Commit Graph

1863 Commits

Author SHA1 Message Date
Sean Christopherson
363010e1dd KVM: nVMX: Get to-be-acknowledge IRQ for nested VM-Exit at injection site
Move the logic to get the to-be-acknowledge IRQ for a nested VM-Exit from
nested_vmx_vmexit() to vmx_check_nested_events(), which is subtly the one
and only path where KVM invokes nested_vmx_vmexit() with
EXIT_REASON_EXTERNAL_INTERRUPT.  A future fix will perform a last-minute
check on L2's nested posted interrupt notification vector, just before
injecting a nested VM-Exit.  To handle that scenario correctly, KVM needs
to get the interrupt _before_ injecting VM-Exit, as simply querying the
highest priority interrupt, via kvm_cpu_has_interrupt(), would result in
TOCTOU bug, as a new, higher priority interrupt could arrive between
kvm_cpu_has_interrupt() and kvm_cpu_get_interrupt().

Unfortunately, simply moving the call to kvm_cpu_get_interrupt() doesn't
suffice, as a VMWRITE to GUEST_INTERRUPT_STATUS.SVI is hiding in
kvm_get_apic_interrupt(), and acknowledging the interrupt before nested
VM-Exit would cause the VMWRITE to hit vmcs02 instead of vmcs01.

Open code a rough equivalent to kvm_cpu_get_interrupt() so that the IRQ
is acknowledged after emulating VM-Exit, taking care to avoid the TOCTOU
issue described above.

Opportunistically convert the WARN_ON() to a WARN_ON_ONCE().  If KVM has
a bug that results in a false positive from kvm_cpu_has_interrupt(),
spamming dmesg won't help the situation.

Note, nested_vmx_reflect_vmexit() can never reflect external interrupts as
they are always "wanted" by L0.

Link: https://lore.kernel.org/r/20240906043413.1049633-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:14:58 -07:00
Kai Huang
7efb4d8a39 KVM: VMX: Also clear SGX EDECCSSA in KVM CPU caps when SGX is disabled
When SGX EDECCSSA support was added to KVM in commit 16a7fe3728
("KVM/VMX: Allow exposing EDECCSSA user leaf function to KVM guest"), it
forgot to clear the X86_FEATURE_SGX_EDECCSSA bit in KVM CPU caps when
KVM SGX is disabled.  Fix it.

Fixes: 16a7fe3728 ("KVM/VMX: Allow exposing EDECCSSA user leaf function to KVM guest")
Signed-off-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240905120837.579102-1-kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:13:55 -07:00
Sean Christopherson
590b09b1d8 KVM: x86: Register "emergency disable" callbacks when virt is enabled
Register the "disable virtualization in an emergency" callback just
before KVM enables virtualization in hardware, as there is no functional
need to keep the callbacks registered while KVM happens to be loaded, but
is inactive, i.e. if KVM hasn't enabled virtualization.

Note, unregistering the callback every time the last VM is destroyed could
have measurable latency due to the synchronize_rcu() needed to ensure all
references to the callback are dropped before KVM is unloaded.  But the
latency should be a small fraction of the total latency of disabling
virtualization across all CPUs, and userspace can set enable_virt_at_load
to completely eliminate the runtime overhead.

Add a pointer in kvm_x86_ops to allow vendor code to provide its callback.
There is no reason to force vendor code to do the registration, and either
way KVM would need a new kvm_x86_ops hook.

Suggested-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240830043600.127750-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-04 11:02:34 -04:00
Sean Christopherson
0617a769ce KVM: x86: Rename virtualization {en,dis}abling APIs to match common KVM
Rename x86's the per-CPU vendor hooks used to enable virtualization in
hardware to align with the recently renamed arch hooks.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20240830043600.127750-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-04 11:02:33 -04:00
Sean Christopherson
1876dd69df KVM: x86: Add fastpath handling of HLT VM-Exits
Add a fastpath for HLT VM-Exits by immediately re-entering the guest if
it has a pending wake event.  When virtual interrupt delivery is enabled,
i.e. when KVM doesn't need to manually inject interrupts, this allows KVM
to stay in the fastpath run loop when a vIRQ arrives between the guest
doing CLI and STI;HLT.  Without AMD's Idle HLT-intercept support, the CPU
generates a HLT VM-Exit even though KVM will immediately resume the guest.

Note, on bare metal, it's relatively uncommon for a modern guest kernel to
actually trigger this scenario, as the window between the guest checking
for a wake event and committing to HLT is quite small.  But in a nested
environment, the timings change significantly, e.g. rudimentary testing
showed that ~50% of HLT exits where HLT-polling was successful would be
serviced by this fastpath, i.e. ~50% of the time that a nested vCPU gets
a wake event before KVM schedules out the vCPU, the wake event was pending
even before the VM-Exit.

Link: https://lore.kernel.org/all/20240528041926.3989-3-manali.shukla@amd.com
Link: https://lore.kernel.org/r/20240802195120.325560-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29 19:50:22 -07:00
Sean Christopherson
73b42dc69b KVM: x86: Re-split x2APIC ICR into ICR+ICR2 for AMD (x2AVIC)
Re-introduce the "split" x2APIC ICR storage that KVM used prior to Intel's
IPI virtualization support, but only for AMD.  While not stated anywhere
in the APM, despite stating the ICR is a single 64-bit register, AMD CPUs
store the 64-bit ICR as two separate 32-bit values in ICR and ICR2.  When
IPI virtualization (IPIv on Intel, all AVIC flavors on AMD) is enabled,
KVM needs to match CPU behavior as some ICR ICR writes will be handled by
the CPU, not by KVM.

Add a kvm_x86_ops knob to control the underlying format used by the CPU to
store the x2APIC ICR, and tune it to AMD vs. Intel regardless of whether
or not x2AVIC is enabled.  If KVM is handling all ICR writes, the storage
format for x2APIC mode doesn't matter, and having the behavior follow AMD
versus Intel will provide better test coverage and ease debugging.

Fixes: 4d1d7942e3 ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240719235107.3023592-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29 16:25:05 -07:00
Sean Christopherson
b848f24bd7 KVM: x86: Rename get_msr_feature() APIs to get_feature_msr()
Rename all APIs related to feature MSRs from get_msr_feature() to
get_feature_msr().  The APIs get "feature MSRs", not "MSR features".
And unlike kvm_{g,s}et_msr_common(), the "feature" adjective doesn't
describe the helper itself.

No functional change intended.

Link: https://lore.kernel.org/r/20240802181935.292540-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 12:06:56 -07:00
Sean Christopherson
74c6c98a59 KVM: x86: Refactor kvm_x86_ops.get_msr_feature() to avoid kvm_msr_entry
Refactor get_msr_feature() to take the index and data pointer as distinct
parameters in anticipation of eliminating "struct kvm_msr_entry" usage
further up the primary callchain.

No functional change intended.

Link: https://lore.kernel.org/r/20240802181935.292540-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 12:06:30 -07:00
Sean Christopherson
aaecae7b6a KVM: x86: Rename KVM_MSR_RET_INVALID to KVM_MSR_RET_UNSUPPORTED
Rename the "INVALID" internal MSR error return code to "UNSUPPORTED" to
try and make it more clear that access was denied because the MSR itself
is unsupported/unknown.  "INVALID" is too ambiguous, as it could just as
easily mean the value for WRMSR as invalid.

Avoid UNKNOWN and UNIMPLEMENTED, as the error code is used for MSRs that
_are_ actually implemented by KVM, e.g. if the MSR is unsupported because
an associated feature flag is not present in guest CPUID.

Opportunistically beef up the comments for the internal MSR error codes.

Link: https://lore.kernel.org/r/20240802181935.292540-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 12:06:29 -07:00
Sean Christopherson
b58b808cbe KVM: x86: Move MSR_TYPE_{R,W,RW} values from VMX to x86, as enums
Move VMX's MSR_TYPE_{R,W,RW} #defines to x86.h, as enums, so that they can
be used by common x86 code, e.g. instead of doing "bool write".

Opportunistically tweak the definitions to make it more obvious that the
values are bitmasks, not arbitrary ascending values.

No functional change intended.

Link: https://lore.kernel.org/r/20240802181935.292540-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 12:06:29 -07:00
Vitaly Kuznetsov
2ab637df5f KVM: VMX: hyper-v: Prevent impossible NULL pointer dereference in evmcs_load()
GCC 12.3.0 complains about a potential NULL pointer dereference in
evmcs_load() as hv_get_vp_assist_page() can return NULL. In fact, this
cannot happen because KVM verifies (hv_init_evmcs()) that every CPU has a
valid VP assist page and aborts enabling the feature otherwise. CPU
onlining path is also checked in vmx_hardware_enable().

To make the compiler happy and to future proof the code, add a KVM_BUG_ON()
sentinel. It doesn't seem to be possible (and logical) to observe
evmcs_load() happening without an active vCPU so it is presumed that
kvm_get_running_vcpu() can't return NULL.

No functional change intended.

Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240816130124.286226-1-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:35:18 -07:00
Maxim Levitsky
41ab0d59fa KVM: nVMX: Use vmx_segment_cache_clear() instead of open coded equivalent
In prepare_vmcs02_rare(), call vmx_segment_cache_clear() instead of
setting segment_cache.bitmask directly.  Using the helper minimizes the
chances of prepare_vmcs02_rare() doing the wrong thing in the future, e.g.
if KVM ends up doing more than just zero the bitmask when purging the
cache.

No functional change intended.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240725175232.337266-2-mlevitsk@redhat.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:35:17 -07:00
Sean Christopherson
653ea4489e KVM: nVMX: Honor userspace MSR filter lists for nested VM-Enter/VM-Exit
Synthesize a consistency check VM-Exit (VM-Enter) or VM-Abort (VM-Exit) if
L1 attempts to load/store an MSR via the VMCS MSR lists that userspace has
disallowed access to via an MSR filter.  Intel already disallows including
a handful of "special" MSRs in the VMCS lists, so denying access isn't
completely without precedent.

More importantly, the behavior is well-defined _and_ can be communicated
the end user, e.g. to the customer that owns a VM running as L1 on top of
KVM.  On the other hand, ignoring userspace MSR filters is all but
guaranteed to result in unexpected behavior as the access will hit KVM's
internal state, which is likely not up-to-date.

Unlike KVM-internal accesses, instruction emulation, and dedicated VMCS
fields, the MSRs in the VMCS load/store lists are 100% guest controlled,
thus making it all but impossible to reason about the correctness of
ignoring the MSR filter.  And if userspace *really* wants to deny access
to MSRs via the aforementioned scenarios, userspace can hide the
associated feature from the guest, e.g. by disabling the PMU to prevent
accessing PERF_GLOBAL_CTRL via its VMCS field.  But for the MSR lists, KVM
is blindly processing MSRs; the  MSR filters are the _only_ way for
userspace to deny access.

This partially reverts commit ac8d6cad3c ("KVM: x86: Only do MSR
filtering when access MSR by rdmsr/wrmsr").

Cc: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20240722235922.3351122-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:35:16 -07:00
Kai Huang
d9aa56edad KVM: VMX: Do not account for temporary memory allocation in ECREATE emulation
In handle_encls_ecreate(), a page is allocated to store a copy of SECS
structure used by the ENCLS[ECREATE] leaf from the guest.  This page is
only used temporarily and is freed after use in handle_encls_ecreate().

Don't account for the memory allocation of this page per [1].

Link: https://lore.kernel.org/kvm/b999afeb588eb75d990891855bc6d58861968f23.camel@intel.com/T/#mb81987afc3ab308bbb5861681aa9a20f2aece7fd [1]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240715101224.90958-1-kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:35:15 -07:00
Qiang Liu
caf22c6dd3 KVM: VMX: Modify the BUILD_BUG_ON_MSG of the 32-bit field in the vmcs_check16 function
According to the SDM, the meaning of field bit 0 is:
Access type (0 = full; 1 = high); must be full for 16-bit, 32-bit,
and natural-width fields. So there is no 32-bit high field here,
it should be a 32-bit field instead.

Signed-off-by: Qiang Liu <liuq131@chinatelecom.cn>
Link: https://lore.kernel.org/r/20240702064609.52487-1-liuq131@chinatelecom.cn
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:35:15 -07:00
Xin Li
566975f6ec KVM: nVMX: Use macros and #defines in vmx_restore_vmx_misc()
Use macros in vmx_restore_vmx_misc() instead of open coding everything
using BIT_ULL() and GENMASK_ULL().  Opportunistically split feature bits
and reserved bits into separate variables, and add a comment explaining
the subset logic (it's not immediately obvious that the set of feature
bits is NOT the set of _supported_ feature bits).

Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog, drop #defines]
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:54 -07:00
Xin Li
8f56b14e9f KVM: VMX: Open code VMX preemption timer rate mask in its accessor
Use vmx_misc_preemption_timer_rate() to get the rate in hardware_setup(),
and open code the rate's bitmask in vmx_misc_preemption_timer_rate() so
that the function looks like all the helpers that grab values from
VMX_BASIC and VMX_MISC MSR values.

No functional change intended.

Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog]
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:54 -07:00
Sean Christopherson
dc1e67f70f KVM VMX: Move MSR_IA32_VMX_MISC bit defines to asm/vmx.h
Move the handful of MSR_IA32_VMX_MISC bit defines that are currently in
msr-indx.h to vmx.h so that all of the VMX_MISC defines and wrappers can
be found in a single location.

Opportunistically use BIT_ULL() instead of open coding hex values, add
defines for feature bits that are architecturally defined, and move the
defines down in the file so that they are colocated with the helpers for
getting fields from VMX_MISC.

No functional change intended.

Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog]
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:53 -07:00
Sean Christopherson
92e648042c KVM: nVMX: Add a helper to encode VMCS info in MSR_IA32_VMX_BASIC
Add a helper to encode the VMCS revision, size, and supported memory types
in MSR_IA32_VMX_BASIC, i.e. when synthesizing KVM's supported BASIC MSR
value, and delete the now unused VMCS size and memtype shift macros.

For a variety of reasons, KVM has shifted (pun intended) to using helpers
to *get* information from the VMX MSRs, as opposed to defined MASK and
SHIFT macros for direct use.  Provide a similar helper for the nested VMX
code, which needs to *set* information, so that KVM isn't left with a mix
of SHIFT macros and dedicated helpers.

Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:52 -07:00
Xin Li
c97b106fa8 KVM: nVMX: Use macros and #defines in vmx_restore_vmx_basic()
Use macros in vmx_restore_vmx_basic() instead of open coding everything
using BIT_ULL() and GENMASK_ULL().  Opportunistically split feature bits
and reserved bits into separate variables, and add a comment explaining
the subset logic (it's not immediately obvious that the set of feature
bits is NOT the set of _supported_ feature bits).

Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog, drop #defines]
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:51 -07:00
Xin Li
9df398ff7d KVM: VMX: Track CPU's MSR_IA32_VMX_BASIC as a single 64-bit value
Track the "basic" capabilities VMX MSR as a single u64 in vmcs_config
instead of splitting it across three fields, that obviously don't combine
into a single 64-bit value, so that KVM can use the macros that define MSR
bits using their absolute position.  Replace all open coded shifts and
masks, many of which are relative to the "high" half, with the appropriate
macro.

Opportunistically use VMX_BASIC_32BIT_PHYS_ADDR_ONLY instead of an open
coded equivalent, and clean up the related comment to not reference a
specific SDM section (to the surprise of no one, the comment is stale).

No functional change intended (though obviously the code generation will
be quite different).

Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog]
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:50 -07:00
Sean Christopherson
e7e80b66fb x86/cpu: KVM: Add common defines for architectural memory types (PAT, MTRRs, etc.)
Add defines for the architectural memory types that can be shoved into
various MSRs and registers, e.g. MTRRs, PAT, VMX capabilities MSRs, EPTPs,
etc.  While most MSRs/registers support only a subset of all memory types,
the values themselves are architectural and identical across all users.

Leave the goofy MTRR_TYPE_* definitions as-is since they are in a uapi
header, but add compile-time assertions to connect the dots (and sanity
check that the msr-index.h values didn't get fat-fingered).

Keep the VMX_EPTP_MT_* defines so that it's slightly more obvious that the
EPTP holds a single memory type in 3 of its 64 bits; those bits just
happen to be 2:0, i.e. don't need to be shifted.

Opportunistically use X86_MEMTYPE_WB instead of an open coded '6' in
setup_vmcs_config().

No functional change intended.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:46 -07:00
Paolo Bonzini
208a352a54 KVM VMX changes for 6.11
- Remove an unnecessary EPT TLB flush when enabling hardware.
 
  - Fix a series of bugs that cause KVM to fail to detect nested pending posted
    interrupts as valid wake eents for a vCPU executing HLT in L2 (with
    HLT-exiting disable by L1).
 
  - Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRvX0ACgkQOlYIJqCj
 N/2Aiw/9Htwy4MfJ2zdTX0ypZx6CUAVY0B7R2q9LVaqlBBL02dLoNWn9ndf7J2pd
 TJKtp39sHzf342ghti/Za5+mZgRgXA9IjQ5cvcQQjfmjDdDODygEc12otISeSNqq
 uL2jbUZzzjbcQyUrXkeFptVcNFpaiOG0dFfvnoi1csWzXVf7t+CD+8/3kjVm2Qt7
 vQXkV4yN7tNiYOvaukfXP7Og9ALpF8g8ok3YmXVXDPMu7+R7G+P6j3mVWr9ABMPj
 LOmC+5Z/sscMFw1Io3XHuWoF5socQARXEzJNLCblDaw3GMlSj4LNxif2M/6B7bmR
 nQVtiegj9K1Fc3OGOqPJcAIRPI4O9nMmf7uOwvXmOlwDSk7rCxF/yPk7Cto2+UXm
 6mnLcH1l0/VaidW+a7rUAcDGIlWwgfw0F6tp2j6FdVl2Lx/IThcrkn0teLY1gAW8
 CMi/BfTBEXO5583O3+ZCAzVQzeKnWR3yqwJe0oSftB1/rPkPD8PQ39MH8LuJJJxi
 CN1W4R1/taQdOxMZqggDvS1biz7gwpjNGtnWsO9szAgMEXVjf2M1HOZVcT2e2997
 81xDMdZaJSfd26tm7PhWtQnVPqyMZ6vqqIiq7FlIbEEkAE75Kbg4fUn/4y4WRnh9
 3Gog6MZPu/MA5TbwvcZ/sy/CRfFu0HKm5q98oArhjSyU8C7oGeQ=
 =W1/6
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.11' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.11

 - Remove an unnecessary EPT TLB flush when enabling hardware.

 - Fix a series of bugs that cause KVM to fail to detect nested pending posted
   interrupts as valid wake eents for a vCPU executing HLT in L2 (with
   HLT-exiting disable by L1).

 - Misc cleanups
2024-07-16 09:56:41 -04:00
Paolo Bonzini
cda231cd42 KVM x86/pmu changes for 6.11
- Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads
    '0' and writes from userspace are ignored.
 
  - Update to the newfangled Intel CPU FMS infrastructure.
 
  - Use macros instead of open-coded literals to clean up KVM's manipulation of
    FIXED_CTR_CTRL MSRs.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRu3oACgkQOlYIJqCj
 N/0f/Q/+PoRFKrr9ENwlVjxmq7DBJOzrEiht5EH89bQdpYL0pcmv6I+n+Z77o08X
 l49YFO2zVq26dMCe8EFDuQrZpqKjOS/qEc+/zTsLu4lx8NJD1gqYJLJryejgtdQI
 +GefPVIN11TvlDDjuuxSWgKUCAevk8s3PRe+zbUwlsHmw+GVky8dJoe71QbW27rK
 hL7Y2pOe5Y8MgRAadxlhm6QmgOnz3RKKYs9t/HMzi2gQP1TuvPxnYtMC3Gz5pVe+
 w3Ak7M4fh8Z7FbQsoNY5h3IdigG6eFrssqHX4QpCXr/G5L9vAgUmSR93/M8jLjNv
 wAkUulLx7vFeTlOXjqcEJSn0U6mX/48pt68vrPB5ES1Rx28RB5s9tzYXCGtCmSxv
 nHmMDc3YUbg6tp2hvliMqjsN0j5l2GQiX7LJwH2Ma9qQFlTHPmFwJGS4hciki6c5
 obCK2vXBoS1jyxrZx8qUhIcJl2oigv3hwihN2YqZ0Q4QDwllv8cw4BeABQByYR9x
 T91PQ0biiJ9vWCkALbyzYOpy+grdHCblwYW9+FM/qZBGH0ouPzDyZWPrRBLX12pH
 fEgDMB3vT9JqQ5tyafd0MHuAVlrDVHYEY+lmXplzFGKEFonBkN7HmDzAOKafuCuj
 GnIe0Sa1JnHVPNomx2dnG6Sku6/tPIfERuHEXrR9zkUJsacnfRY=
 =pJxP
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-pmu-6.11' of https://github.com/kvm-x86/linux into HEAD

KVM x86/pmu changes for 6.11

 - Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads
   '0' and writes from userspace are ignored.

 - Update to the newfangled Intel CPU FMS infrastructure.

 - Use macros instead of open-coded literals to clean up KVM's manipulation of
   FIXED_CTR_CTRL MSRs.
2024-07-16 09:55:15 -04:00
Paolo Bonzini
5c5ddf7107 KVM x86 MTRR virtualization removal
Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD
 hack, and instead always honor guest PAT on CPUs that support self-snoop.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRuwAACgkQOlYIJqCj
 N/32Gg/+Nnnz6TCRno2vursPJme7gvtLdqSxjazAj3u2ZO8IApGYWMyfVpS+ymC9
 Wdpj6gRe2ukSxgTsUI2CYoy5V2NxDaA9YgdTPZUVQvqwujVrqZCJ7L393iPYYnC9
 No3LXZ+SOYRmomiCzknjC6GOlT2hAZHzQsyaXDlEYok7NAA2L6XybbLonEdA4RYi
 V1mS62W5PaA4tUesuxkJjPujXo1nXRWD/aXOruJWjPESdSFSALlx7reFAf2Nwn7K
 Uw8yZqhq6vWAZSph0Nz8OrZOS/kULKA3q2zl1B/qJJ0ToAt2VdXS6abXky52RExf
 KvP+jBAWMO5kHbIqaMRtCHjbIkbhH8RdUIYNJQEUQ5DdydM5+/RDa+KprmLPcmUn
 qvJq+3uyH0MEENtneGegs8uxR+sn6fT32cGMIw790yIywddh562+IJ4Z+C3BuYJi
 yszD71odqKT8+knUd2CaZjE9UZyoQNDfj2OCCTzzZOC/6TuJWCh9CYQ1csssHbQR
 KcvZCKE6ht8tWwi+2HWj0laOdg1reX2kV869k3xH4uCwEaFIj2Wk+/Bw/lg2Tn5h
 5uTnQ01dx5XhAV1klr6IY3VXJ/A8G8895wRfkZEelsA9Wj8qZvNgXhsoXReIUIrn
 aR0ppsFcbqHzC50qE2JT4juTD1EPx95LL9zKT8pI9mGKwxCAxUM=
 =yb10
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mtrrs-6.11' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MTRR virtualization removal

Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD
hack, and instead always honor guest PAT on CPUs that support self-snoop.
2024-07-16 09:54:57 -04:00
Paolo Bonzini
5dcc1e7614 KVM x86 misc changes for 6.11
- Add a global struct to consolidate tracking of host values, e.g. EFER, and
    move "shadow_phys_bits" into the structure as "maxphyaddr".
 
  - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
    bus frequency, because TDX.
 
  - Print the name of the APICv/AVIC inhibits in the relevant tracepoint.
 
  - Clean up KVM's handling of vendor specific emulation to consistently act on
    "compatible with Intel/AMD", versus checking for a specific vendor.
 
  - Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRub0ACgkQOlYIJqCj
 N/2LMxAArGzhcWZ6Qdo2aMRaMIPtSBJHmbEgEuHvHMumgsTZQzDcn9cxDi/hNSrc
 l8ODOwAM2qNcq95YfwjU7F0ae3E+HRzGvKcBnmZWuQeCDp2HhVEoCphFu1sHst+t
 XEJTL02b6OgyJUEU3h40mYk12eiq2S4FCnFYXPCqijwwuL6Y5KQvvTqek3c2/SDn
 c+VneutYGax/S0GiiCkYh4wrwWh9g7qm0IX70ycBwJbW5qBFKgyglvHxvL8JLJC9
 Nkkw/p2657wcOdraH+fOBuRy2dMwE5fv++1tOjWwB5WAAhSOJPZh0BGYvgA2yfN7
 OE+k7APKUQd9Xxtud8H3LrTPoyMA4hz2sdDFyqrrWK9yjpBY7zXNyN50Fxi7VVsm
 T8nTIiKAGyRbjotY+m7krXQPXjfZYhVqrJ/jtxESOZLZ93q2gSWU2p/ZXpUPVHnH
 +YOBAI1owP3wepaYlrthtI4LQx9lF422dnmeSflztfKFGabRbQZxg3uHMCCxIaGc
 lJ6CD546+D45f/uBXRDMqk//qFTqXhKUbDk9sutmU/C2oWufMwW0R8kOyItGPyvk
 9PP1vd8vSsIHj+tpwg+i04jBqYDaAcPBOcTZaHm9SYYP+1e11Uu5Vjep37JL1bkA
 xJWxnDZOCGcfKQi2jkh51HJ/dOAHXY1GQKMfyAoPQOSonYHvGVY=
 =Cf2R
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.11' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.11

 - Add a global struct to consolidate tracking of host values, e.g. EFER, and
   move "shadow_phys_bits" into the structure as "maxphyaddr".

 - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
   bus frequency, because TDX.

 - Print the name of the APICv/AVIC inhibits in the relevant tracepoint.

 - Clean up KVM's handling of vendor specific emulation to consistently act on
   "compatible with Intel/AMD", versus checking for a specific vendor.

 - Misc cleanups
2024-07-16 09:53:05 -04:00
Paolo Bonzini
86014c1e20 KVM generic changes for 6.11
- Enable halt poll shrinking by default, as Intel found it to be a clear win.
 
  - Setup empty IRQ routing when creating a VM to avoid having to synchronize
    SRCU when creating a split IRQCHIP on x86.
 
  - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag
    that arch code can use for hooking both sched_in() and sched_out().
 
  - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
    truncating a bogus value from userspace, e.g. to help userspace detect bugs.
 
  - Mark a vCPU as preempted if and only if it's scheduled out while in the
    KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest
    memory when retrieving guest state during live migration blackout.
 
  - A few minor cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRuOYACgkQOlYIJqCj
 N/1UnQ/8CI5Qfr+/0gzYgtWmtEMczGG+rMNpzD3XVqPjJjXcMcBiQnplnzUVLhha
 vlPdYVK7vgmEt003XGzV55mik46LHL+DX/v4hI3HEdblfyCeNLW3fKEWVRB44qJe
 o+YUQwSK42SORUp9oXuQINxhA//U9EnI7CQxlJ8w8wenv5IJKfIGr01DefmfGPAV
 PKm9t6WLcNqvhZMEyy/zmzM3KVPCJL0NcwI97x6sHxFpQYIDtL0E/VexA4AFqMoT
 QK7cSDC/2US41Zvem/r/GzM/ucdF6vb9suzZYBohwhxtVhwJe2CDeYQZvtNKJ1U7
 GOHPaKL6nBWdZCm/yyWbbX2nstY1lHqxhN3JD0X8wqU5rNcwm2b8Vfyav0Ehc7H+
 jVbDTshOx4YJmIgajoKjgM050rdBK59TdfVL+l+AAV5q/TlHocalYtvkEBdGmIDg
 2td9UHSime6sp20vQfczUEz4bgrQsh4l2Fa/qU2jFwLievnBw0AvEaMximkSGMJe
 b8XfjmdTjlOesWAejANKtQolfrq14+1wYw0zZZ8PA+uNVpKdoovmcqSOcaDC9bT8
 GO/NFUvoG+lkcvJcIlo1SSl81SmGLosijwxWfGvFAqsgpR3/3l3dYp0QtztoCNJO
 d3+HnjgYn5o5FwufuTD3eUOXH4AFjG108DH0o25XrIkb2Kymy0o=
 =BalU
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-generic-6.11' of https://github.com/kvm-x86/linux into HEAD

KVM generic changes for 6.11

 - Enable halt poll shrinking by default, as Intel found it to be a clear win.

 - Setup empty IRQ routing when creating a VM to avoid having to synchronize
   SRCU when creating a split IRQCHIP on x86.

 - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag
   that arch code can use for hooking both sched_in() and sched_out().

 - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
   truncating a bogus value from userspace, e.g. to help userspace detect bugs.

 - Mark a vCPU as preempted if and only if it's scheduled out while in the
   KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest
   memory when retrieving guest state during live migration blackout.

 - A few minor cleanups
2024-07-16 09:51:36 -04:00
Dapeng Mi
f287bef6dd KVM: x86/pmu: Introduce distinct macros for GP/fixed counter max number
Refine the macros which define maximum General Purpose (GP) and fixed
counter numbers.

Currently the macro KVM_INTEL_PMC_MAX_GENERIC is used to represent the
maximum supported General Purpose (GP) counter number ambiguously across
Intel and AMD platforms. This would cause issues if AMD begins to support
more GP counters than Intel.

Thus a bunch of new macros including vendor specific and vendor
independent are introduced to replace the old macros. The vendor
independent macros are used in x86 common code to hide vendor difference
and eliminate the ambiguity.

No logic changes are introduced in this patch.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240627021756.144815-1-dapeng1.mi@linux.intel.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 09:12:16 -07:00
Sean Christopherson
321ef62b0c KVM: nVMX: Fold requested virtual interrupt check into has_nested_events()
Check for a Requested Virtual Interrupt, i.e. a virtual interrupt that is
pending delivery, in vmx_has_nested_events() and drop the one-off
kvm_x86_ops.guest_apic_has_interrupt() hook.

In addition to dropping a superfluous hook, this fixes a bug where KVM
would incorrectly treat virtual interrupts _for L2_ as always enabled due
to kvm_arch_interrupt_allowed(), by way of vmx_interrupt_blocked(),
treating IRQs as enabled if L2 is active and vmcs12 is configured to exit
on IRQs, i.e. KVM would treat a virtual interrupt for L2 as a valid wake
event based on L1's IRQ blocking status.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240607172609.3205077-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:59:06 -07:00
Sean Christopherson
27c4fa42b1 KVM: nVMX: Check for pending posted interrupts when looking for nested events
Check for pending (and notified!) posted interrupts when checking if L2
has a pending wake event, as fully posted/notified virtual interrupt is a
valid wake event for HLT.

Note that KVM must check vmx->nested.pi_pending to avoid prematurely
waking L2, e.g. even if KVM sees a non-zero PID.PIR and PID.0N=1, the
virtual interrupt won't actually be recognized until a notification IRQ is
received by the vCPU or the vCPU does (nested) VM-Enter.

Fixes: 26844fee6a ("KVM: x86: never write to memory from kvm_vcpu_check_block()")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Reported-by: Jim Mattson <jmattson@google.com>
Closes: https://lore.kernel.org/all/20231207010302.2240506-1-jmattson@google.com
Link: https://lore.kernel.org/r/20240607172609.3205077-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:59:06 -07:00
Sean Christopherson
322a569c4b KVM: VMX: Split out the non-virtualization part of vmx_interrupt_blocked()
Move the non-VMX chunk of the "interrupt blocked" checks to a separate
helper so that KVM can reuse the code to detect if interrupts are blocked
for L2, e.g. to determine if a virtual interrupt _for L2_ is a valid wake
event.  If L1 disables HLT-exiting for L2, nested APICv is enabled, and L2
HLTs, then L2 virtual interrupts are valid wake events, but if and only if
interrupts are unblocked for L2.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240607172609.3205077-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:59:05 -07:00
Sean Christopherson
32f55e475c KVM: nVMX: Request immediate exit iff pending nested event needs injection
When requesting an immediate exit from L2 in order to inject a pending
event, do so only if the pending event actually requires manual injection,
i.e. if and only if KVM actually needs to regain control in order to
deliver the event.

Avoiding the "immediate exit" isn't simply an optimization, it's necessary
to make forward progress, as the "already expired" VMX preemption timer
trick that KVM uses to force a VM-Exit has higher priority than events
that aren't directly injected.

At present time, this is a glorified nop as all events processed by
vmx_has_nested_events() require injection, but that will not hold true in
the future, e.g. if there's a pending virtual interrupt in vmcs02.RVI.
I.e. if KVM is trying to deliver a virtual interrupt to L2, the expired
VMX preemption timer will trigger VM-Exit before the virtual interrupt is
delivered, and KVM will effectively hang the vCPU in an endless loop of
forced immediate VM-Exits (because the pending virtual interrupt never
goes away).

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240607172609.3205077-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:59:04 -07:00
Sean Christopherson
d83c36d822 KVM: nVMX: Add a helper to get highest pending from Posted Interrupt vector
Add a helper to retrieve the highest pending vector given a Posted
Interrupt descriptor.  While the actual operation is straightforward, it's
surprisingly easy to mess up, e.g. if one tries to reuse lapic.c's
find_highest_vector(), which doesn't work with PID.PIR due to the APIC's
IRR and ISR component registers being physically discontiguous (they're
4-byte registers aligned at 16-byte intervals).

To make PIR handling more consistent with respect to IRR and ISR handling,
return -1 to indicate "no interrupt pending".

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240607172609.3205077-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:59:03 -07:00
Kai Huang
92c1e3cbf0 KVM: VMX: Switch __vmx_exit() and kvm_x86_vendor_exit() in vmx_exit()
In the vmx_init() error handling path, the __vmx_exit() is done before
kvm_x86_vendor_exit().  They should follow the same order in vmx_exit().

But currently __vmx_exit() is done after kvm_x86_vendor_exit() in
vmx_exit().  Switch the order of them to fix.

Fixes: e32b120071 ("KVM: VMX: Do _all_ initialization before exposing /dev/kvm to userspace")
Signed-off-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240627010524.3732488-1-kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:58:23 -07:00
Sean Christopherson
23b2c5088d KVM: VMX: Remove unnecessary INVEPT[GLOBAL] from hardware enable path
Remove the completely pointess global INVEPT, i.e. EPT TLB flush, from
KVM's VMX enablement path.  KVM always does a targeted TLB flush when
using a "new" EPT root, in quotes because "new" simply means a root that
isn't currently being used by the vCPU.

KVM also _deliberately_ runs with stale TLB entries for defunct roots,
i.e. doesn't do a TLB flush when vCPUs stop using roots, precisely because
KVM does the flush on first use.  As called out by the comment in
kvm_mmu_load(), the reason KVM flushes on first use is because KVM can't
guarantee the correctness of past hypervisors.

Jumping back to the global INVEPT, when the painfully terse commit
1439442c7b ("KVM: VMX: Enable EPT feature for KVM") was added, the
effective TLB flush being performed was:

  static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
  {
          vpid_sync_vcpu_all(to_vmx(vcpu));
  }

I.e. KVM was not flushing EPT TLB entries when allocating a "new" root,
which very strongly suggests that the global INVEPT during hardware
enabling was a misguided hack that addressed the most obvious symptom,
but failed to fix the underlying bug.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20240608001003.3296640-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:57:25 -07:00
Sean Christopherson
cb9fb5fc12 KVM: nVMX: Update VMCS12_REVISION comment to state it should never change
Rewrite the comment above VMCS12_REVISION to unequivocally state that the
ID must never change.  KVM_{G,S}ET_NESTED_STATE have been officially
supported for some time now, i.e. changing VMCS12_REVISION would break
userspace.

Opportunistically add a blurb to the CHECK_OFFSET() comment to make it
explicitly clear that new fields are allowed, i.e. that the restriction
on the layout is all about backwards compatibility.

No functional change intended.

Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20240613190103.1054877-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:55:00 -07:00
Jeff Johnson
8815d77cbc KVM: x86: Add missing MODULE_DESCRIPTION() macros
Add module descriptions for the vendor modules to fix allmodconfig
'make W=1' warnings:

  WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/kvm/kvm-intel.o
  WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/kvm/kvm-amd.o

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20240622-md-kvm-v2-1-29a60f7c48b1@quicinc.com
[sean: split kvm.ko change to separate commit]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:47:12 -07:00
Sean Christopherson
3dee3b1874 KVM: x86: Drop now-superflous setting of l1tf_flush_l1d in vcpu_run()
Now that KVM unconditionally sets l1tf_flush_l1d in kvm_arch_vcpu_load(),
drop the redundant store from vcpu_run().  The flag is cleared only when
VM-Enter is imminent, deep below vcpu_run(), i.e. barring a KVM bug, it's
impossible for l1tf_flush_l1d to be cleared between loading the vCPU and
calling vcpu_run().

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240522014013.1672962-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11 14:18:47 -07:00
Sean Christopherson
8fbb696a8f KVM: x86: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()
Fold the guts of kvm_arch_sched_in() into kvm_arch_vcpu_load(), keying
off the recently added kvm_vcpu.scheduled_out as appropriate.

Note, there is a very slight functional change, as PLE shrink updates will
now happen after blasting WBINVD, but that is quite uninteresting as the
two operations do not interact in any way.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240522014013.1672962-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11 14:18:44 -07:00
Sean Christopherson
5d9c07febb KVM: VMX: Move PLE grow/shrink helpers above vmx_vcpu_load()
Move VMX's {grow,shrink}_ple_window() above vmx_vcpu_load() in preparation
of moving the sched_in logic, which handles shrinking the PLE window, into
vmx_vcpu_load().

No functional change intended.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240522014013.1672962-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11 14:18:43 -07:00
Sean Christopherson
3b65a692a5 KVM: x86/pmu: Add a helper to enable bits in FIXED_CTR_CTRL
Add a helper, intel_pmu_enable_fixed_counter_bits(), to dedup code that
enables fixed counter bits, i.e. when KVM clears bits in the reserved mask
used to detect invalid MSR_CORE_PERF_FIXED_CTR_CTRL values.

No functional change intended.

Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240608000819.3296176-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11 09:35:58 -07:00
Binbin Wu
d5989a3533 KVM: VMX: Remove unused declaration of vmx_request_immediate_exit()
After commit 0ec3d6d1f1 "KVM: x86: Fully defer to vendor code to decide
how to force immediate exit", vmx_request_immediate_exit() was removed.
Commit 5f18c642ff "KVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch
VMX and TDX" added its declaration by accident.  Remove it.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20240506075025.2251131-1-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10 09:35:25 -07:00
Hou Wenlong
c7d4c5f019 KVM: x86: Drop unused check_apicv_inhibit_reasons() callback definition
The check_apicv_inhibit_reasons() callback implementation was dropped in
the commit b3f257a846 ("KVM: x86: Track required APICv inhibits with
variable, not callback"), but the definition removal was missed in the
final version patch (it was removed in the v4). Therefore, it should be
dropped, and the vmx_check_apicv_inhibit_reasons() function declaration
should also be removed.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Link: https://lore.kernel.org/r/54abd1d0ccaba4d532f81df61259b9c0e021fbde.1714977229.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10 09:20:52 -07:00
Sean Christopherson
377b2f359d KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
Unconditionally honor guest PAT on CPUs that support self-snoop, as
Intel has confirmed that CPUs that support self-snoop always snoop caches
and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
even in the presence of aliased memtypes, thus there is no need to trust
the guest behaves and only honor PAT as a last resort, as KVM does today.

Honoring guest PAT is desirable for use cases where the guest has access
to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
(mediated, for all intents and purposes) GPU is exposed to the guest, along
with buffers that are consumed directly by the physical GPU, i.e. which
can't be proxied by the host to ensure writes from the guest are performed
with the correct memory type for the GPU.

Cc: Yiwei Zhang <zzyiwei@google.com>
Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20240309010929.1403984-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-07 07:18:03 -07:00
Sean Christopherson
e1548088ff KVM: VMX: Drop support for forcing UC memory when guest CR0.CD=1
Drop KVM's emulation of CR0.CD=1 on Intel CPUs now that KVM no longer
honors guest MTRR memtypes, as forcing UC memory for VMs with
non-coherent DMA only makes sense if the guest is using something other
than PAT to configure the memtype for the DMA region.

Furthermore, KVM has forced WB memory for CR0.CD=1 since commit
fb279950ba ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED"), and no known
VMM in existence disables KVM_X86_QUIRK_CD_NW_CLEARED, let alone does
so with non-coherent DMA.

Lastly, commit fb279950ba ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED") was
from the same author as commit b18d5431ac ("KVM: x86: fix CR0.CD
virtualization"), and followed by a mere month.  I.e. forcing UC memory
was likely the result of code inspection or perhaps misdiagnosed failures,
and not the necessitate by a concrete use case.

Update KVM's documentation to note that KVM_X86_QUIRK_CD_NW_CLEARED is now
AMD-only, and to take an erratum for lack of CR0.CD virtualization on
Intel.

Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20240309010929.1403984-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05 08:13:14 -07:00
Sean Christopherson
0a7b73559b KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes
Remove KVM's support for virtualizing guest MTRR memtypes, as full MTRR
adds no value, negatively impacts guest performance, and is a maintenance
burden due to it's complexity and oddities.

KVM's approach to virtualizating MTRRs make no sense, at all.  KVM *only*
honors guest MTRR memtypes if EPT is enabled *and* the guest has a device
that may perform non-coherent DMA access.  From a hardware virtualization
perspective of guest MTRRs, there is _nothing_ special about EPT.  Legacy
shadowing paging doesn't magically account for guest MTRRs, nor does NPT.

Unwinding and deciphering KVM's murky history, the MTRR virtualization
code appears to be the result of misdiagnosed issues when EPT + VT-d with
passthrough devices was enabled years and years ago.  And importantly, the
underlying bugs that were fudged around by honoring guest MTRR memtypes
have since been fixed (though rather poorly in some cases).

The zapping GFNs logic in the MTRR virtualization code came from:

  commit efdfe536d8
  Author: Xiao Guangrong <guangrong.xiao@linux.intel.com>
  Date:   Wed May 13 14:42:27 2015 +0800

    KVM: MMU: fix MTRR update

    Currently, whenever guest MTRR registers are changed
    kvm_mmu_reset_context is called to switch to the new root shadow page
    table, however, it's useless since:
    1) the cache type is not cached into shadow page's attribute so that
       the original root shadow page will be reused

    2) the cache type is set on the last spte, that means we should sync
       the last sptes when MTRR is changed

    This patch fixs this issue by drop all the spte in the gfn range which
    is being updated by MTRR

which was a fix for:

  commit 0bed3b568b
  Author:     Sheng Yang <sheng@linux.intel.com>
  AuthorDate: Thu Oct 9 16:01:54 2008 +0800
  Commit:     Avi Kivity <avi@redhat.com>
  CommitDate: Wed Dec 31 16:51:44 2008 +0200

      KVM: Improve MTRR structure

      As well as reset mmu context when set MTRR.

which was part of a "MTRR/PAT support for EPT" series that also added:

+       if (mt_mask) {
+               mt_mask = get_memory_type(vcpu, gfn) <<
+                         kvm_x86_ops->get_mt_mask_shift();
+               spte |= mt_mask;
+       }

where get_memory_type() was a truly gnarly helper to retrieve the guest
MTRR memtype for a given memtype.  And *very* subtly, at the time of that
change, KVM *always* set VMX_EPT_IGMT_BIT,

        kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK |
                VMX_EPT_WRITABLE_MASK |
                VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT |
                VMX_EPT_IGMT_BIT);

which came in via:

  commit 928d4bf747
  Author:     Sheng Yang <sheng@linux.intel.com>
  AuthorDate: Thu Nov 6 14:55:45 2008 +0800
  Commit:     Avi Kivity <avi@redhat.com>
  CommitDate: Tue Nov 11 21:00:37 2008 +0200

      KVM: VMX: Set IGMT bit in EPT entry

      There is a potential issue that, when guest using pagetable without vmexit when
      EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory,
      which would be inconsistent with host side and would cause host MCE due to
      inconsistent cache attribute.

      The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default
      memory type to protect host (notice that all memory mapped by KVM should be WB).

Note the CommitDates!  The AuthorDates strongly suggests Sheng Yang added
the whole "ignoreIGMT things as a bug fix for issues that were detected
during EPT + VT-d + passthrough enabling, but it was applied earlier
because it was a generic fix.

Jumping back to 0bed3b568b ("KVM: Improve MTRR structure"), the other
relevant code, or rather lack thereof, is the handling of *host* MMIO.
That fix came in a bit later, but given the author and timing, it's safe
to say it was all part of the same EPT+VT-d enabling mess.

  commit 2aaf69dcee
  Author:     Sheng Yang <sheng@linux.intel.com>
  AuthorDate: Wed Jan 21 16:52:16 2009 +0800
  Commit:     Avi Kivity <avi@redhat.com>
  CommitDate: Sun Feb 15 02:47:37 2009 +0200

    KVM: MMU: Map device MMIO as UC in EPT

    Software are not allow to access device MMIO using cacheable memory type, the
    patch limit MMIO region with UC and WC(guest can select WC using PAT and
    PCD/PWT).

In addition to the host MMIO and IGMT issues, KVM's MTRR virtualization
was obviously never tested on NPT until much later, which lends further
credence to the theory/argument that this was all the result of
misdiagnosed issues.

Discussion from the EPT+MTRR enabling thread[*] more or less confirms that
Sheng Yang was trying to resolve issues with passthrough MMIO.

 * Sheng Yang
  : Do you mean host(qemu) would access this memory and if we set it to guest
  : MTRR, host access would be broken? We would cover this in our shadow MTRR
  : patch, for we encountered this in video ram when doing some experiment with
  : VGA assignment.

And in the same thread, there's also what appears to be confirmation of
Intel running into issues with Windows XP related to a guest device driver
mapping DMA with WC in the PAT.

 * Avi Kavity
  : Sheng Yang wrote:
  : > Yes... But it's easy to do with assigned devices' mmio, but what if guest
  : > specific some non-mmio memory's memory type? E.g. we have met one issue in
  : > Xen, that a assigned-device's XP driver specific one memory region as buffer,
  : > and modify the memory type then do DMA.
  : >
  : > Only map MMIO space can be first step, but I guess we can modify assigned
  : > memory region memory type follow guest's?
  : >
  :
  : With ept/npt, we can't, since the memory type is in the guest's
  : pagetable entries, and these are not accessible.

[*] https://lore.kernel.org/all/1223539317-32379-1-git-send-email-sheng@linux.intel.com

So, for the most part, what likely happened is that 15 years ago, a few
engineers (a) fixed a #MC problem by ignoring guest PAT and (b) initially
"fixed" passthrough device MMIO by emulating *guest* MTRRs.  Except for
the below case, everything since then has been a result of those two
intertwined changes.

The one exception, which is actually yet more confirmation of all of the
above, is the revert of Paolo's attempt at "full" virtualization of guest
MTRRs:

  commit 606decd670
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Thu Oct 1 13:12:47 2015 +0200

    Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages"

    This reverts commit fd717f1101.
    It was reported to cause Machine Check Exceptions (bug 104091).

...

  commit fd717f1101
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Tue Jul 7 14:38:13 2015 +0200

    KVM: x86: apply guest MTRR virtualization on host reserved pages

    Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true.
    However, the guest could prefer a different page type than UC for
    such pages. A good example is that pass-throughed VGA frame buffer is
    not always UC as host expected.

    This patch enables full use of virtual guest MTRRs.

I.e. Paolo tried to add back KVM's behavior before "Map device MMIO as UC
in EPT" and got the same result: machine checks, likely due to the guest
MTRRs not being trustworthy/sane at all times.

Note, Paolo also tried to enable MTRR virtualization on SVM+NPT, but that
too got reverted.  Unfortunately, it doesn't appear that anyone ever found
a smoking gun, i.e. exactly why emulating guest MTRRs via NPT PAT caused
extremely slow boot times doesn't appear to have a definitive root cause.

  commit fc07e76ac7
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Thu Oct 1 13:20:22 2015 +0200

    Revert "KVM: SVM: use NPT page attributes"

    This reverts commit 3c2e7f7de3.
    Initializing the mapping from MTRR to PAT values was reported to
    fail nondeterministically, and it also caused extremely slow boot
    (due to caching getting disabled---bug 103321) with assigned devices.

...

  commit 3c2e7f7de3
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Tue Jul 7 14:32:17 2015 +0200

    KVM: SVM: use NPT page attributes

    Right now, NPT page attributes are not used, and the final page
    attribute depends solely on gPAT (which however is not synced
    correctly), the guest MTRRs and the guest page attributes.

    However, we can do better by mimicking what is done for VMX.
    In the absence of PCI passthrough, the guest PAT can be ignored
    and the page attributes can be just WB.  If passthrough is being
    used, instead, keep respecting the guest PAT, and emulate the guest
    MTRRs through the PAT field of the nested page tables.

    The only snag is that WP memory cannot be emulated correctly,
    because Linux's default PAT setting only includes the other types.

In short, honoring guest MTRRs for VMX was initially a workaround of
sorts for KVM ignoring guest PAT *and* for KVM not forcing UC for host
MMIO.  And while there *are* known cases where honoring guest MTRRs is
desirable, e.g. passthrough VGA frame buffers, the desired behavior in
that case is to get WC instead of UC, i.e. at this point it's for
performance, not correctness.

Furthermore, the complete absence of MTRR virtualization on NPT and
shadow paging proves that, while KVM theoretically can do better, it's
by no means necessary for correctnesss.

Lastly, since kernels mostly rely on firmware to do MTRR setup, and the
host typically provides guest firmware, honoring guest MTRRs is effectively
honoring *host* userspace memtypes, which is also backwards.  I.e. it
would be far better for host userspace to communicate its desired memtype
directly to KVM (or perhaps indirectly via VMAs in the host kernel), not
through guest MTRRs.

Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20240309010929.1403984-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05 08:13:14 -07:00
Alejandro Jimenez
f992572120 KVM: x86: Keep consistent naming for APICv/AVIC inhibit reasons
Keep kvm_apicv_inhibit enum naming consistent with the current pattern by
renaming the reason/enumerator defined as APICV_INHIBIT_REASON_DISABLE to
APICV_INHIBIT_REASON_DISABLED.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Link: https://lore.kernel.org/r/20240506225321.3440701-3-alejandro.j.jimenez@oracle.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05 06:18:28 -07:00
Dapeng Mi
75430c412a KVM: x86/pmu: Manipulate FIXED_CTR_CTRL MSR with macros
Magic numbers are used to manipulate the bit fields of
FIXED_CTR_CTRL MSR. This makes reading code become difficult, so use
pre-defined macros to replace these magic numbers.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240430005239.13527-3-dapeng1.mi@linux.intel.com
[sean: drop unnecessary curly braces]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 14:25:22 -07:00
Dapeng Mi
0e102ce3d4 KVM: x86/pmu: Change ambiguous _mask suffix to _rsvd in kvm_pmu
Several '_mask' suffixed variables such as, global_ctrl_mask, are
defined in kvm_pmu structure. However the _mask suffix is ambiguous and
misleading since it's not a real mask with positive logic. On the contrary
it represents the reserved bits of corresponding MSRs and these bits
should not be accessed.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240430005239.13527-2-dapeng1.mi@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 14:23:14 -07:00
Paolo Bonzini
f9d1b541d0 Merge branch 'kvm-fixes-6.10-1' into HEAD
* Fixes and debugging help for the #VE sanity check.  Also disable
  it by default, even for CONFIG_DEBUG_KERNEL, because it was found
  to trigger spuriously (most likely a processor erratum as the
  exact symptoms vary by generation).

* Avoid WARN() when two NMIs arrive simultaneously during an NMI-disabled
  situation (GIF=0 or interrupt shadow) when the processor supports
  virtual NMI.  While generally KVM will not request an NMI window
  when virtual NMIs are supported, in this case it *does* have to
  single-step over the interrupt shadow or enable the STGI intercept,
  in order to deliver the latched second NMI.

* Drop support for hand tuning APIC timer advancement from userspace.
  Since we have adaptive tuning, and it has proved to work well,
  drop the module parameter for manual configuration and with it a
  few stupid bugs that it had.
2024-06-03 13:10:11 -04:00
Tony Luck
0c468a6a02 KVM: VMX: Switch to new Intel CPU model infrastructure
Use x86_vfm (vendor, family, module) to detect CPUs that are affected by
PERF_GLOBAL_CTRL bugs instead of manually checking the family and model.
The new VFM infrastructure encodes all information in one handy location.

No functional change intended.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20240520224620.9480-10-tony.luck@intel.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 09:10:43 -07:00
Sean Christopherson
82897db912 KVM: x86: Move shadow_phys_bits into "kvm_host", as "maxphyaddr"
Move shadow_phys_bits into "struct kvm_host_values", i.e. into KVM's
global "kvm_host" variable, so that it is automatically exported for use
in vendor modules.  Rename the variable/field to maxphyaddr to more
clearly capture what value it holds, now that it's used outside of the
MMU (and because the "shadow" part is more than a bit misleading as the
variable is not at all unique to shadow paging).

Recomputing the raw/true host.MAXPHYADDR on every use can be subtly
expensive, e.g. it will incur a VM-Exit on the CPUID if KVM is running as
a nested hypervisor.  Vendor code already has access to the information,
e.g. by directly doing CPUID or by invoking kvm_get_shadow_phys_bits(), so
there's no tangible benefit to making it MMU-only.

Link: https://lore.kernel.org/r/20240423221521.2923759-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 08:58:55 -07:00
Sean Christopherson
7974c0643e KVM: x86: Add a struct to consolidate host values, e.g. EFER, XCR0, etc...
Add "struct kvm_host_values kvm_host" to hold the various host values
that KVM snapshots during initialization.  Bundling the host values into
a single struct simplifies adding new MSRs and other features with host
state/values that KVM cares about, and provides a one-stop shop.  E.g.
adding a new value requires one line, whereas tracking each value
individual often requires three: declaration, definition, and export.

No functional change intended.

Link: https://lore.kernel.org/r/20240423221521.2923759-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 08:58:53 -07:00
Sean Christopherson
bca99c0356 KVM: x86/mmu: Print SPTEs on unexpected #VE
Print the SPTEs that correspond to the faulting GPA on an unexpected EPT
Violation #VE to help the user debug failures, e.g. to pinpoint which SPTE
didn't have SUPPRESS_VE set.

Opportunistically assert that the underlying exit reason was indeed an EPT
Violation, as the CPU has *really* gone off the rails if a #VE occurs due
to a completely unexpected exit reason.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240518000430.1118488-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23 12:28:45 -04:00
Sean Christopherson
743f177336 KVM: VMX: Dump VMCS on unexpected #VE
Dump the VMCS on an unexpected #VE, otherwise it's practically impossible
to figure out why the #VE occurred.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240518000430.1118488-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23 12:28:04 -04:00
Sean Christopherson
9031b42139 KVM: nVMX: Always handle #VEs in L0 (never forward #VEs from L2 to L1)
Always handle #VEs, e.g. due to prove EPT Violation #VE failures, in L0,
as KVM does not expose any #VE capabilities to L1, i.e. any and all #VEs
are KVM's responsibility.

Fixes: 8131cf5b4f ("KVM: VMX: Introduce test mode related to EPT violation VE")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240518000430.1118488-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23 12:27:26 -04:00
Sean Christopherson
d1b32ecdc8 KVM: nVMX: Initialize #VE info page for vmcs02 when proving #VE support
Point vmcs02.VE_INFORMATION_ADDRESS at the vCPU's #VE info page when
initializing vmcs02, otherwise KVM will run L2 with EPT Violation #VE
enabled and a VE info address pointing at pfn 0.

Fixes: 8131cf5b4f ("KVM: VMX: Introduce test mode related to EPT violation VE")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240518000430.1118488-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23 12:27:25 -04:00
Sean Christopherson
40e8a6901a KVM: VMX: Don't kill the VM on an unexpected #VE
Don't terminate the VM on an unexpected #VE, as it's extremely unlikely
the #VE is fatal to the guest, and even less likely that it presents a
danger to the host.  Simply resume the guest on "failure", as the #VE info
page's BUSY field will prevent converting any more EPT Violations to #VEs
for the vCPU (at least, that's what the BUSY field is supposed to do).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240518000430.1118488-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23 12:27:17 -04:00
Linus Torvalds
f4b0c4b508 ARM:
* Move a lot of state that was previously stored on a per vcpu
   basis into a per-CPU area, because it is only pertinent to the
   host while the vcpu is loaded. This results in better state
   tracking, and a smaller vcpu structure.
 
 * Add full handling of the ERET/ERETAA/ERETAB instructions in
   nested virtualisation. The last two instructions also require
   emulating part of the pointer authentication extension.
   As a result, the trap handling of pointer authentication has
   been greatly simplified.
 
 * Turn the global (and not very scalable) LPI translation cache
   into a per-ITS, scalable cache, making non directly injected
   LPIs much cheaper to make visible to the vcpu.
 
 * A batch of pKVM patches, mostly fixes and cleanups, as the
   upstreaming process seems to be resuming. Fingers crossed!
 
 * Allocate PPIs and SGIs outside of the vcpu structure, allowing
   for smaller EL2 mapping and some flexibility in implementing
   more or less than 32 private IRQs.
 
 * Purge stale mpidr_data if a vcpu is created after the MPIDR
   map has been created.
 
 * Preserve vcpu-specific ID registers across a vcpu reset.
 
 * Various minor cleanups and improvements.
 
 LoongArch:
 
 * Add ParaVirt IPI support.
 
 * Add software breakpoint support.
 
 * Add mmio trace events support.
 
 RISC-V:
 
 * Support guest breakpoints using ebreak
 
 * Introduce per-VCPU mp_state_lock and reset_cntx_lock
 
 * Virtualize SBI PMU snapshot and counter overflow interrupts
 
 * New selftests for SBI PMU and Guest ebreak
 
 * Some preparatory work for both TDX and SNP page fault handling.
   This also cleans up the page fault path, so that the priorities
   of various kinds of fauls (private page, no memory, write
   to read-only slot, etc.) are easier to follow.
 
 x86:
 
 * Minimize amount of time that shadow PTEs remain in the special
   REMOVED_SPTE state.  This is a state where the mmu_lock is held for
   reading but concurrent accesses to the PTE have to spin; shortening
   its use allows other vCPUs to repopulate the zapped region while
   the zapper finishes tearing down the old, defunct page tables.
 
 * Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field,
   which is defined by hardware but left for software use.  This lets KVM
   communicate its inability to map GPAs that set bits 51:48 on hosts
   without 5-level nested page tables.  Guest firmware is expected to
   use the information when mapping BARs; this avoids that they end up at
   a legal, but unmappable, GPA.
 
 * Fixed a bug where KVM would not reject accesses to MSR that aren't
   supposed to exist given the vCPU model and/or KVM configuration.
 
 * As usual, a bunch of code cleanups.
 
 x86 (AMD):
 
 * Implement a new and improved API to initialize SEV and SEV-ES VMs, which
   will also be extendable to SEV-SNP.  The new API specifies the desired
   encryption in KVM_CREATE_VM and then separately initializes the VM.
   The new API also allows customizing the desired set of VMSA features;
   the features affect the measurement of the VM's initial state, and
   therefore enabling them cannot be done tout court by the hypervisor.
 
   While at it, the new API includes two bugfixes that couldn't be
   applied to the old one without a flag day in userspace or without
   affecting the initial measurement.  When a SEV-ES VM is created with
   the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are
   rejected once the VMSA has been encrypted.  Also, the FPU and AVX
   state will be synchronized and encrypted too.
 
 * Support for GHCB version 2 as applicable to SEV-ES guests.  This, once
   more, is only accessible when using the new KVM_SEV_INIT2 flow for
   initialization of SEV-ES VMs.
 
 x86 (Intel):
 
 * An initial bunch of prerequisite patches for Intel TDX were merged.
   They generally don't do anything interesting.  The only somewhat user
   visible change is a new debugging mode that checks that KVM's MMU
   never triggers a #VE virtualization exception in the guest.
 
 * Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to
   L1, as per the SDM.
 
 Generic:
 
 * Use vfree() instead of kvfree() for allocations that always use vcalloc()
   or __vcalloc().
 
 * Remove .change_pte() MMU notifier - the changes to non-KVM code are
   small and Andrew Morton asked that I also take those through the KVM
   tree.  The callback was only ever implemented by KVM (which was also the
   original user of MMU notifiers) but it had been nonfunctional ever since
   calls to set_pte_at_notify were wrapped with invalidate_range_start
   and invalidate_range_end... in 2012.
 
 Selftests:
 
 * Enhance the demand paging test to allow for better reporting and stressing
   of UFFD performance.
 
 * Convert the steal time test to generate TAP-friendly output.
 
 * Fix a flaky false positive in the xen_shinfo_test due to comparing elapsed
   time across two different clock domains.
 
 * Skip the MONITOR/MWAIT test if the host doesn't actually support MWAIT.
 
 * Avoid unnecessary use of "sudo" in the NX hugepage test wrapper shell
   script, to play nice with running in a minimal userspace environment.
 
 * Allow skipping the RSEQ test's sanity check that the vCPU was able to
   complete a reasonable number of KVM_RUNs, as the assert can fail on a
   completely valid setup.  If the test is run on a large-ish system that is
   otherwise idle, and the test isn't affined to a low-ish number of CPUs, the
   vCPU task can be repeatedly migrated to CPUs that are in deep sleep states,
   which results in the vCPU having very little net runtime before the next
   migration due to high wakeup latencies.
 
 * Define _GNU_SOURCE for all selftests to fix a warning that was introduced by
   a change to kselftest_harness.h late in the 6.9 cycle, and because forcing
   every test to #define _GNU_SOURCE is painful.
 
 * Provide a global pseudo-RNG instance for all tests, so that library code can
   generate random, but determinstic numbers.
 
 * Use the global pRNG to randomly force emulation of select writes from guest
   code on x86, e.g. to help validate KVM's emulation of locked accesses.
 
 * Allocate and initialize x86's GDT, IDT, TSS, segments, and default exception
   handlers at VM creation, instead of forcing tests to manually trigger the
   related setup.
 
 Documentation:
 
 * Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmZE878UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOukQf+LcvZsWtrC7Wd5K9SQbYXaS4Rk6P6
 JHoQW2d0hUN893J2WibEw+l1J/0vn5JumqHXyZgJ7CbaMtXkWWQTwDSDLuURUKpv
 XNB3Sb17G87NH+s1tOh0tA9h5upbtlHVHvrtIwdbb9+XHgQ6HTL4uk+HdfO/p9fW
 cWBEZAKoWcCIa99Numv3pmq5vdrvBlNggwBugBS8TH69EKMw+V1Vu1SFkIdNDTQk
 NJJ28cohoP3wnwlIHaXSmU4RujipPH3Lm/xupyA5MwmzO713eq2yUqV49jzhD5/I
 MA4Ruvgrdm4wpp89N9lQMyci91u6q7R9iZfMu0tSg2qYI3UPKIdstd8sOA==
 =2lED
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:

   - Move a lot of state that was previously stored on a per vcpu basis
     into a per-CPU area, because it is only pertinent to the host while
     the vcpu is loaded. This results in better state tracking, and a
     smaller vcpu structure.

   - Add full handling of the ERET/ERETAA/ERETAB instructions in nested
     virtualisation. The last two instructions also require emulating
     part of the pointer authentication extension. As a result, the trap
     handling of pointer authentication has been greatly simplified.

   - Turn the global (and not very scalable) LPI translation cache into
     a per-ITS, scalable cache, making non directly injected LPIs much
     cheaper to make visible to the vcpu.

   - A batch of pKVM patches, mostly fixes and cleanups, as the
     upstreaming process seems to be resuming. Fingers crossed!

   - Allocate PPIs and SGIs outside of the vcpu structure, allowing for
     smaller EL2 mapping and some flexibility in implementing more or
     less than 32 private IRQs.

   - Purge stale mpidr_data if a vcpu is created after the MPIDR map has
     been created.

   - Preserve vcpu-specific ID registers across a vcpu reset.

   - Various minor cleanups and improvements.

  LoongArch:

   - Add ParaVirt IPI support

   - Add software breakpoint support

   - Add mmio trace events support

  RISC-V:

   - Support guest breakpoints using ebreak

   - Introduce per-VCPU mp_state_lock and reset_cntx_lock

   - Virtualize SBI PMU snapshot and counter overflow interrupts

   - New selftests for SBI PMU and Guest ebreak

   - Some preparatory work for both TDX and SNP page fault handling.

     This also cleans up the page fault path, so that the priorities of
     various kinds of fauls (private page, no memory, write to read-only
     slot, etc.) are easier to follow.

  x86:

   - Minimize amount of time that shadow PTEs remain in the special
     REMOVED_SPTE state.

     This is a state where the mmu_lock is held for reading but
     concurrent accesses to the PTE have to spin; shortening its use
     allows other vCPUs to repopulate the zapped region while the zapper
     finishes tearing down the old, defunct page tables.

   - Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID
     field, which is defined by hardware but left for software use.

     This lets KVM communicate its inability to map GPAs that set bits
     51:48 on hosts without 5-level nested page tables. Guest firmware
     is expected to use the information when mapping BARs; this avoids
     that they end up at a legal, but unmappable, GPA.

   - Fixed a bug where KVM would not reject accesses to MSR that aren't
     supposed to exist given the vCPU model and/or KVM configuration.

   - As usual, a bunch of code cleanups.

  x86 (AMD):

   - Implement a new and improved API to initialize SEV and SEV-ES VMs,
     which will also be extendable to SEV-SNP.

     The new API specifies the desired encryption in KVM_CREATE_VM and
     then separately initializes the VM. The new API also allows
     customizing the desired set of VMSA features; the features affect
     the measurement of the VM's initial state, and therefore enabling
     them cannot be done tout court by the hypervisor.

     While at it, the new API includes two bugfixes that couldn't be
     applied to the old one without a flag day in userspace or without
     affecting the initial measurement. When a SEV-ES VM is created with
     the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are rejected
     once the VMSA has been encrypted. Also, the FPU and AVX state will
     be synchronized and encrypted too.

   - Support for GHCB version 2 as applicable to SEV-ES guests.

     This, once more, is only accessible when using the new
     KVM_SEV_INIT2 flow for initialization of SEV-ES VMs.

  x86 (Intel):

   - An initial bunch of prerequisite patches for Intel TDX were merged.

     They generally don't do anything interesting. The only somewhat
     user visible change is a new debugging mode that checks that KVM's
     MMU never triggers a #VE virtualization exception in the guest.

   - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig
     VM-Exit to L1, as per the SDM.

  Generic:

   - Use vfree() instead of kvfree() for allocations that always use
     vcalloc() or __vcalloc().

   - Remove .change_pte() MMU notifier - the changes to non-KVM code are
     small and Andrew Morton asked that I also take those through the
     KVM tree.

     The callback was only ever implemented by KVM (which was also the
     original user of MMU notifiers) but it had been nonfunctional ever
     since calls to set_pte_at_notify were wrapped with
     invalidate_range_start and invalidate_range_end... in 2012.

  Selftests:

   - Enhance the demand paging test to allow for better reporting and
     stressing of UFFD performance.

   - Convert the steal time test to generate TAP-friendly output.

   - Fix a flaky false positive in the xen_shinfo_test due to comparing
     elapsed time across two different clock domains.

   - Skip the MONITOR/MWAIT test if the host doesn't actually support
     MWAIT.

   - Avoid unnecessary use of "sudo" in the NX hugepage test wrapper
     shell script, to play nice with running in a minimal userspace
     environment.

   - Allow skipping the RSEQ test's sanity check that the vCPU was able
     to complete a reasonable number of KVM_RUNs, as the assert can fail
     on a completely valid setup.

     If the test is run on a large-ish system that is otherwise idle,
     and the test isn't affined to a low-ish number of CPUs, the vCPU
     task can be repeatedly migrated to CPUs that are in deep sleep
     states, which results in the vCPU having very little net runtime
     before the next migration due to high wakeup latencies.

   - Define _GNU_SOURCE for all selftests to fix a warning that was
     introduced by a change to kselftest_harness.h late in the 6.9
     cycle, and because forcing every test to #define _GNU_SOURCE is
     painful.

   - Provide a global pseudo-RNG instance for all tests, so that library
     code can generate random, but determinstic numbers.

   - Use the global pRNG to randomly force emulation of select writes
     from guest code on x86, e.g. to help validate KVM's emulation of
     locked accesses.

   - Allocate and initialize x86's GDT, IDT, TSS, segments, and default
     exception handlers at VM creation, instead of forcing tests to
     manually trigger the related setup.

  Documentation:

   - Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (225 commits)
  selftests/kvm: remove dead file
  KVM: selftests: arm64: Test vCPU-scoped feature ID registers
  KVM: selftests: arm64: Test that feature ID regs survive a reset
  KVM: selftests: arm64: Store expected register value in set_id_regs
  KVM: selftests: arm64: Rename helper in set_id_regs to imply VM scope
  KVM: arm64: Only reset vCPU-scoped feature ID regs once
  KVM: arm64: Reset VM feature ID regs from kvm_reset_sys_regs()
  KVM: arm64: Rename is_id_reg() to imply VM scope
  KVM: arm64: Destroy mpidr_data for 'late' vCPU creation
  KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support
  KVM: arm64: Fix hvhe/nvhe early alias parsing
  KVM: SEV: Allow per-guest configuration of GHCB protocol version
  KVM: SEV: Add GHCB handling for termination requests
  KVM: SEV: Add GHCB handling for Hypervisor Feature Support requests
  KVM: SEV: Add support to handle AP reset MSR protocol
  KVM: x86: Explicitly zero kvm_caps during vendor module load
  KVM: x86: Fully re-initialize supported_mce_cap on vendor module load
  KVM: x86: Fully re-initialize supported_vm_types on vendor module load
  KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfns
  KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values
  ...
2024-05-15 14:46:43 -07:00
Paolo Bonzini
31a6cd7f16 KVM VMX changes for 6.10:
- Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to
    L1, as per the SDM.
 
  - Move kvm_vcpu_arch's exit_qualification into x86_exception, as the field is
    used only when synthesizing nested EPT violation, i.e. it's not the vCPU's
    "real" exit_qualification, which is tracked elsewhere.
 
  - Add a sanity check to assert that EPT Violations are the only sources of
    nested PML Full VM-Exits.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmY+qzEACgkQOlYIJqCj
 N/3O0Q/9HZruiL9vzMrLBKgFgWCxQHO2fy+EixuwzVBHunQGOsVnDCO2p+PWnF0p
 kuW/MEZhZfLYnXoDi5/AP12G9qtDhlSNnfSl2gn+BMXqyGSYpcoXuM/zTjM24wLd
 PXKkPirYMpVR2+lHsD7l8YK2I+qc7UfbRkCyJegBgGwUBs13/TBD6Rum3Aa9Q+dX
 IcwjomH+MdHDFPnpfHjksA+G79Ckkqmu/DbOAlCqw1dUSC8oyV9tE/EKStSBzjZ+
 OGMSm7Kl0T+km1JyH60H1ivbUbT3gJxpezoYL9EbO25VPrdldKP+ohqbtew/8ttk
 UP/oW3mL79I7L06ZqqxZKDDj4JGvz53UhhAylZcBPw0P3v9TQF3wm59K4eM9btNt
 eyIaT0SAbcigHAniM+3FPkq443hRxDvLNF5E66Ez03HhhkEz3ZsyNH1oPnQK0Crq
 N1e+NGuKsTAPBzc3sSSrxOHnCajTUQ9WYjOpfdSgWsL6TQOmXIvHl0tE2ILrvDc/
 f+VG62veqa9CCmX5B2lUT0yX9nXvyXKwVpJY9RSQIhB46sA8zjSZsZRCQFkDI5Gx
 pzjxjcXtydAMWpn5qUvpD0B6agMlP6WUJHlu+ezmBQuSUHr+2PHY5dEj9442SusF
 98VGJy8APxDhidK5TaJJXWmDfKNhEaWboMcTnWM1TwY/qLfDsVU=
 =0ncM
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.10' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.10:

 - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to
   L1, as per the SDM.

 - Move kvm_vcpu_arch's exit_qualification into x86_exception, as the field is
   used only when synthesizing nested EPT violation, i.e. it's not the vCPU's
   "real" exit_qualification, which is tracked elsewhere.

 - Add a sanity check to assert that EPT Violations are the only sources of
   nested PML Full VM-Exits.
2024-05-12 03:17:17 -04:00
Paolo Bonzini
4232da23d7 Merge tag 'loongarch-kvm-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.10

1. Add ParaVirt IPI support.
2. Add software breakpoint support.
3. Add mmio trace events support.
2024-05-10 13:20:18 -04:00
Jacob Pan
2254808b53 x86/irq: Remove bitfields in posted interrupt descriptor
Mixture of bitfields and types is weird and really not intuitive, remove
bitfields and use typed data exclusively. Bitfields often result in
inferior machine code.

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240423174114.526704-4-jacob.jun.pan@linux.intel.com
Link: https://lore.kernel.org/all/20240404101735.402feec8@jacob-builder/T/#mf66e34a82a48f4d8e2926b5581eff59a122de53a
2024-04-30 00:54:42 +02:00
Jacob Pan
699f67512f KVM: VMX: Move posted interrupt descriptor out of VMX code
To prepare native usage of posted interrupts, move the PID declarations out
of VMX code such that they can be shared.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20240423174114.526704-2-jacob.jun.pan@linux.intel.com
2024-04-30 00:54:42 +02:00
Linus Torvalds
817772266d * Clean up SVM's enter/exit assembly code so that it can be compiled
without OBJECT_FILES_NON_STANDARD.  This fixes a warning
   "Unpatched return thunk in use. This should not happen!" when running
   KVM selftests.
 
 * Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure where KVM
   would allow userspace to refresh the cache with a bogus GPA.  The bug has
   existed for quite some time, but was exposed by a new sanity check added in
   6.9 (to ensure a cache is either GPA-based or HVA-based).
 
 * Drop an unused param from gfn_to_pfn_cache_invalidate_start() that got left
   behind during a 6.9 cleanup.
 
 * Fix a math goof in x86's hugepage logic for KVM_SET_MEMORY_ATTRIBUTES that
   results in an array overflow (detected by KASAN).
 
 * Fix a bug where KVM incorrectly clears root_role.direct when userspace sets
   guest CPUID.
 
 * Fix a dirty logging bug in the where KVM fails to write-protect SPTEs used
   by a nested guest, if KVM is using Page-Modification Logging and the nested
   hypervisor is NOT using EPT.
 
 x86 PMU:
 
 * Drop support for virtualizing adaptive PEBS, as KVM's implementation is
   architecturally broken without an obvious/easy path forward, and because
   exposing adaptive PEBS can leak host LBRs to the guest, i.e. can leak
   host kernel addresses to the guest.
 
 * Set the enable bits for general purpose counters in PERF_GLOBAL_CTRL at
   RESET time, as done by both Intel and AMD processors.
 
 * Disable LBR virtualization on CPUs that don't support LBR callstacks, as
   KVM unconditionally uses PERF_SAMPLE_BRANCH_CALL_STACK when creating the
   perf event, and would fail on such CPUs.
 
 Tests:
 
 * Fix a flaw in the max_guest_memory selftest that results in it exhausting
   the supply of ucall structures when run with more than 256 vCPUs.
 
 * Mark KVM_MEM_READONLY as supported for RISC-V in set_memory_region_test.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmYjdqcUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPNRAgAh1AdKBAWnq9bFN2Np1kSAcRAk3bs
 REDq/0iD1T9TvIwEmE1lHaRuqvCSO15WW+DKvbs7TS8zA0DyY7X/x8sIIy5YzZ5C
 bQ+JXiqk55OAj0sPskBpCvE5qEreuU8qAit57+8OseKWs57EICvJjrfsRnHlmIub
 pgGas3I42LjIgsuZRr2kjv+GrvaiikW+wWK6sq3CvPzTtHV196d26AK5l4NOoLkY
 0FTbBIYUSJ7wxs92xuTed5mZ7JFZdsa5DVMXF5MRZ9W6g2vZCLbqCNRddRhSAsl0
 gKmqZkuPTB7AnGQbJ2h/aKFT0ydsguzqbbKq62sK7ft5f1CUlbp9luDC9w==
 =99rq
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "This is a bit on the large side, mostly due to two changes:

   - Changes to disable some broken PMU virtualization (see below for
     details under "x86 PMU")

   - Clean up SVM's enter/exit assembly code so that it can be compiled
     without OBJECT_FILES_NON_STANDARD. This fixes a warning "Unpatched
     return thunk in use. This should not happen!" when running KVM
     selftests.

  Everything else is small bugfixes and selftest changes:

   - Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure
     where KVM would allow userspace to refresh the cache with a bogus
     GPA. The bug has existed for quite some time, but was exposed by a
     new sanity check added in 6.9 (to ensure a cache is either
     GPA-based or HVA-based).

   - Drop an unused param from gfn_to_pfn_cache_invalidate_start() that
     got left behind during a 6.9 cleanup.

   - Fix a math goof in x86's hugepage logic for
     KVM_SET_MEMORY_ATTRIBUTES that results in an array overflow
     (detected by KASAN).

   - Fix a bug where KVM incorrectly clears root_role.direct when
     userspace sets guest CPUID.

   - Fix a dirty logging bug in the where KVM fails to write-protect
     SPTEs used by a nested guest, if KVM is using Page-Modification
     Logging and the nested hypervisor is NOT using EPT.

  x86 PMU:

   - Drop support for virtualizing adaptive PEBS, as KVM's
     implementation is architecturally broken without an obvious/easy
     path forward, and because exposing adaptive PEBS can leak host LBRs
     to the guest, i.e. can leak host kernel addresses to the guest.

   - Set the enable bits for general purpose counters in
     PERF_GLOBAL_CTRL at RESET time, as done by both Intel and AMD
     processors.

   - Disable LBR virtualization on CPUs that don't support LBR
     callstacks, as KVM unconditionally uses
     PERF_SAMPLE_BRANCH_CALL_STACK when creating the perf event, and
     would fail on such CPUs.

  Tests:

   - Fix a flaw in the max_guest_memory selftest that results in it
     exhausting the supply of ucall structures when run with more than
     256 vCPUs.

   - Mark KVM_MEM_READONLY as supported for RISC-V in
     set_memory_region_test"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (30 commits)
  KVM: Drop unused @may_block param from gfn_to_pfn_cache_invalidate_start()
  KVM: selftests: Add coverage of EPT-disabled to vmx_dirty_log_test
  KVM: x86/mmu: Fix and clarify comments about clearing D-bit vs. write-protecting
  KVM: x86/mmu: Remove function comments above clear_dirty_{gfn_range,pt_masked}()
  KVM: x86/mmu: Write-protect L2 SPTEs in TDP MMU when clearing dirty status
  KVM: x86/mmu: Precisely invalidate MMU root_role during CPUID update
  KVM: VMX: Disable LBR virtualization if the CPU doesn't support LBR callstacks
  perf/x86/intel: Expose existence of callback support to KVM
  KVM: VMX: Snapshot LBR capabilities during module initialization
  KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms
  KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible
  KVM: x86: Stop compiling vmenter.S with OBJECT_FILES_NON_STANDARD
  KVM: SVM: Create a stack frame in __svm_sev_es_vcpu_run()
  KVM: SVM: Save/restore args across SEV-ES VMRUN via host save area
  KVM: SVM: Save/restore non-volatile GPRs in SEV-ES VMRUN via host save area
  KVM: SVM: Clobber RAX instead of RBX when discarding spec_ctrl_intercepted
  KVM: SVM: Drop 32-bit "support" from __svm_sev_es_vcpu_run()
  KVM: SVM: Wrap __svm_sev_es_vcpu_run() with #ifdef CONFIG_KVM_AMD_SEV
  KVM: SVM: Create a stack frame in __svm_vcpu_run() for unwinding
  KVM: SVM: Remove a useless zeroing of allocated memory
  ...
2024-04-20 11:10:51 -07:00
Isaku Yamahata
8131cf5b4f KVM: VMX: Introduce test mode related to EPT violation VE
To support TDX, KVM is enhanced to operate with #VE.  For TDX, KVM uses the
suppress #VE bit in EPT entries selectively, in order to be able to trap
non-present conditions.  However, #VE isn't used for VMX and it's a bug
if it happens.  To be defensive and test that VMX case isn't broken
introduce an option ept_violation_ve_test and when it's set, BUG the vm.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <d6db6ba836605c0412e166359ba5c46a63c22f86.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19 12:15:21 -04:00
Paolo Bonzini
fb29541ead KVM, x86: add architectural support code for #VE
Dump the contents of the #VE info data structure and assert that #VE does
not happen, but do not yet do anything with it.

No functional change intended, separated for clarity only.

Extracted from a patch by Isaku Yamahata <isaku.yamahata@intel.com>.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19 12:15:20 -04:00
Sean Christopherson
2325a21ac1 KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument
TDX uses different ABI to get information about VM exit.  Pass intr_info to
the NMI and INTR handlers instead of pulling it from vcpu_vmx in
preparation for sharing the bulk of the handlers with TDX.

When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
exit qualification etc rather than the VMCS fields because VMM doesn't have
access to the VMCS.  The eventual code will be

VMX:
  - get exit reason, intr_info, exit_qualification, and etc from VMCS
  - call NMI/INTR handlers (common code)

TDX:
  - get exit reason, intr_info, exit_qualification, and etc from guest
    registers
  - call NMI/INTR handlers (common code)

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <0396a9ae70d293c9d0b060349dae385a8a4fbcec.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12 04:42:24 -04:00
Paolo Bonzini
5f18c642ff KVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch VMX and TDX
KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions
to operate on VM.  TDX doesn't allow VMM to operate VMCS directly.
Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to
indirectly operate those data structures.  This means we must have a TDX
version of kvm_x86_ops.

The existing global struct kvm_x86_ops already defines an interface which
can be adapted to TDX, but kvm_x86_ops is a system-wide, not per-VM
structure.  To allow VMX to coexist with TDs, the kvm_x86_ops callbacks
will have wrappers "if (tdx) tdx_op() else vmx_op()" to pick VMX or
TDX at run time.

To split the runtime switch, the VMX implementation, and the TDX
implementation, add main.c, and move out the vmx_x86_ops hooks in
preparation for adding TDX.  Use 'vt' for the naming scheme as a nod to
VT-x and as a concatenation of VmxTdx.

The eventually converted code will look like this:

vmx.c:
  vmx_op() { ... }
  VMX initialization
tdx.c:
  tdx_op() { ... }
  TDX initialization
x86_ops.h:
  vmx_op();
  tdx_op();
main.c:
  static vt_op() { if (tdx) tdx_op() else vmx_op() }
  static struct kvm_x86_ops vt_x86_ops = {
        .op = vt_op,
  initialization functions call both VMX and TDX initialization

Opportunistically, fix the name inconsistency from vmx_create_vcpu() and
vmx_free_vcpu() to vmx_vcpu_create() and vmx_vcpu_free().

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Message-Id: <e603c317587f933a9d1bee8728c84e4935849c16.1705965634.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12 04:42:24 -04:00
Sean Christopherson
bb9dc85908 KVM: VMX: Disable LBR virtualization if the CPU doesn't support LBR callstacks
Disable LBR virtualization if the CPU doesn't support callstacks, which
were introduced in HSW (see commit e9d7f7cd97 ("perf/x86/intel: Add
basic Haswell LBR call stack support"), as KVM unconditionally configures
the perf LBR event with PERF_SAMPLE_BRANCH_CALL_STACK, i.e. LBR
virtualization always fails on pre-HSW CPUs.

Simply disable LBR support on such CPUs, as it has never worked, i.e.
there is no risk of breaking an existing setup, and figuring out a way
to performantly context switch LBRs on old CPUs is not worth the effort.

Fixes: be635e34c2 ("KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIES")
Cc: Mingwei Zhang <mizhang@google.com>
Cc: Jim Mattson <jmattson@google.com>
Tested-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20240307011344.835640-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-11 12:58:48 -07:00
Sean Christopherson
447112d7ed KVM: VMX: Snapshot LBR capabilities during module initialization
Snapshot VMX's LBR capabilities once during module initialization instead
of calling into perf every time a vCPU reconfigures its vPMU.  This will
allow massaging the LBR capabilities, e.g. if the CPU doesn't support
callstacks, without having to remember to update multiple locations.

Opportunistically tag vmx_get_perf_capabilities() with __init, as it's
only called from vmx_set_cpu_caps().

Reviewed-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20240307011344.835640-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-11 12:58:46 -07:00
Sean Christopherson
23ffe4bbf8 KVM: nVMX: Add a sanity check that nested PML Full stems from EPT Violations
Add a WARN_ON_ONCE() sanity check to verify that a nested PML Full VM-Exit
is only synthesized when the original VM-Exit from L2 was an EPT Violation.
While KVM can fallthrough to kvm_mmu_do_page_fault() if an EPT Misconfig
occurs on a stale MMIO SPTE, KVM should not treat the access as a write
(there isn't enough information to know *what* the access was), i.e. KVM
should never try to insert a PML entry in that case.

Link: https://lore.kernel.org/r/20240209221700.393189-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 10:24:36 -07:00
Sean Christopherson
a946607868 KVM: x86: Move nEPT exit_qualification field from kvm_vcpu_arch to x86_exception
Move the exit_qualification field that is used to track information about
in-flight nEPT violations from "struct kvm_vcpu_arch" to "x86_exception",
i.e. associate the information with the actual nEPT violation instead of
the vCPU.  To handle bits that are pulled from vmcs.EXIT_QUALIFICATION,
i.e. that are propagated from the "original" EPT violation VM-Exit, simply
grab them from the VMCS on-demand when injecting a nEPT Violation or a PML
Full VM-exit.

Aside from being ugly, having an exit_qualification field in kvm_vcpu_arch
is outright dangerous, e.g. see commit d7f0a00e43 ("KVM: VMX: Report
up-to-date exit qualification to userspace").

Opportunstically add a comment to call out that PML Full and EPT Violation
VM-Exits use the same bit to report NMI blocking information.

Link: https://lore.kernel.org/r/20240209221700.393189-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 10:24:36 -07:00
Sean Christopherson
0c47651403 KVM: nVMX: Clear EXIT_QUALIFICATION when injecting an EPT Misconfig
Explicitly clear the EXIT_QUALIFCATION field when injecting an EPT
misconfig into L1, as required by the VMX architecture.  Per the SDM:

  This field is saved for VM exits due to the following causes:
  debug exceptions; page-fault exceptions; start-up IPIs (SIPIs);
  system-management interrupts (SMIs) that arrive immediately after the
  execution of I/O instructions; task switches; INVEPT; INVLPG; INVPCID;
  INVVPID; LGDT; LIDT; LLDT; LTR; SGDT; SIDT; SLDT; STR; VMCLEAR; VMPTRLD;
  VMPTRST; VMREAD; VMWRITE; VMXON; WBINVD; WBNOINVD; XRSTORS; XSAVES;
  control-register accesses; MOV DR; I/O instructions; MWAIT; accesses to
  the APIC-access page; EPT violations; EOI virtualization; APIC-write
  emulation; page-modification log full; SPP-related events; and
  instruction timeout. For all other VM exits, this field is cleared.

Generating EXIT_QUALIFICATION from vcpu->arch.exit_qualification is wrong
for all (two) paths that lead to nested_ept_inject_page_fault().  For EPT
violations (the common case), vcpu->arch.exit_qualification will have been
set by handle_ept_violation() to vmcs02.EXIT_QUALIFICATION, i.e. contains
the information of a EPT violation and thus is likely non-zero.

For an EPT misconfig, which can reach FNAME(walk_addr_generic) and thus
inject a nEPT misconfig if KVM created an MMIO SPTE that became stale,
vcpu->arch.exit_qualification will hold the information from the last EPT
violation VM-Exit, as vcpu->arch.exit_qualification is _only_ written by
handle_ept_violation().

Fixes: 4704d0befb ("KVM: nVMX: Exiting from L2 to L1")
Link: https://lore.kernel.org/r/20240209221700.393189-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 10:24:36 -07:00
Tao Su
7f2817ef52 KVM: VMX: Ignore MKTME KeyID bits when intercepting #PF for allow_smaller_maxphyaddr
Use the raw/true host.MAXPHYADDR when deciding whether or not KVM must
intercept #PFs when allow_smaller_maxphyaddr is enabled, as any adjustments
the kernel makes to boot_cpu_data.x86_phys_bits to account for MKTME KeyID
bits do not apply to the guest physical address space.  I.e. the KeyID are
off-limits for host physical addresses, but are not reserved for GPAs as
far as hardware is concerned.

Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Link: https://lore.kernel.org/r/20240319031111.495006-1-tao1.su@linux.intel.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-08 14:22:10 -07:00
Sean Christopherson
9e985cbf29 KVM: x86/pmu: Disable support for adaptive PEBS
Drop support for virtualizing adaptive PEBS, as KVM's implementation is
architecturally broken without an obvious/easy path forward, and because
exposing adaptive PEBS can leak host LBRs to the guest, i.e. can leak
host kernel addresses to the guest.

Bug #1 is that KVM doesn't account for the upper 32 bits of
IA32_FIXED_CTR_CTRL when (re)programming fixed counters, e.g
fixed_ctrl_field() drops the upper bits, reprogram_fixed_counters()
stores local variables as u8s and truncates the upper bits too, etc.

Bug #2 is that, because KVM _always_ sets precise_ip to a non-zero value
for PEBS events, perf will _always_ generate an adaptive record, even if
the guest requested a basic record.  Note, KVM will also enable adaptive
PEBS in individual *counter*, even if adaptive PEBS isn't exposed to the
guest, but this is benign as MSR_PEBS_DATA_CFG is guaranteed to be zero,
i.e. the guest will only ever see Basic records.

Bug #3 is in perf.  intel_pmu_disable_fixed() doesn't clear the upper
bits either, i.e. leaves ICL_FIXED_0_ADAPTIVE set, and
intel_pmu_enable_fixed() effectively doesn't clear ICL_FIXED_0_ADAPTIVE
either.  I.e. perf _always_ enables ADAPTIVE counters, regardless of what
KVM requests.

Bug #4 is that adaptive PEBS *might* effectively bypass event filters set
by the host, as "Updated Memory Access Info Group" records information
that might be disallowed by userspace via KVM_SET_PMU_EVENT_FILTER.

Bug #5 is that KVM doesn't ensure LBR MSRs hold guest values (or at least
zeros) when entering a vCPU with adaptive PEBS, which allows the guest
to read host LBRs, i.e. host RIPs/addresses, by enabling "LBR Entries"
records.

Disable adaptive PEBS support as an immediate fix due to the severity of
the LBR leak in particular, and because fixing all of the bugs will be
non-trivial, e.g. not suitable for backporting to stable kernels.

Note!  This will break live migration, but trying to make KVM play nice
with live migration would be quite complicated, wouldn't be guaranteed to
work (i.e. KVM might still kill/confuse the guest), and it's not clear
that there are any publicly available VMMs that support adaptive PEBS,
let alone live migrate VMs that support adaptive PEBS, e.g. QEMU doesn't
support PEBS in any capacity.

Link: https://lore.kernel.org/all/20240306230153.786365-1-seanjc@google.com
Link: https://lore.kernel.org/all/ZeepGjHCeSfadANM@google.com
Fixes: c59a1f106f ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS")
Cc: stable@vger.kernel.org
Cc: Like Xu <like.xu.linux@gmail.com>
Cc: Mingwei Zhang <mizhang@google.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhang Xiong <xiong.y.zhang@intel.com>
Cc: Lv Zhiyuan <zhiyuan.lv@intel.com>
Cc: Dapeng Mi <dapeng1.mi@intel.com>
Cc: Jim Mattson <jmattson@google.com>
Acked-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20240307005833.827147-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-08 13:20:25 -07:00
Pawan Gupta
95a6ccbdc7 x86/bhi: Mitigate KVM by default
BHI mitigation mode spectre_bhi=auto does not deploy the software
mitigation by default. In a cloud environment, it is a likely scenario
where userspace is trusted but the guests are not trusted. Deploying
system wide mitigation in such cases is not desirable.

Update the auto mode to unconditionally mitigate against malicious
guests. Deploy the software sequence at VMexit in auto mode also, when
hardware mitigation is not available. Unlike the force =on mode,
software sequence is not deployed at syscalls in auto mode.

Suggested-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08 19:27:06 +02:00
Pawan Gupta
7390db8aea x86/bhi: Add support for clearing branch history at syscall entry
Branch History Injection (BHI) attacks may allow a malicious application to
influence indirect branch prediction in kernel by poisoning the branch
history. eIBRS isolates indirect branch targets in ring0.  The BHB can
still influence the choice of indirect branch predictor entry, and although
branch predictor entries are isolated between modes when eIBRS is enabled,
the BHB itself is not isolated between modes.

Alder Lake and new processors supports a hardware control BHI_DIS_S to
mitigate BHI.  For older processors Intel has released a software sequence
to clear the branch history on parts that don't support BHI_DIS_S. Add
support to execute the software sequence at syscall entry and VMexit to
overwrite the branch history.

For now, branch history is not cleared at interrupt entry, as malicious
applications are not believed to have sufficient control over the
registers, since previous register state is cleared at interrupt
entry. Researchers continue to poke at this area and it may become
necessary to clear at interrupt entry as well in the future.

This mitigation is only defined here. It is enabled later.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Co-developed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08 19:27:05 +02:00
Linus Torvalds
4f712ee0cb S390:
* Changes to FPU handling came in via the main s390 pull request
 
 * Only deliver to the guest the SCLP events that userspace has
   requested.
 
 * More virtual vs physical address fixes (only a cleanup since
   virtual and physical address spaces are currently the same).
 
 * Fix selftests undefined behavior.
 
 x86:
 
 * Fix a restriction that the guest can't program a PMU event whose
   encoding matches an architectural event that isn't included in the
   guest CPUID.  The enumeration of an architectural event only says
   that if a CPU supports an architectural event, then the event can be
   programmed *using the architectural encoding*.  The enumeration does
   NOT say anything about the encoding when the CPU doesn't report support
   the event *in general*.  It might support it, and it might support it
   using the same encoding that made it into the architectural PMU spec.
 
 * Fix a variety of bugs in KVM's emulation of RDPMC (more details on
   individual commits) and add a selftest to verify KVM correctly emulates
   RDMPC, counter availability, and a variety of other PMC-related
   behaviors that depend on guest CPUID and therefore are easier to
   validate with selftests than with custom guests (aka kvm-unit-tests).
 
 * Zero out PMU state on AMD if the virtual PMU is disabled, it does not
   cause any bug but it wastes time in various cases where KVM would check
   if a PMC event needs to be synthesized.
 
 * Optimize triggering of emulated events, with a nice ~10% performance
   improvement in VM-Exit microbenchmarks when a vPMU is exposed to the
   guest.
 
 * Tighten the check for "PMI in guest" to reduce false positives if an NMI
   arrives in the host while KVM is handling an IRQ VM-Exit.
 
 * Fix a bug where KVM would report stale/bogus exit qualification information
   when exiting to userspace with an internal error exit code.
 
 * Add a VMX flag in /proc/cpuinfo to report 5-level EPT support.
 
 * Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for
   read, e.g. to avoid serializing vCPUs when userspace deletes a memslot.
 
 * Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB).  KVM
   doesn't support yielding in the middle of processing a zap, and 1GiB
   granularity resulted in multi-millisecond lags that are quite impolite
   for CONFIG_PREEMPT kernels.
 
 * Allocate write-tracking metadata on-demand to avoid the memory overhead when
   a kernel is built with i915 virtualization support but the workloads use
   neither shadow paging nor i915 virtualization.
 
 * Explicitly initialize a variety of on-stack variables in the emulator that
   triggered KMSAN false positives.
 
 * Fix the debugregs ABI for 32-bit KVM.
 
 * Rework the "force immediate exit" code so that vendor code ultimately decides
   how and when to force the exit, which allowed some optimization for both
   Intel and AMD.
 
 * Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
   vCPU creation ultimately failed, causing extra unnecessary work.
 
 * Cleanup the logic for checking if the currently loaded vCPU is in-kernel.
 
 * Harden against underflowing the active mmu_notifier invalidation
   count, so that "bad" invalidations (usually due to bugs elsehwere in the
   kernel) are detected earlier and are less likely to hang the kernel.
 
 x86 Xen emulation:
 
 * Overlay pages can now be cached based on host virtual address,
   instead of guest physical addresses.  This removes the need to
   reconfigure and invalidate the cache if the guest changes the
   gpa but the underlying host virtual address remains the same.
 
 * When possible, use a single host TSC value when computing the deadline for
   Xen timers in order to improve the accuracy of the timer emulation.
 
 * Inject pending upcall events when the vCPU software-enables its APIC to fix
   a bug where an upcall can be lost (and to follow Xen's behavior).
 
 * Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
   events fails, e.g. if the guest has aliased xAPIC IDs.
 
 RISC-V:
 
 * Support exception and interrupt handling in selftests
 
 * New self test for RISC-V architectural timer (Sstc extension)
 
 * New extension support (Ztso, Zacas)
 
 * Support userspace emulation of random number seed CSRs.
 
 ARM:
 
 * Infrastructure for building KVM's trap configuration based on the
   architectural features (or lack thereof) advertised in the VM's ID
   registers
 
 * Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
   x86's WC) at stage-2, improving the performance of interacting with
   assigned devices that can tolerate it
 
 * Conversion of KVM's representation of LPIs to an xarray, utilized to
   address serialization some of the serialization on the LPI injection
   path
 
 * Support for _architectural_ VHE-only systems, advertised through the
   absence of FEAT_E2H0 in the CPU's ID register
 
 * Miscellaneous cleanups, fixes, and spelling corrections to KVM and
   selftests
 
 LoongArch:
 
 * Set reserved bits as zero in CPUCFG.
 
 * Start SW timer only when vcpu is blocking.
 
 * Do not restart SW timer when it is expired.
 
 * Remove unnecessary CSR register saving during enter guest.
 
 * Misc cleanups and fixes as usual.
 
 Generic:
 
 * cleanup Kconfig by removing CONFIG_HAVE_KVM, which was basically always
   true on all architectures except MIPS (where Kconfig determines the
   available depending on CPU capabilities).  It is replaced either by
   an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM)
   everywhere else.
 
 * Factor common "select" statements in common code instead of requiring
   each architecture to specify it
 
 * Remove thoroughly obsolete APIs from the uapi headers.
 
 * Move architecture-dependent stuff to uapi/asm/kvm.h
 
 * Always flush the async page fault workqueue when a work item is being
   removed, especially during vCPU destruction, to ensure that there are no
   workers running in KVM code when all references to KVM-the-module are gone,
   i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded.
 
 * Grab a reference to the VM's mm_struct in the async #PF worker itself instead
   of gifting the worker a reference, so that there's no need to remember
   to *conditionally* clean up after the worker.
 
 Selftests:
 
 * Reduce boilerplate especially when utilize selftest TAP infrastructure.
 
 * Add basic smoke tests for SEV and SEV-ES, along with a pile of library
   support for handling private/encrypted/protected memory.
 
 * Fix benign bugs where tests neglect to close() guest_memfd files.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmX0iP8UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroND7wf+JZoNvwZ+bmwWe/4jn/YwNoYi/C5z
 eypn8M1gsWEccpCpqPBwznVm9T29rF4uOlcMvqLEkHfTpaL1EKUUjP1lXPz/ileP
 6a2RdOGxAhyTiFC9fjy+wkkjtLbn1kZf6YsS0hjphP9+w0chNbdn0w81dFVnXryd
 j7XYI8R/bFAthNsJOuZXSEjCfIHxvTTG74OrTf1B1FEBB+arPmrgUeJftMVhffQK
 Sowgg8L/Ii/x6fgV5NZQVSIyVf1rp8z7c6UaHT4Fwb0+RAMW8p9pYv9Qp1YkKp8y
 5j0V9UzOHP7FRaYimZ5BtwQoqiZXYylQ+VuU/Y2f4X85cvlLzSqxaEMAPA==
 =mqOV
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "S390:

   - Changes to FPU handling came in via the main s390 pull request

   - Only deliver to the guest the SCLP events that userspace has
     requested

   - More virtual vs physical address fixes (only a cleanup since
     virtual and physical address spaces are currently the same)

   - Fix selftests undefined behavior

  x86:

   - Fix a restriction that the guest can't program a PMU event whose
     encoding matches an architectural event that isn't included in the
     guest CPUID. The enumeration of an architectural event only says
     that if a CPU supports an architectural event, then the event can
     be programmed *using the architectural encoding*. The enumeration
     does NOT say anything about the encoding when the CPU doesn't
     report support the event *in general*. It might support it, and it
     might support it using the same encoding that made it into the
     architectural PMU spec

   - Fix a variety of bugs in KVM's emulation of RDPMC (more details on
     individual commits) and add a selftest to verify KVM correctly
     emulates RDMPC, counter availability, and a variety of other
     PMC-related behaviors that depend on guest CPUID and therefore are
     easier to validate with selftests than with custom guests (aka
     kvm-unit-tests)

   - Zero out PMU state on AMD if the virtual PMU is disabled, it does
     not cause any bug but it wastes time in various cases where KVM
     would check if a PMC event needs to be synthesized

   - Optimize triggering of emulated events, with a nice ~10%
     performance improvement in VM-Exit microbenchmarks when a vPMU is
     exposed to the guest

   - Tighten the check for "PMI in guest" to reduce false positives if
     an NMI arrives in the host while KVM is handling an IRQ VM-Exit

   - Fix a bug where KVM would report stale/bogus exit qualification
     information when exiting to userspace with an internal error exit
     code

   - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support

   - Rework TDP MMU root unload, free, and alloc to run with mmu_lock
     held for read, e.g. to avoid serializing vCPUs when userspace
     deletes a memslot

   - Tear down TDP MMU page tables at 4KiB granularity (used to be
     1GiB). KVM doesn't support yielding in the middle of processing a
     zap, and 1GiB granularity resulted in multi-millisecond lags that
     are quite impolite for CONFIG_PREEMPT kernels

   - Allocate write-tracking metadata on-demand to avoid the memory
     overhead when a kernel is built with i915 virtualization support
     but the workloads use neither shadow paging nor i915 virtualization

   - Explicitly initialize a variety of on-stack variables in the
     emulator that triggered KMSAN false positives

   - Fix the debugregs ABI for 32-bit KVM

   - Rework the "force immediate exit" code so that vendor code
     ultimately decides how and when to force the exit, which allowed
     some optimization for both Intel and AMD

   - Fix a long-standing bug where kvm_has_noapic_vcpu could be left
     elevated if vCPU creation ultimately failed, causing extra
     unnecessary work

   - Cleanup the logic for checking if the currently loaded vCPU is
     in-kernel

   - Harden against underflowing the active mmu_notifier invalidation
     count, so that "bad" invalidations (usually due to bugs elsehwere
     in the kernel) are detected earlier and are less likely to hang the
     kernel

  x86 Xen emulation:

   - Overlay pages can now be cached based on host virtual address,
     instead of guest physical addresses. This removes the need to
     reconfigure and invalidate the cache if the guest changes the gpa
     but the underlying host virtual address remains the same

   - When possible, use a single host TSC value when computing the
     deadline for Xen timers in order to improve the accuracy of the
     timer emulation

   - Inject pending upcall events when the vCPU software-enables its
     APIC to fix a bug where an upcall can be lost (and to follow Xen's
     behavior)

   - Fall back to the slow path instead of warning if "fast" IRQ
     delivery of Xen events fails, e.g. if the guest has aliased xAPIC
     IDs

  RISC-V:

   - Support exception and interrupt handling in selftests

   - New self test for RISC-V architectural timer (Sstc extension)

   - New extension support (Ztso, Zacas)

   - Support userspace emulation of random number seed CSRs

  ARM:

   - Infrastructure for building KVM's trap configuration based on the
     architectural features (or lack thereof) advertised in the VM's ID
     registers

   - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
     x86's WC) at stage-2, improving the performance of interacting with
     assigned devices that can tolerate it

   - Conversion of KVM's representation of LPIs to an xarray, utilized
     to address serialization some of the serialization on the LPI
     injection path

   - Support for _architectural_ VHE-only systems, advertised through
     the absence of FEAT_E2H0 in the CPU's ID register

   - Miscellaneous cleanups, fixes, and spelling corrections to KVM and
     selftests

  LoongArch:

   - Set reserved bits as zero in CPUCFG

   - Start SW timer only when vcpu is blocking

   - Do not restart SW timer when it is expired

   - Remove unnecessary CSR register saving during enter guest

   - Misc cleanups and fixes as usual

  Generic:

   - Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically
     always true on all architectures except MIPS (where Kconfig
     determines the available depending on CPU capabilities). It is
     replaced either by an architecture-dependent symbol for MIPS, and
     IS_ENABLED(CONFIG_KVM) everywhere else

   - Factor common "select" statements in common code instead of
     requiring each architecture to specify it

   - Remove thoroughly obsolete APIs from the uapi headers

   - Move architecture-dependent stuff to uapi/asm/kvm.h

   - Always flush the async page fault workqueue when a work item is
     being removed, especially during vCPU destruction, to ensure that
     there are no workers running in KVM code when all references to
     KVM-the-module are gone, i.e. to prevent a very unlikely
     use-after-free if kvm.ko is unloaded

   - Grab a reference to the VM's mm_struct in the async #PF worker
     itself instead of gifting the worker a reference, so that there's
     no need to remember to *conditionally* clean up after the worker

  Selftests:

   - Reduce boilerplate especially when utilize selftest TAP
     infrastructure

   - Add basic smoke tests for SEV and SEV-ES, along with a pile of
     library support for handling private/encrypted/protected memory

   - Fix benign bugs where tests neglect to close() guest_memfd files"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits)
  selftests: kvm: remove meaningless assignments in Makefiles
  KVM: riscv: selftests: Add Zacas extension to get-reg-list test
  RISC-V: KVM: Allow Zacas extension for Guest/VM
  KVM: riscv: selftests: Add Ztso extension to get-reg-list test
  RISC-V: KVM: Allow Ztso extension for Guest/VM
  RISC-V: KVM: Forward SEED CSR access to user space
  KVM: riscv: selftests: Add sstc timer test
  KVM: riscv: selftests: Change vcpu_has_ext to a common function
  KVM: riscv: selftests: Add guest helper to get vcpu id
  KVM: riscv: selftests: Add exception handling support
  LoongArch: KVM: Remove unnecessary CSR register saving during enter guest
  LoongArch: KVM: Do not restart SW timer when it is expired
  LoongArch: KVM: Start SW timer only when vcpu is blocking
  LoongArch: KVM: Set reserved bits as zero in CPUCFG
  KVM: selftests: Explicitly close guest_memfd files in some gmem tests
  KVM: x86/xen: fix recursive deadlock in timer injection
  KVM: pfncache: simplify locking and make more self-contained
  KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery
  KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled
  KVM: x86/xen: improve accuracy of Xen timers
  ...
2024-03-15 13:03:13 -07:00
Linus Torvalds
685d982112 Core x86 changes for v6.9:
- The biggest change is the rework of the percpu code,
   to support the 'Named Address Spaces' GCC feature,
   by Uros Bizjak:
 
    - This allows C code to access GS and FS segment relative
      memory via variables declared with such attributes,
      which allows the compiler to better optimize those accesses
      than the previous inline assembly code.
 
    - The series also includes a number of micro-optimizations
      for various percpu access methods, plus a number of
      cleanups of %gs accesses in assembly code.
 
    - These changes have been exposed to linux-next testing for
      the last ~5 months, with no known regressions in this area.
 
 - Fix/clean up __switch_to()'s broken but accidentally
   working handling of FPU switching - which also generates
   better code.
 
 - Propagate more RIP-relative addressing in assembly code,
   to generate slightly better code.
 
 - Rework the CPU mitigations Kconfig space to be less idiosyncratic,
   to make it easier for distros to follow & maintain these options.
 
 - Rework the x86 idle code to cure RCU violations and
   to clean up the logic.
 
 - Clean up the vDSO Makefile logic.
 
 - Misc cleanups and fixes.
 
 [ Please note that there's a higher number of merge commits in
   this branch (three) than is usual in x86 topic trees. This happened
   due to the long testing lifecycle of the percpu changes that
   involved 3 merge windows, which generated a longer history
   and various interactions with other core x86 changes that we
   felt better about to carry in a single branch. ]
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmXvB0gRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jUqRAAqnEQPiabF5acQlHrwviX+cjSobDlqtH5
 9q2AQy9qaEHapzD0XMOxvFye6XIvehGOGxSPvk6CoviSxBND8rb56lvnsEZuLeBV
 Bo5QSIL2x42Zrvo11iPHwgXZfTIusU90sBuKDRFkYBAxY3HK2naMDZe8MAsYCUE9
 nwgHF8DDc/NYiSOXV8kosWoWpNIkoK/STyH5bvTQZMqZcwyZ49AIeP1jGZb/prbC
 e/rbnlrq5Eu6brpM7xo9kELO0Vhd34urV14KrrIpdkmUKytW2KIsyvW8D6fqgDBj
 NSaQLLcz0pCXbhF+8Nqvdh/1coR4L7Ymt08P1rfEjCsQgb/2WnSAGUQuC5JoGzaj
 ngkbFcZllIbD9gNzMQ1n4Aw5TiO+l9zxCqPC/r58Uuvstr+K9QKlwnp2+B3Q73Ft
 rojIJ04NJL6lCHdDgwAjTTks+TD2PT/eBWsDfJ/1pnUWttmv9IjMpnXD5sbHxoiU
 2RGGKnYbxXczYdq/ALYDWM6JXpfnJZcXL3jJi0IDcCSsb92xRvTANYFHnTfyzGfw
 EHkhbF4e4Vy9f6QOkSP3CvW5H26BmZS9DKG0J9Il5R3u2lKdfbb5vmtUmVTqHmAD
 Ulo5cWZjEznlWCAYSI/aIidmBsp9OAEvYd+X7Z5SBIgTfSqV7VWHGt0BfA1heiVv
 F/mednG0gGc=
 =3v4F
 -----END PGP SIGNATURE-----

Merge tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core x86 updates from Ingo Molnar:

 - The biggest change is the rework of the percpu code, to support the
   'Named Address Spaces' GCC feature, by Uros Bizjak:

      - This allows C code to access GS and FS segment relative memory
        via variables declared with such attributes, which allows the
        compiler to better optimize those accesses than the previous
        inline assembly code.

      - The series also includes a number of micro-optimizations for
        various percpu access methods, plus a number of cleanups of %gs
        accesses in assembly code.

      - These changes have been exposed to linux-next testing for the
        last ~5 months, with no known regressions in this area.

 - Fix/clean up __switch_to()'s broken but accidentally working handling
   of FPU switching - which also generates better code

 - Propagate more RIP-relative addressing in assembly code, to generate
   slightly better code

 - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to
   make it easier for distros to follow & maintain these options

 - Rework the x86 idle code to cure RCU violations and to clean up the
   logic

 - Clean up the vDSO Makefile logic

 - Misc cleanups and fixes

* tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  x86/idle: Select idle routine only once
  x86/idle: Let prefer_mwait_c1_over_halt() return bool
  x86/idle: Cleanup idle_setup()
  x86/idle: Clean up idle selection
  x86/idle: Sanitize X86_BUG_AMD_E400 handling
  sched/idle: Conditionally handle tick broadcast in default_idle_call()
  x86: Increase brk randomness entropy for 64-bit systems
  x86/vdso: Move vDSO to mmap region
  x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together
  x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o
  x86/retpoline: Ensure default return thunk isn't used at runtime
  x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32
  x86/vdso: Use $(addprefix ) instead of $(foreach )
  x86/vdso: Simplify obj-y addition
  x86/vdso: Consolidate targets and clean-files
  x86/bugs: Rename CONFIG_RETHUNK              => CONFIG_MITIGATION_RETHUNK
  x86/bugs: Rename CONFIG_CPU_SRSO             => CONFIG_MITIGATION_SRSO
  x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY       => CONFIG_MITIGATION_IBRS_ENTRY
  x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY      => CONFIG_MITIGATION_UNRET_ENTRY
  x86/bugs: Rename CONFIG_SLS                  => CONFIG_MITIGATION_SLS
  ...
2024-03-11 19:53:15 -07:00
Linus Torvalds
720c857907 Support for x86 Fast Return and Event Delivery (FRED):
FRED is a replacement for IDT event delivery on x86 and addresses most of
 the technical nightmares which IDT exposes:
 
  1) Exception cause registers like CR2 need to be manually preserved in
     nested exception scenarios.
 
  2) Hardware interrupt stack switching is suboptimal for nested exceptions
     as the interrupt stack mechanism rewinds the stack on each entry which
     requires a massive effort in the low level entry of #NMI code to handle
     this.
 
  3) No hardware distinction between entry from kernel or from user which
     makes establishing kernel context more complex than it needs to be
     especially for unconditionally nestable exceptions like NMI.
 
  4) NMI nesting caused by IRET unconditionally reenabling NMIs, which is a
     problem when the perf NMI takes a fault when collecting a stack trace.
 
  5) Partial restore of ESP when returning to a 16-bit segment
 
  6) Limitation of the vector space which can cause vector exhaustion on
     large systems.
 
  7) Inability to differentiate NMI sources
 
 FRED addresses these shortcomings by:
 
  1) An extended exception stack frame which the CPU uses to save exception
     cause registers. This ensures that the meta information for each
     exception is preserved on stack and avoids the extra complexity of
     preserving it in software.
 
  2) Hardware interrupt stack switching is non-rewinding if a nested
     exception uses the currently interrupt stack.
 
  3) The entry points for kernel and user context are separate and GS BASE
     handling which is required to establish kernel context for per CPU
     variable access is done in hardware.
 
  4) NMIs are now nesting protected. They are only reenabled on the return
     from NMI.
 
  5) FRED guarantees full restore of ESP
 
  6) FRED does not put a limitation on the vector space by design because it
     uses a central entry points for kernel and user space and the CPUstores
     the entry type (exception, trap, interrupt, syscall) on the entry stack
     along with the vector number. The entry code has to demultiplex this
     information, but this removes the vector space restriction.
 
     The first hardware implementations will still have the current
     restricted vector space because lifting this limitation requires
     further changes to the local APIC.
 
  7) FRED stores the vector number and meta information on stack which
     allows having more than one NMI vector in future hardware when the
     required local APIC changes are in place.
 
 The series implements the initial FRED support by:
 
  - Reworking the existing entry and IDT handling infrastructure to
    accomodate for the alternative entry mechanism.
 
  - Expanding the stack frame to accomodate for the extra 16 bytes FRED
    requires to store context and meta information
 
  - Providing FRED specific C entry points for events which have information
    pushed to the extended stack frame, e.g. #PF and #DB.
 
  - Providing FRED specific C entry points for #NMI and #MCE
 
  - Implementing the FRED specific ASM entry points and the C code to
    demultiplex the events
 
  - Providing detection and initialization mechanisms and the necessary
    tweaks in context switching, GS BASE handling etc.
 
 The FRED integration aims for maximum code reuse vs. the existing IDT
 implementation to the extent possible and the deviation in hot paths like
 context switching are handled with alternatives to minimalize the
 impact. The low level entry and exit paths are seperate due to the extended
 stack frame and the hardware based GS BASE swichting and therefore have no
 impact on IDT based systems.
 
 It has been extensively tested on existing systems and on the FRED
 simulation and as of now there are know outstanding problems.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmXuKPgTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWyUEACevJMHU+Ot9zqBPizSWxByM1uunHbp
 bjQXhaFeskd3mt7k7HU6GsPRSmC3q4lliP1Y9ypfbU0DvYSI2h/PhMWizjhmot2y
 nIvFpl51r/NsI+JHx1oXcFetz0eGHEqBui/4YQ/swgOCMymYgfqgHhazXTdldV3g
 KpH9/8W3AeGvw79uzXFH9tjBzTkbvywpam3v0LYNDJWTCuDkilyo8PjhsgRZD4x3
 V9f1nLD7nSHZW8XLoktdJJ38bKwI2Lhao91NQ0ErwopekA4/9WphZEKsDpidUSXJ
 sn1O148oQ8X92IO2OaQje8XC5pLGr5GqQBGPWzRH56P/Vd3+WOwBxaFoU6Drxc5s
 tIe23ZjkVcpA8EEG7BQBZV1Un/NX7XaCCnMniOt0RauXw+1NaslX7t/tnUAh5F1V
 TWCH4D0I0oJ0qJ7kNliGn2BP3agYXOVg81xVEUjT6KfHcYU4ImUrwi+BkeNXuXtL
 Ch5ADnbYAcUjWLFnAmEmaRtfmfNGY5T7PeGFHW2RRkaOJ88v5g14Voo6gPJaDUPn
 wMQ0nLq1xN4xZWF6ZgfRqAhArvh20k38ZujRku5vXEqnhOugQ76TF2UYiFEwOXbQ
 8jcM+yEBLGgBz7tGMwmIAml6kfxaFF1KPpdrtcPxNkGlbE6KTSuIolLx2YGUvlSU
 6/O8nwZy49ckmQ==
 =Ib7w
 -----END PGP SIGNATURE-----

Merge tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 FRED support from Thomas Gleixner:
 "Support for x86 Fast Return and Event Delivery (FRED).

  FRED is a replacement for IDT event delivery on x86 and addresses most
  of the technical nightmares which IDT exposes:

   1) Exception cause registers like CR2 need to be manually preserved
      in nested exception scenarios.

   2) Hardware interrupt stack switching is suboptimal for nested
      exceptions as the interrupt stack mechanism rewinds the stack on
      each entry which requires a massive effort in the low level entry
      of #NMI code to handle this.

   3) No hardware distinction between entry from kernel or from user
      which makes establishing kernel context more complex than it needs
      to be especially for unconditionally nestable exceptions like NMI.

   4) NMI nesting caused by IRET unconditionally reenabling NMIs, which
      is a problem when the perf NMI takes a fault when collecting a
      stack trace.

   5) Partial restore of ESP when returning to a 16-bit segment

   6) Limitation of the vector space which can cause vector exhaustion
      on large systems.

   7) Inability to differentiate NMI sources

  FRED addresses these shortcomings by:

   1) An extended exception stack frame which the CPU uses to save
      exception cause registers. This ensures that the meta information
      for each exception is preserved on stack and avoids the extra
      complexity of preserving it in software.

   2) Hardware interrupt stack switching is non-rewinding if a nested
      exception uses the currently interrupt stack.

   3) The entry points for kernel and user context are separate and GS
      BASE handling which is required to establish kernel context for
      per CPU variable access is done in hardware.

   4) NMIs are now nesting protected. They are only reenabled on the
      return from NMI.

   5) FRED guarantees full restore of ESP

   6) FRED does not put a limitation on the vector space by design
      because it uses a central entry points for kernel and user space
      and the CPUstores the entry type (exception, trap, interrupt,
      syscall) on the entry stack along with the vector number. The
      entry code has to demultiplex this information, but this removes
      the vector space restriction.

      The first hardware implementations will still have the current
      restricted vector space because lifting this limitation requires
      further changes to the local APIC.

   7) FRED stores the vector number and meta information on stack which
      allows having more than one NMI vector in future hardware when the
      required local APIC changes are in place.

  The series implements the initial FRED support by:

   - Reworking the existing entry and IDT handling infrastructure to
     accomodate for the alternative entry mechanism.

   - Expanding the stack frame to accomodate for the extra 16 bytes FRED
     requires to store context and meta information

   - Providing FRED specific C entry points for events which have
     information pushed to the extended stack frame, e.g. #PF and #DB.

   - Providing FRED specific C entry points for #NMI and #MCE

   - Implementing the FRED specific ASM entry points and the C code to
     demultiplex the events

   - Providing detection and initialization mechanisms and the necessary
     tweaks in context switching, GS BASE handling etc.

  The FRED integration aims for maximum code reuse vs the existing IDT
  implementation to the extent possible and the deviation in hot paths
  like context switching are handled with alternatives to minimalize the
  impact. The low level entry and exit paths are seperate due to the
  extended stack frame and the hardware based GS BASE swichting and
  therefore have no impact on IDT based systems.

  It has been extensively tested on existing systems and on the FRED
  simulation and as of now there are no outstanding problems"

* tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  x86/fred: Fix init_task thread stack pointer initialization
  MAINTAINERS: Add a maintainer entry for FRED
  x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly
  x86/fred: Invoke FRED initialization code to enable FRED
  x86/fred: Add FRED initialization functions
  x86/syscall: Split IDT syscall setup code into idt_syscall_init()
  KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling
  x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI
  x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code
  x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user
  x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled
  x86/traps: Add sysvec_install() to install a system interrupt handler
  x86/fred: FRED entry/exit and dispatch code
  x86/fred: Add a machine check entry stub for FRED
  x86/fred: Add a NMI entry stub for FRED
  x86/fred: Add a debug fault entry stub for FRED
  x86/idtentry: Incorporate definitions/declarations of the FRED entries
  x86/fred: Make exc_page_fault() work for FRED
  x86/fred: Allow single-step trap and NMI when starting a new task
  x86/fred: No ESPFIX needed when FRED is enabled
  ...
2024-03-11 16:00:17 -07:00
Paolo Bonzini
e9025cdd8c KVM x86 PMU changes for 6.9:
- Fix several bugs where KVM speciously prevents the guest from utilizing
    fixed counters and architectural event encodings based on whether or not
    guest CPUID reports support for the _architectural_ encoding.
 
  - Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads,
    priority of VMX interception vs #GP, PMC types in architectural PMUs, etc.
 
  - Add a selftest to verify KVM correctly emulates RDMPC, counter availability,
    and a variety of other PMC-related behaviors that depend on guest CPUID,
    i.e. are difficult to validate via KVM-Unit-Tests.
 
  - Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting
    cycles, e.g. when checking if a PMC event needs to be synthesized when
    skipping an instruction.
 
  - Optimize triggering of emulated events, e.g. for "count instructions" events
    when skipping an instruction, which yields a ~10% performance improvement in
    VM-Exit microbenchmarks when a vPMU is exposed to the guest.
 
  - Tighten the check for "PMI in guest" to reduce false positives if an NMI
    arrives in the host while KVM is handling an IRQ VM-Exit.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrUFQACgkQOlYIJqCj
 N/11dhAAnr9e6mPmXvaH4YKcvOGgTmwIQdi5W4IBzGm27ErEb0Vyskx3UATRhRm+
 gZyp3wNgEA9LeifICDNu4ypn7HZcl2VtRql6FYcB8Bcu8OiHfU8PhWL0/qrpY20e
 zffUj2tDweq2ft9Iks1SQJD0sxFkcXIcSKOffP7pRZJHFTKLltGORXwxzd9HJHPY
 nc4nERKegK2yH4A4gY6nZ0oV5L3OMUNHx815db5Y+HxXOIjBCjTQiNNd6mUdyX1N
 C5sIiElXLdvRTSDvirHfA32LqNwnajDGox4QKZkB3wszCxJ3kRd4OCkTEKMYKHxd
 KoKCJQnAdJFFW9xqbT8nNKXZ+hg2+ZQuoSaBuwKryf7jWi0e6a7jcV0OH+cQSZw7
 UNudKhs3r4ambfvnFp2IVZlZREMDB+LAjo2So48Jn/JGCAzqte3XqwVKskn9pS9S
 qeauXCdOLioZALYtTBl8RM1rEY5mbwQrpPv9CzbeU09qQ/hpXV14W9GmbyeOZcI1
 T1cYgEqlLuifRluwT/hxrY321+4noF116gSK1yb07x/sJU8/lhRooEk9V562066E
 qo6nIvc7Bv9gTGLwo6VReKSPcTT/6t3HwgPsRjqe+evso3EFN9f9hG+uPxtO6TUj
 pdPm3mkj2KfxDdJLf+Ys16gyGdiwI0ZImIkA0uLdM0zftNsrb4Y=
 =vayI
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-pmu-6.9' of https://github.com/kvm-x86/linux into HEAD

KVM x86 PMU changes for 6.9:

 - Fix several bugs where KVM speciously prevents the guest from utilizing
   fixed counters and architectural event encodings based on whether or not
   guest CPUID reports support for the _architectural_ encoding.

 - Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads,
   priority of VMX interception vs #GP, PMC types in architectural PMUs, etc.

 - Add a selftest to verify KVM correctly emulates RDMPC, counter availability,
   and a variety of other PMC-related behaviors that depend on guest CPUID,
   i.e. are difficult to validate via KVM-Unit-Tests.

 - Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting
   cycles, e.g. when checking if a PMC event needs to be synthesized when
   skipping an instruction.

 - Optimize triggering of emulated events, e.g. for "count instructions" events
   when skipping an instruction, which yields a ~10% performance improvement in
   VM-Exit microbenchmarks when a vPMU is exposed to the guest.

 - Tighten the check for "PMI in guest" to reduce false positives if an NMI
   arrives in the host while KVM is handling an IRQ VM-Exit.
2024-03-11 10:41:09 -04:00
Paolo Bonzini
b00471a552 KVM VMX changes for 6.9:
- Fix a bug where KVM would report stale/bogus exit qualification information
    when exiting to userspace due to an unexpected VM-Exit while the CPU was
    vectoring an exception.
 
  - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support.
 
  - Clean up the logic for massaging the passthrough MSR bitmaps when userspace
    changes its MSR filter.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrUh4ACgkQOlYIJqCj
 N/2CRA//VKa4KE8zgF3xM6Btyt2NegPgmGYVyhmMHTvARyHDIl5nURy++uXseb4f
 UXQLGcoGS+CIiaMohQFhCOjoNvv/2LR9P72qVV2WjQjFxVGBchybz8bjrqIDSSvY
 TuiPJApIfZtLryFFcowo8jLEBQv3JKgfgn9r2hBwVDcYP13wSz0Z4AWntHIqBxNa
 DW75wo7wnBFzy2RfUdtAgucpbmEihqSoKA+YjUT+0GRLBI7rWbFxdEKqe3BIM/7n
 4NoJXbOmw7mlhTNumZYsF5sKiyOihBOdtUL1TDgKWjJScgmwG+KCSvrp5Ko4PZpo
 uyWWcIbskQ+cTO6dHDoIJTVPsCDxo3PgVJKG1T60CV68NavwxXCUGri1n1ZNyYH/
 bXxEW7dTGHX0TDSt3dcyVOYdZFHbaIbqpu1EXlrzBm1hnruQ1C1uBQGHZ/X+Yo86
 0qdq9SgXJ48tykr8BDruIHMy0Q8jbXxl67oXR0CdRjJGM+H9f+7RefnRN9wPYFhy
 n6Hl3kbezwCZb+8RO34Hq2CpKzNlKRHlJDhUq1ypd2vXPw8FDq1aQYKih0jAzyJQ
 yCdUueBJgo8OJtSL4HGEHvgkLHR4/XERgbCOxpSYNbqIjahAwNtbfHryUnJRWRb5
 V3QczG/TtGfEpVblbRzn3Atbft4rM5a9Z3+s0siB6C2w8wyPmZg=
 =oJso
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.9' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.9:

 - Fix a bug where KVM would report stale/bogus exit qualification information
   when exiting to userspace due to an unexpected VM-Exit while the CPU was
   vectoring an exception.

 - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support.

 - Clean up the logic for massaging the passthrough MSR bitmaps when userspace
   changes its MSR filter.
2024-03-11 10:31:29 -04:00
Paolo Bonzini
c9cd0beae9 KVM x86 misc changes for 6.9:
- Explicitly initialize a variety of on-stack variables in the emulator that
    triggered KMSAN false positives (though in fairness in KMSAN, it's comically
    difficult to see that the uninitialized memory is never truly consumed).
 
  - Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading
    DR6 and DR7.
 
  - Rework the "force immediate exit" code so that vendor code ultimately
    decides how and when to force the exit.  This allows VMX to further optimize
    handling preemption timer exits, and allows SVM to avoid sending a duplicate
    IPI (SVM also has a need to force an exit).
 
  - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
    vCPU creation ultimately failed, and add WARN to guard against similar bugs.
 
  - Provide a dedicated arch hook for checking if a different vCPU was in-kernel
    (for directed yield), and simplify the logic for checking if the currently
    loaded vCPU is in-kernel.
 
  - Misc cleanups and fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrRjQACgkQOlYIJqCj
 N/2Dzw//b+ptSBAl1kGBRmk/DqsX7J9ZkQYCQOTeh1vXiUM+XRTSQoArN0Oo1roy
 3wcEnQ0beVw7jMuzZ8UUuTfU8WUMja/kwltnqXYNHwLnb6yH0I/BIengXWdUdAMc
 FmgPZ4qJR2IzKYzvDsc3eEQ515O8UHWakyVDnmLBtiakAeBcUTYceHpEEPpzE5y5
 ODASTQKM9o/h8R8JwKFTJ8/mrOLNcsu5SycwFdnmubLJCrNWtJWTijA6y1lh6shn
 hbEJex+ESoC2v8p7IP53u1SGJubVlPajt+RkYJtlEI3WVsevp024eYcF4nb1OjXi
 qS2Y3W7DQGWvyCBoSzoMY+9nRMgyOOpHYetdiz+9oZOmnjiYWY0ku59U7Gv+Aotj
 AUbCn4Ry/OpqsuZ7Oo7i3IT8R7uzsTeNNdxhYBn1OQquBEZ0KBYXlZkGfTk9K0t0
 Fhka/5Zu6fBlg5J+zCyaXUGmsGWBo/9HxsC5z1JuKo8fatro5qyqYE5KiM01dkqc
 6FET6gL+fFprC5c67JGRPdEtk6F9Emb+6oiTTA8/8q8JQQAKiJKk95Nlq7KzPfVS
 A5RQPTuTJ7acE/5CY4zB1DdxCjqgnonBEA2ULnA/J10Rk8orHJRnGJcEwKEyDrZh
 HpsxIIqt++i8KffORpCym6zSAVYuQjn1mu7MGth+zuCqhcEpBfc=
 =GX0O
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.9' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.9:

 - Explicitly initialize a variety of on-stack variables in the emulator that
   triggered KMSAN false positives (though in fairness in KMSAN, it's comically
   difficult to see that the uninitialized memory is never truly consumed).

 - Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading
   DR6 and DR7.

 - Rework the "force immediate exit" code so that vendor code ultimately
   decides how and when to force the exit.  This allows VMX to further optimize
   handling preemption timer exits, and allows SVM to avoid sending a duplicate
   IPI (SVM also has a need to force an exit).

 - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
   vCPU creation ultimately failed, and add WARN to guard against similar bugs.

 - Provide a dedicated arch hook for checking if a different vCPU was in-kernel
   (for directed yield), and simplify the logic for checking if the currently
   loaded vCPU is in-kernel.

 - Misc cleanups and fixes.
2024-03-11 10:24:56 -04:00
Sean Christopherson
259720c37d KVM: VMX: Combine "check" and "get" APIs for passthrough MSR lookups
Combine possible_passthrough_msr_slot() and is_valid_passthrough_msr()
into a single function, vmx_get_passthrough_msr_slot(), and have the
combined helper return the slot on success, using a negative value to
indicate "failure".

Combining the operations avoids iterating over the array of passthrough
MSRs twice for relevant MSRs.

Suggested-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Dongli Zhang <dongli.zhang@oracle.com>
Link: https://lore.kernel.org/r/20240223202104.3330974-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-27 12:29:46 -08:00
Dongli Zhang
bab22040d7 KVM: VMX: return early if msr_bitmap is not supported
The vmx_msr_filter_changed() may directly/indirectly calls only
vmx_enable_intercept_for_msr() or vmx_disable_intercept_for_msr(). Those
two functions may exit immediately if !cpu_has_vmx_msr_bitmap().

vmx_msr_filter_changed()
-> vmx_disable_intercept_for_msr()
-> pt_update_intercept_for_msr()
   -> vmx_set_intercept_for_msr()
      -> vmx_enable_intercept_for_msr()
      -> vmx_disable_intercept_for_msr()

Therefore, we exit early if !cpu_has_vmx_msr_bitmap().

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Link: https://lore.kernel.org/r/20240223202104.3330974-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-27 09:36:48 -08:00
Dongli Zhang
8e24eeedfd KVM: VMX: fix comment to add LBR to passthrough MSRs
According to the is_valid_passthrough_msr(), the LBR MSRs are also
passthrough MSRs, since the commit 1b5ac3226a ("KVM: vmx/pmu:
Pass-through LBR msrs when the guest LBR event is ACTIVE").

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Link: https://lore.kernel.org/r/20240223202104.3330974-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-27 09:36:48 -08:00
Sean Christopherson
0ec3d6d1f1 KVM: x86: Fully defer to vendor code to decide how to force immediate exit
Now that vmx->req_immediate_exit is used only in the scope of
vmx_vcpu_run(), use force_immediate_exit to detect that KVM should usurp
the VMX preemption to force a VM-Exit and let vendor code fully handle
forcing a VM-Exit.

Opportunsitically drop __kvm_request_immediate_exit() and just have
vendor code call smp_send_reschedule() directly.  SVM already does this
when injecting an event while also trying to single-step an IRET, i.e.
it's not exactly secret knowledge that KVM uses a reschedule IPI to force
an exit.

Link: https://lore.kernel.org/r/20240110012705.506918-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:22:41 -08:00
Sean Christopherson
7b3d1bbf8d KVM: VMX: Handle KVM-induced preemption timer exits in fastpath for L2
Eat VMX treemption timer exits in the fastpath regardless of whether L1 or
L2 is active.  The VM-Exit is 100% KVM-induced, i.e. there is nothing
directly related to the exit that KVM needs to do on behalf of the guest,
thus there is no reason to wait until the slow path to do nothing.

Opportunistically add comments explaining why preemption timer exits for
emulating the guest's APIC timer need to go down the slow path.

Link: https://lore.kernel.org/r/20240110012705.506918-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:22:37 -08:00
Sean Christopherson
bf1a49436e KVM: x86: Move handling of is_guest_mode() into fastpath exit handlers
Let the fastpath code decide which exits can/can't be handled in the
fastpath when L2 is active, e.g. when KVM generates a VMX preemption
timer exit to forcefully regain control, there is no "work" to be done and
so such exits can be handled in the fastpath regardless of whether L1 or
L2 is active.

Moving the is_guest_mode() check into the fastpath code also makes it
easier to see that L2 isn't allowed to use the fastpath in most cases,
e.g. it's not immediately obvious why handle_fastpath_preemption_timer()
is called from the fastpath and the normal path.

Link: https://lore.kernel.org/r/20240110012705.506918-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:22:36 -08:00
Sean Christopherson
11776aa0cf KVM: VMX: Handle forced exit due to preemption timer in fastpath
Handle VMX preemption timer VM-Exits due to KVM forcing an exit in the
exit fastpath, i.e. avoid calling back into handle_preemption_timer() for
the same exit.  There is no work to be done for forced exits, as the name
suggests the goal is purely to get control back in KVM.

In addition to shaving a few cycles, this will allow cleanly separating
handle_fastpath_preemption_timer() from handle_preemption_timer(), e.g.
it's not immediately obvious why _apparently_ calling
handle_fastpath_preemption_timer() twice on a "slow" exit is necessary:
the "slow" call is necessary to handle exits from L2, which are excluded
from the fastpath by vmx_vcpu_run().

Link: https://lore.kernel.org/r/20240110012705.506918-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:22:36 -08:00
Sean Christopherson
e6b5d16bbd KVM: VMX: Re-enter guest in fastpath for "spurious" preemption timer exits
Re-enter the guest in the fast path if VMX preeemption timer VM-Exit was
"spurious", i.e. if KVM "soft disabled" the timer by writing -1u and by
some miracle the timer expired before any other VM-Exit occurred.  This is
just an intermediate step to cleaning up the preemption timer handling,
optimizing these types of spurious VM-Exits is not interesting as they are
extremely rare/infrequent.

Link: https://lore.kernel.org/r/20240110012705.506918-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:22:36 -08:00
Sean Christopherson
9c9025ea00 KVM: x86: Plumb "force_immediate_exit" into kvm_entry() tracepoint
Annotate the kvm_entry() tracepoint with "immediate exit" when KVM is
forcing a VM-Exit immediately after VM-Enter, e.g. when KVM wants to
inject an event but needs to first complete some other operation.
Knowing that KVM is (or isn't) forcing an exit is useful information when
debugging issues related to event injection.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240110012705.506918-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:22:36 -08:00
Sean Christopherson
2a5f091ce1 KVM: x86: Open code all direct reads to guest DR6 and DR7
Bite the bullet, and open code all direct reads of DR6 and DR7.  KVM
currently has a mix of open coded accesses and calls to kvm_get_dr(),
which is confusing and ugly because there's no rhyme or reason as to why
any particular chunk of code uses kvm_get_dr().

The obvious alternative is to force all accesses through kvm_get_dr(),
but it's not at all clear that doing so would be a net positive, e.g. even
if KVM ends up wanting/needing to force all reads through a common helper,
e.g. to play caching games, the cost of reverting this change is likely
lower than the ongoing cost of maintaining weird, arbitrary code.

No functional change intended.

Cc: Mathias Krause <minipli@grsecurity.net>
Reviewed-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240209220752.388160-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:14:47 -08:00
Sean Christopherson
fc5375dd8c KVM: x86: Make kvm_get_dr() return a value, not use an out parameter
Convert kvm_get_dr()'s output parameter to a return value, and clean up
most of the mess that was created by forcing callers to provide a pointer.

No functional change intended.

Acked-by: Mathias Krause <minipli@grsecurity.net>
Reviewed-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240209220752.388160-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:14:47 -08:00
Pawan Gupta
43fb862de8 KVM/VMX: Move VERW closer to VMentry for MDS mitigation
During VMentry VERW is executed to mitigate MDS. After VERW, any memory
access like register push onto stack may put host data in MDS affected
CPU buffers. A guest can then use MDS to sample host data.

Although likelihood of secrets surviving in registers at current VERW
callsite is less, but it can't be ruled out. Harden the MDS mitigation
by moving the VERW mitigation late in VMentry path.

Note that VERW for MMIO Stale Data mitigation is unchanged because of
the complexity of per-guest conditional VERW which is not easy to handle
that late in asm with no GPRs available. If the CPU is also affected by
MDS, VERW is unconditionally executed late in asm regardless of guest
having MMIO access.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/20240213-delay-verw-v8-6-a6216d83edb7%40linux.intel.com
2024-02-19 16:31:59 -08:00
Sean Christopherson
706a189dcf KVM/VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH
Use EFLAGS.CF instead of EFLAGS.ZF to track whether to use VMRESUME versus
VMLAUNCH.  Freeing up EFLAGS.ZF will allow doing VERW, which clobbers ZF,
for MDS mitigations as late as possible without needing to duplicate VERW
for both paths.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/all/20240213-delay-verw-v8-5-a6216d83edb7%40linux.intel.com
2024-02-19 16:31:54 -08:00
Pawan Gupta
6613d82e61 x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key
The VERW mitigation at exit-to-user is enabled via a static branch
mds_user_clear. This static branch is never toggled after boot, and can
be safely replaced with an ALTERNATIVE() which is convenient to use in
asm.

Switch to ALTERNATIVE() to use the VERW mitigation late in exit-to-user
path. Also remove the now redundant VERW in exc_nmi() and
arch_exit_to_user_mode().

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20240213-delay-verw-v8-4-a6216d83edb7%40linux.intel.com
2024-02-19 16:31:49 -08:00
Paolo Bonzini
22d0bc0721 KVM x86 fixes for 6.8:
- Make a KVM_REQ_NMI request while handling KVM_SET_VCPU_EVENTS if and only
    if the incoming events->nmi.pending is non-zero.  If the target vCPU is in
    the UNITIALIZED state, the spurious request will result in KVM exiting to
    userspace, which in turn causes QEMU to constantly acquire and release
    QEMU's global mutex, to the point where the BSP is unable to make forward
    progress.
 
  - Fix a type (u8 versus u64) goof that results in pmu->fixed_ctr_ctrl being
    incorrectly truncated, and ultimately causes KVM to think a fixed counter
    has already been disabled (KVM thinks the old value is '0').
 
  - Fix a stack leak in KVM_GET_MSRS where a failed MSR read from userspace
    that is ultimately ignored due to ignore_msrs=true doesn't zero the output
    as intended.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmXKt90SHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5e5wP/jU3Zuul2e7fb4E6RN/GPhAFSTzG7Cwe
 4lVSSSPmOQsEXTKwCOMj7fgwF9qVSLzLRi62MKziTJY/1FDsTcI3xlM7nM2wwQC2
 26evIzI3qB54rHQdviuh1jwh6scZH7xLw7kANE+8x4skkm6AZB1IUnj3utR3fEPj
 mIUA5kGQxEAEDrn0TFzrRgIw4JngKjrCwmpT+vbmR37flC+Rwv8jr4JY1E3cBAT3
 KEilv3Fg07gbvagWGZNSSUNqQos5MsnLifdryKbA/vuIJf+j/01CMo5KtLKshiaX
 t4gXPldVZDXdxjH6im0wRAX4s/FpZg3vVje2OxPbzwMVb5+XvLewzjzagQ1lFA3I
 gsNXF8uGdYn0fb8T/wQG4ulWBw6A844PSmGONCwLDA+GZuL9xjMIK5d1litvb/im
 bEP1Ahv6UcnDNKHqRzuFXQENiS2uQdJNLs7p291oDNkTm/CGjDUgFXPuaCehWrUf
 ZZf1dxmIPM/Xt2j19mS/HnTHD114A8t1GTx799kBXbG4x0ScVQclkhRk6yFG3ObA
 14uXxxAdEBoZGBJ2yr5FbddvRLswbWugFoxKbtCZ/CHMopOUQcRRmRb7Lm1NHLtg
 Ae/sHO6gQ1xcrbwpMCq+6RjFK57yW+n1TB8ZTmAE2RQynGqzReSTlUNtfn3yMg4v
 hz+2zGzezoeN
 =92ae
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-fixes-6.8-rcN' of https://github.com/kvm-x86/linux into HEAD

KVM x86 fixes for 6.8:

 - Make a KVM_REQ_NMI request while handling KVM_SET_VCPU_EVENTS if and only
   if the incoming events->nmi.pending is non-zero.  If the target vCPU is in
   the UNITIALIZED state, the spurious request will result in KVM exiting to
   userspace, which in turn causes QEMU to constantly acquire and release
   QEMU's global mutex, to the point where the BSP is unable to make forward
   progress.

 - Fix a type (u8 versus u64) goof that results in pmu->fixed_ctr_ctrl being
   incorrectly truncated, and ultimately causes KVM to think a fixed counter
   has already been disabled (KVM thinks the old value is '0').

 - Fix a stack leak in KVM_GET_MSRS where a failed MSR read from userspace
   that is ultimately ignored due to ignore_msrs=true doesn't zero the output
   as intended.
2024-02-14 12:34:43 -05:00
Ingo Molnar
4589f199eb Merge branch 'x86/bugs' into x86/core, to pick up pending changes before dependent patches
Merge in pending alternatives patching infrastructure changes, before
applying more patches.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-02-14 10:49:37 +01:00
Linus Torvalds
4356e9f841 work around gcc bugs with 'asm goto' with outputs
We've had issues with gcc and 'asm goto' before, and we created a
'asm_volatile_goto()' macro for that in the past: see commits
3f0116c323 ("compiler/gcc4: Add quirk for 'asm goto' miscompilation
bug") and a9f180345f ("compiler/gcc4: Make quirk for
asm_volatile_goto() unconditional").

Then, much later, we ended up removing the workaround in commit
43c249ea0b ("compiler-gcc.h: remove ancient workaround for gcc PR
58670") because we no longer supported building the kernel with the
affected gcc versions, but we left the macro uses around.

Now, Sean Christopherson reports a new version of a very similar
problem, which is fixed by re-applying that ancient workaround.  But the
problem in question is limited to only the 'asm goto with outputs'
cases, so instead of re-introducing the old workaround as-is, let's
rename and limit the workaround to just that much less common case.

It looks like there are at least two separate issues that all hit in
this area:

 (a) some versions of gcc don't mark the asm goto as 'volatile' when it
     has outputs:

        https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98619
        https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110420

     which is easy to work around by just adding the 'volatile' by hand.

 (b) Internal compiler errors:

        https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110422

     which are worked around by adding the extra empty 'asm' as a
     barrier, as in the original workaround.

but the problem Sean sees may be a third thing since it involves bad
code generation (not an ICE) even with the manually added 'volatile'.

but the same old workaround works for this case, even if this feels a
bit like voodoo programming and may only be hiding the issue.

Reported-and-tested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/20240208220604.140859-1-seanjc@google.com/
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Andrew Pinski <quic_apinski@quicinc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-02-09 15:57:48 -08:00
Chao Gao
d7f0a00e43 KVM: VMX: Report up-to-date exit qualification to userspace
Use vmx_get_exit_qual() to read the exit qualification.

vcpu->arch.exit_qualification is cached for EPT violation only and even
for EPT violation, it is stale at this point because the up-to-date
value is cached later in handle_ept_violation().

Fixes: 70bcd708df ("KVM: vmx: expose more information for KVM_INTERNAL_ERROR_DELIVERY_EV exits")
Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20231229022652.300095-1-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-07 07:47:53 -08:00
Mingwei Zhang
05519c86d6 KVM: x86/pmu: Fix type length error when reading pmu->fixed_ctr_ctrl
Use a u64 instead of a u8 when taking a snapshot of pmu->fixed_ctr_ctrl
when reprogramming fixed counters, as truncating the value results in KVM
thinking fixed counter 2 is already disabled (the bug also affects fixed
counters 3+, but KVM doesn't yet support those).  As a result, if the
guest disables fixed counter 2, KVM will get a false negative and fail to
reprogram/disable emulation of the counter, which can leads to incorrect
counts and spurious PMIs in the guest.

Fixes: 76d287b234 ("KVM: x86/pmu: Drop "u8 ctrl, int idx" for reprogram_fixed_counter()")
Cc: stable@vger.kernel.org
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20240123221220.3911317-1-mizhang@google.com
[sean: rewrite changelog to call out the effects of the bug]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-02 14:07:27 -08:00
Sean Christopherson
f19063b1ca KVM: x86/pmu: Snapshot event selectors that KVM emulates in software
Snapshot the event selectors for the events that KVM emulates in software,
which is currently instructions retired and branch instructions retired.
The event selectors a tied to the underlying CPU, i.e. are constant for a
given platform even though perf doesn't manage the mappings as such.

Getting the event selectors from perf isn't exactly cheap, especially if
mitigations are enabled, as at least one indirect call is involved.

Snapshot the values in KVM instead of optimizing perf as working with the
raw event selectors will be required if KVM ever wants to emulate events
that aren't part of perf's uABI, i.e. that don't have an "enum perf_hw_id"
entry.

Link: https://lore.kernel.org/r/20231110022857.1273836-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01 09:35:48 -08:00
Sean Christopherson
e5a65d4f72 KVM: x86/pmu: Add macros to iterate over all PMCs given a bitmap
Add and use kvm_for_each_pmc() to dedup a variety of open coded for-loops
that iterate over valid PMCs given a bitmap (and because seeing checkpatch
whine about bad macro style is always amusing).

No functional change intended.

Link: https://lore.kernel.org/r/20231110022857.1273836-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01 09:35:48 -08:00
Sean Christopherson
b31880ca2f KVM: x86/pmu: Move pmc_idx => pmc translation helper to common code
Add a common helper for *internal* PMC lookups, and delete the ops hook
and Intel's implementation.  Keep AMD's implementation, but rename it to
amd_pmu_get_pmc() to make it somewhat more obvious that it's suited for
both KVM-internal and guest-initiated lookups.

Because KVM tracks all counters in a single bitmap, getting a counter
when iterating over a bitmap, e.g. of all valid PMCs, requires a small
amount of math, that while simple, isn't super obvious and doesn't use the
same semantics as PMC lookups from RDPMC!  Although AMD doesn't support
fixed counters, the common PMU code still behaves as if there a split, the
high half of which just happens to always be empty.

Opportunstically add a comment to explain both what is going on, and why
KVM uses a single bitmap, e.g. the boilerplate for iterating over separate
bitmaps could be done via macros, so it's not (just) about deduplicating
code.

Link: https://lore.kernel.org/r/20231110022857.1273836-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01 09:35:47 -08:00
Sean Christopherson
be6b067dae KVM: x86/pmu: Add common define to capture fixed counters offset
Add a common define to "officially" solidify KVM's split of counters,
i.e. to commit to using bits 31:0 to track general purpose counters and
bits 63:32 to track fixed counters (which only Intel supports).  KVM
already bleeds this behavior all over common PMU code, and adding a KVM-
defined macro allows clarifying that the value is a _base_, as oppposed to
the _flag_ that is used to access fixed PMCs via RDPMC (which perf
confusingly calls INTEL_PMC_FIXED_RDPMC_BASE).

No functional change intended.

Link: https://lore.kernel.org/r/20231110022857.1273836-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01 09:34:31 -08:00
Sean Christopherson
f933b88e20 KVM: x86/pmu: Zero out PMU metadata on AMD if PMU is disabled
Move the purging of common PMU metadata from intel_pmu_refresh() to
kvm_pmu_refresh(), and invoke the vendor refresh() hook if and only if
the VM is supposed to have a vPMU.

KVM already denies access to the PMU based on kvm->arch.enable_pmu, as
get_gp_pmc_amd() returns NULL for all PMCs in that case, i.e. KVM already
violates AMD's architecture by not virtualizing a PMU (kernels have long
since learned to not panic when the PMU is unavailable).  But configuring
the PMU as if it were enabled causes unwanted side effects, e.g. calls to
kvm_pmu_trigger_event() waste an absurd number of cycles due to the
all_valid_pmc_idx bitmap being non-zero.

Fixes: b1d66dad65 ("KVM: x86/svm: Add module param to control PMU virtualization")
Reported-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Closes: https://lore.kernel.org/all/20231109180646.2963718-2-khorenko@virtuozzo.com
Link: https://lore.kernel.org/r/20231110022857.1273836-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01 09:34:31 -08:00
Xin Li
70d0fe5d09 KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling
When FRED is enabled, call fred_entry_from_kvm() to handle IRQ/NMI in
IRQ/NMI induced VM exits.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231205105030.8698-33-xin3.li@intel.com
2024-01-31 22:03:20 +01:00
Sean Christopherson
a634c76b2c KVM: x86/pmu: Explicitly check for RDPMC of unsupported Intel PMC types
Explicitly check for attempts to read unsupported PMC types instead of
letting the bounds check fail.  Functionally, letting the check fail is
ok, but it's unnecessarily subtle and does a poor job of documenting the
architectural behavior that KVM is emulating.

Reviewed-by: Dapeng Mi  <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30 15:28:02 -08:00
Sean Christopherson
7a0fc734c2 KVM: x86/pmu: Treat "fixed" PMU type in RDPMC as index as a value, not flag
Refactor KVM's handling of ECX for RDPMC to treat the FIXED modifier as an
explicit value, not a flag (minus one wart).  While non-architectural PMUs
do use bit 31 as a flag (for "fast" reads), architectural PMUs use the
upper half of ECX to encode the type.  From the SDM:

  ECX[31:16] specifies type of PMC while ECX[15:0] specifies the index of
  the PMC to be read within that type

Note, that the known supported types are 4000H and 2000H, i.e. look a lot
like flags, doesn't contradict the above statement that ECX[31:16] holds
the type, at least not by any sane reading of the SDM.

Keep the explicitly clearing of the FIXED "flag", as KVM subtly relies on
that behavior to disallow unsupported types while allowing the correct
indices for fixed counters.  This wart will be cleaned up in short order.

Opportunistically grab the per-type bitmask in the if-else blocks to
eliminate the one-off usage of the local "fixed" bool.

Reported-by: Jim Mattson <jmattson@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30 15:28:02 -08:00
Sean Christopherson
5728a4a0ea KVM: x86/pmu: Disallow "fast" RDPMC for architectural Intel PMUs
Inject #GP on RDPMC if the "fast" flag is set for architectural Intel
PMUs, i.e. if the PMU version is non-zero.  Per Intel's SDM, and confirmed
on bare metal, the "fast" flag is supported only for non-architectural
PMUs, and is reserved for architectural PMUs.

  If the processor does not support architectural performance monitoring
  (CPUID.0AH:EAX[7:0]=0), ECX[30:0] specifies the index of the PMC to be
  read. Setting ECX[31] selects “fast” read mode if supported. In this mode,
  RDPMC returns bits 31:0 of the PMC in EAX while clearing EDX to zero.

  If the processor does support architectural performance monitoring
  (CPUID.0AH:EAX[7:0] ≠ 0), ECX[31:16] specifies type of PMC while ECX[15:0]
  specifies the index of the PMC to be read within that type. The following
  PMC types are currently defined:
  — General-purpose counters use type 0. The index x (to read IA32_PMCx)
    must be less than the value enumerated by CPUID.0AH.EAX[15:8] (thus
    ECX[15:8] must be zero).
  — Fixed-function counters use type 4000H. The index x (to read
    IA32_FIXED_CTRx) can be used if either CPUID.0AH.EDX[4:0] > x or
    CPUID.0AH.ECX[x] = 1 (thus ECX[15:5] must be 0).
  — Performance metrics use type 2000H. This type can be used only if
    IA32_PERF_CAPABILITIES.PERF_METRICS_AVAILABLE[bit 15]=1. For this type,
    the index in ECX[15:0] is implementation specific.

Opportunistically WARN if KVM ever actually tries to complete RDPMC for a
non-architectural PMU, and drop the non-existent "support" for fast RDPMC,
as KVM doesn't support such PMUs, i.e. kvm_pmu_rdpmc() should reject the
RDPMC before getting to the Intel code.

Fixes: f5132b0138 ("KVM: Expose a version 2 architectural PMU to a guests")
Fixes: 67f4d4288c ("KVM: x86: rdpmc emulation checks the counter incorrectly")
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30 15:28:02 -08:00
Sean Christopherson
d652981db0 KVM: x86/pmu: Apply "fast" RDPMC only to Intel PMUs
Move the handling of "fast" RDPMC instructions, which drop bits 63:32 of
the count, to Intel.  The "fast" flag, and all modifiers for that matter,
are Intel-only and aren't supported by AMD.

Opportunistically replace open coded bit crud with proper #defines, and
add comments to try and disentangle the flags vs. values mess for
non-architectural vs. architectural PMUs.

Fixes: ca724305a2 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30 15:28:02 -08:00
Sean Christopherson
7bb7fce136 KVM: x86/pmu: Prioritize VMX interception over #GP on RDPMC due to bad index
Apply the pre-intercepts RDPMC validity check only to AMD, and rename all
relevant functions to make it as clear as possible that the check is not a
standard PMC index check.  On Intel, the basic rule is that only invalid
opcodes and privilege/permission/mode checks have priority over VM-Exit,
i.e. RDPMC with an invalid index should VM-Exit, not #GP.  While the SDM
doesn't explicitly call out RDPMC, it _does_ explicitly use RDMSR of a
non-existent MSR as an example where VM-Exit has priority over #GP, and
RDPMC is effectively just a variation of RDMSR.

Manually testing on various Intel CPUs confirms this behavior, and the
inverted priority was introduced for SVM compatibility, i.e. was not an
intentional change for Intel PMUs.  On AMD, *all* exceptions on RDPMC have
priority over VM-Exit.

Check for a NULL kvm_pmu_ops.check_rdpmc_early instead of using a RET0
static call so as to provide a convenient location to document the
difference between Intel and AMD, and to again try to make it as obvious
as possible that the early check is a one-off thing, not a generic "is
this PMC valid?" helper.

Fixes: 8061252ee0 ("KVM: SVM: Add intercept checks for remaining twobyte instructions")
Cc: Jim Mattson <jmattson@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30 15:28:02 -08:00
Sean Christopherson
7a277c2241 KVM: x86/pmu: Get eventsel for fixed counters from perf
Get the event selectors used to effectively request fixed counters for
perf events from perf itself instead of hardcoding them in KVM and hoping
that they match the underlying hardware.  While fixed counters 0 and 1 use
architectural events, as of ffbe4ab0be ("perf/x86/intel: Extend the
ref-cycles event to GP counters") fixed counter 2 (reference TSC cycles)
may use a software-defined pseudo-encoding or a real hardware-defined
encoding.

Reported-by: Kan Liang <kan.liang@linux.intel.com>
Closes: https://lkml.kernel.org/r/4281eee7-6423-4ec8-bb18-c6aeee1faf2c%40linux.intel.com
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30 15:28:02 -08:00
Sean Christopherson
61bb2ad795 KVM: x86/pmu: Setup fixed counters' eventsel during PMU initialization
Set the eventsel for all fixed counters during PMU initialization, the
eventsel is hardcoded and consumed if and only if the counter is supported,
i.e. there is no reason to redo the setup every time the PMU is refreshed.

Configuring all KVM-supported fixed counter also eliminates a potential
pitfall if/when KVM supports discontiguous fixed counters, in which case
configuring only nr_arch_fixed_counters will be insufficient (ignoring the
fact that KVM will need many other changes to support discontiguous fixed
counters).

Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30 15:28:02 -08:00
Sean Christopherson
db9e008a0f KVM: x86/pmu: Remove KVM's enumeration of Intel's architectural encodings
Drop KVM's enumeration of Intel's architectural event encodings, and
instead open code the three encodings (of which only two are real) that
KVM uses to emulate fixed counters.  Now that KVM doesn't incorrectly
enforce the availability of architectural encodings, there is no reason
for KVM to ever care about the encodings themselves, at least not in the
current format of an array indexed by the encoding's position in CPUID.

Opportunistically add a comment to explain why KVM cares about eventsel
values for fixed counters.

Suggested-by: Jim Mattson <jmattson@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30 15:28:02 -08:00
Sean Christopherson
cbbd1aa891 KVM: x86/pmu: Allow programming events that match unsupported arch events
Remove KVM's bogus restriction that the guest can't program an event whose
encoding matches an unsupported architectural event.  The enumeration of
an architectural event only says that if a CPU supports an architectural
event, then the event can be programmed using the architectural encoding.
The enumeration does NOT say anything about the encoding when the CPU
doesn't report support the architectural event.

Preventing the guest from counting events whose encoding happens to match
an architectural event breaks existing functionality whenever Intel adds
an architectural encoding that was *ever* used for a CPU that doesn't
enumerate support for the architectural event, even if the encoding is for
the exact same event!

E.g. the architectural encoding for Top-Down Slots is 0x01a4.  Broadwell
CPUs, which do not support the Top-Down Slots architectural event, 0x01a4
is a valid, model-specific event.  Denying guest usage of 0x01a4 if/when
KVM adds support for Top-Down slots would break any Broadwell-based guest.

Reported-by: Kan Liang <kan.liang@linux.intel.com>
Closes: https://lore.kernel.org/all/2004baa6-b494-462c-a11f-8104ea152c6a@linux.intel.com
Fixes: a21864486f ("KVM: x86/pmu: Fix available_event_types check for REF_CPU_CYCLES event")
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30 15:28:02 -08:00
Sean Christopherson
5eb7fcbdea KVM: x86/pmu: Always treat Fixed counters as available when supported
Treat fixed counters as available when they are supported, i.e. don't
silently ignore an enabled fixed counter just because guest CPUID says the
associated general purpose architectural event is unavailable.

KVM originally treated fixed counters as always available, but that got
changed as part of a fix to avoid confusing REF_CPU_CYCLES, which does NOT
map to an architectural event, with the actual architectural event used
associated with bit 7, TOPDOWN_SLOTS.

The commit justified the change with:

    If the event is marked as unavailable in the Intel guest CPUID
    0AH.EBX leaf, we need to avoid any perf_event creation, whether
    it's a gp or fixed counter.

but that justification doesn't mesh with reality.  The Intel SDM uses
"architectural events" to refer to both general purpose events (the ones
with the reverse polarity mask in CPUID.0xA.EBX) and the events for fixed
counters, e.g. the SDM makes statements like:

  Each of the fixed-function PMC can count only one architectural
  performance event.

but the fact that fixed counter 2 (TSC reference cycles) doesn't have an
associated general purpose architectural makes trying to apply the mask
from CPUID.0xA.EBX impossible.

Furthermore, the lack of enumeration for an architectural event in CPUID
only means the CPU doesn't officially support the architectural encoding,
i.e. it doesn't mean using the architectural encoding _won't_ work, it
sipmly means there are no guarantees that it will work as expected.  E.g.
if KVM is running in a VM that advertises a fixed counters but not the
corresponding architectural event encoding, and perf decides to use a
general purpose counter instead of a fixed counter, odds are very good
that the underlying hardware actually does support the architectrual
encoding, and that programming the encoding will count the right thing.

In other words, asking perf to count the event will probably work, whereas
intentionally doing nothing is obviously guaranteed to fail.

Note, at the time of the change, KVM didn't enforce hardware support, i.e.
didn't prevent userspace from enumerating support in guest CPUID.0xA.EBX
for architectural events that aren't supported in hardware.  I.e. silently
dropping the fixed counter didn't somehow protection against counting the
wrong event, it just enforced guest CPUID.  And practically speaking, this
issue is almost certainly limited to running KVM on a funky virtual CPU
model.  No known real hardware has an asymmetric PMU where a fixed counter
is supported but the associated architectural event is not.

Fixes: a21864486f ("KVM: x86/pmu: Fix available_event_types check for REF_CPU_CYCLES event")
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30 15:28:01 -08:00
Linus Torvalds
09d1c6a80f Generic:
- Use memdup_array_user() to harden against overflow.
 
 - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures.
 
 - Clean up Kconfigs that all KVM architectures were selecting
 
 - New functionality around "guest_memfd", a new userspace API that
   creates an anonymous file and returns a file descriptor that refers
   to it.  guest_memfd files are bound to their owning virtual machine,
   cannot be mapped, read, or written by userspace, and cannot be resized.
   guest_memfd files do however support PUNCH_HOLE, which can be used to
   switch a memory area between guest_memfd and regular anonymous memory.
 
 - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
   per-page attributes for a given page of guest memory; right now the
   only attribute is whether the guest expects to access memory via
   guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
   TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees
   confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM).
 
 x86:
 
 - Support for "software-protected VMs" that can use the new guest_memfd
   and page attributes infrastructure.  This is mostly useful for testing,
   since there is no pKVM-like infrastructure to provide a meaningfully
   reduced TCB.
 
 - Fix a relatively benign off-by-one error when splitting huge pages during
   CLEAR_DIRTY_LOG.
 
 - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf
   TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE.
 
 - Use more generic lockdep assertions in paths that don't actually care
   about whether the caller is a reader or a writer.
 
 - let Xen guests opt out of having PV clock reported as "based on a stable TSC",
   because some of them don't expect the "TSC stable" bit (added to the pvclock
   ABI by KVM, but never set by Xen) to be set.
 
 - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL.
 
 - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always
   flushes on nested transitions, i.e. always satisfies flush requests.  This
   allows running bleeding edge versions of VMware Workstation on top of KVM.
 
 - Sanity check that the CPU supports flush-by-ASID when enabling SEV support.
 
 - On AMD machines with vNMI, always rely on hardware instead of intercepting
   IRET in some cases to detect unmasking of NMIs
 
 - Support for virtualizing Linear Address Masking (LAM)
 
 - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state
   prior to refreshing the vPMU model.
 
 - Fix a double-overflow PMU bug by tracking emulated counter events using a
   dedicated field instead of snapshotting the "previous" counter.  If the
   hardware PMC count triggers overflow that is recognized in the same VM-Exit
   that KVM manually bumps an event count, KVM would pend PMIs for both the
   hardware-triggered overflow and for KVM-triggered overflow.
 
 - Turn off KVM_WERROR by default for all configs so that it's not
   inadvertantly enabled by non-KVM developers, which can be problematic for
   subsystems that require no regressions for W=1 builds.
 
 - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL
   "features".
 
 - Don't force a masterclock update when a vCPU synchronizes to the current TSC
   generation, as updating the masterclock can cause kvmclock's time to "jump"
   unexpectedly, e.g. when userspace hotplugs a pre-created vCPU.
 
 - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths,
   partly as a super minor optimization, but mostly to make KVM play nice with
   position independent executable builds.
 
 - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
   CONFIG_HYPERV as a minor optimization, and to self-document the code.
 
 - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation"
   at build time.
 
 ARM64:
 
 - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB
   base granule sizes. Branch shared with the arm64 tree.
 
 - Large Fine-Grained Trap rework, bringing some sanity to the
   feature, although there is more to come. This comes with
   a prefix branch shared with the arm64 tree.
 
 - Some additional Nested Virtualization groundwork, mostly
   introducing the NV2 VNCR support and retargetting the NV
   support to that version of the architecture.
 
 - A small set of vgic fixes and associated cleanups.
 
 Loongarch:
 
 - Optimization for memslot hugepage checking
 
 - Cleanup and fix some HW/SW timer issues
 
 - Add LSX/LASX (128bit/256bit SIMD) support
 
 RISC-V:
 
 - KVM_GET_REG_LIST improvement for vector registers
 
 - Generate ISA extension reg_list using macros in get-reg-list selftest
 
 - Support for reporting steal time along with selftest
 
 s390:
 
 - Bugfixes
 
 Selftests:
 
 - Fix an annoying goof where the NX hugepage test prints out garbage
   instead of the magic token needed to run the test.
 
 - Fix build errors when a header is delete/moved due to a missing flag
   in the Makefile.
 
 - Detect if KVM bugged/killed a selftest's VM and print out a helpful
   message instead of complaining that a random ioctl() failed.
 
 - Annotate the guest printf/assert helpers with __printf(), and fix the
   various bugs that were lurking due to lack of said annotation.
 
 There are two non-KVM patches buried in the middle of guest_memfd support:
 
   fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()
   mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
 
 The first is small and mostly suggested-by Christian Brauner; the second
 a bit less so but it was written by an mm person (Vlastimil Babka).
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmWcMWkUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroO15gf/WLmmg3SET6Uzw9iEq2xo28831ZA+
 6kpILfIDGKozV5safDmMvcInlc/PTnqOFrsKyyN4kDZ+rIJiafJdg/loE0kPXBML
 wdR+2ix5kYI1FucCDaGTahskBDz8Lb/xTpwGg9BFLYFNmuUeHc74o6GoNvr1uliE
 4kLZL2K6w0cSMPybUD+HqGaET80ZqPwecv+s1JL+Ia0kYZJONJifoHnvOUJ7DpEi
 rgudVdgzt3EPjG0y1z6MjvDBXTCOLDjXajErlYuZD3Ej8N8s59Dh2TxOiDNTLdP4
 a4zjRvDmgyr6H6sz+upvwc7f4M4p+DBvf+TkWF54mbeObHUYliStqURIoA==
 =66Ws
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "Generic:

   - Use memdup_array_user() to harden against overflow.

   - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all
     architectures.

   - Clean up Kconfigs that all KVM architectures were selecting

   - New functionality around "guest_memfd", a new userspace API that
     creates an anonymous file and returns a file descriptor that refers
     to it. guest_memfd files are bound to their owning virtual machine,
     cannot be mapped, read, or written by userspace, and cannot be
     resized. guest_memfd files do however support PUNCH_HOLE, which can
     be used to switch a memory area between guest_memfd and regular
     anonymous memory.

   - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
     per-page attributes for a given page of guest memory; right now the
     only attribute is whether the guest expects to access memory via
     guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
     TDX or ARM64 pKVM is checked by firmware or hypervisor that
     guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in
     the case of pKVM).

  x86:

   - Support for "software-protected VMs" that can use the new
     guest_memfd and page attributes infrastructure. This is mostly
     useful for testing, since there is no pKVM-like infrastructure to
     provide a meaningfully reduced TCB.

   - Fix a relatively benign off-by-one error when splitting huge pages
     during CLEAR_DIRTY_LOG.

   - Fix a bug where KVM could incorrectly test-and-clear dirty bits in
     non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with
     a non-huge SPTE.

   - Use more generic lockdep assertions in paths that don't actually
     care about whether the caller is a reader or a writer.

   - let Xen guests opt out of having PV clock reported as "based on a
     stable TSC", because some of them don't expect the "TSC stable" bit
     (added to the pvclock ABI by KVM, but never set by Xen) to be set.

   - Revert a bogus, made-up nested SVM consistency check for
     TLB_CONTROL.

   - Advertise flush-by-ASID support for nSVM unconditionally, as KVM
     always flushes on nested transitions, i.e. always satisfies flush
     requests. This allows running bleeding edge versions of VMware
     Workstation on top of KVM.

   - Sanity check that the CPU supports flush-by-ASID when enabling SEV
     support.

   - On AMD machines with vNMI, always rely on hardware instead of
     intercepting IRET in some cases to detect unmasking of NMIs

   - Support for virtualizing Linear Address Masking (LAM)

   - Fix a variety of vPMU bugs where KVM fail to stop/reset counters
     and other state prior to refreshing the vPMU model.

   - Fix a double-overflow PMU bug by tracking emulated counter events
     using a dedicated field instead of snapshotting the "previous"
     counter. If the hardware PMC count triggers overflow that is
     recognized in the same VM-Exit that KVM manually bumps an event
     count, KVM would pend PMIs for both the hardware-triggered overflow
     and for KVM-triggered overflow.

   - Turn off KVM_WERROR by default for all configs so that it's not
     inadvertantly enabled by non-KVM developers, which can be
     problematic for subsystems that require no regressions for W=1
     builds.

   - Advertise all of the host-supported CPUID bits that enumerate
     IA32_SPEC_CTRL "features".

   - Don't force a masterclock update when a vCPU synchronizes to the
     current TSC generation, as updating the masterclock can cause
     kvmclock's time to "jump" unexpectedly, e.g. when userspace
     hotplugs a pre-created vCPU.

   - Use RIP-relative address to read kvm_rebooting in the VM-Enter
     fault paths, partly as a super minor optimization, but mostly to
     make KVM play nice with position independent executable builds.

   - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
     CONFIG_HYPERV as a minor optimization, and to self-document the
     code.

   - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV
     "emulation" at build time.

  ARM64:

   - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base
     granule sizes. Branch shared with the arm64 tree.

   - Large Fine-Grained Trap rework, bringing some sanity to the
     feature, although there is more to come. This comes with a prefix
     branch shared with the arm64 tree.

   - Some additional Nested Virtualization groundwork, mostly
     introducing the NV2 VNCR support and retargetting the NV support to
     that version of the architecture.

   - A small set of vgic fixes and associated cleanups.

  Loongarch:

   - Optimization for memslot hugepage checking

   - Cleanup and fix some HW/SW timer issues

   - Add LSX/LASX (128bit/256bit SIMD) support

  RISC-V:

   - KVM_GET_REG_LIST improvement for vector registers

   - Generate ISA extension reg_list using macros in get-reg-list
     selftest

   - Support for reporting steal time along with selftest

  s390:

   - Bugfixes

  Selftests:

   - Fix an annoying goof where the NX hugepage test prints out garbage
     instead of the magic token needed to run the test.

   - Fix build errors when a header is delete/moved due to a missing
     flag in the Makefile.

   - Detect if KVM bugged/killed a selftest's VM and print out a helpful
     message instead of complaining that a random ioctl() failed.

   - Annotate the guest printf/assert helpers with __printf(), and fix
     the various bugs that were lurking due to lack of said annotation"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits)
  x86/kvm: Do not try to disable kvmclock if it was not enabled
  KVM: x86: add missing "depends on KVM"
  KVM: fix direction of dependency on MMU notifiers
  KVM: introduce CONFIG_KVM_COMMON
  KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd
  KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache
  RISC-V: KVM: selftests: Add get-reg-list test for STA registers
  RISC-V: KVM: selftests: Add steal_time test support
  RISC-V: KVM: selftests: Add guest_sbi_probe_extension
  RISC-V: KVM: selftests: Move sbi_ecall to processor.c
  RISC-V: KVM: Implement SBI STA extension
  RISC-V: KVM: Add support for SBI STA registers
  RISC-V: KVM: Add support for SBI extension registers
  RISC-V: KVM: Add SBI STA info to vcpu_arch
  RISC-V: KVM: Add steal-update vcpu request
  RISC-V: KVM: Add SBI STA extension skeleton
  RISC-V: paravirt: Implement steal-time support
  RISC-V: Add SBI STA extension definitions
  RISC-V: paravirt: Add skeleton for pv-time support
  RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr()
  ...
2024-01-17 13:03:37 -08:00
Breno Leitao
aefb2f2e61 x86/bugs: Rename CONFIG_RETPOLINE => CONFIG_MITIGATION_RETPOLINE
Step 5/10 of the namespace unification of CPU mitigations related Kconfig options.

[ mingo: Converted a few more uses in comments/messages as well. ]

Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ariel Miculas <amiculas@cisco.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231121160740.1249350-6-leitao@debian.org
2024-01-10 10:52:28 +01:00
Paolo Bonzini
8ecb10bcbf KVM x86 support for virtualizing Linear Address Masking (LAM)
Add KVM support for Linear Address Masking (LAM).  LAM tweaks the canonicality
 checks for most virtual address usage in 64-bit mode, such that only the most
 significant bit of the untranslated address bits must match the polarity of the
 last translated address bit.  This allows software to use ignored, untranslated
 address bits for metadata, e.g. to efficiently tag pointers for address
 sanitization.
 
 LAM can be enabled separately for user pointers and supervisor pointers, and
 for userspace LAM can be select between 48-bit and 57-bit masking
 
  - 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15.
  - 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6.
 
 For user pointers, LAM enabling utilizes two previously-reserved high bits from
 CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and
 61 respectively.  Note, if LAM_57 is set, LAM_U48 is ignored, i.e.:
 
  - CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers
  - CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers
  - CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers
 
 For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with
 the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.:
 
  - CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers
  - CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers
  - CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers
 
 The modified LAM canonicality checks:
  - LAM_S48                : [ 1 ][ metadata ][ 1 ]
                               63               47
  - LAM_U48                : [ 0 ][ metadata ][ 0 ]
                               63               47
  - LAM_S57                : [ 1 ][ metadata ][ 1 ]
                               63               56
  - LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
                               63               56
  - LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
                               63               56..47
 
 The bulk of KVM support for LAM is to emulate LAM's modified canonicality
 checks.  The approach taken by KVM is to "fill" the metadata bits using the
 highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended
 to bits 62:48.  The most significant bit, 63, is *not* modified, i.e. its value
 from the raw, untagged virtual address is kept for the canonicality check. This
 untagging allows
 
 Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM
 touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26],
 enabling via CR3 and CR4 bits, etc.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW+k4SHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5KygQAKTSEmfdox6MSYzGVzAVHBD/8oSTZAGf
 4l96Np3sZiX0ujWP7aW1GaIdGL27Yf1bQrKIrODR4xepaosVPpoZZbnLFQ4Jm16D
 OuwEQL06LV91Lv5XuPkNdq3nMVi1X3wjiKLvP451oCGv8JdxsjXSlFr8ZmDoCfmS
 NCjkPyitdK+/xOMY5WcrkHD/6VMMiM+5A+CrG7DkaTaqBJQSUXG1NvTKhhxey6Rq
 OZv0GPv7QVMhHv1NX0Y3LyoiGyWXAoFRnbk/N3yVBOnXcpJ+HBwWiNLRpxmZOQj/
 CTo0VvUH/ZkN6zGvAb75/9puFHNliA/QCW1hp+ShXnNdn1eNdS7nhhPrzVqtCTy2
 QeNWM/z5v9Wa1norPqDxzqWlh2bWW8JU0soX7Q+quN0d7YjVvmmUluL3Lw/V2zmb
 gFM2ZY43QHlmLVic4sSraK1LEcYFzjexzpTLhee2gNp+l2y0D0c1/hXukCk6YNUM
 gad9DH8P9d7By7Eyr0ZaPHSJbuBW1PqZhot5gCg9nCn4pnT2/y7wXsLj6VAw8gdr
 dWNu2MZWDuH0/d4aKfw2veAECbHUK2daok4ufPDj5nYLVVWCs4HU0U7HlYL2CX7/
 TdWOCwtpFtKoN1NHz8mpET7xldxLPnFkByL+SxypTZurAZXoSnEG71IbO5pJ2iIf
 wHQkXgM+XimA
 =qUZ2
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-lam-6.8' of https://github.com/kvm-x86/linux into HEAD

KVM x86 support for virtualizing Linear Address Masking (LAM)

Add KVM support for Linear Address Masking (LAM).  LAM tweaks the canonicality
checks for most virtual address usage in 64-bit mode, such that only the most
significant bit of the untranslated address bits must match the polarity of the
last translated address bit.  This allows software to use ignored, untranslated
address bits for metadata, e.g. to efficiently tag pointers for address
sanitization.

LAM can be enabled separately for user pointers and supervisor pointers, and
for userspace LAM can be select between 48-bit and 57-bit masking

 - 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15.
 - 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6.

For user pointers, LAM enabling utilizes two previously-reserved high bits from
CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and
61 respectively.  Note, if LAM_57 is set, LAM_U48 is ignored, i.e.:

 - CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers
 - CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers
 - CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers

For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with
the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.:

 - CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers
 - CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers
 - CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers

The modified LAM canonicality checks:
 - LAM_S48                : [ 1 ][ metadata ][ 1 ]
                              63               47
 - LAM_U48                : [ 0 ][ metadata ][ 0 ]
                              63               47
 - LAM_S57                : [ 1 ][ metadata ][ 1 ]
                              63               56
 - LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
                              63               56
 - LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
                              63               56..47

The bulk of KVM support for LAM is to emulate LAM's modified canonicality
checks.  The approach taken by KVM is to "fill" the metadata bits using the
highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended
to bits 62:48.  The most significant bit, 63, is *not* modified, i.e. its value
from the raw, untagged virtual address is kept for the canonicality check. This
untagging allows

Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM
touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26],
enabling via CR3 and CR4 bits, etc.
2024-01-08 08:10:12 -05:00
Paolo Bonzini
01edb1cfbd KVM x86 PMU changes for 6.8:
- Fix a variety of bugs where KVM fail to stop/reset counters and other state
    prior to refreshing the vPMU model.
 
  - Fix a double-overflow PMU bug by tracking emulated counter events using a
    dedicated field instead of snapshotting the "previous" counter.  If the
    hardware PMC count triggers overflow that is recognized in the same VM-Exit
    that KVM manually bumps an event count, KVM would pend PMIs for both the
    hardware-triggered overflow and for KVM-triggered overflow.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW/rsSHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5Q8gQAJc4y9NOd09kYXpI+DhkTVe6v07dmYds
 NzBI2uViqxXFwA5pTs5VTVVYAl1FEmK6NvIVnJdc3epSYRSqyaeN/Z2NoulNxekj
 /jLA/aA4+dTeJf2lfMFeH65IIuSJhuhyGeZV31RfW3NzEmlglcsb74QkHnJB8rLQ
 RFJXZcOxSSap72AWxKmxk0alRaI6ONZ9NyqOWFWjZdQuAE7id9Ae5OixKUrlJkmR
 6CbY8ra51MFIXQEsomVlcl5b1DNiv0drPPf5YaC9T4CERtt5yZxpvZeTPhq70evm
 OutoZpzfi69cF1fFCxqN5cWZSt1C/Bu3xp8+ILI1+bZkMCV/ty85DU6hfMZQZzcV
 JeJkRg/AAgOrG4dtHskwg9LDMs867kgbaqZ8l8K7Dt8rGmcLc5/rZ1ZdjTStFj6V
 ukmVKMAVgkmh88u62wQ5HjrN1IE1oE6nmDp3zivfPuohEr49A8mAT02A2x9AVxAr
 HvmwfDMA92xOGSRAN9Gt0mbOA+G0WZe4A36XgPEXloYeskYZgHzgW2hT6VWTd86O
 ydU9s4L8g+Fy4jcObAiKsT8YwFgAMfVXZKTXvuTME4m/WUNBCrYCwqEOp/NM5qrk
 qYWVXxOMMjZo71tQfvSPu1TWCtW/4ckvmqMrdQosgwLFy5pSqgXEwTruDvbJ1KWU
 KhIWVbUfmgFA
 =+Emh
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-pmu-6.8' of https://github.com/kvm-x86/linux into HEAD

KVM x86 PMU changes for 6.8:

 - Fix a variety of bugs where KVM fail to stop/reset counters and other state
   prior to refreshing the vPMU model.

 - Fix a double-overflow PMU bug by tracking emulated counter events using a
   dedicated field instead of snapshotting the "previous" counter.  If the
   hardware PMC count triggers overflow that is recognized in the same VM-Exit
   that KVM manually bumps an event count, KVM would pend PMIs for both the
   hardware-triggered overflow and for KVM-triggered overflow.
2024-01-08 08:10:08 -05:00
Paolo Bonzini
33d0403fda KVM x86 misc changes for 6.8:
- Turn off KVM_WERROR by default for all configs so that it's not
    inadvertantly enabled by non-KVM developers, which can be problematic for
    subsystems that require no regressions for W=1 builds.
 
  - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL
    "features".
 
  - Don't force a masterclock update when a vCPU synchronizes to the current TSC
    generation, as updating the masterclock can cause kvmclock's time to "jump"
    unexpectedly, e.g. when userspace hotplugs a pre-created vCPU.
 
  - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths,
    partly as a super minor optimization, but mostly to make KVM play nice with
    position independent executable builds.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW+7sSHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5/pwQAL8jIapIWP54VWxWlcTZFtCptGSobGlv
 cBS4L091/bYuMB/jO0pPtD+apzsYt3WmJ+tRsNA7Yctzh9BDE3XxbV7pKVIUpz9P
 TLCtYU2hPzp3vC6WCryjtU0OHxEnYMGHE1RCB7/bRblz+q6td7+MLZHcEUdwv83l
 3pVM5+tNyQBog40frEVf+z7wrXzz2FgnauJn70X1UUs40VuiTzi6FqfLn6QK95xQ
 8QPpjGFep7wQ6RgC4cPKiWSaP5PypCCpr4lMSKrKAf4iaKJdO1CYxEPeu0LcyFhR
 DUM3zb+AZ/FVrisRWUnjke4Epb87ikoMQBlflrI9+o4cNJQaxEHAzTMGO+u4oucy
 KwnXtNYM3lKGvDEvoUSBDphNayzcchn+0qk8YKB+XvClYSOtGi+NsWUB4x+M6crM
 960cidF/CzYZL/IDj9GW2Tb+IiPJarmazdbqDmMpQiAKz0KE3tezGiysB6d6VJs1
 V+KWOaSzAT9GsBKvGnPDHQaZ20vK+YsGB/TMWvpg3rFLTyV5QFM17UNdXyJlX0g8
 G0v+gf7j3MKm156H2yYW0XhIAfhstc1Xb8fTDQjJ3pZn6us2NAtFgnrIpbL31Z7E
 yaSgZuxetswbNwVSECUGlH4/zAtQudBfAt837Nu4eSCjMrJE4SPrrwpbTqp0SPXd
 1VZbGc70QFf7
 =O4hV
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.8' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.8:

 - Turn off KVM_WERROR by default for all configs so that it's not
   inadvertantly enabled by non-KVM developers, which can be problematic for
   subsystems that require no regressions for W=1 builds.

 - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL
   "features".

 - Don't force a masterclock update when a vCPU synchronizes to the current TSC
   generation, as updating the masterclock can cause kvmclock's time to "jump"
   unexpectedly, e.g. when userspace hotplugs a pre-created vCPU.

 - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths,
   partly as a super minor optimization, but mostly to make KVM play nice with
   position independent executable builds.
2024-01-08 08:10:04 -05:00
Bjorn Helgaas
54aa699e80 arch/x86: Fix typos
Fix typos, most reported by "codespell arch/x86".  Only touches comments,
no code changes.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20240103004011.1758650-1-helgaas@kernel.org
2024-01-03 11:46:22 +01:00
Vitaly Kuznetsov
5a30f97683 KVM: nVMX: Hide more stuff under CONFIG_KVM_HYPERV
'hv_evmcs_vmptr'/'hv_evmcs_map'/'hv_evmcs' fields in 'struct nested_vmx'
should not be used when !CONFIG_KVM_HYPERV, hide them when
!CONFIG_KVM_HYPERV.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-16-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:35:13 -08:00
Vitaly Kuznetsov
c98842b26c KVM: nVMX: Introduce accessor to get Hyper-V eVMCS pointer
There's a number of 'vmx->nested.hv_evmcs' accesses in nested.c, introduce
'nested_vmx_evmcs()' accessor to hide them all in !CONFIG_KVM_HYPERV case.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-15-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:35:08 -08:00
Vitaly Kuznetsov
453e42b055 KVM: nVMX: Introduce helpers to check if Hyper-V evmptr12 is valid/set
In order to get rid of raw 'vmx->nested.hv_evmcs_vmptr' accesses when
!CONFIG_KVM_HYPERV, introduce a pair of helpers:

nested_vmx_is_evmptr12_valid() to check that eVMPTR points to a valid
address.

nested_vmx_is_evmptr12_valid() to check that eVMPTR either points to a
valid address or is in 'pending' port-migration state (meaning it is
supposed to be valid but the exact address wasn't acquired from guest's
memory yet).

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-14-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:35:03 -08:00
Vitaly Kuznetsov
b4f69df0f6 KVM: x86: Make Hyper-V emulation optional
Hyper-V emulation in KVM is a fairly big chunk and in some cases it may be
desirable to not compile it in to reduce module sizes as well as the attack
surface. Introduce CONFIG_KVM_HYPERV option to make it possible.

Note, there's room for further nVMX/nSVM code optimizations when
!CONFIG_KVM_HYPERV, this will be done in follow-up patches.

Reorganize Makefile a bit so all CONFIG_HYPERV and CONFIG_KVM_HYPERV files
are grouped together.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-13-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:34:57 -08:00
Vitaly Kuznetsov
f973146267 KVM: nVMX: Move guest_cpuid_has_evmcs() to hyperv.h
In preparation for making Hyper-V emulation optional, move Hyper-V specific
guest_cpuid_has_evmcs() to hyperv.h.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-12-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:34:45 -08:00
Vitaly Kuznetsov
b2e02f82b7 KVM: nVMX: Split off helper for emulating VMCLEAR on Hyper-V eVMCS
To avoid overloading handle_vmclear() with Hyper-V specific details and to
prepare the code to making Hyper-V emulation optional, create a dedicated
nested_evmcs_handle_vmclear() helper.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-9-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:34:27 -08:00
Vitaly Kuznetsov
af9d544a45 KVM: x86: Introduce helper to handle Hyper-V paravirt TLB flush requests
As a preparation to making Hyper-V emulation optional, introduce a helper
to handle pending KVM_REQ_HV_TLB_FLUSH requests.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-8-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:34:23 -08:00
Vitaly Kuznetsov
e7ad84db4d KVM: VMX: Split off hyperv_evmcs.{ch}
Some Enlightened VMCS related code is needed both by Hyper-V on KVM and
KVM on Hyper-V. As a preparation to making Hyper-V emulation optional,
create dedicated 'hyperv_evmcs.{ch}' files which are used by both.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-7-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:34:19 -08:00
Vitaly Kuznetsov
50a82b0eb8 KVM: VMX: Split off vmx_onhyperv.{ch} from hyperv.{ch}
hyperv.{ch} is currently a mix of stuff which is needed by both Hyper-V on
KVM and KVM on Hyper-V. As a preparation to making Hyper-V emulation
optional, put KVM-on-Hyper-V specific code into dedicated files.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-4-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:34:06 -08:00
Vitaly Kuznetsov
cfef5af3cb KVM: x86: Move Hyper-V partition assist page out of Hyper-V emulation context
Hyper-V partition assist page is used when KVM runs on top of Hyper-V and
is not used for Windows/Hyper-V guests on KVM, this means that 'hv_pa_pg'
placement in 'struct kvm_hv' is unfortunate. As a preparation to making
Hyper-V emulation optional, move 'hv_pa_pg' to 'struct kvm_arch' and put it
under CONFIG_HYPERV.

While on it, introduce hv_get_partition_assist_page() helper to allocate
partition assist page. Move the comment explaining why we use a single page
for all vCPUs from VMX and expand it a bit.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-3-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:34:01 -08:00
Sean Christopherson
89acf1237b KVM: x86/pmu: Update sample period in pmc_write_counter()
Update a PMC's sample period in pmc_write_counter() to deduplicate code
across all callers of pmc_write_counter().  Opportunistically move
pmc_write_counter() into pmc.c now that it's doing more work.  WRMSR isn't
such a hot path that an extra CALL+RET pair will be problematic, and the
order of function definitions needs to be changed anyways, i.e. now is a
convenient time to eat the churn.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:52:55 -08:00
Sean Christopherson
cbb359d81a KVM: x86/pmu: Move PMU reset logic to common x86 code
Move the common (or at least "ignored") aspects of resetting the vPMU to
common x86 code, along with the stop/release helpers that are no used only
by the common pmu.c.

There is no need to manually handle fixed counters as all_valid_pmc_idx
tracks both fixed and general purpose counters, and resetting the vPMU is
far from a hot path, i.e. the extra bit of overhead to the PMC from the
index is a non-issue.

Zero fixed_ctr_ctrl in common code even though it's Intel specific.
Ensuring it's zero doesn't harm AMD/SVM in any way, and stopping the fixed
counters via all_valid_pmc_idx, but not clearing the associated control
bits, would be odd/confusing.

Make the .reset() hook optional as SVM no longer needs vendor specific
handling.

Cc: stable@vger.kernel.org
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:52:54 -08:00
Uros Bizjak
15223c4f97 KVM: SVM,VMX: Use %rip-relative addressing to access kvm_rebooting
Instruction with %rip-relative address operand is one byte shorter than
its absolute address counterpart and is also compatible with position
independent executable (-fpie) build.

No functional changes intended.

Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20231031075312.47525-1-ubizjak@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:51:54 -08:00
Binbin Wu
183bdd161c KVM: x86: Use KVM-governed feature framework to track "LAM enabled"
Use the governed feature framework to track if Linear Address Masking (LAM)
is "enabled", i.e. if LAM can be used by the guest.

Using the framework to avoid the relative expensive call guest_cpuid_has()
during cr3 and vmexit handling paths for LAM.

No functional change intended.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-14-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-28 17:54:09 -08:00
Robert Hoo
3098e6eca8 KVM: x86: Virtualize LAM for user pointer
Add support to allow guests to set the new CR3 control bits for Linear
Address Masking (LAM) and add implementation to get untagged address for
user pointers.

LAM modifies the canonical check for 64-bit linear addresses, allowing
software to use the masked/ignored address bits for metadata.  Hardware
masks off the metadata bits before using the linear addresses to access
memory.  LAM uses two new CR3 non-address bits, LAM_U48 (bit 62) and
LAM_U57 (bit 61), to configure LAM for user pointers. LAM also changes
VMENTER to allow both bits to be set in VMCS's HOST_CR3 and GUEST_CR3 for
virtualization.

When EPT is on, CR3 is not trapped by KVM and it's up to the guest to set
any of the two LAM control bits. However, when EPT is off, the actual CR3
used by the guest is generated from the shadow MMU root which is different
from the CR3 that is *set* by the guest, and KVM needs to manually apply
any active control bits to VMCS's GUEST_CR3 based on the cached CR3 *seen*
by the guest.

KVM manually checks guest's CR3 to make sure it points to a valid guest
physical address (i.e. to support smaller MAXPHYSADDR in the guest). Extend
this check to allow the two LAM control bits to be set. After check, LAM
bits of guest CR3 will be stripped off to extract guest physical address.

In case of nested, for a guest which supports LAM, both VMCS12's HOST_CR3
and GUEST_CR3 are allowed to have the new LAM control bits set, i.e. when
L0 enters L1 to emulate a VMEXIT from L2 to L1 or when L0 enters L2
directly. KVM also manually checks VMCS12's HOST_CR3 and GUEST_CR3 being
valid physical address. Extend such check to allow the new LAM control bits
too.

Note, LAM doesn't have a global control bit to turn on/off LAM completely,
but purely depends on hardware's CPUID to determine it can be enabled or
not. That means, when EPT is on, even when KVM doesn't expose LAM to guest,
the guest can still set LAM control bits in CR3 w/o causing problem. This
is an unfortunate virtualization hole. KVM could choose to intercept CR3 in
this case and inject fault but this would hurt performance when running a
normal VM w/o LAM support.  This is undesirable. Just choose to let the
guest do such illegal thing as the worst case is guest being killed when
KVM eventually find out such illegal behaviour and that the guest is
misbehaving.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-12-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-28 17:54:08 -08:00
Robert Hoo
93d1c9f498 KVM: x86: Virtualize LAM for supervisor pointer
Add support to allow guests to set the new CR4 control bit for LAM and add
implementation to get untagged address for supervisor pointers.

LAM modifies the canonicality check applied to 64-bit linear addresses for
data accesses, allowing software to use of the untranslated address bits for
metadata and masks the metadata bits before using them as linear addresses
to access memory. LAM uses CR4.LAM_SUP (bit 28) to configure and enable LAM
for supervisor pointers. It also changes VMENTER to allow the bit to be set
in VMCS's HOST_CR4 and GUEST_CR4 to support virtualization. Note CR4.LAM_SUP
is allowed to be set even not in 64-bit mode, but it will not take effect
since LAM only applies to 64-bit linear addresses.

Move CR4.LAM_SUP out of CR4_RESERVED_BITS, its reservation depends on vcpu
supporting LAM or not. Leave it intercepted to prevent guest from setting
the bit if LAM is not exposed to guest as well as to avoid vmread every time
when KVM fetches its value, with the expectation that guest won't toggle the
bit frequently.

Set CR4.LAM_SUP bit in the emulated IA32_VMX_CR4_FIXED1 MSR for guests to
allow guests to enable LAM for supervisor pointers in nested VMX operation.

Hardware is not required to do TLB flush when CR4.LAM_SUP toggled, KVM
doesn't need to emulate TLB flush based on it.  There's no other features
or vmx_exec_controls connection, and no other code needed in
{kvm,vmx}_set_cr4().

Skip address untag for instruction fetches (which includes branch targets),
operand of INVLPG instructions, and implicit system accesses, all of which
are not subject to untagging.  Note, get_untagged_addr() isn't invoked for
implicit system accesses as there is no reason to do so, but check the
flag anyways for documentation purposes.

Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-11-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-28 17:54:07 -08:00
Binbin Wu
b39bd520a6 KVM: x86: Untag addresses for LAM emulation where applicable
Stub in vmx_get_untagged_addr() and wire up calls from the emulator (via
get_untagged_addr()) and "direct" calls from various VM-Exit handlers in
VMX where LAM untagging is supposed to be applied.  Defer implementing
the guts of vmx_get_untagged_addr() to future patches purely to make the
changes easier to consume.

LAM is active only for 64-bit linear addresses and several types of
accesses are exempted.

- Cases need to untag address (handled in get_vmx_mem_address())
  Operand(s) of VMX instructions and INVPCID.
  Operand(s) of SGX ENCLS.

- Cases LAM doesn't apply to (no change needed)
  Operand of INVLPG.
  Linear address in INVPCID descriptor.
  Linear address in INVVPID descriptor.
  BASEADDR specified in SECS of ECREATE.

Note:
  - LAM doesn't apply to write to control registers or MSRs
  - LAM masking is applied before walking page tables, i.e. the faulting
    linear address in CR2 doesn't contain the metadata.
  - The guest linear address saved in VMCS doesn't contain metadata.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-10-binbin.wu@linux.intel.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-28 17:54:07 -08:00
Binbin Wu
9c8021d4ae KVM: x86: Remove kvm_vcpu_is_illegal_gpa()
Remove kvm_vcpu_is_illegal_gpa() and use !kvm_vcpu_is_legal_gpa() instead.
The "illegal" helper actually predates the "legal" helper, the only reason
the "illegal" variant wasn't removed by commit 4bda0e9786 ("KVM: x86:
Add a helper to check for a legal GPA") was to avoid code churn.  Now that
CR3 has a dedicated helper, there are fewer callers, and so the code churn
isn't that much of a deterrent.

No functional change intended.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-8-binbin.wu@linux.intel.com
[sean: provide a bit of history in the changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-28 17:54:05 -08:00
Binbin Wu
2c49db455e KVM: x86: Add & use kvm_vcpu_is_legal_cr3() to check CR3's legality
Add and use kvm_vcpu_is_legal_cr3() to check CR3's legality to provide
a clear distinction between CR3 and GPA checks.  This will allow exempting
bits from kvm_vcpu_is_legal_cr3() without affecting general GPA checks,
e.g. for upcoming features that will use high bits in CR3 for feature
enabling.

No functional change intended.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-7-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-28 17:54:04 -08:00
Chao Peng
8569992d64 KVM: Use gfn instead of hva for mmu_notifier_retry
Currently in mmu_notifier invalidate path, hva range is recorded and then
checked against by mmu_invalidate_retry_hva() in the page fault handling
path. However, for the soon-to-be-introduced private memory, a page fault
may not have a hva associated, checking gfn(gpa) makes more sense.

For existing hva based shared memory, gfn is expected to also work. The
only downside is when aliasing multiple gfns to a single hva, the
current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.

Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
[sean: convert vmx_set_apic_access_page_addr() to gfn-based API]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
Message-Id: <20231027182217.3615211-4-seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 05:28:53 -05:00
Paolo Bonzini
be47941980 KVM SVM changes for 6.7:
- Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts SHUTDOWN while
    running an SEV-ES guest.
 
  - Clean up handling "failures" when KVM detects it can't emulate the "skip"
    action for an instruction that has already been partially emulated.  Drop a
    hack in the SVM code that was fudging around the emulator code not giving
    SVM enough information to do the right thing.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8GHYSHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5hwkQAIR8l1gWz/caz29biBzmRnDS+aZOXcYM
 8V8WBJqJgMKE9egibF4sADAlhInXzg19Xr7bQs6VfuvmdXrCn0UJ/nLorX+H85A2
 pph6iNlWO6tyQAjvk/AieaeUyZOqpCFmKOgxfN2Fr/Lrn7u3AdjXC20qPeFJSLXr
 YOTCQ704yvjjJp4yVA8JlclAQu38hanKiO5SZdlLzbuhUgWwQk4DVP2ZsYnhX+RO
 F6exxORvMnYF/LJe/kR2/DMLf2JWvyUmjRrGWoeRoksOw5BlXMc5HyTPHSJ2jDac
 lJaNtmZkTY1bDVWZk7N03ze5aFJa4DaqJdIFLtgujrFW8thog0P48aH6vmKi4UAA
 bXme9GFYbmJTkemaGRnrzidFV12uPNvvanS+1PDOw4sn4HpscoMSpZw5PeH2kBwV
 6uKNCJCwLtk8oe50yroKD7rJ/ASB7CeoqzbIL9s2TA0HSAskIf65T4eZp01uniyd
 Q98yCdrG2mudsg5aU5yMfe0LwZby5BB5kUCqIe4hyRC68GJR8wkAzhaFRgCn4aJE
 yaTyjnT2V3PGMEEJOPFdSF3VQGztljzQiXlEvBVj3zvMGQNTo2NhmS3ka4W+wW5G
 avRYv8dITlGRs6J2gV1vp8Eb5LzDrwRpRURSmzeP5rR58saKdljTZgNfOzfLeFr1
 WhLzonLz52IS
 =U0fq
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-svm-6.7' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 6.7:

 - Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts SHUTDOWN while
   running an SEV-ES guest.

 - Clean up handling "failures" when KVM detects it can't emulate the "skip"
   action for an instruction that has already been partially emulated.  Drop a
   hack in the SVM code that was fudging around the emulator code not giving
   SVM enough information to do the right thing.
2023-10-31 10:22:43 -04:00
Paolo Bonzini
f0f59d069e KVM x86 MMU changes for 6.7:
- Clean up code that deals with honoring guest MTRRs when the VM has
    non-coherent DMA and host MTRRs are ignored, i.e. EPT is enabled.
 
  - Zap EPT entries when non-coherent DMA assignment stops/start to prevent
    using stale entries with the wrong memtype.
 
  - Don't ignore guest PAT for CR0.CD=1 && KVM_X86_QUIRK_CD_NW_CLEARED=y, as
    there's zero reason to ignore guest PAT if the effective MTRR memtype is WB.
    This will also allow for future optimizations of handling guest MTRR updates
    for VMs with non-coherent DMA and the quirk enabled.
 
  - Harden the fast page fault path to guard against encountering an invalid
    root when walking SPTEs.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8FG8SHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5tYMP+gJd3raPnpmai4NyaFaZNP6/5YsXuUMj
 XBvHH7hBGHmjd1sV+O62fhUvNk4+M/1f1rERutP4s7yXEXxQfC9G/MQFgLBfyiW8
 xR+RQkNrz8HsG8mHFBZ0Ei6OofhP+BRTYDRU7kbctKDh/4Hp5AOZAxYHs/ZhOho1
 Lw6upZbQLCkdt72eEKbfocg6Tf400hWEyarBRXFe4KJzWq7KMjAPgqA/3Vx0lF6u
 zX73Zr6tV0mcf3QXd58Q4CUwOuwMo1aTangmOhEeC09JplF2okLV36h6WrCF8qqO
 gvmDrMA450Yc215peOJGBJzoZJrNjMIHZ2m+4Ifag6Z/jJoam4vjzUZmmrzx+Gbj
 Ot5lmXCVRXCdHmUNdYQ6yR27WaVP3C3ItkxwNZGMPoh2G08NGyLLY1kwzRyITEH4
 M9jYTRBZaeue57ad5Ms9FaneBLWwPxajTX90rWZbl2kzfd8PG5cF1VroESBLoa0f
 I2kDcd7988xLTOMl1sfO8ci21Ve7rQc0hA6WlOXrDxb26OvYrftYXeXOCowN6kqP
 czXIu5ZPmLI1btimZQXGMdxKkw5wwe3wDC3y5gKrm+rTfORUXoOUDoITIpmPCnAp
 Dzfr5la3RI1GjHhzR80x4vXQC9BgJ9WrEwJub/RqVfE3T3ohw+NZl+AeM1xB9eT1
 2mJWm6GFEm9Y
 =Zfbr
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmu-6.7' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 6.7:

 - Clean up code that deals with honoring guest MTRRs when the VM has
   non-coherent DMA and host MTRRs are ignored, i.e. EPT is enabled.

 - Zap EPT entries when non-coherent DMA assignment stops/start to prevent
   using stale entries with the wrong memtype.

 - Don't ignore guest PAT for CR0.CD=1 && KVM_X86_QUIRK_CD_NW_CLEARED=y, as
   there's zero reason to ignore guest PAT if the effective MTRR memtype is WB.
   This will also allow for future optimizations of handling guest MTRR updates
   for VMs with non-coherent DMA and the quirk enabled.

 - Harden the fast page fault path to guard against encountering an invalid
   root when walking SPTEs.
2023-10-31 10:17:43 -04:00
Paolo Bonzini
f292dc8aad KVM x86 misc changes for 6.7:
- Add CONFIG_KVM_MAX_NR_VCPUS to allow supporting up to 4096 vCPUs without
    forcing more common use cases to eat the extra memory overhead.
 
  - Add IBPB and SBPB virtualization support.
 
  - Fix a bug where restoring a vCPU snapshot that was taken within 1 second of
    creating the original vCPU would cause KVM to try to synchronize the vCPU's
    TSC and thus clobber the correct TSC being set by userspace.
 
  - Compute guest wall clock using a single TSC read to avoid generating an
    inaccurate time, e.g. if the vCPU is preempted between multiple TSC reads.
 
  - "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which complain
     about a "Firmware Bug" if the bit isn't set for select F/M/S combos.
 
  - Don't apply side effects to Hyper-V's synthetic timer on writes from
    userspace to fix an issue where the auto-enable behavior can trigger
    spurious interrupts, i.e. do auto-enabling only for guest writes.
 
  - Remove an unnecessary kick of all vCPUs when synchronizing the dirty log
    without PML enabled.
 
  - Advertise "support" for non-serializing FS/GS base MSR writes as appropriate.
 
  - Use octal notation for file permissions through KVM x86.
 
  - Fix a handful of typo fixes and warts.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8EugSHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5xS0P+gPTDO81CUZO70LrO2W4E7toRBf/F9x1
 /v5D/76p9hG32Z6+BJs/xxDxJFagw75MtoR5oKivtXiip3TxbfOyDOlaQkIRo85E
 /d95il/LRidL3Mv3TXRj1lykXnxSSz9tigAGEZti1Y9Fn9fXEIwurJH7dU5cBI1E
 fin5bsDaTNRjG4jjTiEUbnKPRTlD/S7CQJn4CaYvZhMv/eJkYDLyBBVy4VLoLzvD
 ctL6VJQLGPVxbxr9mEmulaqMrSuDIQQLkRVQJAViKyerBInTEc5d/GPCHuE8O3zi
 0r/QSJbMS9titWLz07NhJ1UH4VJNyaEhRlyJPSFhBW4h6dzUb3EXdUe0Hwa+JH/S
 H2cVqsANItTCIhvDtuEGIRDahu0eD+63h90InJ0gEVL1kSJS+UWZHB71PkUEQgAV
 2OsuT1D26fuxrv+0b9ioBZURycqKw++zGsrwyVhe77eBgqBJ12tbL4TAD+QNjaQ5
 HZTCe6YV83gZoOMeVkoTGSf96s9lGORgxsaAIXmFuLB9RVCVXhVh0ph2HZsnV8Hw
 ZXEXpBEFo7GUhb0NIvsk2W73QL87A3fLv15yITWc8KuC7/dXP9z6KpSKjFySS69X
 uWD1MVx6shhvbg97UzoJlXc3/z0aVzmdZJudE5d0gcFvAjIItqp6ICPOoKxfj8pT
 tqRZu3kVHd61
 =sfp8
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.7' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.7:

 - Add CONFIG_KVM_MAX_NR_VCPUS to allow supporting up to 4096 vCPUs without
   forcing more common use cases to eat the extra memory overhead.

 - Add IBPB and SBPB virtualization support.

 - Fix a bug where restoring a vCPU snapshot that was taken within 1 second of
   creating the original vCPU would cause KVM to try to synchronize the vCPU's
   TSC and thus clobber the correct TSC being set by userspace.

 - Compute guest wall clock using a single TSC read to avoid generating an
   inaccurate time, e.g. if the vCPU is preempted between multiple TSC reads.

 - "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which complain
    about a "Firmware Bug" if the bit isn't set for select F/M/S combos.

 - Don't apply side effects to Hyper-V's synthetic timer on writes from
   userspace to fix an issue where the auto-enable behavior can trigger
   spurious interrupts, i.e. do auto-enabling only for guest writes.

 - Remove an unnecessary kick of all vCPUs when synchronizing the dirty log
   without PML enabled.

 - Advertise "support" for non-serializing FS/GS base MSR writes as appropriate.

 - Use octal notation for file permissions through KVM x86.

 - Fix a handful of typo fixes and warts.
2023-10-31 10:15:15 -04:00
Paolo Bonzini
f233646760 KVM x86 APIC changes for 6.7:
- Purge VMX's posted interrupt descriptor *before* loading APIC state when
    handling KVM_SET_LAPIC.  Purging the PID after loading APIC state results in
    lost APIC timer IRQs as the APIC timer can be armed as part of loading APIC
    state, i.e. can immediately pend an IRQ if the expiry is in the past.
 
  - Clear the ICR.BUSY bit when handling trap-like x2APIC writes to suppress a
    WARN due to KVM expecting the BUSY bit to be cleared when sending IPIs.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8DrkSHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5Z54P/jQbjmwbvAVDa1g2HGr16ouFb3v+rCgV
 eXiN2/8tDlQl9JkvZsjOY1AtJauS3RZu/SncT3IxCCvHVrKKt0xgnhWT//rAMX3N
 zrrFxXJmilFdd2ijsuK8flHVFO5i06Kx1NRjSmAMb76et1et1EI0I62F1rom8ocn
 l0tryTkV0qhWKlgxzLBNzzhzxxA2Mvb6vFhut+nfa0IisnQ+roh5dV4JS7Cdhy+5
 qIQ4J86JdDR6kmraZLCv8r7vIQn5EQPVCAC2WWalds7iWUW3ixy01NLweVLaVtA3
 xWRHYK7azvu3SJ9HvmSMRZzzTElQBYnE2d37ycyYewHbDUJ6FCXxh7H+BehYaVBW
 CK6GLaMHY8vyXLcB2cRAzZ7GgBSSDevL11/CMc2ueGc51AOsDCxFh6v69TU+/PiQ
 dMOYiNTuwV9QC3raIXPCc2Q9rqGp20xBqXIv5XLEyIxqcFQESvtRggg6zJdBZ3zB
 T5/xAXDf++rKFiu1wskFAbwsk3M4T36u1WlK9iayC9dPvCDGRzsle3tt9npmoj2X
 3e0Uz7fkRLki1YZW1qLmvqoYJ9bX9B9BjuKkORIqSqXuwcgPYEzHt1GcRByI2FSk
 KSTnNwgs/Bs498LsDdU1Z740ZkYFGmu2nK+CUvwxP+uejPsnlhik5kGqLSIeA7zA
 9kDpcdV+tZKJ
 =atZ2
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-apic-6.7' of https://github.com/kvm-x86/linux into HEAD

KVM x86 APIC changes for 6.7:

 - Purge VMX's posted interrupt descriptor *before* loading APIC state when
   handling KVM_SET_LAPIC.  Purging the PID after loading APIC state results in
   lost APIC timer IRQs as the APIC timer can be armed as part of loading APIC
   state, i.e. can immediately pend an IRQ if the expiry is in the past.

 - Clear the ICR.BUSY bit when handling trap-like x2APIC writes.  This avoids a
   WARN, due to KVM expecting the BUSY bit to be cleared when sending IPIs.
2023-10-31 10:11:19 -04:00
Peng Hao
26951ec862 KVM: x86: Use octal for file permission
Convert all module params to octal permissions to improve code readability
and to make checkpatch happy:

  WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using
           octal permissions '0444'.

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Link: https://lore.kernel.org/r/20231013113020.77523-1-flyingpeng@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-17 10:29:10 -07:00
Yan Zhao
c9f65a3f2d KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED
For KVM_X86_QUIRK_CD_NW_CLEARED is on, remove the IPAT (ignore PAT) bit in
EPT memory types when cache is disabled and non-coherent DMA are present.

To correctly emulate CR0.CD=1, UC + IPAT are required as memtype in EPT.
However, as with commit fb279950ba ("KVM: vmx: obey
KVM_QUIRK_CD_NW_CLEARED"), WB + IPAT are now returned to workaround a BIOS
issue that guest MTRRs are enabled too late. Without this workaround, a
super slow guest boot-up is expected during the pre-guest-MTRR-enabled
period due to UC as the effective memory type for all guest memory.

Absent emulating CR0.CD=1 with UC, it makes no sense to set IPAT when KVM
is honoring the guest memtype.
Removing the IPAT bit in this patch allows effective memory type to honor
PAT values as well, as WB is the weakest memtype. It means if a guest
explicitly claims UC as the memtype in PAT, the effective memory is UC
instead of previous WB. If, for some unknown reason, a guest meets a slow
boot-up issue with the removal of IPAT, it's desired to fix the blamed PAT
in the guest.

Returning guest MTRR type as if CR0.CD=0 is also not preferred because
KVMs ABI for the quirk also requires KVM to force WB memtype regardless of
guest MTRRs to workaround the slow guest boot-up issue.

In the future, honoring guest PAT will also allow KVM to more precisely
zap SPTEs when the effective memtype changes.  E.g. by not forcing WB when
CR0.CD=1, instead of zapping SPTEs when guest MTRRs change, KVM can skip
MTRR-induced zaps if CR0.CD=1 and zap SPTEs for non-WB MTRR ranges when
CR0.CD is toggled (WB MTRR SPTEs can be kept because they're WB regardless
of CR0.CD).

The change of removing IPAT has been verified with normal boot-up time
on old OVMF of commit c9e5618f84b0cb54a9ac2d7604f7b7e7859b45a7 as well,
dated back to Apr 14 2015.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20230714065326.20557-1-yan.y.zhao@intel.com
[sean: massage changelog to apply patch without full series]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-10 17:04:42 -07:00
Sean Christopherson
aeb904f6b9 KVM: x86: Refactor can_emulate_instruction() return to be more expressive
Refactor and rename can_emulate_instruction() to allow vendor code to
return more than true/false, e.g. to explicitly differentiate between
"retry", "fault", and "unhandleable".  For now, just do the plumbing, a
future patch will expand SVM's implementation to signal outright failure
if KVM attempts EMULTYPE_SKIP on an SEV guest.

No functional change intended (or rather, none that are visible to the
guest or userspace).

Link: https://lore.kernel.org/r/20230825013621.2845700-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-04 15:08:53 -07:00
Haitao Shan
9cfec6d097 KVM: x86: Fix lapic timer interrupt lost after loading a snapshot.
When running android emulator (which is based on QEMU 2.12) on
certain Intel hosts with kernel version 6.3-rc1 or above, guest
will freeze after loading a snapshot. This is almost 100%
reproducible. By default, the android emulator will use snapshot
to speed up the next launching of the same android guest. So
this breaks the android emulator badly.

I tested QEMU 8.0.4 from Debian 12 with an Ubuntu 22.04 guest by
running command "loadvm" after "savevm". The same issue is
observed. At the same time, none of our AMD platforms is impacted.
More experiments show that loading the KVM module with
"enable_apicv=false" can workaround it.

The issue started to show up after commit 8e6ed96cdd ("KVM: x86:
fire timer when it is migrated and expired, and in oneshot mode").
However, as is pointed out by Sean Christopherson, it is introduced
by commit 967235d320 ("KVM: vmx: clear pending interrupts on
KVM_SET_LAPIC"). commit 8e6ed96cdd ("KVM: x86: fire timer when
it is migrated and expired, and in oneshot mode") just makes it
easier to hit the issue.

Having both commits, the oneshot lapic timer gets fired immediately
inside the KVM_SET_LAPIC call when loading the snapshot. On Intel
platforms with APIC virtualization and posted interrupt processing,
this eventually leads to setting the corresponding PIR bit. However,
the whole PIR bits get cleared later in the same KVM_SET_LAPIC call
by apicv_post_state_restore. This leads to timer interrupt lost.

The fix is to move vmx_apicv_post_state_restore to the beginning of
the KVM_SET_LAPIC call and rename to vmx_apicv_pre_state_restore.
What vmx_apicv_post_state_restore does is actually clearing any
former apicv state and this behavior is more suitable to carry out
in the beginning.

Fixes: 967235d320 ("KVM: vmx: clear pending interrupts on KVM_SET_LAPIC")
Cc: stable@vger.kernel.org
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Haitao Shan <hshan@google.com>
Link: https://lore.kernel.org/r/20230913000215.478387-1-hshan@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-28 10:19:29 -07:00
Roman Kagan
b29a2acd36 KVM: x86/pmu: Truncate counter value to allowed width on write
Performance counters are defined to have width less than 64 bits.  The
vPMU code maintains the counters in u64 variables but assumes the value
to fit within the defined width.  However, for Intel non-full-width
counters (MSR_IA32_PERFCTRx) the value receieved from the guest is
truncated to 32 bits and then sign-extended to full 64 bits.  If a
negative value is set, it's sign-extended to 64 bits, but then in
kvm_pmu_incr_counter() it's incremented, truncated, and compared to the
previous value for overflow detection.

That previous value is not truncated, so it always evaluates bigger than
the truncated new one, and a PMI is injected.  If the PMI handler writes
a negative counter value itself, the vCPU never quits the PMI loop.

Turns out that Linux PMI handler actually does write the counter with
the value just read with RDPMC, so when no full-width support is exposed
via MSR_IA32_PERF_CAPABILITIES, and the guest initializes the counter to
a negative value, it locks up.

This has been observed in the field, for example, when the guest configures
atop to use perfevents and runs two instances of it simultaneously.

To address the problem, maintain the invariant that the counter value
always fits in the defined bit width, by truncating the received value
in the respective set_msr methods.  For better readability, factor the
out into a helper function, pmc_write_counter(), shared by vmx and svm
parts.

Fixes: 9cd803d496 ("KVM: x86: Update vPMCs when retiring instructions")
Cc: stable@vger.kernel.org
Signed-off-by: Roman Kagan <rkagan@amazon.de>
Link: https://lore.kernel.org/all/20230504120042.785651-1-rkagan@amazon.de
Tested-by: Like Xu <likexu@tencent.com>
[sean: tweak changelog, s/set/write in the helper]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-25 14:30:44 -07:00
Linus Torvalds
0c02183427 ARM:
* Clean up vCPU targets, always returning generic v8 as the preferred target
 
 * Trap forwarding infrastructure for nested virtualization (used for traps
   that are taken from an L2 guest and are needed by the L1 hypervisor)
 
 * FEAT_TLBIRANGE support to only invalidate specific ranges of addresses
   when collapsing a table PTE to a block PTE.  This avoids that the guest
   refills the TLBs again for addresses that aren't covered by the table PTE.
 
 * Fix vPMU issues related to handling of PMUver.
 
 * Don't unnecessary align non-stack allocations in the EL2 VA space
 
 * Drop HCR_VIRT_EXCP_MASK, which was never used...
 
 * Don't use smp_processor_id() in kvm_arch_vcpu_load(),
   but the cpu parameter instead
 
 * Drop redundant call to kvm_set_pfn_accessed() in user_mem_abort()
 
 * Remove prototypes without implementations
 
 RISC-V:
 
 * Zba, Zbs, Zicntr, Zicsr, Zifencei, and Zihpm support for guest
 
 * Added ONE_REG interface for SATP mode
 
 * Added ONE_REG interface to enable/disable multiple ISA extensions
 
 * Improved error codes returned by ONE_REG interfaces
 
 * Added KVM_GET_REG_LIST ioctl() implementation for KVM RISC-V
 
 * Added get-reg-list selftest for KVM RISC-V
 
 s390:
 
 * PV crypto passthrough enablement (Tony, Steffen, Viktor, Janosch)
   Allows a PV guest to use crypto cards. Card access is governed by
   the firmware and once a crypto queue is "bound" to a PV VM every
   other entity (PV or not) looses access until it is not bound
   anymore. Enablement is done via flags when creating the PV VM.
 
 * Guest debug fixes (Ilya)
 
 x86:
 
 * Clean up KVM's handling of Intel architectural events
 
 * Intel bugfixes
 
 * Add support for SEV-ES DebugSwap, allowing SEV-ES guests to use debug
   registers and generate/handle #DBs
 
 * Clean up LBR virtualization code
 
 * Fix a bug where KVM fails to set the target pCPU during an IRTE update
 
 * Fix fatal bugs in SEV-ES intrahost migration
 
 * Fix a bug where the recent (architecturally correct) change to reinject
   #BP and skip INT3 broke SEV guests (can't decode INT3 to skip it)
 
 * Retry APIC map recalculation if a vCPU is added/enabled
 
 * Overhaul emergency reboot code to bring SVM up to par with VMX, tie the
   "emergency disabling" behavior to KVM actually being loaded, and move all of
   the logic within KVM
 
 * Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC
   ratio MSR cannot diverge from the default when TSC scaling is disabled
   up related code
 
 * Add a framework to allow "caching" feature flags so that KVM can check if
   the guest can use a feature without needing to search guest CPUID
 
 * Rip out the ancient MMU_DEBUG crud and replace the useful bits with
   CONFIG_KVM_PROVE_MMU
 
 * Fix KVM's handling of !visible guest roots to avoid premature triple fault
   injection
 
 * Overhaul KVM's page-track APIs, and KVMGT's usage, to reduce the API surface
   that is needed by external users (currently only KVMGT), and fix a variety
   of issues in the process
 
 This last item had a silly one-character bug in the topic branch that
 was sent to me.  Because it caused pretty bad selftest failures in
 some configurations, I decided to squash in the fix.  So, while the
 exact commit ids haven't been in linux-next, the code has (from the
 kvm-x86 tree).
 
 Generic:
 
 * Wrap kvm_{gfn,hva}_range.pte in a union to allow mmu_notifier events to pass
   action specific data without needing to constantly update the main handlers.
 
 * Drop unused function declarations
 
 Selftests:
 
 * Add testcases to x86's sync_regs_test for detecting KVM TOCTOU bugs
 
 * Add support for printf() in guest code and covert all guest asserts to use
   printf-based reporting
 
 * Clean up the PMU event filter test and add new testcases
 
 * Include x86 selftests in the KVM x86 MAINTAINERS entry
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmT1m0kUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMNgggAiN7nz6UC423FznuI+yO3TLm8tkx1
 CpKh5onqQogVtchH+vrngi97cfOzZb1/AtifY90OWQi31KEWhehkeofcx7G6ERhj
 5a9NFADY1xGBsX4exca/VHDxhnzsbDWaWYPXw5vWFWI6erft9Mvy3tp1LwTvOzqM
 v8X4aWz+g5bmo/DWJf4Wu32tEU6mnxzkrjKU14JmyqQTBawVmJ3RYvHVJ/Agpw+n
 hRtPAy7FU6XTdkmq/uCT+KUHuJEIK0E/l1js47HFAqSzwdW70UDg14GGo1o4ETxu
 RjZQmVNvL57yVgi6QU38/A0FWIsWQm5IlaX1Ug6x8pjZPnUKNbo9BY4T1g==
 =W+4p
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM:

   - Clean up vCPU targets, always returning generic v8 as the preferred
     target

   - Trap forwarding infrastructure for nested virtualization (used for
     traps that are taken from an L2 guest and are needed by the L1
     hypervisor)

   - FEAT_TLBIRANGE support to only invalidate specific ranges of
     addresses when collapsing a table PTE to a block PTE. This avoids
     that the guest refills the TLBs again for addresses that aren't
     covered by the table PTE.

   - Fix vPMU issues related to handling of PMUver.

   - Don't unnecessary align non-stack allocations in the EL2 VA space

   - Drop HCR_VIRT_EXCP_MASK, which was never used...

   - Don't use smp_processor_id() in kvm_arch_vcpu_load(), but the cpu
     parameter instead

   - Drop redundant call to kvm_set_pfn_accessed() in user_mem_abort()

   - Remove prototypes without implementations

  RISC-V:

   - Zba, Zbs, Zicntr, Zicsr, Zifencei, and Zihpm support for guest

   - Added ONE_REG interface for SATP mode

   - Added ONE_REG interface to enable/disable multiple ISA extensions

   - Improved error codes returned by ONE_REG interfaces

   - Added KVM_GET_REG_LIST ioctl() implementation for KVM RISC-V

   - Added get-reg-list selftest for KVM RISC-V

  s390:

   - PV crypto passthrough enablement (Tony, Steffen, Viktor, Janosch)

     Allows a PV guest to use crypto cards. Card access is governed by
     the firmware and once a crypto queue is "bound" to a PV VM every
     other entity (PV or not) looses access until it is not bound
     anymore. Enablement is done via flags when creating the PV VM.

   - Guest debug fixes (Ilya)

  x86:

   - Clean up KVM's handling of Intel architectural events

   - Intel bugfixes

   - Add support for SEV-ES DebugSwap, allowing SEV-ES guests to use
     debug registers and generate/handle #DBs

   - Clean up LBR virtualization code

   - Fix a bug where KVM fails to set the target pCPU during an IRTE
     update

   - Fix fatal bugs in SEV-ES intrahost migration

   - Fix a bug where the recent (architecturally correct) change to
     reinject #BP and skip INT3 broke SEV guests (can't decode INT3 to
     skip it)

   - Retry APIC map recalculation if a vCPU is added/enabled

   - Overhaul emergency reboot code to bring SVM up to par with VMX, tie
     the "emergency disabling" behavior to KVM actually being loaded,
     and move all of the logic within KVM

   - Fix user triggerable WARNs in SVM where KVM incorrectly assumes the
     TSC ratio MSR cannot diverge from the default when TSC scaling is
     disabled up related code

   - Add a framework to allow "caching" feature flags so that KVM can
     check if the guest can use a feature without needing to search
     guest CPUID

   - Rip out the ancient MMU_DEBUG crud and replace the useful bits with
     CONFIG_KVM_PROVE_MMU

   - Fix KVM's handling of !visible guest roots to avoid premature
     triple fault injection

   - Overhaul KVM's page-track APIs, and KVMGT's usage, to reduce the
     API surface that is needed by external users (currently only
     KVMGT), and fix a variety of issues in the process

  Generic:

   - Wrap kvm_{gfn,hva}_range.pte in a union to allow mmu_notifier
     events to pass action specific data without needing to constantly
     update the main handlers.

   - Drop unused function declarations

  Selftests:

   - Add testcases to x86's sync_regs_test for detecting KVM TOCTOU bugs

   - Add support for printf() in guest code and covert all guest asserts
     to use printf-based reporting

   - Clean up the PMU event filter test and add new testcases

   - Include x86 selftests in the KVM x86 MAINTAINERS entry"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (279 commits)
  KVM: x86/mmu: Include mmu.h in spte.h
  KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest roots
  KVM: x86/mmu: Disallow guest from using !visible slots for page tables
  KVM: x86/mmu: Harden TDP MMU iteration against root w/o shadow page
  KVM: x86/mmu: Harden new PGD against roots without shadow pages
  KVM: x86/mmu: Add helper to convert root hpa to shadow page
  drm/i915/gvt: Drop final dependencies on KVM internal details
  KVM: x86/mmu: Handle KVM bookkeeping in page-track APIs, not callers
  KVM: x86/mmu: Drop @slot param from exported/external page-track APIs
  KVM: x86/mmu: Bug the VM if write-tracking is used but not enabled
  KVM: x86/mmu: Assert that correct locks are held for page write-tracking
  KVM: x86/mmu: Rename page-track APIs to reflect the new reality
  KVM: x86/mmu: Drop infrastructure for multiple page-track modes
  KVM: x86/mmu: Use page-track notifiers iff there are external users
  KVM: x86/mmu: Move KVM-only page-track declarations to internal header
  KVM: x86: Remove the unused page-track hook track_flush_slot()
  drm/i915/gvt: switch from ->track_flush_slot() to ->track_remove_region()
  KVM: x86: Add a new page-track hook to handle memslot deletion
  drm/i915/gvt: Don't bother removing write-protection on to-be-deleted slot
  KVM: x86: Reject memslot MOVE operations if KVMGT is attached
  ...
2023-09-07 13:52:20 -07:00
Paolo Bonzini
6d5e3c318a KVM x86 changes for 6.6:
- Misc cleanups
 
  - Retry APIC optimized recalculation if a vCPU is added/enabled
 
  - Overhaul emergency reboot code to bring SVM up to par with VMX, tie the
    "emergency disabling" behavior to KVM actually being loaded, and move all of
    the logic within KVM
 
  - Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC
    ratio MSR can diverge from the default iff TSC scaling is enabled, and clean
    up related code
 
  - Add a framework to allow "caching" feature flags so that KVM can check if
    the guest can use a feature without needing to search guest CPUID
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmTueMwSHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5hp4P/i/UmIJEJupryUrD/ZXcSjqmupCtv4JS
 Z2o1KIAPbM5GUX4iyF1cnZrI4Ac5zMtULN8Tp3ATOp3AqKy72AqB1Z82e+v6SKis
 KfSXlDFCPFisrwv3Ys7JEu9vIS8oqITHmSBk8OAmElwujdQ5jYLZjwGbCXbM9qas
 yCFGLqD4fjX8XqkZLmXggjT99MPSgiTPoKL592Wq4JR8mY4hyQqJzBepDjb94sT7
 wrsAv1B+BchGDguk0+nOdmHM4emGrZU7fVqi3OFPofSlwAAdkqZObleb422KB058
 5bcpNow+9VH5pzgq8XSAU7DLNgH9aXH0PcVU8ASU6P0D9fceKoOFuL47nnFbwz0t
 vKafcXNWFs8xHE4iyzvAAsZK/X8GR0ngNByPnamATMsjt2tTmsa5BOyAPkIN+GpT
 DzZCIk27SbdGC3lGYlSV+5ob/+sOr6m384DkvSZnU6JiiFLlZiTxURj1/9Zvfka8
 2co2wnf8cJxnKFUThFfuxs9XpKgvhkOE8LauwCSo4MAQM95Pen+NAK960RBWj0xl
 wof5kIGmKbwmMXyg2Sr+EKqe5KRPba22Yi3x24tURAXafKK/AW7T8dgEEXOll7dp
 pKmTPAevwUk9wYIGultjhEBXKYgMOeD2BVoTa5je5h1Da28onrSJ7aLQUixHHs0J
 gLdtzs8M9K9t
 =yGM1
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.6' of https://github.com/kvm-x86/linux into HEAD

KVM x86 changes for 6.6:

 - Misc cleanups

 - Retry APIC optimized recalculation if a vCPU is added/enabled

 - Overhaul emergency reboot code to bring SVM up to par with VMX, tie the
   "emergency disabling" behavior to KVM actually being loaded, and move all of
   the logic within KVM

 - Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC
   ratio MSR can diverge from the default iff TSC scaling is enabled, and clean
   up related code

 - Add a framework to allow "caching" feature flags so that KVM can check if
   the guest can use a feature without needing to search guest CPUID
2023-08-31 13:36:33 -04:00
Paolo Bonzini
755e732dde KVM: x86: VMX changes for 6.6:
- Misc cleanups
 
  - Fix a bug where KVM reads a stale vmcs.IDT_VECTORING_INFO_FIELD when trying
    to handle NMI VM-Exits
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmTufN0SHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5tIsP/0J3wexeADGWuj7cckymPBdbZbt4Jkd7
 LDpOPEjCgFew+uAt7xXRfV4AOnXiSTrdos2tpjQGuNPeg18AAwv6Ulq67ia3AF7E
 t/0A0Lp/TK+L+tuGDcEYhBAm+TXyM7yknbnw+meL+VHBK5MfucGAimBcsR6zhZNA
 VeXgB4+XJ9PS+p01ixQFwrBXs1Orm9UOGJ6P8OlGWoNFYJVX70t+Rxk4Rgs/buUk
 pXHllofEL70MchrIJ/0rZUh9CY/foFnpb76h4QASbI401yQsJV+p1FPgu4kyVSfE
 aBhshJBxbrS9bcQ2K0Bl++JXFc1BDkhxcEIUh/P8t9bmG2vzYlKs+ifVtXeD7uYb
 wENIWNO8FsYnTj9e8dpQFfi7spiemcMnayd/TEmt0cyJdC8vPgSR1TnmrLlrM/Fs
 msOZyaOxZaxHjYtEvk5H/u834GgWwNq3fxw0Ap6ic+pJ2gcSay4NQGRtJx2wB4MY
 VdR6w1wMttJpDMmQgmY+pVI91RGjE3O/bLUzKySGSrgQZhqW6I+/itT4IXt9pBST
 A3pi9WCsbzQr1v4/97a/OO2zKsdA7WOD/l0CcUoEjBzRxbhSazASC8PJHnKDBPhE
 OsD5DEKh03bVfgIMGev9fTbfudAfLomRX1JrGx8DXUBYgckgieXM5U89pf9Fpr//
 24dstKOTCGwD
 =6/Oc
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.6' of https://github.com/kvm-x86/linux into HEAD

KVM: x86: VMX changes for 6.6:

 - Misc cleanups

 - Fix a bug where KVM reads a stale vmcs.IDT_VECTORING_INFO_FIELD when trying
   to handle NMI VM-Exits
2023-08-31 13:32:06 -04:00
Paolo Bonzini
8783790a5e KVM x86 PMU changes for 6.6:
- Clean up KVM's handling of Intel architectural events
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmTuejESHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5pMwP+wUH17mXy3Q3d3L7YxKemIsdQLozb12+
 VaeE0EGxmDy2RRKAZ+B4O7iHGknmtyC2iW4r//Q3xaJd61M2ir6Zfib0f4OalTr6
 5W1iElF91JCuo0Fg6aLs64xgUn61iblSGTJgGxMrr2YNhzExDvY41wCOHf4ZBToL
 NsEYXyT14Pk7K3qfDk/ENMyb69bVbwz/aGF/v0lzFKfKRPU6Uw516l+qX1feKAdG
 BUZtXULKl4eZHkKtJXdRrvmhHlWwQoa25s8J05RgEuQMeVHLhi6mXtY4+NAfmZar
 Cn+e0VcyRnwe6rF4swkjoQvX1GLOgLTltiZrE9d0FOtWhgPNAF9T2fPcMhNtvuGI
 bts7Njufgm85IdkCGNHDi+Z8W5iJz+PgtkdGKKY/u4cDFfJBGngEKLnGDSVkVl6x
 ndGPO5i2vKaJW26VQxPz7L+ra2qmcU3qSPIDnkodHrDFogh6G7pa4NaL58Kc2TLm
 KG2L3x6DxiCWRAYoq0h37Zl6Ye2THzcAkErBzV64Iqn5ehJk6DMPFVibom4nWm77
 4v2U5d1dp68O/1orkZyovsZ0E35L4am0TXlVEVInxRHstLl07YL47KAibvtrx1sB
 6n67dNKwIc61Gavp2IOuTiuP0Y6bEtK7vWPI/oaaF+OlzVU/Yk3jCS/B9dxs5pVF
 ZrcQGYSaqVq6
 =dNMK
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-pmu-6.6' of https://github.com/kvm-x86/linux into HEAD

KVM x86 PMU changes for 6.6:

 - Clean up KVM's handling of Intel architectural events
2023-08-31 13:31:32 -04:00
Sean Christopherson
50011c2a24 KVM: VMX: Refresh available regs and IDT vectoring info before NMI handling
Reset the mask of available "registers" and refresh the IDT vectoring
info snapshot in vmx_vcpu_enter_exit(), before KVM potentially handles a
an NMI VM-Exit.  One of the "registers" that KVM VMX lazily loads is the
vmcs.VM_EXIT_INTR_INFO field, which is holds the vector+type on "exception
or NMI" VM-Exits, i.e. is needed to identify NMIs.  Clearing the available
registers bitmask after handling NMIs results in KVM querying info from
the last VM-Exit that read vmcs.VM_EXIT_INTR_INFO, and leads to both
missed NMIs and spurious NMIs in the host.

Opportunistically grab vmcs.IDT_VECTORING_INFO_FIELD early in the VM-Exit
path too, e.g. to guard against similar consumption of stale data.  The
field is read on every "normal" VM-Exit, and there's no point in delaying
the inevitable.

Reported-by: Like Xu <like.xu.linux@gmail.com>
Fixes: 11df586d77 ("KVM: VMX: Handle NMI VM-Exits in noinstr region")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230825014532.2846714-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-28 20:07:43 -07:00
Sean Christopherson
9ca0c1a126 KVM: VMX: Delete ancient pr_warn() about KVM_SET_TSS_ADDR not being set
Delete KVM's printk about KVM_SET_TSS_ADDR not being called.  When the
printk was added by commit 776e58ea3d ("KVM: unbreak userspace that does
not sets tss address"), KVM also stuffed a "hopefully safe" value, i.e.
the message wasn't purely informational.  For reasons unknown, ostensibly
to try and help people running outdated qemu-kvm versions, the message got
left behind when KVM's stuffing was removed by commit 4918c6ca68
("KVM: VMX: Require KVM_SET_TSS_ADDR being called prior to running a VCPU").

Today, the message is completely nonsensical, as it has been over a decade
since KVM supported userspace running a Real Mode guest, on a CPU without
unrestricted guest support, without doing KVM_SET_TSS_ADDR before KVM_RUN.
I.e. KVM's ABI has required KVM_SET_TSS_ADDR for 10+ years.

To make matters worse, the message is prone to false positives as it
triggers when simply *creating* a vCPU due to RESET putting vCPUs into
Real Mode, even when the user has no intention of ever *running* the vCPU
in a Real Mode.  E.g. KVM selftests stuff 64-bit mode and never touch Real
Mode, but trigger the message even though they run just fine without
doing KVM_SET_TSS_ADDR.  Creating "dummy" vCPUs, e.g. to probe features,
can also trigger the message.  In both scenarios, the message confuses
users and falsely implies that they've done something wrong.

Reported-by: Thorsten Glaser <t.glaser@tarent.de>
Closes: https://lkml.kernel.org/r/f1afa6c0-cde2-ab8b-ea71-bfa62a45b956%40tarent.de
Link: https://lore.kernel.org/r/20230815174215.433222-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-25 09:05:56 -07:00
Sean Christopherson
1c18efdaa3 KVM: nVMX: Use KVM-governed feature framework to track "nested VMX enabled"
Track "VMX exposed to L1" via a governed feature flag instead of using a
dedicated helper to provide the same functionality.  The main goal is to
drive convergence between VMX and SVM with respect to querying features
that are controllable via module param (SVM likes to cache nested
features), avoiding the guest CPUID lookups at runtime is just a bonus
and unlikely to provide any meaningful performance benefits.

Note, X86_FEATURE_VMX is set in kvm_cpu_caps if and only if "nested" is
true, and the CPU obviously supports VMX if KVM+VMX is running.  I.e. the
check on "nested" is now implicitly down by the kvm_cpu_cap_has() check
in kvm_governed_feature_check_and_set().

No functional change intended.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviwed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:40:55 -07:00
Sean Christopherson
fe60e8f65f KVM: x86: Use KVM-governed feature framework to track "XSAVES enabled"
Use the governed feature framework to track if XSAVES is "enabled", i.e.
if XSAVES can be used by the guest.  Add a comment in the SVM code to
explain the very unintuitive logic of deliberately NOT checking if XSAVES
is enumerated in the guest CPUID model.

No functional change intended.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:38:28 -07:00
Sean Christopherson
662f681578 KVM: VMX: Rename XSAVES control to follow KVM's preferred "ENABLE_XYZ"
Rename the XSAVES secondary execution control to follow KVM's preferred
style so that XSAVES related logic can use common macros that depend on
KVM's preferred style.

No functional change intended.

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230815203653.519297-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:38:28 -07:00
Sean Christopherson
0497d2ac9b KVM: VMX: Check KVM CPU caps, not just VMX MSR support, for XSAVE enabling
Check KVM CPU capabilities instead of raw VMX support for XSAVES when
determining whether or not XSAVER can/should be exposed to the guest.
Practically speaking, it's nonsensical/impossible for a CPU to support
"enable XSAVES" without XSAVES being supported natively.  The real
motivation for checking kvm_cpu_cap_has() is to allow using the governed
feature's standard check-and-set logic.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:38:28 -07:00
Sean Christopherson
1143c0b85c KVM: VMX: Recompute "XSAVES enabled" only after CPUID update
Recompute whether or not XSAVES is enabled for the guest only if the
guest's CPUID model changes instead of redoing the computation every time
KVM generates vmcs01's secondary execution controls.  The boot_cpu_has()
and cpu_has_vmx_xsaves() checks should never change after KVM is loaded,
and if they do the kernel/KVM is hosed.

Opportunistically add a comment explaining _why_ XSAVES is effectively
exposed to the guest if and only if XSAVE is also exposed to the guest.

Practically speaking, no functional change intended (KVM will do fewer
computations, but should still see the same xsaves_enabled value whenever
KVM looks at it).

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:38:28 -07:00
Shiyuan Gao
7d18eef136 KVM: VMX: Rename vmx_get_max_tdp_level() to vmx_get_max_ept_level()
In VMX, ept_level looks better than tdp_level and is consistent with
SVM's get_npt_level().

Signed-off-by: Shiyuan Gao <gaoshiyuan@baidu.com>
Link: https://lore.kernel.org/r/20230810113853.98114-1-gaoshiyuan@baidu.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:34:04 -07:00
Dave Hansen
28b8235238 x86/apic: Wrap IPI calls into helper functions
Move them to one place so the static call conversion gets simpler.

No functional change.

[ dhansen: merge against recent x86/apic changes ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
2023-08-09 12:00:55 -07:00
Sean Christopherson
2d63699099 KVM: x86: Always write vCPU's current TSC offset/ratio in vendor hooks
Drop the @offset and @multiplier params from the kvm_x86_ops hooks for
propagating TSC offsets/multipliers into hardware, and instead have the
vendor implementations pull the information directly from the vCPU
structure.  The respective vCPU fields _must_ be written at the same
time in order to maintain consistent state, i.e. it's not random luck
that the value passed in by all callers is grabbed from the vCPU.

Explicitly grabbing the value from the vCPU field in SVM's implementation
in particular will allow for additional cleanup without introducing even
more subtle dependencies.  Specifically, SVM can skip the WRMSR if guest
state isn't loaded, i.e. svm_prepare_switch_to_guest() will load the
correct value for the vCPU prior to entering the guest.

This also reconciles KVM's handling of related values that are stored in
the vCPU, as svm_write_tsc_offset() already assumes/requires the caller
to have updated l1_tsc_offset.

Link: https://lore.kernel.org/r/20230729011608.1065019-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 17:16:29 -07:00
Sean Christopherson
a788fbb763 KVM: VMX: Skip VMCLEAR logic during emergency reboots if CR4.VMXE=0
Bail from vmx_emergency_disable() without processing the list of loaded
VMCSes if CR4.VMXE=0, i.e. if the CPU can't be post-VMXON.  It should be
impossible for the list to have entries if VMX is already disabled, and
even if that invariant doesn't hold, VMCLEAR will #UD anyways, i.e.
processing the list is pointless even if it somehow isn't empty.

Assuming no existing KVM bugs, this should be a glorified nop.  The
primary motivation for the change is to avoid having code that looks like
it does VMCLEAR, but then skips VMXON, which is nonsensical.

Suggested-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:15 -07:00
Sean Christopherson
6ae44e012f KVM: x86: Force kvm_rebooting=true during emergency reboot/crash
Set kvm_rebooting when virtualization is disabled in an emergency so that
KVM eats faults on virtualization instructions even if kvm_reboot() isn't
reached.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:15 -07:00
Sean Christopherson
f9a8866040 KVM: VMX: Ensure CPU is stable when probing basic VMX support
Disable migration when probing VMX support during module load to ensure
the CPU is stable, mostly to match similar SVM logic, where allowing
migration effective requires deliberately writing buggy code.  As a bonus,
KVM won't report the wrong CPU to userspace if VMX is unsupported, but in
practice that is a very, very minor bonus as the only way that reporting
the wrong CPU would actually matter is if hardware is broken or if the
system is misconfigured, i.e. if KVM gets migrated from a CPU that _does_
support VMX to a CPU that does _not_ support VMX.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:15 -07:00
Sean Christopherson
22e420e127 x86/virt: KVM: Move VMXOFF helpers into KVM VMX
Now that VMX is disabled in emergencies via the virt callbacks, move the
VMXOFF helpers into KVM, the only remaining user.

No functional change intended.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:15 -07:00
Sean Christopherson
b6a6af0d19 x86/virt: KVM: Open code cpu_has_vmx() in KVM VMX
Fold the raw CPUID check for VMX into kvm_is_vmx_supported(), its sole
user.  Keep the check even though KVM also checks X86_FEATURE_VMX, as the
intent is to provide a unique error message if VMX is unsupported by
hardware, whereas X86_FEATURE_VMX may be clear due to firmware and/or
kernel actions.

No functional change intended.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:14 -07:00
Sean Christopherson
119b5cb4ff x86/reboot: KVM: Handle VMXOFF in KVM's reboot callback
Use KVM VMX's reboot/crash callback to do VMXOFF in an emergency instead
of manually and blindly doing VMXOFF.  There's no need to attempt VMXOFF
if a hypervisor, i.e. KVM, isn't loaded/active, i.e. if the CPU can't
possibly be post-VMXON.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:14 -07:00
Sean Christopherson
5e408396c6 x86/reboot: Harden virtualization hooks for emergency reboot
Provide dedicated helpers to (un)register virt hooks used during an
emergency crash/reboot, and WARN if there is an attempt to overwrite
the registered callback, or an attempt to do an unpaired unregister.

Opportunsitically use rcu_assign_pointer() instead of RCU_INIT_POINTER(),
mainly so that the set/unset paths are more symmetrical, but also because
any performance gains from using RCU_INIT_POINTER() are meaningless for
this code.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:14 -07:00
Sean Christopherson
b23c83ad2c x86/reboot: VMCLEAR active VMCSes before emergency reboot
VMCLEAR active VMCSes before any emergency reboot, not just if the kernel
may kexec into a new kernel after a crash.  Per Intel's SDM, the VMX
architecture doesn't require the CPU to flush the VMCS cache on INIT.  If
an emergency reboot doesn't RESET CPUs, cached VMCSes could theoretically
be kept and only be written back to memory after the new kernel is booted,
i.e. could effectively corrupt memory after reboot.

Opportunistically remove the setting of the global pointer to NULL to make
checkpatch happy.

Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:14 -07:00
Sean Christopherson
6d88d0ee5d KVM: x86/pmu: Require nr fixed_pmc_events to match nr max fixed counters
Assert that the number of known fixed_pmc_events matches the max number of
fixed counters supported by KVM, and clean up related code.

Opportunistically extend setup_fixed_pmc_eventsel()'s use of
array_index_nospec() to cover fixed_counters, as nr_arch_fixed_counters is
set based on userspace input (but capped using KVM-controlled values).

Link: https://lore.kernel.org/r/20230607010206.1425277-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-02 16:44:36 -07:00
Sean Christopherson
bc9658999b KVM: x86/pmu: Simplify intel_hw_event_available()
Walk only the "real", i.e. non-pseudo, architectural events when checking
if a hardware event is available, i.e. isn't disabled by guest CPUID.
Skipping pseudo-arch events in the loop body is unnecessarily convoluted,
especially now that KVM has enums that delineate between real and pseudo
events.

Link: https://lore.kernel.org/r/20230607010206.1425277-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-02 16:44:36 -07:00
Sean Christopherson
0033fa3549 KVM: x86/pmu: Use enums instead of hardcoded magic for arch event indices
Add "enum intel_pmu_architectural_events" to replace the magic numbers for
the (pseudo-)architectural events, and to give a meaningful name to each
event so that new readers don't need psychic powers to understand what the
code is doing.

Cc: Aaron Lewis <aaronlewis@google.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Reviewed-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230607010206.1425277-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-02 16:44:36 -07:00
Sean Christopherson
775bc09865 KVM: VMX: Drop manual TLB flush when migrating vmcs.APIC_ACCESS_ADDR
Remove the superfluous flush of the current TLB in VMX's handling of
migration of the APIC-access page, as a full TLB flush on all vCPUs will
have already been performed in response to kvm_unmap_gfn_range() *if*
there were SPTEs pointing at the APIC-access page.  And if there were no
valid SPTEs, then there can't possibly be TLB entries to flush.

The extra flush was added by commit fb6c819843 ("kvm: vmx: Flush TLB
when the APIC-access address changes"), with the justification of "because
the SDM says so".  The SDM said, and still says:

 As detailed in Section xx.x.x, an access to the APIC-access page might
 not cause an APIC-access VM exit if software does not properly invalidate
 information that may be cached from the EPT paging structures. If EPT was
 in use on a logical processor at one time with EPTP X, it is recommended
 that software use the INVEPT instruction with the “single-context” INVEPT
 type and with EPTP X in the INVEPT descriptor before a VM entry on the
 same logical processor that enables EPT with EPTP X and either (a) the
 "virtualize APIC accesses" VM- execution control was changed from 0 to 1;
 or (b) the value of the APIC-access address was changed.

But the "recommendation" for (b) is predicated on there actually being
a valid EPT translation *and* possible TLB entries for the GPA (or guest
VA when using shadow paging).  It's possible that a different vCPU has
established a mapping for the new page, but the current vCPU can't have
entered the guest, i.e. can't have created a TLB entry, between flushing
the old mappings and changing its vmcs.APIC_ACCESS_ADDR.

kvm_unmap_gfn_range() waits for all vCPUs to ack KVM_REQ_APIC_PAGE_RELOAD,
and then flushes remote TLBs (which may or may not also pend a request).
Thus the vCPU is guaranteed to update vmcs.APIC_ACCESS_ADDR before
re-entering the guest and before it can possibly create new TLB entries.

In other words, KVM does flush in this case, it just does so earlier
on while handling the page migration.

Note, VMX also flushes if the vCPU is migrated to a new pCPU, i.e. if
the vCPU is migrated to a pCPU that entered the guest for a different
vCPU.

Suggested-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20230721233858.2343941-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-02 16:41:15 -07:00
Sean Christopherson
550ba57faa KVM: VMX: Drop unnecessary vmx_fb_clear_ctrl_available "cache"
Now that KVM snapshots the host's MSR_IA32_ARCH_CAPABILITIES, drop the
similar snapshot/cache of whether or not KVM is allowed to manipulate
MSR_IA32_MCU_OPT_CTRL.FB_CLEAR_DIS.  The motivation for the cache was
presumably to avoid the RDMSR, e.g. boot_cpu_has_bug() is quite cheap, and
modifying the vCPU's MSR_IA32_ARCH_CAPABILITIES is an infrequent option
and a relatively slow path.

Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230607004311.1420507-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-02 16:37:30 -07:00
Sean Christopherson
a2fd5d02ba KVM: x86: Snapshot host's MSR_IA32_ARCH_CAPABILITIES
Snapshot the host's MSR_IA32_ARCH_CAPABILITIES, if it's supported, instead
of reading the MSR every time KVM wants to query the host state, e.g. when
initializing the default value during vCPU creation.  The paths that query
ARCH_CAPABILITIES aren't particularly performance sensitive, but creating
vCPUs is a frequent enough operation that burning 8 bytes is a good
trade-off.

Alternatively, KVM could add a field in kvm_caps and thus skip the
on-demand calculations entirely, but a pure snapshot isn't possible due to
the way KVM handles the l1tf_vmx_mitigation module param.  And unlike the
other "supported" fields in kvm_caps, KVM doesn't enforce the "supported"
value, i.e. KVM treats ARCH_CAPABILITIES like a CPUID leaf and lets
userspace advertise whatever it wants.  Those problems are solvable, but
it's not clear there is real benefit versus snapshotting the host value,
and grabbing the host value will allow additional cleanup of KVM's
FB_CLEAR_CTRL code.

Link: https://lore.kernel.org/all/20230524061634.54141-2-chao.gao@intel.com
Cc: Chao Gao <chao.gao@intel.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230607004311.1420507-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-02 16:37:26 -07:00
Like Xu
1d6664fadd KVM: x86: Use sysfs_emit() instead of sprintf()
Use sysfs_emit() instead of the sprintf() for sysfs entries. sysfs_emit()
knows the maximum of the temporary buffer used for outputting sysfs
content and avoids overrunning the buffer length.

Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230625073438.57427-1-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-07-31 13:55:26 -07:00
Sean Christopherson
c4abd73520 KVM: VMX: Don't fudge CR0 and CR4 for restricted L2 guest
Stuff CR0 and/or CR4 to be compliant with a restricted guest if and only
if KVM itself is not configured to utilize unrestricted guests, i.e. don't
stuff CR0/CR4 for a restricted L2 that is running as the guest of an
unrestricted L1.  Any attempt to VM-Enter a restricted guest with invalid
CR0/CR4 values should fail, i.e. in a nested scenario, KVM (as L0) should
never observe a restricted L2 with incompatible CR0/CR4, since nested
VM-Enter from L1 should have failed.

And if KVM does observe an active, restricted L2 with incompatible state,
e.g. due to a KVM bug, fudging CR0/CR4 instead of letting VM-Enter fail
does more harm than good, as KVM will often neglect to undo the side
effects, e.g. won't clear rmode.vm86_active on nested VM-Exit, and thus
the damage can easily spill over to L1.  On the other hand, letting
VM-Enter fail due to bad guest state is more likely to contain the damage
to L2 as KVM relies on hardware to perform most guest state consistency
checks, i.e. KVM needs to be able to reflect a failed nested VM-Enter into
L1 irrespective of (un)restricted guest behavior.

Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Fixes: bddd82d19e ("KVM: nVMX: KVM needs to unset "unrestricted guest" VM-execution control in vmcs02 if vmcs12 doesn't set it")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230613203037.1968489-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29 11:05:32 -04:00
Sean Christopherson
26a0652cb4 KVM: x86: Disallow KVM_SET_SREGS{2} if incoming CR0 is invalid
Reject KVM_SET_SREGS{2} with -EINVAL if the incoming CR0 is invalid,
e.g. due to setting bits 63:32, illegal combinations, or to a value that
isn't allowed in VMX (non-)root mode.  The VMX checks in particular are
"fun" as failure to disallow Real Mode for an L2 that is configured with
unrestricted guest disabled, when KVM itself has unrestricted guest
enabled, will result in KVM forcing VM86 mode to virtual Real Mode for
L2, but then fail to unwind the related metadata when synthesizing a
nested VM-Exit back to L1 (which has unrestricted guest enabled).

Opportunistically fix a benign typo in the prototype for is_valid_cr4().

Cc: stable@vger.kernel.org
Reported-by: syzbot+5feef0b9ee9c8e9e5689@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000f316b705fdf6e2b4@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230613203037.1968489-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29 11:05:31 -04:00
Sean Christopherson
a062dad7ec KVM: VMX: Use vmread_error() to report VM-Fail in "goto" path
Use vmread_error() to report VM-Fail on VMREAD for the "asm goto" case,
now that trampoline case has yet another wrapper around vmread_error() to
play nice with instrumentation.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230721235637.2345403-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29 11:05:26 -04:00
Sean Christopherson
c20d403fd0 KVM: VMX: Make VMREAD error path play nice with noinstr
Mark vmread_error_trampoline() as noinstr, and add a second trampoline
for the CONFIG_CC_HAS_ASM_GOTO_OUTPUT=n case to enable instrumentation
when handling VM-Fail on VMREAD.  VMREAD is used in various noinstr
flows, e.g. immediately after VM-Exit, and objtool rightly complains that
the call to the error trampoline leaves a no-instrumentation section
without annotating that it's safe to do so.

  vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0xc9:
  call to vmread_error_trampoline() leaves .noinstr.text section

Note, strictly speaking, enabling instrumentation in the VM-Fail path
isn't exactly safe, but if VMREAD fails the kernel/system is likely hosed
anyways, and logging that there is a fatal error is more important than
*maybe* encountering slightly unsafe instrumentation.

Reported-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230721235637.2345403-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29 11:05:26 -04:00
Peng Hao
fd1815ea70 KVM: X86: Use GFP_KERNEL_ACCOUNT for pid_table in ipiv
The pid_table of ipiv is the persistent memory allocated by
per-vcpu, which should be counted into the memory cgroup.

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Message-Id: <CAPm50aLxCQ3TQP2Lhc0PX3y00iTRg+mniLBqNDOC=t9CLxMwwA@mail.gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29 11:05:25 -04:00
Paolo Bonzini
255006adb3 KVM VMX changes for 6.5:
- Fix missing/incorrect #GP checks on ENCLS
 
  - Use standard mmu_notifier hooks for handling APIC access page
 
  - Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmSaLDYSHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5ovYP/ib86UG9QXwoEKx0mIyLQ5q1jD+StvxH
 18SIH62+MXAtmz2E+EmXIySW76diOKCngApJ11WTERPwpZYEpcITh2D2Jp/vwgk5
 xUPK+WKYQs1SGpJu3wXhLE1u6mB7X9p7EaXRSKG67P7YK09gTaOik1/3h6oNrGO+
 KI06reCQN1PstKTfrZXxYpRlfDc761YaAmSZ79Bg+bK9PisFqme7TJ2mAqNZPFPd
 E7ho/UOEyWRSyd5VMsuOUB760pMQ9edKrs+38xNDp5N+0Fh0ItTjuAcd2KVWMZyW
 Fk+CJq4kCqTlEik5OwcEHsTGJGBFscGPSO+T0YtVfSZDdtN/rHN7l8RGquOebVTG
 Ldm5bg4agu4lXsqqzMxn8J9SkbNg3xno79mMSc2185jS2HLt5Hu6PzQnQ2tEtHJQ
 IuovmssHOVKDoYODOg0tq8UMydgT3hAvC7YJCouubCjxUUw+22nhN3EDuAhbJhtT
 DgQNGT7GmsrKIWLEjbm6EpLLOdJdB7/U1MrEshLS015a/DUz4b3ZGYApneifJL8h
 nGE2Wu+36xGUVNLgDMdvd+R17WdyQa+f+9KjUGy71KelFV4vI4A3JwvH0aIsTyHZ
 LGlQBZqelc66GYwMiqVC0GYGRtrdgygQopfstvZJ3rYiHZV/mdhB5A0T4J2Xvh2Q
 bnDNzsSFdsH5
 =PjYj
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.5' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.5:

 - Fix missing/incorrect #GP checks on ENCLS

 - Use standard mmu_notifier hooks for handling APIC access page

 - Misc cleanups
2023-07-01 07:20:04 -04:00
Paolo Bonzini
751d77fefa KVM x86/pmu changes for 6.5:
- Add support for AMD PerfMonV2, with a variety of cleanups and minor fixes
    included along the way
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmSaHFgSHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5twMP/15ZJFqZVigVQoATJeeR9tWUuyJe95xM
 lyfnTel91Sg8XOamdwBGi7jLpaDgj34Jm0cfM7/4LbJk2/taeaCLYmJd5w9FXvaw
 EkytQGO85hVNe2XuY+h+XxSIxpflKxgFuUnOwcDk2QbKgASzNSG/mJ9ZBx8PNVXD
 FnyOqpbbYDFspWWvUOAI/RkHnr/dALjXJsSUMvuh3nz5e1NTyubjCAZg+/bse2nR
 s8FrcSh4B0Lg0h4r2fdJ4sAiM/qWhcCIhq5svyTAcUG0T4rMS40LrosJOw3wkBRM
 dyZYXy6GEENeCFJPhenF1mTE1embFyZp89PV/FCNRZXODbnM4kheJFT9gucAjlKi
 ZafRcutrkYIVf4lZCMofDfQGLX/GCEJnwUPKyGygIsPoDRrdR7OLrFycON5bxocr
 9NBNG+2teQFbnt5irB/bBGojtIZtu3OEylkuRjQUQ3lJYQ5r6LddarI9acIu1SHt
 4rRfh8QN5qmMvVblaQzggOr6BPtmPr8QqMEMFncaUMCsV/82hRAEfvj2rifGFJNo
 Axz1ajMfirxyM45WzredUkzzsbphiiegPBELCLRZfHmaEhJ8P7t7wvri0bXt9YdI
 vjSfX+6ulOgDC+xAazE0gEJO4Uh5+g3Y+1e0fr43ltWzUOWdCQskzD3LE9DkqIXj
 KAaCuHYbYpIZ
 =MwqV
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-pmu-6.5' of https://github.com/kvm-x86/linux into HEAD

KVM x86/pmu changes for 6.5:

 - Add support for AMD PerfMonV2, with a variety of cleanups and minor fixes
   included along the way
2023-07-01 07:18:51 -04:00
Like Xu
6a08083f29 KVM: x86/pmu: Disable vPMU if the minimum num of counters isn't met
Disable PMU support when running on AMD and perf reports fewer than four
general purpose counters. All AMD PMUs must define at least four counters
due to AMD's legacy architecture hardcoding the number of counters
without providing a way to enumerate the number of counters to software,
e.g. from AMD's APM:

 The legacy architecture defines four performance counters (PerfCtrn)
 and corresponding event-select registers (PerfEvtSeln).

Virtualizing fewer than four counters can lead to guest instability as
software expects four counters to be available. Rather than bleed AMD
details into the common code, just define a const unsigned int and
provide a convenient location to document why Intel and AMD have different
mins (in particular, AMD's lack of any way to enumerate less than four
counters to the guest).

Keep the minimum number of counters at Intel at one, even though old P6
and Core Solo/Duo processor effectively require a minimum of two counters.
KVM can, and more importantly has up until this point, supported a vPMU so
long as the CPU has at least one counter.  Perf's support for P6/Core CPUs
does require two counters, but perf will happily chug along with a single
counter when running on a modern CPU.

Cc: Jim Mattson <jmattson@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
[sean: set Intel min to '1', not '2']
Link: https://lore.kernel.org/r/20230603011058.1038821-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-06 17:31:44 -07:00
Like Xu
13afa29ae4 KVM: x86/pmu: Provide Intel PMU's pmc_is_enabled() as generic x86 code
Move the Intel PMU implementation of pmc_is_enabled() to common x86 code
as pmc_is_globally_enabled(), and drop AMD's implementation.  AMD PMU
currently supports only v1, and thus not PERF_GLOBAL_CONTROL, thus the
semantics for AMD are unchanged.  And when support for AMD PMU v2 comes
along, the common behavior will also Just Work.

Signed-off-by: Like Xu <likexu@tencent.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230603011058.1038821-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-06 17:31:44 -07:00
Like Xu
c85cdc1cc1 KVM: x86/pmu: Move handling PERF_GLOBAL_CTRL and friends to common x86
Move the handling of GLOBAL_CTRL, GLOBAL_STATUS, and GLOBAL_OVF_CTRL,
a.k.a. GLOBAL_STATUS_RESET, from Intel PMU code to generic x86 PMU code.
AMD PerfMonV2 defines three registers that have the same semantics as
Intel's variants, just with different names and indices.  Conveniently,
since KVM virtualizes GLOBAL_CTRL on Intel only for PMU v2 and above, and
AMD's version shows up in v2, KVM can use common code for the existence
check as well.

Signed-off-by: Like Xu <likexu@tencent.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230603011058.1038821-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-06 17:31:44 -07:00
Like Xu
30dab5c0b6 KVM: x86/pmu: Reject userspace attempts to set reserved GLOBAL_STATUS bits
Reject userspace writes to MSR_CORE_PERF_GLOBAL_STATUS that attempt to set
reserved bits.  Allowing userspace to stuff reserved bits doesn't harm KVM
itself, but it's architecturally wrong and the guest can't clear the
unsupported bits, e.g. makes the guest's PMI handler very confused.

Signed-off-by: Like Xu <likexu@tencent.com>
[sean: rewrite changelog to avoid use of #GP, rebase on name change]
Link: https://lore.kernel.org/r/20230603011058.1038821-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-06 17:31:44 -07:00
Like Xu
8de18543df KVM: x86/pmu: Move reprogram_counters() to pmu.h
Move reprogram_counters() out of Intel specific PMU code and into pmu.h so
that it can be used to implement AMD PMU v2 support.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
[sean: rewrite changelog]
Link: https://lore.kernel.org/r/20230603011058.1038821-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-06 17:31:44 -07:00
Sean Christopherson
53550b8922 KVM: x86/pmu: Rename global_ovf_ctrl_mask to global_status_mask
Rename global_ovf_ctrl_mask to global_status_mask to avoid confusion now
that Intel has renamed GLOBAL_OVF_CTRL to GLOBAL_STATUS_RESET in PMU v4.
GLOBAL_OVF_CTRL and GLOBAL_STATUS_RESET are the same MSR index, i.e. are
just different names for the same thing, but the SDM provides different
entries in the IA-32 Architectural MSRs table, which gets really confusing
when looking at PMU v4 definitions since it *looks* like GLOBAL_STATUS has
bits that don't exist in GLOBAL_OVF_CTRL, but in reality the bits are
simply defined in the GLOBAL_STATUS_RESET entry.

No functional change intended.

Cc: Like Xu <like.xu.linux@gmail.com>
Link: https://lore.kernel.org/r/20230603011058.1038821-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-06 17:31:44 -07:00
Sean Christopherson
878940b33d KVM: VMX: Retry APIC-access page reload if invalidation is in-progress
Re-request an APIC-access page reload if there is a relevant mmu_notifier
invalidation in-progress when KVM retrieves the backing pfn, i.e. stall
vCPUs until the backing pfn for the APIC-access page is "officially"
stable.  Relying on the primary MMU to not make changes after invoking
->invalidate_range() works, e.g. any additional changes to a PRESENT PTE
would also trigger an ->invalidate_range(), but using ->invalidate_range()
to fudge around KVM not honoring past and in-progress invalidations is a
bit hacky.

Honoring invalidations will allow using KVM's standard mmu_notifier hooks
to detect APIC-access page reloads, which will in turn allow removing
KVM's implementation of ->invalidate_range() (the APIC-access page case is
a true one-off).

Opportunistically add a comment to explain why doing nothing if a memslot
isn't found is functionally correct.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20230602011518.787006-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-06 15:07:04 -07:00
Sean Christopherson
c3a1e119a3 KVM: VMX: Inject #GP, not #UD, if SGX2 ENCLS leafs are unsupported
Per Intel's SDM, unsupported ENCLS leafs result in a #GP, not a #UD.
SGX1 is a special snowflake as the SGX1 flag is used by the CPU as a
"soft" disable, e.g. if software disables machine check reporting, i.e.
having SGX but not SGX1 is effectively "SGX completely unsupported" and
and thus #UDs.

Fixes: 9798adbc04 ("KVM: VMX: Frame in ENCLS handler for SGX virtualization")
Cc: Binbin Wu <binbin.wu@linux.intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230405234556.696927-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-02 10:12:03 -07:00
Sean Christopherson
5e50082c8c KVM: VMX: Inject #GP on ENCLS if vCPU has paging disabled (CR0.PG==0)
Inject a #GP when emulating/forwarding a valid ENCLS leaf if the vCPU has
paging disabled, e.g. if KVM is intercepting ECREATE to enforce additional
restrictions.  The pseudocode in the SDM lists all #GP triggers, including
CR0.PG=0, as being checked after the ENLCS-exiting checks, i.e. the
VM-Exit will occur before the CPU performs the CR0.PG check.

Fixes: 70210c044b ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions")
Cc: Binbin Wu <binbin.wu@linux.intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230405234556.696927-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-02 10:10:54 -07:00
Jon Kohler
331f229768 KVM: VMX: restore vmx_vmexit alignment
Commit 8bd200d23e ("KVM: VMX: Flatten __vmx_vcpu_run()") changed
vmx_vmexit from SYM_FUNC_START to SYM_INNER_LABEL, accidentally
removing 16 byte alignment as SYM_FUNC_START uses SYM_A_ALIGN and
SYM_INNER_LABEL does not. Josh mentioned [1] this was unintentional.

Fix by changing to SYM_INNER_LABEL_ALIGN instead.

[1] https://lore.kernel.org/lkml/Y3adkSe%2FJ70PqUyt@p183

Fixes: 8bd200d23e ("KVM: VMX: Flatten __vmx_vcpu_run()")
Signed-off-by: Jon Kohler <jon@nutanix.com>
Suggested-by: Alexey Dobriyan <adobriyan@gmail.com>
CC: Josh Poimboeuf <jpoimboe@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20230531155821.80590-1-jon@nutanix.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01 14:20:04 -07:00
Jinrong Liang
33ab767c26 KVM: x86/pmu: Remove redundant check for MSR_IA32_DS_AREA set handler
After commit 2de154f541 ("KVM: x86/pmu: Provide "error" semantics
for unsupported-but-known PMU MSRs"), the guest_cpuid_has(DS) check
is not necessary any more since if the guest supports X86_FEATURE_DS,
it never returns 1. And if the guest does not support this feature,
the set_msr handler will get false from kvm_pmu_is_valid_msr() before
reaching this point. Therefore, the check will not be true in all cases
and can be safely removed, which also simplifies the code and improves
its readability.

Signed-off-by: Jinrong Liang <cloudliang@tencent.com>
Link: https://lore.kernel.org/r/20230411130338.8592-1-cloudliang@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01 14:04:24 -07:00
Sean Christopherson
dee321977a KVM: x86: Move common handling of PAT MSR writes to kvm_set_msr_common()
Move the common check-and-set handling of PAT MSR writes out of vendor
code and into kvm_set_msr_common().  This aligns writes with reads, which
are already handled in common code, i.e. makes the handling of reads and
writes symmetrical in common code.

Alternatively, the common handling in kvm_get_msr_common() could be moved
to vendor code, but duplicating code is generally undesirable (even though
the duplicatated code is trivial in this case), and guest writes to PAT
should be rare, i.e. the overhead of the extra function call is a
non-issue in practice.

Suggested-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230511233351.635053-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01 13:41:06 -07:00
Wenyao Hai
a33ba1bf0d KVM: VMX: Open code writing vCPU's PAT in VMX's MSR handler
Open code setting "vcpu->arch.pat" in vmx_set_msr() instead of bouncing
through kvm_set_msr_common() to get to the same code in kvm_mtrr_set_msr().
This aligns VMX with SVM, avoids hiding a very simple operation behind a
relatively complicated function call (finding the PAT MSR case in
kvm_set_msr_common() is non-trivial), and most importantly, makes it clear
that not unwinding the VMCS updates if kvm_set_msr_common() isn't a bug
(because kvm_set_msr_common() can never fail for PAT).

Opportunistically set vcpu->arch.pat before updating the VMCS info so that
a future patch can move the common bits (back) into kvm_set_msr_common()
without a functional change.

Note, MSR_IA32_CR_PAT is 0x277, and is very subtly handled by

	case 0x200 ... MSR_IA32_MC0_CTL2 - 1:

in kvm_set_msr_common().

Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Wenyao Hai <haiwenyao@uniontech.com>
[sean: massage changelog, hoist setting vcpu->arch.pat up]
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230511233351.635053-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01 13:41:05 -07:00
Sean Christopherson
023cfa6fc2 KVM: VMX: Use proper accessor to read guest CR4 in handle_desc()
Use kvm_is_cr4_bit_set() to read guest CR4.UMIP when sanity checking that
a descriptor table VM-Exit occurs if and only if guest.CR4.UMIP=1.  UMIP
can't be guest-owned, i.e. using kvm_read_cr4_bits() to decache guest-
owned bits isn't strictly necessary, but eliminating raw reads of
vcpu->arch.cr4 is desirable as it makes it easy to visually audit KVM for
correctness.

Opportunistically add a compile-time assertion that UMIP isn't guest-owned
as letting the guest own UMIP isn't compatible with emulation (or any CR4
bit that is emulated by KVM).

Opportunistically change the WARN_ON() to a ONCE variant.  When the WARN
fires, it fires _a lot_, and spamming the kernel logs ends up doing more
harm than whatever led to KVM's unnecessary emulation.

Reported-by: Robert Hoo <robert.hu@intel.com>
Link: https://lore.kernel.org/all/20230310125718.1442088-4-robert.hu@intel.com
Link: https://lore.kernel.org/r/20230413231914.1482782-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-05-26 13:50:42 -07:00
Sean Christopherson
3243b93c16 KVM: VMX: Treat UMIP as emulated if and only if the host doesn't have UMIP
Advertise UMIP as emulated if and only if the host doesn't natively
support UMIP, otherwise vmx_umip_emulated() is misleading when the host
_does_ support UMIP.  Of the four users of vmx_umip_emulated(), two
already check for native support, and the logic in vmx_set_cpu_caps() is
relevant if and only if UMIP isn't natively supported as UMIP is set in
KVM's caps by kvm_set_cpu_caps() when UMIP is present in hardware.

That leaves KVM's stuffing of X86_CR4_UMIP into the default cr4_fixed1
value enumerated for nested VMX.  In that case, checking for (lack of)
host support is actually a bug fix of sorts, as enumerating UMIP support
based solely on descriptor table exiting works only because KVM doesn't
sanity check MSR_IA32_VMX_CR4_FIXED1.  E.g. if a (very theoretical) host
supported UMIP in hardware but didn't allow UMIP+VMX, KVM would advertise
UMIP but not actually emulate UMIP.  Of course, KVM would explode long
before it could run a nested VM on said theoretical CPU, as KVM doesn't
modify host CR4 when enabling VMX, i.e. would load an "illegal" value into
vmcs.HOST_CR4.

Reported-by: Robert Hoo <robert.hu@intel.com>
Link: https://lore.kernel.org/all/20230310125718.1442088-2-robert.hu@intel.com
Link: https://lore.kernel.org/r/20230413231914.1482782-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-05-26 13:50:42 -07:00
Xiaoyao Li
82dc11b82b KVM: VMX: Move the comment of CR4.MCE handling right above the code
Move the comment about keeping the hosts CR4.MCE loaded in hardware above
the code that actually modifies the hardware CR4 value.

No functional change indented.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230410125017.1305238-3-xiaoyao.li@intel.com
[sean: elaborate in changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-05-26 13:43:16 -07:00
Xiaoyao Li
334006b78c KVM: VMX: Use kvm_read_cr4() to get cr4 value
Directly use vcpu->arch.cr4 is not recommended since it gets stale value
if the cr4 is not available.

Use kvm_read_cr4() instead to ensure correct value.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230410125017.1305238-2-xiaoyao.li@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-05-26 13:41:43 -07:00
Sean Christopherson
ad45413d22 KVM: VMX: Don't rely _only_ on CPUID to enforce XCR0 restrictions for ECREATE
Explicitly check the vCPU's supported XCR0 when determining whether or not
the XFRM for ECREATE is valid.  Checking CPUID works because KVM updates
guest CPUID.0x12.1 to restrict the leaf to a subset of the guest's allowed
XCR0, but that is rather subtle and KVM should not modify guest CPUID
except for modeling true runtime behavior (allowed XFRM is most definitely
not "runtime" behavior).

Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230503160838.3412617-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-05-21 04:05:51 -04:00
Linus Torvalds
c8c655c34e s390:
* More phys_to_virt conversions
 
 * Improvement of AP management for VSIE (nested virtualization)
 
 ARM64:
 
 * Numerous fixes for the pathological lock inversion issue that
   plagued KVM/arm64 since... forever.
 
 * New framework allowing SMCCC-compliant hypercalls to be forwarded
   to userspace, hopefully paving the way for some more features
   being moved to VMMs rather than be implemented in the kernel.
 
 * Large rework of the timer code to allow a VM-wide offset to be
   applied to both virtual and physical counters as well as a
   per-timer, per-vcpu offset that complements the global one.
   This last part allows the NV timer code to be implemented on
   top.
 
 * A small set of fixes to make sure that we don't change anything
   affecting the EL1&0 translation regime just after having having
   taken an exception to EL2 until we have executed a DSB. This
   ensures that speculative walks started in EL1&0 have completed.
 
 * The usual selftest fixes and improvements.
 
 KVM x86 changes for 6.4:
 
 * Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled,
   and by giving the guest control of CR0.WP when EPT is enabled on VMX
   (VMX-only because SVM doesn't support per-bit controls)
 
 * Add CR0/CR4 helpers to query single bits, and clean up related code
   where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return
   as a bool
 
 * Move AMD_PSFD to cpufeatures.h and purge KVM's definition
 
 * Avoid unnecessary writes+flushes when the guest is only adding new PTEs
 
 * Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s optimizations
   when emulating invalidations
 
 * Clean up the range-based flushing APIs
 
 * Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single
   A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle
   changed SPTE" overhead associated with writing the entire entry
 
 * Track the number of "tail" entries in a pte_list_desc to avoid having
   to walk (potentially) all descriptors during insertion and deletion,
   which gets quite expensive if the guest is spamming fork()
 
 * Disallow virtualizing legacy LBRs if architectural LBRs are available,
   the two are mutually exclusive in hardware
 
 * Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES)
   after KVM_RUN, similar to CPUID features
 
 * Overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES
 
 * Apply PMU filters to emulated events and add test coverage to the
   pmu_event_filter selftest
 
 x86 AMD:
 
 * Add support for virtual NMIs
 
 * Fixes for edge cases related to virtual interrupts
 
 x86 Intel:
 
 * Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is
   not being reported due to userspace not opting in via prctl()
 
 * Fix a bug in emulation of ENCLS in compatibility mode
 
 * Allow emulation of NOP and PAUSE for L2
 
 * AMX selftests improvements
 
 * Misc cleanups
 
 MIPS:
 
 * Constify MIPS's internal callbacks (a leftover from the hardware enabling
   rework that landed in 6.3)
 
 Generic:
 
 * Drop unnecessary casts from "void *" throughout kvm_main.c
 
 * Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the struct
   size by 8 bytes on 64-bit kernels by utilizing a padding hole
 
 Documentation:
 
 * Fix goof introduced by the conversion to rST
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmRNExkUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNyjwf+MkzDael9y9AsOZoqhEZ5OsfQYJ32
 Im5ZVYsPRU2K5TuoWql6meIihgclCj1iIU32qYHa2F1WYt2rZ72rJp+HoY8b+TaI
 WvF0pvNtqQyg3iEKUBKPA4xQ6mj7RpQBw86qqiCHmlfNt0zxluEGEPxH8xrWcfhC
 huDQ+NUOdU7fmJ3rqGitCvkUbCuZNkw3aNPR8dhU8RAWrwRzP2hBOmdxIeo81WWY
 XMEpJSijbGpXL9CvM0Jz9nOuMJwZwCCBGxg1vSQq0xTfLySNMxzvWZC2GFaBjucb
 j0UOQ7yE0drIZDVhd3sdNslubXXU6FcSEzacGQb9aigMUon3Tem9SHi7Kw==
 =S2Hq
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "s390:

   - More phys_to_virt conversions

   - Improvement of AP management for VSIE (nested virtualization)

  ARM64:

   - Numerous fixes for the pathological lock inversion issue that
     plagued KVM/arm64 since... forever.

   - New framework allowing SMCCC-compliant hypercalls to be forwarded
     to userspace, hopefully paving the way for some more features being
     moved to VMMs rather than be implemented in the kernel.

   - Large rework of the timer code to allow a VM-wide offset to be
     applied to both virtual and physical counters as well as a
     per-timer, per-vcpu offset that complements the global one. This
     last part allows the NV timer code to be implemented on top.

   - A small set of fixes to make sure that we don't change anything
     affecting the EL1&0 translation regime just after having having
     taken an exception to EL2 until we have executed a DSB. This
     ensures that speculative walks started in EL1&0 have completed.

   - The usual selftest fixes and improvements.

  x86:

   - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is
     enabled, and by giving the guest control of CR0.WP when EPT is
     enabled on VMX (VMX-only because SVM doesn't support per-bit
     controls)

   - Add CR0/CR4 helpers to query single bits, and clean up related code
     where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long"
     return as a bool

   - Move AMD_PSFD to cpufeatures.h and purge KVM's definition

   - Avoid unnecessary writes+flushes when the guest is only adding new
     PTEs

   - Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s
     optimizations when emulating invalidations

   - Clean up the range-based flushing APIs

   - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a
     single A/D bit using a LOCK AND instead of XCHG, and skip all of
     the "handle changed SPTE" overhead associated with writing the
     entire entry

   - Track the number of "tail" entries in a pte_list_desc to avoid
     having to walk (potentially) all descriptors during insertion and
     deletion, which gets quite expensive if the guest is spamming
     fork()

   - Disallow virtualizing legacy LBRs if architectural LBRs are
     available, the two are mutually exclusive in hardware

   - Disallow writes to immutable feature MSRs (notably
     PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features

   - Overhaul the vmx_pmu_caps selftest to better validate
     PERF_CAPABILITIES

   - Apply PMU filters to emulated events and add test coverage to the
     pmu_event_filter selftest

   - AMD SVM:
       - Add support for virtual NMIs
       - Fixes for edge cases related to virtual interrupts

   - Intel AMX:
       - Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if
         XTILE_DATA is not being reported due to userspace not opting in
         via prctl()
       - Fix a bug in emulation of ENCLS in compatibility mode
       - Allow emulation of NOP and PAUSE for L2
       - AMX selftests improvements
       - Misc cleanups

  MIPS:

   - Constify MIPS's internal callbacks (a leftover from the hardware
     enabling rework that landed in 6.3)

  Generic:

   - Drop unnecessary casts from "void *" throughout kvm_main.c

   - Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the
     struct size by 8 bytes on 64-bit kernels by utilizing a padding
     hole

  Documentation:

   - Fix goof introduced by the conversion to rST"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits)
  KVM: s390: pci: fix virtual-physical confusion on module unload/load
  KVM: s390: vsie: clarifications on setting the APCB
  KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA
  KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
  KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
  KVM: selftests: Test the PMU event "Instructions retired"
  KVM: selftests: Copy full counter values from guest in PMU event filter test
  KVM: selftests: Use error codes to signal errors in PMU event filter test
  KVM: selftests: Print detailed info in PMU event filter asserts
  KVM: selftests: Add helpers for PMC asserts in PMU event filter test
  KVM: selftests: Add a common helper for the PMU event filter guest code
  KVM: selftests: Fix spelling mistake "perrmited" -> "permitted"
  KVM: arm64: vhe: Drop extra isb() on guest exit
  KVM: arm64: vhe: Synchronise with page table walker on MMU update
  KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
  KVM: arm64: nvhe: Synchronise with page table walker on TLBI
  KVM: arm64: Handle 32bit CNTPCTSS traps
  KVM: arm64: nvhe: Synchronise with page table walker on vcpu run
  KVM: arm64: vgic: Don't acquire its_lock before config_lock
  KVM: selftests: Add test to verify KVM's supported XCR0
  ...
2023-05-01 12:06:20 -07:00
Paolo Bonzini
b3c98052d4 KVM VMX changes for 6.4:
- Fix a bug in emulation of ENCLS in compatibility mode
 
  - Allow emulation of NOP and PAUSE for L2
 
  - Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGuYgSHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5kQUP/jV5Q8ZeVCzlf6ZCeAHnWX/Hahsv6i6H
 ooNL8W6p8FI5xlYOWh8J02JpmLUrNWURCPqvr0oYLm4r1UlJ/OGjyuKB8d7SZ7z/
 RaLN7tppMod527J+Qm3ptHQbTKAGHe4dEoiX46cuvTEcCxrsVykYltvfD1rNuSQA
 VcaNJkkcHv/KuItUHLAuntCAiFvbD1gYNLfUAC7e0htGjLRLxg3+ugHEiFcJ3c6y
 z4ged1toYLGD962jWSIgokFbivfUNZT25WlZjBliMa/E8+ckTAzmc67UJYvhNBOM
 HyAHs0hp+XtSgfcCgNkI+WDrFXXgxa+QQcMFvRWacS3Hx6tgJoQ51FRMevmumn0O
 zBPk3+BOquhknqb5NbmwRZoLExffo+86fFlDcgszzvV4Y/vBfp/XTsuJZCnaiMDZ
 wdmJoF4mhRDtgt7yORltpjHqp3yRmLqMNUb3sxXLRA9D+edo9mr8SXujOnukmXoH
 o/ZpEollTPUQ7od/uDIvDyosWvb65IbYwsKGdOanfBacVrxy5OPM38mPF7u9AyzD
 Gn81H/OhwhpTSBAX7kLMGeK/QGkyIBEUM1levdmcAk0nKYQzHzsI7tMYfqwXuQSu
 qKAcF+qtpOReWmb4KaJZ7c0HQIBQHOKQ6exXxnQJuLjnAHS0674NxMDkT5a1EGRL
 Q9OPSTSYBMDC
 =FOnk
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.4' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.4:

 - Fix a bug in emulation of ENCLS in compatibility mode

 - Allow emulation of NOP and PAUSE for L2

 - Misc cleanups
2023-04-26 15:56:51 -04:00
Paolo Bonzini
48b1893ae3 KVM x86 PMU changes for 6.4:
- Disallow virtualizing legacy LBRs if architectural LBRs are available,
    the two are mutually exclusive in hardware
 
  - Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES)
    after KVM_RUN, and overhaul the vmx_pmu_caps selftest to better
    validate PERF_CAPABILITIES
 
  - Apply PMU filters to emulated events and add test coverage to the
    pmu_event_filter selftest
 
  - Misc cleanups and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGtd4SHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5Z9kP/i3WZ40hevvQvB/5cEpxxmxYDwCYnnjM
 hiQgK5jT4SrMTmVjLgkNdI2PogQoS4CX+GC7lcA9bvse84hjuPvgOflb2B+p2UQi
 Ytbr9g/tfKNIpnKIk9mcPcSObN9vm2Kgt7n28rtPrHWj89eQzgc66eijqdpKBLxA
 c3crVR8krwYAQK0tmzHq1+H6hB369YbHAHyTTRRI/bNWnqKblnvUbt0NL2aBusa9
 rNMaOdRtinLpy2dmuX/b3japRB8QTnlf7zpPIF4cBEhbYXy5woClZpf1D2fCA6Er
 XFbEoYawMVd9UeJYbW4z5yErLT83eYoGp4U0eFXWp6fvh8nZlgCGvBKE9g4mmqwj
 aSLaTR5eVN2qlw6jXVeg3unCo8Eyl36AwYwve2L6sFmBvZvNV5iz2eQ7rrOe4oE3
 dnTUaLQ8I2SVg04MbYmCq5W+frTL/I7kqNpbccL1Z3R5WO4y5gz63mug6NfLIvhR
 t45TAIaifxBfcXQsBZM3v2KUK/xQrD3AbJmFKh54L2CKqiGaNWsMLX+6NZ7LZWgf
 8rEqsVkkQDgF7z8eXai4TR26nYfSX6g9gDqtOH73L87aJ7PJk5cRoDWQ1sWs1e/l
 4HA/L0Bo/3pnKAa0ZWxJOixmzqY49gNQf3dj8gt3jk3y2ijbAivshiSpPBmIxn0u
 QLeOf/LGvipl
 =m18F
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-pmu-6.4' of https://github.com/kvm-x86/linux into HEAD

KVM x86 PMU changes for 6.4:

 - Disallow virtualizing legacy LBRs if architectural LBRs are available,
   the two are mutually exclusive in hardware

 - Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES)
   after KVM_RUN, and overhaul the vmx_pmu_caps selftest to better
   validate PERF_CAPABILITIES

 - Apply PMU filters to emulated events and add test coverage to the
   pmu_event_filter selftest

 - Misc cleanups and fixes
2023-04-26 15:53:36 -04:00
Paolo Bonzini
807b758496 KVM x86 MMU changes for 6.4:
- Tweak FNAME(sync_spte) to avoid unnecessary writes+flushes when the
    guest is only adding new PTEs
 
  - Overhaul .sync_page() and .invlpg() to share the .sync_page()
    implementation, i.e. utilize .sync_page()'s optimizations when emulating
    invalidations
 
  - Clean up the range-based flushing APIs
 
  - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single
    A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle
    changed SPTE" overhead associated with writing the entire entry
 
  - Track the number of "tail" entries in a pte_list_desc to avoid having
    to walk (potentially) all descriptors during insertion and deletion,
    which gets quite expensive if the guest is spamming fork()
 
  - Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGsvASHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5XnoP/0D8rQmrA0xPHK81zYS1E71tsR/itO/T
 CQMSB4PhEqvcRUaWOuhLBRUW+noWzaOkjkMYK2uoPTdtme7v9+Ar7EtfrWYHrBWD
 IxHCAymo3a5dQPUc3Nb77u6HjRAOokPSqSz5jE4qAjlniW09feruro2Phi+BTme4
 JjxTc/7Oh0Fu26+mK7mJHiw3fV1x3YznnnRPrKGrVQes5L6ozNICkUZ6nvuJUVMk
 lTNHNQbG8PqJZnfWG7VIKRn1vdfXwEfnvyucGVEqFfPLkOXqJHyqMVmIOtvsH7C5
 l8j36+lBZwtFh2jk2EsXOTb6sS7l1MSvyHLlbaJaqqffP+77Hf1n0fROur0k9Yse
 jJJejJWxZ/SvjMt/bOA+4ybGafZH0lt20DsDWnat5GSQ1EVT1CInN2p8OY8pdecR
 QOJBqnNUOykC7/Pyad+IxTxwrOSNCYh+5aYG8AdGquZvNUEwjffVJqrmxDvklY8Z
 DTYwGKgNY7NsP/dV0WYYElsAuHiKwiDZL15KftiQebO1fPcZDpTzDo83/8UMfGxh
 yegngcNX9Qi7lWtLkUMy8A99UvejM0QrS/Zt8v1zjlQ8PjreZLLBWsNpe0ufIMRk
 31ZAC2OS4Koi3wZ54tA7Z1Kh11meGhAk5Ti7sNke0rDqB9UMmj6UKw121cSRvW7q
 W6O4U3YeGpKx
 =zb4u
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmu-6.4' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 6.4:

 - Tweak FNAME(sync_spte) to avoid unnecessary writes+flushes when the
   guest is only adding new PTEs

 - Overhaul .sync_page() and .invlpg() to share the .sync_page()
   implementation, i.e. utilize .sync_page()'s optimizations when emulating
   invalidations

 - Clean up the range-based flushing APIs

 - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single
   A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle
   changed SPTE" overhead associated with writing the entire entry

 - Track the number of "tail" entries in a pte_list_desc to avoid having
   to walk (potentially) all descriptors during insertion and deletion,
   which gets quite expensive if the guest is spamming fork()

 - Misc cleanups
2023-04-26 15:50:01 -04:00
Paolo Bonzini
a1c288f87d KVM x86 changes for 6.4:
- Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled,
    and by giving the guest control of CR0.WP when EPT is enabled on VMX
    (VMX-only because SVM doesn't support per-bit controls)
 
  - Add CR0/CR4 helpers to query single bits, and clean up related code
    where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return
    as a bool
 
  - Move AMD_PSFD to cpufeatures.h and purge KVM's definition
 
  - Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGr2sSHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5b80P/2ayACpc7iV2DysXkrxOdn1JmMu9BeHd
 3oMb7bydf79LMNAO+NKPqVjo74yZ/Lh8UyufJGgF3HnSCdumx5Iklyx6/2PUHu/I
 8xT1H7VlIGQMcNy0G4hMus34ZcafJl4y+BXgMEqEErLcy3n598UvFGJ+C0/4lnux
 2Gk7dLASHq/mVVKReBM/kD4RhCVy5Venz6zkk9KbwDLHAmfejVK5bSqDYAnO1WtV
 IBWetxlVyMZCnfPV2drhzgNVwiHvYvCaMBW+cUk5cH8Z2r0VZVDERmc1D4/rd04t
 xs9lMk6CdNU7REQfblA0xMgeO/dNAXq5Fs4FfcM8OTBZU32KKafPhgW1uj2Sv+9l
 nbb1XxZ7C0EcBhKVbUD6zRl05vjHwxlRgoi0yWUqERthFKNXHV42JJgaNn4fxDYS
 tOBKBNkM9z6tCGN2aZv6GwhsEyY2y7oLdbZUGK9/FM3mF1VBASms1BTwokJXTxCD
 pkOpAGeN5hxOlC4/wl6iHJTrz9oaJUj5E5kMD1oK6oQJgnnfqH0kVTG/ui/OUtJg
 8N3amYO/d7InFvuE0f9R6TqZVhTN2QefHmNJaEldsmYp1NMI8Ep8JIhQKRA2LZVE
 CGRxyrPj5CESerAItAI6tshEre5W8aScEzhpmd6HgHmahhQJsCEj+3q/J8FPWLG/
 iQ3GnggrklfU
 =qj7D
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.4' of https://github.com/kvm-x86/linux into HEAD

KVM x86 changes for 6.4:

 - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled,
   and by giving the guest control of CR0.WP when EPT is enabled on VMX
   (VMX-only because SVM doesn't support per-bit controls)

 - Add CR0/CR4 helpers to query single bits, and clean up related code
   where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return
   as a bool

 - Move AMD_PSFD to cpufeatures.h and purge KVM's definition

 - Misc cleanups
2023-04-26 15:49:23 -04:00
Sean Christopherson
4984563823 KVM: nVMX: Emulate NOPs in L2, and PAUSE if it's not intercepted
Extend VMX's nested intercept logic for emulated instructions to handle
"pause" interception, in quotes because KVM's emulator doesn't filter out
NOPs when checking for nested intercepts.  Failure to allow emulation of
NOPs results in KVM injecting a #UD into L2 on any NOP that collides with
the emulator's definition of PAUSE, i.e. on all single-byte NOPs.

For PAUSE itself, honor L1's PAUSE-exiting control, but ignore PLE to
avoid unnecessarily injecting a #UD into L2.  Per the SDM, the first
execution of PAUSE after VM-Entry is treated as the beginning of a new
loop, i.e. will never trigger a PLE VM-Exit, and so L1 can't expect any
given execution of PAUSE to deterministically exit.

  ... the processor considers this execution to be the first execution of
  PAUSE in a loop. (It also does so for the first execution of PAUSE at
  CPL 0 after VM entry.)

All that said, the PLE side of things is currently a moot point, as KVM
doesn't expose PLE to L1.

Note, vmx_check_intercept() is still wildly broken when L1 wants to
intercept an instruction, as KVM injects a #UD instead of synthesizing a
nested VM-Exit.  That issue extends far beyond NOP/PAUSE and needs far
more effort to fix, i.e. is a problem for the future.

Fixes: 07721feee4 ("KVM: nVMX: Don't emulate instructions in guest mode")
Cc: Mathias Krause <minipli@grsecurity.net>
Cc: stable@vger.kernel.org
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20230405002359.418138-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-11 09:35:49 -07:00
Sean Christopherson
8a1300ff95 KVM: x86: Rename Hyper-V remote TLB hooks to match established scheme
Rename the Hyper-V hooks for TLB flushing to match the naming scheme used
by all the other TLB flushing hooks, e.g. in kvm_x86_ops, vendor code,
arch hooks from common code, etc.

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230405003133.419177-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-10 15:17:29 -07:00
Like Xu
4fa5843d81 KVM: x86/pmu: Fix a typo in kvm_pmu_request_counter_reprogam()
Fix a "reprogam" => "reprogram" typo in kvm_pmu_request_counter_reprogam().

Fixes: 68fb4757e8 ("KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()")
Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230310113349.31799-1-likexu@tencent.com
[sean: trim the changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-07 09:07:41 -07:00
Like Xu
649bccd7fa KVM: x86/pmu: Rewrite reprogram_counters() to improve performance
A valid pmc is always tested before using pmu->reprogram_pmi. Eliminate
this part of the redundancy by setting the counter's bitmask directly,
and in addition, trigger KVM_REQ_PMU only once to save more cpu cycles.

Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230214050757.9623-4-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-06 16:04:31 -07:00
Sean Christopherson
8bca8c5ce4 KVM: VMX: Refactor intel_pmu_{g,}set_msr() to align with other helpers
Invert the flows in intel_pmu_{g,s}et_msr()'s case statements so that
they follow the kernel's preferred style of:

        if (<not valid>)
                return <error>

        <commit change>
        return <success>

which is also the style used by every other {g,s}et_msr() helper (except
AMD's PMU variant, which doesn't use a switch statement).

Modify the "set" paths with costly side effects, i.e. that reprogram
counters, to skip only the side effects, i.e. to perform reserved bits
checks even if the value is unchanged.  None of the reserved bits checks
are expensive, so there's no strong justification for skipping them, and
guarding only the side effect makes it slightly more obvious what is being
skipped and why.

No functional change intended (assuming no reserved bit bugs).

Link: https://lkml.kernel.org/r/Y%2B6cfen%2FCpO3%2FdLO%40google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-06 16:03:20 -07:00
Sean Christopherson
957d0f70e9 KVM: x86/pmu: Zero out LBR capabilities during PMU refresh
Zero out the LBR capabilities during PMU refresh to avoid exposing LBRs
to the guest against userspace's wishes. If userspace modifies the
guest's CPUID model or invokes KVM_CAP_PMU_CAPABILITY to disable vPMU
after an initial KVM_SET_CPUID2, but before the first KVM_RUN, KVM will
retain the previous LBR info due to bailing before refreshing the LBR
descriptor.

Note, this is a very theoretical bug, there is no known use case where a
VMM would deliberately enable the vPMU via KVM_SET_CPUID2, and then later
disable the vPMU.

Link: https://lore.kernel.org/r/20230311004618.920745-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-06 14:58:43 -07:00
Sean Christopherson
5757f5b956 KVM: x86: Add macros to track first...last VMX feature MSRs
Add macros to track the range of VMX feature MSRs that are emulated by
KVM to reduce the maintenance cost of extending the set of emulated MSRs.

Note, KVM doesn't necessarily emulate all known/consumed VMX MSRs, e.g.
PROCBASED_CTLS3 is consumed by KVM to enable IPI virtualization, but is
not emulated as KVM doesn't emulate/virtualize IPI virtualization for
nested guests.

No functional change intended.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230311004618.920745-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-06 14:57:22 -07:00
Sean Christopherson
da3db168fb KVM: x86: Virtualize FLUSH_L1D and passthrough MSR_IA32_FLUSH_CMD
Virtualize FLUSH_L1D so that the guest can use the performant L1D flush
if one of the many mitigations might require a flush in the guest, e.g.
Linux provides an option to flush the L1D when switching mms.

Passthrough MSR_IA32_FLUSH_CMD for write when it's supported in hardware
and exposed to the guest, i.e. always let the guest write it directly if
FLUSH_L1D is fully supported.

Forward writes to hardware in host context on the off chance that KVM
ends up emulating a WRMSR, or in the really unlikely scenario where
userspace wants to force a flush.  Restrict these forwarded WRMSRs to
the known command out of an abundance of caution.  Passing through the
MSR means the guest can throw any and all values at hardware, but doing
so in host context is arguably a bit more dangerous.

Link: https://lkml.kernel.org/r/CALMp9eTt3xzAEoQ038bJQ9LN0ZOXrSWsN7xnNUD%2B0SS%3DWwF7Pg%40mail.gmail.com
Link: https://lore.kernel.org/all/20230201132905.549148-2-eesposit@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230322011440.2195485-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-04-06 13:37:37 -04:00
Sean Christopherson
903358c7ed KVM: x86: Move MSR_IA32_PRED_CMD WRMSR emulation to common code
Dedup the handling of MSR_IA32_PRED_CMD across VMX and SVM by moving the
logic to kvm_set_msr_common().  Now that the MSR interception toggling is
handled as part of setting guest CPUID, the VMX and SVM paths are
identical.

Opportunistically massage the code to make it a wee bit denser.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20230322011440.2195485-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-04-06 13:37:36 -04:00
Sean Christopherson
9a4c485013 KVM: VMX: Passthrough MSR_IA32_PRED_CMD based purely on host+guest CPUID
Passthrough MSR_IA32_PRED_CMD based purely on whether or not the MSR is
supported and enabled, i.e. don't wait until the first write.  There's no
benefit to deferred passthrough, and the extra logic only adds complexity.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20230322011440.2195485-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-04-06 13:37:36 -04:00
Sean Christopherson
52887af565 KVM: x86: Revert MSR_IA32_FLUSH_CMD.FLUSH_L1D enabling
Revert the recently added virtualizing of MSR_IA32_FLUSH_CMD, as both
the VMX and SVM are fatally buggy to guests that use MSR_IA32_FLUSH_CMD or
MSR_IA32_PRED_CMD, and because the entire foundation of the logic is
flawed.

The most immediate problem is an inverted check on @cmd that results in
rejecting legal values.  SVM doubles down on bugs and drops the error,
i.e. silently breaks all guest mitigations based on the command MSRs.

The next issue is that neither VMX nor SVM was updated to mark
MSR_IA32_FLUSH_CMD as being a possible passthrough MSR,
which isn't hugely problematic, but does break MSR filtering and triggers
a WARN on VMX designed to catch this exact bug.

The foundational issues stem from the MSR_IA32_FLUSH_CMD code reusing
logic from MSR_IA32_PRED_CMD, which in turn was likely copied from KVM's
support for MSR_IA32_SPEC_CTRL.  The copy+paste from MSR_IA32_SPEC_CTRL
was misguided as MSR_IA32_PRED_CMD (and MSR_IA32_FLUSH_CMD) is a
write-only MSR, i.e. doesn't need the same "deferred passthrough"
shenanigans as MSR_IA32_SPEC_CTRL.

Revert all MSR_IA32_FLUSH_CMD enabling in one fell swoop so that there is
no point where KVM advertises, but does not support, L1D_FLUSH.

This reverts commits 45cf86f261,
723d5fb0ff, and
a807b78ad0.

Reported-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lkml.kernel.org/r/20230317190432.GA863767%40dev-arch.thelio-3990X
Cc: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Message-Id: <20230322011440.2195485-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-04-06 13:37:35 -04:00
Sean Christopherson
098f4c061e KVM: x86/pmu: Disallow legacy LBRs if architectural LBRs are available
Disallow enabling LBR support if the CPU supports architectural LBRs.
Traditional LBR support is absent on CPU models that have architectural
LBRs, and KVM doesn't yet support arch LBRs, i.e. KVM will pass through
non-existent MSRs if userspace enables LBRs for the guest.

Cc: stable@vger.kernel.org
Cc: Yang Weijiang <weijiang.yang@intel.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes: be635e34c2 ("KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIES")
Tested-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230128001427.2548858-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-05 17:00:17 -07:00
Binbin Wu
548bd27428 KVM: VMX: Use is_64_bit_mode() to check 64-bit mode in SGX handler
sgx_get_encls_gva() uses is_long_mode() to check 64-bit mode, however,
SGX system leaf instructions are valid in compatibility mode, should
use is_64_bit_mode() instead.

Fixes: 70210c044b ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions")
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230404032502.27798-1-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-05 11:52:19 -07:00
Sean Christopherson
80962ec912 KVM: nVMX: Do not report error code when synthesizing VM-Exit from Real Mode
Don't report an error code to L1 when synthesizing a nested VM-Exit and
L2 is in Real Mode.  Per Intel's SDM, regarding the error code valid bit:

  This bit is always 0 if the VM exit occurred while the logical processor
  was in real-address mode (CR0.PE=0).

The bug was introduced by a recent fix for AMD's Paged Real Mode, which
moved the error code suppression from the common "queue exception" path
to the "inject exception" path, but missed VMX's "synthesize VM-Exit"
path.

Fixes: b97f074583 ("KVM: x86: determine if an exception has an error code only when injecting it.")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230322143300.2209476-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-27 10:15:11 -04:00
Binbin Wu
607475cfa0 KVM: x86: Add helpers to query individual CR0/CR4 bits
Add helpers to check if a specific CR0/CR4 bit is set to avoid a plethora
of implicit casts from the "unsigned long" return of kvm_read_cr*_bits(),
and to make each caller's intent more obvious.

Defer converting helpers that do truly ugly casts from "unsigned long" to
"int", e.g. is_pse(), to a future commit so that their conversion is more
isolated.

Opportunistically drop the superfluous pcid_enabled from kvm_set_cr3();
the local variable is used only once, immediately after its declaration.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20230322045824.22970-2-binbin.wu@linux.intel.com
[sean: move "obvious" conversions to this commit, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-22 10:10:53 -07:00
Mathias Krause
fb509f76ac KVM: VMX: Make CR0.WP a guest owned bit
Guests like grsecurity that make heavy use of CR0.WP to implement kernel
level W^X will suffer from the implied VMEXITs.

With EPT there is no need to intercept a guest change of CR0.WP, so
simply make it a guest owned bit if we can do so.

This implies that a read of a guest's CR0.WP bit might need a VMREAD.
However, the only potentially affected user seems to be kvm_init_mmu()
which is a heavy operation to begin with. But also most callers already
cache the full value of CR0 anyway, so no additional VMREAD is needed.
The only exception is nested_vmx_load_cr3().

This change is VMX-specific, as SVM has no such fine grained control
register intercept control.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20230322013731.102955-7-minipli@grsecurity.net
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-22 07:47:26 -07:00
Mathias Krause
74cdc83691 KVM: x86: Make use of kvm_read_cr*_bits() when testing bits
Make use of the kvm_read_cr{0,4}_bits() helper functions when we only
want to know the state of certain bits instead of the whole register.

This not only makes the intent cleaner, it also avoids a potential
VMREAD in case the tested bits aren't guest owned.

Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20230322013731.102955-5-minipli@grsecurity.net
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-22 07:47:25 -07:00
Yu Zhang
f6cde92083 KVM: nVMX: Add helpers to setup VMX control msr configs
nested_vmx_setup_ctls_msrs() is used to set up the various VMX MSR
controls for nested VMX. But it is a bit lengthy, just add helpers
to setup the configuration of VMX MSRs.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Link: https://lore.kernel.org/r/20230119141946.585610-2-yu.c.zhang@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-21 10:20:28 -07:00
Yu Zhang
ad36aab37a KVM: nVMX: Remove outdated comments in nested_vmx_setup_ctls_msrs()
nested_vmx_setup_ctls_msrs() initializes the vmcs_conf.nested,
which stores the global VMX MSR configurations when nested is
supported, regardless of any particular CPUID settings for one
VM.

Commit 6defc59184 ("KVM: nVMX: include conditional controls
in /dev/kvm KVM_GET_MSRS") added the some feature flags for
secondary proc-based controls, so that those features can be
available in KVM_GET_MSRS. Yet this commit did not remove the
obsolete comments in nested_vmx_setup_ctls_msrs().

Just fix the comments, and no functional change intended.

Fixes: 6defc59184 ("KVM: nVMX: include conditional controls in /dev/kvm KVM_GET_MSRS")
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Link: https://lore.kernel.org/r/20230119141946.585610-1-yu.c.zhang@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-21 10:11:57 -07:00
Lai Jiangshan
2c86c444e2 KVM: x86/mmu: Use kvm_mmu_invalidate_addr() in nested_ept_invalidate_addr()
Use kvm_mmu_invalidate_addr() instead open calls to mmu->invlpg().

No functional change intended.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216235321.735214-1-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16 17:19:53 -07:00
Emanuele Giuseppe Esposito
a807b78ad0 kvm: vmx: Add IA32_FLUSH_CMD guest support
Expose IA32_FLUSH_CMD to the guest if the guest CPUID enumerates
support for this MSR. As with IA32_PRED_CMD, permission for
unintercepted writes to this MSR will be granted to the guest after
the first non-zero write.

Co-developed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20230201132905.549148-2-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-16 10:17:55 -04:00
Sean Christopherson
fbc722aac1 KVM: VMX: Rename "KVM is using eVMCS" static key to match its wrapper
Rename enable_evmcs to __kvm_is_using_evmcs to match its wrapper, and to
avoid confusion with enabling eVMCS for nested virtualization, i.e. have
"enable eVMCS" be reserved for "enable eVMCS support for L1".

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230211003534.564198-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-14 10:28:58 -04:00
Sean Christopherson
19f10315fd KVM: VMX: Stub out enable_evmcs static key for CONFIG_HYPERV=n
Wrap enable_evmcs in a helper and stub it out when CONFIG_HYPERV=n in
order to eliminate the static branch nop placeholders.  clang-14 is clever
enough to elide the nop, but gcc-12 is not.  Stubbing out the key reduces
the size of kvm-intel.ko by ~7.5% (200KiB) when compiled with gcc-12
(there are a _lot_ of VMCS accesses throughout KVM).

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230211003534.564198-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-14 10:28:57 -04:00
Sean Christopherson
68ac422149 KVM: nVMX: Move EVMCS1_SUPPORT_* macros to hyperv.c
Move the macros that define the set of VMCS controls that are supported
by eVMCS1 from hyperv.h to hyperv.c, i.e. make them "private".   The
macros should never be consumed directly by KVM at-large since the "final"
set of supported controls depends on guest CPUID.

No functional change intended.

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230211003534.564198-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-14 10:28:57 -04:00
Rong Tao
53293cb81b KVM: VMX: Use tabs instead of spaces for indentation
Code indentation should use tabs where possible and miss a '*'.

Signed-off-by: Rong Tao <rongtao@cestc.cn>
Message-Id: <tencent_A492CB3F9592578451154442830EA1B02C07@qq.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-14 09:40:55 -04:00
Rong Tao
06e1854728 KVM: VMX: Fix indentation coding style issue
Code indentation should use tabs where possible.

Signed-off-by: Rong Tao <rongtao@cestc.cn>
Message-Id: <tencent_31E6ACADCB6915E157CF5113C41803212107@qq.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-14 09:40:55 -04:00
Paolo Bonzini
77900bffed KVM: nVMX: remove unnecessary #ifdef
nested_vmx_check_controls() has already run by the time KVM checks host state,
so the "host address space size" exit control can only be set on x86-64 hosts.
Simplify the condition at the cost of adding some dead code to 32-bit kernels.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-14 09:40:54 -04:00
Paolo Bonzini
112e66017b KVM: nVMX: add missing consistency checks for CR0 and CR4
The effective values of the guest CR0 and CR4 registers may differ from
those included in the VMCS12.  In particular, disabling EPT forces
CR4.PAE=1 and disabling unrestricted guest mode forces CR0.PG=CR0.PE=1.

Therefore, checks on these bits cannot be delegated to the processor
and must be performed by KVM.

Reported-by: Reima ISHII <ishiir@g.ecc.u-tokyo.ac.jp>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-14 09:40:54 -04:00
Paolo Bonzini
ddad47bfb9 KVM x86 APIC changes for 6.3:
- Remove a superfluous variables from apic_get_tmcct()
 
  - Fix various edge cases in x2APIC MSR emulation
 
  - Mark APIC timer as expired if its in one-shot mode and the count
    underflows while the vCPU task was being migrated
 
  - Reset xAPIC when userspace forces "impossible" x2APIC => xAPIC transition
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmPsB58SHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5CK0P/1hhxUWokhNJX0skgf8uKhxTf8bLAq5F
 xr221M4Ac9YwjJaS0p4PJVSLVJxcVXHsyvanCOQh6AE8q1Ugz+iDLr2gAI+fHbJY
 lnczpAj1UhhttaLSOl13/31TaJdE2Ep0/q3+5vf1qQrOJYkElKpiDYbf3M8T5G72
 pguUFhKKKeZcCB99Jpr0u0HupiwCZoYWvdx7mvzRhi11bWaUyYIWc9CBETmAb4kN
 1UAmov16UrVOFAg/ssde6qPgUsAgB8XwJjta6oIQLeEm70L5ci6g/2Tw0IEwMybR
 yLCCST9eATl2U/hPV4KwBzSN1gHCAx4JDp4TKBR8ic+c+Z8CceIZln05fz6rQ8Sz
 ljyaRVFhaQZyZpjrZJ0h3kqMG1JT/Q4Hj9dq8RZJ0K73KVuCspxaJDHqp6a2p9D0
 dDacDkD3LFIPBdem3hHcpmV2XduaMfQwspObJORarkkQTZZS6erxmPvK/6Quvmbk
 UdD+6hvuSQA8rxNKXF+fOBsnK/1xYvzkVis0sxMwthkSDvENdcPbmlD6kHLz52cg
 Jt+yw/85oIg7zBgEkG2c8+5bB2hw0SRPQBlW4j29jYUhRwXwHxuovllFS2GU7iIc
 fVNtocw5Q9WATp752va4bVjv9XeYBmExn99fd3xvFenTa/ya4+5gNFK8vc9zL++J
 x3fDhAPXmQHJ
 =ieB+
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-apic-6.3' of https://github.com/kvm-x86/linux into HEAD

KVM x86 APIC changes for 6.3:

 - Remove a superfluous variables from apic_get_tmcct()

 - Fix various edge cases in x2APIC MSR emulation

 - Mark APIC timer as expired if its in one-shot mode and the count
   underflows while the vCPU task was being migrated

 - Reset xAPIC when userspace forces "impossible" x2APIC => xAPIC transition
2023-02-21 20:00:44 -05:00
Paolo Bonzini
33436335e9 KVM/riscv changes for 6.3
- Fix wrong usage of PGDIR_SIZE to check page sizes
 - Fix privilege mode setting in kvm_riscv_vcpu_trap_redirect()
 - Redirect illegal instruction traps to guest
 - SBI PMU support for guest
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmPifFIACgkQrUjsVaLH
 LAcEyxAAinMBaBhiPmwWZQvcCzh/UFmJo8BQCwAPuwoc/a4ZGAR7ylzd0oJilP8M
 wSgX6Ad8XF+CEW2VpxW9nwyi41N25ep1Lrf8vOaWy9L9QNUo0t15WrCIbXT2p399
 HrK9fz7HHKKIMsJy+rYb9EepdmMf55xtr1Y/EjyvhoDQbrEMlKsAODYz/SUoriQG
 Tn3cCYBzLdvzDzu0xXM9v+nsetWXdajK/v4je+mE3NQceXhePAO4oVWP4IpnoROd
 ZQm3evvVdf0WtKG9curxwMB7jjBqDBFrcLYl0qHGa7pi2o5PzVM7esgaV47KwetH
 IgA/Mrf1IfzpgM7VYDDax5wUHlKj63KisqU0J8rU3PUloQXaWqv7+ho51t9GzZ/i
 9x4uyO/evVntgyTw6HCbqmQJDgEtJiG1ydrR/ydBMYHLnh7LPY2UpKgcqmirtbkK
 1/DYDp84vikQ5VW1hc8IACdoBShh9Moh4xsEStzkTrIeHcZCjtORXUh8UIPZ0Mu2
 7Mnkktu9I55SLwA3rwH/EYT1ISrOV1G+q3wfqgeLpn8YUWwCIiqWQ5Ur0/WSMJse
 uJ3HedZDzj9T4n4khX+mKEYh6joAafQZag+4TID2lRSwd0S/mpeC22hYrViMdDmq
 yhE+JNin/sz4AVaHNzGwfqk2NC2RFl9aRn2X0xTwyBubif9pKMQ=
 =spUL
 -----END PGP SIGNATURE-----

Merge tag 'kvm-riscv-6.3-1' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv changes for 6.3

- Fix wrong usage of PGDIR_SIZE to check page sizes
- Fix privilege mode setting in kvm_riscv_vcpu_trap_redirect()
- Redirect illegal instruction traps to guest
- SBI PMU support for guest
2023-02-15 12:33:28 -05:00
Paolo Bonzini
27b025ebb0 KVM VMX changes for 6.3:
- Handle NMI VM-Exits before leaving the noinstr region
 
  - A few trivial cleanups in the VM-Enter flows
 
  - Stop enabling VMFUNC for L1 purely to document that KVM doesn't support
    EPTP switching (or any other VM function) for L1
 
  - Fix a crash when using eVMCS's enlighted MSR bitmaps
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmPsL4USHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5fvQP/jOhCos/8VrNPJ5vtVkELSgbefbLFPbR
 ThCefVekbnbqkJ4PdxLFOw/nzK+NREnoet02lvLXIeBt243qMeDWbNuGncZ+rqqX
 ohJ6ALIS7ag2egG2ZllM4KTV1jWnm7bG6C8QsJwrFuG7wHe69lPYNnv34OD4l8QO
 PR7reXBl7iyqQehF0rCqkYvtpv1QI+VlVd3ODCg/0vaZX4AOPZI6X1hX2+myUfTy
 LI1oQ88SQ0ptoH5o7xBtNQw7Zknm/SP82Djzpg/1Wpr3cZv4g/1vblNcxpU/D+3I
 Cp8xM42c82GIX1Zk6uYfsoCyQZ7Bd38llbgcLM5Fi0F2IuKJKoN1iuH77RJ1Juqf
 v+SkRmFAr5Uf3UYVp4ck4FxTRH58ooDPeIv2jdaguyE9NgRfbVsghn6QDC+gkZMx
 z4xdwCBXgsAAkhJwOLqDy98ZrQZoy1kxlmG2jLIe5RvOzI4WQDUJkJAd+u0dBD/W
 U7PRqbqfJFtvvKN784BTa4CO8m+KwaNz4Xrwcg4YtBnKBgDU/cxFUZnnSEHdi9Pr
 tVf462dfReK3slK3rkPw5HlacFdQsgfX6+VX1eeFKY+gE88TXoZks/X7ma24MXEw
 hOu6VZMat1NjkfcgUcMffBVZlejK7C6oXTeVmqVKsx6hUs+zzqxpQ05AT4TMbAUI
 gPTHJyy9ByHO
 =WXGG
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.3' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.3:

 - Handle NMI VM-Exits before leaving the noinstr region

 - A few trivial cleanups in the VM-Enter flows

 - Stop enabling VMFUNC for L1 purely to document that KVM doesn't support
   EPTP switching (or any other VM function) for L1

 - Fix a crash when using eVMCS's enlighted MSR bitmaps
2023-02-15 12:23:19 -05:00
Paolo Bonzini
157ed9cb04 KVM x86 PMU changes for 6.3:
- Add support for created masked events for the PMU filter to allow
    userspace to heavily restrict what events the guest can use without
    needing to create an absurd number of events
 
  - Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
    support is disabled
 
  - Add PEBS support for Intel SPR
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmPsFZ4SHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5eKEP/0qeZsOQot53wkf+wNiGh1X6qDacBPFP
 A8GAPC70fEisxAt776DeKEBwikHpARPglCt1Il9dFvkG+0jgYpvPu8UGF1LpouKX
 cD/7itr2k8GZlXZBg2Rgu3TRyFBJEGHT6tAu7PBhZyL6yWQDUxao8FPFrRGfmJ7O
 Z6eFMo1cERNHICQm+W/2TBd1xguiF+m4CXKlA70R4wzM37aPF9o5HvmIwAvPzyhU
 w4WzcIQbjVPs1VpBTzwPqRmyZ8omSlDYo7VqmsDiRtJbucqgbhFI2wR+nyImFCa9
 D2pI5TV3CFTt0fvd8SZpH19nR3S6cMLCXONOsijmvR2BmS3PhJMP4dMm5m4R06nF
 RBtnTj9fkbeL1ghFEkMxHBZVTG3bBlO4ySOxIqNHCvPjqQ37mJ+xP4C8kcIC9p5F
 +xL3AvZ7zenPv3A29SY9YH+QvZLBwyDJzAsveLeYkLFoJxoDT4glOY/Wpi1rkZ17
 /zHDZWoF49l1Eu3Bql0hFetkCreUNFGpa4moUmEC0evYOvV2WCb+39TDXZ8CPCGD
 +cDiRnD8MFQpBw47F03EnFheFHxiJoL0Clv5vvM3C+xOq2J9WVG9mqQWCk+4ta2B
 Um4D++0a9lwvJhOImaR7uyiV3K7oVm+rU8+46x+nTNGaIP2bnE+vronY+b6KGeUx
 7+xzTKlYygGe
 =ev5v
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-pmu-6.3' of https://github.com/kvm-x86/linux into HEAD

KVM x86 PMU changes for 6.3:

 - Add support for created masked events for the PMU filter to allow
   userspace to heavily restrict what events the guest can use without
   needing to create an absurd number of events

 - Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
   support is disabled

 - Add PEBS support for Intel SPR
2023-02-15 08:23:24 -05:00
Alexandru Matei
93827a0a36 KVM: VMX: Fix crash due to uninitialized current_vmcs
KVM enables 'Enlightened VMCS' and 'Enlightened MSR Bitmap' when running as
a nested hypervisor on top of Hyper-V. When MSR bitmap is updated,
evmcs_touch_msr_bitmap function uses current_vmcs per-cpu variable to mark
that the msr bitmap was changed.

vmx_vcpu_create() modifies the msr bitmap via vmx_disable_intercept_for_msr
-> vmx_msr_bitmap_l01_changed which in the end calls this function. The
function checks for current_vmcs if it is null but the check is
insufficient because current_vmcs is not initialized. Because of this, the
code might incorrectly write to the structure pointed by current_vmcs value
left by another task. Preemption is not disabled, the current task can be
preempted and moved to another CPU while current_vmcs is accessed multiple
times from evmcs_touch_msr_bitmap() which leads to crash.

The manipulation of MSR bitmaps by callers happens only for vmcs01 so the
solution is to use vmx->vmcs01.vmcs instead of current_vmcs.

  BUG: kernel NULL pointer dereference, address: 0000000000000338
  PGD 4e1775067 P4D 0
  Oops: 0002 [#1] PREEMPT SMP NOPTI
  ...
  RIP: 0010:vmx_msr_bitmap_l01_changed+0x39/0x50 [kvm_intel]
  ...
  Call Trace:
   vmx_disable_intercept_for_msr+0x36/0x260 [kvm_intel]
   vmx_vcpu_create+0xe6/0x540 [kvm_intel]
   kvm_arch_vcpu_create+0x1d1/0x2e0 [kvm]
   kvm_vm_ioctl_create_vcpu+0x178/0x430 [kvm]
   kvm_vm_ioctl+0x53f/0x790 [kvm]
   __x64_sys_ioctl+0x8a/0xc0
   do_syscall_64+0x5c/0x90
   entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: ceef7d10df ("KVM: x86: VMX: hyper-v: Enlightened MSR-Bitmap support")
Cc: stable@vger.kernel.org
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
Link: https://lore.kernel.org/r/20230123221208.4964-1-alexandru.matei@uipath.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-07 09:02:50 -08:00
Yu Zhang
496c917b09 KVM: nVMX: Simplify the setting of SECONDARY_EXEC_ENABLE_VMFUNC for nested.
Values of base settings for nested proc-based VM-Execution control MSR come
from the ones for non-nested. And for SECONDARY_EXEC_ENABLE_VMFUNC flag,
KVM currently a) first mask off it from vmcs_conf->cpu_based_2nd_exec_ctrl;
b) then check it against the same source; c) and reset it again if host has
it.

So just simplify this, by not masking off SECONDARY_EXEC_ENABLE_VMFUNC in
the first place.

No functional change.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Link: https://lore.kernel.org/r/20221109075413.1405803-3-yu.c.zhang@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-07 01:51:19 +00:00
Yu Zhang
41acdd4197 KVM: VMX: Do not trap VMFUNC instructions for L1 guests.
Explicitly disable VMFUNC in vmcs01 to document that KVM doesn't support
any VM-Functions for L1.  WARN in the dedicated VMFUNC handler if an exit
occurs while L1 is active, but keep the existing handlers as fallbacks to
avoid killing the VM as an unexpected VMFUNC VM-Exit isn't fatal

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Link: https://lore.kernel.org/r/20221109075413.1405803-2-yu.c.zhang@linux.intel.com
[sean: don't kill the VM on an unexpected VMFUNC from L1, reword changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-07 01:51:17 +00:00
Sean Christopherson
e76ae52747 KVM: x86/pmu: Gate all "unimplemented MSR" prints on report_ignored_msrs
Add helpers to print unimplemented MSR accesses and condition all such
prints on report_ignored_msrs, i.e. honor userspace's request to not
print unimplemented MSRs.  Even though vcpu_unimpl() is ratelimited,
printing can still be problematic, e.g. if a print gets stalled when host
userspace is writing MSRs during live migration, an effective stall can
result in very noticeable disruption in the guest.

E.g. the profile below was taken while calling KVM_SET_MSRS on the PMU
counters while the PMU was disabled in KVM.

  -   99.75%     0.00%  [.] __ioctl
   - __ioctl
      - 99.74% entry_SYSCALL_64_after_hwframe
           do_syscall_64
           sys_ioctl
         - do_vfs_ioctl
            - 92.48% kvm_vcpu_ioctl
               - kvm_arch_vcpu_ioctl
                  - 85.12% kvm_set_msr_ignored_check
                       svm_set_msr
                       kvm_set_msr_common
                       printk
                       vprintk_func
                       vprintk_default
                       vprintk_emit
                       console_unlock
                       call_console_drivers
                       univ8250_console_write
                       serial8250_console_write
                       uart_console_write

Reported-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230124234905.3774678-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26 18:03:42 -08:00
Sean Christopherson
8911ce6669 KVM: x86/pmu: Cap kvm_pmu_cap.num_counters_gp at KVM's internal max
Limit kvm_pmu_cap.num_counters_gp during kvm_init_pmu_capability() based
on the vendor PMU capabilities so that consuming num_counters_gp naturally
does the right thing.  This fixes a mostly theoretical bug where KVM could
over-report its PMU support in KVM_GET_SUPPORTED_CPUID for leaf 0xA, e.g.
if the number of counters reported by perf is greater than KVM's
hardcoded internal limit.  Incorporating input from the AMD PMU also
avoids over-reporting MSRs to save when running on AMD.

Link: https://lore.kernel.org/r/20230124234905.3774678-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26 18:03:42 -08:00
Like Xu
2a3003e950 KVM: x86/pmu: Drop event_type and rename "struct kvm_event_hw_type_mapping"
After commit ("02791a5c362b KVM: x86/pmu: Use PERF_TYPE_RAW
to merge reprogram_{gp,fixed}counter()"), vPMU starts to directly
use the hardware event eventsel and unit_mask to reprogram perf_event,
and the event_type field in the "struct kvm_event_hw_type_mapping"
is simply no longer being used.

Convert the struct into an anonymous struct as the current name is
obsolete as the structure no longer has any mapping semantics, and
placing the struct definition directly above its sole user makes its
easier to understand what the array is filling in.

Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20221205122048.16023-1-likexu@tencent.com
[sean: drop new comment, use anonymous struct]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26 14:13:44 -08:00
Sean Christopherson
11df586d77 KVM: VMX: Handle NMI VM-Exits in noinstr region
Move VMX's handling of NMI VM-Exits into vmx_vcpu_enter_exit() so that
the NMI is handled prior to leaving the safety of noinstr.  Handling the
NMI after leaving noinstr exposes the kernel to potential ordering
problems as an instrumentation-induced fault, e.g. #DB, #BP, #PF, etc.
will unblock NMIs when IRETing back to the faulting instruction.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221213060912.654668-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:36:41 -08:00
Sean Christopherson
4f76e86f7e KVM: VMX: Provide separate subroutines for invoking NMI vs. IRQ handlers
Split the asm subroutines for handling NMIs versus IRQs that occur in the
guest so that the NMI handler can be called from a noinstr section.  As a
bonus, the NMI path doesn't need an indirect branch.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221213060912.654668-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:36:41 -08:00
Sean Christopherson
54a3b70a75 x86/entry: KVM: Use dedicated VMX NMI entry for 32-bit kernels too
Use a dedicated entry for invoking the NMI handler from KVM VMX's VM-Exit
path for 32-bit even though using a dedicated entry for 32-bit isn't
strictly necessary.  Exposing a single symbol will allow KVM to reference
the entry point in assembly code without having to resort to more #ifdefs
(or #defines).  identry.h is intended to be included from asm files only
once, and so simply including idtentry.h in KVM assembly isn't an option.

Bypassing the ESP fixup and CR3 switching in the standard NMI entry code
is safe as KVM always handles NMIs that occur in the guest on a kernel
stack, with a kernel CR3.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221213060912.654668-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:36:40 -08:00
Sean Christopherson
432727f1cb KVM: VMX: Always inline to_vmx() and to_kvm_vmx()
Tag to_vmx() and to_kvm_vmx() __always_inline as they both just reflect
the passed in pointer (the embedded struct is the first field in the
container), and drop the @vmx param from vmx_vcpu_enter_exit(), which
likely existed purely to make noinstr validation happy.

Amusingly, when the compiler decides to not inline the helpers, e.g. for
KASAN builds, to_vmx() and to_kvm_vmx() may end up pointing at the same
symbol, which generates very confusing objtool warnings.  E.g. the use of
to_vmx() in a future patch led to objtool complaining about to_kvm_vmx(),
and only once all use of to_kvm_vmx() was commented out did to_vmx() pop
up in the obj tool report.

  vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x160: call to to_kvm_vmx()
                               leaves .noinstr.text section

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221213060912.654668-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:36:40 -08:00
Sean Christopherson
11633f6950 KVM: VMX: Always inline eVMCS read/write helpers
Tag all evmcs_{read,write}() helpers __always_inline so that they can be
freely used in noinstr sections, e.g. to get the VM-Exit reason in
vcpu_vmx_enter_exit() (in a future patch).  For consistency and to avoid
more spot fixes in the future, e.g. see commit 010050a863 ("x86/kvm:
Always inline evmcs_write64()"), tag all accessors even though
evmcs_read32() is the only anticipated use case in the near future.  In
practice, non-KASAN builds are all but guaranteed to inline the helpers
anyways.

  vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x107: call to evmcs_read32()
                               leaves .noinstr.text section

Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221213060912.654668-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:36:27 -08:00
Sean Christopherson
8578f59657 KVM: VMX: Allow VM-Fail path of VMREAD helper to be instrumented
Allow instrumentation in the VM-Fail path of __vmcs_readl() so that the
helper can be used in noinstr functions, e.g. to get the exit reason in
vmx_vcpu_enter_exit() in order to handle NMI VM-Exits in the noinstr
section.  While allowing instrumentation isn't technically safe, KVM has
much bigger problems if VMREAD fails in a noinstr section.

Note, all other VMX instructions also allow instrumentation in their
VM-Fail paths for similar reasons, VMREAD was simply omitted by commit
3ebccdf373 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
because VMREAD wasn't used in a noinstr section at the time.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221213060912.654668-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:36:26 -08:00
Sean Christopherson
fc9465be8a KVM: x86: Make vmx_get_exit_qual() and vmx_get_intr_info() noinstr-friendly
Add an extra special noinstr-friendly helper to test+mark a "register"
available and use it when caching vmcs.EXIT_QUALIFICATION and
vmcs.VM_EXIT_INTR_INFO.  Make the caching helpers __always_inline too so
that they can be used in noinstr functions.

A future fix will move VMX's handling of NMI exits into the noinstr
vmx_vcpu_enter_exit() so that the NMI is processed before any kind of
instrumentation can trigger a fault and thus IRET, i.e. so that KVM
doesn't invoke the NMI handler with NMIs enabled.

Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221213060912.654668-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:35:38 -08:00
Alexey Dobriyan
e8733482f5 KVM: VMX: don't use "unsigned long" in vmx_vcpu_enter_exit()
__vmx_vcpu_run_flags() returns "unsigned int" and uses only 2 bits of it
so using "unsigned long" is very much pointless.  Furthermore,
__vmx_vcpu_run() and vmx_spec_ctrl_restore_host() take an "unsigned int",
i.e. actually relying on an "unsigned long" value won't work.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/Y3e7UW0WNV2AZmsZ@p183
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:07:01 -08:00
Sean Christopherson
5a7a64779e KVM: VMX: Access @flags as a 32-bit value in __vmx_vcpu_run()
Access @flags using 32-bit operands when saving and testing @flags for
VMX_RUN_VMRESUME, as using 8-bit operands is unnecessarily fragile due
to relying on VMX_RUN_VMRESUME being in bits 0-7.  The behavior of
treating @flags a single byte is a holdover from when the param was
"bool launched", i.e. is not deliberate.

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20221119003747.2615229-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:07:00 -08:00
Aaron Lewis
6a5cba7bed KVM: x86/pmu: Correct the mask used in a pmu event filter lookup
When checking if a pmu event the guest is attempting to program should
be filtered, only consider the event select + unit mask in that
decision. Use an architecture specific mask to mask out all other bits,
including bits 35:32 on Intel.  Those bits are not part of the event
select and should not be considered in that decision.

Fixes: 66bb8a065f ("KVM: x86: PMU Event Filter")
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Link: https://lore.kernel.org/r/20221220161236.555143-2-aaronlewis@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:06:10 -08:00
David Matlack
ee661d8ea9 KVM: x86: Replace cpu_dirty_logging_count with nr_memslots_dirty_logging
Drop cpu_dirty_logging_count in favor of nr_memslots_dirty_logging.
Both fields count the number of memslots that have dirty-logging enabled,
with the only difference being that cpu_dirty_logging_count is only
incremented when using PML. So while nr_memslots_dirty_logging is not a
direct replacement for cpu_dirty_logging_count, it can be combined with
enable_pml to get the same information.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20230105214303.2919415-1-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:19 -08:00
Sean Christopherson
02efd818a6 KVM: VMX: Intercept reads to invalid and write-only x2APIC registers
Intercept reads to invalid (non-existent) and write-only x2APIC registers
when configuring VMX's MSR bitmaps for x2APIC+APICv.  When APICv is fully
enabled, Intel hardware doesn't validate the registers on RDMSR and
instead blindly retrieves data from the vAPIC page, i.e. it's software's
responsibility to intercept reads to non-existent and write-only MSRs.

Fixes: 8d14695f95 ("x86, apicv: add virtual x2apic support")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20230107011025.565472-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:04:36 -08:00
Sean Christopherson
c39857ce8d KVM: VMX: Always intercept accesses to unsupported "extended" x2APIC regs
Don't clear the "read" bits for x2APIC registers above SELF_IPI (APIC regs
0x400 - 0xff0, MSRs 0x840 - 0x8ff).  KVM doesn't emulate registers in that
space (there are a smattering of AMD-only extensions) and so should
intercept reads in order to inject #GP.  When APICv is fully enabled,
Intel hardware doesn't validate the registers on RDMSR and instead blindly
retrieves data from the vAPIC page, i.e. it's software's responsibility to
intercept reads to non-existent MSRs.

Fixes: 8d14695f95 ("x86, apicv: add virtual x2apic support")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20230107011025.565472-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:04:36 -08:00
Paolo Bonzini
f15a87c006 Merge branch 'kvm-lapic-fix-and-cleanup' into HEAD
The first half or so patches fix semi-urgent, real-world relevant APICv
and AVIC bugs.

The second half fixes a variety of AVIC and optimized APIC map bugs
where KVM doesn't play nice with various edge cases that are
architecturally legal(ish), but are unlikely to occur in most real world
scenarios

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-24 06:08:01 -05:00
Hendrik Borghorst
a44b331614 KVM: x86/vmx: Do not skip segment attributes if unusable bit is set
When serializing and deserializing kvm_sregs, attributes of the segment
descriptors are stored by user space. For unusable segments,
vmx_segment_access_rights skips all attributes and sets them to 0.

This means we zero out the DPL (Descriptor Privilege Level) for unusable
entries.

Unusable segments are - contrary to their name - usable in 64bit mode and
are used by guests to for example create a linear map through the
NULL selector.

VMENTER checks if SS.DPL is correct depending on the CS segment type.
For types 9 (Execute Only) and 11 (Execute Read), CS.DPL must be equal to
SS.DPL [1].

We have seen real world guests setting CS to a usable segment with DPL=3
and SS to an unusable segment with DPL=3. Once we go through an sregs
get/set cycle, SS.DPL turns to 0. This causes the virtual machine to crash
reproducibly.

This commit changes the attribute logic to always preserve attributes for
unusable segments. According to [2] SS.DPL is always saved on VM exits,
regardless of the unusable bit so user space applications should have saved
the information on serialization correctly.

[3] specifies that besides SS.DPL the rest of the attributes of the
descriptors are undefined after VM entry if unusable bit is set. So, there
should be no harm in setting them all to the previous state.

[1] Intel SDM Vol 3C 26.3.1.2 Checks on Guest Segment Registers
[2] Intel SDM Vol 3C 27.3.2 Saving Segment Registers and Descriptor-Table
Registers
[3] Intel SDM Vol 3C 26.3.2.2 Loading Guest Segment Registers and
Descriptor-Table Registers

Cc: Alexander Graf <graf@amazon.de>
Cc: stable@vger.kernel.org
Signed-off-by: Hendrik Borghorst <hborghor@amazon.de>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Alexander Graf <graf@amazon.com>
Message-Id: <20221114164823.69555-1-hborghor@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-22 04:09:28 -05:00
Sean Christopherson
b3f257a846 KVM: x86: Track required APICv inhibits with variable, not callback
Track the per-vendor required APICv inhibits with a variable instead of
calling into vendor code every time KVM wants to query the set of
required inhibits.  The required inhibits are a property of the vendor's
virtualization architecture, i.e. are 100% static.

Using a variable allows the compiler to inline the check, e.g. generate
a single-uop TEST+Jcc, and thus eliminates any desire to avoid checking
inhibits for performance reasons.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-32-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:34 -05:00
Sean Christopherson
5063c41beb KVM: x86: Inhibit APICv/AVIC if the optimized physical map is disabled
Inhibit APICv/AVIC if the optimized physical map is disabled so that KVM
KVM provides consistent APIC behavior if xAPIC IDs are aliased due to
vcpu_id being truncated and the x2APIC hotplug hack isn't enabled.  If
the hotplug hack is disabled, events that are emulated by KVM will follow
architectural behavior (all matching vCPUs receive events, even if the
"match" is due to truncation), whereas APICv and AVIC will deliver events
only to the first matching vCPU, i.e. the vCPU that matches without
truncation.

Note, the "extra" inhibit is needed because  KVM deliberately ignores
mismatches due to truncation when applying the APIC_ID_MODIFIED inhibit
so that large VMs (>255 vCPUs) can run with APICv/AVIC.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-24-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:31 -05:00
Sean Christopherson
c482f2cebe KVM: x86: Move APIC access page helper to common x86 code
Move the APIC access page allocation helper function to common x86 code,
the allocation routine is virtually identical between APICv (VMX) and
AVIC (SVM).  Keep APICv's gfn_to_page() + put_page() sequence, which
verifies that a backing page can be allocated, i.e. that the system isn't
under heavy memory pressure.  Forcing the backing page to be populated
isn't strictly necessary, but skipping the effective prefetch only delays
the inevitable.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:25 -05:00
Chao Gao
c82a5c5c53 KVM: x86: Do compatibility checks when onlining CPU
Do compatibility checks when enabling hardware to effectively add
compatibility checks when onlining a CPU.  Abort enabling, i.e. the
online process, if the (hotplugged) CPU is incompatible with the known
good setup.

At init time, KVM does compatibility checks to ensure that all online
CPUs support hardware virtualization and a common set of features. But
KVM uses hotplugged CPUs without such compatibility checks. On Intel
CPUs, this leads to #GP if the hotplugged CPU doesn't support VMX, or
VM-Entry failure if the hotplugged CPU doesn't support all features
enabled by KVM.

Note, this is little more than a NOP on SVM, as SVM already checks for
full SVM support during hardware enabling.

Opportunistically add a pr_err() if setup_vmcs_config() fails, and
tweak all error messages to output which CPU failed.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20221130230934.1014142-41-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:48:31 -05:00
Sean Christopherson
d83420c2d7 KVM: x86: Move CPU compat checks hook to kvm_x86_ops (from kvm_x86_init_ops)
Move the .check_processor_compatibility() callback from kvm_x86_init_ops
to kvm_x86_ops to allow a future patch to do compatibility checks during
CPU hotplug.

Do kvm_ops_update() before compat checks so that static_call() can be
used during compat checks.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20221130230934.1014142-40-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:48:28 -05:00
Sean Christopherson
8504ef2139 KVM: VMX: Shuffle support checks and hardware enabling code around
Reorder code in vmx.c so that the VMX support check helpers reside above
the hardware enabling helpers, which will allow KVM to perform support
checks during hardware enabling (in a future patch).

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-38-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:47:42 -05:00
Sean Christopherson
d419313249 KVM: x86: Do VMX/SVM support checks directly in vendor code
Do basic VMX/SVM support checks directly in vendor code instead of
implementing them via kvm_x86_ops hooks.  Beyond the superficial benefit
of providing common messages, which isn't even clearly a net positive
since vendor code can provide more precise/detailed messages, there's
zero advantage to bouncing through common x86 code.

Consolidating the checks will also simplify performing the checks
across all CPUs (in a future patch).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-37-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:47:42 -05:00
Sean Christopherson
462689b37f KVM: VMX: Use current CPU's info to perform "disabled by BIOS?" checks
Use this_cpu_has() instead of boot_cpu_has() to perform the effective
"disabled by BIOS?" checks for VMX.  This will allow consolidating code
between vmx_disabled_by_bios() and vmx_check_processor_compat().

Checking the boot CPU isn't a strict requirement as any divergence in VMX
enabling between the boot CPU and other CPUs will result in KVM refusing
to load thanks to the aforementioned vmx_check_processor_compat().

Furthermore, using the boot CPU was an unintentional change introduced by
commit a4d0b2fdbc ("KVM: VMX: Use VMX feature flag to query BIOS
enabling").  Prior to using the feature flags, KVM checked the raw MSR
value from the current CPU.

Reported-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20221130230934.1014142-36-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:47:41 -05:00
Sean Christopherson
8d20bd6381 KVM: x86: Unify pr_fmt to use module name for all KVM modules
Define pr_fmt using KBUILD_MODNAME for all KVM x86 code so that printks
use consistent formatting across common x86, Intel, and AMD code.  In
addition to providing consistent print formatting, using KBUILD_MODNAME,
e.g. kvm_amd and kvm_intel, allows referencing SVM and VMX (and SEV and
SGX and ...) as technologies without generating weird messages, and
without causing naming conflicts with other kernel code, e.g. "SEV: ",
"tdx: ", "sgx: " etc.. are all used by the kernel for non-KVM subsystems.

Opportunistically move away from printk() for prints that need to be
modified anyways, e.g. to drop a manual "kvm: " prefix.

Opportunistically convert a few SGX WARNs that are similarly modified to
WARN_ONCE; in the very unlikely event that the WARNs fire, odds are good
that they would fire repeatedly and spam the kernel log without providing
unique information in each print.

Note, defining pr_fmt yields undesirable results for code that uses KVM's
printk wrappers, e.g. vcpu_unimpl().  But, that's a pre-existing problem
as SVM/kvm_amd already defines a pr_fmt, and thankfully use of KVM's
wrappers is relatively limited in KVM x86 code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Message-Id: <20221130230934.1014142-35-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:47:35 -05:00
Sean Christopherson
08a9d59c6a KVM: x86: Use KBUILD_MODNAME to specify vendor module name
Use KBUILD_MODNAME to specify the vendor module name instead of manually
writing out the name to make it a bit more obvious that the name isn't
completely arbitrary.  A future patch will also use KBUILD_MODNAME to
define pr_fmt, at which point using KBUILD_MODNAME for kvm_x86_ops.name
further reinforces the intended usage of kvm_x86_ops.name.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-34-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:41:30 -05:00
Sean Christopherson
81a1cf9f89 KVM: Drop kvm_arch_check_processor_compat() hook
Drop kvm_arch_check_processor_compat() and its support code now that all
architecture implementations are nops.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Eric Farman <farman@linux.ibm.com>	# s390
Acked-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20221130230934.1014142-33-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:41:28 -05:00
Sean Christopherson
3045c483ee KVM: x86: Do CPU compatibility checks in x86 code
Move the CPU compatibility checks to pure x86 code, i.e. drop x86's use
of the common kvm_x86_check_cpu_compat() arch hook.  x86 is the only
architecture that "needs" to do per-CPU compatibility checks, moving
the logic to x86 will allow dropping the common code, and will also
give x86 more control over when/how the compatibility checks are
performed, e.g. TDX will need to enable hardware (do VMXON) in order to
perform compatibility checks.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20221130230934.1014142-32-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:41:26 -05:00
Sean Christopherson
58ca193031 KVM: VMX: Make VMCS configuration/capabilities structs read-only after init
Tag vmcs_config and vmx_capability structs as __init, the canonical
configuration is generated during hardware_setup() and must never be
modified after that point.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-31-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:41:25 -05:00
Sean Christopherson
e32b120071 KVM: VMX: Do _all_ initialization before exposing /dev/kvm to userspace
Call kvm_init() only after _all_ setup is complete, as kvm_init() exposes
/dev/kvm to userspace and thus allows userspace to create VMs (and call
other ioctls).  E.g. KVM will encounter a NULL pointer when attempting to
add a vCPU to the per-CPU loaded_vmcss_on_cpu list if userspace is able to
create a VM before vmx_init() configures said list.

 BUG: kernel NULL pointer dereference, address: 0000000000000008
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 0 P4D 0
 Oops: 0002 [#1] SMP
 CPU: 6 PID: 1143 Comm: stable Not tainted 6.0.0-rc7+ #988
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
 RIP: 0010:vmx_vcpu_load_vmcs+0x68/0x230 [kvm_intel]
  <TASK>
  vmx_vcpu_load+0x16/0x60 [kvm_intel]
  kvm_arch_vcpu_load+0x32/0x1f0 [kvm]
  vcpu_load+0x2f/0x40 [kvm]
  kvm_arch_vcpu_create+0x231/0x310 [kvm]
  kvm_vm_ioctl+0x79f/0xe10 [kvm]
  ? handle_mm_fault+0xb1/0x220
  __x64_sys_ioctl+0x80/0xb0
  do_syscall_64+0x2b/0x50
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7f5a6b05743b
  </TASK>
 Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel(+) kvm irqbypass

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:41:01 -05:00
Sean Christopherson
4f8396b96a KVM: x86: Move guts of kvm_arch_init() to standalone helper
Move the guts of kvm_arch_init() to a new helper, kvm_x86_vendor_init(),
so that VMX can do _all_ arch and vendor initialization before calling
kvm_init().  Calling kvm_init() must be the _very_ last step during init,
as kvm_init() exposes /dev/kvm to userspace, i.e. allows creating VMs.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:41:00 -05:00
Sean Christopherson
451d39e800 KVM: VMX: Move Hyper-V eVMCS initialization to helper
Move Hyper-V's eVMCS initialization to a dedicated helper to clean up
vmx_init(), and add a comment to call out that the Hyper-V init code
doesn't need to be unwound if vmx_init() ultimately fails.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20221130230934.1014142-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:40:58 -05:00
Sean Christopherson
da66de44b0 KVM: VMX: Don't bother disabling eVMCS static key on module exit
Don't disable the eVMCS static key on module exit, kvm_intel.ko owns the
key so there can't possibly be users after the kvm_intel.ko is unloaded,
at least not without much bigger issues.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:40:57 -05:00
Sean Christopherson
2916b70fc3 KVM: VMX: Reset eVMCS controls in VP assist page during hardware disabling
Reset the eVMCS controls in the per-CPU VP assist page during hardware
disabling instead of waiting until kvm-intel's module exit.  The controls
are activated if and only if KVM creates a VM, i.e. don't need to be
reset if hardware is never enabled.

Doing the reset during hardware disabling will naturally fix a potential
NULL pointer deref bug once KVM disables CPU hotplug while enabling and
disabling hardware (which is necessary to fix a variety of bugs).  If the
kernel is running as the root partition, the VP assist page is unmapped
during CPU hot unplug, and so KVM's clearing of the eVMCS controls needs
to occur with CPU hot(un)plug disabled, otherwise KVM could attempt to
write to a CPU's VP assist page after it's unmapped.

Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20221130230934.1014142-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:40:55 -05:00
Vitaly Kuznetsov
80edc49f6a KVM: VMX: Resurrect vmcs_conf sanitization for KVM-on-Hyper-V
Commit 9bcb90650e ("KVM: VMX: Get rid of eVMCS specific VMX controls
sanitization") dropped 'vmcs_conf' sanitization for KVM-on-Hyper-V because
there's no known Hyper-V version which would expose a feature
unsupported in eVMCS in VMX feature MSRs. This works well for all
currently existing Hyper-V version, however, future Hyper-V versions
may add features which are supported by KVM and are currently missing
in eVMCSv1 definition (e.g. APIC virtualization, PML,...). When this
happens, existing KVMs will get broken. With the inverted 'unsupported
by eVMCSv1' checks, we can resurrect vmcs_conf sanitization and make
KVM future proof.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20221104144708.435865-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:33:12 -05:00
Vitaly Kuznetsov
c128d3fd38 KVM: nVMX: Prepare to sanitize tertiary execution controls with eVMCS
In preparation to restoring vmcs_conf sanitization for KVM-on-Hyper-V,
(and for completeness) add tertiary VM-execution controls to
'evmcs_supported_ctrls'.

No functional change intended as KVM doesn't yet expose
MSR_IA32_VMX_PROCBASED_CTLS3 to its guests.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20221104144708.435865-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:33:11 -05:00
Vitaly Kuznetsov
96d6955d21 KVM: nVMX: Invert 'unsupported by eVMCSv1' check
When a new feature gets implemented in KVM, EVMCS1_UNSUPPORTED_* defines
need to be adjusted to avoid the situation when the feature is exposed
to the guest but there's no corresponding eVMCS field[s] for it. This
is not obvious and fragile. Invert 'unsupported by eVMCSv1' check and
make it 'supported by eVMCSv1' instead, this way it's much harder to
make a mistake. New features will get added to EVMCS1_SUPPORTED_*
defines when the corresponding fields are added to eVMCS definition.

No functional change intended. EVMCS1_SUPPORTED_* defines are composed
by taking KVM_{REQUIRED,OPTIONAL}_VMX_ defines and filtering out what
was previously known as EVMCS1_UNSUPPORTED_*.

From all the controls, SECONDARY_EXEC_TSC_SCALING requires special
handling as it's actually present in eVMCSv1 definition but is not
currently supported for Hyper-V-on-KVM, just for KVM-on-Hyper-V. As
evmcs_supported_ctrls will be used for both scenarios, just add it
there instead of EVMCS1_SUPPORTED_2NDEXEC.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20221104144708.435865-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:33:10 -05:00
Vitaly Kuznetsov
a2ad080b4d KVM: nVMX: Sanitize primary processor-based VM-execution controls with eVMCS too
The only unsupported primary processor-based VM-execution control at the
moment is CPU_BASED_ACTIVATE_TERTIARY_CONTROLS and KVM doesn't expose it
in nested VMX feature MSRs anyway (see nested_vmx_setup_ctls_msrs())
but in preparation to inverting "unsupported with eVMCS" checks (and
for completeness) it's better to sanitize MSR_IA32_VMX_PROCBASED_CTLS/
MSR_IA32_VMX_TRUE_PROCBASED_CTLS too.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20221104144708.435865-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:33:08 -05:00
Vitaly Kuznetsov
1567037614 KVM: VMX: Resurrect vmcs_conf sanitization for KVM-on-Hyper-V
Commit 9bcb90650e ("KVM: VMX: Get rid of eVMCS specific VMX controls
sanitization") dropped 'vmcs_conf' sanitization for KVM-on-Hyper-V because
there's no known Hyper-V version which would expose a feature
unsupported in eVMCS in VMX feature MSRs. This works well for all
currently existing Hyper-V version, however, future Hyper-V versions
may add features which are supported by KVM and are currently missing
in eVMCSv1 definition (e.g. APIC virtualization, PML,...). When this
happens, existing KVMs will get broken. With the inverted 'unsupported
by eVMCSv1' checks, we can resurrect vmcs_conf sanitization and make
KVM future proof.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20221104144708.435865-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-27 06:02:56 -05:00
Vitaly Kuznetsov
746b183391 KVM: nVMX: Prepare to sanitize tertiary execution controls with eVMCS
In preparation to restoring vmcs_conf sanitization for KVM-on-Hyper-V,
(and for completeness) add tertiary VM-execution controls to
'evmcs_supported_ctrls'.

No functional change intended as KVM doesn't yet expose
MSR_IA32_VMX_PROCBASED_CTLS3 to its guests.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20221104144708.435865-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-27 06:02:55 -05:00
Vitaly Kuznetsov
70b31e50fb KVM: nVMX: Invert 'unsupported by eVMCSv1' check
When a new feature gets implemented in KVM, EVMCS1_UNSUPPORTED_* defines
need to be adjusted to avoid the situation when the feature is exposed
to the guest but there's no corresponding eVMCS field[s] for it. This
is not obvious and fragile. Invert 'unsupported by eVMCSv1' check and
make it 'supported by eVMCSv1' instead, this way it's much harder to
make a mistake. New features will get added to EVMCS1_SUPPORTED_*
defines when the corresponding fields are added to eVMCS definition.

No functional change intended. EVMCS1_SUPPORTED_* defines are composed
by taking KVM_{REQUIRED,OPTIONAL}_VMX_ defines and filtering out what
was previously known as EVMCS1_UNSUPPORTED_*.

From all the controls, SECONDARY_EXEC_TSC_SCALING requires special
handling as it's actually present in eVMCSv1 definition but is not
currently supported for Hyper-V-on-KVM, just for KVM-on-Hyper-V. As
evmcs_supported_ctrls will be used for both scenarios, just add it
there instead of EVMCS1_SUPPORTED_2NDEXEC.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20221104144708.435865-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-27 06:02:55 -05:00
Vitaly Kuznetsov
bbbaf8cd17 KVM: nVMX: Sanitize primary processor-based VM-execution controls with eVMCS too
The only unsupported primary processor-based VM-execution control at the
moment is CPU_BASED_ACTIVATE_TERTIARY_CONTROLS and KVM doesn't expose it
in nested VMX feature MSRs anyway (see nested_vmx_setup_ctls_msrs())
but in preparation to inverting "unsupported with eVMCS" checks (and
for completeness) it's better to sanitize MSR_IA32_VMX_PROCBASED_CTLS/
MSR_IA32_VMX_TRUE_PROCBASED_CTLS too.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20221104144708.435865-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-27 06:02:54 -05:00
Sean Christopherson
a0860d68a2 KVM: nVMX: Don't stuff secondary execution control if it's not supported
When stuffing the allowed secondary execution controls for nested VMX in
response to CPUID updates, don't set the allowed-1 bit for a feature that
isn't supported by KVM, i.e. isn't allowed by the canonical vmcs_config.

WARN if KVM attempts to manipulate a feature that isn't supported.  All
features that are currently stuffed are always advertised to L1 for
nested VMX if they are supported in KVM's base configuration, and no
additional features should ever be added to the CPUID-induced stuffing
(updating VMX MSRs in response to CPUID updates is a long-standing KVM
flaw that is slowly being fixed).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213062306.667649-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23 12:32:03 -05:00
Sean Christopherson
31de69f4ee KVM: nVMX: Properly expose ENABLE_USR_WAIT_PAUSE control to L1
Set ENABLE_USR_WAIT_PAUSE in KVM's supported VMX MSR configuration if the
feature is supported in hardware and enabled in KVM's base, non-nested
configuration, i.e. expose ENABLE_USR_WAIT_PAUSE to L1 if it's supported.
This fixes a bug where saving/restoring, i.e. migrating, a vCPU will fail
if WAITPKG (the associated CPUID feature) is enabled for the vCPU, and
obviously allows L1 to enable the feature for L2.

KVM already effectively exposes ENABLE_USR_WAIT_PAUSE to L1 by stuffing
the allowed-1 control ina vCPU's virtual MSR_IA32_VMX_PROCBASED_CTLS2 when
updating secondary controls in response to KVM_SET_CPUID(2), but (a) that
depends on flawed code (KVM shouldn't touch VMX MSRs in response to CPUID
updates) and (b) runs afoul of vmx_restore_control_msr()'s restriction
that the guest value must be a strict subset of the supported host value.

Although no past commit explicitly enabled nested support for WAITPKG,
doing so is safe and functionally correct from an architectural
perspective as no additional KVM support is needed to virtualize TPAUSE,
UMONITOR, and UMWAIT for L2 relative to L1, and KVM already forwards
VM-Exits to L1 as necessary (commit bf653b78f9, "KVM: vmx: Introduce
handle_unexpected_vmexit and handle WAITPKG vmexit").

Note, KVM always keeps the hosts MSR_IA32_UMWAIT_CONTROL resident in
hardware, i.e. always runs both L1 and L2 with the host's power management
settings for TPAUSE and UMWAIT.  See commit bf09fb6cba ("KVM: VMX: Stop
context switching MSR_IA32_UMWAIT_CONTROL") for more details.

Fixes: e69e72faa3 ("KVM: x86: Add support for user wait instructions")
Cc: stable@vger.kernel.org
Reported-by: Aaron Lewis <aaronlewis@google.com>
Reported-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20221213062306.667649-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23 12:22:37 -05:00
Sean Christopherson
057b18756b KVM: nVMX: Document that ignoring memory failures for VMCLEAR is deliberate
Explicitly drop the result of kvm_vcpu_write_guest() when writing the
"launch state" as part of VMCLEAR emulation, and add a comment to call
out that KVM's behavior is architecturally valid.  Intel's pseudocode
effectively says that VMCLEAR is a nop if the target VMCS address isn't
in memory, e.g. if the address points at MMIO.

Add a FIXME to call out that suppressing failures on __copy_to_user() is
wrong, as memory (a memslot) does exist in that case.  Punt the issue to
the future as open coding kvm_vcpu_write_guest() just to make sure the
guest dies with -EFAULT isn't worth the extra complexity.  The flaw will
need to be addressed if KVM ever does something intelligent on uaccess
failures, e.g. to support post-copy demand paging, but in that case KVM
will need a more thorough overhaul, i.e. VMCLEAR shouldn't need to open
code a core KVM helper.

No functional change intended.

Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 1527765 ("Error handling issues")
Fixes: 587d7e72ae ("kvm: nVMX: VMCLEAR should not cause the vCPU to shut down")
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221220154224.526568-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23 12:16:49 -05:00
Sean Christopherson
2d6cd68636 KVM: VMX: Move MSR_IA32_FEAT_CTL.LOCKED check into "is valid" helper
Move the check on IA32_FEATURE_CONTROL being locked, i.e. read-only from
the guest, into the helper to check the overall validity of the incoming
value.  Opportunistically rename the helper to make it clear that it
returns a bool.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220607232353.3375324-3-seanjc@google.com
2022-11-30 16:29:53 -08:00
Sean Christopherson
d2a00af206 KVM: VMX: Allow userspace to set all supported FEATURE_CONTROL bits
Allow userspace to set all supported bits in MSR IA32_FEATURE_CONTROL
irrespective of the guest CPUID model, e.g. via KVM_SET_MSRS.  KVM's ABI
is that userspace is allowed to set MSRs before CPUID, i.e. can set MSRs
to values that would fault according to the guest CPUID model.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220607232353.3375324-2-seanjc@google.com
2022-11-30 16:29:53 -08:00
Sean Christopherson
0b5e7a16a0 KVM: VMX: Make vmread_error_trampoline() uncallable from C code
Declare vmread_error_trampoline() as an opaque symbol so that it cannot
be called from C code, at least not without some serious fudging.  The
trampoline always passes parameters on the stack so that the inline
VMREAD sequence doesn't need to clobber registers.  regparm(0) was
originally added to document the stack behavior, but it ended up being
confusing because regparm(0) is a nop for 64-bit targets.

Opportunustically wrap the trampoline and its declaration in #ifdeffery
to make it even harder to invoke incorrectly, to document why it exists,
and so that it's not left behind if/when CONFIG_CC_HAS_ASM_GOTO_OUTPUT
is true for all supported toolchains.

No functional change intended.

Cc: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220928232015.745948-1-seanjc@google.com
2022-11-30 16:27:47 -08:00
Sean Christopherson
4a8fd4a720 KVM: nVMX: Reword comments about generating nested CR0/4 read shadows
Reword the comments that (attempt to) document nVMX's overrides of the
CR0/4 read shadows for L2 after calling vmx_set_cr0/4().  The important
behavior that needs to be documented is that KVM needs to override the
shadows to account for L1's masks even though the shadows are set by the
common helpers (and that setting the shadows first would result in the
correct shadows being clobbered).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20220831000721.4066617-1-seanjc@google.com
2022-11-30 16:27:17 -08:00
Jim Mattson
2e7eab8142 KVM: VMX: Execute IBPB on emulated VM-exit when guest has IBRS
According to Intel's document on Indirect Branch Restricted
Speculation, "Enabling IBRS does not prevent software from controlling
the predicted targets of indirect branches of unrelated software
executed later at the same predictor mode (for example, between two
different user applications, or two different virtual machines). Such
isolation can be ensured through use of the Indirect Branch Predictor
Barrier (IBPB) command." This applies to both basic and enhanced IBRS.

Since L1 and L2 VMs share hardware predictor modes (guest-user and
guest-kernel), hardware IBRS is not sufficient to virtualize
IBRS. (The way that basic IBRS is implemented on pre-eIBRS parts,
hardware IBRS is actually sufficient in practice, even though it isn't
sufficient architecturally.)

For virtual CPUs that support IBRS, add an indirect branch prediction
barrier on emulated VM-exit, to ensure that the predicted targets of
indirect branches executed in L1 cannot be controlled by software that
was executed in L2.

Since we typically don't intercept guest writes to IA32_SPEC_CTRL,
perform the IBPB at emulated VM-exit regardless of the current
IA32_SPEC_CTRL.IBRS value, even though the IBPB could technically be
deferred until L1 sets IA32_SPEC_CTRL.IBRS, if IA32_SPEC_CTRL.IBRS is
clear at emulated VM-exit.

This is CVE-2022-2196.

Fixes: 5c911beff2 ("KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02")
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20221019213620.1953281-3-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-11-30 16:15:44 -08:00
Jim Mattson
4f20998958 KVM: VMX: Guest usage of IA32_SPEC_CTRL is likely
At this point in time, most guests (in the default, out-of-the-box
configuration) are likely to use IA32_SPEC_CTRL.  Therefore, drop the
compiler hint that it is unlikely for KVM to be intercepting WRMSR of
IA32_SPEC_CTRL.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20221019213620.1953281-2-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-11-30 16:15:44 -08:00