Commit Graph

11000 Commits

Author SHA1 Message Date
Sean Christopherson
8f2a27752e KVM: x86: Replace (almost) all guest CPUID feature queries with cpu_caps
Switch all queries (except XSAVES) of guest features from guest CPUID to
guest capabilities, i.e. replace all calls to guest_cpuid_has() with calls
to guest_cpu_cap_has().

Keep guest_cpuid_has() around for XSAVES, but subsume its helper
guest_cpuid_get_register() and add a compile-time assertion to prevent
using guest_cpuid_has() for any other feature.  Add yet another comment
for XSAVE to explain why KVM is allowed to query its raw guest CPUID.

Opportunistically drop the unused guest_cpuid_clear(), as there should be
no circumstance in which KVM needs to _clear_ a guest CPUID feature now
that everything is tracked via cpu_caps.  E.g. KVM may need to _change_
a feature to emulate dynamic CPUID flags, but KVM should never need to
clear a feature in guest CPUID to prevent it from being used by the guest.

Delete the last remnants of the governed features framework, as the lone
holdout was vmx_adjust_secondary_exec_control()'s divergent behavior for
governed vs. ungoverned features.

Note, replacing guest_cpuid_has() checks with guest_cpu_cap_has() when
computing reserved CR4 bits is a nop when viewed as a whole, as KVM's
capabilities are already incorporated into the calculation, i.e. if a
feature is present in guest CPUID but unsupported by KVM, its CR4 bit
was already being marked as reserved, checking guest_cpu_cap_has() simply
double-stamps that it's a reserved bit.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-51-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:15 -08:00
Sean Christopherson
820545bdfe KVM: x86: Shuffle code to prepare for dropping guest_cpuid_has()
Move the implementations of guest_has_{spec_ctrl,pred_cmd}_msr() down
below guest_cpu_cap_has() so that their use of guest_cpuid_has() can be
replaced with calls to guest_cpu_cap_has().

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-50-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:14 -08:00
Sean Christopherson
75d4642fce KVM: x86: Update guest cpu_caps at runtime for dynamic CPUID-based features
When updating guest CPUID entries to emulate runtime behavior, e.g. when
the guest enables a CR4-based feature that is tied to a CPUID flag, also
update the vCPU's cpu_caps accordingly.  This will allow replacing all
usage of guest_cpuid_has() with guest_cpu_cap_has().

Note, this relies on kvm_set_cpuid() taking a snapshot of cpu_caps before
invoking kvm_update_cpuid_runtime(), i.e. when KVM is updating CPUID
entries that *may* become the vCPU's CPUID, so that unwinding to the old
cpu_caps is possible if userspace tries to set bogus CPUID information.

Note #2, none of the features in question use guest_cpu_cap_has() at this
time, i.e. aside from settings bits in cpu_caps, this is a glorified nop.

Cc: Yang Weijiang <weijiang.yang@intel.com>
Cc: Robert Hoo <robert.hoo.linux@gmail.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-49-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:13 -08:00
Sean Christopherson
1f66590d7f KVM: x86: Update OS{XSAVE,PKE} bits in guest CPUID irrespective of host support
When making runtime CPUID updates, change OSXSAVE and OSPKE even if their
respective base features (XSAVE, PKU) are not supported by the host.  KVM
already incorporates host support in the vCPU's effective reserved CR4 bits.
I.e. OSXSAVE and OSPKE can be set if and only if the host supports them.

And conversely, since KVM's ABI is that KVM owns the dynamic OS feature
flags, clearing them when they obviously aren't supported and thus can't
be enabled is arguably a fix.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-48-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:12 -08:00
Sean Christopherson
cfd1574526 KVM: x86: Drop unnecessary check that cpuid_entry2_find() returns right leaf
Drop an unnecessary check that kvm_find_cpuid_entry_index(), i.e.
cpuid_entry2_find(), returns the correct leaf when getting CPUID.0x7.0x0
to update X86_FEATURE_OSPKE.  cpuid_entry2_find() never returns an entry
for the wrong function.  And not that it matters, but cpuid_entry2_find()
will always return a precise match for CPUID.0x7.0x0 since the index is
significant.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-47-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:11 -08:00
Sean Christopherson
963180ae06 KVM: x86: Avoid double CPUID lookup when updating MWAIT at runtime
Move the handling of X86_FEATURE_MWAIT during CPUID runtime updates to
utilize the lookup done for other CPUID.0x1 features.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-46-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:10 -08:00
Sean Christopherson
e592ec657d KVM: x86: Initialize guest cpu_caps based on KVM support
Constrain all guest cpu_caps based on KVM support instead of constraining
only the few features that KVM _currently_ needs to verify are actually
supported by KVM.  The intent of cpu_caps is to track what the guest is
actually capable of using, not the raw, unfiltered CPUID values that the
guest sees.

I.e. KVM should always consult it's only support when making decisions
based on guest CPUID, and the only reason KVM has historically made the
checks opt-in was due to lack of centralized tracking.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-45-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:09 -08:00
Sean Christopherson
d4b9ff3d55 KVM: x86: Treat MONTIOR/MWAIT as a "partially emulated" feature
Enumerate MWAIT in cpuid_func_emulated(), but only if the caller wants to
include "partially emulated" features, i.e. features that KVM kinda sorta
emulates, but with major caveats.  This will allow initializing the guest
cpu_caps based on the set of features that KVM virtualizes and/or emulates,
without needing to handle things like MONITOR/MWAIT as one-off exceptions.

Adding one-off handling for individual features is quite painful,
especially when considering future hardening.  It's very doable to verify,
at compile time, that every CPUID-based feature that KVM queries when
emulating guest behavior is actually known to KVM, e.g. to prevent KVM
bugs where KVM emulates some feature but fails to advertise support to
userspace.  In other words, any features that are special cased, i.e. not
handled generically in the CPUID framework, would also need to be special
cased for any hardening efforts that build on said framework.

Link: https://lore.kernel.org/r/20241128013424.4096668-44-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:08 -08:00
Sean Christopherson
ff402f56e8 KVM: x86: Extract code for generating per-entry emulated CPUID information
Extract the meat of __do_cpuid_func_emulated() into a separate helper,
cpuid_func_emulated(), so that cpuid_func_emulated() can be used with a
single CPUID entry.  This will allow marking emulated features as fully
supported in the guest cpu_caps without needing to hardcode the set of
emulated features in multiple locations.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-43-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:07 -08:00
Sean Christopherson
a7a308f863 KVM: x86: Initialize guest cpu_caps based on guest CPUID
Initialize a vCPU's capabilities based on the guest CPUID provided by
userspace instead of simply zeroing the entire array.  This is the first
step toward using cpu_caps to query *all* CPUID-based guest capabilities,
i.e. will allow converting all usage of guest_cpuid_has() to
guest_cpu_cap_has().

Zeroing the array was the logical choice when using cpu_caps was opt-in,
e.g. "unsupported" was generally a safer default, and the whole point of
governed features is that KVM would need to check host and guest support,
i.e. making everything unsupported by default didn't require more code.

But requiring KVM to manually "enable" every CPUID-based feature in
cpu_caps would require an absurd amount of boilerplate code.

Follow existing CPUID/kvm_cpu_caps nomenclature where possible, e.g. for
the change() and clear() APIs.  Replace check_and_set() with constrain()
to try and capture that KVM is constraining userspace's desired guest
feature set based on KVM's capabilities.

This is intended to be gigantic nop, i.e. should not have any impact on
guest or KVM functionality.

This is also an intermediate step; a future commit will also incorporate
KVM support into the vCPU's cpu_caps before converting guest_cpuid_has()
to guest_cpu_cap_has().

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-42-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:06 -08:00
Sean Christopherson
7ea34578ae KVM: x86: Replace guts of "governed" features with comprehensive cpu_caps
Replace the internals of the governed features framework with a more
comprehensive "guest CPU capabilities" implementation, i.e. with a guest
version of kvm_cpu_caps.  Keep the skeleton of governed features around
for now as vmx_adjust_sec_exec_control() relies on detecting governed
features to do the right thing for XSAVES, and switching all guest feature
queries to guest_cpu_cap_has() requires subtle and non-trivial changes,
i.e. is best done as a standalone change.

Tracking *all* guest capabilities that KVM cares will allow excising the
poorly named "governed features" framework, and effectively optimizes all
KVM queries of guest capabilities, i.e. doesn't require making a
subjective decision as to whether or not a feature is worth "governing",
and doesn't require adding the code to do so.

The cost of tracking all features is currently 92 bytes per vCPU on 64-bit
kernels: 100 bytes for cpu_caps versus 8 bytes for governed_features.
That cost is well worth paying even if the only benefit was eliminating
the "governed features" terminology.  And practically speaking, the real
cost is zero unless those 92 bytes pushes the size of vcpu_vmx or vcpu_svm
into a new order-N allocation, and if that happens there are better ways
to reduce the footprint of kvm_vcpu_arch, e.g. making the PMU and/or MTRR
state separate allocations.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-41-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:05 -08:00
Sean Christopherson
2c5e168e5c KVM: x86: Rename "governed features" helpers to use "guest_cpu_cap"
As the first step toward replacing KVM's so-called "governed features"
framework with a more comprehensive, less poorly named implementation,
replace the "kvm_governed_feature" function prefix with "guest_cpu_cap"
and rename guest_can_use() to guest_cpu_cap_has().

The "guest_cpu_cap" naming scheme mirrors that of "kvm_cpu_cap", and
provides a more clear distinction between guest capabilities, which are
KVM controlled (heh, or one might say "governed"), and guest CPUID, which
with few exceptions is fully userspace controlled.

Opportunistically rewrite the comment about XSS passthrough for SEV-ES
guests to avoid referencing so many functions, as such comments are prone
to becoming stale (case in point...).

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-40-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:03 -08:00
Sean Christopherson
9aa470f5dd KVM: x86: Advertise HYPERVISOR in KVM_GET_SUPPORTED_CPUID
Unconditionally advertise "support" for the HYPERVISOR feature in CPUID,
as the flag simply communicates to the guest that's it's running under a
hypervisor.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-39-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:02 -08:00
Sean Christopherson
9be4ec35d6 KVM: x86: Advertise TSC_DEADLINE_TIMER in KVM_GET_SUPPORTED_CPUID
Unconditionally advertise TSC_DEADLINE_TIMER via KVM_GET_SUPPORTED_CPUID,
as KVM always emulates deadline mode, *if* the VM has an in-kernel local
APIC.  The odds of a VMM emulating the local APIC in userspace, not
emulating the TSC deadline timer, _and_ reflecting
KVM_GET_SUPPORTED_CPUID back into KVM_SET_CPUID2, i.e. the risk of
over-advertising and breaking any setups, is extremely low.

KVM has _unconditionally_ advertised X2APIC via CPUID since commit
0d1de2d901 ("KVM: Always report x2apic as supported feature"), and it
is completely impossible for userspace to emulate X2APIC as KVM doesn't
support forwarding the MSR accesses to userspace.  I.e. KVM has relied on
userspace VMMs to not misreport local APIC capabilities for nearly 13
years.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-38-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:01 -08:00
Sean Christopherson
136d605b43 KVM: x86: Remove all direct usage of cpuid_entry2_find()
Convert all use of cpuid_entry2_find() to kvm_find_cpuid_entry{,index}()
now that cpuid_entry2_find() operates on the vCPU state, i.e. now that
there is no need to use cpuid_entry2_find() directly in order to pass in
non-vCPU state.

To help prevent unwanted usage of cpuid_entry2_find(), #undef
KVM_CPUID_INDEX_NOT_SIGNIFICANT, i.e. force KVM to use
kvm_find_cpuid_entry().

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-37-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:00 -08:00
Sean Christopherson
8b30cb367c KVM: x86: Move kvm_find_cpuid_entry{,_index}() up near cpuid_entry2_find()
Move kvm_find_cpuid_entry{,_index}() "up" in cpuid.c so that they are
colocated with cpuid_entry2_find(), e.g. to make it easier to see the
effective guts of the helpers without having to bounce around cpuid.c.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-36-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:59 -08:00
Sean Christopherson
285185f8e4 KVM: x86: Always operate on kvm_vcpu data in cpuid_entry2_find()
Now that KVM sets vcpu->arch.cpuid_{entries,nent} before processing the
incoming CPUID entries during KVM_SET_CPUID{,2}, drop the @entries and
@nent params from cpuid_entry2_find() and unconditionally operate on the
vCPU state.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-35-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:58 -08:00
Sean Christopherson
a5b3271808 KVM: x86: Remove unnecessary caching of KVM's PV CPUID base
Now that KVM only searches for KVM's PV CPUID base when userspace sets
guest CPUID, drop the cache and simply do the search every time.

Practically speaking, this is a nop except for situations where userspace
sets CPUID _after_ running the vCPU, which is anything but a hot path,
e.g. QEMU does so only when hotplugging a vCPU.  And on the flip side,
caching guest CPUID information, especially information that is used to
query/modify _other_ CPUID state, is inherently dangerous as it's all too
easy to use stale information, i.e. KVM should only cache CPUID state when
the performance and/or programming benefits justify it.

Link: https://lore.kernel.org/r/20241128013424.4096668-34-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:56 -08:00
Sean Christopherson
63d8c702c2 KVM: x86: Clear PV_UNHALT for !HLT-exiting only when userspace sets CPUID
Now that KVM disallows disabling HLT-exiting after vCPUs have been created,
i.e. now that it's impossible for kvm_hlt_in_guest() to change while vCPUs
are running, apply KVM's PV_UNHALT quirk only when userspace is setting
guest CPUID.

Opportunistically rename the helper to make it clear that KVM's behavior
is a quirk that should never have been added.  KVM's documentation
explicitly states that userspace should not advertise PV_UNHALT if
HLT-exiting is disabled, but for unknown reasons, commit caa057a2ca
("KVM: X86: Provide a capability to disable HLT intercepts") didn't stop
at documenting the requirement and also massaged the incoming guest CPUID.

Unfortunately, it's quite likely that userspace has come to rely on KVM's
behavior, i.e. the code can't simply be deleted.  The only reason KVM
doesn't have an "official" quirk is that there is no known use case where
disabling the quirk would make sense, i.e. letting userspace disable the
quirk would further increase KVM's burden without any benefit.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-33-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:55 -08:00
Sean Christopherson
8c01290bda KVM: x86: Swap incoming guest CPUID into vCPU before massaging in KVM_SET_CPUID2
When handling KVM_SET_CPUID{,2}, swap the old and new CPUID arrays and
lengths before processing the new CPUID, and simply undo the swap if
setting the new CPUID fails for whatever reason.

To keep the diff reasonable, continue passing the entry array and length
to most helpers, and defer the more complete cleanup to future commits.

For any sane VMM, setting "bad" CPUID state is not a hot path (or even
something that is surviable), and setting guest CPUID before it's known
good will allow removing all of KVM's infrastructure for processing CPUID
entries directly (as opposed to operating on vcpu->arch.cpuid_entries).

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-32-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:54 -08:00
Sean Christopherson
6174004ebd KVM: x86: Add a macro to init CPUID features that KVM emulates in software
Now that kvm_cpu_cap_init() is a macro with its own scope, add EMUL_F() to
OR-in features that KVM emulates in software, i.e. that don't depend on
the feature being available in hardware.  The contained scope
of kvm_cpu_cap_init() allows using a local variable to track the set of
emulated leaves, which in addition to avoiding confusing and/or
unnecessary variables, helps prevent misuse of EMUL_F().

Link: https://lore.kernel.org/r/20241128013424.4096668-31-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:53 -08:00
Sean Christopherson
5c8de4b3a5 KVM: x86: Add a macro to init CPUID features that ignore host kernel support
Add a macro for use in kvm_set_cpu_caps() to automagically initialize
features that KVM wants to support based solely on the CPU's capabilities,
e.g. KVM advertises LA57 support if it's available in hardware, even if
the host kernel isn't utilizing 57-bit virtual addresses.

Track a features that are passed through to userspace (from hardware) in
a local variable, and simply OR them in *after* adjusting the capabilities
that came from boot_cpu_data.

Note, eliminating the open-coded call to cpuid_ecx() also fixes a largely
benign bug where KVM could incorrectly report LA57 support on Intel CPUs
whose max supported CPUID is less than 7, i.e. if the max supported leaf
(<7) happened to have bit 16 set.  In practice, barring a funky virtual
machine setup, the bug is benign as all known CPUs that support VMX also
support leaf 7.

Link: https://lore.kernel.org/r/20241128013424.4096668-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:52 -08:00
Sean Christopherson
3d142340d7 KVM: x86: Harden CPU capabilities processing against out-of-scope features
Add compile-time assertions to verify that usage of F() and friends in
kvm_set_cpu_caps() is scoped to the correct CPUID word, e.g. to detect
bugs where KVM passes a feature bit from word X into word y.

Add a one-off assertion in the aliased feature macro to ensure that only
word 0x8000_0001.EDX aliased the features defined for 0x1.EDX.

To do so, convert kvm_cpu_cap_init() to a macro and have it define a
local variable to track which CPUID word is being initialized that is
then used to validate usage of F() (all of the inputs are compile-time
constants and thus can be fed into BUILD_BUG_ON()).

Redefine KVM_VALIDATE_CPU_CAP_USAGE after kvm_set_cpu_caps() to be a nop
so that F() can be used in other flows that aren't as easily hardened,
e.g. __do_cpuid_func_emulated() and __do_cpuid_func().

Invoke KVM_VALIDATE_CPU_CAP_USAGE() in SF() and X86_64_F() to ensure the
validation occurs, e.g. if the usage of F() is completely compiled out
(which shouldn't happen for boot_cpu_has(), but could happen in the future,
e.g. if KVM were to use cpu_feature_enabled()).

Link: https://lore.kernel.org/r/20241128013424.4096668-29-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:51 -08:00
Sean Christopherson
8d862c270b KVM: x86: #undef SPEC_CTRL_SSBD in cpuid.c to avoid macro collisions
Undefine SPEC_CTRL_SSBD, which is #defined by msr-index.h to represent the
enable flag in MSR_IA32_SPEC_CTRL, to avoid issues with the macro being
unpacked into its raw value when passed to KVM's F() macro.  This will
allow using multiple layers of macros in F() and friends, e.g. to harden
against incorrect usage of F().

No functional change intended (cpuid.c doesn't consume SPEC_CTRL_SSBD).

Link: https://lore.kernel.org/r/20241128013424.4096668-28-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:50 -08:00
Sean Christopherson
46505c0f69 KVM: x86: Handle kernel- and KVM-defined CPUID words in a single helper
Merge kvm_cpu_cap_init() and kvm_cpu_cap_init_kvm_defined() into a single
helper.  The only advantage of separating the two was to make it somewhat
obvious that KVM directly initializes the KVM-defined words, whereas using
a common helper will allow for hardening both kernel- and KVM-defined
CPUID words without needing copy+paste.

No functional change intended.

Link: https://lore.kernel.org/r/20241128013424.4096668-27-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:49 -08:00
Sean Christopherson
264969b48a KVM: x86: Add a macro to precisely handle aliased 0x1.EDX CPUID features
Add a macro to precisely handle CPUID features that AMD duplicated from
CPUID.0x1.EDX into CPUID.0x8000_0001.EDX.  This will allow adding an
assert that all features passed to kvm_cpu_cap_init() match the word being
processed, e.g. to prevent passing a feature from CPUID 0x7 to CPUID 0x1.

Because the kernel simply reuses the X86_FEATURE_* definitions from
CPUID.0x1.EDX, KVM's use of the aliased features would result in false
positives from such an assert.

No functional change intended.

Link: https://lore.kernel.org/r/20241128013424.4096668-26-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:48 -08:00
Sean Christopherson
6eac4d99a9 KVM: x86: Add a macro to init CPUID features that are 64-bit only
Add a macro to mask-in feature flags that are supported only on 64-bit
kernels/KVM.  In addition to reducing overall #ifdeffery, using a macro
will allow hardening the kvm_cpu_cap initialization sequences to assert
that the features being advertised are indeed included in the word being
initialized.  And arguably using *F() macros through is more readable.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:47 -08:00
Sean Christopherson
3cc359ca29 KVM: x86: Rename kvm_cpu_cap_mask() to kvm_cpu_cap_init()
Rename kvm_cpu_cap_mask() to kvm_cpu_cap_init() in anticipation of merging
it with kvm_cpu_cap_init_kvm_defined(), and in anticipation of _setting_
bits in the helper (a future commit will play macro games to set emulated
feature flags via kvm_cpu_cap_init()).

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:46 -08:00
Sean Christopherson
ccf93de484 KVM: x86: Unpack F() CPUID feature flag macros to one flag per line of code
Refactor kvm_set_cpu_caps() to express each supported (or not) feature
flag on a separate line, modulo a handful of cases where KVM does not, and
likely will not, support a sequence of flags.  This will allow adding
fancier macros with longer, more descriptive names without resulting in
absurd line lengths and/or weird code.  Isolating each flag also makes it
far easier to review changes, reduces code conflicts, and generally makes
it easier to resolve conflicts.  Lastly, it allows co-locating comments
for notable flags, e.g. MONITOR, precisely with the relevant flag.

No functional change intended.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:45 -08:00
Sean Christopherson
96cbc766ba KVM: x86: Account for max supported CPUID leaf when getting raw host CPUID
Explicitly zero out the feature word in kvm_cpu_caps if the word's
associated CPUID function is greater than the max leaf supported by the
CPU.  For such unsupported functions, Intel CPUs return the output from
the last supported leaf, not all zeros.

Practically speaking, this is likely a benign bug, as KVM uses the raw
host CPUID to mask the kernel's computed capabilities, and the kernel does
perform max leaf checks when populating boot_cpu_data.  The only way KVM's
goof could be problematic is if the kernel force-set a feature in a leaf
that is completely unsupported, _and_ the max supported leaf happened to
return a value with '1' the same bit position.  Which is theoretically
possible, but extremely unlikely.  And even if that did happen, it's
entirely possible that KVM would still provide the correct functionality;
the kernel did set the capability after all.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:43 -08:00
Sean Christopherson
6416b0fb16 KVM: x86: Do reverse CPUID sanity checks in __feature_leaf()
Do the compile-time sanity checks on reverse_cpuid in __feature_leaf() so
that higher level APIs don't need to "manually" perform the sanity checks.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:42 -08:00
Sean Christopherson
f21958e328 KVM: x86: Don't update PV features caches when enabling enforcement capability
Revert the chunk of commit 01b4f510b9 ("kvm: x86: ensure pv_cpuid.features
is initialized when enabling cap") that forced a PV features cache refresh
during KVM_CAP_ENFORCE_PV_FEATURE_CPUID, as whatever ioctl() ordering
issue it alleged to have fixed never existed upstream, and likely never
existed in any kernel.

At the time of the commit, there was a tangentially related ioctl()
ordering issue, as toggling KVM_X86_DISABLE_EXITS_HLT after KVM_SET_CPUID2
would have resulted in KVM potentially leaving KVM_FEATURE_PV_UNHALT set.
But (a) that bug affected the entire guest CPUID, not just the cache, (b)
commit 01b4f510b9 didn't address that bug, it only refreshed the cache
(with the bad CPUID), and (c) setting KVM_X86_DISABLE_EXITS_HLT after vCPU
creation is completely broken as KVM configures HLT-exiting only during
vCPU creation, which is why KVM_CAP_X86_DISABLE_EXITS is now disallowed if
vCPUs have been created.

Another tangentially related bug was KVM's failure to clear the cache when
handling KVM_SET_CPUID2, but again commit 01b4f510b9 did nothing to fix
that bug.

The most plausible explanation for the what commit 01b4f510b9 was trying
to fix is a bug that existed in Google's internal kernel that was the
source of commit 01b4f510b9.  At the time, Google's internal kernel had
not yet picked up commit 0d3b2ba16b ("KVM: X86: Go on updating other
CPUID leaves when leaf 1 is absent"), i.e. KVM would not initialize the
PV features cache if KVM_SET_CPUID2 was called without a CPUID.0x1 entry.

Of course, no sane real world VMM would omit CPUID.0x1, including the KVM
selftest added by commit ac4a4d6de2 ("selftests: kvm: test enforcement
of paravirtual cpuid features").  And the test didn't actually try to
verify multiple orderings, nor did the selftest enter the guest without
doing KVM_SET_CPUID2, so who knows what motivated the change.

Regardless of why commit 01b4f510b9 ("kvm: x86: ensure pv_cpuid.features
is initialized when enabling cap") was added, refreshing the cache during
KVM_CAP_ENFORCE_PV_FEATURE_CPUID isn't necessary.

Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:41 -08:00
Sean Christopherson
01d1059d63 KVM: x86: Zero out PV features cache when the CPUID leaf is not present
Clear KVM's PV feature cache prior when processing a new guest CPUID so
that KVM doesn't keep a stale cache entry if userspace does KVM_SET_CPUID2
multiple times, once with a PV features entry, and a second time without.

Fixes: 66570e966d ("kvm: x86: only provide PV features if enabled in guest's CPUID")
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:40 -08:00
Sean Christopherson
c829ccd4d9 KVM: x86: Reject disabling of MWAIT/HLT interception when not allowed
Reject KVM_CAP_X86_DISABLE_EXITS if userspace attempts to disable MWAIT or
HLT exits and KVM previously reported (via KVM_CHECK_EXTENSION) that
disabling the exit(s) is not allowed.  E.g. because MWAIT isn't supported
or the CPU doesn't have an always-running APIC timer, or because KVM is
configured to mitigate cross-thread vulnerabilities.

Cc: Kechen Lu <kechenl@nvidia.com>
Fixes: 4d5422cea3 ("KVM: X86: Provide a capability to disable MWAIT intercepts")
Fixes: 6f0f2d5ef8 ("KVM: x86: Mitigate the cross-thread return address predictions bug")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:36 -08:00
Sean Christopherson
04cd8f8628 KVM: x86: Disallow KVM_CAP_X86_DISABLE_EXITS after vCPU creation
Reject KVM_CAP_X86_DISABLE_EXITS if vCPUs have been created, as disabling
PAUSE/MWAIT/HLT exits after vCPUs have been created is broken and useless,
e.g. except for PAUSE on SVM, the relevant intercepts aren't updated after
vCPU creation.  vCPUs may also end up with an inconsistent configuration
if exits are disabled between creation of multiple vCPUs.

Cc: Hou Wenlong <houwenlong.hwl@antgroup.com>
Link: https://lore.kernel.org/all/9227068821b275ac547eb2ede09ec65d2281fe07.1680179693.git.houwenlong.hwl@antgroup.com
Link: https://lore.kernel.org/all/20230121020738.2973-2-kechenl@nvidia.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:35 -08:00
Sean Christopherson
21d7f06d1a KVM: x86: Drop now-redundant MAXPHYADDR and GPA rsvd bits from vCPU creation
Drop the manual initialization of maxphyaddr and reserved_gpa_bits during
vCPU creation now that kvm_arch_vcpu_create() unconditionally invokes
kvm_vcpu_after_set_cpuid(), which handles all such CPUID caching.

None of the helpers between the existing code in kvm_arch_vcpu_create()
and the call to kvm_vcpu_after_set_cpuid() consume maxphyaddr or
reserved_gpa_bits (though auditing vmx_vcpu_create() and svm_vcpu_create()
isn't exactly easy).

Link: https://lore.kernel.org/r/20241128013424.4096668-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:34 -08:00
Sean Christopherson
ac32cbd4df KVM: x86/pmu: Drop now-redundant refresh() during init()
Drop the manual kvm_pmu_refresh() from kvm_pmu_init() now that
kvm_arch_vcpu_create() performs the refresh via kvm_vcpu_after_set_cpuid().

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:33 -08:00
Sean Christopherson
b0c3d68717 KVM: x86: Move __kvm_is_valid_cr4() definition to x86.h
Let vendor code inline __kvm_is_valid_cr4() now x86.c's cr4_reserved_bits
no longer exists, as keeping cr4_reserved_bits local to x86.c was the only
reason for "hiding" the definition of __kvm_is_valid_cr4().

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:32 -08:00
Sean Christopherson
7520a53b8e KVM: x86: Account for KVM-reserved CR4 bits when passing through CR4 on VMX
Drop x86.c's local pre-computed cr4_reserved bits and instead fold KVM's
reserved bits into the guest's reserved bits.  This fixes a bug where VMX's
set_cr4_guest_host_mask() fails to account for KVM-reserved bits when
deciding which bits can be passed through to the guest.  In most cases,
letting the guest directly write reserved CR4 bits is ok, i.e. attempting
to set the bit(s) will still #GP, but not if a feature is available in
hardware but explicitly disabled by the host, e.g. if FSGSBASE support is
disabled via "nofsgsbase".

Note, the extra overhead of computing host reserved bits every time
userspace sets guest CPUID is negligible.  The feature bits that are
queried are packed nicely into a handful of words, and so checking and
setting each reserved bit costs in the neighborhood of ~5 cycles, i.e. the
total cost will be in the noise even if the number of checked CR4 bits
doubles over the next few years.  In other words, x86 will run out of CR4
bits long before the overhead becomes problematic.

Note #2, __cr4_reserved_bits() starts from CR4_RESERVED_BITS, which is
why the existing __kvm_cpu_cap_has() processing doesn't explicitly OR in
CR4_RESERVED_BITS (and why the new code doesn't do so either).

Fixes: 2ed41aa631 ("KVM: VMX: Intercept guest reserved CR4 bits to inject #GP fault")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:26 -08:00
Sean Christopherson
ec3d4440b2 KVM: x86: Explicitly do runtime CPUID updates "after" initial setup
Explicitly perform runtime CPUID adjustments as part of the "after set
CPUID" flow to guard against bugs where KVM consumes stale vCPU/CPUID
state during kvm_update_cpuid_runtime().  E.g. see commit 4736d85f0d
("KVM: x86: Use actual kvm_cpuid.base for clearing KVM_FEATURE_PV_UNHALT").

Whacking each mole individually is not sustainable or robust, e.g. while
the aforemention commit fixed KVM's PV features, the same issue lurks for
Xen and Hyper-V features, Xen and Hyper-V simply don't have any runtime
features (though spoiler alert, neither should KVM).

Updating runtime features in the "full" path will also simplify adding a
snapshot of the guest's capabilities, i.e. of caching the intersection of
guest CPUID and kvm_cpu_caps (modulo a few edge cases).

Link: https://lore.kernel.org/r/20241128013424.4096668-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:25 -08:00
Sean Christopherson
85e5ba83c0 KVM: x86: Do all post-set CPUID processing during vCPU creation
During vCPU creation, process KVM's default, empty CPUID as if userspace
set an empty CPUID to ensure consistent and correct behavior with respect
to guest CPUID.  E.g. if userspace never sets guest CPUID, KVM will never
configure cr4_guest_rsvd_bits, and thus create divergent, incorrect, guest-
visible behavior due to letting the guest set any KVM-supported CR4 bits
despite the features not being allowed per guest CPUID.

Note!  This changes KVM's ABI, as lack of full CPUID processing allowed
userspace to stuff garbage vCPU state, e.g. userspace could set CR4 to a
guest-unsupported value via KVM_SET_SREGS.  But it's extremely unlikely
that this is a breaking change, as KVM already has many flows that require
userspace to set guest CPUID before loading vCPU state.  E.g. multiple MSR
flows consult guest CPUID on host writes, and KVM_SET_SREGS itself already
relies on guest CPUID being up-to-date, as KVM's validity check on CR3
consumes CPUID.0x7.1 (for LAM) and CPUID.0x80000008 (for MAXPHYADDR).

Furthermore, the plan is to commit to enforcing guest CPUID for userspace
writes to MSRs, at which point bypassing sregs CPUID checks is even more
nonsensical.

Link: https://lore.kernel.org/r/20241128013424.4096668-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:24 -08:00
Sean Christopherson
4b027f5af9 KVM: x86: Limit use of F() and SF() to kvm_cpu_cap_{mask,init_kvm_defined}()
Define and undefine the F() and SF() macros precisely around
kvm_set_cpu_caps() to make it all but impossible to use the macros outside
of kvm_cpu_cap_{mask,init_kvm_defined}().  Currently, F() is a simple
passthrough, but SF() is actively dangerous as it checks that the scattered
feature is supported by the host kernel.

And usage outside of the aforementioned helpers will run afoul of future
changes to harden KVM's CPUID management.

Opportunistically switch to feature_bit() when stuffing LA57 based on raw
hardware support.

No functional change intended.

Link: https://lore.kernel.org/r/20241128013424.4096668-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:23 -08:00
Sean Christopherson
ccf4c1d15d KVM: x86: Use feature_bit() to clear CONSTANT_TSC when emulating CPUID
When clearing CONSTANT_TSC during CPUID emulation due to a Hyper-V quirk,
use feature_bit() instead of SF() to ensure the bit is actually cleared.
SF() evaluates to zero if the _host_ doesn't support the feature.  I.e.
KVM could keep the bit set if userspace advertised CONSTANT_TSC despite
it not being supported in hardware.

Note, translating from a scattered feature to a the hardware version is
done by __feature_translate(), not SF().  The sole purpose of SF() is to
check kernel support for the scattered feature, *before* translation.

Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:22 -08:00
Sean Christopherson
036e78a942 KVM: SVM: Remove redundant TLB flush on guest CR4.PGE change
Drop SVM's direct TLB flush when CR4.PGE is toggled and NPT is enabled, as
KVM already guarantees TLBs are flushed appropriately.

For the call from cr_trap(), kvm_post_set_cr4() requests TLB_FLUSH_GUEST
(which is a superset of TLB_FLUSH_CURRENT) when CR4.PGE is toggled,
regardless of whether or not KVM is using TDP.

The calls from nested_vmcb02_prepare_save() and nested_svm_vmexit() are
checking guest (L2) vs. host (L1) CR4, and so a flush is unnecessary as L2
is defined to use a different ASID (from L1's perspective).

Lastly, the call from svm_set_cr0() passes in the current CR4 value, i.e.
can't toggle PGE.

Link: https://lore.kernel.org/r/20241127235312.4048445-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-16 16:15:28 -08:00
Sean Christopherson
45d522d3ee KVM: SVM: Macrofy SEV=n versions of sev_xxx_guest()
Define sev_{,es_,snp_}guest() as "false" when SEV is disabled via Kconfig,
i.e. when CONFIG_KVM_AMD_SEV=n.  Despite the helpers being __always_inline,
gcc-12 is somehow incapable of realizing that the return value is a
compile-time constant and generates sub-optimal code.

Opportunistically clump the paths together to reduce the amount of
ifdeffery.

No functional change intended.

Link: https://lore.kernel.org/r/20241127234659.4046347-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-16 16:15:27 -08:00
Juergen Gross
2d5faa6a84 KVM/x86: add comment to kvm_mmu_do_page_fault()
On a first glance it isn't obvious why calling kvm_tdp_page_fault() in
kvm_mmu_do_page_fault() is special cased, as the general case of using
an indirect case would result in calling of kvm_tdp_page_fault()
anyway.

Add a comment to explain the reason.

Signed-off-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20241108161416.28552-1-jgross@suse.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-16 15:27:34 -08:00
Sean Christopherson
76bce9f101 KVM: x86: Plumb in the vCPU to kvm_x86_ops.hwapic_isr_update()
Pass the target vCPU to the hwapic_isr_update() vendor hook so that VMX
can defer the update until after nested VM-Exit if an EOI for L1's vAPIC
occurs while L2 is active.

Note, commit d39850f57d ("KVM: x86: Drop @vcpu parameter from
kvm_x86_ops.hwapic_isr_update()") removed the parameter with the
justification that doing so "allows for a decent amount of (future)
cleanup in the APIC code", but it's not at all clear what cleanup was
intended, or if it was ever realized.

No functional change intended.

Cc: stable@vger.kernel.org
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20241128000010.4051275-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-16 15:18:30 -08:00
Sean Christopherson
1201f226c8 KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init
Snapshot the output of CPUID.0xD.[1..n] during kvm.ko initiliaization to
avoid the overead of CPUID during runtime.  The offset, size, and metadata
for CPUID.0xD.[1..n] sub-leaves does not depend on XCR0 or XSS values, i.e.
is constant for a given CPU, and thus can be cached during module load.

On Intel's Emerald Rapids, CPUID is *wildly* expensive, to the point where
recomputing XSAVE offsets and sizes results in a 4x increase in latency of
nested VM-Enter and VM-Exit (nested transitions can trigger
xstate_required_size() multiple times per transition), relative to using
cached values.  The issue is easily visible by running `perf top` while
triggering nested transitions: kvm_update_cpuid_runtime() shows up at a
whopping 50%.

As measured via RDTSC from L2 (using KVM-Unit-Test's CPUID VM-Exit test
and a slightly modified L1 KVM to handle CPUID in the fastpath), a nested
roundtrip to emulate CPUID on Skylake (SKX), Icelake (ICX), and Emerald
Rapids (EMR) takes:

  SKX 11650
  ICX 22350
  EMR 28850

Using cached values, the latency drops to:

  SKX 6850
  ICX 9000
  EMR 7900

The underlying issue is that CPUID itself is slow on ICX, and comically
slow on EMR.  The problem is exacerbated on CPUs which support XSAVES
and/or XSAVEC, as KVM invokes xstate_required_size() twice on each
runtime CPUID update, and because there are more supported XSAVE features
(CPUID for supported XSAVE feature sub-leafs is significantly slower).

 SKX:
  CPUID.0xD.2  = 348 cycles
  CPUID.0xD.3  = 400 cycles
  CPUID.0xD.4  = 276 cycles
  CPUID.0xD.5  = 236 cycles
  <other sub-leaves are similar>

 EMR:
  CPUID.0xD.2  = 1138 cycles
  CPUID.0xD.3  = 1362 cycles
  CPUID.0xD.4  = 1068 cycles
  CPUID.0xD.5  = 910 cycles
  CPUID.0xD.6  = 914 cycles
  CPUID.0xD.7  = 1350 cycles
  CPUID.0xD.8  = 734 cycles
  CPUID.0xD.9  = 766 cycles
  CPUID.0xD.10 = 732 cycles
  CPUID.0xD.11 = 718 cycles
  CPUID.0xD.12 = 734 cycles
  CPUID.0xD.13 = 1700 cycles
  CPUID.0xD.14 = 1126 cycles
  CPUID.0xD.15 = 898 cycles
  CPUID.0xD.16 = 716 cycles
  CPUID.0xD.17 = 748 cycles
  CPUID.0xD.18 = 776 cycles

Note, updating runtime CPUID information multiple times per nested
transition is itself a flaw, especially since CPUID is a mandotory
intercept on both Intel and AMD.  E.g. KVM doesn't need to ensure emulated
CPUID state is up-to-date while running L2.  That flaw will be fixed in a
future patch, as deferring runtime CPUID updates is more subtle than it
appears at first glance, the benefits aren't super critical to have once
the XSAVE issue is resolved, and caching CPUID output is desirable even if
KVM's updates are deferred.

Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20241211013302.1347853-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-13 13:58:10 -05:00
Peter Zijlstra
2190966fbc x86: Convert unreachable() to BUG()
Avoid unreachable() as it can (and will in the absence of UBSAN)
generate fallthrough code. Use BUG() so we get a UD2 trap (with
unreachable annotation).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20241128094312.028316261@infradead.org
2024-12-02 12:01:43 +01:00
Linus Torvalds
9f16d5e6f2 The biggest change here is eliminating the awful idea that KVM had, of
essentially guessing which pfns are refcounted pages.  The reason to
 do so was that KVM needs to map both non-refcounted pages (for example
 BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP VMAs that contain
 refcounted pages.  However, the result was security issues in the past,
 and more recently the inability to map VM_IO and VM_PFNMAP memory
 that _is_ backed by struct page but is not refcounted.  In particular
 this broke virtio-gpu blob resources (which directly map host graphics
 buffers into the guest as "vram" for the virtio-gpu device) with the
 amdgpu driver, because amdgpu allocates non-compound higher order pages
 and the tail pages could not be mapped into KVM.
 
 This requires adjusting all uses of struct page in the per-architecture
 code, to always work on the pfn whenever possible.  The large series that
 did this, from David Stevens and Sean Christopherson, also cleaned up
 substantially the set of functions that provided arch code with the
 pfn for a host virtual addresses.  The previous maze of twisty little
 passages, all different, is replaced by five functions (__gfn_to_page,
 __kvm_faultin_pfn, the non-__ versions of these two, and kvm_prefetch_pages)
 saving almost 200 lines of code.
 
 ARM:
 
 * Support for stage-1 permission indirection (FEAT_S1PIE) and
   permission overlays (FEAT_S1POE), including nested virt + the
   emulated page table walker
 
 * Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call
   was introduced in PSCIv1.3 as a mechanism to request hibernation,
   similar to the S4 state in ACPI
 
 * Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
   part of it, introduce trivial initialization of the host's MPAM
   context so KVM can use the corresponding traps
 
 * PMU support under nested virtualization, honoring the guest
   hypervisor's trap configuration and event filtering when running a
   nested guest
 
 * Fixes to vgic ITS serialization where stale device/interrupt table
   entries are not zeroed when the mapping is invalidated by the VM
 
 * Avoid emulated MMIO completion if userspace has requested synchronous
   external abort injection
 
 * Various fixes and cleanups affecting pKVM, vCPU initialization, and
   selftests
 
 LoongArch:
 
 * Add iocsr and mmio bus simulation in kernel.
 
 * Add in-kernel interrupt controller emulation.
 
 * Add support for virtualization extensions to the eiointc irqchip.
 
 PPC:
 
 * Drop lingering and utterly obsolete references to PPC970 KVM, which was
   removed 10 years ago.
 
 * Fix incorrect documentation references to non-existing ioctls
 
 RISC-V:
 
 * Accelerate KVM RISC-V when running as a guest
 
 * Perf support to collect KVM guest statistics from host side
 
 s390:
 
 * New selftests: more ucontrol selftests and CPU model sanity checks
 
 * Support for the gen17 CPU model
 
 * List registers supported by KVM_GET/SET_ONE_REG in the documentation
 
 x86:
 
 * Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve
   documentation, harden against unexpected changes.  Even if the hardware
   A/D tracking is disabled, it is possible to use the hardware-defined A/D
   bits to track if a PFN is Accessed and/or Dirty, and that removes a lot
   of special cases.
 
 * Elide TLB flushes when aging secondary PTEs, as has been done in x86's
   primary MMU for over 10 years.
 
 * Recover huge pages in-place in the TDP MMU when dirty page logging is
   toggled off, instead of zapping them and waiting until the page is
   re-accessed to create a huge mapping.  This reduces vCPU jitter.
 
 * Batch TLB flushes when dirty page logging is toggled off.  This reduces
   the time it takes to disable dirty logging by ~3x.
 
 * Remove the shrinker that was (poorly) attempting to reclaim shadow page
   tables in low-memory situations.
 
 * Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE.
 
 * Advertise CPUIDs for new instructions in Clearwater Forest
 
 * Quirk KVM's misguided behavior of initialized certain feature MSRs to
   their maximum supported feature set, which can result in KVM creating
   invalid vCPU state.  E.g. initializing PERF_CAPABILITIES to a non-zero
   value results in the vCPU having invalid state if userspace hides PDCM
   from the guest, which in turn can lead to save/restore failures.
 
 * Fix KVM's handling of non-canonical checks for vCPUs that support LA57
   to better follow the "architecture", in quotes because the actual
   behavior is poorly documented.  E.g. most MSR writes and descriptor
   table loads ignore CR4.LA57 and operate purely on whether the CPU
   supports LA57.
 
 * Bypass the register cache when querying CPL from kvm_sched_out(), as
   filling the cache from IRQ context is generally unsafe; harden the
   cache accessors to try to prevent similar issues from occuring in the
   future.  The issue that triggered this change was already fixed in 6.12,
   but was still kinda latent.
 
 * Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM
   over-advertises SPEC_CTRL when trying to support cross-vendor VMs.
 
 * Minor cleanups
 
 * Switch hugepage recovery thread to use vhost_task.  These kthreads can
   consume significant amounts of CPU time on behalf of a VM or in response
   to how the VM behaves (for example how it accesses its memory); therefore
   KVM tried to place the thread in the VM's cgroups and charge the CPU
   time consumed by that work to the VM's container.  However the kthreads
   did not process SIGSTOP/SIGCONT, and therefore cgroups which had KVM
   instances inside could not complete freezing.  Fix this by replacing the
   kthread with a PF_USER_WORKER thread, via the vhost_task abstraction.
   Another 100+ lines removed, with generally better behavior too like
   having these threads properly parented in the process tree.
 
 * Revert a workaround for an old CPU erratum (Nehalem/Westmere) that didn't
   really work; there was really nothing to work around anyway: the broken
   patch was meant to fix nested virtualization, but the PERF_GLOBAL_CTRL
   MSR is virtualized and therefore unaffected by the erratum.
 
 * Fix 6.12 regression where CONFIG_KVM will be built as a module even
   if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is 'y'.
 
 x86 selftests:
 
 * x86 selftests can now use AVX.
 
 Documentation:
 
 * Use rST internal links
 
 * Reorganize the introduction to the API document
 
 Generic:
 
 * Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead
   of RCU, so that running a vCPU on a different task doesn't encounter long
   due to having to wait for all CPUs become quiescent.  In general both reads
   and writes are rare, but userspace that supports confidential computing is
   introducing the use of "helper" vCPUs that may jump from one host processor
   to another.  Those will be very happy to trigger a synchronize_rcu(), and
   the effect on performance is quite the disaster.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmc9MRYUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroP00QgArxqxBIGLCW5t7bw7vtNq63QYRyh4
 dTiDguLiYQJ+AXmnRu11R6aPC7HgMAvlFCCmH+GEce4WEgt26hxCmncJr/aJOSwS
 letCS7TrME16PeZvh25A1nhPBUw6mTF1qqzgcdHMrqXG8LuHoGcKYGSRVbkf3kfI
 1ZoMq1r8ChXbVVmCx9DQ3gw1TVr5Dpjs2voLh8rDSE9Xpw0tVVabHu3/NhQEz/F+
 t8/nRaqH777icCHIf9PCk5HnarHxLAOvhM2M0Yj09PuBcE5fFQxpxltw/qiKQqqW
 ep4oquojGl87kZnhlDaac2UNtK90Ws+WxxvCwUmbvGN0ZJVaQwf4FvTwig==
 =lWpE
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "The biggest change here is eliminating the awful idea that KVM had of
  essentially guessing which pfns are refcounted pages.

  The reason to do so was that KVM needs to map both non-refcounted
  pages (for example BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP
  VMAs that contain refcounted pages.

  However, the result was security issues in the past, and more recently
  the inability to map VM_IO and VM_PFNMAP memory that _is_ backed by
  struct page but is not refcounted. In particular this broke virtio-gpu
  blob resources (which directly map host graphics buffers into the
  guest as "vram" for the virtio-gpu device) with the amdgpu driver,
  because amdgpu allocates non-compound higher order pages and the tail
  pages could not be mapped into KVM.

  This requires adjusting all uses of struct page in the
  per-architecture code, to always work on the pfn whenever possible.
  The large series that did this, from David Stevens and Sean
  Christopherson, also cleaned up substantially the set of functions
  that provided arch code with the pfn for a host virtual addresses.

  The previous maze of twisty little passages, all different, is
  replaced by five functions (__gfn_to_page, __kvm_faultin_pfn, the
  non-__ versions of these two, and kvm_prefetch_pages) saving almost
  200 lines of code.

  ARM:

   - Support for stage-1 permission indirection (FEAT_S1PIE) and
     permission overlays (FEAT_S1POE), including nested virt + the
     emulated page table walker

   - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This
     call was introduced in PSCIv1.3 as a mechanism to request
     hibernation, similar to the S4 state in ACPI

   - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
     part of it, introduce trivial initialization of the host's MPAM
     context so KVM can use the corresponding traps

   - PMU support under nested virtualization, honoring the guest
     hypervisor's trap configuration and event filtering when running a
     nested guest

   - Fixes to vgic ITS serialization where stale device/interrupt table
     entries are not zeroed when the mapping is invalidated by the VM

   - Avoid emulated MMIO completion if userspace has requested
     synchronous external abort injection

   - Various fixes and cleanups affecting pKVM, vCPU initialization, and
     selftests

  LoongArch:

   - Add iocsr and mmio bus simulation in kernel.

   - Add in-kernel interrupt controller emulation.

   - Add support for virtualization extensions to the eiointc irqchip.

  PPC:

   - Drop lingering and utterly obsolete references to PPC970 KVM, which
     was removed 10 years ago.

   - Fix incorrect documentation references to non-existing ioctls

  RISC-V:

   - Accelerate KVM RISC-V when running as a guest

   - Perf support to collect KVM guest statistics from host side

  s390:

   - New selftests: more ucontrol selftests and CPU model sanity checks

   - Support for the gen17 CPU model

   - List registers supported by KVM_GET/SET_ONE_REG in the
     documentation

  x86:

   - Cleanup KVM's handling of Accessed and Dirty bits to dedup code,
     improve documentation, harden against unexpected changes.

     Even if the hardware A/D tracking is disabled, it is possible to
     use the hardware-defined A/D bits to track if a PFN is Accessed
     and/or Dirty, and that removes a lot of special cases.

   - Elide TLB flushes when aging secondary PTEs, as has been done in
     x86's primary MMU for over 10 years.

   - Recover huge pages in-place in the TDP MMU when dirty page logging
     is toggled off, instead of zapping them and waiting until the page
     is re-accessed to create a huge mapping. This reduces vCPU jitter.

   - Batch TLB flushes when dirty page logging is toggled off. This
     reduces the time it takes to disable dirty logging by ~3x.

   - Remove the shrinker that was (poorly) attempting to reclaim shadow
     page tables in low-memory situations.

   - Clean up and optimize KVM's handling of writes to
     MSR_IA32_APICBASE.

   - Advertise CPUIDs for new instructions in Clearwater Forest

   - Quirk KVM's misguided behavior of initialized certain feature MSRs
     to their maximum supported feature set, which can result in KVM
     creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to
     a non-zero value results in the vCPU having invalid state if
     userspace hides PDCM from the guest, which in turn can lead to
     save/restore failures.

   - Fix KVM's handling of non-canonical checks for vCPUs that support
     LA57 to better follow the "architecture", in quotes because the
     actual behavior is poorly documented. E.g. most MSR writes and
     descriptor table loads ignore CR4.LA57 and operate purely on
     whether the CPU supports LA57.

   - Bypass the register cache when querying CPL from kvm_sched_out(),
     as filling the cache from IRQ context is generally unsafe; harden
     the cache accessors to try to prevent similar issues from occuring
     in the future. The issue that triggered this change was already
     fixed in 6.12, but was still kinda latent.

   - Advertise AMD_IBPB_RET to userspace, and fix a related bug where
     KVM over-advertises SPEC_CTRL when trying to support cross-vendor
     VMs.

   - Minor cleanups

   - Switch hugepage recovery thread to use vhost_task.

     These kthreads can consume significant amounts of CPU time on
     behalf of a VM or in response to how the VM behaves (for example
     how it accesses its memory); therefore KVM tried to place the
     thread in the VM's cgroups and charge the CPU time consumed by that
     work to the VM's container.

     However the kthreads did not process SIGSTOP/SIGCONT, and therefore
     cgroups which had KVM instances inside could not complete freezing.

     Fix this by replacing the kthread with a PF_USER_WORKER thread, via
     the vhost_task abstraction. Another 100+ lines removed, with
     generally better behavior too like having these threads properly
     parented in the process tree.

   - Revert a workaround for an old CPU erratum (Nehalem/Westmere) that
     didn't really work; there was really nothing to work around anyway:
     the broken patch was meant to fix nested virtualization, but the
     PERF_GLOBAL_CTRL MSR is virtualized and therefore unaffected by the
     erratum.

   - Fix 6.12 regression where CONFIG_KVM will be built as a module even
     if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is
     'y'.

  x86 selftests:

   - x86 selftests can now use AVX.

  Documentation:

   - Use rST internal links

   - Reorganize the introduction to the API document

  Generic:

   - Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock
     instead of RCU, so that running a vCPU on a different task doesn't
     encounter long due to having to wait for all CPUs become quiescent.

     In general both reads and writes are rare, but userspace that
     supports confidential computing is introducing the use of "helper"
     vCPUs that may jump from one host processor to another. Those will
     be very happy to trigger a synchronize_rcu(), and the effect on
     performance is quite the disaster"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (298 commits)
  KVM: x86: Break CONFIG_KVM_X86's direct dependency on KVM_INTEL || KVM_AMD
  KVM: x86: add back X86_LOCAL_APIC dependency
  Revert "KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()"
  KVM: x86: switch hugepage recovery thread to vhost_task
  KVM: x86: expose MSR_PLATFORM_INFO as a feature MSR
  x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest
  Documentation: KVM: fix malformed table
  irqchip/loongson-eiointc: Add virt extension support
  LoongArch: KVM: Add irqfd support
  LoongArch: KVM: Add PCHPIC user mode read and write functions
  LoongArch: KVM: Add PCHPIC read and write functions
  LoongArch: KVM: Add PCHPIC device support
  LoongArch: KVM: Add EIOINTC user mode read and write functions
  LoongArch: KVM: Add EIOINTC read and write functions
  LoongArch: KVM: Add EIOINTC device support
  LoongArch: KVM: Add IPI user mode read and write function
  LoongArch: KVM: Add IPI read and write function
  LoongArch: KVM: Add IPI device support
  LoongArch: KVM: Add iocsr and mmio bus simulation in kernel
  KVM: arm64: Pass on SVE mapping failures
  ...
2024-11-23 16:00:50 -08:00
Linus Torvalds
bf9aa14fc5 A rather large update for timekeeping and timers:
- The final step to get rid of auto-rearming posix-timers
 
     posix-timers are currently auto-rearmed by the kernel when the signal
     of the timer is ignored so that the timer signal can be delivered once
     the corresponding signal is unignored.
 
     This requires to throttle the timer to prevent a DoS by small intervals
     and keeps the system pointlessly out of low power states for no value.
     This is a long standing non-trivial problem due to the lock order of
     posix-timer lock and the sighand lock along with life time issues as
     the timer and the sigqueue have different life time rules.
 
     Cure this by:
 
      * Embedding the sigqueue into the timer struct to have the same life
        time rules. Aside of that this also avoids the lookup of the timer
        in the signal delivery and rearm path as it's just a always valid
        container_of() now.
 
      * Queuing ignored timer signals onto a seperate ignored list.
 
      * Moving queued timer signals onto the ignored list when the signal is
        switched to SIG_IGN before it could be delivered.
 
      * Walking the ignored list when SIG_IGN is lifted and requeue the
        signals to the actual signal lists. This allows the signal delivery
        code to rearm the timer.
 
     This also required to consolidate the signal delivery rules so they are
     consistent across all situations. With that all self test scenarios
     finally succeed.
 
   - Core infrastructure for VFS multigrain timestamping
 
     This is required to allow the kernel to use coarse grained time stamps
     by default and switch to fine grained time stamps when inode attributes
     are actively observed via getattr().
 
     These changes have been provided to the VFS tree as well, so that the
     VFS specific infrastructure could be built on top.
 
   - Cleanup and consolidation of the sleep() infrastructure
 
     * Move all sleep and timeout functions into one file
 
     * Rework udelay() and ndelay() into proper documented inline functions
       and replace the hardcoded magic numbers by proper defines.
 
     * Rework the fsleep() implementation to take the reality of the timer
       wheel granularity on different HZ values into account. Right now the
       boundaries are hard coded time ranges which fail to provide the
       requested accuracy on different HZ settings.
 
     * Update documentation for all sleep/timeout related functions and fix
       up stale documentation links all over the place
 
     * Fixup a few usage sites
 
   - Rework of timekeeping and adjtimex(2) to prepare for multiple PTP clocks
 
     A system can have multiple PTP clocks which are participating in
     seperate and independent PTP clock domains. So far the kernel only
     considers the PTP clock which is based on CLOCK TAI relevant as that's
     the clock which drives the timekeeping adjustments via the various user
     space daemons through adjtimex(2).
 
     The non TAI based clock domains are accessible via the file descriptor
     based posix clocks, but their usability is very limited. They can't be
     accessed fast as they always go all the way out to the hardware and
     they cannot be utilized in the kernel itself.
 
     As Time Sensitive Networking (TSN) gains traction it is required to
     provide fast user and kernel space access to these clocks.
 
     The approach taken is to utilize the timekeeping and adjtimex(2)
     infrastructure to provide this access in a similar way how the kernel
     provides access to clock MONOTONIC, REALTIME etc.
 
     Instead of creating a duplicated infrastructure this rework converts
     timekeeping and adjtimex(2) into generic functionality which operates
     on pointers to data structures instead of using static variables.
 
     This allows to provide time accessors and adjtimex(2) functionality for
     the independent PTP clocks in a subsequent step.
 
   - Consolidate hrtimer initialization
 
     hrtimers are set up by initializing the data structure and then
     seperately setting the callback function for historical reasons.
 
     That's an extra unnecessary step and makes Rust support less straight
     forward than it should be.
 
     Provide a new set of hrtimer_setup*() functions and convert the core
     code and a few usage sites of the less frequently used interfaces over.
 
     The bulk of the htimer_init() to hrtimer_setup() conversion is already
     prepared and scheduled for the next merge window.
 
   - Drivers:
 
     * Ensure that the global timekeeping clocksource is utilizing the
       cluster 0 timer on MIPS multi-cluster systems.
 
       Otherwise CPUs on different clusters use their cluster specific
       clocksource which is not guaranteed to be synchronized with other
       clusters.
 
     * Mostly boring cleanups, fixes, improvements and code movement
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmc7kPITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZKkD/9OUL6fOJrDUmOYBa4QVeMyfTef4EaL
 tvwIMM/29XQFeiq3xxCIn+EMnHjXn2lvIhYGQ7GKsbKYwvJ7ZBDpQb+UMhZ2nKI9
 6D6BP6WomZohKeH2fZbJQAdqOi3KRYdvQdIsVZUexkqiaVPphRvOH9wOr45gHtZM
 EyMRSotPlQTDqcrbUejDMEO94GyjDCYXRsyATLxjmTzL/N4xD4NRIiotjM2vL/a9
 8MuCgIhrKUEyYlFoOxxeokBsF3kk3/ez2jlG9b/N8VLH3SYIc2zgL58FBgWxlmgG
 bY71nVG3nUgEjxBd2dcXAVVqvb+5widk8p6O7xxOAQKTLMcJ4H0tQDkMnzBtUzvB
 DGAJDHAmAr0g+ja9O35Pkhunkh4HYFIbq0Il4d1HMKObhJV0JumcKuQVxrXycdm3
 UZfq3seqHsZJQbPgCAhlFU0/2WWScocbee9bNebGT33KVwSp5FoVv89C/6Vjb+vV
 Gusc3thqrQuMAZW5zV8g4UcBAA/xH4PB0I+vHib+9XPZ4UQ7/6xKl2jE0kd5hX7n
 AAUeZvFNFqIsY+B6vz+Jx/yzyM7u5cuXq87pof5EHVFzv56lyTp4ToGcOGYRgKH5
 JXeYV1OxGziSDrd5vbf9CzdWMzqMvTefXrHbWrjkjhNOe8E1A8O88RZ5uRKZhmSw
 hZZ4hdM9+3T7cg==
 =2VC6
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "A rather large update for timekeeping and timers:

   - The final step to get rid of auto-rearming posix-timers

     posix-timers are currently auto-rearmed by the kernel when the
     signal of the timer is ignored so that the timer signal can be
     delivered once the corresponding signal is unignored.

     This requires to throttle the timer to prevent a DoS by small
     intervals and keeps the system pointlessly out of low power states
     for no value. This is a long standing non-trivial problem due to
     the lock order of posix-timer lock and the sighand lock along with
     life time issues as the timer and the sigqueue have different life
     time rules.

     Cure this by:

       - Embedding the sigqueue into the timer struct to have the same
         life time rules. Aside of that this also avoids the lookup of
         the timer in the signal delivery and rearm path as it's just a
         always valid container_of() now.

       - Queuing ignored timer signals onto a seperate ignored list.

       - Moving queued timer signals onto the ignored list when the
         signal is switched to SIG_IGN before it could be delivered.

       - Walking the ignored list when SIG_IGN is lifted and requeue the
         signals to the actual signal lists. This allows the signal
         delivery code to rearm the timer.

     This also required to consolidate the signal delivery rules so they
     are consistent across all situations. With that all self test
     scenarios finally succeed.

   - Core infrastructure for VFS multigrain timestamping

     This is required to allow the kernel to use coarse grained time
     stamps by default and switch to fine grained time stamps when inode
     attributes are actively observed via getattr().

     These changes have been provided to the VFS tree as well, so that
     the VFS specific infrastructure could be built on top.

   - Cleanup and consolidation of the sleep() infrastructure

       - Move all sleep and timeout functions into one file

       - Rework udelay() and ndelay() into proper documented inline
         functions and replace the hardcoded magic numbers by proper
         defines.

       - Rework the fsleep() implementation to take the reality of the
         timer wheel granularity on different HZ values into account.
         Right now the boundaries are hard coded time ranges which fail
         to provide the requested accuracy on different HZ settings.

       - Update documentation for all sleep/timeout related functions
         and fix up stale documentation links all over the place

       - Fixup a few usage sites

   - Rework of timekeeping and adjtimex(2) to prepare for multiple PTP
     clocks

     A system can have multiple PTP clocks which are participating in
     seperate and independent PTP clock domains. So far the kernel only
     considers the PTP clock which is based on CLOCK TAI relevant as
     that's the clock which drives the timekeeping adjustments via the
     various user space daemons through adjtimex(2).

     The non TAI based clock domains are accessible via the file
     descriptor based posix clocks, but their usability is very limited.
     They can't be accessed fast as they always go all the way out to
     the hardware and they cannot be utilized in the kernel itself.

     As Time Sensitive Networking (TSN) gains traction it is required to
     provide fast user and kernel space access to these clocks.

     The approach taken is to utilize the timekeeping and adjtimex(2)
     infrastructure to provide this access in a similar way how the
     kernel provides access to clock MONOTONIC, REALTIME etc.

     Instead of creating a duplicated infrastructure this rework
     converts timekeeping and adjtimex(2) into generic functionality
     which operates on pointers to data structures instead of using
     static variables.

     This allows to provide time accessors and adjtimex(2) functionality
     for the independent PTP clocks in a subsequent step.

   - Consolidate hrtimer initialization

     hrtimers are set up by initializing the data structure and then
     seperately setting the callback function for historical reasons.

     That's an extra unnecessary step and makes Rust support less
     straight forward than it should be.

     Provide a new set of hrtimer_setup*() functions and convert the
     core code and a few usage sites of the less frequently used
     interfaces over.

     The bulk of the htimer_init() to hrtimer_setup() conversion is
     already prepared and scheduled for the next merge window.

   - Drivers:

       - Ensure that the global timekeeping clocksource is utilizing the
         cluster 0 timer on MIPS multi-cluster systems.

         Otherwise CPUs on different clusters use their cluster specific
         clocksource which is not guaranteed to be synchronized with
         other clusters.

       - Mostly boring cleanups, fixes, improvements and code movement"

* tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits)
  posix-timers: Fix spurious warning on double enqueue versus do_exit()
  clocksource/drivers/arm_arch_timer: Use of_property_present() for non-boolean properties
  clocksource/drivers/gpx: Remove redundant casts
  clocksource/drivers/timer-ti-dm: Fix child node refcount handling
  dt-bindings: timer: actions,owl-timer: convert to YAML
  clocksource/drivers/ralink: Add Ralink System Tick Counter driver
  clocksource/drivers/mips-gic-timer: Always use cluster 0 counter as clocksource
  clocksource/drivers/timer-ti-dm: Don't fail probe if int not found
  clocksource/drivers:sp804: Make user selectable
  clocksource/drivers/dw_apb: Remove unused dw_apb_clockevent functions
  hrtimers: Delete hrtimer_init_on_stack()
  alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack()
  io_uring: Switch to use hrtimer_setup_on_stack()
  sched/idle: Switch to use hrtimer_setup_on_stack()
  hrtimers: Delete hrtimer_init_sleeper_on_stack()
  wait: Switch to use hrtimer_setup_sleeper_on_stack()
  timers: Switch to use hrtimer_setup_sleeper_on_stack()
  net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack()
  futex: Switch to use hrtimer_setup_sleeper_on_stack()
  fs/aio: Switch to use hrtimer_setup_sleeper_on_stack()
  ...
2024-11-19 16:35:06 -08:00
Sean Christopherson
9ee62c33c0 KVM: x86: Break CONFIG_KVM_X86's direct dependency on KVM_INTEL || KVM_AMD
Rework CONFIG_KVM_X86's dependency to only check if KVM_INTEL or KVM_AMD
is selected, i.e. not 'n'.  Having KVM_X86 depend directly on the vendor
modules results in KVM_X86 being set to 'm' if at least one of KVM_INTEL
or KVM_AMD is enabled, but neither is 'y', regardless of the value of KVM
itself.

The documentation for def_tristate doesn't explicitly state that this is
the intended behavior, but it does clearly state that the "if" section is
parsed as a dependency, i.e. the behavior is consistent with how tristate
dependencies are handled in general.

  Optionally dependencies for this default value can be added with "if".

Fixes: ea4290d77b ("KVM: x86: leave kvm.ko out of the build if no vendor module is requested")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20241118172002.1633824-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-19 19:34:51 -05:00
Arnd Bergmann
1331343af6 KVM: x86: add back X86_LOCAL_APIC dependency
Enabling KVM now causes a build failure on x86-32 if X86_LOCAL_APIC
is disabled:

arch/x86/kvm/svm/svm.c: In function 'svm_emergency_disable_virtualization_cpu':
arch/x86/kvm/svm/svm.c:597:9: error: 'kvm_rebooting' undeclared (first use in this function); did you mean 'kvm_irq_routing'?
  597 |         kvm_rebooting = true;
      |         ^~~~~~~~~~~~~
      |         kvm_irq_routing
arch/x86/kvm/svm/svm.c:597:9: note: each undeclared identifier is reported only once for each function it appears in
make[6]: *** [scripts/Makefile.build:221: arch/x86/kvm/svm/svm.o] Error 1
In file included from include/linux/rculist.h:11,
                 from include/linux/hashtable.h:14,
                 from arch/x86/kvm/svm/avic.c:18:
arch/x86/kvm/svm/avic.c: In function 'avic_pi_update_irte':
arch/x86/kvm/svm/avic.c:909:38: error: 'struct kvm' has no member named 'irq_routing'
  909 |         irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
      |                                      ^~
include/linux/rcupdate.h:538:17: note: in definition of macro '__rcu_dereference_check'
  538 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \

Move the dependency to the same place as before.

Fixes: ea4290d77b ("KVM: x86: leave kvm.ko out of the build if no vendor module is requested")
Cc: stable@vger.kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202410060426.e9Xsnkvi-lkp@intel.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Sean Christopherson <seanjc@google.com>
[sean: add Cc to stable, tweak shortlog scope]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20241118172002.1633824-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-19 19:34:51 -05:00
Sean Christopherson
85434c3c73 Revert "KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()"
Revert back to clearing VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL in KVM's
golden VMCS config, as applying the workaround during vCPU creation is
pointless and broken.  KVM *unconditionally* clears the controls in the
values returned by vmx_vmentry_ctrl() and vmx_vmexit_ctrl(), as KVM loads
PERF_GLOBAL_CTRL if and only if its necessary to do so.  E.g. if KVM wants
to run the guest with the same PERF_GLOBAL_CTRL as the host, then there's
no need to re-load the MSR on entry and exit.

Even worse, the buggy commit failed to apply the erratum where it's
actually needed, add_atomic_switch_msr().  As a result, KVM completely
ignores the erratum for all intents and purposes, i.e. uses the flawed
VMCS controls to load PERF_GLOBAL_CTRL.

To top things off, the patch was intended to be dropped, as the premise
of an L1 VMM being able to pivot on FMS is flawed, and KVM can (and now
does) fully emulate the controls in software.  Simply revert the commit,
as all upstream supported kernels that have the buggy commit should also
have commit f4c93d1a0e ("KVM: nVMX: Always emulate PERF_GLOBAL_CTRL
VM-Entry/VM-Exit controls"), i.e. the (likely theoretical) live migration
concern is a complete non-issue.

Opportunistically drop the manual "kvm: " scope from the warning about
the erratum, as KVM now uses pr_fmt() to provide the correct scope (v6.1
kernels and earlier don't, but the erratum only applies to CPUs that are
15+ years old; it's not worth a separate patch).

This reverts commit 9d78d6fb18.

Link: https://lore.kernel.org/all/YtnZmCutdd5tpUmz@google.com
Fixes: 9d78d6fb18 ("KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()")
Cc: stable@vger.kernel.org
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-ID: <20241119011433.1797921-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-19 19:34:35 -05:00
Linus Torvalds
0f25f0e4ef the bulk of struct fd memory safety stuff
Making sure that struct fd instances are destroyed in the same
 scope where they'd been created, getting rid of reassignments
 and passing them by reference, converting to CLASS(fd{,_pos,_raw}).
 
 We are getting very close to having the memory safety of that stuff
 trivial to verify.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdikAAKCRBZ7Krx/gZQ
 69nJAQCmbQHK3TGUbQhOw6MJXOK9ezpyEDN3FZb4jsu38vTIdgEA6OxAYDO2m2g9
 CN18glYmD3wRyU6Bwl4vGODouSJvDgA=
 =gVH3
 -----END PGP SIGNATURE-----

Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull 'struct fd' class updates from Al Viro:
 "The bulk of struct fd memory safety stuff

  Making sure that struct fd instances are destroyed in the same scope
  where they'd been created, getting rid of reassignments and passing
  them by reference, converting to CLASS(fd{,_pos,_raw}).

  We are getting very close to having the memory safety of that stuff
  trivial to verify"

* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
  deal with the last remaing boolean uses of fd_file()
  css_set_fork(): switch to CLASS(fd_raw, ...)
  memcg_write_event_control(): switch to CLASS(fd)
  assorted variants of irqfd setup: convert to CLASS(fd)
  do_pollfd(): convert to CLASS(fd)
  convert do_select()
  convert vfs_dedupe_file_range().
  convert cifs_ioctl_copychunk()
  convert media_request_get_by_fd()
  convert spu_run(2)
  switch spufs_calls_{get,put}() to CLASS() use
  convert cachestat(2)
  convert do_preadv()/do_pwritev()
  fdget(), more trivial conversions
  fdget(), trivial conversions
  privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget()
  o2hb_region_dev_store(): avoid goto around fdget()/fdput()
  introduce "fd_pos" class, convert fdget_pos() users to it.
  fdget_raw() users: switch to CLASS(fd_raw)
  convert vmsplice() to CLASS(fd)
  ...
2024-11-18 12:24:06 -08:00
Paolo Bonzini
d96c77bd4e KVM: x86: switch hugepage recovery thread to vhost_task
kvm_vm_create_worker_thread() is meant to be used for kthreads that
can consume significant amounts of CPU time on behalf of a VM or in
response to how the VM behaves (for example how it accesses its memory).
Therefore it wants to charge the CPU time consumed by that work to
the VM's container.

However, because of these threads, cgroups which have kvm instances
inside never complete freezing.  This can be trivially reproduced:

  root@test ~# mkdir /sys/fs/cgroup/test
  root@test ~# echo $$ > /sys/fs/cgroup/test/cgroup.procs
  root@test ~# qemu-system-x86_64 -nographic -enable-kvm

and in another terminal:

  root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze
  root@test ~# cat /sys/fs/cgroup/test/cgroup.events
  populated 1
  frozen 0

The cgroup freezing happens in the signal delivery path but
kvm_nx_huge_page_recovery_worker, while joining non-root cgroups, never
calls into the signal delivery path and thus never gets frozen. Because
the cgroup freezer determines whether a given cgroup is frozen by
comparing the number of frozen threads to the total number of threads
in the cgroup, the cgroup never becomes frozen and users waiting for
the state transition may hang indefinitely.

Since the worker kthread is tied to a user process, it's better if
it behaves similarly to user tasks as much as possible, including
being able to send SIGSTOP and SIGCONT.  In fact, vhost_task is all
that kvm_vm_create_worker_thread() wanted to be and more: not only it
inherits the userspace process's cgroups, it has other niceties like
being parented properly in the process tree.  Use it instead of the
homegrown alternative.

Incidentally, the new code is also better behaved when you flip recovery
back and forth to disabled and back to enabled.  If your recovery period
is 1 minute, it will run the next recovery after 1 minute independent
of how many times you flipped the parameter.

(Commit message based on emails from Tejun).

Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Luca Boccassi <bluca@debian.org>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Luca Boccassi <bluca@debian.org>
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-14 13:20:04 -05:00
Paolo Bonzini
b467ab82a9 KVM: x86: expose MSR_PLATFORM_INFO as a feature MSR
For userspace that wants to disable KVM_X86_QUIRK_STUFF_FEATURE_MSRS, it
is useful to know what bits can be set to 1 in MSR_PLATFORM_INFO (apart
from the TSC ratio).  The right way to do that is via /dev/kvm's
feature MSR mechanism.

In fact, MSR_PLATFORM_INFO is already a feature MSR for the purpose of
blocking updates after the vCPU is run, but KVM_GET_MSRS did not return
a valid value for it.

Just like in a VM that leaves KVM_X86_QUIRK_STUFF_FEATURE_MSRS enabled,
the TSC ratio field is left to 0.  Only bit 31 is set.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-13 14:40:56 -05:00
Tao Su
a0423af92c x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest
Latest Intel platform Clearwater Forest has introduced new instructions
enumerated by CPUIDs of SHA512, SM3, SM4 and AVX-VNNI-INT16. Advertise
these CPUIDs to userspace so that guests can query them directly.

SHA512, SM3 and SM4 are on an expected-dense CPUID leaf and some other
bits on this leaf have kernel usages. Considering they have not truly
kernel usages, hide them in /proc/cpuinfo.

These new instructions only operate in xmm, ymm registers and have no new
VMX controls, so there is no additional host enabling required for guests
to use these instructions, i.e. advertising these CPUIDs to userspace is
safe.

Tested-by: Jiaan Lu <jiaan.lu@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Message-ID: <20241105054825.870939-1-tao1.su@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-13 14:40:40 -05:00
Paolo Bonzini
2e9a2c624e Merge branch 'kvm-docs-6.13' into HEAD
- Drop obsolete references to PPC970 KVM, which was removed 10 years ago.

- Fix incorrect references to non-existing ioctls

- List registers supported by KVM_GET/SET_ONE_REG on s390

- Use rST internal links

- Reorganize the introduction to the API document
2024-11-13 07:18:12 -05:00
Paolo Bonzini
bb4409a9e7 KVM x86 misc changes for 6.13
- Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE.
 
  - Quirk KVM's misguided behavior of initialized certain feature MSRs to
    their maximum supported feature set, which can result in KVM creating
    invalid vCPU state.  E.g. initializing PERF_CAPABILITIES to a non-zero
    value results in the vCPU having invalid state if userspace hides PDCM
    from the guest, which can lead to save/restore failures.
 
  - Fix KVM's handling of non-canonical checks for vCPUs that support LA57
    to better follow the "architecture", in quotes because the actual
    behavior is poorly documented.  E.g. most MSR writes and descriptor
    table loads ignore CR4.LA57 and operate purely on whether the CPU
    supports LA57.
 
  - Bypass the register cache when querying CPL from kvm_sched_out(), as
    filling the cache from IRQ context is generally unsafe, and harden the
    cache accessors to try to prevent similar issues from occuring in the
    future.
 
  - Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM
    over-advertises SPEC_CTRL when trying to support cross-vendor VMs.
 
  - Minor cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmczowUACgkQOlYIJqCj
 N/3G8w//RYIslQkHXZXovQvhHKM9RBxdg6FjU0Do2KLN/xnR+JdjrBAI3HnG0TVu
 TEsA6slaTdvAFudSBZ27rORPARJ3XCSnDT+NflDR2UqtmGYFXxixcs4LQRUC3I2L
 tS/e847Qfp7/+kXFYQuH6YmMftCf7SQNbUPU3EwSXY8seUJB6ZhO89WXgrBtaMH+
 94b7EdkP86L4dqiEMGr/q/46/ewFpPlB4WWFNIAY+SoZebQ+C8gcAsGDzfG6giNV
 HxdehWNp5nJ34k9Tudt2FseL/IclA3nXZZIl2dU6PWLkjgvDqL2kXLHpAQTpKuXf
 ZzBM4wjdVqZRQHscLgX6+0/uOJDf9/iJs/dwD9PbzLhAJnHF3SWRAy/grzpChLFR
 Yil5zagtdhjqSKDf2FsUCJ7lwaST0fhHvmZZx4loIhcmN0/rvvrhhfLsJgCWv/Kf
 RWMkiSQGhxAZUXjOJDQB+wTHv7ZoJy51ov7/rYjr49jCJrCVO6I6yQ6lNKDCYClR
 vDS3yK6fiEXbC09iudm74FBl2KO+BwJKhnidekzzKGv34RHRAux5Oo54J5jJMhBA
 tXFxALGwKr1KAI7vLK8ByZzwZjN5pDmKHUuOAlJNy9XQi+b4zbbREkXlcqZ5VgxA
 xCibpnLzJFoHS8y7c76wfz4mCkWSWzuC9Rzy4sTkHMzgJESmZiE=
 =dESL
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.13' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.13

 - Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE.

 - Quirk KVM's misguided behavior of initialized certain feature MSRs to
   their maximum supported feature set, which can result in KVM creating
   invalid vCPU state.  E.g. initializing PERF_CAPABILITIES to a non-zero
   value results in the vCPU having invalid state if userspace hides PDCM
   from the guest, which can lead to save/restore failures.

 - Fix KVM's handling of non-canonical checks for vCPUs that support LA57
   to better follow the "architecture", in quotes because the actual
   behavior is poorly documented.  E.g. most MSR writes and descriptor
   table loads ignore CR4.LA57 and operate purely on whether the CPU
   supports LA57.

 - Bypass the register cache when querying CPL from kvm_sched_out(), as
   filling the cache from IRQ context is generally unsafe, and harden the
   cache accessors to try to prevent similar issues from occuring in the
   future.

 - Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM
   over-advertises SPEC_CTRL when trying to support cross-vendor VMs.

 - Minor cleanups
2024-11-13 06:33:00 -05:00
Paolo Bonzini
ef6fdc0e4c KVM VMX change for 6.13
- Remove __invept()'s unused @gpa param, which was left behind when KVM
    dropped code for invalidating a specific GPA (Intel never officially
    documented support for single-address INVEPT; presumably pre-production
    CPUs supported it at some point).
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmczqIYACgkQOlYIJqCj
 N/1hJg//fxOBC74PXNtAuLFXnqxDVf06541SQcyxPGspFiaeSDFcaLbu6avRdPuI
 QchkX7yCAPZ3N8BAnJxva0XqLAyhuPlAFeeXeqYfFUF6Wzrzbn2fp4SGEI2tklKe
 TKINdjNxU+0sx/O+BWtJopBz0JpDyvL6paRNW4uvlra6NS3w9ppS3ju+tx8+cBeF
 ez53Dn0lCtvGJL54sp0isB/WrkXSGM5lzpnyEJ1tCRcSuT4eNSn7ZMXHiS7sUQIj
 2nog4nh8v454LdA+nvrqUz6/nVGURyWU9FVTbfOWrbivEFPADlNKEYx8iaiw9mEV
 g9rEJ/bJPpacsnC1e/kiBh+X68LulK3Vrf9mwlQwMYfrQQPRS7hQvNvNfpmP0AVM
 /H9YFXLkDH2LXoqckTZMtT208mp8VEZlCSVt0z0Di6aVmSbubkCwEOPDBGbWB1Y1
 B4UdLX/ooJFQhCnMB5wfOdcTLQaOmnpHN44qde0tdRv6yRBEj1AEd4QkGbPrtjxI
 JBCISdCtc/Q36HWgUDHR56TA+Q1G5KYVzpHQ4VUhf5hoUFVrojQeUcrEnLogUjbO
 E/L/L5UiOJJGnpVhMILs0eUjPVq68kIjMgTeuAnAAVuZZxhMlbuwf5L0LFIZTo9p
 eLrMWuY4wVmxaHtezJpm2gQIm8ZZPlPBO/F21wFvKKs5UxoLu7g=
 =n8HN
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.13' of https://github.com/kvm-x86/linux into HEAD

KVM VMX change for 6.13

 - Remove __invept()'s unused @gpa param, which was left behind when KVM
   dropped code for invalidating a specific GPA (Intel never officially
   documented support for single-address INVEPT; presumably pre-production
   CPUs supported it at some point).
2024-11-13 06:32:43 -05:00
Sean Christopherson
aa0d42cacf KVM: VMX: Bury Intel PT virtualization (guest/host mode) behind CONFIG_BROKEN
Hide KVM's pt_mode module param behind CONFIG_BROKEN, i.e. disable support
for virtualizing Intel PT via guest/host mode unless BROKEN=y.  There are
myriad bugs in the implementation, some of which are fatal to the guest,
and others which put the stability and health of the host at risk.

For guest fatalities, the most glaring issue is that KVM fails to ensure
tracing is disabled, and *stays* disabled prior to VM-Enter, which is
necessary as hardware disallows loading (the guest's) RTIT_CTL if tracing
is enabled (enforced via a VMX consistency check).  Per the SDM:

  If the logical processor is operating with Intel PT enabled (if
  IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load
  IA32_RTIT_CTL" VM-entry control must be 0.

On the host side, KVM doesn't validate the guest CPUID configuration
provided by userspace, and even worse, uses the guest configuration to
decide what MSRs to save/load at VM-Enter and VM-Exit.  E.g. configuring
guest CPUID to enumerate more address ranges than are supported in hardware
will result in KVM trying to passthrough, save, and load non-existent MSRs,
which generates a variety of WARNs, ToPA ERRORs in the host, a potential
deadlock, etc.

Fixes: f99e3daf94 ("KVM: x86: Add Intel PT virtualization work mode")
Cc: stable@vger.kernel.org
Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Tested-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20241101185031.1799556-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08 05:57:13 -05:00
Sean Christopherson
d3ddef46f2 KVM: x86: Unconditionally set irr_pending when updating APICv state
Always set irr_pending (to true) when updating APICv status to fix a bug
where KVM fails to set irr_pending when userspace sets APIC state and
APICv is disabled, which ultimate results in KVM failing to inject the
pending interrupt(s) that userspace stuffed into the vIRR, until another
interrupt happens to be emulated by KVM.

Only the APICv-disabled case is flawed, as KVM forces apic->irr_pending to
be true if APICv is enabled, because not all vIRR updates will be visible
to KVM.

Hit the bug with a big hammer, even though strictly speaking KVM can scan
the vIRR and set/clear irr_pending as appropriate for this specific case.
The bug was introduced by commit 755c2bf878 ("KVM: x86: lapic: don't
touch irr_pending in kvm_apic_update_apicv when inhibiting it"), which as
the shortlog suggests, deleted code that updated irr_pending.

Before that commit, kvm_apic_update_apicv() did indeed scan the vIRR, with
with the crucial difference that kvm_apic_update_apicv() did the scan even
when APICv was being *disabled*, e.g. due to an AVIC inhibition.

        struct kvm_lapic *apic = vcpu->arch.apic;

        if (vcpu->arch.apicv_active) {
                /* irr_pending is always true when apicv is activated. */
                apic->irr_pending = true;
                apic->isr_count = 1;
        } else {
                apic->irr_pending = (apic_search_irr(apic) != -1);
                apic->isr_count = count_vectors(apic->regs + APIC_ISR);
        }

And _that_ bug (clearing irr_pending) was introduced by commit b26a695a1d
("kvm: lapic: Introduce APICv update helper function"), prior to which KVM
unconditionally set irr_pending to true in kvm_apic_set_state(), i.e.
assumed that the new virtual APIC state could have a pending IRQ.

Furthermore, in addition to introducing this issue, commit 755c2bf878
also papered over the underlying bug: KVM doesn't ensure CPUs and devices
see APICv as disabled prior to searching the IRR.  Waiting until KVM
emulates an EOI to update irr_pending "works", but only because KVM won't
emulate EOI until after refresh_apicv_exec_ctrl(), and there are plenty of
memory barriers in between.  I.e. leaving irr_pending set is basically
hacking around bad ordering.

So, effectively revert to the pre-b26a695a1d78 behavior for state restore,
even though it's sub-optimal if no IRQs are pending, in order to provide a
minimal fix, but leave behind a FIXME to document the ugliness.  With luck,
the ordering issue will be fixed and the mess will be cleaned up in the
not-too-distant future.

Fixes: 755c2bf878 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Reported-by: Yong He <zhuangel570@gmail.com>
Closes: https://lkml.kernel.org/r/20241023124527.1092810-1-alexyonghe%40tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20241106015135.2462147-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08 05:57:13 -05:00
Dionna Glaze
e3a7792d96 kvm: svm: Fix gctx page leak on invalid inputs
Ensure that snp gctx page allocation is adequately deallocated on
failure during snp_launch_start.

Fixes: 136d8bc931 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command")

CC: Sean Christopherson <seanjc@google.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: Ashish Kalra <ashish.kalra@amd.com>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: John Allen <john.allen@amd.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: "David S. Miller" <davem@davemloft.net>
CC: Michael Roth <michael.roth@amd.com>
CC: Luis Chamberlain <mcgrof@kernel.org>
CC: Russ Weight <russ.weight@linux.dev>
CC: Danilo Krummrich <dakr@redhat.com>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: "Rafael J. Wysocki" <rafael@kernel.org>
CC: Tianfei zhang <tianfei.zhang@intel.com>
CC: Alexey Kardashevskiy <aik@amd.com>

Signed-off-by: Dionna Glaze <dionnaglaze@google.com>
Message-ID: <20241105010558.1266699-2-dionnaglaze@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08 05:57:13 -05:00
Nam Cao
f6e12766c5 KVM: x86/xen: Initialize hrtimer in kvm_xen_init_vcpu()
The hrtimer is initialized in the KVM_XEN_VCPU_SET_ATTR ioctl. That caused
problem in the past, because the hrtimer can be initialized multiple times,
which was fixed by commit af735db312 ("KVM: x86/xen: Initialize Xen timer
only once"). This commit avoids initializing the timer multiple times by
checking the field 'function' of struct hrtimer to determine if it has
already been initialized.

This is not required and in the way to make the function field private.

Move the hrtimer initialization into kvm_xen_init_vcpu() so that it will
only be initialized once.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/9c33c7224d97d08f4fa30d3cc8687981c1d3e953.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:05 +01:00
Sean Christopherson
e5d253c60e KVM: SVM: Propagate error from snp_guest_req_init() to userspace
If snp_guest_req_init() fails, return the provided error code up the
stack to userspace, e.g. so that userspace can log that KVM_SEV_INIT2
failed, as opposed to some random operation later in VM setup failing
because SNP wasn't actually enabled for the VM.

Note, KVM itself doesn't consult the return value from __sev_guest_init(),
i.e. the fallout is purely that userspace may be confused.

Fixes: 88caf544c9 ("KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202410192220.MeTyHPxI-lkp@intel.com
Link: https://lore.kernel.org/r/20241031203214.1585751-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 22:03:04 -08:00
Sean Christopherson
2657b82a78 KVM: nVMX: Treat vpid01 as current if L2 is active, but with VPID disabled
When getting the current VPID, e.g. to emulate a guest TLB flush, return
vpid01 if L2 is running but with VPID disabled, i.e. if VPID is disabled
in vmcs12.  Architecturally, if VPID is disabled, then the guest and host
effectively share VPID=0.  KVM emulates this behavior by using vpid01 when
running an L2 with VPID disabled (see prepare_vmcs02_early_rare()), and so
KVM must also treat vpid01 as the current VPID while L2 is active.

Unconditionally treating vpid02 as the current VPID when L2 is active
causes KVM to flush TLB entries for vpid02 instead of vpid01, which
results in TLB entries from L1 being incorrectly preserved across nested
VM-Enter to L2 (L2=>L1 isn't problematic, because the TLB flush after
nested VM-Exit flushes vpid01).

The bug manifests as failures in the vmx_apicv_test KVM-Unit-Test, as KVM
incorrectly retains TLB entries for the APIC-access page across a nested
VM-Enter.

Opportunisticaly add comments at various touchpoints to explain the
architectural requirements, and also why KVM uses vpid01 instead of vpid02.

All credit goes to Chao, who root caused the issue and identified the fix.

Link: https://lore.kernel.org/all/ZwzczkIlYGX+QXJz@intel.com
Fixes: 2b4a5a5d56 ("KVM: nVMX: Flush current VPID (L1 vs. L2) for KVM_REQ_TLB_FLUSH_GUEST")
Cc: stable@vger.kernel.org
Cc: Like Xu <like.xu.linux@gmail.com>
Debugged-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20241031202011.1580522-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 21:10:49 -08:00
Sean Christopherson
a75b7bb46a KVM: x86: Short-circuit all of kvm_apic_set_base() if MSR value is unchanged
Do nothing in all of kvm_apic_set_base(), not just __kvm_apic_set_base(),
if the incoming MSR value is the same as the current value.  Validating
the mode transitions is obviously unnecessary, and rejecting the write is
pointless if the vCPU already has an invalid value, e.g. if userspace is
doing weird things and modified guest CPUID after setting MSR_IA32_APICBASE.

Bailing early avoids kvm_recalculate_apic_map()'s slow path in the rare
scenario where the map is DIRTY due to some other vCPU dirtying the map,
in which case it's the other vCPU/task's responsibility to recalculate the
map.

Note, kvm_lapic_reset() calls __kvm_apic_set_base() only when emulating
RESET, in which case the old value is guaranteed to be zero, and the new
value is guaranteed to be non-zero.  I.e. all callers of
__kvm_apic_set_base() effectively pre-check for the MSR value actually
changing.  Don't bother keeping the check in __kvm_apic_set_base(), as no
additional callers are expected, and implying that the MSR might already
be non-zero at the time of kvm_lapic_reset() could confuse readers.

Link: https://lore.kernel.org/r/20241101183555.1794700-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 20:57:55 -08:00
Sean Christopherson
c9155eb012 KVM: x86: Unpack msr_data structure prior to calling kvm_apic_set_base()
Pass in the new value and "host initiated" as separate parameters to
kvm_apic_set_base(), as forcing the KVM_SET_SREGS path to declare and fill
an msr_data structure is awkward and kludgy, e.g. __set_sregs_common()
doesn't even bother to set the proper MSR index.

No functional change intended.

Suggested-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20241101183555.1794700-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 20:57:46 -08:00
Sean Christopherson
ff6ce56e1d KVM: x86: Make kvm_recalculate_apic_map() local to lapic.c
Make kvm_recalculate_apic_map() local to lapic.c now that all external
callers are gone.

No functional change intended.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20241009181742.1128779-8-seanjc@google.com
Link: https://lore.kernel.org/r/20241101183555.1794700-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 20:57:46 -08:00
Sean Christopherson
7d1cb7cee9 KVM: x86: Rename APIC base setters to better capture their relationship
Rename kvm_set_apic_base() and kvm_lapic_set_base() to kvm_apic_set_base()
and __kvm_apic_set_base() respectively to capture that the underscores
version is a "special" variant (it exists purely to avoid recalculating
the optimized map multiple times when stuffing the RESET value).

Opportunistically add a comment explaining why kvm_lapic_reset() uses the
inner helper.  Note, KVM deliberately invokes kvm_arch_vcpu_create() while
kvm->lock is NOT held so that vCPU setup isn't serialized if userspace is
creating multiple/all vCPUs in parallel.  I.e. triggering an extra
recalculation is not limited to theoretical/rare edge cases, and so is
worth avoiding.

No functional change intended.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20241009181742.1128779-7-seanjc@google.com
Link: https://lore.kernel.org/r/20241101183555.1794700-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 20:57:46 -08:00
Sean Christopherson
c9c9acfcd5 KVM: x86: Move kvm_set_apic_base() implementation to lapic.c (from x86.c)
Move kvm_set_apic_base() to lapic.c so that the bulk of KVM's local APIC
code resides in lapic.c, regardless of whether or not KVM is emulating the
local APIC in-kernel.  This will also allow making various helpers visible
only to lapic.c.

No functional change intended.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20241009181742.1128779-6-seanjc@google.com
Link: https://lore.kernel.org/r/20241101183555.1794700-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 20:57:46 -08:00
Sean Christopherson
adfec1f459 KVM: x86: Inline kvm_get_apic_mode() in lapic.h
Inline kvm_get_apic_mode() in lapic.h to avoid a CALL+RET as well as an
export.  The underlying kvm_apic_mode() helper is public information, i.e.
there is no state/information that needs to be hidden from vendor modules.

No functional change intended.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20241009181742.1128779-5-seanjc@google.com
Link: https://lore.kernel.org/r/20241101183555.1794700-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 20:57:46 -08:00
Sean Christopherson
d91060e342 KVM: x86: Get vcpu->arch.apic_base directly and drop kvm_get_apic_base()
Access KVM's emulated APIC base MSR value directly instead of bouncing
through a helper, as there is no reason to add a layer of indirection, and
there are other MSRs with a "set" but no "get", e.g. EFER.

No functional change intended.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20241009181742.1128779-4-seanjc@google.com
Link: https://lore.kernel.org/r/20241101183555.1794700-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 20:57:45 -08:00
Sean Christopherson
8166d25579 KVM: x86: Drop superfluous kvm_lapic_set_base() call when setting APIC state
Now that kvm_lapic_set_base() does nothing if the "new" APIC base MSR is
the same as the current value, drop the kvm_lapic_set_base() call in the
KVM_SET_LAPIC flow that passes in the current value, as it too does
nothing.

Note, the purpose of invoking kvm_lapic_set_base() was purely to set
apic->base_address (see commit 5dbc8f3fed ("KVM: use kvm_lapic_set_base()
to change apic_base")).  And there is no evidence that explicitly setting
apic->base_address in KVM_SET_LAPIC ever had any functional impact; even
in the original commit 96ad2cc613 ("KVM: in-kernel LAPIC save and restore
support"), all flows that set apic_base also set apic->base_address to the
same address.  E.g. svm_create_vcpu() did open code a write to apic_base,

	svm->vcpu.apic_base = 0xfee00000 | MSR_IA32_APICBASE_ENABLE;

but it also called kvm_create_lapic() when irqchip_in_kernel() is true.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20241009181742.1128779-3-seanjc@google.com
Link: https://lore.kernel.org/r/20241101183555.1794700-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 20:57:45 -08:00
Sean Christopherson
d7d770bed9 KVM: x86: Short-circuit all kvm_lapic_set_base() if MSR value isn't changing
Do nothing in kvm_lapic_set_base() if the APIC base MSR value is the same
as the current value.  All flows except the handling of the base address
explicitly take effect if and only if relevant bits are changing.

For the base address, invoking kvm_lapic_set_base() before KVM initializes
the base to APIC_DEFAULT_PHYS_BASE during vCPU RESET would be a KVM bug,
i.e. KVM _must_ initialize apic->base_address before exposing the vCPU (to
userspace or KVM at-large).

Note, the inhibit is intended to be set if the base address is _changed_
from the default, i.e. is also covered by the RESET behavior.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20241009181742.1128779-2-seanjc@google.com
Link: https://lore.kernel.org/r/20241101183555.1794700-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 20:57:45 -08:00
Vipin Sharma
4cf20d4254 KVM: x86/mmu: Drop per-VM zapped_obsolete_pages list
Drop the per-VM zapped_obsolete_pages list now that the usage from the
defunct mmu_shrinker is gone, and instead use a local list to track pages
in kvm_zap_obsolete_pages(), the sole remaining user of
zapped_obsolete_pages.

Opportunistically add an assertion to verify and document that slots_lock
must be held, i.e. that there can only be one active instance of
kvm_zap_obsolete_pages() at any given time, and by doing so also prove
that using a local list instead of a per-VM list doesn't change any
functionality (beyond trivialities like list initialization).

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Link: https://lore.kernel.org/r/20241101201437.1604321-2-vipinsh@google.com
[sean: split to separate patch, write changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 19:22:53 -08:00
Vipin Sharma
fe140e611d KVM: x86/mmu: Remove KVM's MMU shrinker
Remove KVM's MMU shrinker and (almost) all of its related code, as the
current implementation is very disruptive to VMs (if it ever runs),
without providing any meaningful benefit[1].

Alternatively, KVM could repurpose its shrinker, e.g. to reclaim pages
from the per-vCPU caches[2], but given that no one has complained about
lack of TDP MMU support for the shrinker in the 3+ years since the TDP MMU
was enabled by default, it's safe to say that there is likely no real use
case for initiating reclaim of KVM's page tables from the shrinker.

And while clever/cute, reclaiming the per-vCPU caches doesn't scale the
same way that reclaiming in-use page table pages does.  E.g. the amount of
memory being used by a VM doesn't always directly correlate with the
number vCPUs, and even when it does, reclaiming a few pages from per-vCPU
caches likely won't make much of a dent in the VM's total memory usage,
especially for VMs with huge amounts of memory.

Lastly, if it turns out that there is a strong use case for dropping the
per-vCPU caches, re-introducing the shrinker registration is trivial
compared to the complexity of actually reclaiming pages from the caches.

[1] https://lore.kernel.org/lkml/Y45dldZnI6OIf+a5@google.com
[2] https://lore.kernel.org/kvm/20241004195540.210396-3-vipinsh@google.com

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Link: https://lore.kernel.org/r/20241101201437.1604321-2-vipinsh@google.com
[sean: keep zapped_obsolete_pages for now, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 19:18:22 -08:00
David Matlack
06c4cd957b KVM: x86/mmu: WARN if huge page recovery triggered during dirty logging
WARN and bail out of recover_huge_pages_range() if dirty logging is
enabled. KVM shouldn't be recovering huge pages during dirty logging
anyway, since KVM needs to track writes at 4KiB. However it's not out of
the possibility that that changes in the future.

If KVM wants to recover huge pages during dirty logging, make_huge_spte()
must be updated to write-protect the new huge page mapping. Otherwise,
writes through the newly recovered huge page mapping will not be tracked.

Note that this potential risk did not exist back when KVM zapped to
recover huge page mappings, since subsequent accesses would just be
faulted in at PG_LEVEL_4K if dirty logging was enabled.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240823235648.3236880-7-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 18:37:23 -08:00
David Matlack
430e264b76 KVM: x86/mmu: Rename make_huge_page_split_spte() to make_small_spte()
Rename make_huge_page_split_spte() to make_small_spte(). This ensures
that the usage of "small_spte" and "huge_spte" are consistent between
make_huge_spte() and make_small_spte().

This should also reduce some confusion as make_huge_page_split_spte()
almost reads like it will create a huge SPTE, when in fact it is
creating a small SPTE to split the huge SPTE.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240823235648.3236880-6-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 18:37:23 -08:00
David Matlack
13e2e4f62a KVM: x86/mmu: Recover TDP MMU huge page mappings in-place instead of zapping
Recover TDP MMU huge page mappings in-place instead of zapping them when
dirty logging is disabled, and rename functions that recover huge page
mappings when dirty logging is disabled to move away from the "zap
collapsible spte" terminology.

Before KVM flushes TLBs, guest accesses may be translated through either
the (stale) small SPTE or the (new) huge SPTE. This is already possible
when KVM is doing eager page splitting (where TLB flushes are also
batched), and when vCPUs are faulting in huge mappings (where TLBs are
flushed after the new huge SPTE is installed).

Recovering huge pages reduces the number of page faults when dirty
logging is disabled:

 $ perf stat -e kvm:kvm_page_fault -- ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g

 Before: 393,599      kvm:kvm_page_fault
 After:  262,575      kvm:kvm_page_fault

vCPU throughput and the latency of disabling dirty-logging are about
equal compared to zapping, but avoiding faults can be beneficial to
remove vCPU jitter in extreme scenarios.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240823235648.3236880-5-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 18:37:22 -08:00
Sean Christopherson
dd2e7dbc4a KVM: x86/mmu: Refactor TDP MMU iter need resched check
Refactor the TDP MMU iterator "need resched" checks into a helper
function so they can be called from a different code path in a
subsequent commit.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240823235648.3236880-4-dmatlack@google.com
[sean: rebase on a swapped order of checks]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 18:37:21 -08:00
Sean Christopherson
38b0ac4716 KVM: x86/mmu: Demote the WARN on yielded in xxx_cond_resched() to KVM_MMU_WARN_ON
Convert the WARN in tdp_mmu_iter_cond_resched() that the iterator hasn't
already yielded to a KVM_MMU_WARN_ON() so the code is compiled out for
production kernels (assuming production kernels disable KVM_PROVE_MMU).

Checking for a needed reschedule is a hot path, and KVM sanity checks
iter->yielded in several other less-hot paths, i.e. the odds of KVM not
flagging that something went sideways are quite low.  Furthermore, the
odds of KVM not noticing *and* the WARN detecting something worth
investigating are even lower.

Link: https://lore.kernel.org/r/20241031170633.1502783-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 18:37:18 -08:00
Sean Christopherson
e287e43167 KVM: x86/mmu: Check yielded_gfn for forward progress iff resched is needed
Swap the order of the checks in tdp_mmu_iter_cond_resched() so that KVM
checks to see if a resched is needed _before_ checking to see if yielding
must be disallowed to guarantee forward progress.  Iterating over TDP MMU
SPTEs is a hot path, e.g. tearing down a root can touch millions of SPTEs,
and not needing to reschedule is by far the common case.  On the other
hand, disallowing yielding because forward progress has not been made is a
very rare case.

Returning early for the common case (no resched), effectively reduces the
number of checks from 2 to 1 for the common case, and should make the code
slightly more predictable for the CPU.

To resolve a weird conundrum where the forward progress check currently
returns false, but the need resched check subtly returns iter->yielded,
which _should_ be false (enforced by a WARN), return false unconditionally
(which might also help make the sequence more predictable).  If KVM has a
bug where iter->yielded is left danging, continuing to yield is neither
right nor wrong, it was simply an artifact of how the original code was
written.

Unconditionally returning false when yielding is unnecessary or unwanted
will also allow extracting the "should resched" logic to a separate helper
in a future patch.

Cc: David Matlack <dmatlack@google.com>
Reviewed-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20241031170633.1502783-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 18:37:18 -08:00
Al Viro
6348be02ee fdget(), trivial conversions
fdget() is the first thing done in scope, all matching fdput() are
immediately followed by leaving the scope.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:06 -05:00
Sean Christopherson
1ded7a57b8 KVM: x86: Remove ordering check b/w MSR_PLATFORM_INFO and MISC_FEATURES_ENABLES
Drop KVM's odd restriction that disallows clearing CPUID_FAULT in
MSR_PLATFORM_INFO if CPL>0 CPUID faulting is enabled in
MSR_MISC_FEATURES_ENABLES.  KVM generally doesn't require specific
ordering when userspace sets MSRs, and the completely arbitrary order of
MSRs in emulated_msrs_all means that a userspace that uses KVM's list
verbatim could run afoul of the check.

Dropping the restriction obviously means that userspace could stuff a
nonsensical vCPU model, but that's the case all over KVM.  KVM typically
restricts userspace MSR writes only when it makes things easier for KVM
and/or userspace.

Link: https://lore.kernel.org/r/20240802185511.305849-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:34 -07:00
Sean Christopherson
a5d563890b KVM: x86: Reject userspace attempts to access ARCH_CAPABILITIES w/o support
Reject userspace accesses to ARCH_CAPABILITIES if the MSR isn't supposed
to exist, according to guest CPUID.  However, "reject" accesses with
KVM_MSR_RET_UNSUPPORTED, so that reads get '0' and writes of '0' are
ignored if KVM advertised support ARCH_CAPABILITIES.

KVM's ABI is that userspace must set guest CPUID prior to setting MSRs,
and that setting MSRs that aren't supposed exist is disallowed (modulo the
'0' exemption).

Link: https://lore.kernel.org/r/20240802185511.305849-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:34 -07:00
Sean Christopherson
a103911119 KVM: VMX: Remove restriction that PMU version > 0 for PERF_CAPABILITIES
Drop the restriction that the PMU version is non-zero when handling writes
to PERF_CAPABILITIES now that KVM unconditionally checks for PDCM support.

Link: https://lore.kernel.org/r/20240802185511.305849-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:33 -07:00
Sean Christopherson
d75cac366f KVM: x86: Reject userspace attempts to access PERF_CAPABILITIES w/o PDCM
Reject userspace accesses to PERF_CAPABILITIES if PDCM isn't set in guest
CPUID, i.e. if the vCPU doesn't actually have PERF_CAPABILITIES.  But!  Do
so via KVM_MSR_RET_UNSUPPORTED, so that reads get '0' and writes of '0'
are ignored if KVM advertised support PERF_CAPABILITIES.

KVM's ABI is that userspace must set guest CPUID prior to setting MSRs,
and that setting MSRs that aren't supposed exist is disallowed (modulo the
'0' exemption).

Link: https://lore.kernel.org/r/20240802185511.305849-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:32 -07:00
Sean Christopherson
dcb988cdac KVM: x86: Quirk initialization of feature MSRs to KVM's max configuration
Add a quirk to control KVM's misguided initialization of select feature
MSRs to KVM's max configuration, as enabling features by default violates
KVM's approach of letting userspace own the vCPU model, and is actively
problematic for MSRs that are conditionally supported, as the vCPU will
end up with an MSR value that userspace can't restore.  E.g. if the vCPU
is configured with PDCM=0, userspace will save and attempt to restore a
non-zero PERF_CAPABILITIES, thanks to KVM's meddling.

Link: https://lore.kernel.org/r/20240802185511.305849-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:31 -07:00
Sean Christopherson
bc2ca3680b KVM: x86: Disallow changing MSR_PLATFORM_INFO after vCPU has run
Tag MSR_PLATFORM_INFO as a feature MSR (because it is), i.e. disallow it
from being modified after the vCPU has run.

To make KVM's selftest compliant, simply delete the userspace MSR write
that restores KVM's original value at the end of the test.  Verifying that
userspace can write back what it originally read is uninteresting in this
particular case, because KVM doesn't enforce _any_ bits in the MSR, i.e.
userspace should be able to write any arbitrary value.

Link: https://lore.kernel.org/r/20240802185511.305849-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:30 -07:00
Sean Christopherson
2142ac663a KVM: x86: Co-locate initialization of feature MSRs in kvm_arch_vcpu_create()
Bunch all of the feature MSR initialization in kvm_arch_vcpu_create() so
that it can be easily quirked in a future patch.

No functional change intended.

Link: https://lore.kernel.org/r/20240802185511.305849-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:29 -07:00
Maxim Levitsky
90a877216e KVM: nVMX: fix canonical check of vmcs12 HOST_RIP
HOST_RIP canonical check should check the L1 of CR4.LA57 stored in
the vmcs12 rather than the current L1's because it is legal to change
the CR4.LA57 value during VM exit from L2 to L1.

This is a theoretical bug though, because it is highly unlikely that a
VM exit will change the CR4.LA57 from the value it had on VM entry.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240906221824.491834-5-mlevitsk@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:27 -07:00
Maxim Levitsky
9245fd6b85 KVM: x86: model canonical checks more precisely
As a result of a recent investigation, it was determined that x86 CPUs
which support 5-level paging, don't always respect CR4.LA57 when doing
canonical checks.

In particular:

1. MSRs which contain a linear address, allow full 57-bitcanonical address
regardless of CR4.LA57 state. For example: MSR_KERNEL_GS_BASE.

2. All hidden segment bases and GDT/IDT bases also behave like MSRs.
This means that full 57-bit canonical address can be loaded to them
regardless of CR4.LA57, both using MSRS (e.g GS_BASE) and instructions
(e.g LGDT).

3. TLB invalidation instructions also allow the user to use full 57-bit
address regardless of the CR4.LA57.

Finally, it must be noted that the CPU doesn't prevent the user from
disabling 5-level paging, even when the full 57-bit canonical address is
present in one of the registers mentioned above (e.g GDT base).

In fact, this can happen without any userspace help, when the CPU enters
SMM mode - some MSRs, for example MSR_KERNEL_GS_BASE are left to contain
a non-canonical address in regard to the new mode.

Since most of the affected MSRs and all segment bases can be read and
written freely by the guest without any KVM intervention, this patch makes
the emulator closely follow hardware behavior, which means that the
emulator doesn't take in the account the guest CPUID support for 5-level
paging, and only takes in the account the host CPU support.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240906221824.491834-4-mlevitsk@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:26 -07:00
Maxim Levitsky
c534b37b75 KVM: x86: Add X86EMUL_F_MSR and X86EMUL_F_DT_LOAD to aid canonical checks
Add emulation flags for MSR accesses and Descriptor Tables loads, and pass
the new flags as appropriate to emul_is_noncanonical_address().  The flags
will be used to perform the correct canonical check, as the type of access
affects whether or not CR4.LA57 is consulted when determining the canonical
bit.

No functional change is intended.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240906221824.491834-3-mlevitsk@redhat.com
[sean: split to separate patch, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:25 -07:00
Maxim Levitsky
16ccadefa2 KVM: x86: Route non-canonical checks in emulator through emulate_ops
Add emulate_ops.is_canonical_addr() to perform (non-)canonical checks in
the emulator, which will allow extending is_noncanonical_address() to
support different flavors of canonical checks, e.g. for descriptor table
bases vs. MSRs, without needing duplicate logic in the emulator.

No functional change is intended.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240906221824.491834-3-mlevitsk@redhat.com
[sean: separate from additional of flags, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:25 -07:00
Maxim Levitsky
e52ad1ddd0 KVM: x86: drop x86.h include from cpuid.h
Drop x86.h include from cpuid.h to allow the x86.h to include the cpuid.h
instead.

Also fix various places where x86.h was implicitly included via cpuid.h

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240906221824.491834-2-mlevitsk@redhat.com
[sean: fixup a missed include in mtrr.c]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:23 -07:00
Sean Christopherson
eecf398545 KVM: x86: Use '0' for guest RIP if PMI encounters protected guest state
Explicitly return '0' for guest RIP when handling a PMI VM-Exit for a vCPU
with protected guest state, i.e. when KVM can't read the real RIP.  While
there is no "right" value, and profiling a protect guest is rather futile,
returning the last known RIP is worse than returning obviously "bad" data.
E.g. for SEV-ES+, the last known RIP will often point somewhere in the
guest's boot flow.

Opportunistically add WARNs to effectively assert that the in_kernel() and
get_ip() callbacks are restricted to the common PMI handler, as the return
values for the protected guest state case are largely arbitrary, i.e. only
make any sense whatsoever for PMIs, where the returned values have no
functional impact and thus don't truly matter.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241009175002.1118178-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:23 -07:00
Sean Christopherson
1c932fc762 KVM: x86: Add lockdep-guarded asserts on register cache usage
When lockdep is enabled, assert that KVM accesses the register caches if
and only if cache fills are guaranteed to consume fresh data, i.e. when
KVM when KVM is in control of the code sequence.  Concretely, the caches
can only be used from task context (synchronous) or when handling a PMI
VM-Exit (asynchronous, but only in specific windows where the caches are
in a known, stable state).

Generally speaking, there are very few flows where reading register state
from an asynchronous context is correct or even necessary.  So, rather
than trying to figure out a generic solution, simply disallow using the
caches outside of task context by default, and deal with any future
exceptions on a case-by-case basis _if_ they arise.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241009175002.1118178-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:22 -07:00
Sean Christopherson
f0e7012c4b KVM: x86: Bypass register cache when querying CPL from kvm_sched_out()
When querying guest CPL to determine if a vCPU was preempted while in
kernel mode, bypass the register cache, i.e. always read SS.AR_BYTES from
the VMCS on Intel CPUs.  If the kernel is running with full preemption
enabled, using the register cache in the preemption path can result in
stale and/or uninitialized data being cached in the segment cache.

In particular the following scenario is currently possible:

 - vCPU is just created, and the vCPU thread is preempted before
   SS.AR_BYTES is written in vmx_vcpu_reset().

 - When scheduling out the vCPU task, kvm_arch_vcpu_in_kernel() =>
   vmx_get_cpl() reads and caches '0' for SS.AR_BYTES.

 - vmx_vcpu_reset() => seg_setup() configures SS.AR_BYTES, but doesn't
   invoke vmx_segment_cache_clear() to invalidate the cache.

As a result, KVM retains a stale value in the cache, which can be read,
e.g. via KVM_GET_SREGS.  Usually this is not a problem because the VMX
segment cache is reset on each VM-Exit, but if the userspace VMM (e.g KVM
selftests) reads and writes system registers just after the vCPU was
created, _without_ modifying SS.AR_BYTES, userspace will write back the
stale '0' value and ultimately will trigger a VM-Entry failure due to
incorrect SS segment type.

Note, the VM-Enter failure can also be avoided by moving the call to
vmx_segment_cache_clear() until after the vmx_vcpu_reset() initializes all
segments.  However, while that change is correct and desirable (and will
come along shortly), it does not address the underlying problem that
accessing KVM's register caches from !task context is generally unsafe.

In addition to fixing the immediate bug, bypassing the cache for this
particular case will allow hardening KVM register caching log to assert
that the caches are accessed only when KVM _knows_ it is safe to do so.

Fixes: de63ad4cf4 ("KVM: X86: implement the logic for spinlock optimization")
Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Closes: https://lore.kernel.org/all/20240716022014.240960-3-mlevitsk@redhat.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241009175002.1118178-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:21 -07:00
Jim Mattson
de572491a9 KVM: x86: AMD's IBPB is not equivalent to Intel's IBPB
From Intel's documentation [1], "CPUID.(EAX=07H,ECX=0):EDX[26]
enumerates support for indirect branch restricted speculation (IBRS)
and the indirect branch predictor barrier (IBPB)." Further, from [2],
"Software that executed before the IBPB command cannot control the
predicted targets of indirect branches (4) executed after the command
on the same logical processor," where footnote 4 reads, "Note that
indirect branches include near call indirect, near jump indirect and
near return instructions. Because it includes near returns, it follows
that **RSB entries created before an IBPB command cannot control the
predicted targets of returns executed after the command on the same
logical processor.**" [emphasis mine]

On the other hand, AMD's IBPB "may not prevent return branch
predictions from being specified by pre-IBPB branch targets" [3].

However, some AMD processors have an "enhanced IBPB" [terminology
mine] which does clear the return address predictor. This feature is
enumerated by CPUID.80000008:EDX.IBPB_RET[bit 30] [4].

Adjust the cross-vendor features enumerated by KVM_GET_SUPPORTED_CPUID
accordingly.

[1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/cpuid-enumeration-and-architectural-msrs.html
[2] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/speculative-execution-side-channel-mitigations.html#Footnotes
[3] https://www.amd.com/en/resources/product-security/bulletin/amd-sb-1040.html
[4] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf

Fixes: 0c54914d0c ("KVM: x86: use Intel speculation bugs and features as derived in generic x86 code")
Suggested-by: Venkatesh Srinivas <venkateshs@chromium.org>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20241011214353.1625057-5-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:20 -07:00
Jim Mattson
71dd5d5300 KVM: x86: Advertise AMD_IBPB_RET to userspace
This is an inherent feature of IA32_PRED_CMD[0], so it is trivially
virtualizable (as long as IA32_PRED_CMD[0] is virtualized).

Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20241011214353.1625057-4-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:19 -07:00
Sean Christopherson
3ffe874ea3 KVM: x86: Ensure vcpu->mode is loaded from memory in kvm_vcpu_exit_request()
Wrap kvm_vcpu_exit_request()'s load of vcpu->mode with READ_ONCE() to
ensure the variable is re-loaded from memory, as there is no guarantee the
caller provides the necessary annotations to ensure KVM sees a fresh value,
e.g. the VM-Exit fastpath could theoretically reuse the pre-VM-Enter value.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20240828232013.768446-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:21:46 -07:00
Kai Huang
6e44d2427b KVM: x86: Fix a comment inside __kvm_set_or_clear_apicv_inhibit()
Change svm_vcpu_run() to vcpu_enter_guest() in the comment of
__kvm_set_or_clear_apicv_inhibit() to make it reflect the fact.

When one thread updates VM's APICv state due to updating the APICv
inhibit reasons, it kicks off all vCPUs and makes them wait until the
new reason has been updated and can be seen by all vCPUs.

There was one WARN() to make sure VM's APICv state is consistent with
vCPU's APICv state in the svm_vcpu_run().  Commit ee49a89329 ("KVM:
x86: Move SVM's APICv sanity check to common x86") moved that WARN() to
x86 common code vcpu_enter_guest() due to the logic is not unique to
SVM, and added comments to both __kvm_set_or_clear_apicv_inhibit() and
vcpu_enter_guest() to explain this.

However, although the comment in __kvm_set_or_clear_apicv_inhibit()
mentioned the WARN(), it seems forgot to reflect that the WARN() had
been moved to x86 common, i.e., it still mentioned the svm_vcpu_run()
but not vcpu_enter_guest().  Fix it.

Note after the change the first line that contains vcpu_enter_guest()
exceeds 80 characters, but leave it as is to make the diff clean.

Fixes: ee49a89329 ("KVM: x86: Move SVM's APICv sanity check to common x86")
Signed-off-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/e462e7001b8668649347f879c66597d3327dbac2.1728383775.git.kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:21:45 -07:00
Kai Huang
ef86fe036d KVM: x86: Fix a comment inside kvm_vcpu_update_apicv()
The sentence "... so that KVM can the AVIC doorbell to ..." doesn't have
a verb.  Fix it.

After adding the verb 'use', that line exceeds 80 characters.  Thus wrap
the 'to' to the next line.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/666e991edf81e1fccfba9466f3fe65965fcba897.1728383775.git.kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:21:44 -07:00
David Matlack
35ef80eb29 KVM: x86/mmu: Batch TLB flushes when zapping collapsible TDP MMU SPTEs
Set SPTEs directly to SHADOW_NONPRESENT_VALUE and batch up TLB flushes
when zapping collapsible SPTEs, rather than freezing them first.

Freezing the SPTE first is not required. It is fine for another thread
holding mmu_lock for read to immediately install a present entry before
TLBs are flushed because the underlying mapping is not changing. vCPUs
that translate through the stale 4K mappings or a new huge page mapping
will still observe the same GPA->HPA translations.

KVM must only flush TLBs before dropping RCU (to avoid use-after-free of
the zapped page tables) and before dropping mmu_lock (to synchronize
with mmu_notifiers invalidating mappings).

In VMs backed with 2MiB pages, batching TLB flushes improves the time it
takes to zap collapsible SPTEs to disable dirty logging:

 $ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g

 Before: Disabling dirty logging time: 14.334453428s (131072 flushes)
 After:  Disabling dirty logging time: 4.794969689s  (76 flushes)

Skipping freezing SPTEs also avoids stalling vCPU threads on the frozen
SPTE for the time it takes to perform a remote TLB flush. vCPUs faulting
on the zapped mapping can now immediately install a new huge mapping and
proceed with guest execution.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240823235648.3236880-3-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 15:25:43 -07:00
David Matlack
8ccd51cb59 KVM: x86/mmu: Drop @max_level from kvm_mmu_max_mapping_level()
Drop the @max_level parameter from kvm_mmu_max_mapping_level(). All
callers pass in PG_LEVEL_NUM, so @max_level can be replaced with
PG_LEVEL_NUM in the function body.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240823235648.3236880-2-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 15:25:42 -07:00
Sean Christopherson
b9883ee40d KVM: x86: Don't emit TLB flushes when aging SPTEs for mmu_notifiers
Follow x86's primary MMU, which hasn't flushed TLBs when clearing Accessed
bits for 10+ years, and skip all TLB flushes when aging SPTEs in response
to a clear_flush_young() mmu_notifier event.  As documented in x86's
ptep_clear_flush_young(), the probability and impact of "bad" reclaim due
to stale A-bit information is relatively low, whereas the performance cost
of TLB flushes is relatively high.  I.e. the cost of flushing TLBs
outweighs the benefits.

On KVM x86, the cost of TLB flushes is even higher, as KVM doesn't batch
TLB flushes for mmu_notifier events (KVM's mmu_notifier contract with MM
makes it all but impossible), and sending IPIs forces all running vCPUs to
go through a VM-Exit => VM-Enter roundtrip.

Furthermore, MGLRU aging of secondary MMUs is expected to use flush-less
mmu_notifiers, i.e. flushing for the !MGLRU will make even less sense, and
will be actively confusing as it wouldn't be clear why KVM "needs" to
flush TLBs for legacy LRU aging, but not for MGLRU aging.

Cc: James Houghton <jthoughton@google.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/all/20240926013506.860253-18-jthoughton@google.com
Link: https://lore.kernel.org/r/20241011021051.1557902-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 15:25:41 -07:00
Sean Christopherson
8564911751 KVM: x86/mmu: Set Dirty bit for new SPTEs, even if _hardware_ A/D bits are disabled
When making a SPTE, set the Dirty bit in the SPTE as appropriate, even if
hardware A/D bits are disabled.  Only EPT allows A/D bits to be disabled,
and for EPT, the bits are software-available (ignored by hardware) when
A/D bits are disabled, i.e. it is perfectly legal for KVM to use the Dirty
to track dirty pages in software.

Link: https://lore.kernel.org/r/20241011021051.1557902-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 15:25:39 -07:00
Sean Christopherson
c9b625625b KVM: x86/mmu: Dedup logic for detecting TLB flushes on leaf SPTE changes
Now that the shadow MMU and TDP MMU have identical logic for detecting
required TLB flushes when updating SPTEs, move said logic to a helper so
that the TDP MMU code can benefit from the comments that are currently
exclusive to the shadow MMU.

No functional change intended.

Link: https://lore.kernel.org/r/20241011021051.1557902-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 15:25:37 -07:00
Sean Christopherson
51192ebdd1 KVM: x86/mmu: Stop processing TDP MMU roots for test_age if young SPTE found
Return immediately if a young SPTE is found when testing, but not updating,
SPTEs.  The return value is a boolean, i.e. whether there is one young SPTE
or fifty is irrelevant (ignoring the fact that it's impossible for there to
be fifty SPTEs, as KVM has a hard limit on the number of valid TDP MMU
roots).

Link: https://lore.kernel.org/r/20241011021051.1557902-15-seanjc@google.com
[sean: use guard(rcu)(), as suggested by Paolo]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 15:23:30 -07:00
Sean Christopherson
526e609f05 KVM: x86/mmu: Process only valid TDP MMU roots when aging a gfn range
Skip invalid TDP MMU roots when aging a gfn range.  There is zero reason
to process invalid roots, as they by definition hold stale information.
E.g. if a root is invalid because its from a previous memslot generation,
in the unlikely event the root has a SPTE for the gfn, then odds are good
that the gfn=>hva mapping is different, i.e. doesn't map to the hva that
is being aged by the primary MMU.

Link: https://lore.kernel.org/r/20241011021051.1557902-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 14:46:47 -07:00
Sean Christopherson
7971801b56 KVM: x86/mmu: Use Accessed bit even when _hardware_ A/D bits are disabled
Use the Accessed bit in SPTEs even when A/D bits are disabled in hardware,
i.e. propagate accessed information to SPTE.Accessed even when KVM is
doing manual tracking by making SPTEs not-present.  In addition to
eliminating a small amount of code in is_accessed_spte(), this also paves
the way for preserving Accessed information when a SPTE is zapped in
response to a mmu_notifier PROTECTION event, e.g. if a SPTE is zapped
because NUMA balancing kicks in.

Note, EPT is the only flavor of paging in which A/D bits are conditionally
enabled, and the Accessed (and Dirty) bit is software-available when A/D
bits are disabled.

Note #2, there are currently no concrete plans to preserve Accessed
information.  Explorations on that front were the initial catalyst, but
the cleanup is the motivation for the actual commit.

Link: https://lore.kernel.org/r/20241011021051.1557902-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 14:46:47 -07:00
Sean Christopherson
53510b9125 KVM: x86/mmu: Set shadow_dirty_mask for EPT even if A/D bits disabled
Set shadow_dirty_mask to the architectural EPT Dirty bit value even if
A/D bits are disabled at the module level, i.e. even if KVM will never
enable A/D bits in hardware.  Doing so provides consistent behavior for
Accessed and Dirty bits, i.e. doesn't leave KVM in a state where it sets
shadow_accessed_mask but not shadow_dirty_mask.

Functionally, this should be one big nop, as consumption of
shadow_dirty_mask is always guarded by a check that hardware A/D bits are
enabled.

Link: https://lore.kernel.org/r/20241011021051.1557902-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 14:46:46 -07:00
Sean Christopherson
3835819fb1 KVM: x86/mmu: Set shadow_accessed_mask for EPT even if A/D bits disabled
Now that KVM doesn't use shadow_accessed_mask to detect if hardware A/D
bits are enabled, set shadow_accessed_mask for EPT even when A/D bits
are disabled in hardware.  This will allow using shadow_accessed_mask for
software purposes, e.g. to preserve accessed status in a non-present SPTE
acros NUMA balancing, if something like that is ever desirable.

Link: https://lore.kernel.org/r/20241011021051.1557902-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 14:46:46 -07:00
Sean Christopherson
a5da5dde4b KVM: x86/mmu: Add a dedicated flag to track if A/D bits are globally enabled
Add a dedicated flag to track if KVM has enabled A/D bits at the module
level, instead of inferring the state based on whether or not the MMU's
shadow_accessed_mask is non-zero.  This will allow defining and using
shadow_accessed_mask even when A/D bits aren't used by hardware.

Link: https://lore.kernel.org/r/20241011021051.1557902-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 14:46:46 -07:00
Sean Christopherson
1a175082b1 KVM: x86/mmu: WARN and flush if resolving a TDP MMU fault clears MMU-writable
Do a remote TLB flush if installing a leaf SPTE overwrites an existing
leaf SPTE (with the same target pfn, which is enforced by a BUG() in
handle_changed_spte()) and clears the MMU-Writable bit.  Since the TDP MMU
passes ACC_ALL to make_spte(), i.e. always requests a Writable SPTE, the
only scenario in which make_spte() should create a !MMU-Writable SPTE is
if the gfn is write-tracked or if KVM is prefetching a SPTE.

When write-protecting for write-tracking, KVM must hold mmu_lock for write,
i.e. can't race with a vCPU faulting in the SPTE.  And when prefetching a
SPTE, the TDP MMU takes care to avoid clobbering a shadow-present SPTE,
i.e. it should be impossible to replace a MMU-writable SPTE with a
!MMU-writable SPTE when handling a TDP MMU fault.

Cc: David Matlack <dmatlack@google.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20241011021051.1557902-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 14:46:46 -07:00
Sean Christopherson
67c9380292 KVM: x86/mmu: Fold mmu_spte_update_no_track() into mmu_spte_update()
Fold the guts of mmu_spte_update_no_track() into mmu_spte_update() now
that the latter doesn't flush when clearing A/D bits, i.e. now that there
is no need to explicitly avoid TLB flushes when aging SPTEs.

Opportunistically WARN if mmu_spte_update() requests a TLB flush when
aging SPTEs, as aging should never modify a SPTE in such a way that KVM
thinks a TLB flush is needed.

Link: https://lore.kernel.org/r/20241011021051.1557902-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 14:46:46 -07:00
Sean Christopherson
010344122d KVM: x86/mmu: Drop ignored return value from kvm_tdp_mmu_clear_dirty_slot()
Drop the return value from kvm_tdp_mmu_clear_dirty_slot() as its sole
caller ignores the result (KVM flushes after clearing dirty logs based on
the logs themselves, not based on SPTEs).

Cc: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20241011021051.1557902-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 14:46:46 -07:00
Sean Christopherson
856cf4a60c KVM: x86/mmu: Don't flush TLBs when clearing Dirty bit in shadow MMU
Don't force a TLB flush when an SPTE update in the shadow MMU happens to
clear the Dirty bit, as KVM unconditionally flushes TLBs when enabling
dirty logging, and when clearing dirty logs, KVM flushes based on its
software structures, not the SPTEs.  I.e. the flows that care about
accurate Dirty bit information already ensure there are no stale TLB
entries.

Opportunistically drop is_dirty_spte() as mmu_spte_update() was the sole
caller.

Link: https://lore.kernel.org/r/20241011021051.1557902-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 14:46:46 -07:00
Sean Christopherson
b7ed46b201 KVM: x86/mmu: Don't force flush if SPTE update clears Accessed bit
Don't force a TLB flush if mmu_spte_update() clears the Accessed bit, as
access tracking tolerates false negatives, as evidenced by the
mmu_notifier hooks that explicitly test and age SPTEs without doing a TLB
flush.

In practice, this is very nearly a nop.  spte_write_protect() and
spte_clear_dirty() never clear the Accessed bit.  make_spte() always
sets the Accessed bit for !prefetch scenarios.  FNAME(sync_spte) only sets
SPTE if the protection bits are changing, i.e. if a flush will be needed
regardless of the Accessed bits.  And FNAME(pte_prefetch) sets SPTE if and
only if the old SPTE is !PRESENT.

That leaves kvm_arch_async_page_ready() as the one path that will generate
a !ACCESSED SPTE *and* overwrite a PRESENT SPTE.  And that's very arguably
a bug, as clobbering a valid SPTE in that case is nonsensical.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Link: https://lore.kernel.org/r/20241011021051.1557902-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 14:46:46 -07:00
Sean Christopherson
0387d79e24 KVM: x86/mmu: Fold all of make_spte()'s writable handling into one if-else
Now that make_spte() no longer uses a funky goto to bail out for a special
case of its unsync handling, combine all of the unsync vs. writable logic
into a single if-else statement.

No functional change intended.

Link: https://lore.kernel.org/r/20241011021051.1557902-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 14:46:45 -07:00
Sean Christopherson
cc7ed3358e KVM: x86/mmu: Always set SPTE's dirty bit if it's created as writable
When creating a SPTE, always set the Dirty bit if the Writable bit is set,
i.e. if KVM is creating a writable mapping.  If two (or more) vCPUs are
racing to install a writable SPTE on a !PRESENT fault, only the "winning"
vCPU will create a SPTE with W=1 and D=1, all "losers" will generate a
SPTE with W=1 && D=0.

As a result, tdp_mmu_map_handle_target_level() will fail to detect that
the losing faults are effectively spurious, and will overwrite the D=1
SPTE with a D=0 SPTE.  For normal VMs, overwriting a present SPTE is a
small performance blip; KVM blasts a remote TLB flush, but otherwise life
goes on.

For upcoming TDX VMs, overwriting a present SPTE is much more costly, and
can even lead to the VM being terminated if KVM isn't careful, e.g. if KVM
attempts TDH.MEM.PAGE.AUG because the TDX code doesn't detect that the
new SPTE is actually the same as the old SPTE (which would be a bug in its
own right).

Suggested-by: Sagi Shahar <sagis@google.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20241011021051.1557902-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 14:46:45 -07:00
Sean Christopherson
081976992f KVM: x86/mmu: Flush remote TLBs iff MMU-writable flag is cleared from RO SPTE
Don't force a remote TLB flush if KVM happens to effectively "refresh" a
read-only SPTE that is still MMU-Writable, as KVM allows MMU-Writable SPTEs
to have Writable TLB entries, even if the SPTE is !Writable.  Remote TLBs
need to be flushed only when creating a read-only SPTE for write-tracking,
i.e. when installing a !MMU-Writable SPTE.

In practice, especially now that KVM doesn't overwrite existing SPTEs when
prefetching, KVM will rarely "refresh" a read-only, MMU-Writable SPTE,
i.e. this is unlikely to eliminate many, if any, TLB flushes.  But, more
precisely flushing makes it easier to understand exactly when KVM does and
doesn't need to flush.

Note, x86 architecturally requires relevant TLB entries to be invalidated
on a page fault, i.e. there is no risk of putting a vCPU into an infinite
loop of read-only page faults.

Cc: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20241011021051.1557902-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 14:46:45 -07:00
Yan Zhao
bc17fccb37 KVM: VMX: Remove the unused variable "gpa" in __invept()
Remove the unused variable "gpa" in __invept().

The INVEPT instruction only supports two types: VMX_EPT_EXTENT_CONTEXT (1)
and VMX_EPT_EXTENT_GLOBAL (2). Neither of these types requires a third
variable "gpa".

The "gpa" variable for __invept() is always set to 0 and was originally
introduced for the old non-existent type VMX_EPT_EXTENT_INDIVIDUAL_ADDR
(0). This type was removed by commit 2b3c5cbc0d ("kvm: don't use bit24
for detecting address-specific invalidation capability") and
commit 63f3ac4813 ("KVM: VMX: clean up declaration of VPID/EPT
invalidation types").

Since this variable is not useful for error handling either, remove it to
avoid confusion.

No functional changes expected.

Cc: Yuan Yao <yuan.yao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20241014045931.1061-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 12:28:37 -07:00
Sean Christopherson
66bc627e7f KVM: x86/mmu: Don't mark "struct page" accessed when zapping SPTEs
Don't mark pages/folios as accessed in the primary MMU when zapping SPTEs,
as doing so relies on kvm_pfn_to_refcounted_page(), and generally speaking
is unnecessary and wasteful.  KVM participates in page aging via
mmu_notifiers, so there's no need to push "accessed" updates to the
primary MMU.

And if KVM zaps a SPTe in response to an mmu_notifier, marking it accessed
_after_ the primary MMU has decided to zap the page is likely to go
unnoticed, i.e. odds are good that, if the page is being zapped for
reclaim, the page will be swapped out regardless of whether or not KVM
marks the page accessed.

Dropping x86's use of kvm_set_pfn_accessed() also paves the way for
removing kvm_pfn_to_refcounted_page() and all its users.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-83-seanjc@google.com>
2024-10-25 13:01:35 -04:00
Sean Christopherson
93091f0fc7 KVM: VMX: Use __kvm_faultin_page() to get APIC access page/pfn
Use __kvm_faultin_page() get the APIC access page so that KVM can
precisely release the refcounted page, i.e. to remove yet another user
of kvm_pfn_to_refcounted_page().  While the path isn't handling a guest
page fault, the semantics are effectively the same; KVM just happens to
be mapping the pfn into a VMCS field instead of a secondary MMU.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-52-seanjc@google.com>
2024-10-25 13:00:48 -04:00
Sean Christopherson
cb444acb69 KVM: VMX: Hold mmu_lock until page is released when updating APIC access page
Hold mmu_lock across kvm_release_pfn_clean() when refreshing the APIC
access page address to ensure that KVM doesn't mark a page/folio as
accessed after it has been unmapped.  Practically speaking marking a folio
accesses is benign in this scenario, as KVM does hold a reference (it's
really just marking folios dirty that is problematic), but there's no
reason not to be paranoid (moving the APIC access page isn't a hot path),
and no reason to be different from other mmu_notifier-protected flows in
KVM.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-51-seanjc@google.com>
2024-10-25 13:00:48 -04:00
Sean Christopherson
dc06193532 KVM: Move x86's API to release a faultin page to common KVM
Move KVM x86's helper that "finishes" the faultin process to common KVM
so that the logic can be shared across all architectures.  Note, not all
architectures implement a fast page fault path, but the gist of the
comment applies to all architectures.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-50-seanjc@google.com>
2024-10-25 13:00:47 -04:00
Sean Christopherson
8eaa98004b KVM: x86/mmu: Don't mark unused faultin pages as accessed
When finishing guest page faults, don't mark pages as accessed if KVM
is resuming the guest _without_ installing a mapping, i.e. if the page
isn't being used.  While it's possible that marking the page accessed
could avoid minor thrashing due to reclaiming a page that the guest is
about to access, it's far more likely that the gfn=>pfn mapping was
was invalidated, e.g. due a memslot change, or because the corresponding
VMA is being modified.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-49-seanjc@google.com>
2024-10-25 13:00:47 -04:00
Sean Christopherson
8dd861cc07 KVM: x86/mmu: Put refcounted pages instead of blindly releasing pfns
Now that all x86 page fault paths precisely track refcounted pages, use
Use kvm_page_fault.refcounted_page to put references to struct page memory
when finishing page faults.  This is a baby step towards eliminating
kvm_pfn_to_refcounted_page().

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-48-seanjc@google.com>
2024-10-25 13:00:47 -04:00
Sean Christopherson
1fbee5b01a KVM: guest_memfd: Provide "struct page" as output from kvm_gmem_get_pfn()
Provide the "struct page" associated with a guest_memfd pfn as an output
from __kvm_gmem_get_pfn() so that KVM guest page fault handlers can
directly put the page instead of having to rely on
kvm_pfn_to_refcounted_page().

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-47-seanjc@google.com>
2024-10-25 13:00:47 -04:00
Sean Christopherson
54ba8c98a2 KVM: x86/mmu: Convert page fault paths to kvm_faultin_pfn()
Convert KVM x86 to use the recently introduced __kvm_faultin_pfn().
Opportunstically capture the refcounted_page grabbed by KVM for use in
future changes.

No functional change intended.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-45-seanjc@google.com>
2024-10-25 13:00:47 -04:00
Sean Christopherson
0cad68cab1 KVM: x86/mmu: Mark pages/folios dirty at the origin of make_spte()
Move the marking of folios dirty from make_spte() out to its callers,
which have access to the _struct page_, not just the underlying pfn.
Once all architectures follow suit, this will allow removing KVM's ugly
hack where KVM elevates the refcount of VM_MIXEDMAP pfns that happen to
be struct page memory.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-42-seanjc@google.com>
2024-10-25 12:59:08 -04:00
Sean Christopherson
7103853952 KVM: x86/mmu: Add helper to "finish" handling a guest page fault
Add a helper to finish/complete the handling of a guest page, e.g. to
mark the pages accessed and put any held references.  In the near
future, this will allow improving the logic without having to copy+paste
changes into all page fault paths.  And in the less near future, will
allow sharing the "finish" API across all architectures.

No functional change intended.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-41-seanjc@google.com>
2024-10-25 12:59:08 -04:00
Sean Christopherson
fa8fe58d1e KVM: x86/mmu: Add common helper to handle prefetching SPTEs
Deduplicate the prefetching code for indirect and direct MMUs.  The core
logic is the same, the only difference is that indirect MMUs need to
prefetch SPTEs one-at-a-time, as contiguous guest virtual addresses aren't
guaranteed to yield contiguous guest physical addresses.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-40-seanjc@google.com>
2024-10-25 12:59:08 -04:00
Sean Christopherson
64d5cd99f7 KVM: x86/mmu: Put direct prefetched pages via kvm_release_page_clean()
Use kvm_release_page_clean() to put prefeteched pages instead of calling
put_page() directly.  This will allow de-duplicating the prefetch code
between indirect and direct MMUs.

Note, there's a small functional change as kvm_release_page_clean() marks
the page/folio as accessed.  While it's not strictly guaranteed that the
guest will access the page, KVM won't intercept guest accesses, i.e. won't
mark the page accessed if it _is_ accessed by the guest (unless A/D bits
are disabled, but running without A/D bits is effectively limited to
pre-HSW Intel CPUs).

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-39-seanjc@google.com>
2024-10-25 12:59:08 -04:00
Sean Christopherson
447c375c91 KVM: x86/mmu: Add "mmu" prefix fault-in helpers to free up generic names
Prefix x86's faultin_pfn helpers with "mmu" so that the mmu-less names can
be used by common KVM for similar APIs.

No functional change intended.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-38-seanjc@google.com>
2024-10-25 12:59:08 -04:00
Sean Christopherson
fcd366b95e KVM: x86: Don't fault-in APIC access page during initial allocation
Drop the gfn_to_page() lookup when installing KVM's internal memslot for
the APIC access page, as KVM doesn't need to immediately fault-in the page
now that the page isn't pinned.  In the extremely unlikely event the
kernel can't allocate a 4KiB page, KVM can just as easily return -EFAULT
on the future page fault.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-37-seanjc@google.com>
2024-10-25 12:59:08 -04:00
Sean Christopherson
365e319208 KVM: Pass in write/dirty to kvm_vcpu_map(), not kvm_vcpu_unmap()
Now that all kvm_vcpu_{,un}map() users pass "true" for @dirty, have them
pass "true" as a @writable param to kvm_vcpu_map(), and thus create a
read-only mapping when possible.

Note, creating read-only mappings can be theoretically slower, as they
don't play nice with fast GUP due to the need to break CoW before mapping
the underlying PFN.  But practically speaking, creating a mapping isn't
a super hot path, and getting a writable mapping for reading is weird and
confusing.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-34-seanjc@google.com>
2024-10-25 12:59:07 -04:00
Sean Christopherson
7afe79f573 KVM: nVMX: Mark vmcs12's APIC access page dirty when unmapping
Mark the APIC access page as dirty when unmapping it from KVM.  The fact
that the page _shouldn't_ be written doesn't guarantee the page _won't_ be
written.  And while the contents are likely irrelevant, the values _are_
visible to the guest, i.e. dropping writes would be visible to the guest
(though obviously highly unlikely to be problematic in practice).

Marking the map dirty will allow specifying the write vs. read-only when
*mapping* the memory, which in turn will allow creating read-only maps.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-33-seanjc@google.com>
2024-10-25 12:58:00 -04:00
Sean Christopherson
a629ef9518 KVM: nVMX: Add helper to put (unmap) vmcs12 pages
Add a helper to dedup unmapping the vmcs12 pages.  This will reduce the
amount of churn when a future patch refactors the kvm_vcpu_unmap() API.

No functional change intended.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-26-seanjc@google.com>
2024-10-25 12:57:59 -04:00
Sean Christopherson
2e34f942a5 KVM: nVMX: Drop pointless msr_bitmap_map field from struct nested_vmx
Remove vcpu_vmx.msr_bitmap_map and instead use an on-stack structure in
the one function that uses the map, nested_vmx_prepare_msr_bitmap().

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-25-seanjc@google.com>
2024-10-25 12:57:59 -04:00
Sean Christopherson
efaaabc6c6 KVM: nVMX: Rely on kvm_vcpu_unmap() to track validity of eVMCS mapping
Remove the explicit evmptr12 validity check when deciding whether or not
to unmap the eVMCS pointer, and instead rely on kvm_vcpu_unmap() to play
nice with a NULL map->hva, i.e. to do nothing if the map is invalid.

Note, vmx->nested.hv_evmcs_map is zero-allocated along with the rest of
vcpu_vmx, i.e. the map starts out invalid/NULL.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-24-seanjc@google.com>
2024-10-25 12:57:59 -04:00
Sean Christopherson
cccefb0a0d KVM: Drop unused "hva" pointer from __gfn_to_pfn_memslot()
Drop @hva from __gfn_to_pfn_memslot() now that all callers pass NULL.

No functional change intended.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-19-seanjc@google.com>
2024-10-25 12:57:58 -04:00
Sean Christopherson
084ecf95a0 KVM: x86/mmu: Drop kvm_page_fault.hva, i.e. don't track intermediate hva
Remove kvm_page_fault.hva as it is never read, only written.  This will
allow removing the @hva param from __gfn_to_pfn_memslot().

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-18-seanjc@google.com>
2024-10-25 12:57:58 -04:00
David Stevens
6769d1bcd3 KVM: Replace "async" pointer in gfn=>pfn with "no_wait" and error code
Add a pfn error code to communicate that hva_to_pfn() failed because I/O
was needed and disallowed, and convert @async to a constant @no_wait
boolean.  This will allow eliminating the @no_wait param by having callers
pass in FOLL_NOWAIT along with other FOLL_* flags.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: David Stevens <stevensd@chromium.org>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-17-seanjc@google.com>
2024-10-25 12:57:58 -04:00
Sean Christopherson
e2d2ca71ac KVM: Drop @atomic param from gfn=>pfn and hva=>pfn APIs
Drop @atomic from the myriad "to_pfn" APIs now that all callers pass
"false", and remove a comment blurb about KVM running only the "GUP fast"
part in atomic context.

No functional change intended.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-13-seanjc@google.com>
2024-10-25 12:57:58 -04:00
Sean Christopherson
6419bc5207 KVM: Rename gfn_to_page_many_atomic() to kvm_prefetch_pages()
Rename gfn_to_page_many_atomic() to kvm_prefetch_pages() to try and
communicate its true purpose, as the "atomic" aspect is essentially a
side effect of the fact that x86 uses the API while holding mmu_lock.
E.g. even if mmu_lock weren't held, KVM wouldn't want to fault-in pages,
as the goal is to opportunistically grab surrounding pages that have
already been accessed and/or dirtied by the host, and to do so quickly.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-12-seanjc@google.com>
2024-10-25 12:55:12 -04:00
Sean Christopherson
661fa987e4 KVM: x86/mmu: Use gfn_to_page_many_atomic() when prefetching indirect PTEs
Use gfn_to_page_many_atomic() instead of gfn_to_pfn_memslot_atomic() when
prefetching indirect PTEs (direct_pte_prefetch_many() already uses the
"to page" APIS).  Functionally, the two are subtly equivalent, as the "to
pfn" API short-circuits hva_to_pfn() if hva_to_pfn_fast() fails, i.e. is
just a wrapper for get_user_page_fast_only()/get_user_pages_fast_only().

Switching to the "to page" API will allow dropping the @atomic parameter
from the entire hva_to_pfn() callchain.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-11-seanjc@google.com>
2024-10-25 12:54:42 -04:00
Sean Christopherson
5f6a3badbb KVM: x86/mmu: Mark page/folio accessed only when zapping leaf SPTEs
Now that KVM doesn't clobber Accessed bits of shadow-present SPTEs,
e.g. when prefetching, mark folios as accessed only when zapping leaf
SPTEs, which is a rough heuristic for "only in response to an mmu_notifier
invalidation".  Page aging and LRUs are tolerant of false negatives, i.e.
KVM doesn't need to be precise for correctness, and re-marking folios as
accessed when zapping entire roots or when zapping collapsible SPTEs is
expensive and adds very little value.

E.g. when a VM is dying, all of its memory is being freed; marking folios
accessed at that time provides no known value.  Similarly, because KVM
marks folios as accessed when creating SPTEs, marking all folios as
accessed when userspace happens to delete a memslot doesn't add value.
The folio was marked access when the old SPTE was created, and will be
marked accessed yet again if a vCPU accesses the pfn again after reloading
a new root.  Zapping collapsible SPTEs is a similar story; marking folios
accessed just because userspace disable dirty logging is a side effect of
KVM behavior, not a deliberate goal.

As an intermediate step, a.k.a. bisection point, towards *never* marking
folios accessed when dropping SPTEs, mark folios accessed when the primary
MMU might be invalidating mappings, as such zappings are not KVM initiated,
i.e. might actually be related to page aging and LRU activity.

Note, x86 is the only KVM architecture that "double dips"; every other
arch marks pfns as accessed only when mapping into the guest, not when
mapping into the guest _and_ when removing from the guest.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-10-seanjc@google.com>
2024-10-25 12:54:42 -04:00
Sean Christopherson
aa85986e71 KVM: x86/mmu: Mark folio dirty when creating SPTE, not when zapping/modifying
Mark pages/folios dirty when creating SPTEs to map PFNs into the guest,
not when zapping or modifying SPTEs, as marking folios dirty when zapping
or modifying SPTEs can be extremely inefficient.  E.g. when KVM is zapping
collapsible SPTEs to reconstitute a hugepage after disbling dirty logging,
KVM will mark every 4KiB pfn as dirty, even though _at least_ 512 pfns are
guaranteed to be in a single folio (the SPTE couldn't potentially be huge
if that weren't the case).  The problem only becomes worse for 1GiB
HugeTLB pages, as KVM can mark a single folio dirty 512*512 times.

Marking a folio dirty when mapping is functionally safe as KVM drops all
relevant SPTEs in response to an mmu_notifier invalidation, i.e. ensures
that the guest can't dirty a folio after access has been removed.

And because KVM already marks folios dirty when zapping/modifying SPTEs
for KVM reasons, i.e. not in response to an mmu_notifier invalidation,
there is no danger of "prematurely" marking a folio dirty.  E.g. if a
filesystems cleans a folio without first removing write access, then there
already exists races where KVM could mark a folio dirty before remote TLBs
are flushed, i.e. before guest writes are guaranteed to stop.  Furthermore,
x86 is literally the only architecture that marks folios dirty on the
backend; every other KVM architecture marks folios dirty at map time.

x86's unique behavior likely stems from the fact that x86's MMU predates
mmu_notifiers.  Long, long ago, before mmu_notifiers were added, marking
pages dirty when zapping SPTEs was logical, and perhaps even necessary, as
KVM held references to pages, i.e. kept a page's refcount elevated while
the page was mapped into the guest.  At the time, KVM's rmap_remove()
simply did:

        if (is_writeble_pte(*spte))
                kvm_release_pfn_dirty(pfn);
        else
                kvm_release_pfn_clean(pfn);

i.e. dropped the refcount and marked the page dirty at the same time.
After mmu_notifiers were introduced, commit acb66dd051 ("KVM: MMU:
don't hold pagecount reference for mapped sptes pages") removed the
refcount logic, but kept the dirty logic, i.e. converted the above to:

	if (is_writeble_pte(*spte))
		kvm_release_pfn_dirty(pfn);

And for KVM x86, that's essentially how things have stayed over the last
~15 years, without anyone revisiting *why* KVM marks pages/folios dirty at
zap/modification time, e.g. the behavior was blindly carried forward to
the TDP MMU.

Practically speaking, the only downside to marking a folio dirty during
mapping is that KVM could trigger writeback of memory that was never
actually written.  Except that can't actually happen if KVM marks folios
dirty if and only if a writable SPTE is created (as done here), because
KVM always marks writable SPTEs as dirty during make_spte().  See commit
9b51a63024 ("KVM: MMU: Explicitly set D-bit for writable spte."), circa
2015.

Note, KVM's access tracking logic for prefetched SPTEs is a bit odd.  If a
guest PTE is dirty and writable, KVM will create a writable SPTE, but then
mark the SPTE for access tracking.  Which isn't wrong, just a bit odd, as
it results in _more_ precise dirty tracking for MMUs _without_ A/D bits.

To keep things simple, mark the folio dirty before access tracking comes
into play, as an access-tracked SPTE can be restored in the fast page
fault path, i.e. without holding mmu_lock.  While writing SPTEs and
accessing memslots outside of mmu_lock is safe, marking a folio dirty is
not.  E.g. if the fast path gets interrupted _just_ after setting a SPTE,
the primary MMU could theoretically invalidate and free a folio before KVM
marks it dirty.  Unlike the shadow MMU, which waits for CPUs to respond to
an IPI, the TDP MMU only guarantees the page tables themselves won't be
freed (via RCU).

Opportunistically update a few stale comments.

Cc: David Matlack <dmatlack@google.com>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-9-seanjc@google.com>
2024-10-25 12:54:42 -04:00
Sean Christopherson
4e44ab0a77 KVM: x86/mmu: Mark new SPTE as Accessed when synchronizing existing SPTE
Set the Accessed bit when making a "new" SPTE during SPTE synchronization,
as _clearing_ the Accessed bit is counter-productive, and even if the
Accessed bit wasn't set in the old SPTE, odds are very good the guest will
access the page in the near future, as the most common case where KVM
synchronizes a shadow-present SPTE is when the guest is making the gPTE
read-only for Copy-on-Write (CoW).

Preserving the Accessed bit will allow dropping the logic that propagates
the Accessed bit to the underlying struct page when overwriting an existing
SPTE, without undue risk of regressing page aging.

Note, KVM's current behavior is very deliberate, as SPTE synchronization
was the only "speculative" access type as of commit 947da53830 ("KVM:
MMU: Set the accessed bit on non-speculative shadow ptes").

But, much has changed since 2008, and more changes are on the horizon.
Spurious clearing of the Accessed (and Dirty) was mitigated by commit
e6722d9211 ("KVM: x86/mmu: Reduce the update to the spte in
FNAME(sync_spte)"), which changed FNAME(sync_spte) to only overwrite SPTEs
if the protections are actually changing.  I.e. KVM is already preserving
Accessed information for SPTEs that aren't dropping protections.

And with the aforementioned future change to NOT mark the page/folio as
accessed, KVM's SPTEs will become the "source of truth" so to speak, in
which case clearing the Accessed bit outside of page aging becomes very
undesirable.

Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-8-seanjc@google.com>
2024-10-25 12:54:42 -04:00
Sean Christopherson
63c5754472 KVM: x86/mmu: Invert @can_unsync and renamed to @synchronizing
Invert the polarity of "can_unsync" and rename the parameter to
"synchronizing" to allow a future change to set the Accessed bit if KVM
is synchronizing an existing SPTE.  Querying "can_unsync" in that case is
nonsensical, as the fact that KVM can't unsync SPTEs doesn't provide any
justification for setting the Accessed bit.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-7-seanjc@google.com>
2024-10-25 12:54:42 -04:00
Sean Christopherson
6385d01eec KVM: x86/mmu: Don't overwrite shadow-present MMU SPTEs when prefaulting
Treat attempts to prefetch/prefault MMU SPTEs as spurious if there's an
existing shadow-present SPTE, as overwriting a SPTE that may have been
create by a "real" fault is at best confusing, and at worst potentially
harmful.  E.g. mmu_try_to_unsync_pages() doesn't unsync when prefetching,
which creates a scenario where KVM could try to replace a Writable SPTE
with a !Writable SPTE, as sp->unsync is checked prior to acquiring
mmu_unsync_pages_lock.

Note, this applies to three of the four flavors of "prefetch" in KVM:

  - KVM_PRE_FAULT_MEMORY
  - Async #PF (host or PV)
  - Prefetching

The fourth flavor, SPTE synchronization, i.e. FNAME(sync_spte), _only_
overwrites shadow-present SPTEs when calling make_spte().  But SPTE
synchronization specifically uses mmu_spte_update(), and so naturally
avoids the @prefetch check in mmu_set_spte().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-6-seanjc@google.com>
2024-10-25 12:54:42 -04:00
Sean Christopherson
2867eb782c KVM: x86/mmu: Skip the "try unsync" path iff the old SPTE was a leaf SPTE
Apply make_spte()'s optimization to skip trying to unsync shadow pages if
and only if the old SPTE was a leaf SPTE, as non-leaf SPTEs in direct MMUs
are always writable, i.e. could trigger a false positive and incorrectly
lead to KVM creating a SPTE without write-protecting or marking shadow
pages unsync.

This bug only affects the TDP MMU, as the shadow MMU only overwrites a
shadow-present SPTE when synchronizing SPTEs (and only 4KiB SPTEs can be
unsync).  Specifically, mmu_set_spte() drops any non-leaf SPTEs *before*
calling make_spte(), whereas the TDP MMU can do a direct replacement of a
page table with the leaf SPTE.

Opportunistically update the comment to explain why skipping the unsync
stuff is safe, as opposed to simply saying "it's someone else's problem".

Cc: stable@vger.kernel.org
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-5-seanjc@google.com>
2024-10-25 12:54:42 -04:00
Sean Christopherson
037bc38b29 KVM: Drop KVM_ERR_PTR_BAD_PAGE and instead return NULL to indicate an error
Remove KVM_ERR_PTR_BAD_PAGE and instead return NULL, as "bad page" is just
a leftover bit of weirdness from days of old when KVM stuffed a "bad" page
into the guest instead of actually handling missing pages.  See commit
cea7bb2128 ("KVM: MMU: Make gfn_to_page() always safe").

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-2-seanjc@google.com>
2024-10-25 12:54:42 -04:00
Sean Christopherson
f559b2e9c5 KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory
Ignore nCR3[4:0] when loading PDPTEs from memory for nested SVM, as bits
4:0 of CR3 are ignored when PAE paging is used, and thus VMRUN doesn't
enforce 32-byte alignment of nCR3.

In the absolute worst case scenario, failure to ignore bits 4:0 can result
in an out-of-bounds read, e.g. if the target page is at the end of a
memslot, and the VMM isn't using guard pages.

Per the APM:

  The CR3 register points to the base address of the page-directory-pointer
  table. The page-directory-pointer table is aligned on a 32-byte boundary,
  with the low 5 address bits 4:0 assumed to be 0.

And the SDM's much more explicit:

  4:0    Ignored

Note, KVM gets this right when loading PDPTRs, it's only the nSVM flow
that is broken.

Fixes: e4e517b4be ("KVM: MMU: Do not unconditionally read PDPTE from guest memory")
Reported-by: Kirk Swidowski <swidowski@google.com>
Cc: Andy Nguyen <theflow@google.com>
Cc: 3pvd <3pvd@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20241009140838.1036226-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20 07:31:06 -04:00
Maxim Levitsky
731285fbb6 KVM: VMX: reset the segment cache after segment init in vmx_vcpu_reset()
Reset the segment cache after segment initialization in vmx_vcpu_reset()
to harden KVM against caching stale/uninitialized data.  Without the
recent fix to bypass the cache in kvm_arch_vcpu_put(), the following
scenario is possible:

 - vCPU is just created, and the vCPU thread is preempted before
   SS.AR_BYTES is written in vmx_vcpu_reset().

 - When scheduling out the vCPU task, kvm_arch_vcpu_in_kernel() =>
   vmx_get_cpl() reads and caches '0' for SS.AR_BYTES.

 - vmx_vcpu_reset() => seg_setup() configures SS.AR_BYTES, but doesn't
   invoke vmx_segment_cache_clear() to invalidate the cache.

As a result, KVM retains a stale value in the cache, which can be read,
e.g. via KVM_GET_SREGS.  Usually this is not a problem because the VMX
segment cache is reset on each VM-Exit, but if the userspace VMM (e.g KVM
selftests) reads and writes system registers just after the vCPU was
created, _without_ modifying SS.AR_BYTES, userspace will write back the
stale '0' value and ultimately will trigger a VM-Entry failure due to
incorrect SS segment type.

Invalidating the cache after writing the VMCS doesn't address the general
issue of cache accesses from IRQ context being unsafe, but it does prevent
KVM from clobbering the VMCS, i.e. mitigates the harm done _if_ KVM has a
bug that results in an unsafe cache access.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Fixes: 2fb92db1ec ("KVM: VMX: Cache vmcs segment fields")
[sean: rework changelog to account for previous patch]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20241009175002.1118178-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20 07:31:06 -04:00
Sean Christopherson
28cf497881 KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range()
Add a lockdep assertion in kvm_unmap_gfn_range() to ensure that either
mmu_invalidate_in_progress is elevated, or that the range is being zapped
due to memslot removal (loosely detected by slots_lock being held).
Zapping SPTEs without mmu_invalidate_{in_progress,seq} protection is unsafe
as KVM's page fault path snapshots state before acquiring mmu_lock, and
thus can create SPTEs with stale information if vCPUs aren't forced to
retry faults (due to seeing an in-progress or past MMU invalidation).

Memslot removal is a special case, as the memslot is retrieved outside of
mmu_invalidate_seq, i.e. doesn't use the "standard" protections, and
instead relies on SRCU synchronization to ensure any in-flight page faults
are fully resolved before zapping SPTEs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20241009192345.1148353-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20 07:31:05 -04:00
Sean Christopherson
58a20a9435 KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot
When performing a targeted zap on memslot removal, zap only MMU pages that
shadow guest PTEs, as zapping all SPs that "match" the gfn is inexact and
unnecessary.  Furthermore, for_each_gfn_valid_sp() arguably shouldn't
exist, because it doesn't do what most people would it expect it to do.
The "round gfn for level" adjustment that is done for direct SPs (no gPTE)
means that the exact gfn comparison will not get a match, even when a SP
does "cover" a gfn, or was even created specifically for a gfn.

For memslot deletion specifically, KVM's behavior will vary significantly
based on the size and alignment of a memslot, and in weird ways.  E.g. for
a 4KiB memslot, KVM will zap more SPs if the slot is 1GiB aligned than if
it's only 4KiB aligned.  And as described below, zapping SPs in the
aligned case overzaps for direct MMUs, as odds are good the upper-level
SPs are serving other memslots.

To iterate over all potentially-relevant gfns, KVM would need to make a
pass over the hash table for each level, with the gfn used for lookup
rounded for said level.  And then check that the SP is of the correct
level, too, e.g. to avoid over-zapping.

But even then, KVM would massively overzap, as processing every level is
all but guaranteed to zap SPs that serve other memslots, especially if the
memslot being removed is relatively small.  KVM could mitigate that issue
by processing only levels that can be possible guest huge pages, i.e. are
less likely to be re-used for other memslot, but while somewhat logical,
that's quite arbitrary and would be a bit of a mess to implement.

So, zap only SPs with gPTEs, as the resulting behavior is easy to describe,
is predictable, and is explicitly minimal, i.e. KVM only zaps SPs that
absolutely must be zapped.

Cc: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20241009192345.1148353-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20 07:08:17 -04:00
Paolo Bonzini
c8d430db8e KVM/arm64 fixes for 6.12, take #1
- Fix pKVM error path on init, making sure we do not change critical
   system registers as we're about to fail
 
 - Make sure that the host's vector length is at capped by a value
   common to all CPUs
 
 - Fix kvm_has_feat*() handling of "negative" features, as the current
   code is pretty broken
 
 - Promote Joey to the status of official reviewer, while James steps
   down -- hopefully only temporarly
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmb++hkACgkQI9DQutE9
 ekNDyQ/9GwamcXC4KfYFtfQrcNRl/6RtlF/PFC0R6iiD1OoqNFHv2D/zscxtOj5a
 nw3gbof1Y59eND/6dubDzk82/A1Ff6bXpygybSQ6LG6Jba7H+01XxvvB0SMTLJ1S
 7hREe6m1EBHG/4VJk2Mx8iHJ7OjgZiTivojjZ1tY2Ez3nSUecL8prjqBFft3lAhg
 rFb20iJiijoZDgEjFZq/gWDxPq5m3N51tushqPRIMJ6wt8TeLYx3uUd2DTO0MzG/
 1K2vGbc1O6010jiR+PO3szi7uJFZfb58IsKCx7/w2e9AbzpYx4BXHKCax00DlGAP
 0PiuEMqG82UXR5a58UQrLC2aonh5VNj7J1Lk3qLb0NCimu6PdYWyIGNsKzAF/f4s
 tRVTRqcPr0RN/IIoX9vFjK3CKF9FcwAtctoO7IbxLKp+OGbPXk7Fk/gmhXKRubPR
 +4L4DCcARTcBflnWDzdLaz02fr13UfhM80mekJXlS1YHlSArCfbrsvjNrh4iL+G0
 UDamq8+8ereN0kT+ZM2jw3iw+DaF2kg24OEEfEQcBHZTS9HqBNVPplqqNSWRkjTl
 WSB79q1G6iOYzMUQdULP4vFRv1OePgJzg/voqMRZ6fUSuNgkpyXT0fLf5X12weq9
 NBnJ09Eh5bWfRIpdMzI1E1Qjfsm7E6hEa79DOnHmiLgSdVk3M9o=
 =Rtrz
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.12, take #1

- Fix pKVM error path on init, making sure we do not change critical
  system registers as we're about to fail

- Make sure that the host's vector length is at capped by a value
  common to all CPUs

- Fix kvm_has_feat*() handling of "negative" features, as the current
  code is pretty broken

- Promote Joey to the status of official reviewer, while James steps
  down -- hopefully only temporarly
2024-10-06 03:59:22 -04:00
Paolo Bonzini
ea4290d77b KVM: x86: leave kvm.ko out of the build if no vendor module is requested
kvm.ko is nothing but library code shared by kvm-intel.ko and kvm-amd.ko.
It provides no functionality on its own and it is unnecessary unless one
of the vendor-specific module is compiled.  In particular, /dev/kvm is
not created until one of kvm-intel.ko or kvm-amd.ko is loaded.

Use CONFIG_KVM to decide if it is built-in or a module, but use the
vendor-specific modules for the actual decision on whether to build it.

This also fixes a build failure when CONFIG_KVM_INTEL and CONFIG_KVM_AMD
are both disabled.  The cpu_emergency_register_virt_callback() function
is called from kvm.ko, but it is only defined if at least one of
CONFIG_KVM_INTEL and CONFIG_KVM_AMD is provided.

Fixes: 590b09b1d8 ("KVM: x86: Register "emergency disable" callbacks when virt is enabled")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-06 03:53:41 -04:00
Paolo Bonzini
fcd1ec9cb5 KVM: x86/mmu: fix KVM_X86_QUIRK_SLOT_ZAP_ALL for shadow MMU
As was tried in commit 4e103134b8 ("KVM: x86/mmu: Zap only the relevant
pages when removing a memslot"), all shadow pages, i.e. non-leaf SPTEs,
need to be zapped.  All of the accounting for a shadow page is tied to the
memslot, i.e. the shadow page holds a reference to the memslot, for all
intents and purposes.  Deleting the memslot without removing all relevant
shadow pages, as is done when KVM_X86_QUIRK_SLOT_ZAP_ALL is disabled,
results in NULL pointer derefs when tearing down the VM.

Reintroduce from that commit the code that walks the whole memslot when
there are active shadow MMU pages.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-03 18:51:13 -04:00
Linus Torvalds
3efc57369a x86:
* KVM currently invalidates the entirety of the page tables, not just
   those for the memslot being touched, when a memslot is moved or deleted.
   The former does not have particularly noticeable overhead, but Intel's
   TDX will require the guest to re-accept private pages if they are
   dropped from the secure EPT, which is a non starter.  Actually,
   the only reason why this is not already being done is a bug which
   was never fully investigated and caused VM instability with assigned
   GeForce GPUs, so allow userspace to opt into the new behavior.
 
 * Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10
   functionality that is on the horizon).
 
 * Rework common MSR handling code to suppress errors on userspace accesses to
   unsupported-but-advertised MSRs.  This will allow removing (almost?) all of
   KVM's exemptions for userspace access to MSRs that shouldn't exist based on
   the vCPU model (the actual cleanup is non-trivial future work).
 
 * Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the
   64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv)
   stores the entire 64-bit value at the ICR offset.
 
 * Fix a bug where KVM would fail to exit to userspace if one was triggered by
   a fastpath exit handler.
 
 * Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when
   there's already a pending wake event at the time of the exit.
 
 * Fix a WARN caused by RSM entering a nested guest from SMM with invalid guest
   state, by forcing the vCPU out of guest mode prior to signalling SHUTDOWN
   (the SHUTDOWN hits the VM altogether, not the nested guest)
 
 * Overhaul the "unprotect and retry" logic to more precisely identify cases
   where retrying is actually helpful, and to harden all retry paths against
   putting the guest into an infinite retry loop.
 
 * Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in
   the shadow MMU.
 
 * Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for
   adding multi generation LRU support in KVM.
 
 * Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is enabled,
   i.e. when the CPU has already flushed the RSB.
 
 * Trace the per-CPU host save area as a VMCB pointer to improve readability
   and cleanup the retrieval of the SEV-ES host save area.
 
 * Remove unnecessary accounting of temporary nested VMCB related allocations.
 
 * Set FINAL/PAGE in the page fault error code for EPT violations if and only
   if the GVA is valid.  If the GVA is NOT valid, there is no guest-side page
   table walk and so stuffing paging related metadata is nonsensical.
 
 * Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of
   emulating posted interrupt delivery to L2.
 
 * Add a lockdep assertion to detect unsafe accesses of vmcs12 structures.
 
 * Harden eVMCS loading against an impossible NULL pointer deref (really truly
   should be impossible).
 
 * Minor SGX fix and a cleanup.
 
 * Misc cleanups
 
 Generic:
 
 * Register KVM's cpuhp and syscore callbacks when enabling virtualization in
   hardware, as the sole purpose of said callbacks is to disable and re-enable
   virtualization as needed.
 
 * Enable virtualization when KVM is loaded, not right before the first VM
   is created.  Together with the previous change, this simplifies a
   lot the logic of the callbacks, because their very existence implies
   virtualization is enabled.
 
 * Fix a bug that results in KVM prematurely exiting to userspace for coalesced
   MMIO/PIO in many cases, clean up the related code, and add a testcase.
 
 * Fix a bug in kvm_clear_guest() where it would trigger a buffer overflow _if_
   the gpa+len crosses a page boundary, which thankfully is guaranteed to not
   happen in the current code base.  Add WARNs in more helpers that read/write
   guest memory to detect similar bugs.
 
 Selftests:
 
 * Fix a goof that caused some Hyper-V tests to be skipped when run on bare
   metal, i.e. NOT in a VM.
 
 * Add a regression test for KVM's handling of SHUTDOWN for an SEV-ES guest.
 
 * Explicitly include one-off assets in .gitignore.  Past Sean was completely
   wrong about not being able to detect missing .gitignore entries.
 
 * Verify userspace single-stepping works when KVM happens to handle a VM-Exit
   in its fastpath.
 
 * Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmb201AUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOM1gf+Ij7dpCh0KwoNYlHfW2aCHAv3PqQd
 cKMDSGxoCernbJEyPO/3qXNUK+p4zKedk3d92snW3mKa+cwxMdfthJ3i9d7uoNiw
 7hAgcfKNHDZGqAQXhx8QcVF3wgp+diXSyirR+h1IKrGtCCmjMdNC8ftSYe6voEkw
 VTVbLL+tER5H0Xo5UKaXbnXKDbQvWLXkdIqM8dtLGFGLQ2PnF/DdMP0p6HYrKf1w
 B7LBu0rvqYDL8/pS82mtR3brHJXxAr9m72fOezRLEUbfUdzkTUi/b1vEe6nDCl0Q
 i/PuFlARDLWuetlR0VVWKNbop/C/l4EmwCcKzFHa+gfNH3L9361Oz+NzBw==
 =Q7kz
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull x86 kvm updates from Paolo Bonzini:
 "x86:

   - KVM currently invalidates the entirety of the page tables, not just
     those for the memslot being touched, when a memslot is moved or
     deleted.

     This does not traditionally have particularly noticeable overhead,
     but Intel's TDX will require the guest to re-accept private pages
     if they are dropped from the secure EPT, which is a non starter.

     Actually, the only reason why this is not already being done is a
     bug which was never fully investigated and caused VM instability
     with assigned GeForce GPUs, so allow userspace to opt into the new
     behavior.

   - Advertise AVX10.1 to userspace (effectively prep work for the
     "real" AVX10 functionality that is on the horizon)

   - Rework common MSR handling code to suppress errors on userspace
     accesses to unsupported-but-advertised MSRs

     This will allow removing (almost?) all of KVM's exemptions for
     userspace access to MSRs that shouldn't exist based on the vCPU
     model (the actual cleanup is non-trivial future work)

   - Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC)
     splits the 64-bit value into the legacy ICR and ICR2 storage,
     whereas Intel (APICv) stores the entire 64-bit value at the ICR
     offset

   - Fix a bug where KVM would fail to exit to userspace if one was
     triggered by a fastpath exit handler

   - Add fastpath handling of HLT VM-Exit to expedite re-entering the
     guest when there's already a pending wake event at the time of the
     exit

   - Fix a WARN caused by RSM entering a nested guest from SMM with
     invalid guest state, by forcing the vCPU out of guest mode prior to
     signalling SHUTDOWN (the SHUTDOWN hits the VM altogether, not the
     nested guest)

   - Overhaul the "unprotect and retry" logic to more precisely identify
     cases where retrying is actually helpful, and to harden all retry
     paths against putting the guest into an infinite retry loop

   - Add support for yielding, e.g. to honor NEED_RESCHED, when zapping
     rmaps in the shadow MMU

   - Refactor pieces of the shadow MMU related to aging SPTEs in
     prepartion for adding multi generation LRU support in KVM

   - Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is
     enabled, i.e. when the CPU has already flushed the RSB

   - Trace the per-CPU host save area as a VMCB pointer to improve
     readability and cleanup the retrieval of the SEV-ES host save area

   - Remove unnecessary accounting of temporary nested VMCB related
     allocations

   - Set FINAL/PAGE in the page fault error code for EPT violations if
     and only if the GVA is valid. If the GVA is NOT valid, there is no
     guest-side page table walk and so stuffing paging related metadata
     is nonsensical

   - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit
     instead of emulating posted interrupt delivery to L2

   - Add a lockdep assertion to detect unsafe accesses of vmcs12
     structures

   - Harden eVMCS loading against an impossible NULL pointer deref
     (really truly should be impossible)

   - Minor SGX fix and a cleanup

   - Misc cleanups

  Generic:

   - Register KVM's cpuhp and syscore callbacks when enabling
     virtualization in hardware, as the sole purpose of said callbacks
     is to disable and re-enable virtualization as needed

   - Enable virtualization when KVM is loaded, not right before the
     first VM is created

     Together with the previous change, this simplifies a lot the logic
     of the callbacks, because their very existence implies
     virtualization is enabled

   - Fix a bug that results in KVM prematurely exiting to userspace for
     coalesced MMIO/PIO in many cases, clean up the related code, and
     add a testcase

   - Fix a bug in kvm_clear_guest() where it would trigger a buffer
     overflow _if_ the gpa+len crosses a page boundary, which thankfully
     is guaranteed to not happen in the current code base. Add WARNs in
     more helpers that read/write guest memory to detect similar bugs

  Selftests:

   - Fix a goof that caused some Hyper-V tests to be skipped when run on
     bare metal, i.e. NOT in a VM

   - Add a regression test for KVM's handling of SHUTDOWN for an SEV-ES
     guest

   - Explicitly include one-off assets in .gitignore. Past Sean was
     completely wrong about not being able to detect missing .gitignore
     entries

   - Verify userspace single-stepping works when KVM happens to handle a
     VM-Exit in its fastpath

   - Misc cleanups"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits)
  Documentation: KVM: fix warning in "make htmldocs"
  s390: Enable KVM_S390_UCONTROL config in debug_defconfig
  selftests: kvm: s390: Add VM run test case
  KVM: SVM: let alternatives handle the cases when RSB filling is required
  KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid
  KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent
  KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals
  KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range()
  KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper
  KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed
  KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot
  KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps()
  KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range()
  KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfn
  KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list
  KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version
  KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure()
  KVM: x86: Update retry protection fields when forcing retry on emulation failure
  KVM: x86: Apply retry protection to "unprotect on failure" path
  KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn
  ...
2024-09-28 09:20:14 -07:00
Linus Torvalds
f8ffbc365f struct fd layout change (and conversion to accessor helpers)
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZvDNmgAKCRBZ7Krx/gZQ
 63zrAP9vI0rf55v27twiabe9LnI7aSx5ckoqXxFIFxyT3dOYpQD/bPmoApnWDD3d
 592+iDgLsema/H/0/CqfqlaNtDNY8Q0=
 =HUl5
 -----END PGP SIGNATURE-----

Merge tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull 'struct fd' updates from Al Viro:
 "Just the 'struct fd' layout change, with conversion to accessor
  helpers"

* tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  add struct fd constructors, get rid of __to_fd()
  struct fd: representation change
  introduce fd_file(), convert all accessors to it.
2024-09-23 09:35:36 -07:00
Paolo Bonzini
3f8df62852 Merge tag 'kvm-x86-vmx-6.12' of https://github.com/kvm-x86/linux into HEAD
KVM VMX changes for 6.12:

 - Set FINAL/PAGE in the page fault error code for EPT Violations if and only
   if the GVA is valid.  If the GVA is NOT valid, there is no guest-side page
   table walk and so stuffing paging related metadata is nonsensical.

 - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of
   emulating posted interrupt delivery to L2.

 - Add a lockdep assertion to detect unsafe accesses of vmcs12 structures.

 - Harden eVMCS loading against an impossible NULL pointer deref (really truly
   should be impossible).

 - Minor SGX fix and a cleanup.
2024-09-17 12:41:23 -04:00
Paolo Bonzini
55e6f8f29d Merge tag 'kvm-x86-svm-6.12' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.12:

 - Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is enabled,
   i.e. when the CPU has already flushed the RSB.

 - Trace the per-CPU host save area as a VMCB pointer to improve readability
   and cleanup the retrieval of the SEV-ES host save area.

 - Remove unnecessary accounting of temporary nested VMCB related allocations.
2024-09-17 12:41:13 -04:00
Paolo Bonzini
43d97b2ebd Merge tag 'kvm-x86-pat_vmx_msrs-6.12' of https://github.com/kvm-x86/linux into HEAD
KVM VMX and x86 PAT MSR macro cleanup for 6.12:

 - Add common defines for the x86 architectural memory types, i.e. the types
   that are shared across PAT, MTRRs, VMCSes, and EPTPs.

 - Clean up the various VMX MSR macros to make the code self-documenting
   (inasmuch as possible), and to make it less painful to add new macros.
2024-09-17 12:40:39 -04:00
Paolo Bonzini
5d55a052e3 Merge tag 'kvm-x86-mmu-6.12' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.12:

 - Overhaul the "unprotect and retry" logic to more precisely identify cases
   where retrying is actually helpful, and to harden all retry paths against
   putting the guest into an infinite retry loop.

 - Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in
   the shadow MMU.

 - Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for
   adding MGLRU support in KVM.

 - Misc cleanups
2024-09-17 12:39:53 -04:00
Paolo Bonzini
41786cc5ea Merge tag 'kvm-x86-misc-6.12' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.12

 - Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10
   functionality that is on the horizon).

 - Rework common MSR handling code to suppress errors on userspace accesses to
   unsupported-but-advertised MSRs.  This will allow removing (almost?) all of
   KVM's exemptions for userspace access to MSRs that shouldn't exist based on
   the vCPU model (the actual cleanup is non-trivial future work).

 - Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the
   64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv)
   stores the entire 64-bit value a the ICR offset.

 - Fix a bug where KVM would fail to exit to userspace if one was triggered by
   a fastpath exit handler.

 - Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when
   there's already a pending wake event at the time of the exit.

 - Finally fix the RSM vs. nested VM-Enter WARN by forcing the vCPU out of
   guest mode prior to signalling SHUTDOWN (architecturally, the SHUTDOWN is
   supposed to hit L1, not L2).
2024-09-17 11:38:23 -04:00
Paolo Bonzini
c09dd2bb57 Merge branch 'kvm-redo-enable-virt' into HEAD
Register KVM's cpuhp and syscore callbacks when enabling virtualization in
hardware, as the sole purpose of said callbacks is to disable and re-enable
virtualization as needed.

The primary motivation for this series is to simplify dealing with enabling
virtualization for Intel's TDX, which needs to enable virtualization
when kvm-intel.ko is loaded, i.e. long before the first VM is created.

That said, this is a nice cleanup on its own.  By registering the callbacks
on-demand, the callbacks themselves don't need to check kvm_usage_count,
because their very existence implies a non-zero count.

Patch 1 (re)adds a dedicated lock for kvm_usage_count.  This avoids a
lock ordering issue between cpus_read_lock() and kvm_lock.  The lock
ordering issue still exist in very rare cases, and will be fixed for
good by switching vm_list to an (S)RCU-protected list.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-17 11:38:20 -04:00
Paolo Bonzini
55f50b2f86 Merge branch 'kvm-memslot-zap-quirk' into HEAD
Today whenever a memslot is moved or deleted, KVM invalidates the entire
page tables and generates fresh ones based on the new memslot layout.

This behavior traditionally was kept because of a bug which was never
fully investigated and caused VM instability with assigned GeForce
GPUs.  It generally does not have a huge overhead, because the old
MMU is able to reuse cached page tables and the new one is more
scalabale and can resolve EPT violations/nested page faults in parallel,
but it has worse performance if the guest frequently deletes and
adds small memslots, and it's entirely not viable for TDX.  This is
because TDX requires re-accepting of private pages after page dropping.

For non-TDX VMs, this series therefore introduces the
KVM_X86_QUIRK_SLOT_ZAP_ALL quirk, enabling users to control the behavior
of memslot zapping when a memslot is moved/deleted.  The quirk is turned
on by default, leading to the zapping of all SPTEs when a memslot is
moved/deleted; users however have the option to turn off the quirk,
which limits the zapping only to those SPTEs hat lie within the range
of memslot being moved/deleted.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-17 11:38:19 -04:00
Linus Torvalds
d42f7708e2 Do not always honor guest PAT on CPUs that support self-snoop.
This triggers an issue in the bochsdrm driver, which used ioremap()
 instead of ioremap_wc() to map the video RAM.  The revert lets video
 RAM use the WB memory type instead of the slower UC memory type.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmbmhVcUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPC8gf9GT8ynM3/+csYTKKvJ3acvhvFoIl0
 xzptkWh9tal7H1jPG+BFq44o8DbAcb9u5pxPng9ng5ojmPwticBRWt6dpWyKurTm
 WKT2JRCV/6/sPDu8WrMLli9c/9P85ETFAyAPyr4CO4/rPg173qtLT5zxWjsLi0xz
 ZtVdAdHj041skYH8REYyRm2zolq/PIj7TWWAYZVRWgX2AkQeRq//g51MpBgLfbYt
 BNL7TLqpaD3ZSNHXsTZDn3c1jh9VnPGFPa+QSq2a6JgPPqCuJzs7RpPwMzTRlkoT
 agdRf8Wj082u1kqMGCHLXHGQybevauLs+yQYRkojpxj774PPNzH3kEGOEA==
 =9Eo3
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-6.11' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fix from Paolo Bonzini:
 "Do not always honor guest PAT on CPUs that support self-snoop.

  This triggers an issue in the bochsdrm driver, which used ioremap()
  instead of ioremap_wc() to map the video RAM.

  The revert lets video RAM use the WB memory type instead of the slower
  UC memory type"

* tag 'for-linus-6.11' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"
2024-09-15 09:35:50 +02:00
Paolo Bonzini
9d70f3fec1 Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"
This reverts commit 377b2f359d.

This caused a regression with the bochsdrm driver, which used ioremap()
instead of ioremap_wc() to map the video RAM.  After the commit, the
WB memory type is used without the IGNORE_PAT, resulting in the slower
UC memory type.  In fact, UC is slow enough to basically cause guests
to not boot... but only on new processors such as Sapphire Rapids and
Cascade Lake.  Coffee Lake for example works properly, though that might
also be an effect of being on a larger, more NUMA system.

The driver has been fixed but that does not help older guests.  Until we
figure out whether Cascade Lake and newer processors are working as
intended, revert the commit.  Long term we might add a quirk, but the
details depend on whether the processors are working as intended: for
example if they are, the quirk might reference bochs-compatible devices,
e.g. in the name and documentation, so that userspace can disable the
quirk by default and only leave it enabled if such a device is being
exposed to the guest.

If instead this is actually a bug in CLX+, then the actions we need to
take are different and depend on the actual cause of the bug.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-15 02:49:33 -04:00
Amit Shah
4440337af4 KVM: SVM: let alternatives handle the cases when RSB filling is required
Remove superfluous RSB filling after a VMEXIT when the CPU already has
flushed the RSB after a VMEXIT when AutoIBRS is enabled.

The initial implementation for adding RETPOLINES added an ALTERNATIVES
implementation for filling the RSB after a VMEXIT in commit 117cc7a908
("x86/retpoline: Fill return stack buffer on vmexit").

Later, X86_FEATURE_RSB_VMEXIT was added in commit 9756bba284
("x86/speculation: Fill RSB on vmexit for IBRS") to handle stuffing the
RSB if RETPOLINE=y *or* KERNEL_IBRS=y, i.e. to also stuff the RSB if the
kernel is configured to do IBRS mitigations on entry/exit.

The AutoIBRS (on AMD) feature implementation added in commit e7862eda30
("x86/cpu: Support AMD Automatic IBRS") used the already-implemented logic
for EIBRS in spectre_v2_determine_rsb_fill_type_on_vmexit() -- but did not
update the code at VMEXIT to act on the mode selected in that function --
resulting in VMEXITs continuing to clear the RSB when RETPOLINES are
enabled, despite the presence of AutoIBRS.

Signed-off-by: Amit Shah <amit.shah@amd.com>
Link: https://lore.kernel.org/r/20240807123531.69677-1-amit@kernel.org
[sean: massage changeloge, drop comment about AMD not needing RSB_VMEXIT_LITE]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-10 10:27:53 -07:00
Sean Christopherson
f300948251 KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid
Set PFERR_GUEST_{FINAL,PAGE}_MASK based on EPT_VIOLATION_GVA_TRANSLATED if
and only if EPT_VIOLATION_GVA_IS_VALID is also set in exit qualification.
Per the SDM, bit 8 (EPT_VIOLATION_GVA_TRANSLATED) is valid if and only if
bit 7 (EPT_VIOLATION_GVA_IS_VALID) is set, and is '0' if bit 7 is '0'.

  Bit 7 (a.k.a. EPT_VIOLATION_GVA_IS_VALID)

  Set if the guest linear-address field is valid.  The guest linear-address
  field is valid for all EPT violations except those resulting from an
  attempt to load the guest PDPTEs as part of the execution of the MOV CR
  instruction and those due to trace-address pre-translation

  Bit 8 (a.k.a. EPT_VIOLATION_GVA_TRANSLATED)

  If bit 7 is 1:
    • Set if the access causing the EPT violation is to a guest-physical
      address that is the translation of a linear address.
    • Clear if the access causing the EPT violation is to a paging-structure
      entry as part of a page walk or the update of an accessed or dirty bit.
      Reserved if bit 7 is 0 (cleared to 0).

Failure to guard the logic on GVA_IS_VALID results in KVM marking the page
fault as PFERR_GUEST_PAGE_MASK when there is no known GVA, which can put
the vCPU into an infinite loop due to kvm_mmu_page_fault() getting false
positive on its PFERR_NESTED_GUEST_PAGE logic (though only because that
logic is also buggy/flawed).

In practice, this is largely a non-issue because so GVA_IS_VALID is almost
always set.  However, when TDX comes along, GVA_IS_VALID will *never* be
set, as the TDX Module deliberately clears bits 12:7 in exit qualification,
e.g. so that the faulting virtual address and other metadata that aren't
practically useful for the hypervisor aren't leaked to the untrusted host.

  When exit is due to EPT violation, bits 12-7 of the exit qualification
  are cleared to 0.

Fixes: eebed24389 ("kvm: nVMX: Add support for fast unprotection of nested guest page tables")
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:33:22 -07:00
Sean Christopherson
9a5bff7f5e KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent
Use KVM_PAGES_PER_HPAGE() instead of open coding equivalent logic that is
anything but obvious.

No functional change intended, and verified by compiling with the below
assertions:

        BUILD_BUG_ON((1UL << KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K)) !=
                      KVM_PAGES_PER_HPAGE(PG_LEVEL_4K));

        BUILD_BUG_ON((1UL << KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M)) !=
                      KVM_PAGES_PER_HPAGE(PG_LEVEL_2M));

        BUILD_BUG_ON((1UL << KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G)) !=
                      KVM_PAGES_PER_HPAGE(PG_LEVEL_1G));

Link: https://lore.kernel.org/r/20240809194335.1726916-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:22:08 -07:00
Sean Christopherson
7645829145 KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals
Replace all of the open coded '1' literals used to mark a PTE list as
having many/multiple entries with a proper define.  It's hard enough to
read the code with one magic bit, and a future patch to support "locking"
a single rmap will add another.

No functional change intended.

Link: https://lore.kernel.org/r/20240809194335.1726916-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:22:07 -07:00
Sean Christopherson
7aac9dc680 KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range()
Fold mmu_spte_age() into its sole caller now that aging and testing for
young SPTEs is handled in a common location, i.e. doesn't require more
helpers.

Opportunistically remove the use of mmu_spte_get_lockless(), as mmu_lock
is held (for write!), and marking SPTEs for access tracking outside of
mmu_lock is unsafe (at least, as written).  I.e. using the lockless
accessor is quite misleading.

No functional change intended.

Link: https://lore.kernel.org/r/20240809194335.1726916-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:22:06 -07:00
Sean Christopherson
c17f150000 KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper
Rework kvm_handle_gfn_range() into an aging-specic helper,
kvm_rmap_age_gfn_range().  In addition to purging a bunch of unnecessary
boilerplate code, this sets the stage for aging rmap SPTEs outside of
mmu_lock.

Note, there's a small functional change, as kvm_test_age_gfn() will now
return immediately if a young SPTE is found, whereas previously KVM would
continue iterating over other levels.

Link: https://lore.kernel.org/r/20240809194335.1726916-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:22:05 -07:00
Sean Christopherson
548f87f667 KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed
Convert kvm_unmap_gfn_range(), which is the helper that zaps rmap SPTEs in
response to an mmu_notifier invalidation, to use __kvm_rmap_zap_gfn_range()
and feed in range->may_block.  In other words, honor NEED_RESCHED by way of
cond_resched() when zapping rmaps.  This fixes a long-standing issue where
KVM could process an absurd number of rmap entries without ever yielding,
e.g. if an mmu_notifier fired on a PUD (or larger) range.

Opportunistically rename __kvm_zap_rmap() to kvm_zap_rmap(), and drop the
old kvm_zap_rmap().  Ideally, the shuffling would be done in a different
patch, but that just makes the compiler unhappy, e.g.

  arch/x86/kvm/mmu/mmu.c:1462:13: error: ‘kvm_zap_rmap’ defined but not used

Reported-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/20240809194335.1726916-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:22:04 -07:00
Sean Christopherson
dd9eaad744 KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot
Add a dedicated helper to walk and zap rmaps for a given memslot so that
the code can be shared between KVM-initiated zaps and mmu_notifier
invalidations.

No functional change intended.

Link: https://lore.kernel.org/r/20240809194335.1726916-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:22:03 -07:00
Sean Christopherson
5b1fb116e1 KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps()
Add a @can_yield param to __walk_slot_rmaps() to control whether or not
dropping mmu_lock and conditionally rescheduling is allowed.  This will
allow using __walk_slot_rmaps() and thus cond_resched() to handle
mmu_notifier invalidations, which usually allow blocking/yielding, but not
when invoked by the OOM killer.

Link: https://lore.kernel.org/r/20240809194335.1726916-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:22:02 -07:00
Sean Christopherson
0a37fffda1 KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range()
Move walk_slot_rmaps() and friends up near for_each_slot_rmap_range() so
that the walkers can be used to handle mmu_notifier invalidations, and so
that similar function has some amount of locality in code.

No functional change intended.

Link: https://lore.kernel.org/r/20240809194335.1726916-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:22:01 -07:00
Sean Christopherson
98a69b96ca KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfn
WARN if KVM gets an MMIO cache hit on a RET_PF_WRITE_PROTECTED fault, as
KVM should return RET_PF_WRITE_PROTECTED if and only if there is a memslot,
and creating a memslot is supposed to invalidate the MMIO cache by virtue
of changing the memslot generation.

Keep the code around mainly to provide a convenient location to document
why emulated MMIO should be impossible.

Suggested-by: Yuan Yao <yuan.yao@linux.intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:36 -07:00
Sean Christopherson
d859b16161 KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list
Explicitly query the list of to-be-zapped shadow pages when checking to
see if unprotecting a gfn for retry has succeeded, i.e. if KVM should
retry the faulting instruction.

Add a comment to explain why the list needs to be checked before zapping,
which is the primary motivation for this change.

No functional change intended.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:35 -07:00
Sean Christopherson
6b3dcabc10 KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version
Fold kvm_mmu_unprotect_page() into kvm_mmu_unprotect_gfn_and_retry() now
that all other direct usage is gone.

No functional change intended.

Link: https://lore.kernel.org/r/20240831001538.336683-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:34 -07:00
Sean Christopherson
2876624e1a KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure()
Rename reexecute_instruction() to kvm_unprotect_and_retry_on_failure() to
make the intent and purpose of the helper much more obvious.

No functional change intended.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:33 -07:00
Sean Christopherson
4df685664b KVM: x86: Update retry protection fields when forcing retry on emulation failure
When retrying the faulting instruction after emulation failure, refresh
the infinite loop protection fields even if no shadow pages were zapped,
i.e. avoid hitting an infinite loop even when retrying the instruction as
a last-ditch effort to avoid terminating the guest.

Link: https://lore.kernel.org/r/20240831001538.336683-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:32 -07:00
Sean Christopherson
dabc4ff70c KVM: x86: Apply retry protection to "unprotect on failure" path
Use kvm_mmu_unprotect_gfn_and_retry() in reexecute_instruction() to pick
up protection against infinite loops, e.g. if KVM somehow manages to
encounter an unsupported instruction and unprotecting the gfn doesn't
allow the vCPU to make forward progress.  Other than that, the retry-on-
failure logic is a functionally equivalent, open coded version of
kvm_mmu_unprotect_gfn_and_retry().

Note, the emulation failure path still isn't fully protected, as KVM
won't update the retry protection fields if no shadow pages are zapped
(but this change is still a step forward).  That flaw will be addressed
in a future patch.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:32 -07:00
Sean Christopherson
19ab2c8be0 KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn
Don't bother unprotecting the target gfn if EMULTYPE_WRITE_PF_TO_SP is
set, as KVM will simply report the emulation failure to userspace.  This
will allow converting reexecute_instruction() to use
kvm_mmu_unprotect_gfn_instead_retry() instead of kvm_mmu_unprotect_page().

Link: https://lore.kernel.org/r/20240831001538.336683-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:31 -07:00
Sean Christopherson
6205257395 KVM: x86: Remove manual pfn lookup when retrying #PF after failed emulation
Drop the manual pfn look when retrying an instruction that KVM failed to
emulation in response to a #PF due to a write-protected gfn.  Now that KVM
sets EMULTYPE_ALLOW_RETRY_PF if and only if the page fault hit a write-
protected gfn, i.e. if and only if there's a writable memslot, there's no
need to redo the lookup to avoid retrying an instruction that failed on
emulated MMIO (no slot, or a write to a read-only slot).

I.e. KVM will never attempt to retry an instruction that failed on
emulated MMIO, whereas that was not the case prior to the introduction of
RET_PF_WRITE_PROTECTED.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:30 -07:00
Sean Christopherson
b299c273c0 KVM: x86/mmu: Move event re-injection unprotect+retry into common path
Move the event re-injection unprotect+retry logic into
kvm_mmu_write_protect_fault(), i.e. unprotect and retry if and only if
the #PF actually hit a write-protected gfn.  Note, there is a small
possibility that the gfn was unprotected by a different tasking between
hitting the #PF and acquiring mmu_lock, but in that case, KVM will resume
the guest immediately anyways because KVM will treat the fault as spurious.

As a bonus, unprotecting _after_ handling the page fault also addresses the
case where the installing a SPTE to handle fault encounters a shadowed PTE,
i.e. *creates* a read-only SPTE.

Opportunstically add a comment explaining what on earth the intent of the
code is, as based on the changelog from commit 577bdc4966 ("KVM: Avoid
instruction emulation when event delivery is pending").

Link: https://lore.kernel.org/r/20240831001538.336683-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:29 -07:00
Sean Christopherson
29e495bdf8 KVM: x86/mmu: Always walk guest PTEs with WRITE access when unprotecting
When getting a gpa from a gva to unprotect the associated gfn when an
event is awating reinjection, walk the guest PTEs for WRITE as there's no
point in unprotecting the gfn if the guest is unable to write the page,
i.e. if write-protection can't trigger emulation.

Note, the entire flow should be guarded on the access being a write, and
even better should be conditioned on actually triggering a write-protect
fault.  This will be addressed in a future commit.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:28 -07:00
Sean Christopherson
b7e948898e KVM: x86/mmu: Don't try to unprotect an INVALID_GPA
If getting the gpa for a gva fails, e.g. because the gva isn't mapped in
the guest page tables, don't try to unprotect the invalid gfn.  This is
mostly a performance fix (avoids unnecessarily taking mmu_lock), as
for_each_gfn_valid_sp_with_gptes() won't explode on garbage input, it's
simply pointless.

Link: https://lore.kernel.org/r/20240831001538.336683-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:27 -07:00
Sean Christopherson
2df354e37c KVM: x86: Fold retry_instruction() into x86_emulate_instruction()
Now that retry_instruction() is reasonably tiny, fold it into its sole
caller, x86_emulate_instruction().  In addition to getting rid of the
absurdly confusing retry_instruction() name, handling the retry in
x86_emulate_instruction() pairs it back up with the code that resets
last_retry_{eip,address}.

No functional change intended.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:26 -07:00
Sean Christopherson
41e6e367d5 KVM: x86: Move EMULTYPE_ALLOW_RETRY_PF to x86_emulate_instruction()
Move the sanity checks for EMULTYPE_ALLOW_RETRY_PF to the top of
x86_emulate_instruction().  In addition to deduplicating a small amount
of code, this makes the connection between EMULTYPE_ALLOW_RETRY_PF and
EMULTYPE_PF even more explicit, and will allow dropping retry_instruction()
entirely.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:25 -07:00
Sean Christopherson
dfaae8447c KVM: x86/mmu: Try "unprotect for retry" iff there are indirect SPs
Try to unprotect shadow pages if and only if indirect_shadow_pages is non-
zero, i.e. iff there is at least one protected such shadow page.  Pre-
checking indirect_shadow_pages avoids taking mmu_lock for write when the
gfn is write-protected by a third party, i.e. not for KVM shadow paging,
and in the *extremely* unlikely case that a different task has already
unprotected the last shadow page.

Link: https://lore.kernel.org/r/20240831001538.336683-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:24 -07:00
Sean Christopherson
01dd4d3192 KVM: x86/mmu: Apply retry protection to "fast nTDP unprotect" path
Move the anti-infinite-loop protection provided by last_retry_{eip,addr}
into kvm_mmu_write_protect_fault() so that it guards unprotect+retry that
never hits the emulator, as well as reexecute_instruction(), which is the
last ditch "might as well try it" logic that kicks in when emulation fails
on an instruction that faulted on a write-protected gfn.

Add a new helper, kvm_mmu_unprotect_gfn_and_retry(), to set the retry
fields and deduplicate other code (with more to come).

Link: https://lore.kernel.org/r/20240831001538.336683-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:23 -07:00
Sean Christopherson
9c19129e53 KVM: x86: Store gpa as gpa_t, not unsigned long, when unprotecting for retry
Store the gpa used to unprotect the faulting gfn for retry as a gpa_t, not
an unsigned long.  This fixes a bug where 32-bit KVM would unprotect and
retry the wrong gfn if the gpa had bits 63:32!=0.  In practice, this bug
is functionally benign, as unprotecting the wrong gfn is purely a
performance issue (thanks to the anti-infinite-loop logic).  And of course,
almost no one runs 32-bit KVM these days.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:22 -07:00
Sean Christopherson
019f3f84a4 KVM: x86: Get RIP from vCPU state when storing it to last_retry_eip
Read RIP from vCPU state instead of pulling it from the emulation context
when filling last_retry_eip, which is part of the anti-infinite-loop
protection used when unprotecting and retrying instructions that hit a
write-protected gfn.

This will allow reusing the anti-infinite-loop protection in flows that
never make it into the emulator.

No functional change intended, as ctxt->eip is set to kvm_rip_read() in
init_emulate_ctxt(), and EMULTYPE_PF emulation is mutually exclusive with
EMULTYPE_NO_DECODE and EMULTYPE_SKIP, i.e. always goes through
x86_decode_emulated_instruction() and hasn't advanced ctxt->eip (yet).

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:21 -07:00
Sean Christopherson
c1edcc41c3 KVM: x86: Retry to-be-emulated insn in "slow" unprotect path iff sp is zapped
Resume the guest and thus skip emulation of a non-PTE-writing instruction
if and only if unprotecting the gfn actually zapped at least one shadow
page.  If the gfn is write-protected for some reason other than shadow
paging, attempting to unprotect the gfn will effectively fail, and thus
retrying the instruction is all but guaranteed to be pointless.  This bug
has existed for a long time, but was effectively fudged around by the
retry RIP+address anti-loop detection.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:21 -07:00
Sean Christopherson
2fb2b7877b KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected
When doing "fast unprotection" of nested TDP page tables, skip emulation
if and only if at least one gfn was unprotected, i.e. continue with
emulation if simply resuming is likely to hit the same fault and risk
putting the vCPU into an infinite loop.

Note, it's entirely possible to get a false negative, e.g. if a different
vCPU faults on the same gfn and unprotects the gfn first, but that's a
relatively rare edge case, and emulating is still functionally ok, i.e.
saving a few cycles by avoiding emulation isn't worth the risk of putting
the vCPU into an infinite loop.

Opportunistically rewrite the relevant comment to document in gory detail
exactly what scenario the "fast unprotect" logic is handling.

Fixes: 147277540b ("kvm: svm: Add support for additional SVM NPF error codes")
Cc: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:20 -07:00
Sean Christopherson
989a84c93f KVM: x86/mmu: Trigger unprotect logic only on write-protection page faults
Trigger KVM's various "unprotect gfn" paths if and only if the page fault
was a write to a write-protected gfn.  To do so, add a new page fault
return code, RET_PF_WRITE_PROTECTED, to explicitly and precisely track
such page faults.

If a page fault requires emulation for any MMIO (or any reason besides
write-protection), trying to unprotect the gfn is pointless and risks
putting the vCPU into an infinite loop.  E.g. KVM will put the vCPU into
an infinite loop if the vCPU manages to trigger MMIO on a page table walk.

Fixes: 147277540b ("kvm: svm: Add support for additional SVM NPF error codes")
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:19 -07:00
Sean Christopherson
4ececec19a KVM: x86/mmu: Replace PFERR_NESTED_GUEST_PAGE with a more descriptive helper
Drop the globally visible PFERR_NESTED_GUEST_PAGE and replace it with a
more appropriately named is_write_to_guest_page_table().  The macro name
is misleading, because while all nNPT walks match PAGE|WRITE|PRESENT, the
reverse is not true.

No functional change intended.

Link: https://lore.kernel.org/r/20240831001538.336683-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:18 -07:00
Sean Christopherson
3dde46a21a KVM: nVMX: Assert that vcpu->mutex is held when accessing secondary VMCSes
Add lockdep assertions in get_vmcs12() and get_shadow_vmcs12() to verify
the vCPU's mutex is held, as the returned VMCS objects are dynamically
allocated/freed when nested VMX is turned on/off, i.e. accessing vmcs12
structures without holding vcpu->mutex is susceptible to use-after-free.

Waive the assertion if the VM is being destroyed, as KVM currently forces
a nested VM-Exit when freeing the vCPU.  If/when that wart is fixed, the
assertion can/should be converted to an unqualified lockdep assertion.
See also https://lore.kernel.org/all/Zsd0TqCeY3B5Sb5b@google.com.

Link: https://lore.kernel.org/r/20240906043413.1049633-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:15:03 -07:00
Sean Christopherson
1ed0f119c5 KVM: nVMX: Explicitly invalidate posted_intr_nv if PI is disabled at VM-Enter
Explicitly invalidate posted_intr_nv when emulating nested VM-Enter and
posted interrupts are disabled to make it clear that posted_intr_nv is
valid if and only if nested posted interrupts are enabled, and as a cheap
way to harden against KVM bugs.

KVM initializes posted_intr_nv to -1 at vCPU creation and resets it to -1
when unloading vmcs12 and/or leaving nested mode, i.e. this is not a bug
fix (or at least, it's not intended to be a bug fix).

Note, tracking nested.posted_intr_nv as a u16 subtly adds a measure of
safety, as it prevents unintentionally matching KVM's informal "no IRQ"
vector of -1, stored as a signed int.  Because a u16 can be always be
represented as a signed int, the effective "invalid" value of
posted_intr_nv, 65535, will be preserved as-is when comparing against an
int, i.e. will be zero-extended, not sign-extended, and thus won't get a
false positive if KVM is buggy and compares posted_intr_nv against -1.

Opportunistically add a comment in vmx_deliver_nested_posted_interrupt()
to call out that it must check vmx->nested.posted_intr_nv, not the vector
in vmcs12, which is presumably the _entire_ reason nested.posted_intr_nv
exists.  E.g. vmcs12 is a KVM-controlled snapshot, so there are no TOCTOU
races to worry about, the only potential badness is if the vCPU leaves
nested and frees vmcs12 between the sender checking is_guest_mode() and
dereferencing the vmcs12 pointer.

Link: https://lore.kernel.org/r/20240906043413.1049633-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:15:02 -07:00
Sean Christopherson
aa9477966a KVM: x86: Fold kvm_get_apic_interrupt() into kvm_cpu_get_interrupt()
Fold kvm_get_apic_interrupt() into kvm_cpu_get_interrupt() now that nVMX
essentially open codes kvm_get_apic_interrupt() in order to correctly
emulate nested posted interrupts.

Opportunistically stop exporting kvm_cpu_get_interrupt(), as the
aforementioned nVMX flow was the only user in vendor code.

Link: https://lore.kernel.org/r/20240906043413.1049633-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:15:01 -07:00
Sean Christopherson
6e0b456547 KVM: nVMX: Detect nested posted interrupt NV at nested VM-Exit injection
When synthensizing a nested VM-Exit due to an external interrupt, pend a
nested posted interrupt if the external interrupt vector matches L2's PI
notification vector, i.e. if the interrupt is a PI notification for L2.
This fixes a bug where KVM will incorrectly inject VM-Exit instead of
processing nested posted interrupt when IPI virtualization is enabled.

Per the SDM, detection of the notification vector doesn't occur until the
interrupt is acknowledge and deliver to the CPU core.

  If the external-interrupt exiting VM-execution control is 1, any unmasked
  external interrupt causes a VM exit (see Section 26.2). If the "process
  posted interrupts" VM-execution control is also 1, this behavior is
  changed and the processor handles an external interrupt as follows:

    1. The local APIC is acknowledged; this provides the processor core
       with an interrupt vector, called here the physical vector.
    2. If the physical vector equals the posted-interrupt notification
       vector, the logical processor continues to the next step. Otherwise,
       a VM exit occurs as it would normally due to an external interrupt;
       the vector is saved in the VM-exit interruption-information field.

For the most part, KVM has avoided problems because a PI NV for L2 that
arrives will L2 is active will be processed by hardware, and KVM checks
for a pending notification vector during nested VM-Enter.  Thus, to hit
the bug, the PI NV interrupt needs to sneak its way into L1's vIRR while
L2 is active.

Without IPI virtualization, the scenario is practically impossible to hit,
modulo L1 doing weird things (see below), as the ordering between
vmx_deliver_posted_interrupt() and nested VM-Enter effectively guarantees
that either the sender will see the vCPU as being in_guest_mode(), or the
receiver will see the interrupt in its vIRR.

With IPI virtualization, introduced by commit d588bb9be1 ("KVM: VMX:
enable IPI virtualization"), the sending CPU effectively implements a rough
equivalent of vmx_deliver_posted_interrupt(), sans the nested PI NV check.
If the target vCPU has a valid PID, the CPU will send a PI NV interrupt
based on _L1's_ PID, as the sender's because IPIv table points at L1 PIDs.

  PIR := 32 bytes at PID_ADDR;
  // under lock
  PIR[V] := 1;
  store PIR at PID_ADDR;
  // release lock

  NotifyInfo := 8 bytes at PID_ADDR + 32;
  // under lock
  IF NotifyInfo.ON = 0 AND NotifyInfo.SN = 0; THEN
    NotifyInfo.ON := 1;
    SendNotify := 1;
  ELSE
    SendNotify := 0;
  FI;
  store NotifyInfo at PID_ADDR + 32;
  // release lock

  IF SendNotify = 1; THEN
    send an IPI specified by NotifyInfo.NDST and NotifyInfo.NV;
  FI;

As a result, the target vCPU ends up receiving an interrupt on KVM's
POSTED_INTR_VECTOR while L2 is running, with an interrupt in L1's PIR for
L2's nested PI NV.  The POSTED_INTR_VECTOR interrupt triggers a VM-Exit
from L2 to L0, KVM moves the interrupt from L1's PIR to vIRR, triggers a
KVM_REQ_EVENT prior to re-entry to L2, and calls vmx_check_nested_events(),
effectively bypassing all of KVM's "early" checks on nested PI NV.

Without IPI virtualization, the bug can likely be hit only if L1 programs
an assigned device to _post_ an interrupt to L2's notification vector, by
way of L1's PID.PIR.  Doing so would allow the interrupt to get into L1's
vIRR without KVM checking vmcs12's NV.  Which is architecturally allowed,
but unlikely behavior for a hypervisor.

Cc: Zeng Guang <guang.zeng@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20240906043413.1049633-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:15:00 -07:00
Sean Christopherson
8c23670f2b KVM: nVMX: Suppress external interrupt VM-Exit injection if there's no IRQ
In the should-be-impossible scenario that kvm_cpu_get_interrupt() doesn't
return a valid vector after checking kvm_cpu_has_interrupt(), skip VM-Exit
injection to reduce the probability of crashing/confusing L1.  Now that
KVM gets the IRQ _before_ calling nested_vmx_vmexit(), squashing the
VM-Exit injection is trivial since there are no actions that need to be
undone.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20240906043413.1049633-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:14:59 -07:00
Sean Christopherson
363010e1dd KVM: nVMX: Get to-be-acknowledge IRQ for nested VM-Exit at injection site
Move the logic to get the to-be-acknowledge IRQ for a nested VM-Exit from
nested_vmx_vmexit() to vmx_check_nested_events(), which is subtly the one
and only path where KVM invokes nested_vmx_vmexit() with
EXIT_REASON_EXTERNAL_INTERRUPT.  A future fix will perform a last-minute
check on L2's nested posted interrupt notification vector, just before
injecting a nested VM-Exit.  To handle that scenario correctly, KVM needs
to get the interrupt _before_ injecting VM-Exit, as simply querying the
highest priority interrupt, via kvm_cpu_has_interrupt(), would result in
TOCTOU bug, as a new, higher priority interrupt could arrive between
kvm_cpu_has_interrupt() and kvm_cpu_get_interrupt().

Unfortunately, simply moving the call to kvm_cpu_get_interrupt() doesn't
suffice, as a VMWRITE to GUEST_INTERRUPT_STATUS.SVI is hiding in
kvm_get_apic_interrupt(), and acknowledging the interrupt before nested
VM-Exit would cause the VMWRITE to hit vmcs02 instead of vmcs01.

Open code a rough equivalent to kvm_cpu_get_interrupt() so that the IRQ
is acknowledged after emulating VM-Exit, taking care to avoid the TOCTOU
issue described above.

Opportunistically convert the WARN_ON() to a WARN_ON_ONCE().  If KVM has
a bug that results in a false positive from kvm_cpu_has_interrupt(),
spamming dmesg won't help the situation.

Note, nested_vmx_reflect_vmexit() can never reflect external interrupts as
they are always "wanted" by L0.

Link: https://lore.kernel.org/r/20240906043413.1049633-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:14:58 -07:00
Sean Christopherson
a194a3a13c KVM: x86: Move "ack" phase of local APIC IRQ delivery to separate API
Split the "ack" phase, i.e. the movement of an interrupt from IRR=>ISR,
out of kvm_get_apic_interrupt() and into a separate API so that nested
VMX can acknowledge a specific interrupt _after_ emulating a VM-Exit from
L2 to L1.

To correctly emulate nested posted interrupts while APICv is active, KVM
must:

  1. find the highest pending interrupt.
  2. check if that IRQ is L2's notification vector
  3. emulate VM-Exit if the IRQ is NOT the notification vector
  4. ACK the IRQ in L1 _after_ VM-Exit

When APICv is active, the process of moving the IRQ from the IRR to the
ISR also requires a VMWRITE to update vmcs01.GUEST_INTERRUPT_STATUS.SVI,
and so acknowledging the interrupt before switching to vmcs01 would result
in marking the IRQ as in-service in the wrong VMCS.

KVM currently fudges around this issue by doing kvm_get_apic_interrupt()
smack dab in the middle of emulating VM-Exit, but that hack doesn't play
nice with nested posted interrupts, as notification vector IRQs don't
trigger a VM-Exit in the first place.

Cc: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240906043413.1049633-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:14:57 -07:00
Kai Huang
7efb4d8a39 KVM: VMX: Also clear SGX EDECCSSA in KVM CPU caps when SGX is disabled
When SGX EDECCSSA support was added to KVM in commit 16a7fe3728
("KVM/VMX: Allow exposing EDECCSSA user leaf function to KVM guest"), it
forgot to clear the X86_FEATURE_SGX_EDECCSSA bit in KVM CPU caps when
KVM SGX is disabled.  Fix it.

Fixes: 16a7fe3728 ("KVM/VMX: Allow exposing EDECCSSA user leaf function to KVM guest")
Signed-off-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240905120837.579102-1-kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:13:55 -07:00
Yue Haibing
4ca077f26d KVM: x86: Remove some unused declarations
Commit 238adc7705 ("KVM: Cleanup LAPIC interface") removed
kvm_lapic_get_base() but leave declaration.

And other two declarations were never implenmented since introduction.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20240830022537.2403873-1-yuehaibing@huawei.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:12:43 -07:00
Sean Christopherson
3f6821aa14 KVM: x86: Forcibly leave nested if RSM to L2 hits shutdown
Leave nested mode before synthesizing shutdown (a.k.a. TRIPLE_FAULT) if
RSM fails when resuming L2 (a.k.a. guest mode).  Architecturally, shutdown
on RSM occurs _before_ the transition back to guest mode on both Intel and
AMD.

On Intel, per the SDM pseudocode, SMRAM state is loaded before critical
VMX state:

  restore state normally from SMRAM;
  ...
  CR4.VMXE := value stored internally;
  IF internal storage indicates that the logical processor had been in
     VMX operation (root or non-root)
  THEN
     enter VMX operation (root or non-root);
     restore VMX-critical state as defined in Section 32.14.1;
     ...
     restore current VMCS pointer;
  FI;

AMD's APM is both less clearcut and more explicit.  Because AMD CPUs save
VMCB and guest state in SMRAM itself, given the lack of anything in the
APM to indicate a shutdown in guest mode is possible, a straightforward
reading of the clause on invalid state is that _what_ state is invalid is
irrelevant, i.e. all roads lead to shutdown.

  An RSM causes a processor shutdown if an invalid-state condition is
  found in the SMRAM state-save area.

This fixes a bug found by syzkaller where synthesizing shutdown for L2
led to a nested VM-Exit (if L1 is intercepting shutdown), which in turn
caused KVM to complain about trying to cancel a nested VM-Enter (see
commit 759cbd5967 ("KVM: x86: nSVM/nVMX: set nested_run_pending on VM
entry which is a result of RSM").

Note, Paolo pointed out that KVM shouldn't set nested_run_pending until
after loading SMRAM state.  But as above, that's only half the story, KVM
shouldn't transition to guest mode either.  Unfortunately, fixing that
mess requires rewriting the nVMX and nSVM RSM flows to not piggyback
their nested VM-Enter flows, as executing the nested VM-Enter flows after
loading state from SMRAM would clobber much of said state.

For now, add a FIXME to call out that transitioning to guest mode before
loading state from SMRAM is wrong.

Link: https://lore.kernel.org/all/CABgObfYaUHXyRmsmg8UjRomnpQ0Jnaog9-L2gMjsjkqChjDYUQ@mail.gmail.com
Reported-by: syzbot+988d9efcdf137bc05f66@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/0000000000007a9acb06151e1670@google.com
Reported-by: Zheyu Ma <zheyuma97@gmail.com>
Closes: https://lore.kernel.org/all/CAMhUBjmXMYsEoVYw_M8hSZjBMHh24i88QYm-RY6HDta5YZ7Wgw@mail.gmail.com
Analyzed-by: Michal Wilczynski <michal.wilczynski@intel.com>
Cc: Kishen Maloor <kishen.maloor@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240906161337.1118412-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:09:49 -07:00
Linus Torvalds
9d4c304001 KVM: x86: don't fall through case statements without annotations
clang warns on this because it has an unannotated fall-through between
cases:

   arch/x86/kvm/x86.c:4819:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]

and while we could annotate it as a fallthrough, the proper fix is to
just add the break for this case, instead of falling through to the
default case and the break there.

gcc also has that warning, but it looks like gcc only warns for the
cases where they fall through to "real code", rather than to just a
break.  Odd.

Fixes: d30d9ee94c ("KVM: x86: Only advertise KVM_CAP_READONLY_MEM when supported by VM")
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Dohrmann <erbse.13@gmx.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-06 15:23:33 -07:00
Steven Rostedt
59cbd4eea4 KVM: Remove HIGH_RES_TIMERS dependency
Commit 92b5265d38 ("KVM: Depend on HIGH_RES_TIMERS") added a dependency
to high resolution timers with the comment:

    KVM lapic timer and tsc deadline timer based on hrtimer,
    setting a leftmost node to rb tree and then do hrtimer reprogram.
    If hrtimer not configured as high resolution, hrtimer_enqueue_reprogram
    do nothing and then make kvm lapic timer and tsc deadline timer fail.

That was back in 2012, where hrtimer_start_range_ns() would do the
reprogramming with hrtimer_enqueue_reprogram(). But as that was a nop with
high resolution timers disabled, this did not work. But a lot has changed
in the last 12 years.

For example, commit 49a2a07514 ("hrtimer: Kick lowres dynticks targets on
timer enqueue") modifies __hrtimer_start_range_ns() to work with low res
timers. There's been lots of other changes that make low res work.

ChromeOS has tested this before as well, and it hasn't seen any issues
with running KVM with high res timers disabled.  There could be problems,
especially at low HZ, for guests that do not support kvmclock and rely
on precise delivery of periodic timers to keep their clock running.
This can be the APIC timer (provided by the kernel), the RTC (provided
by userspace), or the i8254 (choice of kernel/userspace).  These guests
are few and far between these days, and in the case of the APIC timer +
Intel hosts we can use the preemption timer (which is TSC-based and has
better latency _and_ accuracy).

In KVM, only x86 is requiring CONFIG_HIGH_RES_TIMERS; perhaps a "depends
on HIGH_RES_TIMERS || EXPERT" could be added to virt/kvm, or a pr_warn
could be added to kvm_init if HIGH_RES_TIMERS are not enabled.  But in
general, it seems that there must be other code in the kernel (maybe
sound/?) that is relying on having high-enough HZ or hrtimers but that's
not documented anywhere.  Whenever you disable it you probably need to
know what you're doing and what your workload is; so the dependency is
not particularly interesting, and we can just remove it.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Message-ID: <20240821095127.45d17b19@gandalf.local.home>
[Added the last two paragraphs to the commit message. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-05 12:04:54 -04:00
Sean Christopherson
590b09b1d8 KVM: x86: Register "emergency disable" callbacks when virt is enabled
Register the "disable virtualization in an emergency" callback just
before KVM enables virtualization in hardware, as there is no functional
need to keep the callbacks registered while KVM happens to be loaded, but
is inactive, i.e. if KVM hasn't enabled virtualization.

Note, unregistering the callback every time the last VM is destroyed could
have measurable latency due to the synchronize_rcu() needed to ensure all
references to the callback are dropped before KVM is unloaded.  But the
latency should be a small fraction of the total latency of disabling
virtualization across all CPUs, and userspace can set enable_virt_at_load
to completely eliminate the runtime overhead.

Add a pointer in kvm_x86_ops to allow vendor code to provide its callback.
There is no reason to force vendor code to do the registration, and either
way KVM would need a new kvm_x86_ops hook.

Suggested-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240830043600.127750-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-04 11:02:34 -04:00
Sean Christopherson
0617a769ce KVM: x86: Rename virtualization {en,dis}abling APIs to match common KVM
Rename x86's the per-CPU vendor hooks used to enable virtualization in
hardware to align with the recently renamed arch hooks.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20240830043600.127750-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-04 11:02:33 -04:00
Sean Christopherson
071f24ad28 KVM: Rename arch hooks related to per-CPU virtualization enabling
Rename the per-CPU hooks used to enable virtualization in hardware to
align with the KVM-wide helpers in kvm_main.c, and to better capture that
the callbacks are invoked on every online CPU.

No functional change intended.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20240830043600.127750-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-04 11:02:33 -04:00
Tom Dohrmann
d30d9ee94c KVM: x86: Only advertise KVM_CAP_READONLY_MEM when supported by VM
Until recently, KVM_CAP_READONLY_MEM was unconditionally supported on
x86, but this is no longer the case for SEV-ES and SEV-SNP VMs.

When KVM_CHECK_EXTENSION is invoked on a VM, only advertise
KVM_CAP_READONLY_MEM when it's actually supported.

Fixes: 66155de93b ("KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)")
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Michael Roth <michael.roth@amd.com>
Signed-off-by: Tom Dohrmann <erbse.13@gmx.de>
Message-ID: <20240902144219.3716974-1-erbse.13@gmx.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-02 10:56:10 -04:00
Sean Christopherson
1876dd69df KVM: x86: Add fastpath handling of HLT VM-Exits
Add a fastpath for HLT VM-Exits by immediately re-entering the guest if
it has a pending wake event.  When virtual interrupt delivery is enabled,
i.e. when KVM doesn't need to manually inject interrupts, this allows KVM
to stay in the fastpath run loop when a vIRQ arrives between the guest
doing CLI and STI;HLT.  Without AMD's Idle HLT-intercept support, the CPU
generates a HLT VM-Exit even though KVM will immediately resume the guest.

Note, on bare metal, it's relatively uncommon for a modern guest kernel to
actually trigger this scenario, as the window between the guest checking
for a wake event and committing to HLT is quite small.  But in a nested
environment, the timings change significantly, e.g. rudimentary testing
showed that ~50% of HLT exits where HLT-polling was successful would be
serviced by this fastpath, i.e. ~50% of the time that a nested vCPU gets
a wake event before KVM schedules out the vCPU, the wake event was pending
even before the VM-Exit.

Link: https://lore.kernel.org/all/20240528041926.3989-3-manali.shukla@amd.com
Link: https://lore.kernel.org/r/20240802195120.325560-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29 19:50:22 -07:00
Sean Christopherson
70cdd23851 KVM: x86: Reorganize code in x86.c to co-locate vCPU blocking/running helpers
Shuffle code around in x86.c so that the various helpers related to vCPU
blocking/running logic are (a) located near each other and (b) ordered so
that HLT emulation can use kvm_vcpu_has_events() in a future path.

No functional change intended.

Link: https://lore.kernel.org/r/20240802195120.325560-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29 19:50:21 -07:00
Sean Christopherson
f7f39c50ed KVM: x86: Exit to userspace if fastpath triggers one on instruction skip
Exit to userspace if a fastpath handler triggers such an exit, which can
happen when skipping the instruction, e.g. due to userspace
single-stepping the guest via KVM_GUESTDBG_SINGLESTEP or because of an
emulation failure.

Fixes: 404d5d7bff ("KVM: X86: Introduce more exit_fastpath_completion enum values")
Link: https://lore.kernel.org/r/20240802195120.325560-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29 19:50:21 -07:00
Sean Christopherson
ea60229af7 KVM: x86: Dedup fastpath MSR post-handling logic
Now that the WRMSR fastpath for x2APIC_ICR and TSC_DEADLINE are identical,
ignoring the backend MSR handling, consolidate the common bits of skipping
the instruction and setting the return value.

No functional change intended.

Link: https://lore.kernel.org/r/20240802195120.325560-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29 19:50:21 -07:00
Sean Christopherson
0dd45f2cd8 KVM: x86: Re-enter guest if WRMSR(X2APIC_ICR) fastpath is successful
Re-enter the guest in the fastpath if WRMSR emulation for x2APIC's ICR is
successful, as no additional work is needed, i.e. there is no code unique
for WRMSR exits between the fastpath and the "!= EXIT_FASTPATH_NONE" check
in __vmx_handle_exit().

Link: https://lore.kernel.org/r/20240802195120.325560-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29 19:50:21 -07:00
Sean Christopherson
32071fa355 KVM: SVM: Track the per-CPU host save area as a VMCB pointer
The host save area is a VMCB, track it as such to help readers follow
along, but mostly to cleanup/simplify the retrieval of the SEV-ES host
save area.

Note, the compile-time assertion that

  offsetof(struct vmcb, save) == EXPECTED_VMCB_CONTROL_AREA_SIZE

ensures that the SEV-ES save area is indeed at offset 0x400 (whoever added
the expected/architectural VMCB offsets apparently likes decimal).

No functional change intended.

Link: https://lore.kernel.org/r/20240802204511.352017-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29 19:06:12 -07:00
Sean Christopherson
48547fe75e KVM: SVM: Add a helper to convert a SME-aware PA back to a struct page
Add __sme_pa_to_page() to pair with __sme_page_pa() and use it to replace
open coded equivalents, including for "iopm_base", which previously
avoided having to do __sme_clr() by storing the raw PA in the global
variable.

Opportunistically convert __sme_page_pa() to a helper to provide type
safety.

No functional change intended.

Link: https://lore.kernel.org/r/20240802204511.352017-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29 19:06:12 -07:00
Sean Christopherson
1dc9cc1c4c KVM: x86/mmu: Reword a misleading comment about checking gpte_changed()
Rewrite the comment in FNAME(fetch) to explain why KVM needs to check that
the gPTE is still fresh before continuing the shadow page walk, even if
KVM already has a linked shadow page for the gPTE in question.

No functional change intended.

Link: https://lore.kernel.org/r/20240802203900.348808-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29 19:05:55 -07:00
Sean Christopherson
7d67b03e6f KVM: x86/mmu: Drop pointless "return" wrapper label in FNAME(fetch)
Drop the pointless and poorly named "out_gpte_changed" label, in
FNAME(fetch), and instead return RET_PF_RETRY directly.

No functional change intended.

Link: https://lore.kernel.org/r/20240802203900.348808-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29 19:05:55 -07:00
Sean Christopherson
174b6e4a25 KVM: x86/mmu: Decrease indentation in logic to sync new indirect shadow page
Combine the back-to-back if-statements for synchronizing children when
linking a new indirect shadow page in order to decrease the indentation,
and to make it easier to "see" the logic in its entirety.

No functional change intended.

Link: https://lore.kernel.org/r/20240802203900.348808-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29 19:05:55 -07:00
Sean Christopherson
73b42dc69b KVM: x86: Re-split x2APIC ICR into ICR+ICR2 for AMD (x2AVIC)
Re-introduce the "split" x2APIC ICR storage that KVM used prior to Intel's
IPI virtualization support, but only for AMD.  While not stated anywhere
in the APM, despite stating the ICR is a single 64-bit register, AMD CPUs
store the 64-bit ICR as two separate 32-bit values in ICR and ICR2.  When
IPI virtualization (IPIv on Intel, all AVIC flavors on AMD) is enabled,
KVM needs to match CPU behavior as some ICR ICR writes will be handled by
the CPU, not by KVM.

Add a kvm_x86_ops knob to control the underlying format used by the CPU to
store the x2APIC ICR, and tune it to AMD vs. Intel regardless of whether
or not x2AVIC is enabled.  If KVM is handling all ICR writes, the storage
format for x2APIC mode doesn't matter, and having the behavior follow AMD
versus Intel will provide better test coverage and ease debugging.

Fixes: 4d1d7942e3 ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240719235107.3023592-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29 16:25:05 -07:00
Sean Christopherson
d33234342f KVM: x86: Move x2APIC ICR helper above kvm_apic_write_nodecode()
Hoist kvm_x2apic_icr_write() above kvm_apic_write_nodecode() so that a
local helper to _read_ the x2APIC ICR can be added and used in the
nodecode path without needing a forward declaration.

No functional change intended.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240719235107.3023592-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29 16:21:50 -07:00
Sean Christopherson
71bf395a27 KVM: x86: Enforce x2APIC's must-be-zero reserved ICR bits
Inject a #GP on a WRMSR(ICR) that attempts to set any reserved bits that
are must-be-zero on both Intel and AMD, i.e. any reserved bits other than
the BUSY bit, which Intel ignores and basically says is undefined.

KVM's xapic_state_test selftest has been fudging the bug since commit
4b88b1a518 ("KVM: selftests: Enhance handling WRMSR ICR register in
x2APIC mode"), which essentially removed the testcase instead of fixing
the bug.

WARN if the nodecode path triggers a #GP, as the CPU is supposed to check
reserved bits for ICR when it's partially virtualized.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240719235107.3023592-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29 16:21:50 -07:00
Vitaly Kuznetsov
5fa9f0480c KVM: SEV: Update KVM_AMD_SEV Kconfig entry and mention SEV-SNP
SEV-SNP support is present since commit 1dfe571c12 ("KVM: SEV: Add
initial SEV-SNP support") but Kconfig entry wasn't updated and still
mentions SEV and SEV-ES only. Add SEV-SNP there and, while on it, expand
'SEV' in the description as 'Encrypted VMs' is not what 'SEV' stands for.

No functional change.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240828122111.160273-1-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-28 05:46:25 -07:00
Ravi Bangoria
54950bfe2b KVM: SVM: Don't advertise Bus Lock Detect to guest if SVM support is missing
If host supports Bus Lock Detect, KVM advertises it to guests even if
SVM support is absent. Additionally, guest wouldn't be able to use it
despite guest CPUID bit being set. Fix it by unconditionally clearing
the feature bit in KVM cpu capability.

Reported-by: Jim Mattson <jmattson@google.com>
Closes: https://lore.kernel.org/r/CALMp9eRet6+v8Y1Q-i6mqPm4hUow_kJNhmVHfOV8tMfuSS=tVg@mail.gmail.com
Fixes: 76ea438b4a ("KVM: X86: Expose bus lock debug exception to guest")
Cc: stable@vger.kernel.org
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240808062937.1149-4-ravi.bangoria@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 12:17:44 -07:00
Sean Christopherson
44dd0f5732 KVM: x86: Suppress userspace access failures on unsupported, "emulated" MSRs
Extend KVM's suppression of userspace MSR access failures to MSRs that KVM
reports as emulated, but are ultimately unsupported, e.g. if the VMX MSRs
are emulated by KVM, but are unsupported given the vCPU model.

Suggested-by: Weijiang Yang <weijiang.yang@intel.com>
Reviewed-by: Weijiang Yang <weijiang.yang@intel.com>
Link: https://lore.kernel.org/r/20240802181935.292540-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 12:07:39 -07:00
Sean Christopherson
64a5d7a109 KVM: x86: Suppress failures on userspace access to advertised, unsupported MSRs
Extend KVM's suppression of failures due to a userspace access to an
unsupported, but advertised as a "to save" MSR to all MSRs, not just those
that happen to reach the default case statements in kvm_get_msr_common()
and kvm_set_msr_common().  KVM's soon-to-be-established ABI is that if an
MSR is advertised to userspace, then userspace is allowed to read the MSR,
and write back the value that was read, i.e. why an MSR is unsupported
doesn't change KVM's ABI.

Practically speaking, this is very nearly a nop, as the only other paths
that return KVM_MSR_RET_UNSUPPORTED are {svm,vmx}_get_feature_msr(), and
it's unlikely, though not impossible, that userspace is using KVM_GET_MSRS
on unsupported MSRs.

The primary goal of moving the suppression to common code is to allow
returning KVM_MSR_RET_UNSUPPORTED as appropriate throughout KVM, without
having to manually handle the "is userspace accessing an advertised"
waiver.  I.e. this will allow formalizing KVM's ABI without incurring a
high maintenance cost.

Link: https://lore.kernel.org/r/20240802181935.292540-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 12:07:38 -07:00
Sean Christopherson
3adef90345 KVM: x86: Hoist x86.c's global msr_* variables up above kvm_do_msr_access()
Move the definitions of the various MSR arrays above kvm_do_msr_access()
so that kvm_do_msr_access() can query the arrays when handling failures,
e.g. to squash errors if userspace tries to read an MSR that isn't fully
supported, but that KVM advertised as being an MSR-to-save.

No functional change intended.

Link: https://lore.kernel.org/r/20240802181935.292540-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 12:07:37 -07:00
Sean Christopherson
1cec203498 KVM: x86: Funnel all fancy MSR return value handling into a common helper
Add a common helper, kvm_do_msr_access(), to invoke the "leaf" APIs that
are type and access specific, and more importantly to handle errors that
are returned from the leaf APIs.  I.e. turn kvm_msr_ignored_check() from a
a helper that is called on an error, into a trampoline that detects errors
*and* applies relevant side effects, e.g. logging unimplemented accesses.

Because the leaf APIs are used for guest accesses, userspace accesses, and
KVM accesses, and because KVM supports restricting access to MSRs from
userspace via filters, the error handling is subtly non-trivial.  E.g. KVM
has had at least one bug escape due to making each "outer" function handle
errors.  See commit 3376ca3f1a ("KVM: x86: Fix KVM_GET_MSRS stack info
leak").

Link: https://lore.kernel.org/r/20240802181935.292540-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 12:07:36 -07:00
Sean Christopherson
7075f16361 KVM: x86: Refactor kvm_get_feature_msr() to avoid struct kvm_msr_entry
Refactor kvm_get_feature_msr() to take the components of kvm_msr_entry as
separate parameters, along with a vCPU pointer, i.e. to give it the same
prototype as kvm_{g,s}et_msr_ignored_check().  This will allow using a
common inner helper for handling accesses to "regular" and feature MSRs.

No functional change intended.

Link: https://lore.kernel.org/r/20240802181935.292540-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 12:07:35 -07:00
Sean Christopherson
b848f24bd7 KVM: x86: Rename get_msr_feature() APIs to get_feature_msr()
Rename all APIs related to feature MSRs from get_msr_feature() to
get_feature_msr().  The APIs get "feature MSRs", not "MSR features".
And unlike kvm_{g,s}et_msr_common(), the "feature" adjective doesn't
describe the helper itself.

No functional change intended.

Link: https://lore.kernel.org/r/20240802181935.292540-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 12:06:56 -07:00
Sean Christopherson
74c6c98a59 KVM: x86: Refactor kvm_x86_ops.get_msr_feature() to avoid kvm_msr_entry
Refactor get_msr_feature() to take the index and data pointer as distinct
parameters in anticipation of eliminating "struct kvm_msr_entry" usage
further up the primary callchain.

No functional change intended.

Link: https://lore.kernel.org/r/20240802181935.292540-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 12:06:30 -07:00
Sean Christopherson
aaecae7b6a KVM: x86: Rename KVM_MSR_RET_INVALID to KVM_MSR_RET_UNSUPPORTED
Rename the "INVALID" internal MSR error return code to "UNSUPPORTED" to
try and make it more clear that access was denied because the MSR itself
is unsupported/unknown.  "INVALID" is too ambiguous, as it could just as
easily mean the value for WRMSR as invalid.

Avoid UNKNOWN and UNIMPLEMENTED, as the error code is used for MSRs that
_are_ actually implemented by KVM, e.g. if the MSR is unsupported because
an associated feature flag is not present in guest CPUID.

Opportunistically beef up the comments for the internal MSR error codes.

Link: https://lore.kernel.org/r/20240802181935.292540-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 12:06:29 -07:00
Sean Christopherson
b58b808cbe KVM: x86: Move MSR_TYPE_{R,W,RW} values from VMX to x86, as enums
Move VMX's MSR_TYPE_{R,W,RW} #defines to x86.h, as enums, so that they can
be used by common x86 code, e.g. instead of doing "bool write".

Opportunistically tweak the definitions to make it more obvious that the
values are bitmasks, not arbitrary ascending values.

No functional change intended.

Link: https://lore.kernel.org/r/20240802181935.292540-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 12:06:29 -07:00
Sean Christopherson
74a0e79df6 KVM: SVM: Disallow guest from changing userspace's MSR_AMD64_DE_CFG value
Inject a #GP if the guest attempts to change MSR_AMD64_DE_CFG from its
*current* value, not if the guest attempts to write a value other than
KVM's set of supported bits.  As per the comment and the changelog of the
original code, the intent is to effectively make MSR_AMD64_DE_CFG read-
only for the guest.

Opportunistically use a more conventional equality check instead of an
exclusive-OR check to detect attempts to change bits.

Fixes: d1d93fa90f ("KVM: SVM: Add MSR-based feature support for serializing LFENCE")
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240802181935.292540-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 12:06:24 -07:00
Sean Christopherson
acf2923271 KVM: x86/mmu: Clean up function comments for dirty logging APIs
Rework the function comment for kvm_arch_mmu_enable_log_dirty_pt_masked()
into the body of the function, as it has gotten a bit stale, is harder to
read without the code context, and is the last source of warnings for W=1
builds in KVM x86 due to using a kernel-doc comment without documenting
all parameters.

Opportunistically subsume the functions comments for
kvm_mmu_write_protect_pt_masked() and kvm_mmu_clear_dirty_pt_masked(), as
there is no value in regurgitating similar information at a higher level,
and capturing the differences between write-protection and PML-based dirty
logging is best done in a common location.

No functional change intended.

Cc: David Matlack <dmatlack@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20240802202006.340854-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:40:04 -07:00
Li Chen
e0183a42e3 KVM: x86: Use this_cpu_ptr() in kvm_user_return_msr_cpu_online
Use this_cpu_ptr() instead of open coding the equivalent in
kvm_user_return_msr_cpu_online.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Link: https://lore.kernel.org/r/87zfp96ojk.wl-me@linux.beauty
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:37:41 -07:00
Vitaly Kuznetsov
2ab637df5f KVM: VMX: hyper-v: Prevent impossible NULL pointer dereference in evmcs_load()
GCC 12.3.0 complains about a potential NULL pointer dereference in
evmcs_load() as hv_get_vp_assist_page() can return NULL. In fact, this
cannot happen because KVM verifies (hv_init_evmcs()) that every CPU has a
valid VP assist page and aborts enabling the feature otherwise. CPU
onlining path is also checked in vmx_hardware_enable().

To make the compiler happy and to future proof the code, add a KVM_BUG_ON()
sentinel. It doesn't seem to be possible (and logical) to observe
evmcs_load() happening without an active vCPU so it is presumed that
kvm_get_running_vcpu() can't return NULL.

No functional change intended.

Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240816130124.286226-1-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:35:18 -07:00
Maxim Levitsky
41ab0d59fa KVM: nVMX: Use vmx_segment_cache_clear() instead of open coded equivalent
In prepare_vmcs02_rare(), call vmx_segment_cache_clear() instead of
setting segment_cache.bitmask directly.  Using the helper minimizes the
chances of prepare_vmcs02_rare() doing the wrong thing in the future, e.g.
if KVM ends up doing more than just zero the bitmask when purging the
cache.

No functional change intended.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240725175232.337266-2-mlevitsk@redhat.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:35:17 -07:00
Sean Christopherson
653ea4489e KVM: nVMX: Honor userspace MSR filter lists for nested VM-Enter/VM-Exit
Synthesize a consistency check VM-Exit (VM-Enter) or VM-Abort (VM-Exit) if
L1 attempts to load/store an MSR via the VMCS MSR lists that userspace has
disallowed access to via an MSR filter.  Intel already disallows including
a handful of "special" MSRs in the VMCS lists, so denying access isn't
completely without precedent.

More importantly, the behavior is well-defined _and_ can be communicated
the end user, e.g. to the customer that owns a VM running as L1 on top of
KVM.  On the other hand, ignoring userspace MSR filters is all but
guaranteed to result in unexpected behavior as the access will hit KVM's
internal state, which is likely not up-to-date.

Unlike KVM-internal accesses, instruction emulation, and dedicated VMCS
fields, the MSRs in the VMCS load/store lists are 100% guest controlled,
thus making it all but impossible to reason about the correctness of
ignoring the MSR filter.  And if userspace *really* wants to deny access
to MSRs via the aforementioned scenarios, userspace can hide the
associated feature from the guest, e.g. by disabling the PMU to prevent
accessing PERF_GLOBAL_CTRL via its VMCS field.  But for the MSR lists, KVM
is blindly processing MSRs; the  MSR filters are the _only_ way for
userspace to deny access.

This partially reverts commit ac8d6cad3c ("KVM: x86: Only do MSR
filtering when access MSR by rdmsr/wrmsr").

Cc: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20240722235922.3351122-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:35:16 -07:00
Kai Huang
d9aa56edad KVM: VMX: Do not account for temporary memory allocation in ECREATE emulation
In handle_encls_ecreate(), a page is allocated to store a copy of SECS
structure used by the ENCLS[ECREATE] leaf from the guest.  This page is
only used temporarily and is freed after use in handle_encls_ecreate().

Don't account for the memory allocation of this page per [1].

Link: https://lore.kernel.org/kvm/b999afeb588eb75d990891855bc6d58861968f23.camel@intel.com/T/#mb81987afc3ab308bbb5861681aa9a20f2aece7fd [1]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240715101224.90958-1-kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:35:15 -07:00
Qiang Liu
caf22c6dd3 KVM: VMX: Modify the BUILD_BUG_ON_MSG of the 32-bit field in the vmcs_check16 function
According to the SDM, the meaning of field bit 0 is:
Access type (0 = full; 1 = high); must be full for 16-bit, 32-bit,
and natural-width fields. So there is no 32-bit high field here,
it should be a 32-bit field instead.

Signed-off-by: Qiang Liu <liuq131@chinatelecom.cn>
Link: https://lore.kernel.org/r/20240702064609.52487-1-liuq131@chinatelecom.cn
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:35:15 -07:00
Yongqiang Liu
c501062bb2 KVM: SVM: Remove unnecessary GFP_KERNEL_ACCOUNT in svm_set_nested_state()
The fixed size temporary variables vmcb_control_area and vmcb_save_area
allocated in svm_set_nested_state() are released when the function exits.
Meanwhile, svm_set_nested_state() also have vcpu mutex held to avoid
massive concurrency allocation, so we don't need to set GFP_KERNEL_ACCOUNT.

Signed-off-by: Yongqiang Liu <liuyongqiang13@huawei.com>
Link: https://lore.kernel.org/r/20240821112737.3649937-1-liuyongqiang13@huawei.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:35:09 -07:00
Xin Li
566975f6ec KVM: nVMX: Use macros and #defines in vmx_restore_vmx_misc()
Use macros in vmx_restore_vmx_misc() instead of open coding everything
using BIT_ULL() and GENMASK_ULL().  Opportunistically split feature bits
and reserved bits into separate variables, and add a comment explaining
the subset logic (it's not immediately obvious that the set of feature
bits is NOT the set of _supported_ feature bits).

Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog, drop #defines]
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:54 -07:00
Xin Li
8f56b14e9f KVM: VMX: Open code VMX preemption timer rate mask in its accessor
Use vmx_misc_preemption_timer_rate() to get the rate in hardware_setup(),
and open code the rate's bitmask in vmx_misc_preemption_timer_rate() so
that the function looks like all the helpers that grab values from
VMX_BASIC and VMX_MISC MSR values.

No functional change intended.

Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog]
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:54 -07:00
Sean Christopherson
dc1e67f70f KVM VMX: Move MSR_IA32_VMX_MISC bit defines to asm/vmx.h
Move the handful of MSR_IA32_VMX_MISC bit defines that are currently in
msr-indx.h to vmx.h so that all of the VMX_MISC defines and wrappers can
be found in a single location.

Opportunistically use BIT_ULL() instead of open coding hex values, add
defines for feature bits that are architecturally defined, and move the
defines down in the file so that they are colocated with the helpers for
getting fields from VMX_MISC.

No functional change intended.

Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog]
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:53 -07:00
Sean Christopherson
92e648042c KVM: nVMX: Add a helper to encode VMCS info in MSR_IA32_VMX_BASIC
Add a helper to encode the VMCS revision, size, and supported memory types
in MSR_IA32_VMX_BASIC, i.e. when synthesizing KVM's supported BASIC MSR
value, and delete the now unused VMCS size and memtype shift macros.

For a variety of reasons, KVM has shifted (pun intended) to using helpers
to *get* information from the VMX MSRs, as opposed to defined MASK and
SHIFT macros for direct use.  Provide a similar helper for the nested VMX
code, which needs to *set* information, so that KVM isn't left with a mix
of SHIFT macros and dedicated helpers.

Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:52 -07:00
Xin Li
c97b106fa8 KVM: nVMX: Use macros and #defines in vmx_restore_vmx_basic()
Use macros in vmx_restore_vmx_basic() instead of open coding everything
using BIT_ULL() and GENMASK_ULL().  Opportunistically split feature bits
and reserved bits into separate variables, and add a comment explaining
the subset logic (it's not immediately obvious that the set of feature
bits is NOT the set of _supported_ feature bits).

Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog, drop #defines]
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:51 -07:00
Xin Li
9df398ff7d KVM: VMX: Track CPU's MSR_IA32_VMX_BASIC as a single 64-bit value
Track the "basic" capabilities VMX MSR as a single u64 in vmcs_config
instead of splitting it across three fields, that obviously don't combine
into a single 64-bit value, so that KVM can use the macros that define MSR
bits using their absolute position.  Replace all open coded shifts and
masks, many of which are relative to the "high" half, with the appropriate
macro.

Opportunistically use VMX_BASIC_32BIT_PHYS_ADDR_ONLY instead of an open
coded equivalent, and clean up the related comment to not reference a
specific SDM section (to the surprise of no one, the comment is stale).

No functional change intended (though obviously the code generation will
be quite different).

Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog]
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:50 -07:00
Sean Christopherson
b6717d35d8 KVM: x86: Stuff vCPU's PAT with default value at RESET, not creation
Move the stuffing of the vCPU's PAT to the architectural "default" value
from kvm_arch_vcpu_create() to kvm_vcpu_reset(), guarded by !init_event,
to better capture that the default value is the value "Following Power-up
or Reset".  E.g. setting PAT only during creation would break if KVM were
to expose a RESET ioctl() to userspace (which is unlikely, but that's not
a good reason to have unintuitive code).

No functional change.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:48 -07:00
Sean Christopherson
beb2e44604 x86/cpu: KVM: Move macro to encode PAT value to common header
Move pat/memtype.c's PAT() macro to msr-index.h as PAT_VALUE(), and use it
in KVM to define the default (Power-On / RESET) PAT value instead of open
coding an inscrutable magic number.

No functional change intended.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240605231918.2915961-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:47 -07:00
Sean Christopherson
e7e80b66fb x86/cpu: KVM: Add common defines for architectural memory types (PAT, MTRRs, etc.)
Add defines for the architectural memory types that can be shoved into
various MSRs and registers, e.g. MTRRs, PAT, VMX capabilities MSRs, EPTPs,
etc.  While most MSRs/registers support only a subset of all memory types,
the values themselves are architectural and identical across all users.

Leave the goofy MTRR_TYPE_* definitions as-is since they are in a uapi
header, but add compile-time assertions to connect the dots (and sanity
check that the msr-index.h values didn't get fat-fingered).

Keep the VMX_EPTP_MT_* defines so that it's slightly more obvious that the
EPTP holds a single memory type in 3 of its 64 bits; those bits just
happen to be 2:0, i.e. don't need to be shifted.

Opportunistically use X86_MEMTYPE_WB instead of an open coded '6' in
setup_vmcs_config().

No functional change intended.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:46 -07:00
Maxim Levitsky
dad1613e05 KVM: SVM: fix emulation of msr reads/writes of MSR_FS_BASE and MSR_GS_BASE
If these msrs are read by the emulator (e.g due to 'force emulation' prefix),
SVM code currently fails to extract the corresponding segment bases,
and return them to the emulator.

Fix that.

Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240802151608.72896-3-mlevitsk@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:34 -07:00
Sean Christopherson
4bcdd831d9 KVM: x86: Acquire kvm->srcu when handling KVM_SET_VCPU_EVENTS
Grab kvm->srcu when processing KVM_SET_VCPU_EVENTS, as KVM will forcibly
leave nested VMX/SVM if SMM mode is being toggled, and leaving nested VMX
reads guest memory.

Note, kvm_vcpu_ioctl_x86_set_vcpu_events() can also be called from KVM_RUN
via sync_regs(), which already holds SRCU.  I.e. trying to precisely use
kvm_vcpu_srcu_read_lock() around the problematic SMM code would cause
problems.  Acquiring SRCU isn't all that expensive, so for simplicity,
grab it unconditionally for KVM_SET_VCPU_EVENTS.

 =============================
 WARNING: suspicious RCU usage
 6.10.0-rc7-332d2c1d713e-next-vm #552 Not tainted
 -----------------------------
 include/linux/kvm_host.h:1027 suspicious rcu_dereference_check() usage!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 1 lock held by repro/1071:
  #0: ffff88811e424430 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x7d/0x970 [kvm]

 stack backtrace:
 CPU: 15 PID: 1071 Comm: repro Not tainted 6.10.0-rc7-332d2c1d713e-next-vm #552
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
 Call Trace:
  <TASK>
  dump_stack_lvl+0x7f/0x90
  lockdep_rcu_suspicious+0x13f/0x1a0
  kvm_vcpu_gfn_to_memslot+0x168/0x190 [kvm]
  kvm_vcpu_read_guest+0x3e/0x90 [kvm]
  nested_vmx_load_msr+0x6b/0x1d0 [kvm_intel]
  load_vmcs12_host_state+0x432/0xb40 [kvm_intel]
  vmx_leave_nested+0x30/0x40 [kvm_intel]
  kvm_vcpu_ioctl_x86_set_vcpu_events+0x15d/0x2b0 [kvm]
  kvm_arch_vcpu_ioctl+0x1107/0x1750 [kvm]
  ? mark_held_locks+0x49/0x70
  ? kvm_vcpu_ioctl+0x7d/0x970 [kvm]
  ? kvm_vcpu_ioctl+0x497/0x970 [kvm]
  kvm_vcpu_ioctl+0x497/0x970 [kvm]
  ? lock_acquire+0xba/0x2d0
  ? find_held_lock+0x2b/0x80
  ? do_user_addr_fault+0x40c/0x6f0
  ? lock_release+0xb7/0x270
  __x64_sys_ioctl+0x82/0xb0
  do_syscall_64+0x6c/0x170
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
 RIP: 0033:0x7ff11eb1b539
  </TASK>

Fixes: f7e570780e ("KVM: x86: Forcibly leave nested virt when SMM state is toggled")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240723232055.3643811-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:33 -07:00
Sean Christopherson
28cec7f08b KVM: x86/mmu: Check that root is valid/loaded when pre-faulting SPTEs
Error out if kvm_mmu_reload() fails when pre-faulting memory, as trying to
fault-in SPTEs will fail miserably due to root.hpa pointing at garbage.

Note, kvm_mmu_reload() can return -EIO and thus trigger the WARN on -EIO
in kvm_vcpu_pre_fault_memory(), but all such paths also WARN, i.e. the
WARN isn't user-triggerable and won't run afoul of warn-on-panic because
the kernel would already be panicking.

  BUG: unable to handle page fault for address: 000029ffffffffe8
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000 [#1] PREEMPT SMP
  CPU: 22 PID: 1069 Comm: pre_fault_memor Not tainted 6.10.0-rc7-332d2c1d713e-next-vm #548
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:is_page_fault_stale+0x3e/0xe0 [kvm]
  RSP: 0018:ffffc9000114bd48 EFLAGS: 00010206
  RAX: 00003fffffffffc0 RBX: ffff88810a07c080 RCX: ffffc9000114bd78
  RDX: ffff88810a07c080 RSI: ffffea0000000000 RDI: ffff88810a07c080
  RBP: ffffc9000114bd78 R08: 00007fa3c8c00000 R09: 8000000000000225
  R10: ffffea00043d7d80 R11: 0000000000000000 R12: ffff88810a07c080
  R13: 0000000100000000 R14: ffffc9000114be58 R15: 0000000000000000
  FS:  00007fa3c9da0740(0000) GS:ffff888277d80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000029ffffffffe8 CR3: 000000011d698000 CR4: 0000000000352eb0
  Call Trace:
   <TASK>
   kvm_tdp_page_fault+0xcc/0x160 [kvm]
   kvm_mmu_do_page_fault+0xfb/0x1f0 [kvm]
   kvm_arch_vcpu_pre_fault_memory+0xd0/0x1a0 [kvm]
   kvm_vcpu_ioctl+0x761/0x8c0 [kvm]
   __x64_sys_ioctl+0x82/0xb0
   do_syscall_64+0x5b/0x160
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>
  Modules linked in: kvm_intel kvm
  CR2: 000029ffffffffe8
  ---[ end trace 0000000000000000 ]---

Fixes: 6e01b7601d ("KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()")
Reported-by: syzbot+23786faffb695f17edaa@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/0000000000002b84dc061dd73544@google.com
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: xingwei lee <xrivendell7@gmail.com>
Tested-by: yuxin wang <wang1315768607@163.com>
Link: https://lore.kernel.org/r/20240723000211.3352304-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:32 -07:00
Yan Zhao
e03a7caa53 KVM: x86/mmu: Fixup comments missed by the REMOVED_SPTE=>FROZEN_SPTE rename
Replace "removed" with "frozen" in comments as appropriate to complete the
rename of REMOVED_SPTE to FROZEN_SPTE.

Fixes: 964cea8171 ("KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTE")
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20240712233438.518591-1-rick.p.edgecombe@intel.com
[sean: write changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:31 -07:00
Tao Su
1c450ffef5 KVM: x86: Advertise AVX10.1 CPUID to userspace
Advertise AVX10.1 related CPUIDs, i.e. report AVX10 support bit via
CPUID.(EAX=07H, ECX=01H):EDX[bit 19] and new CPUID leaf 0x24H so that
guest OS and applications can query the AVX10.1 CPUIDs directly. Intel
AVX10 represents the first major new vector ISA since the introduction of
Intel AVX512, which will establish a common, converged vector instruction
set across all Intel architectures[1].

AVX10.1 is an early version of AVX10, that enumerates the Intel AVX512
instruction set at 128, 256, and 512 bits which is enabled on
Granite Rapids. I.e., AVX10.1 is only a new CPUID enumeration with no
new functionality.   New features, e.g. Embedded Rounding and Suppress
All Exceptions (SAE) will be introduced in AVX10.2.

Advertising AVX10.1 is safe because there is nothing to enable for AVX10.1,
i.e. it's purely a new way to enumerate support, thus there will never be
anything for the kernel to enable. Note just the CPUID checking is changed
when using AVX512 related instructions, e.g. if using one AVX512
instruction needs to check (AVX512 AND AVX512DQ), it can check
((AVX512 AND AVX512DQ) OR AVX10.1) after checking XCR0[7:5].

The versions of AVX10 are expected to be inclusive, e.g. version N+1 is
a superset of version N. Per the spec, the version can never be 0, just
advertise AVX10.1 if it's supported in hardware. Moreover, advertising
AVX10_{128,256,512} needs to land in the same commit as advertising basic
AVX10.1 support, otherwise KVM would advertise an impossible CPU model.
E.g. a CPU with AVX512 but not AVX10.1/512 is impossible per the SDM.

As more and more AVX related CPUIDs are added (it would have resulted in
around 40-50 CPUID flags when developing AVX10), the versioning approach
is introduced. But incrementing version numbers are bad for virtualization.
E.g. if AVX10.2 has a feature that shouldn't be enumerated to guests for
whatever reason, then KVM can't enumerate any "later" features either,
because the only way to hide the problematic AVX10.2 feature is to set the
version to AVX10.1 or lower[2]. But most AVX features are just passed
through and don't have virtualization controls, so AVX10 should not be
problematic in practice, so long as Intel honors their promise that future
versions will be supersets of past versions.

[1] https://cdrdv2.intel.com/v1/dl/getContent/784267
[2] https://lore.kernel.org/all/Zkz5Ak0PQlAN8DxK@google.com/

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Link: https://lore.kernel.org/r/20240819062327.3269720-1-tao1.su@linux.intel.com
[sean: minor changelog tweaks]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:25 -07:00
Thorsten Blum
1448d4a935 KVM: x86: Optimize local variable in start_sw_tscdeadline()
Change the data type of the local variable this_tsc_khz to u32 because
virtual_tsc_khz is also declared as u32.

Since do_div() casts the divisor to u32 anyway, changing the data type
of this_tsc_khz to u32 also removes the following Coccinelle/coccicheck
warning reported by do_div.cocci:

  WARNING: do_div() does a 64-by-32 division, please consider using div64_ul instead

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Link: https://lore.kernel.org/r/20240814203345.2234-2-thorsten.blum@toblux.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:24 -07:00
Yan Zhao
aa8d1f48d3 KVM: x86/mmu: Introduce a quirk to control memslot zap behavior
Introduce the quirk KVM_X86_QUIRK_SLOT_ZAP_ALL to allow users to select
KVM's behavior when a memslot is moved or deleted for KVM_X86_DEFAULT_VM
VMs. Make sure KVM behave as if the quirk is always disabled for
non-KVM_X86_DEFAULT_VM VMs.

The KVM_X86_QUIRK_SLOT_ZAP_ALL quirk offers two behavior options:
- when enabled:  Invalidate/zap all SPTEs ("zap-all"),
- when disabled: Precisely zap only the leaf SPTEs within the range of the
                 moving/deleting memory slot ("zap-slot-leafs-only").

"zap-all" is today's KVM behavior to work around a bug [1] where the
changing the zapping behavior of memslot move/deletion would cause VM
instability for VMs with an Nvidia GPU assigned; while
"zap-slot-leafs-only" allows for more precise zapping of SPTEs within the
memory slot range, improving performance in certain scenarios [2], and
meeting the functional requirements for TDX.

Previous attempts to select "zap-slot-leafs-only" include a per-VM
capability approach [3] (which was not preferred because the root cause of
the bug remained unidentified) and a per-memslot flag approach [4]. Sean
and Paolo finally recommended the implementation of this quirk and
explained that it's the least bad option [5].

By default, the quirk is enabled on KVM_X86_DEFAULT_VM VMs to use
"zap-all". Users have the option to disable the quirk to select
"zap-slot-leafs-only" for specific KVM_X86_DEFAULT_VM VMs that are
unaffected by this bug.

For non-KVM_X86_DEFAULT_VM VMs, the "zap-slot-leafs-only" behavior is
always selected without user's opt-in, regardless of if the user opts for
"zap-all".
This is because it is assumed until proven otherwise that non-
KVM_X86_DEFAULT_VM VMs will not be exposed to the bug [1], and most
importantly, it's because TDX must have "zap-slot-leafs-only" always
selected. In TDX's case a memslot's GPA range can be a mixture of "private"
or "shared" memory. Shared is roughly analogous to how EPT is handled for
normal VMs, but private GPAs need lots of special treatment:
1) "zap-all" would require to zap private root page or non-leaf entries or
   at least leaf-entries beyond the deleting memslot scope. However, TDX
   demands that the root page of the private page table remains unchanged,
   with leaf entries being zapped before non-leaf entries, and any dropped
   private guest pages must be re-accepted by the guest.
2) if "zap-all" zaps only shared page tables, it would result in private
   pages still being mapped when the memslot is gone. This may affect even
   other processes if later the gmem fd was whole punched, causing the
   pages being freed on the host while still mapped in the TD, because
   there's no pgoff to the gfn information to zap the private page table
   after memslot is gone.

So, simply go "zap-slot-leafs-only" as if the quirk is always disabled for
non-KVM_X86_DEFAULT_VM VMs to avoid manual opt-in for every VM type [6] or
complicating quirk disabling interface (current quirk disabling interface
is limited, no way to query quirks, or force them to be disabled).

Add a new function kvm_mmu_zap_memslot_leafs() to implement
"zap-slot-leafs-only". This function does not call kvm_unmap_gfn_range(),
bypassing special handling to APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, as
1) The APIC_ACCESS_PAGE_PRIVATE_MEMSLOT cannot be created by users, nor can
   it be moved. It is only deleted by KVM when APICv is permanently
   inhibited.
2) kvm_vcpu_reload_apic_access_page() effectively does nothing when
   APIC_ACCESS_PAGE_PRIVATE_MEMSLOT is deleted.
3) Avoid making all cpus request of KVM_REQ_APIC_PAGE_RELOAD can save on
   costly IPIs.

Suggested-by: Kai Huang <kai.huang@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com [1]
Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com/#25054908 [2]
Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [3]
Link: https://lore.kernel.org/all/20240515005952.3410568-3-rick.p.edgecombe@intel.com [4]
Link: https://lore.kernel.org/all/7df9032d-83e4-46a1-ab29-6c7973a2ab0b@redhat.com [5]
Link: https://lore.kernel.org/all/ZnGa550k46ow2N3L@google.com [6]
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20240703021043.13881-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-14 12:29:11 -04:00
Sean Christopherson
4b7c3f6d04 KVM: x86: Make x2APIC ID 100% readonly
Ignore the userspace provided x2APIC ID when fixing up APIC state for
KVM_SET_LAPIC, i.e. make the x2APIC fully readonly in KVM.  Commit
a92e2543d6 ("KVM: x86: use hardware-compatible format for APIC ID
register"), which added the fixup, didn't intend to allow userspace to
modify the x2APIC ID.  In fact, that commit is when KVM first started
treating the x2APIC ID as readonly, apparently to fix some race:

 static inline u32 kvm_apic_id(struct kvm_lapic *apic)
 {
-       return (kvm_lapic_get_reg(apic, APIC_ID) >> 24) & 0xff;
+       /* To avoid a race between apic_base and following APIC_ID update when
+        * switching to x2apic_mode, the x2apic mode returns initial x2apic id.
+        */
+       if (apic_x2apic_mode(apic))
+               return apic->vcpu->vcpu_id;
+
+       return kvm_lapic_get_reg(apic, APIC_ID) >> 24;
 }

Furthermore, KVM doesn't support delivering interrupts to vCPUs with a
modified x2APIC ID, but KVM *does* return the modified value on a guest
RDMSR and for KVM_GET_LAPIC.  I.e. no remotely sane setup can actually
work with a modified x2APIC ID.

Making the x2APIC ID fully readonly fixes a WARN in KVM's optimized map
calculation, which expects the LDR to align with the x2APIC ID.

  WARNING: CPU: 2 PID: 958 at arch/x86/kvm/lapic.c:331 kvm_recalculate_apic_map+0x609/0xa00 [kvm]
  CPU: 2 PID: 958 Comm: recalc_apic_map Not tainted 6.4.0-rc3-vanilla+ #35
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.2-1-1 04/01/2014
  RIP: 0010:kvm_recalculate_apic_map+0x609/0xa00 [kvm]
  Call Trace:
   <TASK>
   kvm_apic_set_state+0x1cf/0x5b0 [kvm]
   kvm_arch_vcpu_ioctl+0x1806/0x2100 [kvm]
   kvm_vcpu_ioctl+0x663/0x8a0 [kvm]
   __x64_sys_ioctl+0xb8/0xf0
   do_syscall_64+0x56/0x80
   entry_SYSCALL_64_after_hwframe+0x46/0xb0
  RIP: 0033:0x7fade8b9dd6f

Unfortunately, the WARN can still trigger for other CPUs than the current
one by racing against KVM_SET_LAPIC, so remove it completely.

Reported-by: Michal Luczaj <mhal@rbox.co>
Closes: https://lore.kernel.org/all/814baa0c-1eaa-4503-129f-059917365e80@rbox.co
Reported-by: Haoyu Wu <haoyuwu254@gmail.com>
Closes: https://lore.kernel.org/all/20240126161633.62529-1-haoyuwu254@gmail.com
Reported-by: syzbot+545f1326f405db4e1c3e@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000c2a6b9061cbca3c3@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240802202941.344889-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-13 12:01:46 -04:00
Isaku Yamahata
15e1c3d659 KVM: x86: Use this_cpu_ptr() instead of per_cpu_ptr(smp_processor_id())
Use this_cpu_ptr() instead of open coding the equivalent in various
user return MSR helpers.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Message-ID: <20240802201630.339306-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-13 10:24:37 -04:00
Yue Haibing
b098495e69 KVM: x86: hyper-v: Remove unused inline function kvm_hv_free_pa_page()
There is no caller in tree since introduction in commit b4f69df0f6 ("KVM:
x86: Make Hyper-V emulation optional")

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Message-ID: <20240803113233.128185-1-yuehaibing@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-13 09:28:48 -04:00
Dan Carpenter
cd2d006065 KVM: SVM: Fix an error code in sev_gmem_post_populate()
The copy_from_user() function returns the number of bytes which it
was not able to copy.  Return -EFAULT instead.

Fixes: dee5a47cc7 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Message-ID: <20240612115040.2423290-4-dan.carpenter@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-13 06:08:40 -04:00
Dan Carpenter
92b6c2f007 KVM: SVM: Fix uninitialized variable bug
If snp_lookup_rmpentry() fails then "assigned" is printed in the error
message but it was never initialized.  Initialize it to false.

Fixes: dee5a47cc7 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Message-ID: <20240612115040.2423290-3-dan.carpenter@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-13 06:05:10 -04:00
Al Viro
1da91ea87a introduce fd_file(), convert all accessors to it.
For any changes of struct fd representation we need to
turn existing accesses to fields into calls of wrappers.
Accesses to struct fd::flags are very few (3 in linux/file.h,
1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in
explicit initializers).
	Those can be dealt with in the commit converting to
new layout; accesses to struct fd::file are too many for that.
	This commit converts (almost) all of f.file to
fd_file(f).  It's not entirely mechanical ('file' is used as
a member name more than just in struct fd) and it does not
even attempt to distinguish the uses in pointer context from
those in boolean context; the latter will be eventually turned
into a separate helper (fd_empty()).

	NOTE: mass conversion to fd_empty(), tempting as it
might be, is a bad idea; better do that piecewise in commit
that convert from fdget...() to CLASS(...).

[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c
caught by git; fs/stat.c one got caught by git grep]
[fs/xattr.c conflict]

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-08-12 22:00:43 -04:00
Paolo Bonzini
1773014a97 * fix latent bug in how usage of large pages is determined for
confidential VMs
 
 * fix "underline too short" in docs
 
 * eliminate log spam from limited APIC timer periods
 
 * disallow pre-faulting of memory before SEV-SNP VMs are initialized
 
 * delay clearing and encrypting private memory until it is added to
   guest page tables
 
 * this change also enables another small cleanup: the checks in
   SNP_LAUNCH_UPDATE that limit it to non-populated, private pages
   can now be moved in the common kvm_gmem_populate() function
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmar0uEUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMf9Af9EZ0k0HHltM+iUSqKW+hcfnyjRSlh
 MI2m8ZFF4Ra4a/H2CYWbUZSZd6U2TGQoy0cz8vN12uiaaRFSXHAzkoy1zhJGYujq
 ljCUx46Ovo6DDfA1ve9jPdHQNOKWy6Js8yheP+i58Pau1u9fWTewfvWnrwkMgnfD
 lkrSfnWhw7aBy7jTSd8KflRU/IugP2/ApsIhrjZZ9sFGncAwPBbb8NL/u5tI/l6f
 VDp1in5a5gk2PhVRVzvINUxNzhcyuQ0wC07N+B4H+3U0NLg4CwiTBJr/yz0OOWz6
 ThA20/fLTrs5jc2f5APk1EjGT8pqeMJYydI2FdqafSfY0PcTZJtXvzgdSw==
 =CwzF
 -----END PGP SIGNATURE-----

Merge branch 'kvm-fixes' into HEAD

* fix latent bug in how usage of large pages is determined for
  confidential VMs

* fix "underline too short" in docs

* eliminate log spam from limited APIC timer periods

* disallow pre-faulting of memory before SEV-SNP VMs are initialized

* delay clearing and encrypting private memory until it is added to
  guest page tables

* this change also enables another small cleanup: the checks in
  SNP_LAUNCH_UPDATE that limit it to non-populated, private pages
  can now be moved in the common kvm_gmem_populate() function
2024-08-02 12:33:43 -04:00
Ackerley Tng
aca0ec970d KVM: x86/mmu: fix determination of max NPT mapping level for private pages
The `if (req_max_level)` test was meant ignore req_max_level if
PG_LEVEL_NONE was returned. Hence, this function should return
max_level instead of the ignored req_max_level.

This is only a latent issue for now, since guest_memfd does not
support large pages.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Message-ID: <20240801173955.1975034-1-ackerleytng@google.com>
Fixes: f32fb32820 ("KVM: x86: Add hook for determining max NPT mapping level")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-01 14:13:11 -04:00
Paolo Bonzini
e4ee544792 KVM: guest_memfd: let kvm_gmem_populate() operate only on private gfns
This check is currently performed by sev_gmem_post_populate(), but it
applies to all callers of kvm_gmem_populate(): the point of the function
is that the memory is being encrypted and some work has to be done
on all the gfns in order to encrypt them.

Therefore, check the KVM_MEMORY_ATTRIBUTE_PRIVATE attribute prior
to invoking the callback, and stop the operation if a shared page
is encountered.  Because CONFIG_KVM_PRIVATE_MEM in principle does
not require attributes, this makes kvm_gmem_populate() depend on
CONFIG_KVM_GENERIC_PRIVATE_MEM (which does require them).

Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26 14:46:15 -04:00
Paolo Bonzini
4b5f67120a KVM: extend kvm_range_has_memory_attributes() to check subset of attributes
While currently there is no other attribute than KVM_MEMORY_ATTRIBUTE_PRIVATE,
KVM code such as kvm_mem_is_private() is written to expect their existence.
Allow using kvm_range_has_memory_attributes() as a multi-page version of
kvm_mem_is_private(), without it breaking later when more attributes are
introduced.

Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26 14:46:15 -04:00
Paolo Bonzini
de80252414 KVM: guest_memfd: move check for already-populated page to common code
Do not allow populating the same page twice with startup data.  In the
case of SEV-SNP, for example, the firmware does not allow it anyway,
since the launch-update operation is only possible on pages that are
still shared in the RMP.

Even if it worked, kvm_gmem_populate()'s callback is meant to have side
effects such as updating launch measurements, and updating the same
page twice is unlikely to have the desired results.

Races between calls to the ioctl are not possible because
kvm_gmem_populate() holds slots_lock and the VM should not be running.
But again, even if this worked on other confidential computing technology,
it doesn't matter to guest_memfd.c whether this is something fishy
such as missing synchronization in userspace, or rather something
intentional.  One of the racers wins, and the page is initialized by
either kvm_gmem_prepare_folio() or kvm_gmem_populate().

Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use
the same errno that kvm_gmem_populate() is using.

Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26 14:46:14 -04:00
Paolo Bonzini
7239ed7467 KVM: remove kvm_arch_gmem_prepare_needed()
It is enough to return 0 if a guest need not do any preparation.
This is in fact how sev_gmem_prepare() works for non-SNP guests,
and it extends naturally to Intel hosts: the x86 callback for
gmem_prepare is optional and returns 0 if not defined.

Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26 14:46:14 -04:00
Paolo Bonzini
564429a6bd KVM: rename CONFIG_HAVE_KVM_GMEM_* to CONFIG_HAVE_KVM_ARCH_GMEM_*
Add "ARCH" to the symbols; shortly, the "prepare" phase will include both
the arch-independent step to clear out contents left in the page by the
host, and the arch-dependent step enabled by CONFIG_HAVE_KVM_GMEM_PREPARE.
For consistency do the same for CONFIG_HAVE_KVM_GMEM_INVALIDATE as well.

Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26 14:46:14 -04:00
Paolo Bonzini
5932ca411e KVM: x86: disallow pre-fault for SNP VMs before initialization
KVM_PRE_FAULT_MEMORY for an SNP guest can race with
sev_gmem_post_populate() in bad ways. The following sequence for
instance can potentially trigger an RMP fault:

  thread A, sev_gmem_post_populate: called
  thread B, sev_gmem_prepare: places below 'pfn' in a private state in RMP
  thread A, sev_gmem_post_populate: *vaddr = kmap_local_pfn(pfn + i);
  thread A, sev_gmem_post_populate: copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE);
  RMP #PF

Fix this by only allowing KVM_PRE_FAULT_MEMORY to run after a guest's
initial private memory contents have been finalized via
KVM_SEV_SNP_LAUNCH_FINISH.

Beyond fixing this issue, it just sort of makes sense to enforce this,
since the KVM_PRE_FAULT_MEMORY documentation states:

  "KVM maps memory as if the vCPU generated a stage-2 read page fault"

which sort of implies we should be acting on the same guest state that a
vCPU would see post-launch after the initial guest memory is all set up.

Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26 14:46:14 -04:00
Jim Mattson
0005ca2076 KVM: x86: Eliminate log spam from limited APIC timer periods
SAP's vSMP MemoryONE continuously requests a local APIC timer period
less than 500 us, resulting in the following kernel log spam:

  kvm: vcpu 15: requested 70240 ns lapic timer period limited to 500000 ns
  kvm: vcpu 19: requested 52848 ns lapic timer period limited to 500000 ns
  kvm: vcpu 15: requested 70256 ns lapic timer period limited to 500000 ns
  kvm: vcpu 9: requested 70256 ns lapic timer period limited to 500000 ns
  kvm: vcpu 9: requested 70208 ns lapic timer period limited to 500000 ns
  kvm: vcpu 9: requested 387520 ns lapic timer period limited to 500000 ns
  kvm: vcpu 9: requested 70160 ns lapic timer period limited to 500000 ns
  kvm: vcpu 66: requested 205744 ns lapic timer period limited to 500000 ns
  kvm: vcpu 9: requested 70224 ns lapic timer period limited to 500000 ns
  kvm: vcpu 9: requested 70256 ns lapic timer period limited to 500000 ns
  limit_periodic_timer_frequency: 7569 callbacks suppressed
  ...

To eliminate this spam, change the pr_info_ratelimited() in
limit_periodic_timer_frequency() to pr_info_once().

Reported-by: James Houghton <jthoughton@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-ID: <20240724190640.2449291-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26 13:33:25 -04:00
Linus Torvalds
2c9b351240 ARM:
* Initial infrastructure for shadow stage-2 MMUs, as part of nested
   virtualization enablement
 
 * Support for userspace changes to the guest CTR_EL0 value, enabling
   (in part) migration of VMs between heterogenous hardware
 
 * Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1 of
   the protocol
 
 * FPSIMD/SVE support for nested, including merged trap configuration
   and exception routing
 
 * New command-line parameter to control the WFx trap behavior under KVM
 
 * Introduce kCFI hardening in the EL2 hypervisor
 
 * Fixes + cleanups for handling presence/absence of FEAT_TCRX
 
 * Miscellaneous fixes + documentation updates
 
 LoongArch:
 
 * Add paravirt steal time support.
 
 * Add support for KVM_DIRTY_LOG_INITIALLY_SET.
 
 * Add perf kvm-stat support for loongarch.
 
 RISC-V:
 
 * Redirect AMO load/store access fault traps to guest
 
 * perf kvm stat support
 
 * Use guest files for IMSIC virtualization, when available
 
 ONE_REG support for the Zimop, Zcmop, Zca, Zcf, Zcd, Zcb and Zawrs ISA
 extensions is coming through the RISC-V tree.
 
 s390:
 
 * Assortment of tiny fixes which are not time critical
 
 x86:
 
 * Fixes for Xen emulation.
 
 * Add a global struct to consolidate tracking of host values, e.g. EFER
 
 * Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
   bus frequency, because TDX.
 
 * Print the name of the APICv/AVIC inhibits in the relevant tracepoint.
 
 * Clean up KVM's handling of vendor specific emulation to consistently act on
   "compatible with Intel/AMD", versus checking for a specific vendor.
 
 * Drop MTRR virtualization, and instead always honor guest PAT on CPUs
   that support self-snoop.
 
 * Update to the newfangled Intel CPU FMS infrastructure.
 
 * Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads
   '0' and writes from userspace are ignored.
 
 * Misc cleanups
 
 x86 - MMU:
 
 * Small cleanups, renames and refactoring extracted from the upcoming
   Intel TDX support.
 
 * Don't allocate kvm_mmu_page.shadowed_translation for shadow pages that can't
   hold leafs SPTEs.
 
 * Unconditionally drop mmu_lock when allocating TDP MMU page tables for eager
   page splitting, to avoid stalling vCPUs when splitting huge pages.
 
 * Bug the VM instead of simply warning if KVM tries to split a SPTE that is
   non-present or not-huge.  KVM is guaranteed to end up in a broken state
   because the callers fully expect a valid SPTE, it's all but dangerous
   to let more MMU changes happen afterwards.
 
 x86 - AMD:
 
 * Make per-CPU save_area allocations NUMA-aware.
 
 * Force sev_es_host_save_area() to be inlined to avoid calling into an
   instrumentable function from noinstr code.
 
 * Base support for running SEV-SNP guests.  API-wise, this includes
   a new KVM_X86_SNP_VM type, encrypting/measure the initial image into
   guest memory, and finalizing it before launching it.  Internally,
   there are some gmem/mmu hooks needed to prepare gmem-allocated pages
   before mapping them into guest private memory ranges.
 
   This includes basic support for attestation guest requests, enough to
   say that KVM supports the GHCB 2.0 specification.
 
   There is no support yet for loading into the firmware those signing
   keys to be used for attestation requests, and therefore no need yet
   for the host to provide certificate data for those keys.  To support
   fetching certificate data from userspace, a new KVM exit type will be
   needed to handle fetching the certificate from userspace. An attempt to
   define a new KVM_EXIT_COCO/KVM_EXIT_COCO_REQ_CERTS exit type to handle
   this was introduced in v1 of this patchset, but is still being discussed
   by community, so for now this patchset only implements a stub version
   of SNP Extended Guest Requests that does not provide certificate data.
 
 x86 - Intel:
 
 * Remove an unnecessary EPT TLB flush when enabling hardware.
 
 * Fix a series of bugs that cause KVM to fail to detect nested pending posted
   interrupts as valid wake eents for a vCPU executing HLT in L2 (with
   HLT-exiting disable by L1).
 
 * KVM: x86: Suppress MMIO that is triggered during task switch emulation
 
   Explicitly suppress userspace emulated MMIO exits that are triggered when
   emulating a task switch as KVM doesn't support userspace MMIO during
   complex (multi-step) emulation.  Silently ignoring the exit request can
   result in the WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to
   userspace for some other reason prior to purging mmio_needed.
 
   See commit 0dc902267c ("KVM: x86: Suppress pending MMIO write exits if
   emulator detects exception") for more details on KVM's limitations with
   respect to emulated MMIO during complex emulator flows.
 
 Generic:
 
 * Rename the AS_UNMOVABLE flag that was introduced for KVM to AS_INACCESSIBLE,
   because the special casing needed by these pages is not due to just
   unmovability (and in fact they are only unmovable because the CPU cannot
   access them).
 
 * New ioctl to populate the KVM page tables in advance, which is useful to
   mitigate KVM page faults during guest boot or after live migration.
   The code will also be used by TDX, but (probably) not through the ioctl.
 
 * Enable halt poll shrinking by default, as Intel found it to be a clear win.
 
 * Setup empty IRQ routing when creating a VM to avoid having to synchronize
   SRCU when creating a split IRQCHIP on x86.
 
 * Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag
   that arch code can use for hooking both sched_in() and sched_out().
 
 * Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
   truncating a bogus value from userspace, e.g. to help userspace detect bugs.
 
 * Mark a vCPU as preempted if and only if it's scheduled out while in the
   KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest
   memory when retrieving guest state during live migration blackout.
 
 Selftests:
 
 * Remove dead code in the memslot modification stress test.
 
 * Treat "branch instructions retired" as supported on all AMD Family 17h+ CPUs.
 
 * Print the guest pseudo-RNG seed only when it changes, to avoid spamming the
   log for tests that create lots of VMs.
 
 * Make the PMU counters test less flaky when counting LLC cache misses by
   doing CLFLUSH{OPT} in every loop iteration.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmaZQB0UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNkZwf/bv2jiENaLFNGPe/VqTKMQ6PHQLMG
 +sNHx6fJPP35gTM8Jqf0/7/ummZXcSuC1mWrzYbecZm7Oeg3vwNXHZ4LquwwX6Dv
 8dKcUzLbWDAC4WA3SKhi8C8RV2v6E7ohy69NtAJmFWTc7H95dtIQm6cduV2osTC3
 OEuHe1i8d9umk6couL9Qhm8hk3i9v2KgCsrfyNrQgLtS3hu7q6yOTR8nT0iH6sJR
 KE5A8prBQgLmF34CuvYDw4Hu6E4j+0QmIqodovg2884W1gZQ9LmcVqYPaRZGsG8S
 iDdbkualLKwiR1TpRr3HJGKWSFdc7RblbsnHRvHIZgFsMQiimh4HrBSCyQ==
 =zepX
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM:

   - Initial infrastructure for shadow stage-2 MMUs, as part of nested
     virtualization enablement

   - Support for userspace changes to the guest CTR_EL0 value, enabling
     (in part) migration of VMs between heterogenous hardware

   - Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1
     of the protocol

   - FPSIMD/SVE support for nested, including merged trap configuration
     and exception routing

   - New command-line parameter to control the WFx trap behavior under
     KVM

   - Introduce kCFI hardening in the EL2 hypervisor

   - Fixes + cleanups for handling presence/absence of FEAT_TCRX

   - Miscellaneous fixes + documentation updates

  LoongArch:

   - Add paravirt steal time support

   - Add support for KVM_DIRTY_LOG_INITIALLY_SET

   - Add perf kvm-stat support for loongarch

  RISC-V:

   - Redirect AMO load/store access fault traps to guest

   - perf kvm stat support

   - Use guest files for IMSIC virtualization, when available

  s390:

   - Assortment of tiny fixes which are not time critical

  x86:

   - Fixes for Xen emulation

   - Add a global struct to consolidate tracking of host values, e.g.
     EFER

   - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the
     effective APIC bus frequency, because TDX

   - Print the name of the APICv/AVIC inhibits in the relevant
     tracepoint

   - Clean up KVM's handling of vendor specific emulation to
     consistently act on "compatible with Intel/AMD", versus checking
     for a specific vendor

   - Drop MTRR virtualization, and instead always honor guest PAT on
     CPUs that support self-snoop

   - Update to the newfangled Intel CPU FMS infrastructure

   - Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as
     it reads '0' and writes from userspace are ignored

   - Misc cleanups

  x86 - MMU:

   - Small cleanups, renames and refactoring extracted from the upcoming
     Intel TDX support

   - Don't allocate kvm_mmu_page.shadowed_translation for shadow pages
     that can't hold leafs SPTEs

   - Unconditionally drop mmu_lock when allocating TDP MMU page tables
     for eager page splitting, to avoid stalling vCPUs when splitting
     huge pages

   - Bug the VM instead of simply warning if KVM tries to split a SPTE
     that is non-present or not-huge. KVM is guaranteed to end up in a
     broken state because the callers fully expect a valid SPTE, it's
     all but dangerous to let more MMU changes happen afterwards

  x86 - AMD:

   - Make per-CPU save_area allocations NUMA-aware

   - Force sev_es_host_save_area() to be inlined to avoid calling into
     an instrumentable function from noinstr code

   - Base support for running SEV-SNP guests. API-wise, this includes a
     new KVM_X86_SNP_VM type, encrypting/measure the initial image into
     guest memory, and finalizing it before launching it. Internally,
     there are some gmem/mmu hooks needed to prepare gmem-allocated
     pages before mapping them into guest private memory ranges

     This includes basic support for attestation guest requests, enough
     to say that KVM supports the GHCB 2.0 specification

     There is no support yet for loading into the firmware those signing
     keys to be used for attestation requests, and therefore no need yet
     for the host to provide certificate data for those keys.

     To support fetching certificate data from userspace, a new KVM exit
     type will be needed to handle fetching the certificate from
     userspace.

     An attempt to define a new KVM_EXIT_COCO / KVM_EXIT_COCO_REQ_CERTS
     exit type to handle this was introduced in v1 of this patchset, but
     is still being discussed by community, so for now this patchset
     only implements a stub version of SNP Extended Guest Requests that
     does not provide certificate data

  x86 - Intel:

   - Remove an unnecessary EPT TLB flush when enabling hardware

   - Fix a series of bugs that cause KVM to fail to detect nested
     pending posted interrupts as valid wake eents for a vCPU executing
     HLT in L2 (with HLT-exiting disable by L1)

   - KVM: x86: Suppress MMIO that is triggered during task switch
     emulation

     Explicitly suppress userspace emulated MMIO exits that are
     triggered when emulating a task switch as KVM doesn't support
     userspace MMIO during complex (multi-step) emulation

     Silently ignoring the exit request can result in the
     WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to userspace
     for some other reason prior to purging mmio_needed

     See commit 0dc902267c ("KVM: x86: Suppress pending MMIO write
     exits if emulator detects exception") for more details on KVM's
     limitations with respect to emulated MMIO during complex emulator
     flows

  Generic:

   - Rename the AS_UNMOVABLE flag that was introduced for KVM to
     AS_INACCESSIBLE, because the special casing needed by these pages
     is not due to just unmovability (and in fact they are only
     unmovable because the CPU cannot access them)

   - New ioctl to populate the KVM page tables in advance, which is
     useful to mitigate KVM page faults during guest boot or after live
     migration. The code will also be used by TDX, but (probably) not
     through the ioctl

   - Enable halt poll shrinking by default, as Intel found it to be a
     clear win

   - Setup empty IRQ routing when creating a VM to avoid having to
     synchronize SRCU when creating a split IRQCHIP on x86

   - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with
     a flag that arch code can use for hooking both sched_in() and
     sched_out()

   - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
     truncating a bogus value from userspace, e.g. to help userspace
     detect bugs

   - Mark a vCPU as preempted if and only if it's scheduled out while in
     the KVM_RUN loop, e.g. to avoid marking it preempted and thus
     writing guest memory when retrieving guest state during live
     migration blackout

  Selftests:

   - Remove dead code in the memslot modification stress test

   - Treat "branch instructions retired" as supported on all AMD Family
     17h+ CPUs

   - Print the guest pseudo-RNG seed only when it changes, to avoid
     spamming the log for tests that create lots of VMs

   - Make the PMU counters test less flaky when counting LLC cache
     misses by doing CLFLUSH{OPT} in every loop iteration"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits)
  crypto: ccp: Add the SNP_VLEK_LOAD command
  KVM: x86/pmu: Add kvm_pmu_call() to simplify static calls of kvm_pmu_ops
  KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops
  KVM: x86: Replace static_call_cond() with static_call()
  KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event
  x86/sev: Move sev_guest.h into common SEV header
  KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event
  KVM: x86: Suppress MMIO that is triggered during task switch emulation
  KVM: x86/mmu: Clean up make_huge_page_split_spte() definition and intro
  KVM: x86/mmu: Bug the VM if KVM tries to split a !hugepage SPTE
  KVM: selftests: x86: Add test for KVM_PRE_FAULT_MEMORY
  KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()
  KVM: x86/mmu: Make kvm_mmu_do_page_fault() return mapped level
  KVM: x86/mmu: Account pf_{fixed,emulate,spurious} in callers of "do page fault"
  KVM: x86/mmu: Bump pf_taken stat only in the "real" page fault handler
  KVM: Add KVM_PRE_FAULT_MEMORY vcpu ioctl to pre-populate guest memory
  KVM: Document KVM_PRE_FAULT_MEMORY ioctl
  mm, virt: merge AS_UNMOVABLE and AS_INACCESSIBLE
  perf kvm: Add kvm-stat for loongarch64
  LoongArch: KVM: Add PV steal time support in guest side
  ...
2024-07-20 12:41:03 -07:00
Wei Wang
5d766508fd KVM: x86/pmu: Add kvm_pmu_call() to simplify static calls of kvm_pmu_ops
Similar to kvm_x86_call(), kvm_pmu_call() is added to streamline the usage
of static calls of kvm_pmu_ops, which improves code readability.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Link: https://lore.kernel.org/r/20240507133103.15052-4-wei.w.wang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16 12:14:12 -04:00
Wei Wang
896046474f KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops
Introduces kvm_x86_call(), to streamline the usage of static calls of
kvm_x86_ops. The current implementation of these calls is verbose and
could lead to alignment challenges. This makes the code susceptible to
exceeding the "80 columns per single line of code" limit as defined in
the coding-style document. Another issue with the existing implementation
is that the addition of kvm_x86_ prefix to hooks at the static_call sites
hinders code readability and navigation. kvm_x86_call() is added to
improve code readability and maintainability, while adhering to the coding
style guidelines.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Link: https://lore.kernel.org/r/20240507133103.15052-3-wei.w.wang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16 12:14:12 -04:00
Wei Wang
f4854bf741 KVM: x86: Replace static_call_cond() with static_call()
The use of static_call_cond() is essentially the same as static_call() on
x86 (e.g. static_call() now handles a NULL pointer as a NOP), so replace
it with static_call() to simplify the code.

Link: https://lore.kernel.org/all/3916caa1dcd114301a49beafa5030eca396745c1.1679456900.git.jpoimboe@kernel.org/
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Link: https://lore.kernel.org/r/20240507133103.15052-2-wei.w.wang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16 12:14:11 -04:00
Paolo Bonzini
bc9cd5a219 Merge branch 'kvm-6.11-sev-attestation' into HEAD
The GHCB 2.0 specification defines 2 GHCB request types to allow SNP guests
to send encrypted messages/requests to firmware: SNP Guest Requests and SNP
Extended Guest Requests. These encrypted messages are used for things like
servicing attestation requests issued by the guest. Implementing support for
these is required to be fully GHCB-compliant.

For the most part, KVM only needs to handle forwarding these requests to
firmware (to be issued via the SNP_GUEST_REQUEST firmware command defined
in the SEV-SNP Firmware ABI), and then forwarding the encrypted response to
the guest.

However, in the case of SNP Extended Guest Requests, the host is also
able to provide the certificate data corresponding to the endorsement key
used by firmware to sign attestation report requests. This certificate data
is provided by userspace because:

  1) It allows for different keys/key types to be used for each particular
     guest with requiring any sort of KVM API to configure the certificate
     table in advance on a per-guest basis.

  2) It provides additional flexibility with how attestation requests might
     be handled during live migration where the certificate data for
     source/dest might be different.

  3) It allows all synchronization between certificates and firmware/signing
     key updates to be handled purely by userspace rather than requiring
     some in-kernel mechanism to facilitate it. [1]

To support fetching certificate data from userspace, a new KVM exit type will
be needed to handle fetching the certificate from userspace. An attempt to
define a new KVM_EXIT_COCO/KVM_EXIT_COCO_REQ_CERTS exit type to handle this
was introduced in v1 of this patchset, but is still being discussed by
community, so for now this patchset only implements a stub version of SNP
Extended Guest Requests that does not provide certificate data, but is still
enough to provide compliance with the GHCB 2.0 spec.
2024-07-16 11:44:23 -04:00
Michael Roth
74458e4859 KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event
Version 2 of GHCB specification added support for the SNP Extended Guest
Request Message NAE event. This event serves a nearly identical purpose
to the previously-added SNP_GUEST_REQUEST event, but for certain message
types it allows the guest to supply a buffer to be used for additional
information in some cases.

Currently the GHCB spec only defines extended handling of this sort in
the case of attestation requests, where the additional buffer is used to
supply a table of certificate data corresponding to the attestion
report's signing key. Support for this extended handling will require
additional KVM APIs to handle coordinating with userspace.

Whether or not the hypervisor opts to provide this certificate data is
optional. However, support for processing SNP_EXTENDED_GUEST_REQUEST
GHCB requests is required by the GHCB 2.0 specification for SNP guests,
so for now implement a stub implementation that provides an empty
certificate table to the guest if it supplies an additional buffer, but
otherwise behaves identically to SNP_GUEST_REQUEST.

Reviewed-by: Carlos Bilbao <carlos.bilbao.osdev@gmail.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240701223148.3798365-4-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16 11:44:00 -04:00
Brijesh Singh
88caf544c9 KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event
Version 2 of GHCB specification added support for the SNP Guest Request
Message NAE event. The event allows for an SEV-SNP guest to make
requests to the SEV-SNP firmware through the hypervisor using the
SNP_GUEST_REQUEST API defined in the SEV-SNP firmware specification.

This is used by guests primarily to request attestation reports from
firmware. There are other request types are available as well, but the
specifics of what guest requests are being made generally does not
affect how they are handled by the hypervisor, which only serves as a
proxy for the guest requests and firmware responses.

Implement handling for these events.

When an SNP Guest Request is issued, the guest will provide its own
request/response pages, which could in theory be passed along directly
to firmware. However, these pages would need special care:

  - Both pages are from shared guest memory, so they need to be
    protected from migration/etc. occurring while firmware reads/writes
    to them. At a minimum, this requires elevating the ref counts and
    potentially needing an explicit pinning of the memory. This places
    additional restrictions on what type of memory backends userspace
    can use for shared guest memory since there would be some reliance
    on using refcounted pages.

  - The response page needs to be switched to Firmware-owned state
    before the firmware can write to it, which can lead to potential
    host RMP #PFs if the guest is misbehaved and hands the host a
    guest page that KVM is writing to for other reasons (e.g. virtio
    buffers).

Both of these issues can be avoided completely by using
separately-allocated bounce pages for both the request/response pages
and passing those to firmware instead. So that's the approach taken
here.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Alexey Kardashevskiy <aik@amd.com>
Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
Co-developed-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
[mdr: ensure FW command failures are indicated to guest, drop extended
 request handling to be re-written as separate patch, massage commit]
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240701223148.3798365-2-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16 11:44:00 -04:00
Sean Christopherson
2a1fc7dc36 KVM: x86: Suppress MMIO that is triggered during task switch emulation
Explicitly suppress userspace emulated MMIO exits that are triggered when
emulating a task switch as KVM doesn't support userspace MMIO during
complex (multi-step) emulation.  Silently ignoring the exit request can
result in the WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to
userspace for some other reason prior to purging mmio_needed.

See commit 0dc902267c ("KVM: x86: Suppress pending MMIO write exits if
emulator detects exception") for more details on KVM's limitations with
respect to emulated MMIO during complex emulator flows.

Reported-by: syzbot+2fb9f8ed752c01bc9a3f@syzkaller.appspotmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240712144841.1230591-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16 09:57:45 -04:00
Sean Christopherson
9fe17d2ada KVM: x86/mmu: Clean up make_huge_page_split_spte() definition and intro
Tweak the definition of make_huge_page_split_spte() to eliminate an
unnecessarily long line, and opportunistically initialize child_spte to
make it more obvious that the child is directly derived from the huge
parent.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240712151335.1242633-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16 09:56:56 -04:00
Sean Christopherson
3d4415ed75 KVM: x86/mmu: Bug the VM if KVM tries to split a !hugepage SPTE
Bug the VM instead of simply warning if KVM tries to split a SPTE that is
non-present or not-huge.  KVM is guaranteed to end up in a broken state as
the callers fully expect a valid SPTE, e.g. the shadow MMU will add an
rmap entry, and all MMUs will account the expected small page.  Returning
'0' is also technically wrong now that SHADOW_NONPRESENT_VALUE exists,
i.e. would cause KVM to create a potential #VE SPTE.

While it would be possible to have the callers gracefully handle failure,
doing so would provide no practical value as the scenario really should be
impossible, while the error handling would add a non-trivial amount of
noise.

Fixes: a3fe5dbda0 ("KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled")
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240712151335.1242633-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16 09:56:56 -04:00
Paolo Bonzini
208a352a54 KVM VMX changes for 6.11
- Remove an unnecessary EPT TLB flush when enabling hardware.
 
  - Fix a series of bugs that cause KVM to fail to detect nested pending posted
    interrupts as valid wake eents for a vCPU executing HLT in L2 (with
    HLT-exiting disable by L1).
 
  - Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRvX0ACgkQOlYIJqCj
 N/2Aiw/9Htwy4MfJ2zdTX0ypZx6CUAVY0B7R2q9LVaqlBBL02dLoNWn9ndf7J2pd
 TJKtp39sHzf342ghti/Za5+mZgRgXA9IjQ5cvcQQjfmjDdDODygEc12otISeSNqq
 uL2jbUZzzjbcQyUrXkeFptVcNFpaiOG0dFfvnoi1csWzXVf7t+CD+8/3kjVm2Qt7
 vQXkV4yN7tNiYOvaukfXP7Og9ALpF8g8ok3YmXVXDPMu7+R7G+P6j3mVWr9ABMPj
 LOmC+5Z/sscMFw1Io3XHuWoF5socQARXEzJNLCblDaw3GMlSj4LNxif2M/6B7bmR
 nQVtiegj9K1Fc3OGOqPJcAIRPI4O9nMmf7uOwvXmOlwDSk7rCxF/yPk7Cto2+UXm
 6mnLcH1l0/VaidW+a7rUAcDGIlWwgfw0F6tp2j6FdVl2Lx/IThcrkn0teLY1gAW8
 CMi/BfTBEXO5583O3+ZCAzVQzeKnWR3yqwJe0oSftB1/rPkPD8PQ39MH8LuJJJxi
 CN1W4R1/taQdOxMZqggDvS1biz7gwpjNGtnWsO9szAgMEXVjf2M1HOZVcT2e2997
 81xDMdZaJSfd26tm7PhWtQnVPqyMZ6vqqIiq7FlIbEEkAE75Kbg4fUn/4y4WRnh9
 3Gog6MZPu/MA5TbwvcZ/sy/CRfFu0HKm5q98oArhjSyU8C7oGeQ=
 =W1/6
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.11' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.11

 - Remove an unnecessary EPT TLB flush when enabling hardware.

 - Fix a series of bugs that cause KVM to fail to detect nested pending posted
   interrupts as valid wake eents for a vCPU executing HLT in L2 (with
   HLT-exiting disable by L1).

 - Misc cleanups
2024-07-16 09:56:41 -04:00
Paolo Bonzini
1229cbefa6 KVM SVM changes for 6.11
- Make per-CPU save_area allocations NUMA-aware.
 
  - Force sev_es_host_save_area() to be inlined to avoid calling into an
    instrumentable function from noinstr code.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRvOsACgkQOlYIJqCj
 N/29pQ/9FTbHZa5RFqBPWY2Q8TKkCsguzJVMM3xxqq4lIFCwoWhXp1GBjkuT/L8g
 oHbRqHgl+cFKJ3xHUF7idyXtRQPRKJ/Dz+zz6RgAEIoeij+pxmjrqhm+kDAbfBkh
 L2Oz83B+PTdof3h0lD1nqtf449O/2istn5acasn7JrXefv+AI/VvqnaL7iMpb8zg
 K0Pscwgqtzl2okiQP3jAQfK9DbLuoQ1yHPRPIajijxJr7zGjsacg4Iju849zgbSo
 F569gvqm3ILD9oTzrBKy+Xb7GLtRCIXjuaI88TKwiqhJ6huc+lhQR3z3ogdpK5Qa
 nAj+c2/qcWbLXVe0/0Owks8Htm4wRDOhO7AMQ2Nk8Cg98VT09V7AMGoW36civnP2
 7X2m4dyigDF504r4YJWHfyJ9sifcXkxocuPKBWOTfgK3kzu6SVVTpbFg8RnLgISt
 RqOR1uuh6dipDIUV/QoRnBkGATuhVOfCkt0V1ymonFpwwqTWZnW40QjvFmNGjE7M
 Z4sCtnMFPal78w1s5vQtJ7WKgbRs49GSEg4ib/+qaS8xVZKoT/cTMn9hBX93PtOw
 6HG+o8P8zq+JvtGegxHwKzUr/4mgo2B5wDGizT+2HZdpVM3pF0ILQIrJ+fgGL+lu
 SBdEbyiNf1LFx51y1qYDu4ZOiRs2NFg3FRzdKdC9wldCnmb9V7Q=
 =RxHs
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-svm-6.11' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 6.11

 - Make per-CPU save_area allocations NUMA-aware.

 - Force sev_es_host_save_area() to be inlined to avoid calling into an
   instrumentable function from noinstr code.
2024-07-16 09:55:39 -04:00
Paolo Bonzini
cda231cd42 KVM x86/pmu changes for 6.11
- Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads
    '0' and writes from userspace are ignored.
 
  - Update to the newfangled Intel CPU FMS infrastructure.
 
  - Use macros instead of open-coded literals to clean up KVM's manipulation of
    FIXED_CTR_CTRL MSRs.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRu3oACgkQOlYIJqCj
 N/0f/Q/+PoRFKrr9ENwlVjxmq7DBJOzrEiht5EH89bQdpYL0pcmv6I+n+Z77o08X
 l49YFO2zVq26dMCe8EFDuQrZpqKjOS/qEc+/zTsLu4lx8NJD1gqYJLJryejgtdQI
 +GefPVIN11TvlDDjuuxSWgKUCAevk8s3PRe+zbUwlsHmw+GVky8dJoe71QbW27rK
 hL7Y2pOe5Y8MgRAadxlhm6QmgOnz3RKKYs9t/HMzi2gQP1TuvPxnYtMC3Gz5pVe+
 w3Ak7M4fh8Z7FbQsoNY5h3IdigG6eFrssqHX4QpCXr/G5L9vAgUmSR93/M8jLjNv
 wAkUulLx7vFeTlOXjqcEJSn0U6mX/48pt68vrPB5ES1Rx28RB5s9tzYXCGtCmSxv
 nHmMDc3YUbg6tp2hvliMqjsN0j5l2GQiX7LJwH2Ma9qQFlTHPmFwJGS4hciki6c5
 obCK2vXBoS1jyxrZx8qUhIcJl2oigv3hwihN2YqZ0Q4QDwllv8cw4BeABQByYR9x
 T91PQ0biiJ9vWCkALbyzYOpy+grdHCblwYW9+FM/qZBGH0ouPzDyZWPrRBLX12pH
 fEgDMB3vT9JqQ5tyafd0MHuAVlrDVHYEY+lmXplzFGKEFonBkN7HmDzAOKafuCuj
 GnIe0Sa1JnHVPNomx2dnG6Sku6/tPIfERuHEXrR9zkUJsacnfRY=
 =pJxP
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-pmu-6.11' of https://github.com/kvm-x86/linux into HEAD

KVM x86/pmu changes for 6.11

 - Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads
   '0' and writes from userspace are ignored.

 - Update to the newfangled Intel CPU FMS infrastructure.

 - Use macros instead of open-coded literals to clean up KVM's manipulation of
   FIXED_CTR_CTRL MSRs.
2024-07-16 09:55:15 -04:00
Paolo Bonzini
5c5ddf7107 KVM x86 MTRR virtualization removal
Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD
 hack, and instead always honor guest PAT on CPUs that support self-snoop.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRuwAACgkQOlYIJqCj
 N/32Gg/+Nnnz6TCRno2vursPJme7gvtLdqSxjazAj3u2ZO8IApGYWMyfVpS+ymC9
 Wdpj6gRe2ukSxgTsUI2CYoy5V2NxDaA9YgdTPZUVQvqwujVrqZCJ7L393iPYYnC9
 No3LXZ+SOYRmomiCzknjC6GOlT2hAZHzQsyaXDlEYok7NAA2L6XybbLonEdA4RYi
 V1mS62W5PaA4tUesuxkJjPujXo1nXRWD/aXOruJWjPESdSFSALlx7reFAf2Nwn7K
 Uw8yZqhq6vWAZSph0Nz8OrZOS/kULKA3q2zl1B/qJJ0ToAt2VdXS6abXky52RExf
 KvP+jBAWMO5kHbIqaMRtCHjbIkbhH8RdUIYNJQEUQ5DdydM5+/RDa+KprmLPcmUn
 qvJq+3uyH0MEENtneGegs8uxR+sn6fT32cGMIw790yIywddh562+IJ4Z+C3BuYJi
 yszD71odqKT8+knUd2CaZjE9UZyoQNDfj2OCCTzzZOC/6TuJWCh9CYQ1csssHbQR
 KcvZCKE6ht8tWwi+2HWj0laOdg1reX2kV869k3xH4uCwEaFIj2Wk+/Bw/lg2Tn5h
 5uTnQ01dx5XhAV1klr6IY3VXJ/A8G8895wRfkZEelsA9Wj8qZvNgXhsoXReIUIrn
 aR0ppsFcbqHzC50qE2JT4juTD1EPx95LL9zKT8pI9mGKwxCAxUM=
 =yb10
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mtrrs-6.11' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MTRR virtualization removal

Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD
hack, and instead always honor guest PAT on CPUs that support self-snoop.
2024-07-16 09:54:57 -04:00
Paolo Bonzini
34b69edecb KVM x86 MMU changes for 6.11
- Don't allocate kvm_mmu_page.shadowed_translation for shadow pages that can't
    hole leafs SPTEs.
 
  - Unconditionally drop mmu_lock when allocating TDP MMU page tables for eager
    page splitting to avoid stalling vCPUs when splitting huge pages.
 
  - Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRuqAACgkQOlYIJqCj
 N/3wwQ//d1HyNn/INUq+KZzaNgPPRXB/phsyAiHg0N9w3COlB3WsPlVtX9u04mHe
 12O8IDUmK4TufxPYrxdfMJPRup0Ewb0BOu5+n6fGOBzfaeBDlLF/SbX65I4KaiNI
 taqfhBCW/4Jis9ESvrJpOZSdv7pAA2Q67aCKVoKrd4Vbrw/96lnrB1GLL662XzZ+
 b7jm8nANoJgY4dLm7MVm33aSDQU35EXVHDWC9eWiaJmuXnFf7guf0rLKD1zkmNTI
 fVpUmZUFI7pcZNMB8u7JNmMofx748yrDe/MT6GxcEoLky6YKLYSLv3tZfywO2OrO
 vBJagYd1dy3798QQOCyqvtqc4OyHzv5jmwyLiKLGVgtavhYUWhFQUVSNy3p003S/
 NfvLFOrT+cBAYE0D898bdoX0cvQQggdgC5UEXjzGaZAfG0TMRMv3klGSUS3NABnE
 owtdV/2qIRsC+bybLhqaYvib5zjDrZDtzUU6+2wt0ugWrvF4Qn/RdnFmOWedGJ51
 Mr0xwhL0wekKvO8QaF55b9JO8wyaN4UYrUkPLmuK3/AICPU1m9CfQmu2iIe1bdsd
 303X94LOmsKNRlWTe8SWj5xWrC8LUH0P4g56/gT36ye08tzy7dfmX7T/6VwkVxS4
 pGRFLhlV8rqxaCSDgJqs+EdpKhfGpo5LuBcwZzO1YQNcDxoKO0I=
 =5Mgp
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmu-6.11' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 6.11

 - Don't allocate kvm_mmu_page.shadowed_translation for shadow pages that can't
   hold leafs SPTEs.

 - Unconditionally drop mmu_lock when allocating TDP MMU page tables for eager
   page splitting to avoid stalling vCPUs when splitting huge pages.

 - Misc cleanups
2024-07-16 09:53:28 -04:00
Paolo Bonzini
5dcc1e7614 KVM x86 misc changes for 6.11
- Add a global struct to consolidate tracking of host values, e.g. EFER, and
    move "shadow_phys_bits" into the structure as "maxphyaddr".
 
  - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
    bus frequency, because TDX.
 
  - Print the name of the APICv/AVIC inhibits in the relevant tracepoint.
 
  - Clean up KVM's handling of vendor specific emulation to consistently act on
    "compatible with Intel/AMD", versus checking for a specific vendor.
 
  - Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRub0ACgkQOlYIJqCj
 N/2LMxAArGzhcWZ6Qdo2aMRaMIPtSBJHmbEgEuHvHMumgsTZQzDcn9cxDi/hNSrc
 l8ODOwAM2qNcq95YfwjU7F0ae3E+HRzGvKcBnmZWuQeCDp2HhVEoCphFu1sHst+t
 XEJTL02b6OgyJUEU3h40mYk12eiq2S4FCnFYXPCqijwwuL6Y5KQvvTqek3c2/SDn
 c+VneutYGax/S0GiiCkYh4wrwWh9g7qm0IX70ycBwJbW5qBFKgyglvHxvL8JLJC9
 Nkkw/p2657wcOdraH+fOBuRy2dMwE5fv++1tOjWwB5WAAhSOJPZh0BGYvgA2yfN7
 OE+k7APKUQd9Xxtud8H3LrTPoyMA4hz2sdDFyqrrWK9yjpBY7zXNyN50Fxi7VVsm
 T8nTIiKAGyRbjotY+m7krXQPXjfZYhVqrJ/jtxESOZLZ93q2gSWU2p/ZXpUPVHnH
 +YOBAI1owP3wepaYlrthtI4LQx9lF422dnmeSflztfKFGabRbQZxg3uHMCCxIaGc
 lJ6CD546+D45f/uBXRDMqk//qFTqXhKUbDk9sutmU/C2oWufMwW0R8kOyItGPyvk
 9PP1vd8vSsIHj+tpwg+i04jBqYDaAcPBOcTZaHm9SYYP+1e11Uu5Vjep37JL1bkA
 xJWxnDZOCGcfKQi2jkh51HJ/dOAHXY1GQKMfyAoPQOSonYHvGVY=
 =Cf2R
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.11' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.11

 - Add a global struct to consolidate tracking of host values, e.g. EFER, and
   move "shadow_phys_bits" into the structure as "maxphyaddr".

 - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
   bus frequency, because TDX.

 - Print the name of the APICv/AVIC inhibits in the relevant tracepoint.

 - Clean up KVM's handling of vendor specific emulation to consistently act on
   "compatible with Intel/AMD", versus checking for a specific vendor.

 - Misc cleanups
2024-07-16 09:53:05 -04:00
Paolo Bonzini
86014c1e20 KVM generic changes for 6.11
- Enable halt poll shrinking by default, as Intel found it to be a clear win.
 
  - Setup empty IRQ routing when creating a VM to avoid having to synchronize
    SRCU when creating a split IRQCHIP on x86.
 
  - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag
    that arch code can use for hooking both sched_in() and sched_out().
 
  - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
    truncating a bogus value from userspace, e.g. to help userspace detect bugs.
 
  - Mark a vCPU as preempted if and only if it's scheduled out while in the
    KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest
    memory when retrieving guest state during live migration blackout.
 
  - A few minor cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRuOYACgkQOlYIJqCj
 N/1UnQ/8CI5Qfr+/0gzYgtWmtEMczGG+rMNpzD3XVqPjJjXcMcBiQnplnzUVLhha
 vlPdYVK7vgmEt003XGzV55mik46LHL+DX/v4hI3HEdblfyCeNLW3fKEWVRB44qJe
 o+YUQwSK42SORUp9oXuQINxhA//U9EnI7CQxlJ8w8wenv5IJKfIGr01DefmfGPAV
 PKm9t6WLcNqvhZMEyy/zmzM3KVPCJL0NcwI97x6sHxFpQYIDtL0E/VexA4AFqMoT
 QK7cSDC/2US41Zvem/r/GzM/ucdF6vb9suzZYBohwhxtVhwJe2CDeYQZvtNKJ1U7
 GOHPaKL6nBWdZCm/yyWbbX2nstY1lHqxhN3JD0X8wqU5rNcwm2b8Vfyav0Ehc7H+
 jVbDTshOx4YJmIgajoKjgM050rdBK59TdfVL+l+AAV5q/TlHocalYtvkEBdGmIDg
 2td9UHSime6sp20vQfczUEz4bgrQsh4l2Fa/qU2jFwLievnBw0AvEaMximkSGMJe
 b8XfjmdTjlOesWAejANKtQolfrq14+1wYw0zZZ8PA+uNVpKdoovmcqSOcaDC9bT8
 GO/NFUvoG+lkcvJcIlo1SSl81SmGLosijwxWfGvFAqsgpR3/3l3dYp0QtztoCNJO
 d3+HnjgYn5o5FwufuTD3eUOXH4AFjG108DH0o25XrIkb2Kymy0o=
 =BalU
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-generic-6.11' of https://github.com/kvm-x86/linux into HEAD

KVM generic changes for 6.11

 - Enable halt poll shrinking by default, as Intel found it to be a clear win.

 - Setup empty IRQ routing when creating a VM to avoid having to synchronize
   SRCU when creating a split IRQCHIP on x86.

 - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag
   that arch code can use for hooking both sched_in() and sched_out().

 - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
   truncating a bogus value from userspace, e.g. to help userspace detect bugs.

 - Mark a vCPU as preempted if and only if it's scheduled out while in the
   KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest
   memory when retrieving guest state during live migration blackout.

 - A few minor cleanups
2024-07-16 09:51:36 -04:00
Paolo Bonzini
f4501e8bc8 KVM Xen:
Fix a bug where KVM fails to check the validity of an incoming userspace
 virtual address and tries to activate a gfn_to_pfn_cache with a kernel address.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRtxMACgkQOlYIJqCj
 N/0B9A/+PeiWgW+AZ5Bl3YLLTMUi2Q1pKapT5hdNAthdDC72ewDDqtfXQ/Lcaubq
 j1ElrxXP5IjMBq7U65hT9hc/f6AxZat2LkTSBRM7uGrhE2UTF+/GmfJve8vUHZsA
 XtH9JRqNhBfr5EOkyh4AwvzvmzaiuJemPeLtQuEwlgaECs9byG1ILhudD/KiVStw
 3Tw4Zw2oKzpo3hiW+5/YZBEkpO+CM50ByP9/SEPUhnG8sCqlGzMEI27bOvDgEDQs
 vV2fIj992lrSa9OX6oA/xz3/UE1t3ruo8mHjP/r3+sxr1SnnwbmR0YbiGEVdAjJI
 Pllwx6E9Fxi8Nqq/6Dy4XxUhNG+G8+ozwvr8wbLHzXFIpCGZFN5xgweuHtdEg0x8
 mNxekTZrsLE58t5MpvTUMMHiHsn0KqtPqg1g3c2A7znZnzYxMHrIYyUm+uEOIZ/w
 Q93WT7s3SiaELgENsx3uda3Q0i2gKK3x7gbLk/9N2ciFZXgFZMwdZa7FivJr5OoT
 wp8r/btKTTaVGyn3x3y/Tum5XpyMNsKdxEKQ4n3aaIkfS2APfiHjQFyzkI2uLePG
 Tz/SZ/PT2rxpRuYL54mHzovPsIJ3NqBA+OTWCHKI/EwyWeyW9g05rZnmJar4ZYQ5
 pdHqWgGi0wMrPWG+GQv0FMBj7EIM4Z3dA4/BRyRObnMmp1XtIb4=
 =Vgz5
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-fixes-6.10-11' of https://github.com/kvm-x86/linux into HEAD

KVM Xen:

Fix a bug where KVM fails to check the validity of an incoming userspace
virtual address and tries to activate a gfn_to_pfn_cache with a kernel address.
2024-07-16 09:51:14 -04:00
Linus Torvalds
208c6772d3 - This is basically PeterZ's idea to nest the alternative macros to avoid the
need to "spell out" the number of alternates in an ALTERNATIVE_n() macro and
   thus have an ever-increasing complexity in those definitions.
 
   For ease of bisection, the old macros are converted to the new, nested
   variants in a step-by-step manner so that in case an issue is encountered
   during testing, one can pinpoint the place where it fails easier. Because
   debugging alternatives is a serious pain.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaU/+MACgkQEsHwGGHe
 VUpMGhAAqVbB3DZohv0Oa4BRRvaKFuQ3L7H0NTjK/pbT3EG+phol0zrHby2MnGjD
 HWXskps86n91QBB/06vAyZRimV/dvAPvlSKllsRx6ie3VCE4FJPzA4nTWQn/41dC
 HWamj78mQuSMgioLzIYdTY79KObtJcUw/X/xz+TTMemfkzkQxukKY7+Y71nZbuKi
 rUuCSrfAWNHQaIaoGs2JowGw7te7yNOtKQMCW5TdNLwvJfOAECuoLIFeiEcWHvoO
 uGl6FTABNLp26wmaeceUxdjBbTJcM3iV3joZQYED7B+mbJcU/a7tZw7I+mavPrbh
 Y6+EOn7rzR0wbcmj0iJ74TKr+uKDme/Qzm3YEKgGvJPj9tRjTDwxWRBnyTeCMbav
 NkKVwWTep8K+1qJtGVBwACY6iz89u3P8V5owD8O++KIPQa8rA0m8pN5gaU3PVYYQ
 D2UUdqXWIPIFoD4Sveb/WFU8OJKY+Nx7IK8KD03h5tiXW8MmGSa2e5b57gIfCLP7
 DbSHyCkTiqEdBrSM4/RaVVckD6NZ39M87H+iV51vYUkCYmODa/riMj0M7SVMi5Jo
 S/30jvdHEzWnmDBbOsn9d1XbvB5I+zz3BrcZQ2VSyBx+Y9m+SZ9qsyMEkOWw4uKM
 kfFp+XlYVWSnQFo7jY3UUIgVrU9dmdX1WxNX7/2HABDjg3MtWog=
 =tGap
 -----END PGP SIGNATURE-----

Merge tag 'x86_alternatives_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 alternatives updates from Borislav Petkov:
 "This is basically PeterZ's idea to nest the alternative macros to
  avoid the need to "spell out" the number of alternates in an
  ALTERNATIVE_n() macro and thus have an ever-increasing complexity in
  those definitions.

  For ease of bisection, the old macros are converted to the new, nested
  variants in a step-by-step manner so that in case an issue is
  encountered during testing, one can pinpoint the place where it fails
  easier.

  Because debugging alternatives is a serious pain"

* tag 'x86_alternatives_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/alternatives, kvm: Fix a couple of CALLs without a frame pointer
  x86/alternative: Replace the old macros
  x86/alternative: Convert the asm ALTERNATIVE_3() macro
  x86/alternative: Convert the asm ALTERNATIVE_2() macro
  x86/alternative: Convert the asm ALTERNATIVE() macro
  x86/alternative: Convert ALTERNATIVE_3()
  x86/alternative: Convert ALTERNATIVE_TERNARY()
  x86/alternative: Convert alternative_call_2()
  x86/alternative: Convert alternative_call()
  x86/alternative: Convert alternative_io()
  x86/alternative: Convert alternative_input()
  x86/alternative: Convert alternative_2()
  x86/alternative: Convert alternative()
  x86/alternatives: Add nested alternatives macros
  x86/alternative: Zap alternative_ternary()
2024-07-15 19:11:28 -07:00
Paolo Bonzini
c8b8b8190a LoongArch KVM changes for v6.11
1. Add ParaVirt steal time support.
 2. Add some VM migration enhancement.
 3. Add perf kvm-stat support for loongarch.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEzOlt8mkP+tbeiYy5AoYrw/LiJnoFAmaOS6UWHGNoZW5odWFj
 YWlAa2VybmVsLm9yZwAKCRAChivD8uImehejD/9pACGe3h3krXLcFVWXOFIu5Hpc
 5kQLP0lSPJ/o5Xs8t/oPLrnDX70z90wXI1LOmltc7h32MSwFa2l8COQh+sN5eJBQ
 PNyt7u7bMipp0yJS4Gl3LQQ5vklcGOSpQc/gbeXnVx8J/tz+Mo9YGGLIXVRXRM6W
 Ri8D2VVFiwzQQYeTpPo1u1Ob8C6mA4KOppwvhscMTM3vj4NMbsinBzRnR0lG0Tdw
 meFhxDPly1Ksxsbnj9UGO6UnEY0A2SLONs6MiO4y4DtoqoDlw/lbqFJuYo4vvbx1
 pxtjyirD/PX/wjslQFWUOuU0hMfAodera+JupZ5BZWfcG8FltA4DQfDsm/U9RjK/
 7gGNnr8Xk2/tp6+4AVV+HU2iTgRvq+mXCL72zSy2Y4r7ElBAANDfk4n+Zn/PWisn
 U9wwV8Ue7tVB15BRpRsg77NzBidiCFEe/6flWYiX2y24ke71gwDJBGUy8hMdKt6t
 4Cq8atsU0MvDAzfYMsK9JjskJp4UFq6wb1tXbbuADM4TDhnzlK6s6h3vM+pFlh/f
 my7fDH8/2qsCWhBDM4pmsJskVp+I1GOk/80RjTQISwx7iHktJWvxNYTaisK2fvD5
 Qs1IUWfNFbDX0Lr0QpN6j6X4rZkghR4R6XoFkd4nkicwi+UHVn3oK9GSqv24QJn9
 7+Ev3dfRTUYLd6mC4Q==
 =DpIK
 -----END PGP SIGNATURE-----

Merge tag 'loongarch-kvm-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD

LoongArch KVM changes for v6.11

1. Add ParaVirt steal time support.
2. Add some VM migration enhancement.
3. Add perf kvm-stat support for loongarch.
2024-07-12 11:24:12 -04:00
Paolo Bonzini
f3996d4d79 Merge branch 'kvm-prefault' into HEAD
Pre-population has been requested several times to mitigate KVM page faults
during guest boot or after live migration.  It is also required by TDX
before filling in the initial guest memory with measured contents.
Introduce it as a generic API.
2024-07-12 11:18:45 -04:00
Paolo Bonzini
6e01b7601d KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()
Wire KVM_PRE_FAULT_MEMORY ioctl to kvm_mmu_do_page_fault() to populate guest
memory.  It can be called right after KVM_CREATE_VCPU creates a vCPU,
since at that point kvm_mmu_create() and kvm_init_mmu() are called and
the vCPU is ready to invoke the KVM page fault handler.

The helper function kvm_tdp_map_page() takes care of the logic to
process RET_PF_* return values and convert them to success or errno.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <9b866a0ae7147f96571c439e75429a03dcb659b6.1712785629.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-12 11:17:47 -04:00
Paolo Bonzini
58ef24699b KVM: x86/mmu: Make kvm_mmu_do_page_fault() return mapped level
The guest memory population logic will need to know what page size or level
(4K, 2M, ...) is mapped.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <eabc3f3e5eb03b370cadf6e1901ea34d7a020adc.1712785629.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-12 11:17:36 -04:00
Sean Christopherson
f5e7f00cf1 KVM: x86/mmu: Account pf_{fixed,emulate,spurious} in callers of "do page fault"
Move the accounting of the result of kvm_mmu_do_page_fault() to its
callers, as only pf_fixed is common to guest page faults and async #PFs,
and upcoming support KVM_PRE_FAULT_MEMORY won't bump _any_ stats.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-12 11:17:35 -04:00
Sean Christopherson
5186ec223b KVM: x86/mmu: Bump pf_taken stat only in the "real" page fault handler
Account stat.pf_taken in kvm_mmu_page_fault(), i.e. the actual page fault
handler, instead of conditionally bumping it in kvm_mmu_do_page_fault().
The "real" page fault handler is the only path that should ever increment
the number of taken page faults, as all other paths that "do page fault"
are by definition not handling faults that occurred in the guest.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-12 11:17:35 -04:00
Borislav Petkov (AMD)
0d3db1f14a x86/alternatives, kvm: Fix a couple of CALLs without a frame pointer
objtool complains:

  arch/x86/kvm/kvm.o: warning: objtool: .altinstr_replacement+0xc5: call without frame pointer save/setup
  vmlinux.o: warning: objtool: .altinstr_replacement+0x2eb: call without frame pointer save/setup

Make sure %rSP is an output operand to the respective asm() statements.

The test_cc() hunk and ALT_OUTPUT_SP() courtesy of peterz. Also from him
add some helpful debugging info to the documentation.

Now on to the explanations:

tl;dr: The alternatives macros are pretty fragile.

If I do ALT_OUTPUT_SP(output) in order to be able to package in a %rsp
reference for objtool so that a stack frame gets properly generated, the
inline asm input operand with positional argument 0 in clear_page():

	"0" (page)

gets "renumbered" due to the added

	: "+r" (current_stack_pointer), "=D" (page)

and then gcc says:

  ./arch/x86/include/asm/page_64.h:53:9: error: inconsistent operand constraints in an ‘asm’

The fix is to use an explicit "D" constraint which points to a singleton
register class (gcc terminology) which ends up doing what is expected
here: the page pointer - input and output - should be in the same %rdi
register.

Other register classes have more than one register in them - example:
"r" and "=r" or "A":

  ‘A’
	The ‘a’ and ‘d’ registers.  This class is used for
	instructions that return double word results in the ‘ax:dx’
	register pair.  Single word values will be allocated either in
	‘ax’ or ‘dx’.

so using "D" and "=D" just works in this particular case.

And yes, one would say, sure, why don't you do "+D" but then:

  : "+r" (current_stack_pointer), "+D" (page)
  : [old] "i" (clear_page_orig), [new1] "i" (clear_page_rep), [new2] "i" (clear_page_erms),
  : "cc", "memory", "rax", "rcx")

now find the Waldo^Wcomma which throws a wrench into all this.

Because that silly macro has an "input..." consume-all last macro arg
and in it, one is supposed to supply input *and* clobbers, leading to
silly syntax snafus.

Yap, they need to be cleaned up, one fine day...

Closes: https://lore.kernel.org/oe-kbuild-all/202406141648.jO9qNGLa-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240625112056.GDZnqoGDXgYuWBDUwu@fat_crate.local
2024-07-01 12:41:11 +02:00
Peng Hao
dd103407ca KVM: X86: Remove unnecessary GFP_KERNEL_ACCOUNT for temporary variables
Some variables allocated in kvm_arch_vcpu_ioctl are released when
the function exits, so there is no need to set GFP_KERNEL_ACCOUNT.

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Link: https://lore.kernel.org/r/20240624012016.46133-1-flyingpeng@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 09:52:16 -07:00
Dapeng Mi
f287bef6dd KVM: x86/pmu: Introduce distinct macros for GP/fixed counter max number
Refine the macros which define maximum General Purpose (GP) and fixed
counter numbers.

Currently the macro KVM_INTEL_PMC_MAX_GENERIC is used to represent the
maximum supported General Purpose (GP) counter number ambiguously across
Intel and AMD platforms. This would cause issues if AMD begins to support
more GP counters than Intel.

Thus a bunch of new macros including vendor specific and vendor
independent are introduced to replace the old macros. The vendor
independent macros are used in x86 common code to hide vendor difference
and eliminate the ambiguity.

No logic changes are introduced in this patch.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240627021756.144815-1-dapeng1.mi@linux.intel.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 09:12:16 -07:00
Sean Christopherson
45405155d8 KVM: x86: WARN if a vCPU gets a valid wakeup that KVM can't yet inject
WARN if a blocking vCPU is awakened by a valid wake event that KVM can't
inject, e.g. because KVM needs to complete a nested VM-enter, or needs to
re-inject an exception.  For the nested VM-Enter case, KVM is supposed to
clear "nested_run_pending" if L1 puts L2 into HLT, i.e. entering HLT
"completes" the nested VM-Enter.  And for already-injected exceptions, it
should be impossible for the vCPU to be in a blocking state if a VM-Exit
occurred while an exception was being vectored.

Link: https://lore.kernel.org/r/20240607172609.3205077-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:59:07 -07:00
Sean Christopherson
321ef62b0c KVM: nVMX: Fold requested virtual interrupt check into has_nested_events()
Check for a Requested Virtual Interrupt, i.e. a virtual interrupt that is
pending delivery, in vmx_has_nested_events() and drop the one-off
kvm_x86_ops.guest_apic_has_interrupt() hook.

In addition to dropping a superfluous hook, this fixes a bug where KVM
would incorrectly treat virtual interrupts _for L2_ as always enabled due
to kvm_arch_interrupt_allowed(), by way of vmx_interrupt_blocked(),
treating IRQs as enabled if L2 is active and vmcs12 is configured to exit
on IRQs, i.e. KVM would treat a virtual interrupt for L2 as a valid wake
event based on L1's IRQ blocking status.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240607172609.3205077-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:59:06 -07:00
Sean Christopherson
27c4fa42b1 KVM: nVMX: Check for pending posted interrupts when looking for nested events
Check for pending (and notified!) posted interrupts when checking if L2
has a pending wake event, as fully posted/notified virtual interrupt is a
valid wake event for HLT.

Note that KVM must check vmx->nested.pi_pending to avoid prematurely
waking L2, e.g. even if KVM sees a non-zero PID.PIR and PID.0N=1, the
virtual interrupt won't actually be recognized until a notification IRQ is
received by the vCPU or the vCPU does (nested) VM-Enter.

Fixes: 26844fee6a ("KVM: x86: never write to memory from kvm_vcpu_check_block()")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Reported-by: Jim Mattson <jmattson@google.com>
Closes: https://lore.kernel.org/all/20231207010302.2240506-1-jmattson@google.com
Link: https://lore.kernel.org/r/20240607172609.3205077-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:59:06 -07:00
Sean Christopherson
322a569c4b KVM: VMX: Split out the non-virtualization part of vmx_interrupt_blocked()
Move the non-VMX chunk of the "interrupt blocked" checks to a separate
helper so that KVM can reuse the code to detect if interrupts are blocked
for L2, e.g. to determine if a virtual interrupt _for L2_ is a valid wake
event.  If L1 disables HLT-exiting for L2, nested APICv is enabled, and L2
HLTs, then L2 virtual interrupts are valid wake events, but if and only if
interrupts are unblocked for L2.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240607172609.3205077-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:59:05 -07:00
Sean Christopherson
32f55e475c KVM: nVMX: Request immediate exit iff pending nested event needs injection
When requesting an immediate exit from L2 in order to inject a pending
event, do so only if the pending event actually requires manual injection,
i.e. if and only if KVM actually needs to regain control in order to
deliver the event.

Avoiding the "immediate exit" isn't simply an optimization, it's necessary
to make forward progress, as the "already expired" VMX preemption timer
trick that KVM uses to force a VM-Exit has higher priority than events
that aren't directly injected.

At present time, this is a glorified nop as all events processed by
vmx_has_nested_events() require injection, but that will not hold true in
the future, e.g. if there's a pending virtual interrupt in vmcs02.RVI.
I.e. if KVM is trying to deliver a virtual interrupt to L2, the expired
VMX preemption timer will trigger VM-Exit before the virtual interrupt is
delivered, and KVM will effectively hang the vCPU in an endless loop of
forced immediate VM-Exits (because the pending virtual interrupt never
goes away).

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240607172609.3205077-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:59:04 -07:00
Sean Christopherson
d83c36d822 KVM: nVMX: Add a helper to get highest pending from Posted Interrupt vector
Add a helper to retrieve the highest pending vector given a Posted
Interrupt descriptor.  While the actual operation is straightforward, it's
surprisingly easy to mess up, e.g. if one tries to reuse lapic.c's
find_highest_vector(), which doesn't work with PID.PIR due to the APIC's
IRR and ISR component registers being physically discontiguous (they're
4-byte registers aligned at 16-byte intervals).

To make PIR handling more consistent with respect to IRR and ISR handling,
return -1 to indicate "no interrupt pending".

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240607172609.3205077-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:59:03 -07:00
Kai Huang
92c1e3cbf0 KVM: VMX: Switch __vmx_exit() and kvm_x86_vendor_exit() in vmx_exit()
In the vmx_init() error handling path, the __vmx_exit() is done before
kvm_x86_vendor_exit().  They should follow the same order in vmx_exit().

But currently __vmx_exit() is done after kvm_x86_vendor_exit() in
vmx_exit().  Switch the order of them to fix.

Fixes: e32b120071 ("KVM: VMX: Do _all_ initialization before exposing /dev/kvm to userspace")
Signed-off-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240627010524.3732488-1-kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:58:23 -07:00
Sean Christopherson
23b2c5088d KVM: VMX: Remove unnecessary INVEPT[GLOBAL] from hardware enable path
Remove the completely pointess global INVEPT, i.e. EPT TLB flush, from
KVM's VMX enablement path.  KVM always does a targeted TLB flush when
using a "new" EPT root, in quotes because "new" simply means a root that
isn't currently being used by the vCPU.

KVM also _deliberately_ runs with stale TLB entries for defunct roots,
i.e. doesn't do a TLB flush when vCPUs stop using roots, precisely because
KVM does the flush on first use.  As called out by the comment in
kvm_mmu_load(), the reason KVM flushes on first use is because KVM can't
guarantee the correctness of past hypervisors.

Jumping back to the global INVEPT, when the painfully terse commit
1439442c7b ("KVM: VMX: Enable EPT feature for KVM") was added, the
effective TLB flush being performed was:

  static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
  {
          vpid_sync_vcpu_all(to_vmx(vcpu));
  }

I.e. KVM was not flushing EPT TLB entries when allocating a "new" root,
which very strongly suggests that the global INVEPT during hardware
enabling was a misguided hack that addressed the most obvious symptom,
but failed to fix the underlying bug.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20240608001003.3296640-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:57:25 -07:00
Sean Christopherson
cb9fb5fc12 KVM: nVMX: Update VMCS12_REVISION comment to state it should never change
Rewrite the comment above VMCS12_REVISION to unequivocally state that the
ID must never change.  KVM_{G,S}ET_NESTED_STATE have been officially
supported for some time now, i.e. changing VMCS12_REVISION would break
userspace.

Opportunistically add a blurb to the CHECK_OFFSET() comment to make it
explicitly clear that new fields are allowed, i.e. that the restriction
on the layout is all about backwards compatibility.

No functional change intended.

Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20240613190103.1054877-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:55:00 -07:00
Sean Christopherson
704ec48fc2 KVM: SVM: Use sev_es_host_save_area() helper when initializing tsc_aux
Use sev_es_host_save_area() instead of open coding an equivalent when
setting the MSR_TSC_AUX field during setup.

No functional change intended.

Link: https://lore.kernel.org/r/20240617210432.1642542-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:53:00 -07:00
Sean Christopherson
34830b3c02 KVM: SVM: Force sev_es_host_save_area() to be inlined (for noinstr usage)
Force sev_es_host_save_area() to be always inlined, as it's used in the
low level VM-Enter/VM-Exit path, which is non-instrumentable.

  vmlinux.o: warning: objtool: svm_vcpu_enter_exit+0xb0: call to
    sev_es_host_save_area() leaves .noinstr.text section
  vmlinux.o: warning: objtool: svm_vcpu_enter_exit+0xbf: call to
    sev_es_host_save_area.isra.0() leaves .noinstr.text section

Fixes: c92be2fd8e ("KVM: SVM: Save/restore non-volatile GPRs in SEV-ES VMRUN via host save area")
Reported-by: Borislav Petkov <bp@alien8.de>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240617210432.1642542-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:53:00 -07:00
Jeff Johnson
8815d77cbc KVM: x86: Add missing MODULE_DESCRIPTION() macros
Add module descriptions for the vendor modules to fix allmodconfig
'make W=1' warnings:

  WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/kvm/kvm-intel.o
  WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/kvm/kvm-amd.o

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20240622-md-kvm-v2-1-29a60f7c48b1@quicinc.com
[sean: split kvm.ko change to separate commit]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:47:12 -07:00
Pei Li
ebbdf37ce9 KVM: Validate hva in kvm_gpc_activate_hva() to fix __kvm_gpc_refresh() WARN
Check that the virtual address is "ok" when activating a gfn_to_pfn_cache
with a host VA to ensure that KVM never attempts to use a bad address.

This fixes a bug where KVM fails to check the incoming address when
handling KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA in kvm_xen_vcpu_set_attr().

Reported-by: syzbot+fd555292a1da3180fc82@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=fd555292a1da3180fc82
Tested-by: syzbot+fd555292a1da3180fc82@syzkaller.appspotmail.com
Signed-off-by: Pei Li <peili.dev@gmail.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20240627-bug5-v2-1-2c63f7ee6739@gmail.com
[sean: rewrite changelog with --verbose]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:31:46 -07:00
Michael Roth
cf6d9d2d24 KVM: SEV-ES: Fix svm_get_msr()/svm_set_msr() for KVM_SEV_ES_INIT guests
With commit 27bd5fdc24 ("KVM: SEV-ES: Prevent MSR access post VMSA
encryption"), older VMMs like QEMU 9.0 and older will fail when booting
SEV-ES guests with something like the following error:

  qemu-system-x86_64: error: failed to get MSR 0x174
  qemu-system-x86_64: ../qemu.git/target/i386/kvm/kvm.c:3950: kvm_get_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.

This is because older VMMs that might still call
svm_get_msr()/svm_set_msr() for SEV-ES guests after guest boot even if
those interfaces were essentially just noops because of the vCPU state
being encrypted and stored separately in the VMSA. Now those VMMs will
get an -EINVAL and generally crash.

Newer VMMs that are aware of KVM_SEV_INIT2 however are already aware of
the stricter limitations of what vCPU state can be sync'd during
guest run-time, so newer QEMU for instance will work both for legacy
KVM_SEV_ES_INIT interface as well as KVM_SEV_INIT2.

So when using KVM_SEV_INIT2 it's okay to assume userspace can deal with
-EINVAL, whereas for legacy KVM_SEV_ES_INIT the kernel might be dealing
with either an older VMM and so it needs to assume that returning
-EINVAL might break the VMM.

Address this by only returning -EINVAL if the guest was started with
KVM_SEV_INIT2. Otherwise, just silently return.

Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Nikunj A Dadhania <nikunj@amd.com>
Reported-by: Srikanth Aithal <sraithal@amd.com>
Closes: https://lore.kernel.org/lkml/37usuu4yu4ok7be2hqexhmcyopluuiqj3k266z4gajc2rcj4yo@eujb23qc3zcm/
Fixes: 27bd5fdc24 ("KVM: SEV-ES: Prevent MSR access post VMSA encryption")
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240604233510.764949-1-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-21 07:11:29 -04:00
Rick Edgecombe
c2f38f75fc KVM: x86/tdp_mmu: Take a GFN in kvm_tdp_mmu_fast_pf_get_last_sptep()
Pass fault->gfn into kvm_tdp_mmu_fast_pf_get_last_sptep(), instead of
passing fault->addr and then converting it to a GFN.

Future changes will make fault->addr and fault->gfn differ when running
TDX guests. The GFN will be conceptually the same as it is for normal VMs,
but fault->addr may contain a TDX specific bit that differentiates between
"shared" and "private" memory. This bit will be used to direct faults to
be handled on different roots, either the normal "direct" root or a new
type of root that handles private memory. The TDP iterators will process
the traditional GFN concept and apply the required TDX specifics depending
on the root type. For this reason, it needs to operate on regular GFN and
not the addr, which may contain these special TDX specific bits.

Today kvm_tdp_mmu_fast_pf_get_last_sptep() takes fault->addr and then
immediately converts it to a GFN with a bit shift. However, this would
unfortunately retain the TDX specific bits in what is supposed to be a
traditional GFN. Excluding TDX's needs, it is also is unnecessary to pass
fault->addr and convert it to a GFN when the GFN is already on hand.

So instead just pass the GFN into kvm_tdp_mmu_fast_pf_get_last_sptep() and
use it directly.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240619223614.290657-9-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-20 18:43:31 -04:00
Rick Edgecombe
964cea8171 KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTE
Rename REMOVED_SPTE to FROZEN_SPTE so that it can be used for other
multi-part operations.

REMOVED_SPTE is used as a non-present intermediate value for multi-part
operations that can happen when a thread doesn't have an MMU write lock.
Today these operations are when removing PTEs.

However, future changes will want to use the same concept for setting a
PTE. In that case the REMOVED_SPTE name does not quite fit. So rename it
to FROZEN_SPTE so it can be used for both types of operations.

Also rename the relevant helpers and comments that refer to "removed"
within the context of the SPTE value. Take care to not update naming
referring the "remove" operations, which are still distinct.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240619223614.290657-2-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-20 18:43:31 -04:00
Paolo Bonzini
02b0d3b9d4 Merge branch 'kvm-6.10-fixes' into HEAD 2024-06-20 17:31:50 -04:00
Isaku Yamahata
8a4e2742a5 KVM: x86/tdp_mmu: Sprinkle __must_check
The TDP MMU function __tdp_mmu_set_spte_atomic uses a cmpxchg64 to replace
the SPTE value and returns -EBUSY on failure.  The caller must check the
return value and retry.  Add __must_check to it, as well as to two more
functions that forward the return value of __tdp_mmu_set_spte_atomic to
their caller.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-Id: <8f7d5a1b241bf5351eaab828d1a1efe5c17699ca.1705965635.git.isaku.yamahata@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-20 17:28:44 -04:00
Sean Christopherson
f3ced000a2 KVM: x86: Always sync PIR to IRR prior to scanning I/O APIC routes
Sync pending posted interrupts to the IRR prior to re-scanning I/O APIC
routes, irrespective of whether the I/O APIC is emulated by userspace or
by KVM.  If a level-triggered interrupt routed through the I/O APIC is
pending or in-service for a vCPU, KVM needs to intercept EOIs on said
vCPU even if the vCPU isn't the destination for the new routing, e.g. if
servicing an interrupt using the old routing races with I/O APIC
reconfiguration.

Commit fceb3a36c2 ("KVM: x86: ioapic: Fix level-triggered EOI and
userspace I/OAPIC reconfigure race") fixed the common cases, but
kvm_apic_pending_eoi() only checks if an interrupt is in the local
APIC's IRR or ISR, i.e. misses the uncommon case where an interrupt is
pending in the PIR.

Failure to intercept EOI can manifest as guest hangs with Windows 11 if
the guest uses the RTC as its timekeeping source, e.g. if the VMM doesn't
expose a more modern form of time to the guest.

Cc: stable@vger.kernel.org
Cc: Adamos Ttofari <attofari@amazon.de>
Cc: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240611014845.82795-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-20 14:18:02 -04:00
David Matlack
a6816314af KVM: Introduce vcpu->wants_to_run
Introduce vcpu->wants_to_run to indicate when a vCPU is in its core run
loop, i.e. when the vCPU is running the KVM_RUN ioctl and immediate_exit
was not set.

Replace all references to vcpu->run->immediate_exit with
!vcpu->wants_to_run to avoid TOCTOU races with userspace. For example, a
malicious userspace could invoked KVM_RUN with immediate_exit=true and
then after KVM reads it to set wants_to_run=false, flip it to false.
This would result in the vCPU running in KVM_RUN with
wants_to_run=false. This wouldn't cause any real bugs today but is a
dangerous landmine.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240503181734.1467938-2-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-18 09:20:01 -07:00
Sean Christopherson
d29bf2ca14 KVM: x86: Prevent excluding the BSP on setting max_vcpu_ids
If the BSP vCPU ID was already set, ensure it doesn't get excluded when
limiting vCPU IDs via KVM_CAP_MAX_VCPU_ID.

[mks: provide commit message, code by Sean]

Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240614202859.3597745-4-minipli@grsecurity.net
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-18 08:59:50 -07:00
Mathias Krause
7c305d5118 KVM: x86: Limit check IDs for KVM_SET_BOOT_CPU_ID
Do not accept IDs which are definitely invalid by limit checking the
passed value against KVM_MAX_VCPU_IDS and 'max_vcpu_ids' if it was
already set.

This ensures invalid values, especially on 64-bit systems, don't go
unnoticed and lead to a valid id by chance when truncated by the final
assignment.

Fixes: 73880c80aa ("KVM: Break dependency between vcpu index in vcpus array and vcpu_id.")
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240614202859.3597745-3-minipli@grsecurity.net
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-18 08:59:36 -07:00
David Matlack
0089c055b5 KVM: x86/mmu: Avoid reacquiring RCU if TDP MMU fails to allocate an SP
Avoid needlessly reacquiring the RCU read lock if the TDP MMU fails to
allocate a shadow page during eager page splitting. Opportunistically
drop the local variable ret as well now that it's no longer necessary.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240611220512.2426439-5-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-14 09:25:03 -07:00
David Matlack
3d4a5a45ca KVM: x86/mmu: Unnest TDP MMU helpers that allocate SPs for eager splitting
Move the implementation of tdp_mmu_alloc_sp_for_split() to its one and
only caller to reduce unnecessary nesting and make it more clear why the
eager split loop continues after allocating a new SP.

Opportunistically drop the double-underscores from
__tdp_mmu_alloc_sp_for_split() now that its parent is gone.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240611220512.2426439-4-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-14 09:24:23 -07:00
David Matlack
e1c04f7a9f KVM: x86/mmu: Hard code GFP flags for TDP MMU eager split allocations
Now that the GFP_NOWAIT case is gone, hard code GFP_KERNEL_ACCOUNT when
allocating shadow pages during eager page splitting in the TDP MMU.
Opportunistically replace use of __GFP_ZERO with allocations that zero
to improve readability.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240611220512.2426439-3-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-14 09:23:40 -07:00
David Matlack
cf3ff0ee24 KVM: x86/mmu: Always drop mmu_lock to allocate TDP MMU SPs for eager splitting
Always drop mmu_lock to allocate shadow pages in the TDP MMU when doing
eager page splitting. Dropping mmu_lock during eager page splitting is
cheap since KVM does not have to flush remote TLBs, and avoids stalling
vCPU threads that are taking page faults while KVM is eager splitting
under mmu_lock held for write.

This change reduces 20%+ dips in MySQL throughput during live migration
in a 160 vCPU VM while userspace is issuing CLEAR_DIRTY_LOG ioctls
(tested with 1GiB and 8GiB CLEARs). Userspace could issue finer-grained
CLEARs, which would also reduce contention on mmu_lock, but doing so
will increase the rate of remote TLB flushing, since KVM must flush TLBs
before returning from CLEAR_DITY_LOG.

When there isn't contention on mmu_lock[1], this change does not regress
the time it takes to perform eager page splitting (the cost of releasing
and re-acquiring an uncontended lock is minimal on x86).

[1] Tested with dirty_log_perf_test, which does not run vCPUs during
eager page splitting, and with a 16 vCPU VM Live Migration with
manual-protect disabled (where mmu_lock is held for read).

Cc: Bibo Mao <maobibo@loongson.cn>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240611220512.2426439-2-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-14 09:22:22 -07:00
Sean Christopherson
caa7278829 KVM: x86/mmu: Rephrase comment about synthetic PFERR flags in #PF handler
Reword the BUILD_BUG_ON() comment in the legacy #PF handler to explicitly
describe how asserting that synthetic PFERR flags are limited to bits 31:0
protects KVM against inadvertently passing a synthetic flag to the common
page fault handler.

No functional change intended.

Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20240608001108.3296879-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-14 09:20:47 -07:00
Sean Christopherson
3dee3b1874 KVM: x86: Drop now-superflous setting of l1tf_flush_l1d in vcpu_run()
Now that KVM unconditionally sets l1tf_flush_l1d in kvm_arch_vcpu_load(),
drop the redundant store from vcpu_run().  The flag is cleared only when
VM-Enter is imminent, deep below vcpu_run(), i.e. barring a KVM bug, it's
impossible for l1tf_flush_l1d to be cleared between loading the vCPU and
calling vcpu_run().

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240522014013.1672962-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11 14:18:47 -07:00
Sean Christopherson
ef2e18ef37 KVM: x86: Unconditionally set l1tf_flush_l1d during vCPU load
Always set l1tf_flush_l1d during kvm_arch_vcpu_load() instead of setting
it only when the vCPU is being scheduled back in.  The flag is processed
only when VM-Enter is imminent, and KVM obviously needs to load the vCPU
before VM-Enter, so attempting to precisely set l1tf_flush_l1d provides no
meaningful value.  I.e. the flag _will_ be set either way, it's simply a
matter of when.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240522014013.1672962-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11 14:18:46 -07:00
Sean Christopherson
2a27c43140 KVM: Delete the now unused kvm_arch_sched_in()
Delete kvm_arch_sched_in() now that all implementations are nops.

Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240522014013.1672962-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11 14:18:45 -07:00
Sean Christopherson
8fbb696a8f KVM: x86: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()
Fold the guts of kvm_arch_sched_in() into kvm_arch_vcpu_load(), keying
off the recently added kvm_vcpu.scheduled_out as appropriate.

Note, there is a very slight functional change, as PLE shrink updates will
now happen after blasting WBINVD, but that is quite uninteresting as the
two operations do not interact in any way.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240522014013.1672962-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11 14:18:44 -07:00
Sean Christopherson
5d9c07febb KVM: VMX: Move PLE grow/shrink helpers above vmx_vcpu_load()
Move VMX's {grow,shrink}_ple_window() above vmx_vcpu_load() in preparation
of moving the sched_in logic, which handles shrinking the PLE window, into
vmx_vcpu_load().

No functional change intended.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240522014013.1672962-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11 14:18:43 -07:00
Yi Wang
e3c89f5dd1 KVM: x86: Don't re-setup empty IRQ routing when KVM_CAP_SPLIT_IRQCHIP
Now that KVM sets up empty IRQ routing during VM creation, don't recreate
empty routing during KVM_CAP_SPLIT_IRQCHIP.  Setting IRQ routes during
KVM_CAP_SPLIT_IRQCHIP can result in 20+ milliseconds of delay due to the
synchronize_srcu_expedited() call in kvm_set_irq_routing().

Note, the empty routing is guaranteed to be intact as KVM x86 only allows
changing the IRQ routing after an in-kernel IRQCHIP has been created, and
KVM_CAP_SPLIT_IRQCHIP is disallowed after creating an IRQCHIP.

Signed-off-by: Yi Wang <foxywang@tencent.com>
Link: https://lore.kernel.org/r/20240506101751.3145407-3-foxywang@tencent.com
[sean: massage changelog, remove unused empty_routing array]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11 14:18:40 -07:00
Sean Christopherson
3b65a692a5 KVM: x86/pmu: Add a helper to enable bits in FIXED_CTR_CTRL
Add a helper, intel_pmu_enable_fixed_counter_bits(), to dedup code that
enables fixed counter bits, i.e. when KVM clears bits in the reserved mask
used to detect invalid MSR_CORE_PERF_FIXED_CTR_CTRL values.

No functional change intended.

Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240608000819.3296176-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11 09:35:58 -07:00
Thomas Prescher
85542adb65 KVM: x86: Add KVM_RUN_X86_GUEST_MODE kvm_run flag
When a vCPU is interrupted by a signal while running a nested guest,
KVM will exit to userspace with L2 state. However, userspace has no
way to know whether it sees L1 or L2 state (besides calling
KVM_GET_STATS_FD, which does not have a stable ABI).

This causes multiple problems:

The simplest one is L2 state corruption when userspace marks the sregs
as dirty. See this mailing list thread [1] for a complete discussion.

Another problem is that if userspace decides to continue by emulating
instructions, it will unknowingly emulate with L2 state as if L1
doesn't exist, which can be considered a weird guest escape.

Introduce a new flag, KVM_RUN_X86_GUEST_MODE, in the kvm_run data
structure, which is set when the vCPU exited while running a nested
guest.  Also introduce a new capability, KVM_CAP_X86_GUEST_MODE, to
advertise the functionality to userspace.

[1] https://lore.kernel.org/kvm/20240416123558.212040-1-julian.stecklina@cyberus-technology.de/T/#m280aadcb2e10ae02c191a7dc4ed4b711a74b1f55

Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Link: https://lore.kernel.org/r/20240508132502.184428-1-julian.stecklina@cyberus-technology.de
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11 09:24:31 -07:00
Sean Christopherson
1028893a73 KVM: x86: Bury guest_cpuid_is_amd_or_hygon() in cpuid.c
Move guest_cpuid_is_amd_or_hygon() into cpuid.c now that, except for one
Intel quirk in the emulator, KVM checks for AMD vs. Intel *compatible*
vCPUs, not exact vendors, i.e. now that there should not be any reason for
KVM at-large to care about the exact vendor.

Opportunistically refactor the guts of the helper to use "entry" instead
of "best", and short circuit the !entry path to make the common case more
readable.

Link: https://lore.kernel.org/r/20240405235603.1173076-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10 14:29:39 -07:00
Sean Christopherson
bdaff4f92b KVM: x86: Open code vendor_intel() in string_registers_quirk()
Open code the is_guest_vendor_intel() check in string_registers_quirk() to
discourage makiking exact vendor==Intel checks in the emulator, and to
remove the rather awful #ifdeffery.

The string quirk is literally the only Intel specific, *non-architectural*
behavior that KVM emulates.  All Intel specific behavior that is
architecturally defined applies to all vendors that are compatible with
Intel's architecture, i.e. should use guest_cpuid_is_intel_compatible().

Link: https://lore.kernel.org/r/20240405235603.1173076-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10 14:29:39 -07:00
Sean Christopherson
4067c2395e KVM: x86: Allow SYSENTER in Compatibility Mode for all Intel compat vCPUs
Emulate SYSENTER in Compatibility Mode for all vCPUs models that are
compatible with Intel's architecture, as the behavior if SYSENTER is
architecturally defined in Intel's SDM, i.e. should be followed by any
CPU that implements Intel's architecture.

Link: https://lore.kernel.org/r/20240405235603.1173076-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10 14:29:39 -07:00
Sean Christopherson
dc2b8b2b52 KVM: SVM: Emulate SYSENTER RIP/RSP behavior for all Intel compat vCPUs
Emulate bits 63:32 of the SYSENTER_R{I,S}P MSRs for all vCPUs that are
compatible with Intel's architecture, not just strictly vCPUs that have
vendor==Intel.  The behavior of bits 63:32 is architecturally defined in
the SDM, i.e. not some uarch specific quirk of Intel CPUs.

Link: https://lore.kernel.org/r/20240405235603.1173076-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10 14:29:38 -07:00
Sean Christopherson
d99e4cb2ae KVM: x86: Use "is Intel compatible" helper to emulate SYSCALL in !64-bit
Use guest_cpuid_is_intel_compatible() to determine whether SYSCALL in
32-bit Protected Mode (including Compatibility Mode) should #UD or succeed.
The existing code already does the exact equivalent of
guest_cpuid_is_intel_compatible(), just in a rather roundabout way.

No functional change intended.

Link: https://lore.kernel.org/r/20240405235603.1173076-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10 14:29:38 -07:00
Sean Christopherson
c092fc879f KVM: x86: Inhibit code #DBs in MOV-SS shadow for all Intel compat vCPUs
Treat code #DBs as inhibited in MOV/POP-SS shadows for vCPU models that
are Intel compatible, not just strictly vCPUs with vendor==Intel.  The
behavior is explicitly called out in the SDM, and thus architectural, i.e.
applies to all CPUs that implement Intel's architecture, and isn't a quirk
that is unique to CPUs manufactured by Intel:

  However, if an instruction breakpoint is placed on an instruction located
  immediately after a POP SS/MOV SS instruction, the breakpoint will be
  suppressed as if EFLAGS.RF were 1.

Applying the behavior strictly to Intel wasn't intentional, KVM simply
didn't have a concept of "Intel compatible" as of commit baf67ca8e5
("KVM: x86: Suppress code #DBs on Intel if MOV/POP SS blocking is active").

Link: https://lore.kernel.org/r/20240405235603.1173076-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10 14:29:38 -07:00
Sean Christopherson
6463e5e418 KVM: x86: Apply Intel's TSC_AUX reserved-bit behavior to Intel compat vCPUs
Extend Intel's check on MSR_TSC_AUX[63:32] to all vCPU models that are
Intel compatible, i.e. aren't AMD or Hygon in KVM's world, as the behavior
is architectural, i.e. applies to any CPU that is compatible with Intel's
architecture.  Applying the behavior strictly to Intel wasn't intentional,
KVM simply didn't have a concept of "Intel compatible" as of commit
61a05d444d ("KVM: x86: Tie Intel and AMD behavior for MSR_TSC_AUX to
guest CPU model").

Link: https://lore.kernel.org/r/20240405235603.1173076-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10 14:29:38 -07:00
Sean Christopherson
5a4f8b3026 KVM: x86/pmu: Squash period for checkpointed events based on host HLE/RTM
Zero out the sampling period for checkpointed events if the host supports
HLE or RTM, i.e. supports transactions and thus checkpointed events, not
based on whether the vCPU vendor model is Intel.  Perf's refusal to allow
a sample period for checkpointed events is based purely on whether or not
the CPU supports HLE/RTM transactions, i.e. perf has no knowledge of the
vCPU vendor model.

Note, it is _extremely_ unlikely that the existing code is a problem in
real world usage, as there are far, far bigger hurdles that would need to
be cleared to support cross-vendor vPMUs.  The motivation is mainly to
eliminate the use of guest_cpuid_is_intel(), in order to get to a state
where KVM pivots on AMD vs. Intel compatibility, i.e. doesn't check for
exactly vendor==Intel except in rare circumstances (i.e. for CPU quirks).

Cc: Like Xu <like.xu.linux@gmail.com>
Link: https://lore.kernel.org/r/20240405235603.1173076-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10 14:29:38 -07:00
Binbin Wu
d5989a3533 KVM: VMX: Remove unused declaration of vmx_request_immediate_exit()
After commit 0ec3d6d1f1 "KVM: x86: Fully defer to vendor code to decide
how to force immediate exit", vmx_request_immediate_exit() was removed.
Commit 5f18c642ff "KVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch
VMX and TDX" added its declaration by accident.  Remove it.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20240506075025.2251131-1-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10 09:35:25 -07:00
Hou Wenlong
c7d4c5f019 KVM: x86: Drop unused check_apicv_inhibit_reasons() callback definition
The check_apicv_inhibit_reasons() callback implementation was dropped in
the commit b3f257a846 ("KVM: x86: Track required APICv inhibits with
variable, not callback"), but the definition removal was missed in the
final version patch (it was removed in the v4). Therefore, it should be
dropped, and the vmx_check_apicv_inhibit_reasons() function declaration
should also be removed.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Link: https://lore.kernel.org/r/54abd1d0ccaba4d532f81df61259b9c0e021fbde.1714977229.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10 09:20:52 -07:00
Sean Christopherson
377b2f359d KVM: VMX: Always honor guest PAT on CPUs that support self-snoop
Unconditionally honor guest PAT on CPUs that support self-snoop, as
Intel has confirmed that CPUs that support self-snoop always snoop caches
and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
even in the presence of aliased memtypes, thus there is no need to trust
the guest behaves and only honor PAT as a last resort, as KVM does today.

Honoring guest PAT is desirable for use cases where the guest has access
to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
(mediated, for all intents and purposes) GPU is exposed to the guest, along
with buffers that are consumed directly by the physical GPU, i.e. which
can't be proxied by the host to ensure writes from the guest are performed
with the correct memory type for the GPU.

Cc: Yiwei Zhang <zzyiwei@google.com>
Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20240309010929.1403984-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-07 07:18:03 -07:00
Yan Zhao
65a4de0ffd KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit path
Ensure a full memory barrier is emitted in the VM-Exit path, as a full
barrier is required on Intel CPUs to evict WC buffers.  This will allow
unconditionally honoring guest PAT on Intel CPUs that support self-snoop.

As srcu_read_lock() is always called in the VM-Exit path and it internally
has a smp_mb(), call smp_mb__after_srcu_read_lock() to avoid adding a
second fence and make sure smp_mb() is called without dependency on
implementation details of srcu_read_lock().

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
[sean: massage changelog]
Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20240309010929.1403984-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-07 07:18:02 -07:00
Sean Christopherson
e1548088ff KVM: VMX: Drop support for forcing UC memory when guest CR0.CD=1
Drop KVM's emulation of CR0.CD=1 on Intel CPUs now that KVM no longer
honors guest MTRR memtypes, as forcing UC memory for VMs with
non-coherent DMA only makes sense if the guest is using something other
than PAT to configure the memtype for the DMA region.

Furthermore, KVM has forced WB memory for CR0.CD=1 since commit
fb279950ba ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED"), and no known
VMM in existence disables KVM_X86_QUIRK_CD_NW_CLEARED, let alone does
so with non-coherent DMA.

Lastly, commit fb279950ba ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED") was
from the same author as commit b18d5431ac ("KVM: x86: fix CR0.CD
virtualization"), and followed by a mere month.  I.e. forcing UC memory
was likely the result of code inspection or perhaps misdiagnosed failures,
and not the necessitate by a concrete use case.

Update KVM's documentation to note that KVM_X86_QUIRK_CD_NW_CLEARED is now
AMD-only, and to take an erratum for lack of CR0.CD virtualization on
Intel.

Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20240309010929.1403984-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05 08:13:14 -07:00
Sean Christopherson
0a7b73559b KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes
Remove KVM's support for virtualizing guest MTRR memtypes, as full MTRR
adds no value, negatively impacts guest performance, and is a maintenance
burden due to it's complexity and oddities.

KVM's approach to virtualizating MTRRs make no sense, at all.  KVM *only*
honors guest MTRR memtypes if EPT is enabled *and* the guest has a device
that may perform non-coherent DMA access.  From a hardware virtualization
perspective of guest MTRRs, there is _nothing_ special about EPT.  Legacy
shadowing paging doesn't magically account for guest MTRRs, nor does NPT.

Unwinding and deciphering KVM's murky history, the MTRR virtualization
code appears to be the result of misdiagnosed issues when EPT + VT-d with
passthrough devices was enabled years and years ago.  And importantly, the
underlying bugs that were fudged around by honoring guest MTRR memtypes
have since been fixed (though rather poorly in some cases).

The zapping GFNs logic in the MTRR virtualization code came from:

  commit efdfe536d8
  Author: Xiao Guangrong <guangrong.xiao@linux.intel.com>
  Date:   Wed May 13 14:42:27 2015 +0800

    KVM: MMU: fix MTRR update

    Currently, whenever guest MTRR registers are changed
    kvm_mmu_reset_context is called to switch to the new root shadow page
    table, however, it's useless since:
    1) the cache type is not cached into shadow page's attribute so that
       the original root shadow page will be reused

    2) the cache type is set on the last spte, that means we should sync
       the last sptes when MTRR is changed

    This patch fixs this issue by drop all the spte in the gfn range which
    is being updated by MTRR

which was a fix for:

  commit 0bed3b568b
  Author:     Sheng Yang <sheng@linux.intel.com>
  AuthorDate: Thu Oct 9 16:01:54 2008 +0800
  Commit:     Avi Kivity <avi@redhat.com>
  CommitDate: Wed Dec 31 16:51:44 2008 +0200

      KVM: Improve MTRR structure

      As well as reset mmu context when set MTRR.

which was part of a "MTRR/PAT support for EPT" series that also added:

+       if (mt_mask) {
+               mt_mask = get_memory_type(vcpu, gfn) <<
+                         kvm_x86_ops->get_mt_mask_shift();
+               spte |= mt_mask;
+       }

where get_memory_type() was a truly gnarly helper to retrieve the guest
MTRR memtype for a given memtype.  And *very* subtly, at the time of that
change, KVM *always* set VMX_EPT_IGMT_BIT,

        kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK |
                VMX_EPT_WRITABLE_MASK |
                VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT |
                VMX_EPT_IGMT_BIT);

which came in via:

  commit 928d4bf747
  Author:     Sheng Yang <sheng@linux.intel.com>
  AuthorDate: Thu Nov 6 14:55:45 2008 +0800
  Commit:     Avi Kivity <avi@redhat.com>
  CommitDate: Tue Nov 11 21:00:37 2008 +0200

      KVM: VMX: Set IGMT bit in EPT entry

      There is a potential issue that, when guest using pagetable without vmexit when
      EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory,
      which would be inconsistent with host side and would cause host MCE due to
      inconsistent cache attribute.

      The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default
      memory type to protect host (notice that all memory mapped by KVM should be WB).

Note the CommitDates!  The AuthorDates strongly suggests Sheng Yang added
the whole "ignoreIGMT things as a bug fix for issues that were detected
during EPT + VT-d + passthrough enabling, but it was applied earlier
because it was a generic fix.

Jumping back to 0bed3b568b ("KVM: Improve MTRR structure"), the other
relevant code, or rather lack thereof, is the handling of *host* MMIO.
That fix came in a bit later, but given the author and timing, it's safe
to say it was all part of the same EPT+VT-d enabling mess.

  commit 2aaf69dcee
  Author:     Sheng Yang <sheng@linux.intel.com>
  AuthorDate: Wed Jan 21 16:52:16 2009 +0800
  Commit:     Avi Kivity <avi@redhat.com>
  CommitDate: Sun Feb 15 02:47:37 2009 +0200

    KVM: MMU: Map device MMIO as UC in EPT

    Software are not allow to access device MMIO using cacheable memory type, the
    patch limit MMIO region with UC and WC(guest can select WC using PAT and
    PCD/PWT).

In addition to the host MMIO and IGMT issues, KVM's MTRR virtualization
was obviously never tested on NPT until much later, which lends further
credence to the theory/argument that this was all the result of
misdiagnosed issues.

Discussion from the EPT+MTRR enabling thread[*] more or less confirms that
Sheng Yang was trying to resolve issues with passthrough MMIO.

 * Sheng Yang
  : Do you mean host(qemu) would access this memory and if we set it to guest
  : MTRR, host access would be broken? We would cover this in our shadow MTRR
  : patch, for we encountered this in video ram when doing some experiment with
  : VGA assignment.

And in the same thread, there's also what appears to be confirmation of
Intel running into issues with Windows XP related to a guest device driver
mapping DMA with WC in the PAT.

 * Avi Kavity
  : Sheng Yang wrote:
  : > Yes... But it's easy to do with assigned devices' mmio, but what if guest
  : > specific some non-mmio memory's memory type? E.g. we have met one issue in
  : > Xen, that a assigned-device's XP driver specific one memory region as buffer,
  : > and modify the memory type then do DMA.
  : >
  : > Only map MMIO space can be first step, but I guess we can modify assigned
  : > memory region memory type follow guest's?
  : >
  :
  : With ept/npt, we can't, since the memory type is in the guest's
  : pagetable entries, and these are not accessible.

[*] https://lore.kernel.org/all/1223539317-32379-1-git-send-email-sheng@linux.intel.com

So, for the most part, what likely happened is that 15 years ago, a few
engineers (a) fixed a #MC problem by ignoring guest PAT and (b) initially
"fixed" passthrough device MMIO by emulating *guest* MTRRs.  Except for
the below case, everything since then has been a result of those two
intertwined changes.

The one exception, which is actually yet more confirmation of all of the
above, is the revert of Paolo's attempt at "full" virtualization of guest
MTRRs:

  commit 606decd670
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Thu Oct 1 13:12:47 2015 +0200

    Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages"

    This reverts commit fd717f1101.
    It was reported to cause Machine Check Exceptions (bug 104091).

...

  commit fd717f1101
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Tue Jul 7 14:38:13 2015 +0200

    KVM: x86: apply guest MTRR virtualization on host reserved pages

    Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true.
    However, the guest could prefer a different page type than UC for
    such pages. A good example is that pass-throughed VGA frame buffer is
    not always UC as host expected.

    This patch enables full use of virtual guest MTRRs.

I.e. Paolo tried to add back KVM's behavior before "Map device MMIO as UC
in EPT" and got the same result: machine checks, likely due to the guest
MTRRs not being trustworthy/sane at all times.

Note, Paolo also tried to enable MTRR virtualization on SVM+NPT, but that
too got reverted.  Unfortunately, it doesn't appear that anyone ever found
a smoking gun, i.e. exactly why emulating guest MTRRs via NPT PAT caused
extremely slow boot times doesn't appear to have a definitive root cause.

  commit fc07e76ac7
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Thu Oct 1 13:20:22 2015 +0200

    Revert "KVM: SVM: use NPT page attributes"

    This reverts commit 3c2e7f7de3.
    Initializing the mapping from MTRR to PAT values was reported to
    fail nondeterministically, and it also caused extremely slow boot
    (due to caching getting disabled---bug 103321) with assigned devices.

...

  commit 3c2e7f7de3
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Tue Jul 7 14:32:17 2015 +0200

    KVM: SVM: use NPT page attributes

    Right now, NPT page attributes are not used, and the final page
    attribute depends solely on gPAT (which however is not synced
    correctly), the guest MTRRs and the guest page attributes.

    However, we can do better by mimicking what is done for VMX.
    In the absence of PCI passthrough, the guest PAT can be ignored
    and the page attributes can be just WB.  If passthrough is being
    used, instead, keep respecting the guest PAT, and emulate the guest
    MTRRs through the PAT field of the nested page tables.

    The only snag is that WP memory cannot be emulated correctly,
    because Linux's default PAT setting only includes the other types.

In short, honoring guest MTRRs for VMX was initially a workaround of
sorts for KVM ignoring guest PAT *and* for KVM not forcing UC for host
MMIO.  And while there *are* known cases where honoring guest MTRRs is
desirable, e.g. passthrough VGA frame buffers, the desired behavior in
that case is to get WC instead of UC, i.e. at this point it's for
performance, not correctness.

Furthermore, the complete absence of MTRR virtualization on NPT and
shadow paging proves that, while KVM theoretically can do better, it's
by no means necessary for correctnesss.

Lastly, since kernels mostly rely on firmware to do MTRR setup, and the
host typically provides guest firmware, honoring guest MTRRs is effectively
honoring *host* userspace memtypes, which is also backwards.  I.e. it
would be far better for host userspace to communicate its desired memtype
directly to KVM (or perhaps indirectly via VMAs in the host kernel), not
through guest MTRRs.

Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20240309010929.1403984-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05 08:13:14 -07:00
Alejandro Jimenez
f992572120 KVM: x86: Keep consistent naming for APICv/AVIC inhibit reasons
Keep kvm_apicv_inhibit enum naming consistent with the current pattern by
renaming the reason/enumerator defined as APICV_INHIBIT_REASON_DISABLE to
APICV_INHIBIT_REASON_DISABLED.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Link: https://lore.kernel.org/r/20240506225321.3440701-3-alejandro.j.jimenez@oracle.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05 06:18:28 -07:00
Alejandro Jimenez
69148ccec6 KVM: x86: Print names of apicv inhibit reasons in traces
Use the tracing infrastructure helper __print_flags() for printing flag
bitfields, to enhance the trace output by displaying a string describing
each of the inhibit reasons set.

The kvm_apicv_inhibit_changed tracepoint currently shows the raw bitmap
value, requiring the user to consult the source file where the inhibit
reasons are defined to decode the trace output.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20240506225321.3440701-2-alejandro.j.jimenez@oracle.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05 06:18:27 -07:00
Isaku Yamahata
6fef518594 KVM: x86: Add a capability to configure bus frequency for APIC timer
Add KVM_CAP_X86_APIC_BUS_CYCLES_NS capability to configure the APIC
bus clock frequency for APIC timer emulation.
Allow KVM_ENABLE_CAPABILITY(KVM_CAP_X86_APIC_BUS_CYCLES_NS) to set the
frequency in nanoseconds. When using this capability, the user space
VMM should configure CPUID leaf 0x15 to advertise the frequency.

Vishal reported that the TDX guest kernel expects a 25MHz APIC bus
frequency but ends up getting interrupts at a significantly higher rate.

The TDX architecture hard-codes the core crystal clock frequency to
25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture
does not allow the VMM to override the value.

In addition, per Intel SDM:
    "The APIC timer frequency will be the processor’s bus clock or core
     crystal clock frequency (when TSC/core crystal clock ratio is
     enumerated in CPUID leaf 0x15) divided by the value specified in
     the divide configuration register."

The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded
APIC bus frequency of 1GHz.

The KVM doesn't enumerate CPUID leaf 0x15 to the guest unless the user
space VMM sets it using KVM_SET_CPUID. If the CPUID leaf 0x15 is
enumerated, the guest kernel uses it as the APIC bus frequency. If not,
the guest kernel measures the frequency based on other known timers like
the ACPI timer or the legacy PIT. As reported by Vishal the TDX guest
kernel expects a 25MHz timer frequency but gets timer interrupt more
frequently due to the 1GHz frequency used by KVM.

To ensure that the guest doesn't have a conflicting view of the APIC bus
frequency, allow the userspace to tell KVM to use the same frequency that
TDX mandates instead of the default 1Ghz.

Reported-by: Vishal Annapurve <vannapurve@google.com>
Closes: https://lore.kernel.org/lkml/20231006011255.4163884-1-vannapurve@google.com
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/6748a4c12269e756f0c48680da8ccc5367c31ce7.1714081726.git.reinette.chatre@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05 06:18:27 -07:00
Isaku Yamahata
b460256b16 KVM: x86: Make nanoseconds per APIC bus cycle a VM variable
Introduce the VM variable "nanoseconds per APIC bus cycle" in
preparation to make the APIC bus frequency configurable.

The TDX architecture hard-codes the core crystal clock frequency to
25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture
does not allow the VMM to override the value.

In addition, per Intel SDM:
    "The APIC timer frequency will be the processor’s bus clock or core
     crystal clock frequency (when TSC/core crystal clock ratio is
     enumerated in CPUID leaf 0x15) divided by the value specified in
     the divide configuration register."

The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded
APIC bus frequency of 1GHz.

Introduce the VM variable "nanoseconds per APIC bus cycle" to prepare
for allowing userspace to tell KVM to use the frequency that TDX mandates
instead of the default 1Ghz. Doing so ensures that the guest doesn't have
a conflicting view of the APIC bus frequency.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
[reinette: rework changelog]
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/ae75ce37c6c38bb4efd10a0a41932984c40b24ac.1714081726.git.reinette.chatre@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05 06:18:26 -07:00
Isaku Yamahata
f9e1cbf180 KVM: x86: hyper-v: Calculate APIC bus frequency for Hyper-V
Remove APIC_BUS_FREQUENCY and calculate it based on nanoseconds per APIC
bus cycle.  APIC_BUS_FREQUENCY is used only for HV_X64_MSR_APIC_FREQUENCY.
The MSR is not frequently read, calculate it every time.

There are two constants related to the APIC bus frequency:
APIC_BUS_FREQUENCY and APIC_BUS_CYCLE_NS.
Only one value is required because one can be calculated from the other:
   APIC_BUS_CYCLES_NS = 1000 * 1000 * 1000 / APIC_BUS_FREQUENCY.

Remove APIC_BUS_FREQUENCY and instead calculate it when needed.
This prepares for support of configurable APIC bus frequency by
requiring to change only a single variable.

Suggested-by: Maxim Levitsky <maximlevitsky@gmail.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Maxim Levitsky <maximlevitsky@gmail.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
[reinette: rework changelog]
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/76a659d0898e87ebd73ee7c922f984a87a6ab370.1714081726.git.reinette.chatre@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05 06:18:25 -07:00
Ravi Bangoria
f99b052256 KVM: SNP: Fix LBR Virtualization for SNP guest
SEV-ES and thus SNP guest mandates LBR Virtualization to be _always_ ON.
Although commit b7e4be0a22 ("KVM: SEV-ES: Delegate LBR virtualization
to the processor") did the correct change for SEV-ES guests, it missed
the SNP. Fix it.

Reported-by: Srikanth Aithal <sraithal@amd.com>
Fixes: b7e4be0a22 ("KVM: SEV-ES: Delegate LBR virtualization to the processor")
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Message-ID: <20240605114810.1304-1-ravi.bangoria@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-05 07:50:50 -04:00
Tao Su
db574f2f96 KVM: x86/mmu: Don't save mmu_invalidate_seq after checking private attr
Drop the second snapshot of mmu_invalidate_seq in kvm_faultin_pfn().
Before checking the mismatch of private vs. shared, mmu_invalidate_seq is
saved to fault->mmu_seq, which can be used to detect an invalidation
related to the gfn occurred, i.e. KVM will not install a mapping in page
table if fault->mmu_seq != mmu_invalidate_seq.

Currently there is a second snapshot of mmu_invalidate_seq, which may not
be same as the first snapshot in kvm_faultin_pfn(), i.e. the gfn attribute
may be changed between the two snapshots, but the gfn may be mapped in
page table without hindrance. Therefore, drop the second snapshot as it
has no obvious benefits.

Fixes: f6adeae81f ("KVM: x86/mmu: Handle no-slot faults at the beginning of kvm_faultin_pfn()")
Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Message-ID: <20240528102234.2162763-1-tao1.su@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-05 06:45:06 -04:00
Li RongQing
99a49093ce KVM: SVM: Consider NUMA affinity when allocating per-CPU save_area
save_area of per-CPU svm_data are dominantly accessed from their
own local CPUs, so allocate them node-local for performance reason

so rename __snp_safe_alloc_page as snp_safe_alloc_page_node which
accepts numa node id as input parameter, svm_cpu_init call it with
node id switched from cpu id

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240520120858.13117-4-lirongqing@baidu.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 16:14:11 -07:00
Li RongQing
9f44286d77 KVM: SVM: not account memory allocation for per-CPU svm_data
The allocation for the per-CPU save area in svm_cpu_init shouldn't
be accounted, So introduce  __snp_safe_alloc_page helper, which has
gfp flag as input, svm_cpu_init calls __snp_safe_alloc_page with
GFP_KERNEL, snp_safe_alloc_page calls __snp_safe_alloc_page with
GFP_KERNEL_ACCOUNT as input

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240520120858.13117-3-lirongqing@baidu.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 16:14:11 -07:00
Li RongQing
f51af34686 KVM: SVM: remove useless input parameter in snp_safe_alloc_page
The input parameter 'vcpu' in snp_safe_alloc_page is not used.
Therefore, remove it.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240520120858.13117-2-lirongqing@baidu.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 16:14:11 -07:00
Dapeng Mi
75430c412a KVM: x86/pmu: Manipulate FIXED_CTR_CTRL MSR with macros
Magic numbers are used to manipulate the bit fields of
FIXED_CTR_CTRL MSR. This makes reading code become difficult, so use
pre-defined macros to replace these magic numbers.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240430005239.13527-3-dapeng1.mi@linux.intel.com
[sean: drop unnecessary curly braces]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 14:25:22 -07:00
Dapeng Mi
0e102ce3d4 KVM: x86/pmu: Change ambiguous _mask suffix to _rsvd in kvm_pmu
Several '_mask' suffixed variables such as, global_ctrl_mask, are
defined in kvm_pmu structure. However the _mask suffix is ambiguous and
misleading since it's not a real mask with positive logic. On the contrary
it represents the reserved bits of corresponding MSRs and these bits
should not be accessed.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240430005239.13527-2-dapeng1.mi@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 14:23:14 -07:00
Hou Wenlong
9ecc1c119b KVM: x86/mmu: Only allocate shadowed translation cache for sp->role.level <= KVM_MAX_HUGEPAGE_LEVEL
Only the indirect SP with sp->role.level <= KVM_MAX_HUGEPAGE_LEVEL might
have leaf gptes, so allocation of shadowed translation cache is needed
only for it. Then, it can use sp->shadowed_translation to determine
whether to use the information in the shadowed translation cache or not.
Also, extend the WARN in FNAME(sync_spte)() to ensure that this won't
break shadow_mmu_get_sp_for_split().

Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Link: https://lore.kernel.org/r/5b0cda8a7456cda476b14fca36414a56f921dd52.1715398655.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 14:06:39 -07:00
Liang Chen
4f8973e65f KVM: x86: invalid_list not used anymore in mmu_shrink_scan
'invalid_list' is now gathered in KVM_MMU_ZAP_OLDEST_MMU_PAGES.

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Link: https://lore.kernel.org/r/20240509044710.18788-1-liangchen.linux@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 10:55:15 -07:00
Paolo Bonzini
ab978c62e7 Merge branch 'kvm-6.11-sev-snp' into HEAD
Pull base x86 KVM support for running SEV-SNP guests from Michael Roth:

* add some basic infrastructure and introduces a new KVM_X86_SNP_VM
  vm_type to handle differences versus the existing KVM_X86_SEV_VM and
  KVM_X86_SEV_ES_VM types.

* implement the KVM API to handle the creation of a cryptographic
  launch context, encrypt/measure the initial image into guest memory,
  and finalize it before launching it.

* implement handling for various guest-generated events such as page
  state changes, onlining of additional vCPUs, etc.

* implement the gmem/mmu hooks needed to prepare gmem-allocated pages
  before mapping them into guest private memory ranges as well as
  cleaning them up prior to returning them to the host for use as
  normal memory. Because those cleanup hooks supplant certain
  activities like issuing WBINVDs during KVM MMU invalidations, avoid
  duplicating that work to avoid unecessary overhead.

This merge leaves out support support for attestation guest requests
and for loading the signing keys to be used for attestation requests.
2024-06-03 13:19:46 -04:00
Paolo Bonzini
b3233c737e Merge branch 'kvm-fixes-6.10-1' into HEAD
* Fixes and debugging help for the #VE sanity check.  Also disable
  it by default, even for CONFIG_DEBUG_KERNEL, because it was found
  to trigger spuriously (most likely a processor erratum as the
  exact symptoms vary by generation).

* Avoid WARN() when two NMIs arrive simultaneously during an NMI-disabled
  situation (GIF=0 or interrupt shadow) when the processor supports
  virtual NMI.  While generally KVM will not request an NMI window
  when virtual NMIs are supported, in this case it *does* have to
  single-step over the interrupt shadow or enable the STGI intercept,
  in order to deliver the latched second NMI.

* Drop support for hand tuning APIC timer advancement from userspace.
  Since we have adaptive tuning, and it has proved to work well,
  drop the module parameter for manual configuration and with it a
  few stupid bugs that it had.
2024-06-03 13:18:08 -04:00
Paolo Bonzini
f9d1b541d0 Merge branch 'kvm-fixes-6.10-1' into HEAD
* Fixes and debugging help for the #VE sanity check.  Also disable
  it by default, even for CONFIG_DEBUG_KERNEL, because it was found
  to trigger spuriously (most likely a processor erratum as the
  exact symptoms vary by generation).

* Avoid WARN() when two NMIs arrive simultaneously during an NMI-disabled
  situation (GIF=0 or interrupt shadow) when the processor supports
  virtual NMI.  While generally KVM will not request an NMI window
  when virtual NMIs are supported, in this case it *does* have to
  single-step over the interrupt shadow or enable the STGI intercept,
  in order to deliver the latched second NMI.

* Drop support for hand tuning APIC timer advancement from userspace.
  Since we have adaptive tuning, and it has proved to work well,
  drop the module parameter for manual configuration and with it a
  few stupid bugs that it had.
2024-06-03 13:10:11 -04:00
Sean Christopherson
89a58812c4 KVM: x86: Drop support for hand tuning APIC timer advancement from userspace
Remove support for specifying a static local APIC timer advancement value,
and instead present a read-only boolean parameter to let userspace enable
or disable KVM's dynamic APIC timer advancement.  Realistically, it's all
but impossible for userspace to specify an advancement that is more
precise than what KVM's adaptive tuning can provide.  E.g. a static value
needs to be tuned for the exact hardware and kernel, and if KVM is using
hrtimers, likely requires additional tuning for the exact configuration of
the entire system.

Dropping support for a userspace provided value also fixes several flaws
in the interface.  E.g. KVM interprets a negative value other than -1 as a
large advancement, toggling between a negative and positive value yields
unpredictable behavior as vCPUs will switch from dynamic to static
advancement, changing the advancement in the middle of VM creation can
result in different values for vCPUs within a VM, etc.  Those flaws are
mostly fixable, but there's almost no justification for taking on yet more
complexity (it's minimal complexity, but still non-zero).

The only arguments against using KVM's adaptive tuning is if a setup needs
a higher maximum, or if the adjustments are too reactive, but those are
arguments for letting userspace control the absolute max advancement and
the granularity of each adjustment, e.g. similar to how KVM provides knobs
for halt polling.

Link: https://lore.kernel.org/all/20240520115334.852510-1-zhoushuling@huawei.com
Cc: Shuling Zhou <zhoushuling@huawei.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240522010304.1650603-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-03 13:08:05 -04:00
Ravi Bangoria
b7e4be0a22 KVM: SEV-ES: Delegate LBR virtualization to the processor
As documented in APM[1], LBR Virtualization must be enabled for SEV-ES
guests. Although KVM currently enforces LBRV for SEV-ES guests, there
are multiple issues with it:

o MSR_IA32_DEBUGCTLMSR is still intercepted. Since MSR_IA32_DEBUGCTLMSR
  interception is used to dynamically toggle LBRV for performance reasons,
  this can be fatal for SEV-ES guests. For ex SEV-ES guest on Zen3:

  [guest ~]# wrmsr 0x1d9 0x4
  KVM: entry failed, hardware error 0xffffffff
  EAX=00000004 EBX=00000000 ECX=000001d9 EDX=00000000

  Fix this by never intercepting MSR_IA32_DEBUGCTLMSR for SEV-ES guests.
  No additional save/restore logic is required since MSR_IA32_DEBUGCTLMSR
  is of swap type A.

o KVM will disable LBRV if userspace sets MSR_IA32_DEBUGCTLMSR before the
  VMSA is encrypted. Fix this by moving LBRV enablement code post VMSA
  encryption.

[1]: AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June
     2023, Vol 2, 15.35.2 Enabling SEV-ES.
     https://bugzilla.kernel.org/attachment.cgi?id=304653

Fixes: 376c6d2850 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading")
Co-developed-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Message-ID: <20240531044644.768-4-ravi.bangoria@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-03 13:07:18 -04:00
Ravi Bangoria
d922056215 KVM: SEV-ES: Disallow SEV-ES guests when X86_FEATURE_LBRV is absent
As documented in APM[1], LBR Virtualization must be enabled for SEV-ES
guests. So, prevent SEV-ES guests when LBRV support is missing.

[1]: AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June
     2023, Vol 2, 15.35.2 Enabling SEV-ES.
     https://bugzilla.kernel.org/attachment.cgi?id=304653

Fixes: 376c6d2850 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading")
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Message-ID: <20240531044644.768-3-ravi.bangoria@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-03 13:06:48 -04:00
Nikunj A Dadhania
27bd5fdc24 KVM: SEV-ES: Prevent MSR access post VMSA encryption
KVM currently allows userspace to read/write MSRs even after the VMSA is
encrypted. This can cause unintentional issues if MSR access has side-
effects. For ex, while migrating a guest, userspace could attempt to
migrate MSR_IA32_DEBUGCTLMSR and end up unintentionally disabling LBRV on
the target. Fix this by preventing access to those MSRs which are context
switched via the VMSA, once the VMSA is encrypted.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Message-ID: <20240531044644.768-2-ravi.bangoria@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-03 13:06:48 -04:00
Tom Lendacky
b2ec042347 KVM: SVM: Remove the need to trigger an UNBLOCK event on AP creation
All SNP APs are initially started using the APIC INIT/SIPI sequence in
the guest. This sequence moves the AP MP state from
KVM_MP_STATE_UNINITIALIZED to KVM_MP_STATE_RUNNABLE, so there is no need
to attempt the UNBLOCK.

As it is, the UNBLOCK support in SVM is only enabled when AVIC is
enabled. When AVIC is disabled, AP creation is still successful.

Remove the KVM_REQ_UNBLOCK request from the AP creation code and revert
the changes to the vcpu_unblocking() kvm_x86_ops path.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-03 12:38:17 -04:00
Paolo Bonzini
73137f5924 KVM: SEV: Don't WARN() if RMP lookup fails when invalidating gmem pages
The hook only handles cleanup work specific to SNP, e.g. RMP table
entries and flushing caches for encrypted guest memory. When run on a
non-SNP-enabled host (currently only possible using
KVM_X86_SW_PROTECTED_VM, e.g. via KVM selftests), the callback is a noop
and will WARN due to the RMP table not being present. It's actually
expected in this case that the RMP table wouldn't be present and that
the hook should be a noop, so drop the WARN_ONCE().

Reported-by: Sean Christopherson <seanjc@google.com>
Closes: https://lore.kernel.org/kvm/ZkU3_y0UoPk5yAeK@google.com/
Fixes: 8eb01900b0 ("KVM: SEV: Implement gmem hook for invalidating private pages")
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-03 12:38:14 -04:00
Michael Roth
febff040b1 KVM: SEV: Automatically switch reclaimed pages to shared
Currently there's a consistent pattern of always calling
host_rmp_make_shared() immediately after snp_page_reclaim(), so go ahead
and handle it automatically as part of snp_page_reclaim(). Also rename
it to kvm_rmp_make_shared() to more easily distinguish it as a
KVM-specific variant of the more generic rmp_make_shared() helper.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-03 12:36:52 -04:00
Tony Luck
0c468a6a02 KVM: VMX: Switch to new Intel CPU model infrastructure
Use x86_vfm (vendor, family, module) to detect CPUs that are affected by
PERF_GLOBAL_CTRL bugs instead of manually checking the family and model.
The new VFM infrastructure encodes all information in one handy location.

No functional change intended.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20240520224620.9480-10-tony.luck@intel.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 09:10:43 -07:00
Tony Luck
8387435beb KVM: x86/pmu: Switch to new Intel CPU model defines
Use X86_MATCH_VFM(), which does Vendor checking in addition to Family and
Model checking, to do FMS-based detection of PEBS features.

No functional change intended.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20240520224620.9480-9-tony.luck@intel.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 09:10:34 -07:00
Jim Mattson
ea19f7d0bf KVM: x86: Remove IA32_PERF_GLOBAL_OVF_CTRL from KVM_GET_MSR_INDEX_LIST
This MSR reads as 0, and any host-initiated writes are ignored, so
there's no reason to enumerate it in KVM_GET_MSR_INDEX_LIST.

Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20231113184854.2344416-1-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 09:00:30 -07:00
Sean Christopherson
82897db912 KVM: x86: Move shadow_phys_bits into "kvm_host", as "maxphyaddr"
Move shadow_phys_bits into "struct kvm_host_values", i.e. into KVM's
global "kvm_host" variable, so that it is automatically exported for use
in vendor modules.  Rename the variable/field to maxphyaddr to more
clearly capture what value it holds, now that it's used outside of the
MMU (and because the "shadow" part is more than a bit misleading as the
variable is not at all unique to shadow paging).

Recomputing the raw/true host.MAXPHYADDR on every use can be subtly
expensive, e.g. it will incur a VM-Exit on the CPUID if KVM is running as
a nested hypervisor.  Vendor code already has access to the information,
e.g. by directly doing CPUID or by invoking kvm_get_shadow_phys_bits(), so
there's no tangible benefit to making it MMU-only.

Link: https://lore.kernel.org/r/20240423221521.2923759-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 08:58:55 -07:00
Sean Christopherson
c043eaaa6b KVM: x86/mmu: Snapshot shadow_phys_bits when kvm.ko is loaded
Snapshot shadow_phys_bits when kvm.ko is loaded, not when a vendor module
is loaded, to guard against usage of shadow_phys_bits before it is
initialized.  The computation isn't vendor specific in any way, i.e. there
there is no reason to wait to snapshot the value until a vendor module is
loaded, nor is there any reason to recompute the value every time a vendor
module is loaded.

Opportunistically convert it from "read mostly" to "read-only after init".

Link: https://lore.kernel.org/r/20240423221521.2923759-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 08:58:55 -07:00
Sean Christopherson
52c47f5897 KVM: SVM: Use KVM's snapshot of the host's XCR0 for SEV-ES host state
Use KVM's snapshot of the host's XCR0 when stuffing SEV-ES host state
instead of reading XCR0 from hardware.  XCR0 is only written during
boot, i.e. won't change while KVM is running (and KVM at large is hosed
if that doesn't hold true).

Link: https://lore.kernel.org/r/20240423221521.2923759-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 08:58:54 -07:00
Sean Christopherson
7974c0643e KVM: x86: Add a struct to consolidate host values, e.g. EFER, XCR0, etc...
Add "struct kvm_host_values kvm_host" to hold the various host values
that KVM snapshots during initialization.  Bundling the host values into
a single struct simplifies adding new MSRs and other features with host
state/values that KVM cares about, and provides a one-stop shop.  E.g.
adding a new value requires one line, whereas tracking each value
individual often requires three: declaration, definition, and export.

No functional change intended.

Link: https://lore.kernel.org/r/20240423221521.2923759-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 08:58:53 -07:00
Sean Christopherson
b4bd556467 KVM: SVM: WARN on vNMI + NMI window iff NMIs are outright masked
When requesting an NMI window, WARN on vNMI support being enabled if and
only if NMIs are actually masked, i.e. if the vCPU is already handling an
NMI.  KVM's ABI for NMIs that arrive simultanesouly (from KVM's point of
view) is to inject one NMI and pend the other.  When using vNMI, KVM pends
the second NMI simply by setting V_NMI_PENDING, and lets the CPU do the
rest (hardware automatically sets V_NMI_BLOCKING when an NMI is injected).

However, if KVM can't immediately inject an NMI, e.g. because the vCPU is
in an STI shadow or is running with GIF=0, then KVM will request an NMI
window and trigger the WARN (but still function correctly).

Whether or not the GIF=0 case makes sense is debatable, as the intent of
KVM's behavior is to provide functionality that is as close to real
hardware as possible.  E.g. if two NMIs are sent in quick succession, the
probability of both NMIs arriving in an STI shadow is infinitesimally low
on real hardware, but significantly larger in a virtual environment, e.g.
if the vCPU is preempted in the STI shadow.  For GIF=0, the argument isn't
as clear cut, because the window where two NMIs can collide is much larger
in bare metal (though still small).

That said, KVM should not have divergent behavior for the GIF=0 case based
on whether or not vNMI support is enabled.  And KVM has allowed
simultaneous NMIs with GIF=0 for over a decade, since commit 7460fb4a34
("KVM: Fix simultaneous NMIs").  I.e. KVM's GIF=0 handling shouldn't be
modified without a *really* good reason to do so, and if KVM's behavior
were to be modified, it should be done irrespective of vNMI support.

Fixes: fa4c027a79 ("KVM: x86: Add support for SVM's Virtual NMI")
Cc: stable@vger.kernel.org
Cc: Santosh Shukla <Santosh.Shukla@amd.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240522021435.1684366-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23 12:34:44 -04:00
Sean Christopherson
76d5363c20 KVM: x86: Force KVM_WERROR if the global WERROR is enabled
Force KVM_WERROR if the global WERROR is enabled to avoid pestering the
user about a Kconfig that will ultimately be ignored.  Force KVM_WERROR
instead of making it mutually exclusive with WERROR to avoid generating a
.config builds KVM with -Werror, but has KVM_WERROR=n.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240517180341.974251-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23 12:33:31 -04:00
Sean Christopherson
6af6142e3a KVM: x86: Disable KVM_INTEL_PROVE_VE by default
Disable KVM's "prove #VE" support by default, as it provides no functional
value, and even its sanity checking benefits are relatively limited.  I.e.
it should be fully opt-in even on debug kernels, especially since EPT
Violation #VE suppression appears to be buggy on some CPUs.

Opportunistically add a line in the help text to make it abundantly clear
that KVM_INTEL_PROVE_VE should never be enabled in a production
environment.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240518000430.1118488-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23 12:33:15 -04:00
Sean Christopherson
bca99c0356 KVM: x86/mmu: Print SPTEs on unexpected #VE
Print the SPTEs that correspond to the faulting GPA on an unexpected EPT
Violation #VE to help the user debug failures, e.g. to pinpoint which SPTE
didn't have SUPPRESS_VE set.

Opportunistically assert that the underlying exit reason was indeed an EPT
Violation, as the CPU has *really* gone off the rails if a #VE occurs due
to a completely unexpected exit reason.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240518000430.1118488-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23 12:28:45 -04:00
Sean Christopherson
743f177336 KVM: VMX: Dump VMCS on unexpected #VE
Dump the VMCS on an unexpected #VE, otherwise it's practically impossible
to figure out why the #VE occurred.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240518000430.1118488-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23 12:28:04 -04:00
Sean Christopherson
837d557aba KVM: x86/mmu: Add sanity checks that KVM doesn't create EPT #VE SPTEs
Assert that KVM doesn't set a SPTE to a value that could trigger an EPT
Violation #VE on a non-MMIO SPTE, e.g. to help detect bugs even without
KVM_INTEL_PROVE_VE enabled, and to help debug actual #VE failures.

Note, this will run afoul of TDX support, which needs to reflect emulated
MMIO accesses into the guest as #VEs (which was the whole point of adding
EPT Violation #VE support in KVM).  The obvious fix for that is to exempt
MMIO SPTEs, but that's annoyingly difficult now that is_mmio_spte() relies
on a per-VM value.  However, resolving that conundrum is a future problem,
whereas getting KVM_INTEL_PROVE_VE healthy is a current problem.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240518000430.1118488-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23 12:27:26 -04:00
Sean Christopherson
9031b42139 KVM: nVMX: Always handle #VEs in L0 (never forward #VEs from L2 to L1)
Always handle #VEs, e.g. due to prove EPT Violation #VE failures, in L0,
as KVM does not expose any #VE capabilities to L1, i.e. any and all #VEs
are KVM's responsibility.

Fixes: 8131cf5b4f ("KVM: VMX: Introduce test mode related to EPT violation VE")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240518000430.1118488-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23 12:27:26 -04:00
Sean Christopherson
d1b32ecdc8 KVM: nVMX: Initialize #VE info page for vmcs02 when proving #VE support
Point vmcs02.VE_INFORMATION_ADDRESS at the vCPU's #VE info page when
initializing vmcs02, otherwise KVM will run L2 with EPT Violation #VE
enabled and a VE info address pointing at pfn 0.

Fixes: 8131cf5b4f ("KVM: VMX: Introduce test mode related to EPT violation VE")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240518000430.1118488-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23 12:27:25 -04:00
Sean Christopherson
40e8a6901a KVM: VMX: Don't kill the VM on an unexpected #VE
Don't terminate the VM on an unexpected #VE, as it's extremely unlikely
the #VE is fatal to the guest, and even less likely that it presents a
danger to the host.  Simply resume the guest on "failure", as the #VE info
page's BUSY field will prevent converting any more EPT Violations to #VEs
for the vCPU (at least, that's what the BUSY field is supposed to do).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240518000430.1118488-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23 12:27:17 -04:00
Isaku Yamahata
803482f472 KVM: x86/mmu: Use SHADOW_NONPRESENT_VALUE for atomic zap in TDP MMU
Use SHADOW_NONPRESENT_VALUE when zapping TDP MMU SPTEs with mmu_lock held
for read, tdp_mmu_zap_spte_atomic() was simply missed during the initial
development.

Fixes: 7f01cab849 ("KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE")
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[sean: write changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20240518000430.1118488-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23 12:24:39 -04:00
Steven Rostedt (Google)
2c92ca849f tracing/treewide: Remove second parameter of __assign_str()
With the rework of how the __string() handles dynamic strings where it
saves off the source string in field in the helper structure[1], the
assignment of that value to the trace event field is stored in the helper
value and does not need to be passed in again.

This means that with:

  __string(field, mystring)

Which use to be assigned with __assign_str(field, mystring), no longer
needs the second parameter and it is unused. With this, __assign_str()
will now only get a single parameter.

There's over 700 users of __assign_str() and because coccinelle does not
handle the TRACE_EVENT() macro I ended up using the following sed script:

  git grep -l __assign_str | while read a ; do
      sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file;
      mv /tmp/test-file $a;
  done

I then searched for __assign_str() that did not end with ';' as those
were multi line assignments that the sed script above would fail to catch.

Note, the same updates will need to be done for:

  __assign_str_len()
  __assign_rel_str()
  __assign_rel_str_len()

I tested this with both an allmodconfig and an allyesconfig (build only for both).

[1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/

Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts.
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for
Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>	# xfs
Tested-by: Guenter Roeck <linux@roeck-us.net>
2024-05-22 20:14:47 -04:00
Linus Torvalds
f4b0c4b508 ARM:
* Move a lot of state that was previously stored on a per vcpu
   basis into a per-CPU area, because it is only pertinent to the
   host while the vcpu is loaded. This results in better state
   tracking, and a smaller vcpu structure.
 
 * Add full handling of the ERET/ERETAA/ERETAB instructions in
   nested virtualisation. The last two instructions also require
   emulating part of the pointer authentication extension.
   As a result, the trap handling of pointer authentication has
   been greatly simplified.
 
 * Turn the global (and not very scalable) LPI translation cache
   into a per-ITS, scalable cache, making non directly injected
   LPIs much cheaper to make visible to the vcpu.
 
 * A batch of pKVM patches, mostly fixes and cleanups, as the
   upstreaming process seems to be resuming. Fingers crossed!
 
 * Allocate PPIs and SGIs outside of the vcpu structure, allowing
   for smaller EL2 mapping and some flexibility in implementing
   more or less than 32 private IRQs.
 
 * Purge stale mpidr_data if a vcpu is created after the MPIDR
   map has been created.
 
 * Preserve vcpu-specific ID registers across a vcpu reset.
 
 * Various minor cleanups and improvements.
 
 LoongArch:
 
 * Add ParaVirt IPI support.
 
 * Add software breakpoint support.
 
 * Add mmio trace events support.
 
 RISC-V:
 
 * Support guest breakpoints using ebreak
 
 * Introduce per-VCPU mp_state_lock and reset_cntx_lock
 
 * Virtualize SBI PMU snapshot and counter overflow interrupts
 
 * New selftests for SBI PMU and Guest ebreak
 
 * Some preparatory work for both TDX and SNP page fault handling.
   This also cleans up the page fault path, so that the priorities
   of various kinds of fauls (private page, no memory, write
   to read-only slot, etc.) are easier to follow.
 
 x86:
 
 * Minimize amount of time that shadow PTEs remain in the special
   REMOVED_SPTE state.  This is a state where the mmu_lock is held for
   reading but concurrent accesses to the PTE have to spin; shortening
   its use allows other vCPUs to repopulate the zapped region while
   the zapper finishes tearing down the old, defunct page tables.
 
 * Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field,
   which is defined by hardware but left for software use.  This lets KVM
   communicate its inability to map GPAs that set bits 51:48 on hosts
   without 5-level nested page tables.  Guest firmware is expected to
   use the information when mapping BARs; this avoids that they end up at
   a legal, but unmappable, GPA.
 
 * Fixed a bug where KVM would not reject accesses to MSR that aren't
   supposed to exist given the vCPU model and/or KVM configuration.
 
 * As usual, a bunch of code cleanups.
 
 x86 (AMD):
 
 * Implement a new and improved API to initialize SEV and SEV-ES VMs, which
   will also be extendable to SEV-SNP.  The new API specifies the desired
   encryption in KVM_CREATE_VM and then separately initializes the VM.
   The new API also allows customizing the desired set of VMSA features;
   the features affect the measurement of the VM's initial state, and
   therefore enabling them cannot be done tout court by the hypervisor.
 
   While at it, the new API includes two bugfixes that couldn't be
   applied to the old one without a flag day in userspace or without
   affecting the initial measurement.  When a SEV-ES VM is created with
   the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are
   rejected once the VMSA has been encrypted.  Also, the FPU and AVX
   state will be synchronized and encrypted too.
 
 * Support for GHCB version 2 as applicable to SEV-ES guests.  This, once
   more, is only accessible when using the new KVM_SEV_INIT2 flow for
   initialization of SEV-ES VMs.
 
 x86 (Intel):
 
 * An initial bunch of prerequisite patches for Intel TDX were merged.
   They generally don't do anything interesting.  The only somewhat user
   visible change is a new debugging mode that checks that KVM's MMU
   never triggers a #VE virtualization exception in the guest.
 
 * Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to
   L1, as per the SDM.
 
 Generic:
 
 * Use vfree() instead of kvfree() for allocations that always use vcalloc()
   or __vcalloc().
 
 * Remove .change_pte() MMU notifier - the changes to non-KVM code are
   small and Andrew Morton asked that I also take those through the KVM
   tree.  The callback was only ever implemented by KVM (which was also the
   original user of MMU notifiers) but it had been nonfunctional ever since
   calls to set_pte_at_notify were wrapped with invalidate_range_start
   and invalidate_range_end... in 2012.
 
 Selftests:
 
 * Enhance the demand paging test to allow for better reporting and stressing
   of UFFD performance.
 
 * Convert the steal time test to generate TAP-friendly output.
 
 * Fix a flaky false positive in the xen_shinfo_test due to comparing elapsed
   time across two different clock domains.
 
 * Skip the MONITOR/MWAIT test if the host doesn't actually support MWAIT.
 
 * Avoid unnecessary use of "sudo" in the NX hugepage test wrapper shell
   script, to play nice with running in a minimal userspace environment.
 
 * Allow skipping the RSEQ test's sanity check that the vCPU was able to
   complete a reasonable number of KVM_RUNs, as the assert can fail on a
   completely valid setup.  If the test is run on a large-ish system that is
   otherwise idle, and the test isn't affined to a low-ish number of CPUs, the
   vCPU task can be repeatedly migrated to CPUs that are in deep sleep states,
   which results in the vCPU having very little net runtime before the next
   migration due to high wakeup latencies.
 
 * Define _GNU_SOURCE for all selftests to fix a warning that was introduced by
   a change to kselftest_harness.h late in the 6.9 cycle, and because forcing
   every test to #define _GNU_SOURCE is painful.
 
 * Provide a global pseudo-RNG instance for all tests, so that library code can
   generate random, but determinstic numbers.
 
 * Use the global pRNG to randomly force emulation of select writes from guest
   code on x86, e.g. to help validate KVM's emulation of locked accesses.
 
 * Allocate and initialize x86's GDT, IDT, TSS, segments, and default exception
   handlers at VM creation, instead of forcing tests to manually trigger the
   related setup.
 
 Documentation:
 
 * Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmZE878UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOukQf+LcvZsWtrC7Wd5K9SQbYXaS4Rk6P6
 JHoQW2d0hUN893J2WibEw+l1J/0vn5JumqHXyZgJ7CbaMtXkWWQTwDSDLuURUKpv
 XNB3Sb17G87NH+s1tOh0tA9h5upbtlHVHvrtIwdbb9+XHgQ6HTL4uk+HdfO/p9fW
 cWBEZAKoWcCIa99Numv3pmq5vdrvBlNggwBugBS8TH69EKMw+V1Vu1SFkIdNDTQk
 NJJ28cohoP3wnwlIHaXSmU4RujipPH3Lm/xupyA5MwmzO713eq2yUqV49jzhD5/I
 MA4Ruvgrdm4wpp89N9lQMyci91u6q7R9iZfMu0tSg2qYI3UPKIdstd8sOA==
 =2lED
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:

   - Move a lot of state that was previously stored on a per vcpu basis
     into a per-CPU area, because it is only pertinent to the host while
     the vcpu is loaded. This results in better state tracking, and a
     smaller vcpu structure.

   - Add full handling of the ERET/ERETAA/ERETAB instructions in nested
     virtualisation. The last two instructions also require emulating
     part of the pointer authentication extension. As a result, the trap
     handling of pointer authentication has been greatly simplified.

   - Turn the global (and not very scalable) LPI translation cache into
     a per-ITS, scalable cache, making non directly injected LPIs much
     cheaper to make visible to the vcpu.

   - A batch of pKVM patches, mostly fixes and cleanups, as the
     upstreaming process seems to be resuming. Fingers crossed!

   - Allocate PPIs and SGIs outside of the vcpu structure, allowing for
     smaller EL2 mapping and some flexibility in implementing more or
     less than 32 private IRQs.

   - Purge stale mpidr_data if a vcpu is created after the MPIDR map has
     been created.

   - Preserve vcpu-specific ID registers across a vcpu reset.

   - Various minor cleanups and improvements.

  LoongArch:

   - Add ParaVirt IPI support

   - Add software breakpoint support

   - Add mmio trace events support

  RISC-V:

   - Support guest breakpoints using ebreak

   - Introduce per-VCPU mp_state_lock and reset_cntx_lock

   - Virtualize SBI PMU snapshot and counter overflow interrupts

   - New selftests for SBI PMU and Guest ebreak

   - Some preparatory work for both TDX and SNP page fault handling.

     This also cleans up the page fault path, so that the priorities of
     various kinds of fauls (private page, no memory, write to read-only
     slot, etc.) are easier to follow.

  x86:

   - Minimize amount of time that shadow PTEs remain in the special
     REMOVED_SPTE state.

     This is a state where the mmu_lock is held for reading but
     concurrent accesses to the PTE have to spin; shortening its use
     allows other vCPUs to repopulate the zapped region while the zapper
     finishes tearing down the old, defunct page tables.

   - Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID
     field, which is defined by hardware but left for software use.

     This lets KVM communicate its inability to map GPAs that set bits
     51:48 on hosts without 5-level nested page tables. Guest firmware
     is expected to use the information when mapping BARs; this avoids
     that they end up at a legal, but unmappable, GPA.

   - Fixed a bug where KVM would not reject accesses to MSR that aren't
     supposed to exist given the vCPU model and/or KVM configuration.

   - As usual, a bunch of code cleanups.

  x86 (AMD):

   - Implement a new and improved API to initialize SEV and SEV-ES VMs,
     which will also be extendable to SEV-SNP.

     The new API specifies the desired encryption in KVM_CREATE_VM and
     then separately initializes the VM. The new API also allows
     customizing the desired set of VMSA features; the features affect
     the measurement of the VM's initial state, and therefore enabling
     them cannot be done tout court by the hypervisor.

     While at it, the new API includes two bugfixes that couldn't be
     applied to the old one without a flag day in userspace or without
     affecting the initial measurement. When a SEV-ES VM is created with
     the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are rejected
     once the VMSA has been encrypted. Also, the FPU and AVX state will
     be synchronized and encrypted too.

   - Support for GHCB version 2 as applicable to SEV-ES guests.

     This, once more, is only accessible when using the new
     KVM_SEV_INIT2 flow for initialization of SEV-ES VMs.

  x86 (Intel):

   - An initial bunch of prerequisite patches for Intel TDX were merged.

     They generally don't do anything interesting. The only somewhat
     user visible change is a new debugging mode that checks that KVM's
     MMU never triggers a #VE virtualization exception in the guest.

   - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig
     VM-Exit to L1, as per the SDM.

  Generic:

   - Use vfree() instead of kvfree() for allocations that always use
     vcalloc() or __vcalloc().

   - Remove .change_pte() MMU notifier - the changes to non-KVM code are
     small and Andrew Morton asked that I also take those through the
     KVM tree.

     The callback was only ever implemented by KVM (which was also the
     original user of MMU notifiers) but it had been nonfunctional ever
     since calls to set_pte_at_notify were wrapped with
     invalidate_range_start and invalidate_range_end... in 2012.

  Selftests:

   - Enhance the demand paging test to allow for better reporting and
     stressing of UFFD performance.

   - Convert the steal time test to generate TAP-friendly output.

   - Fix a flaky false positive in the xen_shinfo_test due to comparing
     elapsed time across two different clock domains.

   - Skip the MONITOR/MWAIT test if the host doesn't actually support
     MWAIT.

   - Avoid unnecessary use of "sudo" in the NX hugepage test wrapper
     shell script, to play nice with running in a minimal userspace
     environment.

   - Allow skipping the RSEQ test's sanity check that the vCPU was able
     to complete a reasonable number of KVM_RUNs, as the assert can fail
     on a completely valid setup.

     If the test is run on a large-ish system that is otherwise idle,
     and the test isn't affined to a low-ish number of CPUs, the vCPU
     task can be repeatedly migrated to CPUs that are in deep sleep
     states, which results in the vCPU having very little net runtime
     before the next migration due to high wakeup latencies.

   - Define _GNU_SOURCE for all selftests to fix a warning that was
     introduced by a change to kselftest_harness.h late in the 6.9
     cycle, and because forcing every test to #define _GNU_SOURCE is
     painful.

   - Provide a global pseudo-RNG instance for all tests, so that library
     code can generate random, but determinstic numbers.

   - Use the global pRNG to randomly force emulation of select writes
     from guest code on x86, e.g. to help validate KVM's emulation of
     locked accesses.

   - Allocate and initialize x86's GDT, IDT, TSS, segments, and default
     exception handlers at VM creation, instead of forcing tests to
     manually trigger the related setup.

  Documentation:

   - Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (225 commits)
  selftests/kvm: remove dead file
  KVM: selftests: arm64: Test vCPU-scoped feature ID registers
  KVM: selftests: arm64: Test that feature ID regs survive a reset
  KVM: selftests: arm64: Store expected register value in set_id_regs
  KVM: selftests: arm64: Rename helper in set_id_regs to imply VM scope
  KVM: arm64: Only reset vCPU-scoped feature ID regs once
  KVM: arm64: Reset VM feature ID regs from kvm_reset_sys_regs()
  KVM: arm64: Rename is_id_reg() to imply VM scope
  KVM: arm64: Destroy mpidr_data for 'late' vCPU creation
  KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support
  KVM: arm64: Fix hvhe/nvhe early alias parsing
  KVM: SEV: Allow per-guest configuration of GHCB protocol version
  KVM: SEV: Add GHCB handling for termination requests
  KVM: SEV: Add GHCB handling for Hypervisor Feature Support requests
  KVM: SEV: Add support to handle AP reset MSR protocol
  KVM: x86: Explicitly zero kvm_caps during vendor module load
  KVM: x86: Fully re-initialize supported_mce_cap on vendor module load
  KVM: x86: Fully re-initialize supported_vm_types on vendor module load
  KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfns
  KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values
  ...
2024-05-15 14:46:43 -07:00
Brijesh Singh
6f627b4253 KVM: SVM: Add module parameter to enable SEV-SNP
Add a module parameter than can be used to enable or disable the SEV-SNP
feature. Now that KVM contains the support for the SNP set the GHCB
hypervisor feature flag to indicate that SNP is supported.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <20240501085210.2213060-18-michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:34 -04:00
Ashish Kalra
ea262f8a7c KVM: SEV: Avoid WBINVD for HVA-based MMU notifications for SNP
With SNP/guest_memfd, private/encrypted memory should not be mappable,
and MMU notifications for HVA-mapped memory will only be relevant to
unencrypted guest memory. Therefore, the rationale behind issuing a
wbinvd_on_all_cpus() in sev_guest_memory_reclaimed() should not apply
for SNP guests and can be ignored.

Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
[mdr: Add some clarifications in commit]
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240501085210.2213060-17-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:33 -04:00
Michael Roth
b2104024f4 KVM: x86: Implement hook for determining max NPT mapping level
In the case of SEV-SNP, whether or not a 2MB page can be mapped via a
2MB mapping in the guest's nested page table depends on whether or not
any subpages within the range have already been initialized as private
in the RMP table. The existing mixed-attribute tracking in KVM is
insufficient here, for instance:

  - gmem allocates 2MB page
  - guest issues PVALIDATE on 2MB page
  - guest later converts a subpage to shared
  - SNP host code issues PSMASH to split 2MB RMP mapping to 4K
  - KVM MMU splits NPT mapping to 4K
  - guest later converts that shared page back to private

At this point there are no mixed attributes, and KVM would normally
allow for 2MB NPT mappings again, but this is actually not allowed
because the RMP table mappings are 4K and cannot be promoted on the
hypervisor side, so the NPT mappings must still be limited to 4K to
match this.

Implement a kvm_x86_ops.private_max_mapping_level() hook for SEV that
checks for this condition and adjusts the mapping level accordingly.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240501085210.2213060-16-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:33 -04:00
Michael Roth
8eb01900b0 KVM: SEV: Implement gmem hook for invalidating private pages
Implement a platform hook to do the work of restoring the direct map
entries of gmem-managed pages and transitioning the corresponding RMP
table entries back to the default shared/hypervisor-owned state.

Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240501085210.2213060-15-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:32 -04:00
Michael Roth
4f2e7aa1cf KVM: SEV: Implement gmem hook for initializing private pages
This will handle the RMP table updates needed to put a page into a
private state before mapping it into an SEV-SNP guest.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240501085210.2213060-14-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:32 -04:00
Tom Lendacky
e366f92ea9 KVM: SEV: Support SEV-SNP AP Creation NAE event
Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP
guests to alter the register state of the APs on their own. This allows
the guest a way of simulating INIT-SIPI.

A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used
so as to avoid updating the VMSA pointer while the vCPU is running.

For CREATE
  The guest supplies the GPA of the VMSA to be used for the vCPU with
  the specified APIC ID. The GPA is saved in the svm struct of the
  target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added
  to the vCPU and then the vCPU is kicked.

For CREATE_ON_INIT:
  The guest supplies the GPA of the VMSA to be used for the vCPU with
  the specified APIC ID the next time an INIT is performed. The GPA is
  saved in the svm struct of the target vCPU.

For DESTROY:
  The guest indicates it wishes to stop the vCPU. The GPA is cleared
  from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is
  added to vCPU and then the vCPU is kicked.

The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked
as a result of the event or as a result of an INIT. If a new VMSA is to
be installed, the VMSA guest page is set as the VMSA in the vCPU VMCB
and the vCPU state is set to KVM_MP_STATE_RUNNABLE. If a new VMSA is not
to be installed, the VMSA is cleared in the vCPU VMCB and the vCPU state
is set to KVM_MP_STATE_HALTED to prevent it from being run.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <20240501085210.2213060-13-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:32 -04:00
Brijesh Singh
c63cf135cc KVM: SEV: Add support to handle RMP nested page faults
When SEV-SNP is enabled in the guest, the hardware places restrictions
on all memory accesses based on the contents of the RMP table. When
hardware encounters RMP check failure caused by the guest memory access
it raises the #NPF. The error code contains additional information on
the access type. See the APM volume 2 for additional information.

When using gmem, RMP faults resulting from mismatches between the state
in the RMP table vs. what the guest expects via its page table result
in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This
means the only expected case that needs to be handled in the kernel is
when the page size of the entry in the RMP table is larger than the
mapping in the nested page table, in which case a PSMASH instruction
needs to be issued to split the large RMP entry into individual 4K
entries so that subsequent accesses can succeed.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <20240501085210.2213060-12-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:31 -04:00
Michael Roth
9b54e248d2 KVM: SEV: Add support to handle Page State Change VMGEXIT
SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change NAE event
as defined in the GHCB specification version 2.

Forward these requests to userspace as KVM_EXIT_VMGEXITs, similar to how
it is done for requests that don't use a GHCB page.

As with the MSR-based page-state changes, use the existing
KVM_HC_MAP_GPA_RANGE hypercall format to deliver these requests to
userspace via KVM_EXIT_HYPERCALL.

Signed-off-by: Michael Roth <michael.roth@amd.com>
Co-developed-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <20240501085210.2213060-11-michael.roth@amd.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:31 -04:00
Michael Roth
d46b7b6a5f KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT
SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change MSR protocol
as defined in the GHCB specification.

When using gmem, private/shared memory is allocated through separate
pools, and KVM relies on userspace issuing a KVM_SET_MEMORY_ATTRIBUTES
KVM ioctl to tell the KVM MMU whether or not a particular GFN should be
backed by private memory or not.

Forward these page state change requests to userspace so that it can
issue the expected KVM ioctls. The KVM MMU will handle updating the RMP
entries when it is ready to map a private page into a guest.

Use the existing KVM_HC_MAP_GPA_RANGE hypercall format to deliver these
requests to userspace via KVM_EXIT_HYPERCALL.

Signed-off-by: Michael Roth <michael.roth@amd.com>
Co-developed-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <20240501085210.2213060-10-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:30 -04:00
Brijesh Singh
0c76b1d082 KVM: SEV: Add support to handle GHCB GPA register VMGEXIT
SEV-SNP guests are required to perform a GHCB GPA registration. Before
using a GHCB GPA for a vCPU the first time, a guest must register the
vCPU GHCB GPA. If hypervisor can work with the guest requested GPA then
it must respond back with the same GPA otherwise return -1.

On VMEXIT, verify that the GHCB GPA matches with the registered value.
If a mismatch is detected, then abort the guest.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240501085210.2213060-9-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:30 -04:00
Brijesh Singh
ad27ce1555 KVM: SEV: Add KVM_SEV_SNP_LAUNCH_FINISH command
Add a KVM_SEV_SNP_LAUNCH_FINISH command to finalize the cryptographic
launch digest which stores the measurement of the guest at launch time.
Also extend the existing SNP firmware data structures to support
disabling the use of Versioned Chip Endorsement Keys (VCEK) by guests as
part of this command.

While finalizing the launch flow, the code also issues the LAUNCH_UPDATE
SNP firmware commands to encrypt/measure the initial VMSA pages for each
configured vCPU, which requires setting the RMP entries for those pages
to private, so also add handling to clean up the RMP entries for these
pages whening freeing vCPUs during shutdown.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Harald Hoyer <harald@profian.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <20240501085210.2213060-8-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:30 -04:00
Brijesh Singh
dee5a47cc7 KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command
A key aspect of a launching an SNP guest is initializing it with a
known/measured payload which is then encrypted into guest memory as
pre-validated private pages and then measured into the cryptographic
launch context created with KVM_SEV_SNP_LAUNCH_START so that the guest
can attest itself after booting.

Since all private pages are provided by guest_memfd, make use of the
kvm_gmem_populate() interface to handle this. The general flow is that
guest_memfd will handle allocating the pages associated with the GPA
ranges being initialized by each particular call of
KVM_SEV_SNP_LAUNCH_UPDATE, copying data from userspace into those pages,
and then the post_populate callback will do the work of setting the
RMP entries for these pages to private and issuing the SNP firmware
calls to encrypt/measure them.

For more information see the SEV-SNP specification.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <20240501085210.2213060-7-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:29 -04:00
Brijesh Singh
136d8bc931 KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command
KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest.
The command initializes a cryptographic digest context used to construct
the measurement of the guest. Other commands can then at that point be
used to load/encrypt data into the guest's initial launch image.

For more information see the SEV-SNP specification.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <20240501085210.2213060-6-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:29 -04:00
Brijesh Singh
1dfe571c12 KVM: SEV: Add initial SEV-SNP support
SEV-SNP builds upon existing SEV and SEV-ES functionality while adding
new hardware-based security protection. SEV-SNP adds strong memory
encryption and integrity protection to help prevent malicious
hypervisor-based attacks such as data replay, memory re-mapping, and
more, to create an isolated execution environment.

Define a new KVM_X86_SNP_VM type which makes use of these capabilities
and extend the KVM_SEV_INIT2 ioctl to support it. Also add a basic
helper to check whether SNP is enabled and set PFERR_PRIVATE_ACCESS for
private #NPFs so they are handled appropriately by KVM MMU.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240501085210.2213060-5-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:28 -04:00
Michael Roth
a8e3198333 KVM: SEV: Select KVM_GENERIC_PRIVATE_MEM when CONFIG_KVM_AMD_SEV=y
SEV-SNP relies on private memory support to run guests, so make sure to
enable that support via the CONFIG_KVM_GENERIC_PRIVATE_MEM config
option.

Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240501085210.2213060-4-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:28 -04:00
Michael Roth
b74d002d3d KVM: MMU: Disable fast path if KVM_EXIT_MEMORY_FAULT is needed
For hardware-protected VMs like SEV-SNP guests, certain conditions like
attempting to perform a write to a page which is not in the state that
the guest expects it to be in can result in a nested/extended #PF which
can only be satisfied by the host performing an implicit page state
change to transition the page into the expected shared/private state.
This is generally handled by generating a KVM_EXIT_MEMORY_FAULT event
that gets forwarded to userspace to handle via
KVM_SET_MEMORY_ATTRIBUTES.

However, the fast_page_fault() code might misconstrue this situation as
being the result of a write-protected access, and treat it as a spurious
case when it sees that writes are already allowed for the sPTE. This
results in the KVM MMU trying to resume the guest rather than taking any
action to satisfy the real source of the #PF such as generating a
KVM_EXIT_MEMORY_FAULT, resulting in the guest spinning on nested #PFs.

Check for this condition and bail out of the fast path if it is
detected.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:28 -04:00
Paolo Bonzini
7323260373 Merge branch 'kvm-coco-hooks' into HEAD
Common patches for the target-independent functionality and hooks
that are needed by SEV-SNP and TDX.
2024-05-12 04:07:01 -04:00
Paolo Bonzini
7d41e24da2 KVM x86 misc changes for 6.10:
- Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which
    is unused by hardware, so that KVM can communicate its inability to map GPAs
    that set bits 51:48 due to lack of 5-level paging.  Guest firmware is
    expected to use the information to safely remap BARs in the uppermost GPA
    space, i.e to avoid placing a BAR at a legal, but unmappable, GPA.
 
  - Use vfree() instead of kvfree() for allocations that always use vcalloc()
    or __vcalloc().
 
  - Don't completely ignore same-value writes to immutable feature MSRs, as
    doing so results in KVM failing to reject accesses to MSR that aren't
    supposed to exist given the vCPU model and/or KVM configuration.
 
  - Don't mark APICv as being inhibited due to ABSENT if APICv is disabled
    KVM-wide to avoid confusing debuggers (KVM will never bother clearing the
    ABSENT inhibit, even if userspace enables in-kernel local APIC).
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmY+rlEACgkQOlYIJqCj
 N/3/xQ/7BvNl1aCJSIQy+yanCKK4wV0wWoY/hD+1wVge3zoaLZqLNHeR7fEa3vo+
 OSS/pOz+PT6DbkokZYjjVaGs6+pFqaYg5YvRE7SPbj903phm81H7v5ZLtwgOBcXx
 dG9cSLTaRhos0PxqoiLfmiGK5IDKmWuZyJzhw+nPh2YmxoRDO/4exsLA9xWWhQSh
 BjPf32cq69fn39Mo/KeANdLR1FEjvKItEty7St5r/OZFxejP8VPe1xuFxHPJn4U+
 FBbDe0DMXAPfoAQImBBhHUpm5Rp7Hwbh90tM8xY6rf3hvRZWmMCAX/Hx8C562M2b
 k6jB13gsoVesatT6lgKs2I0KGL7TSC0jLYG8aeREdBz6AEo5bkBegB5965MZYfGv
 T43i/zk+Ha5VIEURqE/CtocKF8AEjnUWLaIyL7VsDqaMslmaMdWzr8RouaO1snMT
 N/mfilzx9/rzltTV67TI8FSykPNxehwNoc9P8l+ulbW1KKIzpZCWxtIpQnT2TGdn
 89zAJ7LUbEAOnO+jMsJjld0fcNEmUqiqu9tezHuu0rVYErYqtfVhrWIf52r0AHDK
 HRY5FNcZzCE+8FFAVDNl92Of+mPeF47RELXNMLAT+1lm91ug4k62GF4UDw7hsbFo
 6+ductlj2DZlwxZVGKxKhBDxFg+AfsNCC1fZvYq+D/6ZE51eABo=
 =9RXP
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.10' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.10:

 - Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which
   is unused by hardware, so that KVM can communicate its inability to map GPAs
   that set bits 51:48 due to lack of 5-level paging.  Guest firmware is
   expected to use the information to safely remap BARs in the uppermost GPA
   space, i.e to avoid placing a BAR at a legal, but unmappable, GPA.

 - Use vfree() instead of kvfree() for allocations that always use vcalloc()
   or __vcalloc().

 - Don't completely ignore same-value writes to immutable feature MSRs, as
   doing so results in KVM failing to reject accesses to MSR that aren't
   supposed to exist given the vCPU model and/or KVM configuration.

 - Don't mark APICv as being inhibited due to ABSENT if APICv is disabled
   KVM-wide to avoid confusing debuggers (KVM will never bother clearing the
   ABSENT inhibit, even if userspace enables in-kernel local APIC).
2024-05-12 03:18:44 -04:00
Paolo Bonzini
5a1c72e07e KVM x86 MMU changes for 6.10:
- Process TDP MMU SPTEs that are are zapped while holding mmu_lock for read
    after replacing REMOVED_SPTE with '0' and flushing remote TLBs, which allows
    vCPU tasks to repopulate the zapped region while the zapper finishes tearing
    down the old, defunct page tables.
 
  - Fix a longstanding, likely benign-in-practice race where KVM could fail to
    detect a write from kvm_mmu_track_write() to a shadowed GPTE if the GPTE is
    first page table being shadowed.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmY+pUMACgkQOlYIJqCj
 N/2U6A//T3twYSURCUhM/3QYHDoH2RSldxQFs9i4+wJvXdvu4/VK08q1jPltTifm
 6QoloLzJq34rSPPsYAvKSicfhC9Trxz+Cks6oe2wJrDvNNzco+mksC0owj2FsdeO
 8pLh2VGqdmRU64afpnjTRneONJCsxTxHsoVdVEDSMhWiiFX9jj74QS2AbMB/XIli
 rFHK70kpEBTHGzg9E84xcjZb5DBB9+8jIGryWMtXfTAWHC0IO9gSAybLEoVAHZFL
 lUUGpeAs4P97mX28fQFqMm3ZffKE3hfHRfjEoW5BefnZeXYaABwF586I/w7QTjQI
 yHLgvh10a0a0X1hcCsDQFgy81uOLkbVDPUcBOTTY59DXT7Zp2il5bwcMvNBfaaUZ
 olR0auaeOxjPz4/WXd9JOZLaNJYCZqhEQnbEnt0RYcJ4MDULOocbD+D//+3yWPNp
 Dd6t8x73qXqa6GbtwOYWkMENwiDObTZaYBxTUhTd1z6gWpIeXx2fK8RRZ7/+/psF
 Pf/dzSvwOrXUpISQEVn6Q5sRlBS5nzd1vIWRoVe+pze2WYM3SX9E/3SksMCm+TRz
 Is8e+05HvjiaMpZeEjRjbUbBgpQakZYJ1TEwGbC6GLP/PUkssUluiDaQDxCwLPoQ
 bDb/I4NxDUbr0TaEvPszJuA1we8jGpQceq6wUo7n/mX2jC78Syo=
 =Izml
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmu-6.10' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 6.10:

 - Process TDP MMU SPTEs that are are zapped while holding mmu_lock for read
   after replacing REMOVED_SPTE with '0' and flushing remote TLBs, which allows
   vCPU tasks to repopulate the zapped region while the zapper finishes tearing
   down the old, defunct page tables.

 - Fix a longstanding, likely benign-in-practice race where KVM could fail to
   detect a write from kvm_mmu_track_write() to a shadowed GPTE if the GPTE is
   first page table being shadowed.
2024-05-12 03:18:30 -04:00
Paolo Bonzini
31a6cd7f16 KVM VMX changes for 6.10:
- Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to
    L1, as per the SDM.
 
  - Move kvm_vcpu_arch's exit_qualification into x86_exception, as the field is
    used only when synthesizing nested EPT violation, i.e. it's not the vCPU's
    "real" exit_qualification, which is tracked elsewhere.
 
  - Add a sanity check to assert that EPT Violations are the only sources of
    nested PML Full VM-Exits.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmY+qzEACgkQOlYIJqCj
 N/3O0Q/9HZruiL9vzMrLBKgFgWCxQHO2fy+EixuwzVBHunQGOsVnDCO2p+PWnF0p
 kuW/MEZhZfLYnXoDi5/AP12G9qtDhlSNnfSl2gn+BMXqyGSYpcoXuM/zTjM24wLd
 PXKkPirYMpVR2+lHsD7l8YK2I+qc7UfbRkCyJegBgGwUBs13/TBD6Rum3Aa9Q+dX
 IcwjomH+MdHDFPnpfHjksA+G79Ckkqmu/DbOAlCqw1dUSC8oyV9tE/EKStSBzjZ+
 OGMSm7Kl0T+km1JyH60H1ivbUbT3gJxpezoYL9EbO25VPrdldKP+ohqbtew/8ttk
 UP/oW3mL79I7L06ZqqxZKDDj4JGvz53UhhAylZcBPw0P3v9TQF3wm59K4eM9btNt
 eyIaT0SAbcigHAniM+3FPkq443hRxDvLNF5E66Ez03HhhkEz3ZsyNH1oPnQK0Crq
 N1e+NGuKsTAPBzc3sSSrxOHnCajTUQ9WYjOpfdSgWsL6TQOmXIvHl0tE2ILrvDc/
 f+VG62veqa9CCmX5B2lUT0yX9nXvyXKwVpJY9RSQIhB46sA8zjSZsZRCQFkDI5Gx
 pzjxjcXtydAMWpn5qUvpD0B6agMlP6WUJHlu+ezmBQuSUHr+2PHY5dEj9442SusF
 98VGJy8APxDhidK5TaJJXWmDfKNhEaWboMcTnWM1TwY/qLfDsVU=
 =0ncM
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.10' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.10:

 - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to
   L1, as per the SDM.

 - Move kvm_vcpu_arch's exit_qualification into x86_exception, as the field is
   used only when synthesizing nested EPT violation, i.e. it's not the vCPU's
   "real" exit_qualification, which is tracked elsewhere.

 - Add a sanity check to assert that EPT Violations are the only sources of
   nested PML Full VM-Exits.
2024-05-12 03:17:17 -04:00
Paolo Bonzini
4232da23d7 Merge tag 'loongarch-kvm-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.10

1. Add ParaVirt IPI support.
2. Add software breakpoint support.
3. Add mmio trace events support.
2024-05-10 13:20:18 -04:00
Paolo Bonzini
bbe10a5cc0 Merge branch 'kvm-sev-es-ghcbv2' into HEAD
While the main additions from GHCB protocol version 1 to version 2
revolve mostly around SEV-SNP support, there are a number of changes
applicable to SEV-ES guests as well. Pluck a handful patches from the
SNP hypervisor patchset for GHCB-related changes that are also applicable
to SEV-ES.  A KVM_SEV_INIT2 field lets userspace can control the maximum
GHCB protocol version advertised to guests and manage compatibility
across kernels/versions.
2024-05-10 13:18:59 -04:00
Paolo Bonzini
f36508422a Merge branch 'kvm-coco-pagefault-prep' into HEAD
A combination of prep work for TDX and SNP, and a clean up of the
page fault path to (hopefully) make it easier to follow the rules for
private memory, noslot faults, writes to read-only slots, etc.
2024-05-10 13:18:48 -04:00
Paolo Bonzini
1e21b53825 Merge branch 'kvm-vmx-ve' into HEAD
Allow a non-zero value for non-present SPTE and removed SPTE,
so that TDX can set the "suppress VE" bit.
2024-05-10 13:18:36 -04:00
Michael Roth
f32fb32820 KVM: x86: Add hook for determining max NPT mapping level
In the case of SEV-SNP, whether or not a 2MB page can be mapped via a
2MB mapping in the guest's nested page table depends on whether or not
any subpages within the range have already been initialized as private
in the RMP table. The existing mixed-attribute tracking in KVM is
insufficient here, for instance:

- gmem allocates 2MB page
- guest issues PVALIDATE on 2MB page
- guest later converts a subpage to shared
- SNP host code issues PSMASH to split 2MB RMP mapping to 4K
- KVM MMU splits NPT mapping to 4K
- guest later converts that shared page back to private

At this point there are no mixed attributes, and KVM would normally
allow for 2MB NPT mappings again, but this is actually not allowed
because the RMP table mappings are 4K and cannot be promoted on the
hypervisor side, so the NPT mappings must still be limited to 4K to
match this.

Add a hook to determine the max NPT mapping size in situations like
this.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <20240501085210.2213060-3-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-10 13:11:48 -04:00
Michael Roth
a90764f0e4 KVM: guest_memfd: Add hook for invalidating memory
In some cases, like with SEV-SNP, guest memory needs to be updated in a
platform-specific manner before it can be safely freed back to the host.
Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
allow for special handling of this sort when freeing memory in response
to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
ahead and define an arch-specific hook for x86 since it will be needed
for handling memory used for SEV-SNP guests.

Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-Id: <20231230172351.574091-6-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-10 13:11:48 -04:00
Paolo Bonzini
3bb2531e20 KVM: guest_memfd: Add hook for initializing memory
guest_memfd pages are generally expected to be in some arch-defined
initial state prior to using them for guest memory. For SEV-SNP this
initial state is 'private', or 'guest-owned', and requires additional
operations to move these pages into a 'private' state by updating the
corresponding entries the RMP table.

Allow for an arch-defined hook to handle updates of this sort, and go
ahead and implement one for x86 so KVM implementations like AMD SVM can
register a kvm_x86_ops callback to handle these updates for SEV-SNP
guests.

The preparation callback is always called when allocating/grabbing
folios via gmem, and it is up to the architecture to keep track of
whether or not the pages are already in the expected state (e.g. the RMP
table in the case of SEV-SNP).

In some cases, it is necessary to defer the preparation of the pages to
handle things like in-place encryption of initial guest memory payloads
before marking these pages as 'private'/'guest-owned'.  Add an argument
(always true for now) to kvm_gmem_get_folio() that allows for the
preparation callback to be bypassed.  To detect possible issues in
the way userspace initializes memory, it is only possible to add an
unprepared page if it is not already included in the filemap.

Link: https://lore.kernel.org/lkml/ZLqVdvsF11Ddo7Dq@google.com/
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-Id: <20231230172351.574091-5-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-10 13:11:46 -04:00
Michael Roth
4af663c2f6 KVM: SEV: Allow per-guest configuration of GHCB protocol version
The GHCB protocol version may be different from one guest to the next.
Add a field to track it for each KVM instance and extend KVM_SEV_INIT2
to allow it to be configured by userspace.

Now that all SEV-ES support for GHCB protocol version 2 is in place, go
ahead and default to it when creating SEV-ES guests through the new
KVM_SEV_INIT2 interface. Keep the older KVM_SEV_ES_INIT interface
restricted to GHCB protocol version 1.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240501071048.2208265-5-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 13:28:05 -04:00
Michael Roth
8d1a36e42b KVM: SEV: Add GHCB handling for termination requests
GHCB version 2 adds support for a GHCB-based termination request that
a guest can issue when it reaches an error state and wishes to inform
the hypervisor that it should be terminated. Implement support for that
similarly to GHCB MSR-based termination requests that are already
available to SEV-ES guests via earlier versions of the GHCB protocol.

See 'Termination Request' in the 'Invoking VMGEXIT' section of the GHCB
specification for more details.

Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240501071048.2208265-4-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 13:28:04 -04:00
Brijesh Singh
ae01818398 KVM: SEV: Add GHCB handling for Hypervisor Feature Support requests
Version 2 of the GHCB specification introduced advertisement of features
that are supported by the Hypervisor.

Now that KVM supports version 2 of the GHCB specification, bump the
maximum supported protocol version.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240501071048.2208265-3-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 13:28:04 -04:00
Tom Lendacky
d916f00316 KVM: SEV: Add support to handle AP reset MSR protocol
Add support for AP Reset Hold being invoked using the GHCB MSR protocol,
available in version 2 of the GHCB specification.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240501071048.2208265-2-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 13:28:03 -04:00
Sean Christopherson
40269c03fd KVM: x86: Explicitly zero kvm_caps during vendor module load
Zero out all of kvm_caps when loading a new vendor module to ensure that
KVM can't inadvertently rely on global initialization of a field, and add
a comment above the definition of kvm_caps to call out that all fields
needs to be explicitly computed during vendor module load.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20240423165328.2853870-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 13:07:35 -04:00
Sean Christopherson
555485bd86 KVM: x86: Fully re-initialize supported_mce_cap on vendor module load
Effectively reset supported_mce_cap on vendor module load to ensure that
capabilities aren't unintentionally preserved across module reload, e.g.
if kvm-intel.ko added a module param to control LMCE support, or if
someone somehow managed to load a vendor module that doesn't support LMCE
after loading and unloading kvm-intel.ko.

Practically speaking, this bug is a non-issue as kvm-intel.ko doesn't have
a module param for LMCE, and there is no system in the world that supports
both kvm-intel.ko and kvm-amd.ko.

Fixes: c45dcc71b7 ("KVM: VMX: enable guest access to LMCE related MSRs")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20240423165328.2853870-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 13:07:34 -04:00
Sean Christopherson
c43ad19045 KVM: x86: Fully re-initialize supported_vm_types on vendor module load
Recompute the entire set of supported VM types when a vendor module is
loaded, as preserving supported_vm_types across vendor module unload and
reload can result in VM types being incorrectly treated as supported.

E.g. if a vendor module is loaded with TDP enabled, unloaded, and then
reloaded with TDP disabled, KVM_X86_SW_PROTECTED_VM will be incorrectly
retained.  Ditto for SEV_VM and SEV_ES_VM and their respective module
params in kvm-amd.ko.

Fixes: 2a955c4db1 ("KVM: x86: Add supported_vm_types to kvm_caps")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20240423165328.2853870-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 13:07:34 -04:00
Sean Christopherson
2b1f435505 KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfns
WARN if __kvm_faultin_pfn() generates a "no slot" pfn, and gracefully
handle the unexpected behavior instead of continuing on with dangerous
state, e.g. tdp_mmu_map_handle_target_level() _only_ checks fault->slot,
and so could install a bogus PFN into the guest.

The existing code is functionally ok, because kvm_faultin_pfn() pre-checks
all of the cases that result in KVM_PFN_NOSLOT, but it is unnecessarily
unsafe as it relies on __gfn_to_pfn_memslot() getting the _exact_ same
memslot, i.e. not a re-retrieved pointer with KVM_MEMSLOT_INVALID set.
And checking only fault->slot would fall apart if KVM ever added a flag or
condition that forced emulation, similar to how KVM handles writes to
read-only memslots.

Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20240228024147.41573-17-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:24 -04:00
Sean Christopherson
f3310e622f KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values
Explicitly set "pfn" and "hva" to error values in kvm_mmu_do_page_fault()
to harden KVM against using "uninitialized" values.  In quotes because the
fields are actually zero-initialized, and zero is a legal value for both
page frame numbers and virtual addresses.  E.g. failure to set "pfn" prior
to creating an SPTE could result in KVM pointing at physical address '0',
which is far less desirable than KVM generating a SPTE with reserved PA
bits set and thus effectively killing the VM.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20240228024147.41573-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:23 -04:00
Sean Christopherson
36d4492765 KVM: x86/mmu: Set kvm_page_fault.hva to KVM_HVA_ERR_BAD for "no slot" faults
Explicitly set fault->hva to KVM_HVA_ERR_BAD when handling a "no slot"
fault to ensure that KVM doesn't use a bogus virtual address, e.g. if
there *was* a slot but it's unusable (APIC access page), or if there
really was no slot, in which case fault->hva will be '0' (which is a
legal address for x86).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20240228024147.41573-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:23 -04:00
Sean Christopherson
f6adeae81f KVM: x86/mmu: Handle no-slot faults at the beginning of kvm_faultin_pfn()
Handle the "no memslot" case at the beginning of kvm_faultin_pfn(), just
after the private versus shared check, so that there's no need to
repeatedly query whether or not a slot exists.  This also makes it more
obvious that, except for private vs. shared attributes, the process of
faulting in a pfn simply doesn't apply to gfns without a slot.

Opportunistically stuff @fault's metadata in kvm_handle_noslot_fault() so
that it doesn't need to be duplicated in all paths that invoke
kvm_handle_noslot_fault(), and to minimize the probability of not stuffing
the right fields.

Leave the existing handle behind, but convert it to a WARN, to guard
against __kvm_faultin_pfn() unexpectedly nullifying fault->slot.

Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20240228024147.41573-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:22 -04:00
Sean Christopherson
cd272fc439 KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to kvm_faultin_pfn()
Move the checks related to the validity of an access to a memslot from the
inner __kvm_faultin_pfn() to its sole caller, kvm_faultin_pfn().  This
allows emulating accesses to the APIC access page, which don't need to
resolve a pfn, even if there is a relevant in-progress mmu_notifier
invalidation.  Ditto for accesses to KVM internal memslots from L2, which
KVM also treats as emulated MMIO.

More importantly, this will allow for future cleanup by having the
"no memslot" case bail from kvm_faultin_pfn() very early on.

Go to rather extreme and gross lengths to make the change a glorified
nop, e.g. call into __kvm_faultin_pfn() even when there is no slot, as the
related code is very subtle.  E.g. fault->slot can be nullified if it
points at the APIC access page, some flows in KVM x86 expect fault->pfn
to be KVM_PFN_NOSLOT, while others check only fault->slot, etc.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20240228024147.41573-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:22 -04:00
Sean Christopherson
bde9f9d27e KVM: x86/mmu: Explicitly disallow private accesses to emulated MMIO
Explicitly detect and disallow private accesses to emulated MMIO in
kvm_handle_noslot_fault() instead of relying on kvm_faultin_pfn_private()
to perform the check.  This will allow the page fault path to go straight
to kvm_handle_noslot_fault() without bouncing through __kvm_faultin_pfn().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240228024147.41573-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:21 -04:00
Sean Christopherson
5bd74f6eec KVM: x86/mmu: Don't force emulation of L2 accesses to non-APIC internal slots
Allow mapping KVM's internal memslots used for EPT without unrestricted
guest into L2, i.e. allow mapping the hidden TSS and the identity mapped
page tables into L2.  Unlike the APIC access page, there is no correctness
issue with letting L2 access the "hidden" memory.  Allowing these memslots
to be mapped into L2 fixes a largely theoretical bug where KVM could
incorrectly emulate subsequent _L1_ accesses as MMIO, and also ensures
consistent KVM behavior for L2.

If KVM is using TDP, but L1 is using shadow paging for L2, then routing
through kvm_handle_noslot_fault() will incorrectly cache the gfn as MMIO,
and create an MMIO SPTE.  Creating an MMIO SPTE is ok, but only because
kvm_mmu_page_role.guest_mode ensure KVM uses different roots for L1 vs.
L2.  But vcpu->arch.mmio_gfn will remain valid, and could cause KVM to
incorrectly treat an L1 access to the hidden TSS or identity mapped page
tables as MMIO.

Furthermore, forcing L2 accesses to be treated as "no slot" faults doesn't
actually prevent exposing KVM's internal memslots to L2, it simply forces
KVM to emulate the access.  In most cases, that will trigger MMIO,
amusingly due to filling vcpu->arch.mmio_gfn, but also because
vcpu_is_mmio_gpa() unconditionally treats APIC accesses as MMIO, i.e. APIC
accesses are ok.  But the hidden TSS and identity mapped page tables could
go either way (MMIO or access the private memslot's backing memory).

Alternatively, the inconsistent emulator behavior could be addressed by
forcing MMIO emulation for L2 access to all internal memslots, not just to
the APIC.  But that's arguably less correct than letting L2 access the
hidden TSS and identity mapped page tables, not to mention that it's
*extremely* unlikely anyone cares what KVM does in this case.  From L1's
perspective there is R/W memory at those memslots, the memory just happens
to be initialized with non-zero data.  Making the memory disappear when it
is accessed by L2 is far more magical and arbitrary than the memory
existing in the first place.

The APIC access page is special because KVM _must_ emulate the access to
do the right thing (emulate an APIC access instead of reading/writing the
APIC access page).  And despite what commit 3a2936dedd ("kvm: mmu: Don't
expose private memslots to L2") said, it's not just necessary when L1 is
accelerating L2's virtual APIC, it's just as important (likely *more*
imporant for correctness when L1 is passing through its own APIC to L2.

Fixes: 3a2936dedd ("kvm: mmu: Don't expose private memslots to L2")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20240228024147.41573-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:21 -04:00
Sean Christopherson
44f42ef37d KVM: x86/mmu: Move private vs. shared check above slot validity checks
Prioritize private vs. shared gfn attribute checks above slot validity
checks to ensure a consistent userspace ABI.  E.g. as is, KVM will exit to
userspace if there is no memslot, but emulate accesses to the APIC access
page even if the attributes mismatch.

Fixes: 8dd2eee9d5 ("KVM: x86/mmu: Handle page fault for private memory")
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: Chao Peng <chao.p.peng@linux.intel.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20240228024147.41573-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:20 -04:00
Sean Christopherson
07702e5a6d KVM: x86/mmu: WARN and skip MMIO cache on private, reserved page faults
WARN and skip the emulated MMIO fastpath if a private, reserved page fault
is encountered, as private+reserved should be an impossible combination
(KVM should never create an MMIO SPTE for a private access).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240228024147.41573-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:20 -04:00
Paolo Bonzini
cd389f5070 KVM: x86/mmu: check for invalid async page faults involving private memory
Right now the error code is not used when an async page fault is completed.
This is not a problem in the current code, but it is untidy.  For protected
VMs, we will also need to check that the page attributes match the current
state of the page, because asynchronous page faults can only occur on
shared pages (private pages go through kvm_faultin_pfn_private() instead of
__gfn_to_pfn_memslot()).

Start by piping the error code from kvm_arch_setup_async_pf() to
kvm_arch_async_page_ready() via the architecture-specific async page
fault data.  For now, it can be used to assert that there are no
async page faults on private memory.

Extracted from a patch by Isaku Yamahata.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:20 -04:00
Sean Christopherson
b3d5dc629c KVM: x86/mmu: Use synthetic page fault error code to indicate private faults
Add and use a synthetic, KVM-defined page fault error code to indicate
whether a fault is to private vs. shared memory.  TDX and SNP have
different mechanisms for reporting private vs. shared, and KVM's
software-protected VMs have no mechanism at all.  Usurp an error code
flag to avoid having to plumb another parameter to kvm_mmu_page_fault()
and friends.

Alternatively, KVM could borrow AMD's PFERR_GUEST_ENC_MASK, i.e. set it
for TDX and software-protected VMs as appropriate, but that would require
*clearing* the flag for SEV and SEV-ES VMs, which support encrypted
memory at the hardware layer, but don't utilize private memory at the
KVM layer.

Opportunistically add a comment to call out that the logic for software-
protected VMs is (and was before this commit) broken for nested MMUs, i.e.
for nested TDP, as the GPA is an L2 GPA.  Punt on trying to play nice with
nested MMUs as there is a _lot_ of functionality that simply doesn't work
for software-protected VMs, e.g. all of the paths where KVM accesses guest
memory need to be updated to be aware of private vs. shared memory.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20240228024147.41573-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:19 -04:00
Sean Christopherson
7bdbb820fe KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zero
WARN if bits 63:32 are non-zero when handling an intercepted legacy #PF,
as the error code for #PF is limited to 32 bits (and in practice, 16 bits
on Intel CPUS).  This behavior is architectural, is part of KVM's ABI
(see kvm_vcpu_events.error_code), and is explicitly documented as being
preserved for intecerpted #PF in both the APM:

  The error code saved in EXITINFO1 is the same as would be pushed onto
  the stack by a non-intercepted #PF exception in protected mode.

and even more explicitly in the SDM as VMCS.VM_EXIT_INTR_ERROR_CODE is a
32-bit field.

Simply drop the upper bits if hardware provides garbage, as spurious
information should do no harm (though in all likelihood hardware is buggy
and the kernel is doomed).

Handling all upper 32 bits in the #PF path will allow moving the sanity
check on synthetic checks from kvm_mmu_page_fault() to npf_interception(),
which in turn will allow deriving PFERR_PRIVATE_ACCESS from AMD's
PFERR_GUEST_ENC_MASK without running afoul of the sanity check.

Note, this is also why Intel uses bit 15 for SGX (highest bit on Intel CPUs)
and AMD uses bit 31 for RMP (highest bit on AMD CPUs); using the highest
bit minimizes the probability of a collision with the "other" vendor,
without needing to plumb more bits through microcode.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20240228024147.41573-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:19 -04:00
Isaku Yamahata
c9710130cc KVM: x86/mmu: Pass full 64-bit error code when handling page faults
Plumb the full 64-bit error code throughout the page fault handling code
so that KVM can use the upper 32 bits, e.g. SNP's PFERR_GUEST_ENC_MASK
will be used to determine whether or not a fault is private vs. shared.

Note, passing the 64-bit error code to FNAME(walk_addr)() does NOT change
the behavior of permission_fault() when invoked in the page fault path, as
KVM explicitly clears PFERR_IMPLICIT_ACCESS in kvm_mmu_page_fault().

Continue passing '0' from the async #PF worker, as guest_memfd and thus
private memory doesn't support async page faults.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[mdr: drop references/changes on rebase, update commit message]
Signed-off-by: Michael Roth <michael.roth@amd.com>
[sean: drop truncation in call to FNAME(walk_addr)(), rewrite changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20240228024147.41573-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:18 -04:00
Sean Christopherson
dee281e4b4 KVM: x86: Move synthetic PFERR_* sanity checks to SVM's #NPF handler
Move the sanity check that hardware never sets bits that collide with KVM-
define synthetic bits from kvm_mmu_page_fault() to npf_interception(),
i.e. make the sanity check #NPF specific.  The legacy #PF path already
WARNs if _any_ of bits 63:32 are set, and the error code that comes from
VMX's EPT Violatation and Misconfig is 100% synthesized (KVM morphs VMX's
EXIT_QUALIFICATION into error code flags).

Add a compile-time assert in the legacy #PF handler to make sure that KVM-
define flags are covered by its existing sanity check on the upper bits.

Opportunistically add a description of PFERR_IMPLICIT_ACCESS, since we
are removing the comment that defined it.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20240228024147.41573-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:18 -04:00
Sean Christopherson
63b6206e2f KVM: x86: Remove separate "bit" defines for page fault error code masks
Open code the bit number directly in the PFERR_* masks and drop the
intermediate PFERR_*_BIT defines, as having to bounce through two macros
just to see which flag corresponds to which bit is quite annoying, as is
having to define two macros just to add recognition of a new flag.

Use ternary operator to derive the bit in permission_fault(), the one
function that actually needs the bit number as part of clever shifting
to avoid conditional branches.  Generally the compiler is able to turn
it into a conditional move, and if not it's not really a big deal.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240228024147.41573-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:17 -04:00
Sean Christopherson
d0bf8e6e44 KVM: x86/mmu: Exit to userspace with -EFAULT if private fault hits emulation
Exit to userspace with -EFAULT / KVM_EXIT_MEMORY_FAULT if a private fault
triggers emulation of any kind, as KVM doesn't currently support emulating
access to guest private memory.  Practically speaking, private faults and
emulation are already mutually exclusive, but there are many flow that
can result in KVM returning RET_PF_EMULATE, and adding one last check
to harden against weird, unexpected combinations and/or KVM bugs is
inexpensive.

Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240228024147.41573-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:16 -04:00
Alejandro Jimenez
51937f2aae KVM: x86: Remove VT-d mention in posted interrupt tracepoint
The kvm_pi_irte_update tracepoint is called from both SVM and VMX vendor
code, and while the "posted interrupt" naming is also adopted by SVM in
several places, VT-d specifically refers to Intel's "Virtualization
Technology for Directed I/O".

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Link: https://lore.kernel.org/r/20240418021823.1275276-3-alejandro.j.jimenez@oracle.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-05-02 07:54:14 -07:00
Alejandro Jimenez
6982b34c21 KVM: x86: Only set APICV_INHIBIT_REASON_ABSENT if APICv is enabled
Use the APICv enablement status to determine if APICV_INHIBIT_REASON_ABSENT
needs to be set, instead of unconditionally setting the reason during
initialization.

Specifically, in cases where AVIC is disabled via module parameter or lack
of hardware support, unconditionally setting an inhibit reason due to the
absence of an in-kernel local APIC can lead to a scenario where the reason
incorrectly remains set after a local APIC has been created by either
KVM_CREATE_IRQCHIP or the enabling of KVM_CAP_IRQCHIP_SPLIT. This is
because the helpers in charge of removing the inhibit return early if
enable_apicv is not true, and therefore the bit remains set.

This leads to confusion as to the cause why APICv is not active, since an
incorrect reason will be reported by tracepoints and/or a debugging tool
that examines the currently set inhibit reasons.

Fixes: ef8b4b7203 ("KVM: ensure APICv is considered inactive if there is no APIC")
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Link: https://lore.kernel.org/r/20240418021823.1275276-2-alejandro.j.jimenez@oracle.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-05-02 07:53:46 -07:00
Sean Christopherson
226d9b8f16 KVM: x86/mmu: Fix a largely theoretical race in kvm_mmu_track_write()
Add full memory barriers in kvm_mmu_track_write() and account_shadowed()
to plug a (very, very theoretical) race where kvm_mmu_track_write() could
miss a 0->1 transition of indirect_shadow_pages and fail to zap relevant,
*stale* SPTEs.

Without the barriers, because modern x86 CPUs allow (per the SDM):

  Reads may be reordered with older writes to different locations but not
  with older writes to the same location.

it's possible that the following could happen (terms of values being
visible/resolved):

 CPU0                          CPU1
 read memory[gfn] (=Y)
                               memory[gfn] Y=>X
                               read indirect_shadow_pages (=0)
 indirect_shadow_pages 0=>1

or conversely:

 CPU0                          CPU1
 indirect_shadow_pages 0=>1
                               read indirect_shadow_pages (=0)
 read memory[gfn] (=Y)
                               memory[gfn] Y=>X

E.g. in the below scenario, CPU0 could fail to zap SPTEs, and CPU1 could
fail to retry the faulting instruction, resulting in a KVM entering the
guest with a stale SPTE (map PTE=X instead of PTE=Y).

PTE = X;

CPU0:
    emulator_write_phys()
    PTE = Y
    kvm_page_track_write()
      kvm_mmu_track_write()
      // memory barrier missing here
      if (indirect_shadow_pages)
          zap();

CPU1:
   FNAME(page_fault)
     FNAME(walk_addr)
       FNAME(walk_addr_generic)
         gw->pte = PTE; // X

     FNAME(fetch)
       kvm_mmu_get_child_sp
         kvm_mmu_get_shadow_page
           __kvm_mmu_get_shadow_page
             kvm_mmu_alloc_shadow_page
               account_shadowed
                 indirect_shadow_pages++
                 // memory barrier missing here
       if (FNAME(gpte_changed)) // if (PTE == X)
           return RET_PF_RETRY;

In practice, this bug likely cannot be observed as both the 0=>1
transition and reordering of this scope are extremely rare occurrences.

Note, if the cost of the barrier (which is simply a locked ADD, see commit
450cbdd012 ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE")),
is problematic, KVM could avoid the barrier by bailing earlier if checking
kvm_memslots_have_rmaps() is false.  But the odds of the barrier being
problematic is extremely low, *and* the odds of the extra checks being
meaningfully faster overall is also low.

Link: https://lore.kernel.org/r/20240423193114.2887673-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-05-02 07:49:06 -07:00
Sean Christopherson
1d294dfaba KVM: x86: Allow, don't ignore, same-value writes to immutable MSRs
When handling userspace writes to immutable feature MSRs for a vCPU that
has already run, fall through into the normal code to set the MSR instead
of immediately returning '0'.  I.e. allow such writes, instead of ignoring
such writes.  This fixes a bug where KVM incorrectly allows writes to the
VMX MSRs that enumerate which CR{0,4} can be set, but only if the vCPU has
already run.

The intent of returning '0' and thus ignoring the write, was to avoid any
side effects, e.g. refreshing the PMU and thus doing weird things with
perf events while the vCPU is running.  That approach sounds nice in
theory, but in practice it makes it all but impossible to maintain a sane
ABI, e.g. all VMX MSRs return -EBUSY if the CPU is post-VMXON, and the VMX
MSRs for fixed-1 CR bits are never writable, etc.

As for refreshing the PMU, kvm_set_msr_common() explicitly skips the PMU
refresh if MSR_IA32_PERF_CAPABILITIES is being written with the current
value, specifically to avoid unwanted side effects.  And if necessary,
adding similar logic for other MSRs is not difficult.

Fixes: 0094f62c7e ("KVM: x86: Disallow writes to immutable feature MSRs after KVM_RUN")
Reported-by: Jim Mattson <jmattson@google.com>
Cc: Raghavendra Rao Ananta <rananta@google.com>
Link: https://lore.kernel.org/r/20240408231500.1388122-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-05-02 07:19:46 -07:00
Jacob Pan
2254808b53 x86/irq: Remove bitfields in posted interrupt descriptor
Mixture of bitfields and types is weird and really not intuitive, remove
bitfields and use typed data exclusively. Bitfields often result in
inferior machine code.

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240423174114.526704-4-jacob.jun.pan@linux.intel.com
Link: https://lore.kernel.org/all/20240404101735.402feec8@jacob-builder/T/#mf66e34a82a48f4d8e2926b5581eff59a122de53a
2024-04-30 00:54:42 +02:00
Jacob Pan
699f67512f KVM: VMX: Move posted interrupt descriptor out of VMX code
To prepare native usage of posted interrupts, move the PID declarations out
of VMX code such that they can be shared.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20240423174114.526704-2-jacob.jun.pan@linux.intel.com
2024-04-30 00:54:42 +02:00
Linus Torvalds
817772266d * Clean up SVM's enter/exit assembly code so that it can be compiled
without OBJECT_FILES_NON_STANDARD.  This fixes a warning
   "Unpatched return thunk in use. This should not happen!" when running
   KVM selftests.
 
 * Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure where KVM
   would allow userspace to refresh the cache with a bogus GPA.  The bug has
   existed for quite some time, but was exposed by a new sanity check added in
   6.9 (to ensure a cache is either GPA-based or HVA-based).
 
 * Drop an unused param from gfn_to_pfn_cache_invalidate_start() that got left
   behind during a 6.9 cleanup.
 
 * Fix a math goof in x86's hugepage logic for KVM_SET_MEMORY_ATTRIBUTES that
   results in an array overflow (detected by KASAN).
 
 * Fix a bug where KVM incorrectly clears root_role.direct when userspace sets
   guest CPUID.
 
 * Fix a dirty logging bug in the where KVM fails to write-protect SPTEs used
   by a nested guest, if KVM is using Page-Modification Logging and the nested
   hypervisor is NOT using EPT.
 
 x86 PMU:
 
 * Drop support for virtualizing adaptive PEBS, as KVM's implementation is
   architecturally broken without an obvious/easy path forward, and because
   exposing adaptive PEBS can leak host LBRs to the guest, i.e. can leak
   host kernel addresses to the guest.
 
 * Set the enable bits for general purpose counters in PERF_GLOBAL_CTRL at
   RESET time, as done by both Intel and AMD processors.
 
 * Disable LBR virtualization on CPUs that don't support LBR callstacks, as
   KVM unconditionally uses PERF_SAMPLE_BRANCH_CALL_STACK when creating the
   perf event, and would fail on such CPUs.
 
 Tests:
 
 * Fix a flaw in the max_guest_memory selftest that results in it exhausting
   the supply of ucall structures when run with more than 256 vCPUs.
 
 * Mark KVM_MEM_READONLY as supported for RISC-V in set_memory_region_test.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmYjdqcUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPNRAgAh1AdKBAWnq9bFN2Np1kSAcRAk3bs
 REDq/0iD1T9TvIwEmE1lHaRuqvCSO15WW+DKvbs7TS8zA0DyY7X/x8sIIy5YzZ5C
 bQ+JXiqk55OAj0sPskBpCvE5qEreuU8qAit57+8OseKWs57EICvJjrfsRnHlmIub
 pgGas3I42LjIgsuZRr2kjv+GrvaiikW+wWK6sq3CvPzTtHV196d26AK5l4NOoLkY
 0FTbBIYUSJ7wxs92xuTed5mZ7JFZdsa5DVMXF5MRZ9W6g2vZCLbqCNRddRhSAsl0
 gKmqZkuPTB7AnGQbJ2h/aKFT0ydsguzqbbKq62sK7ft5f1CUlbp9luDC9w==
 =99rq
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "This is a bit on the large side, mostly due to two changes:

   - Changes to disable some broken PMU virtualization (see below for
     details under "x86 PMU")

   - Clean up SVM's enter/exit assembly code so that it can be compiled
     without OBJECT_FILES_NON_STANDARD. This fixes a warning "Unpatched
     return thunk in use. This should not happen!" when running KVM
     selftests.

  Everything else is small bugfixes and selftest changes:

   - Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure
     where KVM would allow userspace to refresh the cache with a bogus
     GPA. The bug has existed for quite some time, but was exposed by a
     new sanity check added in 6.9 (to ensure a cache is either
     GPA-based or HVA-based).

   - Drop an unused param from gfn_to_pfn_cache_invalidate_start() that
     got left behind during a 6.9 cleanup.

   - Fix a math goof in x86's hugepage logic for
     KVM_SET_MEMORY_ATTRIBUTES that results in an array overflow
     (detected by KASAN).

   - Fix a bug where KVM incorrectly clears root_role.direct when
     userspace sets guest CPUID.

   - Fix a dirty logging bug in the where KVM fails to write-protect
     SPTEs used by a nested guest, if KVM is using Page-Modification
     Logging and the nested hypervisor is NOT using EPT.

  x86 PMU:

   - Drop support for virtualizing adaptive PEBS, as KVM's
     implementation is architecturally broken without an obvious/easy
     path forward, and because exposing adaptive PEBS can leak host LBRs
     to the guest, i.e. can leak host kernel addresses to the guest.

   - Set the enable bits for general purpose counters in
     PERF_GLOBAL_CTRL at RESET time, as done by both Intel and AMD
     processors.

   - Disable LBR virtualization on CPUs that don't support LBR
     callstacks, as KVM unconditionally uses
     PERF_SAMPLE_BRANCH_CALL_STACK when creating the perf event, and
     would fail on such CPUs.

  Tests:

   - Fix a flaw in the max_guest_memory selftest that results in it
     exhausting the supply of ucall structures when run with more than
     256 vCPUs.

   - Mark KVM_MEM_READONLY as supported for RISC-V in
     set_memory_region_test"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (30 commits)
  KVM: Drop unused @may_block param from gfn_to_pfn_cache_invalidate_start()
  KVM: selftests: Add coverage of EPT-disabled to vmx_dirty_log_test
  KVM: x86/mmu: Fix and clarify comments about clearing D-bit vs. write-protecting
  KVM: x86/mmu: Remove function comments above clear_dirty_{gfn_range,pt_masked}()
  KVM: x86/mmu: Write-protect L2 SPTEs in TDP MMU when clearing dirty status
  KVM: x86/mmu: Precisely invalidate MMU root_role during CPUID update
  KVM: VMX: Disable LBR virtualization if the CPU doesn't support LBR callstacks
  perf/x86/intel: Expose existence of callback support to KVM
  KVM: VMX: Snapshot LBR capabilities during module initialization
  KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms
  KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible
  KVM: x86: Stop compiling vmenter.S with OBJECT_FILES_NON_STANDARD
  KVM: SVM: Create a stack frame in __svm_sev_es_vcpu_run()
  KVM: SVM: Save/restore args across SEV-ES VMRUN via host save area
  KVM: SVM: Save/restore non-volatile GPRs in SEV-ES VMRUN via host save area
  KVM: SVM: Clobber RAX instead of RBX when discarding spec_ctrl_intercepted
  KVM: SVM: Drop 32-bit "support" from __svm_sev_es_vcpu_run()
  KVM: SVM: Wrap __svm_sev_es_vcpu_run() with #ifdef CONFIG_KVM_AMD_SEV
  KVM: SVM: Create a stack frame in __svm_vcpu_run() for unwinding
  KVM: SVM: Remove a useless zeroing of allocated memory
  ...
2024-04-20 11:10:51 -07:00
Isaku Yamahata
8131cf5b4f KVM: VMX: Introduce test mode related to EPT violation VE
To support TDX, KVM is enhanced to operate with #VE.  For TDX, KVM uses the
suppress #VE bit in EPT entries selectively, in order to be able to trap
non-present conditions.  However, #VE isn't used for VMX and it's a bug
if it happens.  To be defensive and test that VMX case isn't broken
introduce an option ept_violation_ve_test and when it's set, BUG the vm.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <d6db6ba836605c0412e166359ba5c46a63c22f86.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19 12:15:21 -04:00
Paolo Bonzini
fb29541ead KVM, x86: add architectural support code for #VE
Dump the contents of the #VE info data structure and assert that #VE does
not happen, but do not yet do anything with it.

No functional change intended, separated for clarity only.

Extracted from a patch by Isaku Yamahata <isaku.yamahata@intel.com>.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19 12:15:20 -04:00
Sean Christopherson
949019b982 KVM: x86/mmu: Track shadow MMIO value on a per-VM basis
TDX will use a different shadow PTE entry value for MMIO from VMX.  Add a
member to kvm_arch and track value for MMIO per-VM instead of a global
variable.  By using the per-VM EPT entry value for MMIO, the existing VMX
logic is kept working.  Introduce a separate setter function so that guest
TD can use a different value later.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <229a18434e5d83f45b1fcd7bf1544d79db1becb6.1705965635.git.isaku.yamahata@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19 12:15:20 -04:00
Isaku Yamahata
7fa5e29291 KVM: x86/mmu: Add Suppress VE bit to EPT shadow_mmio_mask/shadow_present_mask
To make use of the same value of shadow_mmio_mask and shadow_present_mask
for TDX and VMX, add Suppress-VE bit to shadow_mmio_mask and
shadow_present_mask so that they can be common for both VMX and TDX.

TDX will require shadow_mmio_mask and shadow_present_mask to include
VMX_SUPPRESS_VE for shared GPA so that EPT violation is triggered for
shared GPA.  For VMX, VMX_SUPPRESS_VE doesn't matter for MMIO because the
spte value is defined so as to cause EPT misconfig.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <97cc616b3563cd8277be91aaeb3e14bce23c3649.1705965635.git.isaku.yamahata@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19 12:15:19 -04:00
Sean Christopherson
7f01cab849 KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE
For TD guest, the current way to emulate MMIO doesn't work any more, as KVM
is not able to access the private memory of TD guest and do the emulation.
Instead, TD guest expects to receive #VE when it accesses the MMIO and then
it can explicitly make hypercall to KVM to get the expected information.

To achieve this, the TDX module always enables "EPT-violation #VE" in the
VMCS control.  And accordingly, for the MMIO spte for the shared GPA,
1. KVM needs to set "suppress #VE" bit for the non-present SPTE so that EPT
violation happens on TD accessing MMIO range.  2. On EPT violation, KVM
sets the MMIO spte to clear "suppress #VE" bit so the TD guest can receive
the #VE instead of EPT misconfiguration unlike VMX case.  For the shared GPA
that is not populated yet, EPT violation need to be triggered when TD guest
accesses such shared GPA.  The non-present SPTE value for shared GPA should
set "suppress #VE" bit.

Add "suppress #VE" bit (bit 63) to SHADOW_NONPRESENT_VALUE and
REMOVED_SPTE.  Unconditionally set the "suppress #VE" bit (which is bit 63)
for both AMD and Intel as: 1) AMD hardware doesn't use this bit when
present bit is off; 2) for normal VMX guest, KVM never enables the
"EPT-violation #VE" in VMCS control and "suppress #VE" bit is ignored by
hardware.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <a99cb866897c7083430dce7f24c63b17d7121134.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19 12:15:19 -04:00
Sean Christopherson
d8fa2031fa KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE
The TDX support will need the "suppress #VE" bit (bit 63) set as the
initial value for SPTE.  To reduce code change size, introduce a new macro
SHADOW_NONPRESENT_VALUE for the initial value for the shadow page table
entry (SPTE) and replace hard-coded value 0 for it.  Initialize shadow page
tables with their value.

The plan is to unconditionally set the "suppress #VE" bit for both AMD and
Intel as: 1) AMD hardware uses the bit 63 as NX for present SPTE and
ignored for non-present SPTE; 2) for conventional VMX guests, KVM never
enables the "EPT-violation #VE" in VMCS control and "suppress #VE" bit is
ignored by hardware.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <acdf09bf60cad12c495005bf3495c54f6b3069c9.1705965635.git.isaku.yamahata@intel.com>
[Remove unnecessary CONFIG_X86_64 check. - Paolo]
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19 12:15:18 -04:00
Paolo Bonzini
a96cb3bf39 Merge x86 bugfixes from Linux 6.9-rc3
Pull fix for SEV-SNP late disable bugs.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19 09:02:22 -04:00
Paolo Bonzini
44ecfa3e5f Merge branch 'svm' of https://github.com/kvm-x86/linux into HEAD
Clean up SVM's enter/exit assembly code so that it can be compiled
without OBJECT_FILES_NON_STANDARD.  The "standard" __svm_vcpu_run() can't
be made 100% bulletproof, as RBP isn't restored on #VMEXIT, but that's
also the case for __vmx_vcpu_run(), and getting "close enough" is better
than not even trying.

As for SEV-ES, after yet another refresher on swap types, I realized
KVM can simply let the hardware restore registers after #VMEXIT, all
that's missing is storing the current values to the host save area
(they are swap type B).  This should provide 100% accuracy when using
stack frames for unwinding, and requires less assembly.

In between, build the SEV-ES code iff CONFIG_KVM_AMD_SEV=y, and yank out
"support" for 32-bit kernels in __svm_sev_es_vcpu_run, which was
unnecessarily polluting the code for a configuration that is disabled
at build time.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-17 11:44:37 -04:00
Paolo Bonzini
1c3bed8006 KVM fixes for 6.9-rcN:
- Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure where KVM
    would allow userspace to refresh the cache with a bogus GPA.  The bug has
    existed for quite some time, but was exposed by a new sanity check added in
    6.9 (to ensure a cache is either GPA-based or HVA-based).
 
  - Drop an unused param from gfn_to_pfn_cache_invalidate_start() that got left
    behind during a 6.9 cleanup.
 
  - Disable support for virtualizing adaptive PEBS, as KVM's implementation is
    architecturally broken and can leak host LBRs to the guest.
 
  - Fix a bug where KVM neglects to set the enable bits for general purpose
    counters in PERF_GLOBAL_CTRL when initializing the virtual PMU.  Both Intel
    and AMD architectures require the bits to be set at RESET in order for v2
    PMUs to be backwards compatible with software that was written for v1 PMUs,
    i.e. for software that will never manually set the global enables.
 
  - Disable LBR virtualization on CPUs that don't support LBR callstacks, as
    KVM unconditionally uses PERF_SAMPLE_BRANCH_CALL_STACK when creating the
    virtual LBR perf event, i.e. KVM will always fail to create LBR events on
    such CPUs.
 
  - Fix a math goof in x86's hugepage logic for KVM_SET_MEMORY_ATTRIBUTES that
    results in an array overflow (detected by KASAN).
 
  - Fix a flaw in the max_guest_memory selftest that results in it exhausting
    the supply of ucall structures when run with more than 256 vCPUs.
 
  - Mark KVM_MEM_READONLY as supported for RISC-V in set_memory_region_test.
 
  - Fix a bug where KVM incorrectly thinks a TDP MMU root is an indirect shadow
    root due KVM unnecessarily clobbering root_role.direct when userspace sets
    guest CPUID.
 
  - Fix a dirty logging bug in the where KVM fails to write-protect TDP MMU
    SPTEs used for L2 if Page-Modification Logging is enabled for L1 and the L1
    hypervisor is NOT using EPT (if nEPT is enabled, KVM doesn't use the TDP MMU
    to run L2).  For simplicity, KVM always disables PML when running L2, but
    the TDP MMU wasn't accounting for root-specific conditions that force write-
    protect based dirty logging.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmYYRoUACgkQOlYIJqCj
 N/2sDQ/8Dgd8lvzHieVZaWRCXzvtrmqZqxr08NTHJo4yqXiPxUd5z3lC1s6mSSQc
 RHAD21A6JstSdz6O6p3Y+koYws8YTVAZNhlBCiRnVyNuopEs+EVmUQQI5YfQiVFO
 0dX7aWRUlPH7q4OQVFhI7/owLahsuzvYCEFInWQt+586oQCpkPiiRRKF48d+n/Ba
 fuY2jYxmxI72lMoSVFE/ZSh23lKyhpyiJW/qMCBv2jbNFR8tkbrQkcuBMaHJ6Z7d
 f/7sJ4T5SA4VH+4fwctONqepAGk1jLcfZFl/21Peyf2Ieh/Oy1d1+MOmVgbpdUZR
 WE9pVsktoDMH4tMSgNI7uOgVIh43/mDVIoYwYnfrKFjoASGWpFJV7UOf87X2soVi
 MHxjYKc9PXkaG8Kua1jM0VB2jo7LKFtSoHjFBHLeKJa9Y2CS1eE8y0iWarZufEtA
 tlt6KUqOdICzB8lbNWLwRtB9jp3V/LYWRJ+YqL3QKiN9kpTB79qH+mIOjhzunASV
 RfkT8No76dCoTgX1e/qhElmWJ0OBB0zhtmELxHxGCH5AUZG4JgebyomsqkZaUAeM
 DMgMb3nZMiijW94n8xQCGVEJ1SHL3L70DtNFej3udY6Q49c6RDsoppkMSlO3D90r
 ratTwHhMc5KTk51zDW+DRmVgbBZwyhDfVK2KKJi37PbObfbJyIY=
 =0hRN
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-fixes-6.9-rcN' of https://github.com/kvm-x86/linux into HEAD

- Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure where KVM
  would allow userspace to refresh the cache with a bogus GPA.  The bug has
  existed for quite some time, but was exposed by a new sanity check added in
  6.9 (to ensure a cache is either GPA-based or HVA-based).

- Drop an unused param from gfn_to_pfn_cache_invalidate_start() that got left
  behind during a 6.9 cleanup.

- Disable support for virtualizing adaptive PEBS, as KVM's implementation is
  architecturally broken and can leak host LBRs to the guest.

- Fix a bug where KVM neglects to set the enable bits for general purpose
  counters in PERF_GLOBAL_CTRL when initializing the virtual PMU.  Both Intel
  and AMD architectures require the bits to be set at RESET in order for v2
  PMUs to be backwards compatible with software that was written for v1 PMUs,
  i.e. for software that will never manually set the global enables.

- Disable LBR virtualization on CPUs that don't support LBR callstacks, as
  KVM unconditionally uses PERF_SAMPLE_BRANCH_CALL_STACK when creating the
  virtual LBR perf event, i.e. KVM will always fail to create LBR events on
  such CPUs.

- Fix a math goof in x86's hugepage logic for KVM_SET_MEMORY_ATTRIBUTES that
  results in an array overflow (detected by KASAN).

- Fix a flaw in the max_guest_memory selftest that results in it exhausting
  the supply of ucall structures when run with more than 256 vCPUs.

- Mark KVM_MEM_READONLY as supported for RISC-V in set_memory_region_test.

- Fix a bug where KVM incorrectly thinks a TDP MMU root is an indirect shadow
  root due KVM unnecessarily clobbering root_role.direct when userspace sets
  guest CPUID.

- Fix a dirty logging bug in the where KVM fails to write-protect TDP MMU
  SPTEs used for L2 if Page-Modification Logging is enabled for L1 and the L1
  hypervisor is NOT using EPT (if nEPT is enabled, KVM doesn't use the TDP MMU
  to run L2).  For simplicity, KVM always disables PML when running L2, but
  the TDP MMU wasn't accounting for root-specific conditions that force write-
  protect based dirty logging.
2024-04-16 12:50:21 -04:00
Paolo Bonzini
1ab157ce57 KVM: SEV: use u64_to_user_ptr throughout
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12 04:42:25 -04:00
Sean Christopherson
2325a21ac1 KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument
TDX uses different ABI to get information about VM exit.  Pass intr_info to
the NMI and INTR handlers instead of pulling it from vcpu_vmx in
preparation for sharing the bulk of the handlers with TDX.

When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
exit qualification etc rather than the VMCS fields because VMM doesn't have
access to the VMCS.  The eventual code will be

VMX:
  - get exit reason, intr_info, exit_qualification, and etc from VMCS
  - call NMI/INTR handlers (common code)

TDX:
  - get exit reason, intr_info, exit_qualification, and etc from guest
    registers
  - call NMI/INTR handlers (common code)

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <0396a9ae70d293c9d0b060349dae385a8a4fbcec.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12 04:42:24 -04:00
Paolo Bonzini
5f18c642ff KVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch VMX and TDX
KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions
to operate on VM.  TDX doesn't allow VMM to operate VMCS directly.
Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to
indirectly operate those data structures.  This means we must have a TDX
version of kvm_x86_ops.

The existing global struct kvm_x86_ops already defines an interface which
can be adapted to TDX, but kvm_x86_ops is a system-wide, not per-VM
structure.  To allow VMX to coexist with TDs, the kvm_x86_ops callbacks
will have wrappers "if (tdx) tdx_op() else vmx_op()" to pick VMX or
TDX at run time.

To split the runtime switch, the VMX implementation, and the TDX
implementation, add main.c, and move out the vmx_x86_ops hooks in
preparation for adding TDX.  Use 'vt' for the naming scheme as a nod to
VT-x and as a concatenation of VmxTdx.

The eventually converted code will look like this:

vmx.c:
  vmx_op() { ... }
  VMX initialization
tdx.c:
  tdx_op() { ... }
  TDX initialization
x86_ops.h:
  vmx_op();
  tdx_op();
main.c:
  static vt_op() { if (tdx) tdx_op() else vmx_op() }
  static struct kvm_x86_ops vt_x86_ops = {
        .op = vt_op,
  initialization functions call both VMX and TDX initialization

Opportunistically, fix the name inconsistency from vmx_create_vcpu() and
vmx_free_vcpu() to vmx_vcpu_create() and vmx_vcpu_free().

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Message-Id: <e603c317587f933a9d1bee8728c84e4935849c16.1705965634.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12 04:42:24 -04:00
Sean Christopherson
e913ef159f KVM: x86: Split core of hypercall emulation to helper function
By necessity, TDX will use a different register ABI for hypercalls.
Break out the core functionality so that it may be reused for TDX.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <5134caa55ac3dec33fb2addb5545b52b3b52db02.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12 04:42:23 -04:00
Paolo Bonzini
f9cecb3c50 Merge branch 'kvm-sev-init2' into HEAD
The idea that no parameter would ever be necessary when enabling SEV or
SEV-ES for a VM was decidedly optimistic.  The first source of variability
that was encountered is the desired set of VMSA features, as that affects
the measurement of the VM's initial state and cannot be changed
arbitrarily by the hypervisor.

This series adds all the APIs that are needed to customize the features,
with room for future enhancements:

- a new /dev/kvm device attribute to retrieve the set of supported
  features (right now, only debug swap)

- a new sub-operation for KVM_MEM_ENCRYPT_OP that can take a struct,
  replacing the existing KVM_SEV_INIT and KVM_SEV_ES_INIT

It then puts the new op to work by including the VMSA features as a field
of the The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of
supported VMSA features for backwards compatibility; but I am considering
also making them use zero as the feature mask, and will gladly adjust the
patches if so requested.

In order to avoid creating *two* new KVM_MEM_ENCRYPT_OPs, I decided that
I could as well make SEV and SEV-ES use VM types.  This allows SEV-SNP
to reuse the KVM_SEV_INIT2 ioctl.

And while at it, KVM_SEV_INIT2 also includes two bugfixes.  First of all,
SEV-ES VM, when created with the new VM type instead of KVM_SEV_ES_INIT,
reject KVM_GET_REGS/KVM_SET_REGS and friends on the vCPU file descriptor
once the VMSA has been encrypted...  which is how the API should have
always behaved.  Second, they also synchronize the FPU and AVX state.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12 04:41:34 -04:00
David Matlack
b1a8d2b02b KVM: x86/mmu: Fix and clarify comments about clearing D-bit vs. write-protecting
Drop the "If AD bits are enabled/disabled" verbiage from the comments
above kvm_tdp_mmu_clear_dirty_{slot,pt_masked}() since TDP MMU SPTEs may
need to be write-protected even when A/D bits are enabled. i.e. These
comments aren't technically correct.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240315230541.1635322-4-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-11 12:58:51 -07:00
David Matlack
feac19aa4c KVM: x86/mmu: Remove function comments above clear_dirty_{gfn_range,pt_masked}()
Drop the comments above clear_dirty_gfn_range() and
clear_dirty_pt_masked(), since each is word-for-word identical to the
comment above their parent function.

Leave the comment on the parent functions since they are APIs called by
the KVM/x86 MMU.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240315230541.1635322-3-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-11 12:58:50 -07:00
David Matlack
2673dfb591 KVM: x86/mmu: Write-protect L2 SPTEs in TDP MMU when clearing dirty status
Check kvm_mmu_page_ad_need_write_protect() when deciding whether to
write-protect or clear D-bits on TDP MMU SPTEs, so that the TDP MMU
accounts for any role-specific reasons for disabling D-bit dirty logging.

Specifically, TDP MMU SPTEs must be write-protected when the TDP MMU is
being used to run an L2 (i.e. L1 has disabled EPT) and PML is enabled.
KVM always disables PML when running L2, even when L1 and L2 GPAs are in
the some domain, so failing to write-protect TDP MMU SPTEs will cause
writes made by L2 to not be reflected in the dirty log.

Reported-by: syzbot+900d58a45dcaab9e4821@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=900d58a45dcaab9e4821
Fixes: 5982a53926 ("KVM: x86/mmu: Use kvm_ad_enabled() to determine if TDP MMU SPTEs need wrprot")
Cc: stable@vger.kernel.org
Cc: Vipin Sharma <vipinsh@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240315230541.1635322-2-dmatlack@google.com
[sean: massage shortlog and changelog, tweak ternary op formatting]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-11 12:58:50 -07:00
Sean Christopherson
1bc26cb909 KVM: x86/mmu: Precisely invalidate MMU root_role during CPUID update
Set kvm_mmu_page_role.invalid to mark the various MMU root_roles invalid
during CPUID update in order to force a refresh, instead of zeroing out
the entire role.  This fixes a bug where kvm_mmu_free_roots() incorrectly
thinks a root is indirect, i.e. not a TDP MMU, due to "direct" being
zeroed, which in turn causes KVM to take mmu_lock for write instead of
read.

Note, paving over the entire role was largely unintentional, commit
7a458f0e1b ("KVM: x86/mmu: remove extended bits from mmu_role, rename
field") simply missed that "invalid" could be set.

Fixes: 576a15de8d ("KVM: x86/mmu: Free TDP MMU roots while holding mmy_lock for read")
Reported-by: syzbot+dc308fcfcd53f987de73@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/0000000000009b38080614c49bdb@google.com
Cc: Phi Nguyen <phind.uet@gmail.com>
Link: https://lore.kernel.org/r/20240408231115.1387279-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-11 12:58:49 -07:00
Sean Christopherson
bb9dc85908 KVM: VMX: Disable LBR virtualization if the CPU doesn't support LBR callstacks
Disable LBR virtualization if the CPU doesn't support callstacks, which
were introduced in HSW (see commit e9d7f7cd97 ("perf/x86/intel: Add
basic Haswell LBR call stack support"), as KVM unconditionally configures
the perf LBR event with PERF_SAMPLE_BRANCH_CALL_STACK, i.e. LBR
virtualization always fails on pre-HSW CPUs.

Simply disable LBR support on such CPUs, as it has never worked, i.e.
there is no risk of breaking an existing setup, and figuring out a way
to performantly context switch LBRs on old CPUs is not worth the effort.

Fixes: be635e34c2 ("KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIES")
Cc: Mingwei Zhang <mizhang@google.com>
Cc: Jim Mattson <jmattson@google.com>
Tested-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20240307011344.835640-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-11 12:58:48 -07:00
Sean Christopherson
447112d7ed KVM: VMX: Snapshot LBR capabilities during module initialization
Snapshot VMX's LBR capabilities once during module initialization instead
of calling into perf every time a vCPU reconfigures its vPMU.  This will
allow massaging the LBR capabilities, e.g. if the CPU doesn't support
callstacks, without having to remember to update multiple locations.

Opportunistically tag vmx_get_perf_capabilities() with __init, as it's
only called from vmx_set_cpu_caps().

Reviewed-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20240307011344.835640-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-11 12:58:46 -07:00
Paolo Bonzini
f3b65bbaed KVM: delete .change_pte MMU notifier callback
The .change_pte() MMU notifier callback was intended as an
optimization. The original point of it was that KSM could tell KVM to flip
its secondary PTE to a new location without having to first zap it. At
the time there was also an .invalidate_page() callback; both of them were
*not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
and .invalidate_page() also doubled as a fallback implementation of
.change_pte().

Later on, however, both callbacks were changed to occur within an
invalidate_range_start/end() block.

In the case of .change_pte(), commit 6bdb913f0a ("mm: wrap calls to
set_pte_at_notify with invalidate_range_start and invalidate_range_end",
2012-10-09) did so to remove the fallback from .invalidate_page() to
.change_pte() and allow sleepable .invalidate_page() hooks.

This however made KVM's usage of the .change_pte() callback completely
moot, because KVM unmaps the sPTEs during .invalidate_range_start()
and therefore .change_pte() has no hope of finding a sPTE to change.
Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
well as all the architecture specific implementations.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Anup Patel <anup@brainfault.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Message-ID: <20240405115815.3226315-2-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 13:18:27 -04:00
Paolo Bonzini
4dd5ecacb9 KVM: SEV: allow SEV-ES DebugSwap again
The DebugSwap feature of SEV-ES provides a way for confidential guests
to use data breakpoints.  Its status is record in VMSA, and therefore
attestation signatures depend on whether it is enabled or not.  In order
to avoid invalidating the signatures depending on the host machine, it
was disabled by default (see commit 5abf6dceb0, "SEV: disable SEV-ES
DebugSwap by default", 2024-03-09).

However, we now have a new API to create SEV VMs that allows enabling
DebugSwap based on what the user tells KVM to do, and we also changed the
legacy KVM_SEV_ES_INIT API to never enable DebugSwap.  It is therefore
possible to re-enable the feature without breaking compatibility with
kernels that pre-date the introduction of DebugSwap, so go ahead.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-14-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 13:08:26 -04:00
Paolo Bonzini
4f5defae70 KVM: SEV: introduce KVM_SEV_INIT2 operation
The idea that no parameter would ever be necessary when enabling SEV or
SEV-ES for a VM was decidedly optimistic.  In fact, in some sense it's
already a parameter whether SEV or SEV-ES is desired.  Another possible
source of variability is the desired set of VMSA features, as that affects
the measurement of the VM's initial state and cannot be changed
arbitrarily by the hypervisor.

Create a new sub-operation for KVM_MEMORY_ENCRYPT_OP that can take a struct,
and put the new op to work by including the VMSA features as a field of the
struct.  The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of
supported VMSA features for backwards compatibility.

The struct also includes the usual bells and whistles for future
extensibility: a flags field that must be zero for now, and some padding
at the end.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-13-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 13:08:25 -04:00
Paolo Bonzini
eb4441864e KVM: SEV: sync FPU and AVX state at LAUNCH_UPDATE_VMSA time
SEV-ES allows passing custom contents for x87, SSE and AVX state into the VMSA.
Allow userspace to do that with the usual KVM_SET_XSAVE API and only mark
FPU contents as confidential after it has been copied and encrypted into
the VMSA.

Since the XSAVE state for AVX is the first, it does not need the
compacted-state handling of get_xsave_addr().  However, there are other
parts of XSAVE state in the VMSA that currently are not handled, and
the validation logic of get_xsave_addr() is pointless to duplicate
in KVM, so move get_xsave_addr() to public FPU API; it is really just
a facility to operate on XSAVE state and does not expose any internal
details of arch/x86/kernel/fpu.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-12-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 13:08:25 -04:00
Paolo Bonzini
26c44aa9e0 KVM: SEV: define VM types for SEV and SEV-ES
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-11-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 13:08:25 -04:00
Paolo Bonzini
4ebb105e6c KVM: SEV: introduce to_kvm_sev_info
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-10-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 13:08:24 -04:00
Paolo Bonzini
2a955c4db1 KVM: x86: Add supported_vm_types to kvm_caps
This simplifies the implementation of KVM_CHECK_EXTENSION(KVM_CAP_VM_TYPES),
and also allows the vendor module to specify which VM types are supported.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-9-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 13:08:24 -04:00
Paolo Bonzini
517987e3fb KVM: x86: add fields to struct kvm_arch for CoCo features
Some VM types have characteristics in common; in fact, the only use
of VM types right now is kvm_arch_has_private_mem and it assumes that
_all_ nonzero VM types have private memory.

We will soon introduce a VM type for SEV and SEV-ES VMs, and at that
point we will have two special characteristics of confidential VMs
that depend on the VM type: not just if memory is private, but
also whether guest state is protected.  For the latter we have
kvm->arch.guest_state_protected, which is only set on a fully initialized
VM.

For VM types with protected guest state, we can actually fix a problem in
the SEV-ES implementation, where ioctls to set registers do not cause an
error even if the VM has been initialized and the guest state encrypted.
Make sure that when using VM types that will become an error.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20240209183743.22030-7-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <20240404121327.3107131-8-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 13:08:23 -04:00
Paolo Bonzini
605bbdc12b KVM: SEV: store VMSA features in kvm_sev_info
Right now, the set of features that are stored in the VMSA upon
initialization is fixed and depends on the module parameters for
kvm-amd.ko.  However, the hypervisor cannot really change it at will
because the feature word has to match between the hypervisor and whatever
computes a measurement of the VMSA for attestation purposes.

Add a field to kvm_sev_info that holds the set of features to be stored
in the VMSA; and query it instead of referring to the module parameters.

Because KVM_SEV_INIT and KVM_SEV_ES_INIT accept no parameters, this
does not yet introduce any functional change, but it paves the way for
an API that allows customization of the features per-VM.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20240209183743.22030-6-pbonzini@redhat.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-7-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 13:08:23 -04:00
Paolo Bonzini
ac5c48027b KVM: SEV: publish supported VMSA features
Compute the set of features to be stored in the VMSA when KVM is
initialized; move it from there into kvm_sev_info when SEV is initialized,
and then into the initial VMSA.

The new variable can then be used to return the set of supported features
to userspace, via the KVM_GET_DEVICE_ATTR ioctl.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <20240404121327.3107131-6-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 13:08:22 -04:00
Paolo Bonzini
546d714b08 KVM: introduce new vendor op for KVM_GET_DEVICE_ATTR
Allow vendor modules to provide their own attributes on /dev/kvm.
To avoid proliferation of vendor ops, implement KVM_HAS_DEVICE_ATTR
and KVM_GET_DEVICE_ATTR in terms of the same function.  You're not
supposed to use KVM_GET_DEVICE_ATTR to do complicated computations,
especially on /dev/kvm.

Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <20240404121327.3107131-5-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 13:08:22 -04:00
Paolo Bonzini
8d2aec3b2d KVM: x86: use u64_to_user_ptr()
There is no danger to the kernel if 32-bit userspace provides a 64-bit
value that has the high bits set, but for whatever reason happens to
resolve to an address that has something mapped there.  KVM uses the
checked version of get_user() and put_user(), so any faults are caught
properly.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-4-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 13:08:22 -04:00
Paolo Bonzini
0d7bf5e5b0 KVM: SVM: Compile sev.c if and only if CONFIG_KVM_AMD_SEV=y
Stop compiling sev.c when CONFIG_KVM_AMD_SEV=n, as the number of #ifdefs
in sev.c is getting ridiculous, and having #ifdefs inside of SEV helpers
is quite confusing.

To minimize #ifdefs in code flows, #ifdef away only the kvm_x86_ops hooks
and the #VMGEXIT handler. Stubs are also restricted to functions that
check sev_enabled and to the destruction functions sev_free_cpu() and
sev_vm_destroy(), where the style of their callers is to leave checks
to the callers.  Most call sites instead rely on dead code elimination
to take care of functions that are guarded with sev_guest() or
sev_es_guest().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-3-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 13:08:21 -04:00
Sean Christopherson
1ff3c89032 KVM: SVM: Invert handling of SEV and SEV_ES feature flags
Leave SEV and SEV_ES '0' in kvm_cpu_caps by default, and instead set them
in sev_set_cpu_caps() if SEV and SEV-ES support are fully enabled.  Aside
from the fact that sev_set_cpu_caps() is wildly misleading when it *clears*
capabilities, this will allow compiling out sev.c without falsely
advertising SEV/SEV-ES support in KVM_GET_SUPPORTED_CPUID.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-2-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 13:08:21 -04:00
Sandipan Das
49ff3b4aec KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms
On AMD and Hygon platforms, the local APIC does not automatically set
the mask bit of the LVTPC register when handling a PMI and there is
no need to clear it in the kernel's PMI handler.

For guests, the mask bit is currently set by kvm_apic_local_deliver()
and unless it is cleared by the guest kernel's PMI handler, PMIs stop
arriving and break use-cases like sampling with perf record.

This does not affect non-PerfMonV2 guests because PMIs are handled in
the guest kernel by x86_pmu_handle_irq() which always clears the LVTPC
mask bit irrespective of the vendor.

Before:

  $ perf record -e cycles:u true
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.001 MB perf.data (1 samples) ]

After:

  $ perf record -e cycles:u true
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.002 MB perf.data (19 samples) ]

Fixes: a16eb25b09 ("KVM: x86: Mask LVTPC when handling a PMI")
Cc: stable@vger.kernel.org
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
[sean: use is_intel_compatible instead of !is_amd_or_hygon()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240405235603.1173076-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 12:58:59 -04:00
Sean Christopherson
fd706c9b16 KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible
Add kvm_vcpu_arch.is_amd_compatible to cache if a vCPU's vendor model is
compatible with AMD, i.e. if the vCPU vendor is AMD or Hygon, along with
helpers to check if a vCPU is compatible AMD vs. Intel.  To handle Intel
vs. AMD behavior related to masking the LVTPC entry, KVM will need to
check for vendor compatibility on every PMI injection, i.e. querying for
AMD will soon be a moderately hot path.

Note!  This subtly (or maybe not-so-subtly) makes "Intel compatible" KVM's
default behavior, both if userspace omits (or never sets) CPUID 0x0 and if
userspace sets a completely unknown vendor.  One could argue that KVM
should treat such vCPUs as not being compatible with Intel *or* AMD, but
that would add useless complexity to KVM.

KVM needs to do *something* in the face of vendor specific behavior, and
so unless KVM conjured up a magic third option, choosing to treat unknown
vendors as neither Intel nor AMD means that checks on AMD compatibility
would yield Intel behavior, and checks for Intel compatibility would yield
AMD behavior.  And that's far worse as it would effectively yield random
behavior depending on whether KVM checked for AMD vs. Intel vs. !AMD vs.
!Intel.  And practically speaking, all x86 CPUs follow either Intel or AMD
architecture, i.e. "supporting" an unknown third architecture adds no
value.

Deliberately don't convert any of the existing guest_cpuid_is_intel()
checks, as the Intel side of things is messier due to some flows explicitly
checking for exactly vendor==Intel, versus some flows assuming anything
that isn't "AMD compatible" gets Intel behavior.  The Intel code will be
cleaned up in the future.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240405235603.1173076-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 12:58:56 -04:00
Li RongQing
a952d608f0 KVM: Use vfree for memory allocated by vcalloc()/__vcalloc()
commit 37b2a6510a48("KVM: use __vcalloc for very large allocations")
replaced kvzalloc()/kvcalloc() with vcalloc(), but didn't replace kvfree()
with vfree().

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://lore.kernel.org/r/20240131012357.53563-1-lirongqing@baidu.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 12:18:38 -07:00
Gerd Hoffmann
b628cb523c KVM: x86: Advertise max mappable GPA in CPUID.0x80000008.GuestPhysBits
Use the GuestPhysBits field in CPUID.0x80000008 to communicate the max
mappable GPA to userspace, i.e. the max GPA that is addressable by the
CPU itself.  Typically this is identical to the max effective GPA, except
in the case where the CPU supports MAXPHYADDR > 48 but does not support
5-level TDP (the CPU consults bits 51:48 of the GPA only when walking the
fifth level TDP page table entry).

Enumerating the max mappable GPA via CPUID will allow guest firmware to
map resources like PCI bars in the highest possible address space, while
ensuring that the GPA is addressable by the CPU.  Without precise
knowledge about the max mappable GPA, the guest must assume that 5-level
paging is unsupported and thus restrict its mappings to the lower 48 bits.

Advertise the max mappable GPA via KVM_GET_SUPPORTED_CPUID as userspace
doesn't have easy access to whether or not 5-level paging is supported,
and to play nice with userspace VMMs that reflect the supported CPUID
directly into the guest.

AMD's APM (3.35) defines GuestPhysBits (EAX[23:16]) as:

  Maximum guest physical address size in bits.  This number applies
  only to guests using nested paging.  When this field is zero, refer
  to the PhysAddrSize field for the maximum guest physical address size.

Tom Lendacky confirmed that the purpose of GuestPhysBits is software use
and KVM can use it as described above.  Real hardware always returns zero.

Leave GuestPhysBits as '0' when TDP is disabled in order to comply with
the APM's statement that GuestPhysBits "applies only to guest using nested
paging".  As above, guest firmware will likely create suboptimal mappings,
but that is a very minor issue and not a functional concern.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20240313125844.912415-3-kraxel@redhat.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 12:18:37 -07:00
Gerd Hoffmann
6f5c960062 KVM: x86: Don't advertise guest.MAXPHYADDR as host.MAXPHYADDR in CPUID
Drop KVM's propagation of GuestPhysBits (CPUID leaf 80000008, EAX[23:16])
to HostPhysBits (same leaf, EAX[7:0]) when advertising the address widths
to userspace via KVM_GET_SUPPORTED_CPUID.

Per AMD, GuestPhysBits is intended for software use, and physical CPUs do
not set that field.  I.e. GuestPhysBits will be non-zero if and only if
KVM is running as a nested hypervisor, and in that case, GuestPhysBits is
NOT guaranteed to capture the CPU's effective MAXPHYADDR when running with
TDP enabled.

E.g. KVM will soon use GuestPhysBits to communicate the CPU's maximum
*addressable* guest physical address, which would result in KVM under-
reporting PhysBits when running as an L1 on a CPU with MAXPHYADDR=52,
but without 5-level paging.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20240313125844.912415-2-kraxel@redhat.com
[sean: rewrite changelog with --verbose, Cc stable@]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 12:18:22 -07:00
Sean Christopherson
23ffe4bbf8 KVM: nVMX: Add a sanity check that nested PML Full stems from EPT Violations
Add a WARN_ON_ONCE() sanity check to verify that a nested PML Full VM-Exit
is only synthesized when the original VM-Exit from L2 was an EPT Violation.
While KVM can fallthrough to kvm_mmu_do_page_fault() if an EPT Misconfig
occurs on a stale MMIO SPTE, KVM should not treat the access as a write
(there isn't enough information to know *what* the access was), i.e. KVM
should never try to insert a PML entry in that case.

Link: https://lore.kernel.org/r/20240209221700.393189-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 10:24:36 -07:00
Sean Christopherson
a946607868 KVM: x86: Move nEPT exit_qualification field from kvm_vcpu_arch to x86_exception
Move the exit_qualification field that is used to track information about
in-flight nEPT violations from "struct kvm_vcpu_arch" to "x86_exception",
i.e. associate the information with the actual nEPT violation instead of
the vCPU.  To handle bits that are pulled from vmcs.EXIT_QUALIFICATION,
i.e. that are propagated from the "original" EPT violation VM-Exit, simply
grab them from the VMCS on-demand when injecting a nEPT Violation or a PML
Full VM-exit.

Aside from being ugly, having an exit_qualification field in kvm_vcpu_arch
is outright dangerous, e.g. see commit d7f0a00e43 ("KVM: VMX: Report
up-to-date exit qualification to userspace").

Opportunstically add a comment to call out that PML Full and EPT Violation
VM-Exits use the same bit to report NMI blocking information.

Link: https://lore.kernel.org/r/20240209221700.393189-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 10:24:36 -07:00
Sean Christopherson
0c47651403 KVM: nVMX: Clear EXIT_QUALIFICATION when injecting an EPT Misconfig
Explicitly clear the EXIT_QUALIFCATION field when injecting an EPT
misconfig into L1, as required by the VMX architecture.  Per the SDM:

  This field is saved for VM exits due to the following causes:
  debug exceptions; page-fault exceptions; start-up IPIs (SIPIs);
  system-management interrupts (SMIs) that arrive immediately after the
  execution of I/O instructions; task switches; INVEPT; INVLPG; INVPCID;
  INVVPID; LGDT; LIDT; LLDT; LTR; SGDT; SIDT; SLDT; STR; VMCLEAR; VMPTRLD;
  VMPTRST; VMREAD; VMWRITE; VMXON; WBINVD; WBNOINVD; XRSTORS; XSAVES;
  control-register accesses; MOV DR; I/O instructions; MWAIT; accesses to
  the APIC-access page; EPT violations; EOI virtualization; APIC-write
  emulation; page-modification log full; SPP-related events; and
  instruction timeout. For all other VM exits, this field is cleared.

Generating EXIT_QUALIFICATION from vcpu->arch.exit_qualification is wrong
for all (two) paths that lead to nested_ept_inject_page_fault().  For EPT
violations (the common case), vcpu->arch.exit_qualification will have been
set by handle_ept_violation() to vmcs02.EXIT_QUALIFICATION, i.e. contains
the information of a EPT violation and thus is likely non-zero.

For an EPT misconfig, which can reach FNAME(walk_addr_generic) and thus
inject a nEPT misconfig if KVM created an MMIO SPTE that became stale,
vcpu->arch.exit_qualification will hold the information from the last EPT
violation VM-Exit, as vcpu->arch.exit_qualification is _only_ written by
handle_ept_violation().

Fixes: 4704d0befb ("KVM: nVMX: Exiting from L2 to L1")
Link: https://lore.kernel.org/r/20240209221700.393189-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 10:24:36 -07:00
Sean Christopherson
27ca867042 KVM: x86: Stop compiling vmenter.S with OBJECT_FILES_NON_STANDARD
Stop compiling vmenter.S with OBJECT_FILES_NON_STANDARD to skip objtool's
stack validation now that __svm_vcpu_run() and __svm_sev_es_vcpu_run()
create stack frames (though the former's effectiveness is dubious).

Note, due to a quirk in how OBJECT_FILES_NON_STANDARD was handled by the
build system prior to commit bf48d9b756 ("kbuild: change tool coverage
variables to take the path relative to $(obj)"), vmx/vmenter.S got lumped
in with svm/vmenter.S.  __vmx_vcpu_run() already plays nice with frame
pointers, i.e. it was collateral damage when commit 7f4b5cde24 ("kvm:
Disable objtool frame pointer checking for vmenter.S") added the
OBJECT_FILES_NON_STANDARD hack-a-fix.

Link: https://lore.kernel.org/all/20240217055504.2059803-1-masahiroy@kernel.org
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 10:21:44 -07:00
Sean Christopherson
4367a75887 KVM: SVM: Create a stack frame in __svm_sev_es_vcpu_run()
Now that KVM uses the host save area to context switch RBP, i.e.
preserves RBP for the entirety of __svm_sev_es_vcpu_run(), create a stack
frame using the standared FRAME_{BEGIN,END} macros.

Note, __svm_sev_es_vcpu_run() is subtly not a leaf function as it can call
into ibpb_feature() via UNTRAIN_RET_VM.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 10:21:10 -07:00
Sean Christopherson
adac42bf42 KVM: SVM: Save/restore args across SEV-ES VMRUN via host save area
Use the host save area to preserve volatile registers that are used in
__svm_sev_es_vcpu_run() to access function parameters after #VMEXIT.
Like saving/restoring non-volatile registers, there's no reason not to
take advantage of hardware restoring registers on #VMEXIT, as doing so
shaves a few instructions and the save area is going to be accessed no
matter what.

Converting all register save/restore code to use the host save area also
make it easier to follow the SEV-ES VMRUN flow in its entirety, as
opposed to having a mix of stack-based versus host save area save/restore.

Add a parameter to RESTORE_HOST_SPEC_CTRL_BODY so that the SEV-ES path
doesn't need to write @spec_ctrl_intercepted to memory just to play nice
with the common macro.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 10:21:10 -07:00
Sean Christopherson
c92be2fd8e KVM: SVM: Save/restore non-volatile GPRs in SEV-ES VMRUN via host save area
Use the host save area to save/restore non-volatile (callee-saved)
registers in __svm_sev_es_vcpu_run() to take advantage of hardware loading
all registers from the save area on #VMEXIT.  KVM still needs to save the
registers it wants restored, but the loads are handled automatically by
hardware.

Aside from less assembly code, letting hardware do the restoration means
stack frames are preserved for the entirety of __svm_sev_es_vcpu_run().

Opportunistically add a comment to call out why @svm needs to be saved
across VMRUN->#VMEXIT, as it's not easy to decipher that from the macro
hell.

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 10:20:29 -07:00
Sean Christopherson
87e8e360a0 KVM: SVM: Clobber RAX instead of RBX when discarding spec_ctrl_intercepted
POP @spec_ctrl_intercepted into RAX instead of RBX when discarding it from
the stack so that __svm_sev_es_vcpu_run() doesn't modify any non-volatile
registers.  __svm_sev_es_vcpu_run() doesn't return a value, and RAX is
already are clobbered multiple times in the #VMEXIT path.

This will allowing using the host save area to save/restore non-volatile
registers in __svm_sev_es_vcpu_run().

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 10:20:29 -07:00
Sean Christopherson
331282fdb1 KVM: SVM: Drop 32-bit "support" from __svm_sev_es_vcpu_run()
Drop 32-bit "support" from __svm_sev_es_vcpu_run(), as SEV/SEV-ES firmly
64-bit only.  The "support" was purely the result of bad copy+paste from
__svm_vcpu_run(), which in turn was slightly less bad copy+paste from
__vmx_vcpu_run().

Opportunistically convert to unadulterated register accesses so that it's
easier (but still not easy) to follow which registers hold what arguments,
and when.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 10:20:29 -07:00
Sean Christopherson
7774c8f32e KVM: SVM: Wrap __svm_sev_es_vcpu_run() with #ifdef CONFIG_KVM_AMD_SEV
Compile (and link) __svm_sev_es_vcpu_run() if and only if SEV support is
actually enabled.  This will allow dropping non-existent 32-bit "support"
from __svm_sev_es_vcpu_run() without causing undue confusion.

Intentionally don't provide a stub (but keep the declaration), as any sane
compiler, even with things like KASAN enabled, should eliminate the call
to __svm_sev_es_vcpu_run() since sev_es_guest() unconditionally returns
"false" if CONFIG_KVM_AMD_SEV=n.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 10:20:28 -07:00
Sean Christopherson
19597a71a0 KVM: SVM: Create a stack frame in __svm_vcpu_run() for unwinding
Unconditionally create a stack frame in __svm_vcpu_run() to play nice with
unwinding via frame pointers, at least until the point where RBP is loaded
with the guest's value.  Don't bother conditioning the code on
CONFIG_FRAME_POINTER=y, as RBP needs to be saved and restored anyways (due
to it being clobbered with the guest's value); omitting the "MOV RSP, RBP"
is not worth the extra #ifdef.

Creating a stack frame will allow removing the OBJECT_FILES_NON_STANDARD
tag from vmenter.S once __svm_sev_es_vcpu_run() is fixed to not stomp all
over RBP for no reason.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 10:20:28 -07:00
Christophe JAILLET
4710e4fc3e KVM: SVM: Remove a useless zeroing of allocated memory
Remove KVM's unnecessary zeroing of memory when allocating the pages array
in sev_pin_memory() via __vmalloc(), as the array is only used to hold
kernel pointers.  The kmalloc() path for "small" regions doesn't zero the
array, and if KVM leaks state and/or accesses uninitialized data, then the
kernel has bigger problems.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/c7619a3d3cbb36463531a7c73ccbde9db587986c.1710004509.git.christophe.jaillet@wanadoo.fr
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 10:15:30 -07:00
David Matlack
aca48556c5 KVM: x86/mmu: Process atomically-zapped SPTEs after TLB flush
When zapping TDP MMU SPTEs under read-lock, processes zapped SPTEs *after*
flushing TLBs and after replacing the special REMOVED_SPTE with '0'.
When zapping an SPTE that points to a page table, processing SPTEs after
flushing TLBs minimizes contention on the child SPTEs (e.g. vCPUs won't
hit write-protection faults via stale, read-only child SPTEs), and
processing after replacing REMOVED_SPTE with '0' minimizes the amount of
time vCPUs will be blocked by the REMOVED_SPTE.

Processing SPTEs after setting the SPTE to '0', i.e. in parallel with the
SPTE potentially being replacing with a new SPTE, is safe because KVM does
not depend on completing the processing before a new SPTE is installed, and
the processing is done on a subset of the page tables that is disconnected
from the root, and thus unreachable by other tasks (after the TLB flush).
KVM already relies on similar logic, as kvm_mmu_zap_all_fast() can result
in KVM processing all SPTEs in a given root after vCPUs create mappings in
a new root.

In VMs with a large (400+) number of vCPUs, it can take KVM multiple
seconds to process a 1GiB region mapped with 4KiB entries, e.g. when
disabling dirty logging in a VM backed by 1GiB HugeTLB.  During those
seconds, if a vCPU accesses the 1GiB region being zapped it will be
stalled until KVM finishes processing the SPTE and replaces the
REMOVED_SPTE with 0.

Re-ordering the processing does speed up the atomic-zaps somewhat, but
the main benefit is avoiding blocking vCPU threads.

Before:

 $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e
 ...
 Disabling dirty logging time: 509.765146313s

 $ ./funclatency -m tdp_mmu_zap_spte_atomic

     msec                : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 8        |**                                      |
       256 -> 511        : 68       |******************                      |
       512 -> 1023       : 129      |**********************************      |
      1024 -> 2047       : 151      |****************************************|
      2048 -> 4095       : 60       |***************                         |

After:

 $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e
 ...
 Disabling dirty logging time: 336.516838548s

 $ ./funclatency -m tdp_mmu_zap_spte_atomic

     msec                : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 12       |**                                      |
       256 -> 511        : 166      |****************************************|
       512 -> 1023       : 101      |************************                |
      1024 -> 2047       : 137      |*********************************       |

Note, KVM's processing of collapsible SPTEs is still extremely slow and
can be improved.  For example, a significant amount of time is spent
calling kvm_set_pfn_{accessed,dirty}() for every last-level SPTE, even
when processing SPTEs that all map the same folio.  But avoiding blocking
vCPUs and contending SPTEs is valuable regardless of how fast KVM can
process collapsible SPTEs.

Link: https://lore.kernel.org/all/20240320005024.3216282-1-seanjc@google.com
Cc: Vipin Sharma <vipinsh@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Vipin Sharma <vipinsh@google.com>
Link: https://lore.kernel.org/r/20240307194059.1357377-1-dmatlack@google.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09 09:37:47 -07:00
Linus Torvalds
2bb69f5fc7 x86 mitigations for the native BHI hardware vulnerabilty:
Branch History Injection (BHI) attacks may allow a malicious application to
 influence indirect branch prediction in kernel by poisoning the branch
 history. eIBRS isolates indirect branch targets in ring0.  The BHB can
 still influence the choice of indirect branch predictor entry, and although
 branch predictor entries are isolated between modes when eIBRS is enabled,
 the BHB itself is not isolated between modes.
 
 Add mitigations against it either with the help of microcode or with
 software sequences for the affected CPUs.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmYUKPMTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYofT8EACJJix+GzGUcJjOvfWFZcxwziY152hO
 5XSzHOZZL6oz5Yk/Rye/S9RVTN7aDjn1CEvI0cD/ULxaTP869sS9dDdUcHhEJ//5
 6hjqWsWiKc1QmLjBy3Pcb97GZHQXM5a9D1f6jXnJD+0FMLbQHpzSEBit0H4tv/TC
 75myGgYihvUbhN9/bL10M5fz+UADU42nChvPWDMr9ukljjCqa46tPTmKUIAW5TWj
 /xsyf+Nk+4kZpdaidKGhpof6KCV2rNeevvzUGN8Pv5y13iAmvlyplqTcQ6dlubnZ
 CuDX5Ji9spNF9WmhKpLgy5N+Ocb64oVHov98N2zw1sT1N8XOYcSM0fBj7SQIFURs
 L7T4jBZS+1c3ZGJPPFWIaGjV8w1ZMhelglwJxjY7ZgRD6fK3mwRx/ks54J8H4HjE
 FbirXaZLeKlscDIOKtnxxKoIGwpdGwLKQYi/wEw7F9NhCLSj9wMia+j3uYIUEEHr
 6xEiYEtyjcV3ocxagH7eiHyrasOKG64vjx2h1XodusBA2Wrvgm/jXlchUu+wb6B4
 LiiZJt+DmOdQ1h5j3r2rt3hw7+nWa7kyq34qfN6NSUCHiedp6q7BClueSaKiOCGk
 RoNibNiS+CqaxwGxj/RGuvajEJeEMCsLuCxzT3aeaDBsqscW6Ka/HkGA76Tpb5nJ
 E3JyjYE7AlG4rw==
 =W0W3
 -----END PGP SIGNATURE-----

Merge tag 'nativebhi' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mitigations from Thomas Gleixner:
 "Mitigations for the native BHI hardware vulnerabilty:

  Branch History Injection (BHI) attacks may allow a malicious
  application to influence indirect branch prediction in kernel by
  poisoning the branch history. eIBRS isolates indirect branch targets
  in ring0. The BHB can still influence the choice of indirect branch
  predictor entry, and although branch predictor entries are isolated
  between modes when eIBRS is enabled, the BHB itself is not isolated
  between modes.

  Add mitigations against it either with the help of microcode or with
  software sequences for the affected CPUs"

[ This also ends up enabling the full mitigation by default despite the
  system call hardening, because apparently there are other indirect
  calls that are still sufficiently reachable, and the 'auto' case just
  isn't hardened enough.

  We'll have some more inevitable tweaking in the future    - Linus ]

* tag 'nativebhi' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  KVM: x86: Add BHI_NO
  x86/bhi: Mitigate KVM by default
  x86/bhi: Add BHI mitigation knob
  x86/bhi: Enumerate Branch History Injection (BHI) bug
  x86/bhi: Define SPEC_CTRL_BHI_DIS_S
  x86/bhi: Add support for clearing branch history at syscall entry
  x86/syscall: Don't force use of indirect calls for system calls
  x86/bugs: Change commas to semicolons in 'spectre_v2' sysfs file
2024-04-08 20:07:51 -07:00
Tao Su
7f2817ef52 KVM: VMX: Ignore MKTME KeyID bits when intercepting #PF for allow_smaller_maxphyaddr
Use the raw/true host.MAXPHYADDR when deciding whether or not KVM must
intercept #PFs when allow_smaller_maxphyaddr is enabled, as any adjustments
the kernel makes to boot_cpu_data.x86_phys_bits to account for MKTME KeyID
bits do not apply to the guest physical address space.  I.e. the KeyID are
off-limits for host physical addresses, but are not reserved for GPAs as
far as hardware is concerned.

Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Link: https://lore.kernel.org/r/20240319031111.495006-1-tao1.su@linux.intel.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-08 14:22:10 -07:00
Sean Christopherson
de120e1d69 KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at "RESET"
Set the enable bits for general purpose counters in IA32_PERF_GLOBAL_CTRL
when refreshing the PMU to emulate the MSR's architecturally defined
post-RESET behavior.  Per Intel's SDM:

  IA32_PERF_GLOBAL_CTRL:  Sets bits n-1:0 and clears the upper bits.

and

  Where "n" is the number of general-purpose counters available in the processor.

AMD also documents this behavior for PerfMonV2 CPUs in one of AMD's many
PPRs.

Do not set any PERF_GLOBAL_CTRL bits if there are no general purpose
counters, although a literal reading of the SDM would require the CPU to
set either bits 63:0 or 31:0.  The intent of the behavior is to globally
enable all GP counters; honor the intent, if not the letter of the law.

Leaving PERF_GLOBAL_CTRL '0' effectively breaks PMU usage in guests that
haven't been updated to work with PMUs that support PERF_GLOBAL_CTRL.
This bug was recently exposed when KVM added supported for AMD's
PerfMonV2, i.e. when KVM started exposing a vPMU with PERF_GLOBAL_CTRL to
guest software that only knew how to program v1 PMUs (that don't support
PERF_GLOBAL_CTRL).

Failure to emulate the post-RESET behavior results in such guests
unknowingly leaving all general purpose counters globally disabled (the
entire reason the post-RESET value sets the GP counter enable bits is to
maintain backwards compatibility).

The bug has likely gone unnoticed because PERF_GLOBAL_CTRL has been
supported on Intel CPUs for as long as KVM has existed, i.e. hardly anyone
is running guest software that isn't aware of PERF_GLOBAL_CTRL on Intel
PMUs.  And because up until v6.0, KVM _did_ emulate the behavior for Intel
CPUs, although the old behavior was likely dumb luck.

Because (a) that old code was also broken in its own way (the history of
this code is a comedy of errors), and (b) PERF_GLOBAL_CTRL was documented
as having a value of '0' post-RESET in all SDMs before March 2023.

Initial vPMU support in commit f5132b0138 ("KVM: Expose a version 2
architectural PMU to a guests") *almost* got it right (again likely by
dumb luck), but for some reason only set the bits if the guest PMU was
advertised as v1:

        if (pmu->version == 1) {
                pmu->global_ctrl = (1 << pmu->nr_arch_gp_counters) - 1;
                return;
        }

Commit f19a0c2c2e ("KVM: PMU emulation: GLOBAL_CTRL MSR should be
enabled on reset") then tried to remedy that goof, presumably because
guest PMUs were leaving PERF_GLOBAL_CTRL '0', i.e. weren't enabling
counters.

        pmu->global_ctrl = ((1 << pmu->nr_arch_gp_counters) - 1) |
                (((1ull << pmu->nr_arch_fixed_counters) - 1) << X86_PMC_IDX_FIXED);
        pmu->global_ctrl_mask = ~pmu->global_ctrl;

That was KVM's behavior up until commit c49467a45f ("KVM: x86/pmu:
Don't overwrite the pmu->global_ctrl when refreshing") removed
*everything*.  However, it did so based on the behavior defined by the
SDM , which at the time stated that "Global Perf Counter Controls" is
'0' at Power-Up and RESET.

But then the March 2023 SDM (325462-079US), stealthily changed its
"IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT"
table to say:

  IA32_PERF_GLOBAL_CTRL: Sets bits n-1:0 and clears the upper bits.

Note, kvm_pmu_refresh() can be invoked multiple times, i.e. it's not a
"pure" RESET flow.  But it can only be called prior to the first KVM_RUN,
i.e. the guest will only ever observe the final value.

Note #2, KVM has always cleared global_ctrl during refresh (see commit
f5132b0138 ("KVM: Expose a version 2 architectural PMU to a guests")),
i.e. there is no danger of breaking existing setups by clobbering a value
set by userspace.

Reported-by: Babu Moger <babu.moger@amd.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Cc: Mingwei Zhang <mizhang@google.com>
Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: stable@vger.kernel.org
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240309013641.1413400-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-08 13:20:27 -07:00
Rick Edgecombe
992b54bd08 KVM: x86/mmu: x86: Don't overflow lpage_info when checking attributes
Fix KVM_SET_MEMORY_ATTRIBUTES to not overflow lpage_info array and trigger
KASAN splat, as seen in the private_mem_conversions_test selftest.

When memory attributes are set on a GFN range, that range will have
specific properties applied to the TDP. A huge page cannot be used when
the attributes are inconsistent, so they are disabled for those the
specific huge pages. For internal KVM reasons, huge pages are also not
allowed to span adjacent memslots regardless of whether the backing memory
could be mapped as huge.

What GFNs support which huge page sizes is tracked by an array of arrays
'lpage_info' on the memslot, of ‘kvm_lpage_info’ structs. Each index of
lpage_info contains a vmalloc allocated array of these for a specific
supported page size. The kvm_lpage_info denotes whether a specific huge
page (GFN and page size) on the memslot is supported. These arrays include
indices for unaligned head and tail huge pages.

Preventing huge pages from spanning adjacent memslot is covered by
incrementing the count in head and tail kvm_lpage_info when the memslot is
allocated, but disallowing huge pages for memory that has mixed attributes
has to be done in a more complicated way. During the
KVM_SET_MEMORY_ATTRIBUTES ioctl KVM updates lpage_info for each memslot in
the range that has mismatched attributes. KVM does this a memslot at a
time, and marks a special bit, KVM_LPAGE_MIXED_FLAG, in the kvm_lpage_info
for any huge page. This bit is essentially a permanently elevated count.
So huge pages will not be mapped for the GFN at that page size if the
count is elevated in either case: a huge head or tail page unaligned to
the memslot or if KVM_LPAGE_MIXED_FLAG is set because it has mixed
attributes.

To determine whether a huge page has consistent attributes, the
KVM_SET_MEMORY_ATTRIBUTES operation checks an xarray to make sure it
consistently has the incoming attribute. Since level - 1 huge pages are
aligned to level huge pages, it employs an optimization. As long as the
level - 1 huge pages are checked first, it can just check these and assume
that if each level - 1 huge page contained within the level sized huge
page is not mixed, then the level size huge page is not mixed. This
optimization happens in the helper hugepage_has_attrs().

Unfortunately, although the kvm_lpage_info array representing page size
'level' will contain an entry for an unaligned tail page of size level,
the array for level - 1  will not contain an entry for each GFN at page
size level. The level - 1 array will only contain an index for any
unaligned region covered by level - 1 huge page size, which can be a
smaller region. So this causes the optimization to overflow the level - 1
kvm_lpage_info and perform a vmalloc out of bounds read.

In some cases of head and tail pages where an overflow could happen,
callers skip the operation completely as KVM_LPAGE_MIXED_FLAG is not
required to prevent huge pages as discussed earlier. But for memslots that
are smaller than the 1GB page size, it does call hugepage_has_attrs(). In
this case the huge page is both the head and tail page. The issue can be
observed simply by compiling the kernel with CONFIG_KASAN_VMALLOC and
running the selftest “private_mem_conversions_test”, which produces the
output like the following:

BUG: KASAN: vmalloc-out-of-bounds in hugepage_has_attrs+0x7e/0x110
Read of size 4 at addr ffffc900000a3008 by task private_mem_con/169
Call Trace:
  dump_stack_lvl
  print_report
  ? __virt_addr_valid
  ? hugepage_has_attrs
  ? hugepage_has_attrs
  kasan_report
  ? hugepage_has_attrs
  hugepage_has_attrs
  kvm_arch_post_set_memory_attributes
  kvm_vm_ioctl

It is a little ambiguous whether the unaligned head page (in the bug case
also the tail page) should be expected to have KVM_LPAGE_MIXED_FLAG set.
It is not functionally required, as the unaligned head/tail pages will
already have their kvm_lpage_info count incremented. The comments imply
not setting it on unaligned head pages is intentional, so fix the callers
to skip trying to set KVM_LPAGE_MIXED_FLAG in this case, and in doing so
not call hugepage_has_attrs().

Cc: stable@vger.kernel.org
Fixes: 90b4fe1798 ("KVM: x86: Disallow hugepages when memory attributes are mixed")
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Chao Peng <chao.p.peng@linux.intel.com>
Link: https://lore.kernel.org/r/20240314212902.2762507-1-rick.p.edgecombe@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-08 13:20:26 -07:00
Sean Christopherson
9e985cbf29 KVM: x86/pmu: Disable support for adaptive PEBS
Drop support for virtualizing adaptive PEBS, as KVM's implementation is
architecturally broken without an obvious/easy path forward, and because
exposing adaptive PEBS can leak host LBRs to the guest, i.e. can leak
host kernel addresses to the guest.

Bug #1 is that KVM doesn't account for the upper 32 bits of
IA32_FIXED_CTR_CTRL when (re)programming fixed counters, e.g
fixed_ctrl_field() drops the upper bits, reprogram_fixed_counters()
stores local variables as u8s and truncates the upper bits too, etc.

Bug #2 is that, because KVM _always_ sets precise_ip to a non-zero value
for PEBS events, perf will _always_ generate an adaptive record, even if
the guest requested a basic record.  Note, KVM will also enable adaptive
PEBS in individual *counter*, even if adaptive PEBS isn't exposed to the
guest, but this is benign as MSR_PEBS_DATA_CFG is guaranteed to be zero,
i.e. the guest will only ever see Basic records.

Bug #3 is in perf.  intel_pmu_disable_fixed() doesn't clear the upper
bits either, i.e. leaves ICL_FIXED_0_ADAPTIVE set, and
intel_pmu_enable_fixed() effectively doesn't clear ICL_FIXED_0_ADAPTIVE
either.  I.e. perf _always_ enables ADAPTIVE counters, regardless of what
KVM requests.

Bug #4 is that adaptive PEBS *might* effectively bypass event filters set
by the host, as "Updated Memory Access Info Group" records information
that might be disallowed by userspace via KVM_SET_PMU_EVENT_FILTER.

Bug #5 is that KVM doesn't ensure LBR MSRs hold guest values (or at least
zeros) when entering a vCPU with adaptive PEBS, which allows the guest
to read host LBRs, i.e. host RIPs/addresses, by enabling "LBR Entries"
records.

Disable adaptive PEBS support as an immediate fix due to the severity of
the LBR leak in particular, and because fixing all of the bugs will be
non-trivial, e.g. not suitable for backporting to stable kernels.

Note!  This will break live migration, but trying to make KVM play nice
with live migration would be quite complicated, wouldn't be guaranteed to
work (i.e. KVM might still kill/confuse the guest), and it's not clear
that there are any publicly available VMMs that support adaptive PEBS,
let alone live migrate VMs that support adaptive PEBS, e.g. QEMU doesn't
support PEBS in any capacity.

Link: https://lore.kernel.org/all/20240306230153.786365-1-seanjc@google.com
Link: https://lore.kernel.org/all/ZeepGjHCeSfadANM@google.com
Fixes: c59a1f106f ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS")
Cc: stable@vger.kernel.org
Cc: Like Xu <like.xu.linux@gmail.com>
Cc: Mingwei Zhang <mizhang@google.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhang Xiong <xiong.y.zhang@intel.com>
Cc: Lv Zhiyuan <zhiyuan.lv@intel.com>
Cc: Dapeng Mi <dapeng1.mi@intel.com>
Cc: Jim Mattson <jmattson@google.com>
Acked-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20240307005833.827147-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-08 13:20:25 -07:00
Daniel Sneddon
ed2e8d49b5 KVM: x86: Add BHI_NO
Intel processors that aren't vulnerable to BHI will set
MSR_IA32_ARCH_CAPABILITIES[BHI_NO] = 1;. Guests may use this BHI_NO bit to
determine if they need to implement BHI mitigations or not.  Allow this bit
to be passed to the guests.

Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08 19:27:06 +02:00
Pawan Gupta
95a6ccbdc7 x86/bhi: Mitigate KVM by default
BHI mitigation mode spectre_bhi=auto does not deploy the software
mitigation by default. In a cloud environment, it is a likely scenario
where userspace is trusted but the guests are not trusted. Deploying
system wide mitigation in such cases is not desirable.

Update the auto mode to unconditionally mitigate against malicious
guests. Deploy the software sequence at VMexit in auto mode also, when
hardware mitigation is not available. Unlike the force =on mode,
software sequence is not deployed at syscalls in auto mode.

Suggested-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08 19:27:06 +02:00
Daniel Sneddon
0f4a837615 x86/bhi: Define SPEC_CTRL_BHI_DIS_S
Newer processors supports a hardware control BHI_DIS_S to mitigate
Branch History Injection (BHI). Setting BHI_DIS_S protects the kernel
from userspace BHI attacks without having to manually overwrite the
branch history.

Define MSR_SPEC_CTRL bit BHI_DIS_S and its enumeration CPUID.BHI_CTRL.
Mitigation is enabled later.

Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08 19:27:05 +02:00
Pawan Gupta
7390db8aea x86/bhi: Add support for clearing branch history at syscall entry
Branch History Injection (BHI) attacks may allow a malicious application to
influence indirect branch prediction in kernel by poisoning the branch
history. eIBRS isolates indirect branch targets in ring0.  The BHB can
still influence the choice of indirect branch predictor entry, and although
branch predictor entries are isolated between modes when eIBRS is enabled,
the BHB itself is not isolated between modes.

Alder Lake and new processors supports a hardware control BHI_DIS_S to
mitigate BHI.  For older processors Intel has released a software sequence
to clear the branch history on parts that don't support BHI_DIS_S. Add
support to execute the software sequence at syscall entry and VMexit to
overwrite the branch history.

For now, branch history is not cleared at interrupt entry, as malicious
applications are not believed to have sufficient control over the
registers, since previous register state is cleared at interrupt
entry. Researchers continue to poke at this area and it may become
necessary to clear at interrupt entry as well in the future.

This mitigation is only defined here. It is enabled later.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Co-developed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08 19:27:05 +02:00
Ingo Molnar
5f2ca44ed2 Merge branch 'linus' into x86/urgent, to pick up dependent commit
We want to fix:

  0e11073247 ("x86/retpoline: Do the necessary fixup to the Zen3/4 srso return thunk for !SRSO")

So merge in Linus's latest into x86/urgent to have it available.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-04-06 13:00:32 +02:00
Sean Christopherson
8cb4a9a82b x86/cpufeatures: Add CPUID_LNX_5 to track recently added Linux-defined word
Add CPUID_LNX_5 to track cpufeatures' word 21, and add the appropriate
compile-time assert in KVM to prevent direct lookups on the features in
CPUID_LNX_5.  KVM uses X86_FEATURE_* flags to manage guest CPUID, and so
must translate features that are scattered by Linux from the Linux-defined
bit to the hardware-defined bit, i.e. should never try to directly access
scattered features in guest CPUID.

Opportunistically add NR_CPUID_WORDS to enum cpuid_leafs, along with a
compile-time assert in KVM's CPUID infrastructure to ensure that future
additions update cpuid_leafs along with NCAPINTS.

No functional change intended.

Fixes: 7f274e609f ("x86/cpufeatures: Add new word for scattered features")
Cc: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-04-04 17:42:19 -07:00
Borislav Petkov (AMD)
0ecaefb303 x86/CPU/AMD: Track SNP host status with cc_platform_*()
The host SNP worthiness can determined later, after alternatives have
been patched, in snp_rmptable_init() depending on cmdline options like
iommu=pt which is incompatible with SNP, for example.

Which means that one cannot use X86_FEATURE_SEV_SNP and will need to
have a special flag for that control.

Use that newly added CC_ATTR_HOST_SEV_SNP in the appropriate places.

Move kdump_sev_callback() to its rightful place, while at it.

Fixes: 216d106c7f ("x86/sev: Add SEV-SNP host initialization support")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Link: https://lore.kernel.org/r/20240327154317.29909-6-bp@alien8.de
2024-04-04 10:40:30 +02:00
Borislav Petkov (AMD)
54f5f47b60 x86/kvm/Kconfig: Have KVM_AMD_SEV select ARCH_HAS_CC_PLATFORM
The functionality to load SEV-SNP guests by the host will soon rely on
cc_platform* helpers because the cpu_feature* API with the early
patching is insufficient when SNP support needs to be disabled late.

Therefore, pull that functionality in.

Fixes: 216d106c7f ("x86/sev: Add SEV-SNP host initialization support")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Link: https://lore.kernel.org/r/20240327154317.29909-4-bp@alien8.de
2024-04-04 10:40:23 +02:00
Paolo Bonzini
52b761b48f KVM/arm64 fixes for 6.9, part #1
- Ensure perf events programmed to count during guest execution
    are actually enabled before entering the guest in the nVHE
    configuration.
 
  - Restore out-of-range handler for stage-2 translation faults.
 
  - Several fixes to stage-2 TLB invalidations to avoid stale
    translations, possibly including partial walk caches.
 
  - Fix early handling of architectural VHE-only systems to ensure E2H is
    appropriately set.
 
  - Correct a format specifier warning in the arch_timer selftest.
 
  - Make the KVM banner message correctly handle all of the possible
    configurations.
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZgtpWBccb2xpdmVyLnVw
 dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFoilAQCQk6kLIeuih5QOe50fK4XkNsyg
 PGcxw0a0BP8cfjtJsgEArwLlfHQOTE4tRWtXyEHvapJfe/bE1hjLmzUJx7BwLQ4=
 =6hNq
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.9, part #1

 - Ensure perf events programmed to count during guest execution
   are actually enabled before entering the guest in the nVHE
   configuration.

 - Restore out-of-range handler for stage-2 translation faults.

 - Several fixes to stage-2 TLB invalidations to avoid stale
   translations, possibly including partial walk caches.

 - Fix early handling of architectural VHE-only systems to ensure E2H is
   appropriately set.

 - Correct a format specifier warning in the arch_timer selftest.

 - Make the KVM banner message correctly handle all of the possible
   configurations.
2024-04-02 12:26:15 -04:00
Linus Torvalds
1d35aae78f Kbuild updates for v6.9
- Generate a list of built DTB files (arch/*/boot/dts/dtbs-list)
 
  - Use more threads when building Debian packages in parallel
 
  - Fix warnings shown during the RPM kernel package uninstallation
 
  - Change OBJECT_FILES_NON_STANDARD_*.o etc. to take a relative path to
    Makefile
 
  - Support GCC's -fmin-function-alignment flag
 
  - Fix a null pointer dereference bug in modpost
 
  - Add the DTB support to the RPM package
 
  - Various fixes and cleanups in Kconfig
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmX8HGIVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGYfIQAIl/zEFoNVSHGR4TIvO7SIwkT4MM
 VAm0W6XRFaXfIGw8HL/MXe+U9jAyeQ9yL9uUVv8PqFTO+LzBbW1X1X97tlmrlQsC
 7mdxbA1KJXwkwt4wH/8/EZQMwHr327vtVH4AilSm+gAaWMXaSKAye3ulKQQ2gevz
 vP6aOcfbHIWOPdxA53cLdSl9LOGrYNczKySHXKV9O39T81F+ko7wPpdkiMWw5LWG
 ISRCV8bdXli8j10Pmg8jlbevSKl4Z5FG2BVw/Cl8rQ5tBBoCzFsUPnnp9A29G8QP
 OqRhbwxtkSm67BMJAYdHnhjp/l0AOEbmetTGpna+R06hirOuXhR3vc6YXZxhQjff
 LmKaqfG5YchRALS1fNDsRUNIkQxVJade+tOUG+V4WbxHQKWX7Ghu5EDlt2/x7P0p
 +XLPE48HoNQLQOJ+pgIOkaEDl7WLfGhoEtEgprZBuEP2h39xcdbYJyF10ZAAR4UZ
 FF6J9lDHbf7v1uqD2YnAQJQ6jJ06CvN6/s6SdiJnCWSs5cYRW0fnYigSIuwAgGHZ
 c/QFECoGEflXGGuqZDl5iXiIjhWKzH2nADSVEs7maP47vapcMWb9gA7VBNoOr5M0
 IXuFo1khChF4V2pxqlDj3H5TkDlFENYT/Wjh+vvjx8XplKCRKaSh+LaZ39hja61V
 dWH7BPecS44h4KXx
 =tFdl
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - Generate a list of built DTB files (arch/*/boot/dts/dtbs-list)

 - Use more threads when building Debian packages in parallel

 - Fix warnings shown during the RPM kernel package uninstallation

 - Change OBJECT_FILES_NON_STANDARD_*.o etc. to take a relative path to
   Makefile

 - Support GCC's -fmin-function-alignment flag

 - Fix a null pointer dereference bug in modpost

 - Add the DTB support to the RPM package

 - Various fixes and cleanups in Kconfig

* tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (67 commits)
  kconfig: tests: test dependency after shuffling choices
  kconfig: tests: add a test for randconfig with dependent choices
  kconfig: tests: support KCONFIG_SEED for the randconfig runner
  kbuild: rpm-pkg: add dtb files in kernel rpm
  kconfig: remove unneeded menu_is_visible() call in conf_write_defconfig()
  kconfig: check prompt for choice while parsing
  kconfig: lxdialog: remove unused dialog colors
  kconfig: lxdialog: fix button color for blackbg theme
  modpost: fix null pointer dereference
  kbuild: remove GCC's default -Wpacked-bitfield-compat flag
  kbuild: unexport abs_srctree and abs_objtree
  kbuild: Move -Wenum-{compare-conditional,enum-conversion} into W=1
  kconfig: remove named choice support
  kconfig: use linked list in get_symbol_str() to iterate over menus
  kconfig: link menus to a symbol
  kbuild: fix inconsistent indentation in top Makefile
  kbuild: Use -fmin-function-alignment when available
  alpha: merge two entries for CONFIG_ALPHA_GAMMA
  alpha: merge two entries for CONFIG_ALPHA_EV4
  kbuild: change DTC_FLAGS_<basetarget>.o to take the path relative to $(obj)
  ...
2024-03-21 14:41:00 -07:00
Paolo Bonzini
0d1756482e Fix a bug in KVM_SET_CPUID{2,} where KVM looks at the wrong CPUID entries (old
vs. new) and ultimately neglects to clear PV_UNHALT from vCPUs with HLT-exiting
 disabled.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmX4yVUACgkQOlYIJqCj
 N/0BpQ/9Flr0fL9150AUb+yZofb0JTbVRgSNfvY12hr9vIp88KY/ryOw8OzlJy0v
 veXD3IqSxkClTp+i2ocRJi1zBVo3ww7s6VwWJwY9SkDEfIYyqRWu+Es/mHNZ/0HM
 BvMcwwyGDtHdZi2BHnztbfLzhh+AQvYm57RKBGyjTx76kdaYiiHwvHRIlJgYTC6q
 w4YBvInIys8Fj5dGKp1I72UvA0F+db9QOC4vxW/x/OAEcbMi6mMkEzdr3ftK5U/q
 8K4h1OvE3PfMXR3S0HDoqnGCenGX/93REhduOO36SfP5gupN0TzkgQwqIAWpqvER
 zQFdJ3+/6H07q83tlhpThggD7qgqQeg2a/DhFnj6AK5ima44zg+MrW3v14D42hY1
 GbBXz9CLWsnzm0ieZqaOhJW1Gx57a9AoXr5YZ7NGQxJ2fEaG7zSAzLMKP28+6PDT
 1OXlozPVAMYNL8xZmkA5+QIoBMRUQVaRhXmoW1wr7NqUqHcm6ILQl6DOIM4sGGXL
 TPMGjBkZwLVv0J5rtcSIIPoXChcB5V1DqqMyIuu+arAzoR8ulcETdqb6kJvyP1HT
 GQHtinqq/nc0cpaNhkmB4WkLg7fvMlvz5YNPQAEs+2ZZGTiwAo05jMv1Gpky3yI6
 XXQf+bhT7ghJdTJy0QKmUGw3YCDjrYXzfYfPEwwewVqAbIlrjFM=
 =o7dM
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-pvunhalt-6.9' of https://github.com/kvm-x86/linux into HEAD

Fix a bug in KVM_SET_CPUID{2,} where KVM looks at the wrong CPUID entries (old
vs. new) and ultimately neglects to clear PV_UNHALT from vCPUs with HLT-exiting
disabled.
2024-03-18 19:19:08 -04:00
Paolo Bonzini
1d55934ed5 KVM SVM changes for 6.9:
- Add support for systems that are configured with SEV and SEV-ES+ enabled,
    but have all ASIDs assigned to SEV-ES+ guests, which effectively makes SEV
    unusuable.  Cleanup ASID handling to make supporting this scenario less
    brittle/ugly.
 
  - Return -EINVAL instead of -EBUSY if userspace attempts to invoke
    KVM_SEV{,ES}_INIT on an SEV+ guest.  The operation is simply invalid, and
    not related to resource contention in any way.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXeMssACgkQOlYIJqCj
 N/3UQg/8D5J0N1jqE6cnPsN3OA733Q+fRkfJd6zLUn5qJ8jqssxeNUiRCCUYIP8b
 ijuUB1/SCphQoIlAmy73+lmLOs2AMtW5Qaephekv4YZlSlsqIbIq12LJ88PGv/Gd
 WO6zxeWnIPh1jLvaHA5bqEg6VC/vyl0enCXaw6o0ll3UubAQ5wcHaYoW0SM28bT3
 mHJJBjElgvV9845y3sZkWYYP4AYAbrhNWVJLYgxZjByCYPHo5h0bffZKzniWxAZQ
 kANkotYJ2mMXAnagmuUvxOBxzSSVn7dYijR6u7eAx5PPodv9mptrFyY0XdGl0o8O
 MexEF4IQRpJN4JhFmC0Wm0Zw42TDq+CSBv2YqHEfnpgN7BYjIqiefx3+DdaQ3fwp
 czd+EVHHqDOklyCpBmOtZAtqSrSNAJn7OJk36Q/SCaEMbmgyE1nCNAZ7CubHpwET
 9jGumcQ2gd+fcw8Ju8ehxD9su7tQun93gIZ5DGGcw3/x0P85V5eWvafjqv5lNnZ+
 5uwHFqt9Bir1Pdk59MyWpIH1YZ//Us3KYe+yApRwyjxMpiilrkYYowvQbu0/3BKo
 0WcIDnTezYlF1EdHBruok/lgmIKm04FrlbxwAGFUFD0ClBSwZCr9K59gczX3v4sq
 giI4lWoHwRN79hM6QioeJcFDzSaxos9hppgcAw0+1fL8RsOPedA=
 =9jK/
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-svm-6.9' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 6.9:

 - Add support for systems that are configured with SEV and SEV-ES+ enabled,
   but have all ASIDs assigned to SEV-ES+ guests, which effectively makes SEV
   unusuable.  Cleanup ASID handling to make supporting this scenario less
   brittle/ugly.

 - Return -EINVAL instead of -EBUSY if userspace attempts to invoke
   KVM_SEV{,ES}_INIT on an SEV+ guest.  The operation is simply invalid, and
   not related to resource contention in any way.
2024-03-18 19:03:26 -04:00
Linus Torvalds
4f712ee0cb S390:
* Changes to FPU handling came in via the main s390 pull request
 
 * Only deliver to the guest the SCLP events that userspace has
   requested.
 
 * More virtual vs physical address fixes (only a cleanup since
   virtual and physical address spaces are currently the same).
 
 * Fix selftests undefined behavior.
 
 x86:
 
 * Fix a restriction that the guest can't program a PMU event whose
   encoding matches an architectural event that isn't included in the
   guest CPUID.  The enumeration of an architectural event only says
   that if a CPU supports an architectural event, then the event can be
   programmed *using the architectural encoding*.  The enumeration does
   NOT say anything about the encoding when the CPU doesn't report support
   the event *in general*.  It might support it, and it might support it
   using the same encoding that made it into the architectural PMU spec.
 
 * Fix a variety of bugs in KVM's emulation of RDPMC (more details on
   individual commits) and add a selftest to verify KVM correctly emulates
   RDMPC, counter availability, and a variety of other PMC-related
   behaviors that depend on guest CPUID and therefore are easier to
   validate with selftests than with custom guests (aka kvm-unit-tests).
 
 * Zero out PMU state on AMD if the virtual PMU is disabled, it does not
   cause any bug but it wastes time in various cases where KVM would check
   if a PMC event needs to be synthesized.
 
 * Optimize triggering of emulated events, with a nice ~10% performance
   improvement in VM-Exit microbenchmarks when a vPMU is exposed to the
   guest.
 
 * Tighten the check for "PMI in guest" to reduce false positives if an NMI
   arrives in the host while KVM is handling an IRQ VM-Exit.
 
 * Fix a bug where KVM would report stale/bogus exit qualification information
   when exiting to userspace with an internal error exit code.
 
 * Add a VMX flag in /proc/cpuinfo to report 5-level EPT support.
 
 * Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for
   read, e.g. to avoid serializing vCPUs when userspace deletes a memslot.
 
 * Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB).  KVM
   doesn't support yielding in the middle of processing a zap, and 1GiB
   granularity resulted in multi-millisecond lags that are quite impolite
   for CONFIG_PREEMPT kernels.
 
 * Allocate write-tracking metadata on-demand to avoid the memory overhead when
   a kernel is built with i915 virtualization support but the workloads use
   neither shadow paging nor i915 virtualization.
 
 * Explicitly initialize a variety of on-stack variables in the emulator that
   triggered KMSAN false positives.
 
 * Fix the debugregs ABI for 32-bit KVM.
 
 * Rework the "force immediate exit" code so that vendor code ultimately decides
   how and when to force the exit, which allowed some optimization for both
   Intel and AMD.
 
 * Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
   vCPU creation ultimately failed, causing extra unnecessary work.
 
 * Cleanup the logic for checking if the currently loaded vCPU is in-kernel.
 
 * Harden against underflowing the active mmu_notifier invalidation
   count, so that "bad" invalidations (usually due to bugs elsehwere in the
   kernel) are detected earlier and are less likely to hang the kernel.
 
 x86 Xen emulation:
 
 * Overlay pages can now be cached based on host virtual address,
   instead of guest physical addresses.  This removes the need to
   reconfigure and invalidate the cache if the guest changes the
   gpa but the underlying host virtual address remains the same.
 
 * When possible, use a single host TSC value when computing the deadline for
   Xen timers in order to improve the accuracy of the timer emulation.
 
 * Inject pending upcall events when the vCPU software-enables its APIC to fix
   a bug where an upcall can be lost (and to follow Xen's behavior).
 
 * Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
   events fails, e.g. if the guest has aliased xAPIC IDs.
 
 RISC-V:
 
 * Support exception and interrupt handling in selftests
 
 * New self test for RISC-V architectural timer (Sstc extension)
 
 * New extension support (Ztso, Zacas)
 
 * Support userspace emulation of random number seed CSRs.
 
 ARM:
 
 * Infrastructure for building KVM's trap configuration based on the
   architectural features (or lack thereof) advertised in the VM's ID
   registers
 
 * Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
   x86's WC) at stage-2, improving the performance of interacting with
   assigned devices that can tolerate it
 
 * Conversion of KVM's representation of LPIs to an xarray, utilized to
   address serialization some of the serialization on the LPI injection
   path
 
 * Support for _architectural_ VHE-only systems, advertised through the
   absence of FEAT_E2H0 in the CPU's ID register
 
 * Miscellaneous cleanups, fixes, and spelling corrections to KVM and
   selftests
 
 LoongArch:
 
 * Set reserved bits as zero in CPUCFG.
 
 * Start SW timer only when vcpu is blocking.
 
 * Do not restart SW timer when it is expired.
 
 * Remove unnecessary CSR register saving during enter guest.
 
 * Misc cleanups and fixes as usual.
 
 Generic:
 
 * cleanup Kconfig by removing CONFIG_HAVE_KVM, which was basically always
   true on all architectures except MIPS (where Kconfig determines the
   available depending on CPU capabilities).  It is replaced either by
   an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM)
   everywhere else.
 
 * Factor common "select" statements in common code instead of requiring
   each architecture to specify it
 
 * Remove thoroughly obsolete APIs from the uapi headers.
 
 * Move architecture-dependent stuff to uapi/asm/kvm.h
 
 * Always flush the async page fault workqueue when a work item is being
   removed, especially during vCPU destruction, to ensure that there are no
   workers running in KVM code when all references to KVM-the-module are gone,
   i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded.
 
 * Grab a reference to the VM's mm_struct in the async #PF worker itself instead
   of gifting the worker a reference, so that there's no need to remember
   to *conditionally* clean up after the worker.
 
 Selftests:
 
 * Reduce boilerplate especially when utilize selftest TAP infrastructure.
 
 * Add basic smoke tests for SEV and SEV-ES, along with a pile of library
   support for handling private/encrypted/protected memory.
 
 * Fix benign bugs where tests neglect to close() guest_memfd files.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmX0iP8UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroND7wf+JZoNvwZ+bmwWe/4jn/YwNoYi/C5z
 eypn8M1gsWEccpCpqPBwznVm9T29rF4uOlcMvqLEkHfTpaL1EKUUjP1lXPz/ileP
 6a2RdOGxAhyTiFC9fjy+wkkjtLbn1kZf6YsS0hjphP9+w0chNbdn0w81dFVnXryd
 j7XYI8R/bFAthNsJOuZXSEjCfIHxvTTG74OrTf1B1FEBB+arPmrgUeJftMVhffQK
 Sowgg8L/Ii/x6fgV5NZQVSIyVf1rp8z7c6UaHT4Fwb0+RAMW8p9pYv9Qp1YkKp8y
 5j0V9UzOHP7FRaYimZ5BtwQoqiZXYylQ+VuU/Y2f4X85cvlLzSqxaEMAPA==
 =mqOV
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "S390:

   - Changes to FPU handling came in via the main s390 pull request

   - Only deliver to the guest the SCLP events that userspace has
     requested

   - More virtual vs physical address fixes (only a cleanup since
     virtual and physical address spaces are currently the same)

   - Fix selftests undefined behavior

  x86:

   - Fix a restriction that the guest can't program a PMU event whose
     encoding matches an architectural event that isn't included in the
     guest CPUID. The enumeration of an architectural event only says
     that if a CPU supports an architectural event, then the event can
     be programmed *using the architectural encoding*. The enumeration
     does NOT say anything about the encoding when the CPU doesn't
     report support the event *in general*. It might support it, and it
     might support it using the same encoding that made it into the
     architectural PMU spec

   - Fix a variety of bugs in KVM's emulation of RDPMC (more details on
     individual commits) and add a selftest to verify KVM correctly
     emulates RDMPC, counter availability, and a variety of other
     PMC-related behaviors that depend on guest CPUID and therefore are
     easier to validate with selftests than with custom guests (aka
     kvm-unit-tests)

   - Zero out PMU state on AMD if the virtual PMU is disabled, it does
     not cause any bug but it wastes time in various cases where KVM
     would check if a PMC event needs to be synthesized

   - Optimize triggering of emulated events, with a nice ~10%
     performance improvement in VM-Exit microbenchmarks when a vPMU is
     exposed to the guest

   - Tighten the check for "PMI in guest" to reduce false positives if
     an NMI arrives in the host while KVM is handling an IRQ VM-Exit

   - Fix a bug where KVM would report stale/bogus exit qualification
     information when exiting to userspace with an internal error exit
     code

   - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support

   - Rework TDP MMU root unload, free, and alloc to run with mmu_lock
     held for read, e.g. to avoid serializing vCPUs when userspace
     deletes a memslot

   - Tear down TDP MMU page tables at 4KiB granularity (used to be
     1GiB). KVM doesn't support yielding in the middle of processing a
     zap, and 1GiB granularity resulted in multi-millisecond lags that
     are quite impolite for CONFIG_PREEMPT kernels

   - Allocate write-tracking metadata on-demand to avoid the memory
     overhead when a kernel is built with i915 virtualization support
     but the workloads use neither shadow paging nor i915 virtualization

   - Explicitly initialize a variety of on-stack variables in the
     emulator that triggered KMSAN false positives

   - Fix the debugregs ABI for 32-bit KVM

   - Rework the "force immediate exit" code so that vendor code
     ultimately decides how and when to force the exit, which allowed
     some optimization for both Intel and AMD

   - Fix a long-standing bug where kvm_has_noapic_vcpu could be left
     elevated if vCPU creation ultimately failed, causing extra
     unnecessary work

   - Cleanup the logic for checking if the currently loaded vCPU is
     in-kernel

   - Harden against underflowing the active mmu_notifier invalidation
     count, so that "bad" invalidations (usually due to bugs elsehwere
     in the kernel) are detected earlier and are less likely to hang the
     kernel

  x86 Xen emulation:

   - Overlay pages can now be cached based on host virtual address,
     instead of guest physical addresses. This removes the need to
     reconfigure and invalidate the cache if the guest changes the gpa
     but the underlying host virtual address remains the same

   - When possible, use a single host TSC value when computing the
     deadline for Xen timers in order to improve the accuracy of the
     timer emulation

   - Inject pending upcall events when the vCPU software-enables its
     APIC to fix a bug where an upcall can be lost (and to follow Xen's
     behavior)

   - Fall back to the slow path instead of warning if "fast" IRQ
     delivery of Xen events fails, e.g. if the guest has aliased xAPIC
     IDs

  RISC-V:

   - Support exception and interrupt handling in selftests

   - New self test for RISC-V architectural timer (Sstc extension)

   - New extension support (Ztso, Zacas)

   - Support userspace emulation of random number seed CSRs

  ARM:

   - Infrastructure for building KVM's trap configuration based on the
     architectural features (or lack thereof) advertised in the VM's ID
     registers

   - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
     x86's WC) at stage-2, improving the performance of interacting with
     assigned devices that can tolerate it

   - Conversion of KVM's representation of LPIs to an xarray, utilized
     to address serialization some of the serialization on the LPI
     injection path

   - Support for _architectural_ VHE-only systems, advertised through
     the absence of FEAT_E2H0 in the CPU's ID register

   - Miscellaneous cleanups, fixes, and spelling corrections to KVM and
     selftests

  LoongArch:

   - Set reserved bits as zero in CPUCFG

   - Start SW timer only when vcpu is blocking

   - Do not restart SW timer when it is expired

   - Remove unnecessary CSR register saving during enter guest

   - Misc cleanups and fixes as usual

  Generic:

   - Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically
     always true on all architectures except MIPS (where Kconfig
     determines the available depending on CPU capabilities). It is
     replaced either by an architecture-dependent symbol for MIPS, and
     IS_ENABLED(CONFIG_KVM) everywhere else

   - Factor common "select" statements in common code instead of
     requiring each architecture to specify it

   - Remove thoroughly obsolete APIs from the uapi headers

   - Move architecture-dependent stuff to uapi/asm/kvm.h

   - Always flush the async page fault workqueue when a work item is
     being removed, especially during vCPU destruction, to ensure that
     there are no workers running in KVM code when all references to
     KVM-the-module are gone, i.e. to prevent a very unlikely
     use-after-free if kvm.ko is unloaded

   - Grab a reference to the VM's mm_struct in the async #PF worker
     itself instead of gifting the worker a reference, so that there's
     no need to remember to *conditionally* clean up after the worker

  Selftests:

   - Reduce boilerplate especially when utilize selftest TAP
     infrastructure

   - Add basic smoke tests for SEV and SEV-ES, along with a pile of
     library support for handling private/encrypted/protected memory

   - Fix benign bugs where tests neglect to close() guest_memfd files"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits)
  selftests: kvm: remove meaningless assignments in Makefiles
  KVM: riscv: selftests: Add Zacas extension to get-reg-list test
  RISC-V: KVM: Allow Zacas extension for Guest/VM
  KVM: riscv: selftests: Add Ztso extension to get-reg-list test
  RISC-V: KVM: Allow Ztso extension for Guest/VM
  RISC-V: KVM: Forward SEED CSR access to user space
  KVM: riscv: selftests: Add sstc timer test
  KVM: riscv: selftests: Change vcpu_has_ext to a common function
  KVM: riscv: selftests: Add guest helper to get vcpu id
  KVM: riscv: selftests: Add exception handling support
  LoongArch: KVM: Remove unnecessary CSR register saving during enter guest
  LoongArch: KVM: Do not restart SW timer when it is expired
  LoongArch: KVM: Start SW timer only when vcpu is blocking
  LoongArch: KVM: Set reserved bits as zero in CPUCFG
  KVM: selftests: Explicitly close guest_memfd files in some gmem tests
  KVM: x86/xen: fix recursive deadlock in timer injection
  KVM: pfncache: simplify locking and make more self-contained
  KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery
  KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled
  KVM: x86/xen: improve accuracy of Xen timers
  ...
2024-03-15 13:03:13 -07:00
Linus Torvalds
902861e34c - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
from hotplugged memory rather than only from main memory.  Series
   "implement "memmap on memory" feature on s390".
 
 - More folio conversions from Matthew Wilcox in the series
 
 	"Convert memcontrol charge moving to use folios"
 	"mm: convert mm counter to take a folio"
 
 - Chengming Zhou has optimized zswap's rbtree locking, providing
   significant reductions in system time and modest but measurable
   reductions in overall runtimes.  The series is "mm/zswap: optimize the
   scalability of zswap rb-tree".
 
 - Chengming Zhou has also provided the series "mm/zswap: optimize zswap
   lru list" which provides measurable runtime benefits in some
   swap-intensive situations.
 
 - And Chengming Zhou further optimizes zswap in the series "mm/zswap:
   optimize for dynamic zswap_pools".  Measured improvements are modest.
 
 - zswap cleanups and simplifications from Yosry Ahmed in the series "mm:
   zswap: simplify zswap_swapoff()".
 
 - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
   contributed several DAX cleanups as well as adding a sysfs tunable to
   control the memmap_on_memory setting when the dax device is hotplugged
   as system memory.
 
 - Johannes Weiner has added the large series "mm: zswap: cleanups",
   which does that.
 
 - More DAMON work from SeongJae Park in the series
 
 	"mm/damon: make DAMON debugfs interface deprecation unignorable"
 	"selftests/damon: add more tests for core functionalities and corner cases"
 	"Docs/mm/damon: misc readability improvements"
 	"mm/damon: let DAMOS feeds and tame/auto-tune itself"
 
 - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
   extension" Rakie Kim has developed a new mempolicy interleaving policy
   wherein we allocate memory across nodes in a weighted fashion rather
   than uniformly.  This is beneficial in heterogeneous memory environments
   appearing with CXL.
 
 - Christophe Leroy has contributed some cleanup and consolidation work
   against the ARM pagetable dumping code in the series "mm: ptdump:
   Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".
 
 - Luis Chamberlain has added some additional xarray selftesting in the
   series "test_xarray: advanced API multi-index tests".
 
 - Muhammad Usama Anjum has reworked the selftest code to make its
   human-readable output conform to the TAP ("Test Anything Protocol")
   format.  Amongst other things, this opens up the use of third-party
   tools to parse and process out selftesting results.
 
 - Ryan Roberts has added fork()-time PTE batching of THP ptes in the
   series "mm/memory: optimize fork() with PTE-mapped THP".  Mainly
   targeted at arm64, this significantly speeds up fork() when the process
   has a large number of pte-mapped folios.
 
 - David Hildenbrand also gets in on the THP pte batching game in his
   series "mm/memory: optimize unmap/zap with PTE-mapped THP".  It
   implements batching during munmap() and other pte teardown situations.
   The microbenchmark improvements are nice.
 
 - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan
   Roberts further utilizes arm's pte's contiguous bit ("contpte
   mappings").  Kernel build times on arm64 improved nicely.  Ryan's series
   "Address some contpte nits" provides some followup work.
 
 - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
   fixed an obscure hugetlb race which was causing unnecessary page faults.
   He has also added a reproducer under the selftest code.
 
 - In the series "selftests/mm: Output cleanups for the compaction test",
   Mark Brown did what the title claims.
 
 - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring".
 
 - Even more zswap material from Nhat Pham.  The series "fix and extend
   zswap kselftests" does as claimed.
 
 - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
   regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in
   our handling of DAX on archiecctures which have virtually aliasing data
   caches.  The arm architecture is the main beneficiary.
 
 - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic
   improvements in worst-case mmap_lock hold times during certain
   userfaultfd operations.
 
 - Some page_owner enhancements and maintenance work from Oscar Salvador
   in his series
 
 	"page_owner: print stacks and their outstanding allocations"
 	"page_owner: Fixup and cleanup"
 
 - Uladzislau Rezki has contributed some vmalloc scalability improvements
   in his series "Mitigate a vmap lock contention".  It realizes a 12x
   improvement for a certain microbenchmark.
 
 - Some kexec/crash cleanup work from Baoquan He in the series "Split
   crash out from kexec and clean up related config items".
 
 - Some zsmalloc maintenance work from Chengming Zhou in the series
 
 	"mm/zsmalloc: fix and optimize objects/page migration"
 	"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"
 
 - Zi Yan has taught the MM to perform compaction on folios larger than
   order=0.  This a step along the path to implementaton of the merging of
   large anonymous folios.  The series is named "Enable >0 order folio
   memory compaction".
 
 - Christoph Hellwig has done quite a lot of cleanup work in the
   pagecache writeback code in his series "convert write_cache_pages() to
   an iterator".
 
 - Some modest hugetlb cleanups and speedups in Vishal Moola's series
   "Handle hugetlb faults under the VMA lock".
 
 - Zi Yan has changed the page splitting code so we can split huge pages
   into sizes other than order-0 to better utilize large folios.  The
   series is named "Split a folio to any lower order folios".
 
 - David Hildenbrand has contributed the series "mm: remove
   total_mapcount()", a cleanup.
 
 - Matthew Wilcox has sought to improve the performance of bulk memory
   freeing in his series "Rearrange batched folio freeing".
 
 - Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
   provides large improvements in bootup times on large machines which are
   configured to use large numbers of hugetlb pages.
 
 - Matthew Wilcox's series "PageFlags cleanups" does that.
 
 - Qi Zheng's series "minor fixes and supplement for ptdesc" does that
   also.  S390 is affected.
 
 - Cleanups to our pagemap utility functions from Peter Xu in his series
   "mm/treewide: Replace pXd_large() with pXd_leaf()".
 
 - Nico Pache has fixed a few things with our hugepage selftests in his
   series "selftests/mm: Improve Hugepage Test Handling in MM Selftests".
 
 - Also, of course, many singleton patches to many things.  Please see
   the individual changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfJpPQAKCRDdBJ7gKXxA
 joxeAP9TrcMEuHnLmBlhIXkWbIR4+ki+pA3v+gNTlJiBhnfVSgD9G55t1aBaRplx
 TMNhHfyiHYDTx/GAV9NXW84tasJSDgA=
 =TG55
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
   from hotplugged memory rather than only from main memory. Series
   "implement "memmap on memory" feature on s390".

 - More folio conversions from Matthew Wilcox in the series

	"Convert memcontrol charge moving to use folios"
	"mm: convert mm counter to take a folio"

 - Chengming Zhou has optimized zswap's rbtree locking, providing
   significant reductions in system time and modest but measurable
   reductions in overall runtimes. The series is "mm/zswap: optimize the
   scalability of zswap rb-tree".

 - Chengming Zhou has also provided the series "mm/zswap: optimize zswap
   lru list" which provides measurable runtime benefits in some
   swap-intensive situations.

 - And Chengming Zhou further optimizes zswap in the series "mm/zswap:
   optimize for dynamic zswap_pools". Measured improvements are modest.

 - zswap cleanups and simplifications from Yosry Ahmed in the series
   "mm: zswap: simplify zswap_swapoff()".

 - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
   contributed several DAX cleanups as well as adding a sysfs tunable to
   control the memmap_on_memory setting when the dax device is
   hotplugged as system memory.

 - Johannes Weiner has added the large series "mm: zswap: cleanups",
   which does that.

 - More DAMON work from SeongJae Park in the series

	"mm/damon: make DAMON debugfs interface deprecation unignorable"
	"selftests/damon: add more tests for core functionalities and corner cases"
	"Docs/mm/damon: misc readability improvements"
	"mm/damon: let DAMOS feeds and tame/auto-tune itself"

 - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
   extension" Rakie Kim has developed a new mempolicy interleaving
   policy wherein we allocate memory across nodes in a weighted fashion
   rather than uniformly. This is beneficial in heterogeneous memory
   environments appearing with CXL.

 - Christophe Leroy has contributed some cleanup and consolidation work
   against the ARM pagetable dumping code in the series "mm: ptdump:
   Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".

 - Luis Chamberlain has added some additional xarray selftesting in the
   series "test_xarray: advanced API multi-index tests".

 - Muhammad Usama Anjum has reworked the selftest code to make its
   human-readable output conform to the TAP ("Test Anything Protocol")
   format. Amongst other things, this opens up the use of third-party
   tools to parse and process out selftesting results.

 - Ryan Roberts has added fork()-time PTE batching of THP ptes in the
   series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
   targeted at arm64, this significantly speeds up fork() when the
   process has a large number of pte-mapped folios.

 - David Hildenbrand also gets in on the THP pte batching game in his
   series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
   implements batching during munmap() and other pte teardown
   situations. The microbenchmark improvements are nice.

 - And in the series "Transparent Contiguous PTEs for User Mappings"
   Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte
   mappings"). Kernel build times on arm64 improved nicely. Ryan's
   series "Address some contpte nits" provides some followup work.

 - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
   fixed an obscure hugetlb race which was causing unnecessary page
   faults. He has also added a reproducer under the selftest code.

 - In the series "selftests/mm: Output cleanups for the compaction
   test", Mark Brown did what the title claims.

 - Kinsey Ho has added the series "mm/mglru: code cleanup and
   refactoring".

 - Even more zswap material from Nhat Pham. The series "fix and extend
   zswap kselftests" does as claimed.

 - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
   regression" Mathieu Desnoyers has cleaned up and fixed rather a mess
   in our handling of DAX on archiecctures which have virtually aliasing
   data caches. The arm architecture is the main beneficiary.

 - Lokesh Gidra's series "per-vma locks in userfaultfd" provides
   dramatic improvements in worst-case mmap_lock hold times during
   certain userfaultfd operations.

 - Some page_owner enhancements and maintenance work from Oscar Salvador
   in his series

	"page_owner: print stacks and their outstanding allocations"
	"page_owner: Fixup and cleanup"

 - Uladzislau Rezki has contributed some vmalloc scalability
   improvements in his series "Mitigate a vmap lock contention". It
   realizes a 12x improvement for a certain microbenchmark.

 - Some kexec/crash cleanup work from Baoquan He in the series "Split
   crash out from kexec and clean up related config items".

 - Some zsmalloc maintenance work from Chengming Zhou in the series

	"mm/zsmalloc: fix and optimize objects/page migration"
	"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"

 - Zi Yan has taught the MM to perform compaction on folios larger than
   order=0. This a step along the path to implementaton of the merging
   of large anonymous folios. The series is named "Enable >0 order folio
   memory compaction".

 - Christoph Hellwig has done quite a lot of cleanup work in the
   pagecache writeback code in his series "convert write_cache_pages()
   to an iterator".

 - Some modest hugetlb cleanups and speedups in Vishal Moola's series
   "Handle hugetlb faults under the VMA lock".

 - Zi Yan has changed the page splitting code so we can split huge pages
   into sizes other than order-0 to better utilize large folios. The
   series is named "Split a folio to any lower order folios".

 - David Hildenbrand has contributed the series "mm: remove
   total_mapcount()", a cleanup.

 - Matthew Wilcox has sought to improve the performance of bulk memory
   freeing in his series "Rearrange batched folio freeing".

 - Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
   provides large improvements in bootup times on large machines which
   are configured to use large numbers of hugetlb pages.

 - Matthew Wilcox's series "PageFlags cleanups" does that.

 - Qi Zheng's series "minor fixes and supplement for ptdesc" does that
   also. S390 is affected.

 - Cleanups to our pagemap utility functions from Peter Xu in his series
   "mm/treewide: Replace pXd_large() with pXd_leaf()".

 - Nico Pache has fixed a few things with our hugepage selftests in his
   series "selftests/mm: Improve Hugepage Test Handling in MM
   Selftests".

 - Also, of course, many singleton patches to many things. Please see
   the individual changelogs for details.

* tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits)
  mm/zswap: remove the memcpy if acomp is not sleepable
  crypto: introduce: acomp_is_async to expose if comp drivers might sleep
  memtest: use {READ,WRITE}_ONCE in memory scanning
  mm: prohibit the last subpage from reusing the entire large folio
  mm: recover pud_leaf() definitions in nopmd case
  selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements
  selftests/mm: skip uffd hugetlb tests with insufficient hugepages
  selftests/mm: dont fail testsuite due to a lack of hugepages
  mm/huge_memory: skip invalid debugfs new_order input for folio split
  mm/huge_memory: check new folio order when split a folio
  mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure
  mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
  mm: fix list corruption in put_pages_list
  mm: remove folio from deferred split list before uncharging it
  filemap: avoid unnecessary major faults in filemap_fault()
  mm,page_owner: drop unnecessary check
  mm,page_owner: check for null stack_record before bumping its refcount
  mm: swap: fix race between free_swap_and_cache() and swapoff()
  mm/treewide: align up pXd_leaf() retval across archs
  mm/treewide: drop pXd_large()
  ...
2024-03-14 17:43:30 -07:00
Linus Torvalds
216532e147 hardening updates for v6.9-rc1
- string.h and related header cleanups (Tanzir Hasan, Andy Shevchenko)
 
 - VMCI memcpy() usage and struct_size() cleanups (Vasiliy Kovalev, Harshit
   Mogalapalli)
 
 - selftests/powerpc: Fix load_unaligned_zeropad build failure (Michael
   Ellerman)
 
 - hardened Kconfig fragment updates (Marco Elver, Lukas Bulwahn)
 
 - Handle tail call optimization better in LKDTM (Douglas Anderson)
 
 - Use long form types in overflow.h (Andy Shevchenko)
 
 - Add flags param to string_get_size() (Andy Shevchenko)
 
 - Add Coccinelle script for potential struct_size() use (Jacob Keller)
 
 - Fix objtool corner case under KCFI (Josh Poimboeuf)
 
 - Drop 13 year old backward compat CAP_SYS_ADMIN check (Jingzi Meng)
 
 - Add str_plural() helper (Michal Wajdeczko, Kees Cook)
 
 - Ignore relocations in .notes section
 
 - Add comments to explain how __is_constexpr() works
 
 - Fix m68k stack alignment expectations in stackinit Kunit test
 
 - Convert string selftests to KUnit
 
 - Add KUnit tests for fortified string functions
 
 - Improve reporting during fortified string warnings
 
 - Allow non-type arg to type_max() and type_min()
 
 - Allow strscpy() to be called with only 2 arguments
 
 - Add binary mode to leaking_addresses scanner
 
 - Various small cleanups to leaking_addresses scanner
 
 - Adding wrapping_*() arithmetic helper
 
 - Annotate initial signed integer wrap-around in refcount_t
 
 - Add explicit UBSAN section to MAINTAINERS
 
 - Fix UBSAN self-test warnings
 
 - Simplify UBSAN build via removal of CONFIG_UBSAN_SANITIZE_ALL
 
 - Reintroduce UBSAN's signed overflow sanitizer
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmXvm5kWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJiQqD/4mM6SWZpYHKlR1nEiqIyz7Hqr9
 g4oguuw6HIVNJXLyeBI5Hd43CTeHPA0e++EETqhUAt7HhErxfYJY+JB221nRYmu+
 zhhQ7N/xbTMV/Je7AR03kQjhiMm8LyEcM2X4BNrsAcoCieQzmO3g0zSp8ISzLUE0
 PEEmf1lOzMe3gK2KOFCPt5Hiz9sGWyN6at+BQubY18tQGtjEXYAQNXkpD5qhGn4a
 EF693r/17wmc8hvSsjf4AGaWy1k8crG0WfpMCZsaqftjj0BbvOC60IDyx4eFjpcy
 tGyAJKETq161AkCdNweIh2Q107fG3tm0fcvw2dv8Wt1eQCko6M8dUGCBinQs/thh
 TexjJFS/XbSz+IvxLqgU+C5qkOP23E0M9m1dbIbOFxJAya/5n16WOBlGr3ae2Wdq
 /+t8wVSJw3vZiku5emWdFYP1VsdIHUjVa5QizFaaRhzLGRwhxVV49SP4IQC/5oM5
 3MAgNOFTP6yRQn9Y9wP+SZs+SsfaIE7yfKa9zOi4S+Ve+LI2v4YFhh8NCRiLkeWZ
 R1dhp8Pgtuq76f/v0qUaWcuuVeGfJ37M31KOGIhi1sI/3sr7UMrngL8D1+F8UZMi
 zcLu+x4GtfUZCHl6znx1rNUBqE5S/5ndVhLpOqfCXKaQ+RAm7lkOJ3jXE2VhNkhp
 yVEmeSOLnlCaQjZvXQ==
 =OP+o
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening updates from Kees Cook:
 "As is pretty normal for this tree, there are changes all over the
  place, especially for small fixes, selftest improvements, and improved
  macro usability.

  Some header changes ended up landing via this tree as they depended on
  the string header cleanups. Also, a notable set of changes is the work
  for the reintroduction of the UBSAN signed integer overflow sanitizer
  so that we can continue to make improvements on the compiler side to
  make this sanitizer a more viable future security hardening option.

  Summary:

   - string.h and related header cleanups (Tanzir Hasan, Andy
     Shevchenko)

   - VMCI memcpy() usage and struct_size() cleanups (Vasiliy Kovalev,
     Harshit Mogalapalli)

   - selftests/powerpc: Fix load_unaligned_zeropad build failure
     (Michael Ellerman)

   - hardened Kconfig fragment updates (Marco Elver, Lukas Bulwahn)

   - Handle tail call optimization better in LKDTM (Douglas Anderson)

   - Use long form types in overflow.h (Andy Shevchenko)

   - Add flags param to string_get_size() (Andy Shevchenko)

   - Add Coccinelle script for potential struct_size() use (Jacob
     Keller)

   - Fix objtool corner case under KCFI (Josh Poimboeuf)

   - Drop 13 year old backward compat CAP_SYS_ADMIN check (Jingzi Meng)

   - Add str_plural() helper (Michal Wajdeczko, Kees Cook)

   - Ignore relocations in .notes section

   - Add comments to explain how __is_constexpr() works

   - Fix m68k stack alignment expectations in stackinit Kunit test

   - Convert string selftests to KUnit

   - Add KUnit tests for fortified string functions

   - Improve reporting during fortified string warnings

   - Allow non-type arg to type_max() and type_min()

   - Allow strscpy() to be called with only 2 arguments

   - Add binary mode to leaking_addresses scanner

   - Various small cleanups to leaking_addresses scanner

   - Adding wrapping_*() arithmetic helper

   - Annotate initial signed integer wrap-around in refcount_t

   - Add explicit UBSAN section to MAINTAINERS

   - Fix UBSAN self-test warnings

   - Simplify UBSAN build via removal of CONFIG_UBSAN_SANITIZE_ALL

   - Reintroduce UBSAN's signed overflow sanitizer"

* tag 'hardening-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (51 commits)
  selftests/powerpc: Fix load_unaligned_zeropad build failure
  string: Convert helpers selftest to KUnit
  string: Convert selftest to KUnit
  sh: Fix build with CONFIG_UBSAN=y
  compiler.h: Explain how __is_constexpr() works
  overflow: Allow non-type arg to type_max() and type_min()
  VMCI: Fix possible memcpy() run-time warning in vmci_datagram_invoke_guest_handler()
  lib/string_helpers: Add flags param to string_get_size()
  x86, relocs: Ignore relocations in .notes section
  objtool: Fix UNWIND_HINT_{SAVE,RESTORE} across basic blocks
  overflow: Use POD in check_shl_overflow()
  lib: stackinit: Adjust target string to 8 bytes for m68k
  sparc: vdso: Disable UBSAN instrumentation
  kernel.h: Move lib/cmdline.c prototypes to string.h
  leaking_addresses: Provide mechanism to scan binary files
  leaking_addresses: Ignore input device status lines
  leaking_addresses: Use File::Temp for /tmp files
  MAINTAINERS: Update LEAKING_ADDRESSES details
  fortify: Improve buffer overflow reporting
  fortify: Add KUnit tests for runtime overflows
  ...
2024-03-12 14:49:30 -07:00
Linus Torvalds
0e33cf955f * Mitigate RFDS vulnerability
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmXvZgoACgkQaDWVMHDJ
 krC2Eg//aZKBp97/DSzRqXKDwJzVUr0sGJ9cii0gVT1sI+1U6ZZCh/roVH4xOT5/
 HqtOOnQ+X0mwUx2VG3Yv2VPI7VW68sJ3/y9D8R4tnMEsyQ4CmDw96Pre3NyKr/Av
 jmW7SK94fOkpNFJOMk3zpk7GtRUlCsVkS1P61dOmMYduguhel/V20rWlx83BgnAY
 Rf/c3rBjqe8Ri3rzBP5icY/d6OgwoafuhME31DD/j6oKOh+EoQBvA4urj46yMTMX
 /mrK7hCm/wqwuOOvgGbo7sfZNBLCYy3SZ3EyF4beDERhPF1DaSvCwOULpGVJroqu
 SelFsKXAtEbYrDgsan+MYlx3bQv43q7PbHska1gjkH91plO4nAsssPr5VsusUKmT
 sq8jyBaauZb40oLOSgooL4RqAHrfs8q5695Ouwh/DB/XovMezUI1N/BkpGFmqpJI
 o2xH9P5q520pkB8pFhN9TbRuFSGe/dbWC24QTq1DUajo3M3RwcwX6ua9hoAKLtDF
 pCV5DNcVcXHD3Cxp0M5dQ5JEAiCnW+ZpUWgxPQamGDNW5PEvjDmFwql2uWw/qOuW
 lkheOIffq8ejUBQFbN8VXfIzzeeKQNFiIcViaqGITjIwhqdHAzVi28OuIGwtdh3g
 ywLzSC8yvyzgKrNBgtFMr3ucKN0FoPxpBro253xt2H7w8srXW64=
 =5V9t
 -----END PGP SIGNATURE-----

Merge tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 RFDS mitigation from Dave Hansen:
 "RFDS is a CPU vulnerability that may allow a malicious userspace to
  infer stale register values from kernel space. Kernel registers can
  have all kinds of secrets in them so the mitigation is basically to
  wait until the kernel is about to return to userspace and has user
  values in the registers. At that point there is little chance of
  kernel secrets ending up in the registers and the microarchitectural
  state can be cleared.

  This leverages some recent robustness fixes for the existing MDS
  vulnerability. Both MDS and RFDS use the VERW instruction for
  mitigation"

* tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests
  x86/rfds: Mitigate Register File Data Sampling (RFDS)
  Documentation/hw-vuln: Add documentation for RFDS
  x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is set
2024-03-12 09:31:39 -07:00
Linus Torvalds
685d982112 Core x86 changes for v6.9:
- The biggest change is the rework of the percpu code,
   to support the 'Named Address Spaces' GCC feature,
   by Uros Bizjak:
 
    - This allows C code to access GS and FS segment relative
      memory via variables declared with such attributes,
      which allows the compiler to better optimize those accesses
      than the previous inline assembly code.
 
    - The series also includes a number of micro-optimizations
      for various percpu access methods, plus a number of
      cleanups of %gs accesses in assembly code.
 
    - These changes have been exposed to linux-next testing for
      the last ~5 months, with no known regressions in this area.
 
 - Fix/clean up __switch_to()'s broken but accidentally
   working handling of FPU switching - which also generates
   better code.
 
 - Propagate more RIP-relative addressing in assembly code,
   to generate slightly better code.
 
 - Rework the CPU mitigations Kconfig space to be less idiosyncratic,
   to make it easier for distros to follow & maintain these options.
 
 - Rework the x86 idle code to cure RCU violations and
   to clean up the logic.
 
 - Clean up the vDSO Makefile logic.
 
 - Misc cleanups and fixes.
 
 [ Please note that there's a higher number of merge commits in
   this branch (three) than is usual in x86 topic trees. This happened
   due to the long testing lifecycle of the percpu changes that
   involved 3 merge windows, which generated a longer history
   and various interactions with other core x86 changes that we
   felt better about to carry in a single branch. ]
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmXvB0gRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jUqRAAqnEQPiabF5acQlHrwviX+cjSobDlqtH5
 9q2AQy9qaEHapzD0XMOxvFye6XIvehGOGxSPvk6CoviSxBND8rb56lvnsEZuLeBV
 Bo5QSIL2x42Zrvo11iPHwgXZfTIusU90sBuKDRFkYBAxY3HK2naMDZe8MAsYCUE9
 nwgHF8DDc/NYiSOXV8kosWoWpNIkoK/STyH5bvTQZMqZcwyZ49AIeP1jGZb/prbC
 e/rbnlrq5Eu6brpM7xo9kELO0Vhd34urV14KrrIpdkmUKytW2KIsyvW8D6fqgDBj
 NSaQLLcz0pCXbhF+8Nqvdh/1coR4L7Ymt08P1rfEjCsQgb/2WnSAGUQuC5JoGzaj
 ngkbFcZllIbD9gNzMQ1n4Aw5TiO+l9zxCqPC/r58Uuvstr+K9QKlwnp2+B3Q73Ft
 rojIJ04NJL6lCHdDgwAjTTks+TD2PT/eBWsDfJ/1pnUWttmv9IjMpnXD5sbHxoiU
 2RGGKnYbxXczYdq/ALYDWM6JXpfnJZcXL3jJi0IDcCSsb92xRvTANYFHnTfyzGfw
 EHkhbF4e4Vy9f6QOkSP3CvW5H26BmZS9DKG0J9Il5R3u2lKdfbb5vmtUmVTqHmAD
 Ulo5cWZjEznlWCAYSI/aIidmBsp9OAEvYd+X7Z5SBIgTfSqV7VWHGt0BfA1heiVv
 F/mednG0gGc=
 =3v4F
 -----END PGP SIGNATURE-----

Merge tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core x86 updates from Ingo Molnar:

 - The biggest change is the rework of the percpu code, to support the
   'Named Address Spaces' GCC feature, by Uros Bizjak:

      - This allows C code to access GS and FS segment relative memory
        via variables declared with such attributes, which allows the
        compiler to better optimize those accesses than the previous
        inline assembly code.

      - The series also includes a number of micro-optimizations for
        various percpu access methods, plus a number of cleanups of %gs
        accesses in assembly code.

      - These changes have been exposed to linux-next testing for the
        last ~5 months, with no known regressions in this area.

 - Fix/clean up __switch_to()'s broken but accidentally working handling
   of FPU switching - which also generates better code

 - Propagate more RIP-relative addressing in assembly code, to generate
   slightly better code

 - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to
   make it easier for distros to follow & maintain these options

 - Rework the x86 idle code to cure RCU violations and to clean up the
   logic

 - Clean up the vDSO Makefile logic

 - Misc cleanups and fixes

* tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  x86/idle: Select idle routine only once
  x86/idle: Let prefer_mwait_c1_over_halt() return bool
  x86/idle: Cleanup idle_setup()
  x86/idle: Clean up idle selection
  x86/idle: Sanitize X86_BUG_AMD_E400 handling
  sched/idle: Conditionally handle tick broadcast in default_idle_call()
  x86: Increase brk randomness entropy for 64-bit systems
  x86/vdso: Move vDSO to mmap region
  x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together
  x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o
  x86/retpoline: Ensure default return thunk isn't used at runtime
  x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32
  x86/vdso: Use $(addprefix ) instead of $(foreach )
  x86/vdso: Simplify obj-y addition
  x86/vdso: Consolidate targets and clean-files
  x86/bugs: Rename CONFIG_RETHUNK              => CONFIG_MITIGATION_RETHUNK
  x86/bugs: Rename CONFIG_CPU_SRSO             => CONFIG_MITIGATION_SRSO
  x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY       => CONFIG_MITIGATION_IBRS_ENTRY
  x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY      => CONFIG_MITIGATION_UNRET_ENTRY
  x86/bugs: Rename CONFIG_SLS                  => CONFIG_MITIGATION_SLS
  ...
2024-03-11 19:53:15 -07:00
Linus Torvalds
fcc196579a Misc cleanups, including a large series from Thomas Gleixner to
cure Sparse warnings.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmXvAFQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hkDRAAwASVCQ88kiGqNQtHibXlK54mAFGsc0xv
 T8OPds15DUzoLg/y8lw0X0DHly6MdGXVmygybejNIw2BN4lhLjQ7f4Ria7rv7LDy
 FcI1jfvysEMyYRFHGRefb/GBFzuEfKoROwf+QylGmKz0ZK674gNMngsI9pwOBdbe
 wElq3IkHoNuTUfH9QA4BvqGam1n122nvVTop3g0PMHWzx9ky8hd/BEUjXFZhfINL
 zZk3fwUbER2QYbhHt+BN2GRbdf2BrKvqTkXpKxyXTdnpiqAo0CzBGKerZ62H82qG
 n737Nib1lrsfM5yDHySnau02aamRXaGvCJUd6gpac1ZmNpZMWhEOT/0Tr/Nj5ztF
 lUAvKqMZn/CwwQky1/XxD0LHegnve0G+syqQt/7x7o1ELdiwTzOWMCx016UeodzB
 yyHf3Xx9J8nt3snlrlZBaGEfegg9ePLu5Vir7iXjg3vrloUW8A+GZM62NVxF4HVV
 QWF80BfWf8zbLQ/OS1382t1shaioIe5pEXzIjcnyVIZCiiP2/5kP2O6P4XVbwVlo
 Ca5eEt8U1rtsLUZaCzI2ZRTQf/8SLMQWyaV+ZmkVwcVdFoARC31EgdE5wYYoZOf6
 7Vl+rXd+rZCuTWk0ZgznCZEm75aaqukaQCBa2V8hIVociLFVzhg/Tjedv7s0CspA
 hNfxdN1LDZc=
 =0eJ7
 -----END PGP SIGNATURE-----

Merge tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Ingo Molnar:
 "Misc cleanups, including a large series from Thomas Gleixner to cure
  sparse warnings"

* tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/nmi: Drop unused declaration of proc_nmi_enabled()
  x86/callthunks: Use EXPORT_PER_CPU_SYMBOL_GPL() for per CPU variables
  x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation
  x86/cpu: Use EXPORT_PER_CPU_SYMBOL_GPL() for x86_spec_ctrl_current
  x86/uaccess: Add missing __force to casts in __access_ok() and valid_user_address()
  x86/percpu: Cure per CPU madness on UP
  smp: Consolidate smp_prepare_boot_cpu()
  x86/msr: Add missing __percpu annotations
  x86/msr: Prepare for including <linux/percpu.h> into <asm/msr.h>
  perf/x86/amd/uncore: Fix __percpu annotation
  x86/nmi: Remove an unnecessary IS_ENABLED(CONFIG_SMP)
  x86/apm_32: Remove dead function apm_get_battery_status()
  x86/insn-eval: Fix function param name in get_eff_addr_sib()
2024-03-11 19:37:56 -07:00
Linus Torvalds
38b334fc76 - Add the x86 part of the SEV-SNP host support. This will allow the
kernel to be used as a KVM hypervisor capable of running SNP (Secure
   Nested Paging) guests. Roughly speaking, SEV-SNP is the ultimate goal
   of the AMD confidential computing side, providing the most
   comprehensive confidential computing environment up to date.
 
   This is the x86 part and there is a KVM part which did not get ready
   in time for the merge window so latter will be forthcoming in the next
   cycle.
 
 - Rework the early code's position-dependent SEV variable references in
   order to allow building the kernel with clang and -fPIE/-fPIC and
   -mcmodel=kernel
 
 - The usual set of fixes, cleanups and improvements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmXvH0wACgkQEsHwGGHe
 VUrzmA//VS/n6dhHRnm/nAGngr4PeegkgV1OhyKYFfiZ272rT6P9QvblQrgcY0dc
 Ij1DOhEKlke51pTHvMOQ33B3P4Fuc0mx3dpCLY0up5V26kzQiKCjRKEkC4U1bcw8
 W4GqMejaR89bE14bYibmwpSib9T/uVsV65eM3xf1iF5UvsnoUaTziymDoy+nb43a
 B1pdd5vcl4mBNqXeEvt0qjg+xkMLpWUI9tJDB8mbMl/cnIFGgMZzBaY8oktHSROK
 QpuUnKegOgp1RXpfLbNjmZ2Q4Rkk4MNazzDzWq3EIxaRjXL3Qp507ePK7yeA2qa0
 J3jCBQc9E2j7lfrIkUgNIzOWhMAXM2YH5bvH6UrIcMi1qsWJYDmkp2MF1nUedjdf
 Wj16/pJbeEw1aKKIywJGwsmViSQju158vY3SzXG83U/A/Iz7zZRHFmC/ALoxZptY
 Bi7VhfcOSpz98PE3axnG8CvvxRDWMfzBr2FY1VmQbg6VBNo1Xl1aP/IH1I8iQNKg
 /laBYl/qP+1286TygF1lthYROb1lfEIJprgi2xfO6jVYUqPb7/zq2sm78qZRfm7l
 25PN/oHnuidfVfI/H3hzcGubjOG9Zwra8WWYBB2EEmelf21rT0OLqq+eS4T6pxFb
 GNVfc0AzG77UmqbrpkAMuPqL7LrGaSee4NdU3hkEdSphlx1/YTo=
 =c1ps
 -----END PGP SIGNATURE-----

Merge tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 SEV updates from Borislav Petkov:

 - Add the x86 part of the SEV-SNP host support.

   This will allow the kernel to be used as a KVM hypervisor capable of
   running SNP (Secure Nested Paging) guests. Roughly speaking, SEV-SNP
   is the ultimate goal of the AMD confidential computing side,
   providing the most comprehensive confidential computing environment
   up to date.

   This is the x86 part and there is a KVM part which did not get ready
   in time for the merge window so latter will be forthcoming in the
   next cycle.

 - Rework the early code's position-dependent SEV variable references in
   order to allow building the kernel with clang and -fPIE/-fPIC and
   -mcmodel=kernel

 - The usual set of fixes, cleanups and improvements all over the place

* tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  x86/sev: Disable KMSAN for memory encryption TUs
  x86/sev: Dump SEV_STATUS
  crypto: ccp - Have it depend on AMD_IOMMU
  iommu/amd: Fix failure return from snp_lookup_rmpentry()
  x86/sev: Fix position dependent variable references in startup code
  crypto: ccp: Make snp_range_list static
  x86/Kconfig: Remove CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT
  Documentation: virt: Fix up pre-formatted text block for SEV ioctls
  crypto: ccp: Add the SNP_SET_CONFIG command
  crypto: ccp: Add the SNP_COMMIT command
  crypto: ccp: Add the SNP_PLATFORM_STATUS command
  x86/cpufeatures: Enable/unmask SEV-SNP CPU feature
  KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
  crypto: ccp: Add panic notifier for SEV/SNP firmware shutdown on kdump
  iommu/amd: Clean up RMP entries for IOMMU pages during SNP shutdown
  crypto: ccp: Handle legacy SEV commands when SNP is enabled
  crypto: ccp: Handle non-volatile INIT_EX data when SNP is enabled
  crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
  x86/sev: Introduce an SNP leaked pages list
  crypto: ccp: Provide an API to issue SEV and SNP commands
  ...
2024-03-11 17:44:11 -07:00
Linus Torvalds
720c857907 Support for x86 Fast Return and Event Delivery (FRED):
FRED is a replacement for IDT event delivery on x86 and addresses most of
 the technical nightmares which IDT exposes:
 
  1) Exception cause registers like CR2 need to be manually preserved in
     nested exception scenarios.
 
  2) Hardware interrupt stack switching is suboptimal for nested exceptions
     as the interrupt stack mechanism rewinds the stack on each entry which
     requires a massive effort in the low level entry of #NMI code to handle
     this.
 
  3) No hardware distinction between entry from kernel or from user which
     makes establishing kernel context more complex than it needs to be
     especially for unconditionally nestable exceptions like NMI.
 
  4) NMI nesting caused by IRET unconditionally reenabling NMIs, which is a
     problem when the perf NMI takes a fault when collecting a stack trace.
 
  5) Partial restore of ESP when returning to a 16-bit segment
 
  6) Limitation of the vector space which can cause vector exhaustion on
     large systems.
 
  7) Inability to differentiate NMI sources
 
 FRED addresses these shortcomings by:
 
  1) An extended exception stack frame which the CPU uses to save exception
     cause registers. This ensures that the meta information for each
     exception is preserved on stack and avoids the extra complexity of
     preserving it in software.
 
  2) Hardware interrupt stack switching is non-rewinding if a nested
     exception uses the currently interrupt stack.
 
  3) The entry points for kernel and user context are separate and GS BASE
     handling which is required to establish kernel context for per CPU
     variable access is done in hardware.
 
  4) NMIs are now nesting protected. They are only reenabled on the return
     from NMI.
 
  5) FRED guarantees full restore of ESP
 
  6) FRED does not put a limitation on the vector space by design because it
     uses a central entry points for kernel and user space and the CPUstores
     the entry type (exception, trap, interrupt, syscall) on the entry stack
     along with the vector number. The entry code has to demultiplex this
     information, but this removes the vector space restriction.
 
     The first hardware implementations will still have the current
     restricted vector space because lifting this limitation requires
     further changes to the local APIC.
 
  7) FRED stores the vector number and meta information on stack which
     allows having more than one NMI vector in future hardware when the
     required local APIC changes are in place.
 
 The series implements the initial FRED support by:
 
  - Reworking the existing entry and IDT handling infrastructure to
    accomodate for the alternative entry mechanism.
 
  - Expanding the stack frame to accomodate for the extra 16 bytes FRED
    requires to store context and meta information
 
  - Providing FRED specific C entry points for events which have information
    pushed to the extended stack frame, e.g. #PF and #DB.
 
  - Providing FRED specific C entry points for #NMI and #MCE
 
  - Implementing the FRED specific ASM entry points and the C code to
    demultiplex the events
 
  - Providing detection and initialization mechanisms and the necessary
    tweaks in context switching, GS BASE handling etc.
 
 The FRED integration aims for maximum code reuse vs. the existing IDT
 implementation to the extent possible and the deviation in hot paths like
 context switching are handled with alternatives to minimalize the
 impact. The low level entry and exit paths are seperate due to the extended
 stack frame and the hardware based GS BASE swichting and therefore have no
 impact on IDT based systems.
 
 It has been extensively tested on existing systems and on the FRED
 simulation and as of now there are know outstanding problems.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmXuKPgTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWyUEACevJMHU+Ot9zqBPizSWxByM1uunHbp
 bjQXhaFeskd3mt7k7HU6GsPRSmC3q4lliP1Y9ypfbU0DvYSI2h/PhMWizjhmot2y
 nIvFpl51r/NsI+JHx1oXcFetz0eGHEqBui/4YQ/swgOCMymYgfqgHhazXTdldV3g
 KpH9/8W3AeGvw79uzXFH9tjBzTkbvywpam3v0LYNDJWTCuDkilyo8PjhsgRZD4x3
 V9f1nLD7nSHZW8XLoktdJJ38bKwI2Lhao91NQ0ErwopekA4/9WphZEKsDpidUSXJ
 sn1O148oQ8X92IO2OaQje8XC5pLGr5GqQBGPWzRH56P/Vd3+WOwBxaFoU6Drxc5s
 tIe23ZjkVcpA8EEG7BQBZV1Un/NX7XaCCnMniOt0RauXw+1NaslX7t/tnUAh5F1V
 TWCH4D0I0oJ0qJ7kNliGn2BP3agYXOVg81xVEUjT6KfHcYU4ImUrwi+BkeNXuXtL
 Ch5ADnbYAcUjWLFnAmEmaRtfmfNGY5T7PeGFHW2RRkaOJ88v5g14Voo6gPJaDUPn
 wMQ0nLq1xN4xZWF6ZgfRqAhArvh20k38ZujRku5vXEqnhOugQ76TF2UYiFEwOXbQ
 8jcM+yEBLGgBz7tGMwmIAml6kfxaFF1KPpdrtcPxNkGlbE6KTSuIolLx2YGUvlSU
 6/O8nwZy49ckmQ==
 =Ib7w
 -----END PGP SIGNATURE-----

Merge tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 FRED support from Thomas Gleixner:
 "Support for x86 Fast Return and Event Delivery (FRED).

  FRED is a replacement for IDT event delivery on x86 and addresses most
  of the technical nightmares which IDT exposes:

   1) Exception cause registers like CR2 need to be manually preserved
      in nested exception scenarios.

   2) Hardware interrupt stack switching is suboptimal for nested
      exceptions as the interrupt stack mechanism rewinds the stack on
      each entry which requires a massive effort in the low level entry
      of #NMI code to handle this.

   3) No hardware distinction between entry from kernel or from user
      which makes establishing kernel context more complex than it needs
      to be especially for unconditionally nestable exceptions like NMI.

   4) NMI nesting caused by IRET unconditionally reenabling NMIs, which
      is a problem when the perf NMI takes a fault when collecting a
      stack trace.

   5) Partial restore of ESP when returning to a 16-bit segment

   6) Limitation of the vector space which can cause vector exhaustion
      on large systems.

   7) Inability to differentiate NMI sources

  FRED addresses these shortcomings by:

   1) An extended exception stack frame which the CPU uses to save
      exception cause registers. This ensures that the meta information
      for each exception is preserved on stack and avoids the extra
      complexity of preserving it in software.

   2) Hardware interrupt stack switching is non-rewinding if a nested
      exception uses the currently interrupt stack.

   3) The entry points for kernel and user context are separate and GS
      BASE handling which is required to establish kernel context for
      per CPU variable access is done in hardware.

   4) NMIs are now nesting protected. They are only reenabled on the
      return from NMI.

   5) FRED guarantees full restore of ESP

   6) FRED does not put a limitation on the vector space by design
      because it uses a central entry points for kernel and user space
      and the CPUstores the entry type (exception, trap, interrupt,
      syscall) on the entry stack along with the vector number. The
      entry code has to demultiplex this information, but this removes
      the vector space restriction.

      The first hardware implementations will still have the current
      restricted vector space because lifting this limitation requires
      further changes to the local APIC.

   7) FRED stores the vector number and meta information on stack which
      allows having more than one NMI vector in future hardware when the
      required local APIC changes are in place.

  The series implements the initial FRED support by:

   - Reworking the existing entry and IDT handling infrastructure to
     accomodate for the alternative entry mechanism.

   - Expanding the stack frame to accomodate for the extra 16 bytes FRED
     requires to store context and meta information

   - Providing FRED specific C entry points for events which have
     information pushed to the extended stack frame, e.g. #PF and #DB.

   - Providing FRED specific C entry points for #NMI and #MCE

   - Implementing the FRED specific ASM entry points and the C code to
     demultiplex the events

   - Providing detection and initialization mechanisms and the necessary
     tweaks in context switching, GS BASE handling etc.

  The FRED integration aims for maximum code reuse vs the existing IDT
  implementation to the extent possible and the deviation in hot paths
  like context switching are handled with alternatives to minimalize the
  impact. The low level entry and exit paths are seperate due to the
  extended stack frame and the hardware based GS BASE swichting and
  therefore have no impact on IDT based systems.

  It has been extensively tested on existing systems and on the FRED
  simulation and as of now there are no outstanding problems"

* tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  x86/fred: Fix init_task thread stack pointer initialization
  MAINTAINERS: Add a maintainer entry for FRED
  x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly
  x86/fred: Invoke FRED initialization code to enable FRED
  x86/fred: Add FRED initialization functions
  x86/syscall: Split IDT syscall setup code into idt_syscall_init()
  KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling
  x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI
  x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code
  x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user
  x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled
  x86/traps: Add sysvec_install() to install a system interrupt handler
  x86/fred: FRED entry/exit and dispatch code
  x86/fred: Add a machine check entry stub for FRED
  x86/fred: Add a NMI entry stub for FRED
  x86/fred: Add a debug fault entry stub for FRED
  x86/idtentry: Incorporate definitions/declarations of the FRED entries
  x86/fred: Make exc_page_fault() work for FRED
  x86/fred: Allow single-step trap and NMI when starting a new task
  x86/fred: No ESPFIX needed when FRED is enabled
  ...
2024-03-11 16:00:17 -07:00
Pawan Gupta
2a0180129d KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests
Mitigation for RFDS requires RFDS_CLEAR capability which is enumerated
by MSR_IA32_ARCH_CAPABILITIES bit 27. If the host has it set, export it
to guests so that they can deploy the mitigation.

RFDS_NO indicates that the system is not vulnerable to RFDS, export it
to guests so that they don't deploy the mitigation unnecessarily. When
the host is not affected by X86_BUG_RFDS, but has RFDS_NO=0, synthesize
RFDS_NO to the guest.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-03-11 13:13:50 -07:00
Paolo Bonzini
e9a2bba476 KVM Xen and pfncache changes for 6.9:
- Rip out the half-baked support for using gfn_to_pfn caches to manage pages
    that are "mapped" into guests via physical addresses.
 
  - Add support for using gfn_to_pfn caches with only a host virtual address,
    i.e. to bypass the "gfn" stage of the cache.  The primary use case is
    overlay pages, where the guest may change the gfn used to reference the
    overlay page, but the backing hva+pfn remains the same.
 
  - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead
    of a gpa, so that userspace doesn't need to reconfigure and invalidate the
    cache/mapping if the guest changes the gpa (but userspace keeps the resolved
    hva the same).
 
  - When possible, use a single host TSC value when computing the deadline for
    Xen timers in order to improve the accuracy of the timer emulation.
 
  - Inject pending upcall events when the vCPU software-enables its APIC to fix
    a bug where an upcall can be lost (and to follow Xen's behavior).
 
  - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
    events fails, e.g. if the guest has aliased xAPIC IDs.
 
  - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to
    refresh), and drop a now-redundant acquisition of xen_lock (that was
    protecting the shared_info cache) to fix a deadlock due to recursively
    acquiring xen_lock.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrblYACgkQOlYIJqCj
 N/3K4Q/+KZ8lrnNXvdHNCQdosA5DDXpqUcRzhlTUp82fncpdJ0LqrSMzMots2Eh9
 KC0jSPo8EkivF+Epug0+bpQBEaLXzTWhRcS1grePCDz2lBnxoHFSWjvaK2p14KlC
 LvxCJZjxyfLKHwKHpSndvO9hVFElCY3mvvE9KRcKeQAmrz1cz+DDMKelo1MuV8D+
 GfymhYc+UXpY41+6hQdznx+WoGoXKRameo3iGYuBoJjvKOyl4Wxkx9WSXIxxxuqG
 kHxjiWTR/jF1ITJl6PeMrFcGl3cuGKM/UfTOM6W2h6Wi3mhLpXveoVLnqR1kipIj
 btSzSVHL7C4WTPwOcyhwPzap+dJmm31c6N0uPScT7r9yhs+q5BDj26vcVcyPZUHo
 efIwmsnO2eQvuw+f8C6QqWCPaxvw46N0zxzwgc5uA3jvAC93y0l4v+xlAQsC0wzV
 0+BwU00cutH/3t3c/WPD5QcmRLH726VoFuTlaDufpoMU7gBVJ8rzjcusxR+5BKT+
 GJcAgZxZhEgvnzmTKd4Ec/mt+xZ2Erd+kV3MKCHvDPyj8jqy8FQ4DAWKGBR+h3WR
 rqAs2k8NPHyh3i1a3FL1opmxEGsRS+Cnc6Bi77cj9DxTr22JkgDJEuFR+Ues1z6/
 SpE889kt3w5zTo34+lNxNPlIKmO0ICwwhDL6pxJTWU7iWQnKypU=
 =GliW
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-xen-6.9' of https://github.com/kvm-x86/linux into HEAD

KVM Xen and pfncache changes for 6.9:

 - Rip out the half-baked support for using gfn_to_pfn caches to manage pages
   that are "mapped" into guests via physical addresses.

 - Add support for using gfn_to_pfn caches with only a host virtual address,
   i.e. to bypass the "gfn" stage of the cache.  The primary use case is
   overlay pages, where the guest may change the gfn used to reference the
   overlay page, but the backing hva+pfn remains the same.

 - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead
   of a gpa, so that userspace doesn't need to reconfigure and invalidate the
   cache/mapping if the guest changes the gpa (but userspace keeps the resolved
   hva the same).

 - When possible, use a single host TSC value when computing the deadline for
   Xen timers in order to improve the accuracy of the timer emulation.

 - Inject pending upcall events when the vCPU software-enables its APIC to fix
   a bug where an upcall can be lost (and to follow Xen's behavior).

 - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
   events fails, e.g. if the guest has aliased xAPIC IDs.

 - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to
   refresh), and drop a now-redundant acquisition of xen_lock (that was
   protecting the shared_info cache) to fix a deadlock due to recursively
   acquiring xen_lock.
2024-03-11 10:42:55 -04:00
Paolo Bonzini
e9025cdd8c KVM x86 PMU changes for 6.9:
- Fix several bugs where KVM speciously prevents the guest from utilizing
    fixed counters and architectural event encodings based on whether or not
    guest CPUID reports support for the _architectural_ encoding.
 
  - Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads,
    priority of VMX interception vs #GP, PMC types in architectural PMUs, etc.
 
  - Add a selftest to verify KVM correctly emulates RDMPC, counter availability,
    and a variety of other PMC-related behaviors that depend on guest CPUID,
    i.e. are difficult to validate via KVM-Unit-Tests.
 
  - Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting
    cycles, e.g. when checking if a PMC event needs to be synthesized when
    skipping an instruction.
 
  - Optimize triggering of emulated events, e.g. for "count instructions" events
    when skipping an instruction, which yields a ~10% performance improvement in
    VM-Exit microbenchmarks when a vPMU is exposed to the guest.
 
  - Tighten the check for "PMI in guest" to reduce false positives if an NMI
    arrives in the host while KVM is handling an IRQ VM-Exit.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrUFQACgkQOlYIJqCj
 N/11dhAAnr9e6mPmXvaH4YKcvOGgTmwIQdi5W4IBzGm27ErEb0Vyskx3UATRhRm+
 gZyp3wNgEA9LeifICDNu4ypn7HZcl2VtRql6FYcB8Bcu8OiHfU8PhWL0/qrpY20e
 zffUj2tDweq2ft9Iks1SQJD0sxFkcXIcSKOffP7pRZJHFTKLltGORXwxzd9HJHPY
 nc4nERKegK2yH4A4gY6nZ0oV5L3OMUNHx815db5Y+HxXOIjBCjTQiNNd6mUdyX1N
 C5sIiElXLdvRTSDvirHfA32LqNwnajDGox4QKZkB3wszCxJ3kRd4OCkTEKMYKHxd
 KoKCJQnAdJFFW9xqbT8nNKXZ+hg2+ZQuoSaBuwKryf7jWi0e6a7jcV0OH+cQSZw7
 UNudKhs3r4ambfvnFp2IVZlZREMDB+LAjo2So48Jn/JGCAzqte3XqwVKskn9pS9S
 qeauXCdOLioZALYtTBl8RM1rEY5mbwQrpPv9CzbeU09qQ/hpXV14W9GmbyeOZcI1
 T1cYgEqlLuifRluwT/hxrY321+4noF116gSK1yb07x/sJU8/lhRooEk9V562066E
 qo6nIvc7Bv9gTGLwo6VReKSPcTT/6t3HwgPsRjqe+evso3EFN9f9hG+uPxtO6TUj
 pdPm3mkj2KfxDdJLf+Ys16gyGdiwI0ZImIkA0uLdM0zftNsrb4Y=
 =vayI
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-pmu-6.9' of https://github.com/kvm-x86/linux into HEAD

KVM x86 PMU changes for 6.9:

 - Fix several bugs where KVM speciously prevents the guest from utilizing
   fixed counters and architectural event encodings based on whether or not
   guest CPUID reports support for the _architectural_ encoding.

 - Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads,
   priority of VMX interception vs #GP, PMC types in architectural PMUs, etc.

 - Add a selftest to verify KVM correctly emulates RDMPC, counter availability,
   and a variety of other PMC-related behaviors that depend on guest CPUID,
   i.e. are difficult to validate via KVM-Unit-Tests.

 - Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting
   cycles, e.g. when checking if a PMC event needs to be synthesized when
   skipping an instruction.

 - Optimize triggering of emulated events, e.g. for "count instructions" events
   when skipping an instruction, which yields a ~10% performance improvement in
   VM-Exit microbenchmarks when a vPMU is exposed to the guest.

 - Tighten the check for "PMI in guest" to reduce false positives if an NMI
   arrives in the host while KVM is handling an IRQ VM-Exit.
2024-03-11 10:41:09 -04:00
Paolo Bonzini
b00471a552 KVM VMX changes for 6.9:
- Fix a bug where KVM would report stale/bogus exit qualification information
    when exiting to userspace due to an unexpected VM-Exit while the CPU was
    vectoring an exception.
 
  - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support.
 
  - Clean up the logic for massaging the passthrough MSR bitmaps when userspace
    changes its MSR filter.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrUh4ACgkQOlYIJqCj
 N/2CRA//VKa4KE8zgF3xM6Btyt2NegPgmGYVyhmMHTvARyHDIl5nURy++uXseb4f
 UXQLGcoGS+CIiaMohQFhCOjoNvv/2LR9P72qVV2WjQjFxVGBchybz8bjrqIDSSvY
 TuiPJApIfZtLryFFcowo8jLEBQv3JKgfgn9r2hBwVDcYP13wSz0Z4AWntHIqBxNa
 DW75wo7wnBFzy2RfUdtAgucpbmEihqSoKA+YjUT+0GRLBI7rWbFxdEKqe3BIM/7n
 4NoJXbOmw7mlhTNumZYsF5sKiyOihBOdtUL1TDgKWjJScgmwG+KCSvrp5Ko4PZpo
 uyWWcIbskQ+cTO6dHDoIJTVPsCDxo3PgVJKG1T60CV68NavwxXCUGri1n1ZNyYH/
 bXxEW7dTGHX0TDSt3dcyVOYdZFHbaIbqpu1EXlrzBm1hnruQ1C1uBQGHZ/X+Yo86
 0qdq9SgXJ48tykr8BDruIHMy0Q8jbXxl67oXR0CdRjJGM+H9f+7RefnRN9wPYFhy
 n6Hl3kbezwCZb+8RO34Hq2CpKzNlKRHlJDhUq1ypd2vXPw8FDq1aQYKih0jAzyJQ
 yCdUueBJgo8OJtSL4HGEHvgkLHR4/XERgbCOxpSYNbqIjahAwNtbfHryUnJRWRb5
 V3QczG/TtGfEpVblbRzn3Atbft4rM5a9Z3+s0siB6C2w8wyPmZg=
 =oJso
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.9' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.9:

 - Fix a bug where KVM would report stale/bogus exit qualification information
   when exiting to userspace due to an unexpected VM-Exit while the CPU was
   vectoring an exception.

 - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support.

 - Clean up the logic for massaging the passthrough MSR bitmaps when userspace
   changes its MSR filter.
2024-03-11 10:31:29 -04:00
Paolo Bonzini
41ebae2ecd KVM x86 MMU changes for 6.9:
- Clean up code related to unprotecting shadow pages when retrying a guest
    instruction after failed #PF-induced emulation.
 
  - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if
    a reschedule is needed, e.g. if a high priority task needs to run.  Because
    KVM doesn't support yielding in the middle of processing a zapped non-leaf
    SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when
    attempting to schedule in a high priority.
 
  - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for
    read, e.g. to avoid serializing vCPUs when userspace deletes a memslot.
 
  - Allocate write-tracking metadata on-demand to avoid the memory overhead when
    running kernels built with KVMGT support (external write-tracking enabled),
    but for workloads that don't use nested virtualization (shadow paging) or
    KVMGT.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrTH4ACgkQOlYIJqCj
 N/1q3xAAh3wpUDzRfkNkgGUbulhuJmQ72PiaW3NRoMo/3Rowegsdgt1N3/ec+fcJ
 Awx0KUM8Cju8O2Zqp6NzKwUkddCni8dHmOa55NJQuK2M1OpnE0RjBB94n+AFJZki
 mm8wKSKNgjlVeJDG87+RLPnbaeEvqYPp22oNKJyAPsimTbxvmhIqtg8qdyujGPXA
 Jke7LXgtVGav+nEzXiLh86VU/agoBJc/zt+hiuLvamU5Y8so+zReqFbrDtvsgtpV
 ryvMbDZxcPXKrsBP+B7syqUAbODcmh/wkzOCZ4Tby5yurEaw1rwpZIH0BRKRgGx2
 F2JqWayYsCOsrJ4DwQre8RfLMtbEKB2BBWkZlYyblAy0++1LcTP9pSk5YC5lSL71
 5Oszql9DKi10Vq5IfR/ehsr6mHXFr3AB7C7QefiXpytGbObQs8/f/OxinxaEajcs
 ERBgh+rcQ5p3kfdiHzuQjn7y45J7z21CKVhka4iKJtTxypBK4ZvkDOVqHuHppb5O
 aw6rC5HR1EKhSW4jz7QWrDExtDZ2X5HeYl8TgfHncSSJRc7urKYcSCHhXJsB6BPs
 iQf0xbHaIOyH9jmoqLZjz0QZmXB9fydQ/zAlFVXZsrNHvomayVjqrpl8UFTMdhuI
 zll9ynfRRHMUkIi1YubUlmFMgBeqOXGkfBFh8QUH3+YiI7Cwzh4=
 =SgFo
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmu-6.9' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 6.9:

 - Clean up code related to unprotecting shadow pages when retrying a guest
   instruction after failed #PF-induced emulation.

 - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if
   a reschedule is needed, e.g. if a high priority task needs to run.  Because
   KVM doesn't support yielding in the middle of processing a zapped non-leaf
   SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when
   attempting to schedule in a high priority.

 - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for
   read, e.g. to avoid serializing vCPUs when userspace deletes a memslot.

 - Allocate write-tracking metadata on-demand to avoid the memory overhead when
   running kernels built with KVMGT support (external write-tracking enabled),
   but for workloads that don't use nested virtualization (shadow paging) or
   KVMGT.
2024-03-11 10:29:22 -04:00
Paolo Bonzini
c9cd0beae9 KVM x86 misc changes for 6.9:
- Explicitly initialize a variety of on-stack variables in the emulator that
    triggered KMSAN false positives (though in fairness in KMSAN, it's comically
    difficult to see that the uninitialized memory is never truly consumed).
 
  - Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading
    DR6 and DR7.
 
  - Rework the "force immediate exit" code so that vendor code ultimately
    decides how and when to force the exit.  This allows VMX to further optimize
    handling preemption timer exits, and allows SVM to avoid sending a duplicate
    IPI (SVM also has a need to force an exit).
 
  - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
    vCPU creation ultimately failed, and add WARN to guard against similar bugs.
 
  - Provide a dedicated arch hook for checking if a different vCPU was in-kernel
    (for directed yield), and simplify the logic for checking if the currently
    loaded vCPU is in-kernel.
 
  - Misc cleanups and fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrRjQACgkQOlYIJqCj
 N/2Dzw//b+ptSBAl1kGBRmk/DqsX7J9ZkQYCQOTeh1vXiUM+XRTSQoArN0Oo1roy
 3wcEnQ0beVw7jMuzZ8UUuTfU8WUMja/kwltnqXYNHwLnb6yH0I/BIengXWdUdAMc
 FmgPZ4qJR2IzKYzvDsc3eEQ515O8UHWakyVDnmLBtiakAeBcUTYceHpEEPpzE5y5
 ODASTQKM9o/h8R8JwKFTJ8/mrOLNcsu5SycwFdnmubLJCrNWtJWTijA6y1lh6shn
 hbEJex+ESoC2v8p7IP53u1SGJubVlPajt+RkYJtlEI3WVsevp024eYcF4nb1OjXi
 qS2Y3W7DQGWvyCBoSzoMY+9nRMgyOOpHYetdiz+9oZOmnjiYWY0ku59U7Gv+Aotj
 AUbCn4Ry/OpqsuZ7Oo7i3IT8R7uzsTeNNdxhYBn1OQquBEZ0KBYXlZkGfTk9K0t0
 Fhka/5Zu6fBlg5J+zCyaXUGmsGWBo/9HxsC5z1JuKo8fatro5qyqYE5KiM01dkqc
 6FET6gL+fFprC5c67JGRPdEtk6F9Emb+6oiTTA8/8q8JQQAKiJKk95Nlq7KzPfVS
 A5RQPTuTJ7acE/5CY4zB1DdxCjqgnonBEA2ULnA/J10Rk8orHJRnGJcEwKEyDrZh
 HpsxIIqt++i8KffORpCym6zSAVYuQjn1mu7MGth+zuCqhcEpBfc=
 =GX0O
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.9' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.9:

 - Explicitly initialize a variety of on-stack variables in the emulator that
   triggered KMSAN false positives (though in fairness in KMSAN, it's comically
   difficult to see that the uninitialized memory is never truly consumed).

 - Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading
   DR6 and DR7.

 - Rework the "force immediate exit" code so that vendor code ultimately
   decides how and when to force the exit.  This allows VMX to further optimize
   handling preemption timer exits, and allows SVM to avoid sending a duplicate
   IPI (SVM also has a need to force an exit).

 - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
   vCPU creation ultimately failed, and add WARN to guard against similar bugs.

 - Provide a dedicated arch hook for checking if a different vCPU was in-kernel
   (for directed yield), and simplify the logic for checking if the currently
   loaded vCPU is in-kernel.

 - Misc cleanups and fixes.
2024-03-11 10:24:56 -04:00
Paolo Bonzini
961e2bfcf3 KVM/arm64 updates for 6.9
- Infrastructure for building KVM's trap configuration based on the
    architectural features (or lack thereof) advertised in the VM's ID
    registers
 
  - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
    x86's WC) at stage-2, improving the performance of interacting with
    assigned devices that can tolerate it
 
  - Conversion of KVM's representation of LPIs to an xarray, utilized to
    address serialization some of the serialization on the LPI injection
    path
 
  - Support for _architectural_ VHE-only systems, advertised through the
    absence of FEAT_E2H0 in the CPU's ID register
 
  - Miscellaneous cleanups, fixes, and spelling corrections to KVM and
    selftests
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZepBjgAKCRCivnWIJHzd
 FnngAP93VxjCkJ+5qSmYpFNG6r0ECVIbLHFQ59nKn0+GgvbPEgEAwt8svdLdW06h
 njFTpdzvl4Po+aD/V9xHgqVz3kVvZwE=
 =1FbW
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-6.9' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for 6.9

 - Infrastructure for building KVM's trap configuration based on the
   architectural features (or lack thereof) advertised in the VM's ID
   registers

 - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
   x86's WC) at stage-2, improving the performance of interacting with
   assigned devices that can tolerate it

 - Conversion of KVM's representation of LPIs to an xarray, utilized to
   address serialization some of the serialization on the LPI injection
   path

 - Support for _architectural_ VHE-only systems, advertised through the
   absence of FEAT_E2H0 in the CPU's ID register

 - Miscellaneous cleanups, fixes, and spelling corrections to KVM and
   selftests
2024-03-11 10:02:32 -04:00
Paolo Bonzini
233d0bc4d8 LoongArch KVM changes for v6.9
1. Set reserved bits as zero in CPUCFG.
 2. Start SW timer only when vcpu is blocking.
 3. Do not restart SW timer when it is expired.
 4. Remove unnecessary CSR register saving during enter guest.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEzOlt8mkP+tbeiYy5AoYrw/LiJnoFAmXoeWIWHGNoZW5odWFj
 YWlAa2VybmVsLm9yZwAKCRAChivD8uImehb3D/9C5IrdyU/2f3fEUuuXO0a2ZS1p
 l2OT+yr7C6/jATokGcd+53CF8MzYawzuAT3tSXYyoqAxRu0HUkvuS1oA/eFM4EwV
 iIoUC3jnqcsQ5LCPt6yt+Tzgug64Xm5F4btYWIpmXgCJWx/VVG6+z3JarXAfA2it
 vgVMGgrrfHt68sEsenNFNgiJ5tCCubjR7XFwjM8rsL7AzUDdmXpF7gFyH2Ufgosi
 a5CxcPPauO1y5ZCGU4JU9QvxnVqW1kt/TRZIGqqGfULtlBSoZbD9zP3OcCQkL+ai
 SPNxvU5I+BeX6honpmO6aR/F1EphQhRji3ZKxI8UBo4aJD5+FtMG/YOEPI+ZAS0/
 JPuWpDqJH46SN3jfKTQay8jXc+mcnOYXJ9Yrixd4UCf66WJit/+BOma/wP638u2j
 RUzm1kqhNGad6QiDDtSjISM6sg6FozAGc/KhCkWAhV+lHLnfkXtaf3S+GIu5OiWz
 ETCKlmIGiy0y774+iftlD7RDRGmtrC4cx5ibl7cKKi62Y5vgujCdDofAyYC+D5cW
 puaIuHOx1hWtPRT9p1WfUL310ED+Qj3N2pDDcJcqdCIiRRZ5l/hxGS7V687a30WV
 GcegEqh19CjI9KDat4E1ld4jUHJxaFrw3pr2z3SP7cW3IgdngPJL57M0M2jSazaQ
 479xZPJ/i4xhJaKACg==
 =8HOW
 -----END PGP SIGNATURE-----

Merge tag 'loongarch-kvm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD

LoongArch KVM changes for v6.9

* Set reserved bits as zero in CPUCFG.
* Start SW timer only when vcpu is blocking.
* Do not restart SW timer when it is expired.
* Remove unnecessary CSR register saving during enter guest.
2024-03-11 09:56:54 -04:00
Linus Torvalds
137e0ec05a KVM GUEST_MEMFD fixes for 6.8:
- Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to
   avoid creating an inconsistent ABI (KVM_MEM_GUEST_MEMFD is not writable
   from userspace, so there would be no way to write to a read-only
   guest_memfd).
 
 - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
   clear that such VMs are purely for development and testing.
 
 - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan
   is to support confidential VMs with deterministic private memory (SNP
   and TDX) only in the TDP MMU.
 
 - Fix a bug in a GUEST_MEMFD dirty logging test that caused false passes.
 
 x86 fixes:
 
 - Fix missing marking of a guest page as dirty when emulating an atomic access.
 
 - Check for mmu_notifier invalidation events before faulting in the pfn,
   and before acquiring mmu_lock, to avoid unnecessary work and lock
   contention with preemptible kernels (including CONFIG_PREEMPT_DYNAMIC
   in non-preemptible mode).
 
 - Disable AMD DebugSwap by default, it breaks VMSA signing and will be
   re-enabled with a better VM creation API in 6.10.
 
 - Do the cache flush of converted pages in svm_register_enc_region() before
   dropping kvm->lock, to avoid a race with unregistering of the same region
   and the consequent use-after-free issue.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmXskdYUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroN1TAf/SUGf4QuYG7nnfgWDR+goFO6Gx7NE
 pJr3kAwv6d2f+qTlURfGjnX929pgZDLgoTkXTNeZquN6LjgownxMjBIpymVobvAD
 AKvqJS/ECpryuehXbeqlxJxJn+TrxJ5r4QeNILMHc3AOZoiUqM6xl3zFfXWDNWVo
 IazwT8P3d8wxiHAxv1eG6OVWHxbcg31068FVKRX3f/bWPbVwROJrPkCopmz2BJvU
 6KYdYcn2rkpDTEM3ouDC/6gxJ9vpSY3+nW7Q7dNtGtOH2+BddfSA6I0rphCQWCNs
 uXOxd5bDrC+KmkiULTPostuvwBgIm1k9wC2kW9A4P2VEf6Ay+ZHEdAOBJQ==
 =+MT/
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "KVM GUEST_MEMFD fixes for 6.8:

   - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY
     to avoid creating an inconsistent ABI (KVM_MEM_GUEST_MEMFD is not
     writable from userspace, so there would be no way to write to a
     read-only guest_memfd).

   - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
     clear that such VMs are purely for development and testing.

   - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term
     plan is to support confidential VMs with deterministic private
     memory (SNP and TDX) only in the TDP MMU.

   - Fix a bug in a GUEST_MEMFD dirty logging test that caused false
     passes.

  x86 fixes:

   - Fix missing marking of a guest page as dirty when emulating an
     atomic access.

   - Check for mmu_notifier invalidation events before faulting in the
     pfn, and before acquiring mmu_lock, to avoid unnecessary work and
     lock contention with preemptible kernels (including
     CONFIG_PREEMPT_DYNAMIC in non-preemptible mode).

   - Disable AMD DebugSwap by default, it breaks VMSA signing and will
     be re-enabled with a better VM creation API in 6.10.

   - Do the cache flush of converted pages in svm_register_enc_region()
     before dropping kvm->lock, to avoid a race with unregistering of
     the same region and the consequent use-after-free issue"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  SEV: disable SEV-ES DebugSwap by default
  KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing
  KVM: SVM: Flush pages under kvm->lock to fix UAF in svm_register_enc_region()
  KVM: selftests: Add a testcase to verify GUEST_MEMFD and READONLY are exclusive
  KVM: selftests: Create GUEST_MEMFD for relevant invalid flags testcases
  KVM: x86/mmu: Restrict KVM_SW_PROTECTED_VM to the TDP MMU
  KVM: x86: Update KVM_SW_PROTECTED_VM docs to make it clear they're a WIP
  KVM: Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY
  KVM: x86: Mark target gfn of emulated atomic instruction as dirty
2024-03-10 09:27:39 -07:00
Paolo Bonzini
7d8942d8e7 KVM GUEST_MEMFD fixes for 6.8:
- Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to
    avoid creating ABI that KVM can't sanely support.
 
  - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
    clear that such VMs are purely a development and testing vehicle, and
    come with zero guarantees.
 
  - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan
    is to support confidential VMs with deterministic private memory (SNP
    and TDX) only in the TDP MMU.
 
  - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes
    when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXZB/8ACgkQOlYIJqCj
 N/3XlQ//RIsvqr38k7kELSKhCMyWgF4J57itABrHpMqAZu3gaAo5sETX8AGcHEe5
 mxmquxyNQSf4cthhWy1kzxjGCy6+fk+Z0Z7wzfz0Yd5D+FI6vpo3HhkjovLb2gpt
 kSrHuhJyuj2vkftNvdaz0nHX1QalVyIEnXnR3oqTmxUUsg6lp1x/zr5SP0KBXjo8
 ZzJtyFd0fkRXWpA792T7XPRBWrzPV31HYZBLX8sPlYmJATcbIx9rYSThgCN6XuVN
 bfE6wATsC+mwv5BpCoDFpCKmFcqSqamag9NGe5qE5mOby5DQGYTCRMCQB8YXXBR0
 97ppaY9ZJV4nOVjrYJn6IMOSMVNfoG7nTRFfcd0eFP4tlPEgHwGr5BGDaBtQPkrd
 KcgWJw8nS02eCA2iOE+FtCXvGJwKhTTjQ45w7rU4EcfUk603L5J4GO1ddmjMhPcP
 upGGcWDK9vCGrSUFTm8pyWp/NKRJPvAQEiQd/BweSk9+isQHTX2RYCQgPAQnwlTS
 wTg7ZPNSLoUkRYmd6r+TUT32ELJGNc8GLftMnxIwweq6V7AgNMi0HE60eMovuBNO
 7DAWWzfBEZmJv+0mNNZPGXczHVv4YvMWysRdKkhztBc3+sO7P3AL1zWIDlm5qwoG
 LpFeeI3qo3o5ZNaqGzkSop2pUUGNGpWCH46WmP0AG7RpzW/Natw=
 =M0td
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of https://github.com/kvm-x86/linux into HEAD

KVM GUEST_MEMFD fixes for 6.8:

 - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to
   avoid creating ABI that KVM can't sanely support.

 - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
   clear that such VMs are purely a development and testing vehicle, and
   come with zero guarantees.

 - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan
   is to support confidential VMs with deterministic private memory (SNP
   and TDX) only in the TDP MMU.

 - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes
   when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
2024-03-09 11:48:35 -05:00
Paolo Bonzini
5abf6dceb0 SEV: disable SEV-ES DebugSwap by default
The DebugSwap feature of SEV-ES provides a way for confidential guests to use
data breakpoints.  However, because the status of the DebugSwap feature is
recorded in the VMSA, enabling it by default invalidates the attestation
signatures.  In 6.10 we will introduce a new API to create SEV VMs that
will allow enabling DebugSwap based on what the user tells KVM to do.
Contextually, we will change the legacy KVM_SEV_ES_INIT API to never
enable DebugSwap.

For compatibility with kernels that pre-date the introduction of DebugSwap,
as well as with those where KVM_SEV_ES_INIT will never enable it, do not enable
the feature by default.  If anybody wants to use it, for now they can enable
the sev_es_debug_swap_enabled module parameter, but this will result in a
warning.

Fixes: d1f85fbe83 ("KVM: SEV: Enable data breakpoints in SEV-ES")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-03-09 11:42:25 -05:00
Paolo Bonzini
39fee313fd Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM GUEST_MEMFD fixes for 6.8:

 - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to
   avoid creating ABI that KVM can't sanely support.

 - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
   clear that such VMs are purely a development and testing vehicle, and
   come with zero guarantees.

 - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan
   is to support confidential VMs with deterministic private memory (SNP
   and TDX) only in the TDP MMU.

 - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes
   when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
2024-03-09 11:42:17 -05:00
Paolo Bonzini
1b6c146df5 Merge tag 'kvm-x86-fixes-6.8-2' of https://github.com/kvm-x86/linux into HEAD
KVM x86 fixes for 6.8, round 2:

 - When emulating an atomic access, mark the gfn as dirty in the memslot
   to fix a bug where KVM could fail to mark the slot as dirty during live
   migration, ultimately resulting in guest data corruption due to a dirty
   page not being re-copied from the source to the target.

 - Check for mmu_notifier invalidation events before faulting in the pfn,
   and before acquiring mmu_lock, to avoid unnecessary work and lock
   contention.  Contending mmu_lock is especially problematic on preemptible
   kernels, as KVM may yield mmu_lock in response to the contention, which
   severely degrades overall performance due to vCPUs making it difficult
   for the task that triggered invalidation to make forward progress.

   Note, due to another kernel bug, this fix isn't limited to preemtible
   kernels, as any kernel built with CONFIG_PREEMPT_DYNAMIC=y will yield
   contended rwlocks and spinlocks.

   https://lore.kernel.org/all/20240110214723.695930-1-seanjc@google.com
2024-03-09 11:42:06 -05:00
Peter Xu
e72c7c2b88 mm/treewide: drop pXd_large()
They're not used anymore, drop all of them.

Link: https://lkml.kernel.org/r/20240305043750.93762-10-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-06 13:04:19 -08:00
Peter Xu
0a845e0f63 mm/treewide: replace pud_large() with pud_leaf()
pud_large() is always defined as pud_leaf().  Merge their usages.  Chose
pud_leaf() because pud_leaf() is a global API, while pud_large() is not.

Link: https://lkml.kernel.org/r/20240305043750.93762-9-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-06 13:04:19 -08:00
Peter Xu
2f709f7bfd mm/treewide: replace pmd_large() with pmd_leaf()
pmd_large() is always defined as pmd_leaf().  Merge their usages.  Chose
pmd_leaf() because pmd_leaf() is a global API, while pmd_large() is not.

Link: https://lkml.kernel.org/r/20240305043750.93762-8-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-06 13:04:19 -08:00
Vitaly Kuznetsov
4736d85f0d KVM: x86: Use actual kvm_cpuid.base for clearing KVM_FEATURE_PV_UNHALT
Commit ee3a5f9e3d ("KVM: x86: Do runtime CPUID update before updating
vcpu->arch.cpuid_entries") moved tweaking of the supplied CPUID
data earlier in kvm_set_cpuid() but __kvm_update_cpuid_runtime() actually
uses 'vcpu->arch.kvm_cpuid' (though __kvm_find_kvm_cpuid_features()) which
gets set later in kvm_set_cpuid(). In some cases, e.g. when kvm_set_cpuid()
is called for the first time and 'vcpu->arch.kvm_cpuid' is clear,
__kvm_find_kvm_cpuid_features() fails to find KVM PV feature entry and the
logic which clears KVM_FEATURE_PV_UNHALT after enabling
KVM_X86_DISABLE_EXITS_HLT does not work.

The logic, introduced by the commit ee3a5f9e3d ("KVM: x86: Do runtime
CPUID update before updating vcpu->arch.cpuid_entries") must stay: the
supplied CPUID data is tweaked by KVM first (__kvm_update_cpuid_runtime())
and checked later (kvm_check_cpuid()) and the actual data
(vcpu->arch.cpuid_*, vcpu->arch.kvm_cpuid, vcpu->arch.xen.cpuid,..) is only
updated on success.

Switch to searching for KVM_SIGNATURE in the supplied CPUID data to
discover KVM PV feature entry instead of using stale 'vcpu->arch.kvm_cpuid'.

While on it, drop pointless "&& (best->eax & (1 << KVM_FEATURE_PV_UNHALT)"
check when clearing KVM_FEATURE_PV_UNHALT bit.

Fixes: ee3a5f9e3d ("KVM: x86: Do runtime CPUID update before updating vcpu->arch.cpuid_entries")
Reported-and-tested-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240228101837.93642-3-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-03-06 09:50:15 -08:00
Vitaly Kuznetsov
92e82cf632 KVM: x86: Introduce __kvm_get_hypervisor_cpuid() helper
Similar to kvm_find_kvm_cpuid_features()/__kvm_find_kvm_cpuid_features(),
introduce a helper to search for the specific hypervisor signature in any
struct kvm_cpuid_entry2 array, not only in vcpu->arch.cpuid_entries.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240228101837.93642-2-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-03-06 09:50:15 -08:00
David Woodhouse
7a36d68065 KVM: x86/xen: fix recursive deadlock in timer injection
The fast-path timer delivery introduced a recursive locking deadlock
when userspace configures a timer which has already expired and is
delivered immediately. The call to kvm_xen_inject_timer_irqs() can
call to kvm_xen_set_evtchn() which may take kvm->arch.xen.xen_lock,
which is already held in kvm_xen_vcpu_get_attr().

 ============================================
 WARNING: possible recursive locking detected
 6.8.0-smp--5e10b4d51d77-drs #232 Tainted: G           O
 --------------------------------------------
 xen_shinfo_test/250013 is trying to acquire lock:
 ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_set_evtchn+0x74/0x170 [kvm]

 but task is already holding lock:
 ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_vcpu_get_attr+0x38/0x250 [kvm]

Now that the gfn_to_pfn_cache has its own self-sufficient locking, its
callers no longer need to ensure serialization, so just stop taking
kvm->arch.xen.xen_lock from kvm_xen_set_evtchn().

Fixes: 77c9b9dea4 ("KVM: x86/xen: Use fast path for Xen timer delivery")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20240227115648.3104-6-dwmw2@infradead.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-03-04 16:22:39 -08:00
David Woodhouse
66e3cf729b KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery
The kvm_xen_inject_vcpu_vector() function has a comment saying "the fast
version will always work for physical unicast", justifying its use of
kvm_irq_delivery_to_apic_fast() and the WARN_ON_ONCE() when that fails.

In fact that assumption isn't true if X2APIC isn't in use by the guest
and there is (8-bit x)APIC ID aliasing. A single "unicast" destination
APIC ID *may* then be delivered to multiple vCPUs. Remove the warning,
and in fact it might as well just call kvm_irq_delivery_to_apic().

Reported-by: Michal Luczaj <mhal@rbox.co>
Fixes: fde0451be8 ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20240227115648.3104-4-dwmw2@infradead.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-03-04 16:22:37 -08:00
David Woodhouse
8e62bf2bfa KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled
Linux guests since commit b1c3497e60 ("x86/xen: Add support for
HVMOP_set_evtchn_upcall_vector") in v6.0 onwards will use the per-vCPU
upcall vector when it's advertised in the Xen CPUID leaves.

This upcall is injected through the guest's local APIC as an MSI, unlike
the older system vector which was merely injected by the hypervisor any
time the CPU was able to receive an interrupt and the upcall_pending
flags is set in its vcpu_info.

Effectively, that makes the per-CPU upcall edge triggered instead of
level triggered, which results in the upcall being lost if the MSI is
delivered when the local APIC is *disabled*.

Xen checks the vcpu_info->evtchn_upcall_pending flag when the local APIC
for a vCPU is software enabled (in fact, on any write to the SPIV
register which doesn't disable the APIC). Do the same in KVM since KVM
doesn't provide a way for userspace to intervene and trap accesses to
the SPIV register of a local APIC emulated by KVM.

Fixes: fde0451be8 ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240227115648.3104-3-dwmw2@infradead.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-03-04 16:22:36 -08:00
David Woodhouse
451a707813 KVM: x86/xen: improve accuracy of Xen timers
A test program such as http://david.woodhou.se/timerlat.c confirms user
reports that timers are increasingly inaccurate as the lifetime of a
guest increases. Reporting the actual delay observed when asking for
100µs of sleep, it starts off OK on a newly-launched guest but gets
worse over time, giving incorrect sleep times:

root@ip-10-0-193-21:~# ./timerlat -c -n 5
00000000 latency 103243/100000 (3.2430%)
00000001 latency 103243/100000 (3.2430%)
00000002 latency 103242/100000 (3.2420%)
00000003 latency 103245/100000 (3.2450%)
00000004 latency 103245/100000 (3.2450%)

The biggest problem is that get_kvmclock_ns() returns inaccurate values
when the guest TSC is scaled. The guest sees a TSC value scaled from the
host TSC by a mul/shift conversion (hopefully done in hardware). The
guest then converts that guest TSC value into nanoseconds using the
mul/shift conversion given to it by the KVM pvclock information.

But get_kvmclock_ns() performs only a single conversion directly from
host TSC to nanoseconds, giving a different result. A test program at
http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
over a day.

It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
that. The actual guest hv_clock is per-CPU, and *theoretically* each
vCPU could be running at a *different* frequency. But this patch is
needed anyway because...

The other issue with Xen timers was that the code would snapshot the
host CLOCK_MONOTONIC at some point in time, and then... after a few
interrupts may have occurred, some preemption perhaps... would also read
the guest's kvmclock. Then it would proceed under the false assumption
that those two happened at the *same* time. Any time which *actually*
elapsed between reading the two clocks was introduced as inaccuracies
in the time at which the timer fired.

Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
host TSC just *once*, then use the returned TSC value to calculate the
kvmclock (making sure to do that the way the guest would instead of
making the same mistake get_kvmclock_ns() does).

Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
timers still have to use CLOCK_MONOTONIC. In practice the difference
between the two won't matter over the timescales involved, as the
*absolute* values don't matter; just the delta.

This does mean a new variant of kvm_get_time_and_clockread() is needed;
called kvm_get_monotonic_and_clockread() because that's what it does.

Fixes: 5363952605 ("KVM: x86/xen: handle PV timers oneshot mode")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20240227115648.3104-2-dwmw2@infradead.org
[sean: massage moved comment, tweak if statement formatting]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-03-04 16:22:32 -08:00
Thomas Gleixner
65efc4dc12 x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation
Sparse complains rightfully about the missing declaration which has been
placed sloppily into the usage site:

  bugs.c:2223:6: sparse: warning: symbol 'itlb_multihit_kvm_mitigation' was not declared. Should it be static?

Add it to <asm/spec-ctrl.h> where it belongs and remove the one in the KVM code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240304005104.787173239@linutronix.de
2024-03-04 12:09:13 +01:00
Sean Christopherson
259720c37d KVM: VMX: Combine "check" and "get" APIs for passthrough MSR lookups
Combine possible_passthrough_msr_slot() and is_valid_passthrough_msr()
into a single function, vmx_get_passthrough_msr_slot(), and have the
combined helper return the slot on success, using a negative value to
indicate "failure".

Combining the operations avoids iterating over the array of passthrough
MSRs twice for relevant MSRs.

Suggested-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Dongli Zhang <dongli.zhang@oracle.com>
Link: https://lore.kernel.org/r/20240223202104.3330974-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-27 12:29:46 -08:00
Andrei Vagin
a364c014a2 kvm/x86: allocate the write-tracking metadata on-demand
The write-track is used externally only by the gpu/drm/i915 driver.
Currently, it is always enabled, if a kernel has been compiled with this
driver.

Enabling the write-track mechanism adds a two-byte overhead per page across
all memory slots. It isn't significant for regular VMs. However in gVisor,
where the entire process virtual address space is mapped into the VM, even
with a 39-bit address space, the overhead amounts to 256MB.

Rework the write-tracking mechanism to enable it on-demand in
kvm_page_track_register_notifier.

Here is Sean's comment about the locking scheme:

The only potential hiccup would be if taking slots_arch_lock would
deadlock, but it should be impossible for slots_arch_lock to be taken in
any other path that involves VFIO and/or KVMGT *and* can be coincident.
Except for kvm_arch_destroy_vm() (which deletes KVM's internal
memslots), slots_arch_lock is taken only through KVM ioctls(), and the
caller of kvm_page_track_register_notifier() *must* hold a reference to
the VM.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
Link: https://lore.kernel.org/r/20240213192340.2023366-1-avagin@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-27 11:49:54 -08:00
Dongli Zhang
bab22040d7 KVM: VMX: return early if msr_bitmap is not supported
The vmx_msr_filter_changed() may directly/indirectly calls only
vmx_enable_intercept_for_msr() or vmx_disable_intercept_for_msr(). Those
two functions may exit immediately if !cpu_has_vmx_msr_bitmap().

vmx_msr_filter_changed()
-> vmx_disable_intercept_for_msr()
-> pt_update_intercept_for_msr()
   -> vmx_set_intercept_for_msr()
      -> vmx_enable_intercept_for_msr()
      -> vmx_disable_intercept_for_msr()

Therefore, we exit early if !cpu_has_vmx_msr_bitmap().

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Link: https://lore.kernel.org/r/20240223202104.3330974-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-27 09:36:48 -08:00
Dongli Zhang
8e24eeedfd KVM: VMX: fix comment to add LBR to passthrough MSRs
According to the is_valid_passthrough_msr(), the LBR MSRs are also
passthrough MSRs, since the commit 1b5ac3226a ("KVM: vmx/pmu:
Pass-through LBR msrs when the guest LBR event is ACTIVE").

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Link: https://lore.kernel.org/r/20240223202104.3330974-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-27 09:36:48 -08:00
Like Xu
812d432373 KVM: x86/pmu: Explicitly check NMI from guest to reducee false positives
Explicitly check that the source of external interrupt is indeed an NMI
in kvm_arch_pmi_in_guest(), which reduces perf-kvm false positive samples
(host samples labelled as guest samples) generated by perf/core NMI mode
if an NMI arrives after VM-Exit, but before kvm_after_interrupt():

 # test: perf-record + cpu-cycles:HP (which collects host-only precise samples)
 # Symbol                                   Overhead       sys       usr  guest sys  guest usr
 # .......................................  ........  ........  ........  .........  .........
 #
 # Before:
   [g] entry_SYSCALL_64                       24.63%     0.00%     0.00%     24.63%      0.00%
   [g] syscall_return_via_sysret              23.23%     0.00%     0.00%     23.23%      0.00%
   [g] files_lookup_fd_raw                     6.35%     0.00%     0.00%      6.35%      0.00%
 # After:
   [k] perf_adjust_freq_unthr_context         57.23%    57.23%     0.00%      0.00%      0.00%
   [k] __vmx_vcpu_run                          4.09%     4.09%     0.00%      0.00%      0.00%
   [k] vmx_update_host_rsp                     3.17%     3.17%     0.00%      0.00%      0.00%

In the above case, perf records the samples labelled '[g]', the RIPs behind
the weird samples are actually being queried by perf_instruction_pointer()
after determining whether it's in GUEST state or not, and here's the issue:

If VM-Exit is caused by a non-NMI interrupt (such as hrtimer_interrupt) and
at least one PMU counter is enabled on host, the kvm_arch_pmi_in_guest()
will remain true (KVM_HANDLING_IRQ is set) until kvm_before_interrupt().

During this window, if a PMI occurs on host (since the KVM instructions on
host are being executed), the control flow, with the help of the host NMI
context, will be transferred to perf/core to generate performance samples,
thus perf_instruction_pointer() and perf_guest_get_ip() is called.

Since kvm_arch_pmi_in_guest() only checks if there is an interrupt, it may
cause perf/core to mistakenly assume that the source RIP of the host NMI
belongs to the guest world and use perf_guest_get_ip() to get the RIP of
a vCPU that has already exited by a non-NMI interrupt.

Error samples are recorded and presented to the end-user via perf-report.
Such false positive samples could be eliminated by explicitly determining
if the exit reason is KVM_HANDLING_NMI.

Note that when VM-exit is indeed triggered by PMI and before HANDLING_NMI
is cleared, it's also still possible that another PMI is generated on host.
Also for perf/core timer mode, the false positives are still possible since
those non-NMI sources of interrupts are not always being used by perf/core.

For events that are host-only, perf/core can and should eliminate false
positives by checking event->attr.exclude_guest, i.e. events that are
configured to exclude KVM guests should never fire in the guest.

Events that are configured to count host and guest are trickier, perhaps
impossible to handle with 100% accuracy?  And regardless of what accuracy
is provided by perf/core, improving KVM's accuracy is cheap and easy, with
no real downsides.

Fixes: dd60d21706 ("KVM: x86: Fix perf timer mode IP reporting")
Signed-off-by: Like Xu <likexu@tencent.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231206032054.55070-1-likexu@tencent.com
[sean: massage changelog, squash !!in_nmi() fixup from Like]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-26 15:57:22 -08:00
Oliver Upton
284851ee5c KVM: Get rid of return value from kvm_arch_create_vm_debugfs()
The general expectation with debugfs is that any initialization failure
is nonfatal. Nevertheless, kvm_arch_create_vm_debugfs() allows
implementations to return an error and kvm_create_vm_debugfs() allows
that to fail VM creation.

Change to a void return to discourage architectures from making debugfs
failures fatal for the VM. Seems like everyone already had the right
idea, as all implementations already return 0 unconditionally.

Acked-by: Marc Zyngier <maz@kernel.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20240216155941.2029458-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-23 21:44:58 +00:00
Sean Christopherson
d02c357e5b KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing
Retry page faults without acquiring mmu_lock, and without even faulting
the page into the primary MMU, if the resolved gfn is covered by an active
invalidation.  Contending for mmu_lock is especially problematic on
preemptible kernels as the mmu_notifier invalidation task will yield
mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and
ultimately increase the latency of resolving the page fault.  And in the
worst case scenario, yielding will be accompanied by a remote TLB flush,
e.g. if the invalidation covers a large range of memory and vCPUs are
accessing addresses that were already zapped.

Faulting the page into the primary MMU is similarly problematic, as doing
so may acquire locks that need to be taken for the invalidation to
complete (the primary MMU has finer grained locks than KVM's MMU), and/or
may cause unnecessary churn (getting/putting pages, marking them accessed,
etc).

Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
iterators to perform more work before yielding, but that wouldn't solve
the lock contention and would negatively affect scenarios where a vCPU is
trying to fault in an address that is NOT covered by the in-progress
invalidation.

Add a dedicated lockess version of the range-based retry check to avoid
false positives on the sanity check on start+end WARN, and so that it's
super obvious that checking for a racing invalidation without holding
mmu_lock is unsafe (though obviously useful).

Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking
invalidation in a loop won't put KVM into an infinite loop, e.g. due to
caching the in-progress flag and never seeing it go to '0'.

Force a load of mmu_invalidate_seq as well, even though it isn't strictly
necessary to avoid an infinite loop, as doing so improves the probability
that KVM will detect an invalidation that already completed before
acquiring mmu_lock and bailing anyways.

Do the pre-check even for non-preemptible kernels, as waiting to detect
the invalidation until mmu_lock is held guarantees the vCPU will observe
the worst case latency in terms of handling the fault, and can generate
even more mmu_lock contention.  E.g. the vCPU will acquire mmu_lock,
detect retry, drop mmu_lock, re-enter the guest, retake the fault, and
eventually re-acquire mmu_lock.  This behavior is also why there are no
new starvation issues due to losing the fairness guarantees provided by
rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting
on mmu_lock doesn't guarantee forward progress in the face of _another_
mmu_notifier invalidation event.

Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
may generate a load into a register instead of doing a direct comparison
(MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
is a few bytes of code and maaaaybe a cycle or three.

Reported-by: Yan Zhao <yan.y.zhao@intel.com>
Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
Reported-by: Friedrich Weber <f.weber@proxmox.com>
Cc: Kai Huang <kai.huang@intel.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Yuan Yao <yuan.yao@linux.intel.com>
Cc: Xu Yilun <yilun.xu@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20240222012640.2820927-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-23 10:14:34 -08:00
Masahiro Yamada
bf48d9b756 kbuild: change tool coverage variables to take the path relative to $(obj)
Commit 54b8ae66ae ("kbuild: change *FLAGS_<basetarget>.o to take the
path relative to $(obj)") changed the syntax of per-file compiler flags.

The situation is the same for the following variables:

  OBJECT_FILES_NON_STANDARD_<basetarget>.o
  GCOV_PROFILE_<basetarget>.o
  KASAN_SANITIZE_<basetarget>.o
  KMSAN_SANITIZE_<basetarget>.o
  KMSAN_ENABLE_CHECKS_<basetarget>.o
  UBSAN_SANITIZE_<basetarget>.o
  KCOV_INSTRUMENT_<basetarget>.o
  KCSAN_SANITIZE_<basetarget>.o
  KCSAN_INSTRUMENT_BARRIERS_<basetarget>.o

The <basetarget> is the filename of the target with its directory and
suffix stripped.

This syntax comes into a trouble when two files with the same basename
appear in one Makefile, for example:

  obj-y += dir1/foo.o
  obj-y += dir2/foo.o
  OBJECT_FILES_NON_STANDARD_foo.o := y

OBJECT_FILES_NON_STANDARD_foo.o is applied to both dir1/foo.o and
dir2/foo.o. This syntax is not flexbile enough to handle cases where
one of them is a standard object, but the other is not.

It is more sensible to use the relative path to the Makefile, like this:

  obj-y += dir1/foo.o
  OBJECT_FILES_NON_STANDARD_dir1/foo.o := y
  obj-y += dir2/foo.o
  OBJECT_FILES_NON_STANDARD_dir2/foo.o := y

To maintain the current behavior, I made adjustments to the following two
Makefiles:

 - arch/x86/entry/vdso/Makefile, which compiles vclock_gettime.o, vgetcpu.o,
   and their vdso32 variants.

 - arch/x86/kvm/Makefile, which compiles vmx/vmenter.o and svm/vmenter.o

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
Acked-by: Sean Christopherson <seanjc@google.com>
2024-02-23 21:06:21 +09:00
Sean Christopherson
5ef1d8c1dd KVM: SVM: Flush pages under kvm->lock to fix UAF in svm_register_enc_region()
Do the cache flush of converted pages in svm_register_enc_region() before
dropping kvm->lock to fix use-after-free issues where region and/or its
array of pages could be freed by a different task, e.g. if userspace has
__unregister_enc_region_locked() already queued up for the region.

Note, the "obvious" alternative of using local variables doesn't fully
resolve the bug, as region->pages is also dynamically allocated.  I.e. the
region structure itself would be fine, but region->pages could be freed.

Flushing multiple pages under kvm->lock is unfortunate, but the entire
flow is a rare slow path, and the manual flush is only needed on CPUs that
lack coherency for encrypted memory.

Fixes: 19a23da539 ("Fix unsynchronized access to sev members through svm_register_enc_region")
Reported-by: Gabe Kirkpatrick <gkirkpatrick@google.com>
Cc: Josh Eads <josheads@google.com>
Cc: Peter Gonda <pgonda@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20240217013430.2079561-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-23 03:55:59 -05:00
Sean Christopherson
a1176ef5c9 KVM: x86/mmu: Restrict KVM_SW_PROTECTED_VM to the TDP MMU
Advertise and support software-protected VMs if and only if the TDP MMU is
enabled, i.e. disallow KVM_SW_PROTECTED_VM if TDP is enabled for KVM's
legacy/shadow MMU.  TDP support for the shadow MMU is maintenance-only,
e.g. support for TDX and SNP will also be restricted to the TDP MMU.

Fixes: 89ea60c2c7 ("KVM: x86: Add support for "protected VMs" that can utilize private memory")
Link: https://lore.kernel.org/r/20240222190612.2942589-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 17:07:06 -08:00
Sean Christopherson
422692098c KVM: x86: Update KVM_SW_PROTECTED_VM docs to make it clear they're a WIP
Rewrite the help message for KVM_SW_PROTECTED_VM to make it clear that
software-protected VMs are a development and testing vehicle for
guest_memfd(), and that attempting to use KVM_SW_PROTECTED_VM for anything
remotely resembling a "real" VM will fail.  E.g. any memory accesses from
KVM will incorrectly access shared memory, nested TDP is wildly broken,
and so on and so forth.

Update KVM's API documentation with similar warnings to discourage anyone
from attempting to run anything but selftests with KVM_X86_SW_PROTECTED_VM.

Fixes: 89ea60c2c7 ("KVM: x86: Add support for "protected VMs" that can utilize private memory")
Link: https://lore.kernel.org/r/20240222190612.2942589-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 17:07:06 -08:00
Sean Christopherson
576a15de8d KVM: x86/mmu: Free TDP MMU roots while holding mmy_lock for read
Free TDP MMU roots from vCPU context while holding mmu_lock for read, it
is completely legal to invoke kvm_tdp_mmu_put_root() as a reader.  This
eliminates the last mmu_lock writer in the TDP MMU's "fast zap" path
after requesting vCPUs to reload roots, i.e. allows KVM to zap invalidated
roots, free obsolete roots, and allocate new roots in parallel.

On large VMs, e.g. 100+ vCPUs, allowing the bulk of the "fast zap"
operation to run in parallel with freeing and allocating roots reduces the
worst case latency for a vCPU to reload a root from 2-3ms to <100us.

Link: https://lore.kernel.org/r/20240111020048.844847-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:28:45 -08:00
Sean Christopherson
dab285e4ec KVM: x86/mmu: Alloc TDP MMU roots while holding mmu_lock for read
Allocate TDP MMU roots while holding mmu_lock for read, and instead use
tdp_mmu_pages_lock to guard against duplicate roots.  This allows KVM to
create new roots without forcing kvm_tdp_mmu_zap_invalidated_roots() to
yield, e.g. allows vCPUs to load new roots after memslot deletion without
forcing the zap thread to detect contention and yield (or complete if the
kernel isn't preemptible).

Note, creating a new TDP MMU root as an mmu_lock reader is safe for two
reasons: (1) paths that must guarantee all roots/SPTEs are *visited* take
mmu_lock for write and so are still mutually exclusive, e.g. mmu_notifier
invalidations, and (2) paths that require all roots/SPTEs to *observe*
some given state without holding mmu_lock for write must ensure freshness
through some other means, e.g. toggling dirty logging must first wait for
SRCU readers to recognize the memslot flags change before processing
existing roots/SPTEs.

Link: https://lore.kernel.org/r/20240111020048.844847-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:28:45 -08:00
Sean Christopherson
f5238c2a60 KVM: x86/mmu: Check for usable TDP MMU root while holding mmu_lock for read
When allocating a new TDP MMU root, check for a usable root while holding
mmu_lock for read and only acquire mmu_lock for write if a new root needs
to be created.  There is no need to serialize other MMU operations if a
vCPU is simply grabbing a reference to an existing root, holding mmu_lock
for write is "necessary" (spoiler alert, it's not strictly necessary) only
to ensure KVM doesn't end up with duplicate roots.

Allowing vCPUs to get "new" roots in parallel is beneficial to VM boot and
to setups that frequently delete memslots, i.e. which force all vCPUs to
reload all roots.

Link: https://lore.kernel.org/r/20240111020048.844847-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:28:45 -08:00
Sean Christopherson
d746182337 KVM: x86/mmu: Skip invalid TDP MMU roots when write-protecting SPTEs
When write-protecting SPTEs, don't process invalid roots as invalid roots
are unreachable, i.e. can't be used to access guest memory and thus don't
need to be write-protected.

Note, this is *almost* a nop for kvm_tdp_mmu_clear_dirty_pt_masked(),
which is called under slots_lock, i.e. is mutually exclusive with
kvm_mmu_zap_all_fast().  But it's possible for something other than the
"fast zap" thread to grab a reference to an invalid root and thus keep a
root alive (but completely empty) after kvm_mmu_zap_all_fast() completes.

The kvm_tdp_mmu_write_protect_gfn() case is more interesting as KVM write-
protects SPTEs for reasons other than dirty logging, e.g. if a KVM creates
a SPTE for a nested VM while a fast zap is in-progress.

Add another TDP MMU iterator to visit only valid roots, and
opportunistically convert kvm_tdp_mmu_get_vcpu_root_hpa() to said iterator.

Link: https://lore.kernel.org/r/20240111020048.844847-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:28:45 -08:00
Sean Christopherson
99b85fda91 KVM: x86/mmu: Skip invalid roots when zapping leaf SPTEs for GFN range
When zapping a GFN in response to an APICv or MTRR change, don't zap SPTEs
for invalid roots as KVM only needs to ensure the guest can't use stale
mappings for the GFN.  Unlike kvm_tdp_mmu_unmap_gfn_range(), which must
zap "unreachable" SPTEs to ensure KVM doesn't mark a page accessed/dirty,
kvm_tdp_mmu_zap_leafs() isn't used (and isn't intended to be used) to
handle freeing of host memory.

Link: https://lore.kernel.org/r/20240111020048.844847-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:28:45 -08:00
Sean Christopherson
6577f1efdf KVM: x86/mmu: Allow passing '-1' for "all" as_id for TDP MMU iterators
Modify for_each_tdp_mmu_root() and __for_each_tdp_mmu_root_yield_safe() to
accept -1 for _as_id to mean "process all memslot address spaces".  That
way code that wants to process both SMM and !SMM doesn't need to iterate
over roots twice (and likely copy+paste code in the process).

Deliberately don't cast _as_id to an "int", just in case not casting helps
the compiler elide the "_as_id >=0" check when being passed an unsigned
value, e.g. from a memslot.

No functional change intended.

Link: https://lore.kernel.org/r/20240111020048.844847-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:28:45 -08:00
Sean Christopherson
fcdffe97f8 KVM: x86/mmu: Don't do TLB flush when zappings SPTEs in invalid roots
Don't force a TLB flush when zapping SPTEs in invalid roots as vCPUs
can't be actively using invalid roots (zapping SPTEs in invalid roots is
necessary only to ensure KVM doesn't mark a page accessed/dirty after it
is freed by the primary MMU).

Link: https://lore.kernel.org/r/20240111020048.844847-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:28:45 -08:00
Sean Christopherson
8ca983631f KVM: x86/mmu: Zap invalidated TDP MMU roots at 4KiB granularity
Zap invalidated TDP MMU roots at maximum granularity, i.e. with more
frequent conditional resched checkpoints, in order to avoid running for an
extended duration (milliseconds, or worse) without honoring a reschedule
request.  And for kernels running with full or real-time preempt models,
zapping at 4KiB granularity also provides significantly reduced latency
for other tasks that are contending for mmu_lock (which isn't necessarily
an overall win for KVM, but KVM should do its best to honor the kernel's
preemption model).

To keep KVM's assertion that zapping at 1GiB granularity is functionally
ok, which is the main reason 1GiB was selected in the past, skip straight
to zapping at 1GiB if KVM is configured to prove the MMU.  Zapping roots
is far more common than a vCPU replacing a 1GiB page table with a hugepage,
e.g. generally happens multiple times during boot, and so keeping the test
coverage provided by root zaps is desirable, just not for production.

Cc: David Matlack <dmatlack@google.com>
Cc: Pattara Teerapong <pteerapong@google.com>
Link: https://lore.kernel.org/r/20240111020048.844847-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:28:45 -08:00
Sean Christopherson
322d79f1db KVM: x86: Clean up directed yield API for "has pending interrupt"
Directly return the boolean result of whether or not a vCPU has a pending
interrupt instead of effectively doing:

  if (true)
	return true;

  return false;

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240110003938.490206-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:27:40 -08:00
Sean Christopherson
9b8615c5d3 KVM: x86: Rely solely on preempted_in_kernel flag for directed yield
Snapshot preempted_in_kernel using kvm_arch_vcpu_in_kernel() so that the
flag is "accurate" (or rather, consistent and deterministic within KVM)
for guests with protected state, and explicitly use preempted_in_kernel
when checking if a vCPU was preempted in kernel mode instead of bouncing
through kvm_arch_vcpu_in_kernel().

Drop the gnarly logic in kvm_arch_vcpu_in_kernel() that redirects to
preempted_in_kernel if the target vCPU is not the "running", i.e. loaded,
vCPU, as the only reason that code existed was for the directed yield case
where KVM wants to check the CPL of a vCPU that may or may not be loaded
on the current pCPU.

Cc: Like Xu <like.xu.linux@gmail.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240110003938.490206-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:27:03 -08:00
Sean Christopherson
77bcd9e623 KVM: Add dedicated arch hook for querying if vCPU was preempted in-kernel
Plumb in a dedicated hook for querying whether or not a vCPU was preempted
in-kernel.  Unlike literally every other architecture, x86's VMX can check
if a vCPU is in kernel context if and only if the vCPU is loaded on the
current pCPU.

x86's kvm_arch_vcpu_in_kernel() works around the limitation by querying
kvm_get_running_vcpu() and redirecting to vcpu->arch.preempted_in_kernel
as needed.  But that's unnecessary, confusing, and fragile, e.g. x86 has
had at least one bug where KVM incorrectly used a stale
preempted_in_kernel.

No functional change intended.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240110003938.490206-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:26:26 -08:00
Sean Christopherson
fc3c94142b KVM: x86: Sanity check that kvm_has_noapic_vcpu is zero at module_exit()
WARN if kvm.ko is unloaded with an elevated kvm_has_noapic_vcpu to guard
against incorrect management of the key, e.g. to detect if KVM fails to
decrement the key in error paths.  Because kvm_has_noapic_vcpu is purely
an optimization, in all likelihood KVM could completely botch handling of
kvm_has_noapic_vcpu and no one would notice (which is a good argument for
deleting the key entirely, but that's a problem for another day).

Note, ideally the sanity check would be performance when kvm_usage_count
goes to zero, but adding an arch callback just for this sanity check isn't
at all worth doing.

Link: https://lore.kernel.org/r/20240209222047.394389-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:24:26 -08:00
Sean Christopherson
a78d904669 KVM: x86: Move "KVM no-APIC vCPU" key management into local APIC code
Move incrementing and decrementing of kvm_has_noapic_vcpu into
kvm_create_lapic() and kvm_free_lapic() respectively to fix a benign bug
where KVM fails to decrement the count if vCPU creation ultimately fails,
e.g. due to a memory allocation failing.

Note, the bug is benign as kvm_has_noapic_vcpu is used purely to optimize
lapic_in_kernel() checks, and that optimization is quite dubious.  That,
and practically speaking no setup that cares at all about performance runs
with a userspace local APIC.

Reported-by: Li RongQing <lirongqing@baidu.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
Link: https://lore.kernel.org/r/20240209222047.394389-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:24:09 -08:00
Sean Christopherson
0ec3d6d1f1 KVM: x86: Fully defer to vendor code to decide how to force immediate exit
Now that vmx->req_immediate_exit is used only in the scope of
vmx_vcpu_run(), use force_immediate_exit to detect that KVM should usurp
the VMX preemption to force a VM-Exit and let vendor code fully handle
forcing a VM-Exit.

Opportunsitically drop __kvm_request_immediate_exit() and just have
vendor code call smp_send_reschedule() directly.  SVM already does this
when injecting an event while also trying to single-step an IRET, i.e.
it's not exactly secret knowledge that KVM uses a reschedule IPI to force
an exit.

Link: https://lore.kernel.org/r/20240110012705.506918-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:22:41 -08:00
Sean Christopherson
7b3d1bbf8d KVM: VMX: Handle KVM-induced preemption timer exits in fastpath for L2
Eat VMX treemption timer exits in the fastpath regardless of whether L1 or
L2 is active.  The VM-Exit is 100% KVM-induced, i.e. there is nothing
directly related to the exit that KVM needs to do on behalf of the guest,
thus there is no reason to wait until the slow path to do nothing.

Opportunistically add comments explaining why preemption timer exits for
emulating the guest's APIC timer need to go down the slow path.

Link: https://lore.kernel.org/r/20240110012705.506918-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:22:37 -08:00
Sean Christopherson
bf1a49436e KVM: x86: Move handling of is_guest_mode() into fastpath exit handlers
Let the fastpath code decide which exits can/can't be handled in the
fastpath when L2 is active, e.g. when KVM generates a VMX preemption
timer exit to forcefully regain control, there is no "work" to be done and
so such exits can be handled in the fastpath regardless of whether L1 or
L2 is active.

Moving the is_guest_mode() check into the fastpath code also makes it
easier to see that L2 isn't allowed to use the fastpath in most cases,
e.g. it's not immediately obvious why handle_fastpath_preemption_timer()
is called from the fastpath and the normal path.

Link: https://lore.kernel.org/r/20240110012705.506918-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:22:36 -08:00
Sean Christopherson
11776aa0cf KVM: VMX: Handle forced exit due to preemption timer in fastpath
Handle VMX preemption timer VM-Exits due to KVM forcing an exit in the
exit fastpath, i.e. avoid calling back into handle_preemption_timer() for
the same exit.  There is no work to be done for forced exits, as the name
suggests the goal is purely to get control back in KVM.

In addition to shaving a few cycles, this will allow cleanly separating
handle_fastpath_preemption_timer() from handle_preemption_timer(), e.g.
it's not immediately obvious why _apparently_ calling
handle_fastpath_preemption_timer() twice on a "slow" exit is necessary:
the "slow" call is necessary to handle exits from L2, which are excluded
from the fastpath by vmx_vcpu_run().

Link: https://lore.kernel.org/r/20240110012705.506918-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:22:36 -08:00
Sean Christopherson
e6b5d16bbd KVM: VMX: Re-enter guest in fastpath for "spurious" preemption timer exits
Re-enter the guest in the fast path if VMX preeemption timer VM-Exit was
"spurious", i.e. if KVM "soft disabled" the timer by writing -1u and by
some miracle the timer expired before any other VM-Exit occurred.  This is
just an intermediate step to cleaning up the preemption timer handling,
optimizing these types of spurious VM-Exits is not interesting as they are
extremely rare/infrequent.

Link: https://lore.kernel.org/r/20240110012705.506918-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:22:36 -08:00
Sean Christopherson
9c9025ea00 KVM: x86: Plumb "force_immediate_exit" into kvm_entry() tracepoint
Annotate the kvm_entry() tracepoint with "immediate exit" when KVM is
forcing a VM-Exit immediately after VM-Enter, e.g. when KVM wants to
inject an event but needs to first complete some other operation.
Knowing that KVM is (or isn't) forcing an exit is useful information when
debugging issues related to event injection.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240110012705.506918-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:22:36 -08:00
Sean Christopherson
dfeef3d3f3 KVM: x86: Drop superfluous check on direct MMU vs. WRITE_PF_TO_SP flag
Remove reexecute_instruction()'s final check on the MMU being direct, as
EMULTYPE_WRITE_PF_TO_SP is only ever set if the MMU is indirect, i.e. is a
shadow MMU.  Prior to commit 93c05d3ef2 ("KVM: x86: improve
reexecute_instruction"), the flag simply didn't exist (and KVM actually
returned "true" unconditionally for both types of MMUs).  I.e. the
explicit check for a direct MMU is simply leftover artifact from old code.

Link: https://lore.kernel.org/r/20240203002343.383056-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:19:06 -08:00
Sean Christopherson
515c18a64e KVM: x86: Drop dedicated logic for direct MMUs in reexecute_instruction()
Now that KVM doesn't pointlessly acquire mmu_lock for direct MMUs, drop
the dedicated path entirely and always query indirect_shadow_pages when
deciding whether or not to try unprotecting the gfn.  For indirect, a.k.a.
shadow MMUs, checking indirect_shadow_pages is harmless; unless *every*
shadow page was somehow zapped while KVM was attempting to emulate the
instruction, indirect_shadow_pages is guaranteed to be non-zero.

Well, unless the instruction used a direct hugepage with 2-level paging
for its code page, but in that case, there's obviously nothing to
unprotect.  And in the extremely unlikely case all shadow pages were
zapped, there's again obviously nothing to unprotect.

Link: https://lore.kernel.org/r/20240203002343.383056-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:19:06 -08:00
Mingwei Zhang
474b99ed70 KVM: x86/mmu: Don't acquire mmu_lock when using indirect_shadow_pages as a heuristic
Drop KVM's completely pointless acquisition of mmu_lock when deciding
whether or not to unprotect any shadow pages residing at the gfn before
resuming the guest to let it retry an instruction that KVM failed to
emulated.  In this case, indirect_shadow_pages is used as a coarse-grained
heuristic to check if there is any chance of there being a relevant shadow
page to unprotected.  But acquiring mmu_lock largely defeats any benefit
to the heuristic, as taking mmu_lock for write is likely far more costly
to the VM as a whole than unnecessarily walking mmu_page_hash.

Furthermore, the current code is already prone to false negatives and
false positives, as it drops mmu_lock before checking the flag and
unprotecting shadow pages.  And as evidenced by the lack of bug reports,
neither false positives nor false negatives are problematic.  A false
positive simply means that KVM will try to unprotect shadow pages that
have already been zapped.  And a false negative means that KVM will
resume the guest without unprotecting the gfn, i.e. if a shadow page was
_just_ created, the vCPU will hit the same page fault and do the whole
dance all over again, and detect and unprotect the shadow page the second
time around (or not, if something else zaps it first).

Reported-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
[sean: drop READ_ONCE() and comment change, rewrite changelog]
Link: https://lore.kernel.org/r/20240203002343.383056-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:19:06 -08:00
Sean Christopherson
2a5f091ce1 KVM: x86: Open code all direct reads to guest DR6 and DR7
Bite the bullet, and open code all direct reads of DR6 and DR7.  KVM
currently has a mix of open coded accesses and calls to kvm_get_dr(),
which is confusing and ugly because there's no rhyme or reason as to why
any particular chunk of code uses kvm_get_dr().

The obvious alternative is to force all accesses through kvm_get_dr(),
but it's not at all clear that doing so would be a net positive, e.g. even
if KVM ends up wanting/needing to force all reads through a common helper,
e.g. to play caching games, the cost of reverting this change is likely
lower than the ongoing cost of maintaining weird, arbitrary code.

No functional change intended.

Cc: Mathias Krause <minipli@grsecurity.net>
Reviewed-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240209220752.388160-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:14:47 -08:00
Sean Christopherson
fc5375dd8c KVM: x86: Make kvm_get_dr() return a value, not use an out parameter
Convert kvm_get_dr()'s output parameter to a return value, and clean up
most of the mess that was created by forcing callers to provide a pointer.

No functional change intended.

Acked-by: Mathias Krause <minipli@grsecurity.net>
Reviewed-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240209220752.388160-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:14:47 -08:00
Paul Durrant
003d914220 KVM: x86/xen: allow vcpu_info content to be 'safely' copied
If the guest sets an explicit vcpu_info GPA then, for any of the first 32
vCPUs, the content of the default vcpu_info in the shared_info page must be
copied into the new location. Because this copy may race with event
delivery (which updates the 'evtchn_pending_sel' field in vcpu_info),
event delivery needs to be deferred until the copy is complete.

Happily there is already a shadow of 'evtchn_pending_sel' in kvm_vcpu_xen
that is used in atomic context if the vcpu_info PFN cache has been
invalidated so that the update of vcpu_info can be deferred until the
cache can be refreshed (on vCPU thread's the way back into guest context).

Use this shadow if the vcpu_info cache has been *deactivated*, so that
the VMM can safely copy the vcpu_info content and then re-activate the
cache with the new GPA. To do this, stop considering an inactive vcpu_info
cache as a hard error in kvm_xen_set_evtchn_fast(), and let the existing
kvm_gpc_check() fail and kick the vCPU (if necessary).

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20240215152916.1158-21-paul@xen.org
[sean: add a bit of verbosity to the changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 07:01:21 -08:00
Paul Durrant
615451d8cb KVM: x86/xen: advertize the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA capability
Now that all relevant kernel changes and selftests are in place, enable the
new capability.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20240215152916.1158-17-paul@xen.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 07:01:19 -08:00
Paul Durrant
3991f35805 KVM: x86/xen: allow vcpu_info to be mapped by fixed HVA
If the guest does not explicitly set the GPA of vcpu_info structure in
memory then, for guests with 32 vCPUs or fewer, the vcpu_info embedded
in the shared_info page may be used. As described in a previous commit,
the shared_info page is an overlay at a fixed HVA within the VMM, so in
this case it also more optimal to activate the vcpu_info cache with a
fixed HVA to avoid unnecessary invalidation if the guest memory layout
is modified.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20240215152916.1158-14-paul@xen.org
[sean: use kvm_gpc_is_{gpa,hva}_active()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 07:01:17 -08:00
Paul Durrant
b9220d3279 KVM: x86/xen: allow shared_info to be mapped by fixed HVA
The shared_info page is not guest memory as such. It is a dedicated page
allocated by the VMM and overlaid onto guest memory in a GFN chosen by the
guest and specified in the XENMEM_add_to_physmap hypercall. The guest may
even request that shared_info be moved from one GFN to another by
re-issuing that hypercall, but the HVA is never going to change.

Because the shared_info page is an overlay the memory slots need to be
updated in response to the hypercall. However, memory slot adjustment is
not atomic and, whilst all vCPUs are paused, there is still the possibility
that events may be delivered (which requires the shared_info page to be
updated) whilst the shared_info GPA is absent. The HVA is never absent
though, so it makes much more sense to use that as the basis for the
kernel's mapping.

Hence add a new KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA attribute type for this
purpose and a KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA flag to advertize its
availability. Don't actually advertize it yet though. That will be done in
a subsequent patch, which will also add tests for the new attribute type.

Also update the KVM API documentation with the new attribute and also fix
it up to consistently refer to 'shared_info' (with the underscore).

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20240215152916.1158-13-paul@xen.org
[sean: store "hva" as a user pointer, use kvm_gpc_is_{gpa,hva}_active()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 07:00:52 -08:00
Paul Durrant
18b99e4d6d KVM: x86/xen: re-initialize shared_info if guest (32/64-bit) mode is set
If the shared_info PFN cache has already been initialized then the content
of the shared_info page needs to be re-initialized whenever the guest
mode is (re)set.
Setting the guest mode is either done explicitly by the VMM via the
KVM_XEN_ATTR_TYPE_LONG_MODE attribute, or implicitly when the guest writes
the MSR to set up the hypercall page.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20240215152916.1158-12-paul@xen.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-20 07:37:48 -08:00
Paul Durrant
c01c55a34f KVM: x86/xen: separate initialization of shared_info cache and content
A subsequent patch will allow shared_info to be initialized using either a
GPA or a user-space (i.e. VMM) HVA. To make that patch cleaner, separate
the initialization of the shared_info content from the activation of the
pfncache.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20240215152916.1158-11-paul@xen.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-20 07:37:47 -08:00
Paul Durrant
a4bff3df51 KVM: pfncache: remove KVM_GUEST_USES_PFN usage
As noted in [1] the KVM_GUEST_USES_PFN usage flag is never set by any
callers of kvm_gpc_init(), and for good reason: the implementation is
incomplete/broken.  And it's not clear that there will ever be a user of
KVM_GUEST_USES_PFN, as coordinating vCPUs with mmu_notifier events is
non-trivial.

Remove KVM_GUEST_USES_PFN and all related code, e.g. dropping
KVM_GUEST_USES_PFN also makes the 'vcpu' argument redundant, to avoid
having to reason about broken code as __kvm_gpc_refresh() evolves.

Moreover, all existing callers specify KVM_HOST_USES_PFN so the usage
check in hva_to_pfn_retry() and hence the 'usage' argument to
kvm_gpc_init() are also redundant.

[1] https://lore.kernel.org/all/ZQiR8IpqOZrOpzHC@google.com

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20240215152916.1158-6-paul@xen.org
[sean: explicitly call out that guest usage is incomplete]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-20 07:37:43 -08:00
Paul Durrant
78b74638eb KVM: pfncache: add a mark-dirty helper
At the moment pages are marked dirty by open-coded calls to
mark_page_dirty_in_slot(), directly deferefencing the gpa and memslot
from the cache. After a subsequent patch these may not always be set
so add a helper now so that caller will protected from the need to know
about this detail.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20240215152916.1158-5-paul@xen.org
[sean: decrease indentation, use gpa_to_gfn()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-20 07:37:42 -08:00
Paul Durrant
4438355ec6 KVM: x86/xen: mark guest pages dirty with the pfncache lock held
Sampling gpa and memslot from an unlocked pfncache may yield inconsistent
values so, since there is no problem with calling mark_page_dirty_in_slot()
with the pfncache lock held, relocate the calls in
kvm_xen_update_runstate_guest() and kvm_xen_inject_pending_events()
accordingly.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20240215152916.1158-4-paul@xen.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-20 07:37:41 -08:00
Masahiro Yamada
cd14b01846 treewide: replace or remove redundant def_bool in Kconfig files
'def_bool X' is a shorthand for 'bool' plus 'default X'.

'def_bool' is redundant where 'bool' is already present, so 'def_bool X'
can be replaced with 'default X', or removed if X is 'n'.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-02-20 20:47:45 +09:00
Pawan Gupta
43fb862de8 KVM/VMX: Move VERW closer to VMentry for MDS mitigation
During VMentry VERW is executed to mitigate MDS. After VERW, any memory
access like register push onto stack may put host data in MDS affected
CPU buffers. A guest can then use MDS to sample host data.

Although likelihood of secrets surviving in registers at current VERW
callsite is less, but it can't be ruled out. Harden the MDS mitigation
by moving the VERW mitigation late in VMentry path.

Note that VERW for MMIO Stale Data mitigation is unchanged because of
the complexity of per-guest conditional VERW which is not easy to handle
that late in asm with no GPRs available. If the CPU is also affected by
MDS, VERW is unconditionally executed late in asm regardless of guest
having MMIO access.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/20240213-delay-verw-v8-6-a6216d83edb7%40linux.intel.com
2024-02-19 16:31:59 -08:00
Sean Christopherson
706a189dcf KVM/VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH
Use EFLAGS.CF instead of EFLAGS.ZF to track whether to use VMRESUME versus
VMLAUNCH.  Freeing up EFLAGS.ZF will allow doing VERW, which clobbers ZF,
for MDS mitigations as late as possible without needing to duplicate VERW
for both paths.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/all/20240213-delay-verw-v8-5-a6216d83edb7%40linux.intel.com
2024-02-19 16:31:54 -08:00
Pawan Gupta
6613d82e61 x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key
The VERW mitigation at exit-to-user is enabled via a static branch
mds_user_clear. This static branch is never toggled after boot, and can
be safely replaced with an ALTERNATIVE() which is convenient to use in
asm.

Switch to ALTERNATIVE() to use the VERW mitigation late in exit-to-user
path. Also remove the now redundant VERW in exc_nmi() and
arch_exit_to_user_mode().

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20240213-delay-verw-v8-4-a6216d83edb7%40linux.intel.com
2024-02-19 16:31:49 -08:00
Sean Christopherson
910c57dfa4 KVM: x86: Mark target gfn of emulated atomic instruction as dirty
When emulating an atomic access on behalf of the guest, mark the target
gfn dirty if the CMPXCHG by KVM is attempted and doesn't fault.  This
fixes a bug where KVM effectively corrupts guest memory during live
migration by writing to guest memory without informing userspace that the
page is dirty.

Marking the page dirty got unintentionally dropped when KVM's emulated
CMPXCHG was converted to do a user access.  Before that, KVM explicitly
mapped the guest page into kernel memory, and marked the page dirty during
the unmap phase.

Mark the page dirty even if the CMPXCHG fails, as the old data is written
back on failure, i.e. the page is still written.  The value written is
guaranteed to be the same because the operation is atomic, but KVM's ABI
is that all writes are dirty logged regardless of the value written.  And
more importantly, that's what KVM did before the buggy commit.

Huge kudos to the folks on the Cc list (and many others), who did all the
actual work of triaging and debugging.

Fixes: 1c2361f667 ("KVM: x86: Use __try_cmpxchg_user() to emulate atomic accesses")
Cc: stable@vger.kernel.org
Cc: David Matlack <dmatlack@google.com>
Cc: Pasha Tatashin <tatashin@google.com>
Cc: Michael Krebs <mkrebs@google.com>
base-commit: 6769ea8da8a93ed4630f1ce64df6aafcaabfce64
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20240215010004.1456078-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-16 16:56:01 -08:00
Paolo Bonzini
2f8ebe43a0 KVM selftests fixes/cleanups (and one KVM x86 cleanup) for 6.8:
- Remove redundant newlines from error messages.
 
  - Delete an unused variable in the AMX test (which causes build failures when
    compiling with -Werror).
 
  - Fail instead of skipping tests if open(), e.g. of /dev/kvm, fails with an
    error code other than ENOENT (a Hyper-V selftest bug resulted in an EMFILE,
    and the test eventually got skipped).
 
  - Fix TSC related bugs in several Hyper-V selftests.
 
  - Fix a bug in the dirty ring logging test where a sem_post() could be left
    pending across multiple runs, resulting in incorrect synchronization between
    the main thread and the vCPU worker thread.
 
  - Relax the dirty log split test's assertions on 4KiB mappings to fix false
    positives due to the number of mappings for memslot 0 (used for code and
    data that is NOT being dirty logged) changing, e.g. due to NUMA balancing.
 
  - Have KVM's gtod_is_based_on_tsc() return "bool" instead of an "int" (the
    function generates boolean values, and all callers treat the return value as
    a bool).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmXKupQSHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5DiQP/RNSgLrE9+/3oyqo9zpbhio2dKqz4dIk
 8Ga1ZE4R89dyMB9jGKtWn3rEkyma3TsB+neVpG9ohHV6j25JJ0vNAkxQu3Gt+gkl
 uM1lh/IfXPnAKyuy6dW9tpgZYE1v2/KfdWjeEzzxfPjzY/LX3yFiiCKEnUmfjjzZ
 sSz91nV4KYS4b4xLWTIcBgNJuyLJuL05htTLmCu7t8DKOBHwHxXjSn8qqG8OvAjs
 FOhf0zgGJKBFdKOw2Y8XeDdKO0RTEyEPHaFILcLEsuhoVIbY5OUmLe32pAFzzMbG
 hPawUZ5CzC++e339gUgGkRNY80iSnGcYVcZa+ohxOsNBdOWko9z/eGWZUV7qkYDK
 dkPHMoDnSzUCE2eSYbEB1eR/KOfziJCWMS9SAIJbJxIGb1HYajikwAEZ6FNp3R+u
 MyCuNlV9TfsGgt4Dx8RctMeH2ROpORRu7h3WPFUBgG2/jOzPk/OR6U8hSzvmhTvL
 MykZ8IaLmUIYoK/nCY2iwy50lQRxtZ/htqWn3sidCBGY0DXdNlMhvd3Vk9jtUvY5
 Fgof0b564eYfk/qO3cMIDd2WFaDejP28JVSn0CNm6z9i54ubCKkSBEb4kTYXXnVK
 YBHvbZ21Vjg52trudvK5UPt599sxxNBNiSV32ckLFKHS4ZVGSFSBSbsAWiQF157i
 CbYntmtJhM+D
 =infW
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-selftests-6.8-rcN' of https://github.com/kvm-x86/linux into HEAD

KVM selftests fixes/cleanups (and one KVM x86 cleanup) for 6.8:

 - Remove redundant newlines from error messages.

 - Delete an unused variable in the AMX test (which causes build failures when
   compiling with -Werror).

 - Fail instead of skipping tests if open(), e.g. of /dev/kvm, fails with an
   error code other than ENOENT (a Hyper-V selftest bug resulted in an EMFILE,
   and the test eventually got skipped).

 - Fix TSC related bugs in several Hyper-V selftests.

 - Fix a bug in the dirty ring logging test where a sem_post() could be left
   pending across multiple runs, resulting in incorrect synchronization between
   the main thread and the vCPU worker thread.

 - Relax the dirty log split test's assertions on 4KiB mappings to fix false
   positives due to the number of mappings for memslot 0 (used for code and
   data that is NOT being dirty logged) changing, e.g. due to NUMA balancing.

 - Have KVM's gtod_is_based_on_tsc() return "bool" instead of an "int" (the
   function generates boolean values, and all callers treat the return value as
   a bool).
2024-02-14 12:34:58 -05:00
Paolo Bonzini
22d0bc0721 KVM x86 fixes for 6.8:
- Make a KVM_REQ_NMI request while handling KVM_SET_VCPU_EVENTS if and only
    if the incoming events->nmi.pending is non-zero.  If the target vCPU is in
    the UNITIALIZED state, the spurious request will result in KVM exiting to
    userspace, which in turn causes QEMU to constantly acquire and release
    QEMU's global mutex, to the point where the BSP is unable to make forward
    progress.
 
  - Fix a type (u8 versus u64) goof that results in pmu->fixed_ctr_ctrl being
    incorrectly truncated, and ultimately causes KVM to think a fixed counter
    has already been disabled (KVM thinks the old value is '0').
 
  - Fix a stack leak in KVM_GET_MSRS where a failed MSR read from userspace
    that is ultimately ignored due to ignore_msrs=true doesn't zero the output
    as intended.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmXKt90SHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5e5wP/jU3Zuul2e7fb4E6RN/GPhAFSTzG7Cwe
 4lVSSSPmOQsEXTKwCOMj7fgwF9qVSLzLRi62MKziTJY/1FDsTcI3xlM7nM2wwQC2
 26evIzI3qB54rHQdviuh1jwh6scZH7xLw7kANE+8x4skkm6AZB1IUnj3utR3fEPj
 mIUA5kGQxEAEDrn0TFzrRgIw4JngKjrCwmpT+vbmR37flC+Rwv8jr4JY1E3cBAT3
 KEilv3Fg07gbvagWGZNSSUNqQos5MsnLifdryKbA/vuIJf+j/01CMo5KtLKshiaX
 t4gXPldVZDXdxjH6im0wRAX4s/FpZg3vVje2OxPbzwMVb5+XvLewzjzagQ1lFA3I
 gsNXF8uGdYn0fb8T/wQG4ulWBw6A844PSmGONCwLDA+GZuL9xjMIK5d1litvb/im
 bEP1Ahv6UcnDNKHqRzuFXQENiS2uQdJNLs7p291oDNkTm/CGjDUgFXPuaCehWrUf
 ZZf1dxmIPM/Xt2j19mS/HnTHD114A8t1GTx799kBXbG4x0ScVQclkhRk6yFG3ObA
 14uXxxAdEBoZGBJ2yr5FbddvRLswbWugFoxKbtCZ/CHMopOUQcRRmRb7Lm1NHLtg
 Ae/sHO6gQ1xcrbwpMCq+6RjFK57yW+n1TB8ZTmAE2RQynGqzReSTlUNtfn3yMg4v
 hz+2zGzezoeN
 =92ae
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-fixes-6.8-rcN' of https://github.com/kvm-x86/linux into HEAD

KVM x86 fixes for 6.8:

 - Make a KVM_REQ_NMI request while handling KVM_SET_VCPU_EVENTS if and only
   if the incoming events->nmi.pending is non-zero.  If the target vCPU is in
   the UNITIALIZED state, the spurious request will result in KVM exiting to
   userspace, which in turn causes QEMU to constantly acquire and release
   QEMU's global mutex, to the point where the BSP is unable to make forward
   progress.

 - Fix a type (u8 versus u64) goof that results in pmu->fixed_ctr_ctrl being
   incorrectly truncated, and ultimately causes KVM to think a fixed counter
   has already been disabled (KVM thinks the old value is '0').

 - Fix a stack leak in KVM_GET_MSRS where a failed MSR read from userspace
   that is ultimately ignored due to ignore_msrs=true doesn't zero the output
   as intended.
2024-02-14 12:34:43 -05:00
Ingo Molnar
4589f199eb Merge branch 'x86/bugs' into x86/core, to pick up pending changes before dependent patches
Merge in pending alternatives patching infrastructure changes, before
applying more patches.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-02-14 10:49:37 +01:00
Linus Torvalds
4356e9f841 work around gcc bugs with 'asm goto' with outputs
We've had issues with gcc and 'asm goto' before, and we created a
'asm_volatile_goto()' macro for that in the past: see commits
3f0116c323 ("compiler/gcc4: Add quirk for 'asm goto' miscompilation
bug") and a9f180345f ("compiler/gcc4: Make quirk for
asm_volatile_goto() unconditional").

Then, much later, we ended up removing the workaround in commit
43c249ea0b ("compiler-gcc.h: remove ancient workaround for gcc PR
58670") because we no longer supported building the kernel with the
affected gcc versions, but we left the macro uses around.

Now, Sean Christopherson reports a new version of a very similar
problem, which is fixed by re-applying that ancient workaround.  But the
problem in question is limited to only the 'asm goto with outputs'
cases, so instead of re-introducing the old workaround as-is, let's
rename and limit the workaround to just that much less common case.

It looks like there are at least two separate issues that all hit in
this area:

 (a) some versions of gcc don't mark the asm goto as 'volatile' when it
     has outputs:

        https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98619
        https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110420

     which is easy to work around by just adding the 'volatile' by hand.

 (b) Internal compiler errors:

        https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110422

     which are worked around by adding the extra empty 'asm' as a
     barrier, as in the original workaround.

but the problem Sean sees may be a third thing since it involves bad
code generation (not an ICE) even with the manually added 'volatile'.

but the same old workaround works for this case, even if this feels a
bit like voodoo programming and may only be hiding the issue.

Reported-and-tested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/20240208220604.140859-1-seanjc@google.com/
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Andrew Pinski <quic_apinski@quicinc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-02-09 15:57:48 -08:00
Paolo Bonzini
687d8f4c3d Merge branch 'kvm-kconfig'
Cleanups to Kconfig definitions for KVM

* replace HAVE_KVM with an architecture-dependent symbol, when CONFIG_KVM
  may or may not be available depending on CPU capabilities (MIPS)

* replace HAVE_KVM with IS_ENABLED(CONFIG_KVM) for host-side code that is
  not part of the KVM module, so that it is completely compiled out

* factor common "select" statements in common code instead of requiring
  each architecture to specify it
2024-02-08 08:47:51 -05:00
Paolo Bonzini
f48212ee8e treewide: remove CONFIG_HAVE_KVM
It has no users anymore.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-08 08:45:36 -05:00
Paolo Bonzini
61df71ee99 kvm: move "select IRQ_BYPASS_MANAGER" to common code
CONFIG_IRQ_BYPASS_MANAGER is a dependency of the common code included by
CONFIG_HAVE_KVM_IRQ_BYPASS.  There is no advantage in adding the corresponding
"select" directive to each architecture.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-08 08:45:34 -05:00
Paolo Bonzini
8886640dad kvm: replace __KVM_HAVE_READONLY_MEM with Kconfig symbol
KVM uses __KVM_HAVE_* symbols in the architecture-dependent uapi/asm/kvm.h to mask
unused definitions in include/uapi/linux/kvm.h.  __KVM_HAVE_READONLY_MEM however
was nothing but a misguided attempt to define KVM_CAP_READONLY_MEM only on
architectures where KVM_CHECK_EXTENSION(KVM_CAP_READONLY_MEM) could possibly
return nonzero.  This however does not make sense, and it prevented userspace
from supporting this architecture-independent feature without recompilation.

Therefore, these days __KVM_HAVE_READONLY_MEM does not mask anything and
is only used in virt/kvm/kvm_main.c.  Userspace does not need to test it
and there should be no need for it to exist.  Remove it and replace it
with a Kconfig symbol within Linux source code.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-08 08:41:06 -05:00
Julian Stecklina
64435aaa4a KVM: x86: rename push to emulate_push for consistency
push and emulate_pop are counterparts. Rename push to emulate_push and
harmonize its function signature with emulate_pop. This should remove
a bit of cognitive load when reading this code.

Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Link: https://lore.kernel.org/r/20231009092054.556935-2-julian.stecklina@cyberus-technology.de
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-07 16:10:01 -08:00
Julian Stecklina
6fd1e3963f KVM: x86: Clean up partially uninitialized integer in emulate_pop()
Explicitly zero out variables passed to emulate_pop() as output params
to harden against consuming uninitialized data, and to make sanitizers
happy.  Many flows that use emulate_pop() pass an "unsigned long" so as
to be able to hold the largest possible operand, but the actual number
of bytes written is usually the word with of the vCPU.  E.g. if the vCPU
is in 16-bit or 32-bit mode (on a 64-bit host), the upper portion of the
output param will be uninitialized.

Passing around the uninitialized data is benign, as actual KVM usage of
the output is also tied to the word width, but passing around
uninitialized data makes some sanitizers rightly complain.

Note, initializing the data in emulate_pop() is not a safe alternative,
e.g. it would result in em_leave() clobbering RBP[31:16] if LEAVE were
emulated with a 16-bit stack.

Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Link: https://lore.kernel.org/r/20231009092054.556935-1-julian.stecklina@cyberus-technology.de
[sean: massage changelog, drop em_popa() variable size change]]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-07 16:08:54 -08:00
Thomas Prescher
03f6298c7c KVM: x86/emulator: emulate movbe with operand-size prefix
The MOVBE instruction can come with an operand-size prefix (66h). In
this, case the x86 emulation code returns EMULATION_FAILED.

It turns out that em_movbe can already handle this case and all that
is missing is an entry in respective opcode tables to populate
gprefix->pfx_66.

Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231212095938.26731-1-julian.stecklina@cyberus-technology.de
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-07 13:16:11 -08:00
Chao Gao
d7f0a00e43 KVM: VMX: Report up-to-date exit qualification to userspace
Use vmx_get_exit_qual() to read the exit qualification.

vcpu->arch.exit_qualification is cached for EPT violation only and even
for EPT violation, it is stale at this point because the up-to-date
value is cached later in handle_ept_violation().

Fixes: 70bcd708df ("KVM: vmx: expose more information for KVM_INTERNAL_ERROR_DELIVERY_EV exits")
Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20231229022652.300095-1-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-07 07:47:53 -08:00
Sean Christopherson
fdd58834d1 KVM: SVM: Return -EINVAL instead of -EBUSY on attempt to re-init SEV/SEV-ES
Return -EINVAL instead of -EBUSY if userspace attempts KVM_SEV{,ES}_INIT
on a VM that already has SEV active.  Returning -EBUSY is nonsencial as
it's impossible to deactivate SEV without destroying the VM, i.e. the VM
isn't "busy" in any sane sense of the word, and the odds of any userspace
wanting exactly -EBUSY on a userspace bug are minuscule.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240131235609.4161407-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-06 11:10:12 -08:00
Ashish Kalra
0aa6b90ef9 KVM: SVM: Add support for allowing zero SEV ASIDs
Some BIOSes allow the end user to set the minimum SEV ASID value
(CPUID 0x8000001F_EDX) to be greater than the maximum number of
encrypted guests, or maximum SEV ASID value (CPUID 0x8000001F_ECX)
in order to dedicate all the SEV ASIDs to SEV-ES or SEV-SNP.

The SEV support, as coded, does not handle the case where the minimum
SEV ASID value can be greater than the maximum SEV ASID value.
As a result, the following confusing message is issued:

[   30.715724] kvm_amd: SEV enabled (ASIDs 1007 - 1006)

Fix the support to properly handle this case.

Fixes: 916391a2d1 ("KVM: SVM: Add support for SEV-ES capability in KVM")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Cc: stable@vger.kernel.org
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240104190520.62510-1-Ashish.Kalra@amd.com
Link: https://lore.kernel.org/r/20240131235609.4161407-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-06 11:10:11 -08:00
Sean Christopherson
466eec4a22 KVM: SVM: Use unsigned integers when dealing with ASIDs
Convert all local ASID variables and parameters throughout the SEV code
from signed integers to unsigned integers.  As ASIDs are fundamentally
unsigned values, and the global min/max variables are appropriately
unsigned integers, too.

Functionally, this is a glorified nop as KVM guarantees min_sev_asid is
non-zero, and no CPU supports -1u as the _only_ asid, i.e. the signed vs.
unsigned goof won't cause problems in practice.

Opportunistically use sev_get_asid() in sev_flush_encrypted_page() instead
of open coding an equivalent.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240131235609.4161407-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-06 11:09:34 -08:00
Sean Christopherson
cc4ce37bed KVM: SVM: Set sev->asid in sev_asid_new() instead of overloading the return
Explicitly set sev->asid in sev_asid_new() when a new ASID is successfully
allocated, and return '0' to indicate success instead of overloading the
return value to multiplex the ASID with error codes.  There is exactly one
caller of sev_asid_new(), and sev_asid_free() already consumes sev->asid,
i.e. returning the ASID isn't necessary for flexibility, nor does it
provide symmetry between related APIs.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240131235609.4161407-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-06 11:08:44 -08:00
Mathias Krause
e1dda3afe2 KVM: x86: Fix broken debugregs ABI for 32 bit kernels
The ioctl()s to get and set KVM's debug registers are broken for 32 bit
kernels as they'd only copy half of the user register state because of a
UAPI and in-kernel type mismatch (__u64 vs. unsigned long; 8 vs. 4
bytes).

This makes it impossible for userland to set anything but DR0 without
resorting to bit folding tricks.

Switch to a loop for copying debug registers that'll implicitly do the
type conversion for us, if needed.

There are likely no users (left) for 32bit KVM, fix the bug nonetheless.

Fixes: a1efbe77c1 ("KVM: x86: Add support for saving&restoring debug registers")
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240203124522.592778-4-minipli@grsecurity.net
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-05 15:40:54 -08:00
Mathias Krause
3376ca3f1a KVM: x86: Fix KVM_GET_MSRS stack info leak
Commit 6abe9c1386 ("KVM: X86: Move ignore_msrs handling upper the
stack") changed the 'ignore_msrs' handling, including sanitizing return
values to the caller. This was fine until commit 12bc2132b1 ("KVM:
X86: Do the same ignore_msrs check for feature msrs") which allowed
non-existing feature MSRs to be ignored, i.e. to not generate an error
on the ioctl() level. It even tried to preserve the sanitization of the
return value. However, the logic is flawed, as '*data' will be
overwritten again with the uninitialized stack value of msr.data.

Fix this by simplifying the logic and always initializing msr.data,
vanishing the need for an additional error exit path.

Fixes: 12bc2132b1 ("KVM: X86: Do the same ignore_msrs check for feature msrs")
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20240203124522.592778-2-minipli@grsecurity.net
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-05 11:20:51 -08:00
Mingwei Zhang
05519c86d6 KVM: x86/pmu: Fix type length error when reading pmu->fixed_ctr_ctrl
Use a u64 instead of a u8 when taking a snapshot of pmu->fixed_ctr_ctrl
when reprogramming fixed counters, as truncating the value results in KVM
thinking fixed counter 2 is already disabled (the bug also affects fixed
counters 3+, but KVM doesn't yet support those).  As a result, if the
guest disables fixed counter 2, KVM will get a false negative and fail to
reprogram/disable emulation of the counter, which can leads to incorrect
counts and spurious PMIs in the guest.

Fixes: 76d287b234 ("KVM: x86/pmu: Drop "u8 ctrl, int idx" for reprogram_fixed_counter()")
Cc: stable@vger.kernel.org
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20240123221220.3911317-1-mizhang@google.com
[sean: rewrite changelog to call out the effects of the bug]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-02 14:07:27 -08:00
Tanzir Hasan
66a5c40f60 kernel.h: removed REPEAT_BYTE from kernel.h
This patch creates wordpart.h and includes it in asm/word-at-a-time.h
for all architectures. WORD_AT_A_TIME_CONSTANTS depends on kernel.h
because of REPEAT_BYTE. Moving this to another header and including it
where necessary allows us to not include the bloated kernel.h. Making
this implicit dependency on REPEAT_BYTE explicit allows for later
improvements in the lib/string.c inclusion list.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Tanzir Hasan <tanzirh@google.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Link: https://lore.kernel.org/r/20231226-libstringheader-v6-1-80aa08c7652c@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2024-02-01 09:47:59 -08:00
Sean Christopherson
83bdfe04c9 KVM: x86/pmu: Avoid CPL lookup if PMC enabline for USER and KERNEL is the same
Don't bother querying the CPL if a PMC is (not) counting for both USER and
KERNEL, i.e. if the end result is guaranteed to be the same regardless of
the CPL.  Querying the CPL on Intel requires a VMREAD, i.e. isn't free,
and a single CMP+Jcc is cheap.

Link: https://lore.kernel.org/r/20231110022857.1273836-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01 09:35:48 -08:00
Sean Christopherson
e35529fb4a KVM: x86/pmu: Check eventsel first when emulating (branch) insns retired
When triggering events, i.e. emulating PMC events in software, check for
a matching event selector before checking the event is allowed.  The "is
allowed" check *might* be cheap, but it could also be very costly, e.g. if
userspace has defined a large PMU event filter.  The event selector check
on the other hand is all but guaranteed to be <10 uops, e.g. looks
something like:

   0xffffffff8105e615 <+5>:	movabs $0xf0000ffff,%rax
   0xffffffff8105e61f <+15>:	xor    %rdi,%rsi
   0xffffffff8105e622 <+18>:	test   %rax,%rsi
   0xffffffff8105e625 <+21>:	sete   %al

Link: https://lore.kernel.org/r/20231110022857.1273836-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01 09:35:48 -08:00
Sean Christopherson
afda2d7666 KVM: x86/pmu: Expand the comment about what bits are check emulating events
Expand the comment about what bits are and aren't checked when emulating
PMC events in software.  As pointed out by Jim, AMD's mask includes bits
35:32, which on Intel overlap with the IN_TX and IN_TXCP bits (32 and 33)
as well as reserved bits (34 and 45).

Checking The IN_TX* bits is actually correct, as it's safe to assert that
the vCPU can't be in an HLE/RTM transaction if KVM is emulating an
instruction, i.e. KVM *shouldn't count if either of those bits is set.

For the reserved bits, KVM is has equal odds of being right if Intel adds
new behavior, i.e. ignoring them is just as likely to be correct as
checking them.

Opportunistically explain *why* the other flags aren't checked.

Suggested-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20231110022857.1273836-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01 09:35:48 -08:00
Sean Christopherson
f19063b1ca KVM: x86/pmu: Snapshot event selectors that KVM emulates in software
Snapshot the event selectors for the events that KVM emulates in software,
which is currently instructions retired and branch instructions retired.
The event selectors a tied to the underlying CPU, i.e. are constant for a
given platform even though perf doesn't manage the mappings as such.

Getting the event selectors from perf isn't exactly cheap, especially if
mitigations are enabled, as at least one indirect call is involved.

Snapshot the values in KVM instead of optimizing perf as working with the
raw event selectors will be required if KVM ever wants to emulate events
that aren't part of perf's uABI, i.e. that don't have an "enum perf_hw_id"
entry.

Link: https://lore.kernel.org/r/20231110022857.1273836-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01 09:35:48 -08:00
Sean Christopherson
d2b321ea93 KVM: x86/pmu: Process only enabled PMCs when emulating events in software
Mask off disabled counters based on PERF_GLOBAL_CTRL *before* iterating
over PMCs to emulate (branch) instruction required events in software.  In
the common case where the guest isn't utilizing the PMU, pre-checking for
enabled counters turns a relatively expensive search into a few AND uops
and a Jcc.

Sadly, PMUs without PERF_GLOBAL_CTRL, e.g. most existing AMD CPUs, are out
of luck as there is no way to check that a PMC isn't being used without
checking the PMC's event selector.

Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
Link: https://lore.kernel.org/r/20231110022857.1273836-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01 09:35:48 -08:00
Sean Christopherson
e5a65d4f72 KVM: x86/pmu: Add macros to iterate over all PMCs given a bitmap
Add and use kvm_for_each_pmc() to dedup a variety of open coded for-loops
that iterate over valid PMCs given a bitmap (and because seeing checkpatch
whine about bad macro style is always amusing).

No functional change intended.

Link: https://lore.kernel.org/r/20231110022857.1273836-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01 09:35:48 -08:00
Sean Christopherson
004a0aa56e KVM: x86/pmu: Snapshot and clear reprogramming bitmap before reprogramming
Refactor the handling of the reprogramming bitmap to snapshot and clear
to-be-processed bits before doing the reprogramming, and then explicitly
set bits for PMCs that need to be reprogrammed (again).  This will allow
adding a macro to iterate over all valid PMCs without having to add
special handling for the reprogramming bit, which (a) can have bits set
for non-existent PMCs and (b) needs to clear such bits to avoid wasting
cycles in perpetuity.

Note, the existing behavior of clearing bits after reprogramming does NOT
have a race with kvm_vm_ioctl_set_pmu_event_filter().  Setting a new PMU
filter synchronizes SRCU _before_ setting the bitmap, i.e. guarantees that
the vCPU isn't in the middle of reprogramming with a stale filter prior to
setting the bitmap.

Link: https://lore.kernel.org/r/20231110022857.1273836-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01 09:35:48 -08:00
Sean Christopherson
b31880ca2f KVM: x86/pmu: Move pmc_idx => pmc translation helper to common code
Add a common helper for *internal* PMC lookups, and delete the ops hook
and Intel's implementation.  Keep AMD's implementation, but rename it to
amd_pmu_get_pmc() to make it somewhat more obvious that it's suited for
both KVM-internal and guest-initiated lookups.

Because KVM tracks all counters in a single bitmap, getting a counter
when iterating over a bitmap, e.g. of all valid PMCs, requires a small
amount of math, that while simple, isn't super obvious and doesn't use the
same semantics as PMC lookups from RDPMC!  Although AMD doesn't support
fixed counters, the common PMU code still behaves as if there a split, the
high half of which just happens to always be empty.

Opportunstically add a comment to explain both what is going on, and why
KVM uses a single bitmap, e.g. the boilerplate for iterating over separate
bitmaps could be done via macros, so it's not (just) about deduplicating
code.

Link: https://lore.kernel.org/r/20231110022857.1273836-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01 09:35:47 -08:00
Sean Christopherson
be6b067dae KVM: x86/pmu: Add common define to capture fixed counters offset
Add a common define to "officially" solidify KVM's split of counters,
i.e. to commit to using bits 31:0 to track general purpose counters and
bits 63:32 to track fixed counters (which only Intel supports).  KVM
already bleeds this behavior all over common PMU code, and adding a KVM-
defined macro allows clarifying that the value is a _base_, as oppposed to
the _flag_ that is used to access fixed PMCs via RDPMC (which perf
confusingly calls INTEL_PMC_FIXED_RDPMC_BASE).

No functional change intended.

Link: https://lore.kernel.org/r/20231110022857.1273836-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01 09:34:31 -08:00
Sean Christopherson
f933b88e20 KVM: x86/pmu: Zero out PMU metadata on AMD if PMU is disabled
Move the purging of common PMU metadata from intel_pmu_refresh() to
kvm_pmu_refresh(), and invoke the vendor refresh() hook if and only if
the VM is supposed to have a vPMU.

KVM already denies access to the PMU based on kvm->arch.enable_pmu, as
get_gp_pmc_amd() returns NULL for all PMCs in that case, i.e. KVM already
violates AMD's architecture by not virtualizing a PMU (kernels have long
since learned to not panic when the PMU is unavailable).  But configuring
the PMU as if it were enabled causes unwanted side effects, e.g. calls to
kvm_pmu_trigger_event() waste an absurd number of cycles due to the
all_valid_pmc_idx bitmap being non-zero.

Fixes: b1d66dad65 ("KVM: x86/svm: Add module param to control PMU virtualization")
Reported-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Closes: https://lore.kernel.org/all/20231109180646.2963718-2-khorenko@virtuozzo.com
Link: https://lore.kernel.org/r/20231110022857.1273836-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01 09:34:31 -08:00
Vitaly Kuznetsov
9e62797fd7 KVM: x86: Make gtod_is_based_on_tsc() return 'bool'
gtod_is_based_on_tsc() is boolean in nature, i.e. it returns '1' for good
clocksources and '0' otherwise. Moreover, its result is used raw by
kvm_get_time_and_clockread()/kvm_get_walltime_and_clockread() which are
'bool'.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240109141121.1619463-6-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01 08:58:16 -08:00
Kunwu Chan
0dbd054699 KVM: x86/mmu: Use KMEM_CACHE instead of kmem_cache_create()
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.

Note, KMEM_CACHE() uses the required alignment of the struct, '8' as the
alignment, whereas KVM's existing code passes '0'.  In the end, the two
values yield the same result as x86's minimum slab alignment is also '8'
(which is not at all coincidental).

Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Link: https://lore.kernel.org/r/20240116100025.95702-1-chentao@kylinos.cn
[sean: call out alignment behavior]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-31 15:34:26 -08:00
Maciej S. Szmigiero
d52734d00b KVM: x86: Give a hint when Win2016 might fail to boot due to XSAVES erratum
Since commit b0563468ee ("x86/CPU/AMD: Disable XSAVES on AMD family 0x17")
kernel unconditionally clears the XSAVES CPU feature bit on Zen1/2 CPUs.

Because KVM CPU caps are initialized from the kernel boot CPU features this
makes the XSAVES feature also unavailable for KVM guests in this case.
At the same time the XSAVEC feature is left enabled.

Unfortunately, having XSAVEC but no XSAVES in CPUID breaks Hyper-V enabled
Windows Server 2016 VMs that have more than one vCPU.

Let's at least give users hint in the kernel log what could be wrong since
these VMs currently simply hang at boot with a black screen - giving no
clue what suddenly broke them and how to make them work again.

Trigger the kernel message hint based on the particular guest ID written to
the Guest OS Identity Hyper-V MSR implemented by KVM.

Defer this check to when the L1 Hyper-V hypervisor enables SVM in EFER
since we want to limit this message to Hyper-V enabled Windows guests only
(Windows session running nested as L2) but the actual Guest OS Identity MSR
write is done by L1 and happens before it enables SVM.

Fixes: b0563468ee ("x86/CPU/AMD: Disable XSAVES on AMD family 0x17")
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <b83ab45c5e239e5d148b0ae7750133a67ac9575c.1706127425.git.maciej.szmigiero@oracle.com>
[Move some checks before mutex_lock(), rename function. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-01-31 16:21:00 -05:00
Tengfei Yu
9e05d9b067 KVM: x86: Check irqchip mode before create PIT
As the kvm api(https://docs.kernel.org/virt/kvm/api.html) reads,
KVM_CREATE_PIT2 call is only valid after enabling in-kernel irqchip
support via KVM_CREATE_IRQCHIP.

Without this check, I can create PIT first and enable irqchip-split
then, which may cause the PIT invalid because of lacking of in-kernel
PIC to inject the interrupt.

Signed-off-by: Tengfei Yu <moehanabichan@gmail.com>
Message-Id: <20240125050823.4893-1-moehanabichan@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-01-31 16:21:00 -05:00
Xin Li
70d0fe5d09 KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling
When FRED is enabled, call fred_entry_from_kvm() to handle IRQ/NMI in
IRQ/NMI induced VM exits.

Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231205105030.8698-33-xin3.li@intel.com
2024-01-31 22:03:20 +01:00
Prasad Pandit
6231c9e1a9 KVM: x86: make KVM_REQ_NMI request iff NMI pending for vcpu
kvm_vcpu_ioctl_x86_set_vcpu_events() routine makes 'KVM_REQ_NMI'
request for a vcpu even when its 'events->nmi.pending' is zero.
Ex:
    qemu_thread_start
     kvm_vcpu_thread_fn
      qemu_wait_io_event
       qemu_wait_io_event_common
        process_queued_cpu_work
         do_kvm_cpu_synchronize_post_init/_reset
          kvm_arch_put_registers
           kvm_put_vcpu_events (cpu, level=[2|3])

This leads vCPU threads in QEMU to constantly acquire & release the
global mutex lock, delaying the guest boot due to lock contention.
Add check to make KVM_REQ_NMI request only if vcpu has NMI pending.

Fixes: bdedff2631 ("KVM: x86: Route pending NMIs from userspace through process_nmi()")
Cc: stable@vger.kernel.org
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Link: https://lore.kernel.org/r/20240103075343.549293-1-ppandit@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-31 07:35:07 -08:00
Sean Christopherson
a634c76b2c KVM: x86/pmu: Explicitly check for RDPMC of unsupported Intel PMC types
Explicitly check for attempts to read unsupported PMC types instead of
letting the bounds check fail.  Functionally, letting the check fail is
ok, but it's unnecessarily subtle and does a poor job of documenting the
architectural behavior that KVM is emulating.

Reviewed-by: Dapeng Mi  <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30 15:28:02 -08:00
Sean Christopherson
7a0fc734c2 KVM: x86/pmu: Treat "fixed" PMU type in RDPMC as index as a value, not flag
Refactor KVM's handling of ECX for RDPMC to treat the FIXED modifier as an
explicit value, not a flag (minus one wart).  While non-architectural PMUs
do use bit 31 as a flag (for "fast" reads), architectural PMUs use the
upper half of ECX to encode the type.  From the SDM:

  ECX[31:16] specifies type of PMC while ECX[15:0] specifies the index of
  the PMC to be read within that type

Note, that the known supported types are 4000H and 2000H, i.e. look a lot
like flags, doesn't contradict the above statement that ECX[31:16] holds
the type, at least not by any sane reading of the SDM.

Keep the explicitly clearing of the FIXED "flag", as KVM subtly relies on
that behavior to disallow unsupported types while allowing the correct
indices for fixed counters.  This wart will be cleaned up in short order.

Opportunistically grab the per-type bitmask in the if-else blocks to
eliminate the one-off usage of the local "fixed" bool.

Reported-by: Jim Mattson <jmattson@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30 15:28:02 -08:00