Commit Graph

11000 Commits

Author SHA1 Message Date
Sean Christopherson
4880919aaf KVM: SVM: Merge "after set CPUID" intercept recalc helpers
Merge svm_recalc_intercepts_after_set_cpuid() and
svm_recalc_instruction_intercepts() such that the "after set CPUID" helper
simply invokes the type-specific helpers (MSRs vs. instructions), i.e.
make svm_recalc_intercepts_after_set_cpuid() a single entry point for all
intercept updates that need to be performed after a CPUID change.

No functional change intended.

Link: https://lore.kernel.org/r/20250610225737.156318-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:32 -07:00
Sean Christopherson
40ba80e4b0 KVM: SVM: Fold svm_vcpu_init_msrpm() into its sole caller
Fold svm_vcpu_init_msrpm() into svm_recalc_msr_intercepts() now that there
is only the one caller (and because the "init" misnomer is even more
misleading than it was in the past).

No functional change intended.

Link: https://lore.kernel.org/r/20250610225737.156318-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:31 -07:00
Sean Christopherson
049dff172b KVM: SVM: Rename init_vmcb_after_set_cpuid() to make it intercepts specific
Rename init_vmcb_after_set_cpuid() to svm_recalc_intercepts_after_set_cpuid()
to more precisely describe its role.  Strictly speaking, the name isn't
perfect as toggling virtual VM{LOAD,SAVE} is arguably not recalculating an
intercept, but practically speaking it's close enough.

No functional change intended.

Link: https://lore.kernel.org/r/20250610225737.156318-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:30 -07:00
Sean Christopherson
4ceca57e3f KVM: x86: Rename msr_filter_changed() => recalc_msr_intercepts()
Rename msr_filter_changed() to recalc_msr_intercepts() and drop the
trampoline wrapper now that both SVM and VMX use a filter-agnostic recalc
helper to react to the new userspace filter.

No functional change intended.

Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:30 -07:00
Sean Christopherson
160f143cc1 KVM: SVM: Manually recalc all MSR intercepts on userspace MSR filter change
On a userspace MSR filter change, recalculate all MSR intercepts using the
filter-agnostic logic instead of maintaining a "shadow copy" of KVM's
desired intercepts.  The shadow bitmaps add yet another point of failure,
are confusing (e.g. what does "handled specially" mean!?!?), an eyesore,
and a maintenance burden.

Given that KVM *must* be able to recalculate the correct intercepts at any
given time, and that MSR filter updates are not hot paths, there is zero
benefit to maintaining the shadow bitmaps.

Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as
appropriate.

Link: https://lore.kernel.org/all/aCdPbZiYmtni4Bjs@google.com
Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local
Cc: Francesco Lavra <francescolavra.fl@gmail.com>
Link: https://lore.kernel.org/r/20250610225737.156318-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:29 -07:00
Sean Christopherson
8a056ece45 KVM: VMX: Manually recalc all MSR intercepts on userspace MSR filter change
On a userspace MSR filter change, recalculate all MSR intercepts using the
filter-agnostic logic instead of maintaining a "shadow copy" of KVM's
desired intercepts.  The shadow bitmaps add yet another point of failure,
are confusing (e.g. what does "handled specially" mean!?!?), an eyesore,
and a maintenance burden.

Given that KVM *must* be able to recalculate the correct intercepts at any
given time, and that MSR filter updates are not hot paths, there is zero
benefit to maintaining the shadow bitmaps.

Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as
appropriate.

Link: https://lore.kernel.org/all/aCdPbZiYmtni4Bjs@google.com
Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local
Cc: Borislav Petkov <bp@alien8.de>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:28 -07:00
Sean Christopherson
405a63d4d3 KVM: x86: Move definition of X2APIC_MSR() to lapic.h
Dedup the definition of X2APIC_MSR and put it in the local APIC code
where it belongs.

No functional change intended.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:28 -07:00
Sean Christopherson
cb53d07948 KVM: SVM: Drop "always" flag from list of possible passthrough MSRs
Drop the "always" flag from the array of possible passthrough MSRs, and
instead manually initialize the permissions for the handful of MSRs that
KVM passes through by default.  In addition to cutting down on boilerplate
copy+paste code and eliminating a misleading flag (the MSRs aren't always
passed through, e.g. thanks to MSR filters), this will allow for removing
the direct_access_msrs array entirely.

Link: https://lore.kernel.org/r/20250610225737.156318-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:27 -07:00
Sean Christopherson
3a0f09b361 KVM: SVM: Pass through GHCB MSR if and only if VM is an SEV-ES guest
Disable interception of the GHCB MSR if and only if the VM is an SEV-ES
guest.  While the exact behavior is completely undocumented in the APM,
common sense and testing on SEV-ES capable CPUs says that accesses to the
GHCB from non-SEV-ES guests will #GP.  I.e. from the guest's perspective,
no functional change intended.

Fixes: 376c6d2850 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading")
Link: https://lore.kernel.org/r/20250610225737.156318-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:26 -07:00
Sean Christopherson
6b7315fe54 KVM: SVM: Implement and adopt VMX style MSR intercepts APIs
Add and use SVM MSR interception APIs (in most paths) to match VMX's
APIs and nomenclature.  Specifically, add SVM variants of:

        vmx_disable_intercept_for_msr(vcpu, msr, type)
        vmx_enable_intercept_for_msr(vcpu, msr, type)
        vmx_set_intercept_for_msr(vcpu, msr, type, intercept)

to eventually replace SVM's single helper:

        set_msr_interception(vcpu, msrpm, msr, allow_read, allow_write)

which is awkward to use (in all cases, KVM either applies the same logic
for both reads and writes, or intercepts one of read or write), and is
unintuitive due to using '0' to indicate interception should be *set*.

Keep the guts of the old API for the moment to avoid churning the MSR
filter code, as that mess will be overhauled in the near future.  Leave
behind a temporary comment to call out that the shadow bitmaps have
inverted polarity relative to the bitmaps consumed by hardware.

No functional change intended.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:26 -07:00
Sean Christopherson
c38595ad69 KVM: SVM: Add helpers for accessing MSR bitmap that don't rely on offsets
Add macro-built helpers for testing, setting, and clearing MSRPM entries
without relying on precomputed offsets.  This sets the stage for eventually
removing general KVM use of precomputed offsets, which are quite confusing
and rather inefficient for the vast majority of KVM's usage.

Outside of merging L0 and L1 bitmaps for nested SVM, using u32-indexed
offsets and accesses is at best unnecessary, and at worst introduces extra
operations to retrieve the individual bit from within the offset u32 value.
And simply calling them "offsets" is very confusing, as the "unit" of the
offset isn't immediately obvious.

Use the new helpers in set_msr_interception_bitmap() and
msr_write_intercepted() to verify the math and operations, but keep the
existing offset-based logic in set_msr_interception_bitmap() to sanity
check the "clear" and "set" operations.  Manipulating MSR interceptions
isn't a hot path and no kernel release is ever expected to contain this
specific version of set_msr_interception_bitmap() (it will be removed
entirely in the near future).

Link: https://lore.kernel.org/r/20250610225737.156318-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:25 -07:00
Sean Christopherson
4879dc9469 KVM: nSVM: Don't initialize vmcb02 MSRPM with vmcb01's "always passthrough"
Don't initialize vmcb02's MSRPM with KVM's set of "always passthrough"
MSRs, as KVM always needs to consult L1's intercepts, i.e. needs to merge
vmcb01 with vmcb12 and write the result to vmcb02.  This will eventually
allow for the removal of svm_vcpu_init_msrpm().

Note, the bitmaps are truly initialized by svm_vcpu_alloc_msrpm() (default
to intercepting all MSRs), e.g. if there is a bug lurking elsewhere, the
worst case scenario from dropping the call to svm_vcpu_init_msrpm() should
be that KVM would fail to passthrough MSRs to L2.

Link: https://lore.kernel.org/r/20250610225737.156318-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:24 -07:00
Sean Christopherson
f21ff2c8c9 KVM: nSVM: Omit SEV-ES specific passthrough MSRs from L0+L1 bitmap merge
Don't merge bitmaps on nested VMRUN for MSRs that KVM passes through only
for SEV-ES guests.  KVM doesn't support nested virtualization for SEV-ES,
and likely never will.

Link: https://lore.kernel.org/r/20250610225737.156318-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:24 -07:00
Sean Christopherson
9b72c3d59f KVM: nSVM: Use dedicated array of MSRPM offsets to merge L0 and L1 bitmaps
Use a dedicated array of MSRPM offsets to merge L0 and L1 bitmaps, i.e. to
merge KVM's vmcb01 bitmap with L1's vmcb12 bitmap.  This will eventually
allow for the removal of direct_access_msrs, as the only path where
tracking the offsets is truly justified is the merge for nested SVM, where
merging in chunks is an easy way to batch uaccess reads/writes.

Opportunistically omit the x2APIC MSRs from the merge-specific array
instead of filtering them out at runtime.

Note, disabling interception of DEBUGCTL, XSS, EFER, PAT, GHCB, and
TSC_AUX is mutually exclusive with nested virtualization, as KVM passes
through those MSRs only for SEV-ES guests, and KVM doesn't support nested
virtualization for SEV+ guests.  Defer removing those MSRs to a future
cleanup in order to make this refactoring as benign as possible.

Link: https://lore.kernel.org/r/20250610225737.156318-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:23 -07:00
Sean Christopherson
16e9584cc0 KVM: SVM: Clean up macros related to architectural MSRPM definitions
Move SVM's MSR Permissions Map macros to svm.h in anticipation of adding
helpers that are available to SVM code, and opportunistically replace a
variety of open-coded literals with (hopefully) informative macros.

Opportunistically open code ARRAY_SIZE(msrpm_ranges) instead of wrapping
it as NUM_MSR_MAPS, which is an ambiguous name even if it were qualified
with "SVM_MSRPM".

Deliberately leave the ranges as open coded literals, as using macros to
define the ranges actually introduces more potential failure points, since
both the definitions and the usage have to be careful to use the correct
index.  The lack of clear intent behind the ranges will be addressed in
future patches.

No functional change intended.

Link: https://lore.kernel.org/r/20250610225737.156318-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:11 -07:00
Sean Christopherson
925149b6d0 KVM: SVM: Massage name and param of helper that merges vmcb01 and vmcb12 MSRPMs
Rename nested_svm_vmrun_msrpm() to nested_svm_merge_msrpm() to better
capture its role, and opportunistically feed it @vcpu instead of @svm, as
grabbing "svm" only to turn around and grab svm->vcpu is rather silly.

No functional change intended.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:40 -07:00
Sean Christopherson
b1bccf7883 KVM: x86: Use non-atomic bit ops to manipulate "shadow" MSR intercepts
Manipulate the MSR bitmaps using non-atomic bit ops APIs (two underscores),
as the bitmaps are per-vCPU and are only ever accessed while vcpu->mutex is
held.

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:40 -07:00
Sean Christopherson
6353cd685c KVM: SVM: Kill the VM instead of the host if MSR interception is buggy
WARN and kill the VM instead of panicking the host if KVM attempts to set
or query MSR interception for an unsupported MSR.  Accessing the MSR
interception bitmaps only meaningfully affects post-VMRUN behavior, and
KVM_BUG_ON() is guaranteed to prevent the current vCPU from doing VMRUN,
i.e. there is no need to panic the entire host.

Opportunistically move the sanity checks about their use to index into the
MSRPM, e.g. so that bugs only WARN and terminate the VM, as opposed to
doing that _and_ generating an out-of-bounds load.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:40 -07:00
Sean Christopherson
b241c50c4e KVM: SVM: Use ARRAY_SIZE() to iterate over direct_access_msrs
Drop the unnecessary and dangerous value-terminated behavior of
direct_access_msrs, and simply iterate over the actual size of the array.
The use in svm_set_x2apic_msr_interception() is especially sketchy, as it
relies on unused capacity being zero-initialized, and '0' being outside
the range of x2APIC MSRs.

To ensure the array and shadow_msr_intercept stay synchronized, simply
assert that their sizes are identical (note the six 64-bit-only MSRs).

Note, direct_access_msrs will soon be removed entirely; keeping the assert
synchronized with the array isn't expected to be along-term maintenance
burden.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:40 -07:00
Sean Christopherson
f886515f9b KVM: SVM: Tag MSR bitmap initialization helpers with __init
Tag init_msrpm_offsets() and add_msr_offset() with __init, as they're used
only during hardware setup to map potential passthrough MSRs to offsets in
the bitmap.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:40 -07:00
Sean Christopherson
5ebd737308 KVM: SVM: Don't BUG if setting up the MSR intercept bitmaps fails
WARN and reject module loading if there is a problem with KVM's MSR
interception bitmaps.  Panicking the host in this situation is inexcusable
since it is trivially easy to propagate the error up the stack.

Link: https://lore.kernel.org/r/20250610225737.156318-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:40 -07:00
Sean Christopherson
fb96d5cf0f KVM: SVM: Allocate IOPM pages after initial setup in svm_hardware_setup()
Allocate pages for the IOPM after initial setup has been completed in
svm_hardware_setup(), so that sanity checks can be added in the setup flow
without needing to free the IOPM pages.  The IOPM is only referenced (via
iopm_base) in init_vmcb() and svm_hardware_unsetup(), so there's no need
to allocate it early on.

No functional change intended (beyond the obvious ordering differences,
e.g. if the allocation fails).

Link: https://lore.kernel.org/r/20250610225737.156318-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:39 -07:00
Sean Christopherson
674ffc6503 KVM: SVM: Disable interception of SPEC_CTRL iff the MSR exists for the guest
Disable interception of SPEC_CTRL when the CPU virtualizes (i.e. context
switches) SPEC_CTRL if and only if the MSR exists according to the vCPU's
CPUID model.  Letting the guest access SPEC_CTRL is generally benign, but
the guest would see inconsistent behavior if KVM happened to emulate an
access to the MSR.

Fixes: d00b99c514 ("KVM: SVM: Add support for Virtual SPEC_CTRL")
Reported-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:39 -07:00
Maxim Levitsky
6b1dd26544 KVM: VMX: Preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while running the guest
Set/clear DEBUGCTLMSR_FREEZE_IN_SMM in GUEST_IA32_DEBUGCTL based on the
host's pre-VM-Enter value, i.e. preserve the host's FREEZE_IN_SMM setting
while running the guest.  When running with the "default treatment of SMIs"
in effect (the only mode KVM supports), SMIs do not generate a VM-Exit that
is visible to host (non-SMM) software, and instead transitions directly
from VMX non-root to SMM.  And critically, DEBUGCTL isn't context switched
by hardware on SMI or RSM, i.e. SMM will run with whatever value was
resident in hardware at the time of the SMI.

Failure to preserve FREEZE_IN_SMM results in the PMU unexpectedly counting
events while the CPU is executing in SMM, which can pollute profiling and
potentially leak information into the guest.

Check for changes in FREEZE_IN_SMM prior to every entry into KVM's inner
run loop, as the bit can be toggled in IRQ context via IPI callback (SMP
function call), by way of /sys/devices/cpu/freeze_on_smi.

Add a field in kvm_x86_ops to communicate which DEBUGCTL bits need to be
preserved, as FREEZE_IN_SMM is only supported and defined for Intel CPUs,
i.e. explicitly checking FREEZE_IN_SMM in common x86 is at best weird, and
at worst could lead to undesirable behavior in the future if AMD CPUs ever
happened to pick up a collision with the bit.

Exempt TDX vCPUs, i.e. protected guests, from the check, as the TDX Module
owns and controls GUEST_IA32_DEBUGCTL.

WARN in SVM if KVM_RUN_LOAD_DEBUGCTL is set, mostly to document that the
lack of handling isn't a KVM bug (TDX already WARNs on any run_flag).

Lastly, explicitly reload GUEST_IA32_DEBUGCTL on a VM-Fail that is missed
by KVM but detected by hardware, i.e. in nested_vmx_restore_host_state().
Doing so avoids the need to track host_debugctl on a per-VMCS basis, as
GUEST_IA32_DEBUGCTL is unconditionally written by prepare_vmcs02() and
load_vmcs12_host_state().  For the VM-Fail case, even though KVM won't
have actually entered the guest, vcpu_enter_guest() will have run with
vmcs02 active and thus could result in vmcs01 being run with a stale value.

Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250610232010.162191-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:24 -07:00
Maxim Levitsky
7d0cce6cbe KVM: VMX: Wrap all accesses to IA32_DEBUGCTL with getter/setter APIs
Introduce vmx_guest_debugctl_{read,write}() to handle all accesses to
vmcs.GUEST_IA32_DEBUGCTL. This will allow stuffing FREEZE_IN_SMM into
GUEST_IA32_DEBUGCTL based on the host setting without bleeding the state
into the guest, and without needing to copy+paste the FREEZE_IN_SMM
logic into every patch that accesses GUEST_IA32_DEBUGCTL.

No functional change intended.

Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[sean: massage changelog, make inline, use in all prepare_vmcs02() cases]
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610232010.162191-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:24 -07:00
Maxim Levitsky
095686e6fc KVM: nVMX: Check vmcs12->guest_ia32_debugctl on nested VM-Enter
Add a consistency check for L2's guest_ia32_debugctl, as KVM only supports
a subset of hardware functionality, i.e. KVM can't rely on hardware to
detect illegal/unsupported values.  Failure to check the vmcs12 value
would allow the guest to load any harware-supported value while running L2.

Take care to exempt BTF and LBR from the validity check in order to match
KVM's behavior for writes via WRMSR, but without clobbering vmcs12.  Even
if VM_EXIT_SAVE_DEBUG_CONTROLS is set in vmcs12, L1 can reasonably expect
that vmcs12->guest_ia32_debugctl will not be modified if writes to the MSR
are being intercepted.

Arguably, KVM _should_ update vmcs12 if VM_EXIT_SAVE_DEBUG_CONTROLS is set
*and* writes to MSR_IA32_DEBUGCTLMSR are not being intercepted by L1, but
that would incur non-trivial complexity and wouldn't change the fact that
KVM's handling of DEBUGCTL is blatantly broken.  I.e. the extra complexity
is not worth carrying.

Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250610232010.162191-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:23 -07:00
Sean Christopherson
8a4351ac30 KVM: VMX: Extract checking of guest's DEBUGCTL into helper
Move VMX's logic to check DEBUGCTL values into a standalone helper so that
the code can be used by nested VM-Enter to apply the same logic to the
value being loaded from vmcs12.

KVM needs to explicitly check vmcs12->guest_ia32_debugctl on nested
VM-Enter, as hardware may support features that KVM does not, i.e. relying
on hardware to detect invalid guest state will result in false negatives.
Unfortunately, that means applying KVM's funky suppression of BTF and LBR
to vmcs12 so as not to break existing guests.

No functional change intended.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610232010.162191-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:22 -07:00
Sean Christopherson
17ec2f9653 KVM: VMX: Allow guest to set DEBUGCTL.RTM_DEBUG if RTM is supported
Let the guest set DEBUGCTL.RTM_DEBUG if RTM is supported according to the
guest CPUID model, as debug support is supposed to be available if RTM is
supported, and there are no known downsides to letting the guest debug RTM
aborts.

Note, there are no known bug reports related to RTM_DEBUG, the primary
motivation is to reduce the probability of breaking existing guests when a
future change adds a missing consistency check on vmcs12.GUEST_DEBUGCTL
(KVM currently lets L2 run with whatever hardware supports; whoops).

Note #2, KVM already emulates DR6.RTM, and doesn't restrict access to
DR7.RTM.

Fixes: 83c529151a ("KVM: x86: expose Intel cpu new features (HLE, RTM) to guest")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250610232010.162191-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:04:24 -07:00
Sean Christopherson
80c64c7afe KVM: x86: Drop kvm_x86_ops.set_dr6() in favor of a new KVM_RUN flag
Instruct vendor code to load the guest's DR6 into hardware via a new
KVM_RUN flag, and remove kvm_x86_ops.set_dr6(), whose sole purpose was to
load vcpu->arch.dr6 into hardware when DR6 can be read/written directly
by the guest.

Note, TDX already WARNs on any run_flag being set, i.e. will yell if KVM
thinks DR6 needs to be reloaded.  TDX vCPUs force KVM_DEBUGREG_AUTO_SWITCH
and never clear the flag, i.e. should never observe KVM_RUN_LOAD_GUEST_DR6.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250610232010.162191-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:04:24 -07:00
Sean Christopherson
2478b1b220 KVM: x86: Convert vcpu_run()'s immediate exit param into a generic bitmap
Convert kvm_x86_ops.vcpu_run()'s "force_immediate_exit" boolean parameter
into an a generic bitmap so that similar "take action" information can be
passed to vendor code without creating a pile of boolean parameters.

This will allow dropping kvm_x86_ops.set_dr6() in favor of a new flag, and
will also allow for adding similar functionality for re-loading debugctl
in the active VMCS.

Opportunistically massage the TDX WARN and comment to prepare for adding
more run_flags, all of which are expected to be mutually exclusive with
TDX, i.e. should be WARNed on.

No functional change intended.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250610232010.162191-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:04:24 -07:00
Sean Christopherson
7d390a9da8 KVM: TDX: Use kvm_arch_vcpu.host_debugctl to restore the host's DEBUGCTL
Use the kvm_arch_vcpu.host_debugctl snapshot to restore DEBUGCTL after
running a TD vCPU.  The final TDX series rebase was mishandled, likely due
to commit fb71c79593 ("KVM: x86: Snapshot the host's DEBUGCTL in common
x86") deleting the same line of code from vmx.h, i.e. creating a semantic
conflict of sorts, but no syntactic conflict.

Using the version in kvm_vcpu_arch picks up the ulong => u64 fix (which
isn't relevant to TDX) as well as the IRQ fix from commit 189ecdb3e1
("KVM: x86: Snapshot the host's DEBUGCTL after disabling IRQs").

Link: https://lore.kernel.org/all/20250307212053.2948340-10-pbonzini@redhat.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Fixes: 8af0990375 ("KVM: TDX: Save and restore IA32_DEBUGCTL")
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Link: https://lore.kernel.org/r/20250610232010.162191-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:04:23 -07:00
Paolo Bonzini
28224ef02b KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities
Allow userspace to advertise TDG.VP.VMCALL subfunctions that the
kernel also supports.  For each output register of GetTdVmCallInfo's
leaf 1, add two fields to KVM_TDX_CAPABILITIES: one for kernel-supported
TDVMCALLs (userspace can set those blindly) and one for user-supported
TDVMCALLs (userspace can set those if it knows how to handle them).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20 14:20:20 -04:00
Paolo Bonzini
4580dbef5c KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20 14:09:50 -04:00
Binbin Wu
25e8b1dd48 KVM: TDX: Exit to userspace for GetTdVmCallInfo
Exit to userspace for TDG.VP.VMCALL<GetTdVmCallInfo> via KVM_EXIT_TDX,
to allow userspace to provide information about the support of
TDVMCALLs when r12 is 1 for the TDVMCALLs beyond the GHCI base API.

GHCI spec defines the GHCI base TDVMCALLs: <GetTdVmCallInfo>, <MapGPA>,
<ReportFatalError>, <Instruction.CPUID>, <#VE.RequestMMIO>,
<Instruction.HLT>, <Instruction.IO>, <Instruction.RDMSR> and
<Instruction.WRMSR>. They must be supported by VMM to support TDX guests.

For GetTdVmCallInfo
- When leaf (r12) to enumerate TDVMCALL functionality is set to 0,
  successful execution indicates all GHCI base TDVMCALLs listed above are
  supported.

  Update the KVM TDX document with the set of the GHCI base APIs.

- When leaf (r12) to enumerate TDVMCALL functionality is set to 1, it
  indicates the TDX guest is querying the supported TDVMCALLs beyond
  the GHCI base TDVMCALLs.
  Exit to userspace to let userspace set the TDVMCALL sub-function bit(s)
  accordingly to the leaf outputs.  KVM could set the TDVMCALL bit(s)
  supported by itself when the TDVMCALLs don't need support from userspace
  after returning from userspace and before entering guest. Currently, no
  such TDVMCALLs implemented, KVM just sets the values returned from
  userspace.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
[Adjust userspace API. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20 13:55:47 -04:00
Binbin Wu
cf207eac06 KVM: TDX: Handle TDG.VP.VMCALL<GetQuote>
Handle TDVMCALL for GetQuote to generate a TD-Quote.

GetQuote is a doorbell-like interface used by TDX guests to request VMM
to generate a TD-Quote signed by a service hosting TD-Quoting Enclave
operating on the host.  A TDX guest passes a TD Report (TDREPORT_STRUCT) in
a shared-memory area as parameter.  Host VMM can access it and queue the
operation for a service hosting TD-Quoting enclave.  When completed, the
Quote is returned via the same shared-memory area.

KVM only checks the GPA from the TDX guest has the shared-bit set and drops
the shared-bit before exiting to userspace to avoid bleeding the shared-bit
into KVM's exit ABI.  KVM forwards the request to userspace VMM (e.g. QEMU)
and userspace VMM queues the operation asynchronously.  KVM sets the return
code according to the 'ret' field set by userspace to notify the TDX guest
whether the request has been queued successfully or not.  When the request
has been queued successfully, the TDX guest can poll the status field in
the shared-memory area to check whether the Quote generation is completed
or not.  When completed, the generated Quote is returned via the same
buffer.

Add KVM_EXIT_TDX as a new exit reason to userspace. Userspace is
required to handle the KVM exit reason as the initial support for TDX,
by reentering KVM to ensure that the TDVMCALL is complete.  While at it,
add a note that KVM_EXIT_HYPERCALL also requires reentry with KVM_RUN.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
[Adjust userspace API. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20 13:09:32 -04:00
Binbin Wu
b5aafcb4ef KVM: TDX: Add new TDVMCALL status code for unsupported subfuncs
Add the new TDVMCALL status code TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED and
return it for unimplemented TDVMCALL subfunctions.

Returning TDVMCALL_STATUS_INVALID_OPERAND when a subfunction is not
implemented is vague because TDX guests can't tell the error is due to
the subfunction is not supported or an invalid input of the subfunction.
New GHCI spec adds TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED to avoid the
ambiguity. Use it instead of TDVMCALL_STATUS_INVALID_OPERAND.

Before the change, for common guest implementations, when a TDX guest
receives TDVMCALL_STATUS_INVALID_OPERAND, it has two cases:
1. Some operand is invalid. It could change the operand to another value
   retry.
2. The subfunction is not supported.

For case 1, an invalid operand usually means the guest implementation bug.
Since the TDX guest can't tell which case is, the best practice for
handling TDVMCALL_STATUS_INVALID_OPERAND is stopping calling such leaf,
treating the failure as fatal if the TDVMCALL is essential or ignoring
it if the TDVMCALL is optional.

With this change, TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED could be sent to
old TDX guest that do not know about it, but it is expected that the
guest will make the same action as TDVMCALL_STATUS_INVALID_OPERAND.
Currently, no known TDX guest checks TDVMCALL_STATUS_INVALID_OPERAND
specifically; for example Linux just checks for success.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
[Return it for untrapped KVM_HC_MAP_GPA_RANGE. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20 13:09:31 -04:00
Borislav Petkov (AMD)
31272abd59 KVM: SVM: Advertise TSA CPUID bits to guests
Synthesize the TSA CPUID feature bits for guests. Set TSA_{SQ,L1}_NO on
unaffected machines.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-06-17 17:17:12 +02:00
Borislav Petkov (AMD)
d8010d4ba4 x86/bugs: Add a Transient Scheduler Attacks mitigation
Add the required features detection glue to bugs.c et all in order to
support the TSA mitigation.

Co-developed-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
2025-06-17 17:17:02 +02:00
Borislav Petkov (AMD)
f9af88a3d3 x86/bugs: Rename MDS machinery to something more generic
It will be used by other x86 mitigations.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
2025-06-16 18:45:18 +02:00
Paolo Bonzini
8046d29dde KVM: x86/mmu: Reject direct bits in gpa passed to KVM_PRE_FAULT_MEMORY
Only let userspace pass the same addresses that were used in KVM_SET_USER_MEMORY_REGION
(or KVM_SET_USER_MEMORY_REGION2); gpas in the the upper half of the address space
are an implementation detail of TDX and KVM.

Extracted from a patch by Sean Christopherson <seanjc@google.com>.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-12 00:51:42 -04:00
Paolo Bonzini
aa2024c01a KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
Bug[*] reported for TDX case when enabling KVM_PRE_FAULT_MEMORY in QEMU.

It turns out that @gpa passed to kvm_mmu_do_page_fault() doesn't have
shared bit set when the memory attribute of it is shared, and it leads
to wrong root in tdp_mmu_get_root_for_fault().

Fix it by embedding the direct bits in the gpa that is passed to
kvm_tdp_map_page(), when the memory of the gpa is not private.

[*] https://lore.kernel.org/qemu-devel/4a757796-11c2-47f1-ae0d-335626e818fd@intel.com/

Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Closes: https://lore.kernel.org/qemu-devel/4a757796-11c2-47f1-ae0d-335626e818fd@intel.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20250611001018.2179964-1-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-12 00:43:39 -04:00
Ashish Kalra
179a8427fc KVM: SEV: Disable SEV-SNP support on initialization failure
During platform init, SNP initialization may fail for several reasons,
such as firmware command failures and incompatible versions. However,
the KVM capability may continue to advertise support for it.

The platform may have SNP enabled but if SNP_INIT fails then SNP is
not supported by KVM.

During KVM module initialization query the SNP platform status to obtain
the SNP initialization state and use it as an additional condition to
determine support for SEV-SNP.

Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Pratik R. Sampat <prsampat@amd.com>
Signed-off-by: Pratik R. Sampat <prsampat@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Reviewed-by: Pavan Kumar Paluri <papaluri@amd.com>
Message-ID: <20250512221634.12045-1-Ashish.Kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-11 14:27:06 -04:00
Ingo Molnar
41cb08555c treewide, timers: Rename from_timer() to timer_container_of()
Move this API to the canonical timer_*() namespace.

[ tglx: Redone against pre rc1 ]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-08 09:07:37 +02:00
Linus Torvalds
7f9039c524 Generic:
* Clean up locking of all vCPUs for a VM by using the *_nest_lock()
   family of functions, and move duplicated code to virt/kvm/.
   kernel/ patches acked by Peter Zijlstra.
 
 * Add MGLRU support to the access tracking perf test.
 
 ARM fixes:
 
 * Make the irqbypass hooks resilient to changes in the GSI<->MSI
   routing, avoiding behind stale vLPI mappings being left behind. The
   fix is to resolve the VGIC IRQ using the host IRQ (which is stable)
   and nuking the vLPI mapping upon a routing change.
 
 * Close another VGIC race where vCPU creation races with VGIC
   creation, leading to in-flight vCPUs entering the kernel w/o private
   IRQs allocated.
 
 * Fix a build issue triggered by the recently added workaround for
   Ampere's AC04_CPU_23 erratum.
 
 * Correctly sign-extend the VA when emulating a TLBI instruction
   potentially targeting a VNCR mapping.
 
 * Avoid dereferencing a NULL pointer in the VGIC debug code, which can
   happen if the device doesn't have any mapping yet.
 
 s390:
 
 * Fix interaction between some filesystems and Secure Execution
 
 * Some cleanups and refactorings, preparing for an upcoming big series
 
 x86:
 
 * Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
   fix a race between AP destroy and VMRUN.
 
 * Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.
 
 * Refine and harden handling of spurious faults.
 
 * Add support for ALLOWED_SEV_FEATURES.
 
 * Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.
 
 * Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
   that utilize those bits.
 
 * Don't account temporary allocations in sev_send_update_data().
 
 * Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
 
 * Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between
   SVM and VMX.
 
 * Advertise support to userspace for WRMSRNS and PREFETCHI.
 
 * Rescan I/O APIC routes after handling EOI that needed to be intercepted due
   to the old/previous routing, but not the new/current routing.
 
 * Add a module param to control and enumerate support for device posted
   interrupts.
 
 * Fix a potential overflow with nested virt on Intel systems running 32-bit kernels.
 
 * Flush shadow VMCSes on emergency reboot.
 
 * Add support for SNP to the various SEV selftests.
 
 * Add a selftest to verify fastops instructions via forced emulation.
 
 * Refine and optimize KVM's software processing of the posted interrupt bitmap, and share
   the harvesting code between KVM and the kernel's Posted MSI handler
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmg9TjwUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOUxQf7B7nnWqIKd7jSkGzSD6YsSX9TXktr
 2tJIOfWM3zNYg5GRCidg+m4Y5+DqQWd3Hi5hH2P9wUw7RNuOjOFsDe+y0VBr8ysE
 ve39t/yp+mYalNmHVFl8s3dBDgrIeGKiz+Wgw3zCQIBZ18rJE1dREhv37RlYZ3a2
 wSvuObe8sVpCTyKIowDs1xUi7qJUBoopMSuqfleSHawRrcgCpV99U8/KNFF5plLH
 7fXOBAHHniVCVc+mqQN2wxtVJDhST+U3TaU4GwlKy9Yevr+iibdOXffveeIgNEU4
 D6q1F2zKp6UdV3+p8hxyaTTbiCVDqsp9WOgY/0I/f+CddYn0WVZgOlR+ow==
 =mYFL
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more kvm updates from Paolo Bonzini:
  Generic:

   - Clean up locking of all vCPUs for a VM by using the *_nest_lock()
     family of functions, and move duplicated code to virt/kvm/. kernel/
     patches acked by Peter Zijlstra

   - Add MGLRU support to the access tracking perf test

  ARM fixes:

   - Make the irqbypass hooks resilient to changes in the GSI<->MSI
     routing, avoiding behind stale vLPI mappings being left behind. The
     fix is to resolve the VGIC IRQ using the host IRQ (which is stable)
     and nuking the vLPI mapping upon a routing change

   - Close another VGIC race where vCPU creation races with VGIC
     creation, leading to in-flight vCPUs entering the kernel w/o
     private IRQs allocated

   - Fix a build issue triggered by the recently added workaround for
     Ampere's AC04_CPU_23 erratum

   - Correctly sign-extend the VA when emulating a TLBI instruction
     potentially targeting a VNCR mapping

   - Avoid dereferencing a NULL pointer in the VGIC debug code, which
     can happen if the device doesn't have any mapping yet

  s390:

   - Fix interaction between some filesystems and Secure Execution

   - Some cleanups and refactorings, preparing for an upcoming big
     series

  x86:

   - Wait for target vCPU to ack KVM_REQ_UPDATE_PROTECTED_GUEST_STATE
     to fix a race between AP destroy and VMRUN

   - Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for
     the VM

   - Refine and harden handling of spurious faults

   - Add support for ALLOWED_SEV_FEATURES

   - Add #VMGEXIT to the set of handlers special cased for
     CONFIG_RETPOLINE=y

   - Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing
     features that utilize those bits

   - Don't account temporary allocations in sev_send_update_data()

   - Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock
     Threshold

   - Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU
     IBPB, between SVM and VMX

   - Advertise support to userspace for WRMSRNS and PREFETCHI

   - Rescan I/O APIC routes after handling EOI that needed to be
     intercepted due to the old/previous routing, but not the
     new/current routing

   - Add a module param to control and enumerate support for device
     posted interrupts

   - Fix a potential overflow with nested virt on Intel systems running
     32-bit kernels

   - Flush shadow VMCSes on emergency reboot

   - Add support for SNP to the various SEV selftests

   - Add a selftest to verify fastops instructions via forced emulation

   - Refine and optimize KVM's software processing of the posted
     interrupt bitmap, and share the harvesting code between KVM and the
     kernel's Posted MSI handler"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits)
  rtmutex_api: provide correct extern functions
  KVM: arm64: vgic-debug: Avoid dereferencing NULL ITE pointer
  KVM: arm64: vgic-init: Plug vCPU vs. VGIC creation race
  KVM: arm64: Unmap vLPIs affected by changes to GSI routing information
  KVM: arm64: Resolve vLPI by host IRQ in vgic_v4_unset_forwarding()
  KVM: arm64: Protect vLPI translation with vgic_irq::irq_lock
  KVM: arm64: Use lock guard in vgic_v4_set_forwarding()
  KVM: arm64: Mask out non-VA bits from TLBI VA* on VNCR invalidation
  arm64: sysreg: Drag linux/kconfig.h to work around vdso build issue
  KVM: s390: Simplify and move pv code
  KVM: s390: Refactor and split some gmap helpers
  KVM: s390: Remove unneeded srcu lock
  s390: Remove unneeded includes
  s390/uv: Improve splitting of large folios that cannot be split while dirty
  s390/uv: Always return 0 from s390_wiggle_split_folio() if successful
  s390/uv: Don't return 0 from make_hva_secure() if the operation was not successful
  rust: add helper for mutex_trylock
  RISC-V: KVM: use kvm_trylock_all_vcpus when locking all vCPUs
  KVM: arm64: use kvm_trylock_all_vcpus when locking all vCPUs
  x86: KVM: SVM: use kvm_lock_all_vcpus instead of a custom implementation
  ...
2025-06-02 12:24:58 -07:00
Linus Torvalds
43db111107 ARM:
* Add large stage-2 mapping (THP) support for non-protected guests when
   pKVM is enabled, clawing back some performance.
 
 * Enable nested virtualisation support on systems that support it,
   though it is disabled by default.
 
 * Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and
   protected modes.
 
 * Large rework of the way KVM tracks architecture features and links
   them with the effects of control bits. While this has no functional
   impact, it ensures correctness of emulation (the data is automatically
   extracted from the published JSON files), and helps dealing with the
   evolution of the architecture.
 
 * Significant changes to the way pKVM tracks ownership of pages,
   avoiding page table walks by storing the state in the hypervisor's
   vmemmap. This in turn enables the THP support described above.
 
 * New selftest checking the pKVM ownership transition rules
 
 * Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
   even if the host didn't have it.
 
 * Fixes for the address translation emulation, which happened to be
   rather buggy in some specific contexts.
 
 * Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
   from the number of counters exposed to a guest and addressing a
   number of issues in the process.
 
 * Add a new selftest for the SVE host state being corrupted by a
   guest.
 
 * Keep HCR_EL2.xMO set at all times for systems running with the
   kernel at EL2, ensuring that the window for interrupts is slightly
   bigger, and avoiding a pretty bad erratum on the AmpereOne HW.
 
 * Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
   from a pretty bad case of TLB corruption unless accesses to HCR_EL2
   are heavily synchronised.
 
 * Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
   tables in a human-friendly fashion.
 
 * and the usual random cleanups.
 
 LoongArch:
 
 * Don't flush tlb if the host supports hardware page table walks.
 
 * Add KVM selftests support.
 
 RISC-V:
 
 * Add vector registers to get-reg-list selftest
 
 * VCPU reset related improvements
 
 * Remove scounteren initialization from VCPU reset
 
 * Support VCPU reset from userspace using set_mpstate() ioctl
 
 x86:
 
 * Initial support for TDX in KVM.  This finally makes it possible to use the
   TDX module to run confidential guests on Intel processors.  This is quite a
   large series, including support for private page tables (managed by the
   TDX module and mirrored in KVM for efficiency), forwarding some TDVMCALLs
   to userspace, and handling several special VM exits from the TDX module.
 
   This has been in the works for literally years and it's not really possible
   to describe everything here, so I'll defer to the various merge commits
   up to and including commit 7bcf7246c4 ("Merge branch 'kvm-tdx-finish-initial'
   into HEAD").
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmg02hwUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNnkwf/db4xeWKSMseCIvBVR+ObDn3LXhwT
 hAgmTkDkP1zq9RfbfJSbUA1DXRwfP+f1sWySLMWECkFEQW9fGIJF9fOQRDSXKmhX
 158U3+FEt+3jxLRCGFd4zyXAqyY3C8JSkPUyJZxCpUbXtB5tdDNac4rZAXKDULwe
 sUi0OW/kFDM2yt369pBGQAGdN+75/oOrYISGOSvMXHxjccNqvveX8MUhpBjYIuuj
 73iBWmsfv3vCtam56Racz3C3v44ie498PmWFtnB0R+CVfWfrnUAaRiGWx+egLiBW
 dBPDiZywMn++prmphEUFgaStDTQy23JBLJ8+RvHkp+o5GaTISKJB3nedZQ==
 =adZU
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "As far as x86 goes this pull request "only" includes TDX host support.

  Quotes are appropriate because (at 6k lines and 100+ commits) it is
  much bigger than the rest, which will come later this week and
  consists mostly of bugfixes and selftests. s390 changes will also come
  in the second batch.

  ARM:

   - Add large stage-2 mapping (THP) support for non-protected guests
     when pKVM is enabled, clawing back some performance.

   - Enable nested virtualisation support on systems that support it,
     though it is disabled by default.

   - Add UBSAN support to the standalone EL2 object used in nVHE/hVHE
     and protected modes.

   - Large rework of the way KVM tracks architecture features and links
     them with the effects of control bits. While this has no functional
     impact, it ensures correctness of emulation (the data is
     automatically extracted from the published JSON files), and helps
     dealing with the evolution of the architecture.

   - Significant changes to the way pKVM tracks ownership of pages,
     avoiding page table walks by storing the state in the hypervisor's
     vmemmap. This in turn enables the THP support described above.

   - New selftest checking the pKVM ownership transition rules

   - Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
     even if the host didn't have it.

   - Fixes for the address translation emulation, which happened to be
     rather buggy in some specific contexts.

   - Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
     from the number of counters exposed to a guest and addressing a
     number of issues in the process.

   - Add a new selftest for the SVE host state being corrupted by a
     guest.

   - Keep HCR_EL2.xMO set at all times for systems running with the
     kernel at EL2, ensuring that the window for interrupts is slightly
     bigger, and avoiding a pretty bad erratum on the AmpereOne HW.

   - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
     from a pretty bad case of TLB corruption unless accesses to HCR_EL2
     are heavily synchronised.

   - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
     tables in a human-friendly fashion.

   - and the usual random cleanups.

  LoongArch:

   - Don't flush tlb if the host supports hardware page table walks.

   - Add KVM selftests support.

  RISC-V:

   - Add vector registers to get-reg-list selftest

   - VCPU reset related improvements

   - Remove scounteren initialization from VCPU reset

   - Support VCPU reset from userspace using set_mpstate() ioctl

  x86:

   - Initial support for TDX in KVM.

     This finally makes it possible to use the TDX module to run
     confidential guests on Intel processors. This is quite a large
     series, including support for private page tables (managed by the
     TDX module and mirrored in KVM for efficiency), forwarding some
     TDVMCALLs to userspace, and handling several special VM exits from
     the TDX module.

     This has been in the works for literally years and it's not really
     possible to describe everything here, so I'll defer to the various
     merge commits up to and including commit 7bcf7246c4 ('Merge
     branch 'kvm-tdx-finish-initial' into HEAD')"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (248 commits)
  x86/tdx: mark tdh_vp_enter() as __flatten
  Documentation: virt/kvm: remove unreferenced footnote
  RISC-V: KVM: lock the correct mp_state during reset
  KVM: arm64: Fix documentation for vgic_its_iter_next()
  KVM: arm64: np-guest CMOs with PMD_SIZE fixmap
  KVM: arm64: Stage-2 huge mappings for np-guests
  KVM: arm64: Add a range to pkvm_mappings
  KVM: arm64: Convert pkvm_mappings to interval tree
  KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest()
  KVM: arm64: Add a range to __pkvm_host_wrprotect_guest()
  KVM: arm64: Add a range to __pkvm_host_unshare_guest()
  KVM: arm64: Add a range to __pkvm_host_share_guest()
  KVM: arm64: Introduce for_each_hyp_page
  KVM: arm64: Handle huge mappings for np-guest CMOs
  KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section
  KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held
  KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating
  RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET
  RISC-V: KVM: Remove scounteren initialization
  KVM: RISC-V: remove unnecessary SBI reset state
  ...
2025-05-29 08:10:01 -07:00
Paolo Bonzini
8e86e73626 Merge branch 'kvm-lockdep-common' into HEAD
Introduce new mutex locking functions mutex_trylock_nest_lock() and
mutex_lock_killable_nest_lock() and use them to clean up locking
of all vCPUs for a VM.

For x86, this removes some complex code that was used instead
of lockdep's "nest_lock" feature.

For ARM and RISC-V, this removes a lockdep warning when the VM is
configured to have more than MAX_LOCK_DEPTH vCPUs, and removes a fair
amount of duplicate code by sharing the logic across all architectures.

Signed-off-by: Paolo BOnzini <pbonzini@redhat.com>
2025-05-28 06:29:17 -04:00
Maxim Levitsky
c560bc9286 x86: KVM: SVM: use kvm_lock_all_vcpus instead of a custom implementation
Use kvm_lock_all_vcpus instead of sev's own implementation.

Because kvm_lock_all_vcpus uses the _nest_lock feature of lockdep, which
ignores subclasses, there is no longer a need to use separate subclasses
for source and target VMs.

No functional change intended.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-ID: <20250512180407.659015-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-27 12:16:41 -04:00
Paolo Bonzini
4e02d4f973 KVM SVM changes for 6.16:
- Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
    fix a race between AP destroy and VMRUN.
 
  - Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.
 
  - Add support for ALLOWED_SEV_FEATURES.
 
  - Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.
 
  - Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
    that utilize those bits.
 
  - Don't account temporary allocations in sev_send_update_data().
 
  - Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmwAACgkQOlYIJqCj
 N/1pHw//edW/x838POMeeCN8j39NBKErW9yZoQLhMbzogttRvfoba+xYY9zXyRFx
 8AXB8+2iLtb7pXUohc0eYN0mNqgD0SnoMLqGfn7nrkJafJSUAJHAoZn1Mdom1M1y
 jHvBPbHCMMsgdLV8wpDRqCNWTH+d5W0kcN5WjKwOswVLj1rybVfK7bSLMhvkk1e5
 RrOR4Ewf95/Ag2b36L4SvS1yG9fTClmKeGArMXhEXjy2INVSpBYyZMjVtjHiNzU9
 TjtB2RSM45O+Zl0T2fZdVW8LFhA6kVeL1v+Oo433CjOQE0LQff3Vl14GCANIlPJU
 PiWN/RIKdWkuxStIP3vw02eHzONCcg2GnNHzEyKQ1xW8lmrwzVRdXZzVsc2Dmowb
 7qGykBQ+wzoE0sMeZPA0k/QOSqg2vGxUQHjR7720loLV9m9Tu/mJnS9e179GJKgI
 e1ArSLwKmHpjwKZqU44IQVTZaxSC4Sg2kI670i21ChPgx8+oVkA6I0LFQXymx7uS
 2lbH+ovTlJSlP9fbaJhMwAU2wpSHAyXif/HPjdw2LTH3NdgXzfEnZfTlAWiP65LQ
 hnz5HvmUalW3x9kmzRmeDIAkDnAXhyt3ZQMvbNzqlO5AfS+Tqh4Ed5EFP3IrQAzK
 HQ+Gi0ip+B84t9Tbi6rfQwzTZEbSSOfYksC7TXqRGhNo/DvHumE=
 =k6rK
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-svm-6.16' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 6.16:

 - Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
   fix a race between AP destroy and VMRUN.

 - Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.

 - Add support for ALLOWED_SEV_FEATURES.

 - Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.

 - Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
   that utilize those bits.

 - Don't account temporary allocations in sev_send_update_data().

 - Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
2025-05-27 12:15:49 -04:00
Paolo Bonzini
3e89d5fdc7 KVM VMX changes for 6.16:
- Explicitly check MSR load/store list counts to fix a potential overflow on
    32-bit kernels.
 
  - Flush shadow VMCSes on emergency reboot.
 
  - Revert mem_enc_ioctl() back to an optional hook, as it's nullified when
    SEV or TDX is disabled via Kconfig.
 
  - Macrofy the handling of vt_x86_ops to eliminate a pile of boilerplate code
    needed for TDX, and to optimize CONFIG_KVM_INTEL_TDX=n builds.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwnAwACgkQOlYIJqCj
 N/1J6Q//ZxmA443SLutx1z1F6jLxRLIh4tvOUyWPfyTIedEHXgx9btIXC9ws+Iue
 lgsa7Pto+JB0qP4XtoVS+gBrOeF2ZWwp4kyuFM/07tX67q6NKj7mTML/vXcelJNw
 bvWA3lqeNW/KZd0KCByOGeOUdfAvruXFrX0yN9/VKGZladcZaj4EGwxcS7prPCzM
 YNlRlaRsD4jB4IsmaIvWZ4sj/TNVefTybUvEgAsK6rz2bfYbYzu73ITRng7VHrRI
 u+NS3TyIoMbnqFpE8mvGJ/2dvPwvn8u6IzHLsxQRm/TKeEDlvKZxTWMncHqA6i8K
 MT79onhT1QHfaMCx6/0DjLJGGoW/9/Q8k4XNkb80Vh7e+KlqkglPYXhUW1AYs0le
 7dXkCF04xDlj5E3I971HclX5Dbiec2fLKyOiYnEUSJgPIchPJ802GYg1TOqSyHZY
 g7+QS9RKE7iXqQ2+Df7cfXbSTrpmjVmzPK4nTFUZwbcnYhNN8qo1SYo5nv89+Myr
 EOdvsk0GzwgCbIbSPejIsrHFUE65iWGP7NcoBtXa7i34inGCgvElPHhXbDPu9xkU
 9AZJ+iz3oegrDuARYJI/QSl3hciD31fX8yHKnpz7ccIYGBbQ5w3hNhpREM9X7fL7
 pg9dv/oPmTla+NuuV+9fPXdYq/7p8CuwYyFis6DrTWeyKCOQSyE=
 =7W6U
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.16' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.16:

 - Explicitly check MSR load/store list counts to fix a potential overflow on
   32-bit kernels.

 - Flush shadow VMCSes on emergency reboot.

 - Revert mem_enc_ioctl() back to an optional hook, as it's nullified when
   SEV or TDX is disabled via Kconfig.

 - Macrofy the handling of vt_x86_ops to eliminate a pile of boilerplate code
   needed for TDX, and to optimize CONFIG_KVM_INTEL_TDX=n builds.
2025-05-27 12:15:38 -04:00
Paolo Bonzini
db44dcbdf8 KVM x86 posted interrupt changes for 6.16:
Refine and optimize KVM's software processing of the PIR, and ultimately share
 PIR harvesting code between KVM and the kernel's Posted MSI handler
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmWcACgkQOlYIJqCj
 N/3mUw/9HN4OLRqFytu+GjEocl8I7JelJdwCsNMsUwZRnNVnYGDqsjvw8rzqeFmx
 RoQ8uNqMd1PqZOgAdN6suLES949ItErbnG2+UlBvZeNgR63K8fyNJaPUzSXh0Kyd
 vNNzGschI0txZXNEtMHcIsCuQknU/arlE6v+HOAokb1jxaIZH2h06vrBAj6pLAHO
 hbcZPkaQEaFoQhqCbYm015ecJQRPv3IZoW7H1cK5nC4q6QdNo3LPfGqUJwgHV3Wq
 hbfS+2J78nTqLhSn7HHE/y5z3R5+ZyPwFQwbqfvjjap5/DW5w8Tltg2Oif597lf2
 klBukBkJyfzSdhjaPKb3V23kCNabNyyX7KUDZnW5HCiEu62Lnl0MexXCvFvSvtmy
 YDSsXMg3KdtlESwUOaxGjd2J81tx36L3ZvWRaopDLzA2A6KVyVQCSANGOGkKrRzq
 Qq3R/frzp1uUVpVDtdyDIO1AujoXkRecdOj1uAIr2XQBg8jx0kveAUyrkXFbQVjK
 oNbfRlOiu6/vnXkWqwZ2w/Q0kRRrK7M+vensOZlculqDqxPH+BLWB+dfPqjGikb/
 cL01KPu6n/GQJpwAxIbGU4eUIQPAVOcHm3iRaIlRqEoDCs7C8fTRIyDx+cD1vW8O
 O9j/r05EV/Ck5XF2ks6bHIK+C3wemNrCvoeFbnO1uicqtdO+Tqw=
 =dU1G
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-pir-6.16' of https://github.com/kvm-x86/linux into HEAD

KVM x86 posted interrupt changes for 6.16:

Refine and optimize KVM's software processing of the PIR, and ultimately share
PIR harvesting code between KVM and the kernel's Posted MSI handler
2025-05-27 12:15:01 -04:00
Paolo Bonzini
5d816c1399 KVM x86 MMU changes for 6.16:
- Refine and harden handling of spurious faults.
 
  - Use kvm_x86_call() instead of open coding static_call().
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmO0ACgkQOlYIJqCj
 N/050BAAhl1u8bFGhb4SYOjeZ7QBgqhbzm2kvODQ7nnq3lQdBh5F3PYcBqX05dNo
 ntT1w5g8LOoFQWQ0rHrCDxEraBecYI4GWtuvC567AtmvibTn6GKJ1drh3ekNKD0b
 3Dr674PgOzTdXWmD74tVSa3VCQ2c0jvkkueF7aeAAlX20efkjqAiQauF2c0PFBrY
 eqPqICOoecDxlQ6O7wY5jxkxBBurk60a4TBByzp2fCvvkAOdt6cYt42SPjd3dnXu
 Hi3a2nIIYyTfF0jvXcbtvGzzOac1YbGP1wBGA3uiyCkMz7ABZfHoOD5O6+rzUYml
 9j9Um4phbHgvvRRDfL6rv2rBFI6hs9c4NzAMv/gcgblJiNhbg1A4Hu3ykDNsHpUN
 nO1/GrCkHKXITU1za2Z6rf4X/QouBGVHphODB/TzLbyXV2xw1IU1TveWgvWT5yVF
 mZMWvpYJaGtOX/imMNxPwVcrX75HkSD1DU2r1CfV5CUfd87wIp6HkSBvqUneFScN
 v7S5OaWGce6/EOoI3bq2Dp+hJS3ZnNmXzknK5PHB+Nb/JuXVNIW/Meol85d+z7j9
 A/I11shjNV6RwuFfpvj6yGQUtKU5mSkw4D4eUIA57K2HtxLVMuhxis45X2O8plDI
 pMU4KLXmq6nNU0MLTxNWMExXIQ1MWR35Ak/Q3JYdUVLXe6UIM9M=
 =ayRv
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmu-6.16' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 6.16:

 - Refine and harden handling of spurious faults.

 - Use kvm_x86_call() instead of open coding static_call().
2025-05-27 12:14:47 -04:00
Paolo Bonzini
ebd38b26ec KVM x86 misc changes for 6.16:
- Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between
    SVM and VMX.
 
  - Advertise support to userspace for WRMSRNS and PREFETCHI.
 
  - Rescan I/O APIC routes after handling EOI that needed to be intercepted due
    to the old/previous routing, but not the new/current routing.
 
  - Add a module param to control and enumerate support for device posted
    interrupts.
 
  - Misc cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmHsACgkQOlYIJqCj
 N/2fYhAAiwKkqQpOWLcGjjezDTnpMqDDbHCffroq0Ttmqfg/cuul1oyZax+9fxBO
 203HUi5VKcG7uAGSpLcMFkPUs9hKnaln2lsDaQD+AnGucdj+JKF5p3INCsSYCo9N
 LVRjRZWtZocxJwHSHX9gU8om0pJ5fBCBG2+7+7XhWRaIqCpJe5k944JotiiOkgZ4
 5sXeITkN2kouFVMI8eD4wQGNXxRxs837SYUlwCnoD3VuuBesOZuEhz/CEL9l8vNY
 keXBLPg7bSW53clKfquNKwXDQRephnZaYoexDebUd+OlZphGhTIPh+C75xPQLWSi
 aYg6W9XDu3TChf4LPxHnJLwLg/rjeKNQARcxrnb3XLpPAtx3i2cKU8pDPhnd4qn0
 +YV5H0dato8bbe+oClGv+oIolM01qfI9SJVoaEhTPu3Rdw9cCQSVFn5t32vG3Vab
 FVxX+seV3+XTmVveD4cjiiMbqtNADwZ/PmHNAi9QCl46DgHR++MLfRtjYuGo1koL
 QOmCg2fWOFYtQT6XPJqZxp1SHYxuawrB4qcO9FNyxTuMChoslYoLAr3mBUj0DvwL
 fdXNof74Ccj8OK9o3uCPOXS92pZz92rz/edy/XmYiCmO+VEwBJFR6IdginZMPX11
 UAc4mAC+KkDTvOcKPPIWEXArOsfQMFKefi+bPOeWUx9/nqpcVws=
 =Vs9P
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.16' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.16:

 - Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between
   SVM and VMX.

 - Advertise support to userspace for WRMSRNS and PREFETCHI.

 - Rescan I/O APIC routes after handling EOI that needed to be intercepted due
   to the old/previous routing, but not the new/current routing.

 - Add a module param to control and enumerate support for device posted
   interrupts.

 - Misc cleanups.
2025-05-27 12:14:36 -04:00
Edward Adam Davis
cd1be30b44 KVM: VMX: use __always_inline for is_td_vcpu and is_td
is_td() and is_td_vcpu() are used in no-instrumentation sections; use
__always_inline instead of inline.

vmlinux.o: error: objtool: vmx_handle_nmi+0x47:
        call to is_td_vcpu.isra.0() leaves .noinstr.text section

Fixes: 7172c753c2 ("KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct")
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Message-ID: <tencent_1A767567C83C1137829622362E4A72756F09@qq.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-27 12:12:49 -04:00
Linus Torvalds
785cdec46e Core x86 updates for v6.16:
Boot code changes:
 
  - A large series of changes to reorganize the x86 boot code into a better isolated
    and easier to maintain base of PIC early startup code in arch/x86/boot/startup/,
    by Ard Biesheuvel.
 
    Motivation & background:
 
 	| Since commit
 	|
 	|    c88d71508e ("x86/boot/64: Rewrite startup_64() in C")
 	|
 	| dated Jun 6 2017, we have been using C code on the boot path in a way
 	| that is not supported by the toolchain, i.e., to execute non-PIC C
 	| code from a mapping of memory that is different from the one provided
 	| to the linker. It should have been obvious at the time that this was a
 	| bad idea, given the need to sprinkle fixup_pointer() calls left and
 	| right to manipulate global variables (including non-pointer variables)
 	| without crashing.
 	|
 	| This C startup code has been expanding, and in particular, the SEV-SNP
 	| startup code has been expanding over the past couple of years, and
 	| grown many of these warts, where the C code needs to use special
 	| annotations or helpers to access global objects.
 
    This tree includes the first phase of this work-in-progress x86 boot code
    reorganization.
 
 Scalability enhancements and micro-optimizations:
 
  - Improve code-patching scalability (Eric Dumazet)
  - Remove MFENCEs for X86_BUG_CLFLUSH_MONITOR (Andrew Cooper)
 
 CPU features enumeration updates:
 
  - Thorough reorganization and cleanup of CPUID parsing APIs (Ahmed S. Darwish)
  - Fix, refactor and clean up the cacheinfo code (Ahmed S. Darwish, Thomas Gleixner)
  - Update CPUID bitfields to x86-cpuid-db v2.3 (Ahmed S. Darwish)
 
 Memory management changes:
 
  - Allow temporary MMs when IRQs are on (Andy Lutomirski)
  - Opt-in to IRQs-off activate_mm() (Andy Lutomirski)
  - Simplify choose_new_asid() and generate better code (Borislav Petkov)
  - Simplify 32-bit PAE page table handling (Dave Hansen)
  - Always use dynamic memory layout (Kirill A. Shutemov)
  - Make SPARSEMEM_VMEMMAP the only memory model (Kirill A. Shutemov)
  - Make 5-level paging support unconditional (Kirill A. Shutemov)
  - Stop prefetching current->mm->mmap_lock on page faults (Mateusz Guzik)
  - Predict valid_user_address() returning true (Mateusz Guzik)
  - Consolidate initmem_init() (Mike Rapoport)
 
 FPU support and vector computing:
 
  - Enable Intel APX support (Chang S. Bae)
  - Reorgnize and clean up the xstate code (Chang S. Bae)
  - Make task_struct::thread constant size (Ingo Molnar)
  - Restore fpu_thread_struct_whitelist() to fix CONFIG_HARDENED_USERCOPY=y
    (Kees Cook)
  - Simplify the switch_fpu_prepare() + switch_fpu_finish() logic (Oleg Nesterov)
  - Always preserve non-user xfeatures/flags in __state_perm (Sean Christopherson)
 
 Microcode loader changes:
 
  - Help users notice when running old Intel microcode (Dave Hansen)
  - AMD: Do not return error when microcode update is not necessary (Annie Li)
  - AMD: Clean the cache if update did not load microcode (Boris Ostrovsky)
 
 Code patching (alternatives) changes:
 
  - Simplify, reorganize and clean up the x86 text-patching code (Ingo Molnar)
  - Make smp_text_poke_batch_process() subsume smp_text_poke_batch_finish()
    (Nikolay Borisov)
  - Refactor the {,un}use_temporary_mm() code (Peter Zijlstra)
 
 Debugging support:
 
  - Add early IDT and GDT loading to debug relocate_kernel() bugs (David Woodhouse)
  - Print the reason for the last reset on modern AMD CPUs (Yazen Ghannam)
  - Add AMD Zen debugging document (Mario Limonciello)
  - Fix opcode map (!REX2) superscript tags (Masami Hiramatsu)
  - Stop decoding i64 instructions in x86-64 mode at opcode (Masami Hiramatsu)
 
 CPU bugs and bug mitigations:
 
  - Remove X86_BUG_MMIO_UNKNOWN (Borislav Petkov)
  - Fix SRSO reporting on Zen1/2 with SMT disabled (Borislav Petkov)
  - Restructure and harmonize the various CPU bug mitigation methods
    (David Kaplan)
  - Fix spectre_v2 mitigation default on Intel (Pawan Gupta)
 
 MSR API:
 
  - Large MSR code and API cleanup (Xin Li)
  - In-kernel MSR API type cleanups and renames (Ingo Molnar)
 
 PKEYS:
 
  - Simplify PKRU update in signal frame (Chang S. Bae)
 
 NMI handling code:
 
  - Clean up, refactor and simplify the NMI handling code (Sohil Mehta)
  - Improve NMI duration console printouts (Sohil Mehta)
 
 Paravirt guests interface:
 
  - Restrict PARAVIRT_XXL to 64-bit only (Kirill A. Shutemov)
 
 SEV support:
 
  - Share the sev_secrets_pa value again (Tom Lendacky)
 
 x86 platform changes:
 
  - Introduce the <asm/amd/> header namespace (Ingo Molnar)
  - i2c: piix4, x86/platform: Move the SB800 PIIX4 FCH definitions to <asm/amd/fch.h>
    (Mario Limonciello)
 
 Fixes and cleanups:
 
  - x86 assembly code cleanups and fixes (Uros Bizjak)
 
  - Misc fixes and cleanups (Andi Kleen, Andy Lutomirski, Andy Shevchenko,
    Ard Biesheuvel, Bagas Sanjaya, Baoquan He, Borislav Petkov, Chang S. Bae,
    Chao Gao, Dan Williams, Dave Hansen, David Kaplan, David Woodhouse,
    Eric Biggers, Ingo Molnar, Josh Poimboeuf, Juergen Gross, Malaya Kumar Rout,
    Mario Limonciello, Nathan Chancellor, Oleg Nesterov, Pawan Gupta,
    Peter Zijlstra, Shivank Garg, Sohil Mehta, Thomas Gleixner, Uros Bizjak,
    Xin Li)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgy9WARHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jJSw/+OW2zvAx602doujBIE17vFLU7R10Xwj5H
 lVgomkWCoTNscUZPhdT/iI+/kQF1fG8PtN9oZKUsTAUswldKJsqu7KevobviesiW
 qI+FqH/fhHaIk7GVh9VP65Dgrdki8zsgd7BFxD8pLRBlbZTxTxXNNkuNJrs6LxJh
 SxWp/FVtKo6Wd57qlUcsdo0tilAfcuhlEweFUarX55X2ouhdeHjcGNpxj9dHKOh8
 M7R5yMYFrpfdpSms+WaCnKKahWHaIQtQTsPAyKwoVdtfl1kK+7NgaCF55Gbo3ogp
 r59JwC/CGruDa5QnnDizCwFIwpZw9M52Q1NhP/eLEZbDGB4Yya3b5NW+Ya+6rPvO
 ZZC3e1uUmlxW3lrYflUHurnwrVb2GjkQZOdf0gfnly/7LljIicIS2dk4qIQF9NBd
 sQPpW5hjmIz9CsfeL8QaJW38pQyMsQWznFuz4YVuHcLHvleb3hR+n4fNfV5Lx9bw
 oirVETSIT5hy/msAgShPqTqFUEiVCgp16ow20YstxxzFu/FQ+VG987tkeUyFkPMe
 q1v5yF1hty+TkM4naKendIZ/MJnsrv0AxaegFz9YQrKGL1UPiOajQbSyKbzbto7+
 ozmtN0W80E8n4oQq008j8htpgIhDV91UjF5m33qB82uSqKihHPPTsVcbeg5nZwh2
 ti5g/a1jk94=
 =JgQo
 -----END PGP SIGNATURE-----

Merge tag 'x86-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core x86 updates from Ingo Molnar:
 "Boot code changes:

   - A large series of changes to reorganize the x86 boot code into a
     better isolated and easier to maintain base of PIC early startup
     code in arch/x86/boot/startup/, by Ard Biesheuvel.

     Motivation & background:

  	| Since commit
  	|
  	|    c88d71508e ("x86/boot/64: Rewrite startup_64() in C")
  	|
  	| dated Jun 6 2017, we have been using C code on the boot path in a way
  	| that is not supported by the toolchain, i.e., to execute non-PIC C
  	| code from a mapping of memory that is different from the one provided
  	| to the linker. It should have been obvious at the time that this was a
  	| bad idea, given the need to sprinkle fixup_pointer() calls left and
  	| right to manipulate global variables (including non-pointer variables)
  	| without crashing.
  	|
  	| This C startup code has been expanding, and in particular, the SEV-SNP
  	| startup code has been expanding over the past couple of years, and
  	| grown many of these warts, where the C code needs to use special
  	| annotations or helpers to access global objects.

     This tree includes the first phase of this work-in-progress x86
     boot code reorganization.

  Scalability enhancements and micro-optimizations:

   - Improve code-patching scalability (Eric Dumazet)

   - Remove MFENCEs for X86_BUG_CLFLUSH_MONITOR (Andrew Cooper)

  CPU features enumeration updates:

   - Thorough reorganization and cleanup of CPUID parsing APIs (Ahmed S.
     Darwish)

   - Fix, refactor and clean up the cacheinfo code (Ahmed S. Darwish,
     Thomas Gleixner)

   - Update CPUID bitfields to x86-cpuid-db v2.3 (Ahmed S. Darwish)

  Memory management changes:

   - Allow temporary MMs when IRQs are on (Andy Lutomirski)

   - Opt-in to IRQs-off activate_mm() (Andy Lutomirski)

   - Simplify choose_new_asid() and generate better code (Borislav
     Petkov)

   - Simplify 32-bit PAE page table handling (Dave Hansen)

   - Always use dynamic memory layout (Kirill A. Shutemov)

   - Make SPARSEMEM_VMEMMAP the only memory model (Kirill A. Shutemov)

   - Make 5-level paging support unconditional (Kirill A. Shutemov)

   - Stop prefetching current->mm->mmap_lock on page faults (Mateusz
     Guzik)

   - Predict valid_user_address() returning true (Mateusz Guzik)

   - Consolidate initmem_init() (Mike Rapoport)

  FPU support and vector computing:

   - Enable Intel APX support (Chang S. Bae)

   - Reorgnize and clean up the xstate code (Chang S. Bae)

   - Make task_struct::thread constant size (Ingo Molnar)

   - Restore fpu_thread_struct_whitelist() to fix
     CONFIG_HARDENED_USERCOPY=y (Kees Cook)

   - Simplify the switch_fpu_prepare() + switch_fpu_finish() logic (Oleg
     Nesterov)

   - Always preserve non-user xfeatures/flags in __state_perm (Sean
     Christopherson)

  Microcode loader changes:

   - Help users notice when running old Intel microcode (Dave Hansen)

   - AMD: Do not return error when microcode update is not necessary
     (Annie Li)

   - AMD: Clean the cache if update did not load microcode (Boris
     Ostrovsky)

  Code patching (alternatives) changes:

   - Simplify, reorganize and clean up the x86 text-patching code (Ingo
     Molnar)

   - Make smp_text_poke_batch_process() subsume
     smp_text_poke_batch_finish() (Nikolay Borisov)

   - Refactor the {,un}use_temporary_mm() code (Peter Zijlstra)

  Debugging support:

   - Add early IDT and GDT loading to debug relocate_kernel() bugs
     (David Woodhouse)

   - Print the reason for the last reset on modern AMD CPUs (Yazen
     Ghannam)

   - Add AMD Zen debugging document (Mario Limonciello)

   - Fix opcode map (!REX2) superscript tags (Masami Hiramatsu)

   - Stop decoding i64 instructions in x86-64 mode at opcode (Masami
     Hiramatsu)

  CPU bugs and bug mitigations:

   - Remove X86_BUG_MMIO_UNKNOWN (Borislav Petkov)

   - Fix SRSO reporting on Zen1/2 with SMT disabled (Borislav Petkov)

   - Restructure and harmonize the various CPU bug mitigation methods
     (David Kaplan)

   - Fix spectre_v2 mitigation default on Intel (Pawan Gupta)

  MSR API:

   - Large MSR code and API cleanup (Xin Li)

   - In-kernel MSR API type cleanups and renames (Ingo Molnar)

  PKEYS:

   - Simplify PKRU update in signal frame (Chang S. Bae)

  NMI handling code:

   - Clean up, refactor and simplify the NMI handling code (Sohil Mehta)

   - Improve NMI duration console printouts (Sohil Mehta)

  Paravirt guests interface:

   - Restrict PARAVIRT_XXL to 64-bit only (Kirill A. Shutemov)

  SEV support:

   - Share the sev_secrets_pa value again (Tom Lendacky)

  x86 platform changes:

   - Introduce the <asm/amd/> header namespace (Ingo Molnar)

   - i2c: piix4, x86/platform: Move the SB800 PIIX4 FCH definitions to
     <asm/amd/fch.h> (Mario Limonciello)

  Fixes and cleanups:

   - x86 assembly code cleanups and fixes (Uros Bizjak)

   - Misc fixes and cleanups (Andi Kleen, Andy Lutomirski, Andy
     Shevchenko, Ard Biesheuvel, Bagas Sanjaya, Baoquan He, Borislav
     Petkov, Chang S. Bae, Chao Gao, Dan Williams, Dave Hansen, David
     Kaplan, David Woodhouse, Eric Biggers, Ingo Molnar, Josh Poimboeuf,
     Juergen Gross, Malaya Kumar Rout, Mario Limonciello, Nathan
     Chancellor, Oleg Nesterov, Pawan Gupta, Peter Zijlstra, Shivank
     Garg, Sohil Mehta, Thomas Gleixner, Uros Bizjak, Xin Li)"

* tag 'x86-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (331 commits)
  x86/bugs: Fix spectre_v2 mitigation default on Intel
  x86/bugs: Restructure ITS mitigation
  x86/xen/msr: Fix uninitialized variable 'err'
  x86/msr: Remove a superfluous inclusion of <asm/asm.h>
  x86/paravirt: Restrict PARAVIRT_XXL to 64-bit only
  x86/mm/64: Make 5-level paging support unconditional
  x86/mm/64: Make SPARSEMEM_VMEMMAP the only memory model
  x86/mm/64: Always use dynamic memory layout
  x86/bugs: Fix indentation due to ITS merge
  x86/cpuid: Rename hypervisor_cpuid_base()/for_each_possible_hypervisor_cpuid_base() to cpuid_base_hypervisor()/for_each_possible_cpuid_base_hypervisor()
  x86/cpu/intel: Rename CPUID(0x2) descriptors iterator parameter
  x86/cacheinfo: Rename CPUID(0x2) descriptors iterator parameter
  x86/cpuid: Rename cpuid_get_leaf_0x2_regs() to cpuid_leaf_0x2()
  x86/cpuid: Rename have_cpuid_p() to cpuid_feature()
  x86/cpuid: Set <asm/cpuid/api.h> as the main CPUID header
  x86/cpuid: Move CPUID(0x2) APIs into <cpuid/api.h>
  x86/msr: Add rdmsrl_on_cpu() compatibility wrapper
  x86/mm: Fix kernel-doc descriptions of various pgtable methods
  x86/asm-offsets: Export certain 'struct cpuinfo_x86' fields for 64-bit asm use too
  x86/boot: Defer initialization of VM space related global variables
  ...
2025-05-26 16:04:17 -07:00
Linus Torvalds
14418ddcc2 This update includes the following changes:
API:
 
 - Fix memcpy_sglist to handle partially overlapping SG lists.
 - Use memcpy_sglist to replace null skcipher.
 - Rename CRYPTO_TESTS to CRYPTO_BENCHMARK.
 - Flip CRYPTO_MANAGER_DISABLE_TEST into CRYPTO_SELFTESTS.
 - Hide CRYPTO_MANAGER.
 - Add delayed freeing of driver crypto_alg structures.
 
 Compression:
 
 - Allocate large buffers on first use instead of initialisation in scomp.
 - Drop destination linearisation buffer in scomp.
 - Move scomp stream allocation into acomp.
 - Add acomp scatter-gather walker.
 - Remove request chaining.
 - Add optional async request allocation.
 
 Hashing:
 
 - Remove request chaining.
 - Add optional async request allocation.
 - Move partial block handling into API.
 - Add ahash support to hmac.
 - Fix shash documentation to disallow usage in hard IRQs.
 
 Algorithms:
 
 - Remove unnecessary SIMD fallback code on x86 and arm/arm64.
 - Drop avx10_256 xts(aes)/ctr(aes) on x86.
 - Improve avx-512 optimisations for xts(aes).
 - Move chacha arch implementations into lib/crypto.
 - Move poly1305 into lib/crypto and drop unused Crypto API algorithm.
 - Disable powerpc/poly1305 as it has no SIMD fallback.
 - Move sha256 arch implementations into lib/crypto.
 - Convert deflate to acomp.
 - Set block size correctly in cbcmac.
 
 Drivers:
 
 - Do not use sg_dma_len before mapping in sun8i-ss.
 - Fix warm-reboot failure by making shutdown do more work in qat.
 - Add locking in zynqmp-sha.
 - Remove cavium/zip.
 - Add support for PCI device 0x17D8 to ccp.
 - Add qat_6xxx support in qat.
 - Add support for RK3576 in rockchip-rng.
 - Add support for i.MX8QM in caam.
 
 Others:
 
 - Fix irq_fpu_usable/kernel_fpu_begin inconsistency during CPU bring-up.
 - Add new SEV/SNP platform shutdown API in ccp.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmgz47AACgkQxycdCkmx
 i6fvKRAAr4Xa903L0r1Q1P1alQqoFFCqimUWeH72m68LiWynHWi0lUo0z/+tKweg
 mnPStz7/Ha9HRHJjdNCMPnlJqXQDkuH3bIOuBJCwduDuhHo9VGOd46XGzmGMv3gb
 HKuZhI0lk7pznK3CSyD/2nHmbDCHD+7feTZSBMoN9mm875+aSoM6fdxgak8uPFcq
 KbB1L+hObTn2kAPSqRrNOR8/xG2N7hdH8eax7Li+LAtqYNVT5HvWVECsB/CKRPfB
 sgAv3UTzcIFapSSHUHaONppSeoqPAIAeV7SdQhJvlT+EUUR/h/B6+D9OUQQqbphQ
 LBalgTnqMKl0ymDEQFQ6QyYCat9ZfNmDft2WcXEsxc8PxImkgJI1W3B8O51sOjbG
 78D8JqVQ96dleo4FsBhM2wfG0b41JM6zU4raC4vS7a3qsUS+Q1MpehvcS1iORicy
 SpGdE8e7DLlxKhzWyW1xJnbrtMZDC7Sa2hUnxrvP0/xOvRhChKscRVtWcf0a5q7X
 8JmuvwVSOJuSbQ3MeFbQvpo5lR9+0WsNjM6e9miiH6Y7vZUKmWcq2yDp377qVzeh
 7NK6+OwGIQZZExrmtPw2BXwssT9Eg+ks6Y7g2Ne7yzvrjVNfEPY7Cws/5w7p8mRS
 qhrcpbJNFlWgD7YYkmGZFTQ8DCN25ipP8lklO/hbcfchqLE/o1o=
 =O8L5
 -----END PGP SIGNATURE-----

Merge tag 'v6.16-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto updates from Herbert Xu:
 "API:
   - Fix memcpy_sglist to handle partially overlapping SG lists
   - Use memcpy_sglist to replace null skcipher
   - Rename CRYPTO_TESTS to CRYPTO_BENCHMARK
   - Flip CRYPTO_MANAGER_DISABLE_TEST into CRYPTO_SELFTESTS
   - Hide CRYPTO_MANAGER
   - Add delayed freeing of driver crypto_alg structures

  Compression:
   - Allocate large buffers on first use instead of initialisation in scomp
   - Drop destination linearisation buffer in scomp
   - Move scomp stream allocation into acomp
   - Add acomp scatter-gather walker
   - Remove request chaining
   - Add optional async request allocation

  Hashing:
   - Remove request chaining
   - Add optional async request allocation
   - Move partial block handling into API
   - Add ahash support to hmac
   - Fix shash documentation to disallow usage in hard IRQs

  Algorithms:
   - Remove unnecessary SIMD fallback code on x86 and arm/arm64
   - Drop avx10_256 xts(aes)/ctr(aes) on x86
   - Improve avx-512 optimisations for xts(aes)
   - Move chacha arch implementations into lib/crypto
   - Move poly1305 into lib/crypto and drop unused Crypto API algorithm
   - Disable powerpc/poly1305 as it has no SIMD fallback
   - Move sha256 arch implementations into lib/crypto
   - Convert deflate to acomp
   - Set block size correctly in cbcmac

  Drivers:
   - Do not use sg_dma_len before mapping in sun8i-ss
   - Fix warm-reboot failure by making shutdown do more work in qat
   - Add locking in zynqmp-sha
   - Remove cavium/zip
   - Add support for PCI device 0x17D8 to ccp
   - Add qat_6xxx support in qat
   - Add support for RK3576 in rockchip-rng
   - Add support for i.MX8QM in caam

  Others:
   - Fix irq_fpu_usable/kernel_fpu_begin inconsistency during CPU bring-up
   - Add new SEV/SNP platform shutdown API in ccp"

* tag 'v6.16-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (382 commits)
  x86/fpu: Fix irq_fpu_usable() to return false during CPU onlining
  crypto: qat - add missing header inclusion
  crypto: api - Redo lookup on EEXIST
  Revert "crypto: testmgr - Add hash export format testing"
  crypto: marvell/cesa - Do not chain submitted requests
  crypto: powerpc/poly1305 - add depends on BROKEN for now
  Revert "crypto: powerpc/poly1305 - Add SIMD fallback"
  crypto: ccp - Add missing tee info reg for teev2
  crypto: ccp - Add missing bootloader info reg for pspv5
  crypto: sun8i-ce - move fallback ahash_request to the end of the struct
  crypto: octeontx2 - Use dynamic allocated memory region for lmtst
  crypto: octeontx2 - Initialize cptlfs device info once
  crypto: xts - Only add ecb if it is not already there
  crypto: lrw - Only add ecb if it is not already there
  crypto: testmgr - Add hash export format testing
  crypto: testmgr - Use ahash for generic tfm
  crypto: hmac - Add ahash support
  crypto: testmgr - Ignore EEXIST on shash allocation
  crypto: algapi - Add driver template support to crypto_inst_setname
  crypto: shash - Set reqsize in shash_alg
  ...
2025-05-26 13:47:28 -07:00
Paolo Bonzini
85502b2214 LoongArch KVM changes for v6.16
1. Don't flush tlb if HW PTW supported.
 2. Add LoongArch KVM selftests support.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEzOlt8mkP+tbeiYy5AoYrw/LiJnoFAmgsdO8WHGNoZW5odWFj
 YWlAa2VybmVsLm9yZwAKCRAChivD8uImejUWD/9zNaBSqTpqeAptRQ6qTKdrxYtN
 ZKJR9a8AQF5vMPD9dWoRr6iLaNt061GqBOKbhF5RGUVq6uIDfCkZbSbX1h6Ptgcz
 OwkJHbrZAu+Z31NSoYqbgPZnurSJ9oOUGsghqp3ecKEf0LptTVaKw4WDeKkKNOlq
 eJaUC1WJ1P4sSXhTlHKAl79Ds/1pza9iAgtKXothrh09PL48LYaWFpmiS0uQmKOD
 qwYVU/OJIUWn4ZOxyGdk6ZR0B+mJOwwoO1ILptRIeSk4oTKiE1HiIEqnhdnMrJFr
 DkRdwQ/jq7iH/CFkVybNzxgLqAqpGLJgPj5VakYPmp2scuWSESej9/o8wYrmn0y0
 QuDQah0LiRIcbUYRHLDkfKRKCndfJ5KCrpOiD5mZ8bMd2LF9hPnH5toCXb/ZEpsK
 plu9qUrQgXF8rkX/zCIvQrOp0kYdU8DMbhZsXymSVpEs1fwrCNp2/Z2PrMYLKGt+
 JT+65jVRJ67d1OdKw3DrWQA8Au0Ma1rgzX3oLDu8wnqAG7ULAJRVIkDMcxG7a5SQ
 P+DlIbEHC1U8Dw8hW+PFhfOl13M9p5s3EP5fy8q85UJ/5fJ41fCPBUOm/rfMQcci
 6/e+xwBKKkJ7oQ/fFKvJ+n6GbV6suV5FUxicYQjC43iiCKklRhG77uek/DK3c5um
 ZiTu+YapX9qbG54Q9w==
 =YM/u
 -----END PGP SIGNATURE-----

Merge tag 'loongarch-kvm-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD

LoongArch KVM changes for v6.16

1. Don't flush tlb if HW PTW supported.
2. Add LoongArch KVM selftests support.
2025-05-26 16:12:13 -04:00
Manali Shukla
89f9edf4c6 KVM: SVM: Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs
Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs with Bus Lock
Threshold, which is close enough to VMX's Bus Lock Detection VM-Exit to
allow reusing KVM_CAP_X86_BUS_LOCK_EXIT.

The biggest difference between the two features is that Threshold is
fault-like, whereas Detection is trap-like.  To allow the guest to make
forward progress, Threshold provides a per-VMCB counter which is
decremented every time a bus lock occurs, and a VM-Exit is triggered if
and only if the counter is '0'.

To provide Detection-like semantics, initialize the counter to '0', i.e.
exit on every bus lock, and when re-executing the guilty instruction, set
the counter to '1' to effectively step past the instruction.

Note, in the unlikely scenario that re-executing the instruction doesn't
trigger a bus lock, e.g. because the guest has changed memory types or
patched the guilty instruction, the bus lock counter will be left at '1',
i.e. the guest will be able to do a bus lock on a different instruction.
In a perfect world, KVM would ensure the counter is '0' if the guest has
made forward progress, e.g. if RIP has changed.  But trying to close that
hole would incur non-trivial complexity, for marginal benefit; the intent
of KVM_CAP_X86_BUS_LOCK_EXIT is to allow userspace rate-limit bus locks,
not to allow for precise detection of problematic guest code.  And, it's
simply not feasible to fully close the hole, e.g. if an interrupt arrives
before the original instruction can re-execute, the guest could step past
a different bus lock.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Manali Shukla <manali.shukla@amd.com>
Link: https://lore.kernel.org/r/20250502050346.14274-5-manali.shukla@amd.com
[sean: fix typo in comment]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-19 11:05:10 -07:00
Sean Christopherson
6a3d704959 KVM: x86/mmu: Use kvm_x86_call() instead of manual static_call()
Use KVM's preferred kvm_x86_call() wrapper to invoke static calls related
to mirror page tables.

No functional change intended.

Fixes: 77ac7079e6 ("KVM: x86/tdp_mmu: Propagate building mirror page tables")
Fixes: 94faba8999 ("KVM: x86/tdp_mmu: Propagate tearing down mirror page tables")
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250331182703.725214-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-16 13:13:58 -07:00
Manali Shukla
e9628b011b KVM: x86: Make kvm_pio_request.linear_rip a common field for user exits
Move and rename kvm_pio_request.linear_rip to
kvm_vcpu_arch.cui_linear_rip so that the field can be used by other
userspace exit completion flows that need to take action if and only
if userspace has not modified RIP.

No functional changes intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Manali Shukla <manali.shukla@amd.com>
Link: https://lore.kernel.org/r/20250502050346.14274-2-manali.shukla@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-16 09:40:23 -07:00
Ahmed S. Darwish
3bf8ce8284 x86/cpuid: Rename hypervisor_cpuid_base()/for_each_possible_hypervisor_cpuid_base() to cpuid_base_hypervisor()/for_each_possible_cpuid_base_hypervisor()
In order to let all the APIs under <cpuid/api.h> have a shared "cpuid_"
namespace, rename hypervisor_cpuid_base() to cpuid_base_hypervisor().

To align with the new style, also rename:

    for_each_possible_hypervisor_cpuid_base(function)

to:

    for_each_possible_cpuid_base_hypervisor(function)

Adjust call-sites accordingly.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/aCZOi0Oohc7DpgTo@lx-t490
2025-05-16 10:54:47 +02:00
Ahmed S. Darwish
968e300068 x86/cpuid: Set <asm/cpuid/api.h> as the main CPUID header
The main CPUID header <asm/cpuid.h> was originally a storefront for the
headers:

    <asm/cpuid/api.h>
    <asm/cpuid/leaf_0x2_api.h>

Now that the latter CPUID(0x2) header has been merged into the former,
there is no practical difference between <asm/cpuid.h> and
<asm/cpuid/api.h>.

Migrate all users to the <asm/cpuid/api.h> header, in preparation of
the removal of <asm/cpuid.h>.

Don't remove <asm/cpuid.h> just yet, in case some new code in -next
started using it.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/20250508150240.172915-3-darwi@linutronix.de
2025-05-15 18:23:55 +02:00
Ingo Molnar
c4070e1996 Merge commit 'its-for-linus-20250509-merge' into x86/core, to resolve conflicts
Conflicts:
	Documentation/admin-guide/hw-vuln/index.rst
	arch/x86/include/asm/cpufeatures.h
	arch/x86/kernel/alternative.c
	arch/x86/kernel/cpu/bugs.c
	arch/x86/kernel/cpu/common.c
	drivers/base/cpu.c
	include/linux/cpu.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-13 10:47:10 +02:00
Ingo Molnar
1f82e8e1ca Merge branch 'x86/msr' into x86/core, to resolve conflicts
Conflicts:
	arch/x86/boot/startup/sme.c
	arch/x86/coco/sev/core.c
	arch/x86/kernel/fpu/core.c
	arch/x86/kernel/fpu/xstate.c

 Semantic conflict:
	arch/x86/include/asm/sev-internal.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-13 10:42:06 +02:00
Ingo Molnar
206c07d6ab Merge branch 'x86/bugs' into x86/core, to merge dependent commits
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:

  6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-13 10:35:14 +02:00
Linus Torvalds
6f5bf947ba * Mitigate Indirect Target Selection (ITS) issue
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmgebIwACgkQaDWVMHDJ
 krCGSA/+I+W/uqiz58Z2Zu4RrXMYFfKJxacF7My9wnOyRxaJduS3qrz1E5wHqBId
 f6M8wDx9nS24UxDkBbi84NdtlG1zj8nV8djtszGKVeqHG2DcQMMOXBKZSjOmTo2b
 GIZ3a3xEqXaFfnGQxXSZrvtHIwCmv10H2oyGHu0vBp/SJuWXNg72oivOGhbm0uWs
 0/bdIK8+1sW7OAmhhKdvMVpmzL8TQJnkUHSkQilPB2Tsf9wWDfeY7kDkK5YwQpk2
 ZK+hrmwCFXQZELY65F2+y/cFim/F38HiqVdvIkV1wFSVqVVE9hEKJ4BDZl1fXZKB
 p4qpDFgxO27E/eMo9IZfxRH4TdSoK6YLWo9FGWHKBPnciJfAeO9EP/AwAIhEQRdx
 YZlN9sGS6ja7O1Eh423BBw6cFj6ta0ck2T1PoYk32FXc6sgqCphsfvBD3+tJxz8/
 xoZ3BzoErdPqSXbH5cSI972kQW0JLESiMTZa827qnJtT672t6uBcsnnmR0ZbJH1f
 TJCC9qgwpBiEkiGW3gwv00SC7CkXo3o0FJw0pa3MkKHGd7csxBtGBHI1b6Jj+oB0
 yWf1HxSqwrq2Yek8R7lWd4jIxyWfKriEMTu7xCMUUFlprKmR2RufsADvqclNyedQ
 sGBCc4eu1cpZp2no/IFm+IvkuzUHnkS/WNL1LbZ9YI8h8unjZHE=
 =UVgZ
 -----END PGP SIGNATURE-----

Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 ITS mitigation from Dave Hansen:
 "Mitigate Indirect Target Selection (ITS) issue.

  I'd describe this one as a good old CPU bug where the behavior is
  _obviously_ wrong, but since it just results in bad predictions it
  wasn't wrong enough to notice. Well, the researchers noticed and also
  realized that thus bug undermined a bunch of existing indirect branch
  mitigations.

  Thus the unusually wide impact on this one. Details:

  ITS is a bug in some Intel CPUs that affects indirect branches
  including RETs in the first half of a cacheline. Due to ITS such
  branches may get wrongly predicted to a target of (direct or indirect)
  branch that is located in the second half of a cacheline. Researchers
  at VUSec found this behavior and reported to Intel.

  Affected processors:

   - Cascade Lake, Cooper Lake, Whiskey Lake V, Coffee Lake R, Comet
     Lake, Ice Lake, Tiger Lake and Rocket Lake.

  Scope of impact:

   - Guest/host isolation:

     When eIBRS is used for guest/host isolation, the indirect branches
     in the VMM may still be predicted with targets corresponding to
     direct branches in the guest.

   - Intra-mode using cBPF:

     cBPF can be used to poison the branch history to exploit ITS.
     Realigning the indirect branches and RETs mitigates this attack
     vector.

   - User/kernel:

     With eIBRS enabled user/kernel isolation is *not* impacted by ITS.

   - Indirect Branch Prediction Barrier (IBPB):

     Due to this bug indirect branches may be predicted with targets
     corresponding to direct branches which were executed prior to IBPB.
     This will be fixed in the microcode.

  Mitigation:

  As indirect branches in the first half of cacheline are affected, the
  mitigation is to replace those indirect branches with a call to thunk that
  is aligned to the second half of the cacheline.

  RETs that take prediction from RSB are not affected, but they may be
  affected by RSB-underflow condition. So, RETs in the first half of
  cacheline are also patched to a return thunk that executes the RET aligned
  to second half of cacheline"

* tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  selftest/x86/bugs: Add selftests for ITS
  x86/its: FineIBT-paranoid vs ITS
  x86/its: Use dynamic thunks for indirect branches
  x86/ibt: Keep IBT disabled during alternative patching
  mm/execmem: Unify early execmem_cache behaviour
  x86/its: Align RETs in BHB clear sequence to avoid thunking
  x86/its: Add support for RSB stuffing mitigation
  x86/its: Add "vmexit" option to skip mitigation on some CPUs
  x86/its: Enable Indirect Target Selection mitigation
  x86/its: Add support for ITS-safe return thunk
  x86/its: Add support for ITS-safe indirect thunk
  x86/its: Enumerate Indirect Target Selection (ITS) bug
  Documentation: x86/bugs/its: Add ITS documentation
2025-05-11 17:23:03 -07:00
Pawan Gupta
159013a7ca x86/its: Enumerate Indirect Target Selection (ITS) bug
ITS bug in some pre-Alderlake Intel CPUs may allow indirect branches in the
first half of a cache line get predicted to a target of a branch located in
the second half of the cache line.

Set X86_BUG_ITS on affected CPUs. Mitigation to follow in later commits.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09 13:22:04 -07:00
Sean Christopherson
e3417ab75a KVM: SVM: Set/clear SRSO's BP_SPEC_REDUCE on 0 <=> 1 VM count transitions
Set the magic BP_SPEC_REDUCE bit to mitigate SRSO when running VMs if and
only if KVM has at least one active VM.  Leaving the bit set at all times
unfortunately degrades performance by a wee bit more than expected.

Use a dedicated spinlock and counter instead of hooking virtualization
enablement, as changing the behavior of kvm.enable_virt_at_load based on
SRSO_BP_SPEC_REDUCE is painful, and has its own drawbacks, e.g. could
result in performance issues for flows that are sensitive to VM creation
latency.

Defer setting BP_SPEC_REDUCE until VMRUN is imminent to avoid impacting
performance on CPUs that aren't running VMs, e.g. if a setup is using
housekeeping CPUs.  Setting BP_SPEC_REDUCE in task context, i.e. without
blasting IPIs to all CPUs, also helps avoid serializing 1<=>N transitions
without incurring a gross amount of complexity (see the Link for details
on how ugly coordinating via IPIs gets).

Link: https://lore.kernel.org/all/aBOnzNCngyS_pQIW@google.com
Fixes: 8442df2b49 ("x86/bugs: KVM: Add support for SRSO_MSR_FIX")
Reported-by: Michael Larabel <Michael@michaellarabel.com>
Closes: https://www.phoronix.com/review/linux-615-amd-regression
Cc: Borislav Petkov <bp@alien8.de>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250505180300.973137-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-08 07:17:10 -07:00
Herbert Xu
fba4aafaba Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux v6.15-rc5
Merge mainline to pick up bcachefs poly1305 patch 4bf4b5046d
("bcachefs: use library APIs for ChaCha20 and Poly1305").  This
is a prerequisite for removing the poly1305 shash algorithm.
2025-05-05 13:25:15 +08:00
Sean Christopherson
9129633d56 KVM: x86/mmu: Prevent installing hugepages when mem attributes are changing
When changing memory attributes on a subset of a potential hugepage, add
the hugepage to the invalidation range tracking to prevent installing a
hugepage until the attributes are fully updated.  Like the actual hugepage
tracking updates in kvm_arch_post_set_memory_attributes(), process only
the head and tail pages, as any potential hugepages that are entirely
covered by the range will already be tracked.

Note, only hugepage chunks whose current attributes are NOT mixed need to
be added to the invalidation set, as mixed attributes already prevent
installing a hugepage, and it's perfectly safe to install a smaller
mapping for a gfn whose attributes aren't changing.

Fixes: 8dd2eee9d5 ("KVM: x86/mmu: Handle page fault for private memory")
Cc: stable@vger.kernel.org
Reported-by: Michael Roth <michael.roth@amd.com>
Tested-by: Michael Roth <michael.roth@amd.com>
Link: https://lore.kernel.org/r/20250430220954.522672-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02 13:39:34 -07:00
Tom Lendacky
5fea0c6c0e KVM: SVM: Update dump_ghcb() to use the GHCB snapshot fields
Commit 4e15a0ddc3 ("KVM: SEV: snapshot the GHCB before accessing it")
updated the SEV code to take a snapshot of the GHCB before using it. But
the dump_ghcb() function wasn't updated to use the snapshot locations.
This results in incorrect output from dump_ghcb() for the "is_valid" and
"valid_bitmap" fields.

Update dump_ghcb() to use the proper locations.

Fixes: 4e15a0ddc3 ("KVM: SEV: snapshot the GHCB before accessing it")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Link: https://lore.kernel.org/r/8f03878443681496008b1b37b7c4bf77a342b459.1745866531.git.thomas.lendacky@amd.com
[sean: add comment and snapshot qualifier]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02 13:39:33 -07:00
Vishal Verma
907092bf7c KVM: VMX: Clean up and macrofy x86_ops
Eliminate a lot of stub definitions by using macros to define the TDX vs
non-TDX versions of various x86_ops. Moving the x86_ops wrappers under
CONFIG_KVM_INTEL_TDX also allows nearly all of vmx/main.c to go under a
single #ifdef, eliminating trampolines in the generated code, and almost
all of the stubs.

For example, with CONFIG_KVM_INTEL_TDX=n, before this cleanup,
vt_refresh_apicv_exec_ctrl() would produce:

0000000000036490 <vt_refresh_apicv_exec_ctrl>:
   36490:       f3 0f 1e fa             endbr64
   36494:       e8 00 00 00 00          call   36499 <vt_refresh_apicv_exec_ctrl+0x9>
                        36495: R_X86_64_PLT32   __fentry__-0x4
   36499:       e9 00 00 00 00          jmp    3649e <vt_refresh_apicv_exec_ctrl+0xe>
                        3649a: R_X86_64_PLT32   vmx_refresh_apicv_exec_ctrl-0x4
   3649e:       66 90                   xchg   %ax,%ax

After this patch, this is completely eliminated.

Based on a patch by Sean Christopherson <seanjc@google.com>

Link: https://lore.kernel.org/kvm/Z6v9yjWLNTU6X90d@google.com/
Cc: Sean Christopherson <seanjc@google.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Link: https://lore.kernel.org/r/20250318-vverma7-cleanup_x86_ops-v2-4-701e82d6b779@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02 13:37:26 -07:00
Vishal Verma
1a81d9d5a1 KVM: VMX: Define a VMX glue macro for kvm_complete_insn_gp()
Define kvm_complete_insn_gp() as vmx_complete_emulated_msr() and use the
glue wrapper in vt_complete_emulated_msr() so that VT's
.complete_emulated_msr() implementation follows the soon-to-be-standard
pattern of:

    vt_abc:
        if (is_td())
            return tdx_abc();
        return vmx_abc();

This will allow generating such wrappers via a macro, which in turn will
make it trivially easy to skip the wrappers entirely when KVM_INTEL_TDX=n.

Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/kvm/Z6v9yjWLNTU6X90d@google.com/
Cc: Sean Christopherson <seanjc@google.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Link: https://lore.kernel.org/r/20250318-vverma7-cleanup_x86_ops-v2-3-701e82d6b779@intel.com
[sean: massage shortlog+changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02 13:37:25 -07:00
Vishal Verma
84ad4d834c KVM: VMX: Move vt_apicv_pre_state_restore() to posted_intr.c and tweak name
In preparation for a cleanup of the kvm_x86_ops struct for TDX, all vt_*
functions are expected to act as glue functions that route to either tdx_*
or vmx_* based on the VM type. Specifically, the pattern is:

vt_abc:
    if (is_td())
        return tdx_abc();
    return vmx_abc();

But vt_apicv_pre_state_restore() does not follow this pattern. To
facilitate that cleanup, rename and move vt_apicv_pre_state_restore() into
posted_intr.c.

Opportunistically turn vcpu_to_pi_desc() back into a static function, as
the only reason it was exposed outside of posted_intr.c was for
vt_apicv_pre_state_restore().

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/kvm/Z6v9yjWLNTU6X90d@google.com/
Cc: Sean Christopherson <seanjc@google.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linxu.intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Link: https://lore.kernel.org/r/20250318-vverma7-cleanup_x86_ops-v2-2-701e82d6b779@intel.com
[sean: apply Chao's suggestions, massage shortlog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02 13:37:25 -07:00
Sean Christopherson
f2d7993314 KVM: x86: Revert kvm_x86_ops.mem_enc_ioctl() back to an OPTIONAL hook
Restore KVM's handling of a NULL kvm_x86_ops.mem_enc_ioctl, as the hook is
NULL on SVM when CONFIG_KVM_AMD_SEV=n, and TDX will soon follow suit.

  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 1 at arch/x86/include/asm/kvm-x86-ops.h:130 kvm_x86_vendor_init+0x178b/0x18e0
  Modules linked in:
  CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.15.0-rc2-dc1aead1a985-sink-vm #2 NONE
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_x86_vendor_init+0x178b/0x18e0
  Call Trace:
   <TASK>
   svm_init+0x2e/0x60
   do_one_initcall+0x56/0x290
   kernel_init_freeable+0x192/0x1e0
   kernel_init+0x16/0x130
   ret_from_fork+0x30/0x50
   ret_from_fork_asm+0x1a/0x30
   </TASK>
  ---[ end trace 0000000000000000 ]---

Opportunistically drop the superfluous curly braces.

Link: https://lore.kernel.org/all/20250318-vverma7-cleanup_x86_ops-v2-4-701e82d6b779@intel.com
Fixes: b2aaf38ced ("KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl")
Link: https://lore.kernel.org/r/20250502203421.865686-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02 13:37:22 -07:00
Xin Li (Intel)
502ad6e5a6 x86/msr: Change the function type of native_read_msr_safe()
Modify the function type of native_read_msr_safe() to:

    int native_read_msr_safe(u32 msr, u64 *val)

This change makes the function return an error code instead of the
MSR value, aligning it with the type of native_write_msr_safe().
Consequently, their callers can check the results in the same way.

While at it, convert leftover MSR data type "unsigned int" to u32.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20250427092027.1598740-16-xin@zytor.com
2025-05-02 10:36:36 +02:00
Xin Li (Intel)
0c2678efed x86/pvops/msr: Refactor pv_cpu_ops.write_msr{,_safe}()
An MSR value is represented as a 64-bit unsigned integer, with existing
MSR instructions storing it in EDX:EAX as two 32-bit segments.

The new immediate form MSR instructions, however, utilize a 64-bit
general-purpose register to store the MSR value.  To unify the usage of
all MSR instructions, let the default MSR access APIs accept an MSR
value as a single 64-bit argument instead of two 32-bit segments.

The dual 32-bit APIs are still available as convenient wrappers over the
APIs that handle an MSR value as a single 64-bit argument.

The following illustrates the updated derivation of the MSR write APIs:

                 __wrmsrq(u32 msr, u64 val)
                   /                  \
                  /                    \
           native_wrmsrq(msr, val)    native_wrmsr(msr, low, high)
                 |
                 |
           native_write_msr(msr, val)
                /          \
               /            \
       wrmsrq(msr, val)    wrmsr(msr, low, high)

When CONFIG_PARAVIRT is enabled, wrmsrq() and wrmsr() are defined on top
of paravirt_write_msr():

            paravirt_write_msr(u32 msr, u64 val)
               /             \
              /               \
          wrmsrq(msr, val)    wrmsr(msr, low, high)

paravirt_write_msr() invokes cpu.write_msr(msr, val), an indirect layer
of pv_ops MSR write call:

    If on native:

            cpu.write_msr = native_write_msr

    If on Xen:

            cpu.write_msr = xen_write_msr

Therefore, refactor pv_cpu_ops.write_msr{_safe}() to accept an MSR value
in a single u64 argument, replacing the current dual u32 arguments.

No functional change intended.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20250427092027.1598740-14-xin@zytor.com
2025-05-02 10:36:36 +02:00
Xin Li (Intel)
3204877d05 x86/msr: Convert __rdmsr() uses to native_rdmsrq() uses
__rdmsr() is the lowest level MSR write API, with native_rdmsr()
and native_rdmsrq() serving as higher-level wrappers around it.

  #define native_rdmsr(msr, val1, val2)                   \
  do {                                                    \
          u64 __val = __rdmsr((msr));                     \
          (void)((val1) = (u32)__val);                    \
          (void)((val2) = (u32)(__val >> 32));            \
  } while (0)

  static __always_inline u64 native_rdmsrq(u32 msr)
  {
          return __rdmsr(msr);
  }

However, __rdmsr() continues to be utilized in various locations.

MSR APIs are designed for different scenarios, such as native or
pvops, with or without trace, and safe or non-safe.  Unfortunately,
the current MSR API names do not adequately reflect these factors,
making it challenging to select the most appropriate API for
various situations.

To pave the way for improving MSR API names, convert __rdmsr()
uses to native_rdmsrq() to ensure consistent usage.  Later, these
APIs can be renamed to better reflect their implications, such as
native or pvops, with or without trace, and safe or non-safe.

No functional change intended.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20250427092027.1598740-10-xin@zytor.com
2025-05-02 10:36:35 +02:00
Xin Li (Intel)
efef7f184f x86/msr: Add explicit includes of <asm/msr.h>
For historic reasons there are some TSC-related functions in the
<asm/msr.h> header, even though there's an <asm/tsc.h> header.

To facilitate the relocation of rdtsc{,_ordered}() from <asm/msr.h>
to <asm/tsc.h> and to eventually eliminate the inclusion of
<asm/msr.h> in <asm/tsc.h>, add an explicit <asm/msr.h> dependency
to the source files that reference definitions from <asm/msr.h>.

[ mingo: Clarified the changelog. ]

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20250501054241.1245648-1-xin@zytor.com
2025-05-02 10:23:47 +02:00
Ingo Molnar
0c7b20b852 Linux 6.15-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCgA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmgOrWseHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGFyIH/AhXcuA8y8rk43mo
 t+0GO7JR4dnr4DIl74GgDjCXlXiKCT7EXMfD/ABdofTxV4Pbyv+pUODlg1E6eO9U
 C1WWM5PPNBGDDEVSQ3Yu756nr0UoiFhvW0R6pVdou5cezCWAtIF9LTN8DEUgis0u
 EUJD9+/cHAMzfkZwabjm/HNsa1SXv2X47MzYv/PdHKr0htEPcNHF4gqBrBRdACGy
 FJtaCKhuPf6TcDNXOFi5IEWMXrugReRQmOvrXqVYGa7rfUFkZgsAzRY6n/rUN5Z9
 FAgle4Vlv9ohVYj9bXX8b6wWgqiKRpoN+t0PpRd6G6ict1AFBobNGo8LH3tYIKqZ
 b/dCGNg=
 =xDGd
 -----END PGP SIGNATURE-----

Merge tag 'v6.15-rc4' into x86/msr, to pick up fixes and resolve conflicts

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-02 09:43:44 +02:00
David Kaplan
d43ba2dc8e x86/bugs: Restructure L1TF mitigation
Restructure L1TF to use select/apply functions to create consistent
vulnerability handling.

Define new AUTO mitigation for L1TF.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/20250418161721.1855190-16-david.kaplan@amd.com
2025-04-29 18:57:30 +02:00
Sean Christopherson
54a1a24fea KVM: x86: Unify cross-vCPU IBPB
Both SVM and VMX have similar implementation for executing an IBPB
between running different vCPUs on the same CPU to create separate
prediction domains for different vCPUs.

For VMX, when the currently loaded VMCS is changed in
vmx_vcpu_load_vmcs(), an IBPB is executed if there is no 'buddy', which
is the case on vCPU load. The intention is to execute an IBPB when
switching vCPUs, but not when switching the VMCS within the same vCPU.
Executing an IBPB on nested transitions within the same vCPU is handled
separately and conditionally in nested_vmx_vmexit().

For SVM, the current VMCB is tracked on vCPU load and an IBPB is
executed when it is changed. The intention is also to execute an IBPB
when switching vCPUs, although it is possible that in some cases an IBBP
is executed when switching VMCBs for the same vCPU. Executing an IBPB on
nested transitions should be handled separately, and is proposed at [1].

Unify the logic by tracking the last loaded vCPU and execuintg the IBPB
on vCPU change in kvm_arch_vcpu_load() instead. When a vCPU is
destroyed, make sure all references to it are removed from any CPU. This
is similar to how SVM clears the current_vmcb tracking on vCPU
destruction. Remove the current VMCB tracking in SVM as it is no longer
required, as well as the 'buddy' parameter to vmx_vcpu_load_vmcs().

[1] https://lore.kernel.org/lkml/20250221163352.3818347-4-yosry.ahmed@linux.dev

Link: https://lore.kernel.org/all/20250320013759.3965869-1-yosry.ahmed@linux.dev
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
[sean: tweak comment to stay at/under 80 columns]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-29 08:39:44 -07:00
Yosry Ahmed
1bee4838eb KVM: SVM: Clear current_vmcb during vCPU free for all *possible* CPUs
When freeing a vCPU and thus its VMCB, clear current_vmcb for all possible
CPUs, not just online CPUs, as it's theoretically possible a CPU could go
offline and come back online in conjunction with KVM reusing the page for
a new VMCB.

Link: https://lore.kernel.org/all/20250320013759.3965869-1-yosry.ahmed@linux.dev
Fixes: fd65d3142f ("kvm: svm: Ensure an IBPB on all affected CPUs when freeing a vmcb")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
[sean: split to separate patch, write changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-29 08:39:35 -07:00
Peng Hao
e0136112e9 x86/sev: Remove unnecessary GFP_KERNEL_ACCOUNT for temporary variables
Some variables allocated in sev_send_update_data are released when
the function exits, so there is no need to set GFP_KERNEL_ACCOUNT.

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Link: https://lore.kernel.org/r/20250428063013.62311-1-flyingpeng@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28 11:09:55 -07:00
Yan Zhao
20a6cff3b2 KVM: x86/mmu: Check and free obsolete roots in kvm_mmu_reload()
Check request KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to free obsolete roots in
kvm_mmu_reload() to prevent kvm_mmu_reload() from seeing a stale obsolete
root.

Since kvm_mmu_reload() can be called outside the
vcpu_enter_guest() path (e.g., kvm_arch_vcpu_pre_fault_memory()), it may be
invoked after a root has been marked obsolete and before vcpu_enter_guest()
is invoked to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS and set root.hpa to
invalid. This causes kvm_mmu_reload() to fail to load a new root, which
can lead to kvm_arch_vcpu_pre_fault_memory() being stuck in the while
loop in kvm_tdp_map_page() since RET_PF_RETRY is always returned due to
is_page_fault_stale().

Keep the existing check of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS in
vcpu_enter_guest() since the cost of kvm_check_request() is negligible,
especially a check that's guarded by kvm_request_pending().

Export symbol of kvm_mmu_free_obsolete_roots() as kvm_mmu_reload() is
inline and may be called outside of kvm.ko.

Fixes: 6e01b7601d ("KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20250318013333.5817-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28 11:06:13 -07:00
Yan Zhao
11d4517511 KVM: x86/mmu: Warn if PFN changes on shadow-present SPTE in shadow MMU
Warn if PFN changes on shadow-present SPTE in mmu_set_spte().

KVM should _never_ change the PFN of a shadow-present SPTE. In
mmu_set_spte(), there is a WARN_ON_ONCE() on pfn changes on shadow-present
SPTE in mmu_spte_update() to detect this condition. However, that
WARN_ON_ONCE() is not hittable since mmu_set_spte() invokes drop_spte()
earlier before mmu_spte_update(), which clears SPTE to a !shadow-present
state. So, before invoking drop_spte(), add a WARN_ON_ONCE() in
mmu_set_spte() to warn PFN change of a shadow-present SPTE.

For the spurious prefetch fault, only return RET_PF_SPURIOUS directly when
PFN is not changed. When PFN changes, fall through to follow the sequence
of drop_spte(), warn of PFN change, make_spte(), flush tlb, rmap_add().

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20250318013310.5781-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28 11:03:06 -07:00
Yan Zhao
988da78202 KVM: x86/tdp_mmu: WARN if PFN changes for spurious faults
Add a WARN() to assert that KVM does _not_ change the PFN of a
shadow-present SPTE during spurious fault handling.

KVM should _never_ change the PFN of a shadow-present SPTE and TDP MMU
already BUG()s on this. However, spurious faults just return early before
the existing BUG() could be hit.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20250318013238.5732-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28 11:03:06 -07:00
Yan Zhao
d17cc13cc4 KVM: x86/tdp_mmu: Merge prefetch and access checks for spurious faults
Combine prefetch and is_access_allowed() checks into a unified path to
detect spurious faults, since both cases now share identical logic.

No functional changes.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20250318013210.5701-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28 11:03:06 -07:00
Yan Zhao
ea9fcdf76d KVM: x86/mmu: Further check old SPTE is leaf for spurious prefetch fault
Instead of simply treating a prefetch fault as spurious when there's a
shadow-present old SPTE, further check if the old SPTE is leaf to determine
if a prefetch fault is spurious.

It's not reasonable to treat a prefetch fault as spurious when there's a
shadow-present non-leaf SPTE without a corresponding shadow-present leaf
SPTE. e.g., in the following sequence, a prefetch fault should not be
considered spurious:
1. add a memslot with size 4K
2. prefault GPA A in the memslot
3. delete the memslot (zap all disabled)
4. re-add the memslot with size 2M
5. prefault GPA A again.
In step 5, the prefetch fault attempts to install a 2M huge entry.
Since step 3 zaps the leaf SPTE for GPA A while keeping the non-leaf SPTE,
the leaf entry will remain empty after step 5 if the fetch fault is
regarded as spurious due to a shadow-present non-leaf SPTE.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20250318013111.5648-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28 11:03:06 -07:00
Chao Gao
a0ee1d5faf KVM: VMX: Flush shadow VMCS on emergency reboot
Ensure the shadow VMCS cache is evicted during an emergency reboot to
prevent potential memory corruption if the cache is evicted after reboot.

This issue was identified through code inspection, as __loaded_vmcs_clear()
flushes both the normal VMCS and the shadow VMCS.

Avoid checking the "launched" state during an emergency reboot, unlike the
behavior in __loaded_vmcs_clear(). This is important because reboot NMIs
can interfere with operations like copy_shadow_to_vmcs12(), where shadow
VMCSes are loaded directly using VMPTRLD. In such cases, if NMIs occur
right after the VMCS load, the shadow VMCSes will be active but the
"launched" state may not be set.

Fixes: 16f5b9034b ("KVM: nVMX: Copy processor-specific shadow-vmcs to VMCS12")
Cc: stable@vger.kernel.org
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250324140849.2099723-1-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28 10:57:56 -07:00
Sean Christopherson
5ecdb48dd9 KVM: SVM: Treat DEBUGCTL[5:2] as reserved
Stop ignoring DEBUGCTL[5:2] on AMD CPUs and instead treat them as reserved.
KVM has never properly virtualized AMD's legacy PBi bits, but did allow
the guest (and host userspace) to set the bits.  To avoid breaking guests
when running on CPUs with BusLockTrap, which redefined bit 2 to BLCKDB and
made bits 5:3 reserved, a previous KVM change ignored bits 5:3, e.g. so
that legacy guest software wouldn't inadvertently enable BusLockTrap or
hit a VMRUN failure due to setting reserved.

To allow for virtualizing BusLockTrap and whatever future features may use
bits 5:3, treat bits 5:2 as reserved (and hope that doing so doesn't break
any existing guests).

Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250227222411.3490595-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28 10:56:35 -07:00
Peng Hao
bb5081f4ab KVM: SVM: avoid frequency indirect calls
When retpoline is enabled, indirect function calls introduce additional
performance overhead. Avoid frequent indirect calls to VMGEXIT when SEV
is enabled.

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Link: https://lore.kernel.org/r/20250306075425.66693-1-flyingpeng@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25 16:19:56 -07:00
Kim Phillips
b6bc164f41 KVM: SEV: Configure "ALLOWED_SEV_FEATURES" VMCB Field
AMD EPYC 5th generation processors have introduced a feature that allows
the hypervisor to control the SEV_FEATURES that are set for, or by, a
guest [1].  ALLOWED_SEV_FEATURES can be used by the hypervisor to enforce
that SEV-ES and SEV-SNP guests cannot enable features that the
hypervisor does not want to be enabled.

Always enable ALLOWED_SEV_FEATURES.  A VMRUN will fail if any
non-reserved bits are 1 in SEV_FEATURES but are 0 in
ALLOWED_SEV_FEATURES.

Some SEV_FEATURES - currently PmcVirtualization and SecureAvic
(see Appendix B, Table B-4) - require an opt-in via ALLOWED_SEV_FEATURES,
i.e. are off-by-default, whereas all other features are effectively
on-by-default, but still honor ALLOWED_SEV_FEATURES.

[1] Section 15.36.20 "Allowed SEV Features", AMD64 Architecture
    Programmer's Manual, Pub. 24593 Rev. 3.42 - March 2024:
    https://bugzilla.kernel.org/attachment.cgi?id=306250

Co-developed-by: Kishon Vijay Abraham I <kvijayab@amd.com>
Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250310201603.1217954-3-kim.phillips@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25 16:19:55 -07:00
Tom Lendacky
468c27ae02 KVM: SVM: Add a mutex to dump_vmcb() to prevent concurrent output
If multiple VMRUN instructions fail, resulting in calls to dump_vmcb(),
the output can become interleaved and it is impossible to identify which
line of output belongs to which VMCB. Add a mutex to dump_vmcb() so that
the output is serialized.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Link: https://lore.kernel.org/r/a880678afd9488e1dd6017445802712f7c02cc6d.1742477213.git.thomas.lendacky@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25 16:19:54 -07:00
Tom Lendacky
0e6b677de7 KVM: SVM: Include the vCPU ID when dumping a VMCB
Provide the vCPU ID of the VMCB in dump_vmcb().

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Link: https://lore.kernel.org/r/ee0af5a6c1a49aebb4a8291071c3f68cacf107b2.1742477213.git.thomas.lendacky@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25 16:19:54 -07:00
Tom Lendacky
db26450961 KVM: SVM: Add the type of VM for which the VMCB/VMSA is being dumped
Add the type of VM (SVM, SEV, SEV-ES, or SEV-SNP) being dumped to the
dump_vmcb() function.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Link: https://lore.kernel.org/r/7a183a8beedf4ee26c42001160e073a884fe466e.1742477213.git.thomas.lendacky@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25 16:19:53 -07:00
Tom Lendacky
22f5c2003a KVM: SVM: Dump guest register state in dump_vmcb()
Guest register state can be useful when debugging, include it as part
of dump_vmcb().

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Link: https://lore.kernel.org/r/a4131a10c082a93610cac12b35dca90292e50f50.1742477213.git.thomas.lendacky@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25 16:19:52 -07:00
Tom Lendacky
962e2b6152 KVM: SVM: Decrypt SEV VMSA in dump_vmcb() if debugging is enabled
An SEV-ES/SEV-SNP VM save area (VMSA) can be decrypted if the guest
policy allows debugging. Update the dump_vmcb() routine to output
some of the SEV VMSA contents if possible. This can be useful for
debug purposes.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Link: https://lore.kernel.org/r/ea3b852c295b6f4b200925ed6b6e2c90d9475e71.1742477213.git.thomas.lendacky@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25 16:19:52 -07:00
Uros Bizjak
798b9b1cb0 KVM: VMX: Use LEAVE in vmx_do_interrupt_irqoff()
Micro-optimize vmx_do_interrupt_irqoff() by substituting
MOV %RBP,%RSP; POP %RBP instruction sequence with equivalent
LEAVE instruction. GCC compiler does this by default for
a generic tuning and for all modern processors:

DEF_TUNE (X86_TUNE_USE_LEAVE, "use_leave",
	  m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE | m_ZHAOXIN
	  | m_TREMONT | m_CORE_HYBRID | m_CORE_ATOM | m_GENERIC)

The new code also saves a couple of bytes, from:

  27:	48 89 ec             	mov    %rbp,%rsp
  2a:	5d                   	pop    %rbp

to:

  27:	c9                   	leave

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20250414081131.97374-2-ubizjak@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:23:07 -07:00
Sean Christopherson
17a2c62fbf KVM: nVMX: Check MSR load/store list counts during VM-Enter consistency checks
Explicitly verify the MSR load/store list counts are below the advertised
limit as part of the initial consistency checks on the lists, so that code
that consumes the count doesn't need to worry about extreme edge cases.
Enforcing the limit during the initial checks fixes a flaw on 32-bit KVM
where a sufficiently high @count could lead to overflow:

	arch/x86/kvm/vmx/nested.c:834 nested_vmx_check_msr_switch()
	warn: potential user controlled sizeof overflow 'addr + count * 16' '0-u64max + 16-68719476720'

arch/x86/kvm/vmx/nested.c
    827 static int nested_vmx_check_msr_switch(struct kvm_vcpu *vcpu,
    828                                        u32 count, u64 addr)
    829 {
    830         if (count == 0)
    831                 return 0;
    832
    833         if (!kvm_vcpu_is_legal_aligned_gpa(vcpu, addr, 16) ||
--> 834             !kvm_vcpu_is_legal_gpa(vcpu, (addr + count * sizeof(struct vmx_msr_entry) - 1)))
                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

While the SDM doesn't explicitly state an illegal count results in VM-Fail,
the SDM states that exceeding the limit may result in undefined behavior.
I.e. the SDM gives hardware, and thus KVM, carte blanche to do literally
anything in response to a count that exceeds the "recommended" limit.

  If the limit is exceeded, undefined processor behavior may result
  (including a machine check during the VMX transition).

KVM already enforces the limit when processing the MSRs, i.e. already
signals a late VM-Exit Consistency Check for VM-Enter, and generates a
VMX Abort for VM-Exit.  I.e. explicitly checking the limits simply means
KVM will signal VM-Fail instead of VM-Exit or VMX Abort.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/44961459-2759-4164-b604-f6bd43da8ce9@stanley.mountain
Link: https://lore.kernel.org/r/20250315024402.2363098-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:23:06 -07:00
Tom Lendacky
309d28576f KVM: SVM: Fix SNP AP destroy race with VMRUN
An AP destroy request for a target vCPU is typically followed by an
RMPADJUST to remove the VMSA attribute from the page currently being
used as the VMSA for the target vCPU. This can result in a vCPU that
is about to VMRUN to exit with #VMEXIT_INVALID.

This usually does not happen as APs are typically sitting in HLT when
being destroyed and therefore the vCPU thread is not running at the time.
However, if HLT is allowed inside the VM, then the vCPU could be about to
VMRUN when the VMSA attribute is removed from the VMSA page, resulting in
a #VMEXIT_INVALID when the vCPU actually issues the VMRUN and causing the
guest to crash. An RMPADJUST against an in-use (already running) VMSA
results in a #NPF for the vCPU issuing the RMPADJUST, so the VMSA
attribute cannot be changed until the VMRUN for target vCPU exits. The
Qemu command line option '-overcommit cpu-pm=on' is an example of allowing
HLT inside the guest.

Update the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event to include the
KVM_REQUEST_WAIT flag. The kvm_vcpu_kick() function will not wait for
requests to be honored, so create kvm_make_request_and_kick() that will
add a new event request and honor the KVM_REQUEST_WAIT flag. This will
ensure that the target vCPU sees the AP destroy request before returning
to the initiating vCPU should the target vCPU be in guest mode.

Fixes: e366f92ea9 ("KVM: SEV: Support SEV-SNP AP Creation NAE event")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/fe2c885bf35643dd224e91294edb6777d5df23a4.1743097196.git.thomas.lendacky@amd.com
[sean: add a comment explaining the use of smp_send_reschedule()]
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:20:08 -07:00
Sean Christopherson
edaf3eded3 x86/irq: KVM: Add helper for harvesting PIR to deduplicate KVM and posted MSIs
Now that posted MSI and KVM harvesting of PIR is identical, extract the
code (and posted MSI's wonderful comment) to a common helper.

No functional change intended.

Link: https://lore.kernel.org/r/20250401163447.846608-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:19:41 -07:00
Sean Christopherson
baf68a0e3b KVM: VMX: Use arch_xchg() when processing PIR to avoid instrumentation
Use arch_xchg() when moving IRQs from the PIR to the vIRR, purely to avoid
instrumentation so that KVM is compatible with the needs of posted MSI.
This will allow extracting the core PIR logic to common code and sharing
it between KVM and posted MSI handling.

Link: https://lore.kernel.org/r/20250401163447.846608-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:19:40 -07:00
Sean Christopherson
b41f8638b9 KVM: VMX: Isolate pure loads from atomic XCHG when processing PIR
Rework KVM's processing of the PIR to use the same algorithm as posted
MSIs, i.e. to do READ(x4) => XCHG(x4) instead of (READ+XCHG)(x4).  Given
KVM's long-standing, sub-optimal use of 32-bit accesses to the PIR, it's
safe to say far more thought and investigation was put into handling the
PIR for posted MSIs, i.e. there's no reason to assume KVM's existing
logic is meaningful, let alone superior.

Matching the processing done by posted MSIs will also allow deduplicating
the code between KVM and posted MSIs.

See the comment for handle_pending_pir() added by commit 1b03d82ba1
("x86/irq: Install posted MSI notification handler") for details on
why isolating loads from XCHG is desirable.

Suggested-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250401163447.846608-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:19:40 -07:00
Sean Christopherson
06b4d0ea22 KVM: VMX: Process PIR using 64-bit accesses on 64-bit kernels
Process the PIR at the natural kernel width, i.e. in 64-bit chunks on
64-bit kernels, so that the worst case of having a posted IRQ in each
chunk of the vIRR only requires 4 loads and xchgs from/to the PIR, not 8.

Deliberately use a "continue" to skip empty entries so that the code is a
carbon copy of handle_pending_pir(), in anticipation of deduplicating KVM
and posted MSI logic.

Suggested-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250401163447.846608-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:19:39 -07:00
Sean Christopherson
f1459315f4 x86/irq: KVM: Track PIR bitmap as an "unsigned long" array
Track the PIR bitmap in posted interrupt descriptor structures as an array
of unsigned longs instead of using unionized arrays for KVM (u32s) versus
IRQ management (u64s).  In practice, because the non-KVM usage is (sanely)
restricted to 64-bit kernels, all existing usage of the u64 variant is
already working with unsigned longs.

Using "unsigned long" for the array will allow reworking KVM's processing
of the bitmap to read/write in 64-bit chunks on 64-bit kernels, i.e. will
allow optimizing KVM by reducing the number of atomic accesses to PIR.

Opportunstically replace the open coded literals in the posted MSIs code
with the appropriate macro.  Deliberately don't use ARRAY_SIZE() in the
for-loops, even though it would be cleaner from a certain perspective, in
anticipation of decoupling the processing from the array declaration.

No functional change intended.

Link: https://lore.kernel.org/r/20250401163447.846608-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:19:38 -07:00
Sean Christopherson
6433fc01f9 KVM: VMX: Ensure vIRR isn't reloaded at odd times when sync'ing PIR
Read each vIRR exactly once when shuffling IRQs from the PIR to the vAPIC
to ensure getting the highest priority IRQ from the chunk doesn't reload
from the vIRR.  In practice, a reload is functionally benign as vcpu->mutex
is held and so IRQs can be consumed, i.e. new IRQs can appear, but existing
IRQs can't disappear.

Link: https://lore.kernel.org/r/20250401163447.846608-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:19:38 -07:00
Sean Christopherson
459074cff6 KVM: x86: Add module param to control and enumerate device posted IRQs
Add a module param to each KVM vendor module to allow disabling device
posted interrupts without having to sacrifice all of APICv/AVIC, and to
also effectively enumerate to userspace whether or not KVM may be
utilizing device posted IRQs.  Disabling device posted interrupts is
very desirable for testing, and can even be desirable for production
environments, e.g. if the host kernel wants to interpose on device
interrupts.

Put the module param in kvm-{amd,intel}.ko instead of kvm.ko to match
the overall APICv/AVIC controls, and to avoid complications with said
controls.  E.g. if the param is in kvm.ko, KVM needs to be snapshot the
original user-defined value to play nice with a vendor module being
reloaded with different enable_apicv settings.

Link: https://lore.kernel.org/r/20250401161804.842968-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:18:38 -07:00
Sean Christopherson
c364baad3e KVM: VMX: Don't send UNBLOCK when starting device assignment without APICv
When starting device assignment, i.e. potential IRQ bypass, don't blast
KVM_REQ_UNBLOCK if APICv is disabled/unsupported.  There is no need to
wake vCPUs if they can never use VT-d posted IRQs (sending UNBLOCK guards
against races being vCPUs blocking and devices starting IRQ bypass).

Opportunistically use kvm_arch_has_irq_bypass() for all relevant checks in
the VMX Posted Interrupt code so that all checks in KVM x86 incorporate
the same information (once AMD/AVIC is given similar treatment).

Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://lore.kernel.org/r/20250401161804.842968-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:18:37 -07:00
weizijie
87e4951e25 KVM: x86: Rescan I/O APIC routes after EOI interception for old routing
Rescan I/O APIC routes for a vCPU after handling an intercepted I/O APIC
EOI for an IRQ that is not targeting said vCPU, i.e. after handling what's
effectively a stale EOI VM-Exit.  If a level-triggered IRQ is in-flight
when IRQ routing changes, e.g. because the guest changes routing from its
IRQ handler, then KVM intercepts EOIs on both the new and old target vCPUs,
so that the in-flight IRQ can be de-asserted when it's EOI'd.

However, only the EOI for the in-flight IRQ needs to be intercepted, as
IRQs on the same vector with the new routing are coincidental, i.e. occur
only if the guest is reusing the vector for multiple interrupt sources.
If the I/O APIC routes aren't rescanned, KVM will unnecessarily intercept
EOIs for the vector and negative impact the vCPU's interrupt performance.

Note, both commit db2bdcbbbd ("KVM: x86: fix edge EOI and IOAPIC reconfig
race") and commit 0fc5a36dd6 ("KVM: x86: ioapic: Fix level-triggered EOI
and IOAPIC reconfigure race") mentioned this issue, but it was considered
a "rare" occurrence thus was not addressed.  However in real environments,
this issue can happen even in a well-behaved guest.

Cc: Kai Huang <kai.huang@intel.com>
Co-developed-by: xuyun <xuyun_xy.xy@linux.alibaba.com>
Signed-off-by: xuyun <xuyun_xy.xy@linux.alibaba.com>
Signed-off-by: weizijie <zijie.wei@linux.alibaba.com>
[sean: massage changelog and comments, use int/-1, reset at scan]
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250304013335.4155703-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:18:36 -07:00
Sean Christopherson
c2207bbc0c KVM: x86: Add a helper to deduplicate I/O APIC EOI interception logic
Extract the vCPU specific EOI interception logic for I/O APIC emulation
into a common helper for userspace and in-kernel emulation in anticipation
of optimizing the "pending EOI" case.

No functional change intended.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250304013335.4155703-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:18:36 -07:00
Sean Christopherson
b1f7723a5a KVM: x86: Isolate edge vs. level check in userspace I/O APIC route scanning
Extract and isolate the trigger mode check in kvm_scan_ioapic_routes() in
anticipation of moving destination matching logic to a common helper (for
userspace vs. in-kernel I/O APIC emulation).

No functional change intended.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250304013335.4155703-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:18:35 -07:00
Babu Moger
d88bb2ded2 KVM: x86: Advertise support for AMD's PREFETCHI
The latest AMD platform has introduced a new instruction called PREFETCHI.
This instruction loads a cache line from a specified memory address into
the indicated data or instruction cache level, based on locality reference
hints.

Feature bit definition:
CPUID_Fn80000021_EAX [bit 20] - Indicates support for IC prefetch.

This feature is analogous to Intel's PREFETCHITI (CPUID.(EAX=7,ECX=1):EDX),
though the CPUID bit definitions differ between AMD and Intel.

Advertise support to userspace, as no additional enabling is necessary
(PREFETCHI can't be intercepted as there's no instruction specific behavior
that needs to be virtualize).

The feature is documented in Processor Programming Reference (PPR)
for AMD Family 1Ah Model 02h, Revision C1 (Link below).

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <babu.moger@amd.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/ee1c08fc400bb574a2b8f2c6a0bd9def10a29d35.1744130533.git.babu.moger@amd.com
[sean: rewrite shortlog to highlight the KVM functionality]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:18:35 -07:00
Borislav Petkov
49c140d5af KVM: x86: Sort CPUID_8000_0021_EAX leaf bits properly
WRMSR_XX_BASE_NS is bit 1 so put it there, add some new bits as
comments only.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250324160617.15379-1-bp@kernel.org
[sean: skip the FSRS/FSRC placeholders to avoid confusion]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:18:34 -07:00
Dan Carpenter
f804dc6aa2 KVM: x86: clean up a return
Returning a literal X86EMUL_CONTINUE is slightly clearer than returning
rc.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/r/7604cbbf-15e6-45a8-afec-cf5be46c2924@stanley.mountain
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:18:33 -07:00
Sean Christopherson
ead4dac16d KVM: x86: Advertise support for WRMSRNS
Advertise support for WRMSRNS (WRMSR non-serializing) to userspace if the
instruction is supported by the underlying CPU.  From a virtualization
perspective, the only difference between WRMSRNS and WRMSR is that VM-Exits
due to WRMSRNS set EXIT_QUALIFICATION to '1'.  WRMSRNS doesn't require a
new enabling control, shares the same basic exit reason, and behaves the
same as WRMSR with respect to MSR interception.

  WRMSR and WRMSRNS use the same basic exit reason (see Appendix C). For
  WRMSR, the exit qualification is 0, while for WRMSRNS it is 1.

Don't do anything different when emulating WRMSRNS vs. WRMSR, as KVM can't
do anything less, i.e. can't make emulation non-serializing.  The
motivation for the guest to use WRMSRNS instead of WRMSR is to avoid
immediately serializing the CPU when the necessary serialization is
guaranteed by some other mechanism, i.e. WRMSRNS being fully serializing
isn't guest-visible, just less performant.

Suggested-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250227010111.3222742-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:18:33 -07:00
Yosry Ahmed
656d9624bd KVM: x86: Generalize IBRS virtualization on emulated VM-exit
Commit 2e7eab8142 ("KVM: VMX: Execute IBPB on emulated VM-exit when
guest has IBRS") added an IBPB in the emulated VM-exit path on Intel to
properly virtualize IBRS by providing separate predictor modes for L1
and L2.

AMD requires similar handling, except when IbrsSameMode is enumerated by
the host CPU (which is the case on most/all AMD CPUs). With
IbrsSameMode, hardware IBRS is sufficient and no extra handling is
needed from KVM.

Generalize the handling in nested_vmx_vmexit() by moving it into a
generic function, add the AMD handling, and use it in
nested_svm_vmexit() too. The main reason for using a generic function is
to have a single place to park the huge comment about virtualizing IBRS.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250221163352.3818347-4-yosry.ahmed@linux.dev
[sean: use kvm_nested_vmexit_handle_spec_ctrl() for the helper]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:18:32 -07:00
Yosry Ahmed
65ca287201 KVM: x86: Propagate AMD's IbrsSameMode to the guest
If IBRS provides same mode (kernel/user or host/guest) protection on the
host, then by definition it also provides same mode protection in the
guest. In fact, all different modes from the guest's perspective are the
same mode from the host's perspective anyway.

Propagate IbrsSameMode to the guests.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250221163352.3818347-3-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:18:31 -07:00
Dan Carpenter
a476cadf8e KVM: x86: Check that the high 32bits are clear in kvm_arch_vcpu_ioctl_run()
The "kvm_run->kvm_valid_regs" and "kvm_run->kvm_dirty_regs" variables are
u64 type.  We are only using the lowest 3 bits but we want to ensure that
the users are not passing invalid bits so that we can use the remaining
bits in the future.

However "sync_valid_fields" and kvm_sync_valid_fields() are u32 type so
the check only ensures that the lower 32 bits are clear.  Fix this by
changing the types to u64.

Fixes: 74c1807f6c ("KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/r/ec25aad1-113e-4c6e-8941-43d432251398@stanley.mountain
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:18:02 -07:00
Mikhail Lobanov
a2620f8932 KVM: SVM: Forcibly leave SMM mode on SHUTDOWN interception
Previously, commit ed129ec905 ("KVM: x86: forcibly leave nested mode
on vCPU reset") addressed an issue where a triple fault occurring in
nested mode could lead to use-after-free scenarios. However, the commit
did not handle the analogous situation for System Management Mode (SMM).

This omission results in triggering a WARN when KVM forces a vCPU INIT
after SHUTDOWN interception while the vCPU is in SMM. This situation was
reprodused using Syzkaller by:

  1) Creating a KVM VM and vCPU
  2) Sending a KVM_SMI ioctl to explicitly enter SMM
  3) Executing invalid instructions causing consecutive exceptions and
     eventually a triple fault

The issue manifests as follows:

  WARNING: CPU: 0 PID: 25506 at arch/x86/kvm/x86.c:12112
  kvm_vcpu_reset+0x1d2/0x1530 arch/x86/kvm/x86.c:12112
  Modules linked in:
  CPU: 0 PID: 25506 Comm: syz-executor.0 Not tainted
  6.1.130-syzkaller-00157-g164fe5dde9b6 #0
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
  BIOS 1.12.0-1 04/01/2014
  RIP: 0010:kvm_vcpu_reset+0x1d2/0x1530 arch/x86/kvm/x86.c:12112
  Call Trace:
   <TASK>
   shutdown_interception+0x66/0xb0 arch/x86/kvm/svm/svm.c:2136
   svm_invoke_exit_handler+0x110/0x530 arch/x86/kvm/svm/svm.c:3395
   svm_handle_exit+0x424/0x920 arch/x86/kvm/svm/svm.c:3457
   vcpu_enter_guest arch/x86/kvm/x86.c:10959 [inline]
   vcpu_run+0x2c43/0x5a90 arch/x86/kvm/x86.c:11062
   kvm_arch_vcpu_ioctl_run+0x50f/0x1cf0 arch/x86/kvm/x86.c:11283
   kvm_vcpu_ioctl+0x570/0xf00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4122
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:870 [inline]
   __se_sys_ioctl fs/ioctl.c:856 [inline]
   __x64_sys_ioctl+0x19a/0x210 fs/ioctl.c:856
   do_syscall_x64 arch/x86/entry/common.c:51 [inline]
   do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81
   entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Architecturally, INIT is blocked when the CPU is in SMM, hence KVM's WARN()
in kvm_vcpu_reset() to guard against KVM bugs, e.g. to detect improper
emulation of INIT.  SHUTDOWN on SVM is a weird edge case where KVM needs to
do _something_ sane with the VMCB, since it's technically undefined, and
INIT is the least awful choice given KVM's ABI.

So, double down on stuffing INIT on SHUTDOWN, and force the vCPU out of
SMM to avoid any weirdness (and the WARN).

Found by Linux Verification Center (linuxtesting.org) with Syzkaller.

Fixes: ed129ec905 ("KVM: x86: forcibly leave nested mode on vCPU reset")
Cc: stable@vger.kernel.org
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Mikhail Lobanov <m.lobanov@rosa.ru>
Link: https://lore.kernel.org/r/20250414171207.155121-1-m.lobanov@rosa.ru
[sean: massage changelog, make it clear this isn't architectural behavior]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:17:58 -07:00
Paolo Bonzini
45eb29140e Merge branch 'kvm-fixes-6.15-rc4' into HEAD
* Single fix for broken usage of 'multi-MIDR' infrastructure in PI
  code, adding an open-coded erratum check for Cavium ThunderX

* Bugfixes from a planned posted interrupt rework

* Do not use kvm_rip_read() unconditionally to cater for guests
  with inaccessible register state.
2025-04-24 13:39:34 -04:00
Adrian Hunter
38e93267ca KVM: x86: Do not use kvm_rip_read() unconditionally for KVM_PROFILING
Not all VMs allow access to RIP.  Check guest_state_protected before
calling kvm_rip_read().

This avoids, for example, hitting WARN_ON_ONCE in vt_cache_reg() for
TDX VMs.

Fixes: 81bf912b2c ("KVM: TDX: Implement TDX vcpu enter/exit path")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250415104821.247234-3-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:32 -04:00
Adrian Hunter
ca4f113b0b KVM: x86: Do not use kvm_rip_read() unconditionally in KVM tracepoints
Not all VMs allow access to RIP.  Check guest_state_protected before
calling kvm_rip_read().

This avoids, for example, hitting WARN_ON_ONCE in vt_cache_reg() for
TDX VMs.

Fixes: 81bf912b2c ("KVM: TDX: Implement TDX vcpu enter/exit path")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250415104821.247234-2-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:31 -04:00
Sean Christopherson
268cbfe65b KVM: SVM: WARN if an invalid posted interrupt IRTE entry is added
Now that the AMD IOMMU doesn't signal success incorrectly, WARN if KVM
attempts to track an AMD IRTE entry without metadata.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:31 -04:00
Sean Christopherson
f1fb088d9c KVM: x86: Take irqfds.lock when adding/deleting IRQ bypass producer
Take irqfds.lock when adding/deleting an IRQ bypass producer to ensure
irqfd->producer isn't modified while kvm_irq_routing_update() is running.
The only lock held when a producer is added/removed is irqbypass's mutex.

Fixes: 8727688006 ("KVM: x86: select IRQ_BYPASS_MANAGER")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:31 -04:00
Sean Christopherson
bcda70c56f KVM: x86: Explicitly treat routing entry type changes as changes
Explicitly treat type differences as GSI routing changes, as comparing MSI
data between two entries could get a false negative, e.g. if userspace
changed the type but left the type-specific data as-is.

Fixes: 515a0c79e7 ("kvm: irqfd: avoid update unmodified entries of the routing")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:31 -04:00
Sean Christopherson
9bcac97dc4 KVM: x86: Reset IRTE to host control if *new* route isn't postable
Restore an IRTE back to host control (remapped or posted MSI mode) if the
*new* GSI route prevents posting the IRQ directly to a vCPU, regardless of
the GSI routing type.  Updating the IRTE if and only if the new GSI is an
MSI results in KVM leaving an IRTE posting to a vCPU.

The dangling IRTE can result in interrupts being incorrectly delivered to
the guest, and in the worst case scenario can result in use-after-free,
e.g. if the VM is torn down, but the underlying host IRQ isn't freed.

Fixes: efc644048e ("KVM: x86: Update IRTE for posted-interrupts")
Fixes: 411b44ba80 ("svm: Implements update_pi_irte hook to setup posted interrupt")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:31 -04:00
Sean Christopherson
7537deda36 KVM: SVM: Allocate IR data using atomic allocation
Allocate SVM's interrupt remapping metadata using GFP_ATOMIC as
svm_ir_list_add() is called with IRQs are disabled and irqfs.lock held
when kvm_irq_routing_update() reacts to GSI routing changes.

Fixes: 411b44ba80 ("svm: Implements update_pi_irte hook to setup posted interrupt")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:31 -04:00
Sean Christopherson
6560aff981 KVM: SVM: Don't update IRTEs if APICv/AVIC is disabled
Skip IRTE updates if AVIC is disabled/unsupported, as forcing the IRTE
into remapped mode (kvm_vcpu_apicv_active() will never be true) is
unnecessary and wasteful.  The IOMMU driver is responsible for putting
IRTEs into remapped mode when an IRQ is allocated by a device, long before
that device is assigned to a VM.  I.e. the kernel as a whole has major
issues if the IRTE isn't already in remapped mode.

Opportunsitically kvm_arch_has_irq_bypass() to query for APICv/AVIC, so
so that all checks in KVM x86 incorporate the same information.

Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250401161804.842968-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:31 -04:00
Paolo Bonzini
5f9e169814 KVM: arm64, x86: make kvm_arch_has_irq_bypass() inline
kvm_arch_has_irq_bypass() is a small function and even though it does
not appear in any *really* hot paths, it's also not entirely rare.
Make it inline---it also works out nicely in preparation for using it in
kvm-intel.ko and kvm-amd.ko, since the function is not currently exported.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:46:58 -04:00
Pawan Gupta
d9b79111fd x86/bugs: Rename mmio_stale_data_clear to cpu_buf_vm_clear
The static key mmio_stale_data_clear controls the KVM-only mitigation for MMIO
Stale Data vulnerability. Rename it to reflect its purpose.

No functional change.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250416-mmio-rename-v2-1-ad1f5488767c@linux.intel.com
2025-04-16 19:40:01 +02:00
Ingo Molnar
7cbc2ba7c1 x86/msr: Rename 'native_wrmsrl()' to 'native_wrmsrq()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:59:28 +02:00
Ingo Molnar
6fa17efe45 x86/msr: Rename 'wrmsrl_safe()' to 'wrmsrq_safe()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:58:44 +02:00
Ingo Molnar
6fe22abacd x86/msr: Rename 'rdmsrl_safe()' to 'rdmsrq_safe()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:58:38 +02:00
Ingo Molnar
78255eb239 x86/msr: Rename 'wrmsrl()' to 'wrmsrq()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:58:33 +02:00
Ingo Molnar
c435e608cf x86/msr: Rename 'rdmsrl()' to 'rdmsrq()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:58:27 +02:00
Linus Torvalds
0e8863244e ARM:
* Rework heuristics for resolving the fault IPA (HPFAR_EL2 v. re-walk
   stage-1 page tables) to align with the architecture. This avoids
   possibly taking an SEA at EL2 on the page table walk or using an
   architecturally UNKNOWN fault IPA.
 
 * Use acquire/release semantics in the KVM FF-A proxy to avoid reading
   a stale value for the FF-A version.
 
 * Fix KVM guest driver to match PV CPUID hypercall ABI.
 
 * Use Inner Shareable Normal Write-Back mappings at stage-1 in KVM
   selftests, which is the only memory type for which atomic
   instructions are architecturally guaranteed to work.
 
 s390:
 
 * Don't use %pK for debug printing and tracepoints.
 
 x86:
 
 * Use a separate subclass when acquiring KVM's per-CPU posted interrupts
   wakeup lock in the scheduled out path, i.e. when adding a vCPU on
   the list of vCPUs to wake, to workaround a false positive deadlock.
   The schedule out code runs with a scheduler lock that the wakeup
   handler takes in the opposite order; but it does so with IRQs disabled
   and cannot run concurrently with a wakeup.
 
 * Explicitly zero-initialize on-stack CPUID unions
 
 * Allow building irqbypass.ko as as module when kvm.ko is a module
 
 * Wrap relatively expensive sanity check with KVM_PROVE_MMU
 
 * Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses
 
 selftests:
 
 * Add more scenarios to the MONITOR/MWAIT test.
 
 * Add option to rseq test to override /dev/cpu_dma_latency
 
 * Bring list of exit reasons up to date
 
 * Cleanup Makefile to list once tests that are valid on all architectures
 
 Other:
 
 * Documentation fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmf083IUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroN1dgf/QwfpZcHoMNQSnrc1jMy2LHrArln2
 XfmsOGZTU7kyoLQsLWGAPNocOveGdiemTDsj5ZXoNMnqV8hCBr+tZuv2gWI1rr/o
 kiGerdIgSZ9piTjBlJkVAaOzbWhg2DUnr7qVVzEzFY9+rPNyQ81vgAfU7h56KhYB
 optecozmBrHHAxvQZwmPeL9UyPWFjOF1BY/8LTMx7X+aVuCX6qx1JqO3a3ylAw4J
 tGXv6qFJfuCnu1d1b4X0ILce0iMUTOjQzvTcIm+BKjYycecl+3j1aczC/BOorIgc
 mf0+XeauhcTduK73pirnvx2b05eOxntgkOpwJytO2RP6pE0uK+2Th/C3Qg==
 =ba/Y
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM:

   - Rework heuristics for resolving the fault IPA (HPFAR_EL2 v. re-walk
     stage-1 page tables) to align with the architecture. This avoids
     possibly taking an SEA at EL2 on the page table walk or using an
     architecturally UNKNOWN fault IPA

   - Use acquire/release semantics in the KVM FF-A proxy to avoid
     reading a stale value for the FF-A version

   - Fix KVM guest driver to match PV CPUID hypercall ABI

   - Use Inner Shareable Normal Write-Back mappings at stage-1 in KVM
     selftests, which is the only memory type for which atomic
     instructions are architecturally guaranteed to work

  s390:

   - Don't use %pK for debug printing and tracepoints

  x86:

   - Use a separate subclass when acquiring KVM's per-CPU posted
     interrupts wakeup lock in the scheduled out path, i.e. when adding
     a vCPU on the list of vCPUs to wake, to workaround a false positive
     deadlock. The schedule out code runs with a scheduler lock that the
     wakeup handler takes in the opposite order; but it does so with
     IRQs disabled and cannot run concurrently with a wakeup

   - Explicitly zero-initialize on-stack CPUID unions

   - Allow building irqbypass.ko as as module when kvm.ko is a module

   - Wrap relatively expensive sanity check with KVM_PROVE_MMU

   - Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses

  selftests:

   - Add more scenarios to the MONITOR/MWAIT test

   - Add option to rseq test to override /dev/cpu_dma_latency

   - Bring list of exit reasons up to date

   - Cleanup Makefile to list once tests that are valid on all
     architectures

  Other:

   - Documentation fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (26 commits)
  KVM: arm64: Use acquire/release to communicate FF-A version negotiation
  KVM: arm64: selftests: Explicitly set the page attrs to Inner-Shareable
  KVM: arm64: selftests: Introduce and use hardware-definition macros
  KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive
  KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list
  KVM: x86: Explicitly zero-initialize on-stack CPUID unions
  KVM: Allow building irqbypass.ko as as module when kvm.ko is a module
  KVM: x86/mmu: Wrap sanity check on number of TDP MMU pages with KVM_PROVE_MMU
  KVM: selftests: Add option to rseq test to override /dev/cpu_dma_latency
  KVM: x86: Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses
  Documentation: kvm: remove KVM_CAP_MIPS_TE
  Documentation: kvm: organize capabilities in the right section
  Documentation: kvm: fix some definition lists
  Documentation: kvm: drop "Capability" heading from capabilities
  Documentation: kvm: give correct name for KVM_CAP_SPAPR_MULTITCE
  Documentation: KVM: KVM_GET_SUPPORTED_CPUID now exposes TSC_DEADLINE
  selftests: kvm: list once tests that are valid on all architectures
  selftests: kvm: bring list of exit reasons up to date
  selftests: kvm: revamp MONITOR/MWAIT tests
  KVM: arm64: Don't translate FAR if invalid/unsafe
  ...
2025-04-08 13:47:55 -07:00
Ashish Kalra
6f1d5a3513 KVM: SVM: Add support to initialize SEV/SNP functionality in KVM
Move platform initialization of SEV/SNP from CCP driver probe time to
KVM module load time so that KVM can do SEV/SNP platform initialization
explicitly if it actually wants to use SEV/SNP functionality.

Add support for KVM to explicitly call into the CCP driver at load time
to initialize SEV/SNP. If required, this behavior can be altered with KVM
module parameters to not do SEV/SNP platform initialization at module load
time. Additionally, a corresponding SEV/SNP platform shutdown is invoked
during KVM module unload time.

Continue to support SEV deferred initialization as the user may have the
file containing SEV persistent data for SEV INIT_EX available only later
after module load/init.

Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-04-08 15:54:37 +08:00
Paolo Bonzini
fd02aa45bd Merge branch 'kvm-tdx-initial' into HEAD
This large commit contains the initial support for TDX in KVM.  All x86
parts enable the host-side hypercalls that KVM uses to talk to the TDX
module, a software component that runs in a special CPU mode called SEAM
(Secure Arbitration Mode).

The series is in turn split into multiple sub-series, each with a separate
merge commit:

- Initialization: basic setup for using the TDX module from KVM, plus
  ioctls to create TDX VMs and vCPUs.

- MMU: in TDX, private and shared halves of the address space are mapped by
  different EPT roots, and the private half is managed by the TDX module.
  Using the support that was added to the generic MMU code in 6.14,
  add support for TDX's secure page tables to the Intel side of KVM.
  Generic KVM code takes care of maintaining a mirror of the secure page
  tables so that they can be queried efficiently, and ensuring that changes
  are applied to both the mirror and the secure EPT.

- vCPU enter/exit: implement the callbacks that handle the entry of a TDX
  vCPU (via the SEAMCALL TDH.VP.ENTER) and the corresponding save/restore
  of host state.

- Userspace exits: introduce support for guest TDVMCALLs that KVM forwards to
  userspace.  These correspond to the usual KVM_EXIT_* "heavyweight vmexits"
  but are triggered through a different mechanism, similar to VMGEXIT for
  SEV-ES and SEV-SNP.

- Interrupt handling: support for virtual interrupt injection as well as
  handling VM-Exits that are caused by vectored events.  Exclusive to
  TDX are machine-check SMIs, which the kernel already knows how to
  handle through the kernel machine check handler (commit 7911f145de,
  "x86/mce: Implement recovery for errors in TDX/SEAM non-root mode")

- Loose ends: handling of the remaining exits from the TDX module, including
  EPT violation/misconfig and several TDVMCALL leaves that are handled in
  the kernel (CPUID, HLT, RDMSR/WRMSR, GetTdVmCallInfo); plus returning
  an error or ignoring operations that are not supported by TDX guests

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-07 07:36:33 -04:00
Paolo Bonzini
7d7685631a Merge branch 'kvm-pi-fix-lockdep' into HEAD 2025-04-07 07:11:03 -04:00
Paolo Bonzini
b6262dd695 Merge branch 'kvm-6.15-rc2-fixes' into HEAD 2025-04-07 07:10:46 -04:00
Thomas Gleixner
8fa7292fee treewide: Switch/rename to timer_delete[_sync]()
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.

Conversion was done with coccinelle plus manual fixups where necessary.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-05 10:30:12 +02:00
Paolo Bonzini
c77eee50ca Merge branch 'kvm-pi-fix-lockdep' into HEAD 2025-04-04 07:17:04 -04:00
Yan Zhao
c0b8dcabb2 KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive
Use a separate subclass when acquiring KVM's per-CPU posted interrupts
wakeup lock in the scheduled out path, i.e. when adding a vCPU on the list
of vCPUs to wake, to workaround a false positive deadlock.

  Chain exists of:
   &p->pi_lock --> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu)

  Possible unsafe locking scenario:

        CPU0                CPU1
        ----                ----
   lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
                            lock(&rq->__lock);
                            lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
   lock(&p->pi_lock);

  *** DEADLOCK ***

In the wakeup handler, the callchain is *always*:

  sysvec_kvm_posted_intr_wakeup_ipi()
  |
  --> pi_wakeup_handler()
      |
      --> kvm_vcpu_wake_up()
          |
          --> try_to_wake_up(),

and the lock order is:

  &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) --> &p->pi_lock.

For the schedule out path, the callchain is always (for all intents and
purposes; if the kernel is preemptible, kvm_sched_out() can be called from
something other than schedule(), but the beginning of the callchain will
be the same point in vcpu_block()):

  vcpu_block()
  |
  --> schedule()
      |
      --> kvm_sched_out()
          |
          --> vmx_vcpu_put()
              |
              --> vmx_vcpu_pi_put()
                  |
                  --> pi_enable_wakeup_handler()

and the lock order is:

  &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu)

I.e. lockdep sees AB+BC ordering for schedule out, and CA ordering for
wakeup, and complains about the A=>C versus C=>A inversion.  In practice,
deadlock can't occur between schedule out and the wakeup handler as they
are mutually exclusive.  The entirely of the schedule out code that runs
with the problematic scheduler locks held, does so with IRQs disabled,
i.e. can't run concurrently with the wakeup handler.

Use a subclass instead disabling lockdep entirely, and tell lockdep that
both subclasses are being acquired when loading a vCPU, as the sched_out
and sched_in paths are NOT mutually exclusive, e.g.

      CPU 0                 CPU 1
  ---------------     ---------------
  vCPU0 sched_out
  vCPU1 sched_in
  vCPU1 sched_out      vCPU 0 sched_in

where vCPU0's sched_in may race with vCPU1's sched_out, on CPU 0's wakeup
list+lock.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-ID: <20250401154727.835231-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04 07:11:59 -04:00
Sean Christopherson
6bad6ecc63 KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list
Assert that IRQs are already disabled when putting a vCPU on a CPU's PI
wakeup list, as opposed to saving/disabling+restoring IRQs.  KVM relies on
IRQs being disabled until the vCPU task is fully scheduled out, i.e. until
the scheduler has dropped all of its per-CPU locks (e.g. for the runqueue),
as attempting to wake the task while it's being scheduled out could lead
to deadlock.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250401154727.835231-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04 07:11:59 -04:00
Sean Christopherson
bc52ae0a70 KVM: x86: Explicitly zero-initialize on-stack CPUID unions
Explicitly zero/empty-initialize the unions used for PMU related CPUID
entries, instead of manually zeroing all fields (hopefully), or in the
case of 0x80000022, relying on the compiler to clobber the uninitialized
bitfields.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-ID: <20250315024102.2361628-1-seanjc@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04 07:07:40 -04:00
Sean Christopherson
81d480fdf8 KVM: x86/mmu: Wrap sanity check on number of TDP MMU pages with KVM_PROVE_MMU
Wrap the TDP MMU page counter in CONFIG_KVM_PROVE_MMU so that the sanity
check is omitted from production builds, and more importantly to remove
the atomic accesses to account pages.  A one-off memory leak in production
is relatively uninteresting, and a WARN_ON won't help mitigate a systemic
issue; it's as much about helping triage memory leaks as it is about
detecting them in the first place, and doesn't magically stop the leaks.
I.e. production environments will be quite sad if a severe KVM bug escapes,
regardless of whether or not KVM WARNs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250315023448.2358456-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04 07:07:40 -04:00
Sean Christopherson
ef01cac401 KVM: x86: Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses
Acquire a lock on kvm->srcu when userspace is getting MP state to handle a
rather extreme edge case where "accepting" APIC events, i.e. processing
pending INIT or SIPI, can trigger accesses to guest memory.  If the vCPU
is in L2 with INIT *and* a TRIPLE_FAULT request pending, then getting MP
state will trigger a nested VM-Exit by way of ->check_nested_events(), and
emuating the nested VM-Exit can access guest memory.

The splat was originally hit by syzkaller on a Google-internal kernel, and
reproduced on an upstream kernel by hacking the triple_fault_event_test
selftest to stuff a pending INIT, store an MSR on VM-Exit (to generate a
memory access on VMX), and do vcpu_mp_state_get() to trigger the scenario.

  =============================
  WARNING: suspicious RCU usage
  6.14.0-rc3-b112d356288b-vmx/pi_lockdep_false_pos-lock #3 Not tainted
  -----------------------------
  include/linux/kvm_host.h:1058 suspicious rcu_dereference_check() usage!

  other info that might help us debug this:

  rcu_scheduler_active = 2, debug_locks = 1
  1 lock held by triple_fault_ev/1256:
   #0: ffff88810df5a330 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x8b/0x9a0 [kvm]

  stack backtrace:
  CPU: 11 UID: 1000 PID: 1256 Comm: triple_fault_ev Not tainted 6.14.0-rc3-b112d356288b-vmx #3
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   <TASK>
   dump_stack_lvl+0x7f/0x90
   lockdep_rcu_suspicious+0x144/0x190
   kvm_vcpu_gfn_to_memslot+0x156/0x180 [kvm]
   kvm_vcpu_read_guest+0x3e/0x90 [kvm]
   read_and_check_msr_entry+0x2e/0x180 [kvm_intel]
   __nested_vmx_vmexit+0x550/0xde0 [kvm_intel]
   kvm_check_nested_events+0x1b/0x30 [kvm]
   kvm_apic_accept_events+0x33/0x100 [kvm]
   kvm_arch_vcpu_ioctl_get_mpstate+0x30/0x1d0 [kvm]
   kvm_vcpu_ioctl+0x33e/0x9a0 [kvm]
   __x64_sys_ioctl+0x8b/0xb0
   do_syscall_64+0x6c/0x170
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250401150504.829812-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04 07:07:39 -04:00
Linus Torvalds
edb0e8f6e2 ARM:
* Nested virtualization support for VGICv3, giving the nested
 hypervisor control of the VGIC hardware when running an L2 VM
 
 * Removal of 'late' nested virtualization feature register masking,
   making the supported feature set directly visible to userspace
 
 * Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage
   of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers
 
 * Paravirtual interface for discovering the set of CPU implementations
   where a VM may run, addressing a longstanding issue of guest CPU
   errata awareness in big-little systems and cross-implementation VM
   migration
 
 * Userspace control of the registers responsible for identifying a
   particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1),
   allowing VMs to be migrated cross-implementation
 
 * pKVM updates, including support for tracking stage-2 page table
   allocations in the protected hypervisor in the 'SecPageTable' stat
 
 * Fixes to vPMU, ensuring that userspace updates to the vPMU after
   KVM_RUN are reflected into the backing perf events
 
 LoongArch:
 
 * Remove unnecessary header include path
 
 * Assume constant PGD during VM context switch
 
 * Add perf events support for guest VM
 
 RISC-V:
 
 * Disable the kernel perf counter during configure
 
 * KVM selftests improvements for PMU
 
 * Fix warning at the time of KVM module removal
 
 x86:
 
 * Add support for aging of SPTEs without holding mmu_lock.  Not taking mmu_lock
   allows multiple aging actions to run in parallel, and more importantly avoids
   stalling vCPUs.  This includes an implementation of per-rmap-entry locking;
   aging the gfn is done with only a per-rmap single-bin spinlock taken, whereas
   locking an rmap for write requires taking both the per-rmap spinlock and
   the mmu_lock.
 
   Note that this decreases slightly the accuracy of accessed-page information,
   because changes to the SPTE outside aging might not use atomic operations
   even if they could race against a clear of the Accessed bit.  This is
   deliberate because KVM and mm/ tolerate false positives/negatives for
   accessed information, and testing has shown that reducing the latency of
   aging is far more beneficial to overall system performance than providing
   "perfect" young/old information.
 
 * Defer runtime CPUID updates until KVM emulates a CPUID instruction, to
   coalesce updates when multiple pieces of vCPU state are changing, e.g. as
   part of a nested transition.
 
 * Fix a variety of nested emulation bugs, and add VMX support for synthesizing
   nested VM-Exit on interception (instead of injecting #UD into L2).
 
 * Drop "support" for async page faults for protected guests that do not set
   SEND_ALWAYS (i.e. that only want async page faults at CPL3)
 
 * Bring a bit of sanity to x86's VM teardown code, which has accumulated
   a lot of cruft over the years.  Particularly, destroy vCPUs before
   the MMU, despite the latter being a VM-wide operation.
 
 * Add common secure TSC infrastructure for use within SNP and in the
   future TDX
 
 * Block KVM_CAP_SYNC_REGS if guest state is protected.  It does not make
   sense to use the capability if the relevant registers are not
   available for reading or writing.
 
 * Don't take kvm->lock when iterating over vCPUs in the suspend notifier to
   fix a largely theoretical deadlock.
 
 * Use the vCPU's actual Xen PV clock information when starting the Xen timer,
   as the cached state in arch.hv_clock can be stale/bogus.
 
 * Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different
   PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend
   notifier only accounts for kvmclock, and there's no evidence that the
   flag is actually supported by Xen guests.
 
 * Clean up the per-vCPU "cache" of its reference pvclock, and instead only
   track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately
   expensive to compute, and rarely changes for modern setups).
 
 * Don't write to the Xen hypercall page on MSR writes that are initiated by
   the host (userspace or KVM) to fix a class of bugs where KVM can write to
   guest memory at unexpected times, e.g. during vCPU creation if userspace has
   set the Xen hypercall MSR index to collide with an MSR that KVM emulates.
 
 * Restrict the Xen hypercall MSR index to the unofficial synthetic range to
   reduce the set of possible collisions with MSRs that are emulated by KVM
   (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside
   in the synthetic range).
 
 * Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config.
 
 * Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID
   entries when updating PV clocks; there is no guarantee PV clocks will be
   updated between TSC frequency changes and CPUID emulation, and guest reads
   of the TSC leaves should be rare, i.e. are not a hot path.
 
 x86 (Intel):
 
 * Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus
   modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1.
 
 * Pass XFD_ERR as the payload when injecting #NM, as a preparatory step
   for upcoming FRED virtualization support.
 
 * Decouple the EPT entry RWX protection bit macros from the EPT Violation
   bits, both as a general cleanup and in anticipation of adding support for
   emulating Mode-Based Execution Control (MBEC).
 
 * Reject KVM_RUN if userspace manages to gain control and stuff invalid guest
   state while KVM is in the middle of emulating nested VM-Enter.
 
 * Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs
   in anticipation of adding sanity checks for secondary exit controls (the
   primary field is out of bits).
 
 x86 (AMD):
 
 * Ensure the PSP driver is initialized when both the PSP and KVM modules are
   built-in (the initcall framework doesn't handle dependencies).
 
 * Use long-term pins when registering encrypted memory regions, so that the
   pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to
   excessive fragmentation.
 
 * Add macros and helpers for setting GHCB return/error codes.
 
 * Add support for Idle HLT interception, which elides interception if the vCPU
   has a pending, unmasked virtual IRQ when HLT is executed.
 
 * Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical
   address.
 
 * Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g.
   because the vCPU was "destroyed" via SNP's AP Creation hypercall.
 
 * Reject SNP AP Creation if the requested SEV features for the vCPU don't
   match the VM's configured set of features.
 
 Selftests:
 
 * Fix again the Intel PMU counters test; add a data load and do CLFLUSH{OPT} on the data
   instead of executing code.  The theory is that modern Intel CPUs have
   learned new code prefetching tricks that bypass the PMU counters.
 
 * Fix a flaw in the Intel PMU counters test where it asserts that an event is
   counting correctly without actually knowing what the event counts on the
   underlying hardware.
 
 * Fix a variety of flaws, bugs, and false failures/passes dirty_log_test, and
   improve its coverage by collecting all dirty entries on each iteration.
 
 * Fix a few minor bugs related to handling of stats FDs.
 
 * Add infrastructure to make vCPU and VM stats FDs available to tests by
   default (open the FDs during VM/vCPU creation).
 
 * Relax an assertion on the number of HLT exits in the xAPIC IPI test when
   running on a CPU that supports AMD's Idle HLT (which elides interception of
   HLT if a virtual IRQ is pending and unmasked).
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmfcTkEUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMnQAf/cPx72hJOdNy4Qrm8M33YLXVRVV00
 yEZ8eN8TWdOclr0ltE/w/ELGh/qS4CU8pjURAk0A6lPioU+mdcTn3dPEqMDMVYom
 uOQ2lusEHw0UuSnGZSEjvZJsE/Ro2NSAsHIB6PWRqig1ZBPJzyu0frce34pMpeQH
 diwriJL9lKPAhBWXnUQ9BKoi1R0P5OLW9ahX4SOWk7cAFg4DLlDE66Nqf6nKqViw
 DwEucTiUEg5+a3d93gihdD4JNl+fb3vI2erxrMxjFjkacl0qgqRu3ei3DG0MfdHU
 wNcFSG5B1n0OECKxr80lr1Ip1KTVNNij0Ks+w6Gc6lSg9c4PptnNkfLK3A==
 =nnCN
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM:

   - Nested virtualization support for VGICv3, giving the nested
     hypervisor control of the VGIC hardware when running an L2 VM

   - Removal of 'late' nested virtualization feature register masking,
     making the supported feature set directly visible to userspace

   - Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage
     of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers

   - Paravirtual interface for discovering the set of CPU
     implementations where a VM may run, addressing a longstanding issue
     of guest CPU errata awareness in big-little systems and
     cross-implementation VM migration

   - Userspace control of the registers responsible for identifying a
     particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1),
     allowing VMs to be migrated cross-implementation

   - pKVM updates, including support for tracking stage-2 page table
     allocations in the protected hypervisor in the 'SecPageTable' stat

   - Fixes to vPMU, ensuring that userspace updates to the vPMU after
     KVM_RUN are reflected into the backing perf events

  LoongArch:

   - Remove unnecessary header include path

   - Assume constant PGD during VM context switch

   - Add perf events support for guest VM

  RISC-V:

   - Disable the kernel perf counter during configure

   - KVM selftests improvements for PMU

   - Fix warning at the time of KVM module removal

  x86:

   - Add support for aging of SPTEs without holding mmu_lock.

     Not taking mmu_lock allows multiple aging actions to run in
     parallel, and more importantly avoids stalling vCPUs. This includes
     an implementation of per-rmap-entry locking; aging the gfn is done
     with only a per-rmap single-bin spinlock taken, whereas locking an
     rmap for write requires taking both the per-rmap spinlock and the
     mmu_lock.

     Note that this decreases slightly the accuracy of accessed-page
     information, because changes to the SPTE outside aging might not
     use atomic operations even if they could race against a clear of
     the Accessed bit.

     This is deliberate because KVM and mm/ tolerate false
     positives/negatives for accessed information, and testing has shown
     that reducing the latency of aging is far more beneficial to
     overall system performance than providing "perfect" young/old
     information.

   - Defer runtime CPUID updates until KVM emulates a CPUID instruction,
     to coalesce updates when multiple pieces of vCPU state are
     changing, e.g. as part of a nested transition

   - Fix a variety of nested emulation bugs, and add VMX support for
     synthesizing nested VM-Exit on interception (instead of injecting
     #UD into L2)

   - Drop "support" for async page faults for protected guests that do
     not set SEND_ALWAYS (i.e. that only want async page faults at CPL3)

   - Bring a bit of sanity to x86's VM teardown code, which has
     accumulated a lot of cruft over the years. Particularly, destroy
     vCPUs before the MMU, despite the latter being a VM-wide operation

   - Add common secure TSC infrastructure for use within SNP and in the
     future TDX

   - Block KVM_CAP_SYNC_REGS if guest state is protected. It does not
     make sense to use the capability if the relevant registers are not
     available for reading or writing

   - Don't take kvm->lock when iterating over vCPUs in the suspend
     notifier to fix a largely theoretical deadlock

   - Use the vCPU's actual Xen PV clock information when starting the
     Xen timer, as the cached state in arch.hv_clock can be stale/bogus

   - Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across
     different PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as
     KVM's suspend notifier only accounts for kvmclock, and there's no
     evidence that the flag is actually supported by Xen guests

   - Clean up the per-vCPU "cache" of its reference pvclock, and instead
     only track the vCPU's TSC scaling (multipler+shift) metadata (which
     is moderately expensive to compute, and rarely changes for modern
     setups)

   - Don't write to the Xen hypercall page on MSR writes that are
     initiated by the host (userspace or KVM) to fix a class of bugs
     where KVM can write to guest memory at unexpected times, e.g.
     during vCPU creation if userspace has set the Xen hypercall MSR
     index to collide with an MSR that KVM emulates

   - Restrict the Xen hypercall MSR index to the unofficial synthetic
     range to reduce the set of possible collisions with MSRs that are
     emulated by KVM (collisions can still happen as KVM emulates
     Hyper-V MSRs, which also reside in the synthetic range)

   - Clean up and optimize KVM's handling of Xen MSR writes and
     xen_hvm_config

   - Update Xen TSC leaves during CPUID emulation instead of modifying
     the CPUID entries when updating PV clocks; there is no guarantee PV
     clocks will be updated between TSC frequency changes and CPUID
     emulation, and guest reads of the TSC leaves should be rare, i.e.
     are not a hot path

  x86 (Intel):

   - Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and
     thus modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1

   - Pass XFD_ERR as the payload when injecting #NM, as a preparatory
     step for upcoming FRED virtualization support

   - Decouple the EPT entry RWX protection bit macros from the EPT
     Violation bits, both as a general cleanup and in anticipation of
     adding support for emulating Mode-Based Execution Control (MBEC)

   - Reject KVM_RUN if userspace manages to gain control and stuff
     invalid guest state while KVM is in the middle of emulating nested
     VM-Enter

   - Add a macro to handle KVM's sanity checks on entry/exit VMCS
     control pairs in anticipation of adding sanity checks for secondary
     exit controls (the primary field is out of bits)

  x86 (AMD):

   - Ensure the PSP driver is initialized when both the PSP and KVM
     modules are built-in (the initcall framework doesn't handle
     dependencies)

   - Use long-term pins when registering encrypted memory regions, so
     that the pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and
     don't lead to excessive fragmentation

   - Add macros and helpers for setting GHCB return/error codes

   - Add support for Idle HLT interception, which elides interception if
     the vCPU has a pending, unmasked virtual IRQ when HLT is executed

   - Fix a bug in INVPCID emulation where KVM fails to check for a
     non-canonical address

   - Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is
     invalid, e.g. because the vCPU was "destroyed" via SNP's AP
     Creation hypercall

   - Reject SNP AP Creation if the requested SEV features for the vCPU
     don't match the VM's configured set of features

  Selftests:

   - Fix again the Intel PMU counters test; add a data load and do
     CLFLUSH{OPT} on the data instead of executing code. The theory is
     that modern Intel CPUs have learned new code prefetching tricks
     that bypass the PMU counters

   - Fix a flaw in the Intel PMU counters test where it asserts that an
     event is counting correctly without actually knowing what the event
     counts on the underlying hardware

   - Fix a variety of flaws, bugs, and false failures/passes
     dirty_log_test, and improve its coverage by collecting all dirty
     entries on each iteration

   - Fix a few minor bugs related to handling of stats FDs

   - Add infrastructure to make vCPU and VM stats FDs available to tests
     by default (open the FDs during VM/vCPU creation)

   - Relax an assertion on the number of HLT exits in the xAPIC IPI test
     when running on a CPU that supports AMD's Idle HLT (which elides
     interception of HLT if a virtual IRQ is pending and unmasked)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (216 commits)
  RISC-V: KVM: Optimize comments in kvm_riscv_vcpu_isa_disable_allowed
  RISC-V: KVM: Teardown riscv specific bits after kvm_exit
  LoongArch: KVM: Register perf callbacks for guest
  LoongArch: KVM: Implement arch-specific functions for guest perf
  LoongArch: KVM: Add stub for kvm_arch_vcpu_preempted_in_kernel()
  LoongArch: KVM: Remove PGD saving during VM context switch
  LoongArch: KVM: Remove unnecessary header include path
  KVM: arm64: Tear down vGIC on failed vCPU creation
  KVM: arm64: PMU: Reload when resetting
  KVM: arm64: PMU: Reload when user modifies registers
  KVM: arm64: PMU: Fix SET_ONE_REG for vPMC regs
  KVM: arm64: PMU: Assume PMU presence in pmu-emul.c
  KVM: arm64: PMU: Set raw values from user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}
  KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpu
  KVM: arm64: Factor out pKVM hyp vcpu creation to separate function
  KVM: arm64: Initialize HCRX_EL2 traps in pKVM
  KVM: arm64: Factor out setting HCRX_EL2 traps into separate function
  KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected
  KVM: x86: Add infrastructure for secure TSC
  KVM: x86: Push down setting vcpu.arch.user_set_tsc
  ...
2025-03-25 14:22:07 -07:00
Linus Torvalds
906174776c - Some preparatory work to convert the mitigations machinery to mitigating
attack vectors instead of single vulnerabilities
 
 - Untangle and remove a now unneeded X86_FEATURE_USE_IBPB flag
 
 - Add support for a Zen5-specific SRSO mitigation
 
 - Cleanups and minor improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmfixS0ACgkQEsHwGGHe
 VUpi1xAAgvH2u8Eo8ibT5dABQpD65w3oQiykO+9aDpObG9w9beDVGlld8DJE61Rz
 6tcE0Clp2H/tMcCbn8zXIJ92TQ3wIX/85uZwLi1VEM1Tx7A6VtAbPv8WKfZE3FCX
 9v92HRKnK3ql+A2ZR+oyy+/8RedUmia7y7/bXH1H7Zf2uozoKkmq5cQnwfq5iU4A
 qNiKuvSlQwjZ8Zz6Ax1ugHUkE4R7mlKh8rccLXl4+mVr63/lkPHSY3OFTjcYf4HW
 Ir92N86Spfo0/l0vsOOsWoYKmoaiVP7ouJh7YbKR3B0BGN0pt2MT476mehkEs427
 m4J6XhRKhIrsYmzEkLvvpsg12zO4/PKk8BEYNS7YPYlRaOwjV4ivyFS2aY6e55rh
 yUHyo9s+16f/Mp+/fNFXll3mdMxYBioPWh3M191nJkdfyKMrtf0MdKPRibaJB8wH
 yMF4D1gMx+hFbs0/VOS6dtqD9DKW7VgPg0LW+RysfhnLTuFFb5iBcH6Of7l7Z/Ca
 vVK+JxrhB1EDVI1+MKnESKPF9c6j3DRa2xrQHi/XYje1TGqnQ1v4CmsEObYBuJDN
 9M9t4QLzNuA/DA5tS7cxxtQ3YUthuJjPLcO4EVHOCvnqCAxkzp0i3dVMUr+YISl+
 2yFqaZdTt8s8FjTI21LOyuloCo30ZLlzaorFa0lp2cIyYup+1vg=
 =btX/
 -----END PGP SIGNATURE-----

Merge tag 'x86_bugs_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 speculation mitigation updates from Borislav Petkov:

 - Some preparatory work to convert the mitigations machinery to
   mitigating attack vectors instead of single vulnerabilities

 - Untangle and remove a now unneeded X86_FEATURE_USE_IBPB flag

 - Add support for a Zen5-specific SRSO mitigation

 - Cleanups and minor improvements

* tag 'x86_bugs_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/bugs: Make spectre user default depend on MITIGATION_SPECTRE_V2
  x86/bugs: Use the cpu_smt_possible() helper instead of open-coded code
  x86/bugs: Add AUTO mitigations for mds/taa/mmio/rfds
  x86/bugs: Relocate mds/taa/mmio/rfds defines
  x86/bugs: Add X86_BUG_SPECTRE_V2_USER
  x86/bugs: Remove X86_FEATURE_USE_IBPB
  KVM: nVMX: Always use IBPB to properly virtualize IBRS
  x86/bugs: Use a static branch to guard IBPB on vCPU switch
  x86/bugs: Remove the X86_FEATURE_USE_IBPB check in ib_prctl_set()
  x86/mm: Remove X86_FEATURE_USE_IBPB checks in cond_mitigation()
  x86/bugs: Move the X86_FEATURE_USE_IBPB check into callers
  x86/bugs: KVM: Add support for SRSO_MSR_FIX
2025-03-25 13:30:18 -07:00
Linus Torvalds
a50b4fe095 A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
   the callback pointer. This turned out to be suboptimal for the upcoming
   Rust integration and is obviously a silly implementation to begin with.
 
   This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
   with hrtimer_setup(T, cb);
 
   The conversion was done with Coccinelle and a few manual fixups.
 
   Once the conversion has completely landed in mainline, hrtimer_init()
   will be removed and the hrtimer::function becomes a private member.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8
 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc
 rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv
 ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r
 d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8
 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV
 Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE
 iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop
 Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn
 gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I
 Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0
 iJbvLJeisXr3GA==
 =bYAX
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "A treewide hrtimer timer cleanup

  hrtimers are initialized with hrtimer_init() and a subsequent store to
  the callback pointer. This turned out to be suboptimal for the
  upcoming Rust integration and is obviously a silly implementation to
  begin with.

  This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
  with hrtimer_setup(T, cb);

  The conversion was done with Coccinelle and a few manual fixups.

  Once the conversion has completely landed in mainline, hrtimer_init()
  will be removed and the hrtimer::function becomes a private member"

* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  wifi: rt2x00: Switch to use hrtimer_update_function()
  io_uring: Use helper function hrtimer_update_function()
  serial: xilinx_uartps: Use helper function hrtimer_update_function()
  ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
  RDMA: Switch to use hrtimer_setup()
  virtio: mem: Switch to use hrtimer_setup()
  drm/vmwgfx: Switch to use hrtimer_setup()
  drm/xe/oa: Switch to use hrtimer_setup()
  drm/vkms: Switch to use hrtimer_setup()
  drm/msm: Switch to use hrtimer_setup()
  drm/i915/request: Switch to use hrtimer_setup()
  drm/i915/uncore: Switch to use hrtimer_setup()
  drm/i915/pmu: Switch to use hrtimer_setup()
  drm/i915/perf: Switch to use hrtimer_setup()
  drm/i915/gvt: Switch to use hrtimer_setup()
  drm/i915/huc: Switch to use hrtimer_setup()
  drm/amdgpu: Switch to use hrtimer_setup()
  stm class: heartbeat: Switch to use hrtimer_setup()
  i2c: Switch to use hrtimer_setup()
  iio: Switch to use hrtimer_setup()
  ...
2025-03-25 10:54:15 -07:00
Paolo Bonzini
782f9feaa9 Merge branch 'kvm-pre-tdx' into HEAD
- Add common secure TSC infrastructure for use within SNP and in the
  future TDX

- Block KVM_CAP_SYNC_REGS if guest state is protected.  It does not make
  sense to use the capability if the relevant registers are not
  available for reading or writing.
2025-03-20 13:13:13 -04:00
Paolo Bonzini
361da275e5 Merge branch 'kvm-nvmx-and-vm-teardown' into HEAD
The immediate issue being fixed here is a nVMX bug where KVM fails to
detect that, after nested VM-Exit, L1 has a pending IRQ (or NMI).
However, checking for a pending interrupt accesses the legacy PIC, and
x86's kvm_arch_destroy_vm() currently frees the PIC before destroying
vCPUs, i.e. checking for IRQs during the forced nested VM-Exit results
in a NULL pointer deref; that's a prerequisite for the nVMX fix.

The remaining patches attempt to bring a bit of sanity to x86's VM
teardown code, which has accumulated a lot of cruft over the years.  E.g.
KVM currently unloads each vCPU's MMUs in a separate operation from
destroying vCPUs, all because when guest SMP support was added, KVM had a
kludgy MMU teardown flow that broke when a VM had more than one 1 vCPU.
And that oddity lived on, for 18 years...

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-20 13:13:00 -04:00
Paolo Bonzini
3ecf162a31 KVM Xen changes for 6.15
- Don't write to the Xen hypercall page on MSR writes that are initiated by
    the host (userspace or KVM) to fix a class of bugs where KVM can write to
    guest memory at unexpected times, e.g. during vCPU creation if userspace has
    set the Xen hypercall MSR index to collide with an MSR that KVM emulates.
 
  - Restrict the Xen hypercall MSR indx to the unofficial synthetic range to
    reduce the set of possible collisions with MSRs that are emulated by KVM
    (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside
    in the synthetic range).
 
  - Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config.
 
  - Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID
    entries when updating PV clocks, as there is no guarantee PV clocks will be
    updated between TSC frequency changes and CPUID emulation, and guest reads
    of Xen TSC should be rare, i.e. are not a hot path.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZpO4ACgkQOlYIJqCj
 N/3AMQ/+J4+yOslekq4DHYhZaTvJFf0MqhPgTuf2s6I5p449JWn9rebqK2w0M9Xj
 fJy7/rboQA4QflBuhTiWcC3Dl1lYtxUqqtcCH9608eqKhbeay87OfV0/vgMwWBRs
 FhcOcp1587esJj5gz5M5R9i3S5Yq7Q4fp1+DmS23X41Zz5nTb2q80MY5UklMgI9I
 Ydaw1liB8rRHWbdt9yM4UsI8k4fMuj0PE8pEapoTSfsZm8J4cG9qHKrvuWjuFSCF
 l18Hyl11nWq8eZ5Vg2E2UIz0EgtWIHKu1/fi4av20/JTuA8Mon15WC5q4BBmDDdD
 keR9OJLYclVBh8KweiJSTUE6PcD9A8pWmoWyp6aGRiyyUVhbwysYTzT7uytwQz6w
 RH/vVHe0o/m19SnD9rqsRVObc7dOGorFXScMcf4Qxoq9yQm2p0lJDvq6c9uECLMV
 RIfZrXe9HS67RB9INybS+1fVlLcd0bLgGfG7q9lWLEABD45HpM5daQ4Mlf8+MIE0
 V7egx9t69/WALbJka8pWNISeFRKkB1LRjite+XXasqJ0iFeneM8UKFVB4OMtXL9g
 M0m8ovvySySMkoCq3yMlKxXh4rJ1/D556/bAaJBukMPWFWX9FQaP33U3FuzId7jH
 ztZVugViQMNiIbQVgUSAcgpuJvgpttAciACODlaw2u2Bk1Txmn0=
 =c3Wt
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-xen-6.15' of https://github.com/kvm-x86/linux into HEAD

KVM Xen changes for 6.15

 - Don't write to the Xen hypercall page on MSR writes that are initiated by
   the host (userspace or KVM) to fix a class of bugs where KVM can write to
   guest memory at unexpected times, e.g. during vCPU creation if userspace has
   set the Xen hypercall MSR index to collide with an MSR that KVM emulates.

 - Restrict the Xen hypercall MSR indx to the unofficial synthetic range to
   reduce the set of possible collisions with MSRs that are emulated by KVM
   (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside
   in the synthetic range).

 - Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config.

 - Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID
   entries when updating PV clocks, as there is no guarantee PV clocks will be
   updated between TSC frequency changes and CPUID emulation, and guest reads
   of Xen TSC should be rare, i.e. are not a hot path.
2025-03-19 09:14:59 -04:00
Paolo Bonzini
fcce7c1e7d KVM PV clock changes for 6.15:
- Don't take kvm->lock when iterating over vCPUs in the suspend notifier to
    fix a largely theoretical deadlock.
 
  - Use the vCPU's actual Xen PV clock information when starting the Xen timer,
    as the cached state in arch.hv_clock can be stale/bogus.
 
  - Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different
    PV clocks.
 
  - Restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only
    accounts for kvmclock, and there's no evidence that the flag is actually
    supported by Xen guests.
 
  - Clean up the per-vCPU "cache" of its reference pvclock, and instead only
    track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately
    expensive to compute, and rarely changes for modern setups).
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZnJsACgkQOlYIJqCj
 N/3nAg//TuURhfTm56TB0PZ01DX9Fqxl+9b+fDSllk1F7O5BcfkkEd11Jv4qa/Zb
 eKhSNZzWuDCTMky8izM2Ej4rfTmCg2xF+3hdpVi6yQ7SItgDo2E7e71lm1lNXSMO
 oCkQxEwQk0cW2sxeEqPuREq0Zm/kw7jrEt2co2OX2FKlt9UoZiCy6RDBde50z1ut
 5Z32k6QX9Alhu67kXvBE/+Xv6abx1dbADOnaTgE7s74smHKxS2WXrfpKnPXjy2y0
 pWjX9k2ClSISKdaFbSu4Y0VqeLqE+57ZAWAPT8vndJxjNWOvZK1oBSlaOPchR9CZ
 0VFLDWKV2FjEs0O0AkWCw8XTEmdJ4R1ekHpqbBZJ9TJYwVA/LDWOGgR1jcORkzsS
 WMJkfMOmQeL8bPR6TBuAFXawbhalsXnYUSthZ3sn4kA7c1DTkIC5mzrDZ3ADPyJi
 UpYwVHaWAMOqncEvSQEuUTvSoDeb5P4HMyB4QOAsh1GoKw4vVXpSWUPDy0JKPOnu
 WblztX9h/CRB/ZNt/566s2Jh7sCeBO2qs3ffujI4GDosYDcIRRuQy5U08/oMrPRf
 l3nPStjxLqsCdJe8IXvL5zwt6YOxJvJdG8XcfcvfQsUCPMAZOIv7PKrt8AFfrN6c
 GU5v8x/IBBB46qJw1Jm5eE5S3P/PuaIf235JHpIabGPzJ+H1QGo=
 =yv8C
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-pvclock-6.15' of https://github.com/kvm-x86/linux into HEAD

KVM PV clock changes for 6.15:

 - Don't take kvm->lock when iterating over vCPUs in the suspend notifier to
   fix a largely theoretical deadlock.

 - Use the vCPU's actual Xen PV clock information when starting the Xen timer,
   as the cached state in arch.hv_clock can be stale/bogus.

 - Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different
   PV clocks.

 - Restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only
   accounts for kvmclock, and there's no evidence that the flag is actually
   supported by Xen guests.

 - Clean up the per-vCPU "cache" of its reference pvclock, and instead only
   track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately
   expensive to compute, and rarely changes for modern setups).
2025-03-19 09:11:59 -04:00
Paolo Bonzini
9b093f5b86 KVM SVM changes for 6.15
- Ensure the PSP driver is initialized when both the PSP and KVM modules are
    built-in (the initcall framework doesn't handle dependencies).
 
  - Use long-term pins when registering encrypted memory regions, so that the
    pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to
    excessive fragmentation.
 
  - Add macros and helpers for setting GHCB return/error codes.
 
  - Add support for Idle HLT interception, which elides interception if the vCPU
    has a pending, unmasked virtual IRQ when HLT is executed.
 
  - Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical
    address.
 
  - Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g.
    because the vCPU was "destroyed" via SNP's AP Creation hypercall.
 
  - Reject SNP AP Creation if the requested SEV features for the vCPU don't
    match the VM's configured set of features.
 
  - Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZoO4ACgkQOlYIJqCj
 N/2m1BAAkbn6k0ZzJj0GTYqNh8ejWFBa4Mt+RHXrf9deNXLJuQFPRAQi2En3rQwj
 esISxA0dzsADEYkCXHxDsUfsJ729kQYAIyN5r4h3GftriKlNBmHLZJsXHXnpAZ0F
 2Yjwu/r5zw/c5/mOYkcBjJ0gDgoDWNI0UoA656HTqE88E6v7DoSZlZBZZRSCMXRi
 8jIQSzmQQkKsWi0c/N5LKm8E+6HFLJB1BnASbXbwXChIoi2pFE/wAv6ntC+V5DzN
 Y5oqDtf3evOBrpmMbN7t2I4KJ8VW1+041whANymFK1QARBBevCYY1ezCHg2RIHpc
 cyS8G+wice7IMSnqTNtJvN2IpwBkV2SqRyxwBKS2j1ec1xeoX2JT23tOom1XpPhW
 diqiSUto2xQIPz3x8fddtAHvY0W11jpXt4MUyOzdefbBLGQBB4EsxbnwxY+i6kKh
 0tdw4R1uzvbn1sHW+p2hOvtkgxSLmYFGIrYEUMCxXRxOviHfPWzCBlucEOOceU1D
 2o/SgoBWS6xF8KxMxwnVLE9q8/Baiua8Ak2h2cLapHwWGpRaeJGFbz/TwbcDaKVy
 gW34W8KXc4WNWiFwoD6WRqrSDTRXG3XAtn0vjwvCqD6PBPRleALWsAxq8ztenYIy
 2se051XsKGg+e64zsAZFNdzIDrSGIHWfZb9ec398cF/iuKTGWLQ=
 =iZrd
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-svm-6.15' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 6.15

 - Ensure the PSP driver is initialized when both the PSP and KVM modules are
   built-in (the initcall framework doesn't handle dependencies).

 - Use long-term pins when registering encrypted memory regions, so that the
   pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to
   excessive fragmentation.

 - Add macros and helpers for setting GHCB return/error codes.

 - Add support for Idle HLT interception, which elides interception if the vCPU
   has a pending, unmasked virtual IRQ when HLT is executed.

 - Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical
   address.

 - Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g.
   because the vCPU was "destroyed" via SNP's AP Creation hypercall.

 - Reject SNP AP Creation if the requested SEV features for the vCPU don't
   match the VM's configured set of features.

 - Misc cleanups
2025-03-19 09:10:44 -04:00
Paolo Bonzini
a24dbf986b KVM VMX changes for 6.15
- Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus
    modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1.
 
  - Pass XFD_ERR as a psueo-payload when injecting #NM as a preparatory step
    for upcoming FRED virtualization support.
 
  - Decouple the EPT entry RWX protection bit macros from the EPT Violation bits
    as a general cleanup, and in anticipation of adding support for emulating
    Mode-Based Execution (MBEC).
 
  - Reject KVM_RUN if userspace manages to gain control and stuff invalid guest
    state while KVM is in the middle of emulating nested VM-Enter.
 
  - Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs
    in anticipation of adding sanity checks for secondary exit controls (the
    primary field is out of bits).
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZosAACgkQOlYIJqCj
 N/0pSA/9FTKQ8hzBinHc5ae+FPhbmAmSgX9X1Gge2cDWF8OYD1yR3ySEdPRGOalz
 oRhhO9pnXeNDbGeps8TFqoRhydcXRyp8AiDATFKky5kFk9vyKxDR26bXrM8nWGgI
 WZS/txkae/Bt5P63IViaGBIbWjXlfKP2wtAOrv676e//SGtZ22K0QUOUuBNETD1F
 txs2vHOUC7S9q+hIo95I4WZJGA4Ih7ZMlfGr1hrPGASHMG0AUozaouN8CEHFWecH
 uutpeVAcYtLiZbsl19L+M6wHr0TeRYo0d5Rcw/zH1XeTZ+zLoF27PJ6PHwx7QmVs
 C0mNzW+2cNjEYpzEDFxo+EkiWXagJ0m9pWlne6PKb92WtX+l5+x5abxVSF86lCy9
 X46wtm/FJ3DyfZ8yuLJXX2c2TOHVNNPUOS/dmAKrv3i1t2kG90yKRX+P9m0k9L1i
 f1IcyhIy4hrkSz+OVRG07mf7VeXUjklWfjIOsAYGBiQRyTbD+8PU1pgDRXWC9mPw
 tgvOgnnDKaS8POHak5DGb3kfdISVztNUM5Dg4GV7fZYqy8E37cdjo1PHk71xmksU
 lPN5U1wSX5MkAyqzrFz+LAZlgoIAvDmbPclWVACbwwRIo7IvbAjwsfE+8JyO5DPX
 T8mvd6C+A2K8GqkQTpjx8leXEDRSZiuV63dscSoAMEr+NbSudi4=
 =SAgZ
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.15' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.15

 - Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus
   modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1.

 - Pass XFD_ERR as a psueo-payload when injecting #NM as a preparatory step
   for upcoming FRED virtualization support.

 - Decouple the EPT entry RWX protection bit macros from the EPT Violation bits
   as a general cleanup, and in anticipation of adding support for emulating
   Mode-Based Execution (MBEC).

 - Reject KVM_RUN if userspace manages to gain control and stuff invalid guest
   state while KVM is in the middle of emulating nested VM-Enter.

 - Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs
   in anticipation of adding sanity checks for secondary exit controls (the
   primary field is out of bits).
2025-03-19 09:05:52 -04:00
Paolo Bonzini
4d9a677596 KVM x86 misc changes for 6.15:
- Fix a bug in PIC emulation that caused KVM to emit a spurious KVM_REQ_EVENT.
 
  - Add a helper to consolidate handling of mp_state transitions, and use it to
    clear pv_unhalted whenever a vCPU is made RUNNABLE.
 
  - Defer runtime CPUID updates until KVM emulates a CPUID instruction, to
    coalesce updates when multiple pieces of vCPU state are changing, e.g. as
    part of a nested transition.
 
  - Fix a variety of nested emulation bugs, and add VMX support for synthesizing
    nested VM-Exit on interception (instead of injecting #UD into L2).
 
  - Drop "support" for PV Async #PF with proctected guests without SEND_ALWAYS,
    as KVM can't get the current CPL.
 
  - Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZhoYACgkQOlYIJqCj
 N/1oSBAAlhsZzv4m6rHtWACjGsSBlAE5WM7HrpnwjOyMW+Desc0WLL4L6qAlxgT7
 4ZkBvQ4zsiyUIdLv1XARvkBYHTUkcgG8fQSSJQ6grZit5OcFcMgWafvQrJYI6428
 TyzF/x6t0gj8CjDluA6jM/eL5HhWZZDOIg0Wma+pyYpW+7y2tkphXyYyOQPBGwv3
 geRUetbDHHXjf042k/8f1j1vjzrNNvAg3YyNyx1YbdU9XKsn5D+SeUW2eVfYk8G7
 5QsCOGvUYcbbjrR8kbCZKexvoH6Np9J6YKDe4R9R2yDzgs/96qz6xkYTGCVkHA1y
 uursKqRHgbXBxzxa+ban073laT7Qt3S01Gd9bJW3IO7hzG89gl4qfX7fap8T9Yc2
 yeBTYIgInpyx+NCdZ2Z/++BzPagBGfa77gFX/eIkmsVA9LWYi9CI3FSjtr/czvWm
 a4tfMPvTVBjsBQQ7t/lNksrq0O51lbb3iqqv3ToQpDOOqCWuMEU5xcihhPRr5NSZ
 dX4o/jIDhCV8EyXdtASyqMlYBXcuC45ojEZn1elh0QogzYAdSGQ2bIDyxuBtA//k
 kSbi+E4GB64jVfBWUyK2QeLOBnBkH7mh6Cg5UYr1Ln9Sm6l8vrcxhcbnchiWxXMI
 WCK7BJwI2HojBVpEZ04jMkjHvg36uSfjOzmMLT5yPXfFNebsGmA=
 =8SGF
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.15' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.15:

 - Fix a bug in PIC emulation that caused KVM to emit a spurious KVM_REQ_EVENT.

 - Add a helper to consolidate handling of mp_state transitions, and use it to
   clear pv_unhalted whenever a vCPU is made RUNNABLE.

 - Defer runtime CPUID updates until KVM emulates a CPUID instruction, to
   coalesce updates when multiple pieces of vCPU state are changing, e.g. as
   part of a nested transition.

 - Fix a variety of nested emulation bugs, and add VMX support for synthesizing
   nested VM-Exit on interception (instead of injecting #UD into L2).

 - Drop "support" for PV Async #PF with proctected guests without SEND_ALWAYS,
   as KVM can't get the current CPL.

 - Misc cleanups
2025-03-19 09:04:48 -04:00
Paolo Bonzini
4286a3ec25 KVM x86/mmu changes for 6.15
Add support for "fast" aging of SPTEs in both the TDP MMU and Shadow MMU, where
 "fast" means "without holding mmu_lock".  Not taking mmu_lock allows multiple
 aging actions to run in parallel, and more importantly avoids stalling vCPUs,
 e.g. due to holding mmu_lock for an extended duration while a vCPU is faulting
 in memory.
 
 For the TDP MMU, protect aging via RCU; the page tables are RCU-protected and
 KVM doesn't need to access any metadata to age SPTEs.
 
 For the Shadow MMU, use bit 1 of rmap pointers (bit 0 is used to terminate a
 list of rmaps) to implement a per-rmap single-bit spinlock.  When aging a gfn,
 acquire the rmap's spinlock with read-only permissions, which allows hardening
 and optimizing the locking and aging, e.g. locking an rmap for write requires
 mmu_lock to also be held.  The lock is NOT a true R/W spinlock, i.e. multiple
 concurrent readers aren't supported.
 
 To avoid forcing all SPTE updates to use atomic operations (clearing the
 Accessed bit out of mmu_lock makes it inherently volatile), rework and rename
 spte_has_volatile_bits() to spte_needs_atomic_update() and deliberately exclude
 the Accessed bit.  KVM (and mm/) already tolerates false positives/negatives
 for Accessed information, and all testing has shown that reducing the latency
 of aging is far more beneficial to overall system performance than providing
 "perfect" young/old information.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZizUACgkQOlYIJqCj
 N/1XkBAAwLxK3uKwkIIgzu+V6NPMiqPBsNtGiiQgRYMfvCwMW2vU6ztsBgNgs6zI
 eMnOCCo6fQaxPFvpKue8VN7TD33BcjzKZaPiuHZrzQIa/oYQeOlZ4oaN8lr9F9Ec
 5l1Lg/p2+z1GGDhWc2opNpg48sCtX7IQ0Tx46LkoB3VSDFP33GwW4+Ht2r71rNeL
 ofKB+T0hU5HOry5j0w0nTVwOEoNzlm1sVFqOHzgnK18Lmqw2CfOPm+46K+w8nOh+
 v+rwGuGa//1kcCjNCcGP1OuJdNAMgXBxND/l6LAkWcHfffIRbXlO07O05dAGqPeF
 rRn5JUl02OkI6lq99+935OmtEROe6mt+Bx0dhAzk4Z0CD6JY34ShZSAADSnltQlK
 2a1E95t63v8a7ZM5dwED7os2HBhxODoyeWQAlIHpkVdmeTJstkyvjPhubJc13+Js
 oDL6ehs3hhZ171ePn2aXo0NobA5fe7xbl4wugP3hNmBXjLvu04D+llcDmC095nBk
 ICuzFqFXCXzdjEwgWwPzTseWOCoWTkoRqeJ9lch4UD3mMMcmK0MbK6joocGvCFto
 cL/eZdElnf1MZwWYdo44X+NEc1jItZVvktkRrllpwCtpRSDINO6RYZGcRf/g0Lha
 XmaU7jICfi3AKc4N3S2l4KIkd/AeJQySM+kGArxIOYoaqFCe2Mc=
 =Iy57
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmu-6.15' of https://github.com/kvm-x86/linux into HEAD

KVM x86/mmu changes for 6.15

Add support for "fast" aging of SPTEs in both the TDP MMU and Shadow MMU, where
"fast" means "without holding mmu_lock".  Not taking mmu_lock allows multiple
aging actions to run in parallel, and more importantly avoids stalling vCPUs,
e.g. due to holding mmu_lock for an extended duration while a vCPU is faulting
in memory.

For the TDP MMU, protect aging via RCU; the page tables are RCU-protected and
KVM doesn't need to access any metadata to age SPTEs.

For the Shadow MMU, use bit 1 of rmap pointers (bit 0 is used to terminate a
list of rmaps) to implement a per-rmap single-bit spinlock.  When aging a gfn,
acquire the rmap's spinlock with read-only permissions, which allows hardening
and optimizing the locking and aging, e.g. locking an rmap for write requires
mmu_lock to also be held.  The lock is NOT a true R/W spinlock, i.e. multiple
concurrent readers aren't supported.

To avoid forcing all SPTE updates to use atomic operations (clearing the
Accessed bit out of mmu_lock makes it inherently volatile), rework and rename
spte_has_volatile_bits() to spte_needs_atomic_update() and deliberately exclude
the Accessed bit.  KVM (and mm/) already tolerates false positives/negatives
for Accessed information, and all testing has shown that reducing the latency
of aging is far more beneficial to overall system performance than providing
"perfect" young/old information.
2025-03-19 09:04:33 -04:00
Ingo Molnar
89771319e0 Linux 6.14-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmfXVtUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGN/sH/i5423Gt/z51gDjA
 s4v5Z7GaBJ9zOGBahn2RWFe72zytTqKrEJmMnGfguirs0atD1DtQj4WAP7iFKP+e
 WyO663X6HF7i5y37ja0Yd4PZc31hwtqzKH8LjBf8f8tTy8UsEVqumdi5A4sS9KTM
 qm4kTyyVEY9D/s7oRY8ywjDlRJtO6nT0aKMp4kAqNEbrNUYbilT/a0hgXcgSmPyB
 uIjmjL2fZfutxGI5LgfbaSHCa1ElmhvTvivOMpaAmZSGCRVHCKGgT0CTNnHyn/7C
 dB145JkRO4ZOUqirCdO4PE/23id3ajq9fcixJGBzAv7c45y+B3JZ1r2kAfKalE8/
 qrOKLys=
 =8r7a
 -----END PGP SIGNATURE-----

Merge tag 'v6.14-rc7' into x86/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-19 11:03:06 +01:00
Isaku Yamahata
161d34609f KVM: TDX: Make TDX VM type supported
Now all the necessary code for TDX is in place, it's ready to run TDX
guest.  Advertise the VM type of KVM_X86_TDX_VM so that the user space
VMM like QEMU can start to use it.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX "the rest" v2:
- No change.

TDX "the rest" v1:
- Move down to the end of patch series.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Yan Zhao
90fe64a94d KVM: TDX: KVM: TDX: Always honor guest PAT on TDX enabled guests
Always honor guest PAT in KVM-managed EPTs on TDX enabled guests by
making self-snoop feature a hard dependency for TDX and making quirk
KVM_X86_QUIRK_IGNORE_GUEST_PAT not a valid quirk once TDX is enabled.

The quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT only affects memory type of
KVM-managed EPTs. For the TDX-module-managed private EPT, memory type is
always forced to WB now.

Honoring guest PAT in KVM-managed EPTs ensures KVM does not invoke
kvm_zap_gfn_range() when attaching/detaching non-coherent DMA devices,
which would cause mirrored EPTs for TDs to be zapped, leading to the
TDX-module-managed private EPT being incorrectly zapped.

As a new feature, TDX always comes with support for self-snoop, and does
not have to worry about unmodifiable but buggy guests. So, simply ignore
KVM_X86_QUIRK_IGNORE_GUEST_PAT on TDX guests just like kvm-amd.ko already
does.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250224071039.31511-1-yan.y.zhao@intel.com>
[Only apply to TDX guests. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Paolo Bonzini
3fee4837ef KVM: x86: remove shadow_memtype_mask
The IGNORE_GUEST_PAT quirk is inapplicable, and thus always-disabled,
if shadow_memtype_mask is zero.  As long as vmx_get_mt_mask is not
called for the shadow paging case, there is no need to consult
shadow_memtype_mask and it can be removed altogether.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Yan Zhao
c9c1e20b4c KVM: x86: Introduce Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT
Introduce an Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT to have
KVM ignore guest PAT when this quirk is enabled.

On AMD platforms, KVM always honors guest PAT.  On Intel however there are
two issues.  First, KVM *cannot* honor guest PAT if CPU feature self-snoop
is not supported. Second, UC access on certain Intel platforms can be very
slow[1] and honoring guest PAT on those platforms may break some old
guests that accidentally specify video RAM as UC. Those old guests may
never expect the slowness since KVM always forces WB previously. See [2].

So, introduce a quirk that KVM can enable by default on all Intel platforms
to avoid breaking old unmodifiable guests. Newer userspace can disable this
quirk if it wishes KVM to honor guest PAT; disabling the quirk will fail
if self-snoop is not supported, i.e. if KVM cannot obey the wish.

The quirk is a no-op on AMD and also if any assigned devices have
non-coherent DMA.  This is not an issue, as KVM_X86_QUIRK_CD_NW_CLEARED is
another example of a quirk that is sometimes automatically disabled.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/all/Ztl9NWCOupNfVaCA@yzhao56-desk.sh.intel.com # [1]
Link: https://lore.kernel.org/all/87jzfutmfc.fsf@redhat.com # [2]
Message-ID: <20250224070946.31482-1-yan.y.zhao@intel.com>
[Use supported_quirks/inapplicable_quirks to support both AMD and
 no-self-snoop cases, as well as to remove the shadow_memtype_mask check
 from kvm_mmu_may_ignore_guest_pat(). - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Yan Zhao
bd7d5362b4 KVM: x86: Introduce supported_quirks to block disabling quirks
Introduce supported_quirks in kvm_caps to store platform-specific force-enabled
quirks.

No functional changes intended.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250224070832.31394-1-yan.y.zhao@intel.com>
[Remove unsupported quirks at KVM_ENABLE_CAP time. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Paolo Bonzini
a4dae7c7a4 KVM: x86: Allow vendor code to disable quirks
In some cases, the handling of quirks is split between platform-specific
code and generic code, or it is done entirely in generic code, but the
relevant bug does not trigger on some platforms; for example,
this will be the case for "ignore guest PAT".  Allow unaffected vendor
modules to disable handling of a quirk for all VMs via a new entry in
kvm_caps.

Such quirks remain available in KVM_CAP_DISABLE_QUIRKS2, because that API
tells userspace that KVM *knows* that some of its past behavior was bogus
or just undesirable.  In other words, it's plausible for userspace to
refuse to run if a quirk is not listed by KVM_CAP_DISABLE_QUIRKS2, so
preserve that and make it part of the API.

As an example, mark KVM_X86_QUIRK_CD_NW_CLEARED as auto-disabled on
Intel systems.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Paolo Bonzini
9966b7822b KVM: x86: do not allow re-enabling quirks
Allowing arbitrary re-enabling of quirks puts a limit on what the
quirks themselves can do, since you cannot assume that the quirk
prevents a particular state.  More important, it also prevents
KVM from disabling a quirk at VM creation time, because userspace
can always go back and re-enable that.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Binbin Wu
26eab9ae4b KVM: TDX: Enable guest access to MTRR MSRs
Allow TDX guests to access MTRR MSRs as what KVM does for normal VMs, i.e.,
KVM emulates accesses to MTRR MSRs, but doesn't virtualize guest MTRR
memory types.

TDX module exposes MTRR feature to TDX guests unconditionally.  KVM needs
to support MTRR MSRs accesses for TDX guests to match the architectural
behavior.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-19-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Isaku Yamahata
0b75889b0c KVM: TDX: Add a method to ignore hypercall patching
Because guest TD memory is protected, VMM patching guest binary for
hypercall instruction isn't possible.  Add a method to ignore hypercall
patching.  Note: guest TD kernel needs to be modified to use
TDG.VP.VMCALL for hypercall.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-18-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Isaku Yamahata
79264ff080 KVM: TDX: Ignore setting up mce
Because vmx_set_mce function is VMX specific and it cannot be used for TDX.
Add vt stub to ignore setting up mce for TDX.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-17-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
cf5f3668c5 KVM: TDX: Add methods to ignore accesses to TSC
TDX protects TDX guest TSC state from VMM.  Implement access methods to
ignore guest TSC.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-16-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
a946c71cf8 KVM: TDX: Add methods to ignore VMX preemption timer
TDX doesn't support VMX preemption timer.  Implement access methods for VMM
to ignore VMX preemption timer.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-15-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
e6bb397884 KVM: TDX: Add method to ignore guest instruction emulation
Skip instruction emulation and let the TDX guest retry for MMIO emulation
after installing the MMIO SPTE with suppress #VE bit cleared.

TDX protects TDX guest state from VMM, instructions in guest memory cannot
be emulated.  MMIO emulation is the only case that triggers the instruction
emulation code path for TDX guest.

The MMIO emulation handling flow as following:
- The TDX guest issues a vMMIO instruction. (The GPA must be shared and is
  not covered by KVM memory slot.)
- The default SPTE entry for shared-EPT by KVM has suppress #VE bit set. So
  EPT violation causes TD exit to KVM.
- Trigger KVM page fault handler and install a new SPTE with suppress #VE
  bit cleared.
- Skip instruction emulation and return X86EMU_RETRY_INSTR to let the vCPU
  retry.
- TDX guest re-executes the vMMIO instruction.
- TDX guest gets #VE because KVM has cleared #VE suppress bit.
- TDX guest #VE handler converts MMIO into TDG.VP.VMCALL<MMIO>

Return X86EMU_RETRY_INSTR in the callback check_emulate_instruction() for
TDX guests to retry the MMIO instruction.  Also, the instruction emulation
handling will be skipped, so that the callback check_intercept() will never
be called for TDX guest.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-14-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
a141f28d6b KVM: TDX: Add methods to ignore accesses to CPU state
TDX protects TDX guest state from VMM.  Implement access methods for TDX
guest state to ignore them or return zero.  Because those methods can be
called by kvm ioctls to set/get cpu registers, they don't have KVM_BUG_ON.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-13-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
04733836fe KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
Implement TDG.VP.VMCALL<GetTdVmCallInfo> hypercall.  If the input value is
zero, return success code and zero in output registers.

TDG.VP.VMCALL<GetTdVmCallInfo> hypercall is a subleaf of TDG.VP.VMCALL to
enumerate which TDG.VP.VMCALL sub leaves are supported.  This hypercall is
for future enhancement of the Guest-Host-Communication Interface (GHCI)
specification.  The GHCI version of 344426-001US defines it to require
input R12 to be zero and to return zero in output registers, R11, R12, R13,
and R14 so that guest TD enumerates no enhancement.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-12-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
9fc3402a20 KVM: TDX: Enable guest access to LMCE related MSRs
Allow TDX guest to configure LMCE (Local Machine Check Event) by handling
MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL.

MCE and MCA are advertised via cpuid based on the TDX module spec.  Guest
kernel can access IA32_FEAT_CTL to check whether LMCE is opted-in by the
platform or not.  If LMCE is opted-in by the platform, guest kernel can
access IA32_MCG_EXT_CTL to enable/disable LMCE.

Handle MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL for TDX guests to avoid
failure when a guest accesses them with TDG.VP.VMCALL<MSR> on #VE.  E.g.,
Linux guest will treat the failure as a #GP(0).

Userspace VMM may not opt-in LMCE by default, e.g., QEMU disables it by
default, "-cpu lmce=on" is needed in QEMU command line to opt-in it.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: rework changelog]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-11-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
081385dbc6 KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
Morph PV RDMSR/WRMSR hypercall to EXIT_REASON_MSR_{READ,WRITE} and
wire up KVM backend functions.

For complete_emulated_msr() callback, instead of injecting #GP on error,
implement tdx_complete_emulated_msr() to set return code on error.  Also
set return value on MSR read according to the values from kvm x86
registers.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250227012021.1778144-10-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
dd50294f3e KVM: TDX: Implement callbacks for MSR operations
Add functions to implement MSR related callbacks, .set_msr(), .get_msr(),
and .has_emulated_msr(), for preparation of handling hypercalls from TDX
guest for PV RDMSR and WRMSR.  Ignore KVM_REQ_MSR_FILTER_CHANGED for TDX.

There are three classes of MSR virtualization for TDX.
- Non-configurable: TDX module directly virtualizes it. VMM can't configure
  it, the value set by KVM_SET_MSRS is ignored.
- Configurable: TDX module directly virtualizes it. VMM can configure it at
  VM creation time.  The value set by KVM_SET_MSRS is used.
- #VE case: TDX guest would issue TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}>
  and VMM handles the MSR hypercall. The value set by KVM_SET_MSRS is used.

For the MSRs belonging to the #VE case, the TDX module injects #VE to the
TDX guest upon RDMSR or WRMSR.  The exact list of such MSRs is defined in
TDX Module ABI Spec.

Upon #VE, the TDX guest may call TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}>,
which are defined in GHCI (Guest-Host Communication Interface) so that the
host VMM (e.g. KVM) can virtualize the MSRs.

TDX doesn't allow VMM to configure interception of MSR accesses.  Ignore
KVM_REQ_MSR_FILTER_CHANGED for TDX guest.  If the userspace has set any
MSR filters, it will be applied when handling
TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}> in a later patch.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250227012021.1778144-9-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
7ddf314441 KVM: x86: Move KVM_MAX_MCE_BANKS to header file
Move KVM_MAX_MCE_BANKS to header file so that it can be used for TDX in
a future patch.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: split into new patch]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-8-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
5cf7239b78 KVM: TDX: Handle TDX PV HLT hypercall
Handle TDX PV HLT hypercall and the interrupt status due to it.

TDX guest status is protected, KVM can't get the interrupt status
of TDX guest and it assumes interrupt is always allowed unless TDX
guest calls TDVMCALL with HLT, which passes the interrupt blocked flag.

If the guest halted with interrupt enabled, also query pending RVI by
checking bit 0 of TD_VCPU_STATE_DETAILS_NON_ARCH field via a seamcall.

Update vt_interrupt_allowed() for TDX based on interrupt blocked flag
passed by HLT TDVMCALL.  Do not wakeup TD vCPU if interrupt is blocked
for VT-d PI.

For NMIs, KVM cannot determine the NMI blocking status for TDX guests,
so KVM always assumes NMIs are not blocked.  In the unlikely scenario
where a guest invokes the PV HLT hypercall within an NMI handler, this
could result in a spurious wakeup.  The guest should implement the PV
HLT hypercall within a loop if it truly requires no interruptions, since
NMI could be unblocked by an IRET due to an exception occurring before
the PV HLT is executed in the NMI handler.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-7-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
3bf31b5786 KVM: TDX: Handle TDX PV CPUID hypercall
Handle TDX PV CPUID hypercall for the CPUIDs virtualized by VMM
according to TDX Guest Host Communication Interface (GHCI).

For TDX, most CPUID leaf/sub-leaf combinations are virtualized by
the TDX module while some trigger #VE.  On #VE, TDX guest can issue
TDG.VP.VMCALL<INSTRUCTION.CPUID> (same value as EXIT_REASON_CPUID)
to request VMM to emulate CPUID operation.

Morph PV CPUID hypercall to EXIT_REASON_CPUID and wire up  to the KVM
backend function.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: rewrite changelog]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-6-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Yan Zhao
4b2abc4971 KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
Kick off all vCPUs and prevent tdh_vp_enter() from executing whenever
tdh_mem_range_block()/tdh_mem_track()/tdh_mem_page_remove() encounters
contention, since the page removal path does not expect error and is less
sensitive to the performance penalty caused by kicking off vCPUs.

Although KVM has protected SEPT zap-related SEAMCALLs with kvm->mmu_lock,
KVM may still encounter TDX_OPERAND_BUSY due to the contention in the TDX
module.
- tdh_mem_track() may contend with tdh_vp_enter().
- tdh_mem_range_block()/tdh_mem_page_remove() may contend with
  tdh_vp_enter() and TDCALLs.

Resources     SHARED users      EXCLUSIVE users
------------------------------------------------------------
TDCS epoch    tdh_vp_enter      tdh_mem_track
------------------------------------------------------------
SEPT tree  tdh_mem_page_remove  tdh_vp_enter (0-step mitigation)
                                tdh_mem_range_block
------------------------------------------------------------
SEPT entry                      tdh_mem_range_block (Host lock)
                                tdh_mem_page_remove (Host lock)
                                tdg_mem_page_accept (Guest lock)
                                tdg_mem_page_attr_rd (Guest lock)
                                tdg_mem_page_attr_wr (Guest lock)

Use a TDX specific per-VM flag wait_for_sept_zap along with
KVM_REQ_OUTSIDE_GUEST_MODE to kick off vCPUs and prevent them from entering
TD, thereby avoiding the potential contention. Apply the kick-off and no
vCPU entering only after each SEAMCALL busy error to minimize the window of
no TD entry, as the contention due to 0-step mitigation or TDCALLs is
expected to be rare.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250227012021.1778144-5-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Yan Zhao
b0327bb2e7 KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY
Retry locally in the TDX EPT violation handler for private memory to reduce
the chances for tdh_mem_sept_add()/tdh_mem_page_aug() to contend with
tdh_vp_enter().

TDX EPT violation installs private pages via tdh_mem_sept_add() and
tdh_mem_page_aug(). The two may have contention with tdh_vp_enter() or
TDCALLs.

Resources    SHARED  users      EXCLUSIVE users
------------------------------------------------------------
SEPT tree  tdh_mem_sept_add     tdh_vp_enter(0-step mitigation)
           tdh_mem_page_aug
------------------------------------------------------------
SEPT entry                      tdh_mem_sept_add (Host lock)
                                tdh_mem_page_aug (Host lock)
                                tdg_mem_page_accept (Guest lock)
                                tdg_mem_page_attr_rd (Guest lock)
                                tdg_mem_page_attr_wr (Guest lock)

Though the contention between tdh_mem_sept_add()/tdh_mem_page_aug() and
TDCALLs may be removed in future TDX module, their contention with
tdh_vp_enter() due to 0-step mitigation still persists.

The TDX module may trigger 0-step mitigation in SEAMCALL TDH.VP.ENTER,
which works as follows:
0. Each TDH.VP.ENTER records the guest RIP on TD entry.
1. When the TDX module encounters a VM exit with reason EPT_VIOLATION, it
   checks if the guest RIP is the same as last guest RIP on TD entry.
   -if yes, it means the EPT violation is caused by the same instruction
            that caused the last VM exit.
            Then, the TDX module increases the guest RIP no-progress count.
            When the count increases from 0 to the threshold (currently 6),
            the TDX module records the faulting GPA into a
            last_epf_gpa_list.
   -if no,  it means the guest RIP has made progress.
            So, the TDX module resets the RIP no-progress count and the
            last_epf_gpa_list.
2. On the next TDH.VP.ENTER, the TDX module (after saving the guest RIP on
   TD entry) checks if the last_epf_gpa_list is empty.
   -if yes, TD entry continues without acquiring the lock on the SEPT tree.
   -if no,  it triggers the 0-step mitigation by acquiring the exclusive
            lock on SEPT tree, walking the EPT tree to check if all page
            faults caused by the GPAs in the last_epf_gpa_list have been
            resolved before continuing TD entry.

Since KVM TDP MMU usually re-enters guest whenever it exits to userspace
(e.g. for KVM_EXIT_MEMORY_FAULT) or encounters a BUSY, it is possible for a
tdh_vp_enter() to be called more than the threshold count before a page
fault is addressed, triggering contention when tdh_vp_enter() attempts to
acquire exclusive lock on SEPT tree.

Retry locally in TDX EPT violation handler to reduce the count of invoking
tdh_vp_enter(), hence reducing the possibility of its contention with
tdh_mem_sept_add()/tdh_mem_page_aug(). However, the 0-step mitigation and
the contention are still not eliminated due to KVM_EXIT_MEMORY_FAULT,
signals/interrupts, and cases when one instruction faults more GFNs than
the threshold count.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250227012021.1778144-4-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Yan Zhao
e6a85781f7 KVM: TDX: Detect unexpected SEPT violations due to pending SPTEs
Detect SEPT violations that occur when an SEPT entry is in PENDING state
while the TD is configured not to receive #VE on SEPT violations.

A TD guest can be configured not to receive #VE by setting SEPT_VE_DISABLE
to 1 in tdh_mng_init() or modifying pending_ve_disable to 1 in TDCS when
flexible_pending_ve is permitted. In such cases, the TDX module will not
inject #VE into the TD upon encountering an EPT violation caused by an SEPT
entry in the PENDING state. Instead, TDX module will exit to VMM and set
extended exit qualification type to PENDING_EPT_VIOLATION and exit
qualification bit 6:3 to 0.

Since #VE will not be injected to such TDs, they are not able to be
notified to accept a GPA. TD accessing before accepting a private GPA
is regarded as an error within the guest.

Detect such guest error by inspecting the (extended) exit qualification
bits and make such VM dead.

Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-3-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Isaku Yamahata
da407fe459 KVM: TDX: Handle EPT violation/misconfig exit
For TDX, on EPT violation, call common __vmx_handle_ept_violation() to
trigger x86 MMU code; on EPT misconfiguration, bug the VM since it
shouldn't happen.

EPT violation due to instruction fetch should never be triggered from
shared memory in TDX guest.  If such EPT violation occurs, treat it as
broken hardware.

EPT misconfiguration shouldn't happen on neither shared nor secure EPT for
TDX guests.
- TDX module guarantees no EPT misconfiguration on secure EPT.  Per TDX
  module v1.5 spec section 9.4 "Secure EPT Induced TD Exits":
  "By design, since secure EPT is fully controlled by the TDX module, an
  EPT misconfiguration on a private GPA indicates a TDX module bug and is
  handled as a fatal error."
- For shared EPT, the MMIO caching optimization, which is the only case
  where current KVM configures EPT entries to generate EPT
  misconfiguration, is implemented in a different way for TDX guests.  KVM
  configures EPT entries to non-present value without suppressing #VE bit.
  It causes #VE in the TDX guest and the guest will call TDG.VP.VMCALL to
  request MMIO emulation.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
[binbin: rework changelog]
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Isaku Yamahata
6c441e4d6e KVM: TDX: Handle EXIT_REASON_OTHER_SMI
Handle VM exit caused by "other SMI" for TDX, by returning back to
userspace for Machine Check System Management Interrupt (MSMI) case or
ignoring it and resume vCPU for non-MSMI case.

For VMX, SMM transition can happen in both VMX non-root mode and VMX
root mode.  Unlike VMX, in SEAM root mode (TDX module), all interrupts
are blocked. If an SMI occurs in SEAM non-root mode (TD guest), the SMI
causes VM exit to TDX module, then SEAMRET to KVM. Once it exits to KVM,
SMI is delivered and handled by kernel handler right away.

An SMI can be "I/O SMI" or "other SMI".  For TDX, there will be no I/O SMI
because I/O instructions inside TDX guest trigger #VE and TDX guest needs
to use TDVMCALL to request VMM to do I/O emulation.

For "other SMI", there are two cases:
- MSMI case.  When BIOS eMCA MCE-SMI morphing is enabled, the #MC occurs in
  TDX guest will be delivered as an MSMI.  It causes an
  EXIT_REASON_OTHER_SMI VM exit with MSMI (bit 0) set in the exit
  qualification.  On VM exit, TDX module checks whether the "other SMI" is
  caused by an MSMI or not.  If so, TDX module marks TD as fatal,
  preventing further TD entries, and then completes the TD exit flow to KVM
  with the TDH.VP.ENTER outputs indicating TDX_NON_RECOVERABLE_TD.  After
  TD exit, the MSMI is delivered and eventually handled by the kernel
  machine check handler (7911f145de x86/mce: Implement recovery for
  errors in TDX/SEAM non-root mode), i.e., the memory page is marked as
  poisoned and it won't be freed to the free list when the TDX guest is
  terminated.  Since the TDX guest is dead, follow other non-recoverable
  cases, exit to userspace.
- For non-MSMI case, KVM doesn't need to do anything, just continue TDX
  vCPU execution.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250222014757.897978-17-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Isaku Yamahata
f30cb6429f KVM: TDX: Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT exits for TDX.

NMI Handling: Just like the VMX case, NMI remains blocked after exiting
from TDX guest for NMI-induced exits [*].  Handle NMI-induced exits for
TDX guests in the same way as they are handled for VMX guests, i.e.,
handle NMI in tdx_vcpu_enter_exit() by calling the vmx_handle_nmi()
helper.

Interrupt and Exception Handling: Similar to the VMX case, external
interrupts and exceptions (machine check is the only exception type
KVM handles for TDX guests) are handled in the .handle_exit_irqoff()
callback.

For other exceptions, because TDX guest state is protected, exceptions in
TDX guests can't be intercepted.  TDX VMM isn't supposed to handle these
exceptions.  If unexpected exception occurs, exit to userspace with
KVM_EXIT_EXCEPTION.

For external interrupt, increase the statistics, same as the VMX case.

[*]: Some old TDX modules have a bug which makes NMI unblocked after
exiting from TDX guest for NMI-induced exits.  This could potentially
lead to nested NMIs: a new NMI arrives when KVM is manually calling the
host NMI handler.  This is an architectural violation, but it doesn't
have real harm until FRED is enabled together with TDX (for non-FRED,
the host NMI handler can handle nested NMIs).  Given this is rare to
happen and has no real harm, ignore this for the initial TDX support.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250222014757.897978-16-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Sean Christopherson
7e548b0d90 KVM: VMX: Add a helper for NMI handling
Add a helper to handles NMI exit.

TDX handles the NMI exit the same as VMX case.  Add a helper to share the
code with TDX, expose the helper in common.h.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-15-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Binbin Wu
d5bc91e8e7 KVM: VMX: Move emulation_required to struct vcpu_vt
Move emulation_required from struct vcpu_vmx to struct vcpu_vt so that
vmx_handle_exit_irqoff() can be reused by TDX code.

No functional change intended.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-14-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Isaku Yamahata
8dac6b9a97 KVM: TDX: Add methods to ignore virtual apic related operation
TDX protects TDX guest APIC state from VMM.  Implement access methods of
TDX guest vAPIC state to ignore them or return zero.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-13-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Isaku Yamahata
f65916ae2d KVM: TDX: Force APICv active for TDX guest
Force APICv active for TDX guests in KVM because APICv is always enabled
by TDX module.

From the view of KVM, whether APICv state is active or not is decided by:
1. APIC is hw enabled
2. VM and vCPU have no inhibit reasons set.

After TDX vCPU init, APIC is set to x2APIC mode. KVM_SET_{SREGS,SREGS2} are
rejected due to has_protected_state for TDs and guest_state_protected
for TDX vCPUs are set.  Reject KVM_{GET,SET}_LAPIC from userspace since
migration is not supported yet, so that userspace cannot disable APIC.

For various APICv inhibit reasons:
- APICV_INHIBIT_REASON_DISABLED is impossible after checking enable_apicv
  in tdx_bringup(). If !enable_apicv, TDX support will be disabled.
- APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED is impossible since x2APIC is
  mandatory, KVM emulates APIC_ID as read-only for x2APIC mode. (Note:
  APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED could be set if the memory
  allocation fails for KVM apic_map.)
- APICV_INHIBIT_REASON_HYPERV is impossible since TDX doesn't support
  HyperV guest yet.
- APICV_INHIBIT_REASON_ABSENT is impossible since in-kernel LAPIC is
  checked in tdx_vcpu_create().
- APICV_INHIBIT_REASON_BLOCKIRQ is impossible since TDX doesn't support
  KVM_SET_GUEST_DEBUG.
- APICV_INHIBIT_REASON_APIC_ID_MODIFIED is impossible since x2APIC is
  mandatory.
- APICV_INHIBIT_REASON_APIC_BASE_MODIFIED is impossible since KVM rejects
  userspace to set APIC base.
- The rest inhibit reasons are relevant only to AMD's AVIC, including
  APICV_INHIBIT_REASON_NESTED, APICV_INHIBIT_REASON_IRQWIN,
  APICV_INHIBIT_REASON_PIT_REINJ, APICV_INHIBIT_REASON_SEV, and
  APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED.
  (For APICV_INHIBIT_REASON_PIT_REINJ, similar to AVIC, KVM can't intercept
   EOI for TDX guests neither, but KVM enforces KVM_IRQCHIP_SPLIT for TDX
   guests, which eliminates the in-kernel PIT.)

Implement vt_refresh_apicv_exec_ctrl() to call KVM_BUG_ON() if APICv is
disabled for TDX guests.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-12-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Binbin Wu
209afc0c42 KVM: TDX: Enforce KVM_IRQCHIP_SPLIT for TDX guests
Enforce KVM_IRQCHIP_SPLIT for TDX guests to disallow in-kernel I/O APIC
while in-kernel local APIC is needed.

APICv is always enabled by TDX module and TDX Module doesn't allow the
hypervisor to modify the EOI-bitmap, i.e. all EOIs are accelerated and
never trigger exits.  Level-triggered interrupts and other things depending
on EOI VM-Exit can't be faithfully emulated in KVM.  Also, the lazy check
of pending APIC EOI for RTC edge-triggered interrupts, which was introduced
as a workaround when EOI cannot be intercepted, doesn't work for TDX either
because kvm_apic_pending_eoi() checks vIRR and vISR, but both values are
invisible in KVM.

If the guest induces generation of a level-triggered interrupt, the VMM is
left with the choice of dropping the interrupt, sending it as-is, or
converting it to an edge-triggered interrupt.  Ditto for KVM.  All of those
options will make the guest unhappy. There's no architectural behavior KVM
can provide that's better than sending the interrupt and hoping for the
best.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-11-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Isaku Yamahata
4cdf243eb1 KVM: TDX: Always block INIT/SIPI
Always block INIT and SIPI events for the TDX guest because the TDX module
doesn't provide API for VMM to inject INIT IPI or SIPI.

TDX defines its own vCPU creation and initialization sequence including
multiple seamcalls.  Also, it's only allowed during TD build time.

Given that TDX guest is para-virtualized to boot BSP/APs, normally there
shouldn't be any INIT/SIPI event for TDX guest.  If any, three options to
handle them:
1. Always block INIT/SIPI request.
2. (Silently) ignore INIT/SIPI request during delivery.
3. Return error to guest TDs somehow.

Choose option 1 for simplicity. Since INIT and SIPI are always blocked,
INIT handling and the OP vcpu_deliver_sipi_vector() won't be called, no
need to add new interface or helper function for INIT/SIPI delivery.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-10-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Isaku Yamahata
2b06beb08f KVM: TDX: Handle SMI request as !CONFIG_KVM_SMM
Handle SMI request as what KVM does for CONFIG_KVM_SMM=n, i.e. return
-ENOTTY, and add KVM_BUG_ON() to SMI related OPs for TD.

TDX doesn't support system-management mode (SMM) and system-management
interrupt (SMI) in guest TDs.  Because guest state (vCPU state, memory
state) is protected, it must go through the TDX module APIs to change
guest state.  However, the TDX module doesn't provide a way for VMM to
inject SMI into guest TD or a way for VMM to switch guest vCPU mode into
SMM.

MSR_IA32_SMBASE will not be emulated for TDX guest, -ENOTTY will be
returned when SMI is requested.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-9-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Isaku Yamahata
acc64eb4e2 KVM: TDX: Implement methods to inject NMI
Inject NMI to TDX guest by setting the PEND_NMI TDVPS field to 1, i.e. make
the NMI pending in the TDX module.  If there is a further pending NMI in
KVM, collapse it to the one pending in the TDX module.

VMM can request the TDX module to inject a NMI into a TDX vCPU by setting
the PEND_NMI TDVPS field to 1.  Following that, VMM can call TDH.VP.ENTER
to run the vCPU and the TDX module will attempt to inject the NMI as soon
as possible.

KVM has the following 3 cases to inject two NMIs when handling simultaneous
NMIs and they need to be injected in a back-to-back way.  Otherwise, OS
kernel may fire a warning about the unknown NMI [1]:
K1. One NMI is being handled in the guest and one NMI pending in KVM.
    KVM requests NMI window exit to inject the pending NMI.
K2. Two NMIs are pending in KVM.
    KVM injects the first NMI and requests NMI window exit to inject the
    second NMI.
K3. A previous NMI needs to be rejected and one NMI pending in KVM.
    KVM first requests force immediate exit followed by a VM entry to
    complete the NMI rejection.  Then, during the force immediate exit, KVM
    requests NMI window exit to inject the pending NMI.

For TDX, PEND_NMI TDVPS field is a 1-bit field, i.e. KVM can only pend one
NMI in the TDX module.  Also, the vCPU state is protected, KVM doesn't know
the NMI blocking states of TDX vCPU, KVM has to assume NMI is always
unmasked and allowed.  When KVM sees PEND_NMI is 1 after a TD exit, it
means the previous NMI needs to be re-injected.

Based on KVM's NMI handling flow, there are following 6 cases:
    In NMI handler    TDX module    KVM
T1. No                PEND_NMI=0    1 pending NMI
T2. No                PEND_NMI=0    2 pending NMIs
T3. No                PEND_NMI=1    1 pending NMI
T4. Yes               PEND_NMI=0    1 pending NMI
T5. Yes               PEND_NMI=0    2 pending NMIs
T6. Yes               PEND_NMI=1    1 pending NMI
K1 is mapped to T4.
K2 is mapped to T2 or T5.
K3 is mapped to T3 or T6.
Note: KVM doesn't know whether NMI is blocked by a NMI or not, case T5 and
T6 can happen.

When handling pending NMI in KVM for TDX guest, what KVM can do is to add a
pending NMI in TDX module when PEND_NMI is 0.  T1 and T4 can be handled by
this way.  However, TDX doesn't allow KVM to request NMI window exit
directly, if PEND_NMI is already set and there is still pending NMI in KVM,
the only way KVM could try is to request a force immediate exit.  But for
case T5 and T6, force immediate exit will result in infinite loop because
force immediate exit makes it no progress in the NMI handler, so that the
pending NMI in the TDX module can never be injected.

Considering on X86 bare metal, multiple NMIs could collapse into one NMI,
e.g. when NMI is blocked by SMI.  It's OS's responsibility to poll all NMI
sources in the NMI handler to avoid missing handling of some NMI events.

Based on that, for the above 3 cases (K1-K3), only case K1 must inject the
second NMI because the guest NMI handler may have already polled some of
the NMI sources, which could include the source of the pending NMI, the
pending NMI must be injected to avoid the lost of NMI.  For case K2 and K3,
the guest OS will poll all NMI sources (including the sources caused by the
second NMI and further NMI collapsed) when the delivery of the first NMI,
KVM doesn't have the necessity to inject the second NMI.

To handle the NMI injection properly for TDX, there are two options:
- Option 1: Modify the KVM's NMI handling common code, to collapse the
  second pending NMI for K2 and K3.
- Option 2: Do it in TDX specific way. When the previous NMI is still
  pending in the TDX module, i.e. it has not been delivered to TDX guest
  yet, collapse the pending NMI in KVM into the previous one.

This patch goes with option 2 because it is simple and doesn't impact other
VM types.  Option 1 may need more discussions.

This is the first need to access vCPU scope metadata in the "management"
class. Make needed accessors available.

[1] https://lore.kernel.org/all/1317409584-23662-5-git-send-email-dzickus@redhat.com/

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250222014757.897978-8-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Isaku Yamahata
fc17de9901 KVM: TDX: Wait lapic expire when timer IRQ was injected
Call kvm_wait_lapic_expire() when POSTED_INTR_ON is set and the vector
for LVTT is set in PIR before TD entry.

KVM always assumes a timer IRQ was injected if APIC state is protected.
For TDX guest, APIC state is protected and KVM injects timer IRQ via posted
interrupt.  To avoid unnecessary wait calls, only call
kvm_wait_lapic_expire() when a timer IRQ was injected, i.e., POSTED_INTR_ON
is set and the vector for LVTT is set in PIR.

Add a helper to test PIR.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-7-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Sean Christopherson
14aecf2a5b KVM: x86: Assume timer IRQ was injected if APIC state is protected
If APIC state is protected, i.e. the vCPU is a TDX guest, assume a timer
IRQ was injected when deciding whether or not to busy wait in the "timer
advanced" path.  The "real" vIRR is not readable/writable, so trying to
query for a pending timer IRQ will return garbage.

Note, TDX can scour the PIR if it wants to be more precise and skip the
"wait" call entirely.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-6-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Isaku Yamahata
24c1291116 KVM: TDX: Implement non-NMI interrupt injection
Implement non-NMI interrupt injection for TDX via posted interrupt.

As CPU state is protected and APICv is enabled for the TDX guest, TDX
supports non-NMI interrupt injection only by posted interrupt. Posted
interrupt descriptors (PIDs) are allocated in shared memory, KVM can
update them directly.  If target vCPU is in non-root mode, send posted
interrupt notification to the vCPU and hardware will sync PIR to vIRR
atomically.  Otherwise, kick it to pick up the interrupt from PID. To
post pending interrupts in the PID, KVM can generate a self-IPI with
notification vector prior to TD entry.

Since the guest status of TD vCPU is protected, assume interrupt is
always allowed.  Ignore the code path for event injection mechanism or
LAPIC emulation for TDX.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250222014757.897978-5-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Isaku Yamahata
254e5dcd5a KVM: VMX: Move posted interrupt delivery code to common header
Move posted interrupt delivery code to common header so that TDX can
leverage it.

No functional change intended.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: split into new patch]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Message-ID: <20250222014757.897978-4-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Isaku Yamahata
34d2d1ca1b KVM: TDX: Disable PI wakeup for IPIv
Disable PI wakeup for IPI virtualization (IPIv) case for TDX.

When a vCPU is being scheduled out, notification vector is switched and
pi_wakeup_handler() is enabled when the vCPU has interrupt enabled and
posted interrupt is used to wake up the vCPU.

For VMX, a blocked vCPU can be the target of posted interrupts when using
IPIv or VT-d PI.  TDX doesn't support IPIv, disable PI wakeup for IPIv.
Also, since the guest status of TD vCPU is protected, assume interrupt is
always enabled for TD. (PV HLT hypercall is not support yet, TDX guest
tells VMM whether HLT is called with interrupt disabled or not.)

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: split into new patch]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-3-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Sean Christopherson
90cfe144c8 KVM: TDX: Add support for find pending IRQ in a protected local APIC
Add flag and hook to KVM's local APIC management to support determining
whether or not a TDX guest has a pending IRQ.  For TDX vCPUs, the virtual
APIC page is owned by the TDX module and cannot be accessed by KVM.  As a
result, registers that are virtualized by the CPU, e.g. PPR, cannot be
read or written by KVM.  To deliver interrupts for TDX guests, KVM must
send an IRQ to the CPU on the posted interrupt notification vector.  And
to determine if TDX vCPU has a pending interrupt, KVM must check if there
is an outstanding notification.

Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is
protected to short-circuit the various other flows that try to pull an
IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is
pending, KVM can't do anything based on _which_ IRQ is pending.

Intentionally omit sanity checks from other flows, e.g. PPR update, so as
not to degrade non-TDX guests with unnecessary checks.  A well-behaved KVM
and userspace will never reach those flows for TDX guests, but reaching
them is not fatal if something does go awry.

For the TD exits not due to HLT TDCALL, skip checking RVI pending in
tdx_protected_apic_has_interrupt().  Except for the guest being stupid
(e.g., non-HLT TDCALL in an interrupt shadow), it's not even possible to
have an interrupt in RVI that is fully unmasked.  There is no any CPU flows
that modify RVI in the middle of instruction execution.  I.e. if RVI is
non-zero, then either the interrupt has been pending since before the TD
exit, or the instruction caused the TD exit is in an STI/SS shadow.  KVM
doesn't care about STI/SS shadows outside of the HALTED case.  And if the
interrupt was pending before TD exit, then it _must_ be blocked, otherwise
the interrupt would have been serviced at the instruction boundary.

For the HLT TDCALL case, it will be handled in a future patch when HLT
TDCALL is supported.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00