Commit Graph

11000 Commits

Author SHA1 Message Date
Sean Christopherson
6737557442 KVM: VMX: Suppress PI notifications whenever the vCPU is put
Suppress posted interrupt notifications (set PID.SN=1) whenever the vCPU
is put, i.e. unloaded, not just when the vCPU is preempted, as KVM doesn't
do anything in response to a notification IRQ that arrives in the host,
nor does KVM rely on the Outstanding Notification (PID.ON) flag when the
vCPU is unloaded.  And, the cost of scanning the PIR to manually set PID.ON
when loading the vCPU is quite small, especially relative to the cost of
loading (and unloading) a vCPU.

On the flip side, leaving SN clear means a notification for the vCPU will
result in a spurious IRQ for the pCPU, even if vCPU task is scheduled out,
running in userspace, etc.  Even worse, if the pCPU is running a different
vCPU, the spurious IRQ could trigger posted interrupt processing for the
wrong vCPU, which is technically a violation of the architecture, as
setting bits in PIR aren't supposed to be propagated to the vIRR until a
notification IRQ is received.

The saving grace of the current behavior is that hardware sends
notification interrupts if and only if PID.ON=0, i.e. only the first
posted interrupt for a vCPU will trigger a spurious IRQ (for each window
where the vCPU is unloaded).

Ideally, KVM would suppress notifications before enabling IRQs in the
VM-Exit, but KVM relies on PID.ON as an indicator that there is a posted
interrupt pending in PIR, e.g. in vmx_sync_pir_to_irr(), and sadly there
is no way to ask hardware to set PID.ON, but not generate an interrupt.
That could be solved by using pi_has_pending_interrupt() instead of
checking only PID.ON, but it's not at all clear that would be a performance
win, as KVM would end up scanning the entire PIR whenever an interrupt
isn't pending.

And long term, the spurious IRQ window, i.e. where a vCPU is loaded with
IRQs enabled, can effectively be made smaller for hot paths by moving
performance critical VM-Exit handlers into the fastpath, i.e. by never
enabling IRQs for hot path VM-Exits.

Link: https://lore.kernel.org/r/20250611224604.313496-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 08:42:26 -07:00
Maxim Levitsky
8de4a1c816 KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235
Disable IPI virtualization on AMD Family 17h CPUs (Zen2 and Zen1), as
hardware doesn't reliably detect changes to the 'IsRunning' bit during ICR
write emulation, and might fail to VM-Exit on the sending vCPU, if
IsRunning was recently cleared.

The absence of the VM-Exit leads to KVM not waking (or triggering nested
VM-Exit of) the target vCPU(s) of the IPI, which can lead to hung vCPUs,
unbounded delays in L2 execution, etc.

To workaround the erratum, simply disable IPI virtualization, which
prevents KVM from setting IsRunning and thus eliminates the race where
hardware sees a stale IsRunning=1.  As a result, all ICR writes (except
when "Self" shorthand is used) will VM-Exit and therefore be correctly
emulated by KVM.

Disabling IPI virtualization does carry a performance penalty, but
benchmarkng shows that enabling AVIC without IPI virtualization is still
much better than not using AVIC at all, because AVIC still accelerates
posted interrupts and the receiving end of the IPIs.

Note, when virtualizing Self-IPIs, the CPU skips reading the physical ID
table and updates the vIRR directly (because the vCPU is by definition
actively running), i.e. Self-IPI isn't susceptible to the erratum *and*
is still accelerated by hardware.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[sean: rebase, massage changelog, disallow user override]
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 08:42:22 -07:00
Maxim Levitsky
d921665e01 KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled
Let userspace "disable" IPI virtualization for AVIC via the enable_ipiv
module param, by never setting IsRunning.  SVM doesn't provide a way to
disable IPI virtualization in hardware, but by ensuring CPUs never see
IsRunning=1, every IPI in the guest (except for self-IPIs) will generate a
VM-Exit.

To avoid setting the real IsRunning bit, while still allowing KVM to use
each vCPU's entry to update GA log entries, simply maintain a shadow of
the entry, without propagating IsRunning updates to the real table when
IPI virtualization is disabled.

Providing a way to effectively disable IPI virtualization will allow KVM
to safely enable AVIC on hardware that is susceptible to erratum #1235,
which causes hardware to sometimes fail to detect that the IsRunning bit
has been cleared by software.

Note, the table _must_ be fully populated, as broadcast IPIs skip invalid
entries, i.e. won't generate VM-Exit if every entry is invalid, and so
simply pointing the VMCB at a common dummy table won't work.

Alternatively, KVM could allocate a shadow of the entire table, but that'd
be a waste of 4KiB since the per-vCPU entry doesn't actually consume an
additional 8 bytes of memory (vCPU structures are large enough that they
are backed by order-N pages).

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[sean: keep "entry" variables, reuse enable_ipiv, split from erratum]
Link: https://lore.kernel.org/r/20250611224604.313496-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:53:02 -07:00
Sean Christopherson
bafddc7000 KVM: VMX: Move enable_ipiv knob to common x86
Move enable_ipiv to common x86 so that it can be reused by SVM to control
IPI virtualization when AVIC is enabled.  SVM doesn't actually provide a
way to truly disable IPI virtualization, but KVM can get close enough by
skipping the necessary table programming.

Link: https://lore.kernel.org/r/20250611224604.313496-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:53:01 -07:00
Sean Christopherson
d29433336a KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer
Drop the vCPU's pointer to its AVIC Physical ID entry, and simply index
the table directly.  Caching a pointer address is completely unnecessary
for performance, and while the field technically caches the result of the
pointer calculation, it's all too easy to misinterpret the name and think
that the field somehow caches the _data_ in the table.

No functional change intended.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:53:01 -07:00
Sean Christopherson
26baab4eea KVM: SVM: Track AVIC tables as natively sized pointers, not "struct pages"
Allocate and track AVIC's logical and physical tables as u32 and u64
pointers respectively, as managing the pages as "struct page" pointers
adds an almost absurd amount of boilerplate and complexity.  E.g. with
page_address() out of the way, svm->avic_physical_id_cache becomes
completely superfluous, and will be removed in a future cleanup.

No functional change intended.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:53:00 -07:00
Sean Christopherson
c24ed209c4 KVM: SVM: Drop redundant check in AVIC code on ID during vCPU creation
Drop avic_get_physical_id_entry()'s compatibility check on the incoming
ID, as its sole caller, avic_init_backing_page(), performs the exact same
check.  Drop avic_get_physical_id_entry() entirely as the only remaining
functionality is getting the address of the Physical ID table, and
accessing the array without an immediate bounds check is kludgy.

Opportunistically add a compile-time assertion to ensure the vcpu_id can't
result in a bounds overflow, e.g. if KVM (really) messed up a maximum
physical ID #define, as well as run-time assertions so that a NULL pointer
dereference is morphed into a safer WARN().

No functional change intended.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:59 -07:00
Sean Christopherson
1aa6e256e4 KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation
Inhibit AVIC with a new "ID too big" flag if userspace creates a vCPU with
an ID that is too big, but otherwise allow vCPU creation to succeed.
Rejecting KVM_CREATE_VCPU with EINVAL violates KVM's ABI as KVM advertises
that the max vCPU ID is 4095, but disallows creating vCPUs with IDs bigger
than 254 (AVIC) or 511 (x2AVIC).

Alternatively, KVM could advertise an accurate value depending on which
AVIC mode is in use, but that wouldn't really solve the underlying problem,
e.g. would be a breaking change if KVM were to ever try and enable AVIC or
x2AVIC by default.

Cc: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:59 -07:00
Sean Christopherson
d8527f133c KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field
Drop vcpu_svm's avic_backing_page pointer and instead grab the physical
address of KVM's vAPIC page directly from the source.  Getting a physical
address from a kernel virtual address is not an expensive operation, and
getting the physical address from a struct page is *more* expensive for
CONFIG_SPARSEMEM=y kernels.  Regardless, none of the paths that consume
the address are hot paths, i.e. shaving cycles is not a priority.

Eliminating the "cache" means KVM doesn't have to worry about the cache
being invalid, which will simplify a future fix when dealing with vCPU IDs
that are too big.

WARN if KVM attempts to allocate a vCPU's AVIC backing page without an
in-kernel local APIC.  avic_init_vcpu() bails early if the APIC is not
in-kernel, and KVM disallows enabling an in-kernel APIC after vCPUs have
been created, i.e. it should be impossible to reach
avic_init_backing_page() without the vAPIC being allocated.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:58 -07:00
Sean Christopherson
3338c639da KVM: SVM: Add helper to deduplicate code for getting AVIC backing page
Add a helper to get the physical address of the AVIC backing page, both
to deduplicate code and to prepare for getting the address directly from
apic->regs, at which point it won't be all that obvious that the address
in question is what SVM calls the AVIC backing page.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:57 -07:00
Sean Christopherson
2e002ddc89 KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA masks
Drop AVIC_HPA_MASK and all its users, the mask is just the 4KiB-aligned
maximum theoretical physical address for x86-64 CPUs, as x86-64 is
currently defined (going beyond PA52 would require an entirely new paging
mode, which would arguably create a new, different architecture).

All usage in KVM masks the result of page_to_phys(), which on x86-64 is
guaranteed to be 4KiB aligned and a legal physical address; if either of
those requirements doesn't hold true, KVM has far bigger problems.

Drop masking the avic_backing_page with
AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK for all the same reasons, but
keep the macro even though it's unused in functional code.  It's a
distinct architectural define, and having the definition in software
helps visualize the layout of an entry.  And to be hyper-paranoid about
MAXPA going beyond 52, add a compile-time assert to ensure the kernel's
maximum supported physical address stays in bounds.

The unnecessary masking in avic_init_vmcb() also incorrectly assumes that
SME's C-bit resides between bits 51:11; that holds true for current CPUs,
but isn't required by AMD's architecture:

  In some implementations, the bit used may be a physical address bit

Key word being "may".

Opportunistically use the GENMASK_ULL() version for
AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK, which is far more readable
than a set of repeating Fs.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:57 -07:00
Sean Christopherson
4305795778 KVM: SVM: Drop pointless masking of default APIC base when setting V_APIC_BAR
Drop VMCB_AVIC_APIC_BAR_MASK, it's just a regurgitation of the maximum
theoretical 4KiB-aligned physical address, i.e. is not novel in any way,
and its only usage is to mask the default APIC base, which is 4KiB aligned
and (obviously) a legal physical address.

No functional change intended.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:56 -07:00
Sean Christopherson
a0ca34bb1a KVM: SVM: Delete IRTE link from previous vCPU irrespective of new routing
Delete the IRTE link from the previous vCPU irrespective of the new
routing state, i.e. even if the IRTE won't be configured to post IRQs to a
vCPU.  Whether or not the new route is postable as no bearing on the *old*
route.  Failure to delete the link can result in KVM incorrectly updating
the IRTE, e.g. if the "old" vCPU is scheduled in/out.

Fixes: 411b44ba80 ("svm: Implements update_pi_irte hook to setup posted interrupt")
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:56 -07:00
Sean Christopherson
1da19c5ce0 iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields
Delete the amd_ir_data.prev_ga_tag field now that all usage is
superfluous.

Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:55 -07:00
Sean Christopherson
0a917e9d4b KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE
Delete the previous per-vCPU IRTE link prior to modifying the IRTE.  If
forcing the IRTE back to remapped mode fails, the IRQ is already broken;
keeping stale metadata won't change that, and the IOMMU should be
sufficiently paranoid to sanitize the IRTE when the IRQ is freed and
reallocated.

This will allow hoisting the vCPU tracking to common x86, which in turn
will allow most of the IRTE update code to be deduplicated.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:54 -07:00
Sean Christopherson
05c5e23657 KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure
Track the IRTEs that are posting to an SVM vCPU via the associated irqfd
structure and GSI routing instead of dynamically allocating a separate
data structure.  In addition to eliminating an atomic allocation, this
will allow hoisting much of the IRTE update logic to common x86.

Cc: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:54 -07:00
Sean Christopherson
cb21073767 KVM: Pass new routing entries and irqfd when updating IRTEs
When updating IRTEs in response to a GSI routing or IRQ bypass change,
pass the new/current routing information along with the associated irqfd.
This will allow KVM x86 to harden, simplify, and deduplicate its code.

Since adding/removing a bypass producer is now conveniently protected with
irqfds.lock, i.e. can't run concurrently with kvm_irq_routing_update(),
use the routing information cached in the irqfd instead of looking up
the information in the current GSI routing tables.

Opportunistically convert an existing printk() to pr_info() and put its
string onto a single line (old code that strictly adhered to 80 chars).

Link: https://lore.kernel.org/r/20250611224604.313496-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:53 -07:00
Sean Christopherson
e76c274513 KVM: x86: Fold irq_comm.c into irq.c
Drop irq_comm.c, a.k.a. common IRQ APIs, as there has been no non-x86 user
since commit 003f7de625 ("KVM: ia64: remove") (at the time, irq_comm.c
lived in virt/kvm, not arch/x86/kvm).

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:52 -07:00
Sean Christopherson
37b1761fe8 KVM: x86: Move IRQ mask notifier infrastructure to I/O APIC emulation
Move the IRQ mask logic to ioapic.c as KVM's only user is its in-kernel
I/O APIC emulation.  In addition to encapsulating more I/O APIC specific
code, trimming down irq_comm.c helps pave the way for removing it entirely.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:52 -07:00
Sean Christopherson
628a27731e KVM: x86: Add CONFIG_KVM_IOAPIC to allow disabling in-kernel I/O APIC
Add a Kconfig to allow building KVM without support for emulating a I/O
APIC, PIC, and PIT, which is desirable for deployments that effectively
don't support a fully in-kernel IRQ chip, i.e. never expect any VMM to
create an in-kernel I/O APIC.  E.g. compiling out support eliminates a few
thousand lines of guest-facing code and gives security folks warm fuzzies.

As a bonus, wrapping relevant paths with CONFIG_KVM_IOAPIC #ifdefs makes
it much easier for readers to understand which bits and pieces exist
specifically for fully in-kernel IRQ chips.

Opportunistically convert all two in-kernel uses of __KVM_HAVE_IOAPIC to
CONFIG_KVM_IOAPIC, e.g. rather than add a second #ifdef to generate a stub
for kvm_arch_post_irq_routing_update().

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:50 -07:00
Sean Christopherson
2c938850d9 KVM: Move x86-only tracepoints to x86's trace.h
Move the I/O APIC tracepoints and trace_kvm_msi_set_irq() to x86, as
__KVM_HAVE_IOAPIC is just code for "x86", and trace_kvm_msi_set_irq()
isn't unique to I/O APIC emulation.

Opportunistically clean up the absurdly messy #includes in ioapic.c.

No functional change intended.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:49 -07:00
Sean Christopherson
cd9140ad83 KVM: x86: Explicitly check for in-kernel PIC when getting ExtINT
Explicitly check for an in-kernel PIC when checking for a pending ExtINT
in the PIC.  Effectively swapping the split vs. full irqchip logic will
allow guarding the in-kernel I/O APIC (and PIC) emulation with a Kconfig,
and also makes it more obvious that kvm_pic_read_irq() won't result in a
NULL pointer dereference.

Opportunistically add WARNs in the fallthrough path, mostly to document
that the userspace ExtINT logic is only relevant to split IRQ chips.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:48 -07:00
Sean Christopherson
2c31aa747d KVM: x86: Don't clear PIT's IRQ line status when destroying PIT
Don't bother clearing the PIT's IRQ line status when destroying the PIT,
as userspace can't possibly rely on KVM to lower the IRQ line in any sane
use case, and it's not at all obvious that clearing the PIT's IRQ line is
correct/desirable in kvm_create_pit()'s error path.

When called from kvm_arch_pre_destroy_vm(), the entire VM is being torn
down and thus {kvm_pic,kvm_ioapic}.irq_states are unreachable.

As for the error path in kvm_create_pit(), the only way the PIT's bit in
irq_states can be set is if userspace raises the associated IRQ before
KVM_CREATE_PIT{2} completes.  Forcefully clearing the bit would clobber
userspace's input, nonsensical though that input may be.  Not to mention
that no known VMM will continue on if PIT creation fails.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:48 -07:00
Sean Christopherson
61423c413a KVM: x86: Hardcode the PIT IRQ source ID to '2'
Hardcode the PIT's source IRQ ID to '2' instead of "finding" that bit 2
is always the first available bit in irq_sources_bitmap.  Bits 0 and 1 are
set/reserved by kvm_arch_init_vm(), i.e. long before kvm_create_pit() can
be invoked, and KVM allows at most one in-kernel PIT instance, i.e. it's
impossible for the PIT to find a different free bit (there are no other
users of kvm_request_irq_source_id().

Delete the now-defunct irq_sources_bitmap and all its associated code.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:47 -07:00
Sean Christopherson
77a74b8ff4 KVM: x86: Move kvm_{request,free}_irq_source_id() to i8254.c (PIT)
Move kvm_{request,free}_irq_source_id() to i8254.c, i.e. the dedicated PIT
emulation file, in anticipation of removing them entirely in favor of
hardcoding the PIT's "requested" source ID (the source ID can only ever be
'2', and the request can never fail).

No functional change intended.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:47 -07:00
Sean Christopherson
df35135680 KVM: x86: Move kvm_setup_default_irq_routing() into irq.c
Move the default IRQ routing table used for in-kernel I/O APIC and PIC
routing to irq.c, and tweak the name to make it explicitly clear what
routing is being initialized.

In addition to making it more obvious that the so called "default" routing
only applies to an in-kernel I/O APIC, getting it out of irq_comm.c will
allow removing irq_comm.c entirely.  And placing the function alongside
other I/O APIC and PIC code will allow for guarding KVM's in-kernel I/O
APIC and PIC emulation with a Kconfig with minimal #ifdefs.

No functional change intended.

Cc: Kai Huang <kai.huang@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:46 -07:00
Sean Christopherson
c5a701955e KVM: x86: Rename irqchip_kernel() to irqchip_full()
Rename irqchip_kernel() to irqchip_full(), as "kernel" is very ambiguous
due to the existence of split IRQ chip support, where only some of the
"irqchip" is in emulated by the kernel/KVM.  E.g. irqchip_kernel() often
gets confused with irqchip_in_kernel().

Opportunistically hoist the definition up in irq.h so that it's
co-located with other "full" irqchip code in anticipation of wrapping it
all with a Kconfig/#ifdef.

No functional change intended.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:45 -07:00
Sean Christopherson
b771b1616f KVM: x86: Move KVM_{GET,SET}_IRQCHIP ioctl helpers to irq.c
Move the ioctl helpers for getting/setting fully in-kernel IRQ chip state
to irq.c, partly to trim down x86.c, but mostly in preparation for adding
a Kconfig to control support for in-kernel I/O APIC, PIC, and PIT
emulation.

No functional change intended.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:45 -07:00
Sean Christopherson
00b5ebf8db KVM: x86: Move PIT ioctl helpers to i8254.c
Move the PIT ioctl helpers to i8254.c, i.e. to the file that implements
PIT emulation.  Eliminating PIT code in x86.c will allow adding a Kconfig
to control support for in-kernel I/O APIC, PIC, and PIT emulation with
minimal #ifdefs.

Opportunistically make kvm_pit_set_reinject() and kvm_pit_load_count()
local to i8254.c as they were only publicly visible to make them available
to the ioctl helpers.

No functional change intended.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:44 -07:00
Sean Christopherson
20218e69e8 KVM: x86: Drop superfluous kvm_hv_set_sint() => kvm_hv_synic_set_irq() wrapper
Drop the superfluous kvm_hv_set_sint() and instead wire up ->set() directly
to its final destination, kvm_hv_synic_set_irq().  Keep hv_synic_set_irq()
instead of kvm_hv_set_sint() to provide some amount of consistency in the
->set() helpers, e.g. to match kvm_pic_set_irq() and kvm_ioapic_set_irq().

kvm_set_msi() is arguably the oddball, e.g. kvm_set_msi_irq() should be
something like kvm_msi_to_lapic_irq() so that kvm_set_msi() can instead be
kvm_set_msi_irq(), but that's a future problem to solve.

No functional change intended.

Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Kai Huang <kai.huang@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:43 -07:00
Sean Christopherson
05dc9eab3f KVM: x86: Drop superfluous kvm_set_ioapic_irq() => kvm_ioapic_set_irq() wrapper
Drop the superfluous and confusing kvm_set_ioapic_irq() and instead wire
up ->set() directly to its final destination.

No functional change intended.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:43 -07:00
Sean Christopherson
8a33b1f246 KVM: x86: Drop superfluous kvm_set_pic_irq() => kvm_pic_set_irq() wrapper
Drop the superfluous and confusing kvm_set_pic_irq() => kvm_pic_set_irq()
wrapper, and instead wire up ->set() directly to its final destination.

Opportunistically move the declaration kvm_pic_set_irq() to irq.h to
start gathering more of the in-kernel APIC/IO-APIC logic in irq.{c,h}.

No functional change intended.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:42 -07:00
Sean Christopherson
e295d2e7fb KVM: x86: Trigger I/O APIC route rescan in kvm_arch_irq_routing_update()
Trigger the I/O APIC route rescan that's performed for a split IRQ chip
after userspace updates IRQ routes in kvm_arch_irq_routing_update(), i.e.
before dropping kvm->irq_lock.  Calling kvm_make_all_cpus_request() under
a mutex is perfectly safe, and the smp_wmb()+smp_mb__after_atomic() pair
in __kvm_make_request()+kvm_check_request() ensures the new routing is
visible to vCPUs prior to the request being visible to vCPUs.

In all likelihood, commit b053b2aef2 ("KVM: x86: Add EOI exit bitmap
inference") somewhat arbitrarily made the request outside of irq_lock to
avoid holding irq_lock any longer than is strictly necessary.  And then
commit abdb080f7a ("kvm/irqchip: kvm_arch_irq_routing_update renaming
split") took the easy route of adding another arch hook instead of risking
a functional change.

Note, the call to synchronize_srcu_expedited() does NOT provide ordering
guarantees with respect to vCPUs scanning the new routing; as above, the
request infrastructure provides the necessary ordering.  I.e. there's no
need to wait for kvm_scan_ioapic_routes() to complete if it's actively
running, because regardless of whether it grabs the old or new table, the
vCPU will have another KVM_REQ_SCAN_IOAPIC pending, i.e. will rescan again
and see the new mappings.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:41 -07:00
Sean Christopherson
2b521d86ee irqbypass: Take ownership of producer/consumer token tracking
Move ownership of IRQ bypass token tracking into irqbypass.ko, and
explicitly require callers to pass an eventfd_ctx structure instead of a
completely opaque token.  Relying on producers and consumers to set the
token appropriately is error prone, and hiding the fact that the token must
be an eventfd_ctx pointer (for all intents and purposes) unnecessarily
obfuscates the code and makes it more brittle.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:38 -07:00
Tom Lendacky
24be2b7956 KVM: SVM: Allow SNP guest policy to specify SINGLE_SOCKET
KVM currently returns -EINVAL when it attempts to create an SNP guest if
the SINGLE_SOCKET guest policy bit is set. The reason for this action is
that KVM would need specific support (SNP_ACTIVATE_EX command support) to
achieve this when running on a system with more than one socket. However,
the SEV firmware will make the proper check and return POLICY_FAILURE
during SNP_ACTIVATE if the single socket guest policy bit is set and the
system has more than one socket:

 - System with one socket
   - Guest policy SINGLE_SOCKET == 0 ==> SNP_ACTIVATE succeeds
   - Guest policy SINGLE_SOCKET == 1 ==> SNP_ACTIVATE succeeds

 - System with more than one socket
   - Guest policy SINGLE_SOCKET == 0 ==> SNP_ACTIVATE succeeds
   - Guest policy SINGLE_SOCKET == 1 ==> SNP_ACTIVATE fails with
     POLICY_FAILURE

Remove the check for the SINGLE_SOCKET policy bit from snp_launch_start()
and allow the firmware to perform the proper checking.

This does have the effect of allowing an SNP guest with the SINGLE_SOCKET
policy bit set to run on a single socket system, but fail when run on a
system with more than one socket. However, this should not affect existing
SNP guests as setting the SINGLE_SOCKET policy bit is not allowed today.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/4c51018dd3e4f2c543935134d2c4f47076f109f6.1748553480.git.thomas.lendacky@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:33:45 -07:00
Tom Lendacky
9f4701e05f KVM: SVM: Allow SNP guest policy disallow running with SMT enabled
KVM currently returns -EINVAL when it attempts to create an SNP guest if
the SMT guest policy bit is not set. However, there is no reason to check
this, as there is no specific support in KVM that is required to support
this. The SEV firmware will determine if SMT has been enabled or disabled
in the BIOS and process the policy in the proper way:

 - SMT enabled in BIOS
   - Guest policy SMT == 0 ==> SNP_LAUNCH_START fails with POLICY_FAILURE
   - Guest policy SMT == 1 ==> SNP_LAUNCH_START succeeds

 - SMT disabled in BIOS
   - Guest policy SMT == 0 ==> SNP_LAUNCH_START succeeds
   - Guest policy SMT == 1 ==> SNP_LAUNCH_START succeeds

Remove the check for the SMT policy bit from snp_launch_start() and allow
the firmware to perform the proper checking.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/71043abdd9ef23b6f98fffa9c5c6045ac3a50187.1748553480.git.thomas.lendacky@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:33:45 -07:00
Sean Christopherson
1f287a4e7b KVM: TDX: Move TDX hardware setup from main.c to tdx.c
Move TDX hardware setup to tdx.c, as the code is obviously TDX specific,
co-locating the setup with tdx_bringup() makes it easier to see and
document the success_disable_tdx "error" path, and configuring the TDX
specific hooks in tdx.c reduces the number of globally visible TDX symbols.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250523001138.3182794-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:11:16 -07:00
Sean Christopherson
ffced89220 KVM: x86/mmu: Exempt nested EPT page tables from !USER, CR0.WP=0 logic
Exempt nested EPT shadow pages tables from the CR0.WP=0 handling of
supervisor writes, as EPT doesn't have a U/S bit and isn't affected by
CR0.WP (or CR4.SMEP in the exception to the exception).

Opportunistically refresh the comment to explain what KVM is doing, as
the only record of why KVM shoves in WRITE and drops USER is buried in
years-old changelogs.

Cc: Jon Kohler <jon@nutanix.com>
Cc: Sergey Dyasli <sergey.dyasli@nutanix.com>
Reviewed-by: Jon Kohler <jon@nutanix.com>
Reviewed-by: Sergey Dyasli <sergey.dyasli@nutanix.com>
Link: https://lore.kernel.org/r/20250602234851.54573-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:08:22 -07:00
Sean Christopherson
58c81bc1e7 KVM: x86: Refactor handling of SIPI_RECEIVED when setting MP_STATE
Convert the incoming mp_state to INIT_RECIEVED instead of manually calling
kvm_set_mp_state() to make it more obvious that the SIPI_RECEIVED logic is
translating the incoming state to KVM's internal tracking, as opposed to
being some entirely unique flow.

Opportunistically add a comment to explain what the code is doing.

No functional change intended.

Link: https://lore.kernel.org/r/20250605195018.539901-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:08:00 -07:00
Sean Christopherson
0fe3e8d804 KVM: x86: Move INIT_RECEIVED vs. INIT/SIPI blocked check to KVM_RUN
Check for the should-be-impossible scenario of a vCPU being in
Wait-For-SIPI with INIT/SIPI blocked during KVM_RUN instead of trying to
detect and prevent illegal combinations in every ioctl that sets relevant
state.  Attempting to handle every possible "set" path is a losing game of
whack-a-mole, and risks breaking userspace.  E.g. INIT/SIPI are blocked on
Intel if the vCPU is in VMX Root mode (post-VMXON), and on AMD if GIF=0.
Handling those scenarios would require potentially breaking changes to
{vmx,svm}_set_nested_state().

Moving the check to KVM_RUN fixes a syzkaller-induced splat due to the
aforementioned VMXON case, and in theory should close the hole once and for
all.

Note, kvm_x86_vcpu_pre_run() already handles SIPI_RECEIVED, only the WFS
case needs additional attention.

Reported-by: syzbot+c1cbaedc2613058d5194@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?id=490ae63d8d89cb82c5d462d16962cf371df0e476
Link: https://lore.kernel.org/r/20250605195018.539901-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:08:00 -07:00
Sean Christopherson
16777ebded KVM: x86: WARN and reject KVM_RUN if vCPU's MP_STATE is SIPI_RECEIVED
WARN if KVM_RUN is reached with a vCPU's mp_state set to SIPI_RECEIVED, as
KVM no longer uses SIPI_RECEIVED internally, and should morph SIPI_RECEIVED
into INIT_RECEIVED with a pending SIPI if userspace forces SIPI_RECEIVED.

See commit 66450a21f9 ("KVM: x86: Rework INIT and SIPI handling") for
more history and details.

Link: https://lore.kernel.org/r/20250605195018.539901-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:59 -07:00
Sean Christopherson
c4a37acc51 KVM: x86: Drop pending_smi vs. INIT_RECEIVED check when setting MP_STATE
Allow userspace to set a vCPU's mp_state to INIT_RECEIVED in conjunction
with a pending SMI, as rejecting that combination could result in KVM
disallowing reflecting the output from KVM_GET_VCPU_EVENTS back into KVM
via KVM_SET_VCPU_EVENTS.

At the time the check was added, smi_pending could only be set in the
context of KVM_RUN, with the vCPU in the RUNNABLE state.  I.e. it was
impossible for KVM to save vCPU state such that userspace could see a
pending SMI for a vCPU in WFS.

That no longer holds true now that KVM processes requested SMIs during
KVM_GET_VCPU_EVENTS, e.g. if a vCPU receives an SMI while in WFS, and
then userspace saves vCPU state.

Note, this may partially re-open the user-triggerable WARN that was mostly
closed by commit 28bf288879 ("KVM: x86: fix user triggerable warning in
kvm_apic_accept_events()"), but that WARN can already be triggered in
several other ways, e.g. if userspace stuffs VMXON=1 after putting the
vCPU into WFS.  That issue will be addressed in an upcoming commit, in a
more robust fashion (hopefully).

Fixes: 1f7becf1b7 ("KVM: x86: get smi pending status correctly")
Link: https://lore.kernel.org/r/20250605195018.539901-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:59 -07:00
Sean Christopherson
bea44d1992 KVM: x86: Simplify userspace filter logic when disabling MSR interception
Refactor {svm,vmx}_disable_intercept_for_msr() to simplify the handling of
userspace filters that disallow access to an MSR.  The more complicated
logic is no longer needed or justified now that KVM recalculates all MSR
intercepts on a userspace MSR filter change, i.e. now that KVM doesn't
need to also update shadow bitmaps.

No functional change intended.

Suggested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-32-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:37 -07:00
Sean Christopherson
73be81b3bb KVM: SVM: Add a helper to allocate and initialize permissions bitmaps
Add a helper to allocate and initialize an MSR or I/O permissions map, as
the logic is identical between the two map types, the only difference is
the size of the bitmap.  Opportunistically add a comment to explain why
the bitmaps are initialized with 0xff, e.g. instead of the more common
zero-initialized behavior, which is the main motivation for deduplicating
the code.

No functional change intended.

Link: https://lore.kernel.org/r/20250610225737.156318-31-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:37 -07:00
Sean Christopherson
54f1c77061 KVM: nSVM: Merge MSRPM in 64-bit chunks on 64-bit kernels
When merging L0 and L1 MSRPMs as part of nested VMRUN emulation, access
the bitmaps using "unsigned long" chunks, i.e. use 8-byte access for
64-bit kernels instead of arbitrarily working on 4-byte chunks.

Opportunistically rename local variables in nested_svm_merge_msrpm() to
more precisely/accurately reflect their purpose ("offset" in particular is
extremely ambiguous).

Link: https://lore.kernel.org/r/20250610225737.156318-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:36 -07:00
Sean Christopherson
5904ba5172 KVM: SVM: Return -EINVAL instead of MSR_INVALID to signal out-of-range MSR
Return -EINVAL instead of MSR_INVALID from svm_msrpm_bit_nr() to indicate
that the MSR isn't covered by one of the (currently) three MSRPM ranges,
and delete the MSR_INVALID macro now that all users are gone.

Link: https://lore.kernel.org/r/20250610225737.156318-29-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:35 -07:00
Sean Christopherson
52f8217742 KVM: nSVM: Access MSRPM in 4-byte chunks only for merging L0 and L1 bitmaps
Access the MSRPM using u32/4-byte chunks (and appropriately adjusted
offsets) only when merging L0 and L1 bitmaps as part of emulating VMRUN.
The only reason to batch accesses to MSRPMs is to avoid the overhead of
uaccess operations (e.g. STAC/CLAC and bounds checks) when reading L1's
bitmap pointed at by vmcb12.  For all other uses, either per-bit accesses
are more than fast enough (no uaccess), or KVM is only accessing a single
bit (nested_svm_exit_handled_msr()) and so there's nothing to batch.

In addition to (hopefully) documenting the uniqueness of the merging code,
restricting chunked access to _just_ the merging code will allow for
increasing the chunk size (to unsigned long) with minimal risk.

Link: https://lore.kernel.org/r/20250610225737.156318-28-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:34 -07:00
Sean Christopherson
7fe0578041 KVM: SVM: Store MSRPM pointer as "void *" instead of "u32 *"
Store KVM's MSRPM pointers as "void *" instead of "u32 *" to guard against
directly accessing the bitmaps outside of code that is explicitly written
to access the bitmaps with a specific type.

Opportunistically use svm_vcpu_free_msrpm() in svm_vcpu_free() instead of
open coding an equivalent.

Link: https://lore.kernel.org/r/20250610225737.156318-27-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:34 -07:00
Sean Christopherson
5c9c084763 KVM: SVM: Move svm_msrpm_offset() to nested.c
Move svm_msrpm_offset() from svm.c to nested.c now that all usage of the
u32-index offsets is nested virtualization specific.

No functional change intended.

Link: https://lore.kernel.org/r/20250610225737.156318-26-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:33 -07:00
Sean Christopherson
2f89888434 KVM: SVM: Drop explicit check on MSRPM offset when emulating SEV-ES accesses
Now that msr_write_intercepted() defaults to true, i.e. accurately reflects
hardware behavior for out-of-range MSRs, and doesn't WARN (or BUG) on an
out-of-range MSR, drop sev_es_prevent_msr_access()'s svm_msrpm_offset()
check that guarded against calling msr_write_intercepted() with a "bad"
index.

Opportunistically clean up the helper's formatting.

Link: https://lore.kernel.org/r/20250610225737.156318-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:32 -07:00
Sean Christopherson
4880919aaf KVM: SVM: Merge "after set CPUID" intercept recalc helpers
Merge svm_recalc_intercepts_after_set_cpuid() and
svm_recalc_instruction_intercepts() such that the "after set CPUID" helper
simply invokes the type-specific helpers (MSRs vs. instructions), i.e.
make svm_recalc_intercepts_after_set_cpuid() a single entry point for all
intercept updates that need to be performed after a CPUID change.

No functional change intended.

Link: https://lore.kernel.org/r/20250610225737.156318-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:32 -07:00
Sean Christopherson
40ba80e4b0 KVM: SVM: Fold svm_vcpu_init_msrpm() into its sole caller
Fold svm_vcpu_init_msrpm() into svm_recalc_msr_intercepts() now that there
is only the one caller (and because the "init" misnomer is even more
misleading than it was in the past).

No functional change intended.

Link: https://lore.kernel.org/r/20250610225737.156318-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:31 -07:00
Sean Christopherson
049dff172b KVM: SVM: Rename init_vmcb_after_set_cpuid() to make it intercepts specific
Rename init_vmcb_after_set_cpuid() to svm_recalc_intercepts_after_set_cpuid()
to more precisely describe its role.  Strictly speaking, the name isn't
perfect as toggling virtual VM{LOAD,SAVE} is arguably not recalculating an
intercept, but practically speaking it's close enough.

No functional change intended.

Link: https://lore.kernel.org/r/20250610225737.156318-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:30 -07:00
Sean Christopherson
4ceca57e3f KVM: x86: Rename msr_filter_changed() => recalc_msr_intercepts()
Rename msr_filter_changed() to recalc_msr_intercepts() and drop the
trampoline wrapper now that both SVM and VMX use a filter-agnostic recalc
helper to react to the new userspace filter.

No functional change intended.

Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:30 -07:00
Sean Christopherson
160f143cc1 KVM: SVM: Manually recalc all MSR intercepts on userspace MSR filter change
On a userspace MSR filter change, recalculate all MSR intercepts using the
filter-agnostic logic instead of maintaining a "shadow copy" of KVM's
desired intercepts.  The shadow bitmaps add yet another point of failure,
are confusing (e.g. what does "handled specially" mean!?!?), an eyesore,
and a maintenance burden.

Given that KVM *must* be able to recalculate the correct intercepts at any
given time, and that MSR filter updates are not hot paths, there is zero
benefit to maintaining the shadow bitmaps.

Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as
appropriate.

Link: https://lore.kernel.org/all/aCdPbZiYmtni4Bjs@google.com
Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local
Cc: Francesco Lavra <francescolavra.fl@gmail.com>
Link: https://lore.kernel.org/r/20250610225737.156318-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:29 -07:00
Sean Christopherson
8a056ece45 KVM: VMX: Manually recalc all MSR intercepts on userspace MSR filter change
On a userspace MSR filter change, recalculate all MSR intercepts using the
filter-agnostic logic instead of maintaining a "shadow copy" of KVM's
desired intercepts.  The shadow bitmaps add yet another point of failure,
are confusing (e.g. what does "handled specially" mean!?!?), an eyesore,
and a maintenance burden.

Given that KVM *must* be able to recalculate the correct intercepts at any
given time, and that MSR filter updates are not hot paths, there is zero
benefit to maintaining the shadow bitmaps.

Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as
appropriate.

Link: https://lore.kernel.org/all/aCdPbZiYmtni4Bjs@google.com
Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local
Cc: Borislav Petkov <bp@alien8.de>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:28 -07:00
Sean Christopherson
405a63d4d3 KVM: x86: Move definition of X2APIC_MSR() to lapic.h
Dedup the definition of X2APIC_MSR and put it in the local APIC code
where it belongs.

No functional change intended.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:28 -07:00
Sean Christopherson
cb53d07948 KVM: SVM: Drop "always" flag from list of possible passthrough MSRs
Drop the "always" flag from the array of possible passthrough MSRs, and
instead manually initialize the permissions for the handful of MSRs that
KVM passes through by default.  In addition to cutting down on boilerplate
copy+paste code and eliminating a misleading flag (the MSRs aren't always
passed through, e.g. thanks to MSR filters), this will allow for removing
the direct_access_msrs array entirely.

Link: https://lore.kernel.org/r/20250610225737.156318-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:27 -07:00
Sean Christopherson
3a0f09b361 KVM: SVM: Pass through GHCB MSR if and only if VM is an SEV-ES guest
Disable interception of the GHCB MSR if and only if the VM is an SEV-ES
guest.  While the exact behavior is completely undocumented in the APM,
common sense and testing on SEV-ES capable CPUs says that accesses to the
GHCB from non-SEV-ES guests will #GP.  I.e. from the guest's perspective,
no functional change intended.

Fixes: 376c6d2850 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading")
Link: https://lore.kernel.org/r/20250610225737.156318-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:26 -07:00
Sean Christopherson
6b7315fe54 KVM: SVM: Implement and adopt VMX style MSR intercepts APIs
Add and use SVM MSR interception APIs (in most paths) to match VMX's
APIs and nomenclature.  Specifically, add SVM variants of:

        vmx_disable_intercept_for_msr(vcpu, msr, type)
        vmx_enable_intercept_for_msr(vcpu, msr, type)
        vmx_set_intercept_for_msr(vcpu, msr, type, intercept)

to eventually replace SVM's single helper:

        set_msr_interception(vcpu, msrpm, msr, allow_read, allow_write)

which is awkward to use (in all cases, KVM either applies the same logic
for both reads and writes, or intercepts one of read or write), and is
unintuitive due to using '0' to indicate interception should be *set*.

Keep the guts of the old API for the moment to avoid churning the MSR
filter code, as that mess will be overhauled in the near future.  Leave
behind a temporary comment to call out that the shadow bitmaps have
inverted polarity relative to the bitmaps consumed by hardware.

No functional change intended.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:26 -07:00
Sean Christopherson
c38595ad69 KVM: SVM: Add helpers for accessing MSR bitmap that don't rely on offsets
Add macro-built helpers for testing, setting, and clearing MSRPM entries
without relying on precomputed offsets.  This sets the stage for eventually
removing general KVM use of precomputed offsets, which are quite confusing
and rather inefficient for the vast majority of KVM's usage.

Outside of merging L0 and L1 bitmaps for nested SVM, using u32-indexed
offsets and accesses is at best unnecessary, and at worst introduces extra
operations to retrieve the individual bit from within the offset u32 value.
And simply calling them "offsets" is very confusing, as the "unit" of the
offset isn't immediately obvious.

Use the new helpers in set_msr_interception_bitmap() and
msr_write_intercepted() to verify the math and operations, but keep the
existing offset-based logic in set_msr_interception_bitmap() to sanity
check the "clear" and "set" operations.  Manipulating MSR interceptions
isn't a hot path and no kernel release is ever expected to contain this
specific version of set_msr_interception_bitmap() (it will be removed
entirely in the near future).

Link: https://lore.kernel.org/r/20250610225737.156318-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:25 -07:00
Sean Christopherson
4879dc9469 KVM: nSVM: Don't initialize vmcb02 MSRPM with vmcb01's "always passthrough"
Don't initialize vmcb02's MSRPM with KVM's set of "always passthrough"
MSRs, as KVM always needs to consult L1's intercepts, i.e. needs to merge
vmcb01 with vmcb12 and write the result to vmcb02.  This will eventually
allow for the removal of svm_vcpu_init_msrpm().

Note, the bitmaps are truly initialized by svm_vcpu_alloc_msrpm() (default
to intercepting all MSRs), e.g. if there is a bug lurking elsewhere, the
worst case scenario from dropping the call to svm_vcpu_init_msrpm() should
be that KVM would fail to passthrough MSRs to L2.

Link: https://lore.kernel.org/r/20250610225737.156318-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:24 -07:00
Sean Christopherson
f21ff2c8c9 KVM: nSVM: Omit SEV-ES specific passthrough MSRs from L0+L1 bitmap merge
Don't merge bitmaps on nested VMRUN for MSRs that KVM passes through only
for SEV-ES guests.  KVM doesn't support nested virtualization for SEV-ES,
and likely never will.

Link: https://lore.kernel.org/r/20250610225737.156318-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:24 -07:00
Sean Christopherson
9b72c3d59f KVM: nSVM: Use dedicated array of MSRPM offsets to merge L0 and L1 bitmaps
Use a dedicated array of MSRPM offsets to merge L0 and L1 bitmaps, i.e. to
merge KVM's vmcb01 bitmap with L1's vmcb12 bitmap.  This will eventually
allow for the removal of direct_access_msrs, as the only path where
tracking the offsets is truly justified is the merge for nested SVM, where
merging in chunks is an easy way to batch uaccess reads/writes.

Opportunistically omit the x2APIC MSRs from the merge-specific array
instead of filtering them out at runtime.

Note, disabling interception of DEBUGCTL, XSS, EFER, PAT, GHCB, and
TSC_AUX is mutually exclusive with nested virtualization, as KVM passes
through those MSRs only for SEV-ES guests, and KVM doesn't support nested
virtualization for SEV+ guests.  Defer removing those MSRs to a future
cleanup in order to make this refactoring as benign as possible.

Link: https://lore.kernel.org/r/20250610225737.156318-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:23 -07:00
Sean Christopherson
16e9584cc0 KVM: SVM: Clean up macros related to architectural MSRPM definitions
Move SVM's MSR Permissions Map macros to svm.h in anticipation of adding
helpers that are available to SVM code, and opportunistically replace a
variety of open-coded literals with (hopefully) informative macros.

Opportunistically open code ARRAY_SIZE(msrpm_ranges) instead of wrapping
it as NUM_MSR_MAPS, which is an ambiguous name even if it were qualified
with "SVM_MSRPM".

Deliberately leave the ranges as open coded literals, as using macros to
define the ranges actually introduces more potential failure points, since
both the definitions and the usage have to be careful to use the correct
index.  The lack of clear intent behind the ranges will be addressed in
future patches.

No functional change intended.

Link: https://lore.kernel.org/r/20250610225737.156318-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:07:11 -07:00
Sean Christopherson
925149b6d0 KVM: SVM: Massage name and param of helper that merges vmcb01 and vmcb12 MSRPMs
Rename nested_svm_vmrun_msrpm() to nested_svm_merge_msrpm() to better
capture its role, and opportunistically feed it @vcpu instead of @svm, as
grabbing "svm" only to turn around and grab svm->vcpu is rather silly.

No functional change intended.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:40 -07:00
Sean Christopherson
b1bccf7883 KVM: x86: Use non-atomic bit ops to manipulate "shadow" MSR intercepts
Manipulate the MSR bitmaps using non-atomic bit ops APIs (two underscores),
as the bitmaps are per-vCPU and are only ever accessed while vcpu->mutex is
held.

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:40 -07:00
Sean Christopherson
6353cd685c KVM: SVM: Kill the VM instead of the host if MSR interception is buggy
WARN and kill the VM instead of panicking the host if KVM attempts to set
or query MSR interception for an unsupported MSR.  Accessing the MSR
interception bitmaps only meaningfully affects post-VMRUN behavior, and
KVM_BUG_ON() is guaranteed to prevent the current vCPU from doing VMRUN,
i.e. there is no need to panic the entire host.

Opportunistically move the sanity checks about their use to index into the
MSRPM, e.g. so that bugs only WARN and terminate the VM, as opposed to
doing that _and_ generating an out-of-bounds load.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:40 -07:00
Sean Christopherson
b241c50c4e KVM: SVM: Use ARRAY_SIZE() to iterate over direct_access_msrs
Drop the unnecessary and dangerous value-terminated behavior of
direct_access_msrs, and simply iterate over the actual size of the array.
The use in svm_set_x2apic_msr_interception() is especially sketchy, as it
relies on unused capacity being zero-initialized, and '0' being outside
the range of x2APIC MSRs.

To ensure the array and shadow_msr_intercept stay synchronized, simply
assert that their sizes are identical (note the six 64-bit-only MSRs).

Note, direct_access_msrs will soon be removed entirely; keeping the assert
synchronized with the array isn't expected to be along-term maintenance
burden.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:40 -07:00
Sean Christopherson
f886515f9b KVM: SVM: Tag MSR bitmap initialization helpers with __init
Tag init_msrpm_offsets() and add_msr_offset() with __init, as they're used
only during hardware setup to map potential passthrough MSRs to offsets in
the bitmap.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:40 -07:00
Sean Christopherson
5ebd737308 KVM: SVM: Don't BUG if setting up the MSR intercept bitmaps fails
WARN and reject module loading if there is a problem with KVM's MSR
interception bitmaps.  Panicking the host in this situation is inexcusable
since it is trivially easy to propagate the error up the stack.

Link: https://lore.kernel.org/r/20250610225737.156318-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:40 -07:00
Sean Christopherson
fb96d5cf0f KVM: SVM: Allocate IOPM pages after initial setup in svm_hardware_setup()
Allocate pages for the IOPM after initial setup has been completed in
svm_hardware_setup(), so that sanity checks can be added in the setup flow
without needing to free the IOPM pages.  The IOPM is only referenced (via
iopm_base) in init_vmcb() and svm_hardware_unsetup(), so there's no need
to allocate it early on.

No functional change intended (beyond the obvious ordering differences,
e.g. if the allocation fails).

Link: https://lore.kernel.org/r/20250610225737.156318-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:39 -07:00
Sean Christopherson
674ffc6503 KVM: SVM: Disable interception of SPEC_CTRL iff the MSR exists for the guest
Disable interception of SPEC_CTRL when the CPU virtualizes (i.e. context
switches) SPEC_CTRL if and only if the MSR exists according to the vCPU's
CPUID model.  Letting the guest access SPEC_CTRL is generally benign, but
the guest would see inconsistent behavior if KVM happened to emulate an
access to the MSR.

Fixes: d00b99c514 ("KVM: SVM: Add support for Virtual SPEC_CTRL")
Reported-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:39 -07:00
Maxim Levitsky
6b1dd26544 KVM: VMX: Preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while running the guest
Set/clear DEBUGCTLMSR_FREEZE_IN_SMM in GUEST_IA32_DEBUGCTL based on the
host's pre-VM-Enter value, i.e. preserve the host's FREEZE_IN_SMM setting
while running the guest.  When running with the "default treatment of SMIs"
in effect (the only mode KVM supports), SMIs do not generate a VM-Exit that
is visible to host (non-SMM) software, and instead transitions directly
from VMX non-root to SMM.  And critically, DEBUGCTL isn't context switched
by hardware on SMI or RSM, i.e. SMM will run with whatever value was
resident in hardware at the time of the SMI.

Failure to preserve FREEZE_IN_SMM results in the PMU unexpectedly counting
events while the CPU is executing in SMM, which can pollute profiling and
potentially leak information into the guest.

Check for changes in FREEZE_IN_SMM prior to every entry into KVM's inner
run loop, as the bit can be toggled in IRQ context via IPI callback (SMP
function call), by way of /sys/devices/cpu/freeze_on_smi.

Add a field in kvm_x86_ops to communicate which DEBUGCTL bits need to be
preserved, as FREEZE_IN_SMM is only supported and defined for Intel CPUs,
i.e. explicitly checking FREEZE_IN_SMM in common x86 is at best weird, and
at worst could lead to undesirable behavior in the future if AMD CPUs ever
happened to pick up a collision with the bit.

Exempt TDX vCPUs, i.e. protected guests, from the check, as the TDX Module
owns and controls GUEST_IA32_DEBUGCTL.

WARN in SVM if KVM_RUN_LOAD_DEBUGCTL is set, mostly to document that the
lack of handling isn't a KVM bug (TDX already WARNs on any run_flag).

Lastly, explicitly reload GUEST_IA32_DEBUGCTL on a VM-Fail that is missed
by KVM but detected by hardware, i.e. in nested_vmx_restore_host_state().
Doing so avoids the need to track host_debugctl on a per-VMCS basis, as
GUEST_IA32_DEBUGCTL is unconditionally written by prepare_vmcs02() and
load_vmcs12_host_state().  For the VM-Fail case, even though KVM won't
have actually entered the guest, vcpu_enter_guest() will have run with
vmcs02 active and thus could result in vmcs01 being run with a stale value.

Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250610232010.162191-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:24 -07:00
Maxim Levitsky
7d0cce6cbe KVM: VMX: Wrap all accesses to IA32_DEBUGCTL with getter/setter APIs
Introduce vmx_guest_debugctl_{read,write}() to handle all accesses to
vmcs.GUEST_IA32_DEBUGCTL. This will allow stuffing FREEZE_IN_SMM into
GUEST_IA32_DEBUGCTL based on the host setting without bleeding the state
into the guest, and without needing to copy+paste the FREEZE_IN_SMM
logic into every patch that accesses GUEST_IA32_DEBUGCTL.

No functional change intended.

Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[sean: massage changelog, make inline, use in all prepare_vmcs02() cases]
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610232010.162191-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:24 -07:00
Maxim Levitsky
095686e6fc KVM: nVMX: Check vmcs12->guest_ia32_debugctl on nested VM-Enter
Add a consistency check for L2's guest_ia32_debugctl, as KVM only supports
a subset of hardware functionality, i.e. KVM can't rely on hardware to
detect illegal/unsupported values.  Failure to check the vmcs12 value
would allow the guest to load any harware-supported value while running L2.

Take care to exempt BTF and LBR from the validity check in order to match
KVM's behavior for writes via WRMSR, but without clobbering vmcs12.  Even
if VM_EXIT_SAVE_DEBUG_CONTROLS is set in vmcs12, L1 can reasonably expect
that vmcs12->guest_ia32_debugctl will not be modified if writes to the MSR
are being intercepted.

Arguably, KVM _should_ update vmcs12 if VM_EXIT_SAVE_DEBUG_CONTROLS is set
*and* writes to MSR_IA32_DEBUGCTLMSR are not being intercepted by L1, but
that would incur non-trivial complexity and wouldn't change the fact that
KVM's handling of DEBUGCTL is blatantly broken.  I.e. the extra complexity
is not worth carrying.

Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250610232010.162191-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:23 -07:00
Sean Christopherson
8a4351ac30 KVM: VMX: Extract checking of guest's DEBUGCTL into helper
Move VMX's logic to check DEBUGCTL values into a standalone helper so that
the code can be used by nested VM-Enter to apply the same logic to the
value being loaded from vmcs12.

KVM needs to explicitly check vmcs12->guest_ia32_debugctl on nested
VM-Enter, as hardware may support features that KVM does not, i.e. relying
on hardware to detect invalid guest state will result in false negatives.
Unfortunately, that means applying KVM's funky suppression of BTF and LBR
to vmcs12 so as not to break existing guests.

No functional change intended.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610232010.162191-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:22 -07:00
Sean Christopherson
17ec2f9653 KVM: VMX: Allow guest to set DEBUGCTL.RTM_DEBUG if RTM is supported
Let the guest set DEBUGCTL.RTM_DEBUG if RTM is supported according to the
guest CPUID model, as debug support is supposed to be available if RTM is
supported, and there are no known downsides to letting the guest debug RTM
aborts.

Note, there are no known bug reports related to RTM_DEBUG, the primary
motivation is to reduce the probability of breaking existing guests when a
future change adds a missing consistency check on vmcs12.GUEST_DEBUGCTL
(KVM currently lets L2 run with whatever hardware supports; whoops).

Note #2, KVM already emulates DR6.RTM, and doesn't restrict access to
DR7.RTM.

Fixes: 83c529151a ("KVM: x86: expose Intel cpu new features (HLE, RTM) to guest")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250610232010.162191-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:04:24 -07:00
Sean Christopherson
80c64c7afe KVM: x86: Drop kvm_x86_ops.set_dr6() in favor of a new KVM_RUN flag
Instruct vendor code to load the guest's DR6 into hardware via a new
KVM_RUN flag, and remove kvm_x86_ops.set_dr6(), whose sole purpose was to
load vcpu->arch.dr6 into hardware when DR6 can be read/written directly
by the guest.

Note, TDX already WARNs on any run_flag being set, i.e. will yell if KVM
thinks DR6 needs to be reloaded.  TDX vCPUs force KVM_DEBUGREG_AUTO_SWITCH
and never clear the flag, i.e. should never observe KVM_RUN_LOAD_GUEST_DR6.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250610232010.162191-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:04:24 -07:00
Sean Christopherson
2478b1b220 KVM: x86: Convert vcpu_run()'s immediate exit param into a generic bitmap
Convert kvm_x86_ops.vcpu_run()'s "force_immediate_exit" boolean parameter
into an a generic bitmap so that similar "take action" information can be
passed to vendor code without creating a pile of boolean parameters.

This will allow dropping kvm_x86_ops.set_dr6() in favor of a new flag, and
will also allow for adding similar functionality for re-loading debugctl
in the active VMCS.

Opportunistically massage the TDX WARN and comment to prepare for adding
more run_flags, all of which are expected to be mutually exclusive with
TDX, i.e. should be WARNed on.

No functional change intended.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250610232010.162191-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:04:24 -07:00
Sean Christopherson
7d390a9da8 KVM: TDX: Use kvm_arch_vcpu.host_debugctl to restore the host's DEBUGCTL
Use the kvm_arch_vcpu.host_debugctl snapshot to restore DEBUGCTL after
running a TD vCPU.  The final TDX series rebase was mishandled, likely due
to commit fb71c79593 ("KVM: x86: Snapshot the host's DEBUGCTL in common
x86") deleting the same line of code from vmx.h, i.e. creating a semantic
conflict of sorts, but no syntactic conflict.

Using the version in kvm_vcpu_arch picks up the ulong => u64 fix (which
isn't relevant to TDX) as well as the IRQ fix from commit 189ecdb3e1
("KVM: x86: Snapshot the host's DEBUGCTL after disabling IRQs").

Link: https://lore.kernel.org/all/20250307212053.2948340-10-pbonzini@redhat.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Fixes: 8af0990375 ("KVM: TDX: Save and restore IA32_DEBUGCTL")
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Link: https://lore.kernel.org/r/20250610232010.162191-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:04:23 -07:00
Paolo Bonzini
28224ef02b KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities
Allow userspace to advertise TDG.VP.VMCALL subfunctions that the
kernel also supports.  For each output register of GetTdVmCallInfo's
leaf 1, add two fields to KVM_TDX_CAPABILITIES: one for kernel-supported
TDVMCALLs (userspace can set those blindly) and one for user-supported
TDVMCALLs (userspace can set those if it knows how to handle them).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20 14:20:20 -04:00
Paolo Bonzini
4580dbef5c KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20 14:09:50 -04:00
Binbin Wu
25e8b1dd48 KVM: TDX: Exit to userspace for GetTdVmCallInfo
Exit to userspace for TDG.VP.VMCALL<GetTdVmCallInfo> via KVM_EXIT_TDX,
to allow userspace to provide information about the support of
TDVMCALLs when r12 is 1 for the TDVMCALLs beyond the GHCI base API.

GHCI spec defines the GHCI base TDVMCALLs: <GetTdVmCallInfo>, <MapGPA>,
<ReportFatalError>, <Instruction.CPUID>, <#VE.RequestMMIO>,
<Instruction.HLT>, <Instruction.IO>, <Instruction.RDMSR> and
<Instruction.WRMSR>. They must be supported by VMM to support TDX guests.

For GetTdVmCallInfo
- When leaf (r12) to enumerate TDVMCALL functionality is set to 0,
  successful execution indicates all GHCI base TDVMCALLs listed above are
  supported.

  Update the KVM TDX document with the set of the GHCI base APIs.

- When leaf (r12) to enumerate TDVMCALL functionality is set to 1, it
  indicates the TDX guest is querying the supported TDVMCALLs beyond
  the GHCI base TDVMCALLs.
  Exit to userspace to let userspace set the TDVMCALL sub-function bit(s)
  accordingly to the leaf outputs.  KVM could set the TDVMCALL bit(s)
  supported by itself when the TDVMCALLs don't need support from userspace
  after returning from userspace and before entering guest. Currently, no
  such TDVMCALLs implemented, KVM just sets the values returned from
  userspace.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
[Adjust userspace API. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20 13:55:47 -04:00
Binbin Wu
cf207eac06 KVM: TDX: Handle TDG.VP.VMCALL<GetQuote>
Handle TDVMCALL for GetQuote to generate a TD-Quote.

GetQuote is a doorbell-like interface used by TDX guests to request VMM
to generate a TD-Quote signed by a service hosting TD-Quoting Enclave
operating on the host.  A TDX guest passes a TD Report (TDREPORT_STRUCT) in
a shared-memory area as parameter.  Host VMM can access it and queue the
operation for a service hosting TD-Quoting enclave.  When completed, the
Quote is returned via the same shared-memory area.

KVM only checks the GPA from the TDX guest has the shared-bit set and drops
the shared-bit before exiting to userspace to avoid bleeding the shared-bit
into KVM's exit ABI.  KVM forwards the request to userspace VMM (e.g. QEMU)
and userspace VMM queues the operation asynchronously.  KVM sets the return
code according to the 'ret' field set by userspace to notify the TDX guest
whether the request has been queued successfully or not.  When the request
has been queued successfully, the TDX guest can poll the status field in
the shared-memory area to check whether the Quote generation is completed
or not.  When completed, the generated Quote is returned via the same
buffer.

Add KVM_EXIT_TDX as a new exit reason to userspace. Userspace is
required to handle the KVM exit reason as the initial support for TDX,
by reentering KVM to ensure that the TDVMCALL is complete.  While at it,
add a note that KVM_EXIT_HYPERCALL also requires reentry with KVM_RUN.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
[Adjust userspace API. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20 13:09:32 -04:00
Binbin Wu
b5aafcb4ef KVM: TDX: Add new TDVMCALL status code for unsupported subfuncs
Add the new TDVMCALL status code TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED and
return it for unimplemented TDVMCALL subfunctions.

Returning TDVMCALL_STATUS_INVALID_OPERAND when a subfunction is not
implemented is vague because TDX guests can't tell the error is due to
the subfunction is not supported or an invalid input of the subfunction.
New GHCI spec adds TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED to avoid the
ambiguity. Use it instead of TDVMCALL_STATUS_INVALID_OPERAND.

Before the change, for common guest implementations, when a TDX guest
receives TDVMCALL_STATUS_INVALID_OPERAND, it has two cases:
1. Some operand is invalid. It could change the operand to another value
   retry.
2. The subfunction is not supported.

For case 1, an invalid operand usually means the guest implementation bug.
Since the TDX guest can't tell which case is, the best practice for
handling TDVMCALL_STATUS_INVALID_OPERAND is stopping calling such leaf,
treating the failure as fatal if the TDVMCALL is essential or ignoring
it if the TDVMCALL is optional.

With this change, TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED could be sent to
old TDX guest that do not know about it, but it is expected that the
guest will make the same action as TDVMCALL_STATUS_INVALID_OPERAND.
Currently, no known TDX guest checks TDVMCALL_STATUS_INVALID_OPERAND
specifically; for example Linux just checks for success.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
[Return it for untrapped KVM_HC_MAP_GPA_RANGE. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20 13:09:31 -04:00
Borislav Petkov (AMD)
31272abd59 KVM: SVM: Advertise TSA CPUID bits to guests
Synthesize the TSA CPUID feature bits for guests. Set TSA_{SQ,L1}_NO on
unaffected machines.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-06-17 17:17:12 +02:00
Borislav Petkov (AMD)
d8010d4ba4 x86/bugs: Add a Transient Scheduler Attacks mitigation
Add the required features detection glue to bugs.c et all in order to
support the TSA mitigation.

Co-developed-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
2025-06-17 17:17:02 +02:00
Borislav Petkov (AMD)
f9af88a3d3 x86/bugs: Rename MDS machinery to something more generic
It will be used by other x86 mitigations.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
2025-06-16 18:45:18 +02:00
Paolo Bonzini
8046d29dde KVM: x86/mmu: Reject direct bits in gpa passed to KVM_PRE_FAULT_MEMORY
Only let userspace pass the same addresses that were used in KVM_SET_USER_MEMORY_REGION
(or KVM_SET_USER_MEMORY_REGION2); gpas in the the upper half of the address space
are an implementation detail of TDX and KVM.

Extracted from a patch by Sean Christopherson <seanjc@google.com>.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-12 00:51:42 -04:00
Paolo Bonzini
aa2024c01a KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
Bug[*] reported for TDX case when enabling KVM_PRE_FAULT_MEMORY in QEMU.

It turns out that @gpa passed to kvm_mmu_do_page_fault() doesn't have
shared bit set when the memory attribute of it is shared, and it leads
to wrong root in tdp_mmu_get_root_for_fault().

Fix it by embedding the direct bits in the gpa that is passed to
kvm_tdp_map_page(), when the memory of the gpa is not private.

[*] https://lore.kernel.org/qemu-devel/4a757796-11c2-47f1-ae0d-335626e818fd@intel.com/

Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Closes: https://lore.kernel.org/qemu-devel/4a757796-11c2-47f1-ae0d-335626e818fd@intel.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20250611001018.2179964-1-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-12 00:43:39 -04:00
Ashish Kalra
179a8427fc KVM: SEV: Disable SEV-SNP support on initialization failure
During platform init, SNP initialization may fail for several reasons,
such as firmware command failures and incompatible versions. However,
the KVM capability may continue to advertise support for it.

The platform may have SNP enabled but if SNP_INIT fails then SNP is
not supported by KVM.

During KVM module initialization query the SNP platform status to obtain
the SNP initialization state and use it as an additional condition to
determine support for SEV-SNP.

Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Pratik R. Sampat <prsampat@amd.com>
Signed-off-by: Pratik R. Sampat <prsampat@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Reviewed-by: Pavan Kumar Paluri <papaluri@amd.com>
Message-ID: <20250512221634.12045-1-Ashish.Kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-11 14:27:06 -04:00
Ingo Molnar
41cb08555c treewide, timers: Rename from_timer() to timer_container_of()
Move this API to the canonical timer_*() namespace.

[ tglx: Redone against pre rc1 ]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-08 09:07:37 +02:00
Linus Torvalds
7f9039c524 Generic:
* Clean up locking of all vCPUs for a VM by using the *_nest_lock()
   family of functions, and move duplicated code to virt/kvm/.
   kernel/ patches acked by Peter Zijlstra.
 
 * Add MGLRU support to the access tracking perf test.
 
 ARM fixes:
 
 * Make the irqbypass hooks resilient to changes in the GSI<->MSI
   routing, avoiding behind stale vLPI mappings being left behind. The
   fix is to resolve the VGIC IRQ using the host IRQ (which is stable)
   and nuking the vLPI mapping upon a routing change.
 
 * Close another VGIC race where vCPU creation races with VGIC
   creation, leading to in-flight vCPUs entering the kernel w/o private
   IRQs allocated.
 
 * Fix a build issue triggered by the recently added workaround for
   Ampere's AC04_CPU_23 erratum.
 
 * Correctly sign-extend the VA when emulating a TLBI instruction
   potentially targeting a VNCR mapping.
 
 * Avoid dereferencing a NULL pointer in the VGIC debug code, which can
   happen if the device doesn't have any mapping yet.
 
 s390:
 
 * Fix interaction between some filesystems and Secure Execution
 
 * Some cleanups and refactorings, preparing for an upcoming big series
 
 x86:
 
 * Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
   fix a race between AP destroy and VMRUN.
 
 * Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.
 
 * Refine and harden handling of spurious faults.
 
 * Add support for ALLOWED_SEV_FEATURES.
 
 * Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.
 
 * Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
   that utilize those bits.
 
 * Don't account temporary allocations in sev_send_update_data().
 
 * Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
 
 * Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between
   SVM and VMX.
 
 * Advertise support to userspace for WRMSRNS and PREFETCHI.
 
 * Rescan I/O APIC routes after handling EOI that needed to be intercepted due
   to the old/previous routing, but not the new/current routing.
 
 * Add a module param to control and enumerate support for device posted
   interrupts.
 
 * Fix a potential overflow with nested virt on Intel systems running 32-bit kernels.
 
 * Flush shadow VMCSes on emergency reboot.
 
 * Add support for SNP to the various SEV selftests.
 
 * Add a selftest to verify fastops instructions via forced emulation.
 
 * Refine and optimize KVM's software processing of the posted interrupt bitmap, and share
   the harvesting code between KVM and the kernel's Posted MSI handler
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmg9TjwUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOUxQf7B7nnWqIKd7jSkGzSD6YsSX9TXktr
 2tJIOfWM3zNYg5GRCidg+m4Y5+DqQWd3Hi5hH2P9wUw7RNuOjOFsDe+y0VBr8ysE
 ve39t/yp+mYalNmHVFl8s3dBDgrIeGKiz+Wgw3zCQIBZ18rJE1dREhv37RlYZ3a2
 wSvuObe8sVpCTyKIowDs1xUi7qJUBoopMSuqfleSHawRrcgCpV99U8/KNFF5plLH
 7fXOBAHHniVCVc+mqQN2wxtVJDhST+U3TaU4GwlKy9Yevr+iibdOXffveeIgNEU4
 D6q1F2zKp6UdV3+p8hxyaTTbiCVDqsp9WOgY/0I/f+CddYn0WVZgOlR+ow==
 =mYFL
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more kvm updates from Paolo Bonzini:
  Generic:

   - Clean up locking of all vCPUs for a VM by using the *_nest_lock()
     family of functions, and move duplicated code to virt/kvm/. kernel/
     patches acked by Peter Zijlstra

   - Add MGLRU support to the access tracking perf test

  ARM fixes:

   - Make the irqbypass hooks resilient to changes in the GSI<->MSI
     routing, avoiding behind stale vLPI mappings being left behind. The
     fix is to resolve the VGIC IRQ using the host IRQ (which is stable)
     and nuking the vLPI mapping upon a routing change

   - Close another VGIC race where vCPU creation races with VGIC
     creation, leading to in-flight vCPUs entering the kernel w/o
     private IRQs allocated

   - Fix a build issue triggered by the recently added workaround for
     Ampere's AC04_CPU_23 erratum

   - Correctly sign-extend the VA when emulating a TLBI instruction
     potentially targeting a VNCR mapping

   - Avoid dereferencing a NULL pointer in the VGIC debug code, which
     can happen if the device doesn't have any mapping yet

  s390:

   - Fix interaction between some filesystems and Secure Execution

   - Some cleanups and refactorings, preparing for an upcoming big
     series

  x86:

   - Wait for target vCPU to ack KVM_REQ_UPDATE_PROTECTED_GUEST_STATE
     to fix a race between AP destroy and VMRUN

   - Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for
     the VM

   - Refine and harden handling of spurious faults

   - Add support for ALLOWED_SEV_FEATURES

   - Add #VMGEXIT to the set of handlers special cased for
     CONFIG_RETPOLINE=y

   - Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing
     features that utilize those bits

   - Don't account temporary allocations in sev_send_update_data()

   - Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock
     Threshold

   - Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU
     IBPB, between SVM and VMX

   - Advertise support to userspace for WRMSRNS and PREFETCHI

   - Rescan I/O APIC routes after handling EOI that needed to be
     intercepted due to the old/previous routing, but not the
     new/current routing

   - Add a module param to control and enumerate support for device
     posted interrupts

   - Fix a potential overflow with nested virt on Intel systems running
     32-bit kernels

   - Flush shadow VMCSes on emergency reboot

   - Add support for SNP to the various SEV selftests

   - Add a selftest to verify fastops instructions via forced emulation

   - Refine and optimize KVM's software processing of the posted
     interrupt bitmap, and share the harvesting code between KVM and the
     kernel's Posted MSI handler"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits)
  rtmutex_api: provide correct extern functions
  KVM: arm64: vgic-debug: Avoid dereferencing NULL ITE pointer
  KVM: arm64: vgic-init: Plug vCPU vs. VGIC creation race
  KVM: arm64: Unmap vLPIs affected by changes to GSI routing information
  KVM: arm64: Resolve vLPI by host IRQ in vgic_v4_unset_forwarding()
  KVM: arm64: Protect vLPI translation with vgic_irq::irq_lock
  KVM: arm64: Use lock guard in vgic_v4_set_forwarding()
  KVM: arm64: Mask out non-VA bits from TLBI VA* on VNCR invalidation
  arm64: sysreg: Drag linux/kconfig.h to work around vdso build issue
  KVM: s390: Simplify and move pv code
  KVM: s390: Refactor and split some gmap helpers
  KVM: s390: Remove unneeded srcu lock
  s390: Remove unneeded includes
  s390/uv: Improve splitting of large folios that cannot be split while dirty
  s390/uv: Always return 0 from s390_wiggle_split_folio() if successful
  s390/uv: Don't return 0 from make_hva_secure() if the operation was not successful
  rust: add helper for mutex_trylock
  RISC-V: KVM: use kvm_trylock_all_vcpus when locking all vCPUs
  KVM: arm64: use kvm_trylock_all_vcpus when locking all vCPUs
  x86: KVM: SVM: use kvm_lock_all_vcpus instead of a custom implementation
  ...
2025-06-02 12:24:58 -07:00
Linus Torvalds
43db111107 ARM:
* Add large stage-2 mapping (THP) support for non-protected guests when
   pKVM is enabled, clawing back some performance.
 
 * Enable nested virtualisation support on systems that support it,
   though it is disabled by default.
 
 * Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and
   protected modes.
 
 * Large rework of the way KVM tracks architecture features and links
   them with the effects of control bits. While this has no functional
   impact, it ensures correctness of emulation (the data is automatically
   extracted from the published JSON files), and helps dealing with the
   evolution of the architecture.
 
 * Significant changes to the way pKVM tracks ownership of pages,
   avoiding page table walks by storing the state in the hypervisor's
   vmemmap. This in turn enables the THP support described above.
 
 * New selftest checking the pKVM ownership transition rules
 
 * Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
   even if the host didn't have it.
 
 * Fixes for the address translation emulation, which happened to be
   rather buggy in some specific contexts.
 
 * Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
   from the number of counters exposed to a guest and addressing a
   number of issues in the process.
 
 * Add a new selftest for the SVE host state being corrupted by a
   guest.
 
 * Keep HCR_EL2.xMO set at all times for systems running with the
   kernel at EL2, ensuring that the window for interrupts is slightly
   bigger, and avoiding a pretty bad erratum on the AmpereOne HW.
 
 * Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
   from a pretty bad case of TLB corruption unless accesses to HCR_EL2
   are heavily synchronised.
 
 * Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
   tables in a human-friendly fashion.
 
 * and the usual random cleanups.
 
 LoongArch:
 
 * Don't flush tlb if the host supports hardware page table walks.
 
 * Add KVM selftests support.
 
 RISC-V:
 
 * Add vector registers to get-reg-list selftest
 
 * VCPU reset related improvements
 
 * Remove scounteren initialization from VCPU reset
 
 * Support VCPU reset from userspace using set_mpstate() ioctl
 
 x86:
 
 * Initial support for TDX in KVM.  This finally makes it possible to use the
   TDX module to run confidential guests on Intel processors.  This is quite a
   large series, including support for private page tables (managed by the
   TDX module and mirrored in KVM for efficiency), forwarding some TDVMCALLs
   to userspace, and handling several special VM exits from the TDX module.
 
   This has been in the works for literally years and it's not really possible
   to describe everything here, so I'll defer to the various merge commits
   up to and including commit 7bcf7246c4 ("Merge branch 'kvm-tdx-finish-initial'
   into HEAD").
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmg02hwUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNnkwf/db4xeWKSMseCIvBVR+ObDn3LXhwT
 hAgmTkDkP1zq9RfbfJSbUA1DXRwfP+f1sWySLMWECkFEQW9fGIJF9fOQRDSXKmhX
 158U3+FEt+3jxLRCGFd4zyXAqyY3C8JSkPUyJZxCpUbXtB5tdDNac4rZAXKDULwe
 sUi0OW/kFDM2yt369pBGQAGdN+75/oOrYISGOSvMXHxjccNqvveX8MUhpBjYIuuj
 73iBWmsfv3vCtam56Racz3C3v44ie498PmWFtnB0R+CVfWfrnUAaRiGWx+egLiBW
 dBPDiZywMn++prmphEUFgaStDTQy23JBLJ8+RvHkp+o5GaTISKJB3nedZQ==
 =adZU
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "As far as x86 goes this pull request "only" includes TDX host support.

  Quotes are appropriate because (at 6k lines and 100+ commits) it is
  much bigger than the rest, which will come later this week and
  consists mostly of bugfixes and selftests. s390 changes will also come
  in the second batch.

  ARM:

   - Add large stage-2 mapping (THP) support for non-protected guests
     when pKVM is enabled, clawing back some performance.

   - Enable nested virtualisation support on systems that support it,
     though it is disabled by default.

   - Add UBSAN support to the standalone EL2 object used in nVHE/hVHE
     and protected modes.

   - Large rework of the way KVM tracks architecture features and links
     them with the effects of control bits. While this has no functional
     impact, it ensures correctness of emulation (the data is
     automatically extracted from the published JSON files), and helps
     dealing with the evolution of the architecture.

   - Significant changes to the way pKVM tracks ownership of pages,
     avoiding page table walks by storing the state in the hypervisor's
     vmemmap. This in turn enables the THP support described above.

   - New selftest checking the pKVM ownership transition rules

   - Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
     even if the host didn't have it.

   - Fixes for the address translation emulation, which happened to be
     rather buggy in some specific contexts.

   - Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
     from the number of counters exposed to a guest and addressing a
     number of issues in the process.

   - Add a new selftest for the SVE host state being corrupted by a
     guest.

   - Keep HCR_EL2.xMO set at all times for systems running with the
     kernel at EL2, ensuring that the window for interrupts is slightly
     bigger, and avoiding a pretty bad erratum on the AmpereOne HW.

   - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
     from a pretty bad case of TLB corruption unless accesses to HCR_EL2
     are heavily synchronised.

   - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
     tables in a human-friendly fashion.

   - and the usual random cleanups.

  LoongArch:

   - Don't flush tlb if the host supports hardware page table walks.

   - Add KVM selftests support.

  RISC-V:

   - Add vector registers to get-reg-list selftest

   - VCPU reset related improvements

   - Remove scounteren initialization from VCPU reset

   - Support VCPU reset from userspace using set_mpstate() ioctl

  x86:

   - Initial support for TDX in KVM.

     This finally makes it possible to use the TDX module to run
     confidential guests on Intel processors. This is quite a large
     series, including support for private page tables (managed by the
     TDX module and mirrored in KVM for efficiency), forwarding some
     TDVMCALLs to userspace, and handling several special VM exits from
     the TDX module.

     This has been in the works for literally years and it's not really
     possible to describe everything here, so I'll defer to the various
     merge commits up to and including commit 7bcf7246c4 ('Merge
     branch 'kvm-tdx-finish-initial' into HEAD')"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (248 commits)
  x86/tdx: mark tdh_vp_enter() as __flatten
  Documentation: virt/kvm: remove unreferenced footnote
  RISC-V: KVM: lock the correct mp_state during reset
  KVM: arm64: Fix documentation for vgic_its_iter_next()
  KVM: arm64: np-guest CMOs with PMD_SIZE fixmap
  KVM: arm64: Stage-2 huge mappings for np-guests
  KVM: arm64: Add a range to pkvm_mappings
  KVM: arm64: Convert pkvm_mappings to interval tree
  KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest()
  KVM: arm64: Add a range to __pkvm_host_wrprotect_guest()
  KVM: arm64: Add a range to __pkvm_host_unshare_guest()
  KVM: arm64: Add a range to __pkvm_host_share_guest()
  KVM: arm64: Introduce for_each_hyp_page
  KVM: arm64: Handle huge mappings for np-guest CMOs
  KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section
  KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held
  KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating
  RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET
  RISC-V: KVM: Remove scounteren initialization
  KVM: RISC-V: remove unnecessary SBI reset state
  ...
2025-05-29 08:10:01 -07:00
Paolo Bonzini
8e86e73626 Merge branch 'kvm-lockdep-common' into HEAD
Introduce new mutex locking functions mutex_trylock_nest_lock() and
mutex_lock_killable_nest_lock() and use them to clean up locking
of all vCPUs for a VM.

For x86, this removes some complex code that was used instead
of lockdep's "nest_lock" feature.

For ARM and RISC-V, this removes a lockdep warning when the VM is
configured to have more than MAX_LOCK_DEPTH vCPUs, and removes a fair
amount of duplicate code by sharing the logic across all architectures.

Signed-off-by: Paolo BOnzini <pbonzini@redhat.com>
2025-05-28 06:29:17 -04:00
Maxim Levitsky
c560bc9286 x86: KVM: SVM: use kvm_lock_all_vcpus instead of a custom implementation
Use kvm_lock_all_vcpus instead of sev's own implementation.

Because kvm_lock_all_vcpus uses the _nest_lock feature of lockdep, which
ignores subclasses, there is no longer a need to use separate subclasses
for source and target VMs.

No functional change intended.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-ID: <20250512180407.659015-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-27 12:16:41 -04:00
Paolo Bonzini
4e02d4f973 KVM SVM changes for 6.16:
- Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
    fix a race between AP destroy and VMRUN.
 
  - Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.
 
  - Add support for ALLOWED_SEV_FEATURES.
 
  - Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.
 
  - Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
    that utilize those bits.
 
  - Don't account temporary allocations in sev_send_update_data().
 
  - Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmwAACgkQOlYIJqCj
 N/1pHw//edW/x838POMeeCN8j39NBKErW9yZoQLhMbzogttRvfoba+xYY9zXyRFx
 8AXB8+2iLtb7pXUohc0eYN0mNqgD0SnoMLqGfn7nrkJafJSUAJHAoZn1Mdom1M1y
 jHvBPbHCMMsgdLV8wpDRqCNWTH+d5W0kcN5WjKwOswVLj1rybVfK7bSLMhvkk1e5
 RrOR4Ewf95/Ag2b36L4SvS1yG9fTClmKeGArMXhEXjy2INVSpBYyZMjVtjHiNzU9
 TjtB2RSM45O+Zl0T2fZdVW8LFhA6kVeL1v+Oo433CjOQE0LQff3Vl14GCANIlPJU
 PiWN/RIKdWkuxStIP3vw02eHzONCcg2GnNHzEyKQ1xW8lmrwzVRdXZzVsc2Dmowb
 7qGykBQ+wzoE0sMeZPA0k/QOSqg2vGxUQHjR7720loLV9m9Tu/mJnS9e179GJKgI
 e1ArSLwKmHpjwKZqU44IQVTZaxSC4Sg2kI670i21ChPgx8+oVkA6I0LFQXymx7uS
 2lbH+ovTlJSlP9fbaJhMwAU2wpSHAyXif/HPjdw2LTH3NdgXzfEnZfTlAWiP65LQ
 hnz5HvmUalW3x9kmzRmeDIAkDnAXhyt3ZQMvbNzqlO5AfS+Tqh4Ed5EFP3IrQAzK
 HQ+Gi0ip+B84t9Tbi6rfQwzTZEbSSOfYksC7TXqRGhNo/DvHumE=
 =k6rK
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-svm-6.16' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 6.16:

 - Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
   fix a race between AP destroy and VMRUN.

 - Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.

 - Add support for ALLOWED_SEV_FEATURES.

 - Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.

 - Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
   that utilize those bits.

 - Don't account temporary allocations in sev_send_update_data().

 - Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
2025-05-27 12:15:49 -04:00
Paolo Bonzini
3e89d5fdc7 KVM VMX changes for 6.16:
- Explicitly check MSR load/store list counts to fix a potential overflow on
    32-bit kernels.
 
  - Flush shadow VMCSes on emergency reboot.
 
  - Revert mem_enc_ioctl() back to an optional hook, as it's nullified when
    SEV or TDX is disabled via Kconfig.
 
  - Macrofy the handling of vt_x86_ops to eliminate a pile of boilerplate code
    needed for TDX, and to optimize CONFIG_KVM_INTEL_TDX=n builds.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwnAwACgkQOlYIJqCj
 N/1J6Q//ZxmA443SLutx1z1F6jLxRLIh4tvOUyWPfyTIedEHXgx9btIXC9ws+Iue
 lgsa7Pto+JB0qP4XtoVS+gBrOeF2ZWwp4kyuFM/07tX67q6NKj7mTML/vXcelJNw
 bvWA3lqeNW/KZd0KCByOGeOUdfAvruXFrX0yN9/VKGZladcZaj4EGwxcS7prPCzM
 YNlRlaRsD4jB4IsmaIvWZ4sj/TNVefTybUvEgAsK6rz2bfYbYzu73ITRng7VHrRI
 u+NS3TyIoMbnqFpE8mvGJ/2dvPwvn8u6IzHLsxQRm/TKeEDlvKZxTWMncHqA6i8K
 MT79onhT1QHfaMCx6/0DjLJGGoW/9/Q8k4XNkb80Vh7e+KlqkglPYXhUW1AYs0le
 7dXkCF04xDlj5E3I971HclX5Dbiec2fLKyOiYnEUSJgPIchPJ802GYg1TOqSyHZY
 g7+QS9RKE7iXqQ2+Df7cfXbSTrpmjVmzPK4nTFUZwbcnYhNN8qo1SYo5nv89+Myr
 EOdvsk0GzwgCbIbSPejIsrHFUE65iWGP7NcoBtXa7i34inGCgvElPHhXbDPu9xkU
 9AZJ+iz3oegrDuARYJI/QSl3hciD31fX8yHKnpz7ccIYGBbQ5w3hNhpREM9X7fL7
 pg9dv/oPmTla+NuuV+9fPXdYq/7p8CuwYyFis6DrTWeyKCOQSyE=
 =7W6U
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.16' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.16:

 - Explicitly check MSR load/store list counts to fix a potential overflow on
   32-bit kernels.

 - Flush shadow VMCSes on emergency reboot.

 - Revert mem_enc_ioctl() back to an optional hook, as it's nullified when
   SEV or TDX is disabled via Kconfig.

 - Macrofy the handling of vt_x86_ops to eliminate a pile of boilerplate code
   needed for TDX, and to optimize CONFIG_KVM_INTEL_TDX=n builds.
2025-05-27 12:15:38 -04:00
Paolo Bonzini
db44dcbdf8 KVM x86 posted interrupt changes for 6.16:
Refine and optimize KVM's software processing of the PIR, and ultimately share
 PIR harvesting code between KVM and the kernel's Posted MSI handler
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmWcACgkQOlYIJqCj
 N/3mUw/9HN4OLRqFytu+GjEocl8I7JelJdwCsNMsUwZRnNVnYGDqsjvw8rzqeFmx
 RoQ8uNqMd1PqZOgAdN6suLES949ItErbnG2+UlBvZeNgR63K8fyNJaPUzSXh0Kyd
 vNNzGschI0txZXNEtMHcIsCuQknU/arlE6v+HOAokb1jxaIZH2h06vrBAj6pLAHO
 hbcZPkaQEaFoQhqCbYm015ecJQRPv3IZoW7H1cK5nC4q6QdNo3LPfGqUJwgHV3Wq
 hbfS+2J78nTqLhSn7HHE/y5z3R5+ZyPwFQwbqfvjjap5/DW5w8Tltg2Oif597lf2
 klBukBkJyfzSdhjaPKb3V23kCNabNyyX7KUDZnW5HCiEu62Lnl0MexXCvFvSvtmy
 YDSsXMg3KdtlESwUOaxGjd2J81tx36L3ZvWRaopDLzA2A6KVyVQCSANGOGkKrRzq
 Qq3R/frzp1uUVpVDtdyDIO1AujoXkRecdOj1uAIr2XQBg8jx0kveAUyrkXFbQVjK
 oNbfRlOiu6/vnXkWqwZ2w/Q0kRRrK7M+vensOZlculqDqxPH+BLWB+dfPqjGikb/
 cL01KPu6n/GQJpwAxIbGU4eUIQPAVOcHm3iRaIlRqEoDCs7C8fTRIyDx+cD1vW8O
 O9j/r05EV/Ck5XF2ks6bHIK+C3wemNrCvoeFbnO1uicqtdO+Tqw=
 =dU1G
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-pir-6.16' of https://github.com/kvm-x86/linux into HEAD

KVM x86 posted interrupt changes for 6.16:

Refine and optimize KVM's software processing of the PIR, and ultimately share
PIR harvesting code between KVM and the kernel's Posted MSI handler
2025-05-27 12:15:01 -04:00
Paolo Bonzini
5d816c1399 KVM x86 MMU changes for 6.16:
- Refine and harden handling of spurious faults.
 
  - Use kvm_x86_call() instead of open coding static_call().
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmO0ACgkQOlYIJqCj
 N/050BAAhl1u8bFGhb4SYOjeZ7QBgqhbzm2kvODQ7nnq3lQdBh5F3PYcBqX05dNo
 ntT1w5g8LOoFQWQ0rHrCDxEraBecYI4GWtuvC567AtmvibTn6GKJ1drh3ekNKD0b
 3Dr674PgOzTdXWmD74tVSa3VCQ2c0jvkkueF7aeAAlX20efkjqAiQauF2c0PFBrY
 eqPqICOoecDxlQ6O7wY5jxkxBBurk60a4TBByzp2fCvvkAOdt6cYt42SPjd3dnXu
 Hi3a2nIIYyTfF0jvXcbtvGzzOac1YbGP1wBGA3uiyCkMz7ABZfHoOD5O6+rzUYml
 9j9Um4phbHgvvRRDfL6rv2rBFI6hs9c4NzAMv/gcgblJiNhbg1A4Hu3ykDNsHpUN
 nO1/GrCkHKXITU1za2Z6rf4X/QouBGVHphODB/TzLbyXV2xw1IU1TveWgvWT5yVF
 mZMWvpYJaGtOX/imMNxPwVcrX75HkSD1DU2r1CfV5CUfd87wIp6HkSBvqUneFScN
 v7S5OaWGce6/EOoI3bq2Dp+hJS3ZnNmXzknK5PHB+Nb/JuXVNIW/Meol85d+z7j9
 A/I11shjNV6RwuFfpvj6yGQUtKU5mSkw4D4eUIA57K2HtxLVMuhxis45X2O8plDI
 pMU4KLXmq6nNU0MLTxNWMExXIQ1MWR35Ak/Q3JYdUVLXe6UIM9M=
 =ayRv
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmu-6.16' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 6.16:

 - Refine and harden handling of spurious faults.

 - Use kvm_x86_call() instead of open coding static_call().
2025-05-27 12:14:47 -04:00
Paolo Bonzini
ebd38b26ec KVM x86 misc changes for 6.16:
- Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between
    SVM and VMX.
 
  - Advertise support to userspace for WRMSRNS and PREFETCHI.
 
  - Rescan I/O APIC routes after handling EOI that needed to be intercepted due
    to the old/previous routing, but not the new/current routing.
 
  - Add a module param to control and enumerate support for device posted
    interrupts.
 
  - Misc cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmHsACgkQOlYIJqCj
 N/2fYhAAiwKkqQpOWLcGjjezDTnpMqDDbHCffroq0Ttmqfg/cuul1oyZax+9fxBO
 203HUi5VKcG7uAGSpLcMFkPUs9hKnaln2lsDaQD+AnGucdj+JKF5p3INCsSYCo9N
 LVRjRZWtZocxJwHSHX9gU8om0pJ5fBCBG2+7+7XhWRaIqCpJe5k944JotiiOkgZ4
 5sXeITkN2kouFVMI8eD4wQGNXxRxs837SYUlwCnoD3VuuBesOZuEhz/CEL9l8vNY
 keXBLPg7bSW53clKfquNKwXDQRephnZaYoexDebUd+OlZphGhTIPh+C75xPQLWSi
 aYg6W9XDu3TChf4LPxHnJLwLg/rjeKNQARcxrnb3XLpPAtx3i2cKU8pDPhnd4qn0
 +YV5H0dato8bbe+oClGv+oIolM01qfI9SJVoaEhTPu3Rdw9cCQSVFn5t32vG3Vab
 FVxX+seV3+XTmVveD4cjiiMbqtNADwZ/PmHNAi9QCl46DgHR++MLfRtjYuGo1koL
 QOmCg2fWOFYtQT6XPJqZxp1SHYxuawrB4qcO9FNyxTuMChoslYoLAr3mBUj0DvwL
 fdXNof74Ccj8OK9o3uCPOXS92pZz92rz/edy/XmYiCmO+VEwBJFR6IdginZMPX11
 UAc4mAC+KkDTvOcKPPIWEXArOsfQMFKefi+bPOeWUx9/nqpcVws=
 =Vs9P
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.16' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.16:

 - Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between
   SVM and VMX.

 - Advertise support to userspace for WRMSRNS and PREFETCHI.

 - Rescan I/O APIC routes after handling EOI that needed to be intercepted due
   to the old/previous routing, but not the new/current routing.

 - Add a module param to control and enumerate support for device posted
   interrupts.

 - Misc cleanups.
2025-05-27 12:14:36 -04:00
Edward Adam Davis
cd1be30b44 KVM: VMX: use __always_inline for is_td_vcpu and is_td
is_td() and is_td_vcpu() are used in no-instrumentation sections; use
__always_inline instead of inline.

vmlinux.o: error: objtool: vmx_handle_nmi+0x47:
        call to is_td_vcpu.isra.0() leaves .noinstr.text section

Fixes: 7172c753c2 ("KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct")
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Message-ID: <tencent_1A767567C83C1137829622362E4A72756F09@qq.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-27 12:12:49 -04:00
Linus Torvalds
785cdec46e Core x86 updates for v6.16:
Boot code changes:
 
  - A large series of changes to reorganize the x86 boot code into a better isolated
    and easier to maintain base of PIC early startup code in arch/x86/boot/startup/,
    by Ard Biesheuvel.
 
    Motivation & background:
 
 	| Since commit
 	|
 	|    c88d71508e ("x86/boot/64: Rewrite startup_64() in C")
 	|
 	| dated Jun 6 2017, we have been using C code on the boot path in a way
 	| that is not supported by the toolchain, i.e., to execute non-PIC C
 	| code from a mapping of memory that is different from the one provided
 	| to the linker. It should have been obvious at the time that this was a
 	| bad idea, given the need to sprinkle fixup_pointer() calls left and
 	| right to manipulate global variables (including non-pointer variables)
 	| without crashing.
 	|
 	| This C startup code has been expanding, and in particular, the SEV-SNP
 	| startup code has been expanding over the past couple of years, and
 	| grown many of these warts, where the C code needs to use special
 	| annotations or helpers to access global objects.
 
    This tree includes the first phase of this work-in-progress x86 boot code
    reorganization.
 
 Scalability enhancements and micro-optimizations:
 
  - Improve code-patching scalability (Eric Dumazet)
  - Remove MFENCEs for X86_BUG_CLFLUSH_MONITOR (Andrew Cooper)
 
 CPU features enumeration updates:
 
  - Thorough reorganization and cleanup of CPUID parsing APIs (Ahmed S. Darwish)
  - Fix, refactor and clean up the cacheinfo code (Ahmed S. Darwish, Thomas Gleixner)
  - Update CPUID bitfields to x86-cpuid-db v2.3 (Ahmed S. Darwish)
 
 Memory management changes:
 
  - Allow temporary MMs when IRQs are on (Andy Lutomirski)
  - Opt-in to IRQs-off activate_mm() (Andy Lutomirski)
  - Simplify choose_new_asid() and generate better code (Borislav Petkov)
  - Simplify 32-bit PAE page table handling (Dave Hansen)
  - Always use dynamic memory layout (Kirill A. Shutemov)
  - Make SPARSEMEM_VMEMMAP the only memory model (Kirill A. Shutemov)
  - Make 5-level paging support unconditional (Kirill A. Shutemov)
  - Stop prefetching current->mm->mmap_lock on page faults (Mateusz Guzik)
  - Predict valid_user_address() returning true (Mateusz Guzik)
  - Consolidate initmem_init() (Mike Rapoport)
 
 FPU support and vector computing:
 
  - Enable Intel APX support (Chang S. Bae)
  - Reorgnize and clean up the xstate code (Chang S. Bae)
  - Make task_struct::thread constant size (Ingo Molnar)
  - Restore fpu_thread_struct_whitelist() to fix CONFIG_HARDENED_USERCOPY=y
    (Kees Cook)
  - Simplify the switch_fpu_prepare() + switch_fpu_finish() logic (Oleg Nesterov)
  - Always preserve non-user xfeatures/flags in __state_perm (Sean Christopherson)
 
 Microcode loader changes:
 
  - Help users notice when running old Intel microcode (Dave Hansen)
  - AMD: Do not return error when microcode update is not necessary (Annie Li)
  - AMD: Clean the cache if update did not load microcode (Boris Ostrovsky)
 
 Code patching (alternatives) changes:
 
  - Simplify, reorganize and clean up the x86 text-patching code (Ingo Molnar)
  - Make smp_text_poke_batch_process() subsume smp_text_poke_batch_finish()
    (Nikolay Borisov)
  - Refactor the {,un}use_temporary_mm() code (Peter Zijlstra)
 
 Debugging support:
 
  - Add early IDT and GDT loading to debug relocate_kernel() bugs (David Woodhouse)
  - Print the reason for the last reset on modern AMD CPUs (Yazen Ghannam)
  - Add AMD Zen debugging document (Mario Limonciello)
  - Fix opcode map (!REX2) superscript tags (Masami Hiramatsu)
  - Stop decoding i64 instructions in x86-64 mode at opcode (Masami Hiramatsu)
 
 CPU bugs and bug mitigations:
 
  - Remove X86_BUG_MMIO_UNKNOWN (Borislav Petkov)
  - Fix SRSO reporting on Zen1/2 with SMT disabled (Borislav Petkov)
  - Restructure and harmonize the various CPU bug mitigation methods
    (David Kaplan)
  - Fix spectre_v2 mitigation default on Intel (Pawan Gupta)
 
 MSR API:
 
  - Large MSR code and API cleanup (Xin Li)
  - In-kernel MSR API type cleanups and renames (Ingo Molnar)
 
 PKEYS:
 
  - Simplify PKRU update in signal frame (Chang S. Bae)
 
 NMI handling code:
 
  - Clean up, refactor and simplify the NMI handling code (Sohil Mehta)
  - Improve NMI duration console printouts (Sohil Mehta)
 
 Paravirt guests interface:
 
  - Restrict PARAVIRT_XXL to 64-bit only (Kirill A. Shutemov)
 
 SEV support:
 
  - Share the sev_secrets_pa value again (Tom Lendacky)
 
 x86 platform changes:
 
  - Introduce the <asm/amd/> header namespace (Ingo Molnar)
  - i2c: piix4, x86/platform: Move the SB800 PIIX4 FCH definitions to <asm/amd/fch.h>
    (Mario Limonciello)
 
 Fixes and cleanups:
 
  - x86 assembly code cleanups and fixes (Uros Bizjak)
 
  - Misc fixes and cleanups (Andi Kleen, Andy Lutomirski, Andy Shevchenko,
    Ard Biesheuvel, Bagas Sanjaya, Baoquan He, Borislav Petkov, Chang S. Bae,
    Chao Gao, Dan Williams, Dave Hansen, David Kaplan, David Woodhouse,
    Eric Biggers, Ingo Molnar, Josh Poimboeuf, Juergen Gross, Malaya Kumar Rout,
    Mario Limonciello, Nathan Chancellor, Oleg Nesterov, Pawan Gupta,
    Peter Zijlstra, Shivank Garg, Sohil Mehta, Thomas Gleixner, Uros Bizjak,
    Xin Li)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgy9WARHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jJSw/+OW2zvAx602doujBIE17vFLU7R10Xwj5H
 lVgomkWCoTNscUZPhdT/iI+/kQF1fG8PtN9oZKUsTAUswldKJsqu7KevobviesiW
 qI+FqH/fhHaIk7GVh9VP65Dgrdki8zsgd7BFxD8pLRBlbZTxTxXNNkuNJrs6LxJh
 SxWp/FVtKo6Wd57qlUcsdo0tilAfcuhlEweFUarX55X2ouhdeHjcGNpxj9dHKOh8
 M7R5yMYFrpfdpSms+WaCnKKahWHaIQtQTsPAyKwoVdtfl1kK+7NgaCF55Gbo3ogp
 r59JwC/CGruDa5QnnDizCwFIwpZw9M52Q1NhP/eLEZbDGB4Yya3b5NW+Ya+6rPvO
 ZZC3e1uUmlxW3lrYflUHurnwrVb2GjkQZOdf0gfnly/7LljIicIS2dk4qIQF9NBd
 sQPpW5hjmIz9CsfeL8QaJW38pQyMsQWznFuz4YVuHcLHvleb3hR+n4fNfV5Lx9bw
 oirVETSIT5hy/msAgShPqTqFUEiVCgp16ow20YstxxzFu/FQ+VG987tkeUyFkPMe
 q1v5yF1hty+TkM4naKendIZ/MJnsrv0AxaegFz9YQrKGL1UPiOajQbSyKbzbto7+
 ozmtN0W80E8n4oQq008j8htpgIhDV91UjF5m33qB82uSqKihHPPTsVcbeg5nZwh2
 ti5g/a1jk94=
 =JgQo
 -----END PGP SIGNATURE-----

Merge tag 'x86-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core x86 updates from Ingo Molnar:
 "Boot code changes:

   - A large series of changes to reorganize the x86 boot code into a
     better isolated and easier to maintain base of PIC early startup
     code in arch/x86/boot/startup/, by Ard Biesheuvel.

     Motivation & background:

  	| Since commit
  	|
  	|    c88d71508e ("x86/boot/64: Rewrite startup_64() in C")
  	|
  	| dated Jun 6 2017, we have been using C code on the boot path in a way
  	| that is not supported by the toolchain, i.e., to execute non-PIC C
  	| code from a mapping of memory that is different from the one provided
  	| to the linker. It should have been obvious at the time that this was a
  	| bad idea, given the need to sprinkle fixup_pointer() calls left and
  	| right to manipulate global variables (including non-pointer variables)
  	| without crashing.
  	|
  	| This C startup code has been expanding, and in particular, the SEV-SNP
  	| startup code has been expanding over the past couple of years, and
  	| grown many of these warts, where the C code needs to use special
  	| annotations or helpers to access global objects.

     This tree includes the first phase of this work-in-progress x86
     boot code reorganization.

  Scalability enhancements and micro-optimizations:

   - Improve code-patching scalability (Eric Dumazet)

   - Remove MFENCEs for X86_BUG_CLFLUSH_MONITOR (Andrew Cooper)

  CPU features enumeration updates:

   - Thorough reorganization and cleanup of CPUID parsing APIs (Ahmed S.
     Darwish)

   - Fix, refactor and clean up the cacheinfo code (Ahmed S. Darwish,
     Thomas Gleixner)

   - Update CPUID bitfields to x86-cpuid-db v2.3 (Ahmed S. Darwish)

  Memory management changes:

   - Allow temporary MMs when IRQs are on (Andy Lutomirski)

   - Opt-in to IRQs-off activate_mm() (Andy Lutomirski)

   - Simplify choose_new_asid() and generate better code (Borislav
     Petkov)

   - Simplify 32-bit PAE page table handling (Dave Hansen)

   - Always use dynamic memory layout (Kirill A. Shutemov)

   - Make SPARSEMEM_VMEMMAP the only memory model (Kirill A. Shutemov)

   - Make 5-level paging support unconditional (Kirill A. Shutemov)

   - Stop prefetching current->mm->mmap_lock on page faults (Mateusz
     Guzik)

   - Predict valid_user_address() returning true (Mateusz Guzik)

   - Consolidate initmem_init() (Mike Rapoport)

  FPU support and vector computing:

   - Enable Intel APX support (Chang S. Bae)

   - Reorgnize and clean up the xstate code (Chang S. Bae)

   - Make task_struct::thread constant size (Ingo Molnar)

   - Restore fpu_thread_struct_whitelist() to fix
     CONFIG_HARDENED_USERCOPY=y (Kees Cook)

   - Simplify the switch_fpu_prepare() + switch_fpu_finish() logic (Oleg
     Nesterov)

   - Always preserve non-user xfeatures/flags in __state_perm (Sean
     Christopherson)

  Microcode loader changes:

   - Help users notice when running old Intel microcode (Dave Hansen)

   - AMD: Do not return error when microcode update is not necessary
     (Annie Li)

   - AMD: Clean the cache if update did not load microcode (Boris
     Ostrovsky)

  Code patching (alternatives) changes:

   - Simplify, reorganize and clean up the x86 text-patching code (Ingo
     Molnar)

   - Make smp_text_poke_batch_process() subsume
     smp_text_poke_batch_finish() (Nikolay Borisov)

   - Refactor the {,un}use_temporary_mm() code (Peter Zijlstra)

  Debugging support:

   - Add early IDT and GDT loading to debug relocate_kernel() bugs
     (David Woodhouse)

   - Print the reason for the last reset on modern AMD CPUs (Yazen
     Ghannam)

   - Add AMD Zen debugging document (Mario Limonciello)

   - Fix opcode map (!REX2) superscript tags (Masami Hiramatsu)

   - Stop decoding i64 instructions in x86-64 mode at opcode (Masami
     Hiramatsu)

  CPU bugs and bug mitigations:

   - Remove X86_BUG_MMIO_UNKNOWN (Borislav Petkov)

   - Fix SRSO reporting on Zen1/2 with SMT disabled (Borislav Petkov)

   - Restructure and harmonize the various CPU bug mitigation methods
     (David Kaplan)

   - Fix spectre_v2 mitigation default on Intel (Pawan Gupta)

  MSR API:

   - Large MSR code and API cleanup (Xin Li)

   - In-kernel MSR API type cleanups and renames (Ingo Molnar)

  PKEYS:

   - Simplify PKRU update in signal frame (Chang S. Bae)

  NMI handling code:

   - Clean up, refactor and simplify the NMI handling code (Sohil Mehta)

   - Improve NMI duration console printouts (Sohil Mehta)

  Paravirt guests interface:

   - Restrict PARAVIRT_XXL to 64-bit only (Kirill A. Shutemov)

  SEV support:

   - Share the sev_secrets_pa value again (Tom Lendacky)

  x86 platform changes:

   - Introduce the <asm/amd/> header namespace (Ingo Molnar)

   - i2c: piix4, x86/platform: Move the SB800 PIIX4 FCH definitions to
     <asm/amd/fch.h> (Mario Limonciello)

  Fixes and cleanups:

   - x86 assembly code cleanups and fixes (Uros Bizjak)

   - Misc fixes and cleanups (Andi Kleen, Andy Lutomirski, Andy
     Shevchenko, Ard Biesheuvel, Bagas Sanjaya, Baoquan He, Borislav
     Petkov, Chang S. Bae, Chao Gao, Dan Williams, Dave Hansen, David
     Kaplan, David Woodhouse, Eric Biggers, Ingo Molnar, Josh Poimboeuf,
     Juergen Gross, Malaya Kumar Rout, Mario Limonciello, Nathan
     Chancellor, Oleg Nesterov, Pawan Gupta, Peter Zijlstra, Shivank
     Garg, Sohil Mehta, Thomas Gleixner, Uros Bizjak, Xin Li)"

* tag 'x86-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (331 commits)
  x86/bugs: Fix spectre_v2 mitigation default on Intel
  x86/bugs: Restructure ITS mitigation
  x86/xen/msr: Fix uninitialized variable 'err'
  x86/msr: Remove a superfluous inclusion of <asm/asm.h>
  x86/paravirt: Restrict PARAVIRT_XXL to 64-bit only
  x86/mm/64: Make 5-level paging support unconditional
  x86/mm/64: Make SPARSEMEM_VMEMMAP the only memory model
  x86/mm/64: Always use dynamic memory layout
  x86/bugs: Fix indentation due to ITS merge
  x86/cpuid: Rename hypervisor_cpuid_base()/for_each_possible_hypervisor_cpuid_base() to cpuid_base_hypervisor()/for_each_possible_cpuid_base_hypervisor()
  x86/cpu/intel: Rename CPUID(0x2) descriptors iterator parameter
  x86/cacheinfo: Rename CPUID(0x2) descriptors iterator parameter
  x86/cpuid: Rename cpuid_get_leaf_0x2_regs() to cpuid_leaf_0x2()
  x86/cpuid: Rename have_cpuid_p() to cpuid_feature()
  x86/cpuid: Set <asm/cpuid/api.h> as the main CPUID header
  x86/cpuid: Move CPUID(0x2) APIs into <cpuid/api.h>
  x86/msr: Add rdmsrl_on_cpu() compatibility wrapper
  x86/mm: Fix kernel-doc descriptions of various pgtable methods
  x86/asm-offsets: Export certain 'struct cpuinfo_x86' fields for 64-bit asm use too
  x86/boot: Defer initialization of VM space related global variables
  ...
2025-05-26 16:04:17 -07:00
Linus Torvalds
14418ddcc2 This update includes the following changes:
API:
 
 - Fix memcpy_sglist to handle partially overlapping SG lists.
 - Use memcpy_sglist to replace null skcipher.
 - Rename CRYPTO_TESTS to CRYPTO_BENCHMARK.
 - Flip CRYPTO_MANAGER_DISABLE_TEST into CRYPTO_SELFTESTS.
 - Hide CRYPTO_MANAGER.
 - Add delayed freeing of driver crypto_alg structures.
 
 Compression:
 
 - Allocate large buffers on first use instead of initialisation in scomp.
 - Drop destination linearisation buffer in scomp.
 - Move scomp stream allocation into acomp.
 - Add acomp scatter-gather walker.
 - Remove request chaining.
 - Add optional async request allocation.
 
 Hashing:
 
 - Remove request chaining.
 - Add optional async request allocation.
 - Move partial block handling into API.
 - Add ahash support to hmac.
 - Fix shash documentation to disallow usage in hard IRQs.
 
 Algorithms:
 
 - Remove unnecessary SIMD fallback code on x86 and arm/arm64.
 - Drop avx10_256 xts(aes)/ctr(aes) on x86.
 - Improve avx-512 optimisations for xts(aes).
 - Move chacha arch implementations into lib/crypto.
 - Move poly1305 into lib/crypto and drop unused Crypto API algorithm.
 - Disable powerpc/poly1305 as it has no SIMD fallback.
 - Move sha256 arch implementations into lib/crypto.
 - Convert deflate to acomp.
 - Set block size correctly in cbcmac.
 
 Drivers:
 
 - Do not use sg_dma_len before mapping in sun8i-ss.
 - Fix warm-reboot failure by making shutdown do more work in qat.
 - Add locking in zynqmp-sha.
 - Remove cavium/zip.
 - Add support for PCI device 0x17D8 to ccp.
 - Add qat_6xxx support in qat.
 - Add support for RK3576 in rockchip-rng.
 - Add support for i.MX8QM in caam.
 
 Others:
 
 - Fix irq_fpu_usable/kernel_fpu_begin inconsistency during CPU bring-up.
 - Add new SEV/SNP platform shutdown API in ccp.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmgz47AACgkQxycdCkmx
 i6fvKRAAr4Xa903L0r1Q1P1alQqoFFCqimUWeH72m68LiWynHWi0lUo0z/+tKweg
 mnPStz7/Ha9HRHJjdNCMPnlJqXQDkuH3bIOuBJCwduDuhHo9VGOd46XGzmGMv3gb
 HKuZhI0lk7pznK3CSyD/2nHmbDCHD+7feTZSBMoN9mm875+aSoM6fdxgak8uPFcq
 KbB1L+hObTn2kAPSqRrNOR8/xG2N7hdH8eax7Li+LAtqYNVT5HvWVECsB/CKRPfB
 sgAv3UTzcIFapSSHUHaONppSeoqPAIAeV7SdQhJvlT+EUUR/h/B6+D9OUQQqbphQ
 LBalgTnqMKl0ymDEQFQ6QyYCat9ZfNmDft2WcXEsxc8PxImkgJI1W3B8O51sOjbG
 78D8JqVQ96dleo4FsBhM2wfG0b41JM6zU4raC4vS7a3qsUS+Q1MpehvcS1iORicy
 SpGdE8e7DLlxKhzWyW1xJnbrtMZDC7Sa2hUnxrvP0/xOvRhChKscRVtWcf0a5q7X
 8JmuvwVSOJuSbQ3MeFbQvpo5lR9+0WsNjM6e9miiH6Y7vZUKmWcq2yDp377qVzeh
 7NK6+OwGIQZZExrmtPw2BXwssT9Eg+ks6Y7g2Ne7yzvrjVNfEPY7Cws/5w7p8mRS
 qhrcpbJNFlWgD7YYkmGZFTQ8DCN25ipP8lklO/hbcfchqLE/o1o=
 =O8L5
 -----END PGP SIGNATURE-----

Merge tag 'v6.16-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto updates from Herbert Xu:
 "API:
   - Fix memcpy_sglist to handle partially overlapping SG lists
   - Use memcpy_sglist to replace null skcipher
   - Rename CRYPTO_TESTS to CRYPTO_BENCHMARK
   - Flip CRYPTO_MANAGER_DISABLE_TEST into CRYPTO_SELFTESTS
   - Hide CRYPTO_MANAGER
   - Add delayed freeing of driver crypto_alg structures

  Compression:
   - Allocate large buffers on first use instead of initialisation in scomp
   - Drop destination linearisation buffer in scomp
   - Move scomp stream allocation into acomp
   - Add acomp scatter-gather walker
   - Remove request chaining
   - Add optional async request allocation

  Hashing:
   - Remove request chaining
   - Add optional async request allocation
   - Move partial block handling into API
   - Add ahash support to hmac
   - Fix shash documentation to disallow usage in hard IRQs

  Algorithms:
   - Remove unnecessary SIMD fallback code on x86 and arm/arm64
   - Drop avx10_256 xts(aes)/ctr(aes) on x86
   - Improve avx-512 optimisations for xts(aes)
   - Move chacha arch implementations into lib/crypto
   - Move poly1305 into lib/crypto and drop unused Crypto API algorithm
   - Disable powerpc/poly1305 as it has no SIMD fallback
   - Move sha256 arch implementations into lib/crypto
   - Convert deflate to acomp
   - Set block size correctly in cbcmac

  Drivers:
   - Do not use sg_dma_len before mapping in sun8i-ss
   - Fix warm-reboot failure by making shutdown do more work in qat
   - Add locking in zynqmp-sha
   - Remove cavium/zip
   - Add support for PCI device 0x17D8 to ccp
   - Add qat_6xxx support in qat
   - Add support for RK3576 in rockchip-rng
   - Add support for i.MX8QM in caam

  Others:
   - Fix irq_fpu_usable/kernel_fpu_begin inconsistency during CPU bring-up
   - Add new SEV/SNP platform shutdown API in ccp"

* tag 'v6.16-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (382 commits)
  x86/fpu: Fix irq_fpu_usable() to return false during CPU onlining
  crypto: qat - add missing header inclusion
  crypto: api - Redo lookup on EEXIST
  Revert "crypto: testmgr - Add hash export format testing"
  crypto: marvell/cesa - Do not chain submitted requests
  crypto: powerpc/poly1305 - add depends on BROKEN for now
  Revert "crypto: powerpc/poly1305 - Add SIMD fallback"
  crypto: ccp - Add missing tee info reg for teev2
  crypto: ccp - Add missing bootloader info reg for pspv5
  crypto: sun8i-ce - move fallback ahash_request to the end of the struct
  crypto: octeontx2 - Use dynamic allocated memory region for lmtst
  crypto: octeontx2 - Initialize cptlfs device info once
  crypto: xts - Only add ecb if it is not already there
  crypto: lrw - Only add ecb if it is not already there
  crypto: testmgr - Add hash export format testing
  crypto: testmgr - Use ahash for generic tfm
  crypto: hmac - Add ahash support
  crypto: testmgr - Ignore EEXIST on shash allocation
  crypto: algapi - Add driver template support to crypto_inst_setname
  crypto: shash - Set reqsize in shash_alg
  ...
2025-05-26 13:47:28 -07:00
Paolo Bonzini
85502b2214 LoongArch KVM changes for v6.16
1. Don't flush tlb if HW PTW supported.
 2. Add LoongArch KVM selftests support.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEzOlt8mkP+tbeiYy5AoYrw/LiJnoFAmgsdO8WHGNoZW5odWFj
 YWlAa2VybmVsLm9yZwAKCRAChivD8uImejUWD/9zNaBSqTpqeAptRQ6qTKdrxYtN
 ZKJR9a8AQF5vMPD9dWoRr6iLaNt061GqBOKbhF5RGUVq6uIDfCkZbSbX1h6Ptgcz
 OwkJHbrZAu+Z31NSoYqbgPZnurSJ9oOUGsghqp3ecKEf0LptTVaKw4WDeKkKNOlq
 eJaUC1WJ1P4sSXhTlHKAl79Ds/1pza9iAgtKXothrh09PL48LYaWFpmiS0uQmKOD
 qwYVU/OJIUWn4ZOxyGdk6ZR0B+mJOwwoO1ILptRIeSk4oTKiE1HiIEqnhdnMrJFr
 DkRdwQ/jq7iH/CFkVybNzxgLqAqpGLJgPj5VakYPmp2scuWSESej9/o8wYrmn0y0
 QuDQah0LiRIcbUYRHLDkfKRKCndfJ5KCrpOiD5mZ8bMd2LF9hPnH5toCXb/ZEpsK
 plu9qUrQgXF8rkX/zCIvQrOp0kYdU8DMbhZsXymSVpEs1fwrCNp2/Z2PrMYLKGt+
 JT+65jVRJ67d1OdKw3DrWQA8Au0Ma1rgzX3oLDu8wnqAG7ULAJRVIkDMcxG7a5SQ
 P+DlIbEHC1U8Dw8hW+PFhfOl13M9p5s3EP5fy8q85UJ/5fJ41fCPBUOm/rfMQcci
 6/e+xwBKKkJ7oQ/fFKvJ+n6GbV6suV5FUxicYQjC43iiCKklRhG77uek/DK3c5um
 ZiTu+YapX9qbG54Q9w==
 =YM/u
 -----END PGP SIGNATURE-----

Merge tag 'loongarch-kvm-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD

LoongArch KVM changes for v6.16

1. Don't flush tlb if HW PTW supported.
2. Add LoongArch KVM selftests support.
2025-05-26 16:12:13 -04:00
Manali Shukla
89f9edf4c6 KVM: SVM: Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs
Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs with Bus Lock
Threshold, which is close enough to VMX's Bus Lock Detection VM-Exit to
allow reusing KVM_CAP_X86_BUS_LOCK_EXIT.

The biggest difference between the two features is that Threshold is
fault-like, whereas Detection is trap-like.  To allow the guest to make
forward progress, Threshold provides a per-VMCB counter which is
decremented every time a bus lock occurs, and a VM-Exit is triggered if
and only if the counter is '0'.

To provide Detection-like semantics, initialize the counter to '0', i.e.
exit on every bus lock, and when re-executing the guilty instruction, set
the counter to '1' to effectively step past the instruction.

Note, in the unlikely scenario that re-executing the instruction doesn't
trigger a bus lock, e.g. because the guest has changed memory types or
patched the guilty instruction, the bus lock counter will be left at '1',
i.e. the guest will be able to do a bus lock on a different instruction.
In a perfect world, KVM would ensure the counter is '0' if the guest has
made forward progress, e.g. if RIP has changed.  But trying to close that
hole would incur non-trivial complexity, for marginal benefit; the intent
of KVM_CAP_X86_BUS_LOCK_EXIT is to allow userspace rate-limit bus locks,
not to allow for precise detection of problematic guest code.  And, it's
simply not feasible to fully close the hole, e.g. if an interrupt arrives
before the original instruction can re-execute, the guest could step past
a different bus lock.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Manali Shukla <manali.shukla@amd.com>
Link: https://lore.kernel.org/r/20250502050346.14274-5-manali.shukla@amd.com
[sean: fix typo in comment]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-19 11:05:10 -07:00
Sean Christopherson
6a3d704959 KVM: x86/mmu: Use kvm_x86_call() instead of manual static_call()
Use KVM's preferred kvm_x86_call() wrapper to invoke static calls related
to mirror page tables.

No functional change intended.

Fixes: 77ac7079e6 ("KVM: x86/tdp_mmu: Propagate building mirror page tables")
Fixes: 94faba8999 ("KVM: x86/tdp_mmu: Propagate tearing down mirror page tables")
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250331182703.725214-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-16 13:13:58 -07:00
Manali Shukla
e9628b011b KVM: x86: Make kvm_pio_request.linear_rip a common field for user exits
Move and rename kvm_pio_request.linear_rip to
kvm_vcpu_arch.cui_linear_rip so that the field can be used by other
userspace exit completion flows that need to take action if and only
if userspace has not modified RIP.

No functional changes intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Manali Shukla <manali.shukla@amd.com>
Link: https://lore.kernel.org/r/20250502050346.14274-2-manali.shukla@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-16 09:40:23 -07:00
Ahmed S. Darwish
3bf8ce8284 x86/cpuid: Rename hypervisor_cpuid_base()/for_each_possible_hypervisor_cpuid_base() to cpuid_base_hypervisor()/for_each_possible_cpuid_base_hypervisor()
In order to let all the APIs under <cpuid/api.h> have a shared "cpuid_"
namespace, rename hypervisor_cpuid_base() to cpuid_base_hypervisor().

To align with the new style, also rename:

    for_each_possible_hypervisor_cpuid_base(function)

to:

    for_each_possible_cpuid_base_hypervisor(function)

Adjust call-sites accordingly.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/aCZOi0Oohc7DpgTo@lx-t490
2025-05-16 10:54:47 +02:00
Ahmed S. Darwish
968e300068 x86/cpuid: Set <asm/cpuid/api.h> as the main CPUID header
The main CPUID header <asm/cpuid.h> was originally a storefront for the
headers:

    <asm/cpuid/api.h>
    <asm/cpuid/leaf_0x2_api.h>

Now that the latter CPUID(0x2) header has been merged into the former,
there is no practical difference between <asm/cpuid.h> and
<asm/cpuid/api.h>.

Migrate all users to the <asm/cpuid/api.h> header, in preparation of
the removal of <asm/cpuid.h>.

Don't remove <asm/cpuid.h> just yet, in case some new code in -next
started using it.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/20250508150240.172915-3-darwi@linutronix.de
2025-05-15 18:23:55 +02:00
Ingo Molnar
c4070e1996 Merge commit 'its-for-linus-20250509-merge' into x86/core, to resolve conflicts
Conflicts:
	Documentation/admin-guide/hw-vuln/index.rst
	arch/x86/include/asm/cpufeatures.h
	arch/x86/kernel/alternative.c
	arch/x86/kernel/cpu/bugs.c
	arch/x86/kernel/cpu/common.c
	drivers/base/cpu.c
	include/linux/cpu.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-13 10:47:10 +02:00
Ingo Molnar
1f82e8e1ca Merge branch 'x86/msr' into x86/core, to resolve conflicts
Conflicts:
	arch/x86/boot/startup/sme.c
	arch/x86/coco/sev/core.c
	arch/x86/kernel/fpu/core.c
	arch/x86/kernel/fpu/xstate.c

 Semantic conflict:
	arch/x86/include/asm/sev-internal.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-13 10:42:06 +02:00
Ingo Molnar
206c07d6ab Merge branch 'x86/bugs' into x86/core, to merge dependent commits
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:

  6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-13 10:35:14 +02:00
Linus Torvalds
6f5bf947ba * Mitigate Indirect Target Selection (ITS) issue
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmgebIwACgkQaDWVMHDJ
 krCGSA/+I+W/uqiz58Z2Zu4RrXMYFfKJxacF7My9wnOyRxaJduS3qrz1E5wHqBId
 f6M8wDx9nS24UxDkBbi84NdtlG1zj8nV8djtszGKVeqHG2DcQMMOXBKZSjOmTo2b
 GIZ3a3xEqXaFfnGQxXSZrvtHIwCmv10H2oyGHu0vBp/SJuWXNg72oivOGhbm0uWs
 0/bdIK8+1sW7OAmhhKdvMVpmzL8TQJnkUHSkQilPB2Tsf9wWDfeY7kDkK5YwQpk2
 ZK+hrmwCFXQZELY65F2+y/cFim/F38HiqVdvIkV1wFSVqVVE9hEKJ4BDZl1fXZKB
 p4qpDFgxO27E/eMo9IZfxRH4TdSoK6YLWo9FGWHKBPnciJfAeO9EP/AwAIhEQRdx
 YZlN9sGS6ja7O1Eh423BBw6cFj6ta0ck2T1PoYk32FXc6sgqCphsfvBD3+tJxz8/
 xoZ3BzoErdPqSXbH5cSI972kQW0JLESiMTZa827qnJtT672t6uBcsnnmR0ZbJH1f
 TJCC9qgwpBiEkiGW3gwv00SC7CkXo3o0FJw0pa3MkKHGd7csxBtGBHI1b6Jj+oB0
 yWf1HxSqwrq2Yek8R7lWd4jIxyWfKriEMTu7xCMUUFlprKmR2RufsADvqclNyedQ
 sGBCc4eu1cpZp2no/IFm+IvkuzUHnkS/WNL1LbZ9YI8h8unjZHE=
 =UVgZ
 -----END PGP SIGNATURE-----

Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 ITS mitigation from Dave Hansen:
 "Mitigate Indirect Target Selection (ITS) issue.

  I'd describe this one as a good old CPU bug where the behavior is
  _obviously_ wrong, but since it just results in bad predictions it
  wasn't wrong enough to notice. Well, the researchers noticed and also
  realized that thus bug undermined a bunch of existing indirect branch
  mitigations.

  Thus the unusually wide impact on this one. Details:

  ITS is a bug in some Intel CPUs that affects indirect branches
  including RETs in the first half of a cacheline. Due to ITS such
  branches may get wrongly predicted to a target of (direct or indirect)
  branch that is located in the second half of a cacheline. Researchers
  at VUSec found this behavior and reported to Intel.

  Affected processors:

   - Cascade Lake, Cooper Lake, Whiskey Lake V, Coffee Lake R, Comet
     Lake, Ice Lake, Tiger Lake and Rocket Lake.

  Scope of impact:

   - Guest/host isolation:

     When eIBRS is used for guest/host isolation, the indirect branches
     in the VMM may still be predicted with targets corresponding to
     direct branches in the guest.

   - Intra-mode using cBPF:

     cBPF can be used to poison the branch history to exploit ITS.
     Realigning the indirect branches and RETs mitigates this attack
     vector.

   - User/kernel:

     With eIBRS enabled user/kernel isolation is *not* impacted by ITS.

   - Indirect Branch Prediction Barrier (IBPB):

     Due to this bug indirect branches may be predicted with targets
     corresponding to direct branches which were executed prior to IBPB.
     This will be fixed in the microcode.

  Mitigation:

  As indirect branches in the first half of cacheline are affected, the
  mitigation is to replace those indirect branches with a call to thunk that
  is aligned to the second half of the cacheline.

  RETs that take prediction from RSB are not affected, but they may be
  affected by RSB-underflow condition. So, RETs in the first half of
  cacheline are also patched to a return thunk that executes the RET aligned
  to second half of cacheline"

* tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  selftest/x86/bugs: Add selftests for ITS
  x86/its: FineIBT-paranoid vs ITS
  x86/its: Use dynamic thunks for indirect branches
  x86/ibt: Keep IBT disabled during alternative patching
  mm/execmem: Unify early execmem_cache behaviour
  x86/its: Align RETs in BHB clear sequence to avoid thunking
  x86/its: Add support for RSB stuffing mitigation
  x86/its: Add "vmexit" option to skip mitigation on some CPUs
  x86/its: Enable Indirect Target Selection mitigation
  x86/its: Add support for ITS-safe return thunk
  x86/its: Add support for ITS-safe indirect thunk
  x86/its: Enumerate Indirect Target Selection (ITS) bug
  Documentation: x86/bugs/its: Add ITS documentation
2025-05-11 17:23:03 -07:00
Pawan Gupta
159013a7ca x86/its: Enumerate Indirect Target Selection (ITS) bug
ITS bug in some pre-Alderlake Intel CPUs may allow indirect branches in the
first half of a cache line get predicted to a target of a branch located in
the second half of the cache line.

Set X86_BUG_ITS on affected CPUs. Mitigation to follow in later commits.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09 13:22:04 -07:00
Sean Christopherson
e3417ab75a KVM: SVM: Set/clear SRSO's BP_SPEC_REDUCE on 0 <=> 1 VM count transitions
Set the magic BP_SPEC_REDUCE bit to mitigate SRSO when running VMs if and
only if KVM has at least one active VM.  Leaving the bit set at all times
unfortunately degrades performance by a wee bit more than expected.

Use a dedicated spinlock and counter instead of hooking virtualization
enablement, as changing the behavior of kvm.enable_virt_at_load based on
SRSO_BP_SPEC_REDUCE is painful, and has its own drawbacks, e.g. could
result in performance issues for flows that are sensitive to VM creation
latency.

Defer setting BP_SPEC_REDUCE until VMRUN is imminent to avoid impacting
performance on CPUs that aren't running VMs, e.g. if a setup is using
housekeeping CPUs.  Setting BP_SPEC_REDUCE in task context, i.e. without
blasting IPIs to all CPUs, also helps avoid serializing 1<=>N transitions
without incurring a gross amount of complexity (see the Link for details
on how ugly coordinating via IPIs gets).

Link: https://lore.kernel.org/all/aBOnzNCngyS_pQIW@google.com
Fixes: 8442df2b49 ("x86/bugs: KVM: Add support for SRSO_MSR_FIX")
Reported-by: Michael Larabel <Michael@michaellarabel.com>
Closes: https://www.phoronix.com/review/linux-615-amd-regression
Cc: Borislav Petkov <bp@alien8.de>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250505180300.973137-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-08 07:17:10 -07:00
Herbert Xu
fba4aafaba Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux v6.15-rc5
Merge mainline to pick up bcachefs poly1305 patch 4bf4b5046d
("bcachefs: use library APIs for ChaCha20 and Poly1305").  This
is a prerequisite for removing the poly1305 shash algorithm.
2025-05-05 13:25:15 +08:00
Sean Christopherson
9129633d56 KVM: x86/mmu: Prevent installing hugepages when mem attributes are changing
When changing memory attributes on a subset of a potential hugepage, add
the hugepage to the invalidation range tracking to prevent installing a
hugepage until the attributes are fully updated.  Like the actual hugepage
tracking updates in kvm_arch_post_set_memory_attributes(), process only
the head and tail pages, as any potential hugepages that are entirely
covered by the range will already be tracked.

Note, only hugepage chunks whose current attributes are NOT mixed need to
be added to the invalidation set, as mixed attributes already prevent
installing a hugepage, and it's perfectly safe to install a smaller
mapping for a gfn whose attributes aren't changing.

Fixes: 8dd2eee9d5 ("KVM: x86/mmu: Handle page fault for private memory")
Cc: stable@vger.kernel.org
Reported-by: Michael Roth <michael.roth@amd.com>
Tested-by: Michael Roth <michael.roth@amd.com>
Link: https://lore.kernel.org/r/20250430220954.522672-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02 13:39:34 -07:00
Tom Lendacky
5fea0c6c0e KVM: SVM: Update dump_ghcb() to use the GHCB snapshot fields
Commit 4e15a0ddc3 ("KVM: SEV: snapshot the GHCB before accessing it")
updated the SEV code to take a snapshot of the GHCB before using it. But
the dump_ghcb() function wasn't updated to use the snapshot locations.
This results in incorrect output from dump_ghcb() for the "is_valid" and
"valid_bitmap" fields.

Update dump_ghcb() to use the proper locations.

Fixes: 4e15a0ddc3 ("KVM: SEV: snapshot the GHCB before accessing it")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Link: https://lore.kernel.org/r/8f03878443681496008b1b37b7c4bf77a342b459.1745866531.git.thomas.lendacky@amd.com
[sean: add comment and snapshot qualifier]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02 13:39:33 -07:00
Vishal Verma
907092bf7c KVM: VMX: Clean up and macrofy x86_ops
Eliminate a lot of stub definitions by using macros to define the TDX vs
non-TDX versions of various x86_ops. Moving the x86_ops wrappers under
CONFIG_KVM_INTEL_TDX also allows nearly all of vmx/main.c to go under a
single #ifdef, eliminating trampolines in the generated code, and almost
all of the stubs.

For example, with CONFIG_KVM_INTEL_TDX=n, before this cleanup,
vt_refresh_apicv_exec_ctrl() would produce:

0000000000036490 <vt_refresh_apicv_exec_ctrl>:
   36490:       f3 0f 1e fa             endbr64
   36494:       e8 00 00 00 00          call   36499 <vt_refresh_apicv_exec_ctrl+0x9>
                        36495: R_X86_64_PLT32   __fentry__-0x4
   36499:       e9 00 00 00 00          jmp    3649e <vt_refresh_apicv_exec_ctrl+0xe>
                        3649a: R_X86_64_PLT32   vmx_refresh_apicv_exec_ctrl-0x4
   3649e:       66 90                   xchg   %ax,%ax

After this patch, this is completely eliminated.

Based on a patch by Sean Christopherson <seanjc@google.com>

Link: https://lore.kernel.org/kvm/Z6v9yjWLNTU6X90d@google.com/
Cc: Sean Christopherson <seanjc@google.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Link: https://lore.kernel.org/r/20250318-vverma7-cleanup_x86_ops-v2-4-701e82d6b779@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02 13:37:26 -07:00
Vishal Verma
1a81d9d5a1 KVM: VMX: Define a VMX glue macro for kvm_complete_insn_gp()
Define kvm_complete_insn_gp() as vmx_complete_emulated_msr() and use the
glue wrapper in vt_complete_emulated_msr() so that VT's
.complete_emulated_msr() implementation follows the soon-to-be-standard
pattern of:

    vt_abc:
        if (is_td())
            return tdx_abc();
        return vmx_abc();

This will allow generating such wrappers via a macro, which in turn will
make it trivially easy to skip the wrappers entirely when KVM_INTEL_TDX=n.

Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/kvm/Z6v9yjWLNTU6X90d@google.com/
Cc: Sean Christopherson <seanjc@google.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Link: https://lore.kernel.org/r/20250318-vverma7-cleanup_x86_ops-v2-3-701e82d6b779@intel.com
[sean: massage shortlog+changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02 13:37:25 -07:00
Vishal Verma
84ad4d834c KVM: VMX: Move vt_apicv_pre_state_restore() to posted_intr.c and tweak name
In preparation for a cleanup of the kvm_x86_ops struct for TDX, all vt_*
functions are expected to act as glue functions that route to either tdx_*
or vmx_* based on the VM type. Specifically, the pattern is:

vt_abc:
    if (is_td())
        return tdx_abc();
    return vmx_abc();

But vt_apicv_pre_state_restore() does not follow this pattern. To
facilitate that cleanup, rename and move vt_apicv_pre_state_restore() into
posted_intr.c.

Opportunistically turn vcpu_to_pi_desc() back into a static function, as
the only reason it was exposed outside of posted_intr.c was for
vt_apicv_pre_state_restore().

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/kvm/Z6v9yjWLNTU6X90d@google.com/
Cc: Sean Christopherson <seanjc@google.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linxu.intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Link: https://lore.kernel.org/r/20250318-vverma7-cleanup_x86_ops-v2-2-701e82d6b779@intel.com
[sean: apply Chao's suggestions, massage shortlog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02 13:37:25 -07:00
Sean Christopherson
f2d7993314 KVM: x86: Revert kvm_x86_ops.mem_enc_ioctl() back to an OPTIONAL hook
Restore KVM's handling of a NULL kvm_x86_ops.mem_enc_ioctl, as the hook is
NULL on SVM when CONFIG_KVM_AMD_SEV=n, and TDX will soon follow suit.

  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 1 at arch/x86/include/asm/kvm-x86-ops.h:130 kvm_x86_vendor_init+0x178b/0x18e0
  Modules linked in:
  CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.15.0-rc2-dc1aead1a985-sink-vm #2 NONE
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_x86_vendor_init+0x178b/0x18e0
  Call Trace:
   <TASK>
   svm_init+0x2e/0x60
   do_one_initcall+0x56/0x290
   kernel_init_freeable+0x192/0x1e0
   kernel_init+0x16/0x130
   ret_from_fork+0x30/0x50
   ret_from_fork_asm+0x1a/0x30
   </TASK>
  ---[ end trace 0000000000000000 ]---

Opportunistically drop the superfluous curly braces.

Link: https://lore.kernel.org/all/20250318-vverma7-cleanup_x86_ops-v2-4-701e82d6b779@intel.com
Fixes: b2aaf38ced ("KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl")
Link: https://lore.kernel.org/r/20250502203421.865686-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02 13:37:22 -07:00
Xin Li (Intel)
502ad6e5a6 x86/msr: Change the function type of native_read_msr_safe()
Modify the function type of native_read_msr_safe() to:

    int native_read_msr_safe(u32 msr, u64 *val)

This change makes the function return an error code instead of the
MSR value, aligning it with the type of native_write_msr_safe().
Consequently, their callers can check the results in the same way.

While at it, convert leftover MSR data type "unsigned int" to u32.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20250427092027.1598740-16-xin@zytor.com
2025-05-02 10:36:36 +02:00
Xin Li (Intel)
0c2678efed x86/pvops/msr: Refactor pv_cpu_ops.write_msr{,_safe}()
An MSR value is represented as a 64-bit unsigned integer, with existing
MSR instructions storing it in EDX:EAX as two 32-bit segments.

The new immediate form MSR instructions, however, utilize a 64-bit
general-purpose register to store the MSR value.  To unify the usage of
all MSR instructions, let the default MSR access APIs accept an MSR
value as a single 64-bit argument instead of two 32-bit segments.

The dual 32-bit APIs are still available as convenient wrappers over the
APIs that handle an MSR value as a single 64-bit argument.

The following illustrates the updated derivation of the MSR write APIs:

                 __wrmsrq(u32 msr, u64 val)
                   /                  \
                  /                    \
           native_wrmsrq(msr, val)    native_wrmsr(msr, low, high)
                 |
                 |
           native_write_msr(msr, val)
                /          \
               /            \
       wrmsrq(msr, val)    wrmsr(msr, low, high)

When CONFIG_PARAVIRT is enabled, wrmsrq() and wrmsr() are defined on top
of paravirt_write_msr():

            paravirt_write_msr(u32 msr, u64 val)
               /             \
              /               \
          wrmsrq(msr, val)    wrmsr(msr, low, high)

paravirt_write_msr() invokes cpu.write_msr(msr, val), an indirect layer
of pv_ops MSR write call:

    If on native:

            cpu.write_msr = native_write_msr

    If on Xen:

            cpu.write_msr = xen_write_msr

Therefore, refactor pv_cpu_ops.write_msr{_safe}() to accept an MSR value
in a single u64 argument, replacing the current dual u32 arguments.

No functional change intended.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20250427092027.1598740-14-xin@zytor.com
2025-05-02 10:36:36 +02:00
Xin Li (Intel)
3204877d05 x86/msr: Convert __rdmsr() uses to native_rdmsrq() uses
__rdmsr() is the lowest level MSR write API, with native_rdmsr()
and native_rdmsrq() serving as higher-level wrappers around it.

  #define native_rdmsr(msr, val1, val2)                   \
  do {                                                    \
          u64 __val = __rdmsr((msr));                     \
          (void)((val1) = (u32)__val);                    \
          (void)((val2) = (u32)(__val >> 32));            \
  } while (0)

  static __always_inline u64 native_rdmsrq(u32 msr)
  {
          return __rdmsr(msr);
  }

However, __rdmsr() continues to be utilized in various locations.

MSR APIs are designed for different scenarios, such as native or
pvops, with or without trace, and safe or non-safe.  Unfortunately,
the current MSR API names do not adequately reflect these factors,
making it challenging to select the most appropriate API for
various situations.

To pave the way for improving MSR API names, convert __rdmsr()
uses to native_rdmsrq() to ensure consistent usage.  Later, these
APIs can be renamed to better reflect their implications, such as
native or pvops, with or without trace, and safe or non-safe.

No functional change intended.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20250427092027.1598740-10-xin@zytor.com
2025-05-02 10:36:35 +02:00
Xin Li (Intel)
efef7f184f x86/msr: Add explicit includes of <asm/msr.h>
For historic reasons there are some TSC-related functions in the
<asm/msr.h> header, even though there's an <asm/tsc.h> header.

To facilitate the relocation of rdtsc{,_ordered}() from <asm/msr.h>
to <asm/tsc.h> and to eventually eliminate the inclusion of
<asm/msr.h> in <asm/tsc.h>, add an explicit <asm/msr.h> dependency
to the source files that reference definitions from <asm/msr.h>.

[ mingo: Clarified the changelog. ]

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20250501054241.1245648-1-xin@zytor.com
2025-05-02 10:23:47 +02:00
Ingo Molnar
0c7b20b852 Linux 6.15-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCgA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmgOrWseHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGFyIH/AhXcuA8y8rk43mo
 t+0GO7JR4dnr4DIl74GgDjCXlXiKCT7EXMfD/ABdofTxV4Pbyv+pUODlg1E6eO9U
 C1WWM5PPNBGDDEVSQ3Yu756nr0UoiFhvW0R6pVdou5cezCWAtIF9LTN8DEUgis0u
 EUJD9+/cHAMzfkZwabjm/HNsa1SXv2X47MzYv/PdHKr0htEPcNHF4gqBrBRdACGy
 FJtaCKhuPf6TcDNXOFi5IEWMXrugReRQmOvrXqVYGa7rfUFkZgsAzRY6n/rUN5Z9
 FAgle4Vlv9ohVYj9bXX8b6wWgqiKRpoN+t0PpRd6G6ict1AFBobNGo8LH3tYIKqZ
 b/dCGNg=
 =xDGd
 -----END PGP SIGNATURE-----

Merge tag 'v6.15-rc4' into x86/msr, to pick up fixes and resolve conflicts

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-02 09:43:44 +02:00
David Kaplan
d43ba2dc8e x86/bugs: Restructure L1TF mitigation
Restructure L1TF to use select/apply functions to create consistent
vulnerability handling.

Define new AUTO mitigation for L1TF.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/20250418161721.1855190-16-david.kaplan@amd.com
2025-04-29 18:57:30 +02:00
Sean Christopherson
54a1a24fea KVM: x86: Unify cross-vCPU IBPB
Both SVM and VMX have similar implementation for executing an IBPB
between running different vCPUs on the same CPU to create separate
prediction domains for different vCPUs.

For VMX, when the currently loaded VMCS is changed in
vmx_vcpu_load_vmcs(), an IBPB is executed if there is no 'buddy', which
is the case on vCPU load. The intention is to execute an IBPB when
switching vCPUs, but not when switching the VMCS within the same vCPU.
Executing an IBPB on nested transitions within the same vCPU is handled
separately and conditionally in nested_vmx_vmexit().

For SVM, the current VMCB is tracked on vCPU load and an IBPB is
executed when it is changed. The intention is also to execute an IBPB
when switching vCPUs, although it is possible that in some cases an IBBP
is executed when switching VMCBs for the same vCPU. Executing an IBPB on
nested transitions should be handled separately, and is proposed at [1].

Unify the logic by tracking the last loaded vCPU and execuintg the IBPB
on vCPU change in kvm_arch_vcpu_load() instead. When a vCPU is
destroyed, make sure all references to it are removed from any CPU. This
is similar to how SVM clears the current_vmcb tracking on vCPU
destruction. Remove the current VMCB tracking in SVM as it is no longer
required, as well as the 'buddy' parameter to vmx_vcpu_load_vmcs().

[1] https://lore.kernel.org/lkml/20250221163352.3818347-4-yosry.ahmed@linux.dev

Link: https://lore.kernel.org/all/20250320013759.3965869-1-yosry.ahmed@linux.dev
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
[sean: tweak comment to stay at/under 80 columns]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-29 08:39:44 -07:00
Yosry Ahmed
1bee4838eb KVM: SVM: Clear current_vmcb during vCPU free for all *possible* CPUs
When freeing a vCPU and thus its VMCB, clear current_vmcb for all possible
CPUs, not just online CPUs, as it's theoretically possible a CPU could go
offline and come back online in conjunction with KVM reusing the page for
a new VMCB.

Link: https://lore.kernel.org/all/20250320013759.3965869-1-yosry.ahmed@linux.dev
Fixes: fd65d3142f ("kvm: svm: Ensure an IBPB on all affected CPUs when freeing a vmcb")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
[sean: split to separate patch, write changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-29 08:39:35 -07:00
Peng Hao
e0136112e9 x86/sev: Remove unnecessary GFP_KERNEL_ACCOUNT for temporary variables
Some variables allocated in sev_send_update_data are released when
the function exits, so there is no need to set GFP_KERNEL_ACCOUNT.

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Link: https://lore.kernel.org/r/20250428063013.62311-1-flyingpeng@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28 11:09:55 -07:00
Yan Zhao
20a6cff3b2 KVM: x86/mmu: Check and free obsolete roots in kvm_mmu_reload()
Check request KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to free obsolete roots in
kvm_mmu_reload() to prevent kvm_mmu_reload() from seeing a stale obsolete
root.

Since kvm_mmu_reload() can be called outside the
vcpu_enter_guest() path (e.g., kvm_arch_vcpu_pre_fault_memory()), it may be
invoked after a root has been marked obsolete and before vcpu_enter_guest()
is invoked to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS and set root.hpa to
invalid. This causes kvm_mmu_reload() to fail to load a new root, which
can lead to kvm_arch_vcpu_pre_fault_memory() being stuck in the while
loop in kvm_tdp_map_page() since RET_PF_RETRY is always returned due to
is_page_fault_stale().

Keep the existing check of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS in
vcpu_enter_guest() since the cost of kvm_check_request() is negligible,
especially a check that's guarded by kvm_request_pending().

Export symbol of kvm_mmu_free_obsolete_roots() as kvm_mmu_reload() is
inline and may be called outside of kvm.ko.

Fixes: 6e01b7601d ("KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20250318013333.5817-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28 11:06:13 -07:00
Yan Zhao
11d4517511 KVM: x86/mmu: Warn if PFN changes on shadow-present SPTE in shadow MMU
Warn if PFN changes on shadow-present SPTE in mmu_set_spte().

KVM should _never_ change the PFN of a shadow-present SPTE. In
mmu_set_spte(), there is a WARN_ON_ONCE() on pfn changes on shadow-present
SPTE in mmu_spte_update() to detect this condition. However, that
WARN_ON_ONCE() is not hittable since mmu_set_spte() invokes drop_spte()
earlier before mmu_spte_update(), which clears SPTE to a !shadow-present
state. So, before invoking drop_spte(), add a WARN_ON_ONCE() in
mmu_set_spte() to warn PFN change of a shadow-present SPTE.

For the spurious prefetch fault, only return RET_PF_SPURIOUS directly when
PFN is not changed. When PFN changes, fall through to follow the sequence
of drop_spte(), warn of PFN change, make_spte(), flush tlb, rmap_add().

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20250318013310.5781-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28 11:03:06 -07:00
Yan Zhao
988da78202 KVM: x86/tdp_mmu: WARN if PFN changes for spurious faults
Add a WARN() to assert that KVM does _not_ change the PFN of a
shadow-present SPTE during spurious fault handling.

KVM should _never_ change the PFN of a shadow-present SPTE and TDP MMU
already BUG()s on this. However, spurious faults just return early before
the existing BUG() could be hit.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20250318013238.5732-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28 11:03:06 -07:00
Yan Zhao
d17cc13cc4 KVM: x86/tdp_mmu: Merge prefetch and access checks for spurious faults
Combine prefetch and is_access_allowed() checks into a unified path to
detect spurious faults, since both cases now share identical logic.

No functional changes.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20250318013210.5701-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28 11:03:06 -07:00
Yan Zhao
ea9fcdf76d KVM: x86/mmu: Further check old SPTE is leaf for spurious prefetch fault
Instead of simply treating a prefetch fault as spurious when there's a
shadow-present old SPTE, further check if the old SPTE is leaf to determine
if a prefetch fault is spurious.

It's not reasonable to treat a prefetch fault as spurious when there's a
shadow-present non-leaf SPTE without a corresponding shadow-present leaf
SPTE. e.g., in the following sequence, a prefetch fault should not be
considered spurious:
1. add a memslot with size 4K
2. prefault GPA A in the memslot
3. delete the memslot (zap all disabled)
4. re-add the memslot with size 2M
5. prefault GPA A again.
In step 5, the prefetch fault attempts to install a 2M huge entry.
Since step 3 zaps the leaf SPTE for GPA A while keeping the non-leaf SPTE,
the leaf entry will remain empty after step 5 if the fetch fault is
regarded as spurious due to a shadow-present non-leaf SPTE.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20250318013111.5648-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28 11:03:06 -07:00
Chao Gao
a0ee1d5faf KVM: VMX: Flush shadow VMCS on emergency reboot
Ensure the shadow VMCS cache is evicted during an emergency reboot to
prevent potential memory corruption if the cache is evicted after reboot.

This issue was identified through code inspection, as __loaded_vmcs_clear()
flushes both the normal VMCS and the shadow VMCS.

Avoid checking the "launched" state during an emergency reboot, unlike the
behavior in __loaded_vmcs_clear(). This is important because reboot NMIs
can interfere with operations like copy_shadow_to_vmcs12(), where shadow
VMCSes are loaded directly using VMPTRLD. In such cases, if NMIs occur
right after the VMCS load, the shadow VMCSes will be active but the
"launched" state may not be set.

Fixes: 16f5b9034b ("KVM: nVMX: Copy processor-specific shadow-vmcs to VMCS12")
Cc: stable@vger.kernel.org
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250324140849.2099723-1-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28 10:57:56 -07:00
Sean Christopherson
5ecdb48dd9 KVM: SVM: Treat DEBUGCTL[5:2] as reserved
Stop ignoring DEBUGCTL[5:2] on AMD CPUs and instead treat them as reserved.
KVM has never properly virtualized AMD's legacy PBi bits, but did allow
the guest (and host userspace) to set the bits.  To avoid breaking guests
when running on CPUs with BusLockTrap, which redefined bit 2 to BLCKDB and
made bits 5:3 reserved, a previous KVM change ignored bits 5:3, e.g. so
that legacy guest software wouldn't inadvertently enable BusLockTrap or
hit a VMRUN failure due to setting reserved.

To allow for virtualizing BusLockTrap and whatever future features may use
bits 5:3, treat bits 5:2 as reserved (and hope that doing so doesn't break
any existing guests).

Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250227222411.3490595-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28 10:56:35 -07:00
Peng Hao
bb5081f4ab KVM: SVM: avoid frequency indirect calls
When retpoline is enabled, indirect function calls introduce additional
performance overhead. Avoid frequent indirect calls to VMGEXIT when SEV
is enabled.

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Link: https://lore.kernel.org/r/20250306075425.66693-1-flyingpeng@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25 16:19:56 -07:00
Kim Phillips
b6bc164f41 KVM: SEV: Configure "ALLOWED_SEV_FEATURES" VMCB Field
AMD EPYC 5th generation processors have introduced a feature that allows
the hypervisor to control the SEV_FEATURES that are set for, or by, a
guest [1].  ALLOWED_SEV_FEATURES can be used by the hypervisor to enforce
that SEV-ES and SEV-SNP guests cannot enable features that the
hypervisor does not want to be enabled.

Always enable ALLOWED_SEV_FEATURES.  A VMRUN will fail if any
non-reserved bits are 1 in SEV_FEATURES but are 0 in
ALLOWED_SEV_FEATURES.

Some SEV_FEATURES - currently PmcVirtualization and SecureAvic
(see Appendix B, Table B-4) - require an opt-in via ALLOWED_SEV_FEATURES,
i.e. are off-by-default, whereas all other features are effectively
on-by-default, but still honor ALLOWED_SEV_FEATURES.

[1] Section 15.36.20 "Allowed SEV Features", AMD64 Architecture
    Programmer's Manual, Pub. 24593 Rev. 3.42 - March 2024:
    https://bugzilla.kernel.org/attachment.cgi?id=306250

Co-developed-by: Kishon Vijay Abraham I <kvijayab@amd.com>
Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250310201603.1217954-3-kim.phillips@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25 16:19:55 -07:00
Tom Lendacky
468c27ae02 KVM: SVM: Add a mutex to dump_vmcb() to prevent concurrent output
If multiple VMRUN instructions fail, resulting in calls to dump_vmcb(),
the output can become interleaved and it is impossible to identify which
line of output belongs to which VMCB. Add a mutex to dump_vmcb() so that
the output is serialized.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Link: https://lore.kernel.org/r/a880678afd9488e1dd6017445802712f7c02cc6d.1742477213.git.thomas.lendacky@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25 16:19:54 -07:00
Tom Lendacky
0e6b677de7 KVM: SVM: Include the vCPU ID when dumping a VMCB
Provide the vCPU ID of the VMCB in dump_vmcb().

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Link: https://lore.kernel.org/r/ee0af5a6c1a49aebb4a8291071c3f68cacf107b2.1742477213.git.thomas.lendacky@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25 16:19:54 -07:00
Tom Lendacky
db26450961 KVM: SVM: Add the type of VM for which the VMCB/VMSA is being dumped
Add the type of VM (SVM, SEV, SEV-ES, or SEV-SNP) being dumped to the
dump_vmcb() function.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Link: https://lore.kernel.org/r/7a183a8beedf4ee26c42001160e073a884fe466e.1742477213.git.thomas.lendacky@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25 16:19:53 -07:00
Tom Lendacky
22f5c2003a KVM: SVM: Dump guest register state in dump_vmcb()
Guest register state can be useful when debugging, include it as part
of dump_vmcb().

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Link: https://lore.kernel.org/r/a4131a10c082a93610cac12b35dca90292e50f50.1742477213.git.thomas.lendacky@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25 16:19:52 -07:00
Tom Lendacky
962e2b6152 KVM: SVM: Decrypt SEV VMSA in dump_vmcb() if debugging is enabled
An SEV-ES/SEV-SNP VM save area (VMSA) can be decrypted if the guest
policy allows debugging. Update the dump_vmcb() routine to output
some of the SEV VMSA contents if possible. This can be useful for
debug purposes.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Link: https://lore.kernel.org/r/ea3b852c295b6f4b200925ed6b6e2c90d9475e71.1742477213.git.thomas.lendacky@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25 16:19:52 -07:00
Uros Bizjak
798b9b1cb0 KVM: VMX: Use LEAVE in vmx_do_interrupt_irqoff()
Micro-optimize vmx_do_interrupt_irqoff() by substituting
MOV %RBP,%RSP; POP %RBP instruction sequence with equivalent
LEAVE instruction. GCC compiler does this by default for
a generic tuning and for all modern processors:

DEF_TUNE (X86_TUNE_USE_LEAVE, "use_leave",
	  m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE | m_ZHAOXIN
	  | m_TREMONT | m_CORE_HYBRID | m_CORE_ATOM | m_GENERIC)

The new code also saves a couple of bytes, from:

  27:	48 89 ec             	mov    %rbp,%rsp
  2a:	5d                   	pop    %rbp

to:

  27:	c9                   	leave

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20250414081131.97374-2-ubizjak@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:23:07 -07:00
Sean Christopherson
17a2c62fbf KVM: nVMX: Check MSR load/store list counts during VM-Enter consistency checks
Explicitly verify the MSR load/store list counts are below the advertised
limit as part of the initial consistency checks on the lists, so that code
that consumes the count doesn't need to worry about extreme edge cases.
Enforcing the limit during the initial checks fixes a flaw on 32-bit KVM
where a sufficiently high @count could lead to overflow:

	arch/x86/kvm/vmx/nested.c:834 nested_vmx_check_msr_switch()
	warn: potential user controlled sizeof overflow 'addr + count * 16' '0-u64max + 16-68719476720'

arch/x86/kvm/vmx/nested.c
    827 static int nested_vmx_check_msr_switch(struct kvm_vcpu *vcpu,
    828                                        u32 count, u64 addr)
    829 {
    830         if (count == 0)
    831                 return 0;
    832
    833         if (!kvm_vcpu_is_legal_aligned_gpa(vcpu, addr, 16) ||
--> 834             !kvm_vcpu_is_legal_gpa(vcpu, (addr + count * sizeof(struct vmx_msr_entry) - 1)))
                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

While the SDM doesn't explicitly state an illegal count results in VM-Fail,
the SDM states that exceeding the limit may result in undefined behavior.
I.e. the SDM gives hardware, and thus KVM, carte blanche to do literally
anything in response to a count that exceeds the "recommended" limit.

  If the limit is exceeded, undefined processor behavior may result
  (including a machine check during the VMX transition).

KVM already enforces the limit when processing the MSRs, i.e. already
signals a late VM-Exit Consistency Check for VM-Enter, and generates a
VMX Abort for VM-Exit.  I.e. explicitly checking the limits simply means
KVM will signal VM-Fail instead of VM-Exit or VMX Abort.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/44961459-2759-4164-b604-f6bd43da8ce9@stanley.mountain
Link: https://lore.kernel.org/r/20250315024402.2363098-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:23:06 -07:00
Tom Lendacky
309d28576f KVM: SVM: Fix SNP AP destroy race with VMRUN
An AP destroy request for a target vCPU is typically followed by an
RMPADJUST to remove the VMSA attribute from the page currently being
used as the VMSA for the target vCPU. This can result in a vCPU that
is about to VMRUN to exit with #VMEXIT_INVALID.

This usually does not happen as APs are typically sitting in HLT when
being destroyed and therefore the vCPU thread is not running at the time.
However, if HLT is allowed inside the VM, then the vCPU could be about to
VMRUN when the VMSA attribute is removed from the VMSA page, resulting in
a #VMEXIT_INVALID when the vCPU actually issues the VMRUN and causing the
guest to crash. An RMPADJUST against an in-use (already running) VMSA
results in a #NPF for the vCPU issuing the RMPADJUST, so the VMSA
attribute cannot be changed until the VMRUN for target vCPU exits. The
Qemu command line option '-overcommit cpu-pm=on' is an example of allowing
HLT inside the guest.

Update the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event to include the
KVM_REQUEST_WAIT flag. The kvm_vcpu_kick() function will not wait for
requests to be honored, so create kvm_make_request_and_kick() that will
add a new event request and honor the KVM_REQUEST_WAIT flag. This will
ensure that the target vCPU sees the AP destroy request before returning
to the initiating vCPU should the target vCPU be in guest mode.

Fixes: e366f92ea9 ("KVM: SEV: Support SEV-SNP AP Creation NAE event")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/fe2c885bf35643dd224e91294edb6777d5df23a4.1743097196.git.thomas.lendacky@amd.com
[sean: add a comment explaining the use of smp_send_reschedule()]
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:20:08 -07:00