Restore KVM's handling of a NULL kvm_x86_ops.mem_enc_ioctl, as the hook is
NULL on SVM when CONFIG_KVM_AMD_SEV=n, and TDX will soon follow suit.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at arch/x86/include/asm/kvm-x86-ops.h:130 kvm_x86_vendor_init+0x178b/0x18e0
Modules linked in:
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.15.0-rc2-dc1aead1a985-sink-vm #2 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_x86_vendor_init+0x178b/0x18e0
Call Trace:
<TASK>
svm_init+0x2e/0x60
do_one_initcall+0x56/0x290
kernel_init_freeable+0x192/0x1e0
kernel_init+0x16/0x130
ret_from_fork+0x30/0x50
ret_from_fork_asm+0x1a/0x30
</TASK>
---[ end trace 0000000000000000 ]---
Opportunistically drop the superfluous curly braces.
Link: https://lore.kernel.org/all/20250318-vverma7-cleanup_x86_ops-v2-4-701e82d6b779@intel.com
Fixes: b2aaf38ced ("KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl")
Link: https://lore.kernel.org/r/20250502203421.865686-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Both SVM and VMX have similar implementation for executing an IBPB
between running different vCPUs on the same CPU to create separate
prediction domains for different vCPUs.
For VMX, when the currently loaded VMCS is changed in
vmx_vcpu_load_vmcs(), an IBPB is executed if there is no 'buddy', which
is the case on vCPU load. The intention is to execute an IBPB when
switching vCPUs, but not when switching the VMCS within the same vCPU.
Executing an IBPB on nested transitions within the same vCPU is handled
separately and conditionally in nested_vmx_vmexit().
For SVM, the current VMCB is tracked on vCPU load and an IBPB is
executed when it is changed. The intention is also to execute an IBPB
when switching vCPUs, although it is possible that in some cases an IBBP
is executed when switching VMCBs for the same vCPU. Executing an IBPB on
nested transitions should be handled separately, and is proposed at [1].
Unify the logic by tracking the last loaded vCPU and execuintg the IBPB
on vCPU change in kvm_arch_vcpu_load() instead. When a vCPU is
destroyed, make sure all references to it are removed from any CPU. This
is similar to how SVM clears the current_vmcb tracking on vCPU
destruction. Remove the current VMCB tracking in SVM as it is no longer
required, as well as the 'buddy' parameter to vmx_vcpu_load_vmcs().
[1] https://lore.kernel.org/lkml/20250221163352.3818347-4-yosry.ahmed@linux.dev
Link: https://lore.kernel.org/all/20250320013759.3965869-1-yosry.ahmed@linux.dev
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
[sean: tweak comment to stay at/under 80 columns]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a module param to each KVM vendor module to allow disabling device
posted interrupts without having to sacrifice all of APICv/AVIC, and to
also effectively enumerate to userspace whether or not KVM may be
utilizing device posted IRQs. Disabling device posted interrupts is
very desirable for testing, and can even be desirable for production
environments, e.g. if the host kernel wants to interpose on device
interrupts.
Put the module param in kvm-{amd,intel}.ko instead of kvm.ko to match
the overall APICv/AVIC controls, and to avoid complications with said
controls. E.g. if the param is in kvm.ko, KVM needs to be snapshot the
original user-defined value to play nice with a vendor module being
reloaded with different enable_apicv settings.
Link: https://lore.kernel.org/r/20250401161804.842968-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rescan I/O APIC routes for a vCPU after handling an intercepted I/O APIC
EOI for an IRQ that is not targeting said vCPU, i.e. after handling what's
effectively a stale EOI VM-Exit. If a level-triggered IRQ is in-flight
when IRQ routing changes, e.g. because the guest changes routing from its
IRQ handler, then KVM intercepts EOIs on both the new and old target vCPUs,
so that the in-flight IRQ can be de-asserted when it's EOI'd.
However, only the EOI for the in-flight IRQ needs to be intercepted, as
IRQs on the same vector with the new routing are coincidental, i.e. occur
only if the guest is reusing the vector for multiple interrupt sources.
If the I/O APIC routes aren't rescanned, KVM will unnecessarily intercept
EOIs for the vector and negative impact the vCPU's interrupt performance.
Note, both commit db2bdcbbbd ("KVM: x86: fix edge EOI and IOAPIC reconfig
race") and commit 0fc5a36dd6 ("KVM: x86: ioapic: Fix level-triggered EOI
and IOAPIC reconfigure race") mentioned this issue, but it was considered
a "rare" occurrence thus was not addressed. However in real environments,
this issue can happen even in a well-behaved guest.
Cc: Kai Huang <kai.huang@intel.com>
Co-developed-by: xuyun <xuyun_xy.xy@linux.alibaba.com>
Signed-off-by: xuyun <xuyun_xy.xy@linux.alibaba.com>
Signed-off-by: weizijie <zijie.wei@linux.alibaba.com>
[sean: massage changelog and comments, use int/-1, reset at scan]
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250304013335.4155703-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The "kvm_run->kvm_valid_regs" and "kvm_run->kvm_dirty_regs" variables are
u64 type. We are only using the lowest 3 bits but we want to ensure that
the users are not passing invalid bits so that we can use the remaining
bits in the future.
However "sync_valid_fields" and kvm_sync_valid_fields() are u32 type so
the check only ensures that the lower 32 bits are clear. Fix this by
changing the types to u64.
Fixes: 74c1807f6c ("KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/r/ec25aad1-113e-4c6e-8941-43d432251398@stanley.mountain
Signed-off-by: Sean Christopherson <seanjc@google.com>
* Single fix for broken usage of 'multi-MIDR' infrastructure in PI
code, adding an open-coded erratum check for Cavium ThunderX
* Bugfixes from a planned posted interrupt rework
* Do not use kvm_rip_read() unconditionally to cater for guests
with inaccessible register state.
Not all VMs allow access to RIP. Check guest_state_protected before
calling kvm_rip_read().
This avoids, for example, hitting WARN_ON_ONCE in vt_cache_reg() for
TDX VMs.
Fixes: 81bf912b2c ("KVM: TDX: Implement TDX vcpu enter/exit path")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250415104821.247234-3-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Take irqfds.lock when adding/deleting an IRQ bypass producer to ensure
irqfd->producer isn't modified while kvm_irq_routing_update() is running.
The only lock held when a producer is added/removed is irqbypass's mutex.
Fixes: 8727688006 ("KVM: x86: select IRQ_BYPASS_MANAGER")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly treat type differences as GSI routing changes, as comparing MSI
data between two entries could get a false negative, e.g. if userspace
changed the type but left the type-specific data as-is.
Fixes: 515a0c79e7 ("kvm: irqfd: avoid update unmodified entries of the routing")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_arch_has_irq_bypass() is a small function and even though it does
not appear in any *really* hot paths, it's also not entirely rare.
Make it inline---it also works out nicely in preparation for using it in
kvm-intel.ko and kvm-amd.ko, since the function is not currently exported.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This large commit contains the initial support for TDX in KVM. All x86
parts enable the host-side hypercalls that KVM uses to talk to the TDX
module, a software component that runs in a special CPU mode called SEAM
(Secure Arbitration Mode).
The series is in turn split into multiple sub-series, each with a separate
merge commit:
- Initialization: basic setup for using the TDX module from KVM, plus
ioctls to create TDX VMs and vCPUs.
- MMU: in TDX, private and shared halves of the address space are mapped by
different EPT roots, and the private half is managed by the TDX module.
Using the support that was added to the generic MMU code in 6.14,
add support for TDX's secure page tables to the Intel side of KVM.
Generic KVM code takes care of maintaining a mirror of the secure page
tables so that they can be queried efficiently, and ensuring that changes
are applied to both the mirror and the secure EPT.
- vCPU enter/exit: implement the callbacks that handle the entry of a TDX
vCPU (via the SEAMCALL TDH.VP.ENTER) and the corresponding save/restore
of host state.
- Userspace exits: introduce support for guest TDVMCALLs that KVM forwards to
userspace. These correspond to the usual KVM_EXIT_* "heavyweight vmexits"
but are triggered through a different mechanism, similar to VMGEXIT for
SEV-ES and SEV-SNP.
- Interrupt handling: support for virtual interrupt injection as well as
handling VM-Exits that are caused by vectored events. Exclusive to
TDX are machine-check SMIs, which the kernel already knows how to
handle through the kernel machine check handler (commit 7911f145de,
"x86/mce: Implement recovery for errors in TDX/SEAM non-root mode")
- Loose ends: handling of the remaining exits from the TDX module, including
EPT violation/misconfig and several TDVMCALL leaves that are handled in
the kernel (CPUID, HLT, RDMSR/WRMSR, GetTdVmCallInfo); plus returning
an error or ignoring operations that are not supported by TDX guests
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acquire a lock on kvm->srcu when userspace is getting MP state to handle a
rather extreme edge case where "accepting" APIC events, i.e. processing
pending INIT or SIPI, can trigger accesses to guest memory. If the vCPU
is in L2 with INIT *and* a TRIPLE_FAULT request pending, then getting MP
state will trigger a nested VM-Exit by way of ->check_nested_events(), and
emuating the nested VM-Exit can access guest memory.
The splat was originally hit by syzkaller on a Google-internal kernel, and
reproduced on an upstream kernel by hacking the triple_fault_event_test
selftest to stuff a pending INIT, store an MSR on VM-Exit (to generate a
memory access on VMX), and do vcpu_mp_state_get() to trigger the scenario.
=============================
WARNING: suspicious RCU usage
6.14.0-rc3-b112d356288b-vmx/pi_lockdep_false_pos-lock #3 Not tainted
-----------------------------
include/linux/kvm_host.h:1058 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by triple_fault_ev/1256:
#0: ffff88810df5a330 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x8b/0x9a0 [kvm]
stack backtrace:
CPU: 11 UID: 1000 PID: 1256 Comm: triple_fault_ev Not tainted 6.14.0-rc3-b112d356288b-vmx #3
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
<TASK>
dump_stack_lvl+0x7f/0x90
lockdep_rcu_suspicious+0x144/0x190
kvm_vcpu_gfn_to_memslot+0x156/0x180 [kvm]
kvm_vcpu_read_guest+0x3e/0x90 [kvm]
read_and_check_msr_entry+0x2e/0x180 [kvm_intel]
__nested_vmx_vmexit+0x550/0xde0 [kvm_intel]
kvm_check_nested_events+0x1b/0x30 [kvm]
kvm_apic_accept_events+0x33/0x100 [kvm]
kvm_arch_vcpu_ioctl_get_mpstate+0x30/0x1d0 [kvm]
kvm_vcpu_ioctl+0x33e/0x9a0 [kvm]
__x64_sys_ioctl+0x8b/0xb0
do_syscall_64+0x6c/0x170
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250401150504.829812-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Add common secure TSC infrastructure for use within SNP and in the
future TDX
- Block KVM_CAP_SYNC_REGS if guest state is protected. It does not make
sense to use the capability if the relevant registers are not
available for reading or writing.
The immediate issue being fixed here is a nVMX bug where KVM fails to
detect that, after nested VM-Exit, L1 has a pending IRQ (or NMI).
However, checking for a pending interrupt accesses the legacy PIC, and
x86's kvm_arch_destroy_vm() currently frees the PIC before destroying
vCPUs, i.e. checking for IRQs during the forced nested VM-Exit results
in a NULL pointer deref; that's a prerequisite for the nVMX fix.
The remaining patches attempt to bring a bit of sanity to x86's VM
teardown code, which has accumulated a lot of cruft over the years. E.g.
KVM currently unloads each vCPU's MMUs in a separate operation from
destroying vCPUs, all because when guest SMP support was added, KVM had a
kludgy MMU teardown flow that broke when a VM had more than one 1 vCPU.
And that oddity lived on, for 18 years...
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Don't write to the Xen hypercall page on MSR writes that are initiated by
the host (userspace or KVM) to fix a class of bugs where KVM can write to
guest memory at unexpected times, e.g. during vCPU creation if userspace has
set the Xen hypercall MSR index to collide with an MSR that KVM emulates.
- Restrict the Xen hypercall MSR indx to the unofficial synthetic range to
reduce the set of possible collisions with MSRs that are emulated by KVM
(collisions can still happen as KVM emulates Hyper-V MSRs, which also reside
in the synthetic range).
- Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config.
- Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID
entries when updating PV clocks, as there is no guarantee PV clocks will be
updated between TSC frequency changes and CPUID emulation, and guest reads
of Xen TSC should be rare, i.e. are not a hot path.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZpO4ACgkQOlYIJqCj
N/3AMQ/+J4+yOslekq4DHYhZaTvJFf0MqhPgTuf2s6I5p449JWn9rebqK2w0M9Xj
fJy7/rboQA4QflBuhTiWcC3Dl1lYtxUqqtcCH9608eqKhbeay87OfV0/vgMwWBRs
FhcOcp1587esJj5gz5M5R9i3S5Yq7Q4fp1+DmS23X41Zz5nTb2q80MY5UklMgI9I
Ydaw1liB8rRHWbdt9yM4UsI8k4fMuj0PE8pEapoTSfsZm8J4cG9qHKrvuWjuFSCF
l18Hyl11nWq8eZ5Vg2E2UIz0EgtWIHKu1/fi4av20/JTuA8Mon15WC5q4BBmDDdD
keR9OJLYclVBh8KweiJSTUE6PcD9A8pWmoWyp6aGRiyyUVhbwysYTzT7uytwQz6w
RH/vVHe0o/m19SnD9rqsRVObc7dOGorFXScMcf4Qxoq9yQm2p0lJDvq6c9uECLMV
RIfZrXe9HS67RB9INybS+1fVlLcd0bLgGfG7q9lWLEABD45HpM5daQ4Mlf8+MIE0
V7egx9t69/WALbJka8pWNISeFRKkB1LRjite+XXasqJ0iFeneM8UKFVB4OMtXL9g
M0m8ovvySySMkoCq3yMlKxXh4rJ1/D556/bAaJBukMPWFWX9FQaP33U3FuzId7jH
ztZVugViQMNiIbQVgUSAcgpuJvgpttAciACODlaw2u2Bk1Txmn0=
=c3Wt
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-xen-6.15' of https://github.com/kvm-x86/linux into HEAD
KVM Xen changes for 6.15
- Don't write to the Xen hypercall page on MSR writes that are initiated by
the host (userspace or KVM) to fix a class of bugs where KVM can write to
guest memory at unexpected times, e.g. during vCPU creation if userspace has
set the Xen hypercall MSR index to collide with an MSR that KVM emulates.
- Restrict the Xen hypercall MSR indx to the unofficial synthetic range to
reduce the set of possible collisions with MSRs that are emulated by KVM
(collisions can still happen as KVM emulates Hyper-V MSRs, which also reside
in the synthetic range).
- Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config.
- Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID
entries when updating PV clocks, as there is no guarantee PV clocks will be
updated between TSC frequency changes and CPUID emulation, and guest reads
of Xen TSC should be rare, i.e. are not a hot path.
- Don't take kvm->lock when iterating over vCPUs in the suspend notifier to
fix a largely theoretical deadlock.
- Use the vCPU's actual Xen PV clock information when starting the Xen timer,
as the cached state in arch.hv_clock can be stale/bogus.
- Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different
PV clocks.
- Restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only
accounts for kvmclock, and there's no evidence that the flag is actually
supported by Xen guests.
- Clean up the per-vCPU "cache" of its reference pvclock, and instead only
track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately
expensive to compute, and rarely changes for modern setups).
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZnJsACgkQOlYIJqCj
N/3nAg//TuURhfTm56TB0PZ01DX9Fqxl+9b+fDSllk1F7O5BcfkkEd11Jv4qa/Zb
eKhSNZzWuDCTMky8izM2Ej4rfTmCg2xF+3hdpVi6yQ7SItgDo2E7e71lm1lNXSMO
oCkQxEwQk0cW2sxeEqPuREq0Zm/kw7jrEt2co2OX2FKlt9UoZiCy6RDBde50z1ut
5Z32k6QX9Alhu67kXvBE/+Xv6abx1dbADOnaTgE7s74smHKxS2WXrfpKnPXjy2y0
pWjX9k2ClSISKdaFbSu4Y0VqeLqE+57ZAWAPT8vndJxjNWOvZK1oBSlaOPchR9CZ
0VFLDWKV2FjEs0O0AkWCw8XTEmdJ4R1ekHpqbBZJ9TJYwVA/LDWOGgR1jcORkzsS
WMJkfMOmQeL8bPR6TBuAFXawbhalsXnYUSthZ3sn4kA7c1DTkIC5mzrDZ3ADPyJi
UpYwVHaWAMOqncEvSQEuUTvSoDeb5P4HMyB4QOAsh1GoKw4vVXpSWUPDy0JKPOnu
WblztX9h/CRB/ZNt/566s2Jh7sCeBO2qs3ffujI4GDosYDcIRRuQy5U08/oMrPRf
l3nPStjxLqsCdJe8IXvL5zwt6YOxJvJdG8XcfcvfQsUCPMAZOIv7PKrt8AFfrN6c
GU5v8x/IBBB46qJw1Jm5eE5S3P/PuaIf235JHpIabGPzJ+H1QGo=
=yv8C
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pvclock-6.15' of https://github.com/kvm-x86/linux into HEAD
KVM PV clock changes for 6.15:
- Don't take kvm->lock when iterating over vCPUs in the suspend notifier to
fix a largely theoretical deadlock.
- Use the vCPU's actual Xen PV clock information when starting the Xen timer,
as the cached state in arch.hv_clock can be stale/bogus.
- Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different
PV clocks.
- Restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only
accounts for kvmclock, and there's no evidence that the flag is actually
supported by Xen guests.
- Clean up the per-vCPU "cache" of its reference pvclock, and instead only
track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately
expensive to compute, and rarely changes for modern setups).
- Fix a bug in PIC emulation that caused KVM to emit a spurious KVM_REQ_EVENT.
- Add a helper to consolidate handling of mp_state transitions, and use it to
clear pv_unhalted whenever a vCPU is made RUNNABLE.
- Defer runtime CPUID updates until KVM emulates a CPUID instruction, to
coalesce updates when multiple pieces of vCPU state are changing, e.g. as
part of a nested transition.
- Fix a variety of nested emulation bugs, and add VMX support for synthesizing
nested VM-Exit on interception (instead of injecting #UD into L2).
- Drop "support" for PV Async #PF with proctected guests without SEND_ALWAYS,
as KVM can't get the current CPL.
- Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZhoYACgkQOlYIJqCj
N/1oSBAAlhsZzv4m6rHtWACjGsSBlAE5WM7HrpnwjOyMW+Desc0WLL4L6qAlxgT7
4ZkBvQ4zsiyUIdLv1XARvkBYHTUkcgG8fQSSJQ6grZit5OcFcMgWafvQrJYI6428
TyzF/x6t0gj8CjDluA6jM/eL5HhWZZDOIg0Wma+pyYpW+7y2tkphXyYyOQPBGwv3
geRUetbDHHXjf042k/8f1j1vjzrNNvAg3YyNyx1YbdU9XKsn5D+SeUW2eVfYk8G7
5QsCOGvUYcbbjrR8kbCZKexvoH6Np9J6YKDe4R9R2yDzgs/96qz6xkYTGCVkHA1y
uursKqRHgbXBxzxa+ban073laT7Qt3S01Gd9bJW3IO7hzG89gl4qfX7fap8T9Yc2
yeBTYIgInpyx+NCdZ2Z/++BzPagBGfa77gFX/eIkmsVA9LWYi9CI3FSjtr/czvWm
a4tfMPvTVBjsBQQ7t/lNksrq0O51lbb3iqqv3ToQpDOOqCWuMEU5xcihhPRr5NSZ
dX4o/jIDhCV8EyXdtASyqMlYBXcuC45ojEZn1elh0QogzYAdSGQ2bIDyxuBtA//k
kSbi+E4GB64jVfBWUyK2QeLOBnBkH7mh6Cg5UYr1Ln9Sm6l8vrcxhcbnchiWxXMI
WCK7BJwI2HojBVpEZ04jMkjHvg36uSfjOzmMLT5yPXfFNebsGmA=
=8SGF
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.15' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.15:
- Fix a bug in PIC emulation that caused KVM to emit a spurious KVM_REQ_EVENT.
- Add a helper to consolidate handling of mp_state transitions, and use it to
clear pv_unhalted whenever a vCPU is made RUNNABLE.
- Defer runtime CPUID updates until KVM emulates a CPUID instruction, to
coalesce updates when multiple pieces of vCPU state are changing, e.g. as
part of a nested transition.
- Fix a variety of nested emulation bugs, and add VMX support for synthesizing
nested VM-Exit on interception (instead of injecting #UD into L2).
- Drop "support" for PV Async #PF with proctected guests without SEND_ALWAYS,
as KVM can't get the current CPL.
- Misc cleanups
The IGNORE_GUEST_PAT quirk is inapplicable, and thus always-disabled,
if shadow_memtype_mask is zero. As long as vmx_get_mt_mask is not
called for the shadow paging case, there is no need to consult
shadow_memtype_mask and it can be removed altogether.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce an Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT to have
KVM ignore guest PAT when this quirk is enabled.
On AMD platforms, KVM always honors guest PAT. On Intel however there are
two issues. First, KVM *cannot* honor guest PAT if CPU feature self-snoop
is not supported. Second, UC access on certain Intel platforms can be very
slow[1] and honoring guest PAT on those platforms may break some old
guests that accidentally specify video RAM as UC. Those old guests may
never expect the slowness since KVM always forces WB previously. See [2].
So, introduce a quirk that KVM can enable by default on all Intel platforms
to avoid breaking old unmodifiable guests. Newer userspace can disable this
quirk if it wishes KVM to honor guest PAT; disabling the quirk will fail
if self-snoop is not supported, i.e. if KVM cannot obey the wish.
The quirk is a no-op on AMD and also if any assigned devices have
non-coherent DMA. This is not an issue, as KVM_X86_QUIRK_CD_NW_CLEARED is
another example of a quirk that is sometimes automatically disabled.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/all/Ztl9NWCOupNfVaCA@yzhao56-desk.sh.intel.com # [1]
Link: https://lore.kernel.org/all/87jzfutmfc.fsf@redhat.com # [2]
Message-ID: <20250224070946.31482-1-yan.y.zhao@intel.com>
[Use supported_quirks/inapplicable_quirks to support both AMD and
no-self-snoop cases, as well as to remove the shadow_memtype_mask check
from kvm_mmu_may_ignore_guest_pat(). - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce supported_quirks in kvm_caps to store platform-specific force-enabled
quirks.
No functional changes intended.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250224070832.31394-1-yan.y.zhao@intel.com>
[Remove unsupported quirks at KVM_ENABLE_CAP time. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In some cases, the handling of quirks is split between platform-specific
code and generic code, or it is done entirely in generic code, but the
relevant bug does not trigger on some platforms; for example,
this will be the case for "ignore guest PAT". Allow unaffected vendor
modules to disable handling of a quirk for all VMs via a new entry in
kvm_caps.
Such quirks remain available in KVM_CAP_DISABLE_QUIRKS2, because that API
tells userspace that KVM *knows* that some of its past behavior was bogus
or just undesirable. In other words, it's plausible for userspace to
refuse to run if a quirk is not listed by KVM_CAP_DISABLE_QUIRKS2, so
preserve that and make it part of the API.
As an example, mark KVM_X86_QUIRK_CD_NW_CLEARED as auto-disabled on
Intel systems.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allowing arbitrary re-enabling of quirks puts a limit on what the
quirks themselves can do, since you cannot assume that the quirk
prevents a particular state. More important, it also prevents
KVM from disabling a quirk at VM creation time, because userspace
can always go back and re-enable that.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move KVM_MAX_MCE_BANKS to header file so that it can be used for TDX in
a future patch.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: split into new patch]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-8-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Force APICv active for TDX guests in KVM because APICv is always enabled
by TDX module.
From the view of KVM, whether APICv state is active or not is decided by:
1. APIC is hw enabled
2. VM and vCPU have no inhibit reasons set.
After TDX vCPU init, APIC is set to x2APIC mode. KVM_SET_{SREGS,SREGS2} are
rejected due to has_protected_state for TDs and guest_state_protected
for TDX vCPUs are set. Reject KVM_{GET,SET}_LAPIC from userspace since
migration is not supported yet, so that userspace cannot disable APIC.
For various APICv inhibit reasons:
- APICV_INHIBIT_REASON_DISABLED is impossible after checking enable_apicv
in tdx_bringup(). If !enable_apicv, TDX support will be disabled.
- APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED is impossible since x2APIC is
mandatory, KVM emulates APIC_ID as read-only for x2APIC mode. (Note:
APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED could be set if the memory
allocation fails for KVM apic_map.)
- APICV_INHIBIT_REASON_HYPERV is impossible since TDX doesn't support
HyperV guest yet.
- APICV_INHIBIT_REASON_ABSENT is impossible since in-kernel LAPIC is
checked in tdx_vcpu_create().
- APICV_INHIBIT_REASON_BLOCKIRQ is impossible since TDX doesn't support
KVM_SET_GUEST_DEBUG.
- APICV_INHIBIT_REASON_APIC_ID_MODIFIED is impossible since x2APIC is
mandatory.
- APICV_INHIBIT_REASON_APIC_BASE_MODIFIED is impossible since KVM rejects
userspace to set APIC base.
- The rest inhibit reasons are relevant only to AMD's AVIC, including
APICV_INHIBIT_REASON_NESTED, APICV_INHIBIT_REASON_IRQWIN,
APICV_INHIBIT_REASON_PIT_REINJ, APICV_INHIBIT_REASON_SEV, and
APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED.
(For APICV_INHIBIT_REASON_PIT_REINJ, similar to AVIC, KVM can't intercept
EOI for TDX guests neither, but KVM enforces KVM_IRQCHIP_SPLIT for TDX
guests, which eliminates the in-kernel PIT.)
Implement vt_refresh_apicv_exec_ctrl() to call KVM_BUG_ON() if APICv is
disabled for TDX guests.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-12-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Handle TDX PV MMIO hypercall when TDX guest calls TDVMCALL with the
leaf #VE.RequestMMIO (same value as EXIT_REASON_EPT_VIOLATION) according
to TDX Guest Host Communication Interface (GHCI) spec.
For TDX guests, VMM is not allowed to access vCPU registers and the private
memory, and the code instructions must be fetched from the private memory.
So MMIO emulation implemented for non-TDX VMs is not possible for TDX
guests.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest. The #VE handling is supposed to emulate the MMIO
instruction inside the guest and convert it into a TDVMCALL with the
leaf #VE.RequestMMIO, which equals to EXIT_REASON_EPT_VIOLATION.
The requested MMIO address must be in shared GPA space. The shared bit
is stripped after check because the existing code for MMIO emulation is
not aware of the shared bit.
The MMIO GPA shouldn't have a valid memslot, also the attribute of the GPA
should be shared. KVM could do the checks before exiting to userspace,
however, even if KVM does the check, there still will be race conditions
between the check in KVM and the emulation of MMIO access in userspace due
to a memslot hotplug, or a memory attribute conversion. If userspace
doesn't check the attribute of the GPA and the attribute happens to be
private, it will not pose a security risk or cause an MCE, but it can lead
to another issue. E.g., in QEMU, treating a GPA with private attribute as
shared when it falls within RAM's range can result in extra memory
consumption during the emulation to the access to the HVA of the GPA.
There are two options: 1) Do the check both in KVM and userspace. 2) Do
the check only in QEMU. This patch chooses option 2, i.e. KVM omits the
memslot and attribute checks, and expects userspace to do the checks.
Similar to normal MMIO emulation, try to handle the MMIO in kernel first,
if kernel can't support it, forward the request to userspace. Export
needed symbols used for MMIO handling.
Fragments handling is not needed for TDX PV MMIO because GPA is provided,
if a MMIO access crosses page boundary, it should be continuous in GPA.
Also, the size is limited to 1, 2, 4, 8 bytes. No further split needed.
Allow cross page access because no extra handling needed after checking
both start and end GPA are shared GPAs.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250222014225.897298-10-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move pv_unhalted check out of kvm_vcpu_has_events(), check pv_unhalted
explicitly when handling PV unhalt and expose kvm_vcpu_has_events().
kvm_vcpu_has_events() returns true if pv_unhalted is set, and pv_unhalted
is only cleared on transitions to KVM_MP_STATE_RUNNABLE. If the guest
initiates a spurious wakeup, pv_unhalted could be left set in perpetuity.
Currently, this is not problematic because kvm_vcpu_has_events() is only
called when handling PV unhalt. However, if kvm_vcpu_has_events() is used
for other purposes in the future, it could return the unexpected results.
Export kvm_vcpu_has_events() for its usage in broader contexts.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014225.897298-3-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Have ____kvm_emulate_hypercall() read the GPRs instead of passing them
in via the macro.
When emulating KVM hypercalls via TDVMCALL, TDX will marshall registers of
TDVMCALL ABI into KVM's x86 registers to match the definition of KVM
hypercall ABI _before_ ____kvm_emulate_hypercall() gets called. Therefore,
____kvm_emulate_hypercall() can just read registers internally based on KVM
hypercall ABI, and those registers can be removed from the
__kvm_emulate_hypercall() macro.
Also, op_64_bit can be determined inside ____kvm_emulate_hypercall(),
remove it from the __kvm_emulate_hypercall() macro as well.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20250222014225.897298-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a flag KVM_DEBUGREG_AUTO_SWITCH to skip saving/restoring guest
DRs.
TDX-SEAM unconditionally saves/restores guest DRs on TD exit/enter,
and resets DRs to architectural INIT state on TD exit. Use the new
flag KVM_DEBUGREG_AUTO_SWITCH to indicate that KVM doesn't need to
save/restore guest DRs. KVM still needs to restore host DRs after TD
exit if there are active breakpoints in the host, which is covered by
the existing code.
MOV-DR exiting is always cleared for TDX guests, so the handler for DR
access is never called, and KVM_DEBUGREG_WONT_EXIT is never set. Add
a warning if both KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_AUTO_SWITCH
are set.
Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT().
Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: rework changelog]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20241210004946.3718496-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250129095902.16391-13-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Several MSRs are constant and only used in userspace(ring 3). But VMs may
have different values. KVM uses kvm_set_user_return_msr() to switch to
guest's values and leverages user return notifier to restore them when the
kernel is to return to userspace. To eliminate unnecessary wrmsr, KVM also
caches the value it wrote to an MSR last time.
TDX module unconditionally resets some of these MSRs to architectural INIT
state on TD exit. It makes the cached values in kvm_user_return_msrs are
inconsistent with values in hardware. This inconsistency needs to be
fixed. Otherwise, it may mislead kvm_on_user_return() to skip restoring
some MSRs to the host's values. kvm_set_user_return_msr() can help correct
this case, but it is not optimal as it always does a wrmsr. So, introduce
a variation of kvm_set_user_return_msr() to update cached values and skip
that wrmsr.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250129095902.16391-9-adrian.hunter@intel.com>
Reviewed-by: Xiayao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make cpu_dirty_log_size (CPU's dirty log buffer size) a per-VM value and
set the per-VM cpu_dirty_log_size only for normal VMs when PML is enabled.
Do not set it for TDs.
Until now, cpu_dirty_log_size was a system-wide value that is used for
all VMs and is set to the PML buffer size when PML was enabled in VMX.
However, PML is not currently supported for TDs, though PML remains
available for normal VMs as long as the feature is supported by hardware
and enabled in VMX.
Making cpu_dirty_log_size a per-VM value allows it to be ther PML buffer
size for normal VMs and 0 for TDs. This allows functions like
kvm_arch_sync_dirty_log() and kvm_mmu_update_cpu_dirty_logging() to
determine if PML is supported, in order to kick off vCPUs or request them
to update CPU dirty logging status (turn on/off PML in VMCS).
This fixes an issue first reported in [1], where QEMU attaches an
emulated VGA device to a TD; note that KVM_MEM_LOG_DIRTY_PAGES
still works if the corresponding has no flag KVM_MEM_GUEST_MEMFD.
KVM then invokes kvm_mmu_update_cpu_dirty_logging() and from there
vmx_update_cpu_dirty_logging(), which incorrectly accesses a kvm_vmx
struct for a TDX VM.
Reported-by: ANAND NARSHINHA PATIL <Anand.N.Patil@ibm.com>
Reported-by: Pedro Principeza <pedro.principeza@canonical.com>
Reported-by: Farrah Chen <farrah.chen@intel.com>
Closes: https://github.com/canonical/tdx/issues/202
Link: https://github.com/canonical/tdx/issues/202 [1]
Suggested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
TD guest vcpu needs TDX specific initialization before running. Repurpose
KVM_MEMORY_ENCRYPT_OP to vcpu-scope, add a new sub-command
KVM_TDX_INIT_VCPU, and implement the callback for it.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
- Fix comment: https://lore.kernel.org/kvm/Z36OYfRW9oPjW8be@google.com/
(Sean)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Change to report the KVM_CAP_MAX_VCPUS extension from globally to per-VM
to allow userspace to be able to query maximum vCPUs for TDX guest via
checking the KVM_CAP_MAX_VCPU extension on per-VM basis.
Today KVM x86 reports KVM_MAX_VCPUS as guest's maximum vCPUs for all
guests globally, and userspace, i.e. Qemu, queries the KVM_MAX_VCPUS
extension globally but not on per-VM basis.
TDX has its own limit of maximum vCPUs it can support for all TDX guests
in addition to KVM_MAX_VCPUS. TDX module reports this limit via the
MAX_VCPU_PER_TD global metadata. Different modules may report different
values. In practice, the reported value reflects the maximum logical
CPUs that ALL the platforms that the module supports can possibly have.
Note some old modules may also not support this metadata, in which case
the limit is U16_MAX.
The current way to always report KVM_MAX_VCPUS in the KVM_CAP_MAX_VCPUS
extension is not enough for TDX. To accommodate TDX, change to report
the KVM_CAP_MAX_VCPUS extension on per-VM basis.
Specifically, override kvm->max_vcpus in tdx_vm_init() for TDX guest,
and report kvm->max_vcpus in the KVM_CAP_MAX_VCPUS extension check.
Change to report "the number of logical CPUs the platform has" as the
maximum vCPUs for TDX guest. Simply forwarding the MAX_VCPU_PER_TD
reported by the TDX module would result in an unpredictable ABI because
the reported value to userspace would be depending on whims of TDX
modules.
This works in practice because of the MAX_VCPU_PER_TD reported by the
TDX module will never be smaller than the one reported to userspace.
But to make sure KVM never reports an unsupported value, sanity check
the MAX_VCPU_PER_TD reported by TDX module is not smaller than the
number of logical CPUs the platform has, otherwise refuse to use TDX.
Note, when creating a TDX guest, TDX actually requires the "maximum
vCPUs for _this_ TDX guest" as an input to initialize the TDX guest.
But TDX guest's maximum vCPUs is not part of TDREPORT thus not part of
attestation, thus there's no need to allow userspace to explicitly
_configure_ the maximum vCPUs on per-VM basis. KVM will simply use
kvm->max_vcpus as input when initializing the TDX guest.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Implement managing the TDX private KeyID to implement, create, destroy
and free for a TDX guest.
When creating at TDX guest, assign a TDX private KeyID for the TDX guest
for memory encryption, and allocate pages for the guest. These are used
for the Trust Domain Root (TDR) and Trust Domain Control Structure (TDCS).
On destruction, free the allocated pages, and the KeyID.
Before tearing down the private page tables, TDX requires the guest TD to
be destroyed by reclaiming the KeyID. Do it in the vm_pre_destroy() kvm_x86_ops
hook. The TDR control structures can be freed in the vm_destroy() hook,
which runs last.
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
- Fix build issue in kvm-coco-queue
- Init ret earlier to fix __tdx_td_init() error handling. (Chao)
- Standardize -EAGAIN for __tdx_td_init() retry errors (Rick)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM_MEMORY_ENCRYPT_OP was introduced for VM-scoped operations specific for
guest state-protected VM. It defined subcommands for technology-specific
operations under KVM_MEMORY_ENCRYPT_OP. Despite its name, the subcommands
are not limited to memory encryption, but various technology-specific
operations are defined. It's natural to repurpose KVM_MEMORY_ENCRYPT_OP
for TDX specific operations and define subcommands.
Add a place holder function for TDX specific VM-scoped ioctl as mem_enc_op.
TDX specific sub-commands will be added to retrieve/pass TDX specific
parameters. Make mem_enc_ioctl non-optional as it's always filled.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
- Drop the misleading "defined for consistency" line. It's a copy-paste
error introduced in the earlier patches. Earlier there was padding at
the end to match struct kvm_sev_cmd size. (Tony)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM_CAP_SYNC_REGS does not make sense for VMs with protected guest state,
since the register values cannot actually be written. Return 0
when using the VM-level KVM_CHECK_EXTENSION ioctl, and accordingly
return -EINVAL from KVM_RUN if the valid/dirty fields are nonzero.
However, on exit from KVM_RUN userspace could have placed a nonzero
value into kvm_run->kvm_valid_regs, so check guest_state_protected
again and skip store_regs() in that case.
Cc: stable@vger.kernel.org
Fixes: 517987e3fb ("KVM: x86: add fields to struct kvm_arch for CoCo features")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250306202923.646075-1-pbonzini@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add guest_tsc_protected member to struct kvm_arch_vcpu and prohibit
changing TSC offset/multiplier when guest_tsc_protected is true.
X86 confidential computing technology defines protected guest TSC so that
the VMM can't change the TSC offset/multiplier once vCPU is initialized.
SEV-SNP defines Secure TSC as optional, whereas TDX mandates it.
KVM has common logic on x86 that tries to guess or adjust TSC
offset/multiplier for better guest TSC and TSC interrupt latency
at KVM vCPU creation (kvm_arch_vcpu_postcreate()), vCPU migration
over pCPU (kvm_arch_vcpu_load()), vCPU TSC device attributes
(kvm_arch_tsc_set_attr()) and guest/host writing to TSC or TSC adjust MSR
(kvm_set_msr_common()).
The current x86 KVM implementation conflicts with protected TSC because the
VMM can't change the TSC offset/multiplier.
Because KVM emulates the TSC timer or the TSC deadline timer with the TSC
offset/multiplier, the TSC timer interrupts is injected to the guest at the
wrong time if the KVM TSC offset is different from what the TDX module
determined.
Originally this issue was found by cyclic test of rt-test [1] as the
latency in TDX case is worse than VMX value + TDX SEAMCALL overhead. It
turned out that the KVM TSC offset is different from what the TDX module
determines.
Disable or ignore the KVM logic to change/adjust the TSC offset/multiplier
somehow, thus keeping the KVM TSC offset/multiplier the same as the
value of the TDX module. Writes to MSR_IA32_TSC are also blocked as
they amount to a change in the TSC offset.
[1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <3a7444aec08042fe205666864b6858910e86aa98.1728719037.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Push down setting vcpu.arch.user_set_tsc to true from kvm_synchronize_tsc()
to __kvm_synchronize_tsc() so that the two callers don't have to modify
user_set_tsc directly as preparation.
Later, prohibit changing TSC synchronization for TDX guests to modify
__kvm_synchornize_tsc() change. We don't want to touch caller sites not to
change user_set_tsc.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <62b1a7a35d6961844786b6e47e8ecb774af7a228.1728719037.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
TDX needs to free the TDR control structures last, after all paging structures
have been torn down; move the vm_destroy callback at a suitable place.
The new place is also okay for AMD; the main difference is that the
MMU has been torn down and, if anything, that is better done before
the SNP ASID is released.
Extracted from a patch by Yan Zhao.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Set RFLAGS.IF in C code on SVM to get VMRUN out of the STI shadow.
- Ensure DEBUGCTL is context switched on AMD to avoid running the guest with
the host's value, which can lead to unexpected bus lock #DBs.
- Suppress DEBUGCTL.BTF on AMD (to match Intel), as KVM doesn't properly
emulate BTF. KVM's lack of context switching has meant BTF has always been
broken to some extent.
- Always save DR masks for SNP vCPUs if DebugSwap is *supported*, as the guest
can enable DebugSwap without KVM's knowledge.
- Fix a bug in mmu_stress_tests where a vCPU could finish the "writes to RO
memory" phase without actually generating a write-protection fault.
- Fix a printf() goof in the SEV smoke test that causes build failures with
-Werror.
- Explicitly zero EAX and EBX in CPUID.0x8000_0022 output when PERFMON_V2
isn't supported by KVM.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfLlhUACgkQOlYIJqCj
N/0x7w/+MhqJdHbshL7Gzw+rcXwCROiCkqsxFP+YoTXte8uaHS5CEfcMYjE8SuGp
KBpgLo4Lj1dVTXiCjemlY5sn6CDiuSs74X8A88ksuu5hVsFByJUgyWU9iw8J/crZ
B2vj8huhqa8OCEPe5JujWfnfyAkKE5tUA4GFi73vhHMcftTNj+ftxT33/Pfg7y7M
xOvWFWS6ZshKrouRKzI7ZFEYLwp0lr4U3dzO5rCRAd5J4MSBWRx6Dx2um5dyEYKJ
xgwl4ylM4S/+78u1+0nQnToM0UWHJ3e7x8nze6UXYTZIrBr/lSeKlbhOPnEWJcJB
Eemnur9ORI2BRPUReqBKluCZsSK+E5B/HPCVt5cxtuRIuUOD+kW17LPgnPyE4Sso
eVt+XAvQc7EjrpWDSHr3ZQZZM89l9zHhuSAQ0npO6y71s0FzEVZQoDamNmOLAPjH
Qg+qhBV2l6pyfqhqiLzADasYLOl57cJsfiMjM331ALLqAn57jzd+B8c4hdB2Xg4s
KPuy8w8uBaY9zpd9YDBLLr7JJVs35KexNZMjT2vqBYXcScyLgmAuSQXy3hub6Mzn
gI5ZXIKG8eO9v2jejfClI6/OEdtEwgSGEVwuBKB16pMrIxqpguMTMTWLVRn5G+oo
qA8anmKaac62GaB66JE/Wjy069OPIGYnHSU2nal0Tej6kG0xv6E=
=as6u
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-fixes-6.14-rcN.2' of https://github.com/kvm-x86/linux into HEAD
KVM x86 fixes for 6.14-rcN #2
- Set RFLAGS.IF in C code on SVM to get VMRUN out of the STI shadow.
- Ensure DEBUGCTL is context switched on AMD to avoid running the guest with
the host's value, which can lead to unexpected bus lock #DBs.
- Suppress DEBUGCTL.BTF on AMD (to match Intel), as KVM doesn't properly
emulate BTF. KVM's lack of context switching has meant BTF has always been
broken to some extent.
- Always save DR masks for SNP vCPUs if DebugSwap is *supported*, as the guest
can enable DebugSwap without KVM's knowledge.
- Fix a bug in mmu_stress_tests where a vCPU could finish the "writes to RO
memory" phase without actually generating a write-protection fault.
- Fix a printf() goof in the SEV smoke test that causes build failures with
-Werror.
- Explicitly zero EAX and EBX in CPUID.0x8000_0022 output when PERFMON_V2
isn't supported by KVM.
When emulating HLT and a wake event is already pending, explicitly mark
the vCPU RUNNABLE (via kvm_set_mp_state()) instead of assuming the vCPU is
already in the appropriate state. Barring a KVM bug, it should be
impossible for the vCPU to be in a non-RUNNABLE state, but there is no
advantage to relying on that to hold true, and ensuring the vCPU is made
RUNNABLE avoids non-deterministic behavior with respect to pv_unhalted.
E.g. if the vCPU is not already RUNNABLE, then depending on when
pv_unhalted is set, KVM could either leave the vCPU in the non-RUNNABLE
state (set before __kvm_emulate_halt()), or transition the vCPU to HALTED
and then RUNNABLE (pv_unhalted set after the kvm_vcpu_has_events() check).
Link: https://lore.kernel.org/r/20250224174156.2362059-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Snapshot the host's DEBUGCTL after disabling IRQs, as perf can toggle
debugctl bits from IRQ context, e.g. when enabling/disabling events via
smp_call_function_single(). Taking the snapshot (long) before IRQs are
disabled could result in KVM effectively clobbering DEBUGCTL due to using
a stale snapshot.
Cc: stable@vger.kernel.org
Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250227222411.3490595-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move KVM's snapshot of DEBUGCTL to kvm_vcpu_arch and take the snapshot in
common x86, so that SVM can also use the snapshot.
Opportunistically change the field to a u64. While bits 63:32 are reserved
on AMD, not mentioned at all in Intel's SDM, and managed as an "unsigned
long" by the kernel, DEBUGCTL is an MSR and therefore a 64-bit value.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: stable@vger.kernel.org
Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250227222411.3490595-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Remove kvm_arch_sync_events() now that x86 no longer uses it (no other
arch has ever used it).
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Message-ID: <20250224235542.2562848-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fold the guts of kvm_arch_sync_events() into kvm_arch_pre_destroy_vm(), as
the kvmclock and PIT background workers only need to be stopped before
destroying vCPUs (to avoid accessing vCPUs as they are being freed); it's
a-ok for them to be running while the VM is visible on the global vm_list.
Note, the PIT also needs to be stopped before IRQ routing is freed
(because KVM's IRQ routing is garbage and assumes there is always non-NULL
routing).
Opportunistically add comments to explain why KVM stops/frees certain
assets early.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250224235542.2562848-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When destroying a VM, unload a vCPU's MMUs as part of normal vCPU freeing,
instead of as a separate prepratory action. Unloading MMUs ahead of time
is a holdover from commit 7b53aa5650 ("KVM: Fix vcpu freeing for guest
smp"), which "fixed" a rather egregious flaw where KVM would attempt to
free *all* MMU pages when destroying a vCPU.
At the time, KVM would spin on all MMU pages in a VM when free a single
vCPU, and so would hang due to the way KVM pins and zaps root pages
(roots are invalidated but not freed if they are pinned by a vCPU).
static void free_mmu_pages(struct kvm_vcpu *vcpu)
{
struct kvm_mmu_page *page;
while (!list_empty(&vcpu->kvm->active_mmu_pages)) {
page = container_of(vcpu->kvm->active_mmu_pages.next,
struct kvm_mmu_page, link);
kvm_mmu_zap_page(vcpu->kvm, page);
}
free_page((unsigned long)vcpu->mmu.pae_root);
}
Now that KVM doesn't try to free all MMU pages when destroying a single
vCPU, there's no need to unpin roots prior to destroying a vCPU.
Note! While KVM mostly destroys all MMUs before calling
kvm_arch_destroy_vm() (see commit f00be0cae4 ("KVM: MMU: do not free
active mmu pages in free_mmu_pages()")), unpinning MMU roots during vCPU
destruction will unfortunately trigger remote TLB flushes, i.e. will try
to send requests to all vCPUs.
Happily, thanks to commit 27592ae8db ("KVM: Move wiping of the kvm->vcpus
array to common code"), that's a non-issue as freed vCPUs are naturally
skipped by xa_for_each_range(), i.e. by kvm_for_each_vcpu(). Prior to that
commit, KVM x86 rather stupidly freed vCPUs one-by-one, and _then_
nullified them, one-by-one. I.e. triggering a VM-wide request would hit a
use-after-free.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250224235542.2562848-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Don't load (and then put) a vCPU when unloading its MMU during VM
destruction, as nothing in kvm_mmu_unload() accesses vCPU state beyond the
root page/address of each MMU, i.e. can't possible need to run with the
vCPU loaded.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250224235542.2562848-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Free vCPUs before freeing any VM state, as both SVM and VMX may access
VM state when "freeing" a vCPU that is currently "in" L2, i.e. that needs
to be kicked out of nested guest mode.
Commit 6fcee03df6 ("KVM: x86: avoid loading a vCPU after .vm_destroy was
called") partially fixed the issue, but for unknown reasons only moved the
MMU unloading before VM destruction. Complete the change, and free all
vCPU state prior to destroying VM state, as nVMX accesses even more state
than nSVM.
In addition to the AVIC, KVM can hit a use-after-free on MSR filters:
kvm_msr_allowed+0x4c/0xd0
__kvm_set_msr+0x12d/0x1e0
kvm_set_msr+0x19/0x40
load_vmcs12_host_state+0x2d8/0x6e0 [kvm_intel]
nested_vmx_vmexit+0x715/0xbd0 [kvm_intel]
nested_vmx_free_vcpu+0x33/0x50 [kvm_intel]
vmx_free_vcpu+0x54/0xc0 [kvm_intel]
kvm_arch_vcpu_destroy+0x28/0xf0
kvm_vcpu_destroy+0x12/0x50
kvm_arch_destroy_vm+0x12c/0x1c0
kvm_put_kvm+0x263/0x3c0
kvm_vm_release+0x21/0x30
and an upcoming fix to process injectable interrupts on nested VM-Exit
will access the PIC:
BUG: kernel NULL pointer dereference, address: 0000000000000090
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
CPU: 23 UID: 1000 PID: 2658 Comm: kvm-nx-lpage-re
RIP: 0010:kvm_cpu_has_extint+0x2f/0x60 [kvm]
Call Trace:
<TASK>
kvm_cpu_has_injectable_intr+0xe/0x60 [kvm]
nested_vmx_vmexit+0x2d7/0xdf0 [kvm_intel]
nested_vmx_free_vcpu+0x40/0x50 [kvm_intel]
vmx_vcpu_free+0x2d/0x80 [kvm_intel]
kvm_arch_vcpu_destroy+0x2d/0x130 [kvm]
kvm_destroy_vcpus+0x8a/0x100 [kvm]
kvm_arch_destroy_vm+0xa7/0x1d0 [kvm]
kvm_destroy_vm+0x172/0x300 [kvm]
kvm_vcpu_release+0x31/0x50 [kvm]
Inarguably, both nSVM and nVMX need to be fixed, but punt on those
cleanups for the moment. Conceptually, vCPUs should be freed before VM
state. Assets like the I/O APIC and PIC _must_ be allocated before vCPUs
are created, so it stands to reason that they must be freed _after_ vCPUs
are destroyed.
Reported-by: Aaron Lewis <aaronlewis@google.com>
Closes: https://lore.kernel.org/all/20240703175618.2304869-2-aaronlewis@google.com
Cc: Jim Mattson <jmattson@google.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Rick P Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Cc: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250224235542.2562848-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Open code the filling of vcpu->arch.exception in kvm_requeue_exception()
instead of bouncing through kvm_multiple_exception(), as re-injection
doesn't actually share that much code with "normal" injection, e.g. the
VM-Exit interception check, payload delivery, and nested exception code
is all bypassed as those flows only apply during initial injection.
When FRED comes along, the special casing will only get worse, as FRED
explicitly tracks nested exceptions and essentially delivers the payload
on the stack frame, i.e. re-injection will need more inputs, and normal
injection will have yet more code that needs to be bypassed when KVM is
re-injecting an exception.
No functional change intended.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20241001050110.3643764-2-xin@zytor.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rename send_user_only to avoid "user", because KVM's ABI is to not inject
page faults into CPL0, whereas "user" in x86 is specifically CPL3. Invert
the polarity to keep the naming simple and unambiguous. E.g. while KVM
often refers to CPL0 as "kernel", that terminology isn't ubiquitous, and
"send_kernel" could be misconstrued as "send only to kernel".
Link: https://lore.kernel.org/r/20250215010609.1199982-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Don't inject PV async #PFs into guests with protected register state, i.e.
SEV-ES and SEV-SNP guests, unless the guest has opted-in to receiving #PFs
at CPL0. For protected guests, the actual CPL of the guest is unknown.
Note, no sane CoCo guest should enable PV async #PF, but the current state
of Linux-as-a-CoCo-guest isn't entirely sane.
Fixes: add5e2f045 ("KVM: SVM: Add support for the SEV-ES VMSA")
Link: https://lore.kernel.org/r/20250215010609.1199982-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The Xen emulation in KVM modifies certain CPUID leaves to expose
TSC information to the guest.
Previously, these CPUID leaves were updated whenever guest time changed,
but this conflicts with KVM_SET_CPUID/KVM_SET_CPUID2 ioctls which reject
changes to CPUID entries on running vCPUs.
Fix this by updating the TSC information directly in the CPUID emulation
handler instead of modifying the vCPU's CPUID entries.
Signed-off-by: Fred Griffoul <fgriffo@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20250124150539.69975-1-fgriffo@amazon.co.uk
Signed-off-by: Sean Christopherson <seanjc@google.com>
Now that all KVM usage of the Xen HVM config information is buried behind
CONFIG_KVM_XEN=y, move the per-VM kvm_xen_hvm_config field out of kvm_arch
and into kvm_xen.
No functional change intended.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250215011437.1203084-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a helper to detect writes to the Xen hypercall page MSR, and provide a
stub for CONFIG_KVM_XEN=n to optimize out the check for kernels built
without Xen support.
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20250215011437.1203084-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When updating PV clocks, handle the Xen-specific UNSTABLE_TSC override in
the main kvm_guest_time_update() by simply clearing PVCLOCK_TSC_STABLE_BIT
in the flags of the reference pvclock structure. Expand the comment to
(hopefully) make it obvious that Xen clocks need to be processed after all
clocks that care about the TSC_STABLE flag.
No functional change intended.
Cc: Paul Durrant <pdurrant@amazon.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When updating paravirtual clocks, setup the Hyper-V TSC page before
Xen PV clocks. This will allow dropping xen_pvclock_tsc_unstable in favor
of simply clearing PVCLOCK_TSC_STABLE_BIT in the reference flags.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Remove the per-vCPU "cache" of the reference pvclock and instead cache
only the TSC shift+multiplier. All other fields in pvclock are fully
recomputed by kvm_guest_time_update(), i.e. aren't actually persisted.
In addition to shaving a few bytes, explicitly tracking the TSC shift/mul
fields makes it easier to see that those fields are tied to hw_tsc_khz
(they exist to avoid having to do expensive math in the common case).
And conversely, not tracking the other fields makes it easier to see that
things like the version number are pulled from the guest's copy, not from
KVM's reference.
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Pass the reference pvclock structure that's used to setup each individual
pvclock as a parameter to kvm_setup_guest_pvclock() as a preparatory step
toward removing kvm_vcpu_arch.hv_clock.
No functional change intended.
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Handle "guest stopped" propagation only for kvmclock, as the flag is set
if and only if kvmclock is "active", i.e. can only be set for Xen PV clock
if kvmclock *and* Xen PV clock are in-use by the guest, which creates very
bizarre behavior for the guest.
Simply restrict the flag to kvmclock, e.g. instead of trying to handle
Xen PV clock, as propagation of PVCLOCK_GUEST_STOPPED was unintentionally
added during a refactoring, and while Xen proper defines
XEN_PVCLOCK_GUEST_STOPPED, there's no evidence that Xen guests actually
support the flag.
Check and clear pvclock_set_guest_stopped_request if and only if kvmclock
is active to preserve the original behavior, i.e. keep the flag pending
if kvmclock happens to be disabled when KVM processes the initial request.
Fixes: aa096aa0a0 ("KVM: x86/xen: setup pvclock updates")
Cc: Paul Durrant <pdurrant@amazon.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When updating a specific PV clock, make a full copy of KVM's reference
copy/cache so that PVCLOCK_GUEST_STOPPED doesn't bleed across clocks.
E.g. in the unlikely scenario the guest has enabled both kvmclock and Xen
PV clock, a dangling GUEST_STOPPED in kvmclock would bleed into Xen PV
clock.
Using a local copy of the pvclock structure also sets the stage for
eliminating the per-vCPU copy/cache (only the TSC frequency information
actually "needs" to be cached/persisted).
Fixes: aa096aa0a0 ("KVM: x86/xen: setup pvclock updates")
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Handle "guest stopped" requests once per guest time update in preparation
of restoring KVM's historical behavior of setting PVCLOCK_GUEST_STOPPED
for kvmclock and only kvmclock. For now, simply move the code to minimize
the probability of an unintentional change in functionally.
Note, in practice, all clocks are guaranteed to see the request (or not)
even though each PV clock processes the request individual, as KVM holds
vcpu->mutex (blocks KVM_KVMCLOCK_CTRL) and it should be impossible for
KVM's suspend notifier to run while KVM is handling requests. And because
the helper updates the reference flags, all subsequent PV clock updates
will pick up PVCLOCK_GUEST_STOPPED.
Note #2, once PVCLOCK_GUEST_STOPPED is restricted to kvmclock, the
horrific #ifdef will go away.
Cc: Paul Durrant <pdurrant@amazon.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop the local pvclock_flags in kvm_guest_time_update(), the local variable
is immediately shoved into the per-vCPU "cache", i.e. the local variable
serves no purpose.
No functional change intended.
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop KVM's handling of kvm_set_guest_paused() failure when reacting to a
SUSPEND notification, as kvm_set_guest_paused() only "fails" if the vCPU
isn't using kvmclock, and KVM's notifier callback pre-checks that kvmclock
is active. I.e. barring some bizarre edge case that shouldn't be treated
as an error in the first place, kvm_arch_suspend_notifier() can't fail.
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Defer runtime CPUID updates until the next non-faulting CPUID emulation
or KVM_GET_CPUID2, which are the only paths in KVM that consume the
dynamic entries. Deferring the updates is especially beneficial to
nested VM-Enter/VM-Exit, as KVM will almost always detect multiple state
changes, not to mention the updates don't need to be realized while L2 is
active if CPUID is being intercepted by L1 (CPUID is a mandatory intercept
on Intel, but not AMD).
Deferring CPUID updates shaves several hundred cycles from nested VMX
roundtrips, as measured from L2 executing CPUID in a tight loop:
SKX 6850 => 6450
ICX 9000 => 8800
EMR 7900 => 7700
Alternatively, KVM could update only the CPUID leaves that are affected
by the state change, e.g. update XSAVE info only if XCR0 or XSS changes,
but that adds non-trivial complexity and doesn't solve the underlying
problem of nested transitions potentially changing both XCR0 and XSS, on
both nested VM-Enter and VM-Exit.
Skipping updates entirely if L2 is active and CPUID is being intercepted
by L1 could work for the common case. However, simply skipping updates if
L2 is active is *very* subtly dangerous and complex. Most KVM updates are
triggered by changes to the current vCPU state, which may be L2 state,
whereas performing updates only for L1 would requiring detecting changes
to L1 state. KVM would need to either track relevant L1 state, or defer
runtime CPUID updates until the next nested VM-Exit. The former is ugly
and complex, while the latter comes with similar dangers to deferring all
CPUID updates, and would only address the nested VM-Enter path.
To guard against using stale data, disallow querying dynamic CPUID feature
bits, i.e. features that KVM updates at runtime, via a compile-time
assertion in guest_cpu_cap_has(). Exempt MWAIT from the rule, as the
MISC_ENABLE_NO_MWAIT means that MWAIT is _conditionally_ a dynamic CPUID
feature.
Note, the rule could be enforced for MWAIT as well, e.g. by querying guest
CPUID in kvm_emulate_monitor_mwait, but there's no obvious advtantage to
doing so, and allowing MWAIT for guest_cpuid_has() opens up a different can
of worms. MONITOR/MWAIT can't be virtualized (for a reasonable definition),
and the nature of the MWAIT_NEVER_UD_FAULTS and MISC_ENABLE_NO_MWAIT quirks
means checking X86_FEATURE_MWAIT outside of kvm_emulate_monitor_mwait() is
wrong for other reasons.
Beyond the aforementioned feature bits, the only other dynamic CPUID
(sub)leaves are the XSAVE sizes, and similar to MWAIT, consuming those
CPUID entries in KVM is all but guaranteed to be a bug. The layout for an
actual XSAVE buffer depends on the format (compacted or not) and
potentially the features that are actually enabled. E.g. see the logic in
fpstate_clear_xstate_component() needed to poke into the guest's effective
XSAVE state to clear MPX state on INIT. KVM does consume
CPUID.0xD.0.{EAX,EDX} in kvm_check_cpuid() and cpuid_get_supported_xcr0(),
but not EBX, which is the only dynamic output register in the leaf.
Link: https://lore.kernel.org/r/20241211013302.1347853-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rework MONITOR/MWAIT emulation to query X86_FEATURE_MWAIT if and only if
the MISC_ENABLE_NO_MWAIT quirk is enabled, in which case MWAIT is not a
dynamic, KVM-controlled CPUID feature. KVM's funky ABI for that quirk is
to emulate MONITOR/MWAIT as nops if userspace sets MWAIT in guest CPUID.
For the case where KVM owns the MWAIT feature bit, check MISC_ENABLES
itself, i.e. check the actual control, not its reflection in guest CPUID.
Avoiding consumption of dynamic CPUID features will allow KVM to defer
runtime CPUID updates until kvm_emulate_cpuid(), i.e. until the updates
become visible to the guest. Alternatively, KVM could play other games
with runtime CPUID updates, e.g. by precisely specifying which feature
bits to update, but doing so adds non-trivial complexity and doesn't solve
the underlying issue of unnecessary updates causing meaningful overhead
for nested virtualization roundtrips.
Link: https://lore.kernel.org/r/20241211013302.1347853-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
In kvm_set_mp_state(), ensure that vcpu->arch.pv.pv_unhalted is always
cleared on a transition to KVM_MP_STATE_RUNNABLE, so that the next HLT
instruction will be respected.
Fixes: 6aef266c6e ("kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks")
Fixes: b6b8a1451f ("KVM: nVMX: Rework interception of IRQs and NMIs")
Fixes: 38c0b192bd ("KVM: SVM: leave halted state on vmexit")
Fixes: 1a65105a5a ("KVM: x86/xen: handle PV spinlocks slowpath")
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250113200150.487409-3-jmattson@google.com
[sean: add Xen PV spinlocks to the list of Fixes, tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Replace all open-coded assignments to vcpu->arch.mp_state with calls
to a new helper, kvm_set_mp_state(), to centralize all changes to
mp_state.
No functional change intended.
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250113200150.487409-2-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move the conditional loading of hardware DR6 with the guest's DR6 value
out of the core .vcpu_run() loop to fix a bug where KVM can load hardware
with a stale vcpu->arch.dr6.
When the guest accesses a DR and host userspace isn't debugging the guest,
KVM disables DR interception and loads the guest's values into hardware on
VM-Enter and saves them on VM-Exit. This allows the guest to access DRs
at will, e.g. so that a sequence of DR accesses to configure a breakpoint
only generates one VM-Exit.
For DR0-DR3, the logic/behavior is identical between VMX and SVM, and also
identical between KVM_DEBUGREG_BP_ENABLED (userspace debugging the guest)
and KVM_DEBUGREG_WONT_EXIT (guest using DRs), and so KVM handles loading
DR0-DR3 in common code, _outside_ of the core kvm_x86_ops.vcpu_run() loop.
But for DR6, the guest's value doesn't need to be loaded into hardware for
KVM_DEBUGREG_BP_ENABLED, and SVM provides a dedicated VMCB field whereas
VMX requires software to manually load the guest value, and so loading the
guest's value into DR6 is handled by {svm,vmx}_vcpu_run(), i.e. is done
_inside_ the core run loop.
Unfortunately, saving the guest values on VM-Exit is initiated by common
x86, again outside of the core run loop. If the guest modifies DR6 (in
hardware, when DR interception is disabled), and then the next VM-Exit is
a fastpath VM-Exit, KVM will reload hardware DR6 with vcpu->arch.dr6 and
clobber the guest's actual value.
The bug shows up primarily with nested VMX because KVM handles the VMX
preemption timer in the fastpath, and the window between hardware DR6
being modified (in guest context) and DR6 being read by guest software is
orders of magnitude larger in a nested setup. E.g. in non-nested, the
VMX preemption timer would need to fire precisely between #DB injection
and the #DB handler's read of DR6, whereas with a KVM-on-KVM setup, the
window where hardware DR6 is "dirty" extends all the way from L1 writing
DR6 to VMRESUME (in L1).
L1's view:
==========
<L1 disables DR interception>
CPU 0/KVM-7289 [023] d.... 2925.640961: kvm_entry: vcpu 0
A: L1 Writes DR6
CPU 0/KVM-7289 [023] d.... 2925.640963: <hack>: Set DRs, DR6 = 0xffff0ff1
B: CPU 0/KVM-7289 [023] d.... 2925.640967: kvm_exit: vcpu 0 reason EXTERNAL_INTERRUPT intr_info 0x800000ec
D: L1 reads DR6, arch.dr6 = 0
CPU 0/KVM-7289 [023] d.... 2925.640969: <hack>: Sync DRs, DR6 = 0xffff0ff0
CPU 0/KVM-7289 [023] d.... 2925.640976: kvm_entry: vcpu 0
L2 reads DR6, L1 disables DR interception
CPU 0/KVM-7289 [023] d.... 2925.640980: kvm_exit: vcpu 0 reason DR_ACCESS info1 0x0000000000000216
CPU 0/KVM-7289 [023] d.... 2925.640983: kvm_entry: vcpu 0
CPU 0/KVM-7289 [023] d.... 2925.640983: <hack>: Set DRs, DR6 = 0xffff0ff0
L2 detects failure
CPU 0/KVM-7289 [023] d.... 2925.640987: kvm_exit: vcpu 0 reason HLT
L1 reads DR6 (confirms failure)
CPU 0/KVM-7289 [023] d.... 2925.640990: <hack>: Sync DRs, DR6 = 0xffff0ff0
L0's view:
==========
L2 reads DR6, arch.dr6 = 0
CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216
CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216
L2 => L1 nested VM-Exit
CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit_inject: reason: DR_ACCESS ext_inf1: 0x0000000000000216
CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_entry: vcpu 23
CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_exit: vcpu 23 reason VMREAD
CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_entry: vcpu 23
CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason VMREAD
CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_entry: vcpu 23
L1 writes DR7, L0 disables DR interception
CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000007
CPU 23/KVM-5046 [001] d.... 3410.005613: kvm_entry: vcpu 23
L0 writes DR6 = 0 (arch.dr6)
CPU 23/KVM-5046 [001] d.... 3410.005613: <hack>: Set DRs, DR6 = 0xffff0ff0
A: <L1 writes DR6 = 1, no interception, arch.dr6 is still '0'>
B: CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_exit: vcpu 23 reason PREEMPTION_TIMER
CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_entry: vcpu 23
C: L0 writes DR6 = 0 (arch.dr6)
CPU 23/KVM-5046 [001] d.... 3410.005614: <hack>: Set DRs, DR6 = 0xffff0ff0
L1 => L2 nested VM-Enter
CPU 23/KVM-5046 [001] d.... 3410.005616: kvm_exit: vcpu 23 reason VMRESUME
L0 reads DR6, arch.dr6 = 0
Reported-by: John Stultz <jstultz@google.com>
Closes: https://lkml.kernel.org/r/CANDhNCq5_F3HfFYABqFGCA1bPd_%2BxgNj-iDQhH4tDk%2Bwi8iZZg%40mail.gmail.com
Fixes: 375e28ffc0 ("KVM: X86: Set host DR6 only on VMX and for KVM_DEBUGREG_WONT_EXIT")
Fixes: d67668e9dd ("KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20250125011833.3644371-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The Xen hypercall page MSR is write-only. When the guest writes an address
to the MSR, the hypervisor populates the referenced page with hypercall
functions.
There is no reason for the host ever to write to the MSR, and it isn't
even readable.
Allowing host writes to trigger the hypercall page allows userspace to
attack the kernel, as kvm_xen_write_hypercall_page() takes multiple
locks and writes to guest memory. E.g. if userspace sets the MSR to
MSR_IA32_XSS, KVM's write to MSR_IA32_XSS during vCPU creation will
trigger an SRCU violation due to writing guest memory:
=============================
WARNING: suspicious RCU usage
6.13.0-rc3
-----------------------------
include/linux/kvm_host.h:1046 suspicious rcu_dereference_check() usage!
stack backtrace:
CPU: 6 UID: 1000 PID: 1101 Comm: repro Not tainted 6.13.0-rc3
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
<TASK>
dump_stack_lvl+0x7f/0x90
lockdep_rcu_suspicious+0x176/0x1c0
kvm_vcpu_gfn_to_memslot+0x259/0x280
kvm_vcpu_write_guest+0x3a/0xa0
kvm_xen_write_hypercall_page+0x268/0x300
kvm_set_msr_common+0xc44/0x1940
vmx_set_msr+0x9db/0x1fc0
kvm_vcpu_reset+0x857/0xb50
kvm_arch_vcpu_create+0x37e/0x4d0
kvm_vm_ioctl+0x669/0x2100
__x64_sys_ioctl+0xc1/0xf0
do_syscall_64+0xc5/0x210
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7feda371b539
While the MSR index isn't strictly ABI, i.e. can theoretically float to
any value, in practice no known VMM sets the MSR index to anything other
than 0x40000000 or 0x40000200.
Reported-by: syzbot+cdeaeec70992eca2d920@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/679258d4.050a0220.2eae65.000a.GAE@google.com
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/de0437379dfab11e431a23c8ce41a29234c06cbf.camel@infradead.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
The only statement in a kvm_arch_post_init_vm implementation
can be moved into the x86 kvm_arch_init_vm. Do so and remove all
traces from architecture-independent code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Some libraries want to ensure they are single threaded before forking,
so making the kernel's kvm huge page recovery process a vhost task of
the user process breaks those. The minijail library used by crosvm is
one such affected application.
Defer the task to after the first VM_RUN call, which occurs after the
parent process has forked all its jailed processes. This needs to happen
only once for the kvm instance, so introduce some general-purpose
infrastructure for that, too. It's similar in concept to pthread_once;
except it is actually usable, because the callback takes a parameter.
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Message-ID: <20250123153543.2769928-1-kbusch@meta.com>
[Move call_once API to include/linux. - Paolo]
Cc: stable@vger.kernel.org
Fixes: d96c77bd4e ("KVM: x86: switch hugepage recovery thread to vhost_task")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As part of enabling TDX virtual machines, support support separation of
private/shared EPT into separate roots.
Confidential computing solutions almost invariably have concepts of
private and shared memory, but they may different a lot in the details.
In SEV, for example, the bit is handled more like a permission bit as
far as the page tables are concerned: the private/shared bit is not
included in the physical address.
For TDX, instead, the bit is more like a physical address bit, with
the host mapping private memory in one half of the address space and
shared in another. Furthermore, the two halves are mapped by different
EPT roots and only the shared half is managed by KVM; the private half
(also called Secure EPT in Intel documentation) gets managed by the
privileged TDX Module via SEAMCALLs.
As a result, the operations that actually change the private half of
the EPT are limited and relatively slow compared to reading a PTE. For
this reason the design for KVM is to keep a mirror of the private EPT in
host memory. This allows KVM to quickly walk the EPT and only perform the
slower private EPT operations when it needs to actually modify mid-level
private PTEs.
There are thus three sets of EPT page tables: external, mirror and
direct. In the case of TDX (the only user of this framework) the
first two cover private memory, whereas the third manages shared
memory:
external EPT - Hidden within the TDX module, modified via TDX module
calls.
mirror EPT - Bookkeeping tree used as an optimization by KVM, not
used by the processor.
direct EPT - Normal EPT that maps unencrypted shared memory.
Managed like the EPT of a normal VM.
Modifying external EPT
----------------------
Modifications to the mirrored page tables need to also perform the
same operations to the private page tables, which will be handled via
kvm_x86_ops. Although this prep series does not interact with the TDX
module at all to actually configure the private EPT, it does lay the
ground work for doing this.
In some ways updating the private EPT is as simple as plumbing PTE
modifications through to also call into the TDX module; however, the
locking is more complicated because inserting a single PTE cannot anymore
be done atomically with a single CMPXCHG. For this reason, the existing
FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the
private EPT. FROZEN_SPTE acts basically as a spinlock on a PTE. Besides
protecting operation of KVM, it limits the set of cases in which the
TDX module will encounter contention on its own PTE locks.
Zapping external EPT
--------------------
While the framework tries to be relatively generic, and to be
understandable without knowing TDX much in detail, some requirements of
TDX sometimes leak; for example the private page tables also cannot be
zapped while the range has anything mapped, so the mirrored/private page
tables need to be protected from KVM operations that zap any non-leaf
PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast().
For normal VMs, guest memory is zapped for several reasons: user
memory getting paged out by the guest, memslots getting deleted,
passthrough of devices with non-coherent DMA. Confidential computing
adds to these the conversion of memory between shared and privates. These
operations must not zap any private memory that is in use by the guest.
This is possible because the only zapping that is out of the control
of KVM/userspace is paging out userspace memory, which cannot apply to
guestmemfd operations. Thus a TDX VM will only zap private memory from
memslot deletion and from conversion between private and shared memory
which is triggered by the guest.
To avoid zapping too much memory, enums are introduced so that operations
can choose to target only private or shared memory, and thus only
direct or mirror EPT. For example:
Memslot deletion - Private and shared
MMU notifier based zapping - Shared only
Conversion to shared - Private only
Conversion to private - Shared only
Other cases of zapping will not be supported for KVM, for example
APICv update or non-coherent DMA status update; for the latter, TDX will
simply require that the CPU supports self-snoop and honor guest PAT
unconditionally for shared memory.
Make the completion of hypercalls go through the complete_hypercall
function pointer argument, no matter if the hypercall exits to
userspace or not. Previously, the code assumed that KVM_HC_MAP_GPA_RANGE
specifically went to userspace, and all the others did not; the new code
need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
all whether there was an exit to userspace or not.
- Overhaul KVM's CPUID feature infrastructure to replace "governed" features
with per-vCPU tracking of the vCPU's capabailities for all features. Along
the way, refactor the code to make it easier to add/modify features, and
add a variety of self-documenting macro types to again simplify adding new
features and to help readers understand KVM's handling of existing features.
- Rework KVM's handling of VM-Exits during event vectoring to plug holes where
KVM unintentionally puts the vCPU into infinite loops in some scenarios,
e.g. if emulation is triggered by the exit, and to bring parity between VMX
and SVM.
- Add pending request and interrupt injection information to the kvm_exit and
kvm_entry tracepoints respectively.
- Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
loading guest/host PKRU due to a refactoring of the kernel helpers that
didn't account for KVM's pre-checking of the need to do WRPKRU.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmeJngsACgkQOlYIJqCj
N/1dfA/+NIZmnd8OV9Zvc6HGxrzgt4QsM9pmsUmrfkDWefxMYIAMeaW8Vn4CJfRf
zY/UcqyNI7JYxSiuVTckz+Tf54HhqYaLrUwILGCQ49koirZx+aQT1OUfjLroVMlh
ffX1i6GOoLNtxjb9MXM/heLVdUbvmzQMSFkd/AkOH+nrOtDNOiPlZfjHsewj9zrf
BNJGhzvT4M6vc/AsScC7tc0yFD5KKFRv8tVwJ6Zf1nWKyUDOSpMTWkVnq6geKJPZ
iGBZPPNg55Oy1g6uj6VYWmqYTD8Qioz5jtEJ/8pPHdAyIFo21s81bfJc548d+QLh
KfrL1K7TrCOhSAGC3Cb3lTLeq2immmGHaiTBLwGABG4MhpiX4NVpMMdOyFbVLMOS
HIYuwXwDckm1pfU7/w+PgPaakCyPrXQntm+3Y2pvDOoY6e2JbwodK4j8BvvQda35
8TrYKEGFvq5aij7Iw1O9TUoLAocDM/sHIHE6BCazHyzKBIv9xLRFeabiCQ+A1pwv
gZk5u0+j+DPpLdeLhbMYhIXUtr3bvyMYvc+tRkG716f8ubAE3+Kn5BEDo4Ot2DcT
vc+NTRYYWN6zavHiJH3Ddt153yj256JCZhLwCdfbryCQdz3Mpy16m36tgkDRd3lR
QT4IkPQo1Vl/aU0yiE/dhnJgh1rTO26YQjZoHs5Oj16d0HRrKyc=
=32mM
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.14' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.14:
- Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
instead of just those where KVM needs to manage state and/or explicitly
enable the feature in hardware. Along the way, refactor the code to make
it easier to add features, and to make it more self-documenting how KVM
is handling each feature.
- Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
where KVM unintentionally puts the vCPU into infinite loops in some scenarios
(e.g. if emulation is triggered by the exit), and brings parity between VMX
and SVM.
- Add pending request and interrupt injection information to the kvm_exit and
kvm_entry tracepoints respectively.
- Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
loading guest/host PKRU, due to a refactoring of the kernel helpers that
didn't account for KVM's pre-checking of the need to do WRPKRU.
- Add proper lockdep assertions when setting memory regions.
- Add a dedicated API for setting KVM-internal memory regions.
- Explicitly disallow all flags for KVM-internal memory regions.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmeHH5gACgkQOlYIJqCj
N/2pIBAAm+RLu554cc51EiIemZOI94WphNkl7QegZf+VfPvcPU0pXT3VZiX/hGcl
mk8KJhYMeIjr+rh3kVENkn9YRCmcH334dDmA7CQJwe30EYvKwC9XlOxpRmUZl8h+
DaO6+uf07rs4ZezCBUQ90f9GV7ABpp3TFmwYlPSqYW4TKbWbiwqPZL8X06ZTRgb3
Z5rFI2WpxsHOnlFLQAjmWZ6K3/oMDrdDplwJy9XTEpYV29oX5y06Zp3YeBAkeoxo
uvwv9JYyZcev7j4gamQngud9IJc3Rc5/GzK0NndsPqUOVPGHXuSiul3ajbd3HgLL
zp8Gai9y170DG8KUp+TNdhzaW0E+lLJAeLrWZTkfSwbdxGpA+XEdcwAiHd7X/xqK
0Ty/FS/vfLA9adearWE1DihskCQ1L1ZhJRqjRx5vGIwXBzyhtcKLhYWJL97eOrd+
jmPVUJe9nOZdbdmrQksT0PinQVnsNorSlfU7ODBnymzisIF0Z9NP8HRHLk4PT2C7
VBdOUpmvY1Fae1GUq5HLB1TJml/FoRztTBmtnoczczDlmOmgJ4wuMoiGYIkZaTED
Zhb40/tk4mtBCcY/iXB+0G2uAl+1xw7rrmhsV/JUKexXPJs7U2Yj8xPVpqMoU7aU
HCvUjUujsHZ/SvYXQu3rCZOsFyNfLvX8b+0Fv41gpBPgYqGZXA8=
=Dqql
-----END PGP SIGNATURE-----
Merge tag 'kvm-memslots-6.14' of https://github.com/kvm-x86/linux into HEAD
KVM kvm_set_memory_region() cleanups and hardening for 6.14:
- Add proper lockdep assertions when setting memory regions.
- Add a dedicated API for setting KVM-internal memory regions.
- Explicitly disallow all flags for KVM-internal memory regions.
Now that there's no outer wrapper for __kvm_set_memory_region() and it's
static, drop its double-underscore prefix.
No functional change intended.
Cc: Tao Su <tao1.su@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250111002022.1230573-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a dedicated API for setting internal memslots, and have it explicitly
disallow setting userspace memslots. Setting a userspace memslots without
a direct command from userspace would result in all manner of issues.
No functional change intended.
Cc: Tao Su <tao1.su@linux.intel.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250111002022.1230573-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add proper lockdep assertions in __kvm_set_memory_region() and
__x86_set_memory_region() instead of relying comments.
Opportunistically delete __kvm_set_memory_region()'s entire function
comment as the API doesn't allocate memory or select a gfn, and the
"mostly for framebuffers" comment hasn't been true for a very long time.
Cc: Tao Su <tao1.su@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250111002022.1230573-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use the raw wrpkru() helper when loading the guest/host's PKRU on switch
to/from guest context, as the write_pkru() wrapper incurs an unnecessary
rdpkru(). In both paths, KVM is guaranteed to have performed RDPKRU since
the last possible write, i.e. KVM has a fresh cache of the current value
in hardware.
This effectively restores KVM's behavior to that of KVM prior to commit
c806e88734 ("x86/pkeys: Provide *pkru() helpers"), which renamed the raw
helper from __write_pkru() => wrpkru(), and turned __write_pkru() into a
wrapper. Commit 577ff465f5 ("x86/fpu: Only write PKRU if it is different
from current") then added the extra RDPKRU to avoid an unnecessary WRPKRU,
but completely missed that KVM already optimized away pointless writes.
Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: 577ff465f5 ("x86/fpu: Only write PKRU if it is different from current")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Link: https://lore.kernel.org/r/20241221011647.3747448-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a few sanity checks to prevent memslot GFNs from ever having alias bits
set.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly though calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
For TDX, the shared half will be mapped in the higher alias, with a "shared
bit" set in the GPA. However, KVM will still manage it with the same
memslots as the private half. This means memslot looks ups and zapping
operations will be provided with a GFN without the shared bit set.
If these memslot GFNs ever had the bit that selects between the two aliases
it could lead to unexpected behavior in the complicated code that directs
faulting or zapping operations between the roots that map the two aliases.
As a safety measure, prevent memslots from being set at a GFN range that
contains the alias bit.
Also, check in the kvm_faultin_pfn() for the fault path. This later check
does less today, as the alias bits are specifically stripped from the GFN
being checked, however future code could possibly call in to the fault
handler in a way that skips this stripping. Since kvm_faultin_pfn() now
has many references to vcpu->kvm, extract it to local variable.
Link: https://lore.kernel.org/kvm/ZpbKqG_ZhCWxl-Fc@google.com/
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-19-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rework __kvm_emulate_hypercall() into a macro so that completion of
hypercalls that don't exit to userspace use direct function calls to the
completion helper, i.e. don't trigger a retpoline when RETPOLINE=y.
Opportunistically take the names of the input registers, as opposed to
taking the input values, to preemptively dedup more of the calling code
(TDX needs to use different registers). Use the direct GPR accessors to
read values to avoid the pointless marking of the registers as available
(KVM requires GPRs to always be available).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20241128004344.4072099-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Finish "emulation" of KVM hypercalls by function callback, even when the
hypercall is handled entirely within KVM, i.e. doesn't require an exit to
userspace, and refactor __kvm_emulate_hypercall()'s return value to *only*
communicate whether or not KVM should exit to userspace or resume the
guest.
(Ab)Use vcpu->run->hypercall.ret to propagate the return value to the
callback, purely to avoid having to add a trampoline for every completion
callback.
Using the function return value for KVM's control flow eliminates the
multiplexed return value, where '0' for KVM_HC_MAP_GPA_RANGE (and only
that hypercall) means "exit to userspace".
Note, the unnecessary extra indirect call and thus potential retpoline
will be eliminated in the near future by converting the intermediate layer
to a macro.
Suggested-by: Binbin Wu <binbin.wu@linux.intel.com>
Suggested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20241128004344.4072099-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Increment the "hypercalls" stat for KVM hypercalls as soon as KVM knows
it will skip the guest instruction, i.e. once KVM is committed to emulating
the hypercall. Waiting until completion adds no known value, and creates a
discrepancy where the stat will be bumped if KVM exits to userspace as a
result of trying to skip the instruction, but not if the hypercall itself
exits.
Handling the stat in common code will also avoid the need for another
helper to dedup code when TDX comes along (TDX needs a separate completion
path due to GPR usage differences).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20241128004344.4072099-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add and use user_exit_on_hypercall() to check if userspace wants to handle
a KVM hypercall instead of open-coding the logic everywhere.
No functional change intended.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
[sean: squash into one patch, keep explicit KVM_HC_MAP_GPA_RANGE check]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-ID: <20241128004344.4072099-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
QEMU up to 9.2.0 is assuming that vcpu->run->hypercall.ret is 0 on exit and
it never modifies it when processing KVM_EXIT_HYPERCALL. Make this explicit
in the code, to avoid breakage when KVM starts modifying that field.
This in principle is not a good idea... It would have been much better if
KVM had set the field to -KVM_ENOSYS from the beginning, so that a dumb
userspace that does nothing on KVM_EXIT_HYPERCALL would tell the guest it
does not support KVM_HC_MAP_GPA_RANGE. However, breaking userspace is
a Very Bad Thing, as everybody should know.
Reported-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Disable AVIC on SNP-enabled systems that don't allow writes to the virtual
APIC page, as such hosts will hit unexpected RMP #PFs in the host when
running VMs of any flavor.
- Fix a WARN in the hypercall completion path due to KVM trying to determine
if a guest with protected register state is in 64-bit mode (KVM's ABI is to
assume such guests only make hypercalls in 64-bit mode).
- Allow the guest to write to supported bits in MSR_AMD64_DE_CFG to fix a
regression with Windows guests, and because KVM's read-only behavior appears
to be entirely made up.
- Treat TDP MMU faults as spurious if the faulting access is allowed given the
existing SPTE. This fixes a benign WARN (other than the WARN itself) due to
unexpectedly replacing a writable SPTE with a read-only SPTE.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmdkzW0ACgkQOlYIJqCj
N/3VXhAAjfP7w4TdJGX0yOPUmQWXATq6DC+zKW1K3dhfu9QwVXD6TLykdnQddiUg
k0yOgIpwzIEj1U6l1DamjQj2QdoBqeTjTg8jBujiQTsQH/Gc7FvbcFy70ZKZS/js
IQ+g+1av8XXnLVb5xtVMRWwEtca2BeygSVjRA8/QtIWog0rProfS35/wd3uq04Hm
Z4J6hf8N2DSELUrq0uIHvTFr8/a03oQnaNNrPyRmbvBb8m20no3A/ws6tAB4YfyN
HxQao+CJrSQ+4dinZSGa2s6TLBIbbH5QRtwgeSpbLdeKnRYyrWtThIrKB2uDKKYn
1fsn9mdIi3oAzl3mX2JLQNlx18tCOSrnmYIJCzBAQDaeUOlMNYge4CYeKOqYMJyo
vayNFz9VZwDiUcBLMWy81EyVcKlqQE8U2icgKqEN8jVdmKW36bOcAfFiK/97iPrc
7TkS4azHozoTAHdggWqReraDnjRz4PysjIA/yaMh2G/p1ZexL6OX+GwgF1YrD4w3
zVL8UmXkjiPi+VS6ayUiYDYB5R1kJ13Zq3UpYOJ2MG6Zg2we+UXynQkehS/BHwkp
2aiflhq55cVv6oZoNJeOuokyvXRAuXf8t0Got/rqGzjKGGNLPRCxD/7K1PzN1YsZ
NH5zmS2GsqA89L+Qc2l0Ohuz4kX1S2BEB+1pnxFkQc+CdjmIG9A=
=S9jc
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-fixes-6.13-rcN' of https://github.com/kvm-x86/linux into HEAD
KVM x86 fixes for 6.13:
- Disable AVIC on SNP-enabled systems that don't allow writes to the virtual
APIC page, as such hosts will hit unexpected RMP #PFs in the host when
running VMs of any flavor.
- Fix a WARN in the hypercall completion path due to KVM trying to determine
if a guest with protected register state is in 64-bit mode (KVM's ABI is to
assume such guests only make hypercalls in 64-bit mode).
- Allow the guest to write to supported bits in MSR_AMD64_DE_CFG to fix a
regression with Windows guests, and because KVM's read-only behavior appears
to be entirely made up.
- Treat TDP MMU faults as spurious if the faulting access is allowed given the
existing SPTE. This fixes a benign WARN (other than the WARN itself) due to
unexpectedly replacing a writable SPTE with a read-only SPTE.
- Disable AVIC on SNP-enabled systems that don't allow writes to the virtual
APIC page, as such hosts will hit unexpected RMP #PFs in the host when
running VMs of any flavor.
- Fix a WARN in the hypercall completion path due to KVM trying to determine
if a guest with protected register state is in 64-bit mode (KVM's ABI is to
assume such guests only make hypercalls in 64-bit mode).
- Allow the guest to write to supported bits in MSR_AMD64_DE_CFG to fix a
regression with Windows guests, and because KVM's read-only behavior appears
to be entirely made up.
- Treat TDP MMU faults as spurious if the faulting access is allowed given the
existing SPTE. This fixes a benign WARN (other than the WARN itself) due to
unexpectedly replacing a writable SPTE with a read-only SPTE.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmdkzW0ACgkQOlYIJqCj
N/3VXhAAjfP7w4TdJGX0yOPUmQWXATq6DC+zKW1K3dhfu9QwVXD6TLykdnQddiUg
k0yOgIpwzIEj1U6l1DamjQj2QdoBqeTjTg8jBujiQTsQH/Gc7FvbcFy70ZKZS/js
IQ+g+1av8XXnLVb5xtVMRWwEtca2BeygSVjRA8/QtIWog0rProfS35/wd3uq04Hm
Z4J6hf8N2DSELUrq0uIHvTFr8/a03oQnaNNrPyRmbvBb8m20no3A/ws6tAB4YfyN
HxQao+CJrSQ+4dinZSGa2s6TLBIbbH5QRtwgeSpbLdeKnRYyrWtThIrKB2uDKKYn
1fsn9mdIi3oAzl3mX2JLQNlx18tCOSrnmYIJCzBAQDaeUOlMNYge4CYeKOqYMJyo
vayNFz9VZwDiUcBLMWy81EyVcKlqQE8U2icgKqEN8jVdmKW36bOcAfFiK/97iPrc
7TkS4azHozoTAHdggWqReraDnjRz4PysjIA/yaMh2G/p1ZexL6OX+GwgF1YrD4w3
zVL8UmXkjiPi+VS6ayUiYDYB5R1kJ13Zq3UpYOJ2MG6Zg2we+UXynQkehS/BHwkp
2aiflhq55cVv6oZoNJeOuokyvXRAuXf8t0Got/rqGzjKGGNLPRCxD/7K1PzN1YsZ
NH5zmS2GsqA89L+Qc2l0Ohuz4kX1S2BEB+1pnxFkQc+CdjmIG9A=
=S9jc
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-fixes-6.13-rcN' of https://github.com/kvm-x86/linux into HEAD
KVM x86 fixes for 6.13:
- Disable AVIC on SNP-enabled systems that don't allow writes to the virtual
APIC page, as such hosts will hit unexpected RMP #PFs in the host when
running VMs of any flavor.
- Fix a WARN in the hypercall completion path due to KVM trying to determine
if a guest with protected register state is in 64-bit mode (KVM's ABI is to
assume such guests only make hypercalls in 64-bit mode).
- Allow the guest to write to supported bits in MSR_AMD64_DE_CFG to fix a
regression with Windows guests, and because KVM's read-only behavior appears
to be entirely made up.
- Treat TDP MMU faults as spurious if the faulting access is allowed given the
existing SPTE. This fixes a benign WARN (other than the WARN itself) due to
unexpectedly replacing a writable SPTE with a read-only SPTE.
When running KVM with ignore_msrs=1 and report_ignored_msrs=0, the user has
no clue that that the guest is being lied to. This may cause bug reports
such as https://gitlab.com/qemu-project/qemu/-/issues/2571, where enabling
a CPUID bit in QEMU caused Linux guests to try reading MSR_CU_DEF_ERR; and
being lied about the existence of MSR_CU_DEF_ERR caused the guest to assume
other things about the local APIC which were not true:
Sep 14 12:02:53 kernel: mce: [Firmware Bug]: Your BIOS is not setting up LVT offset 0x2 for deferred error IRQs correctly.
Sep 14 12:02:53 kernel: unchecked MSR access error: RDMSR from 0x852 at rIP: 0xffffffffb548ffa7 (native_read_msr+0x7/0x40)
Sep 14 12:02:53 kernel: Call Trace:
...
Sep 14 12:02:53 kernel: native_apic_msr_read+0x20/0x30
Sep 14 12:02:53 kernel: setup_APIC_eilvt+0x47/0x110
Sep 14 12:02:53 kernel: mce_amd_feature_init+0x485/0x4e0
...
Sep 14 12:02:53 kernel: [Firmware Bug]: cpu 0, try to use APIC520 (LVT offset 2) for vector 0xf4, but the register is already in use for vector 0x0 on this cpu
Without reported_ignored_msrs=0 at least the host kernel log will contain
enough information to avoid going on a wild goose chase. But if reports
about individual MSR accesses are being silenced too, at least complain
loudly the first time a VM is started.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use is_64_bit_hypercall() instead of is_64_bit_mode() to detect a 64-bit
hypercall when completing said hypercall. For guests with protected state,
e.g. SEV-ES and SEV-SNP, KVM must assume the hypercall was made in 64-bit
mode as the vCPU state needed to detect 64-bit mode is unavailable.
Hacking the sev_smoke_test selftest to generate a KVM_HC_MAP_GPA_RANGE
hypercall via VMGEXIT trips the WARN:
------------[ cut here ]------------
WARNING: CPU: 273 PID: 326626 at arch/x86/kvm/x86.h:180 complete_hypercall_exit+0x44/0xe0 [kvm]
Modules linked in: kvm_amd kvm ... [last unloaded: kvm]
CPU: 273 UID: 0 PID: 326626 Comm: sev_smoke_test Not tainted 6.12.0-smp--392e932fa0f3-feat #470
Hardware name: Google Astoria/astoria, BIOS 0.20240617.0-0 06/17/2024
RIP: 0010:complete_hypercall_exit+0x44/0xe0 [kvm]
Call Trace:
<TASK>
kvm_arch_vcpu_ioctl_run+0x2400/0x2720 [kvm]
kvm_vcpu_ioctl+0x54f/0x630 [kvm]
__se_sys_ioctl+0x6b/0xc0
do_syscall_64+0x83/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
---[ end trace 0000000000000000 ]---
Fixes: b5aead0064 ("KVM: x86: Assume a 64-bit hypercall for guests with protected state")
Cc: stable@vger.kernel.org
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20241128004344.4072099-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
If emulation is "rejected" by check_emulate_instruction(), try to
unprotect and retry instruction execution before reporting the error to
userspace. Currently, check_emulate_instruction() never signals failure
when "unprotect and retry" is possible, but that will change in the
future as both VMX and SVM will reject emulation due to coincident
exception vectoring. E.g. if there is a write to a shadowed page table
when vectoring an event, then unprotecting the gfn and retrying the
instruction will allow the guest to make forward progress in most cases,
i.e. will allow the vCPU to keep running instead of returning an error to
userspace.
This ensures that the subsequent patches won't make KVM exit to
userspace when handling an intercepted #PF during vectoring without
checking whether unprotect and retry is possible.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-4-iorlov@amazon.com
[sean: massage changelog to clarify this is a nop for the current code]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add emulation status for unhandleable vectoring, i.e. when KVM can't
emulate an instruction because emulation was triggered on an exit that
occurred while the CPU was vectoring an event. Such a situation can
occur if guest sets the IDT descriptor base to point to MMIO region,
and triggers an exception after that.
Exit to userspace with event delivery error when KVM can't emulate
an instruction when vectoring an event.
Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-3-iorlov@amazon.com
[sean: massage changelog and X86EMUL_UNHANDLEABLE_VECTORING comment]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Extract VMX code for unhandleable VM-Exit during vectoring into
vendor-agnostic function so that boiler-plate code can be shared by SVM.
To avoid unnecessarily complexity in the helper, unconditionally report a
GPA to userspace instead of having a conditional entry. For exits that
don't report a GPA, i.e. everything except EPT Misconfig, simply report
KVM's "invalid GPA".
Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-2-iorlov@amazon.com
[sean: clarify that the INVALID_GPA logic is new]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Switch all queries (except XSAVES) of guest features from guest CPUID to
guest capabilities, i.e. replace all calls to guest_cpuid_has() with calls
to guest_cpu_cap_has().
Keep guest_cpuid_has() around for XSAVES, but subsume its helper
guest_cpuid_get_register() and add a compile-time assertion to prevent
using guest_cpuid_has() for any other feature. Add yet another comment
for XSAVE to explain why KVM is allowed to query its raw guest CPUID.
Opportunistically drop the unused guest_cpuid_clear(), as there should be
no circumstance in which KVM needs to _clear_ a guest CPUID feature now
that everything is tracked via cpu_caps. E.g. KVM may need to _change_
a feature to emulate dynamic CPUID flags, but KVM should never need to
clear a feature in guest CPUID to prevent it from being used by the guest.
Delete the last remnants of the governed features framework, as the lone
holdout was vmx_adjust_secondary_exec_control()'s divergent behavior for
governed vs. ungoverned features.
Note, replacing guest_cpuid_has() checks with guest_cpu_cap_has() when
computing reserved CR4 bits is a nop when viewed as a whole, as KVM's
capabilities are already incorporated into the calculation, i.e. if a
feature is present in guest CPUID but unsupported by KVM, its CR4 bit
was already being marked as reserved, checking guest_cpu_cap_has() simply
double-stamps that it's a reserved bit.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-51-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
As the first step toward replacing KVM's so-called "governed features"
framework with a more comprehensive, less poorly named implementation,
replace the "kvm_governed_feature" function prefix with "guest_cpu_cap"
and rename guest_can_use() to guest_cpu_cap_has().
The "guest_cpu_cap" naming scheme mirrors that of "kvm_cpu_cap", and
provides a more clear distinction between guest capabilities, which are
KVM controlled (heh, or one might say "governed"), and guest CPUID, which
with few exceptions is fully userspace controlled.
Opportunistically rewrite the comment about XSS passthrough for SEV-ES
guests to avoid referencing so many functions, as such comments are prone
to becoming stale (case in point...).
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-40-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>