linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-09-04 18:49:41 +00:00

Author	SHA1	Message	Date
Sean Christopherson	e447212593	KVM: x86: Unload MMUs during vCPU destruction, not before When destroying a VM, unload a vCPU's MMUs as part of normal vCPU freeing, instead of as a separate prepratory action. Unloading MMUs ahead of time is a holdover from commit `7b53aa5650` ("KVM: Fix vcpu freeing for guest smp"), which "fixed" a rather egregious flaw where KVM would attempt to free all MMU pages when destroying a vCPU. At the time, KVM would spin on all MMU pages in a VM when free a single vCPU, and so would hang due to the way KVM pins and zaps root pages (roots are invalidated but not freed if they are pinned by a vCPU). static void free_mmu_pages(struct kvm_vcpu vcpu) { struct kvm_mmu_page page; while (!list_empty(&vcpu->kvm->active_mmu_pages)) { page = container_of(vcpu->kvm->active_mmu_pages.next, struct kvm_mmu_page, link); kvm_mmu_zap_page(vcpu->kvm, page); } free_page((unsigned long)vcpu->mmu.pae_root); } Now that KVM doesn't try to free all MMU pages when destroying a single vCPU, there's no need to unpin roots prior to destroying a vCPU. Note! While KVM mostly destroys all MMUs before calling kvm_arch_destroy_vm() (see commit `f00be0cae4` ("KVM: MMU: do not free active mmu pages in free_mmu_pages()")), unpinning MMU roots during vCPU destruction will unfortunately trigger remote TLB flushes, i.e. will try to send requests to all vCPUs. Happily, thanks to commit `27592ae8db` ("KVM: Move wiping of the kvm->vcpus array to common code"), that's a non-issue as freed vCPUs are naturally skipped by xa_for_each_range(), i.e. by kvm_for_each_vcpu(). Prior to that commit, KVM x86 rather stupidly freed vCPUs one-by-one, and _then_ nullified them, one-by-one. I.e. triggering a VM-wide request would hit a use-after-free. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250224235542.2562848-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-02-26 13:17:23 -05:00
Sean Christopherson	ed09b50b54	KVM: x86: Don't load/put vCPU when unloading its MMU during teardown Don't load (and then put) a vCPU when unloading its MMU during VM destruction, as nothing in kvm_mmu_unload() accesses vCPU state beyond the root page/address of each MMU, i.e. can't possible need to run with the vCPU loaded. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250224235542.2562848-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-02-26 13:17:23 -05:00
Sean Christopherson	17bcd71442	KVM: x86: Free vCPUs before freeing VM state Free vCPUs before freeing any VM state, as both SVM and VMX may access VM state when "freeing" a vCPU that is currently "in" L2, i.e. that needs to be kicked out of nested guest mode. Commit `6fcee03df6` ("KVM: x86: avoid loading a vCPU after .vm_destroy was called") partially fixed the issue, but for unknown reasons only moved the MMU unloading before VM destruction. Complete the change, and free all vCPU state prior to destroying VM state, as nVMX accesses even more state than nSVM. In addition to the AVIC, KVM can hit a use-after-free on MSR filters: kvm_msr_allowed+0x4c/0xd0 __kvm_set_msr+0x12d/0x1e0 kvm_set_msr+0x19/0x40 load_vmcs12_host_state+0x2d8/0x6e0 [kvm_intel] nested_vmx_vmexit+0x715/0xbd0 [kvm_intel] nested_vmx_free_vcpu+0x33/0x50 [kvm_intel] vmx_free_vcpu+0x54/0xc0 [kvm_intel] kvm_arch_vcpu_destroy+0x28/0xf0 kvm_vcpu_destroy+0x12/0x50 kvm_arch_destroy_vm+0x12c/0x1c0 kvm_put_kvm+0x263/0x3c0 kvm_vm_release+0x21/0x30 and an upcoming fix to process injectable interrupts on nested VM-Exit will access the PIC: BUG: kernel NULL pointer dereference, address: 0000000000000090 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page CPU: 23 UID: 1000 PID: 2658 Comm: kvm-nx-lpage-re RIP: 0010:kvm_cpu_has_extint+0x2f/0x60 [kvm] Call Trace: <TASK> kvm_cpu_has_injectable_intr+0xe/0x60 [kvm] nested_vmx_vmexit+0x2d7/0xdf0 [kvm_intel] nested_vmx_free_vcpu+0x40/0x50 [kvm_intel] vmx_vcpu_free+0x2d/0x80 [kvm_intel] kvm_arch_vcpu_destroy+0x2d/0x130 [kvm] kvm_destroy_vcpus+0x8a/0x100 [kvm] kvm_arch_destroy_vm+0xa7/0x1d0 [kvm] kvm_destroy_vm+0x172/0x300 [kvm] kvm_vcpu_release+0x31/0x50 [kvm] Inarguably, both nSVM and nVMX need to be fixed, but punt on those cleanups for the moment. Conceptually, vCPUs should be freed before VM state. Assets like the I/O APIC and PIC _must_ be allocated before vCPUs are created, so it stands to reason that they must be freed _after_ vCPUs are destroyed. Reported-by: Aaron Lewis <aaronlewis@google.com> Closes: https://lore.kernel.org/all/20240703175618.2304869-2-aaronlewis@google.com Cc: Jim Mattson <jmattson@google.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Rick P Edgecombe <rick.p.edgecombe@intel.com> Cc: Kai Huang <kai.huang@intel.com> Cc: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250224235542.2562848-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-02-26 04:17:15 -05:00
Sean Christopherson	b50cb2b155	KVM: x86: Use a dedicated flow for queueing re-injected exceptions Open code the filling of vcpu->arch.exception in kvm_requeue_exception() instead of bouncing through kvm_multiple_exception(), as re-injection doesn't actually share that much code with "normal" injection, e.g. the VM-Exit interception check, payload delivery, and nested exception code is all bypassed as those flows only apply during initial injection. When FRED comes along, the special casing will only get worse, as FRED explicitly tracks nested exceptions and essentially delivers the payload on the stack frame, i.e. re-injection will need more inputs, and normal injection will have yet more code that needs to be bypassed when KVM is re-injecting an exception. No functional change intended. Signed-off-by: Xin Li (Intel) <xin@zytor.com> Tested-by: Shan Kang <shan.kang@intel.com> Link: https://lore.kernel.org/r/20241001050110.3643764-2-xin@zytor.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-25 07:23:07 -08:00
Sean Christopherson	4fa0efb43a	KVM: x86: Rename and invert async #PF's send_user_only flag to send_always Rename send_user_only to avoid "user", because KVM's ABI is to not inject page faults into CPL0, whereas "user" in x86 is specifically CPL3. Invert the polarity to keep the naming simple and unambiguous. E.g. while KVM often refers to CPL0 as "kernel", that terminology isn't ubiquitous, and "send_kernel" could be misconstrued as "send only to kernel". Link: https://lore.kernel.org/r/20250215010609.1199982-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-25 07:10:47 -08:00
Sean Christopherson	b9595d1dde	KVM: x86: Don't inject PV async #PF if SEND_ALWAYS=0 and guest state is protected Don't inject PV async #PFs into guests with protected register state, i.e. SEV-ES and SEV-SNP guests, unless the guest has opted-in to receiving #PFs at CPL0. For protected guests, the actual CPL of the guest is unknown. Note, no sane CoCo guest should enable PV async #PF, but the current state of Linux-as-a-CoCo-guest isn't entirely sane. Fixes: `add5e2f045` ("KVM: SVM: Add support for the SEV-ES VMSA") Link: https://lore.kernel.org/r/20250215010609.1199982-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-25 07:10:47 -08:00
Fred Griffoul	a2b00f85d7	KVM: x86: Update Xen TSC leaves during CPUID emulation The Xen emulation in KVM modifies certain CPUID leaves to expose TSC information to the guest. Previously, these CPUID leaves were updated whenever guest time changed, but this conflicts with KVM_SET_CPUID/KVM_SET_CPUID2 ioctls which reject changes to CPUID entries on running vCPUs. Fix this by updating the TSC information directly in the CPUID emulation handler instead of modifying the vCPU's CPUID entries. Signed-off-by: Fred Griffoul <fgriffo@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20250124150539.69975-1-fgriffo@amazon.co.uk Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-25 07:09:55 -08:00
Sean Christopherson	26e228ec16	KVM: x86/xen: Move kvm_xen_hvm_config field into kvm_xen Now that all KVM usage of the Xen HVM config information is buried behind CONFIG_KVM_XEN=y, move the per-VM kvm_xen_hvm_config field out of kvm_arch and into kvm_xen. No functional change intended. Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250215011437.1203084-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-24 08:59:59 -08:00
Sean Christopherson	bb0978d95a	KVM: x86/xen: Add an #ifdef'd helper to detect writes to Xen MSR Add a helper to detect writes to the Xen hypercall page MSR, and provide a stub for CONFIG_KVM_XEN=n to optimize out the check for kernels built without Xen support. Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20250215011437.1203084-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-24 08:59:57 -08:00
Sean Christopherson	1b3c38050b	KVM: x86: Override TSC_STABLE flag for Xen PV clocks in kvm_guest_time_update() When updating PV clocks, handle the Xen-specific UNSTABLE_TSC override in the main kvm_guest_time_update() by simply clearing PVCLOCK_TSC_STABLE_BIT in the flags of the reference pvclock structure. Expand the comment to (hopefully) make it obvious that Xen clocks need to be processed after all clocks that care about the TSC_STABLE flag. No functional change intended. Cc: Paul Durrant <pdurrant@amazon.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-12 10:45:55 -08:00
Sean Christopherson	847d68abf1	KVM: x86: Setup Hyper-V TSC page before Xen PV clocks (during clock update) When updating paravirtual clocks, setup the Hyper-V TSC page before Xen PV clocks. This will allow dropping xen_pvclock_tsc_unstable in favor of simply clearing PVCLOCK_TSC_STABLE_BIT in the reference flags. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-12 10:45:55 -08:00
Sean Christopherson	39d61b46ad	KVM: x86: Remove per-vCPU "cache" of its reference pvclock Remove the per-vCPU "cache" of the reference pvclock and instead cache only the TSC shift+multiplier. All other fields in pvclock are fully recomputed by kvm_guest_time_update(), i.e. aren't actually persisted. In addition to shaving a few bytes, explicitly tracking the TSC shift/mul fields makes it easier to see that those fields are tied to hw_tsc_khz (they exist to avoid having to do expensive math in the common case). And conversely, not tracking the other fields makes it easier to see that things like the version number are pulled from the guest's copy, not from KVM's reference. Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-12 10:45:55 -08:00
Sean Christopherson	46aed4d4a7	KVM: x86: Pass reference pvclock as a param to kvm_setup_guest_pvclock() Pass the reference pvclock structure that's used to setup each individual pvclock as a parameter to kvm_setup_guest_pvclock() as a preparatory step toward removing kvm_vcpu_arch.hv_clock. No functional change intended. Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-12 10:45:55 -08:00
Sean Christopherson	93fb0b10e7	KVM: x86: Set PVCLOCK_GUEST_STOPPED only for kvmclock, not for Xen PV clock Handle "guest stopped" propagation only for kvmclock, as the flag is set if and only if kvmclock is "active", i.e. can only be set for Xen PV clock if kvmclock and Xen PV clock are in-use by the guest, which creates very bizarre behavior for the guest. Simply restrict the flag to kvmclock, e.g. instead of trying to handle Xen PV clock, as propagation of PVCLOCK_GUEST_STOPPED was unintentionally added during a refactoring, and while Xen proper defines XEN_PVCLOCK_GUEST_STOPPED, there's no evidence that Xen guests actually support the flag. Check and clear pvclock_set_guest_stopped_request if and only if kvmclock is active to preserve the original behavior, i.e. keep the flag pending if kvmclock happens to be disabled when KVM processes the initial request. Fixes: `aa096aa0a0` ("KVM: x86/xen: setup pvclock updates") Cc: Paul Durrant <pdurrant@amazon.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-12 10:45:55 -08:00
Sean Christopherson	24c1663780	KVM: x86: Don't bleed PVCLOCK_GUEST_STOPPED across PV clocks When updating a specific PV clock, make a full copy of KVM's reference copy/cache so that PVCLOCK_GUEST_STOPPED doesn't bleed across clocks. E.g. in the unlikely scenario the guest has enabled both kvmclock and Xen PV clock, a dangling GUEST_STOPPED in kvmclock would bleed into Xen PV clock. Using a local copy of the pvclock structure also sets the stage for eliminating the per-vCPU copy/cache (only the TSC frequency information actually "needs" to be cached/persisted). Fixes: `aa096aa0a0` ("KVM: x86/xen: setup pvclock updates") Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-12 10:45:55 -08:00
Sean Christopherson	6c4927a4b7	KVM: x86: Process "guest stopped request" once per guest time update Handle "guest stopped" requests once per guest time update in preparation of restoring KVM's historical behavior of setting PVCLOCK_GUEST_STOPPED for kvmclock and only kvmclock. For now, simply move the code to minimize the probability of an unintentional change in functionally. Note, in practice, all clocks are guaranteed to see the request (or not) even though each PV clock processes the request individual, as KVM holds vcpu->mutex (blocks KVM_KVMCLOCK_CTRL) and it should be impossible for KVM's suspend notifier to run while KVM is handling requests. And because the helper updates the reference flags, all subsequent PV clock updates will pick up PVCLOCK_GUEST_STOPPED. Note #2, once PVCLOCK_GUEST_STOPPED is restricted to kvmclock, the horrific #ifdef will go away. Cc: Paul Durrant <pdurrant@amazon.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-12 10:45:55 -08:00
Sean Christopherson	aceb04f571	KVM: x86: Drop local pvclock_flags variable in kvm_guest_time_update() Drop the local pvclock_flags in kvm_guest_time_update(), the local variable is immediately shoved into the per-vCPU "cache", i.e. the local variable serves no purpose. No functional change intended. Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-12 10:45:54 -08:00
Sean Christopherson	4198f38aed	KVM: x86: Eliminate "handling" of impossible errors during SUSPEND Drop KVM's handling of kvm_set_guest_paused() failure when reacting to a SUSPEND notification, as kvm_set_guest_paused() only "fails" if the vCPU isn't using kvmclock, and KVM's notifier callback pre-checks that kvmclock is active. I.e. barring some bizarre edge case that shouldn't be treated as an error in the first place, kvm_arch_suspend_notifier() can't fail. Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-12 10:45:54 -08:00
Sean Christopherson	d9c5ed0a9b	KVM: x86: Don't take kvm->lock when iterating over vCPUs in suspend notifier When queueing vCPU PVCLOCK updates in response to SUSPEND or HIBERNATE, don't take kvm->lock as doing so can trigger a largely theoretical deadlock, it is perfectly safe to iterate over the xarray of vCPUs without holding kvm->lock, and kvm->lock doesn't protect kvm_set_guest_paused() in any way (pv_time.active and pvclock_set_guest_stopped_request are protected by vcpu->mutex, not kvm->lock). Reported-by: syzbot+352e553a86e0d75f5120@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/677c0f36.050a0220.3b3668.0014.GAE@google.com Fixes: `7d62874f69` ("kvm: x86: implement KVM PM-notifier") Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-12 10:45:54 -08:00
Sean Christopherson	93da6af3ae	KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulation Defer runtime CPUID updates until the next non-faulting CPUID emulation or KVM_GET_CPUID2, which are the only paths in KVM that consume the dynamic entries. Deferring the updates is especially beneficial to nested VM-Enter/VM-Exit, as KVM will almost always detect multiple state changes, not to mention the updates don't need to be realized while L2 is active if CPUID is being intercepted by L1 (CPUID is a mandatory intercept on Intel, but not AMD). Deferring CPUID updates shaves several hundred cycles from nested VMX roundtrips, as measured from L2 executing CPUID in a tight loop: SKX 6850 => 6450 ICX 9000 => 8800 EMR 7900 => 7700 Alternatively, KVM could update only the CPUID leaves that are affected by the state change, e.g. update XSAVE info only if XCR0 or XSS changes, but that adds non-trivial complexity and doesn't solve the underlying problem of nested transitions potentially changing both XCR0 and XSS, on both nested VM-Enter and VM-Exit. Skipping updates entirely if L2 is active and CPUID is being intercepted by L1 could work for the common case. However, simply skipping updates if L2 is active is very subtly dangerous and complex. Most KVM updates are triggered by changes to the current vCPU state, which may be L2 state, whereas performing updates only for L1 would requiring detecting changes to L1 state. KVM would need to either track relevant L1 state, or defer runtime CPUID updates until the next nested VM-Exit. The former is ugly and complex, while the latter comes with similar dangers to deferring all CPUID updates, and would only address the nested VM-Enter path. To guard against using stale data, disallow querying dynamic CPUID feature bits, i.e. features that KVM updates at runtime, via a compile-time assertion in guest_cpu_cap_has(). Exempt MWAIT from the rule, as the MISC_ENABLE_NO_MWAIT means that MWAIT is _conditionally_ a dynamic CPUID feature. Note, the rule could be enforced for MWAIT as well, e.g. by querying guest CPUID in kvm_emulate_monitor_mwait, but there's no obvious advtantage to doing so, and allowing MWAIT for guest_cpuid_has() opens up a different can of worms. MONITOR/MWAIT can't be virtualized (for a reasonable definition), and the nature of the MWAIT_NEVER_UD_FAULTS and MISC_ENABLE_NO_MWAIT quirks means checking X86_FEATURE_MWAIT outside of kvm_emulate_monitor_mwait() is wrong for other reasons. Beyond the aforementioned feature bits, the only other dynamic CPUID (sub)leaves are the XSAVE sizes, and similar to MWAIT, consuming those CPUID entries in KVM is all but guaranteed to be a bug. The layout for an actual XSAVE buffer depends on the format (compacted or not) and potentially the features that are actually enabled. E.g. see the logic in fpstate_clear_xstate_component() needed to poke into the guest's effective XSAVE state to clear MPX state on INIT. KVM does consume CPUID.0xD.0.{EAX,EDX} in kvm_check_cpuid() and cpuid_get_supported_xcr0(), but not EBX, which is the only dynamic output register in the leaf. Link: https://lore.kernel.org/r/20241211013302.1347853-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-12 10:16:33 -08:00
Sean Christopherson	a487f6797c	KVM: x86: Query X86_FEATURE_MWAIT iff userspace owns the CPUID feature bit Rework MONITOR/MWAIT emulation to query X86_FEATURE_MWAIT if and only if the MISC_ENABLE_NO_MWAIT quirk is enabled, in which case MWAIT is not a dynamic, KVM-controlled CPUID feature. KVM's funky ABI for that quirk is to emulate MONITOR/MWAIT as nops if userspace sets MWAIT in guest CPUID. For the case where KVM owns the MWAIT feature bit, check MISC_ENABLES itself, i.e. check the actual control, not its reflection in guest CPUID. Avoiding consumption of dynamic CPUID features will allow KVM to defer runtime CPUID updates until kvm_emulate_cpuid(), i.e. until the updates become visible to the guest. Alternatively, KVM could play other games with runtime CPUID updates, e.g. by precisely specifying which feature bits to update, but doing so adds non-trivial complexity and doesn't solve the underlying issue of unnecessary updates causing meaningful overhead for nested virtualization roundtrips. Link: https://lore.kernel.org/r/20241211013302.1347853-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-12 10:16:32 -08:00
Jim Mattson	e9cb61055f	KVM: x86: Clear pv_unhalted on all transitions to KVM_MP_STATE_RUNNABLE In kvm_set_mp_state(), ensure that vcpu->arch.pv.pv_unhalted is always cleared on a transition to KVM_MP_STATE_RUNNABLE, so that the next HLT instruction will be respected. Fixes: `6aef266c6e` ("kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks") Fixes: `b6b8a1451f` ("KVM: nVMX: Rework interception of IRQs and NMIs") Fixes: `38c0b192bd` ("KVM: SVM: leave halted state on vmexit") Fixes: `1a65105a5a` ("KVM: x86/xen: handle PV spinlocks slowpath") Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250113200150.487409-3-jmattson@google.com [sean: add Xen PV spinlocks to the list of Fixes, tweak changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-12 10:16:28 -08:00
Jim Mattson	c9e5f3fa90	KVM: x86: Introduce kvm_set_mp_state() Replace all open-coded assignments to vcpu->arch.mp_state with calls to a new helper, kvm_set_mp_state(), to centralize all changes to mp_state. No functional change intended. Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250113200150.487409-2-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-12 10:16:27 -08:00
Sean Christopherson	c2fee09fc1	KVM: x86: Load DR6 with guest value only before entering .vcpu_run() loop Move the conditional loading of hardware DR6 with the guest's DR6 value out of the core .vcpu_run() loop to fix a bug where KVM can load hardware with a stale vcpu->arch.dr6. When the guest accesses a DR and host userspace isn't debugging the guest, KVM disables DR interception and loads the guest's values into hardware on VM-Enter and saves them on VM-Exit. This allows the guest to access DRs at will, e.g. so that a sequence of DR accesses to configure a breakpoint only generates one VM-Exit. For DR0-DR3, the logic/behavior is identical between VMX and SVM, and also identical between KVM_DEBUGREG_BP_ENABLED (userspace debugging the guest) and KVM_DEBUGREG_WONT_EXIT (guest using DRs), and so KVM handles loading DR0-DR3 in common code, _outside_ of the core kvm_x86_ops.vcpu_run() loop. But for DR6, the guest's value doesn't need to be loaded into hardware for KVM_DEBUGREG_BP_ENABLED, and SVM provides a dedicated VMCB field whereas VMX requires software to manually load the guest value, and so loading the guest's value into DR6 is handled by {svm,vmx}_vcpu_run(), i.e. is done _inside_ the core run loop. Unfortunately, saving the guest values on VM-Exit is initiated by common x86, again outside of the core run loop. If the guest modifies DR6 (in hardware, when DR interception is disabled), and then the next VM-Exit is a fastpath VM-Exit, KVM will reload hardware DR6 with vcpu->arch.dr6 and clobber the guest's actual value. The bug shows up primarily with nested VMX because KVM handles the VMX preemption timer in the fastpath, and the window between hardware DR6 being modified (in guest context) and DR6 being read by guest software is orders of magnitude larger in a nested setup. E.g. in non-nested, the VMX preemption timer would need to fire precisely between #DB injection and the #DB handler's read of DR6, whereas with a KVM-on-KVM setup, the window where hardware DR6 is "dirty" extends all the way from L1 writing DR6 to VMRESUME (in L1). L1's view: ========== <L1 disables DR interception> CPU 0/KVM-7289 [023] d.... 2925.640961: kvm_entry: vcpu 0 A: L1 Writes DR6 CPU 0/KVM-7289 [023] d.... 2925.640963: <hack>: Set DRs, DR6 = 0xffff0ff1 B: CPU 0/KVM-7289 [023] d.... 2925.640967: kvm_exit: vcpu 0 reason EXTERNAL_INTERRUPT intr_info 0x800000ec D: L1 reads DR6, arch.dr6 = 0 CPU 0/KVM-7289 [023] d.... 2925.640969: <hack>: Sync DRs, DR6 = 0xffff0ff0 CPU 0/KVM-7289 [023] d.... 2925.640976: kvm_entry: vcpu 0 L2 reads DR6, L1 disables DR interception CPU 0/KVM-7289 [023] d.... 2925.640980: kvm_exit: vcpu 0 reason DR_ACCESS info1 0x0000000000000216 CPU 0/KVM-7289 [023] d.... 2925.640983: kvm_entry: vcpu 0 CPU 0/KVM-7289 [023] d.... 2925.640983: <hack>: Set DRs, DR6 = 0xffff0ff0 L2 detects failure CPU 0/KVM-7289 [023] d.... 2925.640987: kvm_exit: vcpu 0 reason HLT L1 reads DR6 (confirms failure) CPU 0/KVM-7289 [023] d.... 2925.640990: <hack>: Sync DRs, DR6 = 0xffff0ff0 L0's view: ========== L2 reads DR6, arch.dr6 = 0 CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216 CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216 L2 => L1 nested VM-Exit CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit_inject: reason: DR_ACCESS ext_inf1: 0x0000000000000216 CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_entry: vcpu 23 CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_exit: vcpu 23 reason VMREAD CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_entry: vcpu 23 CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason VMREAD CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_entry: vcpu 23 L1 writes DR7, L0 disables DR interception CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000007 CPU 23/KVM-5046 [001] d.... 3410.005613: kvm_entry: vcpu 23 L0 writes DR6 = 0 (arch.dr6) CPU 23/KVM-5046 [001] d.... 3410.005613: <hack>: Set DRs, DR6 = 0xffff0ff0 A: <L1 writes DR6 = 1, no interception, arch.dr6 is still '0'> B: CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_exit: vcpu 23 reason PREEMPTION_TIMER CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_entry: vcpu 23 C: L0 writes DR6 = 0 (arch.dr6) CPU 23/KVM-5046 [001] d.... 3410.005614: <hack>: Set DRs, DR6 = 0xffff0ff0 L1 => L2 nested VM-Enter CPU 23/KVM-5046 [001] d.... 3410.005616: kvm_exit: vcpu 23 reason VMRESUME L0 reads DR6, arch.dr6 = 0 Reported-by: John Stultz <jstultz@google.com> Closes: https://lkml.kernel.org/r/CANDhNCq5_F3HfFYABqFGCA1bPd_%2BxgNj-iDQhH4tDk%2Bwi8iZZg%40mail.gmail.com Fixes: `375e28ffc0` ("KVM: X86: Set host DR6 only on VMX and for KVM_DEBUGREG_WONT_EXIT") Fixes: `d67668e9dd` ("KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6") Cc: stable@vger.kernel.org Cc: Jim Mattson <jmattson@google.com> Tested-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/20250125011833.3644371-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-12 08:59:38 -08:00
David Woodhouse	3617c0ee7d	KVM: x86/xen: Only write Xen hypercall page for guest writes to MSR The Xen hypercall page MSR is write-only. When the guest writes an address to the MSR, the hypervisor populates the referenced page with hypercall functions. There is no reason for the host ever to write to the MSR, and it isn't even readable. Allowing host writes to trigger the hypercall page allows userspace to attack the kernel, as kvm_xen_write_hypercall_page() takes multiple locks and writes to guest memory. E.g. if userspace sets the MSR to MSR_IA32_XSS, KVM's write to MSR_IA32_XSS during vCPU creation will trigger an SRCU violation due to writing guest memory: ============================= WARNING: suspicious RCU usage 6.13.0-rc3 ----------------------------- include/linux/kvm_host.h:1046 suspicious rcu_dereference_check() usage! stack backtrace: CPU: 6 UID: 1000 PID: 1101 Comm: repro Not tainted 6.13.0-rc3 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x7f/0x90 lockdep_rcu_suspicious+0x176/0x1c0 kvm_vcpu_gfn_to_memslot+0x259/0x280 kvm_vcpu_write_guest+0x3a/0xa0 kvm_xen_write_hypercall_page+0x268/0x300 kvm_set_msr_common+0xc44/0x1940 vmx_set_msr+0x9db/0x1fc0 kvm_vcpu_reset+0x857/0xb50 kvm_arch_vcpu_create+0x37e/0x4d0 kvm_vm_ioctl+0x669/0x2100 __x64_sys_ioctl+0xc1/0xf0 do_syscall_64+0xc5/0x210 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7feda371b539 While the MSR index isn't strictly ABI, i.e. can theoretically float to any value, in practice no known VMM sets the MSR index to anything other than 0x40000000 or 0x40000200. Reported-by: syzbot+cdeaeec70992eca2d920@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/679258d4.050a0220.2eae65.000a.GAE@google.com Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/de0437379dfab11e431a23c8ce41a29234c06cbf.camel@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-11 07:05:01 -08:00
Paolo Bonzini	6f61269495	KVM: remove kvm_arch_post_init_vm The only statement in a kvm_arch_post_init_vm implementation can be moved into the x86 kvm_arch_init_vm. Do so and remove all traces from architecture-independent code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-02-04 11:27:45 -05:00
Keith Busch	931656b9e2	kvm: defer huge page recovery vhost task to later Some libraries want to ensure they are single threaded before forking, so making the kernel's kvm huge page recovery process a vhost task of the user process breaks those. The minijail library used by crosvm is one such affected application. Defer the task to after the first VM_RUN call, which occurs after the parent process has forked all its jailed processes. This needs to happen only once for the kvm instance, so introduce some general-purpose infrastructure for that, too. It's similar in concept to pthread_once; except it is actually usable, because the callback takes a parameter. Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Alyssa Ross <hi@alyssa.is> Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250123153543.2769928-1-kbusch@meta.com> [Move call_once API to include/linux. - Paolo] Cc: stable@vger.kernel.org Fixes: `d96c77bd4e` ("KVM: x86: switch hugepage recovery thread to vhost_task") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-01-24 10:53:56 -05:00
Paolo Bonzini	86eb1aef72	Merge branch 'kvm-mirror-page-tables' into HEAD As part of enabling TDX virtual machines, support support separation of private/shared EPT into separate roots. Confidential computing solutions almost invariably have concepts of private and shared memory, but they may different a lot in the details. In SEV, for example, the bit is handled more like a permission bit as far as the page tables are concerned: the private/shared bit is not included in the physical address. For TDX, instead, the bit is more like a physical address bit, with the host mapping private memory in one half of the address space and shared in another. Furthermore, the two halves are mapped by different EPT roots and only the shared half is managed by KVM; the private half (also called Secure EPT in Intel documentation) gets managed by the privileged TDX Module via SEAMCALLs. As a result, the operations that actually change the private half of the EPT are limited and relatively slow compared to reading a PTE. For this reason the design for KVM is to keep a mirror of the private EPT in host memory. This allows KVM to quickly walk the EPT and only perform the slower private EPT operations when it needs to actually modify mid-level private PTEs. There are thus three sets of EPT page tables: external, mirror and direct. In the case of TDX (the only user of this framework) the first two cover private memory, whereas the third manages shared memory: external EPT - Hidden within the TDX module, modified via TDX module calls. mirror EPT - Bookkeeping tree used as an optimization by KVM, not used by the processor. direct EPT - Normal EPT that maps unencrypted shared memory. Managed like the EPT of a normal VM. Modifying external EPT ---------------------- Modifications to the mirrored page tables need to also perform the same operations to the private page tables, which will be handled via kvm_x86_ops. Although this prep series does not interact with the TDX module at all to actually configure the private EPT, it does lay the ground work for doing this. In some ways updating the private EPT is as simple as plumbing PTE modifications through to also call into the TDX module; however, the locking is more complicated because inserting a single PTE cannot anymore be done atomically with a single CMPXCHG. For this reason, the existing FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the private EPT. FROZEN_SPTE acts basically as a spinlock on a PTE. Besides protecting operation of KVM, it limits the set of cases in which the TDX module will encounter contention on its own PTE locks. Zapping external EPT -------------------- While the framework tries to be relatively generic, and to be understandable without knowing TDX much in detail, some requirements of TDX sometimes leak; for example the private page tables also cannot be zapped while the range has anything mapped, so the mirrored/private page tables need to be protected from KVM operations that zap any non-leaf PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast(). For normal VMs, guest memory is zapped for several reasons: user memory getting paged out by the guest, memslots getting deleted, passthrough of devices with non-coherent DMA. Confidential computing adds to these the conversion of memory between shared and privates. These operations must not zap any private memory that is in use by the guest. This is possible because the only zapping that is out of the control of KVM/userspace is paging out userspace memory, which cannot apply to guestmemfd operations. Thus a TDX VM will only zap private memory from memslot deletion and from conversion between private and shared memory which is triggered by the guest. To avoid zapping too much memory, enums are introduced so that operations can choose to target only private or shared memory, and thus only direct or mirror EPT. For example: Memslot deletion - Private and shared MMU notifier based zapping - Shared only Conversion to shared - Private only Conversion to private - Shared only Other cases of zapping will not be supported for KVM, for example APICv update or non-coherent DMA status update; for the latter, TDX will simply require that the CPU supports self-snoop and honor guest PAT unconditionally for shared memory.	2025-01-20 07:15:58 -05:00
Paolo Bonzini	3eba032bb7	Merge branch 'kvm-userspace-hypercall' into HEAD Make the completion of hypercalls go through the complete_hypercall function pointer argument, no matter if the hypercall exits to userspace or not. Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically went to userspace, and all the others did not; the new code need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at all whether there was an exit to userspace or not.	2025-01-20 07:03:06 -05:00
Paolo Bonzini	4f7ff70c05	KVM x86 misc changes for 6.14: - Overhaul KVM's CPUID feature infrastructure to replace "governed" features with per-vCPU tracking of the vCPU's capabailities for all features. Along the way, refactor the code to make it easier to add/modify features, and add a variety of self-documenting macro types to again simplify adding new features and to help readers understand KVM's handling of existing features. - Rework KVM's handling of VM-Exits during event vectoring to plug holes where KVM unintentionally puts the vCPU into infinite loops in some scenarios, e.g. if emulation is triggered by the exit, and to bring parity between VMX and SVM. - Add pending request and interrupt injection information to the kvm_exit and kvm_entry tracepoints respectively. - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when loading guest/host PKRU due to a refactoring of the kernel helpers that didn't account for KVM's pre-checking of the need to do WRPKRU. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmeJngsACgkQOlYIJqCj N/1dfA/+NIZmnd8OV9Zvc6HGxrzgt4QsM9pmsUmrfkDWefxMYIAMeaW8Vn4CJfRf zY/UcqyNI7JYxSiuVTckz+Tf54HhqYaLrUwILGCQ49koirZx+aQT1OUfjLroVMlh ffX1i6GOoLNtxjb9MXM/heLVdUbvmzQMSFkd/AkOH+nrOtDNOiPlZfjHsewj9zrf BNJGhzvT4M6vc/AsScC7tc0yFD5KKFRv8tVwJ6Zf1nWKyUDOSpMTWkVnq6geKJPZ iGBZPPNg55Oy1g6uj6VYWmqYTD8Qioz5jtEJ/8pPHdAyIFo21s81bfJc548d+QLh KfrL1K7TrCOhSAGC3Cb3lTLeq2immmGHaiTBLwGABG4MhpiX4NVpMMdOyFbVLMOS HIYuwXwDckm1pfU7/w+PgPaakCyPrXQntm+3Y2pvDOoY6e2JbwodK4j8BvvQda35 8TrYKEGFvq5aij7Iw1O9TUoLAocDM/sHIHE6BCazHyzKBIv9xLRFeabiCQ+A1pwv gZk5u0+j+DPpLdeLhbMYhIXUtr3bvyMYvc+tRkG716f8ubAE3+Kn5BEDo4Ot2DcT vc+NTRYYWN6zavHiJH3Ddt153yj256JCZhLwCdfbryCQdz3Mpy16m36tgkDRd3lR QT4IkPQo1Vl/aU0yiE/dhnJgh1rTO26YQjZoHs5Oj16d0HRrKyc= =32mM -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-6.14' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.14: - Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities instead of just those where KVM needs to manage state and/or explicitly enable the feature in hardware. Along the way, refactor the code to make it easier to add features, and to make it more self-documenting how KVM is handling each feature. - Rework KVM's handling of VM-Exits during event vectoring; this plugs holes where KVM unintentionally puts the vCPU into infinite loops in some scenarios (e.g. if emulation is triggered by the exit), and brings parity between VMX and SVM. - Add pending request and interrupt injection information to the kvm_exit and kvm_entry tracepoints respectively. - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when loading guest/host PKRU, due to a refactoring of the kernel helpers that didn't account for KVM's pre-checking of the need to do WRPKRU.	2025-01-20 06:49:39 -05:00
Paolo Bonzini	4e4f38f84e	KVM kvm_set_memory_region() cleanups and hardening for 6.14: - Add proper lockdep assertions when setting memory regions. - Add a dedicated API for setting KVM-internal memory regions. - Explicitly disallow all flags for KVM-internal memory regions. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmeHH5gACgkQOlYIJqCj N/2pIBAAm+RLu554cc51EiIemZOI94WphNkl7QegZf+VfPvcPU0pXT3VZiX/hGcl mk8KJhYMeIjr+rh3kVENkn9YRCmcH334dDmA7CQJwe30EYvKwC9XlOxpRmUZl8h+ DaO6+uf07rs4ZezCBUQ90f9GV7ABpp3TFmwYlPSqYW4TKbWbiwqPZL8X06ZTRgb3 Z5rFI2WpxsHOnlFLQAjmWZ6K3/oMDrdDplwJy9XTEpYV29oX5y06Zp3YeBAkeoxo uvwv9JYyZcev7j4gamQngud9IJc3Rc5/GzK0NndsPqUOVPGHXuSiul3ajbd3HgLL zp8Gai9y170DG8KUp+TNdhzaW0E+lLJAeLrWZTkfSwbdxGpA+XEdcwAiHd7X/xqK 0Ty/FS/vfLA9adearWE1DihskCQ1L1ZhJRqjRx5vGIwXBzyhtcKLhYWJL97eOrd+ jmPVUJe9nOZdbdmrQksT0PinQVnsNorSlfU7ODBnymzisIF0Z9NP8HRHLk4PT2C7 VBdOUpmvY1Fae1GUq5HLB1TJml/FoRztTBmtnoczczDlmOmgJ4wuMoiGYIkZaTED Zhb40/tk4mtBCcY/iXB+0G2uAl+1xw7rrmhsV/JUKexXPJs7U2Yj8xPVpqMoU7aU HCvUjUujsHZ/SvYXQu3rCZOsFyNfLvX8b+0Fv41gpBPgYqGZXA8= =Dqql -----END PGP SIGNATURE----- Merge tag 'kvm-memslots-6.14' of https://github.com/kvm-x86/linux into HEAD KVM kvm_set_memory_region() cleanups and hardening for 6.14: - Add proper lockdep assertions when setting memory regions. - Add a dedicated API for setting KVM-internal memory regions. - Explicitly disallow all flags for KVM-internal memory regions.	2025-01-20 06:36:14 -05:00
Sean Christopherson	344315e93d	KVM: x86: Drop double-underscores from __kvm_set_memory_region() Now that there's no outer wrapper for __kvm_set_memory_region() and it's static, drop its double-underscore prefix. No functional change intended. Cc: Tao Su <tao1.su@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-01-14 17:36:16 -08:00
Sean Christopherson	156bffdb2b	KVM: Add a dedicated API for setting KVM-internal memslots Add a dedicated API for setting internal memslots, and have it explicitly disallow setting userspace memslots. Setting a userspace memslots without a direct command from userspace would result in all manner of issues. No functional change intended. Cc: Tao Su <tao1.su@linux.intel.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-01-14 17:36:15 -08:00
Sean Christopherson	d131f0042f	KVM: Assert slots_lock is held when setting memory regions Add proper lockdep assertions in __kvm_set_memory_region() and __x86_set_memory_region() instead of relying comments. Opportunistically delete __kvm_set_memory_region()'s entire function comment as the API doesn't allocate memory or select a gfn, and the "mostly for framebuffers" comment hasn't been true for a very long time. Cc: Tao Su <tao1.su@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-01-14 17:36:15 -08:00
Sean Christopherson	4c20cd4cee	KVM: x86: Avoid double RDPKRU when loading host/guest PKRU Use the raw wrpkru() helper when loading the guest/host's PKRU on switch to/from guest context, as the write_pkru() wrapper incurs an unnecessary rdpkru(). In both paths, KVM is guaranteed to have performed RDPKRU since the last possible write, i.e. KVM has a fresh cache of the current value in hardware. This effectively restores KVM's behavior to that of KVM prior to commit `c806e88734` ("x86/pkeys: Provide *pkru() helpers"), which renamed the raw helper from __write_pkru() => wrpkru(), and turned __write_pkru() into a wrapper. Commit `577ff465f5` ("x86/fpu: Only write PKRU if it is different from current") then added the extra RDPKRU to avoid an unnecessary WRPKRU, but completely missed that KVM already optimized away pointless writes. Reported-by: Adrian Hunter <adrian.hunter@intel.com> Fixes: `577ff465f5` ("x86/fpu: Only write PKRU if it is different from current") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/20241221011647.3747448-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-01-08 14:08:25 -08:00
Rick Edgecombe	2c3412e999	KVM: x86/mmu: Prevent aliased memslot GFNs Add a few sanity checks to prevent memslot GFNs from ever having alias bits set. Like other Coco technologies, TDX has the concept of private and shared memory. For TDX the private and shared mappings are managed on separate EPT roots. The private half is managed indirectly though calls into a protected runtime environment called the TDX module, where the shared half is managed within KVM in normal page tables. For TDX, the shared half will be mapped in the higher alias, with a "shared bit" set in the GPA. However, KVM will still manage it with the same memslots as the private half. This means memslot looks ups and zapping operations will be provided with a GFN without the shared bit set. If these memslot GFNs ever had the bit that selects between the two aliases it could lead to unexpected behavior in the complicated code that directs faulting or zapping operations between the roots that map the two aliases. As a safety measure, prevent memslots from being set at a GFN range that contains the alias bit. Also, check in the kvm_faultin_pfn() for the fault path. This later check does less today, as the alias bits are specifically stripped from the GFN being checked, however future code could possibly call in to the fault handler in a way that skips this stripping. Since kvm_faultin_pfn() now has many references to vcpu->kvm, extract it to local variable. Link: https://lore.kernel.org/kvm/ZpbKqG_ZhCWxl-Fc@google.com/ Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-19-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-12-23 08:31:55 -05:00
Paolo Bonzini	c50be1c945	KVM: x86: Refactor __kvm_emulate_hypercall() into a macro Rework __kvm_emulate_hypercall() into a macro so that completion of hypercalls that don't exit to userspace use direct function calls to the completion helper, i.e. don't trigger a retpoline when RETPOLINE=y. Opportunistically take the names of the input registers, as opposed to taking the input values, to preemptively dedup more of the calling code (TDX needs to use different registers). Use the direct GPR accessors to read values to avoid the pointless marking of the registers as available (KVM requires GPRs to always be available). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20241128004344.4072099-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-12-22 13:00:25 -05:00
Sean Christopherson	d9eb86a6f4	KVM: x86: Always complete hypercall via function callback Finish "emulation" of KVM hypercalls by function callback, even when the hypercall is handled entirely within KVM, i.e. doesn't require an exit to userspace, and refactor __kvm_emulate_hypercall()'s return value to only communicate whether or not KVM should exit to userspace or resume the guest. (Ab)Use vcpu->run->hypercall.ret to propagate the return value to the callback, purely to avoid having to add a trampoline for every completion callback. Using the function return value for KVM's control flow eliminates the multiplexed return value, where '0' for KVM_HC_MAP_GPA_RANGE (and only that hypercall) means "exit to userspace". Note, the unnecessary extra indirect call and thus potential retpoline will be eliminated in the near future by converting the intermediate layer to a macro. Suggested-by: Binbin Wu <binbin.wu@linux.intel.com> Suggested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20241128004344.4072099-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-12-22 13:00:25 -05:00
Sean Christopherson	05a518b49d	KVM: x86: Bump hypercall stat prior to fully completing hypercall Increment the "hypercalls" stat for KVM hypercalls as soon as KVM knows it will skip the guest instruction, i.e. once KVM is committed to emulating the hypercall. Waiting until completion adds no known value, and creates a discrepancy where the stat will be bumped if KVM exits to userspace as a result of trying to skip the instruction, but not if the hypercall itself exits. Handling the stat in common code will also avoid the need for another helper to dedup code when TDX comes along (TDX needs a separate completion path due to GPR usage differences). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20241128004344.4072099-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-12-22 13:00:25 -05:00
Binbin Wu	c4c083d951	KVM: x86: Add a helper to check for user interception of KVM hypercalls Add and use user_exit_on_hypercall() to check if userspace wants to handle a KVM hypercall instead of open-coding the logic everywhere. No functional change intended. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> [sean: squash into one patch, keep explicit KVM_HC_MAP_GPA_RANGE check] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Message-ID: <20241128004344.4072099-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-12-22 13:00:25 -05:00
Paolo Bonzini	9a1dfeff44	KVM: x86: clear vcpu->run->hypercall.ret before exiting for KVM_EXIT_HYPERCALL QEMU up to 9.2.0 is assuming that vcpu->run->hypercall.ret is 0 on exit and it never modifies it when processing KVM_EXIT_HYPERCALL. Make this explicit in the code, to avoid breakage when KVM starts modifying that field. This in principle is not a good idea... It would have been much better if KVM had set the field to -KVM_ENOSYS from the beginning, so that a dumb userspace that does nothing on KVM_EXIT_HYPERCALL would tell the guest it does not support KVM_HC_MAP_GPA_RANGE. However, breaking userspace is a Very Bad Thing, as everybody should know. Reported-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-12-22 13:00:25 -05:00
Paolo Bonzini	10b2c8a67c	KVM x86 fixes for 6.13: - Disable AVIC on SNP-enabled systems that don't allow writes to the virtual APIC page, as such hosts will hit unexpected RMP #PFs in the host when running VMs of any flavor. - Fix a WARN in the hypercall completion path due to KVM trying to determine if a guest with protected register state is in 64-bit mode (KVM's ABI is to assume such guests only make hypercalls in 64-bit mode). - Allow the guest to write to supported bits in MSR_AMD64_DE_CFG to fix a regression with Windows guests, and because KVM's read-only behavior appears to be entirely made up. - Treat TDP MMU faults as spurious if the faulting access is allowed given the existing SPTE. This fixes a benign WARN (other than the WARN itself) due to unexpectedly replacing a writable SPTE with a read-only SPTE. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmdkzW0ACgkQOlYIJqCj N/3VXhAAjfP7w4TdJGX0yOPUmQWXATq6DC+zKW1K3dhfu9QwVXD6TLykdnQddiUg k0yOgIpwzIEj1U6l1DamjQj2QdoBqeTjTg8jBujiQTsQH/Gc7FvbcFy70ZKZS/js IQ+g+1av8XXnLVb5xtVMRWwEtca2BeygSVjRA8/QtIWog0rProfS35/wd3uq04Hm Z4J6hf8N2DSELUrq0uIHvTFr8/a03oQnaNNrPyRmbvBb8m20no3A/ws6tAB4YfyN HxQao+CJrSQ+4dinZSGa2s6TLBIbbH5QRtwgeSpbLdeKnRYyrWtThIrKB2uDKKYn 1fsn9mdIi3oAzl3mX2JLQNlx18tCOSrnmYIJCzBAQDaeUOlMNYge4CYeKOqYMJyo vayNFz9VZwDiUcBLMWy81EyVcKlqQE8U2icgKqEN8jVdmKW36bOcAfFiK/97iPrc 7TkS4azHozoTAHdggWqReraDnjRz4PysjIA/yaMh2G/p1ZexL6OX+GwgF1YrD4w3 zVL8UmXkjiPi+VS6ayUiYDYB5R1kJ13Zq3UpYOJ2MG6Zg2we+UXynQkehS/BHwkp 2aiflhq55cVv6oZoNJeOuokyvXRAuXf8t0Got/rqGzjKGGNLPRCxD/7K1PzN1YsZ NH5zmS2GsqA89L+Qc2l0Ohuz4kX1S2BEB+1pnxFkQc+CdjmIG9A= =S9jc -----END PGP SIGNATURE----- Merge tag 'kvm-x86-fixes-6.13-rcN' of https://github.com/kvm-x86/linux into HEAD KVM x86 fixes for 6.13: - Disable AVIC on SNP-enabled systems that don't allow writes to the virtual APIC page, as such hosts will hit unexpected RMP #PFs in the host when running VMs of any flavor. - Fix a WARN in the hypercall completion path due to KVM trying to determine if a guest with protected register state is in 64-bit mode (KVM's ABI is to assume such guests only make hypercalls in 64-bit mode). - Allow the guest to write to supported bits in MSR_AMD64_DE_CFG to fix a regression with Windows guests, and because KVM's read-only behavior appears to be entirely made up. - Treat TDP MMU faults as spurious if the faulting access is allowed given the existing SPTE. This fixes a benign WARN (other than the WARN itself) due to unexpectedly replacing a writable SPTE with a read-only SPTE.	2024-12-22 12:59:33 -05:00
Paolo Bonzini	8afa5b10af	KVM x86 fixes for 6.13: - Disable AVIC on SNP-enabled systems that don't allow writes to the virtual APIC page, as such hosts will hit unexpected RMP #PFs in the host when running VMs of any flavor. - Fix a WARN in the hypercall completion path due to KVM trying to determine if a guest with protected register state is in 64-bit mode (KVM's ABI is to assume such guests only make hypercalls in 64-bit mode). - Allow the guest to write to supported bits in MSR_AMD64_DE_CFG to fix a regression with Windows guests, and because KVM's read-only behavior appears to be entirely made up. - Treat TDP MMU faults as spurious if the faulting access is allowed given the existing SPTE. This fixes a benign WARN (other than the WARN itself) due to unexpectedly replacing a writable SPTE with a read-only SPTE. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmdkzW0ACgkQOlYIJqCj N/3VXhAAjfP7w4TdJGX0yOPUmQWXATq6DC+zKW1K3dhfu9QwVXD6TLykdnQddiUg k0yOgIpwzIEj1U6l1DamjQj2QdoBqeTjTg8jBujiQTsQH/Gc7FvbcFy70ZKZS/js IQ+g+1av8XXnLVb5xtVMRWwEtca2BeygSVjRA8/QtIWog0rProfS35/wd3uq04Hm Z4J6hf8N2DSELUrq0uIHvTFr8/a03oQnaNNrPyRmbvBb8m20no3A/ws6tAB4YfyN HxQao+CJrSQ+4dinZSGa2s6TLBIbbH5QRtwgeSpbLdeKnRYyrWtThIrKB2uDKKYn 1fsn9mdIi3oAzl3mX2JLQNlx18tCOSrnmYIJCzBAQDaeUOlMNYge4CYeKOqYMJyo vayNFz9VZwDiUcBLMWy81EyVcKlqQE8U2icgKqEN8jVdmKW36bOcAfFiK/97iPrc 7TkS4azHozoTAHdggWqReraDnjRz4PysjIA/yaMh2G/p1ZexL6OX+GwgF1YrD4w3 zVL8UmXkjiPi+VS6ayUiYDYB5R1kJ13Zq3UpYOJ2MG6Zg2we+UXynQkehS/BHwkp 2aiflhq55cVv6oZoNJeOuokyvXRAuXf8t0Got/rqGzjKGGNLPRCxD/7K1PzN1YsZ NH5zmS2GsqA89L+Qc2l0Ohuz4kX1S2BEB+1pnxFkQc+CdjmIG9A= =S9jc -----END PGP SIGNATURE----- Merge tag 'kvm-x86-fixes-6.13-rcN' of https://github.com/kvm-x86/linux into HEAD KVM x86 fixes for 6.13: - Disable AVIC on SNP-enabled systems that don't allow writes to the virtual APIC page, as such hosts will hit unexpected RMP #PFs in the host when running VMs of any flavor. - Fix a WARN in the hypercall completion path due to KVM trying to determine if a guest with protected register state is in 64-bit mode (KVM's ABI is to assume such guests only make hypercalls in 64-bit mode). - Allow the guest to write to supported bits in MSR_AMD64_DE_CFG to fix a regression with Windows guests, and because KVM's read-only behavior appears to be entirely made up. - Treat TDP MMU faults as spurious if the faulting access is allowed given the existing SPTE. This fixes a benign WARN (other than the WARN itself) due to unexpectedly replacing a writable SPTE with a read-only SPTE.	2024-12-22 12:07:16 -05:00
Paolo Bonzini	398b7b6cb9	KVM: x86: let it be known that ignore_msrs is a bad idea When running KVM with ignore_msrs=1 and report_ignored_msrs=0, the user has no clue that that the guest is being lied to. This may cause bug reports such as https://gitlab.com/qemu-project/qemu/-/issues/2571, where enabling a CPUID bit in QEMU caused Linux guests to try reading MSR_CU_DEF_ERR; and being lied about the existence of MSR_CU_DEF_ERR caused the guest to assume other things about the local APIC which were not true: Sep 14 12:02:53 kernel: mce: [Firmware Bug]: Your BIOS is not setting up LVT offset 0x2 for deferred error IRQs correctly. Sep 14 12:02:53 kernel: unchecked MSR access error: RDMSR from 0x852 at rIP: 0xffffffffb548ffa7 (native_read_msr+0x7/0x40) Sep 14 12:02:53 kernel: Call Trace: ... Sep 14 12:02:53 kernel: native_apic_msr_read+0x20/0x30 Sep 14 12:02:53 kernel: setup_APIC_eilvt+0x47/0x110 Sep 14 12:02:53 kernel: mce_amd_feature_init+0x485/0x4e0 ... Sep 14 12:02:53 kernel: [Firmware Bug]: cpu 0, try to use APIC520 (LVT offset 2) for vector 0xf4, but the register is already in use for vector 0x0 on this cpu Without reported_ignored_msrs=0 at least the host kernel log will contain enough information to avoid going on a wild goose chase. But if reports about individual MSR accesses are being silenced too, at least complain loudly the first time a VM is started. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-12-22 12:06:01 -05:00
Sean Christopherson	9b42d1e8e4	KVM: x86: Play nice with protected guests in complete_hypercall_exit() Use is_64_bit_hypercall() instead of is_64_bit_mode() to detect a 64-bit hypercall when completing said hypercall. For guests with protected state, e.g. SEV-ES and SEV-SNP, KVM must assume the hypercall was made in 64-bit mode as the vCPU state needed to detect 64-bit mode is unavailable. Hacking the sev_smoke_test selftest to generate a KVM_HC_MAP_GPA_RANGE hypercall via VMGEXIT trips the WARN: ------------[ cut here ]------------ WARNING: CPU: 273 PID: 326626 at arch/x86/kvm/x86.h:180 complete_hypercall_exit+0x44/0xe0 [kvm] Modules linked in: kvm_amd kvm ... [last unloaded: kvm] CPU: 273 UID: 0 PID: 326626 Comm: sev_smoke_test Not tainted 6.12.0-smp--392e932fa0f3-feat #470 Hardware name: Google Astoria/astoria, BIOS 0.20240617.0-0 06/17/2024 RIP: 0010:complete_hypercall_exit+0x44/0xe0 [kvm] Call Trace: <TASK> kvm_arch_vcpu_ioctl_run+0x2400/0x2720 [kvm] kvm_vcpu_ioctl+0x54f/0x630 [kvm] __se_sys_ioctl+0x6b/0xc0 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e </TASK> ---[ end trace 0000000000000000 ]--- Fixes: `b5aead0064` ("KVM: x86: Assume a 64-bit hypercall for guests with protected state") Cc: stable@vger.kernel.org Cc: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Nikunj A Dadhania <nikunj@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20241128004344.4072099-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-12-19 17:47:51 -08:00
Ivan Orlov	704fc6021b	KVM: x86: Try to unprotect and retry on unhandleable emulation failure If emulation is "rejected" by check_emulate_instruction(), try to unprotect and retry instruction execution before reporting the error to userspace. Currently, check_emulate_instruction() never signals failure when "unprotect and retry" is possible, but that will change in the future as both VMX and SVM will reject emulation due to coincident exception vectoring. E.g. if there is a write to a shadowed page table when vectoring an event, then unprotecting the gfn and retrying the instruction will allow the guest to make forward progress in most cases, i.e. will allow the vCPU to keep running instead of returning an error to userspace. This ensures that the subsequent patches won't make KVM exit to userspace when handling an intercepted #PF during vectoring without checking whether unprotect and retry is possible. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ivan Orlov <iorlov@amazon.com> Link: https://lore.kernel.org/r/20241217181458.68690-4-iorlov@amazon.com [sean: massage changelog to clarify this is a nop for the current code] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-12-18 15:14:43 -08:00
Ivan Orlov	5c9cfc4866	KVM: x86: Add emulation status for unhandleable exception vectoring Add emulation status for unhandleable vectoring, i.e. when KVM can't emulate an instruction because emulation was triggered on an exit that occurred while the CPU was vectoring an event. Such a situation can occur if guest sets the IDT descriptor base to point to MMIO region, and triggers an exception after that. Exit to userspace with event delivery error when KVM can't emulate an instruction when vectoring an event. Signed-off-by: Ivan Orlov <iorlov@amazon.com> Link: https://lore.kernel.org/r/20241217181458.68690-3-iorlov@amazon.com [sean: massage changelog and X86EMUL_UNHANDLEABLE_VECTORING comment] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-12-18 15:14:42 -08:00
Ivan Orlov	11c98fa07a	KVM: x86: Add function for vectoring error generation Extract VMX code for unhandleable VM-Exit during vectoring into vendor-agnostic function so that boiler-plate code can be shared by SVM. To avoid unnecessarily complexity in the helper, unconditionally report a GPA to userspace instead of having a conditional entry. For exits that don't report a GPA, i.e. everything except EPT Misconfig, simply report KVM's "invalid GPA". Signed-off-by: Ivan Orlov <iorlov@amazon.com> Link: https://lore.kernel.org/r/20241217181458.68690-2-iorlov@amazon.com [sean: clarify that the INVALID_GPA logic is new] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-12-18 15:14:41 -08:00
Sean Christopherson	8f2a27752e	KVM: x86: Replace (almost) all guest CPUID feature queries with cpu_caps Switch all queries (except XSAVES) of guest features from guest CPUID to guest capabilities, i.e. replace all calls to guest_cpuid_has() with calls to guest_cpu_cap_has(). Keep guest_cpuid_has() around for XSAVES, but subsume its helper guest_cpuid_get_register() and add a compile-time assertion to prevent using guest_cpuid_has() for any other feature. Add yet another comment for XSAVE to explain why KVM is allowed to query its raw guest CPUID. Opportunistically drop the unused guest_cpuid_clear(), as there should be no circumstance in which KVM needs to _clear_ a guest CPUID feature now that everything is tracked via cpu_caps. E.g. KVM may need to _change_ a feature to emulate dynamic CPUID flags, but KVM should never need to clear a feature in guest CPUID to prevent it from being used by the guest. Delete the last remnants of the governed features framework, as the lone holdout was vmx_adjust_secondary_exec_control()'s divergent behavior for governed vs. ungoverned features. Note, replacing guest_cpuid_has() checks with guest_cpu_cap_has() when computing reserved CR4 bits is a nop when viewed as a whole, as KVM's capabilities are already incorporated into the calculation, i.e. if a feature is present in guest CPUID but unsupported by KVM, its CR4 bit was already being marked as reserved, checking guest_cpu_cap_has() simply double-stamps that it's a reserved bit. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20241128013424.4096668-51-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-12-18 14:20:15 -08:00
Sean Christopherson	2c5e168e5c	KVM: x86: Rename "governed features" helpers to use "guest_cpu_cap" As the first step toward replacing KVM's so-called "governed features" framework with a more comprehensive, less poorly named implementation, replace the "kvm_governed_feature" function prefix with "guest_cpu_cap" and rename guest_can_use() to guest_cpu_cap_has(). The "guest_cpu_cap" naming scheme mirrors that of "kvm_cpu_cap", and provides a more clear distinction between guest capabilities, which are KVM controlled (heh, or one might say "governed"), and guest CPUID, which with few exceptions is fully userspace controlled. Opportunistically rewrite the comment about XSS passthrough for SEV-ES guests to avoid referencing so many functions, as such comments are prone to becoming stale (case in point...). No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20241128013424.4096668-40-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-12-18 14:20:03 -08:00
Sean Christopherson	f21958e328	KVM: x86: Don't update PV features caches when enabling enforcement capability Revert the chunk of commit `01b4f510b9` ("kvm: x86: ensure pv_cpuid.features is initialized when enabling cap") that forced a PV features cache refresh during KVM_CAP_ENFORCE_PV_FEATURE_CPUID, as whatever ioctl() ordering issue it alleged to have fixed never existed upstream, and likely never existed in any kernel. At the time of the commit, there was a tangentially related ioctl() ordering issue, as toggling KVM_X86_DISABLE_EXITS_HLT after KVM_SET_CPUID2 would have resulted in KVM potentially leaving KVM_FEATURE_PV_UNHALT set. But (a) that bug affected the entire guest CPUID, not just the cache, (b) commit `01b4f510b9` didn't address that bug, it only refreshed the cache (with the bad CPUID), and (c) setting KVM_X86_DISABLE_EXITS_HLT after vCPU creation is completely broken as KVM configures HLT-exiting only during vCPU creation, which is why KVM_CAP_X86_DISABLE_EXITS is now disallowed if vCPUs have been created. Another tangentially related bug was KVM's failure to clear the cache when handling KVM_SET_CPUID2, but again commit `01b4f510b9` did nothing to fix that bug. The most plausible explanation for the what commit `01b4f510b9` was trying to fix is a bug that existed in Google's internal kernel that was the source of commit `01b4f510b9`. At the time, Google's internal kernel had not yet picked up commit `0d3b2ba16b` ("KVM: X86: Go on updating other CPUID leaves when leaf 1 is absent"), i.e. KVM would not initialize the PV features cache if KVM_SET_CPUID2 was called without a CPUID.0x1 entry. Of course, no sane real world VMM would omit CPUID.0x1, including the KVM selftest added by commit `ac4a4d6de2` ("selftests: kvm: test enforcement of paravirtual cpuid features"). And the test didn't actually try to verify multiple orderings, nor did the selftest enter the guest without doing KVM_SET_CPUID2, so who knows what motivated the change. Regardless of why commit `01b4f510b9` ("kvm: x86: ensure pv_cpuid.features is initialized when enabling cap") was added, refreshing the cache during KVM_CAP_ENFORCE_PV_FEATURE_CPUID isn't necessary. Cc: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20241128013424.4096668-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-12-18 14:19:41 -08:00
Sean Christopherson	c829ccd4d9	KVM: x86: Reject disabling of MWAIT/HLT interception when not allowed Reject KVM_CAP_X86_DISABLE_EXITS if userspace attempts to disable MWAIT or HLT exits and KVM previously reported (via KVM_CHECK_EXTENSION) that disabling the exit(s) is not allowed. E.g. because MWAIT isn't supported or the CPU doesn't have an always-running APIC timer, or because KVM is configured to mitigate cross-thread vulnerabilities. Cc: Kechen Lu <kechenl@nvidia.com> Fixes: `4d5422cea3` ("KVM: X86: Provide a capability to disable MWAIT intercepts") Fixes: `6f0f2d5ef8` ("KVM: x86: Mitigate the cross-thread return address predictions bug") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20241128013424.4096668-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-12-18 14:19:36 -08:00
Sean Christopherson	04cd8f8628	KVM: x86: Disallow KVM_CAP_X86_DISABLE_EXITS after vCPU creation Reject KVM_CAP_X86_DISABLE_EXITS if vCPUs have been created, as disabling PAUSE/MWAIT/HLT exits after vCPUs have been created is broken and useless, e.g. except for PAUSE on SVM, the relevant intercepts aren't updated after vCPU creation. vCPUs may also end up with an inconsistent configuration if exits are disabled between creation of multiple vCPUs. Cc: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/all/9227068821b275ac547eb2ede09ec65d2281fe07.1680179693.git.houwenlong.hwl@antgroup.com Link: https://lore.kernel.org/all/20230121020738.2973-2-kechenl@nvidia.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20241128013424.4096668-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-12-18 14:19:35 -08:00
Sean Christopherson	21d7f06d1a	KVM: x86: Drop now-redundant MAXPHYADDR and GPA rsvd bits from vCPU creation Drop the manual initialization of maxphyaddr and reserved_gpa_bits during vCPU creation now that kvm_arch_vcpu_create() unconditionally invokes kvm_vcpu_after_set_cpuid(), which handles all such CPUID caching. None of the helpers between the existing code in kvm_arch_vcpu_create() and the call to kvm_vcpu_after_set_cpuid() consume maxphyaddr or reserved_gpa_bits (though auditing vmx_vcpu_create() and svm_vcpu_create() isn't exactly easy). Link: https://lore.kernel.org/r/20241128013424.4096668-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-12-18 14:19:34 -08:00
Sean Christopherson	b0c3d68717	KVM: x86: Move __kvm_is_valid_cr4() definition to x86.h Let vendor code inline __kvm_is_valid_cr4() now x86.c's cr4_reserved_bits no longer exists, as keeping cr4_reserved_bits local to x86.c was the only reason for "hiding" the definition of __kvm_is_valid_cr4(). No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20241128013424.4096668-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-12-18 14:19:32 -08:00
Sean Christopherson	7520a53b8e	KVM: x86: Account for KVM-reserved CR4 bits when passing through CR4 on VMX Drop x86.c's local pre-computed cr4_reserved bits and instead fold KVM's reserved bits into the guest's reserved bits. This fixes a bug where VMX's set_cr4_guest_host_mask() fails to account for KVM-reserved bits when deciding which bits can be passed through to the guest. In most cases, letting the guest directly write reserved CR4 bits is ok, i.e. attempting to set the bit(s) will still #GP, but not if a feature is available in hardware but explicitly disabled by the host, e.g. if FSGSBASE support is disabled via "nofsgsbase". Note, the extra overhead of computing host reserved bits every time userspace sets guest CPUID is negligible. The feature bits that are queried are packed nicely into a handful of words, and so checking and setting each reserved bit costs in the neighborhood of ~5 cycles, i.e. the total cost will be in the noise even if the number of checked CR4 bits doubles over the next few years. In other words, x86 will run out of CR4 bits long before the overhead becomes problematic. Note #2, __cr4_reserved_bits() starts from CR4_RESERVED_BITS, which is why the existing __kvm_cpu_cap_has() processing doesn't explicitly OR in CR4_RESERVED_BITS (and why the new code doesn't do so either). Fixes: `2ed41aa631` ("KVM: VMX: Intercept guest reserved CR4 bits to inject #GP fault") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20241128013424.4096668-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-12-18 14:19:26 -08:00
Sean Christopherson	85e5ba83c0	KVM: x86: Do all post-set CPUID processing during vCPU creation During vCPU creation, process KVM's default, empty CPUID as if userspace set an empty CPUID to ensure consistent and correct behavior with respect to guest CPUID. E.g. if userspace never sets guest CPUID, KVM will never configure cr4_guest_rsvd_bits, and thus create divergent, incorrect, guest- visible behavior due to letting the guest set any KVM-supported CR4 bits despite the features not being allowed per guest CPUID. Note! This changes KVM's ABI, as lack of full CPUID processing allowed userspace to stuff garbage vCPU state, e.g. userspace could set CR4 to a guest-unsupported value via KVM_SET_SREGS. But it's extremely unlikely that this is a breaking change, as KVM already has many flows that require userspace to set guest CPUID before loading vCPU state. E.g. multiple MSR flows consult guest CPUID on host writes, and KVM_SET_SREGS itself already relies on guest CPUID being up-to-date, as KVM's validity check on CR3 consumes CPUID.0x7.1 (for LAM) and CPUID.0x80000008 (for MAXPHYADDR). Furthermore, the plan is to commit to enforcing guest CPUID for userspace writes to MSRs, at which point bypassing sregs CPUID checks is even more nonsensical. Link: https://lore.kernel.org/r/20241128013424.4096668-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-12-18 14:19:24 -08:00
Sean Christopherson	1201f226c8	KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init Snapshot the output of CPUID.0xD.[1..n] during kvm.ko initiliaization to avoid the overead of CPUID during runtime. The offset, size, and metadata for CPUID.0xD.[1..n] sub-leaves does not depend on XCR0 or XSS values, i.e. is constant for a given CPU, and thus can be cached during module load. On Intel's Emerald Rapids, CPUID is wildly expensive, to the point where recomputing XSAVE offsets and sizes results in a 4x increase in latency of nested VM-Enter and VM-Exit (nested transitions can trigger xstate_required_size() multiple times per transition), relative to using cached values. The issue is easily visible by running `perf top` while triggering nested transitions: kvm_update_cpuid_runtime() shows up at a whopping 50%. As measured via RDTSC from L2 (using KVM-Unit-Test's CPUID VM-Exit test and a slightly modified L1 KVM to handle CPUID in the fastpath), a nested roundtrip to emulate CPUID on Skylake (SKX), Icelake (ICX), and Emerald Rapids (EMR) takes: SKX 11650 ICX 22350 EMR 28850 Using cached values, the latency drops to: SKX 6850 ICX 9000 EMR 7900 The underlying issue is that CPUID itself is slow on ICX, and comically slow on EMR. The problem is exacerbated on CPUs which support XSAVES and/or XSAVEC, as KVM invokes xstate_required_size() twice on each runtime CPUID update, and because there are more supported XSAVE features (CPUID for supported XSAVE feature sub-leafs is significantly slower). SKX: CPUID.0xD.2 = 348 cycles CPUID.0xD.3 = 400 cycles CPUID.0xD.4 = 276 cycles CPUID.0xD.5 = 236 cycles <other sub-leaves are similar> EMR: CPUID.0xD.2 = 1138 cycles CPUID.0xD.3 = 1362 cycles CPUID.0xD.4 = 1068 cycles CPUID.0xD.5 = 910 cycles CPUID.0xD.6 = 914 cycles CPUID.0xD.7 = 1350 cycles CPUID.0xD.8 = 734 cycles CPUID.0xD.9 = 766 cycles CPUID.0xD.10 = 732 cycles CPUID.0xD.11 = 718 cycles CPUID.0xD.12 = 734 cycles CPUID.0xD.13 = 1700 cycles CPUID.0xD.14 = 1126 cycles CPUID.0xD.15 = 898 cycles CPUID.0xD.16 = 716 cycles CPUID.0xD.17 = 748 cycles CPUID.0xD.18 = 776 cycles Note, updating runtime CPUID information multiple times per nested transition is itself a flaw, especially since CPUID is a mandotory intercept on both Intel and AMD. E.g. KVM doesn't need to ensure emulated CPUID state is up-to-date while running L2. That flaw will be fixed in a future patch, as deferring runtime CPUID updates is more subtle than it appears at first glance, the benefits aren't super critical to have once the XSAVE issue is resolved, and caching CPUID output is desirable even if KVM's updates are deferred. Cc: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241211013302.1347853-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-12-13 13:58:10 -05:00
Paolo Bonzini	b467ab82a9	KVM: x86: expose MSR_PLATFORM_INFO as a feature MSR For userspace that wants to disable KVM_X86_QUIRK_STUFF_FEATURE_MSRS, it is useful to know what bits can be set to 1 in MSR_PLATFORM_INFO (apart from the TSC ratio). The right way to do that is via /dev/kvm's feature MSR mechanism. In fact, MSR_PLATFORM_INFO is already a feature MSR for the purpose of blocking updates after the vCPU is run, but KVM_GET_MSRS did not return a valid value for it. Just like in a VM that leaves KVM_X86_QUIRK_STUFF_FEATURE_MSRS enabled, the TSC ratio field is left to 0. Only bit 31 is set. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-11-13 14:40:56 -05:00
Paolo Bonzini	bb4409a9e7	KVM x86 misc changes for 6.13 - Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE. - Quirk KVM's misguided behavior of initialized certain feature MSRs to their maximum supported feature set, which can result in KVM creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero value results in the vCPU having invalid state if userspace hides PDCM from the guest, which can lead to save/restore failures. - Fix KVM's handling of non-canonical checks for vCPUs that support LA57 to better follow the "architecture", in quotes because the actual behavior is poorly documented. E.g. most MSR writes and descriptor table loads ignore CR4.LA57 and operate purely on whether the CPU supports LA57. - Bypass the register cache when querying CPL from kvm_sched_out(), as filling the cache from IRQ context is generally unsafe, and harden the cache accessors to try to prevent similar issues from occuring in the future. - Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM over-advertises SPEC_CTRL when trying to support cross-vendor VMs. - Minor cleanups -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmczowUACgkQOlYIJqCj N/3G8w//RYIslQkHXZXovQvhHKM9RBxdg6FjU0Do2KLN/xnR+JdjrBAI3HnG0TVu TEsA6slaTdvAFudSBZ27rORPARJ3XCSnDT+NflDR2UqtmGYFXxixcs4LQRUC3I2L tS/e847Qfp7/+kXFYQuH6YmMftCf7SQNbUPU3EwSXY8seUJB6ZhO89WXgrBtaMH+ 94b7EdkP86L4dqiEMGr/q/46/ewFpPlB4WWFNIAY+SoZebQ+C8gcAsGDzfG6giNV HxdehWNp5nJ34k9Tudt2FseL/IclA3nXZZIl2dU6PWLkjgvDqL2kXLHpAQTpKuXf ZzBM4wjdVqZRQHscLgX6+0/uOJDf9/iJs/dwD9PbzLhAJnHF3SWRAy/grzpChLFR Yil5zagtdhjqSKDf2FsUCJ7lwaST0fhHvmZZx4loIhcmN0/rvvrhhfLsJgCWv/Kf RWMkiSQGhxAZUXjOJDQB+wTHv7ZoJy51ov7/rYjr49jCJrCVO6I6yQ6lNKDCYClR vDS3yK6fiEXbC09iudm74FBl2KO+BwJKhnidekzzKGv34RHRAux5Oo54J5jJMhBA tXFxALGwKr1KAI7vLK8ByZzwZjN5pDmKHUuOAlJNy9XQi+b4zbbREkXlcqZ5VgxA xCibpnLzJFoHS8y7c76wfz4mCkWSWzuC9Rzy4sTkHMzgJESmZiE= =dESL -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-6.13' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.13 - Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE. - Quirk KVM's misguided behavior of initialized certain feature MSRs to their maximum supported feature set, which can result in KVM creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero value results in the vCPU having invalid state if userspace hides PDCM from the guest, which can lead to save/restore failures. - Fix KVM's handling of non-canonical checks for vCPUs that support LA57 to better follow the "architecture", in quotes because the actual behavior is poorly documented. E.g. most MSR writes and descriptor table loads ignore CR4.LA57 and operate purely on whether the CPU supports LA57. - Bypass the register cache when querying CPL from kvm_sched_out(), as filling the cache from IRQ context is generally unsafe, and harden the cache accessors to try to prevent similar issues from occuring in the future. - Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM over-advertises SPEC_CTRL when trying to support cross-vendor VMs. - Minor cleanups	2024-11-13 06:33:00 -05:00
Sean Christopherson	c9155eb012	KVM: x86: Unpack msr_data structure prior to calling kvm_apic_set_base() Pass in the new value and "host initiated" as separate parameters to kvm_apic_set_base(), as forcing the KVM_SET_SREGS path to declare and fill an msr_data structure is awkward and kludgy, e.g. __set_sregs_common() doesn't even bother to set the proper MSR index. No functional change intended. Suggested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20241101183555.1794700-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-04 20:57:46 -08:00
Sean Christopherson	7d1cb7cee9	KVM: x86: Rename APIC base setters to better capture their relationship Rename kvm_set_apic_base() and kvm_lapic_set_base() to kvm_apic_set_base() and __kvm_apic_set_base() respectively to capture that the underscores version is a "special" variant (it exists purely to avoid recalculating the optimized map multiple times when stuffing the RESET value). Opportunistically add a comment explaining why kvm_lapic_reset() uses the inner helper. Note, KVM deliberately invokes kvm_arch_vcpu_create() while kvm->lock is NOT held so that vCPU setup isn't serialized if userspace is creating multiple/all vCPUs in parallel. I.e. triggering an extra recalculation is not limited to theoretical/rare edge cases, and so is worth avoiding. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-7-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-04 20:57:46 -08:00
Sean Christopherson	c9c9acfcd5	KVM: x86: Move kvm_set_apic_base() implementation to lapic.c (from x86.c) Move kvm_set_apic_base() to lapic.c so that the bulk of KVM's local APIC code resides in lapic.c, regardless of whether or not KVM is emulating the local APIC in-kernel. This will also allow making various helpers visible only to lapic.c. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-6-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-04 20:57:46 -08:00
Sean Christopherson	adfec1f459	KVM: x86: Inline kvm_get_apic_mode() in lapic.h Inline kvm_get_apic_mode() in lapic.h to avoid a CALL+RET as well as an export. The underlying kvm_apic_mode() helper is public information, i.e. there is no state/information that needs to be hidden from vendor modules. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-5-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-04 20:57:46 -08:00
Sean Christopherson	d91060e342	KVM: x86: Get vcpu->arch.apic_base directly and drop kvm_get_apic_base() Access KVM's emulated APIC base MSR value directly instead of bouncing through a helper, as there is no reason to add a layer of indirection, and there are other MSRs with a "set" but no "get", e.g. EFER. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-4-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-04 20:57:45 -08:00
David Matlack	13e2e4f62a	KVM: x86/mmu: Recover TDP MMU huge page mappings in-place instead of zapping Recover TDP MMU huge page mappings in-place instead of zapping them when dirty logging is disabled, and rename functions that recover huge page mappings when dirty logging is disabled to move away from the "zap collapsible spte" terminology. Before KVM flushes TLBs, guest accesses may be translated through either the (stale) small SPTE or the (new) huge SPTE. This is already possible when KVM is doing eager page splitting (where TLB flushes are also batched), and when vCPUs are faulting in huge mappings (where TLBs are flushed after the new huge SPTE is installed). Recovering huge pages reduces the number of page faults when dirty logging is disabled: $ perf stat -e kvm:kvm_page_fault -- ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g Before: 393,599 kvm:kvm_page_fault After: 262,575 kvm:kvm_page_fault vCPU throughput and the latency of disabling dirty-logging are about equal compared to zapping, but avoiding faults can be beneficial to remove vCPU jitter in extreme scenarios. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240823235648.3236880-5-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-04 18:37:22 -08:00
Sean Christopherson	1ded7a57b8	KVM: x86: Remove ordering check b/w MSR_PLATFORM_INFO and MISC_FEATURES_ENABLES Drop KVM's odd restriction that disallows clearing CPUID_FAULT in MSR_PLATFORM_INFO if CPL>0 CPUID faulting is enabled in MSR_MISC_FEATURES_ENABLES. KVM generally doesn't require specific ordering when userspace sets MSRs, and the completely arbitrary order of MSRs in emulated_msrs_all means that a userspace that uses KVM's list verbatim could run afoul of the check. Dropping the restriction obviously means that userspace could stuff a nonsensical vCPU model, but that's the case all over KVM. KVM typically restricts userspace MSR writes only when it makes things easier for KVM and/or userspace. Link: https://lore.kernel.org/r/20240802185511.305849-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-01 09:22:34 -07:00
Sean Christopherson	a5d563890b	KVM: x86: Reject userspace attempts to access ARCH_CAPABILITIES w/o support Reject userspace accesses to ARCH_CAPABILITIES if the MSR isn't supposed to exist, according to guest CPUID. However, "reject" accesses with KVM_MSR_RET_UNSUPPORTED, so that reads get '0' and writes of '0' are ignored if KVM advertised support ARCH_CAPABILITIES. KVM's ABI is that userspace must set guest CPUID prior to setting MSRs, and that setting MSRs that aren't supposed exist is disallowed (modulo the '0' exemption). Link: https://lore.kernel.org/r/20240802185511.305849-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-01 09:22:34 -07:00
Sean Christopherson	d75cac366f	KVM: x86: Reject userspace attempts to access PERF_CAPABILITIES w/o PDCM Reject userspace accesses to PERF_CAPABILITIES if PDCM isn't set in guest CPUID, i.e. if the vCPU doesn't actually have PERF_CAPABILITIES. But! Do so via KVM_MSR_RET_UNSUPPORTED, so that reads get '0' and writes of '0' are ignored if KVM advertised support PERF_CAPABILITIES. KVM's ABI is that userspace must set guest CPUID prior to setting MSRs, and that setting MSRs that aren't supposed exist is disallowed (modulo the '0' exemption). Link: https://lore.kernel.org/r/20240802185511.305849-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-01 09:22:32 -07:00
Sean Christopherson	dcb988cdac	KVM: x86: Quirk initialization of feature MSRs to KVM's max configuration Add a quirk to control KVM's misguided initialization of select feature MSRs to KVM's max configuration, as enabling features by default violates KVM's approach of letting userspace own the vCPU model, and is actively problematic for MSRs that are conditionally supported, as the vCPU will end up with an MSR value that userspace can't restore. E.g. if the vCPU is configured with PDCM=0, userspace will save and attempt to restore a non-zero PERF_CAPABILITIES, thanks to KVM's meddling. Link: https://lore.kernel.org/r/20240802185511.305849-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-01 09:22:31 -07:00
Sean Christopherson	bc2ca3680b	KVM: x86: Disallow changing MSR_PLATFORM_INFO after vCPU has run Tag MSR_PLATFORM_INFO as a feature MSR (because it is), i.e. disallow it from being modified after the vCPU has run. To make KVM's selftest compliant, simply delete the userspace MSR write that restores KVM's original value at the end of the test. Verifying that userspace can write back what it originally read is uninteresting in this particular case, because KVM doesn't enforce _any_ bits in the MSR, i.e. userspace should be able to write any arbitrary value. Link: https://lore.kernel.org/r/20240802185511.305849-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-01 09:22:30 -07:00
Sean Christopherson	2142ac663a	KVM: x86: Co-locate initialization of feature MSRs in kvm_arch_vcpu_create() Bunch all of the feature MSR initialization in kvm_arch_vcpu_create() so that it can be easily quirked in a future patch. No functional change intended. Link: https://lore.kernel.org/r/20240802185511.305849-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-01 09:22:29 -07:00
Maxim Levitsky	9245fd6b85	KVM: x86: model canonical checks more precisely As a result of a recent investigation, it was determined that x86 CPUs which support 5-level paging, don't always respect CR4.LA57 when doing canonical checks. In particular: 1. MSRs which contain a linear address, allow full 57-bitcanonical address regardless of CR4.LA57 state. For example: MSR_KERNEL_GS_BASE. 2. All hidden segment bases and GDT/IDT bases also behave like MSRs. This means that full 57-bit canonical address can be loaded to them regardless of CR4.LA57, both using MSRS (e.g GS_BASE) and instructions (e.g LGDT). 3. TLB invalidation instructions also allow the user to use full 57-bit address regardless of the CR4.LA57. Finally, it must be noted that the CPU doesn't prevent the user from disabling 5-level paging, even when the full 57-bit canonical address is present in one of the registers mentioned above (e.g GDT base). In fact, this can happen without any userspace help, when the CPU enters SMM mode - some MSRs, for example MSR_KERNEL_GS_BASE are left to contain a non-canonical address in regard to the new mode. Since most of the affected MSRs and all segment bases can be read and written freely by the guest without any KVM intervention, this patch makes the emulator closely follow hardware behavior, which means that the emulator doesn't take in the account the guest CPUID support for 5-level paging, and only takes in the account the host CPU support. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240906221824.491834-4-mlevitsk@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-01 09:22:26 -07:00
Maxim Levitsky	c534b37b75	KVM: x86: Add X86EMUL_F_MSR and X86EMUL_F_DT_LOAD to aid canonical checks Add emulation flags for MSR accesses and Descriptor Tables loads, and pass the new flags as appropriate to emul_is_noncanonical_address(). The flags will be used to perform the correct canonical check, as the type of access affects whether or not CR4.LA57 is consulted when determining the canonical bit. No functional change is intended. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240906221824.491834-3-mlevitsk@redhat.com [sean: split to separate patch, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-01 09:22:25 -07:00
Maxim Levitsky	16ccadefa2	KVM: x86: Route non-canonical checks in emulator through emulate_ops Add emulate_ops.is_canonical_addr() to perform (non-)canonical checks in the emulator, which will allow extending is_noncanonical_address() to support different flavors of canonical checks, e.g. for descriptor table bases vs. MSRs, without needing duplicate logic in the emulator. No functional change is intended. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240906221824.491834-3-mlevitsk@redhat.com [sean: separate from additional of flags, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-01 09:22:25 -07:00
Sean Christopherson	eecf398545	KVM: x86: Use '0' for guest RIP if PMI encounters protected guest state Explicitly return '0' for guest RIP when handling a PMI VM-Exit for a vCPU with protected guest state, i.e. when KVM can't read the real RIP. While there is no "right" value, and profiling a protect guest is rather futile, returning the last known RIP is worse than returning obviously "bad" data. E.g. for SEV-ES+, the last known RIP will often point somewhere in the guest's boot flow. Opportunistically add WARNs to effectively assert that the in_kernel() and get_ip() callbacks are restricted to the common PMI handler, as the return values for the protected guest state case are largely arbitrary, i.e. only make any sense whatsoever for PMIs, where the returned values have no functional impact and thus don't truly matter. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20241009175002.1118178-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-01 09:22:23 -07:00
Sean Christopherson	f0e7012c4b	KVM: x86: Bypass register cache when querying CPL from kvm_sched_out() When querying guest CPL to determine if a vCPU was preempted while in kernel mode, bypass the register cache, i.e. always read SS.AR_BYTES from the VMCS on Intel CPUs. If the kernel is running with full preemption enabled, using the register cache in the preemption path can result in stale and/or uninitialized data being cached in the segment cache. In particular the following scenario is currently possible: - vCPU is just created, and the vCPU thread is preempted before SS.AR_BYTES is written in vmx_vcpu_reset(). - When scheduling out the vCPU task, kvm_arch_vcpu_in_kernel() => vmx_get_cpl() reads and caches '0' for SS.AR_BYTES. - vmx_vcpu_reset() => seg_setup() configures SS.AR_BYTES, but doesn't invoke vmx_segment_cache_clear() to invalidate the cache. As a result, KVM retains a stale value in the cache, which can be read, e.g. via KVM_GET_SREGS. Usually this is not a problem because the VMX segment cache is reset on each VM-Exit, but if the userspace VMM (e.g KVM selftests) reads and writes system registers just after the vCPU was created, _without_ modifying SS.AR_BYTES, userspace will write back the stale '0' value and ultimately will trigger a VM-Entry failure due to incorrect SS segment type. Note, the VM-Enter failure can also be avoided by moving the call to vmx_segment_cache_clear() until after the vmx_vcpu_reset() initializes all segments. However, while that change is correct and desirable (and will come along shortly), it does not address the underlying problem that accessing KVM's register caches from !task context is generally unsafe. In addition to fixing the immediate bug, bypassing the cache for this particular case will allow hardening KVM register caching log to assert that the caches are accessed only when KVM _knows_ it is safe to do so. Fixes: `de63ad4cf4` ("KVM: X86: implement the logic for spinlock optimization") Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Closes: https://lore.kernel.org/all/20240716022014.240960-3-mlevitsk@redhat.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20241009175002.1118178-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-01 09:22:21 -07:00
Sean Christopherson	3ffe874ea3	KVM: x86: Ensure vcpu->mode is loaded from memory in kvm_vcpu_exit_request() Wrap kvm_vcpu_exit_request()'s load of vcpu->mode with READ_ONCE() to ensure the variable is re-loaded from memory, as there is no guarantee the caller provides the necessary annotations to ensure KVM sees a fresh value, e.g. the VM-Exit fastpath could theoretically reuse the pre-VM-Enter value. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20240828232013.768446-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-01 09:21:46 -07:00
Kai Huang	6e44d2427b	KVM: x86: Fix a comment inside __kvm_set_or_clear_apicv_inhibit() Change svm_vcpu_run() to vcpu_enter_guest() in the comment of __kvm_set_or_clear_apicv_inhibit() to make it reflect the fact. When one thread updates VM's APICv state due to updating the APICv inhibit reasons, it kicks off all vCPUs and makes them wait until the new reason has been updated and can be seen by all vCPUs. There was one WARN() to make sure VM's APICv state is consistent with vCPU's APICv state in the svm_vcpu_run(). Commit `ee49a89329` ("KVM: x86: Move SVM's APICv sanity check to common x86") moved that WARN() to x86 common code vcpu_enter_guest() due to the logic is not unique to SVM, and added comments to both __kvm_set_or_clear_apicv_inhibit() and vcpu_enter_guest() to explain this. However, although the comment in __kvm_set_or_clear_apicv_inhibit() mentioned the WARN(), it seems forgot to reflect that the WARN() had been moved to x86 common, i.e., it still mentioned the svm_vcpu_run() but not vcpu_enter_guest(). Fix it. Note after the change the first line that contains vcpu_enter_guest() exceeds 80 characters, but leave it as is to make the diff clean. Fixes: `ee49a89329` ("KVM: x86: Move SVM's APICv sanity check to common x86") Signed-off-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/e462e7001b8668649347f879c66597d3327dbac2.1728383775.git.kai.huang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-01 09:21:45 -07:00
Kai Huang	ef86fe036d	KVM: x86: Fix a comment inside kvm_vcpu_update_apicv() The sentence "... so that KVM can the AVIC doorbell to ..." doesn't have a verb. Fix it. After adding the verb 'use', that line exceeds 80 characters. Thus wrap the 'to' to the next line. Signed-off-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/666e991edf81e1fccfba9466f3fe65965fcba897.1728383775.git.kai.huang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-01 09:21:44 -07:00
Paolo Bonzini	3f8df62852	Merge tag 'kvm-x86-vmx-6.12' of https://github.com/kvm-x86/linux into HEAD KVM VMX changes for 6.12: - Set FINAL/PAGE in the page fault error code for EPT Violations if and only if the GVA is valid. If the GVA is NOT valid, there is no guest-side page table walk and so stuffing paging related metadata is nonsensical. - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of emulating posted interrupt delivery to L2. - Add a lockdep assertion to detect unsafe accesses of vmcs12 structures. - Harden eVMCS loading against an impossible NULL pointer deref (really truly should be impossible). - Minor SGX fix and a cleanup.	2024-09-17 12:41:23 -04:00
Paolo Bonzini	43d97b2ebd	Merge tag 'kvm-x86-pat_vmx_msrs-6.12' of https://github.com/kvm-x86/linux into HEAD KVM VMX and x86 PAT MSR macro cleanup for 6.12: - Add common defines for the x86 architectural memory types, i.e. the types that are shared across PAT, MTRRs, VMCSes, and EPTPs. - Clean up the various VMX MSR macros to make the code self-documenting (inasmuch as possible), and to make it less painful to add new macros.	2024-09-17 12:40:39 -04:00
Paolo Bonzini	5d55a052e3	Merge tag 'kvm-x86-mmu-6.12' of https://github.com/kvm-x86/linux into HEAD KVM x86 MMU changes for 6.12: - Overhaul the "unprotect and retry" logic to more precisely identify cases where retrying is actually helpful, and to harden all retry paths against putting the guest into an infinite retry loop. - Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in the shadow MMU. - Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for adding MGLRU support in KVM. - Misc cleanups	2024-09-17 12:39:53 -04:00
Paolo Bonzini	41786cc5ea	Merge tag 'kvm-x86-misc-6.12' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.12 - Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10 functionality that is on the horizon). - Rework common MSR handling code to suppress errors on userspace accesses to unsupported-but-advertised MSRs. This will allow removing (almost?) all of KVM's exemptions for userspace access to MSRs that shouldn't exist based on the vCPU model (the actual cleanup is non-trivial future work). - Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the 64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv) stores the entire 64-bit value a the ICR offset. - Fix a bug where KVM would fail to exit to userspace if one was triggered by a fastpath exit handler. - Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when there's already a pending wake event at the time of the exit. - Finally fix the RSM vs. nested VM-Enter WARN by forcing the vCPU out of guest mode prior to signalling SHUTDOWN (architecturally, the SHUTDOWN is supposed to hit L1, not L2).	2024-09-17 11:38:23 -04:00
Paolo Bonzini	c09dd2bb57	Merge branch 'kvm-redo-enable-virt' into HEAD Register KVM's cpuhp and syscore callbacks when enabling virtualization in hardware, as the sole purpose of said callbacks is to disable and re-enable virtualization as needed. The primary motivation for this series is to simplify dealing with enabling virtualization for Intel's TDX, which needs to enable virtualization when kvm-intel.ko is loaded, i.e. long before the first VM is created. That said, this is a nice cleanup on its own. By registering the callbacks on-demand, the callbacks themselves don't need to check kvm_usage_count, because their very existence implies a non-zero count. Patch 1 (re)adds a dedicated lock for kvm_usage_count. This avoids a lock ordering issue between cpus_read_lock() and kvm_lock. The lock ordering issue still exist in very rare cases, and will be fixed for good by switching vm_list to an (S)RCU-protected list. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-09-17 11:38:20 -04:00
Sean Christopherson	2876624e1a	KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure() Rename reexecute_instruction() to kvm_unprotect_and_retry_on_failure() to make the intent and purpose of the helper much more obvious. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:33 -07:00
Sean Christopherson	4df685664b	KVM: x86: Update retry protection fields when forcing retry on emulation failure When retrying the faulting instruction after emulation failure, refresh the infinite loop protection fields even if no shadow pages were zapped, i.e. avoid hitting an infinite loop even when retrying the instruction as a last-ditch effort to avoid terminating the guest. Link: https://lore.kernel.org/r/20240831001538.336683-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:32 -07:00
Sean Christopherson	dabc4ff70c	KVM: x86: Apply retry protection to "unprotect on failure" path Use kvm_mmu_unprotect_gfn_and_retry() in reexecute_instruction() to pick up protection against infinite loops, e.g. if KVM somehow manages to encounter an unsupported instruction and unprotecting the gfn doesn't allow the vCPU to make forward progress. Other than that, the retry-on- failure logic is a functionally equivalent, open coded version of kvm_mmu_unprotect_gfn_and_retry(). Note, the emulation failure path still isn't fully protected, as KVM won't update the retry protection fields if no shadow pages are zapped (but this change is still a step forward). That flaw will be addressed in a future patch. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:32 -07:00
Sean Christopherson	19ab2c8be0	KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn Don't bother unprotecting the target gfn if EMULTYPE_WRITE_PF_TO_SP is set, as KVM will simply report the emulation failure to userspace. This will allow converting reexecute_instruction() to use kvm_mmu_unprotect_gfn_instead_retry() instead of kvm_mmu_unprotect_page(). Link: https://lore.kernel.org/r/20240831001538.336683-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:31 -07:00
Sean Christopherson	6205257395	KVM: x86: Remove manual pfn lookup when retrying #PF after failed emulation Drop the manual pfn look when retrying an instruction that KVM failed to emulation in response to a #PF due to a write-protected gfn. Now that KVM sets EMULTYPE_ALLOW_RETRY_PF if and only if the page fault hit a write- protected gfn, i.e. if and only if there's a writable memslot, there's no need to redo the lookup to avoid retrying an instruction that failed on emulated MMIO (no slot, or a write to a read-only slot). I.e. KVM will never attempt to retry an instruction that failed on emulated MMIO, whereas that was not the case prior to the introduction of RET_PF_WRITE_PROTECTED. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:30 -07:00
Sean Christopherson	2df354e37c	KVM: x86: Fold retry_instruction() into x86_emulate_instruction() Now that retry_instruction() is reasonably tiny, fold it into its sole caller, x86_emulate_instruction(). In addition to getting rid of the absurdly confusing retry_instruction() name, handling the retry in x86_emulate_instruction() pairs it back up with the code that resets last_retry_{eip,address}. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:26 -07:00
Sean Christopherson	41e6e367d5	KVM: x86: Move EMULTYPE_ALLOW_RETRY_PF to x86_emulate_instruction() Move the sanity checks for EMULTYPE_ALLOW_RETRY_PF to the top of x86_emulate_instruction(). In addition to deduplicating a small amount of code, this makes the connection between EMULTYPE_ALLOW_RETRY_PF and EMULTYPE_PF even more explicit, and will allow dropping retry_instruction() entirely. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:25 -07:00
Sean Christopherson	01dd4d3192	KVM: x86/mmu: Apply retry protection to "fast nTDP unprotect" path Move the anti-infinite-loop protection provided by last_retry_{eip,addr} into kvm_mmu_write_protect_fault() so that it guards unprotect+retry that never hits the emulator, as well as reexecute_instruction(), which is the last ditch "might as well try it" logic that kicks in when emulation fails on an instruction that faulted on a write-protected gfn. Add a new helper, kvm_mmu_unprotect_gfn_and_retry(), to set the retry fields and deduplicate other code (with more to come). Link: https://lore.kernel.org/r/20240831001538.336683-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:23 -07:00
Sean Christopherson	9c19129e53	KVM: x86: Store gpa as gpa_t, not unsigned long, when unprotecting for retry Store the gpa used to unprotect the faulting gfn for retry as a gpa_t, not an unsigned long. This fixes a bug where 32-bit KVM would unprotect and retry the wrong gfn if the gpa had bits 63:32!=0. In practice, this bug is functionally benign, as unprotecting the wrong gfn is purely a performance issue (thanks to the anti-infinite-loop logic). And of course, almost no one runs 32-bit KVM these days. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:22 -07:00
Sean Christopherson	019f3f84a4	KVM: x86: Get RIP from vCPU state when storing it to last_retry_eip Read RIP from vCPU state instead of pulling it from the emulation context when filling last_retry_eip, which is part of the anti-infinite-loop protection used when unprotecting and retrying instructions that hit a write-protected gfn. This will allow reusing the anti-infinite-loop protection in flows that never make it into the emulator. No functional change intended, as ctxt->eip is set to kvm_rip_read() in init_emulate_ctxt(), and EMULTYPE_PF emulation is mutually exclusive with EMULTYPE_NO_DECODE and EMULTYPE_SKIP, i.e. always goes through x86_decode_emulated_instruction() and hasn't advanced ctxt->eip (yet). Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:21 -07:00
Sean Christopherson	c1edcc41c3	KVM: x86: Retry to-be-emulated insn in "slow" unprotect path iff sp is zapped Resume the guest and thus skip emulation of a non-PTE-writing instruction if and only if unprotecting the gfn actually zapped at least one shadow page. If the gfn is write-protected for some reason other than shadow paging, attempting to unprotect the gfn will effectively fail, and thus retrying the instruction is all but guaranteed to be pointless. This bug has existed for a long time, but was effectively fudged around by the retry RIP+address anti-loop detection. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:21 -07:00
Sean Christopherson	3f6821aa14	KVM: x86: Forcibly leave nested if RSM to L2 hits shutdown Leave nested mode before synthesizing shutdown (a.k.a. TRIPLE_FAULT) if RSM fails when resuming L2 (a.k.a. guest mode). Architecturally, shutdown on RSM occurs _before_ the transition back to guest mode on both Intel and AMD. On Intel, per the SDM pseudocode, SMRAM state is loaded before critical VMX state: restore state normally from SMRAM; ... CR4.VMXE := value stored internally; IF internal storage indicates that the logical processor had been in VMX operation (root or non-root) THEN enter VMX operation (root or non-root); restore VMX-critical state as defined in Section 32.14.1; ... restore current VMCS pointer; FI; AMD's APM is both less clearcut and more explicit. Because AMD CPUs save VMCB and guest state in SMRAM itself, given the lack of anything in the APM to indicate a shutdown in guest mode is possible, a straightforward reading of the clause on invalid state is that _what_ state is invalid is irrelevant, i.e. all roads lead to shutdown. An RSM causes a processor shutdown if an invalid-state condition is found in the SMRAM state-save area. This fixes a bug found by syzkaller where synthesizing shutdown for L2 led to a nested VM-Exit (if L1 is intercepting shutdown), which in turn caused KVM to complain about trying to cancel a nested VM-Enter (see commit `759cbd5967` ("KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM"). Note, Paolo pointed out that KVM shouldn't set nested_run_pending until after loading SMRAM state. But as above, that's only half the story, KVM shouldn't transition to guest mode either. Unfortunately, fixing that mess requires rewriting the nVMX and nSVM RSM flows to not piggyback their nested VM-Enter flows, as executing the nested VM-Enter flows after loading state from SMRAM would clobber much of said state. For now, add a FIXME to call out that transitioning to guest mode before loading state from SMRAM is wrong. Link: https://lore.kernel.org/all/CABgObfYaUHXyRmsmg8UjRomnpQ0Jnaog9-L2gMjsjkqChjDYUQ@mail.gmail.com Reported-by: syzbot+988d9efcdf137bc05f66@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000007a9acb06151e1670@google.com Reported-by: Zheyu Ma <zheyuma97@gmail.com> Closes: https://lore.kernel.org/all/CAMhUBjmXMYsEoVYw_M8hSZjBMHh24i88QYm-RY6HDta5YZ7Wgw@mail.gmail.com Analyzed-by: Michal Wilczynski <michal.wilczynski@intel.com> Cc: Kishen Maloor <kishen.maloor@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240906161337.1118412-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:09:49 -07:00
Linus Torvalds	9d4c304001	KVM: x86: don't fall through case statements without annotations clang warns on this because it has an unannotated fall-through between cases: arch/x86/kvm/x86.c:4819:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough] and while we could annotate it as a fallthrough, the proper fix is to just add the break for this case, instead of falling through to the default case and the break there. gcc also has that warning, but it looks like gcc only warns for the cases where they fall through to "real code", rather than to just a break. Odd. Fixes: `d30d9ee94c` ("KVM: x86: Only advertise KVM_CAP_READONLY_MEM when supported by VM") Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Tom Dohrmann <erbse.13@gmx.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-09-06 15:23:33 -07:00
Sean Christopherson	590b09b1d8	KVM: x86: Register "emergency disable" callbacks when virt is enabled Register the "disable virtualization in an emergency" callback just before KVM enables virtualization in hardware, as there is no functional need to keep the callbacks registered while KVM happens to be loaded, but is inactive, i.e. if KVM hasn't enabled virtualization. Note, unregistering the callback every time the last VM is destroyed could have measurable latency due to the synchronize_rcu() needed to ensure all references to the callback are dropped before KVM is unloaded. But the latency should be a small fraction of the total latency of disabling virtualization across all CPUs, and userspace can set enable_virt_at_load to completely eliminate the runtime overhead. Add a pointer in kvm_x86_ops to allow vendor code to provide its callback. There is no reason to force vendor code to do the registration, and either way KVM would need a new kvm_x86_ops hook. Suggested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-09-04 11:02:34 -04:00
Sean Christopherson	0617a769ce	KVM: x86: Rename virtualization {en,dis}abling APIs to match common KVM Rename x86's the per-CPU vendor hooks used to enable virtualization in hardware to align with the recently renamed arch hooks. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240830043600.127750-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-09-04 11:02:33 -04:00
Sean Christopherson	071f24ad28	KVM: Rename arch hooks related to per-CPU virtualization enabling Rename the per-CPU hooks used to enable virtualization in hardware to align with the KVM-wide helpers in kvm_main.c, and to better capture that the callbacks are invoked on every online CPU. No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240830043600.127750-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-09-04 11:02:33 -04:00
Tom Dohrmann	d30d9ee94c	KVM: x86: Only advertise KVM_CAP_READONLY_MEM when supported by VM Until recently, KVM_CAP_READONLY_MEM was unconditionally supported on x86, but this is no longer the case for SEV-ES and SEV-SNP VMs. When KVM_CHECK_EXTENSION is invoked on a VM, only advertise KVM_CAP_READONLY_MEM when it's actually supported. Fixes: `66155de93b` ("KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)") Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Michael Roth <michael.roth@amd.com> Signed-off-by: Tom Dohrmann <erbse.13@gmx.de> Message-ID: <20240902144219.3716974-1-erbse.13@gmx.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-09-02 10:56:10 -04:00
Sean Christopherson	1876dd69df	KVM: x86: Add fastpath handling of HLT VM-Exits Add a fastpath for HLT VM-Exits by immediately re-entering the guest if it has a pending wake event. When virtual interrupt delivery is enabled, i.e. when KVM doesn't need to manually inject interrupts, this allows KVM to stay in the fastpath run loop when a vIRQ arrives between the guest doing CLI and STI;HLT. Without AMD's Idle HLT-intercept support, the CPU generates a HLT VM-Exit even though KVM will immediately resume the guest. Note, on bare metal, it's relatively uncommon for a modern guest kernel to actually trigger this scenario, as the window between the guest checking for a wake event and committing to HLT is quite small. But in a nested environment, the timings change significantly, e.g. rudimentary testing showed that ~50% of HLT exits where HLT-polling was successful would be serviced by this fastpath, i.e. ~50% of the time that a nested vCPU gets a wake event before KVM schedules out the vCPU, the wake event was pending even before the VM-Exit. Link: https://lore.kernel.org/all/20240528041926.3989-3-manali.shukla@amd.com Link: https://lore.kernel.org/r/20240802195120.325560-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-29 19:50:22 -07:00
Sean Christopherson	70cdd23851	KVM: x86: Reorganize code in x86.c to co-locate vCPU blocking/running helpers Shuffle code around in x86.c so that the various helpers related to vCPU blocking/running logic are (a) located near each other and (b) ordered so that HLT emulation can use kvm_vcpu_has_events() in a future path. No functional change intended. Link: https://lore.kernel.org/r/20240802195120.325560-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-29 19:50:21 -07:00
Sean Christopherson	f7f39c50ed	KVM: x86: Exit to userspace if fastpath triggers one on instruction skip Exit to userspace if a fastpath handler triggers such an exit, which can happen when skipping the instruction, e.g. due to userspace single-stepping the guest via KVM_GUESTDBG_SINGLESTEP or because of an emulation failure. Fixes: `404d5d7bff` ("KVM: X86: Introduce more exit_fastpath_completion enum values") Link: https://lore.kernel.org/r/20240802195120.325560-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-29 19:50:21 -07:00
Sean Christopherson	ea60229af7	KVM: x86: Dedup fastpath MSR post-handling logic Now that the WRMSR fastpath for x2APIC_ICR and TSC_DEADLINE are identical, ignoring the backend MSR handling, consolidate the common bits of skipping the instruction and setting the return value. No functional change intended. Link: https://lore.kernel.org/r/20240802195120.325560-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-29 19:50:21 -07:00
Sean Christopherson	0dd45f2cd8	KVM: x86: Re-enter guest if WRMSR(X2APIC_ICR) fastpath is successful Re-enter the guest in the fastpath if WRMSR emulation for x2APIC's ICR is successful, as no additional work is needed, i.e. there is no code unique for WRMSR exits between the fastpath and the "!= EXIT_FASTPATH_NONE" check in __vmx_handle_exit(). Link: https://lore.kernel.org/r/20240802195120.325560-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-29 19:50:21 -07:00
Sean Christopherson	44dd0f5732	KVM: x86: Suppress userspace access failures on unsupported, "emulated" MSRs Extend KVM's suppression of userspace MSR access failures to MSRs that KVM reports as emulated, but are ultimately unsupported, e.g. if the VMX MSRs are emulated by KVM, but are unsupported given the vCPU model. Suggested-by: Weijiang Yang <weijiang.yang@intel.com> Reviewed-by: Weijiang Yang <weijiang.yang@intel.com> Link: https://lore.kernel.org/r/20240802181935.292540-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:07:39 -07:00
Sean Christopherson	64a5d7a109	KVM: x86: Suppress failures on userspace access to advertised, unsupported MSRs Extend KVM's suppression of failures due to a userspace access to an unsupported, but advertised as a "to save" MSR to all MSRs, not just those that happen to reach the default case statements in kvm_get_msr_common() and kvm_set_msr_common(). KVM's soon-to-be-established ABI is that if an MSR is advertised to userspace, then userspace is allowed to read the MSR, and write back the value that was read, i.e. why an MSR is unsupported doesn't change KVM's ABI. Practically speaking, this is very nearly a nop, as the only other paths that return KVM_MSR_RET_UNSUPPORTED are {svm,vmx}_get_feature_msr(), and it's unlikely, though not impossible, that userspace is using KVM_GET_MSRS on unsupported MSRs. The primary goal of moving the suppression to common code is to allow returning KVM_MSR_RET_UNSUPPORTED as appropriate throughout KVM, without having to manually handle the "is userspace accessing an advertised" waiver. I.e. this will allow formalizing KVM's ABI without incurring a high maintenance cost. Link: https://lore.kernel.org/r/20240802181935.292540-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:07:38 -07:00
Sean Christopherson	3adef90345	KVM: x86: Hoist x86.c's global msr_* variables up above kvm_do_msr_access() Move the definitions of the various MSR arrays above kvm_do_msr_access() so that kvm_do_msr_access() can query the arrays when handling failures, e.g. to squash errors if userspace tries to read an MSR that isn't fully supported, but that KVM advertised as being an MSR-to-save. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:07:37 -07:00
Sean Christopherson	1cec203498	KVM: x86: Funnel all fancy MSR return value handling into a common helper Add a common helper, kvm_do_msr_access(), to invoke the "leaf" APIs that are type and access specific, and more importantly to handle errors that are returned from the leaf APIs. I.e. turn kvm_msr_ignored_check() from a a helper that is called on an error, into a trampoline that detects errors and applies relevant side effects, e.g. logging unimplemented accesses. Because the leaf APIs are used for guest accesses, userspace accesses, and KVM accesses, and because KVM supports restricting access to MSRs from userspace via filters, the error handling is subtly non-trivial. E.g. KVM has had at least one bug escape due to making each "outer" function handle errors. See commit `3376ca3f1a` ("KVM: x86: Fix KVM_GET_MSRS stack info leak"). Link: https://lore.kernel.org/r/20240802181935.292540-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:07:36 -07:00
Sean Christopherson	7075f16361	KVM: x86: Refactor kvm_get_feature_msr() to avoid struct kvm_msr_entry Refactor kvm_get_feature_msr() to take the components of kvm_msr_entry as separate parameters, along with a vCPU pointer, i.e. to give it the same prototype as kvm_{g,s}et_msr_ignored_check(). This will allow using a common inner helper for handling accesses to "regular" and feature MSRs. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:07:35 -07:00
Sean Christopherson	b848f24bd7	KVM: x86: Rename get_msr_feature() APIs to get_feature_msr() Rename all APIs related to feature MSRs from get_msr_feature() to get_feature_msr(). The APIs get "feature MSRs", not "MSR features". And unlike kvm_{g,s}et_msr_common(), the "feature" adjective doesn't describe the helper itself. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:06:56 -07:00
Sean Christopherson	74c6c98a59	KVM: x86: Refactor kvm_x86_ops.get_msr_feature() to avoid kvm_msr_entry Refactor get_msr_feature() to take the index and data pointer as distinct parameters in anticipation of eliminating "struct kvm_msr_entry" usage further up the primary callchain. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:06:30 -07:00
Sean Christopherson	aaecae7b6a	KVM: x86: Rename KVM_MSR_RET_INVALID to KVM_MSR_RET_UNSUPPORTED Rename the "INVALID" internal MSR error return code to "UNSUPPORTED" to try and make it more clear that access was denied because the MSR itself is unsupported/unknown. "INVALID" is too ambiguous, as it could just as easily mean the value for WRMSR as invalid. Avoid UNKNOWN and UNIMPLEMENTED, as the error code is used for MSRs that _are_ actually implemented by KVM, e.g. if the MSR is unsupported because an associated feature flag is not present in guest CPUID. Opportunistically beef up the comments for the internal MSR error codes. Link: https://lore.kernel.org/r/20240802181935.292540-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:06:29 -07:00
Li Chen	e0183a42e3	KVM: x86: Use this_cpu_ptr() in kvm_user_return_msr_cpu_online Use this_cpu_ptr() instead of open coding the equivalent in kvm_user_return_msr_cpu_online. Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Link: https://lore.kernel.org/r/87zfp96ojk.wl-me@linux.beauty Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:37:41 -07:00
Sean Christopherson	653ea4489e	KVM: nVMX: Honor userspace MSR filter lists for nested VM-Enter/VM-Exit Synthesize a consistency check VM-Exit (VM-Enter) or VM-Abort (VM-Exit) if L1 attempts to load/store an MSR via the VMCS MSR lists that userspace has disallowed access to via an MSR filter. Intel already disallows including a handful of "special" MSRs in the VMCS lists, so denying access isn't completely without precedent. More importantly, the behavior is well-defined _and_ can be communicated the end user, e.g. to the customer that owns a VM running as L1 on top of KVM. On the other hand, ignoring userspace MSR filters is all but guaranteed to result in unexpected behavior as the access will hit KVM's internal state, which is likely not up-to-date. Unlike KVM-internal accesses, instruction emulation, and dedicated VMCS fields, the MSRs in the VMCS load/store lists are 100% guest controlled, thus making it all but impossible to reason about the correctness of ignoring the MSR filter. And if userspace really wants to deny access to MSRs via the aforementioned scenarios, userspace can hide the associated feature from the guest, e.g. by disabling the PMU to prevent accessing PERF_GLOBAL_CTRL via its VMCS field. But for the MSR lists, KVM is blindly processing MSRs; the MSR filters are the _only_ way for userspace to deny access. This partially reverts commit `ac8d6cad3c` ("KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr"). Cc: Hou Wenlong <houwenlong.hwl@antgroup.com> Cc: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20240722235922.3351122-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:35:16 -07:00
Sean Christopherson	b6717d35d8	KVM: x86: Stuff vCPU's PAT with default value at RESET, not creation Move the stuffing of the vCPU's PAT to the architectural "default" value from kvm_arch_vcpu_create() to kvm_vcpu_reset(), guarded by !init_event, to better capture that the default value is the value "Following Power-up or Reset". E.g. setting PAT only during creation would break if KVM were to expose a RESET ioctl() to userspace (which is unlikely, but that's not a good reason to have unintuitive code). No functional change. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20240605231918.2915961-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:48 -07:00
Sean Christopherson	4bcdd831d9	KVM: x86: Acquire kvm->srcu when handling KVM_SET_VCPU_EVENTS Grab kvm->srcu when processing KVM_SET_VCPU_EVENTS, as KVM will forcibly leave nested VMX/SVM if SMM mode is being toggled, and leaving nested VMX reads guest memory. Note, kvm_vcpu_ioctl_x86_set_vcpu_events() can also be called from KVM_RUN via sync_regs(), which already holds SRCU. I.e. trying to precisely use kvm_vcpu_srcu_read_lock() around the problematic SMM code would cause problems. Acquiring SRCU isn't all that expensive, so for simplicity, grab it unconditionally for KVM_SET_VCPU_EVENTS. ============================= WARNING: suspicious RCU usage 6.10.0-rc7-332d2c1d713e-next-vm #552 Not tainted ----------------------------- include/linux/kvm_host.h:1027 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by repro/1071: #0: ffff88811e424430 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x7d/0x970 [kvm] stack backtrace: CPU: 15 PID: 1071 Comm: repro Not tainted 6.10.0-rc7-332d2c1d713e-next-vm #552 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x7f/0x90 lockdep_rcu_suspicious+0x13f/0x1a0 kvm_vcpu_gfn_to_memslot+0x168/0x190 [kvm] kvm_vcpu_read_guest+0x3e/0x90 [kvm] nested_vmx_load_msr+0x6b/0x1d0 [kvm_intel] load_vmcs12_host_state+0x432/0xb40 [kvm_intel] vmx_leave_nested+0x30/0x40 [kvm_intel] kvm_vcpu_ioctl_x86_set_vcpu_events+0x15d/0x2b0 [kvm] kvm_arch_vcpu_ioctl+0x1107/0x1750 [kvm] ? mark_held_locks+0x49/0x70 ? kvm_vcpu_ioctl+0x7d/0x970 [kvm] ? kvm_vcpu_ioctl+0x497/0x970 [kvm] kvm_vcpu_ioctl+0x497/0x970 [kvm] ? lock_acquire+0xba/0x2d0 ? find_held_lock+0x2b/0x80 ? do_user_addr_fault+0x40c/0x6f0 ? lock_release+0xb7/0x270 __x64_sys_ioctl+0x82/0xb0 do_syscall_64+0x6c/0x170 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7ff11eb1b539 </TASK> Fixes: `f7e570780e` ("KVM: x86: Forcibly leave nested virt when SMM state is toggled") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240723232055.3643811-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:33 -07:00
Isaku Yamahata	15e1c3d659	KVM: x86: Use this_cpu_ptr() instead of per_cpu_ptr(smp_processor_id()) Use this_cpu_ptr() instead of open coding the equivalent in various user return MSR helpers. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Message-ID: <20240802201630.339306-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-08-13 10:24:37 -04:00
Paolo Bonzini	7239ed7467	KVM: remove kvm_arch_gmem_prepare_needed() It is enough to return 0 if a guest need not do any preparation. This is in fact how sev_gmem_prepare() works for non-SNP guests, and it extends naturally to Intel hosts: the x86 callback for gmem_prepare is optional and returns 0 if not defined. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-26 14:46:14 -04:00
Paolo Bonzini	564429a6bd	KVM: rename CONFIG_HAVE_KVM_GMEM_* to CONFIG_HAVE_KVM_ARCH_GMEM_* Add "ARCH" to the symbols; shortly, the "prepare" phase will include both the arch-independent step to clear out contents left in the page by the host, and the arch-dependent step enabled by CONFIG_HAVE_KVM_GMEM_PREPARE. For consistency do the same for CONFIG_HAVE_KVM_GMEM_INVALIDATE as well. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-26 14:46:14 -04:00
Paolo Bonzini	5932ca411e	KVM: x86: disallow pre-fault for SNP VMs before initialization KVM_PRE_FAULT_MEMORY for an SNP guest can race with sev_gmem_post_populate() in bad ways. The following sequence for instance can potentially trigger an RMP fault: thread A, sev_gmem_post_populate: called thread B, sev_gmem_prepare: places below 'pfn' in a private state in RMP thread A, sev_gmem_post_populate: vaddr = kmap_local_pfn(pfn + i); thread A, sev_gmem_post_populate: copy_from_user(vaddr, src + i PAGE_SIZE, PAGE_SIZE); RMP #PF Fix this by only allowing KVM_PRE_FAULT_MEMORY to run after a guest's initial private memory contents have been finalized via KVM_SEV_SNP_LAUNCH_FINISH. Beyond fixing this issue, it just sort of makes sense to enforce this, since the KVM_PRE_FAULT_MEMORY documentation states: "KVM maps memory as if the vCPU generated a stage-2 read page fault" which sort of implies we should be acting on the same guest state that a vCPU would see post-launch after the initial guest memory is all set up. Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-26 14:46:14 -04:00
Wei Wang	896046474f	KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops Introduces kvm_x86_call(), to streamline the usage of static calls of kvm_x86_ops. The current implementation of these calls is verbose and could lead to alignment challenges. This makes the code susceptible to exceeding the "80 columns per single line of code" limit as defined in the coding-style document. Another issue with the existing implementation is that the addition of kvm_x86_ prefix to hooks at the static_call sites hinders code readability and navigation. kvm_x86_call() is added to improve code readability and maintainability, while adhering to the coding style guidelines. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Link: https://lore.kernel.org/r/20240507133103.15052-3-wei.w.wang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-16 12:14:12 -04:00
Wei Wang	f4854bf741	KVM: x86: Replace static_call_cond() with static_call() The use of static_call_cond() is essentially the same as static_call() on x86 (e.g. static_call() now handles a NULL pointer as a NOP), so replace it with static_call() to simplify the code. Link: https://lore.kernel.org/all/3916caa1dcd114301a49beafa5030eca396745c1.1679456900.git.jpoimboe@kernel.org/ Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Wei Wang <wei.w.wang@intel.com> Link: https://lore.kernel.org/r/20240507133103.15052-2-wei.w.wang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-16 12:14:11 -04:00
Sean Christopherson	2a1fc7dc36	KVM: x86: Suppress MMIO that is triggered during task switch emulation Explicitly suppress userspace emulated MMIO exits that are triggered when emulating a task switch as KVM doesn't support userspace MMIO during complex (multi-step) emulation. Silently ignoring the exit request can result in the WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to userspace for some other reason prior to purging mmio_needed. See commit `0dc902267c` ("KVM: x86: Suppress pending MMIO write exits if emulator detects exception") for more details on KVM's limitations with respect to emulated MMIO during complex emulator flows. Reported-by: syzbot+2fb9f8ed752c01bc9a3f@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240712144841.1230591-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-16 09:57:45 -04:00
Paolo Bonzini	208a352a54	KVM VMX changes for 6.11 - Remove an unnecessary EPT TLB flush when enabling hardware. - Fix a series of bugs that cause KVM to fail to detect nested pending posted interrupts as valid wake eents for a vCPU executing HLT in L2 (with HLT-exiting disable by L1). - Misc cleanups -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRvX0ACgkQOlYIJqCj N/2Aiw/9Htwy4MfJ2zdTX0ypZx6CUAVY0B7R2q9LVaqlBBL02dLoNWn9ndf7J2pd TJKtp39sHzf342ghti/Za5+mZgRgXA9IjQ5cvcQQjfmjDdDODygEc12otISeSNqq uL2jbUZzzjbcQyUrXkeFptVcNFpaiOG0dFfvnoi1csWzXVf7t+CD+8/3kjVm2Qt7 vQXkV4yN7tNiYOvaukfXP7Og9ALpF8g8ok3YmXVXDPMu7+R7G+P6j3mVWr9ABMPj LOmC+5Z/sscMFw1Io3XHuWoF5socQARXEzJNLCblDaw3GMlSj4LNxif2M/6B7bmR nQVtiegj9K1Fc3OGOqPJcAIRPI4O9nMmf7uOwvXmOlwDSk7rCxF/yPk7Cto2+UXm 6mnLcH1l0/VaidW+a7rUAcDGIlWwgfw0F6tp2j6FdVl2Lx/IThcrkn0teLY1gAW8 CMi/BfTBEXO5583O3+ZCAzVQzeKnWR3yqwJe0oSftB1/rPkPD8PQ39MH8LuJJJxi CN1W4R1/taQdOxMZqggDvS1biz7gwpjNGtnWsO9szAgMEXVjf2M1HOZVcT2e2997 81xDMdZaJSfd26tm7PhWtQnVPqyMZ6vqqIiq7FlIbEEkAE75Kbg4fUn/4y4WRnh9 3Gog6MZPu/MA5TbwvcZ/sy/CRfFu0HKm5q98oArhjSyU8C7oGeQ= =W1/6 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-vmx-6.11' of https://github.com/kvm-x86/linux into HEAD KVM VMX changes for 6.11 - Remove an unnecessary EPT TLB flush when enabling hardware. - Fix a series of bugs that cause KVM to fail to detect nested pending posted interrupts as valid wake eents for a vCPU executing HLT in L2 (with HLT-exiting disable by L1). - Misc cleanups	2024-07-16 09:56:41 -04:00
Paolo Bonzini	cda231cd42	KVM x86/pmu changes for 6.11 - Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads '0' and writes from userspace are ignored. - Update to the newfangled Intel CPU FMS infrastructure. - Use macros instead of open-coded literals to clean up KVM's manipulation of FIXED_CTR_CTRL MSRs. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRu3oACgkQOlYIJqCj N/0f/Q/+PoRFKrr9ENwlVjxmq7DBJOzrEiht5EH89bQdpYL0pcmv6I+n+Z77o08X l49YFO2zVq26dMCe8EFDuQrZpqKjOS/qEc+/zTsLu4lx8NJD1gqYJLJryejgtdQI +GefPVIN11TvlDDjuuxSWgKUCAevk8s3PRe+zbUwlsHmw+GVky8dJoe71QbW27rK hL7Y2pOe5Y8MgRAadxlhm6QmgOnz3RKKYs9t/HMzi2gQP1TuvPxnYtMC3Gz5pVe+ w3Ak7M4fh8Z7FbQsoNY5h3IdigG6eFrssqHX4QpCXr/G5L9vAgUmSR93/M8jLjNv wAkUulLx7vFeTlOXjqcEJSn0U6mX/48pt68vrPB5ES1Rx28RB5s9tzYXCGtCmSxv nHmMDc3YUbg6tp2hvliMqjsN0j5l2GQiX7LJwH2Ma9qQFlTHPmFwJGS4hciki6c5 obCK2vXBoS1jyxrZx8qUhIcJl2oigv3hwihN2YqZ0Q4QDwllv8cw4BeABQByYR9x T91PQ0biiJ9vWCkALbyzYOpy+grdHCblwYW9+FM/qZBGH0ouPzDyZWPrRBLX12pH fEgDMB3vT9JqQ5tyafd0MHuAVlrDVHYEY+lmXplzFGKEFonBkN7HmDzAOKafuCuj GnIe0Sa1JnHVPNomx2dnG6Sku6/tPIfERuHEXrR9zkUJsacnfRY= =pJxP -----END PGP SIGNATURE----- Merge tag 'kvm-x86-pmu-6.11' of https://github.com/kvm-x86/linux into HEAD KVM x86/pmu changes for 6.11 - Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads '0' and writes from userspace are ignored. - Update to the newfangled Intel CPU FMS infrastructure. - Use macros instead of open-coded literals to clean up KVM's manipulation of FIXED_CTR_CTRL MSRs.	2024-07-16 09:55:15 -04:00
Paolo Bonzini	5c5ddf7107	KVM x86 MTRR virtualization removal Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD hack, and instead always honor guest PAT on CPUs that support self-snoop. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRuwAACgkQOlYIJqCj N/32Gg/+Nnnz6TCRno2vursPJme7gvtLdqSxjazAj3u2ZO8IApGYWMyfVpS+ymC9 Wdpj6gRe2ukSxgTsUI2CYoy5V2NxDaA9YgdTPZUVQvqwujVrqZCJ7L393iPYYnC9 No3LXZ+SOYRmomiCzknjC6GOlT2hAZHzQsyaXDlEYok7NAA2L6XybbLonEdA4RYi V1mS62W5PaA4tUesuxkJjPujXo1nXRWD/aXOruJWjPESdSFSALlx7reFAf2Nwn7K Uw8yZqhq6vWAZSph0Nz8OrZOS/kULKA3q2zl1B/qJJ0ToAt2VdXS6abXky52RExf KvP+jBAWMO5kHbIqaMRtCHjbIkbhH8RdUIYNJQEUQ5DdydM5+/RDa+KprmLPcmUn qvJq+3uyH0MEENtneGegs8uxR+sn6fT32cGMIw790yIywddh562+IJ4Z+C3BuYJi yszD71odqKT8+knUd2CaZjE9UZyoQNDfj2OCCTzzZOC/6TuJWCh9CYQ1csssHbQR KcvZCKE6ht8tWwi+2HWj0laOdg1reX2kV869k3xH4uCwEaFIj2Wk+/Bw/lg2Tn5h 5uTnQ01dx5XhAV1klr6IY3VXJ/A8G8895wRfkZEelsA9Wj8qZvNgXhsoXReIUIrn aR0ppsFcbqHzC50qE2JT4juTD1EPx95LL9zKT8pI9mGKwxCAxUM= =yb10 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-mtrrs-6.11' of https://github.com/kvm-x86/linux into HEAD KVM x86 MTRR virtualization removal Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD hack, and instead always honor guest PAT on CPUs that support self-snoop.	2024-07-16 09:54:57 -04:00
Paolo Bonzini	5dcc1e7614	KVM x86 misc changes for 6.11 - Add a global struct to consolidate tracking of host values, e.g. EFER, and move "shadow_phys_bits" into the structure as "maxphyaddr". - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC bus frequency, because TDX. - Print the name of the APICv/AVIC inhibits in the relevant tracepoint. - Clean up KVM's handling of vendor specific emulation to consistently act on "compatible with Intel/AMD", versus checking for a specific vendor. - Misc cleanups -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRub0ACgkQOlYIJqCj N/2LMxAArGzhcWZ6Qdo2aMRaMIPtSBJHmbEgEuHvHMumgsTZQzDcn9cxDi/hNSrc l8ODOwAM2qNcq95YfwjU7F0ae3E+HRzGvKcBnmZWuQeCDp2HhVEoCphFu1sHst+t XEJTL02b6OgyJUEU3h40mYk12eiq2S4FCnFYXPCqijwwuL6Y5KQvvTqek3c2/SDn c+VneutYGax/S0GiiCkYh4wrwWh9g7qm0IX70ycBwJbW5qBFKgyglvHxvL8JLJC9 Nkkw/p2657wcOdraH+fOBuRy2dMwE5fv++1tOjWwB5WAAhSOJPZh0BGYvgA2yfN7 OE+k7APKUQd9Xxtud8H3LrTPoyMA4hz2sdDFyqrrWK9yjpBY7zXNyN50Fxi7VVsm T8nTIiKAGyRbjotY+m7krXQPXjfZYhVqrJ/jtxESOZLZ93q2gSWU2p/ZXpUPVHnH +YOBAI1owP3wepaYlrthtI4LQx9lF422dnmeSflztfKFGabRbQZxg3uHMCCxIaGc lJ6CD546+D45f/uBXRDMqk//qFTqXhKUbDk9sutmU/C2oWufMwW0R8kOyItGPyvk 9PP1vd8vSsIHj+tpwg+i04jBqYDaAcPBOcTZaHm9SYYP+1e11Uu5Vjep37JL1bkA xJWxnDZOCGcfKQi2jkh51HJ/dOAHXY1GQKMfyAoPQOSonYHvGVY= =Cf2R -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-6.11' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.11 - Add a global struct to consolidate tracking of host values, e.g. EFER, and move "shadow_phys_bits" into the structure as "maxphyaddr". - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC bus frequency, because TDX. - Print the name of the APICv/AVIC inhibits in the relevant tracepoint. - Clean up KVM's handling of vendor specific emulation to consistently act on "compatible with Intel/AMD", versus checking for a specific vendor. - Misc cleanups	2024-07-16 09:53:05 -04:00
Paolo Bonzini	86014c1e20	KVM generic changes for 6.11 - Enable halt poll shrinking by default, as Intel found it to be a clear win. - Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86. - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out(). - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs. - Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout. - A few minor cleanups -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRuOYACgkQOlYIJqCj N/1UnQ/8CI5Qfr+/0gzYgtWmtEMczGG+rMNpzD3XVqPjJjXcMcBiQnplnzUVLhha vlPdYVK7vgmEt003XGzV55mik46LHL+DX/v4hI3HEdblfyCeNLW3fKEWVRB44qJe o+YUQwSK42SORUp9oXuQINxhA//U9EnI7CQxlJ8w8wenv5IJKfIGr01DefmfGPAV PKm9t6WLcNqvhZMEyy/zmzM3KVPCJL0NcwI97x6sHxFpQYIDtL0E/VexA4AFqMoT QK7cSDC/2US41Zvem/r/GzM/ucdF6vb9suzZYBohwhxtVhwJe2CDeYQZvtNKJ1U7 GOHPaKL6nBWdZCm/yyWbbX2nstY1lHqxhN3JD0X8wqU5rNcwm2b8Vfyav0Ehc7H+ jVbDTshOx4YJmIgajoKjgM050rdBK59TdfVL+l+AAV5q/TlHocalYtvkEBdGmIDg 2td9UHSime6sp20vQfczUEz4bgrQsh4l2Fa/qU2jFwLievnBw0AvEaMximkSGMJe b8XfjmdTjlOesWAejANKtQolfrq14+1wYw0zZZ8PA+uNVpKdoovmcqSOcaDC9bT8 GO/NFUvoG+lkcvJcIlo1SSl81SmGLosijwxWfGvFAqsgpR3/3l3dYp0QtztoCNJO d3+HnjgYn5o5FwufuTD3eUOXH4AFjG108DH0o25XrIkb2Kymy0o= =BalU -----END PGP SIGNATURE----- Merge tag 'kvm-x86-generic-6.11' of https://github.com/kvm-x86/linux into HEAD KVM generic changes for 6.11 - Enable halt poll shrinking by default, as Intel found it to be a clear win. - Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86. - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out(). - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs. - Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout. - A few minor cleanups	2024-07-16 09:51:36 -04:00
Paolo Bonzini	6e01b7601d	KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory() Wire KVM_PRE_FAULT_MEMORY ioctl to kvm_mmu_do_page_fault() to populate guest memory. It can be called right after KVM_CREATE_VCPU creates a vCPU, since at that point kvm_mmu_create() and kvm_init_mmu() are called and the vCPU is ready to invoke the KVM page fault handler. The helper function kvm_tdp_map_page() takes care of the logic to process RET_PF_* return values and convert them to success or errno. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <9b866a0ae7147f96571c439e75429a03dcb659b6.1712785629.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-12 11:17:47 -04:00
Peng Hao	dd103407ca	KVM: X86: Remove unnecessary GFP_KERNEL_ACCOUNT for temporary variables Some variables allocated in kvm_arch_vcpu_ioctl are released when the function exits, so there is no need to set GFP_KERNEL_ACCOUNT. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Link: https://lore.kernel.org/r/20240624012016.46133-1-flyingpeng@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 09:52:16 -07:00
Dapeng Mi	f287bef6dd	KVM: x86/pmu: Introduce distinct macros for GP/fixed counter max number Refine the macros which define maximum General Purpose (GP) and fixed counter numbers. Currently the macro KVM_INTEL_PMC_MAX_GENERIC is used to represent the maximum supported General Purpose (GP) counter number ambiguously across Intel and AMD platforms. This would cause issues if AMD begins to support more GP counters than Intel. Thus a bunch of new macros including vendor specific and vendor independent are introduced to replace the old macros. The vendor independent macros are used in x86 common code to hide vendor difference and eliminate the ambiguity. No logic changes are introduced in this patch. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240627021756.144815-1-dapeng1.mi@linux.intel.com Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 09:12:16 -07:00
Sean Christopherson	45405155d8	KVM: x86: WARN if a vCPU gets a valid wakeup that KVM can't yet inject WARN if a blocking vCPU is awakened by a valid wake event that KVM can't inject, e.g. because KVM needs to complete a nested VM-enter, or needs to re-inject an exception. For the nested VM-Enter case, KVM is supposed to clear "nested_run_pending" if L1 puts L2 into HLT, i.e. entering HLT "completes" the nested VM-Enter. And for already-injected exceptions, it should be impossible for the vCPU to be in a blocking state if a VM-Exit occurred while an exception was being vectored. Link: https://lore.kernel.org/r/20240607172609.3205077-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 08:59:07 -07:00
Sean Christopherson	321ef62b0c	KVM: nVMX: Fold requested virtual interrupt check into has_nested_events() Check for a Requested Virtual Interrupt, i.e. a virtual interrupt that is pending delivery, in vmx_has_nested_events() and drop the one-off kvm_x86_ops.guest_apic_has_interrupt() hook. In addition to dropping a superfluous hook, this fixes a bug where KVM would incorrectly treat virtual interrupts _for L2_ as always enabled due to kvm_arch_interrupt_allowed(), by way of vmx_interrupt_blocked(), treating IRQs as enabled if L2 is active and vmcs12 is configured to exit on IRQs, i.e. KVM would treat a virtual interrupt for L2 as a valid wake event based on L1's IRQ blocking status. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 08:59:06 -07:00
Sean Christopherson	32f55e475c	KVM: nVMX: Request immediate exit iff pending nested event needs injection When requesting an immediate exit from L2 in order to inject a pending event, do so only if the pending event actually requires manual injection, i.e. if and only if KVM actually needs to regain control in order to deliver the event. Avoiding the "immediate exit" isn't simply an optimization, it's necessary to make forward progress, as the "already expired" VMX preemption timer trick that KVM uses to force a VM-Exit has higher priority than events that aren't directly injected. At present time, this is a glorified nop as all events processed by vmx_has_nested_events() require injection, but that will not hold true in the future, e.g. if there's a pending virtual interrupt in vmcs02.RVI. I.e. if KVM is trying to deliver a virtual interrupt to L2, the expired VMX preemption timer will trigger VM-Exit before the virtual interrupt is delivered, and KVM will effectively hang the vCPU in an endless loop of forced immediate VM-Exits (because the pending virtual interrupt never goes away). Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 08:59:04 -07:00
Paolo Bonzini	02b0d3b9d4	Merge branch 'kvm-6.10-fixes' into HEAD	2024-06-20 17:31:50 -04:00
Sean Christopherson	f3ced000a2	KVM: x86: Always sync PIR to IRR prior to scanning I/O APIC routes Sync pending posted interrupts to the IRR prior to re-scanning I/O APIC routes, irrespective of whether the I/O APIC is emulated by userspace or by KVM. If a level-triggered interrupt routed through the I/O APIC is pending or in-service for a vCPU, KVM needs to intercept EOIs on said vCPU even if the vCPU isn't the destination for the new routing, e.g. if servicing an interrupt using the old routing races with I/O APIC reconfiguration. Commit `fceb3a36c2` ("KVM: x86: ioapic: Fix level-triggered EOI and userspace I/OAPIC reconfigure race") fixed the common cases, but kvm_apic_pending_eoi() only checks if an interrupt is in the local APIC's IRR or ISR, i.e. misses the uncommon case where an interrupt is pending in the PIR. Failure to intercept EOI can manifest as guest hangs with Windows 11 if the guest uses the RTC as its timekeeping source, e.g. if the VMM doesn't expose a more modern form of time to the guest. Cc: stable@vger.kernel.org Cc: Adamos Ttofari <attofari@amazon.de> Cc: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240611014845.82795-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-20 14:18:02 -04:00
David Matlack	a6816314af	KVM: Introduce vcpu->wants_to_run Introduce vcpu->wants_to_run to indicate when a vCPU is in its core run loop, i.e. when the vCPU is running the KVM_RUN ioctl and immediate_exit was not set. Replace all references to vcpu->run->immediate_exit with !vcpu->wants_to_run to avoid TOCTOU races with userspace. For example, a malicious userspace could invoked KVM_RUN with immediate_exit=true and then after KVM reads it to set wants_to_run=false, flip it to false. This would result in the vCPU running in KVM_RUN with wants_to_run=false. This wouldn't cause any real bugs today but is a dangerous landmine. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240503181734.1467938-2-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-18 09:20:01 -07:00
Sean Christopherson	d29bf2ca14	KVM: x86: Prevent excluding the BSP on setting max_vcpu_ids If the BSP vCPU ID was already set, ensure it doesn't get excluded when limiting vCPU IDs via KVM_CAP_MAX_VCPU_ID. [mks: provide commit message, code by Sean] Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240614202859.3597745-4-minipli@grsecurity.net Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-18 08:59:50 -07:00
Mathias Krause	7c305d5118	KVM: x86: Limit check IDs for KVM_SET_BOOT_CPU_ID Do not accept IDs which are definitely invalid by limit checking the passed value against KVM_MAX_VCPU_IDS and 'max_vcpu_ids' if it was already set. This ensures invalid values, especially on 64-bit systems, don't go unnoticed and lead to a valid id by chance when truncated by the final assignment. Fixes: `73880c80aa` ("KVM: Break dependency between vcpu index in vcpus array and vcpu_id.") Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240614202859.3597745-3-minipli@grsecurity.net Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-18 08:59:36 -07:00
Sean Christopherson	3dee3b1874	KVM: x86: Drop now-superflous setting of l1tf_flush_l1d in vcpu_run() Now that KVM unconditionally sets l1tf_flush_l1d in kvm_arch_vcpu_load(), drop the redundant store from vcpu_run(). The flag is cleared only when VM-Enter is imminent, deep below vcpu_run(), i.e. barring a KVM bug, it's impossible for l1tf_flush_l1d to be cleared between loading the vCPU and calling vcpu_run(). Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-11 14:18:47 -07:00
Sean Christopherson	ef2e18ef37	KVM: x86: Unconditionally set l1tf_flush_l1d during vCPU load Always set l1tf_flush_l1d during kvm_arch_vcpu_load() instead of setting it only when the vCPU is being scheduled back in. The flag is processed only when VM-Enter is imminent, and KVM obviously needs to load the vCPU before VM-Enter, so attempting to precisely set l1tf_flush_l1d provides no meaningful value. I.e. the flag _will_ be set either way, it's simply a matter of when. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-11 14:18:46 -07:00
Sean Christopherson	2a27c43140	KVM: Delete the now unused kvm_arch_sched_in() Delete kvm_arch_sched_in() now that all implementations are nops. Reviewed-by: Bibo Mao <maobibo@loongson.cn> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-11 14:18:45 -07:00
Sean Christopherson	8fbb696a8f	KVM: x86: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load() Fold the guts of kvm_arch_sched_in() into kvm_arch_vcpu_load(), keying off the recently added kvm_vcpu.scheduled_out as appropriate. Note, there is a very slight functional change, as PLE shrink updates will now happen after blasting WBINVD, but that is quite uninteresting as the two operations do not interact in any way. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-11 14:18:44 -07:00
Yi Wang	e3c89f5dd1	KVM: x86: Don't re-setup empty IRQ routing when KVM_CAP_SPLIT_IRQCHIP Now that KVM sets up empty IRQ routing during VM creation, don't recreate empty routing during KVM_CAP_SPLIT_IRQCHIP. Setting IRQ routes during KVM_CAP_SPLIT_IRQCHIP can result in 20+ milliseconds of delay due to the synchronize_srcu_expedited() call in kvm_set_irq_routing(). Note, the empty routing is guaranteed to be intact as KVM x86 only allows changing the IRQ routing after an in-kernel IRQCHIP has been created, and KVM_CAP_SPLIT_IRQCHIP is disallowed after creating an IRQCHIP. Signed-off-by: Yi Wang <foxywang@tencent.com> Link: https://lore.kernel.org/r/20240506101751.3145407-3-foxywang@tencent.com [sean: massage changelog, remove unused empty_routing array] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-11 14:18:40 -07:00
Thomas Prescher	85542adb65	KVM: x86: Add KVM_RUN_X86_GUEST_MODE kvm_run flag When a vCPU is interrupted by a signal while running a nested guest, KVM will exit to userspace with L2 state. However, userspace has no way to know whether it sees L1 or L2 state (besides calling KVM_GET_STATS_FD, which does not have a stable ABI). This causes multiple problems: The simplest one is L2 state corruption when userspace marks the sregs as dirty. See this mailing list thread [1] for a complete discussion. Another problem is that if userspace decides to continue by emulating instructions, it will unknowingly emulate with L2 state as if L1 doesn't exist, which can be considered a weird guest escape. Introduce a new flag, KVM_RUN_X86_GUEST_MODE, in the kvm_run data structure, which is set when the vCPU exited while running a nested guest. Also introduce a new capability, KVM_CAP_X86_GUEST_MODE, to advertise the functionality to userspace. [1] https://lore.kernel.org/kvm/20240416123558.212040-1-julian.stecklina@cyberus-technology.de/T/#m280aadcb2e10ae02c191a7dc4ed4b711a74b1f55 Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de> Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> Link: https://lore.kernel.org/r/20240508132502.184428-1-julian.stecklina@cyberus-technology.de Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-11 09:24:31 -07:00
Sean Christopherson	d99e4cb2ae	KVM: x86: Use "is Intel compatible" helper to emulate SYSCALL in !64-bit Use guest_cpuid_is_intel_compatible() to determine whether SYSCALL in 32-bit Protected Mode (including Compatibility Mode) should #UD or succeed. The existing code already does the exact equivalent of guest_cpuid_is_intel_compatible(), just in a rather roundabout way. No functional change intended. Link: https://lore.kernel.org/r/20240405235603.1173076-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-10 14:29:38 -07:00
Sean Christopherson	c092fc879f	KVM: x86: Inhibit code #DBs in MOV-SS shadow for all Intel compat vCPUs Treat code #DBs as inhibited in MOV/POP-SS shadows for vCPU models that are Intel compatible, not just strictly vCPUs with vendor==Intel. The behavior is explicitly called out in the SDM, and thus architectural, i.e. applies to all CPUs that implement Intel's architecture, and isn't a quirk that is unique to CPUs manufactured by Intel: However, if an instruction breakpoint is placed on an instruction located immediately after a POP SS/MOV SS instruction, the breakpoint will be suppressed as if EFLAGS.RF were 1. Applying the behavior strictly to Intel wasn't intentional, KVM simply didn't have a concept of "Intel compatible" as of commit `baf67ca8e5` ("KVM: x86: Suppress code #DBs on Intel if MOV/POP SS blocking is active"). Link: https://lore.kernel.org/r/20240405235603.1173076-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-10 14:29:38 -07:00

1 2 3 4 5 ...

3134 Commits