Commit Graph

11000 Commits

Author SHA1 Message Date
Linus Torvalds
e991acf1bc Significant patch series in this pull request:
- The 2 patch series "squashfs: Remove page->mapping references" from
   Matthew Wilcox gets us closer to being able to remove page->mapping.
 
 - The 5 patch series "relayfs: misc changes" from Jason Xing does some
   maintenance and minor feature addition work in relayfs.
 
 - The 5 patch series "kdump: crashkernel reservation from CMA" from Jiri
   Bohac switches us from static preallocation of the kdump crashkernel's
   working memory over to dynamic allocation.  So the difficulty of
   a-priori estimation of the second kernel's needs is removed and the
   first kernel obtains extra memory.
 
 - The 5 patch series "generalize panic_print's dump function to be used
   by other kernel parts" from Feng Tang implements some consolidation and
   rationalizatio of the various ways in which a faiing kernel splats
   information at the operator.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaI+82gAKCRDdBJ7gKXxA
 jj4JAP9xb+w9DrBY6sa+7KTPIb+aTqQ7Zw3o9O2m+riKQJv6jAEA6aEwRnDA0451
 fDT5IqVlCWGvnVikdZHSnvhdD7TGsQ0=
 =rT71
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "Significant patch series in this pull request:

   - "squashfs: Remove page->mapping references" (Matthew Wilcox) gets
     us closer to being able to remove page->mapping

   - "relayfs: misc changes" (Jason Xing) does some maintenance and
     minor feature addition work in relayfs

   - "kdump: crashkernel reservation from CMA" (Jiri Bohac) switches
     us from static preallocation of the kdump crashkernel's working
     memory over to dynamic allocation. So the difficulty of a-priori
     estimation of the second kernel's needs is removed and the first
     kernel obtains extra memory

   - "generalize panic_print's dump function to be used by other
     kernel parts" (Feng Tang) implements some consolidation and
     rationalization of the various ways in which a failing kernel
     splats information at the operator

* tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (80 commits)
  tools/getdelays: add backward compatibility for taskstats version
  kho: add test for kexec handover
  delaytop: enhance error logging and add PSI feature description
  samples: Kconfig: fix spelling mistake "instancess" -> "instances"
  fat: fix too many log in fat_chain_add()
  scripts/spelling.txt: add notifer||notifier to spelling.txt
  xen/xenbus: fix typo "notifer"
  net: mvneta: fix typo "notifer"
  drm/xe: fix typo "notifer"
  cxl: mce: fix typo "notifer"
  KVM: x86: fix typo "notifer"
  MAINTAINERS: add maintainers for delaytop
  ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below()
  ucount: fix atomic_long_inc_below() argument type
  kexec: enable CMA based contiguous allocation
  stackdepot: make max number of pools boot-time configurable
  lib/xxhash: remove unused functions
  init/Kconfig: restore CONFIG_BROKEN help text
  lib/raid6: update recov_rvv.c zero page usage
  docs: update docs after introducing delaytop
  ...
2025-08-03 16:23:09 -07:00
WangYuli
a30469cac8 KVM: x86: fix typo "notifer"
Patch series "treewide: Fix typo "notifer"", v3.

There are some spelling mistakes of 'notifer' in comments which
should be 'notifier'.

Fix them and add it to scripts/spelling.txt.


This patch (of 8):

There are some spelling mistakes of 'notifer' which should be 'notifier'.

Link: https://lkml.kernel.org/r/576F0D85F6853074+20250722072734.19367-1-wangyuli@uniontech.com
Link: https://lkml.kernel.org/r/7F05778C3A1A9F8B+20250722073431.21983-1-wangyuli@uniontech.com
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02 12:01:39 -07:00
Paolo Bonzini
beafd7ecf2 Merge tag 'kvm-x86-sev-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM SEV cache maintenance changes for 6.17

 - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM.

 - Use WBNOINVD instead of WBINVD when possible, for SEV cache maintenance,
   e.g. to minimize collateral damage when reclaiming memory from an SEV guest.

 - When reclaiming memory from an SEV guest, only do cache flushes on CPUs that
   have ever run a vCPU for the guest, i.e. don't flush the caches for CPUs
   that can't possibly have cache lines with dirty, encrypted data.
2025-07-29 08:36:46 -04:00
Paolo Bonzini
a10accaef4 Merge tag 'x86_core_for_kvm' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD
Immutable branch for KVM tree to put the KVM patches from

https://lore.kernel.org/r/20250522233733.3176144-1-seanjc@google.com

ontop.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-07-29 08:36:46 -04:00
Paolo Bonzini
0097f9df99 Merge tag 'kvm-x86-svm-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.17

Drop KVM's rejection of SNP's SMT and single-socket policy restrictions, and
instead rely on firmware to verify that the policy can actually be supported.
Don't bother checking that requested policy(s) can actually be satisfied, as
an incompatible policy doesn't put the kernel at risk in any way, and providing
guarantees with respect to the physical topology is outside of KVM's purview.
2025-07-29 08:36:45 -04:00
Paolo Bonzini
89400f0687 Merge tag 'kvm-x86-apic-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM local APIC changes for 6.17

Extract many of KVM's helpers for accessing architectural local APIC state
to common x86 so that they can be shared by guest-side code for Secure AVIC.
2025-07-29 08:36:44 -04:00
Paolo Bonzini
d7f4aac280 Merge tag 'kvm-x86-mmu-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.17

 - Exempt nested EPT from the the !USER + CR0.WP logic, as EPT doesn't interact
   with CR0.WP.

 - Move the TDX hardware setup code to tdx.c to better co-locate TDX code
   and eliminate a few global symbols.

 - Dynamically allocation the shadow MMU's hashed page list, and defer
   allocating the hashed list until it's actually needed (the TDP MMU doesn't
   use the list).
2025-07-29 08:36:43 -04:00
Paolo Bonzini
1a14928e2e Merge tag 'kvm-x86-misc-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.17

 - Prevert the host's DEBUGCTL.FREEZE_IN_SMM (Intel only) when running the
   guest.  Failure to honor FREEZE_IN_SMM can bleed host state into the guest.

 - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter (Intel only) to
   prevent L1 from running L2 with features that KVM doesn't support, e.g. BTF.

 - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the
   vCPU's CPUID model.

 - Rework the MSR interception code so that the SVM and VMX APIs are more or
   less identical.

 - Recalculate all MSR intercepts from the "source" on MSR filter changes, and
   drop the dedicated "shadow" bitmaps (and their awful "max" size defines).

 - WARN and reject loading kvm-amd.ko instead of panicking the kernel if the
   nested SVM MSRPM offsets tracker can't handle an MSR.

 - Advertise support for LKGS (Load Kernel GS base), a new instruction that's
   loosely related to FRED, but is supported and enumerated independently.

 - Fix a user-triggerable WARN that syzkaller found by stuffing INIT_RECEIVED,
   a.k.a. WFS, and then putting the vCPU into VMX Root Mode (post-VMXON).  Use
   the same approach KVM uses for dealing with "impossible" emulation when
   running a !URG guest, and simply wait until KVM_RUN to detect that the vCPU
   has architecturally impossible state.

 - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of
   APERF/MPERF reads, so that a "properly" configured VM can "virtualize"
   APERF/MPERF (with many caveats).

 - Reject KVM_SET_TSC_KHZ if vCPUs have been created, as changing the "default"
   frequency is unsupported for VMs with a "secure" TSC, and there's no known
   use case for changing the default frequency for other VM types.
2025-07-29 08:36:43 -04:00
Paolo Bonzini
9de13951d5 Merge tag 'kvm-x86-no_assignment-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM VFIO device assignment cleanups for 6.17

Kill off kvm_arch_{start,end}_assignment() and x86's associated tracking now
that KVM no longer uses assigned_device_count as a bad heuristic for "VM has
an irqbypass producer" or for "VM has access to host MMIO".
2025-07-29 08:36:42 -04:00
Paolo Bonzini
f05efcfe07 Merge tag 'kvm-x86-mmio-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM MMIO Stale Data mitigation cleanup for 6.17

Rework KVM's mitigation for the MMIO State Data vulnerability to track
whether or not a vCPU has access to (host) MMIO based on the MMU that will be
used when running in the guest.  The current approach doesn't actually detect
whether or not a guest has access to MMIO, and is prone to false negatives (and
to a lesser extent, false positives), as KVM_DEV_VFIO_FILE_ADD is optional, and
obviously only covers VFIO devices.
2025-07-29 08:36:28 -04:00
Paolo Bonzini
f02b1bcc73 Merge tag 'kvm-x86-irqs-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM IRQ changes for 6.17

 - Rework irqbypass to track/match producers and consumers via an xarray
   instead of a linked list.  Using a linked list leads to O(n^2) insertion
   times, which is hugely problematic for use cases that create large numbers
   of VMs.  Such use cases typically don't actually use irqbypass, but
   eliminating the pointless registration is a future problem to solve as it
   likely requires new uAPI.

 - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *",
   to avoid making a simple concept unnecessarily difficult to understand.

 - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC,
   and PIT emulation at compile time.

 - Drop x86's irq_comm.c, and move a pile of IRQ related code into irq.c.

 - Fix a variety of flaws and bugs in the AVIC device posted IRQ code.

 - Inhibited AVIC if a vCPU's ID is too big (relative to what hardware
   supports) instead of rejecting vCPU creation.

 - Extend enable_ipiv module param support to SVM, by simply leaving IsRunning
   clear in the vCPU's physical ID table entry.

 - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by
   erratum #1235, to allow (safely) enabling AVIC on such CPUs.

 - Dedup x86's device posted IRQ code, as the vast majority of functionality
   can be shared verbatime between SVM and VMX.

 - Harden the device posted IRQ code against bugs and runtime errors.

 - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1)
   instead of O(n).

 - Generate GA Log interrupts if and only if the target vCPU is blocking, i.e.
   only if KVM needs a notification in order to wake the vCPU.

 - Decouple device posted IRQs from VFIO device assignment, as binding a VM to
   a VFIO group is not a requirement for enabling device posted IRQs.

 - Clean up and document/comment the irqfd assignment code.

 - Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e.
   ensure an eventfd is bound to at most one irqfd through the entire host,
   and add a selftest to verify eventfd:irqfd bindings are globally unique.
2025-07-29 08:35:46 -04:00
Manuel Andreas
5a53249d14 KVM: x86/xen: Fix cleanup logic in emulation of Xen schedop poll hypercalls
kvm_xen_schedop_poll does a kmalloc_array() when a VM polls the host
for more than one event channel potr (nr_ports > 1).

After the kmalloc_array(), the error paths need to go through the
"out" label, but the call to kvm_read_guest_virt() does not.

Fixes: 92c58965e9 ("KVM: x86/xen: Use kvm_read_guest_virt() instead of open-coding it badly")
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Manuel Andreas <manuel.andreas@tum.de>
[Adjusted commit message. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-07-23 23:48:54 +02:00
Paolo Bonzini
4b7d440de2 KVM TDX fixes for 6.16
- Fix a formatting goof in the TDX documentation.
 
  - Reject KVM_SET_TSC_KHZ for guests with a protected TSC (currently only TDX).
 
  - Ensure struct kvm_tdx_capabilities fields that are not explicitly set by KVM
    are zeroed.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmh5C9oACgkQOlYIJqCj
 N/1IlRAArAgX2C69qzTWougXFbwl0qiJhY3RpM+jCkiQsSkgICs8kTBSp5OFDRKE
 njtMvdCrhDSXGabRmOlTOjlbauZUU7Ady9Z9GIevVskTu/feknU3I33CQJMBUjCI
 49NG3o3p1vUWluK+WzzclHeojgAFVrPdoWZ3SLhgZf/N9s/hElI46bqpNN/zZOYd
 x2oDC1YJiRDoQmAF/hGDZxvQSZB7ZWvufUEi/lFomjSbWMecn26ynNaFe8jMdjyj
 cBXmsIn41qLMOdVZt+C5h7btcSar1uf0CZ4BfgZbKiAwZn6YH76rzBrAARP15rsI
 iKRfnOTaENx2SkKEX17Xlh1mOFKIMFP5wl2oQbW0pHMEObdUgIcwUlzOtKHmN/Nz
 dCN3W1sYat0fGec4lcx7emNmDDW8EfhmznS9eGo67okGKEbtUnzsDR01Ob335oVN
 F8ORqT1vthn2d3Na1FyAo3a5NKhL2/J67lNThDKzfdiQ7UxYeTjsFW4ys3Li6jDM
 H+LIcXpxLfEwbesbuCHAoyPbAcROFLuNgbTBr1CNm3zsfeqHzxeZAk7aUX6wbBV1
 +F7C7ANfvhae8OBkZoAgJJ3aJEbeloXboQUiijzM12l6Qz+shcpfqmIy5ZrZHQeI
 SKJhHlKDa11vy488sT23Pz1go23sQU83ZbB0x7sdcyOH+1dqCYQ=
 =V3cQ
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-fixes-6.16-rc7' of https://github.com/kvm-x86/linux into HEAD

KVM TDX fixes for 6.16

 - Fix a formatting goof in the TDX documentation.

 - Reject KVM_SET_TSC_KHZ for guests with a protected TSC (currently only TDX).

 - Ensure struct kvm_tdx_capabilities fields that are not explicitly set by KVM
   are zeroed.
2025-07-17 17:06:13 +02:00
Xiaoyao Li
ed302854d0 KVM: TDX: Don't report base TDVMCALLs
Remove TDVMCALLINFO_GET_QUOTE from user_tdvmcallinfo_1_r11 reported to
userspace to align with the direction of the GHCI spec.

Recently, concern was raised about a gap in the GHCI spec that left
ambiguity in how to expose to the guest that only a subset of GHCI
TDVMCalls were supported. During the back and forth on the spec details[0],
<GetQuote> was moved from an individually enumerable TDVMCall, to one that
is part of the 'base spec', meaning it doesn't have a specific bit in the
<GetTDVMCallInfo> return values. Although the spec[1] is still in draft
form, the GetQoute part has been agreed by the major TDX VMMs.

Unfortunately the commits that were upstreamed still treat <GetQuote> as
individually enumerable. They set bit 0 in the user_tdvmcallinfo_1_r11
which is reported to userspace to tell supported optional TDVMCalls,
intending to say that <GetQuote> is supported.

So stop reporting <GetQute> in user_tdvmcallinfo_1_r11 to align with
the direction of the spec, and allow some future TDVMCall to use that bit.

[0] https://lore.kernel.org/all/aEmuKII8FGU4eQZz@google.com/
[1] https://cdrdv2.intel.com/v1/dl/getContent/858626

Fixes: 28224ef02b ("KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities")
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20250717022010.677645-1-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-07-17 16:24:23 +02:00
Sean Christopherson
b8be70ec2b KVM: VMX: Ensure unused kvm_tdx_capabilities fields are zeroed out
Zero-allocate the kernel's kvm_tdx_capabilities structure and copy only
the number of CPUID entries from the userspace structure.  As is, KVM
doesn't ensure kernel_tdvmcallinfo_1_{r11,r12} and user_tdvmcallinfo_1_r12
are zeroed, i.e. KVM will reflect whatever happens to be in the userspace
structure back at userspace, and thus may report garbage to userspace.

Zeroing the entire kernel structure also provides better semantics for the
reserved field.  E.g. if KVM extends kvm_tdx_capabilities to enumerate new
information by repurposing bytes from the reserved field, userspace would
be required to zero the new field in order to get useful information back
(because older KVMs without support for the repurposed field would report
garbage, a la the aforementioned tdvmcallinfo bugs).

Fixes: 61bb282796 ("KVM: TDX: Get system-wide info about TDX module on initialization")
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Closes: https://lore.kernel.org/all/3ef581f1-1ff1-4b99-b216-b316f6415318@intel.com
Tested-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250714221928.1788095-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-15 14:04:39 -07:00
Kai Huang
b24bbb534c KVM: x86: Reject KVM_SET_TSC_KHZ vCPU ioctl for TSC protected guest
Reject KVM_SET_TSC_KHZ vCPU ioctl if guest's TSC is protected and not
changeable by KVM, and update the documentation to reflect it.

For such TSC protected guests, e.g. TDX guests, typically the TSC is
configured once at VM level before any vCPU are created and remains
unchanged during VM's lifetime.  KVM provides the KVM_SET_TSC_KHZ VM
scope ioctl to allow the userspace VMM to configure the TSC of such VM.
After that the userspace VMM is not supposed to call the KVM_SET_TSC_KHZ
vCPU scope ioctl anymore when creating the vCPU.

The de facto userspace VMM Qemu does this for TDX guests.  The upcoming
SEV-SNP guests with Secure TSC should follow.

Note, TDX support hasn't been fully released as of the "buggy" commit,
i.e. there is no established ABI to break.

Fixes: adafea1106 ("KVM: x86: Add infrastructure for secure TSC")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/71bbdf87fdd423e3ba3a45b57642c119ee2dd98c.1752444335.git.kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-15 07:05:13 -07:00
Kai Huang
dcbe5a466c KVM: x86: Reject KVM_SET_TSC_KHZ VM ioctl when vCPUs have been created
Reject the KVM_SET_TSC_KHZ VM ioctl when vCPUs have been created and
update the documentation to reflect it.

The VM scope KVM_SET_TSC_KHZ ioctl is used to set up the default TSC
frequency that all subsequently created vCPUs can use.  It is only
intended to be called before any vCPU is created.  Allowing it to be
called after that only results in confusion but nothing good.

Note this is an ABI change.  But currently in Qemu (the de facto
userspace VMM) only TDX uses this VM ioctl, and it is only called once
before creating any vCPU, therefore the risk of breaking userspace is
pretty low.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/135a35223ce8d01cea06b6cef30bfe494ec85827.1752444335.git.kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-14 15:29:33 -07:00
Zheyun Shen
6f38f8c574 KVM: SVM: Flush cache only on CPUs running SEV guest
On AMD CPUs without ensuring cache consistency, each memory page
reclamation in an SEV guest triggers a call to do WBNOINVD/WBINVD on all
CPUs, thereby affecting the performance of other programs on the host.

Typically, an AMD server may have 128 cores or more, while the SEV guest
might only utilize 8 of these cores. Meanwhile, host can use qemu-affinity
to bind these 8 vCPUs to specific physical CPUs.

Therefore, keeping a record of the physical core numbers each time a vCPU
runs can help avoid flushing the cache for all CPUs every time.

Take care to allocate the cpumask used to track which CPUs have run a
vCPU when copying or moving an "encryption context", as nothing guarantees
memory in a mirror VM is a strict subset of the ASID owner, and the
destination VM for intrahost migration needs to maintain it's own set of
CPUs.  E.g. for intrahost migration, if a CPU was used for the source VM
but not the destination VM, then it can only have cached memory that was
accessible to the source VM.  And a CPU that was run in the source is also
used by the destination is no different than a CPU that was run in the
destination only.

Note, KVM is guaranteed to do flush caches prior to sev_vm_destroy(),
thanks to kvm_arch_guest_memory_reclaimed for SEV and SEV-ES, and
kvm_arch_gmem_invalidate() for SEV-SNP.  I.e. it's safe to free the
cpumask prior to unregistering encrypted regions and freeing the ASID.

Opportunistically clean up sev_vm_destroy()'s comment regarding what is
(implicitly, what isn't) skipped  for mirror VMs.

Cc: Srikanth Aithal <sraithal@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Zheyun Shen <szy0127@sjtu.edu.cn>
Link: https://lore.kernel.org/r/20250522233733.3176144-9-seanjc@google.com
Link: https://lore.kernel.org/all/935a82e3-f7ad-47d7-aaaf-f3d2b62ed768@amd.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-14 15:14:02 -07:00
Neeraj Upadhyay
17776e6c20 x86/apic: KVM: Move apic_test)vector() to common code
Move apic_test_vector() to apic.h in order to reuse it in the Secure AVIC
guest APIC driver in later patches to test vector state in the APIC
backing page.

No functional change intended.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-14-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:43 -07:00
Neeraj Upadhyay
fe954bcd57 x86/apic: KVM: Move lapic set/clear_vector() helpers to common code
Move apic_clear_vector() and apic_set_vector() helper functions to
apic.h in order to reuse them in the Secure AVIC guest APIC driver
in later patches to atomically set/clear vectors in the APIC backing
page.

No functional change intended.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-13-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:43 -07:00
Neeraj Upadhyay
3d3a9083da x86/apic: KVM: Move lapic get/set helpers to common code
Move the apic_get_reg(), apic_set_reg(), apic_get_reg64() and
apic_set_reg64() helper functions to apic.h in order to reuse them in the
Secure AVIC guest APIC driver in later patches to read/write registers
from/to the APIC backing page.

No functional change intended.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-12-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:42 -07:00
Neeraj Upadhyay
39e81633f6 x86/apic: KVM: Move apic_find_highest_vector() to a common header
In preparation for using apic_find_highest_vector() in Secure AVIC
guest APIC driver, move it and associated macros to apic.h.

No functional change intended.

Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/20250709033242.267892-11-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:41 -07:00
Neeraj Upadhyay
b5f8980f29 KVM: x86: Rename lapic set/clear vector helpers
In preparation for moving kvm-internal kvm_lapic_set_vector(),
kvm_lapic_clear_vector() to apic.h for use in Secure AVIC APIC driver,
rename them as part of the APIC API.

No functional change intended.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-10-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:41 -07:00
Neeraj Upadhyay
9c23bc4fec KVM: x86: Rename lapic get/set_reg64() helpers
In preparation for moving kvm-internal __kvm_lapic_set_reg64(),
__kvm_lapic_get_reg64() to apic.h for use in Secure AVIC APIC driver,
rename them as part of the APIC API.

No functional change intended.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-9-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:40 -07:00
Neeraj Upadhyay
b9bd231913 KVM: x86: Rename lapic get/set_reg() helpers
In preparation for moving kvm-internal __kvm_lapic_set_reg(),
__kvm_lapic_get_reg() to apic.h for use in Secure AVIC APIC driver,
rename them as part of the APIC API.

No functional change intended.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-8-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:39 -07:00
Neeraj Upadhyay
bdaccfe4e5 KVM: x86: Rename find_highest_vector()
In preparation for moving kvm-internal find_highest_vector() to
apic.h for use in Secure AVIC APIC driver, rename find_highest_vector()
to apic_find_highest_vector() as part of the APIC API.

No functional change intended.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-7-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:39 -07:00
Neeraj Upadhyay
e2fa7905b2 KVM: x86: Change lapic regs base address to void pointer
Change APIC base address from "char *" to "void *" in KVM
lapic's set/get helper functions. Pointer arithmetic for "void *"
and "char *" operate identically. With "void *" there is less
of a chance of doing the wrong thing, e.g. neglecting to cast and
reading a byte instead of the desired APIC register size.

No functional change intended.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-6-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:38 -07:00
Neeraj Upadhyay
9cbb5fd156 KVM: x86: Rename VEC_POS/REG_POS macro usages
In preparation for moving most of the KVM's lapic helpers which
use VEC_POS/REG_POS macros to common APIC header for use in Secure
AVIC APIC driver, rename all VEC_POS/REG_POS macro usages to
APIC_VECTOR_TO_BIT_NUMBER/APIC_VECTOR_TO_REG_OFFSET and remove
VEC_POS/REG_POS.

While at it, clean up line wrap in find_highest_vector().

No functional change intended.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-5-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:37 -07:00
Sean Christopherson
dc98e3bd49 x86/apic: KVM: Deduplicate APIC vector => register+bit math
Consolidate KVM's {REG,VEC}_POS() macros and lapic_vector_set_in_irr()'s
open coded equivalent logic in anticipation of the kernel gaining more
usage of vector => reg+bit lookups.

Use lapic_vector_set_in_irr()'s math as using divides for both the bit
number and register offset makes it easier to connect the dots, and for at
least one user, fixup_irqs(), "/ 32 * 0x10" generates ever so slightly
better code with gcc-14 (shaves a whole 3 bytes from the code stream):

((v) >> 5) << 4:
  c1 ef 05           shr    $0x5,%edi
  c1 e7 04           shl    $0x4,%edi
  81 c7 00 02 00 00  add    $0x200,%edi

(v) / 32 * 0x10:
  c1 ef 05           shr    $0x5,%edi
  83 c7 20           add    $0x20,%edi
  c1 e7 04           shl    $0x4,%edi

Keep KVM's tersely named macros as "wrappers" to avoid unnecessary churn
in KVM, and because the shorter names yield more readable code overall in
KVM.

The new macros type cast the vector parameter to "unsigned int". This is
required from better code generation for cases where an "int" is passed
to these macros in KVM code.

int v;

((v) >> 5) << 4:

  c1 f8 05    sar    $0x5,%eax
  c1 e0 04    shl    $0x4,%eax

((v) / 32 * 0x10):

  85 ff       test   %edi,%edi
  8d 47 1f    lea    0x1f(%rdi),%eax
  0f 49 c7    cmovns %edi,%eax
  c1 f8 05    sar    $0x5,%eax
  c1 e0 04    shl    $0x4,%eax

((unsigned int)(v) / 32 * 0x10):

  c1 f8 05    sar    $0x5,%eax
  c1 e0 04    shl    $0x4,%eax

(v) & (32 - 1):

  89 f8       mov    %edi,%eax
  83 e0 1f    and    $0x1f,%eax

(v) % 32

  89 fa       mov    %edi,%edx
  c1 fa 1f    sar    $0x1f,%edx
  c1 ea 1b    shr    $0x1b,%edx
  8d 04 17    lea    (%rdi,%rdx,1),%eax
  83 e0 1f    and    $0x1f,%eax
  29 d0       sub    %edx,%eax

(unsigned int)(v) % 32:

  89 f8       mov    %edi,%eax
  83 e0 1f    and    $0x1f,%eax

Overall kvm.ko text size is impacted if "unsigned int" is not used.

Bin      Orig     New (w/o unsigned int)  New (w/ unsigned int)

lapic.o  28580        28772                 28580
kvm.o    670810       671002                670810
kvm.ko   708079       708271                708079

No functional change intended.

[Neeraj: Type cast vec macro param to "unsigned int", provide data
         in commit log on "unsigned int" requirement]

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/20250709033242.267892-4-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:37 -07:00
Neeraj Upadhyay
3fb7b83e2a KVM: x86: Remove redundant parentheses around 'bitmap'
When doing pointer arithmetic in apic_test_vector() and
kvm_lapic_{set|clear}_vector(), remove the unnecessary
parentheses surrounding the 'bitmap' parameter.

No functional change intended.

Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/20250709033242.267892-3-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:36 -07:00
Neeraj Upadhyay
ac48017020 KVM: x86: Open code setting/clearing of bits in the ISR
Remove __apic_test_and_set_vector() and __apic_test_and_clear_vector(),
because the _only_ register that's safe to modify with a non-atomic
operation is ISR, because KVM isn't running the vCPU, i.e. hardware can't
service an IRQ or process an EOI for the relevant (virtual) APIC.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-2-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:36 -07:00
Kevin Loughlin
a77896eea3 KVM: SEV: Prefer WBNOINVD over WBINVD for cache maintenance efficiency
AMD CPUs currently execute WBINVD in the host when unregistering SEV
guest memory or when deactivating SEV guests. Such cache maintenance is
performed to prevent data corruption, wherein the encrypted (C=1)
version of a dirty cache line might otherwise only be written back
after the memory is written in a different context (ex: C=0), yielding
corruption. However, WBINVD is performance-costly, especially because
it invalidates processor caches.

Strictly-speaking, unless the SEV ASID is being recycled (meaning the
SNP firmware requires the use of WBINVD prior to DF_FLUSH), the cache
invalidation triggered by WBINVD is unnecessary; only the writeback is
needed to prevent data corruption in remaining scenarios.

To improve performance in these scenarios, use WBNOINVD when available
instead of WBINVD. WBNOINVD still writes back all dirty lines
(preventing host data corruption by SEV guests) but does *not*
invalidate processor caches. Note that the implementation of wbnoinvd()
ensures fall back to WBINVD if WBNOINVD is unavailable.

In anticipation of forthcoming optimizations to limit the WBNOINVD only
to physical CPUs that have executed SEV guests, place the call to
wbnoinvd_on_all_cpus() in a wrapper function sev_writeback_caches().

Signed-off-by: Kevin Loughlin <kevinloughlin@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250201000259.3289143-3-kevinloughlin@google.com
[sean: tweak comment regarding CLFUSH]
Cc: Francesco Lavra <francescolavra.fl@gmail.com>
Link: https://lore.kernel.org/r/20250522233733.3176144-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:19 -07:00
Zheyun Shen
7e00013bd3 KVM: SVM: Remove wbinvd in sev_vm_destroy()
Before sev_vm_destroy() is called, kvm_arch_guest_memory_reclaimed()
has been called for SEV and SEV-ES and kvm_arch_gmem_invalidate()
has been called for SEV-SNP. These functions have already handled
flushing the memory. Therefore, this wbinvd_on_all_cpus() can
simply be dropped.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Zheyun Shen <szy0127@sjtu.edu.cn>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250522233733.3176144-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:43:43 -07:00
Sean Christopherson
55aed8c2db KVM: x86: Use wbinvd_on_cpu() instead of an open-coded equivalent
Use wbinvd_on_cpu() to target a single CPU instead of open-coding an
equivalent, and drop KVM's wbinvd_ipi() now that all users have switched
to x86 library versions.

No functional change intended.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250522233733.3176144-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:43:43 -07:00
Linus Torvalds
73d7cf0710 ARM:
- Remove the last leftovers of the ill-fated FPSIMD host state
   mapping at EL2 stage-1
 
 - Fix unexpected advertisement to the guest of unimplemented S2 base
   granule sizes
 
 - Gracefully fail initialising pKVM if the interrupt controller isn't
   GICv3
 
 - Also gracefully fail initialising pKVM if the carveout allocation
   fails
 
 - Fix the computing of the minimum MMIO range required for the host on
   stage-2 fault
 
 - Fix the generation of the GICv3 Maintenance Interrupt in nested mode
 
 x86:
 
 - Reject SEV{-ES} intra-host migration if one or more vCPUs are actively
   being created, so as not to create a non-SEV{-ES} vCPU in an SEV{-ES} VM.
 
 - Use a pre-allocated, per-vCPU buffer for handling de-sparsification of
   vCPU masks in Hyper-V hypercalls; fixes a "stack frame too large" issue.
 
 - Allow out-of-range/invalid Xen event channel ports when configuring IRQ
   routing, to avoid dictating a specific ioctl() ordering to userspace.
 
 - Conditionally reschedule when setting memory attributes to avoid soft
   lockups when userspace converts huge swaths of memory to/from private.
 
 - Add back MWAIT as a required feature for the MONITOR/MWAIT selftest.
 
 - Add a missing field in struct sev_data_snp_launch_start that resulted in
   the guest-visible workarounds field being filled at the wrong offset.
 
 - Skip non-canonical address when processing Hyper-V PV TLB flushes to avoid
   VM-Fail on INVVPID.
 
 - Advertise supported TDX TDVMCALLs to userspace.
 
 - Pass SetupEventNotifyInterrupt arguments to userspace.
 
 - Fix TSC frequency underflow.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmhurKgUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNxHggApTP4vw+oOzfN7UoNmgR9XZMI1p2a
 R8AzQ1zDyVbEVWq3xTKvXtld+dKeO0yKB/XeI/1JLck1OiHxY57I3X6k5AnsurEr
 CBzeAhAjXivF8woMgmlP+30aqpomcPACdQm0gRnWkRDDJfXqSUas/iE/s9Ct1dT4
 4w3PtFLsSsU8vX/RttR+CqF1AQ6SeV/NRvA8hzPGMGZoQ2um74j4ZsM/3xh77Kdw
 Z2vOnZOIA4dk0074JjO/Yb9l00Ib4hn+MWG5jVJ+6i2HRRYd2knnB29apVS/ARdL
 X20j+LvtYj/jrPPdYwqjvxbIXyLbJrLCZyjKhfueN+rnisPNvzR+7YE4ZQ==
 =NduO
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "Many patches, pretty much all of them small, that accumulated while I
  was on vacation.

  ARM:

   - Remove the last leftovers of the ill-fated FPSIMD host state
     mapping at EL2 stage-1

   - Fix unexpected advertisement to the guest of unimplemented S2 base
     granule sizes

   - Gracefully fail initialising pKVM if the interrupt controller isn't
     GICv3

   - Also gracefully fail initialising pKVM if the carveout allocation
     fails

   - Fix the computing of the minimum MMIO range required for the host
     on stage-2 fault

   - Fix the generation of the GICv3 Maintenance Interrupt in nested
     mode

  x86:

   - Reject SEV{-ES} intra-host migration if one or more vCPUs are
     actively being created, so as not to create a non-SEV{-ES} vCPU in
     an SEV{-ES} VM

   - Use a pre-allocated, per-vCPU buffer for handling de-sparsification
     of vCPU masks in Hyper-V hypercalls; fixes a "stack frame too
     large" issue

   - Allow out-of-range/invalid Xen event channel ports when configuring
     IRQ routing, to avoid dictating a specific ioctl() ordering to
     userspace

   - Conditionally reschedule when setting memory attributes to avoid
     soft lockups when userspace converts huge swaths of memory to/from
     private

   - Add back MWAIT as a required feature for the MONITOR/MWAIT selftest

   - Add a missing field in struct sev_data_snp_launch_start that
     resulted in the guest-visible workarounds field being filled at the
     wrong offset

   - Skip non-canonical address when processing Hyper-V PV TLB flushes
     to avoid VM-Fail on INVVPID

   - Advertise supported TDX TDVMCALLs to userspace

   - Pass SetupEventNotifyInterrupt arguments to userspace

   - Fix TSC frequency underflow"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: avoid underflow when scaling TSC frequency
  KVM: arm64: Remove kvm_arch_vcpu_run_map_fp()
  KVM: arm64: Fix handling of FEAT_GTG for unimplemented granule sizes
  KVM: arm64: Don't free hyp pages with pKVM on GICv2
  KVM: arm64: Fix error path in init_hyp_mode()
  KVM: arm64: Adjust range correctly during host stage-2 faults
  KVM: arm64: nv: Fix MI line level calculation in vgic_v3_nested_update_mi()
  KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush
  KVM: SVM: Add missing member in SNP_LAUNCH_START command structure
  Documentation: KVM: Fix unexpected unindent warnings
  KVM: selftests: Add back the missing check of MONITOR/MWAIT availability
  KVM: Allow CPU to reschedule while setting per-page memory attributes
  KVM: x86/xen: Allow 'out of range' event channel ports in IRQ routing table.
  KVM: x86/hyper-v: Use preallocated per-vCPU buffer for de-sparsified vCPU masks
  KVM: SVM: Initialize vmsa_pa in VMCB to INVALID_PAGE if VMSA page is NULL
  KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-flight
  KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities
  KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt
2025-07-10 09:06:53 -07:00
Zheyun Shen
4fdc3431e0 x86/lib: Add WBINVD and WBNOINVD helpers to target multiple CPUs
Extract KVM's open-coded calls to do writeback caches on multiple CPUs to
common library helpers for both WBINVD and WBNOINVD (KVM will use both).
Put the onus on the caller to check for a non-empty mask to simplify the
SMP=n implementation, e.g. so that it doesn't need to check that the one
and only CPU in the system is present in the mask.

  [sean: move to lib, add SMP=n helpers, clarify usage]

Signed-off-by: Zheyun Shen <szy0127@sjtu.edu.cn>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250128015345.7929-2-szy0127@sjtu.edu.cn
Link: https://lore.kernel.org/20250522233733.3176144-5-seanjc@google.com
2025-07-10 13:30:17 +02:00
Paolo Bonzini
4578a747f3 KVM: x86: avoid underflow when scaling TSC frequency
In function kvm_guest_time_update(), __scale_tsc() is used to calculate
a TSC *frequency* rather than a TSC value.  With low-enough ratios,
a TSC value that is less than 1 would underflow to 0 and to an infinite
while loop in kvm_get_time_scale():

  kvm_guest_time_update(struct kvm_vcpu *v)
    if (kvm_caps.has_tsc_control)
      tgt_tsc_khz = kvm_scale_tsc(tgt_tsc_khz,
                                  v->arch.l1_tsc_scaling_ratio);
        __scale_tsc(u64 ratio, u64 tsc)
          ratio=122380531, tsc=2299998, N=48
          ratio*tsc >> N = 0.999... -> 0

Later in the function:

  Call Trace:
   <TASK>
   kvm_get_time_scale arch/x86/kvm/x86.c:2458 [inline]
   kvm_guest_time_update+0x926/0xb00 arch/x86/kvm/x86.c:3268
   vcpu_enter_guest.constprop.0+0x1e70/0x3cf0 arch/x86/kvm/x86.c:10678
   vcpu_run+0x129/0x8d0 arch/x86/kvm/x86.c:11126
   kvm_arch_vcpu_ioctl_run+0x37a/0x13d0 arch/x86/kvm/x86.c:11352
   kvm_vcpu_ioctl+0x56b/0xe60 virt/kvm/kvm_main.c:4188
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:871 [inline]
   __se_sys_ioctl+0x12d/0x190 fs/ioctl.c:857
   do_syscall_x64 arch/x86/entry/common.c:51 [inline]
   do_syscall_64+0x59/0x110 arch/x86/entry/common.c:81
   entry_SYSCALL_64_after_hwframe+0x78/0xe2

This can really happen only when fuzzing, since the TSC frequency
would have to be nonsensically low.

Fixes: 35181e86df ("KVM: x86: Add a common TSC scaling function")
Reported-by: Yuntao Liu <liuyuntao12@huawei.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-07-09 13:52:50 -04:00
Jim Mattson
a7cec20845 KVM: x86: Provide a capability to disable APERF/MPERF read intercepts
Allow a guest to read the physical IA32_APERF and IA32_MPERF MSRs
without interception.

The IA32_APERF and IA32_MPERF MSRs are not virtualized. Writes are not
handled at all. The MSR values are not zeroed on vCPU creation, saved
on suspend, or restored on resume. No accommodation is made for
processor migration or for sharing a logical processor with other
tasks. No adjustments are made for non-unit TSC multipliers. The MSRs
do not account for time the same way as the comparable PMU events,
whether the PMU is virtualized by the traditional emulation method or
the new mediated pass-through approach.

Nonetheless, in a properly constrained environment, this capability
can be combined with a guest CPUID table that advertises support for
CPUID.6:ECX.APERFMPERF[bit 0] to induce a Linux guest to report the
effective physical CPU frequency in /proc/cpuinfo. Moreover, there is
no performance cost for this capability.

Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250530185239.2335185-3-jmattson@google.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250626001225.744268-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-09 09:33:37 -07:00
Jim Mattson
6fbef8615d KVM: x86: Replace growing set of *_in_guest bools with a u64
Store each "disabled exit" boolean in a single bit rather than a byte.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250530185239.2335185-2-jmattson@google.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250626001225.744268-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-09 09:32:32 -07:00
Xin Li
e88cfd50b6 KVM: x86: Advertise support for LKGS
Advertise support for LKGS (load into IA32_KERNEL_GS_BASE) to userspace
if the instruction is supported by the underlying CPU.

LKGS is introduced with FRED to completely eliminate the need to swapgs
explicilty.  It behaves like the MOV to GS instruction except that it
loads the base address into the IA32_KERNEL_GS_BASE MSR instead of the
GS segment’s descriptor cache, which is exactly what Linux kernel does
to load a user level GS base.  Thus there is no need to SWAPGS away
from the kernel GS base.

LKGS is an independent CPU feature that works correctly in a KVM guest
without requiring explicit enablement.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250626173521.2301088-1-xin@zytor.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-09 09:32:25 -07:00
Sean Christopherson
e1ef1c57ff KVM: VMX: Add a macro to track which DEBUGCTL bits are host-owned
Add VMX_HOST_OWNED_DEBUGCTL_BITS to track which bits are host-owned, i.e.
need to be preserved when running the guest, to dedup the logic without
having to incur a memory load to get at kvm_x86_ops.HOST_OWNED_DEBUGCTL.

No functional change intended.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/all/aF1yni8U6XNkyfRf@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-09 09:30:52 -07:00
Linus Torvalds
6e9128ff9d Add the mitigation logic for Transient Scheduler Attacks (TSA)
TSA are new aspeculative side channel attacks related to the execution
 timing of instructions under specific microarchitectural conditions. In
 some cases, an attacker may be able to use this timing information to
 infer data from other contexts, resulting in information leakage.
 
 Add the usual controls of the mitigation and integrate it into the
 existing speculation bugs infrastructure in the kernel.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhSsvQACgkQEsHwGGHe
 VUrWNw//V+ZabYq3Nnvh4jEe6Altobnpn8bOIWmcBx6I3xuuArb9bLqcbKerDIcC
 POVVW6zrdNigDe/U4aqaJXE7qCRX55uTYbhp8OLH0zzqX3Pjl/hUnEXWtMtlXj/G
 CIM5mqjqEFp5JRGXetdjjuvjG1IPf+CbjKqj2WXbi//T6F3LiAFxkzdUhd+clBF/
 ztWchjwUmqU0WJd6+Smb8ZnvWrLoZuOFldjhFad820B7fqkdJhzjHMmwBHJKUEZu
 oABv8B0/4IALrx6LenCspWS4OuTOGG7DKyIgzitByXygXXb4L3ZUKpuqkxBU7hFx
 bscwtOP7e5HIYAekx6ZSLZoZpYQXr1iH0aRGrjwapi3ASIpUwI0UA9ck2PdGo0IY
 0GvmN0vbybskewBQyG819BM+DCau5pOLWuL7cYmaD2eTNoOHOknMDNlO8VzXqJxa
 NnignSuEWFm2vNV1FXEav2YbVjlanV6JleiPDGBe5Xd9dnxZTvg9HuP2NkYio4dZ
 mb/kEU/kTcN8nWh0Q96tX45kmj0vCbBgrSQkmUpyAugp38n69D1tp3ii9D/hyQFH
 hKGcFC9m+rYVx1NLyAxhTGxaEqF801d5Qawwud8HsnQudTpCdSXD9fcBg9aCbWEa
 FymtDpIeUQrFAjDpVEp6Syh3odKvLXsGEzL+DVvqKDuA8r6DxFo=
 =2cLl
 -----END PGP SIGNATURE-----

Merge tag 'tsa_x86_bugs_for_6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull CPU speculation fixes from Borislav Petkov:
 "Add the mitigation logic for Transient Scheduler Attacks (TSA)

  TSA are new aspeculative side channel attacks related to the execution
  timing of instructions under specific microarchitectural conditions.
  In some cases, an attacker may be able to use this timing information
  to infer data from other contexts, resulting in information leakage.

  Add the usual controls of the mitigation and integrate it into the
  existing speculation bugs infrastructure in the kernel"

* tag 'tsa_x86_bugs_for_6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/process: Move the buffer clearing before MONITOR
  x86/microcode/AMD: Add TSA microcode SHAs
  KVM: SVM: Advertise TSA CPUID bits to guests
  x86/bugs: Add a Transient Scheduler Attacks mitigation
  x86/bugs: Rename MDS machinery to something more generic
2025-07-07 17:08:36 -07:00
Linus Torvalds
cc69ac7a65 - Make sure DR6 and DR7 are initialized to their architectural values and not
accidentally cleared, leading to misconfigurations
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhg/UQACgkQEsHwGGHe
 VUrmXBAAtOrpEtR4geeBeZtCEaUxE4DE8Zvj36dr+sAScHTXNTzYK94mAy/AHU22
 V3rF12/kuyyZSrwROXLBD6PgkHEn8u0WLztSeqP/SisnoLjMTV9H9TuGYBUoz1NS
 clkoElQ6DJP5BVzmYpZlJrcofqNjkS/mAxfMRoIAq+LzKkb3iL/Lge+Ox/IDUg8z
 L9wRlKh/IaJ5EETWlqh0gkFeS/M9DXYmfkasDQeVkUxnKFeXBdyUGc2jFzBGX5RA
 rsdnz+C3x3ow2U9N+ZMVr+n06yTZvh+fAiU8emeBQm0q5fZBBHWDbnZZtWf+KG6s
 43tlWyVqic5yzyQbUpRC2sttOkIAtOCMx36XexbGm1eKRNNc6fTz9IlgO/97HkuE
 lYBNq0zd/p5Kb53lXb3uwBVy4sjIEZUyD/K5DO4YfTgamcwXl8BP5xnKtNPqImI5
 aaF3xKKLOUDOTL1CcK5YG0joaU1k+I0F0KO7HYqkDi8Uf5naWZSUNil8nPQn8RX7
 3f3LJx0e3j2o0f60AHI4mjUAUJHsxExmpaTl079k03wt8YVE3ucNaUN6se6nidVz
 H5q0JU4q3C3DCu0I3Ub4wa5QXGA+TOHKuqhJCapKAVAAQDlbV2z8GxA/3B2YbRM+
 eZ6/RVyk++VrRXIyfmfwLPu0CLVoSNhaUhu/hFrYbjA3NX+85qw=
 =fjgT
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Make sure DR6 and DR7 are initialized to their architectural values
   and not accidentally cleared, leading to misconfigurations

* tag 'x86_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/traps: Initialize DR7 by writing its architectural reset value
  x86/traps: Initialize DR6 by writing its architectural reset value
2025-06-29 08:28:24 -07:00
Sean Christopherson
bbc13ae593 VFIO: KVM: x86: Drop kvm_arch_{start,end}_assignment()
Drop kvm_arch_{start,end}_assignment() and all associated code now that
KVM x86 no longer consumes assigned_device_count.  Tracking whether or not
a VFIO-assigned device is formally associated with a VM is fundamentally
flawed, as such an association is optional for general usage, i.e. is prone
to false negatives.  E.g. prior to commit 2edd9cb79f ("kvm: detect
assigned device via irqbypass manager"), device passthrough via VFIO would
fail to enable IRQ bypass if userspace omitted the formal VFIO<=>KVM
binding.

And device drivers that *need* the VFIO<=>KVM connection, e.g. KVM-GT,
shouldn't be relying on generic x86 tracking infrastructure.

Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250523011756.3243624-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25 09:51:33 -07:00
Sean Christopherson
ff845e6a84 Revert "kvm: detect assigned device via irqbypass manager"
Now that KVM explicitly tracks the number of possible bypass IRQs, and
doesn't conflate IRQ bypass with host MMIO access, stop bumping the
assigned device count when adding an IRQ bypass producer.

This reverts commit 2edd9cb79f.

Link: https://lore.kernel.org/r/20250523011756.3243624-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25 09:51:29 -07:00
Sean Christopherson
95d9b5d8d6 Merge branch 'kvm-x86 mmio'
Merge the MMIO stale data branch with the device posted IRQs branch to
provide a common base for removing KVM's tracking of "assigned" devices.

Link: https://lore.kernel.org/all/20250523011756.3243624-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25 09:50:58 -07:00
Chao Gao
05186d7a8e KVM: SVM: Simplify MSR interception logic for IA32_XSS MSR
Use svm_set_intercept_for_msr() directly to configure IA32_XSS MSR
interception, ensuring consistency with other cases where MSRs are
intercepted depending on guest caps and CPUIDs.

No functional change intended.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250612081947.94081-3-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25 09:47:28 -07:00
Manuel Andreas
fa787ac07b KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush
In KVM guests with Hyper-V hypercalls enabled, the hypercalls
HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST and HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX
allow a guest to request invalidation of portions of a virtual TLB.
For this, the hypercall parameter includes a list of GVAs that are supposed
to be invalidated.

However, when non-canonical GVAs are passed, there is currently no
filtering in place and they are eventually passed to checked invocations of
INVVPID on Intel / INVLPGA on AMD.  While AMD's INVLPGA silently ignores
non-canonical addresses (effectively a no-op), Intel's INVVPID explicitly
signals VM-Fail and ultimately triggers the WARN_ONCE in invvpid_error():

  invvpid failed: ext=0x0 vpid=1 gva=0xaaaaaaaaaaaaa000
  WARNING: CPU: 6 PID: 326 at arch/x86/kvm/vmx/vmx.c:482
  invvpid_error+0x91/0xa0 [kvm_intel]
  Modules linked in: kvm_intel kvm 9pnet_virtio irqbypass fuse
  CPU: 6 UID: 0 PID: 326 Comm: kvm-vm Not tainted 6.15.0 #14 PREEMPT(voluntary)
  RIP: 0010:invvpid_error+0x91/0xa0 [kvm_intel]
  Call Trace:
    vmx_flush_tlb_gva+0x320/0x490 [kvm_intel]
    kvm_hv_vcpu_flush_tlb+0x24f/0x4f0 [kvm]
    kvm_arch_vcpu_ioctl_run+0x3013/0x5810 [kvm]

Hyper-V documents that invalid GVAs (those that are beyond a partition's
GVA space) are to be ignored.  While not completely clear whether this
ruling also applies to non-canonical GVAs, it is likely fine to make that
assumption, and manual testing on Azure confirms "real" Hyper-V interprets
the specification in the same way.

Skip non-canonical GVAs when processing the list of address to avoid
tripping the INVVPID failure.  Alternatively, KVM could filter out "bad"
GVAs before inserting into the FIFO, but practically speaking the only
downside of pushing validation to the final processing is that doing so
is suboptimal for the guest, and no well-behaved guest will request TLB
flushes for non-canonical addresses.

Fixes: 260970862c ("KVM: x86: hyper-v: Handle HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST{,EX} calls gently")
Cc: stable@vger.kernel.org
Signed-off-by: Manuel Andreas <manuel.andreas@tum.de>
Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/c090efb3-ef82-499f-a5e0-360fc8420fb7@tum.de
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25 09:15:24 -07:00
Sean Christopherson
83ebe71574 KVM: VMX: Apply MMIO Stale Data mitigation if KVM maps MMIO into the guest
Enforce the MMIO State Data mitigation if KVM has ever mapped host MMIO
into the VM, not if the VM has an assigned device.  VFIO is but one of
many ways to map host MMIO into a KVM guest, and even within VFIO,
formally attaching a device to a VM via KVM_DEV_VFIO_FILE_ADD is entirely
optional.

Track whether or not the guest can access host MMIO on a per-MMU basis,
i.e. based on whether or not the vCPU has a mapping to host MMIO.  For
simplicity, track MMIO mappings in "special" rools (those without a
kvm_mmu_page) at the VM level, as only Intel CPUs are vulnerable, and so
only legacy 32-bit shadow paging is affected, i.e. lack of precise
tracking is a complete non-issue.

Make the per-MMU and per-VM flags sticky.  Detecting when *all* MMIO
mappings have been removed would be absurdly complex.  And in practice,
removing MMIO from a guest will be done by deleting the associated memslot,
which by default will force KVM to re-allocate all roots.  Special roots
will forever be mitigated, but as above, the affected scenarios are not
expected to be performance sensitive.

Use a VMX_RUN flag to communicate the need for a buffers flush to
vmx_vcpu_enter_exit() so that kvm_vcpu_can_access_host_mmio() and all its
dependencies don't need to be marked __always_inline, e.g. so that KASAN
doesn't trigger a noinstr violation.

Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Fixes: 8cb861e9e3 ("x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data")
Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Link: https://lore.kernel.org/r/20250523011756.3243624-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25 08:42:51 -07:00
Chao Gao
3f06b8927a KVM: x86: Deduplicate MSR interception enabling and disabling
Extract a common function from MSR interception disabling logic and create
disabling and enabling functions based on it. This removes most of the
duplicated code for MSR interception disabling/enabling.

No functional change intended.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250612081947.94081-2-chao.gao@intel.com
[sean: s/enable/set, inline the wrappers]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24 15:42:12 -07:00