Commit Graph

338 Commits

Author SHA1 Message Date
Paolo Bonzini
314b40b3b6 KVM/arm64 changes for 6.17, round #1
- Host driver for GICv5, the next generation interrupt controller for
    arm64, including support for interrupt routing, MSIs, interrupt
    translation and wired interrupts.
 
  - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
    GICv5 hardware, leveraging the legacy VGIC interface.
 
  - Userspace control of the 'nASSGIcap' GICv3 feature, allowing
    userspace to disable support for SGIs w/o an active state on hardware
    that previously advertised it unconditionally.
 
  - Map supporting endpoints with cacheable memory attributes on systems
    with FEAT_S2FWB and DIC where KVM no longer needs to perform cache
    maintenance on the address range.
 
  - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest
    hypervisor to inject external aborts into an L2 VM and take traps of
    masked external aborts to the hypervisor.
 
  - Convert more system register sanitization to the config-driven
    implementation.
 
  - Fixes to the visibility of EL2 registers, namely making VGICv3 system
    registers accessible through the VGIC device instead of the ONE_REG
    vCPU ioctls.
 
  - Various cleanups and minor fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCaIezbRccb2xpdmVyLnVw
 dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFr/eAQDY5NIG5cR6ZcAWnPQLmGWpz2ou
 pq4Jhn9E/mGR3n5L1AEAsJpfLLpOsmnLBdwfbjmW59gGsa8k3i5tjWEOJ6yzAwk=
 =r+sp
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-6.17' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 changes for 6.17, round #1

 - Host driver for GICv5, the next generation interrupt controller for
   arm64, including support for interrupt routing, MSIs, interrupt
   translation and wired interrupts.

 - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
   GICv5 hardware, leveraging the legacy VGIC interface.

 - Userspace control of the 'nASSGIcap' GICv3 feature, allowing
   userspace to disable support for SGIs w/o an active state on hardware
   that previously advertised it unconditionally.

 - Map supporting endpoints with cacheable memory attributes on systems
   with FEAT_S2FWB and DIC where KVM no longer needs to perform cache
   maintenance on the address range.

 - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest
   hypervisor to inject external aborts into an L2 VM and take traps of
   masked external aborts to the hypervisor.

 - Convert more system register sanitization to the config-driven
   implementation.

 - Fixes to the visibility of EL2 registers, namely making VGICv3 system
   registers accessible through the VGIC device instead of the ONE_REG
   vCPU ioctls.

 - Various cleanups and minor fixes.
2025-07-29 12:27:40 -04:00
Paolo Bonzini
1a14928e2e Merge tag 'kvm-x86-misc-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.17

 - Prevert the host's DEBUGCTL.FREEZE_IN_SMM (Intel only) when running the
   guest.  Failure to honor FREEZE_IN_SMM can bleed host state into the guest.

 - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter (Intel only) to
   prevent L1 from running L2 with features that KVM doesn't support, e.g. BTF.

 - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the
   vCPU's CPUID model.

 - Rework the MSR interception code so that the SVM and VMX APIs are more or
   less identical.

 - Recalculate all MSR intercepts from the "source" on MSR filter changes, and
   drop the dedicated "shadow" bitmaps (and their awful "max" size defines).

 - WARN and reject loading kvm-amd.ko instead of panicking the kernel if the
   nested SVM MSRPM offsets tracker can't handle an MSR.

 - Advertise support for LKGS (Load Kernel GS base), a new instruction that's
   loosely related to FRED, but is supported and enumerated independently.

 - Fix a user-triggerable WARN that syzkaller found by stuffing INIT_RECEIVED,
   a.k.a. WFS, and then putting the vCPU into VMX Root Mode (post-VMXON).  Use
   the same approach KVM uses for dealing with "impossible" emulation when
   running a !URG guest, and simply wait until KVM_RUN to detect that the vCPU
   has architecturally impossible state.

 - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of
   APERF/MPERF reads, so that a "properly" configured VM can "virtualize"
   APERF/MPERF (with many caveats).

 - Reject KVM_SET_TSC_KHZ if vCPUs have been created, as changing the "default"
   frequency is unsupported for VMs with a "secure" TSC, and there's no known
   use case for changing the default frequency for other VM types.
2025-07-29 08:36:43 -04:00
Paolo Bonzini
65164fd0f6 KVM/riscv changes for 6.17
- Enabled ring-based dirty memory tracking
 - Improved perf kvm stat to report interrupt events
 - Delegate illegal instruction trap to VS-mode
 - MMU related improvements for KVM RISC-V for upcoming
   nested virtualization
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmiIr0QACgkQrUjsVaLH
 LAf4yA/+KkfQCgMFhwpml6tIzFaa9yS+C9oqemnlWVT/wTADg18/+hraomqUnYhY
 HqOPdWo6O3vmH6E0jR6+AQXz5f4whYl8X8m+HdBHz8c1rF6ozoLMF4Qh3lsDDuZ0
 7pdIEzkNLCA3umBxXUzy7yexKWSzcGD3521toFMPADQHaZTwT8Om5/KLblHB+he3
 1DzsvErPWknWW1pynvSUJoD4zB41Qn364sJvyq4tAW6i8DmxLAmM/+Reh7GBP83r
 t7nAYVdnYikFj0oCb60NcFHqOQpk88mZTqCPMeZD1BoazEDXCPkdx0J44NsBRjun
 BhEpgBLIZDIwOF1A/DDIPrNuNOjSeeUAsAY1sK/yVEOkVZ1HPndyCil5SkE1FJHT
 dmsJBXq96dTlXYo9jBfExFUaUCI1mivLbX7uziIT1876IgLr5NlEJSbwk+TQ8VmR
 IS1PISi7yp5LeZcJEh6PBYgo02UE9gQ/C3tvvcaHbxXyQjVacB6Dw7EnaBArYMwv
 dbEPkxXem90Vup90ixgdLBW3nGCDckpogbsmlqoxV5m3MOknE5L8IuWxKXKB56Tp
 pxp7o1JywBV+Ym5w1BkpTvyEL/a2VDGFOfjryq/h8NAVxXPfwk2plyNhrwLt+Naz
 dYF7RprQL3XTtPnVOND6mMHyl2Sz9y0dc2tj6mpGamSBWPxlP/w=
 =ifud
 -----END PGP SIGNATURE-----

Merge tag 'kvm-riscv-6.17-2' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv changes for 6.17

- Enabled ring-based dirty memory tracking
- Improved perf kvm stat to report interrupt events
- Delegate illegal instruction trap to VS-mode
- MMU related improvements for KVM RISC-V for upcoming
  nested virtualization
2025-07-29 08:33:04 -04:00
Quan Zhou
f55ffaf896 RISC-V: KVM: Enable ring-based dirty memory tracking
Enable ring-based dirty memory tracking on riscv:

- Enable CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL as riscv is weakly
  ordered.
- Set KVM_DIRTY_LOG_PAGE_OFFSET for the ring buffer's physical page
  offset.
- Add a check to kvm_vcpu_kvm_riscv_check_vcpu_requests for checking
  whether the dirty ring is soft full.

To handle vCPU requests that cause exits to userspace, modified the
`kvm_riscv_check_vcpu_requests` to return a value (currently only
returns 0 or 1).

Signed-off-by: Quan Zhou <zhouquan@iscas.ac.cn>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20e116efb1f7aff211dd8e3cf8990c5521ed5f34.1749810735.git.zhouquan@iscas.ac.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
2025-07-28 22:28:22 +05:30
Kai Huang
b24bbb534c KVM: x86: Reject KVM_SET_TSC_KHZ vCPU ioctl for TSC protected guest
Reject KVM_SET_TSC_KHZ vCPU ioctl if guest's TSC is protected and not
changeable by KVM, and update the documentation to reflect it.

For such TSC protected guests, e.g. TDX guests, typically the TSC is
configured once at VM level before any vCPU are created and remains
unchanged during VM's lifetime.  KVM provides the KVM_SET_TSC_KHZ VM
scope ioctl to allow the userspace VMM to configure the TSC of such VM.
After that the userspace VMM is not supposed to call the KVM_SET_TSC_KHZ
vCPU scope ioctl anymore when creating the vCPU.

The de facto userspace VMM Qemu does this for TDX guests.  The upcoming
SEV-SNP guests with Secure TSC should follow.

Note, TDX support hasn't been fully released as of the "buggy" commit,
i.e. there is no established ABI to break.

Fixes: adafea1106 ("KVM: x86: Add infrastructure for secure TSC")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/71bbdf87fdd423e3ba3a45b57642c119ee2dd98c.1752444335.git.kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-15 07:05:13 -07:00
Kai Huang
dcbe5a466c KVM: x86: Reject KVM_SET_TSC_KHZ VM ioctl when vCPUs have been created
Reject the KVM_SET_TSC_KHZ VM ioctl when vCPUs have been created and
update the documentation to reflect it.

The VM scope KVM_SET_TSC_KHZ ioctl is used to set up the default TSC
frequency that all subsequently created vCPUs can use.  It is only
intended to be called before any vCPU is created.  Allowing it to be
called after that only results in confusion but nothing good.

Note this is an ABI change.  But currently in Qemu (the de facto
userspace VMM) only TDX uses this VM ioctl, and it is only called once
before creating any vCPU, therefore the risk of breaking userspace is
pretty low.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/135a35223ce8d01cea06b6cef30bfe494ec85827.1752444335.git.kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-14 15:29:33 -07:00
Binbin Wu
073b3eca08 Documentation: KVM: Fix unexpected unindent warning
Add proper indentations to bullet list item to resolve the warning:
"Bullet list ends without a blank line; unexpected unindent."

Closes:https://lore.kernel.org/kvm/20250623162110.6e2f4241@canb.auug.org.au

Fixes: 4580dbef5c ("KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250701012536.1281367-1-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:47:23 -07:00
Jim Mattson
a7cec20845 KVM: x86: Provide a capability to disable APERF/MPERF read intercepts
Allow a guest to read the physical IA32_APERF and IA32_MPERF MSRs
without interception.

The IA32_APERF and IA32_MPERF MSRs are not virtualized. Writes are not
handled at all. The MSR values are not zeroed on vCPU creation, saved
on suspend, or restored on resume. No accommodation is made for
processor migration or for sharing a logical processor with other
tasks. No adjustments are made for non-unit TSC multipliers. The MSRs
do not account for time the same way as the comparable PMU events,
whether the PMU is virtualized by the traditional emulation method or
the new mediated pass-through approach.

Nonetheless, in a properly constrained environment, this capability
can be combined with a guest CPUID table that advertises support for
CPUID.6:ECX.APERFMPERF[bit 0] to induce a Linux guest to report the
effective physical CPU frequency in /proc/cpuinfo. Moreover, there is
no performance cost for this capability.

Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250530185239.2335185-3-jmattson@google.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250626001225.744268-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-09 09:33:37 -07:00
Ankit Agrawal
f55ce5a6cd KVM: arm64: Expose new KVM cap for cacheable PFNMAP
Introduce a new KVM capability to expose to the userspace whether
cacheable mapping of PFNMAP is supported.

The ability to safely do the cacheable mapping of PFNMAP is contingent
on S2FWB and ARM64_HAS_CACHE_DIC. S2FWB allows KVM to avoid flushing
the D cache, ARM64_HAS_CACHE_DIC allows KVM to avoid flushing the icache
and turns icache_inval_pou() into a NOP. The cap would be false if
those requirements are missing and is checked by making use of
kvm_arch_supports_cacheable_pfnmap.

This capability would allow userspace to discover the support.
It could for instance be used by userspace to prevent live-migration
across FWB and non-FWB hosts.

CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Jason Gunthorpe <jgg@nvidia.com>
CC: Oliver Upton <oliver.upton@linux.dev>
CC: David Hildenbrand <david@redhat.com>
Suggested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Donald Dutile <ddutile@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20250705071717.5062-7-ankita@nvidia.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-07-07 16:54:52 -07:00
Binbin Wu
0c84b53404 Documentation: KVM: Fix unexpected unindent warnings
Add proper indentations to bullet list items to resolve the warning:
"Bullet list ends without a blank line; unexpected unindent."

Closes:https://lore.kernel.org/kvm/20250623162110.6e2f4241@canb.auug.org.au/

Fixes: cf207eac06 ("KVM: TDX: Handle TDG.VP.VMCALL<GetQuote>")
Fixes: 25e8b1dd48 ("KVM: TDX: Exit to userspace for GetTdVmCallInfo")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://lore.kernel.org/r/20250625014829.82289-1-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25 06:29:45 -07:00
Paolo Bonzini
4580dbef5c KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20 14:09:50 -04:00
Binbin Wu
25e8b1dd48 KVM: TDX: Exit to userspace for GetTdVmCallInfo
Exit to userspace for TDG.VP.VMCALL<GetTdVmCallInfo> via KVM_EXIT_TDX,
to allow userspace to provide information about the support of
TDVMCALLs when r12 is 1 for the TDVMCALLs beyond the GHCI base API.

GHCI spec defines the GHCI base TDVMCALLs: <GetTdVmCallInfo>, <MapGPA>,
<ReportFatalError>, <Instruction.CPUID>, <#VE.RequestMMIO>,
<Instruction.HLT>, <Instruction.IO>, <Instruction.RDMSR> and
<Instruction.WRMSR>. They must be supported by VMM to support TDX guests.

For GetTdVmCallInfo
- When leaf (r12) to enumerate TDVMCALL functionality is set to 0,
  successful execution indicates all GHCI base TDVMCALLs listed above are
  supported.

  Update the KVM TDX document with the set of the GHCI base APIs.

- When leaf (r12) to enumerate TDVMCALL functionality is set to 1, it
  indicates the TDX guest is querying the supported TDVMCALLs beyond
  the GHCI base TDVMCALLs.
  Exit to userspace to let userspace set the TDVMCALL sub-function bit(s)
  accordingly to the leaf outputs.  KVM could set the TDVMCALL bit(s)
  supported by itself when the TDVMCALLs don't need support from userspace
  after returning from userspace and before entering guest. Currently, no
  such TDVMCALLs implemented, KVM just sets the values returned from
  userspace.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
[Adjust userspace API. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20 13:55:47 -04:00
Binbin Wu
cf207eac06 KVM: TDX: Handle TDG.VP.VMCALL<GetQuote>
Handle TDVMCALL for GetQuote to generate a TD-Quote.

GetQuote is a doorbell-like interface used by TDX guests to request VMM
to generate a TD-Quote signed by a service hosting TD-Quoting Enclave
operating on the host.  A TDX guest passes a TD Report (TDREPORT_STRUCT) in
a shared-memory area as parameter.  Host VMM can access it and queue the
operation for a service hosting TD-Quoting enclave.  When completed, the
Quote is returned via the same shared-memory area.

KVM only checks the GPA from the TDX guest has the shared-bit set and drops
the shared-bit before exiting to userspace to avoid bleeding the shared-bit
into KVM's exit ABI.  KVM forwards the request to userspace VMM (e.g. QEMU)
and userspace VMM queues the operation asynchronously.  KVM sets the return
code according to the 'ret' field set by userspace to notify the TDX guest
whether the request has been queued successfully or not.  When the request
has been queued successfully, the TDX guest can poll the status field in
the shared-memory area to check whether the Quote generation is completed
or not.  When completed, the generated Quote is returned via the same
buffer.

Add KVM_EXIT_TDX as a new exit reason to userspace. Userspace is
required to handle the KVM exit reason as the initial support for TDX,
by reentering KVM to ensure that the TDVMCALL is complete.  While at it,
add a note that KVM_EXIT_HYPERCALL also requires reentry with KVM_RUN.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
[Adjust userspace API. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20 13:09:32 -04:00
Paolo Bonzini
4e02d4f973 KVM SVM changes for 6.16:
- Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
    fix a race between AP destroy and VMRUN.
 
  - Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.
 
  - Add support for ALLOWED_SEV_FEATURES.
 
  - Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.
 
  - Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
    that utilize those bits.
 
  - Don't account temporary allocations in sev_send_update_data().
 
  - Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmwAACgkQOlYIJqCj
 N/1pHw//edW/x838POMeeCN8j39NBKErW9yZoQLhMbzogttRvfoba+xYY9zXyRFx
 8AXB8+2iLtb7pXUohc0eYN0mNqgD0SnoMLqGfn7nrkJafJSUAJHAoZn1Mdom1M1y
 jHvBPbHCMMsgdLV8wpDRqCNWTH+d5W0kcN5WjKwOswVLj1rybVfK7bSLMhvkk1e5
 RrOR4Ewf95/Ag2b36L4SvS1yG9fTClmKeGArMXhEXjy2INVSpBYyZMjVtjHiNzU9
 TjtB2RSM45O+Zl0T2fZdVW8LFhA6kVeL1v+Oo433CjOQE0LQff3Vl14GCANIlPJU
 PiWN/RIKdWkuxStIP3vw02eHzONCcg2GnNHzEyKQ1xW8lmrwzVRdXZzVsc2Dmowb
 7qGykBQ+wzoE0sMeZPA0k/QOSqg2vGxUQHjR7720loLV9m9Tu/mJnS9e179GJKgI
 e1ArSLwKmHpjwKZqU44IQVTZaxSC4Sg2kI670i21ChPgx8+oVkA6I0LFQXymx7uS
 2lbH+ovTlJSlP9fbaJhMwAU2wpSHAyXif/HPjdw2LTH3NdgXzfEnZfTlAWiP65LQ
 hnz5HvmUalW3x9kmzRmeDIAkDnAXhyt3ZQMvbNzqlO5AfS+Tqh4Ed5EFP3IrQAzK
 HQ+Gi0ip+B84t9Tbi6rfQwzTZEbSSOfYksC7TXqRGhNo/DvHumE=
 =k6rK
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-svm-6.16' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 6.16:

 - Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
   fix a race between AP destroy and VMRUN.

 - Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.

 - Add support for ALLOWED_SEV_FEATURES.

 - Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.

 - Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
   that utilize those bits.

 - Don't account temporary allocations in sev_send_update_data().

 - Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
2025-05-27 12:15:49 -04:00
Paolo Bonzini
1f7c9d52b1 KVM/riscv changes for 6.16
- Add vector registers to get-reg-list selftest
 - VCPU reset related improvements
 - Remove scounteren initialization from VCPU reset
 - Support VCPU reset from userspace using set_mpstate() ioctl
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmgx8nkACgkQrUjsVaLH
 LAdmYw//aoxlXA8OKFMWFzEC4yZHfq502X1p95Z9h1i8IYKoRcqNxX8O5MJt0y8Z
 dyhgqKLT5q5bpUf/yxJjuJdZQJtCTC0o1bx89sSiM83M2J+AZymcIS5yFVIcxetv
 JRsaWh8GDc/XYUFBnjYCj5zmA5IObbd/QCVnkATjVRcRS56LnRv98P9YPxN7zhHF
 OhDRWELtUowk6FMgiHo75R9vYhOO1ywzzjLFnK5ZHLSYeQ7bhRX/jY5n1BShsWjp
 k+zVRTpJobw9AU76EPQaEKuqiynQz35ZPkxAiuWYac6SDKwmztvGJ1fCnSNAkJLk
 0kN/eAxv61F5nCRJxOkxK0Uy3U94zA6zX53VdnLoRN4rYpA8CYrE4mNAddN31IBC
 NHlBS59w2EsUxIRL1FiCUrEKKgeSWqJY1NuqsmB4ogeo4MKm8n1OhiYSU0l7NnQ6
 3h77ccnHN95cah2C9XDX3GeZ+on6z1t6FjZ4Enki1w3CCfGdKMsfphwUckTNzSvw
 hTeXhYHcP4VKsYkCcitdLR/VFwO4a3HlnjAtHdJLh0qfJ5SergifZ5/eQXhHu7f1
 1uyfk/6nKculr/8yzUVkOR7kerMjBuBx8jir89ceKD8qeA+4MUI1rdwEvzbzaEY9
 9aDwEVbQ1qMjpeONfKUvbyLJ7uN5Lae1/X4Kafmo2TWNAvtLPLo=
 =+S8x
 -----END PGP SIGNATURE-----

Merge tag 'kvm-riscv-6.16-1' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv changes for 6.16

- Add vector registers to get-reg-list selftest
- VCPU reset related improvements
- Remove scounteren initialization from VCPU reset
- Support VCPU reset from userspace using set_mpstate() ioctl
2025-05-26 16:27:00 -04:00
Paolo Bonzini
4d526b02df KVM/arm64 updates for 6.16
* New features:
 
   - Add large stage-2 mapping support for non-protected pKVM guests,
     clawing back some performance.
 
   - Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and
     protected modes.
 
   - Enable nested virtualisation support on systems that support it
     (yes, it has been a long time coming), though it is disabled by
     default.
 
 * Improvements, fixes and cleanups:
 
   - Large rework of the way KVM tracks architecture features and links
     them with the effects of control bits. This ensures correctness of
     emulation (the data is automatically extracted from the published
     JSON files), and helps dealing with the evolution of the
     architecture.
 
   - Significant changes to the way pKVM tracks ownership of pages,
     avoiding page table walks by storing the state in the hypervisor's
     vmemmap. This in turn enables the THP support described above.
 
   - New selftest checking the pKVM ownership transition rules
 
   - Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
     even if the host didn't have it.
 
   - Fixes for the address translation emulation, which happened to be
     rather buggy in some specific contexts.
 
   - Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
     from the number of counters exposed to a guest and addressing a
     number of issues in the process.
 
   - Add a new selftest for the SVE host state being corrupted by a
     guest.
 
   - Keep HCR_EL2.xMO set at all times for systems running with the
     kernel at EL2, ensuring that the window for interrupts is slightly
     bigger, and avoiding a pretty bad erratum on the AmpereOne HW.
 
   - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
     from a pretty bad case of TLB corruption unless accesses to HCR_EL2
     are heavily synchronised.
 
   - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
     tables in a human-friendly fashion.
 
   - and the usual random cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmgwU7UACgkQI9DQutE9
 ekN93g//fNnejxf01dBFIbuylzYEyHZSEH0iTGLeM+ES9zvntCzciTYVzb27oqNG
 RDLShlQYp3w4rAe6ORzyePyHptOmKXCxfj/VXUFp3A7H9QYOxt1nacD3WxI9fCOo
 LzaSLquvgwFBaeTdDE0KdeTUKQHluId+w1Azh0lnHGeUP+lOHNZ8FqoP1/la0q04
 GvVL+l3wz/IhPP8r1YA0Q1bzJ5SLfSpjIw/0F5H/xgI4lyYdHzgFL8sKuSyFeCyM
 2STQi+ZnTCsAs4bkXkw2Pp9CFYrfQgZi+sf7Om+noAKhbJo3vb7/RHpgjv+QCjJy
 Kx4g9CbxHfaM03cH6uSLBoFzsACR1iAuUz8BCSRvvVNH4RVT6H+34nzjLZXLncrP
 gm1uYs9aMTLr91caeAx0aYIMWGYa1uqV0rum3WxyIHezN9Q/NuQoZyfprUufr8oX
 wCYE+ot4VT3DwG0UFZKKwj0BiCbYcbph9nBLVyZJsg8OKxpvspkCtPriFp1kb6BP
 dTTGSXd9JJqwSgP9qJLxijcv6Nfgp2gT42TWwh/dJRZXhnTCvr9IyclFIhoIIq3G
 Q2BkFCXOoEoNQhBA1tiWzJ9nDHf52P72Z2K1gPyyMZwF49HGa2BZBCJGkqX06wSs
 Riolf1/cjFhDno1ThiHKsHT0sG1D4oc9k/1NLq5dyNAEGcgATIA=
 =Jju3
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for 6.16

* New features:

  - Add large stage-2 mapping support for non-protected pKVM guests,
    clawing back some performance.

  - Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and
    protected modes.

  - Enable nested virtualisation support on systems that support it
    (yes, it has been a long time coming), though it is disabled by
    default.

* Improvements, fixes and cleanups:

  - Large rework of the way KVM tracks architecture features and links
    them with the effects of control bits. This ensures correctness of
    emulation (the data is automatically extracted from the published
    JSON files), and helps dealing with the evolution of the
    architecture.

  - Significant changes to the way pKVM tracks ownership of pages,
    avoiding page table walks by storing the state in the hypervisor's
    vmemmap. This in turn enables the THP support described above.

  - New selftest checking the pKVM ownership transition rules

  - Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
    even if the host didn't have it.

  - Fixes for the address translation emulation, which happened to be
    rather buggy in some specific contexts.

  - Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
    from the number of counters exposed to a guest and addressing a
    number of issues in the process.

  - Add a new selftest for the SVE host state being corrupted by a
    guest.

  - Keep HCR_EL2.xMO set at all times for systems running with the
    kernel at EL2, ensuring that the window for interrupts is slightly
    bigger, and avoiding a pretty bad erratum on the AmpereOne HW.

  - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
    from a pretty bad case of TLB corruption unless accesses to HCR_EL2
    are heavily synchronised.

  - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
    tables in a human-friendly fashion.

  - and the usual random cleanups.
2025-05-26 16:19:46 -04:00
Radim Krčmář
5b9db9c16f RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET
Add a toggleable VM capability to reset the VCPU from userspace by
setting MP_STATE_INIT_RECEIVED through IOCTL.

Reset through a mp_state to avoid adding a new IOCTL.
Do not reset on a transition from STOPPED to RUNNABLE, because it's
better to avoid side effects that would complicate userspace adoption.
The MP_STATE_INIT_RECEIVED is not a permanent mp_state -- IOCTL resets
the VCPU while preserving the original mp_state -- because we wouldn't
gain much from having a new state it in the rest of KVM, but it's a very
non-standard use of the IOCTL.

Signed-off-by: Radim Krčmář <rkrcmar@ventanamicro.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20250515143723.2450630-5-rkrcmar@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
2025-05-21 09:34:57 +05:30
Manali Shukla
89f9edf4c6 KVM: SVM: Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs
Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs with Bus Lock
Threshold, which is close enough to VMX's Bus Lock Detection VM-Exit to
allow reusing KVM_CAP_X86_BUS_LOCK_EXIT.

The biggest difference between the two features is that Threshold is
fault-like, whereas Detection is trap-like.  To allow the guest to make
forward progress, Threshold provides a per-VMCB counter which is
decremented every time a bus lock occurs, and a VM-Exit is triggered if
and only if the counter is '0'.

To provide Detection-like semantics, initialize the counter to '0', i.e.
exit on every bus lock, and when re-executing the guilty instruction, set
the counter to '1' to effectively step past the instruction.

Note, in the unlikely scenario that re-executing the instruction doesn't
trigger a bus lock, e.g. because the guest has changed memory types or
patched the guilty instruction, the bus lock counter will be left at '1',
i.e. the guest will be able to do a bus lock on a different instruction.
In a perfect world, KVM would ensure the counter is '0' if the guest has
made forward progress, e.g. if RIP has changed.  But trying to close that
hole would incur non-trivial complexity, for marginal benefit; the intent
of KVM_CAP_X86_BUS_LOCK_EXIT is to allow userspace rate-limit bus locks,
not to allow for precise detection of problematic guest code.  And, it's
simply not feasible to fully close the hole, e.g. if an interrupt arrives
before the original instruction can re-execute, the guest could step past
a different bus lock.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Manali Shukla <manali.shukla@amd.com>
Link: https://lore.kernel.org/r/20250502050346.14274-5-manali.shukla@amd.com
[sean: fix typo in comment]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-19 11:05:10 -07:00
Marc Zyngier
29d1697c8c KVM: arm64: Document NV caps and vcpu flags
Describe the two new vcpu flags that control NV, together with
the capabilities that advertise them.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250514103501.2225951-18-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 08:01:19 +01:00
Paolo Bonzini
fd02aa45bd Merge branch 'kvm-tdx-initial' into HEAD
This large commit contains the initial support for TDX in KVM.  All x86
parts enable the host-side hypercalls that KVM uses to talk to the TDX
module, a software component that runs in a special CPU mode called SEAM
(Secure Arbitration Mode).

The series is in turn split into multiple sub-series, each with a separate
merge commit:

- Initialization: basic setup for using the TDX module from KVM, plus
  ioctls to create TDX VMs and vCPUs.

- MMU: in TDX, private and shared halves of the address space are mapped by
  different EPT roots, and the private half is managed by the TDX module.
  Using the support that was added to the generic MMU code in 6.14,
  add support for TDX's secure page tables to the Intel side of KVM.
  Generic KVM code takes care of maintaining a mirror of the secure page
  tables so that they can be queried efficiently, and ensuring that changes
  are applied to both the mirror and the secure EPT.

- vCPU enter/exit: implement the callbacks that handle the entry of a TDX
  vCPU (via the SEAMCALL TDH.VP.ENTER) and the corresponding save/restore
  of host state.

- Userspace exits: introduce support for guest TDVMCALLs that KVM forwards to
  userspace.  These correspond to the usual KVM_EXIT_* "heavyweight vmexits"
  but are triggered through a different mechanism, similar to VMGEXIT for
  SEV-ES and SEV-SNP.

- Interrupt handling: support for virtual interrupt injection as well as
  handling VM-Exits that are caused by vectored events.  Exclusive to
  TDX are machine-check SMIs, which the kernel already knows how to
  handle through the kernel machine check handler (commit 7911f145de,
  "x86/mce: Implement recovery for errors in TDX/SEAM non-root mode")

- Loose ends: handling of the remaining exits from the TDX module, including
  EPT violation/misconfig and several TDVMCALL leaves that are handled in
  the kernel (CPUID, HLT, RDMSR/WRMSR, GetTdVmCallInfo); plus returning
  an error or ignoring operations that are not supported by TDX guests

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-07 07:36:33 -04:00
Paolo Bonzini
269a2c3663 Documentation: kvm: remove KVM_CAP_MIPS_TE
Trap and emulate virtualization is not available anymore for MIPS.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04 06:32:17 -04:00
Paolo Bonzini
af339282e2 Documentation: kvm: organize capabilities in the right section
Categorize the capabilities correctly.  Section 6 is for enabled vCPU
capabilities; section 7 is for enabled VM capabilities; section 8 is
for informational ones.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04 06:32:17 -04:00
Paolo Bonzini
ed7974fd59 Documentation: kvm: fix some definition lists
Ensure that they have a ":" in front of the defined item.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04 06:32:17 -04:00
Paolo Bonzini
2f313018de Documentation: kvm: drop "Capability" heading from capabilities
It is redundant, and sometimes wrong.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04 06:32:17 -04:00
Paolo Bonzini
26cb30f22f Documentation: kvm: give correct name for KVM_CAP_SPAPR_MULTITCE
The capability is incorrectly called KVM_CAP_PPC_MULTITCE in the documentation.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04 06:32:17 -04:00
Paolo Bonzini
f3e555ba45 Documentation: KVM: KVM_GET_SUPPORTED_CPUID now exposes TSC_DEADLINE
TSC_DEADLINE is now advertised unconditionally by KVM_GET_SUPPORTED_CPUID,
since commit 9be4ec35d6 ("KVM: x86: Advertise TSC_DEADLINE_TIMER in
KVM_GET_SUPPORTED_CPUID", 2024-12-18).  Adjust the documentation to
reflect the new behavior.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250331150550.510320-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04 06:24:05 -04:00
Paolo Bonzini
0afd104fb3 KVM/arm64 updates for 6.15
- Nested virtualization support for VGICv3, giving the nested
    hypervisor control of the VGIC hardware when running an L2 VM
 
  - Removal of 'late' nested virtualization feature register masking,
    making the supported feature set directly visible to userspace
 
  - Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage
    of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers
 
  - Paravirtual interface for discovering the set of CPU implementations
    where a VM may run, addressing a longstanding issue of guest CPU
    errata awareness in big-little systems and cross-implementation VM
    migration
 
  - Userspace control of the registers responsible for identifying a
    particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1),
    allowing VMs to be migrated cross-implementation
 
  - pKVM updates, including support for tracking stage-2 page table
    allocations in the protected hypervisor in the 'SecPageTable' stat
 
  - Fixes to vPMU, ensuring that userspace updates to the vPMU after
    KVM_RUN are reflected into the backing perf events
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZ9s9gBccb2xpdmVyLnVw
 dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFp6LAQCOQ1Fidp8RT1NdhLLAhW5D4gLe
 MNT619R4qfqu64ZpeQEAidHMAYaGRk5KDNBq6Jn+awcJnwCcMnh2ok0vTOjz3gY=
 =RC6A
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-6.15' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for 6.15

 - Nested virtualization support for VGICv3, giving the nested
   hypervisor control of the VGIC hardware when running an L2 VM

 - Removal of 'late' nested virtualization feature register masking,
   making the supported feature set directly visible to userspace

 - Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage
   of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers

 - Paravirtual interface for discovering the set of CPU implementations
   where a VM may run, addressing a longstanding issue of guest CPU
   errata awareness in big-little systems and cross-implementation VM
   migration

 - Userspace control of the registers responsible for identifying a
   particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1),
   allowing VMs to be migrated cross-implementation

 - pKVM updates, including support for tracking stage-2 page table
   allocations in the protected hypervisor in the 'SecPageTable' stat

 - Fixes to vPMU, ensuring that userspace updates to the vPMU after
   KVM_RUN are reflected into the backing perf events
2025-03-20 12:54:12 -04:00
Isaku Yamahata
52f52ea79a Documentation/virt/kvm: Document on Trust Domain Extensions (TDX)
Add documentation to Intel Trusted Domain Extensions (TDX) support.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-21-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Yan Zhao
c9c1e20b4c KVM: x86: Introduce Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT
Introduce an Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT to have
KVM ignore guest PAT when this quirk is enabled.

On AMD platforms, KVM always honors guest PAT.  On Intel however there are
two issues.  First, KVM *cannot* honor guest PAT if CPU feature self-snoop
is not supported. Second, UC access on certain Intel platforms can be very
slow[1] and honoring guest PAT on those platforms may break some old
guests that accidentally specify video RAM as UC. Those old guests may
never expect the slowness since KVM always forces WB previously. See [2].

So, introduce a quirk that KVM can enable by default on all Intel platforms
to avoid breaking old unmodifiable guests. Newer userspace can disable this
quirk if it wishes KVM to honor guest PAT; disabling the quirk will fail
if self-snoop is not supported, i.e. if KVM cannot obey the wish.

The quirk is a no-op on AMD and also if any assigned devices have
non-coherent DMA.  This is not an issue, as KVM_X86_QUIRK_CD_NW_CLEARED is
another example of a quirk that is sometimes automatically disabled.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/all/Ztl9NWCOupNfVaCA@yzhao56-desk.sh.intel.com # [1]
Link: https://lore.kernel.org/all/87jzfutmfc.fsf@redhat.com # [2]
Message-ID: <20250224070946.31482-1-yan.y.zhao@intel.com>
[Use supported_quirks/inapplicable_quirks to support both AMD and
 no-self-snoop cases, as well as to remove the shadow_memtype_mask check
 from kvm_mmu_may_ignore_guest_pat(). - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Binbin Wu
79462faa2b KVM: TDX: Handle TDG.VP.VMCALL<ReportFatalError>
Convert TDG.VP.VMCALL<ReportFatalError> to KVM_EXIT_SYSTEM_EVENT with
a new type KVM_SYSTEM_EVENT_TDX_FATAL and forward it to userspace for
handling.

TD guest can use TDG.VP.VMCALL<ReportFatalError> to report the fatal
error it has experienced.  This hypercall is special because TD guest
is requesting a termination with the error information, KVM needs to
forward the hypercall to userspace anyway, KVM doesn't do parsing or
conversion, it just dumps the 16 general-purpose registers to userspace
and let userspace decide what to do.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014225.897298-8-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Oliver Upton
5980a69370 KVM: arm64: Fix documentation for KVM_CAP_ARM_WRITABLE_IMP_ID_REGS
The capability actually fails with EINVAL if vCPUs have already been
created.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305230825.484091-4-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-05 16:56:58 -08:00
Sebastian Ott
3adaee7830 KVM: arm64: Allow userspace to change the implementation ID registers
KVM's treatment of the ID registers that describe the implementation
(MIDR, REVIDR, and AIDR) is interesting, to say the least. On the
userspace-facing end of it, KVM presents the values of the boot CPU on
all vCPUs and treats them as invariant. On the guest side of things KVM
presents the hardware values of the local CPU, which can change during
CPU migration in a big-little system.

While one may call this fragile, there is at least some degree of
predictability around it. For example, if a VMM wanted to present
big-little to a guest, it could affine vCPUs accordingly to the correct
clusters.

All of this makes a giant mess out of adding support for making these
implementation ID registers writable. Avoid breaking the rather subtle
ABI around the old way of doing things by requiring opt-in from
userspace to make the registers writable.

When the cap is enabled, allow userspace to set MIDR, REVIDR, and AIDR
to any non-reserved value and present those values consistently across
all vCPUs.

Signed-off-by: Sebastian Ott <sebott@redhat.com>
[oliver: changelog, capability]
Link: https://lore.kernel.org/r/20250225005401.679536-5-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26 01:32:16 -08:00
Sean Christopherson
5c17848134 KVM: x86/xen: Restrict hypercall MSR to unofficial synthetic range
Reject userspace attempts to set the Xen hypercall page MSR to an index
outside of the "standard" virtualization range [0x40000000, 0x4fffffff],
as KVM is not equipped to handle collisions with real MSRs, e.g. KVM
doesn't update MSR interception, conflicts with VMCS/VMCB fields, special
case writes in KVM, etc.

While the MSR index isn't strictly ABI, i.e. can theoretically float to
any value, in practice no known VMM sets the MSR index to anything other
than 0x40000000 or 0x40000200.

Cc: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250215011437.1203084-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24 08:59:55 -08:00
Claudio Imbrenda
413c98f24c KVM: s390: fake memslot for ucontrol VMs
Create a fake memslot for ucontrol VMs. The fake memslot identity-maps
userspace.

Now memslots will always be present, and ucontrol is not a special case
anymore.

Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-4-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-4-imbrenda@linux.ibm.com>
2025-01-31 12:03:52 +01:00
Linus Torvalds
0f8e26b38d Loongarch:
* Clear LLBCTL if secondary mmu mapping changes.
 
 * Add hypercall service support for usermode VMM.
 
 x86:
 
 * Add a comment to kvm_mmu_do_page_fault() to explain why KVM performs a
   direct call to kvm_tdp_page_fault() when RETPOLINE is enabled.
 
 * Ensure that all SEV code is compiled out when disabled in Kconfig, even
   if building with less brilliant compilers.
 
 * Remove a redundant TLB flush on AMD processors when guest CR4.PGE changes.
 
 * Use str_enabled_disabled() to replace open coded strings.
 
 * Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's APICv cache
   prior to every VM-Enter.
 
 * Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
   instead of just those where KVM needs to manage state and/or explicitly
   enable the feature in hardware.  Along the way, refactor the code to make
   it easier to add features, and to make it more self-documenting how KVM
   is handling each feature.
 
 * Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
   where KVM unintentionally puts the vCPU into infinite loops in some scenarios
   (e.g. if emulation is triggered by the exit), and brings parity between VMX
   and SVM.
 
 * Add pending request and interrupt injection information to the kvm_exit and
   kvm_entry tracepoints respectively.
 
 * Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
   loading guest/host PKRU, due to a refactoring of the kernel helpers that
   didn't account for KVM's pre-checking of the need to do WRPKRU.
 
 * Make the completion of hypercalls go through the complete_hypercall
   function pointer argument, no matter if the hypercall exits to
   userspace or not.  Previously, the code assumed that KVM_HC_MAP_GPA_RANGE
   specifically went to userspace, and all the others did not; the new code
   need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
   all whether there was an exit to userspace or not.
 
 * As part of enabling TDX virtual machines, support support separation of
   private/shared EPT into separate roots.  When TDX will be enabled, operations
   on private pages will need to go through the privileged TDX Module via SEAMCALLs;
   as a result, they are limited and relatively slow compared to reading a PTE.
   The patches included in 6.14 allow KVM to keep a mirror of the private EPT in
   host memory, and define entries in kvm_x86_ops to operate on external page
   tables such as the TDX private EPT.
 
 * The recently introduced conversion of the NX-page reclamation kthread to
   vhost_task moved the task under the main process.  The task is created as
   soon as KVM_CREATE_VM was invoked and this, of course, broke userspace that
   didn't expect to see any child task of the VM process until it started
   creating its own userspace threads.  In particular crosvm refuses to fork()
   if procfs shows any child task, so unbreak it by creating the task lazily.
   This is arguably a userspace bug, as there can be other kinds of legitimate
   worker tasks and they wouldn't impede fork(); but it's not like userspace
   has a way to distinguish kernel worker tasks right now.  Should they show
   as "Kthread: 1" in proc/.../status?
 
 x86 - Intel:
 
 * Fix a bug where KVM updates hardware's APICv cache of the highest ISR bit
   while L2 is active, while ultimately results in a hardware-accelerated L1
   EOI effectively being lost.
 
 * Honor event priority when emulating Posted Interrupt delivery during nested
   VM-Enter by queueing KVM_REQ_EVENT instead of immediately handling the
   interrupt.
 
 * Rework KVM's processing of the Page-Modification Logging buffer to reap
   entries in the same order they were created, i.e. to mark gfns dirty in the
   same order that hardware marked the page/PTE dirty.
 
 * Misc cleanups.
 
 Generic:
 
 * Cleanup and harden kvm_set_memory_region(); add proper lockdep assertions when
   setting memory regions and add a dedicated API for setting KVM-internal
   memory regions.  The API can then explicitly disallow all flags for
   KVM-internal memory regions.
 
 * Explicitly verify the target vCPU is online in kvm_get_vcpu() to fix a bug
   where KVM would return a pointer to a vCPU prior to it being fully online,
   and give kvm_for_each_vcpu() similar treatment to fix a similar flaw.
 
 * Wait for a vCPU to come online prior to executing a vCPU ioctl, to fix a
   bug where userspace could coerce KVM into handling the ioctl on a vCPU that
   isn't yet onlined.
 
 * Gracefully handle xarray insertion failures; even though such failures are
   impossible in practice after xa_reserve(), reserving an entry is always followed
   by xa_store() which does not know (or differentiate) whether there was an
   xa_reserve() before or not.
 
 RISC-V:
 
 * Zabha, Svvptc, and Ziccrse extension support for guests.  None of them
   require anything in KVM except for detecting them and marking them
   as supported; Zabha adds byte and halfword atomic operations, while the
   others are markers for specific operation of the TLB and of LL/SC
   instructions respectively.
 
 * Virtualize SBI system suspend extension for Guest/VM
 
 * Support firmware counters which can be used by the guests to collect
   statistics about traps that occur in the host.
 
 Selftests:
 
 * Rework vcpu_get_reg() to return a value instead of using an out-param, and
   update all affected arch code accordingly.
 
 * Convert the max_guest_memory_test into a more generic mmu_stress_test.
   The basic gist of the "conversion" is to have the test do mprotect() on
   guest memory while vCPUs are accessing said memory, e.g. to verify KVM
   and mmu_notifiers are working as intended.
 
 * Play nice with treewrite builds of unsupported architectures, e.g. arm
   (32-bit), as KVM selftests' Makefile doesn't do anything to ensure the
   target architecture is actually one KVM selftests supports.
 
 * Use the kernel's $(ARCH) definition instead of the target triple for arch
   specific directories, e.g. arm64 instead of aarch64, mainly so as not to
   be different from the rest of the kernel.
 
 * Ensure that format strings for logging statements are checked by the
   compiler even when the logging statement itself is disabled.
 
 * Attempt to whack the last LLC references/misses mole in the Intel PMU
   counters test by adding a data load and doing CLFLUSH{OPT} on the data
   instead of the code being executed.  It seems that modern Intel CPUs
   have learned new code prefetching tricks that bypass the PMU counters.
 
 * Fix a flaw in the Intel PMU counters test where it asserts that events
   are counting correctly without actually knowing what the events count
   given the underlying hardware; this can happen if Intel reuses a
   formerly microarchitecture-specific event encoding as an architectural
   event, as was the case for Top-Down Slots.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmeTuzoUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOkBwf8CRNExYaM3j9y2E7mmo6AiL2ug6+J
 Uy5Hai1poY48pPwKC6ke3EWT8WVsgj/Py5pCeHvLojQchWNjCCYNfSQluJdkRxwG
 DgP3QUljSxEJWBeSwyTRcKM+IySi5hZd1IFo3gePFRB829Jpnj05vjbvCyv8gIwU
 y3HXxSYDsViaaFoNg4OlZFsIGis7mtknsZzk++QjuCXmxNa6UCbv3qvE/UkVLhVg
 WH65RTRdjk+EsdwaOMHKuUvQoGa+iM4o39b6bqmw8+ZMK39+y33WeTX/y5RXsp1N
 tUUBRfS+MuuYgC/6LmTr66EkMzoChxk3Dp3kKUaCBcfqRC8PxQag5reZhw==
 =NEaO
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "Loongarch:

   - Clear LLBCTL if secondary mmu mapping changes

   - Add hypercall service support for usermode VMM

  x86:

   - Add a comment to kvm_mmu_do_page_fault() to explain why KVM
     performs a direct call to kvm_tdp_page_fault() when RETPOLINE is
     enabled

   - Ensure that all SEV code is compiled out when disabled in Kconfig,
     even if building with less brilliant compilers

   - Remove a redundant TLB flush on AMD processors when guest CR4.PGE
     changes

   - Use str_enabled_disabled() to replace open coded strings

   - Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's
     APICv cache prior to every VM-Enter

   - Overhaul KVM's CPUID feature infrastructure to track all vCPU
     capabilities instead of just those where KVM needs to manage state
     and/or explicitly enable the feature in hardware. Along the way,
     refactor the code to make it easier to add features, and to make it
     more self-documenting how KVM is handling each feature

   - Rework KVM's handling of VM-Exits during event vectoring; this
     plugs holes where KVM unintentionally puts the vCPU into infinite
     loops in some scenarios (e.g. if emulation is triggered by the
     exit), and brings parity between VMX and SVM

   - Add pending request and interrupt injection information to the
     kvm_exit and kvm_entry tracepoints respectively

   - Fix a relatively benign flaw where KVM would end up redoing RDPKRU
     when loading guest/host PKRU, due to a refactoring of the kernel
     helpers that didn't account for KVM's pre-checking of the need to
     do WRPKRU

   - Make the completion of hypercalls go through the complete_hypercall
     function pointer argument, no matter if the hypercall exits to
     userspace or not.

     Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically
     went to userspace, and all the others did not; the new code need
     not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
     all whether there was an exit to userspace or not

   - As part of enabling TDX virtual machines, support support
     separation of private/shared EPT into separate roots.

     When TDX will be enabled, operations on private pages will need to
     go through the privileged TDX Module via SEAMCALLs; as a result,
     they are limited and relatively slow compared to reading a PTE.

     The patches included in 6.14 allow KVM to keep a mirror of the
     private EPT in host memory, and define entries in kvm_x86_ops to
     operate on external page tables such as the TDX private EPT

   - The recently introduced conversion of the NX-page reclamation
     kthread to vhost_task moved the task under the main process. The
     task is created as soon as KVM_CREATE_VM was invoked and this, of
     course, broke userspace that didn't expect to see any child task of
     the VM process until it started creating its own userspace threads.

     In particular crosvm refuses to fork() if procfs shows any child
     task, so unbreak it by creating the task lazily. This is arguably a
     userspace bug, as there can be other kinds of legitimate worker
     tasks and they wouldn't impede fork(); but it's not like userspace
     has a way to distinguish kernel worker tasks right now. Should they
     show as "Kthread: 1" in proc/.../status?

  x86 - Intel:

   - Fix a bug where KVM updates hardware's APICv cache of the highest
     ISR bit while L2 is active, while ultimately results in a
     hardware-accelerated L1 EOI effectively being lost

   - Honor event priority when emulating Posted Interrupt delivery
     during nested VM-Enter by queueing KVM_REQ_EVENT instead of
     immediately handling the interrupt

   - Rework KVM's processing of the Page-Modification Logging buffer to
     reap entries in the same order they were created, i.e. to mark gfns
     dirty in the same order that hardware marked the page/PTE dirty

   - Misc cleanups

  Generic:

   - Cleanup and harden kvm_set_memory_region(); add proper lockdep
     assertions when setting memory regions and add a dedicated API for
     setting KVM-internal memory regions. The API can then explicitly
     disallow all flags for KVM-internal memory regions

   - Explicitly verify the target vCPU is online in kvm_get_vcpu() to
     fix a bug where KVM would return a pointer to a vCPU prior to it
     being fully online, and give kvm_for_each_vcpu() similar treatment
     to fix a similar flaw

   - Wait for a vCPU to come online prior to executing a vCPU ioctl, to
     fix a bug where userspace could coerce KVM into handling the ioctl
     on a vCPU that isn't yet onlined

   - Gracefully handle xarray insertion failures; even though such
     failures are impossible in practice after xa_reserve(), reserving
     an entry is always followed by xa_store() which does not know (or
     differentiate) whether there was an xa_reserve() before or not

  RISC-V:

   - Zabha, Svvptc, and Ziccrse extension support for guests. None of
     them require anything in KVM except for detecting them and marking
     them as supported; Zabha adds byte and halfword atomic operations,
     while the others are markers for specific operation of the TLB and
     of LL/SC instructions respectively

   - Virtualize SBI system suspend extension for Guest/VM

   - Support firmware counters which can be used by the guests to
     collect statistics about traps that occur in the host

  Selftests:

   - Rework vcpu_get_reg() to return a value instead of using an
     out-param, and update all affected arch code accordingly

   - Convert the max_guest_memory_test into a more generic
     mmu_stress_test. The basic gist of the "conversion" is to have the
     test do mprotect() on guest memory while vCPUs are accessing said
     memory, e.g. to verify KVM and mmu_notifiers are working as
     intended

   - Play nice with treewrite builds of unsupported architectures, e.g.
     arm (32-bit), as KVM selftests' Makefile doesn't do anything to
     ensure the target architecture is actually one KVM selftests
     supports

   - Use the kernel's $(ARCH) definition instead of the target triple
     for arch specific directories, e.g. arm64 instead of aarch64,
     mainly so as not to be different from the rest of the kernel

   - Ensure that format strings for logging statements are checked by
     the compiler even when the logging statement itself is disabled

   - Attempt to whack the last LLC references/misses mole in the Intel
     PMU counters test by adding a data load and doing CLFLUSH{OPT} on
     the data instead of the code being executed. It seems that modern
     Intel CPUs have learned new code prefetching tricks that bypass the
     PMU counters

   - Fix a flaw in the Intel PMU counters test where it asserts that
     events are counting correctly without actually knowing what the
     events count given the underlying hardware; this can happen if
     Intel reuses a formerly microarchitecture-specific event encoding
     as an architectural event, as was the case for Top-Down Slots"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (151 commits)
  kvm: defer huge page recovery vhost task to later
  KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()
  KVM: Disallow all flags for KVM-internal memslots
  KVM: x86: Drop double-underscores from __kvm_set_memory_region()
  KVM: Add a dedicated API for setting KVM-internal memslots
  KVM: Assert slots_lock is held when setting memory regions
  KVM: Open code kvm_set_memory_region() into its sole caller (ioctl() API)
  LoongArch: KVM: Add hypercall service support for usermode VMM
  LoongArch: KVM: Clear LLBCTL if secondary mmu mapping is changed
  KVM: SVM: Use str_enabled_disabled() helper in svm_hardware_setup()
  KVM: VMX: read the PML log in the same order as it was written
  KVM: VMX: refactor PML terminology
  KVM: VMX: Fix comment of handle_vmx_instruction()
  KVM: VMX: Reinstate __exit attribute for vmx_exit()
  KVM: SVM: Use str_enabled_disabled() helper in sev_hardware_setup()
  KVM: x86: Avoid double RDPKRU when loading host/guest PKRU
  KVM: x86: Use LVT_TIMER instead of an open coded literal
  RISC-V: KVM: Add new exit statstics for redirected traps
  RISC-V: KVM: Update firmware counters for various events
  RISC-V: KVM: Redirect instruction access fault trap to guest
  ...
2025-01-25 09:55:09 -08:00
Linus Torvalds
d0f93ac2c3 Documentation changes this time around include:
- Quite a bit of Chinese and Spanish translation work.
 
 - Clarifying that Git commit IDs >12chars are OK
 
 - A new nvme-multipath document
 
 - A reorganization of the admin-guide top-level page to make it readable
 
 - Clarification of the role of Acked-by and maintainer discretion on their
   acceptance.
 
 - Some reorganization of debugging-oriented docs.
 
 ...and typo fixes, documentation updates, etc. as usual.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmeOp8EPHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5YipUH/iffvlVYuqoVdPUFWdmsiNjwOCRE2MIfp8qO
 tPTRRHJAny+NlFT0IWlGUbLNoNXtvpN47YlkaeAjdrsjASerfpwzje7t4Z1B+jWT
 0YwGBCvDIGasfRCx7D14+w5aqkEEynfsy+QurwcuDxcHMQGwt7ZCuTNOVO6BULEr
 L++BMwqapUr5IemP4ItQqDVVF9sp6bWEhaOnTTJCLU6oG23uUSSA/59sJmwDJUk7
 6J3VGO1An4Jte9WX7qkVrSBNO5cOOhaFiFXIeNxfOioOPctBwxKiHDJnzVud8ipz
 R+tnUI/8hEvyJ7GZFezyZxmMnFs0P2DEYAkaN+hBs/nUjx0dKUg=
 =YxaS
 -----END PGP SIGNATURE-----

Merge tag 'docs-6.14' of git://git.lwn.net/linux

Pull Documentation updates from Jonathan Corbet:

 - Quite a bit of Chinese and Spanish translation work

 - Clarifying that Git commit IDs >12chars are OK

 - A new nvme-multipath document

 - A reorganization of the admin-guide top-level page to make it
   readable

 - Clarification of the role of Acked-by and maintainer discretion on
   their acceptance

 - Some reorganization of debugging-oriented docs

... and typo fixes, documentation updates, etc as usual

* tag 'docs-6.14' of git://git.lwn.net/linux: (50 commits)
  Documentation: Fix x86_64 UEFI outdated references to elilo
  Documentation/sysctl: Add timer_migration to kernel.rst
  docs/mm: Physical memory: Remove zone_t
  docs: submitting-patches: clarify that signers may use their discretion on tags
  docs: submitting-patches: clarify difference between Acked-by and Reviewed-by
  docs: submitting-patches: clarify Acked-by and introduce "# Suffix"
  Documentation: bug-hunting.rst: remove odd contact information
  docs/zh_CN: Add sak index Chinese translation
  doc: module: DEFAULT_SYMBOL_NAMESPACE must be defined before #includes
  doc: module: Fix documented type of namespace
  Documentation/kernel-parameters: Fix a reference to vga-softcursor.rst
  docs/zh_CN: Add landlock index Chinese translation
  Documentation: Fix typo localmodonfig -> localmodconfig
  overlayfs.rst: Fix and improve grammar
  docs/zh_CN: Add siphash index Chinese translation
  docs/zh_CN: Add security IMA-templates Chinese translation
  docs/zh_CN: Add security digsig Chinese translation
  Align git commit ID abbreviation guidelines and checks
  docs: process: submitting-patches: split canonical patch format section
  docs/zh_CN: Add security lsm Chinese translation
  ...
2025-01-21 18:00:00 -08:00
Paolo Bonzini
4f7ff70c05 KVM x86 misc changes for 6.14:
- Overhaul KVM's CPUID feature infrastructure to replace "governed" features
    with per-vCPU tracking of the vCPU's capabailities for all features.  Along
    the way, refactor the code to make it easier to add/modify features, and
    add a variety of self-documenting macro types to again simplify adding new
    features and to help readers understand KVM's handling of existing features.
 
  - Rework KVM's handling of VM-Exits during event vectoring to plug holes where
    KVM unintentionally puts the vCPU into infinite loops in some scenarios,
    e.g. if emulation is triggered by the exit, and to bring parity between VMX
    and SVM.
 
  - Add pending request and interrupt injection information to the kvm_exit and
    kvm_entry tracepoints respectively.
 
  - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
    loading guest/host PKRU due to a refactoring of the kernel helpers that
    didn't account for KVM's pre-checking of the need to do WRPKRU.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmeJngsACgkQOlYIJqCj
 N/1dfA/+NIZmnd8OV9Zvc6HGxrzgt4QsM9pmsUmrfkDWefxMYIAMeaW8Vn4CJfRf
 zY/UcqyNI7JYxSiuVTckz+Tf54HhqYaLrUwILGCQ49koirZx+aQT1OUfjLroVMlh
 ffX1i6GOoLNtxjb9MXM/heLVdUbvmzQMSFkd/AkOH+nrOtDNOiPlZfjHsewj9zrf
 BNJGhzvT4M6vc/AsScC7tc0yFD5KKFRv8tVwJ6Zf1nWKyUDOSpMTWkVnq6geKJPZ
 iGBZPPNg55Oy1g6uj6VYWmqYTD8Qioz5jtEJ/8pPHdAyIFo21s81bfJc548d+QLh
 KfrL1K7TrCOhSAGC3Cb3lTLeq2immmGHaiTBLwGABG4MhpiX4NVpMMdOyFbVLMOS
 HIYuwXwDckm1pfU7/w+PgPaakCyPrXQntm+3Y2pvDOoY6e2JbwodK4j8BvvQda35
 8TrYKEGFvq5aij7Iw1O9TUoLAocDM/sHIHE6BCazHyzKBIv9xLRFeabiCQ+A1pwv
 gZk5u0+j+DPpLdeLhbMYhIXUtr3bvyMYvc+tRkG716f8ubAE3+Kn5BEDo4Ot2DcT
 vc+NTRYYWN6zavHiJH3Ddt153yj256JCZhLwCdfbryCQdz3Mpy16m36tgkDRd3lR
 QT4IkPQo1Vl/aU0yiE/dhnJgh1rTO26YQjZoHs5Oj16d0HRrKyc=
 =32mM
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.14' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.14:

 - Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
   instead of just those where KVM needs to manage state and/or explicitly
   enable the feature in hardware.  Along the way, refactor the code to make
   it easier to add features, and to make it more self-documenting how KVM
   is handling each feature.

 - Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
   where KVM unintentionally puts the vCPU into infinite loops in some scenarios
   (e.g. if emulation is triggered by the exit), and brings parity between VMX
   and SVM.

 - Add pending request and interrupt injection information to the kvm_exit and
   kvm_entry tracepoints respectively.

 - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
   loading guest/host PKRU, due to a refactoring of the kernel helpers that
   didn't account for KVM's pre-checking of the need to do WRPKRU.
2025-01-20 06:49:39 -05:00
Christoph Schlameuss
5021fd77d6 KVM: s390: Reject KVM_SET_GSI_ROUTING on ucontrol VMs
Prevent null pointer dereference when processing
KVM_IRQ_ROUTING_S390_ADAPTER routing entries.
The ioctl cannot be processed for ucontrol VMs.

Fixes: f65470661f ("KVM: s390/interrupt: do not pin adapter interrupt pages")
Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Tested-by: Hariharan Mari <hari55@linux.ibm.com>
Reviewed-by: Hariharan Mari <hari55@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20241216092140.329196-4-schlameuss@linux.ibm.com
Message-ID: <20241216092140.329196-4-schlameuss@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2025-01-07 16:36:11 +01:00
Sean Christopherson
9be4ec35d6 KVM: x86: Advertise TSC_DEADLINE_TIMER in KVM_GET_SUPPORTED_CPUID
Unconditionally advertise TSC_DEADLINE_TIMER via KVM_GET_SUPPORTED_CPUID,
as KVM always emulates deadline mode, *if* the VM has an in-kernel local
APIC.  The odds of a VMM emulating the local APIC in userspace, not
emulating the TSC deadline timer, _and_ reflecting
KVM_GET_SUPPORTED_CPUID back into KVM_SET_CPUID2, i.e. the risk of
over-advertising and breaking any setups, is extremely low.

KVM has _unconditionally_ advertised X2APIC via CPUID since commit
0d1de2d901 ("KVM: Always report x2apic as supported feature"), and it
is completely impossible for userspace to emulate X2APIC as KVM doesn't
support forwarding the MSR accesses to userspace.  I.e. KVM has relied on
userspace VMMs to not misreport local APIC capabilities for nearly 13
years.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-38-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:01 -08:00
Sean Christopherson
04cd8f8628 KVM: x86: Disallow KVM_CAP_X86_DISABLE_EXITS after vCPU creation
Reject KVM_CAP_X86_DISABLE_EXITS if vCPUs have been created, as disabling
PAUSE/MWAIT/HLT exits after vCPUs have been created is broken and useless,
e.g. except for PAUSE on SVM, the relevant intercepts aren't updated after
vCPU creation.  vCPUs may also end up with an inconsistent configuration
if exits are disabled between creation of multiple vCPUs.

Cc: Hou Wenlong <houwenlong.hwl@antgroup.com>
Link: https://lore.kernel.org/all/9227068821b275ac547eb2ede09ec65d2281fe07.1680179693.git.houwenlong.hwl@antgroup.com
Link: https://lore.kernel.org/all/20230121020738.2973-2-kechenl@nvidia.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:35 -08:00
Gianfranco Trad
311cf7164a Documentation: kvm: fix typo in api.rst
Fix minor typo in api.rst where the word physical was misspelled
as physcial.

Signed-off-by: Gianfranco Trad <gianf.trad@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20241115011831.300705-5-gianf.trad@gmail.com
2024-12-11 09:07:40 -07:00
Paolo Bonzini
7b541d557f KVM/arm64 changes for 6.13, part #1
- Support for stage-1 permission indirection (FEAT_S1PIE) and
    permission overlays (FEAT_S1POE), including nested virt + the
    emulated page table walker
 
  - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call
    was introduced in PSCIv1.3 as a mechanism to request hibernation,
    similar to the S4 state in ACPI
 
  - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
    part of it, introduce trivial initialization of the host's MPAM
    context so KVM can use the corresponding traps
 
  - PMU support under nested virtualization, honoring the guest
    hypervisor's trap configuration and event filtering when running a
    nested guest
 
  - Fixes to vgic ITS serialization where stale device/interrupt table
    entries are not zeroed when the mapping is invalidated by the VM
 
  - Avoid emulated MMIO completion if userspace has requested synchronous
    external abort injection
 
  - Various fixes and cleanups affecting pKVM, vCPU initialization, and
    selftests
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZzTZXRccb2xpdmVyLnVw
 dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFioUAP0cs2pYcwuCqLgmeHqfz6L5Xsw3
 hKBCNuvr5mjU0hZfLAEA5ml2eUKD7OnssAOmUZ/K/NoCdJFCe8mJWQDlURvr9g4=
 =u2/3
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-6.13' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 changes for 6.13, part #1

 - Support for stage-1 permission indirection (FEAT_S1PIE) and
   permission overlays (FEAT_S1POE), including nested virt + the
   emulated page table walker

 - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call
   was introduced in PSCIv1.3 as a mechanism to request hibernation,
   similar to the S4 state in ACPI

 - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
   part of it, introduce trivial initialization of the host's MPAM
   context so KVM can use the corresponding traps

 - PMU support under nested virtualization, honoring the guest
   hypervisor's trap configuration and event filtering when running a
   nested guest

 - Fixes to vgic ITS serialization where stale device/interrupt table
   entries are not zeroed when the mapping is invalidated by the VM

 - Avoid emulated MMIO completion if userspace has requested synchronous
   external abort injection

 - Various fixes and cleanups affecting pKVM, vCPU initialization, and
   selftests
2024-11-14 07:05:36 -05:00
Paolo Bonzini
2e9a2c624e Merge branch 'kvm-docs-6.13' into HEAD
- Drop obsolete references to PPC970 KVM, which was removed 10 years ago.

- Fix incorrect references to non-existing ioctls

- List registers supported by KVM_GET/SET_ONE_REG on s390

- Use rST internal links

- Reorganize the introduction to the API document
2024-11-13 07:18:12 -05:00
Oliver Upton
7ccd615bc6 Merge branch kvm-arm64/psci-1.3 into kvmarm/next
* kvm-arm64/psci-1.3:
  : PSCI v1.3 support, courtesy of David Woodhouse
  :
  : Bump KVM's PSCI implementation up to v1.3, with the added bonus of
  : implementing the SYSTEM_OFF2 call. Like other system-scoped PSCI calls,
  : this gets relayed to userspace for further processing with a new
  : KVM_SYSTEM_EVENT_SHUTDOWN flag.
  :
  : As an added bonus, implement client-side support for hibernation with
  : the SYSTEM_OFF2 call.
  arm64: Use SYSTEM_OFF2 PSCI call to power off for hibernate
  KVM: arm64: nvhe: Pass through PSCI v1.3 SYSTEM_OFF2 call
  KVM: selftests: Add test for PSCI SYSTEM_OFF2
  KVM: arm64: Add support for PSCI v1.2 and v1.3
  KVM: arm64: Add PSCI v1.3 SYSTEM_OFF2 function for hibernation
  firmware/psci: Add definitions for PSCI v1.3 specification

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11 18:36:46 +00:00
Paolo Bonzini
5b47f5a725 Documentation: kvm: reorganize introduction
Reorganize the text to mention file descriptors as early as possible.
Also mention capabilities early as they are a central part of KVM's
API.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241023124507.280382-5-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08 05:57:13 -05:00
Paolo Bonzini
badd5372ec Documentation: kvm: replace section numbers with links
In order to simplify further introduction of hyperlinks, replace explicit
section numbers with rST hyperlinks.  The section numbers could actually
be removed now, but I'm not going to do a huge change throughout the file
for an RFC...

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241023124507.280382-4-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08 05:57:13 -05:00
Paolo Bonzini
aae7527ea9 Documentation: kvm: fix a few mistakes
The only occurrence "Capability: none" actually meant the same as "basic".
Fix that and a few more aesthetic or content issues in the document.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241023124507.280382-3-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08 05:57:13 -05:00
Paolo Bonzini
e16e018e82 KVM: powerpc: remove remaining traces of KVM_CAP_PPC_RMA
This was only needed for PPC970 support, which is long gone: the
implementation was removed in 2014.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241023124507.280382-2-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08 05:57:13 -05:00
Sean Christopherson
dcb988cdac KVM: x86: Quirk initialization of feature MSRs to KVM's max configuration
Add a quirk to control KVM's misguided initialization of select feature
MSRs to KVM's max configuration, as enabling features by default violates
KVM's approach of letting userspace own the vCPU model, and is actively
problematic for MSRs that are conditionally supported, as the vCPU will
end up with an MSR value that userspace can't restore.  E.g. if the vCPU
is configured with PDCM=0, userspace will save and attempt to restore a
non-zero PERF_CAPABILITIES, thanks to KVM's meddling.

Link: https://lore.kernel.org/r/20240802185511.305849-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:31 -07:00
David Woodhouse
97413cea1c KVM: arm64: Add PSCI v1.3 SYSTEM_OFF2 function for hibernation
The PSCI v1.3 specification adds support for a SYSTEM_OFF2 function
which is analogous to ACPI S4 state. This will allow hosting
environments to determine that a guest is hibernated rather than just
powered off, and ensure that they preserve the virtual environment
appropriately to allow the guest to resume safely (or bump the
hardware_signature in the FACS to trigger a clean reboot instead).

This feature is safe to enable unconditionally (in a subsequent commit)
because it is exposed to userspace through the existing
KVM_SYSTEM_EVENT_SHUTDOWN event, just with an additional flag which
userspace can use to know that the instance intended hibernation instead
of a plain power-off.

As with SYSTEM_RESET2, there is only one type available (in this case
HIBERNATE_OFF), and it is not explicitly reported to userspace through
the event; userspace can get it from the registers if it cares).

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Miguel Luis <miguel.luis@oracle.com>
Link: https://lore.kernel.org/r/20241019172459.2241939-3-dwmw2@infradead.org
[oliver: slight cleanup of comments]
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-24 16:36:51 -07:00