Commit Graph

11000 Commits

Author SHA1 Message Date
Paolo Bonzini
782f9feaa9 Merge branch 'kvm-pre-tdx' into HEAD
- Add common secure TSC infrastructure for use within SNP and in the
  future TDX

- Block KVM_CAP_SYNC_REGS if guest state is protected.  It does not make
  sense to use the capability if the relevant registers are not
  available for reading or writing.
2025-03-20 13:13:13 -04:00
Paolo Bonzini
361da275e5 Merge branch 'kvm-nvmx-and-vm-teardown' into HEAD
The immediate issue being fixed here is a nVMX bug where KVM fails to
detect that, after nested VM-Exit, L1 has a pending IRQ (or NMI).
However, checking for a pending interrupt accesses the legacy PIC, and
x86's kvm_arch_destroy_vm() currently frees the PIC before destroying
vCPUs, i.e. checking for IRQs during the forced nested VM-Exit results
in a NULL pointer deref; that's a prerequisite for the nVMX fix.

The remaining patches attempt to bring a bit of sanity to x86's VM
teardown code, which has accumulated a lot of cruft over the years.  E.g.
KVM currently unloads each vCPU's MMUs in a separate operation from
destroying vCPUs, all because when guest SMP support was added, KVM had a
kludgy MMU teardown flow that broke when a VM had more than one 1 vCPU.
And that oddity lived on, for 18 years...

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-20 13:13:00 -04:00
Paolo Bonzini
3ecf162a31 KVM Xen changes for 6.15
- Don't write to the Xen hypercall page on MSR writes that are initiated by
    the host (userspace or KVM) to fix a class of bugs where KVM can write to
    guest memory at unexpected times, e.g. during vCPU creation if userspace has
    set the Xen hypercall MSR index to collide with an MSR that KVM emulates.
 
  - Restrict the Xen hypercall MSR indx to the unofficial synthetic range to
    reduce the set of possible collisions with MSRs that are emulated by KVM
    (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside
    in the synthetic range).
 
  - Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config.
 
  - Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID
    entries when updating PV clocks, as there is no guarantee PV clocks will be
    updated between TSC frequency changes and CPUID emulation, and guest reads
    of Xen TSC should be rare, i.e. are not a hot path.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZpO4ACgkQOlYIJqCj
 N/3AMQ/+J4+yOslekq4DHYhZaTvJFf0MqhPgTuf2s6I5p449JWn9rebqK2w0M9Xj
 fJy7/rboQA4QflBuhTiWcC3Dl1lYtxUqqtcCH9608eqKhbeay87OfV0/vgMwWBRs
 FhcOcp1587esJj5gz5M5R9i3S5Yq7Q4fp1+DmS23X41Zz5nTb2q80MY5UklMgI9I
 Ydaw1liB8rRHWbdt9yM4UsI8k4fMuj0PE8pEapoTSfsZm8J4cG9qHKrvuWjuFSCF
 l18Hyl11nWq8eZ5Vg2E2UIz0EgtWIHKu1/fi4av20/JTuA8Mon15WC5q4BBmDDdD
 keR9OJLYclVBh8KweiJSTUE6PcD9A8pWmoWyp6aGRiyyUVhbwysYTzT7uytwQz6w
 RH/vVHe0o/m19SnD9rqsRVObc7dOGorFXScMcf4Qxoq9yQm2p0lJDvq6c9uECLMV
 RIfZrXe9HS67RB9INybS+1fVlLcd0bLgGfG7q9lWLEABD45HpM5daQ4Mlf8+MIE0
 V7egx9t69/WALbJka8pWNISeFRKkB1LRjite+XXasqJ0iFeneM8UKFVB4OMtXL9g
 M0m8ovvySySMkoCq3yMlKxXh4rJ1/D556/bAaJBukMPWFWX9FQaP33U3FuzId7jH
 ztZVugViQMNiIbQVgUSAcgpuJvgpttAciACODlaw2u2Bk1Txmn0=
 =c3Wt
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-xen-6.15' of https://github.com/kvm-x86/linux into HEAD

KVM Xen changes for 6.15

 - Don't write to the Xen hypercall page on MSR writes that are initiated by
   the host (userspace or KVM) to fix a class of bugs where KVM can write to
   guest memory at unexpected times, e.g. during vCPU creation if userspace has
   set the Xen hypercall MSR index to collide with an MSR that KVM emulates.

 - Restrict the Xen hypercall MSR indx to the unofficial synthetic range to
   reduce the set of possible collisions with MSRs that are emulated by KVM
   (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside
   in the synthetic range).

 - Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config.

 - Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID
   entries when updating PV clocks, as there is no guarantee PV clocks will be
   updated between TSC frequency changes and CPUID emulation, and guest reads
   of Xen TSC should be rare, i.e. are not a hot path.
2025-03-19 09:14:59 -04:00
Paolo Bonzini
fcce7c1e7d KVM PV clock changes for 6.15:
- Don't take kvm->lock when iterating over vCPUs in the suspend notifier to
    fix a largely theoretical deadlock.
 
  - Use the vCPU's actual Xen PV clock information when starting the Xen timer,
    as the cached state in arch.hv_clock can be stale/bogus.
 
  - Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different
    PV clocks.
 
  - Restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only
    accounts for kvmclock, and there's no evidence that the flag is actually
    supported by Xen guests.
 
  - Clean up the per-vCPU "cache" of its reference pvclock, and instead only
    track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately
    expensive to compute, and rarely changes for modern setups).
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZnJsACgkQOlYIJqCj
 N/3nAg//TuURhfTm56TB0PZ01DX9Fqxl+9b+fDSllk1F7O5BcfkkEd11Jv4qa/Zb
 eKhSNZzWuDCTMky8izM2Ej4rfTmCg2xF+3hdpVi6yQ7SItgDo2E7e71lm1lNXSMO
 oCkQxEwQk0cW2sxeEqPuREq0Zm/kw7jrEt2co2OX2FKlt9UoZiCy6RDBde50z1ut
 5Z32k6QX9Alhu67kXvBE/+Xv6abx1dbADOnaTgE7s74smHKxS2WXrfpKnPXjy2y0
 pWjX9k2ClSISKdaFbSu4Y0VqeLqE+57ZAWAPT8vndJxjNWOvZK1oBSlaOPchR9CZ
 0VFLDWKV2FjEs0O0AkWCw8XTEmdJ4R1ekHpqbBZJ9TJYwVA/LDWOGgR1jcORkzsS
 WMJkfMOmQeL8bPR6TBuAFXawbhalsXnYUSthZ3sn4kA7c1DTkIC5mzrDZ3ADPyJi
 UpYwVHaWAMOqncEvSQEuUTvSoDeb5P4HMyB4QOAsh1GoKw4vVXpSWUPDy0JKPOnu
 WblztX9h/CRB/ZNt/566s2Jh7sCeBO2qs3ffujI4GDosYDcIRRuQy5U08/oMrPRf
 l3nPStjxLqsCdJe8IXvL5zwt6YOxJvJdG8XcfcvfQsUCPMAZOIv7PKrt8AFfrN6c
 GU5v8x/IBBB46qJw1Jm5eE5S3P/PuaIf235JHpIabGPzJ+H1QGo=
 =yv8C
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-pvclock-6.15' of https://github.com/kvm-x86/linux into HEAD

KVM PV clock changes for 6.15:

 - Don't take kvm->lock when iterating over vCPUs in the suspend notifier to
   fix a largely theoretical deadlock.

 - Use the vCPU's actual Xen PV clock information when starting the Xen timer,
   as the cached state in arch.hv_clock can be stale/bogus.

 - Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different
   PV clocks.

 - Restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only
   accounts for kvmclock, and there's no evidence that the flag is actually
   supported by Xen guests.

 - Clean up the per-vCPU "cache" of its reference pvclock, and instead only
   track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately
   expensive to compute, and rarely changes for modern setups).
2025-03-19 09:11:59 -04:00
Paolo Bonzini
9b093f5b86 KVM SVM changes for 6.15
- Ensure the PSP driver is initialized when both the PSP and KVM modules are
    built-in (the initcall framework doesn't handle dependencies).
 
  - Use long-term pins when registering encrypted memory regions, so that the
    pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to
    excessive fragmentation.
 
  - Add macros and helpers for setting GHCB return/error codes.
 
  - Add support for Idle HLT interception, which elides interception if the vCPU
    has a pending, unmasked virtual IRQ when HLT is executed.
 
  - Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical
    address.
 
  - Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g.
    because the vCPU was "destroyed" via SNP's AP Creation hypercall.
 
  - Reject SNP AP Creation if the requested SEV features for the vCPU don't
    match the VM's configured set of features.
 
  - Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZoO4ACgkQOlYIJqCj
 N/2m1BAAkbn6k0ZzJj0GTYqNh8ejWFBa4Mt+RHXrf9deNXLJuQFPRAQi2En3rQwj
 esISxA0dzsADEYkCXHxDsUfsJ729kQYAIyN5r4h3GftriKlNBmHLZJsXHXnpAZ0F
 2Yjwu/r5zw/c5/mOYkcBjJ0gDgoDWNI0UoA656HTqE88E6v7DoSZlZBZZRSCMXRi
 8jIQSzmQQkKsWi0c/N5LKm8E+6HFLJB1BnASbXbwXChIoi2pFE/wAv6ntC+V5DzN
 Y5oqDtf3evOBrpmMbN7t2I4KJ8VW1+041whANymFK1QARBBevCYY1ezCHg2RIHpc
 cyS8G+wice7IMSnqTNtJvN2IpwBkV2SqRyxwBKS2j1ec1xeoX2JT23tOom1XpPhW
 diqiSUto2xQIPz3x8fddtAHvY0W11jpXt4MUyOzdefbBLGQBB4EsxbnwxY+i6kKh
 0tdw4R1uzvbn1sHW+p2hOvtkgxSLmYFGIrYEUMCxXRxOviHfPWzCBlucEOOceU1D
 2o/SgoBWS6xF8KxMxwnVLE9q8/Baiua8Ak2h2cLapHwWGpRaeJGFbz/TwbcDaKVy
 gW34W8KXc4WNWiFwoD6WRqrSDTRXG3XAtn0vjwvCqD6PBPRleALWsAxq8ztenYIy
 2se051XsKGg+e64zsAZFNdzIDrSGIHWfZb9ec398cF/iuKTGWLQ=
 =iZrd
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-svm-6.15' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 6.15

 - Ensure the PSP driver is initialized when both the PSP and KVM modules are
   built-in (the initcall framework doesn't handle dependencies).

 - Use long-term pins when registering encrypted memory regions, so that the
   pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to
   excessive fragmentation.

 - Add macros and helpers for setting GHCB return/error codes.

 - Add support for Idle HLT interception, which elides interception if the vCPU
   has a pending, unmasked virtual IRQ when HLT is executed.

 - Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical
   address.

 - Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g.
   because the vCPU was "destroyed" via SNP's AP Creation hypercall.

 - Reject SNP AP Creation if the requested SEV features for the vCPU don't
   match the VM's configured set of features.

 - Misc cleanups
2025-03-19 09:10:44 -04:00
Paolo Bonzini
a24dbf986b KVM VMX changes for 6.15
- Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus
    modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1.
 
  - Pass XFD_ERR as a psueo-payload when injecting #NM as a preparatory step
    for upcoming FRED virtualization support.
 
  - Decouple the EPT entry RWX protection bit macros from the EPT Violation bits
    as a general cleanup, and in anticipation of adding support for emulating
    Mode-Based Execution (MBEC).
 
  - Reject KVM_RUN if userspace manages to gain control and stuff invalid guest
    state while KVM is in the middle of emulating nested VM-Enter.
 
  - Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs
    in anticipation of adding sanity checks for secondary exit controls (the
    primary field is out of bits).
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZosAACgkQOlYIJqCj
 N/0pSA/9FTKQ8hzBinHc5ae+FPhbmAmSgX9X1Gge2cDWF8OYD1yR3ySEdPRGOalz
 oRhhO9pnXeNDbGeps8TFqoRhydcXRyp8AiDATFKky5kFk9vyKxDR26bXrM8nWGgI
 WZS/txkae/Bt5P63IViaGBIbWjXlfKP2wtAOrv676e//SGtZ22K0QUOUuBNETD1F
 txs2vHOUC7S9q+hIo95I4WZJGA4Ih7ZMlfGr1hrPGASHMG0AUozaouN8CEHFWecH
 uutpeVAcYtLiZbsl19L+M6wHr0TeRYo0d5Rcw/zH1XeTZ+zLoF27PJ6PHwx7QmVs
 C0mNzW+2cNjEYpzEDFxo+EkiWXagJ0m9pWlne6PKb92WtX+l5+x5abxVSF86lCy9
 X46wtm/FJ3DyfZ8yuLJXX2c2TOHVNNPUOS/dmAKrv3i1t2kG90yKRX+P9m0k9L1i
 f1IcyhIy4hrkSz+OVRG07mf7VeXUjklWfjIOsAYGBiQRyTbD+8PU1pgDRXWC9mPw
 tgvOgnnDKaS8POHak5DGb3kfdISVztNUM5Dg4GV7fZYqy8E37cdjo1PHk71xmksU
 lPN5U1wSX5MkAyqzrFz+LAZlgoIAvDmbPclWVACbwwRIo7IvbAjwsfE+8JyO5DPX
 T8mvd6C+A2K8GqkQTpjx8leXEDRSZiuV63dscSoAMEr+NbSudi4=
 =SAgZ
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.15' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.15

 - Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus
   modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1.

 - Pass XFD_ERR as a psueo-payload when injecting #NM as a preparatory step
   for upcoming FRED virtualization support.

 - Decouple the EPT entry RWX protection bit macros from the EPT Violation bits
   as a general cleanup, and in anticipation of adding support for emulating
   Mode-Based Execution (MBEC).

 - Reject KVM_RUN if userspace manages to gain control and stuff invalid guest
   state while KVM is in the middle of emulating nested VM-Enter.

 - Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs
   in anticipation of adding sanity checks for secondary exit controls (the
   primary field is out of bits).
2025-03-19 09:05:52 -04:00
Paolo Bonzini
4d9a677596 KVM x86 misc changes for 6.15:
- Fix a bug in PIC emulation that caused KVM to emit a spurious KVM_REQ_EVENT.
 
  - Add a helper to consolidate handling of mp_state transitions, and use it to
    clear pv_unhalted whenever a vCPU is made RUNNABLE.
 
  - Defer runtime CPUID updates until KVM emulates a CPUID instruction, to
    coalesce updates when multiple pieces of vCPU state are changing, e.g. as
    part of a nested transition.
 
  - Fix a variety of nested emulation bugs, and add VMX support for synthesizing
    nested VM-Exit on interception (instead of injecting #UD into L2).
 
  - Drop "support" for PV Async #PF with proctected guests without SEND_ALWAYS,
    as KVM can't get the current CPL.
 
  - Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZhoYACgkQOlYIJqCj
 N/1oSBAAlhsZzv4m6rHtWACjGsSBlAE5WM7HrpnwjOyMW+Desc0WLL4L6qAlxgT7
 4ZkBvQ4zsiyUIdLv1XARvkBYHTUkcgG8fQSSJQ6grZit5OcFcMgWafvQrJYI6428
 TyzF/x6t0gj8CjDluA6jM/eL5HhWZZDOIg0Wma+pyYpW+7y2tkphXyYyOQPBGwv3
 geRUetbDHHXjf042k/8f1j1vjzrNNvAg3YyNyx1YbdU9XKsn5D+SeUW2eVfYk8G7
 5QsCOGvUYcbbjrR8kbCZKexvoH6Np9J6YKDe4R9R2yDzgs/96qz6xkYTGCVkHA1y
 uursKqRHgbXBxzxa+ban073laT7Qt3S01Gd9bJW3IO7hzG89gl4qfX7fap8T9Yc2
 yeBTYIgInpyx+NCdZ2Z/++BzPagBGfa77gFX/eIkmsVA9LWYi9CI3FSjtr/czvWm
 a4tfMPvTVBjsBQQ7t/lNksrq0O51lbb3iqqv3ToQpDOOqCWuMEU5xcihhPRr5NSZ
 dX4o/jIDhCV8EyXdtASyqMlYBXcuC45ojEZn1elh0QogzYAdSGQ2bIDyxuBtA//k
 kSbi+E4GB64jVfBWUyK2QeLOBnBkH7mh6Cg5UYr1Ln9Sm6l8vrcxhcbnchiWxXMI
 WCK7BJwI2HojBVpEZ04jMkjHvg36uSfjOzmMLT5yPXfFNebsGmA=
 =8SGF
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.15' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.15:

 - Fix a bug in PIC emulation that caused KVM to emit a spurious KVM_REQ_EVENT.

 - Add a helper to consolidate handling of mp_state transitions, and use it to
   clear pv_unhalted whenever a vCPU is made RUNNABLE.

 - Defer runtime CPUID updates until KVM emulates a CPUID instruction, to
   coalesce updates when multiple pieces of vCPU state are changing, e.g. as
   part of a nested transition.

 - Fix a variety of nested emulation bugs, and add VMX support for synthesizing
   nested VM-Exit on interception (instead of injecting #UD into L2).

 - Drop "support" for PV Async #PF with proctected guests without SEND_ALWAYS,
   as KVM can't get the current CPL.

 - Misc cleanups
2025-03-19 09:04:48 -04:00
Paolo Bonzini
4286a3ec25 KVM x86/mmu changes for 6.15
Add support for "fast" aging of SPTEs in both the TDP MMU and Shadow MMU, where
 "fast" means "without holding mmu_lock".  Not taking mmu_lock allows multiple
 aging actions to run in parallel, and more importantly avoids stalling vCPUs,
 e.g. due to holding mmu_lock for an extended duration while a vCPU is faulting
 in memory.
 
 For the TDP MMU, protect aging via RCU; the page tables are RCU-protected and
 KVM doesn't need to access any metadata to age SPTEs.
 
 For the Shadow MMU, use bit 1 of rmap pointers (bit 0 is used to terminate a
 list of rmaps) to implement a per-rmap single-bit spinlock.  When aging a gfn,
 acquire the rmap's spinlock with read-only permissions, which allows hardening
 and optimizing the locking and aging, e.g. locking an rmap for write requires
 mmu_lock to also be held.  The lock is NOT a true R/W spinlock, i.e. multiple
 concurrent readers aren't supported.
 
 To avoid forcing all SPTE updates to use atomic operations (clearing the
 Accessed bit out of mmu_lock makes it inherently volatile), rework and rename
 spte_has_volatile_bits() to spte_needs_atomic_update() and deliberately exclude
 the Accessed bit.  KVM (and mm/) already tolerates false positives/negatives
 for Accessed information, and all testing has shown that reducing the latency
 of aging is far more beneficial to overall system performance than providing
 "perfect" young/old information.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZizUACgkQOlYIJqCj
 N/1XkBAAwLxK3uKwkIIgzu+V6NPMiqPBsNtGiiQgRYMfvCwMW2vU6ztsBgNgs6zI
 eMnOCCo6fQaxPFvpKue8VN7TD33BcjzKZaPiuHZrzQIa/oYQeOlZ4oaN8lr9F9Ec
 5l1Lg/p2+z1GGDhWc2opNpg48sCtX7IQ0Tx46LkoB3VSDFP33GwW4+Ht2r71rNeL
 ofKB+T0hU5HOry5j0w0nTVwOEoNzlm1sVFqOHzgnK18Lmqw2CfOPm+46K+w8nOh+
 v+rwGuGa//1kcCjNCcGP1OuJdNAMgXBxND/l6LAkWcHfffIRbXlO07O05dAGqPeF
 rRn5JUl02OkI6lq99+935OmtEROe6mt+Bx0dhAzk4Z0CD6JY34ShZSAADSnltQlK
 2a1E95t63v8a7ZM5dwED7os2HBhxODoyeWQAlIHpkVdmeTJstkyvjPhubJc13+Js
 oDL6ehs3hhZ171ePn2aXo0NobA5fe7xbl4wugP3hNmBXjLvu04D+llcDmC095nBk
 ICuzFqFXCXzdjEwgWwPzTseWOCoWTkoRqeJ9lch4UD3mMMcmK0MbK6joocGvCFto
 cL/eZdElnf1MZwWYdo44X+NEc1jItZVvktkRrllpwCtpRSDINO6RYZGcRf/g0Lha
 XmaU7jICfi3AKc4N3S2l4KIkd/AeJQySM+kGArxIOYoaqFCe2Mc=
 =Iy57
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmu-6.15' of https://github.com/kvm-x86/linux into HEAD

KVM x86/mmu changes for 6.15

Add support for "fast" aging of SPTEs in both the TDP MMU and Shadow MMU, where
"fast" means "without holding mmu_lock".  Not taking mmu_lock allows multiple
aging actions to run in parallel, and more importantly avoids stalling vCPUs,
e.g. due to holding mmu_lock for an extended duration while a vCPU is faulting
in memory.

For the TDP MMU, protect aging via RCU; the page tables are RCU-protected and
KVM doesn't need to access any metadata to age SPTEs.

For the Shadow MMU, use bit 1 of rmap pointers (bit 0 is used to terminate a
list of rmaps) to implement a per-rmap single-bit spinlock.  When aging a gfn,
acquire the rmap's spinlock with read-only permissions, which allows hardening
and optimizing the locking and aging, e.g. locking an rmap for write requires
mmu_lock to also be held.  The lock is NOT a true R/W spinlock, i.e. multiple
concurrent readers aren't supported.

To avoid forcing all SPTE updates to use atomic operations (clearing the
Accessed bit out of mmu_lock makes it inherently volatile), rework and rename
spte_has_volatile_bits() to spte_needs_atomic_update() and deliberately exclude
the Accessed bit.  KVM (and mm/) already tolerates false positives/negatives
for Accessed information, and all testing has shown that reducing the latency
of aging is far more beneficial to overall system performance than providing
"perfect" young/old information.
2025-03-19 09:04:33 -04:00
Ingo Molnar
89771319e0 Linux 6.14-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmfXVtUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGN/sH/i5423Gt/z51gDjA
 s4v5Z7GaBJ9zOGBahn2RWFe72zytTqKrEJmMnGfguirs0atD1DtQj4WAP7iFKP+e
 WyO663X6HF7i5y37ja0Yd4PZc31hwtqzKH8LjBf8f8tTy8UsEVqumdi5A4sS9KTM
 qm4kTyyVEY9D/s7oRY8ywjDlRJtO6nT0aKMp4kAqNEbrNUYbilT/a0hgXcgSmPyB
 uIjmjL2fZfutxGI5LgfbaSHCa1ElmhvTvivOMpaAmZSGCRVHCKGgT0CTNnHyn/7C
 dB145JkRO4ZOUqirCdO4PE/23id3ajq9fcixJGBzAv7c45y+B3JZ1r2kAfKalE8/
 qrOKLys=
 =8r7a
 -----END PGP SIGNATURE-----

Merge tag 'v6.14-rc7' into x86/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-19 11:03:06 +01:00
Isaku Yamahata
161d34609f KVM: TDX: Make TDX VM type supported
Now all the necessary code for TDX is in place, it's ready to run TDX
guest.  Advertise the VM type of KVM_X86_TDX_VM so that the user space
VMM like QEMU can start to use it.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX "the rest" v2:
- No change.

TDX "the rest" v1:
- Move down to the end of patch series.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Yan Zhao
90fe64a94d KVM: TDX: KVM: TDX: Always honor guest PAT on TDX enabled guests
Always honor guest PAT in KVM-managed EPTs on TDX enabled guests by
making self-snoop feature a hard dependency for TDX and making quirk
KVM_X86_QUIRK_IGNORE_GUEST_PAT not a valid quirk once TDX is enabled.

The quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT only affects memory type of
KVM-managed EPTs. For the TDX-module-managed private EPT, memory type is
always forced to WB now.

Honoring guest PAT in KVM-managed EPTs ensures KVM does not invoke
kvm_zap_gfn_range() when attaching/detaching non-coherent DMA devices,
which would cause mirrored EPTs for TDs to be zapped, leading to the
TDX-module-managed private EPT being incorrectly zapped.

As a new feature, TDX always comes with support for self-snoop, and does
not have to worry about unmodifiable but buggy guests. So, simply ignore
KVM_X86_QUIRK_IGNORE_GUEST_PAT on TDX guests just like kvm-amd.ko already
does.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250224071039.31511-1-yan.y.zhao@intel.com>
[Only apply to TDX guests. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Paolo Bonzini
3fee4837ef KVM: x86: remove shadow_memtype_mask
The IGNORE_GUEST_PAT quirk is inapplicable, and thus always-disabled,
if shadow_memtype_mask is zero.  As long as vmx_get_mt_mask is not
called for the shadow paging case, there is no need to consult
shadow_memtype_mask and it can be removed altogether.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Yan Zhao
c9c1e20b4c KVM: x86: Introduce Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT
Introduce an Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT to have
KVM ignore guest PAT when this quirk is enabled.

On AMD platforms, KVM always honors guest PAT.  On Intel however there are
two issues.  First, KVM *cannot* honor guest PAT if CPU feature self-snoop
is not supported. Second, UC access on certain Intel platforms can be very
slow[1] and honoring guest PAT on those platforms may break some old
guests that accidentally specify video RAM as UC. Those old guests may
never expect the slowness since KVM always forces WB previously. See [2].

So, introduce a quirk that KVM can enable by default on all Intel platforms
to avoid breaking old unmodifiable guests. Newer userspace can disable this
quirk if it wishes KVM to honor guest PAT; disabling the quirk will fail
if self-snoop is not supported, i.e. if KVM cannot obey the wish.

The quirk is a no-op on AMD and also if any assigned devices have
non-coherent DMA.  This is not an issue, as KVM_X86_QUIRK_CD_NW_CLEARED is
another example of a quirk that is sometimes automatically disabled.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/all/Ztl9NWCOupNfVaCA@yzhao56-desk.sh.intel.com # [1]
Link: https://lore.kernel.org/all/87jzfutmfc.fsf@redhat.com # [2]
Message-ID: <20250224070946.31482-1-yan.y.zhao@intel.com>
[Use supported_quirks/inapplicable_quirks to support both AMD and
 no-self-snoop cases, as well as to remove the shadow_memtype_mask check
 from kvm_mmu_may_ignore_guest_pat(). - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Yan Zhao
bd7d5362b4 KVM: x86: Introduce supported_quirks to block disabling quirks
Introduce supported_quirks in kvm_caps to store platform-specific force-enabled
quirks.

No functional changes intended.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250224070832.31394-1-yan.y.zhao@intel.com>
[Remove unsupported quirks at KVM_ENABLE_CAP time. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Paolo Bonzini
a4dae7c7a4 KVM: x86: Allow vendor code to disable quirks
In some cases, the handling of quirks is split between platform-specific
code and generic code, or it is done entirely in generic code, but the
relevant bug does not trigger on some platforms; for example,
this will be the case for "ignore guest PAT".  Allow unaffected vendor
modules to disable handling of a quirk for all VMs via a new entry in
kvm_caps.

Such quirks remain available in KVM_CAP_DISABLE_QUIRKS2, because that API
tells userspace that KVM *knows* that some of its past behavior was bogus
or just undesirable.  In other words, it's plausible for userspace to
refuse to run if a quirk is not listed by KVM_CAP_DISABLE_QUIRKS2, so
preserve that and make it part of the API.

As an example, mark KVM_X86_QUIRK_CD_NW_CLEARED as auto-disabled on
Intel systems.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Paolo Bonzini
9966b7822b KVM: x86: do not allow re-enabling quirks
Allowing arbitrary re-enabling of quirks puts a limit on what the
quirks themselves can do, since you cannot assume that the quirk
prevents a particular state.  More important, it also prevents
KVM from disabling a quirk at VM creation time, because userspace
can always go back and re-enable that.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Binbin Wu
26eab9ae4b KVM: TDX: Enable guest access to MTRR MSRs
Allow TDX guests to access MTRR MSRs as what KVM does for normal VMs, i.e.,
KVM emulates accesses to MTRR MSRs, but doesn't virtualize guest MTRR
memory types.

TDX module exposes MTRR feature to TDX guests unconditionally.  KVM needs
to support MTRR MSRs accesses for TDX guests to match the architectural
behavior.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-19-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Isaku Yamahata
0b75889b0c KVM: TDX: Add a method to ignore hypercall patching
Because guest TD memory is protected, VMM patching guest binary for
hypercall instruction isn't possible.  Add a method to ignore hypercall
patching.  Note: guest TD kernel needs to be modified to use
TDG.VP.VMCALL for hypercall.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-18-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Isaku Yamahata
79264ff080 KVM: TDX: Ignore setting up mce
Because vmx_set_mce function is VMX specific and it cannot be used for TDX.
Add vt stub to ignore setting up mce for TDX.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-17-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
cf5f3668c5 KVM: TDX: Add methods to ignore accesses to TSC
TDX protects TDX guest TSC state from VMM.  Implement access methods to
ignore guest TSC.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-16-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
a946c71cf8 KVM: TDX: Add methods to ignore VMX preemption timer
TDX doesn't support VMX preemption timer.  Implement access methods for VMM
to ignore VMX preemption timer.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-15-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
e6bb397884 KVM: TDX: Add method to ignore guest instruction emulation
Skip instruction emulation and let the TDX guest retry for MMIO emulation
after installing the MMIO SPTE with suppress #VE bit cleared.

TDX protects TDX guest state from VMM, instructions in guest memory cannot
be emulated.  MMIO emulation is the only case that triggers the instruction
emulation code path for TDX guest.

The MMIO emulation handling flow as following:
- The TDX guest issues a vMMIO instruction. (The GPA must be shared and is
  not covered by KVM memory slot.)
- The default SPTE entry for shared-EPT by KVM has suppress #VE bit set. So
  EPT violation causes TD exit to KVM.
- Trigger KVM page fault handler and install a new SPTE with suppress #VE
  bit cleared.
- Skip instruction emulation and return X86EMU_RETRY_INSTR to let the vCPU
  retry.
- TDX guest re-executes the vMMIO instruction.
- TDX guest gets #VE because KVM has cleared #VE suppress bit.
- TDX guest #VE handler converts MMIO into TDG.VP.VMCALL<MMIO>

Return X86EMU_RETRY_INSTR in the callback check_emulate_instruction() for
TDX guests to retry the MMIO instruction.  Also, the instruction emulation
handling will be skipped, so that the callback check_intercept() will never
be called for TDX guest.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-14-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
a141f28d6b KVM: TDX: Add methods to ignore accesses to CPU state
TDX protects TDX guest state from VMM.  Implement access methods for TDX
guest state to ignore them or return zero.  Because those methods can be
called by kvm ioctls to set/get cpu registers, they don't have KVM_BUG_ON.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-13-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
04733836fe KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
Implement TDG.VP.VMCALL<GetTdVmCallInfo> hypercall.  If the input value is
zero, return success code and zero in output registers.

TDG.VP.VMCALL<GetTdVmCallInfo> hypercall is a subleaf of TDG.VP.VMCALL to
enumerate which TDG.VP.VMCALL sub leaves are supported.  This hypercall is
for future enhancement of the Guest-Host-Communication Interface (GHCI)
specification.  The GHCI version of 344426-001US defines it to require
input R12 to be zero and to return zero in output registers, R11, R12, R13,
and R14 so that guest TD enumerates no enhancement.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-12-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
9fc3402a20 KVM: TDX: Enable guest access to LMCE related MSRs
Allow TDX guest to configure LMCE (Local Machine Check Event) by handling
MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL.

MCE and MCA are advertised via cpuid based on the TDX module spec.  Guest
kernel can access IA32_FEAT_CTL to check whether LMCE is opted-in by the
platform or not.  If LMCE is opted-in by the platform, guest kernel can
access IA32_MCG_EXT_CTL to enable/disable LMCE.

Handle MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL for TDX guests to avoid
failure when a guest accesses them with TDG.VP.VMCALL<MSR> on #VE.  E.g.,
Linux guest will treat the failure as a #GP(0).

Userspace VMM may not opt-in LMCE by default, e.g., QEMU disables it by
default, "-cpu lmce=on" is needed in QEMU command line to opt-in it.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: rework changelog]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-11-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
081385dbc6 KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
Morph PV RDMSR/WRMSR hypercall to EXIT_REASON_MSR_{READ,WRITE} and
wire up KVM backend functions.

For complete_emulated_msr() callback, instead of injecting #GP on error,
implement tdx_complete_emulated_msr() to set return code on error.  Also
set return value on MSR read according to the values from kvm x86
registers.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250227012021.1778144-10-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
dd50294f3e KVM: TDX: Implement callbacks for MSR operations
Add functions to implement MSR related callbacks, .set_msr(), .get_msr(),
and .has_emulated_msr(), for preparation of handling hypercalls from TDX
guest for PV RDMSR and WRMSR.  Ignore KVM_REQ_MSR_FILTER_CHANGED for TDX.

There are three classes of MSR virtualization for TDX.
- Non-configurable: TDX module directly virtualizes it. VMM can't configure
  it, the value set by KVM_SET_MSRS is ignored.
- Configurable: TDX module directly virtualizes it. VMM can configure it at
  VM creation time.  The value set by KVM_SET_MSRS is used.
- #VE case: TDX guest would issue TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}>
  and VMM handles the MSR hypercall. The value set by KVM_SET_MSRS is used.

For the MSRs belonging to the #VE case, the TDX module injects #VE to the
TDX guest upon RDMSR or WRMSR.  The exact list of such MSRs is defined in
TDX Module ABI Spec.

Upon #VE, the TDX guest may call TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}>,
which are defined in GHCI (Guest-Host Communication Interface) so that the
host VMM (e.g. KVM) can virtualize the MSRs.

TDX doesn't allow VMM to configure interception of MSR accesses.  Ignore
KVM_REQ_MSR_FILTER_CHANGED for TDX guest.  If the userspace has set any
MSR filters, it will be applied when handling
TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}> in a later patch.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250227012021.1778144-9-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
7ddf314441 KVM: x86: Move KVM_MAX_MCE_BANKS to header file
Move KVM_MAX_MCE_BANKS to header file so that it can be used for TDX in
a future patch.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: split into new patch]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-8-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
5cf7239b78 KVM: TDX: Handle TDX PV HLT hypercall
Handle TDX PV HLT hypercall and the interrupt status due to it.

TDX guest status is protected, KVM can't get the interrupt status
of TDX guest and it assumes interrupt is always allowed unless TDX
guest calls TDVMCALL with HLT, which passes the interrupt blocked flag.

If the guest halted with interrupt enabled, also query pending RVI by
checking bit 0 of TD_VCPU_STATE_DETAILS_NON_ARCH field via a seamcall.

Update vt_interrupt_allowed() for TDX based on interrupt blocked flag
passed by HLT TDVMCALL.  Do not wakeup TD vCPU if interrupt is blocked
for VT-d PI.

For NMIs, KVM cannot determine the NMI blocking status for TDX guests,
so KVM always assumes NMIs are not blocked.  In the unlikely scenario
where a guest invokes the PV HLT hypercall within an NMI handler, this
could result in a spurious wakeup.  The guest should implement the PV
HLT hypercall within a loop if it truly requires no interruptions, since
NMI could be unblocked by an IRET due to an exception occurring before
the PV HLT is executed in the NMI handler.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-7-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Isaku Yamahata
3bf31b5786 KVM: TDX: Handle TDX PV CPUID hypercall
Handle TDX PV CPUID hypercall for the CPUIDs virtualized by VMM
according to TDX Guest Host Communication Interface (GHCI).

For TDX, most CPUID leaf/sub-leaf combinations are virtualized by
the TDX module while some trigger #VE.  On #VE, TDX guest can issue
TDG.VP.VMCALL<INSTRUCTION.CPUID> (same value as EXIT_REASON_CPUID)
to request VMM to emulate CPUID operation.

Morph PV CPUID hypercall to EXIT_REASON_CPUID and wire up  to the KVM
backend function.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: rewrite changelog]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-6-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Yan Zhao
4b2abc4971 KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
Kick off all vCPUs and prevent tdh_vp_enter() from executing whenever
tdh_mem_range_block()/tdh_mem_track()/tdh_mem_page_remove() encounters
contention, since the page removal path does not expect error and is less
sensitive to the performance penalty caused by kicking off vCPUs.

Although KVM has protected SEPT zap-related SEAMCALLs with kvm->mmu_lock,
KVM may still encounter TDX_OPERAND_BUSY due to the contention in the TDX
module.
- tdh_mem_track() may contend with tdh_vp_enter().
- tdh_mem_range_block()/tdh_mem_page_remove() may contend with
  tdh_vp_enter() and TDCALLs.

Resources     SHARED users      EXCLUSIVE users
------------------------------------------------------------
TDCS epoch    tdh_vp_enter      tdh_mem_track
------------------------------------------------------------
SEPT tree  tdh_mem_page_remove  tdh_vp_enter (0-step mitigation)
                                tdh_mem_range_block
------------------------------------------------------------
SEPT entry                      tdh_mem_range_block (Host lock)
                                tdh_mem_page_remove (Host lock)
                                tdg_mem_page_accept (Guest lock)
                                tdg_mem_page_attr_rd (Guest lock)
                                tdg_mem_page_attr_wr (Guest lock)

Use a TDX specific per-VM flag wait_for_sept_zap along with
KVM_REQ_OUTSIDE_GUEST_MODE to kick off vCPUs and prevent them from entering
TD, thereby avoiding the potential contention. Apply the kick-off and no
vCPU entering only after each SEAMCALL busy error to minimize the window of
no TD entry, as the contention due to 0-step mitigation or TDCALLs is
expected to be rare.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250227012021.1778144-5-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:57 -04:00
Yan Zhao
b0327bb2e7 KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY
Retry locally in the TDX EPT violation handler for private memory to reduce
the chances for tdh_mem_sept_add()/tdh_mem_page_aug() to contend with
tdh_vp_enter().

TDX EPT violation installs private pages via tdh_mem_sept_add() and
tdh_mem_page_aug(). The two may have contention with tdh_vp_enter() or
TDCALLs.

Resources    SHARED  users      EXCLUSIVE users
------------------------------------------------------------
SEPT tree  tdh_mem_sept_add     tdh_vp_enter(0-step mitigation)
           tdh_mem_page_aug
------------------------------------------------------------
SEPT entry                      tdh_mem_sept_add (Host lock)
                                tdh_mem_page_aug (Host lock)
                                tdg_mem_page_accept (Guest lock)
                                tdg_mem_page_attr_rd (Guest lock)
                                tdg_mem_page_attr_wr (Guest lock)

Though the contention between tdh_mem_sept_add()/tdh_mem_page_aug() and
TDCALLs may be removed in future TDX module, their contention with
tdh_vp_enter() due to 0-step mitigation still persists.

The TDX module may trigger 0-step mitigation in SEAMCALL TDH.VP.ENTER,
which works as follows:
0. Each TDH.VP.ENTER records the guest RIP on TD entry.
1. When the TDX module encounters a VM exit with reason EPT_VIOLATION, it
   checks if the guest RIP is the same as last guest RIP on TD entry.
   -if yes, it means the EPT violation is caused by the same instruction
            that caused the last VM exit.
            Then, the TDX module increases the guest RIP no-progress count.
            When the count increases from 0 to the threshold (currently 6),
            the TDX module records the faulting GPA into a
            last_epf_gpa_list.
   -if no,  it means the guest RIP has made progress.
            So, the TDX module resets the RIP no-progress count and the
            last_epf_gpa_list.
2. On the next TDH.VP.ENTER, the TDX module (after saving the guest RIP on
   TD entry) checks if the last_epf_gpa_list is empty.
   -if yes, TD entry continues without acquiring the lock on the SEPT tree.
   -if no,  it triggers the 0-step mitigation by acquiring the exclusive
            lock on SEPT tree, walking the EPT tree to check if all page
            faults caused by the GPAs in the last_epf_gpa_list have been
            resolved before continuing TD entry.

Since KVM TDP MMU usually re-enters guest whenever it exits to userspace
(e.g. for KVM_EXIT_MEMORY_FAULT) or encounters a BUSY, it is possible for a
tdh_vp_enter() to be called more than the threshold count before a page
fault is addressed, triggering contention when tdh_vp_enter() attempts to
acquire exclusive lock on SEPT tree.

Retry locally in TDX EPT violation handler to reduce the count of invoking
tdh_vp_enter(), hence reducing the possibility of its contention with
tdh_mem_sept_add()/tdh_mem_page_aug(). However, the 0-step mitigation and
the contention are still not eliminated due to KVM_EXIT_MEMORY_FAULT,
signals/interrupts, and cases when one instruction faults more GFNs than
the threshold count.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250227012021.1778144-4-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Yan Zhao
e6a85781f7 KVM: TDX: Detect unexpected SEPT violations due to pending SPTEs
Detect SEPT violations that occur when an SEPT entry is in PENDING state
while the TD is configured not to receive #VE on SEPT violations.

A TD guest can be configured not to receive #VE by setting SEPT_VE_DISABLE
to 1 in tdh_mng_init() or modifying pending_ve_disable to 1 in TDCS when
flexible_pending_ve is permitted. In such cases, the TDX module will not
inject #VE into the TD upon encountering an EPT violation caused by an SEPT
entry in the PENDING state. Instead, TDX module will exit to VMM and set
extended exit qualification type to PENDING_EPT_VIOLATION and exit
qualification bit 6:3 to 0.

Since #VE will not be injected to such TDs, they are not able to be
notified to accept a GPA. TD accessing before accepting a private GPA
is regarded as an error within the guest.

Detect such guest error by inspecting the (extended) exit qualification
bits and make such VM dead.

Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-3-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Isaku Yamahata
da407fe459 KVM: TDX: Handle EPT violation/misconfig exit
For TDX, on EPT violation, call common __vmx_handle_ept_violation() to
trigger x86 MMU code; on EPT misconfiguration, bug the VM since it
shouldn't happen.

EPT violation due to instruction fetch should never be triggered from
shared memory in TDX guest.  If such EPT violation occurs, treat it as
broken hardware.

EPT misconfiguration shouldn't happen on neither shared nor secure EPT for
TDX guests.
- TDX module guarantees no EPT misconfiguration on secure EPT.  Per TDX
  module v1.5 spec section 9.4 "Secure EPT Induced TD Exits":
  "By design, since secure EPT is fully controlled by the TDX module, an
  EPT misconfiguration on a private GPA indicates a TDX module bug and is
  handled as a fatal error."
- For shared EPT, the MMIO caching optimization, which is the only case
  where current KVM configures EPT entries to generate EPT
  misconfiguration, is implemented in a different way for TDX guests.  KVM
  configures EPT entries to non-present value without suppressing #VE bit.
  It causes #VE in the TDX guest and the guest will call TDG.VP.VMCALL to
  request MMIO emulation.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
[binbin: rework changelog]
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Isaku Yamahata
6c441e4d6e KVM: TDX: Handle EXIT_REASON_OTHER_SMI
Handle VM exit caused by "other SMI" for TDX, by returning back to
userspace for Machine Check System Management Interrupt (MSMI) case or
ignoring it and resume vCPU for non-MSMI case.

For VMX, SMM transition can happen in both VMX non-root mode and VMX
root mode.  Unlike VMX, in SEAM root mode (TDX module), all interrupts
are blocked. If an SMI occurs in SEAM non-root mode (TD guest), the SMI
causes VM exit to TDX module, then SEAMRET to KVM. Once it exits to KVM,
SMI is delivered and handled by kernel handler right away.

An SMI can be "I/O SMI" or "other SMI".  For TDX, there will be no I/O SMI
because I/O instructions inside TDX guest trigger #VE and TDX guest needs
to use TDVMCALL to request VMM to do I/O emulation.

For "other SMI", there are two cases:
- MSMI case.  When BIOS eMCA MCE-SMI morphing is enabled, the #MC occurs in
  TDX guest will be delivered as an MSMI.  It causes an
  EXIT_REASON_OTHER_SMI VM exit with MSMI (bit 0) set in the exit
  qualification.  On VM exit, TDX module checks whether the "other SMI" is
  caused by an MSMI or not.  If so, TDX module marks TD as fatal,
  preventing further TD entries, and then completes the TD exit flow to KVM
  with the TDH.VP.ENTER outputs indicating TDX_NON_RECOVERABLE_TD.  After
  TD exit, the MSMI is delivered and eventually handled by the kernel
  machine check handler (7911f145de x86/mce: Implement recovery for
  errors in TDX/SEAM non-root mode), i.e., the memory page is marked as
  poisoned and it won't be freed to the free list when the TDX guest is
  terminated.  Since the TDX guest is dead, follow other non-recoverable
  cases, exit to userspace.
- For non-MSMI case, KVM doesn't need to do anything, just continue TDX
  vCPU execution.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250222014757.897978-17-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Isaku Yamahata
f30cb6429f KVM: TDX: Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT exits for TDX.

NMI Handling: Just like the VMX case, NMI remains blocked after exiting
from TDX guest for NMI-induced exits [*].  Handle NMI-induced exits for
TDX guests in the same way as they are handled for VMX guests, i.e.,
handle NMI in tdx_vcpu_enter_exit() by calling the vmx_handle_nmi()
helper.

Interrupt and Exception Handling: Similar to the VMX case, external
interrupts and exceptions (machine check is the only exception type
KVM handles for TDX guests) are handled in the .handle_exit_irqoff()
callback.

For other exceptions, because TDX guest state is protected, exceptions in
TDX guests can't be intercepted.  TDX VMM isn't supposed to handle these
exceptions.  If unexpected exception occurs, exit to userspace with
KVM_EXIT_EXCEPTION.

For external interrupt, increase the statistics, same as the VMX case.

[*]: Some old TDX modules have a bug which makes NMI unblocked after
exiting from TDX guest for NMI-induced exits.  This could potentially
lead to nested NMIs: a new NMI arrives when KVM is manually calling the
host NMI handler.  This is an architectural violation, but it doesn't
have real harm until FRED is enabled together with TDX (for non-FRED,
the host NMI handler can handle nested NMIs).  Given this is rare to
happen and has no real harm, ignore this for the initial TDX support.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250222014757.897978-16-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Sean Christopherson
7e548b0d90 KVM: VMX: Add a helper for NMI handling
Add a helper to handles NMI exit.

TDX handles the NMI exit the same as VMX case.  Add a helper to share the
code with TDX, expose the helper in common.h.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-15-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Binbin Wu
d5bc91e8e7 KVM: VMX: Move emulation_required to struct vcpu_vt
Move emulation_required from struct vcpu_vmx to struct vcpu_vt so that
vmx_handle_exit_irqoff() can be reused by TDX code.

No functional change intended.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-14-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Isaku Yamahata
8dac6b9a97 KVM: TDX: Add methods to ignore virtual apic related operation
TDX protects TDX guest APIC state from VMM.  Implement access methods of
TDX guest vAPIC state to ignore them or return zero.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-13-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Isaku Yamahata
f65916ae2d KVM: TDX: Force APICv active for TDX guest
Force APICv active for TDX guests in KVM because APICv is always enabled
by TDX module.

From the view of KVM, whether APICv state is active or not is decided by:
1. APIC is hw enabled
2. VM and vCPU have no inhibit reasons set.

After TDX vCPU init, APIC is set to x2APIC mode. KVM_SET_{SREGS,SREGS2} are
rejected due to has_protected_state for TDs and guest_state_protected
for TDX vCPUs are set.  Reject KVM_{GET,SET}_LAPIC from userspace since
migration is not supported yet, so that userspace cannot disable APIC.

For various APICv inhibit reasons:
- APICV_INHIBIT_REASON_DISABLED is impossible after checking enable_apicv
  in tdx_bringup(). If !enable_apicv, TDX support will be disabled.
- APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED is impossible since x2APIC is
  mandatory, KVM emulates APIC_ID as read-only for x2APIC mode. (Note:
  APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED could be set if the memory
  allocation fails for KVM apic_map.)
- APICV_INHIBIT_REASON_HYPERV is impossible since TDX doesn't support
  HyperV guest yet.
- APICV_INHIBIT_REASON_ABSENT is impossible since in-kernel LAPIC is
  checked in tdx_vcpu_create().
- APICV_INHIBIT_REASON_BLOCKIRQ is impossible since TDX doesn't support
  KVM_SET_GUEST_DEBUG.
- APICV_INHIBIT_REASON_APIC_ID_MODIFIED is impossible since x2APIC is
  mandatory.
- APICV_INHIBIT_REASON_APIC_BASE_MODIFIED is impossible since KVM rejects
  userspace to set APIC base.
- The rest inhibit reasons are relevant only to AMD's AVIC, including
  APICV_INHIBIT_REASON_NESTED, APICV_INHIBIT_REASON_IRQWIN,
  APICV_INHIBIT_REASON_PIT_REINJ, APICV_INHIBIT_REASON_SEV, and
  APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED.
  (For APICV_INHIBIT_REASON_PIT_REINJ, similar to AVIC, KVM can't intercept
   EOI for TDX guests neither, but KVM enforces KVM_IRQCHIP_SPLIT for TDX
   guests, which eliminates the in-kernel PIT.)

Implement vt_refresh_apicv_exec_ctrl() to call KVM_BUG_ON() if APICv is
disabled for TDX guests.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-12-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Binbin Wu
209afc0c42 KVM: TDX: Enforce KVM_IRQCHIP_SPLIT for TDX guests
Enforce KVM_IRQCHIP_SPLIT for TDX guests to disallow in-kernel I/O APIC
while in-kernel local APIC is needed.

APICv is always enabled by TDX module and TDX Module doesn't allow the
hypervisor to modify the EOI-bitmap, i.e. all EOIs are accelerated and
never trigger exits.  Level-triggered interrupts and other things depending
on EOI VM-Exit can't be faithfully emulated in KVM.  Also, the lazy check
of pending APIC EOI for RTC edge-triggered interrupts, which was introduced
as a workaround when EOI cannot be intercepted, doesn't work for TDX either
because kvm_apic_pending_eoi() checks vIRR and vISR, but both values are
invisible in KVM.

If the guest induces generation of a level-triggered interrupt, the VMM is
left with the choice of dropping the interrupt, sending it as-is, or
converting it to an edge-triggered interrupt.  Ditto for KVM.  All of those
options will make the guest unhappy. There's no architectural behavior KVM
can provide that's better than sending the interrupt and hoping for the
best.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-11-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Isaku Yamahata
4cdf243eb1 KVM: TDX: Always block INIT/SIPI
Always block INIT and SIPI events for the TDX guest because the TDX module
doesn't provide API for VMM to inject INIT IPI or SIPI.

TDX defines its own vCPU creation and initialization sequence including
multiple seamcalls.  Also, it's only allowed during TD build time.

Given that TDX guest is para-virtualized to boot BSP/APs, normally there
shouldn't be any INIT/SIPI event for TDX guest.  If any, three options to
handle them:
1. Always block INIT/SIPI request.
2. (Silently) ignore INIT/SIPI request during delivery.
3. Return error to guest TDs somehow.

Choose option 1 for simplicity. Since INIT and SIPI are always blocked,
INIT handling and the OP vcpu_deliver_sipi_vector() won't be called, no
need to add new interface or helper function for INIT/SIPI delivery.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-10-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Isaku Yamahata
2b06beb08f KVM: TDX: Handle SMI request as !CONFIG_KVM_SMM
Handle SMI request as what KVM does for CONFIG_KVM_SMM=n, i.e. return
-ENOTTY, and add KVM_BUG_ON() to SMI related OPs for TD.

TDX doesn't support system-management mode (SMM) and system-management
interrupt (SMI) in guest TDs.  Because guest state (vCPU state, memory
state) is protected, it must go through the TDX module APIs to change
guest state.  However, the TDX module doesn't provide a way for VMM to
inject SMI into guest TD or a way for VMM to switch guest vCPU mode into
SMM.

MSR_IA32_SMBASE will not be emulated for TDX guest, -ENOTTY will be
returned when SMI is requested.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-9-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Isaku Yamahata
acc64eb4e2 KVM: TDX: Implement methods to inject NMI
Inject NMI to TDX guest by setting the PEND_NMI TDVPS field to 1, i.e. make
the NMI pending in the TDX module.  If there is a further pending NMI in
KVM, collapse it to the one pending in the TDX module.

VMM can request the TDX module to inject a NMI into a TDX vCPU by setting
the PEND_NMI TDVPS field to 1.  Following that, VMM can call TDH.VP.ENTER
to run the vCPU and the TDX module will attempt to inject the NMI as soon
as possible.

KVM has the following 3 cases to inject two NMIs when handling simultaneous
NMIs and they need to be injected in a back-to-back way.  Otherwise, OS
kernel may fire a warning about the unknown NMI [1]:
K1. One NMI is being handled in the guest and one NMI pending in KVM.
    KVM requests NMI window exit to inject the pending NMI.
K2. Two NMIs are pending in KVM.
    KVM injects the first NMI and requests NMI window exit to inject the
    second NMI.
K3. A previous NMI needs to be rejected and one NMI pending in KVM.
    KVM first requests force immediate exit followed by a VM entry to
    complete the NMI rejection.  Then, during the force immediate exit, KVM
    requests NMI window exit to inject the pending NMI.

For TDX, PEND_NMI TDVPS field is a 1-bit field, i.e. KVM can only pend one
NMI in the TDX module.  Also, the vCPU state is protected, KVM doesn't know
the NMI blocking states of TDX vCPU, KVM has to assume NMI is always
unmasked and allowed.  When KVM sees PEND_NMI is 1 after a TD exit, it
means the previous NMI needs to be re-injected.

Based on KVM's NMI handling flow, there are following 6 cases:
    In NMI handler    TDX module    KVM
T1. No                PEND_NMI=0    1 pending NMI
T2. No                PEND_NMI=0    2 pending NMIs
T3. No                PEND_NMI=1    1 pending NMI
T4. Yes               PEND_NMI=0    1 pending NMI
T5. Yes               PEND_NMI=0    2 pending NMIs
T6. Yes               PEND_NMI=1    1 pending NMI
K1 is mapped to T4.
K2 is mapped to T2 or T5.
K3 is mapped to T3 or T6.
Note: KVM doesn't know whether NMI is blocked by a NMI or not, case T5 and
T6 can happen.

When handling pending NMI in KVM for TDX guest, what KVM can do is to add a
pending NMI in TDX module when PEND_NMI is 0.  T1 and T4 can be handled by
this way.  However, TDX doesn't allow KVM to request NMI window exit
directly, if PEND_NMI is already set and there is still pending NMI in KVM,
the only way KVM could try is to request a force immediate exit.  But for
case T5 and T6, force immediate exit will result in infinite loop because
force immediate exit makes it no progress in the NMI handler, so that the
pending NMI in the TDX module can never be injected.

Considering on X86 bare metal, multiple NMIs could collapse into one NMI,
e.g. when NMI is blocked by SMI.  It's OS's responsibility to poll all NMI
sources in the NMI handler to avoid missing handling of some NMI events.

Based on that, for the above 3 cases (K1-K3), only case K1 must inject the
second NMI because the guest NMI handler may have already polled some of
the NMI sources, which could include the source of the pending NMI, the
pending NMI must be injected to avoid the lost of NMI.  For case K2 and K3,
the guest OS will poll all NMI sources (including the sources caused by the
second NMI and further NMI collapsed) when the delivery of the first NMI,
KVM doesn't have the necessity to inject the second NMI.

To handle the NMI injection properly for TDX, there are two options:
- Option 1: Modify the KVM's NMI handling common code, to collapse the
  second pending NMI for K2 and K3.
- Option 2: Do it in TDX specific way. When the previous NMI is still
  pending in the TDX module, i.e. it has not been delivered to TDX guest
  yet, collapse the pending NMI in KVM into the previous one.

This patch goes with option 2 because it is simple and doesn't impact other
VM types.  Option 1 may need more discussions.

This is the first need to access vCPU scope metadata in the "management"
class. Make needed accessors available.

[1] https://lore.kernel.org/all/1317409584-23662-5-git-send-email-dzickus@redhat.com/

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250222014757.897978-8-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:56 -04:00
Isaku Yamahata
fc17de9901 KVM: TDX: Wait lapic expire when timer IRQ was injected
Call kvm_wait_lapic_expire() when POSTED_INTR_ON is set and the vector
for LVTT is set in PIR before TD entry.

KVM always assumes a timer IRQ was injected if APIC state is protected.
For TDX guest, APIC state is protected and KVM injects timer IRQ via posted
interrupt.  To avoid unnecessary wait calls, only call
kvm_wait_lapic_expire() when a timer IRQ was injected, i.e., POSTED_INTR_ON
is set and the vector for LVTT is set in PIR.

Add a helper to test PIR.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-7-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Sean Christopherson
14aecf2a5b KVM: x86: Assume timer IRQ was injected if APIC state is protected
If APIC state is protected, i.e. the vCPU is a TDX guest, assume a timer
IRQ was injected when deciding whether or not to busy wait in the "timer
advanced" path.  The "real" vIRR is not readable/writable, so trying to
query for a pending timer IRQ will return garbage.

Note, TDX can scour the PIR if it wants to be more precise and skip the
"wait" call entirely.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-6-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Isaku Yamahata
24c1291116 KVM: TDX: Implement non-NMI interrupt injection
Implement non-NMI interrupt injection for TDX via posted interrupt.

As CPU state is protected and APICv is enabled for the TDX guest, TDX
supports non-NMI interrupt injection only by posted interrupt. Posted
interrupt descriptors (PIDs) are allocated in shared memory, KVM can
update them directly.  If target vCPU is in non-root mode, send posted
interrupt notification to the vCPU and hardware will sync PIR to vIRR
atomically.  Otherwise, kick it to pick up the interrupt from PID. To
post pending interrupts in the PID, KVM can generate a self-IPI with
notification vector prior to TD entry.

Since the guest status of TD vCPU is protected, assume interrupt is
always allowed.  Ignore the code path for event injection mechanism or
LAPIC emulation for TDX.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250222014757.897978-5-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Isaku Yamahata
254e5dcd5a KVM: VMX: Move posted interrupt delivery code to common header
Move posted interrupt delivery code to common header so that TDX can
leverage it.

No functional change intended.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: split into new patch]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Message-ID: <20250222014757.897978-4-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Isaku Yamahata
34d2d1ca1b KVM: TDX: Disable PI wakeup for IPIv
Disable PI wakeup for IPI virtualization (IPIv) case for TDX.

When a vCPU is being scheduled out, notification vector is switched and
pi_wakeup_handler() is enabled when the vCPU has interrupt enabled and
posted interrupt is used to wake up the vCPU.

For VMX, a blocked vCPU can be the target of posted interrupts when using
IPIv or VT-d PI.  TDX doesn't support IPIv, disable PI wakeup for IPIv.
Also, since the guest status of TD vCPU is protected, assume interrupt is
always enabled for TD. (PV HLT hypercall is not support yet, TDX guest
tells VMM whether HLT is called with interrupt disabled or not.)

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: split into new patch]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-3-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Sean Christopherson
90cfe144c8 KVM: TDX: Add support for find pending IRQ in a protected local APIC
Add flag and hook to KVM's local APIC management to support determining
whether or not a TDX guest has a pending IRQ.  For TDX vCPUs, the virtual
APIC page is owned by the TDX module and cannot be accessed by KVM.  As a
result, registers that are virtualized by the CPU, e.g. PPR, cannot be
read or written by KVM.  To deliver interrupts for TDX guests, KVM must
send an IRQ to the CPU on the posted interrupt notification vector.  And
to determine if TDX vCPU has a pending interrupt, KVM must check if there
is an outstanding notification.

Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is
protected to short-circuit the various other flows that try to pull an
IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is
pending, KVM can't do anything based on _which_ IRQ is pending.

Intentionally omit sanity checks from other flows, e.g. PPR update, so as
not to degrade non-TDX guests with unnecessary checks.  A well-behaved KVM
and userspace will never reach those flows for TDX guests, but reaching
them is not fatal if something does go awry.

For the TD exits not due to HLT TDCALL, skip checking RVI pending in
tdx_protected_apic_has_interrupt().  Except for the guest being stupid
(e.g., non-HLT TDCALL in an interrupt shadow), it's not even possible to
have an interrupt in RVI that is fully unmasked.  There is no any CPU flows
that modify RVI in the middle of instruction execution.  I.e. if RVI is
non-zero, then either the interrupt has been pending since before the TD
exit, or the instruction caused the TD exit is in an STI/SS shadow.  KVM
doesn't care about STI/SS shadows outside of the HALTED case.  And if the
interrupt was pending before TD exit, then it _must_ be blocked, otherwise
the interrupt would have been serviced at the instruction boundary.

For the HLT TDCALL case, it will be handled in a future patch when HLT
TDCALL is supported.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Sean Christopherson
bb723bebde KVM: TDX: Handle TDX PV MMIO hypercall
Handle TDX PV MMIO hypercall when TDX guest calls TDVMCALL with the
leaf #VE.RequestMMIO (same value as EXIT_REASON_EPT_VIOLATION) according
to TDX Guest Host Communication Interface (GHCI) spec.

For TDX guests, VMM is not allowed to access vCPU registers and the private
memory, and the code instructions must be fetched from the private memory.
So MMIO emulation implemented for non-TDX VMs is not possible for TDX
guests.

In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.  The #VE handling is supposed to emulate the MMIO
instruction inside the guest and convert it into a TDVMCALL with the
leaf #VE.RequestMMIO, which equals to EXIT_REASON_EPT_VIOLATION.

The requested MMIO address must be in shared GPA space.  The shared bit
is stripped after check because the existing code for MMIO emulation is
not aware of the shared bit.

The MMIO GPA shouldn't have a valid memslot, also the attribute of the GPA
should be shared.  KVM could do the checks before exiting to userspace,
however, even if KVM does the check, there still will be race conditions
between the check in KVM and the emulation of MMIO access in userspace due
to a memslot hotplug, or a memory attribute conversion.  If userspace
doesn't check the attribute of the GPA and the attribute happens to be
private, it will not pose a security risk or cause an MCE, but it can lead
to another issue. E.g., in QEMU, treating a GPA with private attribute as
shared when it falls within RAM's range can result in extra memory
consumption during the emulation to the access to the HVA of the GPA.
There are two options: 1) Do the check both in KVM and userspace.  2) Do
the check only in QEMU.  This patch chooses option 2, i.e. KVM omits the
memslot and attribute checks, and expects userspace to do the checks.

Similar to normal MMIO emulation, try to handle the MMIO in kernel first,
if kernel can't support it, forward the request to userspace.  Export
needed symbols used for MMIO handling.

Fragments handling is not needed for TDX PV MMIO because GPA is provided,
if a MMIO access crosses page boundary, it should be continuous in GPA.
Also, the size is limited to 1, 2, 4, 8 bytes.  No further split needed.
Allow cross page access because no extra handling needed after checking
both start and end GPA are shared GPAs.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250222014225.897298-10-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Isaku Yamahata
33608aaf71 KVM: TDX: Handle TDX PV port I/O hypercall
Emulate port I/O requested by TDX guest via TDVMCALL with leaf
Instruction.IO (same value as EXIT_REASON_IO_INSTRUCTION) according to
TDX Guest Host Communication Interface (GHCI).

All port I/O instructions inside the TDX guest trigger the #VE exception.
On #VE triggered by I/O instructions, TDX guest can call TDVMCALL with
leaf Instruction.IO to request VMM to emulate I/O instructions.

Similar to normal port I/O emulation, try to handle the port I/O in kernel
first, if kernel can't support it, forward the request to userspace.

Note string I/O operations are not supported in TDX.  Guest should unroll
them before calling the TDVMCALL.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250222014225.897298-9-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Binbin Wu
79462faa2b KVM: TDX: Handle TDG.VP.VMCALL<ReportFatalError>
Convert TDG.VP.VMCALL<ReportFatalError> to KVM_EXIT_SYSTEM_EVENT with
a new type KVM_SYSTEM_EVENT_TDX_FATAL and forward it to userspace for
handling.

TD guest can use TDG.VP.VMCALL<ReportFatalError> to report the fatal
error it has experienced.  This hypercall is special because TD guest
is requesting a termination with the error information, KVM needs to
forward the hypercall to userspace anyway, KVM doesn't do parsing or
conversion, it just dumps the 16 general-purpose registers to userspace
and let userspace decide what to do.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014225.897298-8-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Binbin Wu
2c30488083 KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>
Convert TDG.VP.VMCALL<MapGPA> to KVM_EXIT_HYPERCALL with
KVM_HC_MAP_GPA_RANGE and forward it to userspace for handling.

MapGPA is used by TDX guest to request to map a GPA range as private
or shared memory.  It needs to exit to userspace for handling.  KVM has
already implemented a similar hypercall KVM_HC_MAP_GPA_RANGE, which will
exit to userspace with exit reason KVM_EXIT_HYPERCALL.  Do sanity checks,
convert TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE and forward the request
to userspace.

To prevent a TDG.VP.VMCALL<MapGPA> call from taking too long, the MapGPA
range is split into 2MB chunks and check interrupt pending between chunks.
This allows for timely injection of interrupts and prevents issues with
guest lockup detection.  TDX guest should retry the operation for the
GPA starting at the address specified in R11 when the TDVMCALL return
TDVMCALL_RETRY as status code.

Note userspace needs to enable KVM_CAP_EXIT_HYPERCALL with
KVM_HC_MAP_GPA_RANGE bit set for TD VM.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014225.897298-7-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Isaku Yamahata
d5998c02bc KVM: TDX: Handle KVM hypercall with TDG.VP.VMCALL
Handle KVM hypercall for TDX according to TDX Guest-Host Communication
Interface (GHCI) specification.

The TDX GHCI specification defines the ABI for the guest TD to issue
hypercalls.  When R10 is non-zero, it indicates the TDG.VP.VMCALL is
vendor-specific.  KVM uses R10 as KVM hypercall number and R11-R14
as 4 arguments, while the error code is returned in R10.

Morph the TDG.VP.VMCALL with KVM hypercall to EXIT_REASON_VMCALL and
marshall r10~r14 from vp_enter_args to the appropriate x86 registers for
KVM hypercall handling.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014225.897298-6-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Isaku Yamahata
c42856af8f KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL)
Add a place holder and related helper functions for preparation of
TDG.VP.VMCALL handling.

The TDX module specification defines TDG.VP.VMCALL API (TDVMCALL for short)
for the guest TD to call hypercall to VMM.  When the guest TD issues a
TDVMCALL, the guest TD exits to VMM with a new exit reason.  The arguments
from the guest TD and returned values from the VMM are passed in the guest
registers.  The guest RCX register indicates which registers are used.
Define helper functions to access those registers.

A new VMX exit reason TDCALL is added to indicate the exit is due to
TDVMCALL from the guest TD.  Define the TDCALL exit reason and add a place
holder to handle such exit.

Some leafs of TDCALL will be morphed to another VMX exit reason instead of
EXIT_REASON_TDCALL, add a helper tdcall_to_vmx_exit_reason() as a place
holder to do the conversion.

Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Message-ID: <20250222014225.897298-5-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Isaku Yamahata
095b71a03f KVM: TDX: Add a place holder to handle TDX VM exit
Introduce the wiring for handling TDX VM exits by implementing the
callbacks .get_exit_info(), .get_entry_info(), and .handle_exit().
Additionally, add error handling during the TDX VM exit flow, and add a
place holder to handle various exit reasons.

Store VMX exit reason and exit qualification in struct vcpu_vt for TDX,
so that TDX/VMX can use the same helpers to get exit reason and exit
qualification. Store extended exit qualification and exit GPA info in
struct vcpu_tdx because they are used by TDX code only.

Contention Handling: The TDH.VP.ENTER operation may contend with TDH.MEM.*
operations due to secure EPT or TD EPOCH. If the contention occurs,
the return value will have TDX_OPERAND_BUSY set, prompting the vCPU to
attempt re-entry into the guest with EXIT_FASTPATH_EXIT_HANDLED,
not EXIT_FASTPATH_REENTER_GUEST, so that the interrupts pending during
IN_GUEST_MODE can be delivered for sure. Otherwise, the requester of
KVM_REQ_OUTSIDE_GUEST_MODE may be blocked endlessly.

Error Handling:
- TDX_SW_ERROR: This includes #UD caused by SEAMCALL instruction if the
  CPU isn't in VMX operation, #GP caused by SEAMCALL instruction when TDX
  isn't enabled by the BIOS, and TDX_SEAMCALL_VMFAILINVALID when SEAM
  firmware is not loaded or disabled.
- TDX_ERROR: This indicates some check failed in the TDX module, preventing
  the vCPU from running.
- Failed VM Entry: Exit to userspace with KVM_EXIT_FAIL_ENTRY. Handle it
  separately before handling TDX_NON_RECOVERABLE because when off-TD debug
  is not enabled, TDX_NON_RECOVERABLE is set.
- TDX_NON_RECOVERABLE: Set by the TDX module when the error is
  non-recoverable, indicating that the TDX guest is dead or the vCPU is
  disabled.
  A special case is triple fault, which also sets TDX_NON_RECOVERABLE but
  exits to userspace with KVM_EXIT_SHUTDOWN, aligning with the VMX case.
- Any unhandled VM exit reason will also return to userspace with
  KVM_EXIT_INTERNAL_ERROR.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Message-ID: <20250222014225.897298-4-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:54 -04:00
Binbin Wu
44428e4936 KVM: x86: Move pv_unhalted check out of kvm_vcpu_has_events()
Move pv_unhalted check out of kvm_vcpu_has_events(), check pv_unhalted
explicitly when handling PV unhalt and expose kvm_vcpu_has_events().

kvm_vcpu_has_events() returns true if pv_unhalted is set, and pv_unhalted
is only cleared on transitions to KVM_MP_STATE_RUNNABLE.  If the guest
initiates a spurious wakeup, pv_unhalted could be left set in perpetuity.
Currently, this is not problematic because kvm_vcpu_has_events() is only
called when handling PV unhalt.  However, if kvm_vcpu_has_events() is used
for other purposes in the future, it could return the unexpected results.

Export kvm_vcpu_has_events() for its usage in broader contexts.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014225.897298-3-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:54 -04:00
Binbin Wu
6162b37357 KVM: x86: Have ____kvm_emulate_hypercall() read the GPRs
Have ____kvm_emulate_hypercall() read the GPRs instead of passing them
in via the macro.

When emulating KVM hypercalls via TDVMCALL, TDX will marshall registers of
TDVMCALL ABI into KVM's x86 registers to match the definition of KVM
hypercall ABI _before_ ____kvm_emulate_hypercall() gets called.  Therefore,
____kvm_emulate_hypercall() can just read registers internally based on KVM
hypercall ABI, and those registers can be removed from the
__kvm_emulate_hypercall() macro.

Also, op_64_bit can be determined inside ____kvm_emulate_hypercall(),
remove it from the __kvm_emulate_hypercall() macro as well.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20250222014225.897298-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:54 -04:00
Isaku Yamahata
484612f1a7 KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior
Add a flag KVM_DEBUGREG_AUTO_SWITCH to skip saving/restoring guest
DRs.

TDX-SEAM unconditionally saves/restores guest DRs on TD exit/enter,
and resets DRs to architectural INIT state on TD exit.  Use the new
flag KVM_DEBUGREG_AUTO_SWITCH to indicate that KVM doesn't need to
save/restore guest DRs.  KVM still needs to restore host DRs after TD
exit if there are active breakpoints in the host, which is covered by
the existing code.

MOV-DR exiting is always cleared for TDX guests, so the handler for DR
access is never called, and KVM_DEBUGREG_WONT_EXIT is never set.  Add
a warning if both KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_AUTO_SWITCH
are set.

Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT().

Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: rework changelog]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20241210004946.3718496-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250129095902.16391-13-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:54 -04:00
Adrian Hunter
8af0990375 KVM: TDX: Save and restore IA32_DEBUGCTL
Save the IA32_DEBUGCTL MSR before entering a TDX VCPU and restore it
afterwards.  The TDX Module preserves bits 1, 12, and 14, so if no
other bits are set, no restore is done.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250129095902.16391-12-adrian.hunter@intel.com>
Reviewed-by: Xiayao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:54 -04:00
Adrian Hunter
6d415778f1 KVM: TDX: Disable support for TSX and WAITPKG
Support for restoring IA32_TSX_CTRL MSR and IA32_UMWAIT_CONTROL MSR is not
yet implemented, so disable support for TSX and WAITPKG for now.  Clear the
associated CPUID bits returned by KVM_TDX_CAPABILITIES, and return an error
if those bits are set in KVM_TDX_INIT_VM.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250129095902.16391-11-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:54 -04:00
Isaku Yamahata
e0b4f31a3c KVM: TDX: restore user ret MSRs
Several MSRs are clobbered on TD exit that are not used by Linux while
in ring 0.  Ensure the cached value of the MSR is updated on vcpu_put,
and the MSRs themselves before returning to ring 3.

Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250129095902.16391-10-adrian.hunter@intel.com>
Reviewed-by: Xiayao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:54 -04:00
Chao Gao
d3a6b6cfb8 KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr
Several MSRs are constant and only used in userspace(ring 3).  But VMs may
have different values.  KVM uses kvm_set_user_return_msr() to switch to
guest's values and leverages user return notifier to restore them when the
kernel is to return to userspace.  To eliminate unnecessary wrmsr, KVM also
caches the value it wrote to an MSR last time.

TDX module unconditionally resets some of these MSRs to architectural INIT
state on TD exit.  It makes the cached values in kvm_user_return_msrs are
inconsistent with values in hardware.  This inconsistency needs to be
fixed.  Otherwise, it may mislead kvm_on_user_return() to skip restoring
some MSRs to the host's values.  kvm_set_user_return_msr() can help correct
this case, but it is not optimal as it always does a wrmsr.  So, introduce
a variation of kvm_set_user_return_msr() to update cached values and skip
that wrmsr.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250129095902.16391-9-adrian.hunter@intel.com>
Reviewed-by: Xiayao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:54 -04:00
Isaku Yamahata
6bfa6d8509 KVM: TDX: restore host xsave state when exit from the guest TD
On exiting from the guest TD, xsave state is clobbered; restore it.
Do not use kvm_load_host_xsave_state(), as it relies on vcpu->arch
to find out whether other KVM_RUN code has loaded guest state into
XCR0/PKRU/XSS or not.  In the case of TDX, the exit values are known
independent of the guest CR0 and CR4, and in fact the latter are not
available.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250129095902.16391-8-adrian.hunter@intel.com>
[Rewrite to not use kvm_load_host_xsave_state. - Paolo]
Reviewed-by: Xiayao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:54 -04:00
Isaku Yamahata
81bf40d54c KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
On entering/exiting TDX vcpu, preserved or clobbered CPU state is different
from the VMX case. Add TDX hooks to save/restore host/guest CPU state.
Save/restore kernel GS base MSR.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250129095902.16391-7-adrian.hunter@intel.com>
Reviewed-by: Xiayao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:54 -04:00
Isaku Yamahata
81bf912b2c KVM: TDX: Implement TDX vcpu enter/exit path
Implement callbacks to enter/exit a TDX VCPU by calling tdh_vp_enter().
Ensure the TDX VCPU is in a correct state to run.

Do not pass arguments from/to vcpu->arch.regs[] unconditionally. Instead,
marshall state to/from the appropriate x86 registers only when needed,
i.e., to handle some TDVMCALL sub-leaves following KVM's ABI to leverage
the existing code.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250129095902.16391-6-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:54 -04:00
Binbin Wu
7172c753c2 KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct
Move common fields of struct vcpu_vmx and struct vcpu_tdx to struct
vcpu_vt, to share the code between VMX/TDX as much as possible and to make
TDX exit handling more VMX like.

No functional change intended.

[Adrian: move code that depends on struct vcpu_vmx back to vmx.h]

Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/Z1suNzg2Or743a7e@google.com
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250129095902.16391-5-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:54 -04:00
Yan Zhao
eac0b72fae KVM: TDX: Handle SEPT zap error due to page add error in premap
Move the handling of SEPT zap errors caused by unsuccessful execution of
tdh_mem_page_add() in KVM_TDX_INIT_MEM_REGION from
tdx_sept_drop_private_spte() to tdx_sept_zap_private_spte(). Introduce a
new helper function tdx_is_sept_zap_err_due_to_premap() to detect this
specific error.

During the IOCTL KVM_TDX_INIT_MEM_REGION, KVM premaps leaf SPTEs in the
mirror page table before the corresponding entry in the private page table
is successfully installed by tdh_mem_page_add(). If an error occurs during
the invocation of tdh_mem_page_add(), a mismatch between the mirror and
private page tables results in SEAMCALLs for SEPT zap returning the error
code TDX_EPT_ENTRY_STATE_INCORRECT.

The error TDX_EPT_WALK_FAILED is not possible because, during
KVM_TDX_INIT_MEM_REGION, KVM only premaps leaf SPTEs after successfully
mapping non-leaf SPTEs. Unlike leaf SPTEs, there is no mismatch in non-leaf
PTEs between the mirror and private page tables. Therefore, during zap,
SEAMCALLs should find an empty leaf entry in the private EPT, leading to
the error TDX_EPT_ENTRY_STATE_INCORRECT instead of TDX_EPT_WALK_FAILED.

Since tdh_mem_range_block() is always invoked before tdh_mem_page_remove(),
move the handling of SEPT zap errors from tdx_sept_drop_private_spte() to
tdx_sept_zap_private_spte(). In tdx_sept_zap_private_spte(), return 0 for
errors due to premap to skip executing other SEAMCALLs for zap, which are
unnecessary. Return 1 to indicate no other errors, allowing the execution
of other zap SEAMCALLs to continue.

The failure of tdh_mem_page_add() is uncommon and has not been observed in
real workloads. Currently, this failure is only hypothetically triggered by
skipping the real SEAMCALL and faking the add error in the SEAMCALL
wrapper. Additionally, without this fix, there will be no host crashes or
other severe issues.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250217085642.19696-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:53 -04:00
Paolo Bonzini
1f62531bc9 KVM: TDX: Skip updating CPU dirty logging request for TDs
Wrap vmx_update_cpu_dirty_logging so as to ignore requests to update
CPU dirty logging for TDs, as basic TDX does not support the PML feature.
Invoking vmx_update_cpu_dirty_logging() for TDs would cause an incorrect
access to a kvm_vmx struct for a TDX VM, so block that before it happens.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:53 -04:00
Yan Zhao
fbb4adadea KVM: x86: Make cpu_dirty_log_size a per-VM value
Make cpu_dirty_log_size (CPU's dirty log buffer size) a per-VM value and
set the per-VM cpu_dirty_log_size only for normal VMs when PML is enabled.
Do not set it for TDs.

Until now, cpu_dirty_log_size was a system-wide value that is used for
all VMs and is set to the PML buffer size when PML was enabled in VMX.
However, PML is not currently supported for TDs, though PML remains
available for normal VMs as long as the feature is supported by hardware
and enabled in VMX.

Making cpu_dirty_log_size a per-VM value allows it to be ther PML buffer
size for normal VMs and 0 for TDs. This allows functions like
kvm_arch_sync_dirty_log() and kvm_mmu_update_cpu_dirty_logging() to
determine if PML is supported, in order to kick off vCPUs or request them
to update CPU dirty logging status (turn on/off PML in VMCS).

This fixes an issue first reported in [1], where QEMU attaches an
emulated VGA device to a TD; note that KVM_MEM_LOG_DIRTY_PAGES
still works if the corresponding has no flag KVM_MEM_GUEST_MEMFD.
KVM then invokes kvm_mmu_update_cpu_dirty_logging() and from there
vmx_update_cpu_dirty_logging(), which incorrectly accesses a kvm_vmx
struct for a TDX VM.

Reported-by: ANAND NARSHINHA PATIL <Anand.N.Patil@ibm.com>
Reported-by: Pedro Principeza <pedro.principeza@canonical.com>
Reported-by: Farrah Chen <farrah.chen@intel.com>
Closes: https://github.com/canonical/tdx/issues/202
Link: https://github.com/canonical/tdx/issues/202 [1]
Suggested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:53 -04:00
Yan Zhao
fd3252571b KVM: x86/mmu: Add parameter "kvm" to kvm_mmu_page_ad_need_write_protect()
Add a parameter "kvm" to kvm_mmu_page_ad_need_write_protect() and its
caller tdp_mmu_need_write_protect().

This is a preparation to make cpu_dirty_log_size a per-VM value rather than
a system-wide value.

No function changes expected.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:53 -04:00
Yan Zhao
c4a92f12cf KVM: Add parameter "kvm" to kvm_cpu_dirty_log_size() and its callers
Add a parameter "kvm" to kvm_cpu_dirty_log_size() and down to its callers:
kvm_dirty_ring_get_rsvd_entries(), kvm_dirty_ring_alloc().

This is a preparation to make cpu_dirty_log_size a per-VM value rather than
a system-wide value.

No function changes expected.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:53 -04:00
Isaku Yamahata
d789fa6efa KVM: TDX: Handle vCPU dissociation
Handle vCPUs dissociations by invoking SEAMCALL TDH.VP.FLUSH which flushes
the address translation caches and cached TD VMCS of a TD vCPU in its
associated pCPU.

In TDX, a vCPUs can only be associated with one pCPU at a time, which is
done by invoking SEAMCALL TDH.VP.ENTER. For a successful association, the
vCPU must be dissociated from its previous associated pCPU.

To facilitate vCPU dissociation, introduce a per-pCPU list
associated_tdvcpus. Add a vCPU into this list when it's loaded into a new
pCPU (i.e. when a vCPU is loaded for the first time or migrated to a new
pCPU).

vCPU dissociations can happen under below conditions:
- On the op hardware_disable is called.
  This op is called when virtualization is disabled on a given pCPU, e.g.
  when hot-unplug a pCPU or machine shutdown/suspend.
  In this case, dissociate all vCPUs from the pCPU by iterating its
  per-pCPU list associated_tdvcpus.

- On vCPU migration to a new pCPU.
  Before adding a vCPU into associated_tdvcpus list of the new pCPU,
  dissociation from its old pCPU is required, which is performed by issuing
  an IPI and executing SEAMCALL TDH.VP.FLUSH on the old pCPU.
  On a successful dissociation, the vCPU will be removed from the
  associated_tdvcpus list of its previously associated pCPU.

- On tdx_mmu_release_hkid() is called.
  TDX mandates that all vCPUs must be disassociated prior to the release of
  an hkid. Therefore, dissociation of all vCPUs is a must before executing
  the SEAMCALL TDH.MNG.VPFLUSHDONE and subsequently freeing the hkid.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20241112073858.22312-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:53 -04:00
Isaku Yamahata
012426d6f5 KVM: TDX: Finalize VM initialization
Add a new VM-scoped KVM_MEMORY_ENCRYPT_OP IOCTL subcommand,
KVM_TDX_FINALIZE_VM, to perform TD Measurement Finalization.

Documentation for the API is added in another patch:
"Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)"

For the purpose of attestation, a measurement must be made of the TDX VM
initial state. This is referred to as TD Measurement Finalization, and
uses SEAMCALL TDH.MR.FINALIZE, after which:
1. The VMM adding TD private pages with arbitrary content is no longer
   allowed
2. The TDX VM is runnable

Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240904030751.117579-21-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:53 -04:00
Isaku Yamahata
c846b451d3 KVM: TDX: Add an ioctl to create initial guest memory
Add a new ioctl for the user space VMM to initialize guest memory with the
specified memory contents.

Because TDX protects the guest's memory, the creation of the initial guest
memory requires a dedicated TDX module API, TDH.MEM.PAGE.ADD(), instead of
directly copying the memory contents into the guest's memory in the case of
the default VM type.

Define a new subcommand, KVM_TDX_INIT_MEM_REGION, of vCPU-scoped
KVM_MEMORY_ENCRYPT_OP.  Check if the GFN is already pre-allocated, assign
the guest page in Secure-EPT, copy the initial memory contents into the
guest memory, and encrypt the guest memory.  Optionally, extend the memory
measurement of the TDX guest.

The ioctl uses the vCPU file descriptor because of the TDX module's
requirement that the memory is added to the S-EPT (via TDH.MEM.SEPT.ADD)
prior to initialization (TDH.MEM.PAGE.ADD).  Accessing the MMU in turn
requires a vCPU file descriptor, just like for KVM_PRE_FAULT_MEMORY.  In
fact, the post-populate callback is able to reuse the same logic used by
KVM_PRE_FAULT_MEMORY, so that userspace can do everything with a single
ioctl.

Note that this is the only way to invoke TDH.MEM.SEPT.ADD before the TD
in finalized, as userspace cannot use KVM_PRE_FAULT_MEMORY at that
point.  This ensures that there cannot be pages in the S-EPT awaiting
TDH.MEM.PAGE.ADD, which would be treated incorrectly as spurious by
tdp_mmu_map_handle_target_level() (KVM would see the SPTE as PRESENT,
but the corresponding S-EPT entry will be !PRESENT).

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 - KVM_BUG_ON() for kvm_tdx->nr_premapped (Paolo)
 - Use tdx_operand_busy()
 - Merge first patch in SEPT SEAMCALL retry series in to this base
   (Paolo)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:53 -04:00
Rick Edgecombe
b832317b8c KVM: x86/mmu: Export kvm_tdp_map_page()
In future changes coco specific code will need to call kvm_tdp_map_page()
from within their respective gmem_post_populate() callbacks. Export it
so this can be done from vendor specific code. Since kvm_mmu_reload()
will be needed for this operation, export its callee kvm_mmu_load() as
well.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20241112073827.22270-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:53 -04:00
Yan Zhao
958810a094 KVM: x86/mmu: Bail out kvm_tdp_map_page() when VM dead
Bail out of the loop in kvm_tdp_map_page() when a VM is dead. Otherwise,
kvm_tdp_map_page() may get stuck in the kernel loop when there's only one
vCPU in the VM (or if the other vCPUs are not executing ioctls), even if
fatal errors have occurred.

kvm_tdp_map_page() is called by the ioctl KVM_PRE_FAULT_MEMORY or the TDX
ioctl KVM_TDX_INIT_MEM_REGION. It loops in the kernel whenever RET_PF_RETRY
is returned. In the TDP MMU, kvm_tdp_mmu_map() always returns RET_PF_RETRY,
regardless of the specific error code from tdp_mmu_set_spte_atomic(),
tdp_mmu_link_sp(), or tdp_mmu_split_huge_page(). While this is acceptable
in general cases where the only possible error code from these functions is
-EBUSY, TDX introduces an additional error code, -EIO, due to SEAMCALL
errors.

Since this -EIO error is also a fatal error, check for VM dead in the
kvm_tdp_map_page() to avoid unnecessary retries until a signal is pending.

The error -EIO is uncommon and has not been observed in real workloads.
Currently, it is only hypothetically triggered by bypassing the real
SEAMCALL and faking an error in the SEAMCALL wrapper.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250220102728.24546-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:53 -04:00
Isaku Yamahata
0036b87a95 KVM: TDX: Implement hook to get max mapping level of private pages
Implement hook private_max_mapping_level for TDX to let TDP MMU core get
max mapping level of private pages.

The value is hard coded to 4K for no huge page support for now.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241112073816.22256-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:53 -04:00
Isaku Yamahata
02ab57707b KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table
Implement hooks in TDX to propagate changes of mirror page table to private
EPT, including changes for page table page adding/removing, guest page
adding/removing.

TDX invokes corresponding SEAMCALLs in the hooks.

- Hook link_external_spt
  propagates adding page table page into private EPT.

- Hook set_external_spte
  tdx_sept_set_private_spte() in this patch only handles adding of guest
  private page when TD is finalized.
  Later patches will handle the case of adding guest private pages before
  TD finalization.

- Hook free_external_spt
  It is invoked when page table page is removed in mirror page table, which
  currently must occur at TD tear down phase, after hkid is freed.

- Hook remove_external_spte
  It is invoked when guest private page is removed in mirror page table,
  which can occur when TD is active, e.g. during shared <-> private
  conversion and slot move/deletion.
  This hook is ensured to be triggered before hkid is freed, because
  gmem fd is released along with all private leaf mappings zapped before
  freeing hkid at VM destroy.

  TDX invokes below SEAMCALLs sequentially:
  1) TDH.MEM.RANGE.BLOCK (remove RWX bits from a private EPT entry),
  2) TDH.MEM.TRACK (increases TD epoch)
  3) TDH.MEM.PAGE.REMOVE (remove the private EPT entry and untrack the
     guest page).

  TDH.MEM.PAGE.REMOVE can't succeed without TDH.MEM.RANGE.BLOCK and
  TDH.MEM.TRACK being called successfully.
  SEAMCALL TDH.MEM.TRACK is called in function tdx_track() to enforce that
  TLB tracking will be performed by TDX module for private EPT.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 - Remove TDX_ERROR_SEPT_BUSY and Add tdx_operand_busy() helper (Binbin)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:53 -04:00
Isaku Yamahata
22836e1de6 KVM: TDX: Handle TLB tracking for TDX
Handle TLB tracking for TDX by introducing function tdx_track() for private
memory TLB tracking and implementing flush_tlb* hooks to flush TLBs for
shared memory.

Introduce function tdx_track() to do TLB tracking on private memory, which
basically does two things: calling TDH.MEM.TRACK to increase TD epoch and
kicking off all vCPUs. The private EPT will then be flushed when each vCPU
re-enters the TD. This function is unused temporarily in this patch and
will be called on a page-by-page basis on removal of private guest page in
a later patch.

In earlier revisions, tdx_track() relied on an atomic counter to coordinate
the synchronization between the actions of kicking off vCPUs, incrementing
the TD epoch, and the vCPUs waiting for the incremented TD epoch after
being kicked off.

However, the core MMU only actually needs to call tdx_track() while
aleady under a write mmu_lock. So this sychnonization can be made to be
unneeded. vCPUs are kicked off only after the successful execution of
TDH.MEM.TRACK, eliminating the need for vCPUs to wait for TDH.MEM.TRACK
completion after being kicked off. tdx_track() is therefore able to send
requests KVM_REQ_OUTSIDE_GUEST_MODE rather than KVM_REQ_TLB_FLUSH.

Hooks for flush_remote_tlb and flush_remote_tlbs_range are not necessary
for TDX, as tdx_track() will handle TLB tracking of private memory on
page-by-page basis when private guest pages are removed. There is no need
to invoke tdx_track() again in kvm_flush_remote_tlbs() even after changes
to the mirrored page table.

For hooks flush_tlb_current and flush_tlb_all, which are invoked during
kvm_mmu_load() and vcpu load for normal VMs, let VMM to flush all EPTs in
the two hooks for simplicity, since TDX does not depend on the two
hooks to notify TDX module to flush private EPT in those cases.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20241112073753.22228-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:52 -04:00
Isaku Yamahata
7d10ffb1ac KVM: TDX: Set per-VM shadow_mmio_value to 0
Set per-VM shadow_mmio_value to 0 for TDX.

With enable_mmio_caching on, KVM installs MMIO SPTEs for TDs. To correctly
configure MMIO SPTEs, TDX requires the per-VM shadow_mmio_value to be set
to 0. This is necessary to override the default value of the suppress VE
bit in the SPTE, which is 1, and to ensure value 0 in RWX bits.

For MMIO SPTE, the spte value changes as follows:
1. initial value (suppress VE bit is set)
2. Guest issues MMIO and triggers EPT violation
3. KVM updates SPTE value to MMIO value (suppress VE bit is cleared)
4. Guest MMIO resumes.  It triggers VE exception in guest TD
5. Guest VE handler issues TDG.VP.VMCALL<MMIO>
6. KVM handles MMIO
7. Guest VE handler resumes its execution after MMIO instruction

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241112073743.22214-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:52 -04:00
Isaku Yamahata
5a46fd48d8 KVM: x86/mmu: Add setter for shadow_mmio_value
Future changes will want to set shadow_mmio_value from TDX code. Add a
helper to setter with a name that makes more sense from that context.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[split into new patch]
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241112073730.22200-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:52 -04:00
Isaku Yamahata
427a6486c5 KVM: TDX: Require TDP MMU, mmio caching and EPT A/D bits for TDX
Disable TDX support when TDP MMU or mmio caching or EPT A/D bits aren't
supported.

As TDP MMU is becoming main stream than the legacy MMU, the legacy MMU
support for TDX isn't implemented.

TDX requires KVM mmio caching. Without mmio caching, KVM will go to MMIO
emulation without installing SPTEs for MMIOs. However, TDX guest is
protected and KVM would meet errors when trying to emulate MMIOs for TDX
guest during instruction decoding. So, TDX guest relies on SPTEs being
installed for MMIOs, which are with no RWX bits and with VE suppress bit
unset, to inject VE to TDX guest. The TDX guest would then issue TDVMCALL
in the VE handler to perform instruction decoding and have host do MMIO
emulation.

TDX also relies on EPT A/D bits as EPT A/D bits have been supported in all
CPUs since Haswell. Relying on it can avoid RWX bits being masked out in
the mirror page table for prefaulted entries.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
Requested by Sean at [1].
[1] https://lore.kernel.org/kvm/Zva4aORxE9ljlMNe@google.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:52 -04:00
Isaku Yamahata
e0fbb3bbb5 KVM: TDX: Set gfn_direct_bits to shared bit
Make the direct root handle memslot GFNs at an alias with the TDX shared
bit set.

For TDX shared memory, the memslot GFNs need to be mapped at an alias with
the shared bit set. These shared mappings will be mapped on the KVM MMU's
"direct" root. The direct root has it's mappings shifted by applying
"gfn_direct_bits" as a mask. The concept of "GPAW" (guest physical address
width) determines the location of the shared bit. So set gfn_direct_bits
based on this, to map shared memory at the proper GPA.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241112073613.22100-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:52 -04:00
Sean Christopherson
87e3f45e80 KVM: TDX: Add load_mmu_pgd method for TDX
TDX uses two EPT pointers, one for the private half of the GPA space and
one for the shared half. The private half uses the normal EPT_POINTER vmcs
field, which is managed in a special way by the TDX module. For TDX, KVM is
not allowed to operate on it directly. The shared half uses a new
SHARED_EPT_POINTER field and will be managed by the conventional MMU
management operations that operate directly on the EPT root. This means for
TDX the .load_mmu_pgd() operation will need to know to use the
SHARED_EPT_POINTER field instead of the normal one. Add a new wrapper in
x86 ops for load_mmu_pgd() that either directs the write to the existing
vmx implementation or a TDX one.

tdx_load_mmu_pgd() is so much simpler than vmx_load_mmu_pgd() since for the
TDX mode of operation, EPT will always be used and KVM does not need to be
involved in virtualization of CR3 behavior. So tdx_load_mmu_pgd() can
simply write to SHARED_EPT_POINTER.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241112073601.22084-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:52 -04:00
Isaku Yamahata
fe1e6d483f KVM: TDX: Add accessors VMX VMCS helpers
TDX defines SEAMCALL APIs to access TDX control structures corresponding to
the VMX VMCS.  Introduce helper accessors to hide its SEAMCALL ABI details.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20241112073551.22070-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:52 -04:00
Rick Edgecombe
3b725e972f KVM: VMX: Teach EPT violation helper about private mem
Teach EPT violation helper to check shared mask of a GPA to find out
whether the GPA is for private memory.

When EPT violation is triggered after TD accessing a private GPA, KVM will
exit to user space if the corresponding GFN's attribute is not private.
User space will then update GFN's attribute during its memory conversion
process. After that, TD will re-access the private GPA and trigger EPT
violation again. Only with GFN's attribute matches to private, KVM will
fault in private page, map it in mirrored TDP root, and propagate changes
to private EPT to resolve the EPT violation.

Relying on GFN's attribute tracking xarray to determine if a GFN is
private, as for KVM_X86_SW_PROTECTED_VM, may lead to endless EPT
violations.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20241112073539.22056-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:52 -04:00
Sean Christopherson
c8563d1b69 KVM: VMX: Split out guts of EPT violation to common/exposed function
The difference of TDX EPT violation is how to retrieve information, GPA,
and exit qualification.  To share the code to handle EPT violation, split
out the guts of EPT violation handler so that VMX/TDX exit handler can call
it after retrieving GPA and exit qualification.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20241112073528.22042-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:52 -04:00
Yan Zhao
6d15a641fd KVM: x86/mmu: Do not enable page track for TD guest
Fail kvm_page_track_write_tracking_enabled() if VM type is TDX to make the
external page track user fail in kvm_page_track_register_notifier() since
TDX does not support write protection and hence page track.

No need to fail KVM internal users of page track (i.e. for shadow page),
because TDX is always with EPT enabled and currently TDX module does not
emulate and send VMLAUNCH/VMRESUME VMExits to VMM.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Cc: Yuan Yao <yuan.yao@linux.intel.com>
Message-ID: <20241112073515.22028-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:52 -04:00
Isaku Yamahata
2608f10576 KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU
Export a function to walk down the TDP without modifying it and simply
check if a GPA is mapped.

Future changes will support pre-populating TDX private memory. In order to
implement this KVM will need to check if a given GFN is already
pre-populated in the mirrored EPT. [1]

There is already a TDP MMU walker, kvm_tdp_mmu_get_walk() for use within
the KVM MMU that almost does what is required. However, to make sense of
the results, MMU internal PTE helpers are needed. Refactor the code to
provide a helper that can be used outside of the KVM MMU code.

Refactoring the KVM page fault handler to support this lookup usage was
also considered, but it was an awkward fit.

kvm_tdp_mmu_gpa_is_mapped() is based on a diff by Paolo Bonzini.

Link: https://lore.kernel.org/kvm/ZfBkle1eZFfjPI8l@google.com/ [1]
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241112073457.22011-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:52 -04:00
Rick Edgecombe
ae80c7d66c KVM: x86/mmu: Implement memslot deletion for TDX
Update attr_filter field to zap both private and shared mappings for TDX
when memslot is deleted.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20241112073426.21997-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:52 -04:00
Zhiming Hu
7c035bea94 KVM: TDX: Register TDX host key IDs to cgroup misc controller
TDX host key IDs (HKID) are limit resources in a machine, and the misc
cgroup lets the machine owner track their usage and limits the possibility
of abusing them outside the owner's control.

The cgroup v2 miscellaneous subsystem was introduced to control the
resource of AMD SEV & SEV-ES ASIDs.  Likewise introduce HKIDs as a misc
resource.

Signed-off-by: Zhiming Hu <zhiming.hu@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:51 -04:00
Xiaoyao Li
20d913729c KVM: x86/mmu: Taking guest pa into consideration when calculate tdp level
For TDX, the maxpa (CPUID.0x80000008.EAX[7:0]) is fixed as native and
the max_gpa (CPUID.0x80000008.EAX[23:16]) is configurable and used
to configure the EPT level and GPAW.

Use max_gpa to determine the TDP level.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:51 -04:00
Xiaoyao Li
488808e682 KVM: x86: Introduce KVM_TDX_GET_CPUID
Implement an IOCTL to allow userspace to read the CPUID bit values for a
configured TD.

The TDX module doesn't provide the ability to set all CPUID bits. Instead
some are configured indirectly, or have fixed values. But it does allow
for the final resulting CPUID bits to be read. This information will be
useful for userspace to understand the configuration of the TD, and set
KVM's copy via KVM_SET_CPUID2.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 - Fix subleaf mask check (Binbin)
 - Search all possible sub-leafs (Francesco Lavra)
 - Reduce off-by-one error sensitve code (Francesco, Xiaoyao)
 - Handle buffers too small from userspace (Xiaoyao)
 - Read max CPUID from TD instead of using fixed values (Xiaoyao)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:51 -04:00
Isaku Yamahata
a50f673f25 KVM: TDX: Do TDX specific vcpu initialization
TD guest vcpu needs TDX specific initialization before running.  Repurpose
KVM_MEMORY_ENCRYPT_OP to vcpu-scope, add a new sub-command
KVM_TDX_INIT_VCPU, and implement the callback for it.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 - Fix comment: https://lore.kernel.org/kvm/Z36OYfRW9oPjW8be@google.com/
   (Sean)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:51 -04:00
Isaku Yamahata
9002f8cf52 KVM: TDX: create/free TDX vcpu structure
Implement vcpu related stubs for TDX for create, reset and free.

For now, create only the features that do not require the TDX SEAMCALL.
The TDX specific vcpu initialization will be handled by KVM_TDX_INIT_VCPU.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 - Use lapic_in_kernel() (Nikolay Borisov)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:51 -04:00
Isaku Yamahata
9934d7e529 KVM: TDX: Don't offline the last cpu of one package when there's TDX guest
Destroying TDX guest requires there's at least one cpu online for each
package, because reclaiming the TDX KeyID of the guest (as part of the
teardown process) requires to call some SEAMCALL (on any cpu) on all
packages.

Do not offline the last cpu of one package when there's any TDX guest
running, otherwise KVM may not be able to teardown TDX guest resulting
in leaking of TDX KeyID and other resources like TDX guest control
structure pages.

Implement the TDX version 'offline_cpu()' to prevent the cpu from going
offline if it is the last cpu on the package.

Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:51 -04:00
Isaku Yamahata
ffb6fc8413 KVM: TDX: Make pmu_intel.c ignore guest TD case
TDX KVM doesn't support PMU yet, it's future work of TDX KVM support as
another patch series. For now, handle TDX by updating vcpu_to_lbr_desc()
and vcpu_to_lbr_records() to return NULL.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 - Add pragma poison for to_vmx() (Paolo)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:51 -04:00
Isaku Yamahata
0186dd29a2 KVM: TDX: add ioctl to initialize VM with TDX specific parameters
After the crypto-protection key has been configured, TDX requires a
VM-scope initialization as a step of creating the TDX guest.  This
"per-VM" TDX initialization does the global configurations/features that
the TDX guest can support, such as guest's CPUIDs (emulated by the TDX
module), the maximum number of vcpus etc.

Because there is no room in KVM_CREATE_VM to pass all the required
parameters, introduce a new ioctl KVM_TDX_INIT_VM and mark the VM as
TD_STATE_UNINITIALIZED until it is invoked.

This "per-VM" TDX initialization must be done before any "vcpu-scope" TDX
initialization; KVM_TDX_INIT_VM IOCTL must be invoked before the creation
of vCPUs.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:51 -04:00
Paolo Bonzini
a656dac800 KVM: x86: expose cpuid_entry2_find for TDX
CPUID values are provided for TDX virtual machines as part of the
KVM_TDX_INIT_VM ioctl.  Unlike KVM_SET_CPUID2, TDX will need to
examine the leaves, either to validate against the CPUIDs listed
in the TDX modules configuration or to fill other controls with
matching values.

Since there is an existing function to look up a leaf/index pair
into a given list of CPUID entries, export it as kvm_find_cpuid_entry2().

Reviewed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:50 -04:00
Isaku Yamahata
f94f4a97e6 KVM: TDX: Support per-VM KVM_CAP_MAX_VCPUS extension check
Change to report the KVM_CAP_MAX_VCPUS extension from globally to per-VM
to allow userspace to be able to query maximum vCPUs for TDX guest via
checking the KVM_CAP_MAX_VCPU extension on per-VM basis.

Today KVM x86 reports KVM_MAX_VCPUS as guest's maximum vCPUs for all
guests globally, and userspace, i.e. Qemu, queries the KVM_MAX_VCPUS
extension globally but not on per-VM basis.

TDX has its own limit of maximum vCPUs it can support for all TDX guests
in addition to KVM_MAX_VCPUS.  TDX module reports this limit via the
MAX_VCPU_PER_TD global metadata.  Different modules may report different
values.  In practice, the reported value reflects the maximum logical
CPUs that ALL the platforms that the module supports can possibly have.

Note some old modules may also not support this metadata, in which case
the limit is U16_MAX.

The current way to always report KVM_MAX_VCPUS in the KVM_CAP_MAX_VCPUS
extension is not enough for TDX.  To accommodate TDX, change to report
the KVM_CAP_MAX_VCPUS extension on per-VM basis.

Specifically, override kvm->max_vcpus in tdx_vm_init() for TDX guest,
and report kvm->max_vcpus in the KVM_CAP_MAX_VCPUS extension check.

Change to report "the number of logical CPUs the platform has" as the
maximum vCPUs for TDX guest.  Simply forwarding the MAX_VCPU_PER_TD
reported by the TDX module would result in an unpredictable ABI because
the reported value to userspace would be depending on whims of TDX
modules.

This works in practice because of the MAX_VCPU_PER_TD reported by the
TDX module will never be smaller than the one reported to userspace.
But to make sure KVM never reports an unsupported value, sanity check
the MAX_VCPU_PER_TD reported by TDX module is not smaller than the
number of logical CPUs the platform has, otherwise refuse to use TDX.

Note, when creating a TDX guest, TDX actually requires the "maximum
vCPUs for _this_ TDX guest" as an input to initialize the TDX guest.
But TDX guest's maximum vCPUs is not part of TDREPORT thus not part of
attestation, thus there's no need to allow userspace to explicitly
_configure_ the maximum vCPUs on per-VM basis.  KVM will simply use
kvm->max_vcpus as input when initializing the TDX guest.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:50 -04:00
Isaku Yamahata
8d032b683c KVM: TDX: create/destroy VM structure
Implement managing the TDX private KeyID to implement, create, destroy
and free for a TDX guest.

When creating at TDX guest, assign a TDX private KeyID for the TDX guest
for memory encryption, and allocate pages for the guest. These are used
for the Trust Domain Root (TDR) and Trust Domain Control Structure (TDCS).

On destruction, free the allocated pages, and the KeyID.

Before tearing down the private page tables, TDX requires the guest TD to
be destroyed by reclaiming the KeyID. Do it in the vm_pre_destroy() kvm_x86_ops
hook. The TDR control structures can be freed in the vm_destroy() hook,
which runs last.

Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 - Fix build issue in kvm-coco-queue
 - Init ret earlier to fix __tdx_td_init() error handling. (Chao)
 - Standardize -EAGAIN for __tdx_td_init() retry errors (Rick)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:50 -04:00
Isaku Yamahata
61bb282796 KVM: TDX: Get system-wide info about TDX module on initialization
TDX KVM needs system-wide information about the TDX module. Generate the
data based on tdx_sysinfo td_conf CPUID data.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
---
 - Clarify comment about EAX[23:16] in td_init_cpuid_entry2() (Xiaoyao)
 - Add comment for configurable CPUID bits (Xiaoyao)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:50 -04:00
Isaku Yamahata
b2aaf38ced KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
KVM_MEMORY_ENCRYPT_OP was introduced for VM-scoped operations specific for
guest state-protected VM.  It defined subcommands for technology-specific
operations under KVM_MEMORY_ENCRYPT_OP.  Despite its name, the subcommands
are not limited to memory encryption, but various technology-specific
operations are defined.  It's natural to repurpose KVM_MEMORY_ENCRYPT_OP
for TDX specific operations and define subcommands.

Add a place holder function for TDX specific VM-scoped ioctl as mem_enc_op.
TDX specific sub-commands will be added to retrieve/pass TDX specific
parameters.  Make mem_enc_ioctl non-optional as it's always filled.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 - Drop the misleading "defined for consistency" line. It's a copy-paste
   error introduced in the earlier patches. Earlier there was padding at
   the end to match struct kvm_sev_cmd size. (Tony)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:50 -04:00
Isaku Yamahata
e4aa6f6961 KVM: TDX: Add helper functions to print TDX SEAMCALL error
Add helper functions to print out errors from the TDX module in a uniform
manner.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:50 -04:00
Sean Christopherson
1001d9886f KVM: TDX: Add TDX "architectural" error codes
Add error codes for the TDX SEAMCALLs both for TDX VMM side for TDH
SEAMCALL and TDX guest side for TDG.VP.VMCALL.  KVM issues the TDX
SEAMCALLs and checks its error code.  KVM handles hypercall from the TDX
guest and may return an error.  So error code for the TDX guest is also
needed.

TDX SEAMCALL uses bits 31:0 to return more information, so these error
codes will only exactly match RAX[63:32].  Error codes for TDG.VP.VMCALL is
defined by TDX Guest-Host-Communication interface spec.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20241030190039.77971-14-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:50 -04:00
Isaku Yamahata
fcae3a3e7c KVM: TDX: Define TDX architectural definitions
Define architectural definitions for KVM to issue the TDX SEAMCALLs.

Structures and values that are architecturally defined in the TDX module
specifications the chapter of ABI Reference.

Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 - Drop old duplicate defines, the x86 core exports what's needed (Kai)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:50 -04:00
Isaku Yamahata
09b3d3c17f KVM: TDX: Add placeholders for TDX VM/vCPU structures
Add TDX's own VM and vCPU structures as placeholder to manage and run
TDX guests.  Also add helper functions to check whether a VM/vCPU is
TDX or normal VMX one, and add helpers to convert between TDX VM/vCPU
and KVM VM/vCPU.

TDX protects guest VMs from malicious host.  Unlike VMX guests, TDX
guests are crypto-protected.  KVM cannot access TDX guests' memory and
vCPU states directly.  Instead, TDX requires KVM to use a set of TDX
architecture-defined firmware APIs (a.k.a TDX module SEAMCALLs) to
manage and run TDX guests.

In fact, the way to manage and run TDX guests and normal VMX guests are
quite different.  Because of that, the current structures
('struct kvm_vmx' and 'struct vcpu_vmx') to manage VMX guests are not
quite suitable for TDX guests.  E.g., the majority of the members of
'struct vcpu_vmx' don't apply to TDX guests.

Introduce TDX's own VM and vCPU structures ('struct kvm_tdx' and 'struct
vcpu_tdx' respectively) for KVM to manage and run TDX guests.  And
instead of building TDX's VM and vCPU structures based on VMX's, build
them directly based on 'struct kvm'.

As a result, TDX and VMX guests will have different VM size and vCPU
size/alignment.

Currently, kvm_arch_alloc_vm() uses 'kvm_x86_ops::vm_size' to allocate
enough space for the VM structure when creating guest.  With TDX guests,
ideally, KVM should allocate the VM structure based on the VM type so
that the precise size can be allocated for VMX and TDX guests.  But this
requires more extensive code change.  For now, simply choose the maximum
size of 'struct kvm_tdx' and 'struct kvm_vmx' for VM structure
allocation for both VMX and TDX guests.  This would result in small
memory waste for each VM which has smaller VM structure size but this is
acceptable.

For simplicity, use the same way for vCPU allocation too.  Otherwise KVM
would need to maintain a separate 'kvm_vcpu_cache' for each VM type.

Note, updating the 'vt_x86_ops::vm_size' needs to be done before calling
kvm_ops_update(), which copies vt_x86_ops to kvm_x86_ops.  However this
happens before TDX module initialization.  Therefore theoretically it is
possible that 'kvm_x86_ops::vm_size' is set to size of 'struct kvm_tdx'
(when it's larger) but TDX actually fails to initialize at a later time.

Again the worst case of this is wasting couple of bytes memory for each
VM.  KVM could choose to update 'kvm_x86_ops::vm_size' at a later time
depending on TDX's status but that would require base KVM module to
export either kvm_x86_ops or kvm_ops_update().

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 - Make to_kvm_tdx() and to_tdx() private to tdx.c (Francesco, Tony)
 - Add pragma poison for to_vmx() (Paolo)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:50 -04:00
Kai Huang
45154fb010 KVM: TDX: Get TDX global information
KVM will need to consult some essential TDX global information to create
and run TDX guests.  Get the global information after initializing TDX.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20241030190039.77971-3-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:50 -04:00
Kai Huang
fcdbdf6343 KVM: VMX: Initialize TDX during KVM module load
Before KVM can use TDX to create and run TDX guests, TDX needs to be
initialized from two perspectives: 1) TDX module must be initialized
properly to a working state; 2) A per-cpu TDX initialization, a.k.a the
TDH.SYS.LP.INIT SEAMCALL must be done on any logical cpu before it can
run any other TDX SEAMCALLs.

The TDX host core-kernel provides two functions to do the above two
respectively: tdx_enable() and tdx_cpu_enable().

There are two options in terms of when to initialize TDX: initialize TDX
at KVM module loading time, or when creating the first TDX guest.

Choose to initialize TDX during KVM module loading time:

Initializing TDX module is both memory and CPU time consuming: 1) the
kernel needs to allocate a non-trivial size(~1/256) of system memory
as metadata used by TDX module to track each TDX-usable memory page's
status; 2) the TDX module needs to initialize this metadata, one entry
for each TDX-usable memory page.

Also, the kernel uses alloc_contig_pages() to allocate those metadata
chunks, because they are large and need to be physically contiguous.
alloc_contig_pages() can fail.  If initializing TDX when creating the
first TDX guest, then there's chance that KVM won't be able to run any
TDX guests albeit KVM _declares_ to be able to support TDX.

This isn't good for the user.

On the other hand, initializing TDX at KVM module loading time can make
sure KVM is providing a consistent view of whether KVM can support TDX
to the user.

Always only try to initialize TDX after VMX has been initialized.  TDX
is based on VMX, and if VMX fails to initialize then TDX is likely to be
broken anyway.  Also, in practice, supporting TDX will require part of
VMX and common x86 infrastructure in working order, so TDX cannot be
enabled alone w/o VMX support.

There are two cases that can result in failure to initialize TDX: 1) TDX
cannot be supported (e.g., because of TDX is not supported or enabled by
hardware, or module is not loaded, or missing some dependency in KVM's
configuration); 2) Any unexpected error during TDX bring-up.  For the
first case only mark TDX is disabled but still allow KVM module to be
loaded.  For the second case just fail to load the KVM module so that
the user can be aware.

Because TDX costs additional memory, don't enable TDX by default.  Add a
new module parameter 'enable_tdx' to allow the user to opt-in.

Note, the name tdx_init() has already been taken by the early boot code.
Use tdx_bringup() for initializing TDX (and tdx_cleanup() since KVM
doesn't actually teardown TDX).  They don't match vt_init()/vt_exit(),
vmx_init()/vmx_exit() etc but it's not end of the world.

Also, once initialized, the TDX module cannot be disabled and enabled
again w/o the TDX module runtime update, which isn't supported by the
kernel.  After TDX is enabled, nothing needs to be done when KVM
disables hardware virtualization, e.g., when offlining CPU, or during
suspend/resume.  TDX host core-kernel code internally tracks TDX status
and can handle "multiple enabling" scenario.

Similar to KVM_AMD_SEV, add a new KVM_INTEL_TDX Kconfig to guide KVM TDX
code.  Make it depend on INTEL_TDX_HOST but not replace INTEL_TDX_HOST
because in the longer term there's a use case that requires making
SEAMCALLs w/o KVM as mentioned by Dan [1].

Link: https://lore.kernel.org/6723fc2070a96_60c3294dc@dwillia2-mobl3.amr.corp.intel.com.notmuch/ [1]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-ID: <162f9dee05c729203b9ad6688db1ca2960b4b502.1731664295.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:50 -04:00
Kai Huang
d6bee78137 KVM: VMX: Refactor VMX module init/exit functions
Add vt_init() and vt_exit() as the new module init/exit functions and
refactor existing vmx_init()/vmx_exit() as helper to make room for TDX
specific initialization and teardown.

To support TDX, KVM will need to enable TDX during KVM module loading
time.  Enabling TDX requires enabling hardware virtualization first so
that all online CPUs (and the new CPU going online) are in post-VMXON
state.

Currently, the vmx_init() flow is:

 1) hv_init_evmcs(),
 2) kvm_x86_vendor_init(),
 3) Other VMX specific initialization,
 4) kvm_init()

The kvm_x86_vendor_init() invokes kvm_x86_init_ops::hardware_setup() to
do VMX specific hardware setup and calls kvm_update_ops() to initialize
kvm_x86_ops to VMX's version.

TDX will have its own version for most of kvm_x86_ops callbacks.  It
would be nice if kvm_x86_init_ops::hardware_setup() could also be used
for TDX, but in practice it cannot.  The reason is, as mentioned above,
TDX initialization requires hardware virtualization having been enabled,
which must happen after kvm_update_ops(), but hardware_setup() is done
before that.

Also, TDX is based on VMX, and it makes sense to only initialize TDX
after VMX has been initialized.  If VMX fails to initialize, TDX is
likely broken anyway.

So the new flow of KVM module init function will be:

 1) Current VMX initialization code in vmx_init() before kvm_init(),
 2) TDX initialization,
 3) kvm_init()

Split vmx_init() into two parts based on above 1) and 3) so that TDX
initialization can fit in between.  Make part 1) as the new helper
vmx_init().  Introduce vt_init() as the new module init function which
calls vmx_init() and kvm_init().  TDX initialization will be added
later.

Do the same thing for vmx_exit()/vt_exit().

Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-ID: <3f23f24098bdcf42e213798893ffff7cdc7103be.1731664295.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:50 -04:00
Paolo Bonzini
74c1807f6c KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected
KVM_CAP_SYNC_REGS does not make sense for VMs with protected guest state,
since the register values cannot actually be written.  Return 0
when using the VM-level KVM_CHECK_EXTENSION ioctl, and accordingly
return -EINVAL from KVM_RUN if the valid/dirty fields are nonzero.

However, on exit from KVM_RUN userspace could have placed a nonzero
value into kvm_run->kvm_valid_regs, so check guest_state_protected
again and skip store_regs() in that case.

Cc: stable@vger.kernel.org
Fixes: 517987e3fb ("KVM: x86: add fields to struct kvm_arch for CoCo features")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250306202923.646075-1-pbonzini@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:03 -04:00
Isaku Yamahata
adafea1106 KVM: x86: Add infrastructure for secure TSC
Add guest_tsc_protected member to struct kvm_arch_vcpu and prohibit
changing TSC offset/multiplier when guest_tsc_protected is true.

X86 confidential computing technology defines protected guest TSC so that
the VMM can't change the TSC offset/multiplier once vCPU is initialized.
SEV-SNP defines Secure TSC as optional, whereas TDX mandates it.

KVM has common logic on x86 that tries to guess or adjust TSC
offset/multiplier for better guest TSC and TSC interrupt latency
at KVM vCPU creation (kvm_arch_vcpu_postcreate()), vCPU migration
over pCPU (kvm_arch_vcpu_load()), vCPU TSC device attributes
(kvm_arch_tsc_set_attr()) and guest/host writing to TSC or TSC adjust MSR
(kvm_set_msr_common()).

The current x86 KVM implementation conflicts with protected TSC because the
VMM can't change the TSC offset/multiplier.
Because KVM emulates the TSC timer or the TSC deadline timer with the TSC
offset/multiplier, the TSC timer interrupts is injected to the guest at the
wrong time if the KVM TSC offset is different from what the TDX module
determined.

Originally this issue was found by cyclic test of rt-test [1] as the
latency in TDX case is worse than VMX value + TDX SEAMCALL overhead.  It
turned out that the KVM TSC offset is different from what the TDX module
determines.

Disable or ignore the KVM logic to change/adjust the TSC offset/multiplier
somehow, thus keeping the KVM TSC offset/multiplier the same as the
value of the TDX module.  Writes to MSR_IA32_TSC are also blocked as
they amount to a change in the TSC offset.

[1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git

Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <3a7444aec08042fe205666864b6858910e86aa98.1728719037.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 13:55:44 -04:00
Isaku Yamahata
5f3b30b2b0 KVM: x86: Push down setting vcpu.arch.user_set_tsc
Push down setting vcpu.arch.user_set_tsc to true from kvm_synchronize_tsc()
to __kvm_synchronize_tsc() so that the two callers don't have to modify
user_set_tsc directly as preparation.

Later, prohibit changing TSC synchronization for TDX guests to modify
__kvm_synchornize_tsc() change.  We don't want to touch caller sites not to
change user_set_tsc.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <62b1a7a35d6961844786b6e47e8ecb774af7a228.1728719037.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 13:55:32 -04:00
Paolo Bonzini
46c49372e1 KVM: x86: move vm_destroy callback at end of kvm_arch_destroy_vm
TDX needs to free the TDR control structures last, after all paging structures
have been torn down; move the vm_destroy callback at a suitable place.
The new place is also okay for AMD; the main difference is that the
MMU has been torn down and, if anything, that is better done before
the SNP ASID is released.

Extracted from a patch by Yan Zhao.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 13:45:36 -04:00
Paolo Bonzini
ea9bd29a9c KVM x86 fixes for 6.14-rcN #2
- Set RFLAGS.IF in C code on SVM to get VMRUN out of the STI shadow.
 
  - Ensure DEBUGCTL is context switched on AMD to avoid running the guest with
    the host's value, which can lead to unexpected bus lock #DBs.
 
  - Suppress DEBUGCTL.BTF on AMD (to match Intel), as KVM doesn't properly
    emulate BTF.  KVM's lack of context switching has meant BTF has always been
    broken to some extent.
 
  - Always save DR masks for SNP vCPUs if DebugSwap is *supported*, as the guest
    can enable DebugSwap without KVM's knowledge.
 
  - Fix a bug in mmu_stress_tests where a vCPU could finish the "writes to RO
    memory" phase without actually generating a write-protection fault.
 
  - Fix a printf() goof in the SEV smoke test that causes build failures with
    -Werror.
 
  - Explicitly zero EAX and EBX in CPUID.0x8000_0022 output when PERFMON_V2
    isn't supported by KVM.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfLlhUACgkQOlYIJqCj
 N/0x7w/+MhqJdHbshL7Gzw+rcXwCROiCkqsxFP+YoTXte8uaHS5CEfcMYjE8SuGp
 KBpgLo4Lj1dVTXiCjemlY5sn6CDiuSs74X8A88ksuu5hVsFByJUgyWU9iw8J/crZ
 B2vj8huhqa8OCEPe5JujWfnfyAkKE5tUA4GFi73vhHMcftTNj+ftxT33/Pfg7y7M
 xOvWFWS6ZshKrouRKzI7ZFEYLwp0lr4U3dzO5rCRAd5J4MSBWRx6Dx2um5dyEYKJ
 xgwl4ylM4S/+78u1+0nQnToM0UWHJ3e7x8nze6UXYTZIrBr/lSeKlbhOPnEWJcJB
 Eemnur9ORI2BRPUReqBKluCZsSK+E5B/HPCVt5cxtuRIuUOD+kW17LPgnPyE4Sso
 eVt+XAvQc7EjrpWDSHr3ZQZZM89l9zHhuSAQ0npO6y71s0FzEVZQoDamNmOLAPjH
 Qg+qhBV2l6pyfqhqiLzADasYLOl57cJsfiMjM331ALLqAn57jzd+B8c4hdB2Xg4s
 KPuy8w8uBaY9zpd9YDBLLr7JJVs35KexNZMjT2vqBYXcScyLgmAuSQXy3hub6Mzn
 gI5ZXIKG8eO9v2jejfClI6/OEdtEwgSGEVwuBKB16pMrIxqpguMTMTWLVRn5G+oo
 qA8anmKaac62GaB66JE/Wjy069OPIGYnHSU2nal0Tej6kG0xv6E=
 =as6u
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-fixes-6.14-rcN.2' of https://github.com/kvm-x86/linux into HEAD

KVM x86 fixes for 6.14-rcN #2

 - Set RFLAGS.IF in C code on SVM to get VMRUN out of the STI shadow.

 - Ensure DEBUGCTL is context switched on AMD to avoid running the guest with
   the host's value, which can lead to unexpected bus lock #DBs.

 - Suppress DEBUGCTL.BTF on AMD (to match Intel), as KVM doesn't properly
   emulate BTF.  KVM's lack of context switching has meant BTF has always been
   broken to some extent.

 - Always save DR masks for SNP vCPUs if DebugSwap is *supported*, as the guest
   can enable DebugSwap without KVM's knowledge.

 - Fix a bug in mmu_stress_tests where a vCPU could finish the "writes to RO
   memory" phase without actually generating a write-protection fault.

 - Fix a printf() goof in the SEV smoke test that causes build failures with
   -Werror.

 - Explicitly zero EAX and EBX in CPUID.0x8000_0022 output when PERFMON_V2
   isn't supported by KVM.
2025-03-09 03:44:06 -04:00
Xiaoyao Li
e6c8728a8e KVM: x86: Remove the unreachable case for 0x80000022 leaf in __do_cpuid_func()
Remove dead/unreachable (and misguided) code in KVM's processing of
0x80000022.  The case statement breaks early if PERFMON_V2 isnt supported,
i.e. kvm_cpu_cap_has(X86_FEATURE_PERFMON_V2) must be true when KVM reaches
the code code to setup EBX.

Note, early versions of the patch that became commit 94cdeebd82 ("KVM:
x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022") didn't break
early on lack of PERFMON_V2 support, and instead enumerated the effective
number of counters KVM could emulate.  All of that code was flawed, e.g.
the APM explicitly states EBX is valid only for v2.

  Performance Monitoring Version 2 supported. When set,
  CPUID_Fn8000_0022_EBX reports the number of available performance counters.

When the flaw of not respecting v2 support was addressed, the misguided
stuffing of the number of counters got left behind.

Link: https://lore.kernel.org/all/20220919093453.71737-4-likexu@tencent.com
Fixes: 94cdeebd82 ("KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022")
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250304082314.472202-2-xiaoyao.li@intel.com
[sean: elaborate on the situation a bit more, add Fixes]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-04 09:19:49 -08:00
Xiaoyao Li
f9dc8fb3af KVM: x86: Explicitly zero EAX and EBX when PERFMON_V2 isn't supported by KVM
Fix a goof where KVM sets CPUID.0x80000022.EAX to CPUID.0x80000022.EBX
instead of zeroing both when PERFMON_V2 isn't supported by KVM.  In
practice, barring a buggy CPU (or vCPU model when running nested) only the
!enable_pmu case is affected, as KVM always supports PERFMON_V2 if it's
available in hardware, i.e. CPUID.0x80000022.EBX will be '0' if PERFMON_V2
is unsupported.

For the !enable_pmu case, the bug is relatively benign as KVM will refuse
to enable PMU capabilities, but a VMM that reflects KVM's supported CPUID
into the guest could inadvertently induce #GPs in the guest due to
advertising support for MSRs that KVM refuses to emulate.

Fixes: 94cdeebd82 ("KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022")
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250304082314.472202-3-xiaoyao.li@intel.com
[sean: massage shortlog and changelog, tag for stable]
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-04 09:19:18 -08:00
Josh Poimboeuf
e1c49eaee5 KVM: VMX: Use named operands in inline asm
Convert the non-asm-goto version of the inline asm in __vmcs_readl() to
use named operands, similar to its asm-goto version.

Do this in preparation of changing the ASM_CALL_CONSTRAINT primitive.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: linux-kernel@vger.kernel.org
2025-03-04 11:21:39 +01:00
Sean Christopherson
0c3566b63d KVM: VMX: Extract checks on entry/exit control pairs to a helper macro
Extract the checking of entry/exit pairs to a helper macro so that the
code can be reused to process the upcoming "secondary" exit controls (the
primary exit controls field is out of bits).  Use a macro instead of a
function to support different sized variables (all secondary exit controls
will be optional and so the MSR doesn't have the fixed-0/fixed-1 split).
Taking the largest size as input is trivial, but handling the modification
of KVM's to-be-used controls is much trickier, e.g. would require bitmap
games to clear bits from a 32-bit bitmap vs. a 64-bit bitmap.

Opportunistically add sanity checks to ensure the size of the controls
match (yay, macro!), e.g. to detect bugs where KVM passes in the pairs for
primary exit controls, but its variable for the secondary exit controls.

To help users triage mismatches, print the control bits that are checked,
not just the actual value.  For the foreseeable future, that provides
enough information for a user to determine which fields mismatched.  E.g.
until secondary entry controls comes along, all entry bits and thus all
error messages are guaranteed to be unique.

To avoid returning from a macro, which can get quite dangerous, simply
process all pairs even if error_on_inconsistent_vmcs_config is set.  The
speed at which KVM rejects module load is not at all interesting.

Keep the error message a "once" printk, even though it would be nice to
print out all mismatching pairs.  In practice, the most likely scenario is
that a single pair will mismatch on all CPUs.  Printing all mismatches
generates redundant messages in that situation, and can be extremely noisy
on systems with large numbers of CPUs.  If a CPU has multiple mismatches,
not printing every bad pair is the least of the user's concerns.

Cc: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250227005353.3216123-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03 07:45:54 -08:00
Sean Christopherson
4e96f010af KVM: SVM: Invalidate "next" SNP VMSA GPA even on failure
When processing an SNP AP Creation event, invalidate the "next" VMSA GPA
even if acquiring the page/pfn for the new VMSA fails.  In practice, the
next GPA will never be used regardless of whether or not its invalidated,
as the entire flow is guarded by snp_ap_waiting_for_reset, and said guard
and snp_vmsa_gpa are always written as a pair.  But that's really hard to
see in the code.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250227012541.3234589-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03 07:34:56 -08:00
Sean Christopherson
5279d6f7e4 KVM: SVM: Use guard(mutex) to simplify SNP vCPU state updates
Use guard(mutex) in sev_snp_init_protected_guest_state() and pull in its
lock-protected inner helper.  Without an unlock trampoline (and even with
one), there is no real need for an inner helper.  Eliminating the helper
also avoids having to fixup the open coded "lockdep" WARN_ON().

Opportunistically drop the error message if KVM can't obtain the pfn for
the new target VMSA.  The error message provides zero information that
can't be gleaned from the fact that the vCPU is stuck.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250227012541.3234589-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03 07:34:55 -08:00
Sean Christopherson
e268beee4a KVM: SVM: Mark VMCB dirty before processing incoming snp_vmsa_gpa
Mark the VMCB dirty, i.e. zero control.clean, prior to handling the new
VMSA.  Nothing in the VALID_PAGE() case touches control.clean, and
isolating the VALID_PAGE() code will allow simplifying the overall logic.

Note, the VMCB probably doesn't need to be marked dirty when the VMSA is
invalid, as KVM will disallow running the vCPU in such a state.  But it
also doesn't hurt anything.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250227012541.3234589-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03 07:34:54 -08:00
Sean Christopherson
46332437e1 KVM: SVM: Use guard(mutex) to simplify SNP AP Creation error handling
Use guard(mutex) in sev_snp_ap_creation() and modify the error paths to
return directly instead of jumping to a common exit point.

No functional change intended.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250227012541.3234589-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03 07:34:53 -08:00
Sean Christopherson
c6e129fb2a KVM: SVM: Simplify request+kick logic in SNP AP Creation handling
Drop the local "kick" variable and the unnecessary "fallthrough" logic
from sev_snp_ap_creation(), and simply pivot on the request when deciding
whether or not to immediate force a state update on the target vCPU.

No functional change intended.

Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250227012541.3234589-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03 07:34:52 -08:00
Sean Christopherson
745ff82199 KVM: SVM: Require AP's "requested" SEV_FEATURES to match KVM's view
When handling an "AP Create" event, return an error if the "requested" SEV
features for the vCPU don't exactly match KVM's view of the VM-scoped
features.  There is no known use case for heterogeneous SEV features across
vCPUs, and while KVM can't actually enforce an exact match since the value
in RAX isn't guaranteed to match what the guest shoved into the VMSA, KVM
can at least avoid knowingly letting the guest run in an unsupported state.

E.g. if a VM is created with DebugSwap disabled, KVM will intercept #DBs
and DRs for all vCPUs, even if an AP is "created" with DebugSwap enabled in
its VMSA.

Note, the GHCB spec only "requires" that "AP use the same interrupt
injection mechanism as the BSP", but given the disaster that is DebugSwap
and SEV_FEATURES in general, it's safe to say that AMD didn't consider all
possible complications with mismatching features between the BSP and APs.

Opportunistically fold the check into the relevant request flavors; the
"request < AP_DESTROY" check is just a bizarre way of implementing the
AP_CREATE_ON_INIT => AP_CREATE fallthrough.

Fixes: e366f92ea9 ("KVM: SEV: Support SEV-SNP AP Creation NAE event")
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250227012541.3234589-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03 07:34:50 -08:00
Sean Christopherson
d26638bfcd KVM: SVM: Don't change target vCPU state on AP Creation VMGEXIT error
If KVM rejects an AP Creation event, leave the target vCPU state as-is.
Nothing in the GHCB suggests the hypervisor is *allowed* to muck with vCPU
state on failure, let alone required to do so.  Furthermore, kicking only
in the !ON_INIT case leads to divergent behavior, and even the "kick" case
is non-deterministic.

E.g. if an ON_INIT request fails, the guest can successfully retry if the
fixed AP Creation request is made prior to sending INIT.  And if a !ON_INIT
fails, the guest can successfully retry if the fixed AP Creation request is
handled before the target vCPU processes KVM's
KVM_REQ_UPDATE_PROTECTED_GUEST_STATE.

Fixes: e366f92ea9 ("KVM: SEV: Support SEV-SNP AP Creation NAE event")
Cc: stable@vger.kernel.org
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250227012541.3234589-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03 07:33:44 -08:00
Sean Christopherson
72d12715ed KVM: SVM: Refuse to attempt VRMUN if an SEV-ES+ guest has an invalid VMSA
Explicitly reject KVM_RUN with KVM_EXIT_FAIL_ENTRY if userspace "coerces"
KVM into running an SEV-ES+ guest with an invalid VMSA, e.g. by modifying
a vCPU's mp_state to be RUNNABLE after an SNP vCPU has undergone a Destroy
event.  On Destroy or failed Create, KVM marks the vCPU HALTED so that
*KVM* doesn't run the vCPU, but nothing prevents a misbehaving VMM from
manually making the vCPU RUNNABLE via KVM_SET_MP_STATE.

Attempting VMRUN with an invalid VMSA should be harmless, but knowingly
executing VMRUN with bad control state is at best dodgy.

Fixes: e366f92ea9 ("KVM: SEV: Support SEV-SNP AP Creation NAE event")
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250227012541.3234589-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03 07:33:43 -08:00
Sean Christopherson
807cb9ce2e KVM: SVM: Don't rely on DebugSwap to restore host DR0..DR3
Never rely on the CPU to restore/load host DR0..DR3 values, even if the
CPU supports DebugSwap, as there are no guarantees that SNP guests will
actually enable DebugSwap on APs.  E.g. if KVM were to rely on the CPU to
load DR0..DR3 and skipped them during hw_breakpoint_restore(), KVM would
run with clobbered-to-zero DRs if an SNP guest created APs without
DebugSwap enabled.

Update the comment to explain the dangers, and hopefully prevent breaking
KVM in the future.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250227012541.3234589-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03 07:26:39 -08:00
Sean Christopherson
b2653cd3b7 KVM: SVM: Save host DR masks on CPUs with DebugSwap
When running SEV-SNP guests on a CPU that supports DebugSwap, always save
the host's DR0..DR3 mask MSR values irrespective of whether or not
DebugSwap is enabled, to ensure the host values aren't clobbered by the
CPU.  And for now, also save DR0..DR3, even though doing so isn't
necessary (see below).

SVM_VMGEXIT_AP_CREATE is deeply flawed in that it allows the *guest* to
create a VMSA with guest-controlled SEV_FEATURES.  A well behaved guest
can inform the hypervisor, i.e. KVM, of its "requested" features, but on
CPUs without ALLOWED_SEV_FEATURES support, nothing prevents the guest from
lying about which SEV features are being enabled (or not!).

If a misbehaving guest enables DebugSwap in a secondary vCPU's VMSA, the
CPU will load the DR0..DR3 mask MSRs on #VMEXIT, i.e. will clobber the
MSRs with '0' if KVM doesn't save its desired value.

Note, DR0..DR3 themselves are "ok", as DR7 is reset on #VMEXIT, and KVM
restores all DRs in common x86 code as needed via hw_breakpoint_restore().
I.e. there is no risk of host DR0..DR3 being clobbered (when it matters).
However, there is a flaw in the opposite direction; because the guest can
lie about enabling DebugSwap, i.e. can *disable* DebugSwap without KVM's
knowledge, KVM must not rely on the CPU to restore DRs.  Defer fixing
that wart, as it's more of a documentation issue than a bug in the code.

Note, KVM added support for DebugSwap on commit d1f85fbe83 ("KVM: SEV:
Enable data breakpoints in SEV-ES"), but that is not an appropriate Fixes,
as the underlying flaw exists in hardware, not in KVM.  I.e. all kernels
that support SEV-SNP need to be patched, not just kernels with KVM's full
support for DebugSwap (ignoring that DebugSwap support landed first).

Opportunistically fix an incorrect statement in the comment; on CPUs
without DebugSwap, the CPU does NOT save or load debug registers, i.e.

Fixes: e366f92ea9 ("KVM: SEV: Support SEV-SNP AP Creation NAE event")
Cc: stable@vger.kernel.org
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Kim Phillips <kim.phillips@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250227012541.3234589-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03 07:26:39 -08:00
Keith Busch
916b7f42b3 kvm: retry nx_huge_page_recovery_thread creation
A VMM may send a non-fatal signal to its threads, including vCPU tasks,
at any time, and thus may signal vCPU tasks during KVM_RUN.  If a vCPU
task receives the signal while its trying to spawn the huge page recovery
vhost task, then KVM_RUN will fail due to copy_process() returning
-ERESTARTNOINTR.

Rework call_once() to mark the call complete if and only if the called
function succeeds, and plumb the function's true error code back to the
call_once() invoker.  This provides userspace with the correct, non-fatal
error code so that the VMM doesn't terminate the VM on -ENOMEM, and allows
subsequent KVM_RUN a succeed by virtue of retrying creation of the NX huge
page task.

Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
[implemented the kvm user side]
Signed-off-by: Keith Busch <kbusch@kernel.org>
Message-ID: <20250227230631.303431-3-kbusch@meta.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-01 02:54:18 -05:00
Keith Busch
cb380909ae vhost: return task creation error instead of NULL
Lets callers distinguish why the vhost task creation failed. No one
currently cares why it failed, so no real runtime change from this
patch, but that will not be the case for long.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Message-ID: <20250227230631.303431-2-kbusch@meta.com>
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-01 02:52:52 -05:00
Sean Christopherson
2a289aed3f KVM: x86: Always set mp_state to RUNNABLE on wakeup from HLT
When emulating HLT and a wake event is already pending, explicitly mark
the vCPU RUNNABLE (via kvm_set_mp_state()) instead of assuming the vCPU is
already in the appropriate state.  Barring a KVM bug, it should be
impossible for the vCPU to be in a non-RUNNABLE state, but there is no
advantage to relying on that to hold true, and ensuring the vCPU is made
RUNNABLE avoids non-deterministic behavior with respect to pv_unhalted.

E.g. if the vCPU is not already RUNNABLE, then depending on when
pv_unhalted is set, KVM could either leave the vCPU in the non-RUNNABLE
state (set before __kvm_emulate_halt()), or transition the vCPU to HALTED
and then RUNNABLE (pv_unhalted set after the kvm_vcpu_has_events() check).

Link: https://lore.kernel.org/r/20250224174156.2362059-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28 15:43:18 -08:00
Sean Christopherson
189ecdb3e1 KVM: x86: Snapshot the host's DEBUGCTL after disabling IRQs
Snapshot the host's DEBUGCTL after disabling IRQs, as perf can toggle
debugctl bits from IRQ context, e.g. when enabling/disabling events via
smp_call_function_single().  Taking the snapshot (long) before IRQs are
disabled could result in KVM effectively clobbering DEBUGCTL due to using
a stale snapshot.

Cc: stable@vger.kernel.org
Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250227222411.3490595-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28 09:17:45 -08:00
Sean Christopherson
433265870a KVM: SVM: Manually context switch DEBUGCTL if LBR virtualization is disabled
Manually load the guest's DEBUGCTL prior to VMRUN (and restore the host's
value on #VMEXIT) if it diverges from the host's value and LBR
virtualization is disabled, as hardware only context switches DEBUGCTL if
LBR virtualization is fully enabled.  Running the guest with the host's
value has likely been mildly problematic for quite some time, e.g. it will
result in undesirable behavior if BTF diverges (with the caveat that KVM
now suppresses guest BTF due to lack of support).

But the bug became fatal with the introduction of Bus Lock Trap ("Detect"
in kernel paralance) support for AMD (commit 408eb7417a
("x86/bus_lock: Add support for AMD")), as a bus lock in the guest will
trigger an unexpected #DB.

Note, suppressing the bus lock #DB, i.e. simply resuming the guest without
injecting a #DB, is not an option.  It wouldn't address the general issue
with DEBUGCTL, e.g. for things like BTF, and there are other guest-visible
side effects if BusLockTrap is left enabled.

If BusLockTrap is disabled, then DR6.BLD is reserved-to-1; any attempts to
clear it by software are ignored.  But if BusLockTrap is enabled, software
can clear DR6.BLD:

  Software enables bus lock trap by setting DebugCtl MSR[BLCKDB] (bit 2)
  to 1.  When bus lock trap is enabled, ... The processor indicates that
  this #DB was caused by a bus lock by clearing DR6[BLD] (bit 11).  DR6[11]
  previously had been defined to be always 1.

and clearing DR6.BLD is "sticky" in that it's not set (i.e. lowered) by
other #DBs:

  All other #DB exceptions leave DR6[BLD] unmodified

E.g. leaving BusLockTrap enable can confuse a legacy guest that writes '0'
to reset DR6.

Reported-by: rangemachine@gmail.com
Reported-by: whanos@sergal.fun
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219787
Closes: https://lore.kernel.org/all/bug-219787-28872@https.bugzilla.kernel.org%2F
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: stable@vger.kernel.org
Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250227222411.3490595-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28 09:17:45 -08:00
Sean Christopherson
fb71c79593 KVM: x86: Snapshot the host's DEBUGCTL in common x86
Move KVM's snapshot of DEBUGCTL to kvm_vcpu_arch and take the snapshot in
common x86, so that SVM can also use the snapshot.

Opportunistically change the field to a u64.  While bits 63:32 are reserved
on AMD, not mentioned at all in Intel's SDM, and managed as an "unsigned
long" by the kernel, DEBUGCTL is an MSR and therefore a 64-bit value.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: stable@vger.kernel.org
Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250227222411.3490595-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28 09:17:45 -08:00
Sean Christopherson
d0eac42f5c KVM: SVM: Suppress DEBUGCTL.BTF on AMD
Mark BTF as reserved in DEBUGCTL on AMD, as KVM doesn't actually support
BTF, and fully enabling BTF virtualization is non-trivial due to
interactions with the emulator, guest_debug, #DB interception, nested SVM,
etc.

Don't inject #GP if the guest attempts to set BTF, as there's no way to
communicate lack of support to the guest, and instead suppress the flag
and treat the WRMSR as (partially) unsupported.

In short, make KVM behave the same on AMD and Intel (VMX already squashes
BTF).

Note, due to other bugs in KVM's handling of DEBUGCTL, the only way BTF
has "worked" in any capacity is if the guest simultaneously enables LBRs.

Reported-by: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: stable@vger.kernel.org
Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250227222411.3490595-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28 09:17:45 -08:00
Sean Christopherson
ee89e80133 KVM: SVM: Drop DEBUGCTL[5:2] from guest's effective value
Drop bits 5:2 from the guest's effective DEBUGCTL value, as AMD changed
the architectural behavior of the bits and broke backwards compatibility.
On CPUs without BusLockTrap (or at least, in APMs from before ~2023),
bits 5:2 controlled the behavior of external pins:

  Performance-Monitoring/Breakpoint Pin-Control (PBi)—Bits 5:2, read/write.
  Software uses thesebits to control the type of information reported by
  the four external performance-monitoring/breakpoint pins on the
  processor. When a PBi bit is cleared to 0, the corresponding external pin
  (BPi) reports performance-monitor information. When a PBi bit is set to
  1, the corresponding external pin (BPi) reports breakpoint information.

With the introduction of BusLockTrap, presumably to be compatible with
Intel CPUs, AMD redefined bit 2 to be BLCKDB:

  Bus Lock #DB Trap (BLCKDB)—Bit 2, read/write. Software sets this bit to
  enable generation of a #DB trap following successful execution of a bus
  lock when CPL is > 0.

and redefined bits 5:3 (and bit 6) as "6:3 Reserved MBZ".

Ideally, KVM would treat bits 5:2 as reserved.  Defer that change to a
feature cleanup to avoid breaking existing guest in LTS kernels.  For now,
drop the bits to retain backwards compatibility (of a sort).

Note, dropping bits 5:2 is still a guest-visible change, e.g. if the guest
is enabling LBRs *and* the legacy PBi bits, then the state of the PBi bits
is visible to the guest, whereas now the guest will always see '0'.

Reported-by: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: stable@vger.kernel.org
Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250227222411.3490595-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28 09:17:45 -08:00
Sean Christopherson
d4b69c3d14 KVM: SVM: Inject #GP if memory operand for INVPCID is non-canonical
Inject a #GP if the memory operand received by INVCPID is non-canonical.
The APM clearly states that the intercept takes priority over all #GP
checks except the CPL0 restriction.

Of course, that begs the question of how the CPU generates a linear
address in the first place.  Tracing confirms that EXITINFO1 does hold a
linear address, at least for 64-bit mode guests (hooray GS prefix).
Unfortunately, the APM says absolutely nothing about the EXITINFO fields
for INVPCID intercepts, so it's not at all clear what's supposed to
happen.

Add a FIXME to call out that KVM still does the wrong thing for 32-bit
guests, and if the stack segment is used for the memory operand.

Cc: Babu Moger <babu.moger@amd.com>
Cc: Jim Mattson <jmattson@google.com>
Fixes: 4407a797e9 ("KVM: SVM: Enable INVPCID feature on AMD")
Link: https://lore.kernel.org/r/20250224174522.2363400-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28 09:16:21 -08:00
Sean Christopherson
64c947a1cf KVM: VMX: Reject KVM_RUN if userspace forces emulation during nested VM-Enter
Extend KVM's restrictions on userspace forcing "emulation required" at odd
times to cover stuffing invalid guest state while a nested run is pending.
Clobbering guest state while KVM is in the middle of emulating VM-Enter is
nonsensical, as it puts the vCPU into an architecturally impossible state,
and will trip KVM's sanity check that guards against KVM bugs, e.g. helps
detect missed VMX consistency checks.

  WARNING: CPU: 3 PID: 6336 at arch/x86/kvm/vmx/vmx.c:6480 __vmx_handle_exit arch/x86/kvm/vmx/vmx.c:6480 [inline]
  WARNING: CPU: 3 PID: 6336 at arch/x86/kvm/vmx/vmx.c:6480 vmx_handle_exit+0x40f/0x1f70 arch/x86/kvm/vmx/vmx.c:6637
  Modules linked in:
  CPU: 3 UID: 0 PID: 6336 Comm: syz.0.73 Not tainted 6.13.0-rc1-syzkaller-00316-gb5f217084ab3 #0
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
  RIP: 0010:__vmx_handle_exit arch/x86/kvm/vmx/vmx.c:6480 [inline]
  RIP: 0010:vmx_handle_exit+0x40f/0x1f70 arch/x86/kvm/vmx/vmx.c:6637
   <TASK>
   vcpu_enter_guest arch/x86/kvm/x86.c:11081 [inline]
   vcpu_run+0x3047/0x4f50 arch/x86/kvm/x86.c:11242
   kvm_arch_vcpu_ioctl_run+0x44a/0x1740 arch/x86/kvm/x86.c:11560
   kvm_vcpu_ioctl+0x6ce/0x1520 virt/kvm/kvm_main.c:4340
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:906 [inline]
   __se_sys_ioctl fs/ioctl.c:892 [inline]
   __x64_sys_ioctl+0x190/0x200 fs/ioctl.c:892
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
   </TASK>

Reported-by: syzbot+ac0bc3a70282b4d586cc@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67598fb9.050a0220.17f54a.003b.GAE@google.com
Debugged-by: James Houghton <jthoughton@google.com>
Tested-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250224171409.2348647-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28 09:15:47 -08:00
Sean Christopherson
be45bc4eff KVM: SVM: Set RFLAGS.IF=1 in C code, to get VMRUN out of the STI shadow
Enable/disable local IRQs, i.e. set/clear RFLAGS.IF, in the common
svm_vcpu_enter_exit() just after/before guest_state_{enter,exit}_irqoff()
so that VMRUN is not executed in an STI shadow.  AMD CPUs have a quirk
(some would say "bug"), where the STI shadow bleeds into the guest's
intr_state field if a #VMEXIT occurs during injection of an event, i.e. if
the VMRUN doesn't complete before the subsequent #VMEXIT.

The spurious "interrupts masked" state is relatively benign, as it only
occurs during event injection and is transient.  Because KVM is already
injecting an event, the guest can't be in HLT, and if KVM is querying IRQ
blocking for injection, then KVM would need to force an immediate exit
anyways since injecting multiple events is impossible.

However, because KVM copies int_state verbatim from vmcb02 to vmcb12, the
spurious STI shadow is visible to L1 when running a nested VM, which can
trip sanity checks, e.g. in VMware's VMM.

Hoist the STI+CLI all the way to C code, as the aforementioned calls to
guest_state_{enter,exit}_irqoff() already inform lockdep that IRQs are
enabled/disabled, and taking a fault on VMRUN with RFLAGS.IF=1 is already
possible.  I.e. if there's kernel code that is confused by running with
RFLAGS.IF=1, then it's already a problem.  In practice, since GIF=0 also
blocks NMIs, the only change in exposure to non-KVM code (relative to
surrounding VMRUN with STI+CLI) is exception handling code, and except for
the kvm_rebooting=1 case, all exception in the core VM-Enter/VM-Exit path
are fatal.

Use the "raw" variants to enable/disable IRQs to avoid tracing in the
"no instrumentation" code; the guest state helpers also take care of
tracing IRQ state.

Oppurtunstically document why KVM needs to do STI in the first place.

Reported-by: Doug Covelli <doug.covelli@broadcom.com>
Closes: https://lore.kernel.org/all/CADH9ctBs1YPmE4aCfGPNBwA10cA8RuAk2gO7542DjMZgs4uzJQ@mail.gmail.com
Fixes: f14eec0a32 ("KVM: SVM: move more vmentry code to assembly")
Cc: stable@vger.kernel.org
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250224165442.2338294-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28 09:15:23 -08:00
Nikolay Borisov
0dab791f05 KVM: x86/tdp_mmu: Remove tdp_mmu_for_each_pte()
That macro acts as a different name for for_each_tdp_pte, apart from
adding cognitive load it doesn't bring any value. Let's remove it.

Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20250226074131.312565-1-nik.borisov@suse.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28 09:14:20 -08:00
Sean Christopherson
61146f67e4 KVM: nVMX: Decouple EPT RWX bits from EPT Violation protection bits
Define independent macros for the RWX protection bits that are enumerated
via EXIT_QUALIFICATION for EPT Violations, and tie them to the RWX bits in
EPT entries via compile-time asserts.  Piggybacking the EPTE defines works
for now, but it creates holes in the EPT_VIOLATION_xxx macros and will
cause headaches if/when KVM emulates Mode-Based Execution (MBEC), or any
other features that introduces additional protection information.

Opportunistically rename EPT_VIOLATION_RWX_MASK to EPT_VIOLATION_PROT_MASK
so that it doesn't become stale if/when MBEC support is added.

No functional change intended.

Cc: Jon Kohler <jon@nutanix.com>
Cc: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20250227000705.3199706-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28 09:14:05 -08:00
Yosry Ahmed
8c4f28cd81 KVM: nVMX: Always use IBPB to properly virtualize IBRS
On synthesized nested VM-exits in VMX, an IBPB is performed if IBRS is
advertised to the guest to properly provide separate prediction domains
for L1 and L2. However, this is currently conditional on
X86_FEATURE_USE_IBPB, which depends on the host spectre_v2_user
mitigation.

In short, if spectre_v2_user=no, IBRS is not virtualized correctly and
L1 becomes susceptible to attacks from L2. Fix this by performing the
IBPB regardless of X86_FEATURE_USE_IBPB.

Fixes: 2e7eab8142 ("KVM: VMX: Execute IBPB on emulated VM-exit when guest has IBRS")
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Jim Mattson <jmattson@google.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250227012712.3193063-6-yosry.ahmed@linux.dev
2025-02-27 10:57:21 +01:00
Yosry Ahmed
80dacb0804 x86/bugs: Use a static branch to guard IBPB on vCPU switch
Instead of using X86_FEATURE_USE_IBPB to guard the IBPB execution in KVM
when a new vCPU is loaded, introduce a static branch, similar to
switch_mm_*_ibpb.

This makes it obvious in spectre_v2_user_select_mitigation() what
exactly is being toggled, instead of the unclear X86_FEATURE_USE_IBPB
(which will be shortly removed). It also provides more fine-grained
control, making it simpler to change/add paths that control the IBPB in
the vCPU switch path without affecting other IBPBs.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250227012712.3193063-5-yosry.ahmed@linux.dev
2025-02-27 10:57:20 +01:00
Yosry Ahmed
549435aab4 x86/bugs: Move the X86_FEATURE_USE_IBPB check into callers
indirect_branch_prediction_barrier() only performs the MSR write if
X86_FEATURE_USE_IBPB is set, using alternative_msr_write(). In
preparation for removing X86_FEATURE_USE_IBPB, move the feature check
into the callers so that they can be addressed one-by-one, and use
X86_FEATURE_IBPB instead to guard the MSR write.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250227012712.3193063-2-yosry.ahmed@linux.dev
2025-02-27 10:57:20 +01:00
Sean Christopherson
b2aba529bf KVM: Drop kvm_arch_sync_events() now that all implementations are nops
Remove kvm_arch_sync_events() now that x86 no longer uses it (no other
arch has ever used it).

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Message-ID: <20250224235542.2562848-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26 13:17:23 -05:00
Sean Christopherson
fd21732682 KVM: x86: Fold guts of kvm_arch_sync_events() into kvm_arch_pre_destroy_vm()
Fold the guts of kvm_arch_sync_events() into kvm_arch_pre_destroy_vm(), as
the kvmclock and PIT background workers only need to be stopped before
destroying vCPUs (to avoid accessing vCPUs as they are being freed); it's
a-ok for them to be running while the VM is visible on the global vm_list.

Note, the PIT also needs to be stopped before IRQ routing is freed
(because KVM's IRQ routing is garbage and assumes there is always non-NULL
routing).

Opportunistically add comments to explain why KVM stops/frees certain
assets early.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250224235542.2562848-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26 13:17:23 -05:00
Sean Christopherson
e447212593 KVM: x86: Unload MMUs during vCPU destruction, not before
When destroying a VM, unload a vCPU's MMUs as part of normal vCPU freeing,
instead of as a separate prepratory action.  Unloading MMUs ahead of time
is a holdover from commit 7b53aa5650 ("KVM: Fix vcpu freeing for guest
smp"), which "fixed" a rather egregious flaw where KVM would attempt to
free *all* MMU pages when destroying a vCPU.

At the time, KVM would spin on all MMU pages in a VM when free a single
vCPU, and so would hang due to the way KVM pins and zaps root pages
(roots are invalidated but not freed if they are pinned by a vCPU).

  static void free_mmu_pages(struct kvm_vcpu *vcpu)
  {
        struct kvm_mmu_page *page;

        while (!list_empty(&vcpu->kvm->active_mmu_pages)) {
                page = container_of(vcpu->kvm->active_mmu_pages.next,
                                    struct kvm_mmu_page, link);
                kvm_mmu_zap_page(vcpu->kvm, page);
        }
        free_page((unsigned long)vcpu->mmu.pae_root);
  }

Now that KVM doesn't try to free all MMU pages when destroying a single
vCPU, there's no need to unpin roots prior to destroying a vCPU.

Note!  While KVM mostly destroys all MMUs before calling
kvm_arch_destroy_vm() (see commit f00be0cae4 ("KVM: MMU: do not free
active mmu pages in free_mmu_pages()")), unpinning MMU roots during vCPU
destruction will unfortunately trigger remote TLB flushes, i.e. will try
to send requests to all vCPUs.

Happily, thanks to commit 27592ae8db ("KVM: Move wiping of the kvm->vcpus
array to common code"), that's a non-issue as freed vCPUs are naturally
skipped by xa_for_each_range(), i.e. by kvm_for_each_vcpu().  Prior to that
commit, KVM x86 rather stupidly freed vCPUs one-by-one, and _then_
nullified them, one-by-one.  I.e. triggering a VM-wide request would hit a
use-after-free.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250224235542.2562848-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26 13:17:23 -05:00
Sean Christopherson
ed09b50b54 KVM: x86: Don't load/put vCPU when unloading its MMU during teardown
Don't load (and then put) a vCPU when unloading its MMU during VM
destruction, as nothing in kvm_mmu_unload() accesses vCPU state beyond the
root page/address of each MMU, i.e. can't possible need to run with the
vCPU loaded.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250224235542.2562848-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26 13:17:23 -05:00
Borislav Petkov
8442df2b49 x86/bugs: KVM: Add support for SRSO_MSR_FIX
Add support for

  CPUID Fn8000_0021_EAX[31] (SRSO_MSR_FIX). If this bit is 1, it
  indicates that software may use MSR BP_CFG[BpSpecReduce] to mitigate
  SRSO.

Enable BpSpecReduce to mitigate SRSO across guest/host boundaries.

Switch back to enabling the bit when virtualization is enabled and to
clear the bit when virtualization is disabled because using a MSR slot
would clear the bit when the guest is exited and any training the guest
has done, would potentially influence the host kernel when execution
enters the kernel and hasn't VMRUN the guest yet.

More detail on the public thread in Link below.

Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241202120416.6054-1-bp@kernel.org
2025-02-26 15:13:06 +01:00
Sean Christopherson
982caaa115 KVM: nVMX: Process events on nested VM-Exit if injectable IRQ or NMI is pending
Process pending events on nested VM-Exit if the vCPU has an injectable IRQ
or NMI, as the event may have become pending while L2 was active, i.e. may
not be tracked in the context of vmcs01.  E.g. if L1 has passed its APIC
through to L2 and an IRQ arrives while L2 is active, then KVM needs to
request an IRQ window prior to running L1, otherwise delivery of the IRQ
will be delayed until KVM happens to process events for some other reason.

The missed failure is detected by vmx_apic_passthrough_tpr_threshold_test
in KVM-Unit-Tests, but has effectively been masked due to a flaw in KVM's
PIC emulation that causes KVM to make spurious KVM_REQ_EVENT requests (and
apparently no one ever ran the test with split IRQ chips).

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250224235542.2562848-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26 04:17:32 -05:00
Sean Christopherson
17bcd71442 KVM: x86: Free vCPUs before freeing VM state
Free vCPUs before freeing any VM state, as both SVM and VMX may access
VM state when "freeing" a vCPU that is currently "in" L2, i.e. that needs
to be kicked out of nested guest mode.

Commit 6fcee03df6 ("KVM: x86: avoid loading a vCPU after .vm_destroy was
called") partially fixed the issue, but for unknown reasons only moved the
MMU unloading before VM destruction.  Complete the change, and free all
vCPU state prior to destroying VM state, as nVMX accesses even more state
than nSVM.

In addition to the AVIC, KVM can hit a use-after-free on MSR filters:

  kvm_msr_allowed+0x4c/0xd0
  __kvm_set_msr+0x12d/0x1e0
  kvm_set_msr+0x19/0x40
  load_vmcs12_host_state+0x2d8/0x6e0 [kvm_intel]
  nested_vmx_vmexit+0x715/0xbd0 [kvm_intel]
  nested_vmx_free_vcpu+0x33/0x50 [kvm_intel]
  vmx_free_vcpu+0x54/0xc0 [kvm_intel]
  kvm_arch_vcpu_destroy+0x28/0xf0
  kvm_vcpu_destroy+0x12/0x50
  kvm_arch_destroy_vm+0x12c/0x1c0
  kvm_put_kvm+0x263/0x3c0
  kvm_vm_release+0x21/0x30

and an upcoming fix to process injectable interrupts on nested VM-Exit
will access the PIC:

  BUG: kernel NULL pointer dereference, address: 0000000000000090
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  CPU: 23 UID: 1000 PID: 2658 Comm: kvm-nx-lpage-re
  RIP: 0010:kvm_cpu_has_extint+0x2f/0x60 [kvm]
  Call Trace:
   <TASK>
   kvm_cpu_has_injectable_intr+0xe/0x60 [kvm]
   nested_vmx_vmexit+0x2d7/0xdf0 [kvm_intel]
   nested_vmx_free_vcpu+0x40/0x50 [kvm_intel]
   vmx_vcpu_free+0x2d/0x80 [kvm_intel]
   kvm_arch_vcpu_destroy+0x2d/0x130 [kvm]
   kvm_destroy_vcpus+0x8a/0x100 [kvm]
   kvm_arch_destroy_vm+0xa7/0x1d0 [kvm]
   kvm_destroy_vm+0x172/0x300 [kvm]
   kvm_vcpu_release+0x31/0x50 [kvm]

Inarguably, both nSVM and nVMX need to be fixed, but punt on those
cleanups for the moment.  Conceptually, vCPUs should be freed before VM
state.  Assets like the I/O APIC and PIC _must_ be allocated before vCPUs
are created, so it stands to reason that they must be freed _after_ vCPUs
are destroyed.

Reported-by: Aaron Lewis <aaronlewis@google.com>
Closes: https://lore.kernel.org/all/20240703175618.2304869-2-aaronlewis@google.com
Cc: Jim Mattson <jmattson@google.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Rick P Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Cc: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250224235542.2562848-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26 04:17:15 -05:00
Manali Shukla
fa662c9080 KVM: SVM: Add Idle HLT intercept support
Add support for "Idle HLT" interception on AMD CPUs, and enable Idle HLT
interception instead of "normal" HLT interception for all VMs for which
HLT-exiting is enabled.  Idle HLT provides a mild performance boost for
all VM types, by avoiding a VM-Exit in the scenario where KVM would
immediately "wake" and resume the vCPU.

Idle HLT makes HLT-exiting conditional on the vCPU not having a valid,
unmasked interrupt.  Specifically, a VM-Exit occurs on execution of HLT
if and only if there are no pending V_IRQ or V_NMI events.  Note, Idle
is a replacement for full HLT interception, i.e. enabling HLT interception
would result in all HLT instructions causing unconditional VM-Exits.  Per
the APM:

 When both HLT and Idle HLT intercepts are active at the same time, the
 HLT intercept takes priority. This intercept occurs only if a virtual
 interrupt is not pending (V_INTR or V_NMI).

For KVM's use of V_IRQ (also called V_INTR in the APM) to detect interrupt
windows, the net effect of enabling Idle HLT is that, if a virtual
interupt is pending and unmasked at the time of HLT, the vCPU will take
a V_IRQ intercept instead of a HLT intercept.

When AVIC is enabled, Idle HLT works as intended: the vCPU continues
unimpeded and services the pending virtual interrupt.

Note, the APM's description of V_IRQ interaction with AVIC is quite
confusing, and requires piecing together implied behavior.  Per the APM,
when AVIC is enabled, V_IRQ *from the VMCB* is ignored:

  When AVIC mode is enabled for a virtual processor, the V_IRQ, V_INTR_PRIO,
  V_INTR_VECTOR, and V_IGN_TPR fields in the VMCB are ignored.

Which seems to contradict the behavior of Idle HLT:

  This intercept occurs only if a virtual interrupt is not pending (V_INTR
  or V_NMI).

What's not explicitly stated is that hardware's internal copy of V_IRQ
(and related fields) *are* still active, i.e. are presumably used to cache
information from the virtual APIC.

Handle Idle HLT exits as if they were normal HLT exits, e.g. don't try to
optimize the handling under the assumption that there isn't a pending IRQ.
Irrespective of AVIC, Idle HLT is inherently racy with respect to the vIRR,
as KVM can set vIRR bits asychronously.

No changes are required to support KVM's use Idle HLT while running
L2.  In fact, supporting Idle HLT is actually a bug fix to some extent.
If L1 wants to intercept HLT, recalc_intercepts() will enable HLT
interception in vmcb02 and forward the intercept to L1 as normal.

But if L1 does not want to intercept HLT, then KVM will run L2 with Idle
HLT enabled and HLT interception disabled.  If a V_IRQ or V_NMI for L2
becomes pending and L2 executes HLT, then use of Idle HLT will do the
right thing, i.e. not #VMEXIT and instead deliver the virtual event.  KVM
currently doesn't handle this scenario correctly, e.g. doesn't check V_IRQ
or V_NMI in vmcs02 as part of kvm_vcpu_has_events().

Do not expose Idle HLT to L1 at this time, as supporting nested Idle HLT is
more complex than just enumerating the feature, e.g. requires KVM to handle
the aforementioned scenarios of V_IRQ and V_NMI at the time of exit.

Signed-off-by: Manali Shukla <Manali.Shukla@amd.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://bugzilla.kernel.org/attachment.cgi?id=306250
Link: https://lore.kernel.org/r/20250128124812.7324-3-manali.shukla@amd.com
[sean: rewrite changelog, drop nested "support"]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25 16:30:02 -08:00
Melody Wang
c3392d0ab7 KVM: SVM: Provide helpers to set the error code
Provide helpers to set the error code when converting VMGEXIT SW_EXITINFO1 and
SW_EXITINFO2 codes from plain numbers to proper defines. Add comments for
better code readability.

No functionality changed.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Melody Wang <huibo.wang@amd.com>
Link: https://lore.kernel.org/r/20250225213937.2471419-3-huibo.wang@amd.com
[sean: tweak comments, fix formatting goofs]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25 16:30:00 -08:00
Melody Wang
ea4c2f2f5e KVM: SVM: Convert plain error code numbers to defines
Convert VMGEXIT SW_EXITINFO1 codes from plain numbers to proper defines.

Opportunistically update the comment for the malformed input "sub-error"
codes to state that they are defined by the GHCB, and to capure the
relationship to the malformed input response.

No functional change intended.

Signed-off-by: Melody Wang <huibo.wang@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Pavan Kumar Paluri <papaluri@amd.com>
Link: https://lore.kernel.org/r/20250225213937.2471419-2-huibo.wang@amd.com
[sean: update comments]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25 16:29:59 -08:00
Sean Christopherson
d62c02af7a KVM: VMX: Pass XFD_ERR as pseudo-payload when injecting #NM
Pass XFD_ERR via KVM's exception payload mechanism when injecting an #NM
after interception so that XFD_ERR can be propagated to FRED's event_data
field without needing a dedicated field (which would need to be migrated).

For non-FRED vCPUs, this is a glorified NOP as
kvm_deliver_exception_payload() will simply do nothing (which is desirable
and correct).

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20241001050110.3643764-15-xin@zytor.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25 09:35:44 -08:00
Sean Christopherson
3ef0df3f76 KVM: VMX: Don't modify guest XFD_ERR if CR0.TS=1
Don't update the guest's XFD_ERR MSR if CR0.TS is set; per the SDM,
XFD_ERR is not modified if CR0.TS=1.  Although it's not explicitly stated
in the SDM, conceptually it makes sense the CR0.TS check would be done
prior to the XFD_ERR check, e.g. CR0.TS=1 blocks all SIMD state, whereas
XFD blocks only XTILE state.

  Device-not-available exceptions that are not due to XFD - those
  resulting from setting CR0.TS to 1 - do not modify the IA32_XFD_ERR MSR.

Opportunistically update the comment to call out that XFD_ERR is updated
before the VM-Exit check occurs.  Nothing in the SDM explicitly calls out
this behavior, but logically it must be the behavior, otherwise reading
XFD_ERR in handle_nm_fault_irqoff() would return stale data, i.e. the
to-be-delivered XFD_ERR value would need to be saved in EXIT_QUALIFICATION,
a la DR6 for #DB and CR2 for #PF, so that software could capture the guest
value.

Fixes: ec5be88ab2 ("kvm: x86: Intercept #NM for saving IA32_XFD_ERR")
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20241001050110.3643764-3-xin@zytor.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25 09:35:40 -08:00
Sean Christopherson
b50cb2b155 KVM: x86: Use a dedicated flow for queueing re-injected exceptions
Open code the filling of vcpu->arch.exception in kvm_requeue_exception()
instead of bouncing through kvm_multiple_exception(), as re-injection
doesn't actually share that much code with "normal" injection, e.g. the
VM-Exit interception check, payload delivery, and nested exception code
is all bypassed as those flows only apply during initial injection.

When FRED comes along, the special casing will only get worse, as FRED
explicitly tracks nested exceptions and essentially delivers the payload
on the stack frame, i.e. re-injection will need more inputs, and normal
injection will have yet more code that needs to be bypassed when KVM is
re-injecting an exception.

No functional change intended.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20241001050110.3643764-2-xin@zytor.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25 07:23:07 -08:00
Sean Christopherson
4fa0efb43a KVM: x86: Rename and invert async #PF's send_user_only flag to send_always
Rename send_user_only to avoid "user", because KVM's ABI is to not inject
page faults into CPL0, whereas "user" in x86 is specifically CPL3.  Invert
the polarity to keep the naming simple and unambiguous.  E.g. while KVM
often refers to CPL0 as "kernel", that terminology isn't ubiquitous, and
"send_kernel" could be misconstrued as "send only to kernel".

Link: https://lore.kernel.org/r/20250215010609.1199982-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25 07:10:47 -08:00
Sean Christopherson
b9595d1dde KVM: x86: Don't inject PV async #PF if SEND_ALWAYS=0 and guest state is protected
Don't inject PV async #PFs into guests with protected register state, i.e.
SEV-ES and SEV-SNP guests, unless the guest has opted-in to receiving #PFs
at CPL0.  For protected guests, the actual CPL of the guest is unknown.

Note, no sane CoCo guest should enable PV async #PF, but the current state
of Linux-as-a-CoCo-guest isn't entirely sane.

Fixes: add5e2f045 ("KVM: SVM: Add support for the SEV-ES VMSA")
Link: https://lore.kernel.org/r/20250215010609.1199982-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25 07:10:47 -08:00
Fred Griffoul
a2b00f85d7 KVM: x86: Update Xen TSC leaves during CPUID emulation
The Xen emulation in KVM modifies certain CPUID leaves to expose
TSC information to the guest.

Previously, these CPUID leaves were updated whenever guest time changed,
but this conflicts with KVM_SET_CPUID/KVM_SET_CPUID2 ioctls which reject
changes to CPUID entries on running vCPUs.

Fix this by updating the TSC information directly in the CPUID emulation
handler instead of modifying the vCPU's CPUID entries.

Signed-off-by: Fred Griffoul <fgriffo@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20250124150539.69975-1-fgriffo@amazon.co.uk
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25 07:09:55 -08:00
Sean Christopherson
79a14afc60 KVM: nVMX: Synthesize nested VM-Exit for supported emulation intercepts
When emulating an instruction on behalf of L2 that L1 wants to intercept,
generate a nested VM-Exit instead of injecting a #UD into L2.  Now that
(most of) the necessary information is available, synthesizing a VM-Exit
isn't terribly difficult.

Punt on decoding the ModR/M for descriptor table exits for now.  There is
no evidence that any hypervisor intercepts descriptor table accesses *and*
uses the EXIT_QUALIFICATION to expedite emulation, i.e. it's not worth
delaying basic support for.

To avoid doing more harm than good, e.g. by putting L2 into an infinite
or effectively corrupting its code stream, inject #UD if the instruction
length is nonsensical.

Link: https://lore.kernel.org/r/20250201015518.689704-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24 09:01:07 -08:00
Sean Christopherson
fbd1e0f195 KVM: nVMX: Allow the caller to provide instruction length on nested VM-Exit
Rework the nested VM-Exit helper to take the instruction length as a
parameter, and convert nested_vmx_vmexit() into a "default" wrapper that
grabs the length from vmcs02 as appropriate.  This will allow KVM to set
the correct instruction length when synthesizing a nested VM-Exit when
emulating an instruction that L1 wants to intercept.

No functional change intended, as the path to prepare_vmcs12()'s reading
of vmcs02.VM_EXIT_INSTRUCTION_LEN is gated on the same set of conditions
as the VMREAD in the new nested_vmx_vmexit().

Link: https://lore.kernel.org/r/20250201015518.689704-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24 09:01:07 -08:00
Sean Christopherson
d4aea23fd0 KVM: x86: Add a #define for the architectural max instruction length
Add a #define to capture x86's architecturally defined max instruction
length instead of open coding the literal in a variety of places.

No functional change intended.

Link: https://lore.kernel.org/r/20250201015518.689704-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24 09:01:07 -08:00
Sean Christopherson
9aeb9d8a67 KVM: x86: Plumb the emulator's starting RIP into nested intercept checks
When checking for intercept when emulating an instruction on behalf of L2,
pass the emulator's view of the RIP of the instruction being emulated to
vendor code.  Unlike SVM, which communicates the next RIP on VM-Exit,
VMX communicates the length of the instruction that generated the VM-Exit,
i.e. requires the current and next RIPs.

Note, unless userspace modifies RIP during a userspace exit that requires
completion, kvm_rip_read() will contain the same information.  Pass the
emulator's view largely out of a paranoia, and because there is no
meaningful cost in doing so.

Link: https://lore.kernel.org/r/20250201015518.689704-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24 09:01:07 -08:00
Sean Christopherson
407d03fe92 KVM: x86: Plumb the src/dst operand types through to .check_intercept()
When checking for intercept when emulating an instruction on behalf of L2,
forward the source and destination operand types to vendor code so that
VMX can synthesize the correct EXIT_QUALIFICATION for port I/O VM-Exits.

Link: https://lore.kernel.org/r/20250201015518.689704-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24 09:01:07 -08:00
Sean Christopherson
08e3d89eb3 KVM: nVMX: Consolidate missing X86EMUL_INTERCEPTED logic in L2 emulation
Refactor the handling of port I/O interception checks when emulating on
behalf of L2 in anticipation of synthesizing a nested VM-Exit to L1
instead of injecting a #UD into L2.

No functional change intended.

Link: https://lore.kernel.org/r/20250201015518.689704-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24 09:01:07 -08:00
Sean Christopherson
f43f7a215a KVM: nVMX: Emulate HLT in L2 if it's not intercepted
Extend VMX's nested intercept logic for emulated instructions to handle
HLT interception, primarily for testing purposes.  Failure to allow
emulation of HLT isn't all that interesting, as emulating HLT while L2 is
active either requires forced emulation (and no #UD intercept in L1), TLB
games in the guest to coerce KVM into emulating the wrong instruction, or
a bug elsewhere in KVM.

E.g. without commit 47ef3ef843 ("KVM: VMX: Handle event vectoring
error in check_emulate_instruction()"), KVM can end up trying to emulate
HLT if RIP happens to point at a HLT when a vectored event arrives with
L2's IDT pointing at emulated MMIO.

Note, vmx_check_intercept() is still broken when L1 wants to intercept an
instruction, as KVM injects a #UD instead of synthesizing a nested VM-Exit.
That issue extends far beyond HLT, punt on it for now.

Link: https://lore.kernel.org/r/20250201015518.689704-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24 09:01:07 -08:00
Sean Christopherson
3244616aac KVM: nVMX: Allow emulating RDPID on behalf of L2
Return X86EMUL_CONTINUE instead X86EMUL_UNHANDLEABLE when emulating RDPID
on behalf of L2 and L1 _does_ expose RDPID/RDTSCP to L2.  When RDPID
emulation was added by commit fb6d4d340e ("KVM: x86: emulate RDPID"),
KVM incorrectly allowed emulation by default.  Commit 07721feee4 ("KVM:
nVMX: Don't emulate instructions in guest mode") fixed that flaw, but
missed that RDPID emulation was relying on the common return path to allow
emulation on behalf of L2.

Fixes: 07721feee4 ("KVM: nVMX: Don't emulate instructions in guest mode")
Link: https://lore.kernel.org/r/20250201015518.689704-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24 09:01:06 -08:00
Sean Christopherson
c8e612bfed KVM: nSVM: Pass next RIP, not current RIP, for nested VM-Exit on emulation
Set "next_rip" in the emulation interception info passed to vendor code
using the emulator context's "_eip", not "eip".  "eip" holds RIP from the
start of emulation, i.e. the RIP of the instruction that's being emulated,
whereas _eip tracks the context's current position in decoding the code
stream, which at the time of the intercept checks is effectively the RIP
of the next instruction.

Passing the current RIP as next_rip causes SVM to stuff the wrong value
value into vmcb12->control.next_rip if a nested VM-Exit is generated, i.e.
if L1 wants to intercept the instruction, and could result in L1 putting
L2 into an infinite loop due to restarting L2 with the same RIP over and
over.

Fixes: 8a76d7f25f ("KVM: x86: Add x86 callback for intercept check")
Link: https://lore.kernel.org/r/20250201015518.689704-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24 09:01:06 -08:00
Sean Christopherson
f002a97ec8 KVM: nVMX: Check PAUSE_EXITING, not BUS_LOCK_DETECTION, on PAUSE emulation
When emulating PAUSE on behalf of L2, check for interception in vmcs12 by
looking at primary execution controls, not secondary execution controls.
Checking for PAUSE_EXITING in secondary execution controls effectively
results in KVM looking for BUS_LOCK_DETECTION, which KVM doesn't expose to
L1, i.e. is always off in vmcs12, and ultimately results in KVM failing to
"intercept" PAUSE.

Because KVM doesn't handle interception during emulation correctly on VMX,
i.e. the "fixed" code is still quite broken, and not intercepting PAUSE is
relatively benign, for all intents and purposes the bug means that L2 gets
to live when it would otherwise get an unexpected #UD.

Fixes: 4984563823 ("KVM: nVMX: Emulate NOPs in L2, and PAUSE if it's not intercepted")
Link: https://lore.kernel.org/r/20250201015518.689704-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24 09:01:06 -08:00
Sean Christopherson
26e228ec16 KVM: x86/xen: Move kvm_xen_hvm_config field into kvm_xen
Now that all KVM usage of the Xen HVM config information is buried behind
CONFIG_KVM_XEN=y, move the per-VM kvm_xen_hvm_config field out of kvm_arch
and into kvm_xen.

No functional change intended.

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250215011437.1203084-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24 08:59:59 -08:00
Sean Christopherson
a5d7700af6 KVM: x86/xen: Consult kvm_xen_enabled when checking for Xen MSR writes
Query kvm_xen_enabled when detecting writes to the Xen hypercall page MSR
so that the check is optimized away in the likely scenario that Xen isn't
enabled for the VM.

Deliberately open code the check instead of using kvm_xen_msr_enabled() in
order to avoid a double load of xen_hvm_config.msr (which is admittedly
rather pointless given the widespread lack of READ_ONCE() usage on the
plethora of vCPU-scoped accesses to kvm->arch.xen state).

No functional change intended.

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250215011437.1203084-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24 08:59:58 -08:00
Sean Christopherson
bb0978d95a KVM: x86/xen: Add an #ifdef'd helper to detect writes to Xen MSR
Add a helper to detect writes to the Xen hypercall page MSR, and provide a
stub for CONFIG_KVM_XEN=n to optimize out the check for kernels built
without Xen support.

Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20250215011437.1203084-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24 08:59:57 -08:00
Sean Christopherson
5c17848134 KVM: x86/xen: Restrict hypercall MSR to unofficial synthetic range
Reject userspace attempts to set the Xen hypercall page MSR to an index
outside of the "standard" virtualization range [0x40000000, 0x4fffffff],
as KVM is not equipped to handle collisions with real MSRs, e.g. KVM
doesn't update MSR interception, conflicts with VMCS/VMCB fields, special
case writes in KVM, etc.

While the MSR index isn't strictly ABI, i.e. can theoretically float to
any value, in practice no known VMM sets the MSR index to anything other
than 0x40000000 or 0x40000200.

Cc: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250215011437.1203084-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24 08:59:55 -08:00
Nam Cao
7764b9dd17 KVM: x86: Switch to use hrtimer_setup()
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.

Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.

Patch was created by using Coccinelle.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/5051cfe7ed48ef9913bf2583eeca6795cb53d6ae.1738746821.git.namcao@linutronix.de
2025-02-18 10:32:31 +01:00
Paolo Bonzini
d3d0b8dfe0 KVM fixes for 6.14 part 1
- Reject Hyper-V SEND_IPI hypercalls if the local APIC isn't being emulated
    by KVM to fix a NULL pointer dereference.
 
  - Enter guest mode (L2) from KVM's perspective before initializing the vCPU's
    nested NPT MMU so that the MMU is properly tagged for L2, not L1.
 
  - Load the guest's DR6 outside of the innermost .vcpu_run() loop, as the
    guest's value may be stale if a VM-Exit is handled in the fastpath.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmev2ekACgkQOlYIJqCj
 N/32Gg/7B2+oV9RaKB1VNv4G4vbQLiA+DxPM91U0sBqytkr9BfU5kciaVs068OVk
 2M3j007HHm51sWlsCB7VLeTmiNNi/RcJzh6mOCpJVGa70imNZl3/1cvbzx1hjOAn
 DbZSIqBfLpPnAmNUp4c++WsDPZR2vVVMXriVNWM+RLFRT8E2GavCKxGppoNf+FIS
 8aYYikiqIx+E6iYsZjEm4TXqOQ2CSLM+auq2/L24bFgkn/v6I5m70QfnnYgs7Y7R
 uZhv+x2O8DXuW2RxabiC4q302PDdNKtHYpEh/5+vmG34mouZEEPTVlSRU720frqU
 SnOwtiTKwDmAwMDSRXUAP4jc9FsD4JHSUUM7Sk0J/YaI55X3xV+YrJUBZ07bwunT
 TkKPr6TvlJW9s2bi+CEc0HHoMHqmejjKhq8fOeDgVkGYH1nhjrLQAFpxjI4iVmPQ
 vZLmCZXEMzJaqySMNVIPdSFJLLsKnD7mJT3XfbXG7dV5zmde2qYd7+TiRVb5dmst
 xTgSvhA1jLXpSYA4rmMjhweLEfQyljaPgb1GEZCQCBrV9clP0cb091rOWNbrcieG
 aMXFwHEyPjGDvlXlhjdfkNeHdP6Dq8y0aBoyeSnvdwvpAN256jswrzpYjBHWQqfv
 jsD3QHcbImUr+kH2CHFsZuXxsjh+woL+4crR1eQkL8oZWHEykzs=
 =aFcV
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-fixes-6.14-rcN' of https://github.com/kvm-x86/linux into HEAD

KVM fixes for 6.14 part 1

 - Reject Hyper-V SEND_IPI hypercalls if the local APIC isn't being emulated
   by KVM to fix a NULL pointer dereference.

 - Enter guest mode (L2) from KVM's perspective before initializing the vCPU's
   nested NPT MMU so that the MMU is properly tagged for L2, not L1.

 - Load the guest's DR6 outside of the innermost .vcpu_run() loop, as the
   guest's value may be stale if a VM-Exit is handled in the fastpath.
2025-02-14 19:08:35 -05:00
Sean Christopherson
44e70718df KVM: SVM: Ensure PSP module is initialized if KVM module is built-in
The kernel's initcall infrastructure lacks the ability to express
dependencies between initcalls, whereas the modules infrastructure
automatically handles dependencies via symbol loading.  Ensure the
PSP SEV driver is initialized before proceeding in sev_hardware_setup()
if KVM is built-in as the dependency isn't handled by the initcall
infrastructure.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <f78ddb64087df27e7bcb1ae0ab53f55aa0804fab.1739226950.git.ashish.kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-14 18:39:19 -05:00
Sean Christopherson
af3b6a9eba KVM: x86/mmu: Walk rmaps (shadow MMU) without holding mmu_lock when aging gfns
Convert the shadow MMU to use per-rmap locking instead of the per-VM
mmu_lock to protect rmaps when aging SPTEs.  When A/D bits are enabled, it
is safe to simply clear the Accessed bits, i.e. KVM just needs to ensure
the parent page table isn't freed.

The less obvious case is marking SPTEs for access tracking in the
non-A/D case (for EPT only).  Because aging a gfn means making the SPTE
not-present, KVM needs to play nice with the case where the CPU has TLB
entries for a SPTE that is not-present in memory.  For example, when
doing dirty tracking, if KVM encounters a non-present shadow accessed SPTE,
KVM must know to do a TLB invalidation.

Fortunately, KVM already provides (and relies upon) the necessary
functionality.  E.g. KVM doesn't flush TLBs when aging pages (even in the
clear_flush_young() case), and when harvesting dirty bitmaps, KVM flushes
based on the dirty bitmaps, not on SPTEs.

Co-developed-by: James Houghton <jthoughton@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-12-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14 07:17:47 -08:00
Sean Christopherson
bb6c7749cc KVM: x86/mmu: Add support for lockless walks of rmap SPTEs
Add a lockless version of for_each_rmap_spte(), which is pretty much the
same as the normal version, except that it doesn't BUG() the host if a
non-present SPTE is encountered.  When mmu_lock is held, it should be
impossible for a different task to zap a SPTE, _and_ zapped SPTEs must
be removed from their rmap chain prior to dropping mmu_lock.  Thus, the
normal walker BUG()s if a non-present SPTE is encountered as something is
wildly broken.

When walking rmaps without holding mmu_lock, the SPTEs pointed at by the
rmap chain can be zapped/dropped, and so a lockless walk can observe a
non-present SPTE if it runs concurrently with a different operation that
is zapping SPTEs.

Signed-off-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-11-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14 07:17:44 -08:00
Sean Christopherson
4834eaded9 KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
Steal another bit from rmap entries (which are word aligned pointers, i.e.
have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use
the bit to implement a *very* rudimentary per-rmap spinlock.  The only
anticipated usage of the lock outside of mmu_lock is for aging gfns, and
collisions between aging and other MMU rmap operations are quite rare,
e.g. unless userspace is being silly and aging a tiny range over and over
in a tight loop, time between contention when aging an actively running VM
is O(seconds).  In short, a more sophisticated locking scheme shouldn't be
necessary.

Note, the lock only protects the rmap structure itself, SPTEs that are
pointed at by a locked rmap can still be modified and zapped by another
task (KVM drops/zaps SPTEs before deleting the rmap entries)

Co-developed-by: James Houghton <jthoughton@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-10-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14 07:17:40 -08:00
Sean Christopherson
9fb13ba6b5 KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock
Refactor the pte_list and rmap code to always read and write rmap_head->val
exactly once, e.g. by collecting changes in a local variable and then
propagating those changes back to rmap_head->val as appropriate.  This will
allow implementing a per-rmap rwlock (of sorts) by adding a LOCKED bit into
the rmap value alongside the MANY bit.

Signed-off-by: James Houghton <jthoughton@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-9-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14 07:17:33 -08:00
James Houghton
8c403cf231 KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0
When aging SPTEs and the TDP MMU is enabled, process the shadow MMU if and
only if the VM has at least one shadow page, as opposed to checking if the
VM has rmaps.  Checking for rmaps will effectively yield a false positive
if the VM ran nested TDP VMs in the past, but is not currently doing so.

Signed-off-by: James Houghton <jthoughton@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-8-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14 07:17:28 -08:00
James Houghton
e25c233234 KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as young
Reorder the processing of the TDP MMU versus the shadow MMU when aging
SPTEs, and skip the shadow MMU entirely in the test-only case if the TDP
MMU reports that the page is young, i.e. completely avoid taking mmu_lock
if the TDP MMU SPTE is young.  Swap the order for the test-and-age helper
as well for consistency.

Signed-off-by: James Houghton <jthoughton@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-7-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14 07:17:23 -08:00
Sean Christopherson
b146a9b34a KVM: x86/mmu: Age TDP MMU SPTEs without holding mmu_lock
Walk the TDP MMU in an RCU read-side critical section without holding
mmu_lock when harvesting and potentially updating age information on
TDP MMU SPTEs.  Add a new macro to do RCU-safe walking of TDP MMU roots,
and do all SPTE aging with atomic updates; while clobbering Accessed
information is ok, KVM must not corrupt other bits, e.g. must not drop
a Dirty or Writable bit when making a SPTE young..

If updating a SPTE to mark it for access tracking fails, leave it as is
and treat it as if it were young.  If the spte is being actively modified,
it is most likely young.

Acquire and release mmu_lock for write when harvesting age information
from the shadow MMU, as the shadow MMU doesn't yet support aging outside
of mmu_lock.

Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-5-jthoughton@google.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14 07:17:17 -08:00
Sean Christopherson
928c54b1c4 KVM: x86/mmu: Always update A/D-disabled SPTEs atomically
In anticipation of aging SPTEs outside of mmu_lock, force A/D-disabled
SPTEs to be updated atomically, as aging A/D-disabled SPTEs will mark them
for access-tracking outside of mmu_lock.  Coupled with restoring access-
tracked SPTEs in the fast page fault handler, the end result is that
A/D-disable SPTEs will be volatile at all times.

Reviewed-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/all/Z60bhK96JnKIgqZQ@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14 07:16:51 -08:00
James Houghton
61d65f2dc7 KVM: x86/mmu: Don't force atomic update if only the Accessed bit is volatile
Don't force SPTE modifications to be done atomically if the only volatile
bit in the SPTE is the Accessed bit.  KVM and the primary MMU tolerate
stale aging state, and the probability of an Accessed bit A/D assist being
clobbered *and* affecting again is likely far lower than the probability
of consuming stale information due to not flushing TLBs when aging.

Rename spte_has_volatile_bits() to spte_needs_atomic_update() to better
capture the nature of the helper.

Opportunstically do s/write/update on the TDP MMU wrapper, as it's not
simply the "write" that needs to be done atomically, it's the entire
update, i.e. the entire read-modify-write operation needs to be done
atomically so that KVM has an accurate view of the old SPTE.

Leave kvm_tdp_mmu_write_spte_atomic() as is.  While the name is imperfect,
it pairs with kvm_tdp_mmu_write_spte(), which in turn pairs with
kvm_tdp_mmu_read_spte().  And renaming all of those isn't obviously a net
positive, and would require significant churn.

Signed-off-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-6-jthoughton@google.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14 07:16:45 -08:00
James Houghton
e29b74920e KVM: x86/mmu: Factor out spte atomic bit clearing routine
This new function, tdp_mmu_clear_spte_bits_atomic(), will be used in a
follow-up patch to enable lockless Accessed bit clearing.

Signed-off-by: James Houghton <jthoughton@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-4-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14 07:16:37 -08:00
Ge Yang
7e066cb9b7 KVM: SEV: Use long-term pin when registering encrypted memory regions
When registering an encrypted memory region for SEV-MEM/SEV-ES guests,
pin the pages with FOLL_TERM so that the pages are migrated out of
MIGRATE_CMA/ZONE_MOVABLE.  Failure to do so violates the CMA/MOVABLE
mechanisms and can result in fragmentation due to unmovable pages, e.g.
can make CMA allocations fail.

Signed-off-by: Ge Yang <yangge1116@126.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/1739241423-14326-1-git-send-email-yangge1116@126.com
[sean: massage changelog, make @flags an unsigned int]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:50:42 -08:00
Sean Christopherson
1b3c38050b KVM: x86: Override TSC_STABLE flag for Xen PV clocks in kvm_guest_time_update()
When updating PV clocks, handle the Xen-specific UNSTABLE_TSC override in
the main kvm_guest_time_update() by simply clearing PVCLOCK_TSC_STABLE_BIT
in the flags of the reference pvclock structure.  Expand the comment to
(hopefully) make it obvious that Xen clocks need to be processed after all
clocks that care about the TSC_STABLE flag.

No functional change intended.

Cc: Paul Durrant <pdurrant@amazon.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:45:55 -08:00
Sean Christopherson
847d68abf1 KVM: x86: Setup Hyper-V TSC page before Xen PV clocks (during clock update)
When updating paravirtual clocks, setup the Hyper-V TSC page before
Xen PV clocks.  This will allow dropping xen_pvclock_tsc_unstable in favor
of simply clearing PVCLOCK_TSC_STABLE_BIT in the reference flags.

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:45:55 -08:00
Sean Christopherson
39d61b46ad KVM: x86: Remove per-vCPU "cache" of its reference pvclock
Remove the per-vCPU "cache" of the reference pvclock and instead cache
only the TSC shift+multiplier.  All other fields in pvclock are fully
recomputed by kvm_guest_time_update(), i.e. aren't actually persisted.

In addition to shaving a few bytes, explicitly tracking the TSC shift/mul
fields makes it easier to see that those fields are tied to hw_tsc_khz
(they exist to avoid having to do expensive math in the common case).
And conversely, not tracking the other fields makes it easier to see that
things like the version number are pulled from the guest's copy, not from
KVM's reference.

Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:45:55 -08:00
Sean Christopherson
46aed4d4a7 KVM: x86: Pass reference pvclock as a param to kvm_setup_guest_pvclock()
Pass the reference pvclock structure that's used to setup each individual
pvclock as a parameter to kvm_setup_guest_pvclock() as a preparatory step
toward removing kvm_vcpu_arch.hv_clock.

No functional change intended.

Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:45:55 -08:00
Sean Christopherson
93fb0b10e7 KVM: x86: Set PVCLOCK_GUEST_STOPPED only for kvmclock, not for Xen PV clock
Handle "guest stopped" propagation only for kvmclock, as the flag is set
if and only if kvmclock is "active", i.e. can only be set for Xen PV clock
if kvmclock *and* Xen PV clock are in-use by the guest, which creates very
bizarre behavior for the guest.

Simply restrict the flag to kvmclock, e.g. instead of trying to handle
Xen PV clock, as propagation of PVCLOCK_GUEST_STOPPED was unintentionally
added during a refactoring, and while Xen proper defines
XEN_PVCLOCK_GUEST_STOPPED, there's no evidence that Xen guests actually
support the flag.

Check and clear pvclock_set_guest_stopped_request if and only if kvmclock
is active to preserve the original behavior, i.e. keep the flag pending
if kvmclock happens to be disabled when KVM processes the initial request.

Fixes: aa096aa0a0 ("KVM: x86/xen: setup pvclock updates")
Cc: Paul Durrant <pdurrant@amazon.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:45:55 -08:00
Sean Christopherson
24c1663780 KVM: x86: Don't bleed PVCLOCK_GUEST_STOPPED across PV clocks
When updating a specific PV clock, make a full copy of KVM's reference
copy/cache so that PVCLOCK_GUEST_STOPPED doesn't bleed across clocks.
E.g. in the unlikely scenario the guest has enabled both kvmclock and Xen
PV clock, a dangling GUEST_STOPPED in kvmclock would bleed into Xen PV
clock.

Using a local copy of the pvclock structure also sets the stage for
eliminating the per-vCPU copy/cache (only the TSC frequency information
actually "needs" to be cached/persisted).

Fixes: aa096aa0a0 ("KVM: x86/xen: setup pvclock updates")
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:45:55 -08:00
Sean Christopherson
ca28aa6391 KVM: x86/xen: Use guest's copy of pvclock when starting timer
Use the guest's copy of its pvclock when starting a Xen timer, as KVM's
reference copy may not be up-to-date, i.e. may yield a false positive of
sorts.  In the unlikely scenario that the guest is starting a Xen timer
and has used a Xen pvclock in the past, but has since but turned it "off",
then vcpu->arch.hv_clock may be stale, as KVM's reference copy is updated
if and only if at least one pvclock is enabled.

Furthermore, vcpu->arch.hv_clock is currently used by three different
pvclocks: kvmclock, Xen, and Xen compat.  While it's extremely unlikely a
guest would ever enable multiple pvclocks, effectively sharing KVM's
reference clock could yield very weird behavior.  Using the guest's active
Xen pvclock instead of KVM's reference will allow dropping KVM's
reference copy.

Fixes: 451a707813 ("KVM: x86/xen: improve accuracy of Xen timers")
Cc: Paul Durrant <pdurrant@amazon.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:45:55 -08:00
Sean Christopherson
6c4927a4b7 KVM: x86: Process "guest stopped request" once per guest time update
Handle "guest stopped" requests once per guest time update in preparation
of restoring KVM's historical behavior of setting PVCLOCK_GUEST_STOPPED
for kvmclock and only kvmclock.  For now, simply move the code to minimize
the probability of an unintentional change in functionally.

Note, in practice, all clocks are guaranteed to see the request (or not)
even though each PV clock processes the request individual, as KVM holds
vcpu->mutex (blocks KVM_KVMCLOCK_CTRL) and it should be impossible for
KVM's suspend notifier to run while KVM is handling requests.  And because
the helper updates the reference flags, all subsequent PV clock updates
will pick up PVCLOCK_GUEST_STOPPED.

Note #2, once PVCLOCK_GUEST_STOPPED is restricted to kvmclock, the
horrific #ifdef will go away.

Cc: Paul Durrant <pdurrant@amazon.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:45:55 -08:00
Sean Christopherson
aceb04f571 KVM: x86: Drop local pvclock_flags variable in kvm_guest_time_update()
Drop the local pvclock_flags in kvm_guest_time_update(), the local variable
is immediately shoved into the per-vCPU "cache", i.e. the local variable
serves no purpose.

No functional change intended.

Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:45:54 -08:00
Sean Christopherson
4198f38aed KVM: x86: Eliminate "handling" of impossible errors during SUSPEND
Drop KVM's handling of kvm_set_guest_paused() failure when reacting to a
SUSPEND notification, as kvm_set_guest_paused() only "fails" if the vCPU
isn't using kvmclock, and KVM's notifier callback pre-checks that kvmclock
is active.  I.e. barring some bizarre edge case that shouldn't be treated
as an error in the first place, kvm_arch_suspend_notifier() can't fail.

Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:45:54 -08:00
Sean Christopherson
d9c5ed0a9b KVM: x86: Don't take kvm->lock when iterating over vCPUs in suspend notifier
When queueing vCPU PVCLOCK updates in response to SUSPEND or HIBERNATE,
don't take kvm->lock as doing so can trigger a largely theoretical
deadlock, it is perfectly safe to iterate over the xarray of vCPUs without
holding kvm->lock, and kvm->lock doesn't protect kvm_set_guest_paused() in
any way (pv_time.active and pvclock_set_guest_stopped_request are
protected by vcpu->mutex, not kvm->lock).

Reported-by: syzbot+352e553a86e0d75f5120@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/677c0f36.050a0220.3b3668.0014.GAE@google.com
Fixes: 7d62874f69 ("kvm: x86: implement KVM PM-notifier")
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:45:54 -08:00
Sean Christopherson
93da6af3ae KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulation
Defer runtime CPUID updates until the next non-faulting CPUID emulation
or KVM_GET_CPUID2, which are the only paths in KVM that consume the
dynamic entries.  Deferring the updates is especially beneficial to
nested VM-Enter/VM-Exit, as KVM will almost always detect multiple state
changes, not to mention the updates don't need to be realized while L2 is
active if CPUID is being intercepted by L1 (CPUID is a mandatory intercept
on Intel, but not AMD).

Deferring CPUID updates shaves several hundred cycles from nested VMX
roundtrips, as measured from L2 executing CPUID in a tight loop:

  SKX 6850 => 6450
  ICX 9000 => 8800
  EMR 7900 => 7700

Alternatively, KVM could update only the CPUID leaves that are affected
by the state change, e.g. update XSAVE info only if XCR0 or XSS changes,
but that adds non-trivial complexity and doesn't solve the underlying
problem of nested transitions potentially changing both XCR0 and XSS, on
both nested VM-Enter and VM-Exit.

Skipping updates entirely if L2 is active and CPUID is being intercepted
by L1 could work for the common case.  However, simply skipping updates if
L2 is active is *very* subtly dangerous and complex.  Most KVM updates are
triggered by changes to the current vCPU state, which may be L2 state,
whereas performing updates only for L1 would requiring detecting changes
to L1 state.  KVM would need to either track relevant L1 state, or defer
runtime CPUID updates until the next nested VM-Exit.  The former is ugly
and complex, while the latter comes with similar dangers to deferring all
CPUID updates, and would only address the nested VM-Enter path.

To guard against using stale data, disallow querying dynamic CPUID feature
bits, i.e. features that KVM updates at runtime, via a compile-time
assertion in guest_cpu_cap_has().  Exempt MWAIT from the rule, as the
MISC_ENABLE_NO_MWAIT means that MWAIT is _conditionally_ a dynamic CPUID
feature.

Note, the rule could be enforced for MWAIT as well, e.g. by querying guest
CPUID in kvm_emulate_monitor_mwait, but there's no obvious advtantage to
doing so, and allowing MWAIT for guest_cpuid_has() opens up a different can
of worms.  MONITOR/MWAIT can't be virtualized (for a reasonable definition),
and the nature of the MWAIT_NEVER_UD_FAULTS and MISC_ENABLE_NO_MWAIT quirks
means checking X86_FEATURE_MWAIT outside of kvm_emulate_monitor_mwait() is
wrong for other reasons.

Beyond the aforementioned feature bits, the only other dynamic CPUID
(sub)leaves are the XSAVE sizes, and similar to MWAIT, consuming those
CPUID entries in KVM is all but guaranteed to be a bug.  The layout for an
actual XSAVE buffer depends on the format (compacted or not) and
potentially the features that are actually enabled.  E.g. see the logic in
fpstate_clear_xstate_component() needed to poke into the guest's effective
XSAVE state to clear MPX state on INIT.  KVM does consume
CPUID.0xD.0.{EAX,EDX} in kvm_check_cpuid() and cpuid_get_supported_xcr0(),
but not EBX, which is the only dynamic output register in the leaf.

Link: https://lore.kernel.org/r/20241211013302.1347853-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:16:33 -08:00
Sean Christopherson
a487f6797c KVM: x86: Query X86_FEATURE_MWAIT iff userspace owns the CPUID feature bit
Rework MONITOR/MWAIT emulation to query X86_FEATURE_MWAIT if and only if
the MISC_ENABLE_NO_MWAIT quirk is enabled, in which case MWAIT is not a
dynamic, KVM-controlled CPUID feature.  KVM's funky ABI for that quirk is
to emulate MONITOR/MWAIT as nops if userspace sets MWAIT in guest CPUID.

For the case where KVM owns the MWAIT feature bit, check MISC_ENABLES
itself, i.e. check the actual control, not its reflection in guest CPUID.

Avoiding consumption of dynamic CPUID features will allow KVM to defer
runtime CPUID updates until kvm_emulate_cpuid(), i.e. until the updates
become visible to the guest.  Alternatively, KVM could play other games
with runtime CPUID updates, e.g. by precisely specifying which feature
bits to update, but doing so adds non-trivial complexity and doesn't solve
the underlying issue of unnecessary updates causing meaningful overhead
for nested virtualization roundtrips.

Link: https://lore.kernel.org/r/20241211013302.1347853-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:16:32 -08:00
Sean Christopherson
7e9f735e7a KVM: x86: Apply TSX_CTRL_CPUID_CLEAR if and only if the vCPU has RTM or HLE
When emulating CPUID, retrieve MSR_IA32_TSX_CTRL.TSX_CTRL_CPUID_CLEAR if
and only if RTM and/or HLE feature bits need to be cleared.  Getting the
MSR value is unnecessary if neither bit is set, and avoiding the lookup
saves ~80 cycles for vCPUs without RTM or HLE.

Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20241211013302.1347853-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:16:31 -08:00
Sean Christopherson
aa93b6f96f KVM: x86: Use for-loop to iterate over XSTATE size entries
Rework xstate_required_size() to use a for-loop and continue, to make it
more obvious that the xstate_sizes[] lookups are indeed correctly bounded,
and to make it (hopefully) easier to understand that the loop is iterating
over supported XSAVE features.

Link: https://lore.kernel.org/r/20241211013302.1347853-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:16:30 -08:00
Ethan Zhao
a11128ce16 KVM: x86/cpuid: add type suffix to decimal const 48 fix building warning
The default type of a decimal constant is determined by the magnitude of
its value. If the value falls within the range of int, its type is int;
otherwise, if it falls within the range of unsigned int, its type is
unsigned int. This results in the constant 48 being of type int. In the
following min call,

g_phys_as = min(g_phys_as, 48);

This leads to a building warning/error (CONFIG_KVM_WERROR=y) caused by
the mismatch between the types of the two arguments to macro min. By
adding the suffix U to explicitly declare the type of the constant, this
issue is fixed.

Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com>
Link: https://lore.kernel.org/r/20250127013837.12983-1-haifeng.zhao@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:16:29 -08:00
Jim Mattson
e9cb61055f KVM: x86: Clear pv_unhalted on all transitions to KVM_MP_STATE_RUNNABLE
In kvm_set_mp_state(), ensure that vcpu->arch.pv.pv_unhalted is always
cleared on a transition to KVM_MP_STATE_RUNNABLE, so that the next HLT
instruction will be respected.

Fixes: 6aef266c6e ("kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks")
Fixes: b6b8a1451f ("KVM: nVMX: Rework interception of IRQs and NMIs")
Fixes: 38c0b192bd ("KVM: SVM: leave halted state on vmexit")
Fixes: 1a65105a5a ("KVM: x86/xen: handle PV spinlocks slowpath")
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250113200150.487409-3-jmattson@google.com
[sean: add Xen PV spinlocks to the list of Fixes, tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:16:28 -08:00
Jim Mattson
c9e5f3fa90 KVM: x86: Introduce kvm_set_mp_state()
Replace all open-coded assignments to vcpu->arch.mp_state with calls
to a new helper, kvm_set_mp_state(), to centralize all changes to
mp_state.

No functional change intended.

Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250113200150.487409-2-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:16:27 -08:00
Li RongQing
82c470121c KVM: x86: Use kvfree_rcu() to free old optimized APIC map
Use kvfree_rcu() to free the old optimized APIC instead of open coding a
rough equivalent via call_rcu() and a callback function.

Note, there is a subtle function change as rcu_barrier() doesn't wait on
kvfree_rcu(), but does wait on call_rcu().  Not forcing rcu_barrier() to
wait is safe and desirable in this case, as KVM doesn't care when an old
map is actually freed.  In fact, using kvfree_rcu() fixes a largely
theoretical use-after-free.  Because KVM _doesn't_ do rcu_barrier() to
wait for kvm_apic_map_free() to complete, if KVM-the-module is unloaded in
the RCU grace period before kvm_apic_map_free() is invoked, KVM's callback
could run after module unload.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/20250122073456.2950-1-lirongqing@baidu.com
[sean: rework changelog, call out rcu_barrier() interaction]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:16:26 -08:00
Liam Ni
4cad9f8787 KVM: x86: Wake vCPU for PIC interrupt injection iff a valid IRQ was found
When updating the emulated PIC IRQ status, set "wakeup_needed" if and only
if a new interrupt was found, i.e. if the incoming level is non-zero and
an IRQ is being raised.  The bug is relatively benign, as KVM will signal
a spurious wakeup, e.g. set KVM_REQ_EVENT and kick target vCPUs, but KVM
will never actually inject a spurious IRQ as kvm_cpu_has_extint() cares
only about the "output" field.

Fixes: 7049467b53 ("KVM: remove isr_ack logic from PIC")
Signed-off-by: Liam Ni <zhiguangni01@gmail.com>
Link: https://lore.kernel.org/r/CACZJ9cX2R_=qgvLdaqbB_DUJhv08c674b67Ln_Qb9yyVwgE16w@mail.gmail.com
[sean: reconstruct patch, rewrite changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:05:01 -08:00
Sean Christopherson
c2fee09fc1 KVM: x86: Load DR6 with guest value only before entering .vcpu_run() loop
Move the conditional loading of hardware DR6 with the guest's DR6 value
out of the core .vcpu_run() loop to fix a bug where KVM can load hardware
with a stale vcpu->arch.dr6.

When the guest accesses a DR and host userspace isn't debugging the guest,
KVM disables DR interception and loads the guest's values into hardware on
VM-Enter and saves them on VM-Exit.  This allows the guest to access DRs
at will, e.g. so that a sequence of DR accesses to configure a breakpoint
only generates one VM-Exit.

For DR0-DR3, the logic/behavior is identical between VMX and SVM, and also
identical between KVM_DEBUGREG_BP_ENABLED (userspace debugging the guest)
and KVM_DEBUGREG_WONT_EXIT (guest using DRs), and so KVM handles loading
DR0-DR3 in common code, _outside_ of the core kvm_x86_ops.vcpu_run() loop.

But for DR6, the guest's value doesn't need to be loaded into hardware for
KVM_DEBUGREG_BP_ENABLED, and SVM provides a dedicated VMCB field whereas
VMX requires software to manually load the guest value, and so loading the
guest's value into DR6 is handled by {svm,vmx}_vcpu_run(), i.e. is done
_inside_ the core run loop.

Unfortunately, saving the guest values on VM-Exit is initiated by common
x86, again outside of the core run loop.  If the guest modifies DR6 (in
hardware, when DR interception is disabled), and then the next VM-Exit is
a fastpath VM-Exit, KVM will reload hardware DR6 with vcpu->arch.dr6 and
clobber the guest's actual value.

The bug shows up primarily with nested VMX because KVM handles the VMX
preemption timer in the fastpath, and the window between hardware DR6
being modified (in guest context) and DR6 being read by guest software is
orders of magnitude larger in a nested setup.  E.g. in non-nested, the
VMX preemption timer would need to fire precisely between #DB injection
and the #DB handler's read of DR6, whereas with a KVM-on-KVM setup, the
window where hardware DR6 is "dirty" extends all the way from L1 writing
DR6 to VMRESUME (in L1).

    L1's view:
    ==========
    <L1 disables DR interception>
           CPU 0/KVM-7289    [023] d....  2925.640961: kvm_entry: vcpu 0
 A:  L1 Writes DR6
           CPU 0/KVM-7289    [023] d....  2925.640963: <hack>: Set DRs, DR6 = 0xffff0ff1

 B:        CPU 0/KVM-7289    [023] d....  2925.640967: kvm_exit: vcpu 0 reason EXTERNAL_INTERRUPT intr_info 0x800000ec

 D: L1 reads DR6, arch.dr6 = 0
           CPU 0/KVM-7289    [023] d....  2925.640969: <hack>: Sync DRs, DR6 = 0xffff0ff0

           CPU 0/KVM-7289    [023] d....  2925.640976: kvm_entry: vcpu 0
    L2 reads DR6, L1 disables DR interception
           CPU 0/KVM-7289    [023] d....  2925.640980: kvm_exit: vcpu 0 reason DR_ACCESS info1 0x0000000000000216
           CPU 0/KVM-7289    [023] d....  2925.640983: kvm_entry: vcpu 0

           CPU 0/KVM-7289    [023] d....  2925.640983: <hack>: Set DRs, DR6 = 0xffff0ff0

    L2 detects failure
           CPU 0/KVM-7289    [023] d....  2925.640987: kvm_exit: vcpu 0 reason HLT
    L1 reads DR6 (confirms failure)
           CPU 0/KVM-7289    [023] d....  2925.640990: <hack>: Sync DRs, DR6 = 0xffff0ff0

    L0's view:
    ==========
    L2 reads DR6, arch.dr6 = 0
          CPU 23/KVM-5046    [001] d....  3410.005610: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216
          CPU 23/KVM-5046    [001] .....  3410.005610: kvm_nested_vmexit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216

    L2 => L1 nested VM-Exit
          CPU 23/KVM-5046    [001] .....  3410.005610: kvm_nested_vmexit_inject: reason: DR_ACCESS ext_inf1: 0x0000000000000216

          CPU 23/KVM-5046    [001] d....  3410.005610: kvm_entry: vcpu 23
          CPU 23/KVM-5046    [001] d....  3410.005611: kvm_exit: vcpu 23 reason VMREAD
          CPU 23/KVM-5046    [001] d....  3410.005611: kvm_entry: vcpu 23
          CPU 23/KVM-5046    [001] d....  3410.005612: kvm_exit: vcpu 23 reason VMREAD
          CPU 23/KVM-5046    [001] d....  3410.005612: kvm_entry: vcpu 23

    L1 writes DR7, L0 disables DR interception
          CPU 23/KVM-5046    [001] d....  3410.005612: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000007
          CPU 23/KVM-5046    [001] d....  3410.005613: kvm_entry: vcpu 23

    L0 writes DR6 = 0 (arch.dr6)
          CPU 23/KVM-5046    [001] d....  3410.005613: <hack>: Set DRs, DR6 = 0xffff0ff0

 A: <L1 writes DR6 = 1, no interception, arch.dr6 is still '0'>

 B:       CPU 23/KVM-5046    [001] d....  3410.005614: kvm_exit: vcpu 23 reason PREEMPTION_TIMER
          CPU 23/KVM-5046    [001] d....  3410.005614: kvm_entry: vcpu 23

 C: L0 writes DR6 = 0 (arch.dr6)
          CPU 23/KVM-5046    [001] d....  3410.005614: <hack>: Set DRs, DR6 = 0xffff0ff0

    L1 => L2 nested VM-Enter
          CPU 23/KVM-5046    [001] d....  3410.005616: kvm_exit: vcpu 23 reason VMRESUME

    L0 reads DR6, arch.dr6 = 0

Reported-by: John Stultz <jstultz@google.com>
Closes: https://lkml.kernel.org/r/CANDhNCq5_F3HfFYABqFGCA1bPd_%2BxgNj-iDQhH4tDk%2Bwi8iZZg%40mail.gmail.com
Fixes: 375e28ffc0 ("KVM: X86: Set host DR6 only on VMX and for KVM_DEBUGREG_WONT_EXIT")
Fixes: d67668e9dd ("KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20250125011833.3644371-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 08:59:38 -08:00
Sean Christopherson
46d6c6f3ef KVM: nSVM: Enter guest mode before initializing nested NPT MMU
When preparing vmcb02 for nested VMRUN (or state restore), "enter" guest
mode prior to initializing the MMU for nested NPT so that guest_mode is
set in the MMU's role.  KVM's model is that all L2 MMUs are tagged with
guest_mode, as the behavior of hypervisor MMUs tends to be significantly
different than kernel MMUs.

Practically speaking, the bug is relatively benign, as KVM only directly
queries role.guest_mode in kvm_mmu_free_guest_mode_roots() and
kvm_mmu_page_ad_need_write_protect(), which SVM doesn't use, and in paths
that are optimizations (mmu_page_zap_pte() and
shadow_mmu_try_split_huge_pages()).

And while the role is incorprated into shadow page usage, because nested
NPT requires KVM to be using NPT for L1, reusing shadow pages across L1
and L2 is impossible as L1 MMUs will always have direct=1, while L2 MMUs
will have direct=0.

Hoist the TLB processing and setting of HF_GUEST_MASK to the beginning
of the flow instead of forcing guest_mode in the MMU, as nothing in
nested_vmcb02_prepare_control() between the old and new locations touches
TLB flush requests or HF_GUEST_MASK, i.e. there's no reason to present
inconsistent vCPU state to the MMU.

Fixes: 69cb877487 ("KVM: nSVM: move MMU setup to nested_prepare_vmcb_control")
Cc: stable@vger.kernel.org
Reported-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://lore.kernel.org/r/20250130010825.220346-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 08:57:55 -08:00
Sean Christopherson
a8de7f100b KVM: x86: Reject Hyper-V's SEND_IPI hypercalls if local APIC isn't in-kernel
Advertise support for Hyper-V's SEND_IPI and SEND_IPI_EX hypercalls if and
only if the local API is emulated/virtualized by KVM, and explicitly reject
said hypercalls if the local APIC is emulated in userspace, i.e. don't rely
on userspace to opt-in to KVM_CAP_HYPERV_ENFORCE_CPUID.

Rejecting SEND_IPI and SEND_IPI_EX fixes a NULL-pointer dereference if
Hyper-V enlightenments are exposed to the guest without an in-kernel local
APIC:

  dump_stack+0xbe/0xfd
  __kasan_report.cold+0x34/0x84
  kasan_report+0x3a/0x50
  __apic_accept_irq+0x3a/0x5c0
  kvm_hv_send_ipi.isra.0+0x34e/0x820
  kvm_hv_hypercall+0x8d9/0x9d0
  kvm_emulate_hypercall+0x506/0x7e0
  __vmx_handle_exit+0x283/0xb60
  vmx_handle_exit+0x1d/0xd0
  vcpu_enter_guest+0x16b0/0x24c0
  vcpu_run+0xc0/0x550
  kvm_arch_vcpu_ioctl_run+0x170/0x6d0
  kvm_vcpu_ioctl+0x413/0xb20
  __se_sys_ioctl+0x111/0x160
  do_syscal1_64+0x30/0x40
  entry_SYSCALL_64_after_hwframe+0x67/0xd1

Note, checking the sending vCPU is sufficient, as the per-VM irqchip_mode
can't be modified after vCPUs are created, i.e. if one vCPU has an
in-kernel local APIC, then all vCPUs have an in-kernel local APIC.

Reported-by: Dongjie Zou <zoudongjie@huawei.com>
Fixes: 214ff83d44 ("KVM: x86: hyperv: implement PV IPI send hypercalls")
Fixes: 2bc39970e9 ("x86/kvm/hyper-v: Introduce KVM_GET_SUPPORTED_HV_CPUID")
Cc: stable@vger.kernel.org
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20250118003454.2619573-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 08:53:56 -08:00
Sean Christopherson
a6136669da KVM: SVM: Ensure PSP module is initialized if KVM module is built-in
The kernel's initcall infrastructure lacks the ability to express
dependencies between initcalls, whereas the modules infrastructure
automatically handles dependencies via symbol loading.  Ensure the
PSP SEV driver is initialized before proceeding in sev_hardware_setup()
if KVM is built-in as the dependency isn't handled by the initcall
infrastructure.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Link: https://lore.kernel.org/r/f78ddb64087df27e7bcb1ae0ab53f55aa0804fab.1739226950.git.ashish.kalra@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-11 13:29:08 -08:00
Nikunj A Dadhania
8a01902a01 KVM: SEV: Use to_kvm_sev_info() for fetching kvm_sev_info struct
Simplify code by replacing &to_kvm_svm(kvm)->sev_info with
to_kvm_sev_info() helper function. Wherever possible, drop the local
variable declaration and directly use the helper instead.

No functional changes.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Pavan Kumar Paluri <papaluri@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250123055140.144378-1-nikunj@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-11 10:32:46 -08:00
David Woodhouse
3617c0ee7d KVM: x86/xen: Only write Xen hypercall page for guest writes to MSR
The Xen hypercall page MSR is write-only. When the guest writes an address
to the MSR, the hypervisor populates the referenced page with hypercall
functions.

There is no reason for the host ever to write to the MSR, and it isn't
even readable.

Allowing host writes to trigger the hypercall page allows userspace to
attack the kernel, as kvm_xen_write_hypercall_page() takes multiple
locks and writes to guest memory.  E.g. if userspace sets the MSR to
MSR_IA32_XSS, KVM's write to MSR_IA32_XSS during vCPU creation will
trigger an SRCU violation due to writing guest memory:

  =============================
  WARNING: suspicious RCU usage
  6.13.0-rc3
  -----------------------------
  include/linux/kvm_host.h:1046 suspicious rcu_dereference_check() usage!

  stack backtrace:
  CPU: 6 UID: 1000 PID: 1101 Comm: repro Not tainted 6.13.0-rc3
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   <TASK>
   dump_stack_lvl+0x7f/0x90
   lockdep_rcu_suspicious+0x176/0x1c0
   kvm_vcpu_gfn_to_memslot+0x259/0x280
   kvm_vcpu_write_guest+0x3a/0xa0
   kvm_xen_write_hypercall_page+0x268/0x300
   kvm_set_msr_common+0xc44/0x1940
   vmx_set_msr+0x9db/0x1fc0
   kvm_vcpu_reset+0x857/0xb50
   kvm_arch_vcpu_create+0x37e/0x4d0
   kvm_vm_ioctl+0x669/0x2100
   __x64_sys_ioctl+0xc1/0xf0
   do_syscall_64+0xc5/0x210
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
  RIP: 0033:0x7feda371b539

While the MSR index isn't strictly ABI, i.e. can theoretically float to
any value, in practice no known VMM sets the MSR index to anything other
than 0x40000000 or 0x40000200.

Reported-by: syzbot+cdeaeec70992eca2d920@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/679258d4.050a0220.2eae65.000a.GAE@google.com
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/de0437379dfab11e431a23c8ce41a29234c06cbf.camel@infradead.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-11 07:05:01 -08:00
Sean Christopherson
43fb96ae78 KVM: x86/mmu: Ensure NX huge page recovery thread is alive before waking
When waking a VM's NX huge page recovery thread, ensure the thread is
actually alive before trying to wake it.  Now that the thread is spawned
on-demand during KVM_RUN, a VM without a recovery thread is reachable via
the related module params.

  BUG: kernel NULL pointer dereference, address: 0000000000000040
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:vhost_task_wake+0x5/0x10
  Call Trace:
   <TASK>
   set_nx_huge_pages+0xcc/0x1e0 [kvm]
   param_attr_store+0x8a/0xd0
   module_attr_store+0x1a/0x30
   kernfs_fop_write_iter+0x12f/0x1e0
   vfs_write+0x233/0x3e0
   ksys_write+0x60/0xd0
   do_syscall_64+0x5b/0x160
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
  RIP: 0033:0x7f3b52710104
   </TASK>
  Modules linked in: kvm_intel kvm
  CR2: 0000000000000040

Fixes: 931656b9e2 ("kvm: defer huge page recovery vhost task to later")
Cc: stable@vger.kernel.org
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250124234623.3609069-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-04 11:28:21 -05:00
Paolo Bonzini
6f61269495 KVM: remove kvm_arch_post_init_vm
The only statement in a kvm_arch_post_init_vm implementation
can be moved into the x86 kvm_arch_init_vm.  Do so and remove all
traces from architecture-independent code.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-04 11:27:45 -05:00
Paolo Bonzini
ee3a66f431 kvm: x86: SRSO_USER_KERNEL_NO is not synthesized
SYNTHESIZED_F() generally is used together with setup_force_cpu_cap(),
i.e. when it makes sense to present the feature even if cpuid does not
have it *and* the VM is not able to see the difference.  For example,
it can be used when mitigations on the host automatically protect
the guest as well.

The "SYNTHESIZED_F(SRSO_USER_KERNEL_NO)" line came in as a conflict
resolution between the CPUID overhaul from the KVM tree and support
for the feature in the x86 tree.  Using it right now does not hurt,
or make a difference for that matter, because there is no
setup_force_cpu_cap(X86_FEATURE_SRSO_USER_KERNEL_NO).  However, it
is a little less future proof in case such a setup_force_cpu_cap()
appears later, for a case where the kernel somehow is not vulnerable
but the guest would have to apply the mitigation.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-04 11:13:24 -05:00
Linus Torvalds
0f8e26b38d Loongarch:
* Clear LLBCTL if secondary mmu mapping changes.
 
 * Add hypercall service support for usermode VMM.
 
 x86:
 
 * Add a comment to kvm_mmu_do_page_fault() to explain why KVM performs a
   direct call to kvm_tdp_page_fault() when RETPOLINE is enabled.
 
 * Ensure that all SEV code is compiled out when disabled in Kconfig, even
   if building with less brilliant compilers.
 
 * Remove a redundant TLB flush on AMD processors when guest CR4.PGE changes.
 
 * Use str_enabled_disabled() to replace open coded strings.
 
 * Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's APICv cache
   prior to every VM-Enter.
 
 * Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
   instead of just those where KVM needs to manage state and/or explicitly
   enable the feature in hardware.  Along the way, refactor the code to make
   it easier to add features, and to make it more self-documenting how KVM
   is handling each feature.
 
 * Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
   where KVM unintentionally puts the vCPU into infinite loops in some scenarios
   (e.g. if emulation is triggered by the exit), and brings parity between VMX
   and SVM.
 
 * Add pending request and interrupt injection information to the kvm_exit and
   kvm_entry tracepoints respectively.
 
 * Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
   loading guest/host PKRU, due to a refactoring of the kernel helpers that
   didn't account for KVM's pre-checking of the need to do WRPKRU.
 
 * Make the completion of hypercalls go through the complete_hypercall
   function pointer argument, no matter if the hypercall exits to
   userspace or not.  Previously, the code assumed that KVM_HC_MAP_GPA_RANGE
   specifically went to userspace, and all the others did not; the new code
   need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
   all whether there was an exit to userspace or not.
 
 * As part of enabling TDX virtual machines, support support separation of
   private/shared EPT into separate roots.  When TDX will be enabled, operations
   on private pages will need to go through the privileged TDX Module via SEAMCALLs;
   as a result, they are limited and relatively slow compared to reading a PTE.
   The patches included in 6.14 allow KVM to keep a mirror of the private EPT in
   host memory, and define entries in kvm_x86_ops to operate on external page
   tables such as the TDX private EPT.
 
 * The recently introduced conversion of the NX-page reclamation kthread to
   vhost_task moved the task under the main process.  The task is created as
   soon as KVM_CREATE_VM was invoked and this, of course, broke userspace that
   didn't expect to see any child task of the VM process until it started
   creating its own userspace threads.  In particular crosvm refuses to fork()
   if procfs shows any child task, so unbreak it by creating the task lazily.
   This is arguably a userspace bug, as there can be other kinds of legitimate
   worker tasks and they wouldn't impede fork(); but it's not like userspace
   has a way to distinguish kernel worker tasks right now.  Should they show
   as "Kthread: 1" in proc/.../status?
 
 x86 - Intel:
 
 * Fix a bug where KVM updates hardware's APICv cache of the highest ISR bit
   while L2 is active, while ultimately results in a hardware-accelerated L1
   EOI effectively being lost.
 
 * Honor event priority when emulating Posted Interrupt delivery during nested
   VM-Enter by queueing KVM_REQ_EVENT instead of immediately handling the
   interrupt.
 
 * Rework KVM's processing of the Page-Modification Logging buffer to reap
   entries in the same order they were created, i.e. to mark gfns dirty in the
   same order that hardware marked the page/PTE dirty.
 
 * Misc cleanups.
 
 Generic:
 
 * Cleanup and harden kvm_set_memory_region(); add proper lockdep assertions when
   setting memory regions and add a dedicated API for setting KVM-internal
   memory regions.  The API can then explicitly disallow all flags for
   KVM-internal memory regions.
 
 * Explicitly verify the target vCPU is online in kvm_get_vcpu() to fix a bug
   where KVM would return a pointer to a vCPU prior to it being fully online,
   and give kvm_for_each_vcpu() similar treatment to fix a similar flaw.
 
 * Wait for a vCPU to come online prior to executing a vCPU ioctl, to fix a
   bug where userspace could coerce KVM into handling the ioctl on a vCPU that
   isn't yet onlined.
 
 * Gracefully handle xarray insertion failures; even though such failures are
   impossible in practice after xa_reserve(), reserving an entry is always followed
   by xa_store() which does not know (or differentiate) whether there was an
   xa_reserve() before or not.
 
 RISC-V:
 
 * Zabha, Svvptc, and Ziccrse extension support for guests.  None of them
   require anything in KVM except for detecting them and marking them
   as supported; Zabha adds byte and halfword atomic operations, while the
   others are markers for specific operation of the TLB and of LL/SC
   instructions respectively.
 
 * Virtualize SBI system suspend extension for Guest/VM
 
 * Support firmware counters which can be used by the guests to collect
   statistics about traps that occur in the host.
 
 Selftests:
 
 * Rework vcpu_get_reg() to return a value instead of using an out-param, and
   update all affected arch code accordingly.
 
 * Convert the max_guest_memory_test into a more generic mmu_stress_test.
   The basic gist of the "conversion" is to have the test do mprotect() on
   guest memory while vCPUs are accessing said memory, e.g. to verify KVM
   and mmu_notifiers are working as intended.
 
 * Play nice with treewrite builds of unsupported architectures, e.g. arm
   (32-bit), as KVM selftests' Makefile doesn't do anything to ensure the
   target architecture is actually one KVM selftests supports.
 
 * Use the kernel's $(ARCH) definition instead of the target triple for arch
   specific directories, e.g. arm64 instead of aarch64, mainly so as not to
   be different from the rest of the kernel.
 
 * Ensure that format strings for logging statements are checked by the
   compiler even when the logging statement itself is disabled.
 
 * Attempt to whack the last LLC references/misses mole in the Intel PMU
   counters test by adding a data load and doing CLFLUSH{OPT} on the data
   instead of the code being executed.  It seems that modern Intel CPUs
   have learned new code prefetching tricks that bypass the PMU counters.
 
 * Fix a flaw in the Intel PMU counters test where it asserts that events
   are counting correctly without actually knowing what the events count
   given the underlying hardware; this can happen if Intel reuses a
   formerly microarchitecture-specific event encoding as an architectural
   event, as was the case for Top-Down Slots.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmeTuzoUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOkBwf8CRNExYaM3j9y2E7mmo6AiL2ug6+J
 Uy5Hai1poY48pPwKC6ke3EWT8WVsgj/Py5pCeHvLojQchWNjCCYNfSQluJdkRxwG
 DgP3QUljSxEJWBeSwyTRcKM+IySi5hZd1IFo3gePFRB829Jpnj05vjbvCyv8gIwU
 y3HXxSYDsViaaFoNg4OlZFsIGis7mtknsZzk++QjuCXmxNa6UCbv3qvE/UkVLhVg
 WH65RTRdjk+EsdwaOMHKuUvQoGa+iM4o39b6bqmw8+ZMK39+y33WeTX/y5RXsp1N
 tUUBRfS+MuuYgC/6LmTr66EkMzoChxk3Dp3kKUaCBcfqRC8PxQag5reZhw==
 =NEaO
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "Loongarch:

   - Clear LLBCTL if secondary mmu mapping changes

   - Add hypercall service support for usermode VMM

  x86:

   - Add a comment to kvm_mmu_do_page_fault() to explain why KVM
     performs a direct call to kvm_tdp_page_fault() when RETPOLINE is
     enabled

   - Ensure that all SEV code is compiled out when disabled in Kconfig,
     even if building with less brilliant compilers

   - Remove a redundant TLB flush on AMD processors when guest CR4.PGE
     changes

   - Use str_enabled_disabled() to replace open coded strings

   - Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's
     APICv cache prior to every VM-Enter

   - Overhaul KVM's CPUID feature infrastructure to track all vCPU
     capabilities instead of just those where KVM needs to manage state
     and/or explicitly enable the feature in hardware. Along the way,
     refactor the code to make it easier to add features, and to make it
     more self-documenting how KVM is handling each feature

   - Rework KVM's handling of VM-Exits during event vectoring; this
     plugs holes where KVM unintentionally puts the vCPU into infinite
     loops in some scenarios (e.g. if emulation is triggered by the
     exit), and brings parity between VMX and SVM

   - Add pending request and interrupt injection information to the
     kvm_exit and kvm_entry tracepoints respectively

   - Fix a relatively benign flaw where KVM would end up redoing RDPKRU
     when loading guest/host PKRU, due to a refactoring of the kernel
     helpers that didn't account for KVM's pre-checking of the need to
     do WRPKRU

   - Make the completion of hypercalls go through the complete_hypercall
     function pointer argument, no matter if the hypercall exits to
     userspace or not.

     Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically
     went to userspace, and all the others did not; the new code need
     not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
     all whether there was an exit to userspace or not

   - As part of enabling TDX virtual machines, support support
     separation of private/shared EPT into separate roots.

     When TDX will be enabled, operations on private pages will need to
     go through the privileged TDX Module via SEAMCALLs; as a result,
     they are limited and relatively slow compared to reading a PTE.

     The patches included in 6.14 allow KVM to keep a mirror of the
     private EPT in host memory, and define entries in kvm_x86_ops to
     operate on external page tables such as the TDX private EPT

   - The recently introduced conversion of the NX-page reclamation
     kthread to vhost_task moved the task under the main process. The
     task is created as soon as KVM_CREATE_VM was invoked and this, of
     course, broke userspace that didn't expect to see any child task of
     the VM process until it started creating its own userspace threads.

     In particular crosvm refuses to fork() if procfs shows any child
     task, so unbreak it by creating the task lazily. This is arguably a
     userspace bug, as there can be other kinds of legitimate worker
     tasks and they wouldn't impede fork(); but it's not like userspace
     has a way to distinguish kernel worker tasks right now. Should they
     show as "Kthread: 1" in proc/.../status?

  x86 - Intel:

   - Fix a bug where KVM updates hardware's APICv cache of the highest
     ISR bit while L2 is active, while ultimately results in a
     hardware-accelerated L1 EOI effectively being lost

   - Honor event priority when emulating Posted Interrupt delivery
     during nested VM-Enter by queueing KVM_REQ_EVENT instead of
     immediately handling the interrupt

   - Rework KVM's processing of the Page-Modification Logging buffer to
     reap entries in the same order they were created, i.e. to mark gfns
     dirty in the same order that hardware marked the page/PTE dirty

   - Misc cleanups

  Generic:

   - Cleanup and harden kvm_set_memory_region(); add proper lockdep
     assertions when setting memory regions and add a dedicated API for
     setting KVM-internal memory regions. The API can then explicitly
     disallow all flags for KVM-internal memory regions

   - Explicitly verify the target vCPU is online in kvm_get_vcpu() to
     fix a bug where KVM would return a pointer to a vCPU prior to it
     being fully online, and give kvm_for_each_vcpu() similar treatment
     to fix a similar flaw

   - Wait for a vCPU to come online prior to executing a vCPU ioctl, to
     fix a bug where userspace could coerce KVM into handling the ioctl
     on a vCPU that isn't yet onlined

   - Gracefully handle xarray insertion failures; even though such
     failures are impossible in practice after xa_reserve(), reserving
     an entry is always followed by xa_store() which does not know (or
     differentiate) whether there was an xa_reserve() before or not

  RISC-V:

   - Zabha, Svvptc, and Ziccrse extension support for guests. None of
     them require anything in KVM except for detecting them and marking
     them as supported; Zabha adds byte and halfword atomic operations,
     while the others are markers for specific operation of the TLB and
     of LL/SC instructions respectively

   - Virtualize SBI system suspend extension for Guest/VM

   - Support firmware counters which can be used by the guests to
     collect statistics about traps that occur in the host

  Selftests:

   - Rework vcpu_get_reg() to return a value instead of using an
     out-param, and update all affected arch code accordingly

   - Convert the max_guest_memory_test into a more generic
     mmu_stress_test. The basic gist of the "conversion" is to have the
     test do mprotect() on guest memory while vCPUs are accessing said
     memory, e.g. to verify KVM and mmu_notifiers are working as
     intended

   - Play nice with treewrite builds of unsupported architectures, e.g.
     arm (32-bit), as KVM selftests' Makefile doesn't do anything to
     ensure the target architecture is actually one KVM selftests
     supports

   - Use the kernel's $(ARCH) definition instead of the target triple
     for arch specific directories, e.g. arm64 instead of aarch64,
     mainly so as not to be different from the rest of the kernel

   - Ensure that format strings for logging statements are checked by
     the compiler even when the logging statement itself is disabled

   - Attempt to whack the last LLC references/misses mole in the Intel
     PMU counters test by adding a data load and doing CLFLUSH{OPT} on
     the data instead of the code being executed. It seems that modern
     Intel CPUs have learned new code prefetching tricks that bypass the
     PMU counters

   - Fix a flaw in the Intel PMU counters test where it asserts that
     events are counting correctly without actually knowing what the
     events count given the underlying hardware; this can happen if
     Intel reuses a formerly microarchitecture-specific event encoding
     as an architectural event, as was the case for Top-Down Slots"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (151 commits)
  kvm: defer huge page recovery vhost task to later
  KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()
  KVM: Disallow all flags for KVM-internal memslots
  KVM: x86: Drop double-underscores from __kvm_set_memory_region()
  KVM: Add a dedicated API for setting KVM-internal memslots
  KVM: Assert slots_lock is held when setting memory regions
  KVM: Open code kvm_set_memory_region() into its sole caller (ioctl() API)
  LoongArch: KVM: Add hypercall service support for usermode VMM
  LoongArch: KVM: Clear LLBCTL if secondary mmu mapping is changed
  KVM: SVM: Use str_enabled_disabled() helper in svm_hardware_setup()
  KVM: VMX: read the PML log in the same order as it was written
  KVM: VMX: refactor PML terminology
  KVM: VMX: Fix comment of handle_vmx_instruction()
  KVM: VMX: Reinstate __exit attribute for vmx_exit()
  KVM: SVM: Use str_enabled_disabled() helper in sev_hardware_setup()
  KVM: x86: Avoid double RDPKRU when loading host/guest PKRU
  KVM: x86: Use LVT_TIMER instead of an open coded literal
  RISC-V: KVM: Add new exit statstics for redirected traps
  RISC-V: KVM: Update firmware counters for various events
  RISC-V: KVM: Redirect instruction access fault trap to guest
  ...
2025-01-25 09:55:09 -08:00
Linus Torvalds
382e391365 hyperv-next for v6.14
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmeTFQ4THHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXqMWB/4uHjnu50u+m00OwXAKQr6i92zh50BZ
 RQragd9s9C8tuUNwPDmS/ct2BNAhoy43KJ0ClegdZjKxT1Ys8cLv4Wr5CaGckqWq
 +WCHqTgt+cPe0vUofqahB5wiAZMsnBgzFkV/OfFwBx0wkub9y5T3qVq5KapYlaDI
 7Gftb+wg1AAsrdZ/HuLRy5ZVvkM/73rU2uoi8WXjr/T14E1krCFR/qirLd1OXo6Q
 Jb97qhnCt/N9JPwIq5/VnYWde5Mpqz6UgtA2rFLDXgNGz+h9/ND6ecWFHjZWNVdc
 AKWZTO5t+fRVBOSyahoyRoYSntPw3wlxyL7A2/54h6j4Dex7wLt6NQBj
 =empO
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-next-signed-20250123' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv updates from Wei Liu:

 - Introduce a new set of Hyper-V headers in include/hyperv and replace
   the old hyperv-tlfs.h with the new headers (Nuno Das Neves)

 - Fixes for the Hyper-V VTL mode (Roman Kisel)

 - Fixes for cpu mask usage in Hyper-V code (Michael Kelley)

 - Document the guest VM hibernation behaviour (Michael Kelley)

 - Miscellaneous fixes and cleanups (Jacob Pan, John Starks, Naman Jain)

* tag 'hyperv-next-signed-20250123' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  Documentation: hyperv: Add overview of guest VM hibernation
  hyperv: Do not overlap the hvcall IO areas in hv_vtl_apicid_to_vp_id()
  hyperv: Do not overlap the hvcall IO areas in get_vtl()
  hyperv: Enable the hypercall output page for the VTL mode
  hv_balloon: Fallback to generic_online_page() for non-HV hot added mem
  Drivers: hv: vmbus: Log on missing offers if any
  Drivers: hv: vmbus: Wait for boot-time offers during boot and resume
  uio_hv_generic: Add a check for HV_NIC for send, receive buffers setup
  iommu/hyper-v: Don't assume cpu_possible_mask is dense
  Drivers: hv: Don't assume cpu_possible_mask is dense
  x86/hyperv: Don't assume cpu_possible_mask is dense
  hyperv: Remove the now unused hyperv-tlfs.h files
  hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.h
  hyperv: Add new Hyper-V headers in include/hyperv
  hyperv: Clean up unnecessary #includes
  hyperv: Move hv_connection_id to hyperv-tlfs.h
2025-01-25 09:22:55 -08:00
Keith Busch
931656b9e2 kvm: defer huge page recovery vhost task to later
Some libraries want to ensure they are single threaded before forking,
so making the kernel's kvm huge page recovery process a vhost task of
the user process breaks those. The minijail library used by crosvm is
one such affected application.

Defer the task to after the first VM_RUN call, which occurs after the
parent process has forked all its jailed processes. This needs to happen
only once for the kvm instance, so introduce some general-purpose
infrastructure for that, too.  It's similar in concept to pthread_once;
except it is actually usable, because the callback takes a parameter.

Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Message-ID: <20250123153543.2769928-1-kbusch@meta.com>
[Move call_once API to include/linux. - Paolo]
Cc: stable@vger.kernel.org
Fixes: d96c77bd4e ("KVM: x86: switch hugepage recovery thread to vhost_task")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-01-24 10:53:56 -05:00
Linus Torvalds
1d6d399223 Kthreads affinity follow either of 4 existing different patterns:
1) Per-CPU kthreads must stay affine to a single CPU and never execute
    relevant code on any other CPU. This is currently handled by smpboot
    code which takes care of CPU-hotplug operations. Affinity here is
    a correctness constraint.
 
 2) Some kthreads _have_ to be affine to a specific set of CPUs and can't
    run anywhere else. The affinity is set through kthread_bind_mask()
    and the subsystem takes care by itself to handle CPU-hotplug
    operations. Affinity here is assumed to be a correctness constraint.
 
 3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This
    is not a correctness constraint but merely a preference in terms of
    memory locality. kswapd and kcompactd both fall into this category.
    The affinity is set manually like for any other task and CPU-hotplug
    is supposed to be handled by the relevant subsystem so that the task
    is properly reaffined whenever a given CPU from the node comes up.
    Also care should be taken so that the node affinity doesn't cross
    isolated (nohz_full) cpumask boundaries.
 
 4) Similar to the previous point except kthreads have a _preferred_
    affinity different than a node. Both RCU boost kthreads and RCU
    exp kworkers fall into this category as they refer to "RCU nodes"
    from a distinctly distributed tree.
 
 Currently the preferred affinity patterns (3 and 4) have at least 4
 identified users, with more or less success when it comes to handle
 CPU-hotplug operations and CPU isolation. Each of which do it in its own
 ad-hoc way.
 
 This is an infrastructure proposal to handle this with the following API
 changes:
 
 _ kthread_create_on_node() automatically affines the created kthread to
   its target node unless it has been set as per-cpu or bound with
   kthread_bind[_mask]() before the first wake-up.
 
 - kthread_affine_preferred() is a new function that can be called right
   after kthread_create_on_node() to specify a preferred affinity
   different than the specified node.
 
 When the preferred affinity can't be applied because the possible
 targets are offline or isolated (nohz_full), the kthread is affine
 to the housekeeping CPUs (which means to all online CPUs most of the
 time or only the non-nohz_full CPUs when nohz_full= is set).
 
 kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
 converted, along with a few old drivers.
 
 Summary of the changes:
 
 * Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu()
 
 * Introduce task_cpu_fallback_mask() that defines the default last
   resort affinity of a task to become nohz_full aware
 
 * Add some correctness check to ensure kthread_bind() is always called
   before the first kthread wake up.
 
 * Default affine kthread to its preferred node.
 
 * Convert kswapd / kcompactd and remove their halfway working ad-hoc
   affinity implementation
 
 * Implement kthreads preferred affinity
 
 * Unify kthread worker and kthread API's style
 
 * Convert RCU kthreads to the new API and remove the ad-hoc affinity
   implementation.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmeNf8gACgkQhSRUR1CO
 jHedQQ/+IxTjjqQiItzrq41TES2S0desHDq8lNJFb7rsR/DtKFyLx3s67cOYV+cM
 Yx54QHg2m/Fz4nXMQ7Po5ygOtJGCKBc5C5QQy7y0lVKeTQK+daDfEtBSa3oG7j3C
 u+E3tTY6qxkbCzymUyaKkHN4/ay2vLvjFS50luV7KMyI3x47Aji+t7VdCX4LCPP2
 eAwOALWD0+7qLJ/VF6gsmQLKA4Qx7PQAzBa3KSBmUN9UcN8Gk1bQHCTIQKDHP9LQ
 v8BXrNZtYX1o2+snNYpX2z6/ECjxkdwriOgqqZY5306hd9RAQ1u46Dx3byrIqjGn
 ULG/XQ2istPyhTqb/h+RbrobdOcwEUIeqk8hRRbBXE8bPpqUz9EMuaCMxWDbQjgH
 NTuKG4ifKJ/IqstkkuDkdOiByE/ysMmwqrTXgSnu2ITNL9yY3BEgFbvA95hgo42s
 f7QCxEfZb1MHcNEMENSMwM3xw5lLMGMpxVZcMQ3gLwyotMBRrhFZm1qZJG7TITYW
 IDIeCbH4JOMdQwLs3CcWTXio0N5/85NhRNFV+IDn96OrgxObgnMtV8QwNgjXBAJ5
 wGeJWt8s34W1Zo3qS9gEuVzEhW4XaxISQQMkHe8faKkK6iHmIB/VjSQikDwwUNQ/
 AspYj82RyWBCDZsqhiYh71kpxjvS6Xp0bj39Ce1sNsOnuksxKkQ=
 =g8In
 -----END PGP SIGNATURE-----

Merge tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks

Pull kthread updates from Frederic Weisbecker:
 "Kthreads affinity follow either of 4 existing different patterns:

   1) Per-CPU kthreads must stay affine to a single CPU and never
      execute relevant code on any other CPU. This is currently handled
      by smpboot code which takes care of CPU-hotplug operations.
      Affinity here is a correctness constraint.

   2) Some kthreads _have_ to be affine to a specific set of CPUs and
      can't run anywhere else. The affinity is set through
      kthread_bind_mask() and the subsystem takes care by itself to
      handle CPU-hotplug operations. Affinity here is assumed to be a
      correctness constraint.

   3) Per-node kthreads _prefer_ to be affine to a specific NUMA node.
      This is not a correctness constraint but merely a preference in
      terms of memory locality. kswapd and kcompactd both fall into this
      category. The affinity is set manually like for any other task and
      CPU-hotplug is supposed to be handled by the relevant subsystem so
      that the task is properly reaffined whenever a given CPU from the
      node comes up. Also care should be taken so that the node affinity
      doesn't cross isolated (nohz_full) cpumask boundaries.

   4) Similar to the previous point except kthreads have a _preferred_
      affinity different than a node. Both RCU boost kthreads and RCU
      exp kworkers fall into this category as they refer to "RCU nodes"
      from a distinctly distributed tree.

  Currently the preferred affinity patterns (3 and 4) have at least 4
  identified users, with more or less success when it comes to handle
  CPU-hotplug operations and CPU isolation. Each of which do it in its
  own ad-hoc way.

  This is an infrastructure proposal to handle this with the following
  API changes:

   - kthread_create_on_node() automatically affines the created kthread
     to its target node unless it has been set as per-cpu or bound with
     kthread_bind[_mask]() before the first wake-up.

   - kthread_affine_preferred() is a new function that can be called
     right after kthread_create_on_node() to specify a preferred
     affinity different than the specified node.

  When the preferred affinity can't be applied because the possible
  targets are offline or isolated (nohz_full), the kthread is affine to
  the housekeeping CPUs (which means to all online CPUs most of the time
  or only the non-nohz_full CPUs when nohz_full= is set).

  kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
  converted, along with a few old drivers.

  Summary of the changes:

   - Consolidate a bunch of ad-hoc implementations of
     kthread_run_on_cpu()

   - Introduce task_cpu_fallback_mask() that defines the default last
     resort affinity of a task to become nohz_full aware

   - Add some correctness check to ensure kthread_bind() is always
     called before the first kthread wake up.

   - Default affine kthread to its preferred node.

   - Convert kswapd / kcompactd and remove their halfway working ad-hoc
     affinity implementation

   - Implement kthreads preferred affinity

   - Unify kthread worker and kthread API's style

   - Convert RCU kthreads to the new API and remove the ad-hoc affinity
     implementation"

* tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks:
  kthread: modify kernel-doc function name to match code
  rcu: Use kthread preferred affinity for RCU exp kworkers
  treewide: Introduce kthread_run_worker[_on_cpu]()
  kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format
  rcu: Use kthread preferred affinity for RCU boost
  kthread: Implement preferred affinity
  mm: Create/affine kswapd to its preferred node
  mm: Create/affine kcompactd to its preferred node
  kthread: Default affine kthread to its preferred NUMA node
  kthread: Make sure kthread hasn't started while binding it
  sched,arm64: Handle CPU isolation on last resort fallback rq selection
  arm64: Exclude nohz_full CPUs from 32bits el0 support
  lib: test_objpool: Use kthread_run_on_cpu()
  kallsyms: Use kthread_run_on_cpu()
  soc/qman: test: Use kthread_run_on_cpu()
  arm/bL_switcher: Use kthread_run_on_cpu()
2025-01-21 17:10:05 -08:00
Linus Torvalds
a6640c8c2f Objtool changes for v6.14:
- Introduce the generic section-based annotation
    infrastructure a.k.a. ASM_ANNOTATE/ANNOTATE (Peter Zijlstra)
 
  - Convert various facilities to ASM_ANNOTATE/ANNOTATE: (Peter Zijlstra)
 
     - ANNOTATE_NOENDBR
     - ANNOTATE_RETPOLINE_SAFE
     - instrumentation_{begin,end}()
     - VALIDATE_UNRET_BEGIN
     - ANNOTATE_IGNORE_ALTERNATIVE
     - ANNOTATE_INTRA_FUNCTION_CALL
     - {.UN}REACHABLE
 
  - Optimize the annotation-sections parsing code (Peter Zijlstra)
 
  - Centralize annotation definitions in <linux/objtool.h>
 
  - Unify & simplify the barrier_before_unreachable()/unreachable()
    definitions (Peter Zijlstra)
 
  - Convert unreachable() calls to BUG() in x86 code, as
    unreachable() has unreliable code generation (Peter Zijlstra)
 
  - Remove annotate_reachable() and annotate_unreachable(), as it's
    unreliable against compiler optimizations (Peter Zijlstra)
 
  - Fix non-standard ANNOTATE_REACHABLE annotation order (Peter Zijlstra)
 
  - Robustify the annotation code by warning about unknown annotation
    types (Peter Zijlstra)
 
  - Allow arch code to discover jump table size, in preparation of
    annotated jump table support (Ard Biesheuvel)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmeOHiARHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gATw/7Bn4A+Isqk9bKo6QgYEnKRoyf760ALQl6
 av/toEy1qCHT/CXCiEn1Hut1JEy4YyD6lIarC1scRl5xy7amRDEcCL0i2CKz3orn
 pf6Fk8/Pi68G2K50o4LTiq8t3uPBJXPlGyDlngh2hFTYRfPRT4m+cig784hmJEXG
 Xq2YzzUNG++U/4Uwe3JH7bX/vcZTYkZfM62FWfp3I4V0OqKU4c+Pkiv4u3Rs7L7b
 c3xk5/PktKZWV5TDsz0wU4SAGxYFGV47hhYM6cxdSYD3la7RVO+qZcqxsJByjpcL
 bvOmGKQ1SAXr08rV7TB+Fh8icaNE8Rbbmxf6slB0hdXBQb8STAZ810mZJFey6pnm
 kXgfhhfBOK5Sq+UbTfzF2JgquCGAbKK75bmNGgf2HaLnVLkFIw3AyMsuFqnxhI4X
 vXRHGnHCYpYUHTxzRYTFYR8XL8twA2kgjWkSe7hYrX/RQZV3XfyKOc2jyoJFMXeX
 LecfGJCE/pziZyj60SXT9WaUTvKc8gjWOEuAnW1pJQRM0zJqB9kjLh1cDYUseuwv
 gGkH59KEu0kcfOb5t/jWoqW3PTENJjEAhOmjun6Jv8wgbOxU88TMmSCWppj54O2X
 c2ibO407535u1SKBWZuaKFBLYftS2GM4WaGsdyTyh+ta48C8An90HMfYNKTHM9Nz
 F61Q7Zbn65E=
 =9nGt
 -----END PGP SIGNATURE-----

Merge tag 'objtool-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool updates from Ingo Molnar:

 - Introduce the generic section-based annotation infrastructure a.k.a.
   ASM_ANNOTATE/ANNOTATE (Peter Zijlstra)

 - Convert various facilities to ASM_ANNOTATE/ANNOTATE: (Peter Zijlstra)
    - ANNOTATE_NOENDBR
    - ANNOTATE_RETPOLINE_SAFE
    - instrumentation_{begin,end}()
    - VALIDATE_UNRET_BEGIN
    - ANNOTATE_IGNORE_ALTERNATIVE
    - ANNOTATE_INTRA_FUNCTION_CALL
    - {.UN}REACHABLE

 - Optimize the annotation-sections parsing code (Peter Zijlstra)

 - Centralize annotation definitions in <linux/objtool.h>

 - Unify & simplify the barrier_before_unreachable()/unreachable()
   definitions (Peter Zijlstra)

 - Convert unreachable() calls to BUG() in x86 code, as unreachable()
   has unreliable code generation (Peter Zijlstra)

 - Remove annotate_reachable() and annotate_unreachable(), as it's
   unreliable against compiler optimizations (Peter Zijlstra)

 - Fix non-standard ANNOTATE_REACHABLE annotation order (Peter Zijlstra)

 - Robustify the annotation code by warning about unknown annotation
   types (Peter Zijlstra)

 - Allow arch code to discover jump table size, in preparation of
   annotated jump table support (Ard Biesheuvel)

* tag 'objtool-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Convert unreachable() to BUG()
  objtool: Allow arch code to discover jump table size
  objtool: Warn about unknown annotation types
  objtool: Fix ANNOTATE_REACHABLE to be a normal annotation
  objtool: Convert {.UN}REACHABLE to ANNOTATE
  objtool: Remove annotate_{,un}reachable()
  loongarch: Use ASM_REACHABLE
  x86: Convert unreachable() to BUG()
  unreachable: Unify
  objtool: Collect more annotations in objtool.h
  objtool: Collapse annotate sequences
  objtool: Convert ANNOTATE_INTRA_FUNCTION_CALL to ANNOTATE
  objtool: Convert ANNOTATE_IGNORE_ALTERNATIVE to ANNOTATE
  objtool: Convert VALIDATE_UNRET_BEGIN to ANNOTATE
  objtool: Convert instrumentation_{begin,end}() to ANNOTATE
  objtool: Convert ANNOTATE_RETPOLINE_SAFE to ANNOTATE
  objtool: Convert ANNOTATE_NOENDBR to ANNOTATE
  objtool: Generic annotation infrastructure
2025-01-21 10:13:11 -08:00
Paolo Bonzini
86eb1aef72 Merge branch 'kvm-mirror-page-tables' into HEAD
As part of enabling TDX virtual machines, support support separation of
private/shared EPT into separate roots.

Confidential computing solutions almost invariably have concepts of
private and shared memory, but they may different a lot in the details.
In SEV, for example, the bit is handled more like a permission bit as
far as the page tables are concerned: the private/shared bit is not
included in the physical address.

For TDX, instead, the bit is more like a physical address bit, with
the host mapping private memory in one half of the address space and
shared in another.  Furthermore, the two halves are mapped by different
EPT roots and only the shared half is managed by KVM; the private half
(also called Secure EPT in Intel documentation) gets managed by the
privileged TDX Module via SEAMCALLs.

As a result, the operations that actually change the private half of
the EPT are limited and relatively slow compared to reading a PTE. For
this reason the design for KVM is to keep a mirror of the private EPT in
host memory.  This allows KVM to quickly walk the EPT and only perform the
slower private EPT operations when it needs to actually modify mid-level
private PTEs.

There are thus three sets of EPT page tables: external, mirror and
direct.  In the case of TDX (the only user of this framework) the
first two cover private memory, whereas the third manages shared
memory:

  external EPT - Hidden within the TDX module, modified via TDX module
                 calls.

  mirror EPT   - Bookkeeping tree used as an optimization by KVM, not
                 used by the processor.

  direct EPT   - Normal EPT that maps unencrypted shared memory.
                 Managed like the EPT of a normal VM.

Modifying external EPT
----------------------

Modifications to the mirrored page tables need to also perform the
same operations to the private page tables, which will be handled via
kvm_x86_ops.  Although this prep series does not interact with the TDX
module at all to actually configure the private EPT, it does lay the
ground work for doing this.

In some ways updating the private EPT is as simple as plumbing PTE
modifications through to also call into the TDX module; however, the
locking is more complicated because inserting a single PTE cannot anymore
be done atomically with a single CMPXCHG.  For this reason, the existing
FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the
private EPT.  FROZEN_SPTE acts basically as a spinlock on a PTE.  Besides
protecting operation of KVM, it limits the set of cases in which the
TDX module will encounter contention on its own PTE locks.

Zapping external EPT
--------------------
While the framework tries to be relatively generic, and to be
understandable without knowing TDX much in detail, some requirements of
TDX sometimes leak; for example the private page tables also cannot be
zapped while the range has anything mapped, so the mirrored/private page
tables need to be protected from KVM operations that zap any non-leaf
PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast().

For normal VMs, guest memory is zapped for several reasons: user
memory getting paged out by the guest, memslots getting deleted,
passthrough of devices with non-coherent DMA.  Confidential computing
adds to these the conversion of memory between shared and privates. These
operations must not zap any private memory that is in use by the guest.

This is possible because the only zapping that is out of the control
of KVM/userspace is paging out userspace memory, which cannot apply to
guestmemfd operations.  Thus a TDX VM will only zap private memory from
memslot deletion and from conversion between private and shared memory
which is triggered by the guest.

To avoid zapping too much memory, enums are introduced so that operations
can choose to target only private or shared memory, and thus only
direct or mirror EPT.  For example:

  Memslot deletion           - Private and shared
  MMU notifier based zapping - Shared only
  Conversion to shared       - Private only
  Conversion to private      - Shared only

Other cases of zapping will not be supported for KVM, for example
APICv update or non-coherent DMA status update; for the latter, TDX will
simply require that the CPU supports self-snoop and honor guest PAT
unconditionally for shared memory.
2025-01-20 07:15:58 -05:00
Paolo Bonzini
3eba032bb7 Merge branch 'kvm-userspace-hypercall' into HEAD
Make the completion of hypercalls go through the complete_hypercall
function pointer argument, no matter if the hypercall exits to
userspace or not.  Previously, the code assumed that KVM_HC_MAP_GPA_RANGE
specifically went to userspace, and all the others did not; the new code
need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
all whether there was an exit to userspace or not.
2025-01-20 07:03:06 -05:00
Paolo Bonzini
4f7ff70c05 KVM x86 misc changes for 6.14:
- Overhaul KVM's CPUID feature infrastructure to replace "governed" features
    with per-vCPU tracking of the vCPU's capabailities for all features.  Along
    the way, refactor the code to make it easier to add/modify features, and
    add a variety of self-documenting macro types to again simplify adding new
    features and to help readers understand KVM's handling of existing features.
 
  - Rework KVM's handling of VM-Exits during event vectoring to plug holes where
    KVM unintentionally puts the vCPU into infinite loops in some scenarios,
    e.g. if emulation is triggered by the exit, and to bring parity between VMX
    and SVM.
 
  - Add pending request and interrupt injection information to the kvm_exit and
    kvm_entry tracepoints respectively.
 
  - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
    loading guest/host PKRU due to a refactoring of the kernel helpers that
    didn't account for KVM's pre-checking of the need to do WRPKRU.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmeJngsACgkQOlYIJqCj
 N/1dfA/+NIZmnd8OV9Zvc6HGxrzgt4QsM9pmsUmrfkDWefxMYIAMeaW8Vn4CJfRf
 zY/UcqyNI7JYxSiuVTckz+Tf54HhqYaLrUwILGCQ49koirZx+aQT1OUfjLroVMlh
 ffX1i6GOoLNtxjb9MXM/heLVdUbvmzQMSFkd/AkOH+nrOtDNOiPlZfjHsewj9zrf
 BNJGhzvT4M6vc/AsScC7tc0yFD5KKFRv8tVwJ6Zf1nWKyUDOSpMTWkVnq6geKJPZ
 iGBZPPNg55Oy1g6uj6VYWmqYTD8Qioz5jtEJ/8pPHdAyIFo21s81bfJc548d+QLh
 KfrL1K7TrCOhSAGC3Cb3lTLeq2immmGHaiTBLwGABG4MhpiX4NVpMMdOyFbVLMOS
 HIYuwXwDckm1pfU7/w+PgPaakCyPrXQntm+3Y2pvDOoY6e2JbwodK4j8BvvQda35
 8TrYKEGFvq5aij7Iw1O9TUoLAocDM/sHIHE6BCazHyzKBIv9xLRFeabiCQ+A1pwv
 gZk5u0+j+DPpLdeLhbMYhIXUtr3bvyMYvc+tRkG716f8ubAE3+Kn5BEDo4Ot2DcT
 vc+NTRYYWN6zavHiJH3Ddt153yj256JCZhLwCdfbryCQdz3Mpy16m36tgkDRd3lR
 QT4IkPQo1Vl/aU0yiE/dhnJgh1rTO26YQjZoHs5Oj16d0HRrKyc=
 =32mM
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.14' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.14:

 - Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
   instead of just those where KVM needs to manage state and/or explicitly
   enable the feature in hardware.  Along the way, refactor the code to make
   it easier to add features, and to make it more self-documenting how KVM
   is handling each feature.

 - Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
   where KVM unintentionally puts the vCPU into infinite loops in some scenarios
   (e.g. if emulation is triggered by the exit), and brings parity between VMX
   and SVM.

 - Add pending request and interrupt injection information to the kvm_exit and
   kvm_entry tracepoints respectively.

 - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
   loading guest/host PKRU, due to a refactoring of the kernel helpers that
   didn't account for KVM's pre-checking of the need to do WRPKRU.
2025-01-20 06:49:39 -05:00
Paolo Bonzini
892e7b8c27 KVM VMX changes for 6.14:
- Fix a bug where KVM updates hardware's APICv cache of the highest ISR bit
    while L2 is active, while ultimately results in a hardware-accelerated L1
    EOI effectively being lost.
 
  - Honor event priority when emulating Posted Interrupt delivery during nested
    VM-Enter by queueing KVM_REQ_EVENT instead of immediately handling the
    interrupt.
 
  - Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's APICv cache
    prior to every VM-Enter.
 
  - Rework KVM's processing of the Page-Modification Logging buffer to reap
    entries in the same order they were created, i.e. to mark gfns dirty in the
    same order that hardware marked the page/PTE dirty.
 
  - Misc cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmeJqKwACgkQOlYIJqCj
 N/3bQA//e0Z/BWuRlg/CvYESgR/wxZ5ejK9/0wNDOCJAmim7C5ULMTQsT/PhwkLO
 Hn8uHJkP5r3JsKlifR2ks9cq1HK+YJLrOyvzTYJtOcqgpNXaFXuY/I2IvSAgm8g2
 MABOIaKETyFosPaLoGY+tupDDj+alXE+zL1UN1bzgj9Fqp5BRj2BuFGwP3YolHsl
 Ry2+scpQQ4vAFp4G6Ixjzr75EBvbQgyw3dv6BBQYFKrlxCU3EocrhmygssJf8zqd
 ODsXXMqL3TL5M69cb4Htko4PovhsaoD2x2V0F8OhdgNE2Y7O7oiUuFZqARF9eKf1
 bWOZSzsSWrL6cO7rtLSwFQULGsfLSQP0PnBWnjKbillkQ7GqUW4ufsh5qaKMr3fL
 B6f9eQYHqSa5wYU+ufVdnRraTX4EXvN8kaoJaLV6FdoV8D4gSHt0WedZvyy/WkaY
 Vsf/rCNMNTZnZfHDXFl3kZB9qCZJfZGfrHY7dUGwvs9XbXX5l3d9RPKESeKKSDmc
 s/cYMi1o1+D2GaGzdpJAJ8/WyI4w1k8kuqequgHx8XwtOZuUV9aNO3Hi8IYCgJSB
 sWHXghAloj7no6HQwkMkjkGI1EdDQTyLRlkuB0yybnuwvWF3Ix0OMUhSxx+zV4Gq
 smCJizz+duoaJOZa3tfAX4hIqZjFm0cmp8OqWETNK/bcWv3A5jM=
 =PSfy
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.14' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.14:

 - Fix a bug where KVM updates hardware's APICv cache of the highest ISR bit
   while L2 is active, while ultimately results in a hardware-accelerated L1
   EOI effectively being lost.

 - Honor event priority when emulating Posted Interrupt delivery during nested
   VM-Enter by queueing KVM_REQ_EVENT instead of immediately handling the
   interrupt.

 - Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's APICv cache
   prior to every VM-Enter.

 - Rework KVM's processing of the Page-Modification Logging buffer to reap
   entries in the same order they were created, i.e. to mark gfns dirty in the
   same order that hardware marked the page/PTE dirty.

 - Misc cleanups.
2025-01-20 06:48:53 -05:00
Paolo Bonzini
672162a0d3 KVM SVM changes for 6.14:
- Macrofy the SEV=n version of the sev_xxx_guest() helpers so that the code is
    optimized away when building with less than brilliant compilers.
 
  - Remove a now-redundant TLB flush when guest CR4.PGE changes.
 
  - Use str_enabled_disabled() to replace open coded strings.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmeJoesACgkQOlYIJqCj
 N/1FuQ//bKhTrPVvMhCZB7cxS7wcpb82LA15Gc73hXBTjsik896O6fAN0MyvYWvp
 eLmvhRWloVspIAkLj4pYF7PB6Tl0VAtpGOIdZp9TXOkbGcWg7jFFBrF3Udr2kDig
 3YyinqGhwPN6oSkrJMGuk79EKhuRVGcb7utE/T8MCV5ct9kayQAogGYMLwfnpYvX
 02SRYj++PUz5rk/gVUA2RA/EFSe57JE2VrzQlT34ttqqMZ2JKtysl5ytONGpEW/u
 ZJQ4naztY+vRCLHm+n8ZWE5TCQ3ZZVR5WVcnMqx1MQQEVSZJT5SxveLGVu9tVV4c
 Fv5fTvxa6iKfsUYl4o6tX1E0s388mwEast6uY40q4GysgkpsRRQR1KOLMGTQTMiz
 7hagYbc9gGY3Pr6pn1+qf3hgVogcDX0wnQlFe6PYkHiMc7woIhUCGJ3LND7Y11O+
 cPnbdK+0ZZhsOzYQ6SGeXIv326wvZacqKzXkkWt9pf8abbwaUXVqjYe+Y0yEHsuJ
 e1gPnTdtzzim3Jq5qDRhPEHthHLm+na4O+6vjYlFVoUN4XnbNGhVzUJvQX70WX8a
 6vrQ6gM768dwkyzdpRAGuWv+OWXHBe8UYKp4HQKpWGjRcKiwcXWDwQeIbfE7JkS+
 L/Ck0oQ72sWoBoH4noM/5ehVB4ZY/bEp+GbVbd/ny6nDP1fmApM=
 =MhtD
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-svm-6.14' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 6.14:

 - Macrofy the SEV=n version of the sev_xxx_guest() helpers so that the code is
   optimized away when building with less than brilliant compilers.

 - Remove a now-redundant TLB flush when guest CR4.PGE changes.

 - Use str_enabled_disabled() to replace open coded strings.
2025-01-20 06:48:35 -05:00
Paolo Bonzini
cae083c4e7 KVM x86 MMU changes for 6.14:
- Add a comment to kvm_mmu_do_page_fault() to explain why KVM performs a
    direct call to kvm_tdp_page_fault() when RETPOLINE is enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmeJnmoACgkQOlYIJqCj
 N/0QTA/+OlCwJJn8og22PwkYzeUtaoptInXlWZHWa732ZaMDDaFMbtlxARFGqu2P
 OunqoZDpSWUqEmBg0X79UgUwpnFs3TRjwJecB0VZr3y7mKmU05npChUxXtv9L+64
 Bsm4t+2AeXyj+08nbgpFAZ5MQwltSuBUEnXhfjelizF3+pvr3IvTioxRoH1tWIQ1
 o/Qi7+0yT19ck37a3EKE2N+beTjNQTHNSst+mrVty099m54ydiTAtzIbtBZRzjz3
 +qxAGhAm+9EUnX8IWg/glfMF+gwvXgxMK2GxCfuVeHAbr9aNFd/JFP96iQ52D79v
 gHd8ykok0vrJWDxNihYX3VftAkn2Avy0hPGhKodYNIzu+RWYkoTdw/sC9Coa/xKx
 NF0ZJQ2LX9FLvLr/wg/97oNF4AapDtgGNlJxXGfh96H/nKoCOKoVmlvUeer+AkBT
 G4HeIffCSWhtEOfQRrE8zo2IG/UNA/zfcsEIhxu/ocm3qtTZXLZNTkwfhc5EFRMl
 m80JXnX0GC9YhWTHXdHYj7sN2yf0hUiHTJoKOwL16g56OIzR6ia1qcwZ4dVhIYSO
 RgPCvxHrOvFe1ISPG7xojRSFaHnbPASlKoqwTvbT9n4QsHSKto1yVMWxZY8lmneh
 no7CuheUOR8M0/xaavsZzcbKXbFU9EIidakYAFwlNxhUfwiPUbI=
 =XaOP
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmu-6.14' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 6.14:

 - Add a comment to kvm_mmu_do_page_fault() to explain why KVM performs a
   direct call to kvm_tdp_page_fault() when RETPOLINE is enabled.
2025-01-20 06:36:55 -05:00
Paolo Bonzini
4e4f38f84e KVM kvm_set_memory_region() cleanups and hardening for 6.14:
- Add proper lockdep assertions when setting memory regions.
 
  - Add a dedicated API for setting KVM-internal memory regions.
 
  - Explicitly disallow all flags for KVM-internal memory regions.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmeHH5gACgkQOlYIJqCj
 N/2pIBAAm+RLu554cc51EiIemZOI94WphNkl7QegZf+VfPvcPU0pXT3VZiX/hGcl
 mk8KJhYMeIjr+rh3kVENkn9YRCmcH334dDmA7CQJwe30EYvKwC9XlOxpRmUZl8h+
 DaO6+uf07rs4ZezCBUQ90f9GV7ABpp3TFmwYlPSqYW4TKbWbiwqPZL8X06ZTRgb3
 Z5rFI2WpxsHOnlFLQAjmWZ6K3/oMDrdDplwJy9XTEpYV29oX5y06Zp3YeBAkeoxo
 uvwv9JYyZcev7j4gamQngud9IJc3Rc5/GzK0NndsPqUOVPGHXuSiul3ajbd3HgLL
 zp8Gai9y170DG8KUp+TNdhzaW0E+lLJAeLrWZTkfSwbdxGpA+XEdcwAiHd7X/xqK
 0Ty/FS/vfLA9adearWE1DihskCQ1L1ZhJRqjRx5vGIwXBzyhtcKLhYWJL97eOrd+
 jmPVUJe9nOZdbdmrQksT0PinQVnsNorSlfU7ODBnymzisIF0Z9NP8HRHLk4PT2C7
 VBdOUpmvY1Fae1GUq5HLB1TJml/FoRztTBmtnoczczDlmOmgJ4wuMoiGYIkZaTED
 Zhb40/tk4mtBCcY/iXB+0G2uAl+1xw7rrmhsV/JUKexXPJs7U2Yj8xPVpqMoU7aU
 HCvUjUujsHZ/SvYXQu3rCZOsFyNfLvX8b+0Fv41gpBPgYqGZXA8=
 =Dqql
 -----END PGP SIGNATURE-----

Merge tag 'kvm-memslots-6.14' of https://github.com/kvm-x86/linux into HEAD

KVM kvm_set_memory_region() cleanups and hardening for 6.14:

 - Add proper lockdep assertions when setting memory regions.

 - Add a dedicated API for setting KVM-internal memory regions.

 - Explicitly disallow all flags for KVM-internal memory regions.
2025-01-20 06:36:14 -05:00
Yan Zhao
7c54803863 KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()
Return RET_PF* (excluding RET_PF_EMULATE/RET_PF_CONTINUE/RET_PF_INVALID)
instead of 1 in kvm_mmu_page_fault().

The callers of kvm_mmu_page_fault() are KVM page fault handlers (i.e.,
npf_interception(), handle_ept_misconfig(), __vmx_handle_ept_violation(),
kvm_handle_page_fault()). They either check if the return value is > 0 (as
in npf_interception()) or pass it further to vcpu_run() to decide whether
to break out of the kernel loop and return to the user when r <= 0.
Therefore, returning any positive value is equivalent to returning 1.

Warn if r == RET_PF_CONTINUE (which should not be a valid value) to ensure
a positive return value.

This is a preparation to allow TDX's EPT violation handler to check the
RET_PF* value and retry internally for RET_PF_RETRY.

No functional changes are intended.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250113021138.18875-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-01-15 11:45:26 -05:00
Sean Christopherson
344315e93d KVM: x86: Drop double-underscores from __kvm_set_memory_region()
Now that there's no outer wrapper for __kvm_set_memory_region() and it's
static, drop its double-underscore prefix.

No functional change intended.

Cc: Tao Su <tao1.su@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250111002022.1230573-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-14 17:36:16 -08:00
Sean Christopherson
156bffdb2b KVM: Add a dedicated API for setting KVM-internal memslots
Add a dedicated API for setting internal memslots, and have it explicitly
disallow setting userspace memslots.  Setting a userspace memslots without
a direct command from userspace would result in all manner of issues.

No functional change intended.

Cc: Tao Su <tao1.su@linux.intel.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250111002022.1230573-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-14 17:36:15 -08:00
Sean Christopherson
d131f0042f KVM: Assert slots_lock is held when setting memory regions
Add proper lockdep assertions in __kvm_set_memory_region() and
__x86_set_memory_region() instead of relying comments.

Opportunistically delete __kvm_set_memory_region()'s entire function
comment as the API doesn't allocate memory or select a gfn, and the
"mostly for framebuffers" comment hasn't been true for a very long time.

Cc: Tao Su <tao1.su@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250111002022.1230573-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-14 17:36:15 -08:00
Thorsten Blum
4c334c6880 KVM: SVM: Use str_enabled_disabled() helper in svm_hardware_setup()
Remove hard-coded strings by using the str_enabled_disabled() helper
function.

Suggested-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://lore.kernel.org/r/20250110101100.272312-2-thorsten.blum@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-10 06:56:20 -08:00
Nuno Das Neves
ef5a3c92a8 hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.h
Switch to using hvhdk.h everywhere in the kernel. This header
includes all the new Hyper-V headers in include/hyperv, which form a
superset of the definitions found in hyperv-tlfs.h.

This makes it easier to add new Hyper-V interfaces without being
restricted to those in the TLFS doc (reflected in hyperv-tlfs.h).

To be more consistent with the original Hyper-V code, the names of
some definitions are changed slightly. Update those where needed.

Update comments in mshyperv.h files to point to include/hyperv for
adding new definitions.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Link: https://lore.kernel.org/r/1732577084-2122-5-git-send-email-nunodasneves@linux.microsoft.com
Link: https://lore.kernel.org/r/20250108222138.1623703-3-romank@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-01-10 00:54:21 +00:00
Maxim Levitsky
37c3ddfe52 KVM: VMX: read the PML log in the same order as it was written
Intel's PRM specifies that the CPU writes to the PML log 'backwards'
or in other words, it first writes entry 511, then entry 510 and so on.

I also confirmed on the bare metal that the CPU indeed writes the entries
in this order.

KVM on the other hand, reads the entries in the opposite order, from the
last written entry and towards entry 511 and dumps them in this order to
the dirty ring.

Usually this doesn't matter, except for one complex nesting case:

KVM reties the instructions that cause MMU faults.
This might cause an emulated PML log entry to be visible to L1's hypervisor
before the actual memory write was committed.

This happens when the L0 MMU fault is followed directly by the VM exit to
L1, for example due to a pending L1 interrupt or due to the L1's
'PML log full' event.

This problem doesn't have a noticeable real-world impact because this
write retry is not much different from the guest writing to the same page
multiple times, which is also not reflected in the dirty log. The users of
the dirty logging only rely on correct reporting of the clean pages, or
in other words they assume that if a page is clean, then no writes were
committed to it since the moment it was marked clean.

However KVM has a kvm_dirty_log_test selftest, a test that tests both
the clean and the dirty pages vs the memory contents, and can fail if it
detects a dirty page which has an old value at the offset 0 which the test
writes.

To avoid failure, the test has a workaround for this specific problem:

The test skips checking memory that belongs to the last dirty ring entry,
which it has seen, relying on the fact that as long as memory writes are
committed in-order, only the last entry can belong to a not yet committed
memory write.

However, since L1's KVM is reading the PML log in the opposite direction
that L0 wrote it, the last dirty ring entry often will be not the last
entry written by the L0.

To fix this, switch the order in which KVM reads the PML log.

Note that this issue is not present on the bare metal, because on the
bare metal, an update of the A/D bits of a present entry, PML logging and
the actual memory write are all done by the CPU without any hypervisor
intervention and pending interrupt evaluation, thus once a PML log and/or
vCPU kick happens, all memory writes that are in the PML log are
committed to memory.

The only exception to this rule is when the guest hits a not present EPT
entry, in which case KVM first reads (backward) the PML log, dumps it to
the dirty ring, and *then* sets up a SPTE entry with A/D bits set, and logs
this to the dirty ring, thus making the entry be the last one in the
dirty ring.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241219221034.903927-3-mlevitsk@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-08 14:31:25 -08:00
Maxim Levitsky
ae81ce936f KVM: VMX: refactor PML terminology
Rename PML_ENTITY_NUM to PML_LOG_NR_ENTRIES
Add PML_HEAD_INDEX to specify the first entry that CPU writes.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241219221034.903927-2-mlevitsk@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-08 14:19:25 -08:00
Gao Shiyuan
4d141e444e KVM: VMX: Fix comment of handle_vmx_instruction()
Fix a goof in handle_vmx_instruction()'s comment where it references the
non-existent nested_vmx_setup(); the function that overwrites the exit
handlers is nested_vmx_hardware_setup().

Note, this isn't a case of a stale comment, e.g. due to the function being
renamed.  The comment has always been wrong.

Fixes: e4027cfafd ("KVM: nVMX: Set callbacks for nested functions during hardware setup")
Signed-off-by: Gao Shiyuan <gaoshiyuan@baidu.com>
Link: https://lore.kernel.org/r/20250103153814.73903-1-gaoshiyuan@baidu.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-08 14:12:31 -08:00
Costas Argyris
b5fd068473 KVM: VMX: Reinstate __exit attribute for vmx_exit()
Tag vmx_exit() with __exit now that it's no longer used by vmx_init().

Commit a7b9020b06 ("x86/l1tf: Handle EPT disabled state proper") dropped
the "__exit" attribute from vmx_exit() because vmx_init() was changed to
call vmx_exit().

However, commit e32b120071 ("KVM: VMX: Do _all_ initialization before
exposing /dev/kvm to userspace") changed vmx_init() to call __vmx_exit()
instead of the "full" vmx_exit().  This made it possible to mark vmx_exit()
as "__exit" again, as it originally was, and enjoy the benefits that it
provides (the function can be discarded from memory in situations where it
cannot be called, like the module being built-in or module unloading being
disabled in the kernel).

Signed-off-by: Costas Argyris <costas.argyris@amd.com>
Link: https://lore.kernel.org/r/20250102154050.2403-1-costas.argyris@amd.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-08 14:10:05 -08:00
Thorsten Blum
800173cf75 KVM: SVM: Use str_enabled_disabled() helper in sev_hardware_setup()
Remove hard-coded strings by using the str_enabled_disabled() helper
function.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Pavan Kumar Paluri <papaluri@amd.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20241227094450.674104-2-thorsten.blum@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-08 14:08:41 -08:00
Sean Christopherson
4c20cd4cee KVM: x86: Avoid double RDPKRU when loading host/guest PKRU
Use the raw wrpkru() helper when loading the guest/host's PKRU on switch
to/from guest context, as the write_pkru() wrapper incurs an unnecessary
rdpkru().  In both paths, KVM is guaranteed to have performed RDPKRU since
the last possible write, i.e. KVM has a fresh cache of the current value
in hardware.

This effectively restores KVM's behavior to that of KVM prior to commit
c806e88734 ("x86/pkeys: Provide *pkru() helpers"), which renamed the raw
helper from __write_pkru() => wrpkru(), and turned __write_pkru() into a
wrapper.  Commit 577ff465f5 ("x86/fpu: Only write PKRU if it is different
from current") then added the extra RDPKRU to avoid an unnecessary WRPKRU,
but completely missed that KVM already optimized away pointless writes.

Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: 577ff465f5 ("x86/fpu: Only write PKRU if it is different from current")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Link: https://lore.kernel.org/r/20241221011647.3747448-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-08 14:08:25 -08:00
Liam Ni
d6470627f5 KVM: x86: Use LVT_TIMER instead of an open coded literal
Use LVT_TIMER instead of the literal '0' to clean up the apic_lvt_mask
lookup when emulating handling writes to APIC_LVTT.

No functional change intended.

Signed-off-by: Liam Ni <zhiguangni01@gmail.com>
[sean: manually regenerate patch (whitespace damaged), massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-08 14:05:19 -08:00
Frederic Weisbecker
b04e317b52 treewide: Introduce kthread_run_worker[_on_cpu]()
kthread_create() creates a kthread without running it yet. kthread_run()
creates a kthread and runs it.

On the other hand, kthread_create_worker() creates a kthread worker and
runs it.

This difference in behaviours is confusing. Also there is no way to
create a kthread worker and affine it using kthread_bind_mask() or
kthread_affine_preferred() before starting it.

Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]()
that behaves just like kthread_run(). kthread_create_worker[_on_cpu]()
will now only create a kthread worker without starting it.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
2025-01-08 18:15:03 +01:00
Borislav Petkov (AMD)
716f86b523 KVM: x86: Advertise SRSO_USER_KERNEL_NO to userspace
SRSO_USER_KERNEL_NO denotes whether the CPU is affected by SRSO across
user/kernel boundaries. Advertise it to guest userspace.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20241202120416.6054-3-bp@kernel.org
2024-12-30 17:56:00 +01:00
Rick Edgecombe
2c3412e999 KVM: x86/mmu: Prevent aliased memslot GFNs
Add a few sanity checks to prevent memslot GFNs from ever having alias bits
set.

Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly though calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.

For TDX, the shared half will be mapped in the higher alias, with a "shared
bit" set in the GPA. However, KVM will still manage it with the same
memslots as the private half. This means memslot looks ups and zapping
operations will be provided with a GFN without the shared bit set.

If these memslot GFNs ever had the bit that selects between the two aliases
it could lead to unexpected behavior in the complicated code that directs
faulting or zapping operations between the roots that map the two aliases.

As a safety measure, prevent memslots from being set at a GFN range that
contains the alias bit.

Also, check in the kvm_faultin_pfn() for the fault path. This later check
does less today, as the alias bits are specifically stripped from the GFN
being checked, however future code could possibly call in to the fault
handler in a way that skips this stripping. Since kvm_faultin_pfn() now
has many references to vcpu->kvm, extract it to local variable.

Link: https://lore.kernel.org/kvm/ZpbKqG_ZhCWxl-Fc@google.com/
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-19-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23 08:31:55 -05:00
Rick Edgecombe
df4af9f89c KVM: x86/tdp_mmu: Don't zap valid mirror roots in kvm_tdp_mmu_zap_all()
Don't zap valid mirror roots in kvm_tdp_mmu_zap_all(), which in effect
is only direct roots (invalid and valid).

For TDX, kvm_tdp_mmu_zap_all() is only called during MMU notifier
release. Since, mirrored EPT comes from guest mem, it will never be
mapped to userspace, and won't apply. But in addition to be unnecessary,
mirrored EPT is cleaned up in a special way during VM destruction.

Pass the KVM_INVALID_ROOTS bit into __for_each_tdp_mmu_root_yield_safe()
as well, to clean up invalid direct roots, as is the current behavior.

While at it, remove an obsolete reference to work item-based zapping.

Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-18-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23 08:31:55 -05:00
Isaku Yamahata
a89ecbb56b KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots()
Rename kvm_tdp_mmu_invalidate_all_roots() to
kvm_tdp_mmu_invalidate_roots(), and make it enum kvm_tdp_mmu_root_types
as an argument.

kvm_tdp_mmu_invalidate_roots() is called with different root types. For
kvm_mmu_zap_all_fast() it only operates on shared roots. But when tearing
down a VM it needs to invalidate all roots. Have the callers only
invalidate the required roots instead of all roots.

Within kvm_tdp_mmu_invalidate_roots(), respect the root type
passed by checking the root type in root iterator.

Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-17-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23 08:31:54 -05:00
Isaku Yamahata
94faba8999 KVM: x86/tdp_mmu: Propagate tearing down mirror page tables
Integrate hooks for mirroring page table operations for cases where TDX
will zap PTEs or free page tables.

Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly though calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.

Since calls into the TDX module are relatively slow, walking private page
tables by making calls into the TDX module would not be efficient. Because
of this, previous changes have taught the TDP MMU to keep a mirror root,
which is separate, unmapped TDP root that private operations can be
directed to. Currently this root is disconnected from the guest. Now add
plumbing to propagate changes to the "external" page tables being
mirrored. Just create the x86_ops for now, leave plumbing the operations
into the TDX module for future patches.

Add two operations for tearing down page tables, one for freeing page
tables (free_external_spt) and one for zapping PTEs (remove_external_spte).
Define them such that remove_external_spte will perform a TLB flush as
well. (in TDX terms "ensure there are no active translations").

TDX MMU support will exclude certain MMU operations, so only plug in the
mirroring x86 ops where they will be needed. For zapping/freeing, only
hook tdp_mmu_iter_set_spte() which is used for mapping and linking PTs.
Don't bother hooking tdp_mmu_set_spte_atomic() as it is only used for
zapping PTEs in operations unsupported by TDX: zapping collapsible PTEs and
kvm_mmu_zap_all_fast().

In previous changes to address races around concurrent populating using
tdp_mmu_set_spte_atomic(), a solution was introduced to temporarily set
FROZEN_SPTE in the mirrored page tables while performing the external
operations. Such a solution is not needed for the tear down paths in TDX
as these will always be performed with the mmu_lock held for write.
Sprinkle some KVM_BUG_ON()s to reflect this.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-16-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23 08:31:54 -05:00
Isaku Yamahata
77ac7079e6 KVM: x86/tdp_mmu: Propagate building mirror page tables
Integrate hooks for mirroring page table operations for cases where TDX
will set PTEs or link page tables.

Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly through calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.

Since calls into the TDX module are relatively slow, walking private page
tables by making calls into the TDX module would not be efficient. Because
of this, previous changes have taught the TDP MMU to keep a mirror root,
which is separate, unmapped TDP root that private operations can be
directed to. Currently this root is disconnected from any actual guest
mapping. Now add plumbing to propagate changes to the "external" page
tables being mirrored. Just create the x86_ops for now, leave plumbing the
operations into the TDX module for future patches.

Add two operations for setting up external page tables, one for linking
new page tables and one for setting leaf PTEs. Don't add any op for
configuring the root PFN, as TDX handles this itself. Don't provide a
way to set permissions on the PTEs also, as TDX doesn't support it.

This results in MMU "mirroring" support that is very targeted towards TDX.
Since it is likely there will be no other user, the main benefit of making
the support generic is to keep TDX specific *looking* code outside of the
MMU. As a generic feature it will make enough sense from TDX's
perspective. For developers unfamiliar with TDX arch it can express the
general concepts such that they can continue to work in the code.

TDX MMU support will exclude certain MMU operations, so only plug in the
mirroring x86 ops where they will be needed. For setting/linking, only
hook tdp_mmu_set_spte_atomic() which is used for mapping and linking
PTs. Don't bother hooking tdp_mmu_iter_set_spte() as it is only used for
setting PTEs in operations unsupported by TDX: splitting huge pages and
write protecting. Sprinkle KVM_BUG_ON()s to document as code that these
paths are not supported for mirrored page tables. For zapping operations,
leave those for near future changes.

Many operations in the TDP MMU depend on atomicity of the PTE update.
While the mirror PTE on KVM's side can be updated atomically, the update
that happens inside the external operations (S-EPT updates via TDX module
call) can't happen atomically with the mirror update. The following race
could result during two vCPU's populating private memory:

* vcpu 1: atomically update 2M level mirror EPT entry to be present
* vcpu 2: read 2M level EPT entry that is present
* vcpu 2: walk down into 4K level EPT
* vcpu 2: atomically update 4K level mirror EPT entry to be present
* vcpu 2: set_exterma;_spte() to update 4K secure EPT entry => error
          because 2M secure EPT entry is not populated yet
* vcpu 1: link_external_spt() to update 2M secure EPT entry

Prevent this by setting the mirror PTE to FROZEN_SPTE while the reflect
operations are performed. Only write the actual mirror PTE value once the
reflect operations have completed. When trying to set a PTE to present and
encountering a frozen SPTE, retry the fault.

By doing this the race is prevented as follows:
* vcpu 1: atomically update 2M level EPT entry to be FROZEN_SPTE
* vcpu 2: read 2M level EPT entry that is FROZEN_SPTE
* vcpu 2: find that the EPT entry is frozen
          abandon page table walk to resume guest execution
* vcpu 1: link_external_spt() to update 2M secure EPT entry
* vcpu 1: atomically update 2M level EPT entry to be present (unfreeze)
* vcpu 2: resume guest execution
          Depending on vcpu 1 state, vcpu 2 may result in EPT violation
          again or make progress on guest execution

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-15-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23 08:31:54 -05:00
Paolo Bonzini
de1bf90488 KVM: x86/tdp_mmu: Propagate attr_filter to MMU notifier callbacks
Teach the MMU notifier callbacks how to check kvm_gfn_range.process to
filter which KVM MMU root types to operate on.

The private GPAs are backed by guest memfd. Such memory is not subjected
to MMU notifier callbacks because it can't be mapped into the host user
address space. Now kvm_gfn_range conveys info about which root to operate
on. Enhance the callback to filter the root page table type.

The KVM MMU notifier comes down to two functions.
kvm_tdp_mmu_unmap_gfn_range() and __kvm_tdp_mmu_age_gfn_range():
- invalidate_range_start() calls kvm_tdp_mmu_unmap_gfn_range()
- invalidate_range_end() doesn't call into arch code
- the other callbacks call __kvm_tdp_mmu_age_gfn_range()

For VM's without a private/shared split in the EPT, all operations
should target the normal(direct) root.

With the switch from for_each_tdp_mmu_root() to
__for_each_tdp_mmu_root() in kvm_tdp_mmu_handle_gfn(), there are no
longer any users of for_each_tdp_mmu_root(). Remove it.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-14-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23 08:31:54 -05:00
Isaku Yamahata
fabaa76501 KVM: x86/tdp_mmu: Support mirror root for TDP MMU
Add the ability for the TDP MMU to maintain a mirror of a separate
mapping.

Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly through calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.

In order to handle both shared and private memory, KVM needs to learn to
handle faults and other operations on the correct root for the operation.
KVM could learn the concept of private roots, and operate on them by
calling out to operations that call into the TDX module. But there are two
problems with that:
1. Calls into the TDX module are relatively slow compared to the simple
   accesses required to read a PTE managed directly by KVM.
2. Other Coco technologies deal with private memory completely differently
   and it will make the code confusing when being read from their
   perspective. Special operations added for TDX that set private or zap
   private memory will have nothing to do with these other private memory
   technologies. (SEV, etc).

To handle these, instead teach the TDP MMU about a new concept "mirror
roots". Such roots maintain page tables that are not actually mapped,
and are just used to traverse quickly to determine if the mid level page
tables need to be installed. When the memory be mirrored needs to actually
be changed, calls can be made to via x86_ops.

  private KVM page fault   |
      |                    |
      V                    |
 private GPA               |     CPU protected EPTP
      |                    |           |
      V                    |           V
 mirror PT root            |     external PT root
      |                    |           |
      V                    |           V
   mirror PT   --hook to propagate-->external PT
      |                    |           |
      \--------------------+------\    |
                           |      |    |
                           |      V    V
                           |    private guest page
                           |
                           |
     non-encrypted memory  |    encrypted memory
                           |

Leave calling out to actually update the private page tables that are being
mirrored for later changes. Just implement the handling of MMU operations
on to mirrored roots.

In order to direct operations to correct root, add root types
KVM_DIRECT_ROOTS and KVM_MIRROR_ROOTS. Tie the usage of mirrored/direct
roots to private/shared with conditionals. It could also be implemented by
making the kvm_tdp_mmu_root_types and kvm_gfn_range_filter enum bits line
up such that conversion could be a direct assignment with a case. Don't do
this because the mapping of private to mirrored is confusing enough. So it
is worth not hiding the logic in type casting.

Cleanup the mirror root in kvm_mmu_destroy() instead of the normal place
in kvm_mmu_free_roots(), because the private root that is being cannot be
rebuilt like a normal root. It needs to persist for the lifetime of the VM.

The TDX module will also need to be provided with page tables to use for
the actual mapping being mirrored by the mirrored page tables. Allocate
these in the mapping path using the recently added
kvm_mmu_alloc_external_spt().

Don't support 2M page for now. This is avoided by forcing 4k pages in the
fault. Add a KVM_BUG_ON() to verify.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-13-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23 08:31:54 -05:00
Isaku Yamahata
00d98dd4a8 KVM: x86/tdp_mmu: Take root in tdp_mmu_for_each_pte()
Take the root as an argument of tdp_mmu_for_each_pte() instead of looking
it up in the mmu. With no other purpose of passing the mmu, drop it.

Future changes will want to change which root is used based on the context
of the MMU operation. So change the callers to pass in the root currently
used, mmu->root.hpa in a preparatory patch to make the later one smaller
and easier to review.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-12-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23 08:31:54 -05:00
Isaku Yamahata
de86ef7bf5 KVM: x86/tdp_mmu: Introduce KVM MMU root types to specify page table type
Define an enum kvm_tdp_mmu_root_types to specify the KVM MMU root type [1]
so that the iterator on the root page table can consistently filter the
root page table type instead of only_valid.

TDX KVM will operate on KVM page tables with specified types.  Shared page
table, private page table, or both.  Introduce an enum instead of bool
only_valid so that we can easily enhance page table types applicable to
shared, private, or both in addition to valid or not.  Replace
only_valid=false with KVM_ANY_ROOTS and only_valid=true with
KVM_ANY_VALID_ROOTS.  Use KVM_ANY_ROOTS and KVM_ANY_VALID_ROOTS to wrap
KVM_VALID_ROOTS to avoid further code churn when direct vs mirror root
concepts are introduced in future patches.

Link: https://lore.kernel.org/kvm/ZivazWQw1oCU8VBC@google.com/ [1]
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-11-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23 08:31:54 -05:00
Isaku Yamahata
e84b8e4e44 KVM: x86/tdp_mmu: Extract root invalid check from tdx_mmu_next_root()
Extract tdp_mmu_root_match() to check if the root has given types and use
it for the root page table iterator.  It checks only_invalid now.

TDX KVM operates on a shared page table only (Shared-EPT), a mirrored page
table only (Secure-EPT), or both based on the operation.  KVM MMU notifier
operations only on shared page table.  KVM guest_memfd invalidation
operations only on mirrored page table, and so on.  Introduce a centralized
matching function instead of open coding matching logic in the iterator.
The next step is to extend the function to check whether the page is shared
or private

Link: https://lore.kernel.org/kvm/ZivazWQw1oCU8VBC@google.com/
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-10-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23 08:31:54 -05:00
Isaku Yamahata
3fc3f71851 KVM: x86/mmu: Support GFN direct bits
Teach the MMU to map guest GFNs at a massaged position on the TDP, to aid
in implementing TDX shared memory.

Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly through calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.

For TDX, the shared half will be mapped in the higher alias, with a "shared
bit" set in the GPA. However, KVM will still manage it with the same
memslots as the private half. This means memslot looks ups and zapping
operations will be provided with a GFN without the shared bit set.

So KVM will either need to apply or strip the shared bit before mapping or
zapping the shared EPT. Having GFNs sometimes have the shared bit and
sometimes not would make the code confusing.

So instead arrange the code such that GFNs never have shared bit set.
Create a concept of "direct bits", that is stripped from the fault
address when setting fault->gfn, and applied within the TDP MMU iterator.
Calling code will behave as if it is operating on the PTE mapping the GFN
(without shared bits) but within the iterator, the actual mappings will be
shifted using bits specific for the root. SPs will have the GFN set
without the shared bit. In the end the TDP MMU will behave like it is
mapping things at the GFN without the shared bit but with a strange page
table format where everything is offset by the shared bit.

Since TDX only needs to shift the mapping like this for the shared bit,
which is mapped as the normal TDP root, add a "gfn_direct_bits" field to
the kvm_arch structure for each VM with a default value of 0. It will
have the bit set at the position of the GPA shared bit in GFN through TD
specific initialization code. Keep TDX specific concepts out of the MMU
code by not naming it "shared".

Ranged TLB flushes (i.e. flush_remote_tlbs_range()) target specific GFN
ranges. In convention established above, these would need to target the
shifted GFN range. It won't matter functionally, since the actual
implementation will always result in a full flush for the only planned
user (TDX). For correctness reasons, future changes can provide a TDX
x86_ops.flush_remote_tlbs_range implementation to return -EOPNOTSUPP and
force the full flush for TDs.

This leaves one problem. Some operations use a concept of max GFN (i.e.
kvm_mmu_max_gfn()), to iterate over the whole TDP range. When applying the
direct mask to the start of the range, the iterator would end up skipping
iterating over the range not covered by the direct mask bit. For safety,
make sure the __tdp_mmu_zap_root() operation iterates over the full GFN
range supported by the underlying TDP format. Add a new iterator helper,
for_each_tdp_pte_min_level_all(), that iterates the entire TDP GFN range,
regardless of root.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-9-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23 08:31:54 -05:00
Isaku Yamahata
e23186da66 KVM: x86/tdp_mmu: Take struct kvm in iter loops
Add a struct kvm argument to the TDP MMU iterators.

Future changes will want to change how the iterator behaves based on a
member of struct kvm. Change the signature and callers of the iterator
loop helpers in a separate patch to make the future one easier to review.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-8-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23 08:31:54 -05:00
Rick Edgecombe
243e13e810 KVM: x86/mmu: Make kvm_tdp_mmu_alloc_root() return void
The kvm_tdp_mmu_alloc_root() function currently always returns 0. This
allows for the caller, mmu_alloc_direct_roots(), to call
kvm_tdp_mmu_alloc_root() and also return 0 in one line:
   return kvm_tdp_mmu_alloc_root(vcpu);

So it is useful even though the return value of kvm_tdp_mmu_alloc_root()
is always the same. However, in future changes, kvm_tdp_mmu_alloc_root()
will be called twice in mmu_alloc_direct_roots(). This will force the
first call to either awkwardly handle the return value that will always
be zero or ignore it. So change kvm_tdp_mmu_alloc_root() to return void.
Do it in a separate change so the future change will be cleaner.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240718211230.1492011-7-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23 08:31:54 -05:00
Isaku Yamahata
6961ab0bae KVM: x86/mmu: Add an is_mirror member for union kvm_mmu_page_role
Introduce a "is_mirror" member to the kvm_mmu_page_role union to identify
SPTEs associated with the mirrored EPT.

The TDX module maintains the private half of the EPT mapped in the TD in
its protected memory. KVM keeps a copy of the private GPAs in a mirrored
EPT tree within host memory. This "is_mirror" attribute enables vCPUs to
find and get the root page of mirrored EPT from the MMU root list for a
guest TD. This also allows KVM MMU code to detect changes in mirrored EPT
according to the "is_mirror" mmu page role and propagate the changes to
the private EPT managed by TDX module.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-6-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23 08:29:45 -05:00
Isaku Yamahata
3a4eb364a4 KVM: x86/mmu: Add an external pointer to struct kvm_mmu_page
Add an external pointer to struct kvm_mmu_page for TDX's private page table
and add helper functions to allocate/initialize/free a private page table
page. TDX will only be supported with the TDP MMU. Because KVM TDP MMU
doesn't use unsync_children and write_flooding_count, pack them to have
room for a pointer and use a union to avoid memory overhead.

For private GPA, CPU refers to a private page table whose contents are
encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
PTE entry) are used, and their cost is expensive.

When KVM resolves the KVM page fault, it walks the page tables. To reuse
the existing KVM MMU code and mitigate the heavy cost of directly walking
the private page table allocate two sets of page tables for the private
half of the GPA space.

For the page tables that KVM will walk, allocate them like normal and refer
to them as mirror page tables. Additionally allocate one more page for the
page tables the CPU will walk, and call them external page tables. Resolve
the KVM page fault with the existing code, and do additional operations
necessary for modifying the external page table in future patches.

The relationship of the types of page tables in this scheme is depicted
below:

              KVM page fault                     |
                     |                           |
                     V                           |
        -------------+----------                 |
        |                      |                 |
        V                      V                 |
     shared GPA           private GPA            |
        |                      |                 |
        V                      V                 |
    shared PT root      mirror PT root           |    private PT root
        |                      |                 |           |
        V                      V                 |           V
     shared PT           mirror PT        --propagate--> external PT
        |                      |                 |           |
        |                      \-----------------+------\    |
        |                                        |      |    |
        V                                        |      V    V
  shared guest page                              |    private guest page
                                                 |
                           non-encrypted memory  |    encrypted memory
                                                 |
PT          - Page table
Shared PT   - Visible to KVM, and the CPU uses it for shared mappings.
External PT - The CPU uses it, but it is invisible to KVM. TDX module
              updates this table to map private guest pages.
Mirror PT   - It is visible to KVM, but the CPU doesn't use it. KVM uses
              it to propagate PT change to the actual private PT.

Add a helper kvm_has_mirrored_tdp() to trigger this behavior and wire it
to the TDX vm type.

Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20240718211230.1492011-5-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23 08:28:55 -05:00
Isaku Yamahata
dca6c88532 KVM: Add member to struct kvm_gfn_range to indicate private/shared
Add new members to strut kvm_gfn_range to indicate which mapping
(private-vs-shared) to operate on: enum kvm_gfn_range_filter
attr_filter. Update the core zapping operations to set them appropriately.

TDX utilizes two GPA aliases for the same memslots, one for memory that is
for private memory and one that is for shared. For private memory, KVM
cannot always perform the same operations it does on memory for default
VMs, such as zapping pages and having them be faulted back in, as this
requires guest coordination. However, some operations such as guest driven
conversion of memory between private and shared should zap private memory.

Internally to the MMU, private and shared mappings are tracked on separate
roots. Mapping and zapping operations will operate on the respective GFN
alias for each root (private or shared). So zapping operations will by
default zap both aliases. Add fields in struct kvm_gfn_range to allow
callers to specify which aliases so they can only target the aliases
appropriate for their specific operation.

There was feedback that target aliases should be specified such that the
default value (0) is to operate on both aliases. Several options were
considered. Several variations of having separate bools defined such
that the default behavior was to process both aliases. They either allowed
nonsensical configurations, or were confusing for the caller. A simple
enum was also explored and was close, but was hard to process in the
caller. Instead, use an enum with the default value (0) reserved as a
disallowed value. Catch ranges that didn't have the target aliases
specified by looking for that specific value.

Set target alias with enum appropriately for these MMU operations:
 - For KVM's mmu notifier callbacks, zap shared pages only because private
   pages won't have a userspace mapping
 - For setting memory attributes, kvm_arch_pre_set_memory_attributes()
   chooses the aliases based on the attribute.
 - For guest_memfd invalidations, zap private only.

Link: https://lore.kernel.org/kvm/ZivIF9vjKcuGie3s@google.com/
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-3-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23 08:28:55 -05:00
Rick Edgecombe
35be969d1e KVM: x86/mmu: Zap invalid roots with mmu_lock holding for write at uninit
Prepare for a future TDX patch which asserts that atomic zapping
(i.e. zapping with mmu_lock taken for read) don't operate on mirror roots.
When tearing down a VM, all roots have to be zapped (including mirror
roots once they're in place) so do that with the mmu_lock taken for write.

kvm_mmu_uninit_tdp_mmu() is invoked either before or after executing any
atomic operations on SPTEs by vCPU threads. Therefore, it will not impact
vCPU threads performance if kvm_tdp_mmu_zap_invalidated_roots() acquires
mmu_lock for write to zap invalid roots.

Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-2-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23 08:28:49 -05:00
Paolo Bonzini
c50be1c945 KVM: x86: Refactor __kvm_emulate_hypercall() into a macro
Rework __kvm_emulate_hypercall() into a macro so that completion of
hypercalls that don't exit to userspace use direct function calls to the
completion helper, i.e. don't trigger a retpoline when RETPOLINE=y.

Opportunistically take the names of the input registers, as opposed to
taking the input values, to preemptively dedup more of the calling code
(TDX needs to use different registers).  Use the direct GPR accessors to
read values to avoid the pointless marking of the registers as available
(KVM requires GPRs to always be available).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20241128004344.4072099-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22 13:00:25 -05:00
Sean Christopherson
d9eb86a6f4 KVM: x86: Always complete hypercall via function callback
Finish "emulation" of KVM hypercalls by function callback, even when the
hypercall is handled entirely within KVM, i.e. doesn't require an exit to
userspace, and refactor __kvm_emulate_hypercall()'s return value to *only*
communicate whether or not KVM should exit to userspace or resume the
guest.

(Ab)Use vcpu->run->hypercall.ret to propagate the return value to the
callback, purely to avoid having to add a trampoline for every completion
callback.

Using the function return value for KVM's control flow eliminates the
multiplexed return value, where '0' for KVM_HC_MAP_GPA_RANGE (and only
that hypercall) means "exit to userspace".

Note, the unnecessary extra indirect call and thus potential retpoline
will be eliminated in the near future by converting the intermediate layer
to a macro.

Suggested-by: Binbin Wu <binbin.wu@linux.intel.com>
Suggested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20241128004344.4072099-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22 13:00:25 -05:00
Sean Christopherson
05a518b49d KVM: x86: Bump hypercall stat prior to fully completing hypercall
Increment the "hypercalls" stat for KVM hypercalls as soon as KVM knows
it will skip the guest instruction, i.e. once KVM is committed to emulating
the hypercall.  Waiting until completion adds no known value, and creates a
discrepancy where the stat will be bumped if KVM exits to userspace as a
result of trying to skip the instruction, but not if the hypercall itself
exits.

Handling the stat in common code will also avoid the need for another
helper to dedup code when TDX comes along (TDX needs a separate completion
path due to GPR usage differences).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20241128004344.4072099-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22 13:00:25 -05:00
Sean Christopherson
13b64ce1b6 KVM: x86: Move "emulate hypercall" function declarations to x86.h
Move the declarations for the hypercall emulation APIs to x86.h.  While the
helpers are exported, they are intended to be consumed only by KVM vendor
modules, i.e. don't need to be exposed to the kernel at-large.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20241128004344.4072099-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22 13:00:25 -05:00
Binbin Wu
c4c083d951 KVM: x86: Add a helper to check for user interception of KVM hypercalls
Add and use user_exit_on_hypercall() to check if userspace wants to handle
a KVM hypercall instead of open-coding the logic everywhere.

No functional change intended.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
[sean: squash into one patch, keep explicit KVM_HC_MAP_GPA_RANGE check]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-ID: <20241128004344.4072099-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22 13:00:25 -05:00
Paolo Bonzini
9a1dfeff44 KVM: x86: clear vcpu->run->hypercall.ret before exiting for KVM_EXIT_HYPERCALL
QEMU up to 9.2.0 is assuming that vcpu->run->hypercall.ret is 0 on exit and
it never modifies it when processing KVM_EXIT_HYPERCALL.  Make this explicit
in the code, to avoid breakage when KVM starts modifying that field.

This in principle is not a good idea... It would have been much better if
KVM had set the field to -KVM_ENOSYS from the beginning, so that a dumb
userspace that does nothing on KVM_EXIT_HYPERCALL would tell the guest it
does not support KVM_HC_MAP_GPA_RANGE.  However, breaking userspace is
a Very Bad Thing, as everybody should know.

Reported-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22 13:00:25 -05:00
Paolo Bonzini
10b2c8a67c KVM x86 fixes for 6.13:
- Disable AVIC on SNP-enabled systems that don't allow writes to the virtual
    APIC page, as such hosts will hit unexpected RMP #PFs in the host when
    running VMs of any flavor.
 
  - Fix a WARN in the hypercall completion path due to KVM trying to determine
    if a guest with protected register state is in 64-bit mode (KVM's ABI is to
    assume such guests only make hypercalls in 64-bit mode).
 
  - Allow the guest to write to supported bits in MSR_AMD64_DE_CFG to fix a
    regression with Windows guests, and because KVM's read-only behavior appears
    to be entirely made up.
 
  - Treat TDP MMU faults as spurious if the faulting access is allowed given the
    existing SPTE.  This fixes a benign WARN (other than the WARN itself) due to
    unexpectedly replacing a writable SPTE with a read-only SPTE.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmdkzW0ACgkQOlYIJqCj
 N/3VXhAAjfP7w4TdJGX0yOPUmQWXATq6DC+zKW1K3dhfu9QwVXD6TLykdnQddiUg
 k0yOgIpwzIEj1U6l1DamjQj2QdoBqeTjTg8jBujiQTsQH/Gc7FvbcFy70ZKZS/js
 IQ+g+1av8XXnLVb5xtVMRWwEtca2BeygSVjRA8/QtIWog0rProfS35/wd3uq04Hm
 Z4J6hf8N2DSELUrq0uIHvTFr8/a03oQnaNNrPyRmbvBb8m20no3A/ws6tAB4YfyN
 HxQao+CJrSQ+4dinZSGa2s6TLBIbbH5QRtwgeSpbLdeKnRYyrWtThIrKB2uDKKYn
 1fsn9mdIi3oAzl3mX2JLQNlx18tCOSrnmYIJCzBAQDaeUOlMNYge4CYeKOqYMJyo
 vayNFz9VZwDiUcBLMWy81EyVcKlqQE8U2icgKqEN8jVdmKW36bOcAfFiK/97iPrc
 7TkS4azHozoTAHdggWqReraDnjRz4PysjIA/yaMh2G/p1ZexL6OX+GwgF1YrD4w3
 zVL8UmXkjiPi+VS6ayUiYDYB5R1kJ13Zq3UpYOJ2MG6Zg2we+UXynQkehS/BHwkp
 2aiflhq55cVv6oZoNJeOuokyvXRAuXf8t0Got/rqGzjKGGNLPRCxD/7K1PzN1YsZ
 NH5zmS2GsqA89L+Qc2l0Ohuz4kX1S2BEB+1pnxFkQc+CdjmIG9A=
 =S9jc
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-fixes-6.13-rcN' of https://github.com/kvm-x86/linux into HEAD

KVM x86 fixes for 6.13:

 - Disable AVIC on SNP-enabled systems that don't allow writes to the virtual
   APIC page, as such hosts will hit unexpected RMP #PFs in the host when
   running VMs of any flavor.

 - Fix a WARN in the hypercall completion path due to KVM trying to determine
   if a guest with protected register state is in 64-bit mode (KVM's ABI is to
   assume such guests only make hypercalls in 64-bit mode).

 - Allow the guest to write to supported bits in MSR_AMD64_DE_CFG to fix a
   regression with Windows guests, and because KVM's read-only behavior appears
   to be entirely made up.

 - Treat TDP MMU faults as spurious if the faulting access is allowed given the
   existing SPTE.  This fixes a benign WARN (other than the WARN itself) due to
   unexpectedly replacing a writable SPTE with a read-only SPTE.
2024-12-22 12:59:33 -05:00
Paolo Bonzini
8afa5b10af KVM x86 fixes for 6.13:
- Disable AVIC on SNP-enabled systems that don't allow writes to the virtual
    APIC page, as such hosts will hit unexpected RMP #PFs in the host when
    running VMs of any flavor.
 
  - Fix a WARN in the hypercall completion path due to KVM trying to determine
    if a guest with protected register state is in 64-bit mode (KVM's ABI is to
    assume such guests only make hypercalls in 64-bit mode).
 
  - Allow the guest to write to supported bits in MSR_AMD64_DE_CFG to fix a
    regression with Windows guests, and because KVM's read-only behavior appears
    to be entirely made up.
 
  - Treat TDP MMU faults as spurious if the faulting access is allowed given the
    existing SPTE.  This fixes a benign WARN (other than the WARN itself) due to
    unexpectedly replacing a writable SPTE with a read-only SPTE.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmdkzW0ACgkQOlYIJqCj
 N/3VXhAAjfP7w4TdJGX0yOPUmQWXATq6DC+zKW1K3dhfu9QwVXD6TLykdnQddiUg
 k0yOgIpwzIEj1U6l1DamjQj2QdoBqeTjTg8jBujiQTsQH/Gc7FvbcFy70ZKZS/js
 IQ+g+1av8XXnLVb5xtVMRWwEtca2BeygSVjRA8/QtIWog0rProfS35/wd3uq04Hm
 Z4J6hf8N2DSELUrq0uIHvTFr8/a03oQnaNNrPyRmbvBb8m20no3A/ws6tAB4YfyN
 HxQao+CJrSQ+4dinZSGa2s6TLBIbbH5QRtwgeSpbLdeKnRYyrWtThIrKB2uDKKYn
 1fsn9mdIi3oAzl3mX2JLQNlx18tCOSrnmYIJCzBAQDaeUOlMNYge4CYeKOqYMJyo
 vayNFz9VZwDiUcBLMWy81EyVcKlqQE8U2icgKqEN8jVdmKW36bOcAfFiK/97iPrc
 7TkS4azHozoTAHdggWqReraDnjRz4PysjIA/yaMh2G/p1ZexL6OX+GwgF1YrD4w3
 zVL8UmXkjiPi+VS6ayUiYDYB5R1kJ13Zq3UpYOJ2MG6Zg2we+UXynQkehS/BHwkp
 2aiflhq55cVv6oZoNJeOuokyvXRAuXf8t0Got/rqGzjKGGNLPRCxD/7K1PzN1YsZ
 NH5zmS2GsqA89L+Qc2l0Ohuz4kX1S2BEB+1pnxFkQc+CdjmIG9A=
 =S9jc
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-fixes-6.13-rcN' of https://github.com/kvm-x86/linux into HEAD

KVM x86 fixes for 6.13:

 - Disable AVIC on SNP-enabled systems that don't allow writes to the virtual
   APIC page, as such hosts will hit unexpected RMP #PFs in the host when
   running VMs of any flavor.

 - Fix a WARN in the hypercall completion path due to KVM trying to determine
   if a guest with protected register state is in 64-bit mode (KVM's ABI is to
   assume such guests only make hypercalls in 64-bit mode).

 - Allow the guest to write to supported bits in MSR_AMD64_DE_CFG to fix a
   regression with Windows guests, and because KVM's read-only behavior appears
   to be entirely made up.

 - Treat TDP MMU faults as spurious if the faulting access is allowed given the
   existing SPTE.  This fixes a benign WARN (other than the WARN itself) due to
   unexpectedly replacing a writable SPTE with a read-only SPTE.
2024-12-22 12:07:16 -05:00
Paolo Bonzini
398b7b6cb9 KVM: x86: let it be known that ignore_msrs is a bad idea
When running KVM with ignore_msrs=1 and report_ignored_msrs=0, the user has
no clue that that the guest is being lied to.  This may cause bug reports
such as https://gitlab.com/qemu-project/qemu/-/issues/2571, where enabling
a CPUID bit in QEMU caused Linux guests to try reading MSR_CU_DEF_ERR; and
being lied about the existence of MSR_CU_DEF_ERR caused the guest to assume
other things about the local APIC which were not true:

  Sep 14 12:02:53 kernel: mce: [Firmware Bug]: Your BIOS is not setting up LVT offset 0x2 for deferred error IRQs correctly.
  Sep 14 12:02:53 kernel: unchecked MSR access error: RDMSR from 0x852 at rIP: 0xffffffffb548ffa7 (native_read_msr+0x7/0x40)
  Sep 14 12:02:53 kernel: Call Trace:
  ...
  Sep 14 12:02:53 kernel:  native_apic_msr_read+0x20/0x30
  Sep 14 12:02:53 kernel:  setup_APIC_eilvt+0x47/0x110
  Sep 14 12:02:53 kernel:  mce_amd_feature_init+0x485/0x4e0
  ...
  Sep 14 12:02:53 kernel: [Firmware Bug]: cpu 0, try to use APIC520 (LVT offset 2) for vector 0xf4, but the register is already in use for vector 0x0 on this cpu

Without reported_ignored_msrs=0 at least the host kernel log will contain
enough information to avoid going on a wild goose chase.  But if reports
about individual MSR accesses are being silenced too, at least complain
loudly the first time a VM is started.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22 12:06:01 -05:00
Wolfram Sang
37d1d99b88 KVM: VMX: don't include '<linux/find.h>' directly
The header clearly states that it does not want to be included directly,
only via '<linux/bitmap.h>'. Replace the include accordingly.

Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Message-ID: <20241217070539.2433-2-wsa+renesas@sang-engineering.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22 12:04:57 -05:00
Sean Christopherson
386d69f9f2 KVM: x86/mmu: Treat TDP MMU faults as spurious if access is already allowed
Treat slow-path TDP MMU faults as spurious if the access is allowed given
the existing SPTE to fix a benign warning (other than the WARN itself)
due to replacing a writable SPTE with a read-only SPTE, and to avoid the
unnecessary LOCK CMPXCHG and subsequent TLB flush.

If a read fault races with a write fault, fast GUP fails for any reason
when trying to "promote" the read fault to a writable mapping, and KVM
resolves the write fault first, then KVM will end up trying to install a
read-only SPTE (for a !map_writable fault) overtop a writable SPTE.

Note, it's not entirely clear why fast GUP fails, or if that's even how
KVM ends up with a !map_writable fault with a writable SPTE.  If something
else is going awry, e.g. due to a bug in mmu_notifiers, then treating read
faults as spurious in this scenario could effectively mask the underlying
problem.

However, retrying the faulting access instead of overwriting an existing
SPTE is functionally correct and desirable irrespective of the WARN, and
fast GUP _can_ legitimately fail with a writable VMA, e.g. if the Accessed
bit in primary MMU's PTE is toggled and causes a PTE value mismatch.  The
WARN was also recently added, specifically to track down scenarios where
KVM is unnecessarily overwrites SPTEs, i.e. treating the fault as spurious
doesn't regress KVM's bug-finding capabilities in any way.  In short,
letting the WARN linger because there's a tiny chance it's due to a bug
elsewhere would be excessively paranoid.

Fixes: 1a175082b1 ("KVM: x86/mmu: WARN and flush if resolving a TDP MMU fault clears MMU-writable")
Reported-by: Lei Yang <leiyang@redhat.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219588
Tested-by: Lei Yang <leiyang@redhat.com>
Link: https://lore.kernel.org/r/20241218213611.3181643-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-19 17:47:52 -08:00
Sean Christopherson
4d5163cba4 KVM: SVM: Allow guest writes to set MSR_AMD64_DE_CFG bits
Drop KVM's arbitrary behavior of making DE_CFG.LFENCE_SERIALIZE read-only
for the guest, as rejecting writes can lead to guest crashes, e.g. Windows
in particular doesn't gracefully handle unexpected #GPs on the WRMSR, and
nothing in the AMD manuals suggests that LFENCE_SERIALIZE is read-only _if
it exists_.

KVM only allows LFENCE_SERIALIZE to be set, by the guest or host, if the
underlying CPU has X86_FEATURE_LFENCE_RDTSC, i.e. if LFENCE is guaranteed
to be serializing.  So if the guest sets LFENCE_SERIALIZE, KVM will provide
the desired/correct behavior without any additional action (the guest's
value is never stuffed into hardware).  And having LFENCE be serializing
even when it's not _required_ to be is a-ok from a functional perspective.

Fixes: 74a0e79df6 ("KVM: SVM: Disallow guest from changing userspace's MSR_AMD64_DE_CFG value")
Fixes: d1d93fa90f ("KVM: SVM: Add MSR-based feature support for serializing LFENCE")
Reported-by: Simon Pilkington <simonp.git@mailbox.org>
Closes: https://lore.kernel.org/all/52914da7-a97b-45ad-86a0-affdf8266c61@mailbox.org
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20241211172952.1477605-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-19 17:47:52 -08:00
Sean Christopherson
9b42d1e8e4 KVM: x86: Play nice with protected guests in complete_hypercall_exit()
Use is_64_bit_hypercall() instead of is_64_bit_mode() to detect a 64-bit
hypercall when completing said hypercall.  For guests with protected state,
e.g. SEV-ES and SEV-SNP, KVM must assume the hypercall was made in 64-bit
mode as the vCPU state needed to detect 64-bit mode is unavailable.

Hacking the sev_smoke_test selftest to generate a KVM_HC_MAP_GPA_RANGE
hypercall via VMGEXIT trips the WARN:

  ------------[ cut here ]------------
  WARNING: CPU: 273 PID: 326626 at arch/x86/kvm/x86.h:180 complete_hypercall_exit+0x44/0xe0 [kvm]
  Modules linked in: kvm_amd kvm ... [last unloaded: kvm]
  CPU: 273 UID: 0 PID: 326626 Comm: sev_smoke_test Not tainted 6.12.0-smp--392e932fa0f3-feat #470
  Hardware name: Google Astoria/astoria, BIOS 0.20240617.0-0 06/17/2024
  RIP: 0010:complete_hypercall_exit+0x44/0xe0 [kvm]
  Call Trace:
   <TASK>
   kvm_arch_vcpu_ioctl_run+0x2400/0x2720 [kvm]
   kvm_vcpu_ioctl+0x54f/0x630 [kvm]
   __se_sys_ioctl+0x6b/0xc0
   do_syscall_64+0x83/0x160
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
   </TASK>
  ---[ end trace 0000000000000000 ]---

Fixes: b5aead0064 ("KVM: x86: Assume a 64-bit hypercall for guests with protected state")
Cc: stable@vger.kernel.org
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20241128004344.4072099-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-19 17:47:51 -08:00
Suravee Suthikulpanit
d81cadbe16 KVM: SVM: Disable AVIC on SNP-enabled system without HvInUseWrAllowed feature
On SNP-enabled system, VMRUN marks AVIC Backing Page as in-use while
the guest is running for both secure and non-secure guest. Any hypervisor
write to the in-use vCPU's AVIC backing page (e.g. to inject an interrupt)
will generate unexpected #PF in the host.

Currently, attempt to run AVIC guest would result in the following error:

    BUG: unable to handle page fault for address: ff3a442e549cc270
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x80000003) - RMP violation
    PGD b6ee01067 P4D b6ee02067 PUD 10096d063 PMD 11c540063 PTE 80000001149cc163
    SEV-SNP: PFN 0x1149cc unassigned, dumping non-zero entries in 2M PFN region: [0x114800 - 0x114a00]
    ...

Newer AMD system is enhanced to allow hypervisor to modify the backing page
for non-secure guest on SNP-enabled system. This enhancement is available
when the CPUID Fn8000_001F_EAX bit 30 is set (HvInUseWrAllowed).

This table describes AVIC support matrix w.r.t. SNP enablement:

               | Non-SNP system |     SNP system
-----------------------------------------------------
 Non-SNP guest |  AVIC Activate | AVIC Activate iff
               |                | HvInuseWrAllowed=1
-----------------------------------------------------
     SNP guest |      N/A       |    Secure AVIC

Therefore, check and disable AVIC in kvm_amd driver when the feature is not
available on SNP-enabled system.

See the AMD64 Architecture Programmer’s Manual (APM) Volume 2 for detail.
(https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/
programmer-references/40332.pdf)

Fixes: 216d106c7f ("x86/sev: Add SEV-SNP host initialization support")
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20241104075845.7583-1-suravee.suthikulpanit@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-19 17:47:51 -08:00
Chao Gao
ca0245d131 KVM: x86: Remove hwapic_irr_update() from kvm_x86_ops
Remove the redundant .hwapic_irr_update() ops.

If a vCPU has APICv enabled, KVM updates its RVI before VM-enter to L1
in vmx_sync_pir_to_irr(). This guarantees RVI is up-to-date and aligned
with the vIRR in the virtual APIC. So, no need to update RVI every time
the vIRR changes.

Note that KVM never updates vmcs02 RVI in .hwapic_irr_update() or
vmx_sync_pir_to_irr(). So, removing .hwapic_irr_update() has no
impact to the nested case.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20241111085947.432645-1-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-19 07:34:15 -08:00
Sean Christopherson
3d91521e57 KVM: nVMX: Honor event priority when emulating PI delivery during VM-Enter
Move the handling of a nested posted interrupt notification that is
unblocked by nested VM-Enter (unblocks L1 IRQs when ack-on-exit is enabled
by L1) from VM-Enter emulation to vmx_check_nested_events().  To avoid a
pointless forced immediate exit, i.e. to not regress IRQ delivery latency
when a nested posted interrupt is pending at VM-Enter, block processing of
the notification IRQ if and only if KVM must block _all_ events.  Unlike
injected events, KVM doesn't need to actually enter L2 before updating the
vIRR and vmcs02.GUEST_INTR_STATUS, as the resulting L2 IRQ will be blocked
by hardware itself, until VM-Enter to L2 completes.

Note, very strictly speaking, moving the IRQ from L2's PIR to IRR before
entering L2 is still technically wrong.  But, practically speaking, only
an L1 hypervisor or an L0 userspace that is deliberately checking event
priority against PIR=>IRR processing can even notice; L2 will see
architecturally correct behavior, as KVM ensures the VM-Enter is finished
before doing anything that would effectively preempt the PIR=>IRR movement.

Reported-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20241101191447.1807602-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-19 07:34:14 -08:00
Sean Christopherson
c829d2c356 KVM: nVMX: Use vmcs01's controls shadow to check for IRQ/NMI windows at VM-Enter
Use vmcs01's execution controls shadow to check for IRQ/NMI windows after
a successful nested VM-Enter, instead of snapshotting the information prior
to emulating VM-Enter.  It's quite difficult to see that the entire reason
controls are snapshot prior nested VM-Enter is to read them from vmcs01
(vmcs02 is loaded if nested VM-Enter is successful).

That could be solved with a comment, but explicitly using vmcs01's shadow
makes the code self-documenting to a certain extent.

No functional change intended (vmcs01's execution controls must not be
modified during emulation of nested VM-Enter).

Link: https://lore.kernel.org/r/20241101191447.1807602-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-19 07:34:13 -08:00
Sean Christopherson
b2868b55cf KVM: nVMX: Drop manual vmcs01.GUEST_INTERRUPT_STATUS.RVI check at VM-Enter
Drop the manual check for a pending IRQ in vmcs01's RVI field during
nested VM-Enter, as the recently added call to kvm_apic_has_interrupt()
when checking for pending events after successful VM-Enter is a superset
of the RVI check (IRQs that are pending in RVI are also pending in L1's
IRR).

Link: https://lore.kernel.org/r/20241101191447.1807602-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-19 07:34:12 -08:00
Sean Christopherson
4f09ebd0c8 KVM: nVMX: Check for pending INIT/SIPI after entering non-root mode
Explicitly check for a pending INIT or SIPI after entering non-root mode
during nested VM-Enter emulation, as no VMCS information is quered as part
of the check, i.e. there is no need to check for INIT/SIPI while vmcs01 is
still loaded.

Link: https://lore.kernel.org/r/20241101191447.1807602-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-19 07:34:11 -08:00
Sean Christopherson
cda3960fbc KVM: nVMX: Explicitly update vPPR on successful nested VM-Enter
Always request pending event evaluation after successful nested VM-Enter
if L1 has a pending IRQ, as KVM will effectively do so anyways when APICv
is enabled, by way of vmx_has_apicv_interrupt().  This will allow dropping
the aforementioned APICv check, and will also allow handling nested Posted
Interrupt processing entirely within vmx_check_nested_events(), which is
necessary to honor priority between concurrent events.

Note, checking for pending IRQs has a subtle side effect, as it results in
a PPR update for L1's vAPIC (PPR virtualization does happen at VM-Enter,
but for nested VM-Enter that affects L2's vAPIC, not L1's vAPIC).  However,
KVM updates PPR _constantly_, even when PPR technically shouldn't be
refreshed, e.g. kvm_vcpu_has_events() re-evaluates PPR if IRQs are
unblocked, by way of the same kvm_apic_has_interrupt() check.  Ditto for
nested VM-Enter itself, when nested posted interrupts are enabled.  Thus,
trying to avoid a PPR update on VM-Enter just to be pedantically accurate
is ridiculous, given the behavior elsewhere in KVM.

Link: https://lore.kernel.org/kvm/20230312180048.1778187-1-jason.cj.chen@intel.com
Closes: https://lore.kernel.org/all/20240920080012.74405-1-mankku@gmail.com
Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20241101191447.1807602-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-19 07:34:11 -08:00
Adrian Hunter
fdd2db5126 KVM: VMX: Allow toggling bits in MSR_IA32_RTIT_CTL when enable bit is cleared
Allow toggling other bits in MSR_IA32_RTIT_CTL if the enable bit is being
cleared, the existing logic simply ignores the enable bit.  E.g. KVM will
incorrectly reject a write of '0' to stop tracing.

Fixes: bf8c55d8dc ("KVM: x86: Implement Intel PT MSRs read/write emulation")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
[sean: rework changelog, drop stable@]
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241101185031.1799556-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-19 07:34:10 -08:00
Chao Gao
04bc93cf49 KVM: nVMX: Defer SVI update to vmcs01 on EOI when L2 is active w/o VID
If KVM emulates an EOI for L1's virtual APIC while L2 is active, defer
updating GUEST_INTERUPT_STATUS.SVI, i.e. the VMCS's cache of the highest
in-service IRQ, until L1 is active, as vmcs01, not vmcs02, needs to track
vISR.  The missed SVI update for vmcs01 can result in L1 interrupts being
incorrectly blocked, e.g. if there is a pending interrupt with lower
priority than the interrupt that was EOI'd.

This bug only affects use cases where L1's vAPIC is effectively passed
through to L2, e.g. in a pKVM scenario where L2 is L1's depriveleged host,
as KVM will only emulate an EOI for L1's vAPIC if Virtual Interrupt
Delivery (VID) is disabled in vmc12, and L1 isn't intercepting L2 accesses
to its (virtual) APIC page (or if x2APIC is enabled, the EOI MSR).

WARN() if KVM updates L1's ISR while L2 is active with VID enabled, as an
EOI from L2 is supposed to affect L2's vAPIC, but still defer the update,
to try to keep L1 alive.  Specifically, KVM forwards all APICv-related
VM-Exits to L1 via nested_vmx_l1_wants_exit():

	case EXIT_REASON_APIC_ACCESS:
	case EXIT_REASON_APIC_WRITE:
	case EXIT_REASON_EOI_INDUCED:
		/*
		 * The controls for "virtualize APIC accesses," "APIC-
		 * register virtualization," and "virtual-interrupt
		 * delivery" only come from vmcs12.
		 */
		return true;

Fixes: c7c9c56ca2 ("x86, apicv: add virtual interrupt delivery support")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/kvm/20230312180048.1778187-1-jason.cj.chen@intel.com
Reported-by: Markku Ahvenjärvi <mankku@gmail.com>
Closes: https://lore.kernel.org/all/20240920080012.74405-1-mankku@gmail.com
Cc: Janne Karhunen <janne.karhunen@gmail.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: drop request, handle in VMX, write changelog]
Tested-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20241128000010.4051275-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-19 07:33:58 -08:00
Maxim Levitsky
0e77b32411 KVM: x86: Add information about pending requests to kvm_exit tracepoint
Print pending requests in the kvm_exit tracepoint, which allows userspace
to gather information on how often KVM interrupts vCPUs due to specific
requests.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240910200350.264245-3-mlevitsk@redhat.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 15:14:49 -08:00
Maxim Levitsky
3e633e7e7d KVM: x86: Add interrupt injection information to the kvm_entry tracepoint
Add VMX/SVM specific interrupt injection info the kvm_entry tracepoint.
As is done with kvm_exit, gather the information via a kvm_x86_ops hook
to avoid the moderately costly VMREADs on VMX when the tracepoint isn't
enabled.

Opportunistically rename the parameters in the get_exit_info()
declaration to match the names used by both SVM and VMX.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240910200350.264245-2-mlevitsk@redhat.com
[sean: drop is_guest_mode() change, use intr_info/error_code for names]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 15:14:48 -08:00
Ivan Orlov
7bd7ff9911 KVM: SVM: Handle event vectoring error in check_emulate_instruction()
Detect unhandleable vectoring in check_emulate_instruction() to prevent
infinite retry loops on SVM, and to eliminate the main differences in how
VM-Exits during event vectoring are handled on SVM versus VMX.  E.g. if
the vCPU puts its IDT in emulated MMIO memory and generates an event,
without the check_emulate_instruction() change, SVM will re-inject the
event and resume the guest, and effectively put the vCPU into an infinite
loop.

Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-6-iorlov@amazon.com
[sean: grab "svm" locally, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 15:14:45 -08:00
Ivan Orlov
47ef3ef843 KVM: VMX: Handle event vectoring error in check_emulate_instruction()
Move handling of emulation during event vectoring, which KVM doesn't
support, into VMX's check_emulate_instruction(), so that KVM detects
all unsupported emulation, not just cached emulated MMIO (EPT misconfig).
E.g. on emulated MMIO that isn't cached (EPT Violation) or occurs with
legacy shadow paging (#PF).

Rejecting emulation on other sources of emulation also fixes a largely
theoretical flaw (thanks to the "unprotect and retry" logic), where KVM
could incorrectly inject a #DF:

  1. CPU executes an instruction and hits a #GP
  2. While vectoring the #GP, a shadow #PF occurs
  3. On the #PF VM-Exit, KVM re-injects #GP
  4. KVM emulates because of the write-protected page
  5. KVM "successfully" emulates and also detects the #GP
  6. KVM synthesizes a #GP, and since #GP has already been injected,
     incorrectly escalates to a #DF.

Fix the comment about EMULTYPE_PF as this flag doesn't necessarily
mean MMIO anymore: it can also be set due to the write protection
violation.

Note, handle_ept_misconfig() checks vmx_check_emulate_instruction() before
attempting emulation of any kind.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-5-iorlov@amazon.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 15:14:44 -08:00
Ivan Orlov
704fc6021b KVM: x86: Try to unprotect and retry on unhandleable emulation failure
If emulation is "rejected" by check_emulate_instruction(), try to
unprotect and retry instruction execution before reporting the error to
userspace.  Currently, check_emulate_instruction() never signals failure
when "unprotect and retry" is possible, but that will change in the
future as both VMX and SVM will reject emulation due to coincident
exception vectoring.  E.g. if there is a write to a shadowed page table
when vectoring an event, then unprotecting the gfn and retrying the
instruction will allow the guest to make forward progress in most cases,
i.e. will allow the vCPU to keep running instead of returning an error to
userspace.

This ensures that the subsequent patches won't make KVM exit to
userspace when handling an intercepted #PF during vectoring without
checking whether unprotect and retry is possible.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-4-iorlov@amazon.com
[sean: massage changelog to clarify this is a nop for the current code]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 15:14:43 -08:00
Ivan Orlov
5c9cfc4866 KVM: x86: Add emulation status for unhandleable exception vectoring
Add emulation status for unhandleable vectoring, i.e. when KVM can't
emulate an instruction because emulation was triggered on an exit that
occurred while the CPU was vectoring an event.  Such a situation can
occur if guest sets the IDT descriptor base to point to MMIO region,
and triggers an exception after that.

Exit to userspace with event delivery error when KVM can't emulate
an instruction when vectoring an event.

Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-3-iorlov@amazon.com
[sean: massage changelog and X86EMUL_UNHANDLEABLE_VECTORING comment]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 15:14:42 -08:00
Ivan Orlov
11c98fa07a KVM: x86: Add function for vectoring error generation
Extract VMX code for unhandleable VM-Exit during vectoring into
vendor-agnostic function so that boiler-plate code can be shared by SVM.

To avoid unnecessarily complexity in the helper, unconditionally report a
GPA to userspace instead of having a conditional entry.  For exits that
don't report a GPA, i.e. everything except EPT Misconfig, simply report
KVM's "invalid GPA".

Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-2-iorlov@amazon.com
[sean: clarify that the INVALID_GPA logic is new]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 15:14:41 -08:00
Sean Christopherson
871ac338ef KVM: x86: Use only local variables (no bitmask) to init kvm_cpu_caps
Refactor the kvm_cpu_cap_init() macro magic to collect supported features
in a local variable instead of passing them to the macro as a "mask".  As
pointed out by Maxim, relying on macros to "return" a value and set local
variables is surprising, as the bitwise-OR logic suggests the macros are
pure, i.e. have no side effects.

Ideally, the feature initializers would have zero side effects, e.g. would
take local variables as params, but there isn't a sane way to do so
without either sacrificing the various compile-time assertions (basically
a non-starter), or passing at least one variable, e.g. a struct, to each
macro usage (adds a lot of noise and boilerplate code).

Opportunistically force callers to emit a trailing comma by intentionally
omitting a semicolon after invoking the feature initializers.  Forcing a
trailing comma isotales futures changes to a single line, i.e. doesn't
cause churn for unrelated features/lines when adding/removing/modifying a
feature.

No functional change intended.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-58-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 15:14:34 -08:00
Sean Christopherson
ac9d1b7591 KVM: x86: Explicitly track feature flags that are enabled at runtime
Add one last (hopefully) CPUID feature macro, RUNTIME_F(), and use it
to track features that KVM supports, but that are only set at runtime
(in response to other state), and aren't advertised to userspace via
KVM_GET_SUPPORTED_CPUID.

Currently, RUNTIME_F() is mostly just documentation, but tracking all
KVM-supported features will allow for asserting, at build time, take),
that all features that are set, cleared, *or* checked by KVM are known to
kvm_set_cpu_caps().

No functional change intended.

Link: https://lore.kernel.org/r/20241128013424.4096668-57-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:22 -08:00
Sean Christopherson
0fea7aa2dc KVM: x86: Explicitly track feature flags that require vendor enabling
Add another CPUID feature macro, VENDOR_F(), and use it to track features
that KVM supports, but that need additional vendor support and so are
conditionally enabled in vendor code.

Currently, VENDOR_F() is mostly just documentation, but tracking all
KVM-supported features will allow for asserting, at build time, take),
that all features that are set, cleared, *or* checked by KVM are known to
kvm_set_cpu_caps().

To fudge around a macro collision on 32-bit kernels, #undef DS to be able
to get at X86_FEATURE_DS.

No functional change intended.

Link: https://lore.kernel.org/r/20241128013424.4096668-56-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:21 -08:00
Sean Christopherson
9b2776c7cf KVM: x86: Rename "SF" macro to "SCATTERED_F"
Now that each feature flag is on its own line, i.e. brevity isn't a major
concern, drop the "SF" acronym and use the (almost) full name, SCATTERED_F.

No functional change intended.

Link: https://lore.kernel.org/r/20241128013424.4096668-55-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:20 -08:00
Sean Christopherson
3fd55b5227 KVM: x86: Pull CPUID capabilities from boot_cpu_data only as needed
Don't memcpy() all of boot_cpu_data.x86_capability, and instead explicitly
fill each kvm_cpu_cap_init leaf during kvm_cpu_cap_init().  While clever,
copying all kernel capabilities risks over-reporting KVM capabilities,
e.g. if KVM added support in __do_cpuid_func(), but neglected to init the
supported set of capabilities.

Note, explicitly grabbing leafs deliberately keeps Linux-defined leafs as
0!  KVM should never advertise Linux-defined leafs; any relevant features
that are "real", but scattered, must be gathered in their correct hardware-
defined leaf.

Link: https://lore.kernel.org/r/20241128013424.4096668-54-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:19 -08:00
Sean Christopherson
75c489e12d KVM: x86: Add a macro for features that are synthesized into boot_cpu_data
Add yet another CPUID macro, this time for features that the host kernel
synthesizes into boot_cpu_data, i.e. that the kernel force sets even in
situations where the feature isn't reported by CPUID.  Thanks to the
macro shenanigans of kvm_cpu_cap_init(), such features can now be handled
in the core CPUID framework, i.e. don't need to be handled out-of-band and
thus without as many guardrails.

Adding a dedicated macro also helps document what's going on, e.g. the
calls to kvm_cpu_cap_check_and_set() are very confusing unless the reader
knows exactly how kvm_cpu_cap_init() generates kvm_cpu_caps (and even
then, it's far from obvious).

Link: https://lore.kernel.org/r/20241128013424.4096668-53-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:18 -08:00
Sean Christopherson
cbdeea032b KVM: x86: Drop superfluous host XSAVE check when adjusting guest XSAVES caps
Drop the manual boot_cpu_has() checks on XSAVE when adjusting the guest's
XSAVES capabilities now that guest cpu_caps incorporates KVM's support.
The guest's cpu_caps are initialized from kvm_cpu_caps, which are in turn
initialized from boot_cpu_data, i.e. checking guest_cpu_cap_has() also
checks host/KVM capabilities (which is the entire point of cpu_caps).

Cc: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-52-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:17 -08:00
Sean Christopherson
8f2a27752e KVM: x86: Replace (almost) all guest CPUID feature queries with cpu_caps
Switch all queries (except XSAVES) of guest features from guest CPUID to
guest capabilities, i.e. replace all calls to guest_cpuid_has() with calls
to guest_cpu_cap_has().

Keep guest_cpuid_has() around for XSAVES, but subsume its helper
guest_cpuid_get_register() and add a compile-time assertion to prevent
using guest_cpuid_has() for any other feature.  Add yet another comment
for XSAVE to explain why KVM is allowed to query its raw guest CPUID.

Opportunistically drop the unused guest_cpuid_clear(), as there should be
no circumstance in which KVM needs to _clear_ a guest CPUID feature now
that everything is tracked via cpu_caps.  E.g. KVM may need to _change_
a feature to emulate dynamic CPUID flags, but KVM should never need to
clear a feature in guest CPUID to prevent it from being used by the guest.

Delete the last remnants of the governed features framework, as the lone
holdout was vmx_adjust_secondary_exec_control()'s divergent behavior for
governed vs. ungoverned features.

Note, replacing guest_cpuid_has() checks with guest_cpu_cap_has() when
computing reserved CR4 bits is a nop when viewed as a whole, as KVM's
capabilities are already incorporated into the calculation, i.e. if a
feature is present in guest CPUID but unsupported by KVM, its CR4 bit
was already being marked as reserved, checking guest_cpu_cap_has() simply
double-stamps that it's a reserved bit.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-51-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:15 -08:00
Sean Christopherson
820545bdfe KVM: x86: Shuffle code to prepare for dropping guest_cpuid_has()
Move the implementations of guest_has_{spec_ctrl,pred_cmd}_msr() down
below guest_cpu_cap_has() so that their use of guest_cpuid_has() can be
replaced with calls to guest_cpu_cap_has().

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-50-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:14 -08:00
Sean Christopherson
75d4642fce KVM: x86: Update guest cpu_caps at runtime for dynamic CPUID-based features
When updating guest CPUID entries to emulate runtime behavior, e.g. when
the guest enables a CR4-based feature that is tied to a CPUID flag, also
update the vCPU's cpu_caps accordingly.  This will allow replacing all
usage of guest_cpuid_has() with guest_cpu_cap_has().

Note, this relies on kvm_set_cpuid() taking a snapshot of cpu_caps before
invoking kvm_update_cpuid_runtime(), i.e. when KVM is updating CPUID
entries that *may* become the vCPU's CPUID, so that unwinding to the old
cpu_caps is possible if userspace tries to set bogus CPUID information.

Note #2, none of the features in question use guest_cpu_cap_has() at this
time, i.e. aside from settings bits in cpu_caps, this is a glorified nop.

Cc: Yang Weijiang <weijiang.yang@intel.com>
Cc: Robert Hoo <robert.hoo.linux@gmail.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-49-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:13 -08:00
Sean Christopherson
1f66590d7f KVM: x86: Update OS{XSAVE,PKE} bits in guest CPUID irrespective of host support
When making runtime CPUID updates, change OSXSAVE and OSPKE even if their
respective base features (XSAVE, PKU) are not supported by the host.  KVM
already incorporates host support in the vCPU's effective reserved CR4 bits.
I.e. OSXSAVE and OSPKE can be set if and only if the host supports them.

And conversely, since KVM's ABI is that KVM owns the dynamic OS feature
flags, clearing them when they obviously aren't supported and thus can't
be enabled is arguably a fix.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-48-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:12 -08:00
Sean Christopherson
cfd1574526 KVM: x86: Drop unnecessary check that cpuid_entry2_find() returns right leaf
Drop an unnecessary check that kvm_find_cpuid_entry_index(), i.e.
cpuid_entry2_find(), returns the correct leaf when getting CPUID.0x7.0x0
to update X86_FEATURE_OSPKE.  cpuid_entry2_find() never returns an entry
for the wrong function.  And not that it matters, but cpuid_entry2_find()
will always return a precise match for CPUID.0x7.0x0 since the index is
significant.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-47-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:11 -08:00
Sean Christopherson
963180ae06 KVM: x86: Avoid double CPUID lookup when updating MWAIT at runtime
Move the handling of X86_FEATURE_MWAIT during CPUID runtime updates to
utilize the lookup done for other CPUID.0x1 features.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-46-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:10 -08:00
Sean Christopherson
e592ec657d KVM: x86: Initialize guest cpu_caps based on KVM support
Constrain all guest cpu_caps based on KVM support instead of constraining
only the few features that KVM _currently_ needs to verify are actually
supported by KVM.  The intent of cpu_caps is to track what the guest is
actually capable of using, not the raw, unfiltered CPUID values that the
guest sees.

I.e. KVM should always consult it's only support when making decisions
based on guest CPUID, and the only reason KVM has historically made the
checks opt-in was due to lack of centralized tracking.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-45-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:09 -08:00
Sean Christopherson
d4b9ff3d55 KVM: x86: Treat MONTIOR/MWAIT as a "partially emulated" feature
Enumerate MWAIT in cpuid_func_emulated(), but only if the caller wants to
include "partially emulated" features, i.e. features that KVM kinda sorta
emulates, but with major caveats.  This will allow initializing the guest
cpu_caps based on the set of features that KVM virtualizes and/or emulates,
without needing to handle things like MONITOR/MWAIT as one-off exceptions.

Adding one-off handling for individual features is quite painful,
especially when considering future hardening.  It's very doable to verify,
at compile time, that every CPUID-based feature that KVM queries when
emulating guest behavior is actually known to KVM, e.g. to prevent KVM
bugs where KVM emulates some feature but fails to advertise support to
userspace.  In other words, any features that are special cased, i.e. not
handled generically in the CPUID framework, would also need to be special
cased for any hardening efforts that build on said framework.

Link: https://lore.kernel.org/r/20241128013424.4096668-44-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:08 -08:00
Sean Christopherson
ff402f56e8 KVM: x86: Extract code for generating per-entry emulated CPUID information
Extract the meat of __do_cpuid_func_emulated() into a separate helper,
cpuid_func_emulated(), so that cpuid_func_emulated() can be used with a
single CPUID entry.  This will allow marking emulated features as fully
supported in the guest cpu_caps without needing to hardcode the set of
emulated features in multiple locations.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-43-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:07 -08:00
Sean Christopherson
a7a308f863 KVM: x86: Initialize guest cpu_caps based on guest CPUID
Initialize a vCPU's capabilities based on the guest CPUID provided by
userspace instead of simply zeroing the entire array.  This is the first
step toward using cpu_caps to query *all* CPUID-based guest capabilities,
i.e. will allow converting all usage of guest_cpuid_has() to
guest_cpu_cap_has().

Zeroing the array was the logical choice when using cpu_caps was opt-in,
e.g. "unsupported" was generally a safer default, and the whole point of
governed features is that KVM would need to check host and guest support,
i.e. making everything unsupported by default didn't require more code.

But requiring KVM to manually "enable" every CPUID-based feature in
cpu_caps would require an absurd amount of boilerplate code.

Follow existing CPUID/kvm_cpu_caps nomenclature where possible, e.g. for
the change() and clear() APIs.  Replace check_and_set() with constrain()
to try and capture that KVM is constraining userspace's desired guest
feature set based on KVM's capabilities.

This is intended to be gigantic nop, i.e. should not have any impact on
guest or KVM functionality.

This is also an intermediate step; a future commit will also incorporate
KVM support into the vCPU's cpu_caps before converting guest_cpuid_has()
to guest_cpu_cap_has().

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-42-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:06 -08:00
Sean Christopherson
7ea34578ae KVM: x86: Replace guts of "governed" features with comprehensive cpu_caps
Replace the internals of the governed features framework with a more
comprehensive "guest CPU capabilities" implementation, i.e. with a guest
version of kvm_cpu_caps.  Keep the skeleton of governed features around
for now as vmx_adjust_sec_exec_control() relies on detecting governed
features to do the right thing for XSAVES, and switching all guest feature
queries to guest_cpu_cap_has() requires subtle and non-trivial changes,
i.e. is best done as a standalone change.

Tracking *all* guest capabilities that KVM cares will allow excising the
poorly named "governed features" framework, and effectively optimizes all
KVM queries of guest capabilities, i.e. doesn't require making a
subjective decision as to whether or not a feature is worth "governing",
and doesn't require adding the code to do so.

The cost of tracking all features is currently 92 bytes per vCPU on 64-bit
kernels: 100 bytes for cpu_caps versus 8 bytes for governed_features.
That cost is well worth paying even if the only benefit was eliminating
the "governed features" terminology.  And practically speaking, the real
cost is zero unless those 92 bytes pushes the size of vcpu_vmx or vcpu_svm
into a new order-N allocation, and if that happens there are better ways
to reduce the footprint of kvm_vcpu_arch, e.g. making the PMU and/or MTRR
state separate allocations.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-41-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:05 -08:00
Sean Christopherson
2c5e168e5c KVM: x86: Rename "governed features" helpers to use "guest_cpu_cap"
As the first step toward replacing KVM's so-called "governed features"
framework with a more comprehensive, less poorly named implementation,
replace the "kvm_governed_feature" function prefix with "guest_cpu_cap"
and rename guest_can_use() to guest_cpu_cap_has().

The "guest_cpu_cap" naming scheme mirrors that of "kvm_cpu_cap", and
provides a more clear distinction between guest capabilities, which are
KVM controlled (heh, or one might say "governed"), and guest CPUID, which
with few exceptions is fully userspace controlled.

Opportunistically rewrite the comment about XSS passthrough for SEV-ES
guests to avoid referencing so many functions, as such comments are prone
to becoming stale (case in point...).

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-40-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:03 -08:00
Sean Christopherson
9aa470f5dd KVM: x86: Advertise HYPERVISOR in KVM_GET_SUPPORTED_CPUID
Unconditionally advertise "support" for the HYPERVISOR feature in CPUID,
as the flag simply communicates to the guest that's it's running under a
hypervisor.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-39-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:02 -08:00
Sean Christopherson
9be4ec35d6 KVM: x86: Advertise TSC_DEADLINE_TIMER in KVM_GET_SUPPORTED_CPUID
Unconditionally advertise TSC_DEADLINE_TIMER via KVM_GET_SUPPORTED_CPUID,
as KVM always emulates deadline mode, *if* the VM has an in-kernel local
APIC.  The odds of a VMM emulating the local APIC in userspace, not
emulating the TSC deadline timer, _and_ reflecting
KVM_GET_SUPPORTED_CPUID back into KVM_SET_CPUID2, i.e. the risk of
over-advertising and breaking any setups, is extremely low.

KVM has _unconditionally_ advertised X2APIC via CPUID since commit
0d1de2d901 ("KVM: Always report x2apic as supported feature"), and it
is completely impossible for userspace to emulate X2APIC as KVM doesn't
support forwarding the MSR accesses to userspace.  I.e. KVM has relied on
userspace VMMs to not misreport local APIC capabilities for nearly 13
years.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-38-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:01 -08:00
Sean Christopherson
136d605b43 KVM: x86: Remove all direct usage of cpuid_entry2_find()
Convert all use of cpuid_entry2_find() to kvm_find_cpuid_entry{,index}()
now that cpuid_entry2_find() operates on the vCPU state, i.e. now that
there is no need to use cpuid_entry2_find() directly in order to pass in
non-vCPU state.

To help prevent unwanted usage of cpuid_entry2_find(), #undef
KVM_CPUID_INDEX_NOT_SIGNIFICANT, i.e. force KVM to use
kvm_find_cpuid_entry().

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-37-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:00 -08:00
Sean Christopherson
8b30cb367c KVM: x86: Move kvm_find_cpuid_entry{,_index}() up near cpuid_entry2_find()
Move kvm_find_cpuid_entry{,_index}() "up" in cpuid.c so that they are
colocated with cpuid_entry2_find(), e.g. to make it easier to see the
effective guts of the helpers without having to bounce around cpuid.c.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-36-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:59 -08:00
Sean Christopherson
285185f8e4 KVM: x86: Always operate on kvm_vcpu data in cpuid_entry2_find()
Now that KVM sets vcpu->arch.cpuid_{entries,nent} before processing the
incoming CPUID entries during KVM_SET_CPUID{,2}, drop the @entries and
@nent params from cpuid_entry2_find() and unconditionally operate on the
vCPU state.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-35-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:58 -08:00
Sean Christopherson
a5b3271808 KVM: x86: Remove unnecessary caching of KVM's PV CPUID base
Now that KVM only searches for KVM's PV CPUID base when userspace sets
guest CPUID, drop the cache and simply do the search every time.

Practically speaking, this is a nop except for situations where userspace
sets CPUID _after_ running the vCPU, which is anything but a hot path,
e.g. QEMU does so only when hotplugging a vCPU.  And on the flip side,
caching guest CPUID information, especially information that is used to
query/modify _other_ CPUID state, is inherently dangerous as it's all too
easy to use stale information, i.e. KVM should only cache CPUID state when
the performance and/or programming benefits justify it.

Link: https://lore.kernel.org/r/20241128013424.4096668-34-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:56 -08:00
Sean Christopherson
63d8c702c2 KVM: x86: Clear PV_UNHALT for !HLT-exiting only when userspace sets CPUID
Now that KVM disallows disabling HLT-exiting after vCPUs have been created,
i.e. now that it's impossible for kvm_hlt_in_guest() to change while vCPUs
are running, apply KVM's PV_UNHALT quirk only when userspace is setting
guest CPUID.

Opportunistically rename the helper to make it clear that KVM's behavior
is a quirk that should never have been added.  KVM's documentation
explicitly states that userspace should not advertise PV_UNHALT if
HLT-exiting is disabled, but for unknown reasons, commit caa057a2ca
("KVM: X86: Provide a capability to disable HLT intercepts") didn't stop
at documenting the requirement and also massaged the incoming guest CPUID.

Unfortunately, it's quite likely that userspace has come to rely on KVM's
behavior, i.e. the code can't simply be deleted.  The only reason KVM
doesn't have an "official" quirk is that there is no known use case where
disabling the quirk would make sense, i.e. letting userspace disable the
quirk would further increase KVM's burden without any benefit.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-33-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:55 -08:00
Sean Christopherson
8c01290bda KVM: x86: Swap incoming guest CPUID into vCPU before massaging in KVM_SET_CPUID2
When handling KVM_SET_CPUID{,2}, swap the old and new CPUID arrays and
lengths before processing the new CPUID, and simply undo the swap if
setting the new CPUID fails for whatever reason.

To keep the diff reasonable, continue passing the entry array and length
to most helpers, and defer the more complete cleanup to future commits.

For any sane VMM, setting "bad" CPUID state is not a hot path (or even
something that is surviable), and setting guest CPUID before it's known
good will allow removing all of KVM's infrastructure for processing CPUID
entries directly (as opposed to operating on vcpu->arch.cpuid_entries).

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-32-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:54 -08:00
Sean Christopherson
6174004ebd KVM: x86: Add a macro to init CPUID features that KVM emulates in software
Now that kvm_cpu_cap_init() is a macro with its own scope, add EMUL_F() to
OR-in features that KVM emulates in software, i.e. that don't depend on
the feature being available in hardware.  The contained scope
of kvm_cpu_cap_init() allows using a local variable to track the set of
emulated leaves, which in addition to avoiding confusing and/or
unnecessary variables, helps prevent misuse of EMUL_F().

Link: https://lore.kernel.org/r/20241128013424.4096668-31-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:53 -08:00
Sean Christopherson
5c8de4b3a5 KVM: x86: Add a macro to init CPUID features that ignore host kernel support
Add a macro for use in kvm_set_cpu_caps() to automagically initialize
features that KVM wants to support based solely on the CPU's capabilities,
e.g. KVM advertises LA57 support if it's available in hardware, even if
the host kernel isn't utilizing 57-bit virtual addresses.

Track a features that are passed through to userspace (from hardware) in
a local variable, and simply OR them in *after* adjusting the capabilities
that came from boot_cpu_data.

Note, eliminating the open-coded call to cpuid_ecx() also fixes a largely
benign bug where KVM could incorrectly report LA57 support on Intel CPUs
whose max supported CPUID is less than 7, i.e. if the max supported leaf
(<7) happened to have bit 16 set.  In practice, barring a funky virtual
machine setup, the bug is benign as all known CPUs that support VMX also
support leaf 7.

Link: https://lore.kernel.org/r/20241128013424.4096668-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:52 -08:00
Sean Christopherson
3d142340d7 KVM: x86: Harden CPU capabilities processing against out-of-scope features
Add compile-time assertions to verify that usage of F() and friends in
kvm_set_cpu_caps() is scoped to the correct CPUID word, e.g. to detect
bugs where KVM passes a feature bit from word X into word y.

Add a one-off assertion in the aliased feature macro to ensure that only
word 0x8000_0001.EDX aliased the features defined for 0x1.EDX.

To do so, convert kvm_cpu_cap_init() to a macro and have it define a
local variable to track which CPUID word is being initialized that is
then used to validate usage of F() (all of the inputs are compile-time
constants and thus can be fed into BUILD_BUG_ON()).

Redefine KVM_VALIDATE_CPU_CAP_USAGE after kvm_set_cpu_caps() to be a nop
so that F() can be used in other flows that aren't as easily hardened,
e.g. __do_cpuid_func_emulated() and __do_cpuid_func().

Invoke KVM_VALIDATE_CPU_CAP_USAGE() in SF() and X86_64_F() to ensure the
validation occurs, e.g. if the usage of F() is completely compiled out
(which shouldn't happen for boot_cpu_has(), but could happen in the future,
e.g. if KVM were to use cpu_feature_enabled()).

Link: https://lore.kernel.org/r/20241128013424.4096668-29-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:51 -08:00
Sean Christopherson
8d862c270b KVM: x86: #undef SPEC_CTRL_SSBD in cpuid.c to avoid macro collisions
Undefine SPEC_CTRL_SSBD, which is #defined by msr-index.h to represent the
enable flag in MSR_IA32_SPEC_CTRL, to avoid issues with the macro being
unpacked into its raw value when passed to KVM's F() macro.  This will
allow using multiple layers of macros in F() and friends, e.g. to harden
against incorrect usage of F().

No functional change intended (cpuid.c doesn't consume SPEC_CTRL_SSBD).

Link: https://lore.kernel.org/r/20241128013424.4096668-28-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:50 -08:00
Sean Christopherson
46505c0f69 KVM: x86: Handle kernel- and KVM-defined CPUID words in a single helper
Merge kvm_cpu_cap_init() and kvm_cpu_cap_init_kvm_defined() into a single
helper.  The only advantage of separating the two was to make it somewhat
obvious that KVM directly initializes the KVM-defined words, whereas using
a common helper will allow for hardening both kernel- and KVM-defined
CPUID words without needing copy+paste.

No functional change intended.

Link: https://lore.kernel.org/r/20241128013424.4096668-27-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:49 -08:00
Sean Christopherson
264969b48a KVM: x86: Add a macro to precisely handle aliased 0x1.EDX CPUID features
Add a macro to precisely handle CPUID features that AMD duplicated from
CPUID.0x1.EDX into CPUID.0x8000_0001.EDX.  This will allow adding an
assert that all features passed to kvm_cpu_cap_init() match the word being
processed, e.g. to prevent passing a feature from CPUID 0x7 to CPUID 0x1.

Because the kernel simply reuses the X86_FEATURE_* definitions from
CPUID.0x1.EDX, KVM's use of the aliased features would result in false
positives from such an assert.

No functional change intended.

Link: https://lore.kernel.org/r/20241128013424.4096668-26-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:48 -08:00
Sean Christopherson
6eac4d99a9 KVM: x86: Add a macro to init CPUID features that are 64-bit only
Add a macro to mask-in feature flags that are supported only on 64-bit
kernels/KVM.  In addition to reducing overall #ifdeffery, using a macro
will allow hardening the kvm_cpu_cap initialization sequences to assert
that the features being advertised are indeed included in the word being
initialized.  And arguably using *F() macros through is more readable.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:47 -08:00
Sean Christopherson
3cc359ca29 KVM: x86: Rename kvm_cpu_cap_mask() to kvm_cpu_cap_init()
Rename kvm_cpu_cap_mask() to kvm_cpu_cap_init() in anticipation of merging
it with kvm_cpu_cap_init_kvm_defined(), and in anticipation of _setting_
bits in the helper (a future commit will play macro games to set emulated
feature flags via kvm_cpu_cap_init()).

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:46 -08:00
Sean Christopherson
ccf93de484 KVM: x86: Unpack F() CPUID feature flag macros to one flag per line of code
Refactor kvm_set_cpu_caps() to express each supported (or not) feature
flag on a separate line, modulo a handful of cases where KVM does not, and
likely will not, support a sequence of flags.  This will allow adding
fancier macros with longer, more descriptive names without resulting in
absurd line lengths and/or weird code.  Isolating each flag also makes it
far easier to review changes, reduces code conflicts, and generally makes
it easier to resolve conflicts.  Lastly, it allows co-locating comments
for notable flags, e.g. MONITOR, precisely with the relevant flag.

No functional change intended.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:45 -08:00
Sean Christopherson
96cbc766ba KVM: x86: Account for max supported CPUID leaf when getting raw host CPUID
Explicitly zero out the feature word in kvm_cpu_caps if the word's
associated CPUID function is greater than the max leaf supported by the
CPU.  For such unsupported functions, Intel CPUs return the output from
the last supported leaf, not all zeros.

Practically speaking, this is likely a benign bug, as KVM uses the raw
host CPUID to mask the kernel's computed capabilities, and the kernel does
perform max leaf checks when populating boot_cpu_data.  The only way KVM's
goof could be problematic is if the kernel force-set a feature in a leaf
that is completely unsupported, _and_ the max supported leaf happened to
return a value with '1' the same bit position.  Which is theoretically
possible, but extremely unlikely.  And even if that did happen, it's
entirely possible that KVM would still provide the correct functionality;
the kernel did set the capability after all.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:43 -08:00
Sean Christopherson
6416b0fb16 KVM: x86: Do reverse CPUID sanity checks in __feature_leaf()
Do the compile-time sanity checks on reverse_cpuid in __feature_leaf() so
that higher level APIs don't need to "manually" perform the sanity checks.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:42 -08:00
Sean Christopherson
f21958e328 KVM: x86: Don't update PV features caches when enabling enforcement capability
Revert the chunk of commit 01b4f510b9 ("kvm: x86: ensure pv_cpuid.features
is initialized when enabling cap") that forced a PV features cache refresh
during KVM_CAP_ENFORCE_PV_FEATURE_CPUID, as whatever ioctl() ordering
issue it alleged to have fixed never existed upstream, and likely never
existed in any kernel.

At the time of the commit, there was a tangentially related ioctl()
ordering issue, as toggling KVM_X86_DISABLE_EXITS_HLT after KVM_SET_CPUID2
would have resulted in KVM potentially leaving KVM_FEATURE_PV_UNHALT set.
But (a) that bug affected the entire guest CPUID, not just the cache, (b)
commit 01b4f510b9 didn't address that bug, it only refreshed the cache
(with the bad CPUID), and (c) setting KVM_X86_DISABLE_EXITS_HLT after vCPU
creation is completely broken as KVM configures HLT-exiting only during
vCPU creation, which is why KVM_CAP_X86_DISABLE_EXITS is now disallowed if
vCPUs have been created.

Another tangentially related bug was KVM's failure to clear the cache when
handling KVM_SET_CPUID2, but again commit 01b4f510b9 did nothing to fix
that bug.

The most plausible explanation for the what commit 01b4f510b9 was trying
to fix is a bug that existed in Google's internal kernel that was the
source of commit 01b4f510b9.  At the time, Google's internal kernel had
not yet picked up commit 0d3b2ba16b ("KVM: X86: Go on updating other
CPUID leaves when leaf 1 is absent"), i.e. KVM would not initialize the
PV features cache if KVM_SET_CPUID2 was called without a CPUID.0x1 entry.

Of course, no sane real world VMM would omit CPUID.0x1, including the KVM
selftest added by commit ac4a4d6de2 ("selftests: kvm: test enforcement
of paravirtual cpuid features").  And the test didn't actually try to
verify multiple orderings, nor did the selftest enter the guest without
doing KVM_SET_CPUID2, so who knows what motivated the change.

Regardless of why commit 01b4f510b9 ("kvm: x86: ensure pv_cpuid.features
is initialized when enabling cap") was added, refreshing the cache during
KVM_CAP_ENFORCE_PV_FEATURE_CPUID isn't necessary.

Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:41 -08:00
Sean Christopherson
01d1059d63 KVM: x86: Zero out PV features cache when the CPUID leaf is not present
Clear KVM's PV feature cache prior when processing a new guest CPUID so
that KVM doesn't keep a stale cache entry if userspace does KVM_SET_CPUID2
multiple times, once with a PV features entry, and a second time without.

Fixes: 66570e966d ("kvm: x86: only provide PV features if enabled in guest's CPUID")
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:40 -08:00
Sean Christopherson
c829ccd4d9 KVM: x86: Reject disabling of MWAIT/HLT interception when not allowed
Reject KVM_CAP_X86_DISABLE_EXITS if userspace attempts to disable MWAIT or
HLT exits and KVM previously reported (via KVM_CHECK_EXTENSION) that
disabling the exit(s) is not allowed.  E.g. because MWAIT isn't supported
or the CPU doesn't have an always-running APIC timer, or because KVM is
configured to mitigate cross-thread vulnerabilities.

Cc: Kechen Lu <kechenl@nvidia.com>
Fixes: 4d5422cea3 ("KVM: X86: Provide a capability to disable MWAIT intercepts")
Fixes: 6f0f2d5ef8 ("KVM: x86: Mitigate the cross-thread return address predictions bug")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:36 -08:00
Sean Christopherson
04cd8f8628 KVM: x86: Disallow KVM_CAP_X86_DISABLE_EXITS after vCPU creation
Reject KVM_CAP_X86_DISABLE_EXITS if vCPUs have been created, as disabling
PAUSE/MWAIT/HLT exits after vCPUs have been created is broken and useless,
e.g. except for PAUSE on SVM, the relevant intercepts aren't updated after
vCPU creation.  vCPUs may also end up with an inconsistent configuration
if exits are disabled between creation of multiple vCPUs.

Cc: Hou Wenlong <houwenlong.hwl@antgroup.com>
Link: https://lore.kernel.org/all/9227068821b275ac547eb2ede09ec65d2281fe07.1680179693.git.houwenlong.hwl@antgroup.com
Link: https://lore.kernel.org/all/20230121020738.2973-2-kechenl@nvidia.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:35 -08:00
Sean Christopherson
21d7f06d1a KVM: x86: Drop now-redundant MAXPHYADDR and GPA rsvd bits from vCPU creation
Drop the manual initialization of maxphyaddr and reserved_gpa_bits during
vCPU creation now that kvm_arch_vcpu_create() unconditionally invokes
kvm_vcpu_after_set_cpuid(), which handles all such CPUID caching.

None of the helpers between the existing code in kvm_arch_vcpu_create()
and the call to kvm_vcpu_after_set_cpuid() consume maxphyaddr or
reserved_gpa_bits (though auditing vmx_vcpu_create() and svm_vcpu_create()
isn't exactly easy).

Link: https://lore.kernel.org/r/20241128013424.4096668-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:34 -08:00
Sean Christopherson
ac32cbd4df KVM: x86/pmu: Drop now-redundant refresh() during init()
Drop the manual kvm_pmu_refresh() from kvm_pmu_init() now that
kvm_arch_vcpu_create() performs the refresh via kvm_vcpu_after_set_cpuid().

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:33 -08:00
Sean Christopherson
b0c3d68717 KVM: x86: Move __kvm_is_valid_cr4() definition to x86.h
Let vendor code inline __kvm_is_valid_cr4() now x86.c's cr4_reserved_bits
no longer exists, as keeping cr4_reserved_bits local to x86.c was the only
reason for "hiding" the definition of __kvm_is_valid_cr4().

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:32 -08:00
Sean Christopherson
7520a53b8e KVM: x86: Account for KVM-reserved CR4 bits when passing through CR4 on VMX
Drop x86.c's local pre-computed cr4_reserved bits and instead fold KVM's
reserved bits into the guest's reserved bits.  This fixes a bug where VMX's
set_cr4_guest_host_mask() fails to account for KVM-reserved bits when
deciding which bits can be passed through to the guest.  In most cases,
letting the guest directly write reserved CR4 bits is ok, i.e. attempting
to set the bit(s) will still #GP, but not if a feature is available in
hardware but explicitly disabled by the host, e.g. if FSGSBASE support is
disabled via "nofsgsbase".

Note, the extra overhead of computing host reserved bits every time
userspace sets guest CPUID is negligible.  The feature bits that are
queried are packed nicely into a handful of words, and so checking and
setting each reserved bit costs in the neighborhood of ~5 cycles, i.e. the
total cost will be in the noise even if the number of checked CR4 bits
doubles over the next few years.  In other words, x86 will run out of CR4
bits long before the overhead becomes problematic.

Note #2, __cr4_reserved_bits() starts from CR4_RESERVED_BITS, which is
why the existing __kvm_cpu_cap_has() processing doesn't explicitly OR in
CR4_RESERVED_BITS (and why the new code doesn't do so either).

Fixes: 2ed41aa631 ("KVM: VMX: Intercept guest reserved CR4 bits to inject #GP fault")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:26 -08:00
Sean Christopherson
ec3d4440b2 KVM: x86: Explicitly do runtime CPUID updates "after" initial setup
Explicitly perform runtime CPUID adjustments as part of the "after set
CPUID" flow to guard against bugs where KVM consumes stale vCPU/CPUID
state during kvm_update_cpuid_runtime().  E.g. see commit 4736d85f0d
("KVM: x86: Use actual kvm_cpuid.base for clearing KVM_FEATURE_PV_UNHALT").

Whacking each mole individually is not sustainable or robust, e.g. while
the aforemention commit fixed KVM's PV features, the same issue lurks for
Xen and Hyper-V features, Xen and Hyper-V simply don't have any runtime
features (though spoiler alert, neither should KVM).

Updating runtime features in the "full" path will also simplify adding a
snapshot of the guest's capabilities, i.e. of caching the intersection of
guest CPUID and kvm_cpu_caps (modulo a few edge cases).

Link: https://lore.kernel.org/r/20241128013424.4096668-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:25 -08:00
Sean Christopherson
85e5ba83c0 KVM: x86: Do all post-set CPUID processing during vCPU creation
During vCPU creation, process KVM's default, empty CPUID as if userspace
set an empty CPUID to ensure consistent and correct behavior with respect
to guest CPUID.  E.g. if userspace never sets guest CPUID, KVM will never
configure cr4_guest_rsvd_bits, and thus create divergent, incorrect, guest-
visible behavior due to letting the guest set any KVM-supported CR4 bits
despite the features not being allowed per guest CPUID.

Note!  This changes KVM's ABI, as lack of full CPUID processing allowed
userspace to stuff garbage vCPU state, e.g. userspace could set CR4 to a
guest-unsupported value via KVM_SET_SREGS.  But it's extremely unlikely
that this is a breaking change, as KVM already has many flows that require
userspace to set guest CPUID before loading vCPU state.  E.g. multiple MSR
flows consult guest CPUID on host writes, and KVM_SET_SREGS itself already
relies on guest CPUID being up-to-date, as KVM's validity check on CR3
consumes CPUID.0x7.1 (for LAM) and CPUID.0x80000008 (for MAXPHYADDR).

Furthermore, the plan is to commit to enforcing guest CPUID for userspace
writes to MSRs, at which point bypassing sregs CPUID checks is even more
nonsensical.

Link: https://lore.kernel.org/r/20241128013424.4096668-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:24 -08:00
Sean Christopherson
4b027f5af9 KVM: x86: Limit use of F() and SF() to kvm_cpu_cap_{mask,init_kvm_defined}()
Define and undefine the F() and SF() macros precisely around
kvm_set_cpu_caps() to make it all but impossible to use the macros outside
of kvm_cpu_cap_{mask,init_kvm_defined}().  Currently, F() is a simple
passthrough, but SF() is actively dangerous as it checks that the scattered
feature is supported by the host kernel.

And usage outside of the aforementioned helpers will run afoul of future
changes to harden KVM's CPUID management.

Opportunistically switch to feature_bit() when stuffing LA57 based on raw
hardware support.

No functional change intended.

Link: https://lore.kernel.org/r/20241128013424.4096668-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:23 -08:00
Sean Christopherson
ccf4c1d15d KVM: x86: Use feature_bit() to clear CONSTANT_TSC when emulating CPUID
When clearing CONSTANT_TSC during CPUID emulation due to a Hyper-V quirk,
use feature_bit() instead of SF() to ensure the bit is actually cleared.
SF() evaluates to zero if the _host_ doesn't support the feature.  I.e.
KVM could keep the bit set if userspace advertised CONSTANT_TSC despite
it not being supported in hardware.

Note, translating from a scattered feature to a the hardware version is
done by __feature_translate(), not SF().  The sole purpose of SF() is to
check kernel support for the scattered feature, *before* translation.

Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:22 -08:00
Sean Christopherson
036e78a942 KVM: SVM: Remove redundant TLB flush on guest CR4.PGE change
Drop SVM's direct TLB flush when CR4.PGE is toggled and NPT is enabled, as
KVM already guarantees TLBs are flushed appropriately.

For the call from cr_trap(), kvm_post_set_cr4() requests TLB_FLUSH_GUEST
(which is a superset of TLB_FLUSH_CURRENT) when CR4.PGE is toggled,
regardless of whether or not KVM is using TDP.

The calls from nested_vmcb02_prepare_save() and nested_svm_vmexit() are
checking guest (L2) vs. host (L1) CR4, and so a flush is unnecessary as L2
is defined to use a different ASID (from L1's perspective).

Lastly, the call from svm_set_cr0() passes in the current CR4 value, i.e.
can't toggle PGE.

Link: https://lore.kernel.org/r/20241127235312.4048445-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-16 16:15:28 -08:00
Sean Christopherson
45d522d3ee KVM: SVM: Macrofy SEV=n versions of sev_xxx_guest()
Define sev_{,es_,snp_}guest() as "false" when SEV is disabled via Kconfig,
i.e. when CONFIG_KVM_AMD_SEV=n.  Despite the helpers being __always_inline,
gcc-12 is somehow incapable of realizing that the return value is a
compile-time constant and generates sub-optimal code.

Opportunistically clump the paths together to reduce the amount of
ifdeffery.

No functional change intended.

Link: https://lore.kernel.org/r/20241127234659.4046347-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-16 16:15:27 -08:00
Juergen Gross
2d5faa6a84 KVM/x86: add comment to kvm_mmu_do_page_fault()
On a first glance it isn't obvious why calling kvm_tdp_page_fault() in
kvm_mmu_do_page_fault() is special cased, as the general case of using
an indirect case would result in calling of kvm_tdp_page_fault()
anyway.

Add a comment to explain the reason.

Signed-off-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20241108161416.28552-1-jgross@suse.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-16 15:27:34 -08:00
Sean Christopherson
76bce9f101 KVM: x86: Plumb in the vCPU to kvm_x86_ops.hwapic_isr_update()
Pass the target vCPU to the hwapic_isr_update() vendor hook so that VMX
can defer the update until after nested VM-Exit if an EOI for L1's vAPIC
occurs while L2 is active.

Note, commit d39850f57d ("KVM: x86: Drop @vcpu parameter from
kvm_x86_ops.hwapic_isr_update()") removed the parameter with the
justification that doing so "allows for a decent amount of (future)
cleanup in the APIC code", but it's not at all clear what cleanup was
intended, or if it was ever realized.

No functional change intended.

Cc: stable@vger.kernel.org
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20241128000010.4051275-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-16 15:18:30 -08:00
Sean Christopherson
1201f226c8 KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init
Snapshot the output of CPUID.0xD.[1..n] during kvm.ko initiliaization to
avoid the overead of CPUID during runtime.  The offset, size, and metadata
for CPUID.0xD.[1..n] sub-leaves does not depend on XCR0 or XSS values, i.e.
is constant for a given CPU, and thus can be cached during module load.

On Intel's Emerald Rapids, CPUID is *wildly* expensive, to the point where
recomputing XSAVE offsets and sizes results in a 4x increase in latency of
nested VM-Enter and VM-Exit (nested transitions can trigger
xstate_required_size() multiple times per transition), relative to using
cached values.  The issue is easily visible by running `perf top` while
triggering nested transitions: kvm_update_cpuid_runtime() shows up at a
whopping 50%.

As measured via RDTSC from L2 (using KVM-Unit-Test's CPUID VM-Exit test
and a slightly modified L1 KVM to handle CPUID in the fastpath), a nested
roundtrip to emulate CPUID on Skylake (SKX), Icelake (ICX), and Emerald
Rapids (EMR) takes:

  SKX 11650
  ICX 22350
  EMR 28850

Using cached values, the latency drops to:

  SKX 6850
  ICX 9000
  EMR 7900

The underlying issue is that CPUID itself is slow on ICX, and comically
slow on EMR.  The problem is exacerbated on CPUs which support XSAVES
and/or XSAVEC, as KVM invokes xstate_required_size() twice on each
runtime CPUID update, and because there are more supported XSAVE features
(CPUID for supported XSAVE feature sub-leafs is significantly slower).

 SKX:
  CPUID.0xD.2  = 348 cycles
  CPUID.0xD.3  = 400 cycles
  CPUID.0xD.4  = 276 cycles
  CPUID.0xD.5  = 236 cycles
  <other sub-leaves are similar>

 EMR:
  CPUID.0xD.2  = 1138 cycles
  CPUID.0xD.3  = 1362 cycles
  CPUID.0xD.4  = 1068 cycles
  CPUID.0xD.5  = 910 cycles
  CPUID.0xD.6  = 914 cycles
  CPUID.0xD.7  = 1350 cycles
  CPUID.0xD.8  = 734 cycles
  CPUID.0xD.9  = 766 cycles
  CPUID.0xD.10 = 732 cycles
  CPUID.0xD.11 = 718 cycles
  CPUID.0xD.12 = 734 cycles
  CPUID.0xD.13 = 1700 cycles
  CPUID.0xD.14 = 1126 cycles
  CPUID.0xD.15 = 898 cycles
  CPUID.0xD.16 = 716 cycles
  CPUID.0xD.17 = 748 cycles
  CPUID.0xD.18 = 776 cycles

Note, updating runtime CPUID information multiple times per nested
transition is itself a flaw, especially since CPUID is a mandotory
intercept on both Intel and AMD.  E.g. KVM doesn't need to ensure emulated
CPUID state is up-to-date while running L2.  That flaw will be fixed in a
future patch, as deferring runtime CPUID updates is more subtle than it
appears at first glance, the benefits aren't super critical to have once
the XSAVE issue is resolved, and caching CPUID output is desirable even if
KVM's updates are deferred.

Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20241211013302.1347853-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-13 13:58:10 -05:00
Peter Zijlstra
2190966fbc x86: Convert unreachable() to BUG()
Avoid unreachable() as it can (and will in the absence of UBSAN)
generate fallthrough code. Use BUG() so we get a UD2 trap (with
unreachable annotation).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20241128094312.028316261@infradead.org
2024-12-02 12:01:43 +01:00
Linus Torvalds
9f16d5e6f2 The biggest change here is eliminating the awful idea that KVM had, of
essentially guessing which pfns are refcounted pages.  The reason to
 do so was that KVM needs to map both non-refcounted pages (for example
 BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP VMAs that contain
 refcounted pages.  However, the result was security issues in the past,
 and more recently the inability to map VM_IO and VM_PFNMAP memory
 that _is_ backed by struct page but is not refcounted.  In particular
 this broke virtio-gpu blob resources (which directly map host graphics
 buffers into the guest as "vram" for the virtio-gpu device) with the
 amdgpu driver, because amdgpu allocates non-compound higher order pages
 and the tail pages could not be mapped into KVM.
 
 This requires adjusting all uses of struct page in the per-architecture
 code, to always work on the pfn whenever possible.  The large series that
 did this, from David Stevens and Sean Christopherson, also cleaned up
 substantially the set of functions that provided arch code with the
 pfn for a host virtual addresses.  The previous maze of twisty little
 passages, all different, is replaced by five functions (__gfn_to_page,
 __kvm_faultin_pfn, the non-__ versions of these two, and kvm_prefetch_pages)
 saving almost 200 lines of code.
 
 ARM:
 
 * Support for stage-1 permission indirection (FEAT_S1PIE) and
   permission overlays (FEAT_S1POE), including nested virt + the
   emulated page table walker
 
 * Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call
   was introduced in PSCIv1.3 as a mechanism to request hibernation,
   similar to the S4 state in ACPI
 
 * Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
   part of it, introduce trivial initialization of the host's MPAM
   context so KVM can use the corresponding traps
 
 * PMU support under nested virtualization, honoring the guest
   hypervisor's trap configuration and event filtering when running a
   nested guest
 
 * Fixes to vgic ITS serialization where stale device/interrupt table
   entries are not zeroed when the mapping is invalidated by the VM
 
 * Avoid emulated MMIO completion if userspace has requested synchronous
   external abort injection
 
 * Various fixes and cleanups affecting pKVM, vCPU initialization, and
   selftests
 
 LoongArch:
 
 * Add iocsr and mmio bus simulation in kernel.
 
 * Add in-kernel interrupt controller emulation.
 
 * Add support for virtualization extensions to the eiointc irqchip.
 
 PPC:
 
 * Drop lingering and utterly obsolete references to PPC970 KVM, which was
   removed 10 years ago.
 
 * Fix incorrect documentation references to non-existing ioctls
 
 RISC-V:
 
 * Accelerate KVM RISC-V when running as a guest
 
 * Perf support to collect KVM guest statistics from host side
 
 s390:
 
 * New selftests: more ucontrol selftests and CPU model sanity checks
 
 * Support for the gen17 CPU model
 
 * List registers supported by KVM_GET/SET_ONE_REG in the documentation
 
 x86:
 
 * Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve
   documentation, harden against unexpected changes.  Even if the hardware
   A/D tracking is disabled, it is possible to use the hardware-defined A/D
   bits to track if a PFN is Accessed and/or Dirty, and that removes a lot
   of special cases.
 
 * Elide TLB flushes when aging secondary PTEs, as has been done in x86's
   primary MMU for over 10 years.
 
 * Recover huge pages in-place in the TDP MMU when dirty page logging is
   toggled off, instead of zapping them and waiting until the page is
   re-accessed to create a huge mapping.  This reduces vCPU jitter.
 
 * Batch TLB flushes when dirty page logging is toggled off.  This reduces
   the time it takes to disable dirty logging by ~3x.
 
 * Remove the shrinker that was (poorly) attempting to reclaim shadow page
   tables in low-memory situations.
 
 * Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE.
 
 * Advertise CPUIDs for new instructions in Clearwater Forest
 
 * Quirk KVM's misguided behavior of initialized certain feature MSRs to
   their maximum supported feature set, which can result in KVM creating
   invalid vCPU state.  E.g. initializing PERF_CAPABILITIES to a non-zero
   value results in the vCPU having invalid state if userspace hides PDCM
   from the guest, which in turn can lead to save/restore failures.
 
 * Fix KVM's handling of non-canonical checks for vCPUs that support LA57
   to better follow the "architecture", in quotes because the actual
   behavior is poorly documented.  E.g. most MSR writes and descriptor
   table loads ignore CR4.LA57 and operate purely on whether the CPU
   supports LA57.
 
 * Bypass the register cache when querying CPL from kvm_sched_out(), as
   filling the cache from IRQ context is generally unsafe; harden the
   cache accessors to try to prevent similar issues from occuring in the
   future.  The issue that triggered this change was already fixed in 6.12,
   but was still kinda latent.
 
 * Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM
   over-advertises SPEC_CTRL when trying to support cross-vendor VMs.
 
 * Minor cleanups
 
 * Switch hugepage recovery thread to use vhost_task.  These kthreads can
   consume significant amounts of CPU time on behalf of a VM or in response
   to how the VM behaves (for example how it accesses its memory); therefore
   KVM tried to place the thread in the VM's cgroups and charge the CPU
   time consumed by that work to the VM's container.  However the kthreads
   did not process SIGSTOP/SIGCONT, and therefore cgroups which had KVM
   instances inside could not complete freezing.  Fix this by replacing the
   kthread with a PF_USER_WORKER thread, via the vhost_task abstraction.
   Another 100+ lines removed, with generally better behavior too like
   having these threads properly parented in the process tree.
 
 * Revert a workaround for an old CPU erratum (Nehalem/Westmere) that didn't
   really work; there was really nothing to work around anyway: the broken
   patch was meant to fix nested virtualization, but the PERF_GLOBAL_CTRL
   MSR is virtualized and therefore unaffected by the erratum.
 
 * Fix 6.12 regression where CONFIG_KVM will be built as a module even
   if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is 'y'.
 
 x86 selftests:
 
 * x86 selftests can now use AVX.
 
 Documentation:
 
 * Use rST internal links
 
 * Reorganize the introduction to the API document
 
 Generic:
 
 * Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead
   of RCU, so that running a vCPU on a different task doesn't encounter long
   due to having to wait for all CPUs become quiescent.  In general both reads
   and writes are rare, but userspace that supports confidential computing is
   introducing the use of "helper" vCPUs that may jump from one host processor
   to another.  Those will be very happy to trigger a synchronize_rcu(), and
   the effect on performance is quite the disaster.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmc9MRYUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroP00QgArxqxBIGLCW5t7bw7vtNq63QYRyh4
 dTiDguLiYQJ+AXmnRu11R6aPC7HgMAvlFCCmH+GEce4WEgt26hxCmncJr/aJOSwS
 letCS7TrME16PeZvh25A1nhPBUw6mTF1qqzgcdHMrqXG8LuHoGcKYGSRVbkf3kfI
 1ZoMq1r8ChXbVVmCx9DQ3gw1TVr5Dpjs2voLh8rDSE9Xpw0tVVabHu3/NhQEz/F+
 t8/nRaqH777icCHIf9PCk5HnarHxLAOvhM2M0Yj09PuBcE5fFQxpxltw/qiKQqqW
 ep4oquojGl87kZnhlDaac2UNtK90Ws+WxxvCwUmbvGN0ZJVaQwf4FvTwig==
 =lWpE
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "The biggest change here is eliminating the awful idea that KVM had of
  essentially guessing which pfns are refcounted pages.

  The reason to do so was that KVM needs to map both non-refcounted
  pages (for example BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP
  VMAs that contain refcounted pages.

  However, the result was security issues in the past, and more recently
  the inability to map VM_IO and VM_PFNMAP memory that _is_ backed by
  struct page but is not refcounted. In particular this broke virtio-gpu
  blob resources (which directly map host graphics buffers into the
  guest as "vram" for the virtio-gpu device) with the amdgpu driver,
  because amdgpu allocates non-compound higher order pages and the tail
  pages could not be mapped into KVM.

  This requires adjusting all uses of struct page in the
  per-architecture code, to always work on the pfn whenever possible.
  The large series that did this, from David Stevens and Sean
  Christopherson, also cleaned up substantially the set of functions
  that provided arch code with the pfn for a host virtual addresses.

  The previous maze of twisty little passages, all different, is
  replaced by five functions (__gfn_to_page, __kvm_faultin_pfn, the
  non-__ versions of these two, and kvm_prefetch_pages) saving almost
  200 lines of code.

  ARM:

   - Support for stage-1 permission indirection (FEAT_S1PIE) and
     permission overlays (FEAT_S1POE), including nested virt + the
     emulated page table walker

   - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This
     call was introduced in PSCIv1.3 as a mechanism to request
     hibernation, similar to the S4 state in ACPI

   - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
     part of it, introduce trivial initialization of the host's MPAM
     context so KVM can use the corresponding traps

   - PMU support under nested virtualization, honoring the guest
     hypervisor's trap configuration and event filtering when running a
     nested guest

   - Fixes to vgic ITS serialization where stale device/interrupt table
     entries are not zeroed when the mapping is invalidated by the VM

   - Avoid emulated MMIO completion if userspace has requested
     synchronous external abort injection

   - Various fixes and cleanups affecting pKVM, vCPU initialization, and
     selftests

  LoongArch:

   - Add iocsr and mmio bus simulation in kernel.

   - Add in-kernel interrupt controller emulation.

   - Add support for virtualization extensions to the eiointc irqchip.

  PPC:

   - Drop lingering and utterly obsolete references to PPC970 KVM, which
     was removed 10 years ago.

   - Fix incorrect documentation references to non-existing ioctls

  RISC-V:

   - Accelerate KVM RISC-V when running as a guest

   - Perf support to collect KVM guest statistics from host side

  s390:

   - New selftests: more ucontrol selftests and CPU model sanity checks

   - Support for the gen17 CPU model

   - List registers supported by KVM_GET/SET_ONE_REG in the
     documentation

  x86:

   - Cleanup KVM's handling of Accessed and Dirty bits to dedup code,
     improve documentation, harden against unexpected changes.

     Even if the hardware A/D tracking is disabled, it is possible to
     use the hardware-defined A/D bits to track if a PFN is Accessed
     and/or Dirty, and that removes a lot of special cases.

   - Elide TLB flushes when aging secondary PTEs, as has been done in
     x86's primary MMU for over 10 years.

   - Recover huge pages in-place in the TDP MMU when dirty page logging
     is toggled off, instead of zapping them and waiting until the page
     is re-accessed to create a huge mapping. This reduces vCPU jitter.

   - Batch TLB flushes when dirty page logging is toggled off. This
     reduces the time it takes to disable dirty logging by ~3x.

   - Remove the shrinker that was (poorly) attempting to reclaim shadow
     page tables in low-memory situations.

   - Clean up and optimize KVM's handling of writes to
     MSR_IA32_APICBASE.

   - Advertise CPUIDs for new instructions in Clearwater Forest

   - Quirk KVM's misguided behavior of initialized certain feature MSRs
     to their maximum supported feature set, which can result in KVM
     creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to
     a non-zero value results in the vCPU having invalid state if
     userspace hides PDCM from the guest, which in turn can lead to
     save/restore failures.

   - Fix KVM's handling of non-canonical checks for vCPUs that support
     LA57 to better follow the "architecture", in quotes because the
     actual behavior is poorly documented. E.g. most MSR writes and
     descriptor table loads ignore CR4.LA57 and operate purely on
     whether the CPU supports LA57.

   - Bypass the register cache when querying CPL from kvm_sched_out(),
     as filling the cache from IRQ context is generally unsafe; harden
     the cache accessors to try to prevent similar issues from occuring
     in the future. The issue that triggered this change was already
     fixed in 6.12, but was still kinda latent.

   - Advertise AMD_IBPB_RET to userspace, and fix a related bug where
     KVM over-advertises SPEC_CTRL when trying to support cross-vendor
     VMs.

   - Minor cleanups

   - Switch hugepage recovery thread to use vhost_task.

     These kthreads can consume significant amounts of CPU time on
     behalf of a VM or in response to how the VM behaves (for example
     how it accesses its memory); therefore KVM tried to place the
     thread in the VM's cgroups and charge the CPU time consumed by that
     work to the VM's container.

     However the kthreads did not process SIGSTOP/SIGCONT, and therefore
     cgroups which had KVM instances inside could not complete freezing.

     Fix this by replacing the kthread with a PF_USER_WORKER thread, via
     the vhost_task abstraction. Another 100+ lines removed, with
     generally better behavior too like having these threads properly
     parented in the process tree.

   - Revert a workaround for an old CPU erratum (Nehalem/Westmere) that
     didn't really work; there was really nothing to work around anyway:
     the broken patch was meant to fix nested virtualization, but the
     PERF_GLOBAL_CTRL MSR is virtualized and therefore unaffected by the
     erratum.

   - Fix 6.12 regression where CONFIG_KVM will be built as a module even
     if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is
     'y'.

  x86 selftests:

   - x86 selftests can now use AVX.

  Documentation:

   - Use rST internal links

   - Reorganize the introduction to the API document

  Generic:

   - Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock
     instead of RCU, so that running a vCPU on a different task doesn't
     encounter long due to having to wait for all CPUs become quiescent.

     In general both reads and writes are rare, but userspace that
     supports confidential computing is introducing the use of "helper"
     vCPUs that may jump from one host processor to another. Those will
     be very happy to trigger a synchronize_rcu(), and the effect on
     performance is quite the disaster"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (298 commits)
  KVM: x86: Break CONFIG_KVM_X86's direct dependency on KVM_INTEL || KVM_AMD
  KVM: x86: add back X86_LOCAL_APIC dependency
  Revert "KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()"
  KVM: x86: switch hugepage recovery thread to vhost_task
  KVM: x86: expose MSR_PLATFORM_INFO as a feature MSR
  x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest
  Documentation: KVM: fix malformed table
  irqchip/loongson-eiointc: Add virt extension support
  LoongArch: KVM: Add irqfd support
  LoongArch: KVM: Add PCHPIC user mode read and write functions
  LoongArch: KVM: Add PCHPIC read and write functions
  LoongArch: KVM: Add PCHPIC device support
  LoongArch: KVM: Add EIOINTC user mode read and write functions
  LoongArch: KVM: Add EIOINTC read and write functions
  LoongArch: KVM: Add EIOINTC device support
  LoongArch: KVM: Add IPI user mode read and write function
  LoongArch: KVM: Add IPI read and write function
  LoongArch: KVM: Add IPI device support
  LoongArch: KVM: Add iocsr and mmio bus simulation in kernel
  KVM: arm64: Pass on SVE mapping failures
  ...
2024-11-23 16:00:50 -08:00