When retrying the faulting instruction after emulation failure, refresh
the infinite loop protection fields even if no shadow pages were zapped,
i.e. avoid hitting an infinite loop even when retrying the instruction as
a last-ditch effort to avoid terminating the guest.
Link: https://lore.kernel.org/r/20240831001538.336683-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move the anti-infinite-loop protection provided by last_retry_{eip,addr}
into kvm_mmu_write_protect_fault() so that it guards unprotect+retry that
never hits the emulator, as well as reexecute_instruction(), which is the
last ditch "might as well try it" logic that kicks in when emulation fails
on an instruction that faulted on a write-protected gfn.
Add a new helper, kvm_mmu_unprotect_gfn_and_retry(), to set the retry
fields and deduplicate other code (with more to come).
Link: https://lore.kernel.org/r/20240831001538.336683-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop the globally visible PFERR_NESTED_GUEST_PAGE and replace it with a
more appropriately named is_write_to_guest_page_table(). The macro name
is misleading, because while all nNPT walks match PAGE|WRITE|PRESENT, the
reverse is not true.
No functional change intended.
Link: https://lore.kernel.org/r/20240831001538.336683-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move the logic to get the to-be-acknowledge IRQ for a nested VM-Exit from
nested_vmx_vmexit() to vmx_check_nested_events(), which is subtly the one
and only path where KVM invokes nested_vmx_vmexit() with
EXIT_REASON_EXTERNAL_INTERRUPT. A future fix will perform a last-minute
check on L2's nested posted interrupt notification vector, just before
injecting a nested VM-Exit. To handle that scenario correctly, KVM needs
to get the interrupt _before_ injecting VM-Exit, as simply querying the
highest priority interrupt, via kvm_cpu_has_interrupt(), would result in
TOCTOU bug, as a new, higher priority interrupt could arrive between
kvm_cpu_has_interrupt() and kvm_cpu_get_interrupt().
Unfortunately, simply moving the call to kvm_cpu_get_interrupt() doesn't
suffice, as a VMWRITE to GUEST_INTERRUPT_STATUS.SVI is hiding in
kvm_get_apic_interrupt(), and acknowledging the interrupt before nested
VM-Exit would cause the VMWRITE to hit vmcs02 instead of vmcs01.
Open code a rough equivalent to kvm_cpu_get_interrupt() so that the IRQ
is acknowledged after emulating VM-Exit, taking care to avoid the TOCTOU
issue described above.
Opportunistically convert the WARN_ON() to a WARN_ON_ONCE(). If KVM has
a bug that results in a false positive from kvm_cpu_has_interrupt(),
spamming dmesg won't help the situation.
Note, nested_vmx_reflect_vmexit() can never reflect external interrupts as
they are always "wanted" by L0.
Link: https://lore.kernel.org/r/20240906043413.1049633-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Register the "disable virtualization in an emergency" callback just
before KVM enables virtualization in hardware, as there is no functional
need to keep the callbacks registered while KVM happens to be loaded, but
is inactive, i.e. if KVM hasn't enabled virtualization.
Note, unregistering the callback every time the last VM is destroyed could
have measurable latency due to the synchronize_rcu() needed to ensure all
references to the callback are dropped before KVM is unloaded. But the
latency should be a small fraction of the total latency of disabling
virtualization across all CPUs, and userspace can set enable_virt_at_load
to completely eliminate the runtime overhead.
Add a pointer in kvm_x86_ops to allow vendor code to provide its callback.
There is no reason to force vendor code to do the registration, and either
way KVM would need a new kvm_x86_ops hook.
Suggested-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240830043600.127750-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename x86's the per-CPU vendor hooks used to enable virtualization in
hardware to align with the recently renamed arch hooks.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20240830043600.127750-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Exit to userspace if a fastpath handler triggers such an exit, which can
happen when skipping the instruction, e.g. due to userspace
single-stepping the guest via KVM_GUESTDBG_SINGLESTEP or because of an
emulation failure.
Fixes: 404d5d7bff ("KVM: X86: Introduce more exit_fastpath_completion enum values")
Link: https://lore.kernel.org/r/20240802195120.325560-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Re-introduce the "split" x2APIC ICR storage that KVM used prior to Intel's
IPI virtualization support, but only for AMD. While not stated anywhere
in the APM, despite stating the ICR is a single 64-bit register, AMD CPUs
store the 64-bit ICR as two separate 32-bit values in ICR and ICR2. When
IPI virtualization (IPIv on Intel, all AVIC flavors on AMD) is enabled,
KVM needs to match CPU behavior as some ICR ICR writes will be handled by
the CPU, not by KVM.
Add a kvm_x86_ops knob to control the underlying format used by the CPU to
store the x2APIC ICR, and tune it to AMD vs. Intel regardless of whether
or not x2AVIC is enabled. If KVM is handling all ICR writes, the storage
format for x2APIC mode doesn't matter, and having the behavior follow AMD
versus Intel will provide better test coverage and ease debugging.
Fixes: 4d1d7942e3 ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240719235107.3023592-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rename all APIs related to feature MSRs from get_msr_feature() to
get_feature_msr(). The APIs get "feature MSRs", not "MSR features".
And unlike kvm_{g,s}et_msr_common(), the "feature" adjective doesn't
describe the helper itself.
No functional change intended.
Link: https://lore.kernel.org/r/20240802181935.292540-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Refactor get_msr_feature() to take the index and data pointer as distinct
parameters in anticipation of eliminating "struct kvm_msr_entry" usage
further up the primary callchain.
No functional change intended.
Link: https://lore.kernel.org/r/20240802181935.292540-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Synthesize a consistency check VM-Exit (VM-Enter) or VM-Abort (VM-Exit) if
L1 attempts to load/store an MSR via the VMCS MSR lists that userspace has
disallowed access to via an MSR filter. Intel already disallows including
a handful of "special" MSRs in the VMCS lists, so denying access isn't
completely without precedent.
More importantly, the behavior is well-defined _and_ can be communicated
the end user, e.g. to the customer that owns a VM running as L1 on top of
KVM. On the other hand, ignoring userspace MSR filters is all but
guaranteed to result in unexpected behavior as the access will hit KVM's
internal state, which is likely not up-to-date.
Unlike KVM-internal accesses, instruction emulation, and dedicated VMCS
fields, the MSRs in the VMCS load/store lists are 100% guest controlled,
thus making it all but impossible to reason about the correctness of
ignoring the MSR filter. And if userspace *really* wants to deny access
to MSRs via the aforementioned scenarios, userspace can hide the
associated feature from the guest, e.g. by disabling the PMU to prevent
accessing PERF_GLOBAL_CTRL via its VMCS field. But for the MSR lists, KVM
is blindly processing MSRs; the MSR filters are the _only_ way for
userspace to deny access.
This partially reverts commit ac8d6cad3c ("KVM: x86: Only do MSR
filtering when access MSR by rdmsr/wrmsr").
Cc: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20240722235922.3351122-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Introduce the quirk KVM_X86_QUIRK_SLOT_ZAP_ALL to allow users to select
KVM's behavior when a memslot is moved or deleted for KVM_X86_DEFAULT_VM
VMs. Make sure KVM behave as if the quirk is always disabled for
non-KVM_X86_DEFAULT_VM VMs.
The KVM_X86_QUIRK_SLOT_ZAP_ALL quirk offers two behavior options:
- when enabled: Invalidate/zap all SPTEs ("zap-all"),
- when disabled: Precisely zap only the leaf SPTEs within the range of the
moving/deleting memory slot ("zap-slot-leafs-only").
"zap-all" is today's KVM behavior to work around a bug [1] where the
changing the zapping behavior of memslot move/deletion would cause VM
instability for VMs with an Nvidia GPU assigned; while
"zap-slot-leafs-only" allows for more precise zapping of SPTEs within the
memory slot range, improving performance in certain scenarios [2], and
meeting the functional requirements for TDX.
Previous attempts to select "zap-slot-leafs-only" include a per-VM
capability approach [3] (which was not preferred because the root cause of
the bug remained unidentified) and a per-memslot flag approach [4]. Sean
and Paolo finally recommended the implementation of this quirk and
explained that it's the least bad option [5].
By default, the quirk is enabled on KVM_X86_DEFAULT_VM VMs to use
"zap-all". Users have the option to disable the quirk to select
"zap-slot-leafs-only" for specific KVM_X86_DEFAULT_VM VMs that are
unaffected by this bug.
For non-KVM_X86_DEFAULT_VM VMs, the "zap-slot-leafs-only" behavior is
always selected without user's opt-in, regardless of if the user opts for
"zap-all".
This is because it is assumed until proven otherwise that non-
KVM_X86_DEFAULT_VM VMs will not be exposed to the bug [1], and most
importantly, it's because TDX must have "zap-slot-leafs-only" always
selected. In TDX's case a memslot's GPA range can be a mixture of "private"
or "shared" memory. Shared is roughly analogous to how EPT is handled for
normal VMs, but private GPAs need lots of special treatment:
1) "zap-all" would require to zap private root page or non-leaf entries or
at least leaf-entries beyond the deleting memslot scope. However, TDX
demands that the root page of the private page table remains unchanged,
with leaf entries being zapped before non-leaf entries, and any dropped
private guest pages must be re-accepted by the guest.
2) if "zap-all" zaps only shared page tables, it would result in private
pages still being mapped when the memslot is gone. This may affect even
other processes if later the gmem fd was whole punched, causing the
pages being freed on the host while still mapped in the TD, because
there's no pgoff to the gfn information to zap the private page table
after memslot is gone.
So, simply go "zap-slot-leafs-only" as if the quirk is always disabled for
non-KVM_X86_DEFAULT_VM VMs to avoid manual opt-in for every VM type [6] or
complicating quirk disabling interface (current quirk disabling interface
is limited, no way to query quirks, or force them to be disabled).
Add a new function kvm_mmu_zap_memslot_leafs() to implement
"zap-slot-leafs-only". This function does not call kvm_unmap_gfn_range(),
bypassing special handling to APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, as
1) The APIC_ACCESS_PAGE_PRIVATE_MEMSLOT cannot be created by users, nor can
it be moved. It is only deleted by KVM when APICv is permanently
inhibited.
2) kvm_vcpu_reload_apic_access_page() effectively does nothing when
APIC_ACCESS_PAGE_PRIVATE_MEMSLOT is deleted.
3) Avoid making all cpus request of KVM_REQ_APIC_PAGE_RELOAD can save on
costly IPIs.
Suggested-by: Kai Huang <kai.huang@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com [1]
Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com/#25054908 [2]
Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [3]
Link: https://lore.kernel.org/all/20240515005952.3410568-3-rick.p.edgecombe@intel.com [4]
Link: https://lore.kernel.org/all/7df9032d-83e4-46a1-ab29-6c7973a2ab0b@redhat.com [5]
Link: https://lore.kernel.org/all/ZnGa550k46ow2N3L@google.com [6]
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20240703021043.13881-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
directly emulate instructions for ES/SNP, and instead the guest must
explicitly request emulation. Unless the guest explicitly requests
emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
SPTE, with the subsequent #NPF being reflected into the guest as a #VC.
But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
because except for ES/SNP, doing so requires setting reserved bits in the
SPTE, i.e. the SPTE can't be readable while also generating a #VC on
writes. Because KVM never creates MMIO SPTEs and jumps directly to
emulation, the guest never gets a #VC. And since KVM simply resumes the
guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
into an infinite #NPF loop if the vCPU attempts to write read-only memory.
Disallow read-only memory for all VMs with protected state, i.e. for
upcoming TDX VMs as well as ES/SNP VMs. For TDX, it's actually possible
to support read-only memory, as TDX uses EPT Violation #VE to reflect the
fault into the guest, e.g. KVM could configure read-only SPTEs with RX
protections and SUPPRESS_VE=0. But there is no strong use case for
supporting read-only memslots on TDX, e.g. the main historical usage is
to emulate option ROMs, but TDX disallows executing from shared memory.
And if someone comes along with a legitimate, strong use case, the
restriction can always be lifted for TDX.
Don't bother trying to retroactively apply the restriction to SEV-ES
VMs that are created as type KVM_X86_DEFAULT_VM. Read-only memslots can't
possibly work for SEV-ES, i.e. disallowing such memslots is really just
means reporting an error to userspace instead of silently hanging vCPUs.
Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
isn't worth the marginal benefit it would provide userspace.
Fixes: 26c44aa9e0 ("KVM: SEV: define VM types for SEV and SEV-ES")
Fixes: 1dfe571c12 ("KVM: SEV: Add initial SEV-SNP support")
Cc: Peter Gonda <pgonda@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerly Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240809190319.1710470-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM_PRE_FAULT_MEMORY for an SNP guest can race with
sev_gmem_post_populate() in bad ways. The following sequence for
instance can potentially trigger an RMP fault:
thread A, sev_gmem_post_populate: called
thread B, sev_gmem_prepare: places below 'pfn' in a private state in RMP
thread A, sev_gmem_post_populate: *vaddr = kmap_local_pfn(pfn + i);
thread A, sev_gmem_post_populate: copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE);
RMP #PF
Fix this by only allowing KVM_PRE_FAULT_MEMORY to run after a guest's
initial private memory contents have been finalized via
KVM_SEV_SNP_LAUNCH_FINISH.
Beyond fixing this issue, it just sort of makes sense to enforce this,
since the KVM_PRE_FAULT_MEMORY documentation states:
"KVM maps memory as if the vCPU generated a stage-2 read page fault"
which sort of implies we should be acting on the same guest state that a
vCPU would see post-launch after the initial guest memory is all set up.
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Similar to kvm_x86_call(), kvm_pmu_call() is added to streamline the usage
of static calls of kvm_pmu_ops, which improves code readability.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Link: https://lore.kernel.org/r/20240507133103.15052-4-wei.w.wang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduces kvm_x86_call(), to streamline the usage of static calls of
kvm_x86_ops. The current implementation of these calls is verbose and
could lead to alignment challenges. This makes the code susceptible to
exceeding the "80 columns per single line of code" limit as defined in
the coding-style document. Another issue with the existing implementation
is that the addition of kvm_x86_ prefix to hooks at the static_call sites
hinders code readability and navigation. kvm_x86_call() is added to
improve code readability and maintainability, while adhering to the coding
style guidelines.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Link: https://lore.kernel.org/r/20240507133103.15052-3-wei.w.wang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Remove an unnecessary EPT TLB flush when enabling hardware.
- Fix a series of bugs that cause KVM to fail to detect nested pending posted
interrupts as valid wake eents for a vCPU executing HLT in L2 (with
HLT-exiting disable by L1).
- Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRvX0ACgkQOlYIJqCj
N/2Aiw/9Htwy4MfJ2zdTX0ypZx6CUAVY0B7R2q9LVaqlBBL02dLoNWn9ndf7J2pd
TJKtp39sHzf342ghti/Za5+mZgRgXA9IjQ5cvcQQjfmjDdDODygEc12otISeSNqq
uL2jbUZzzjbcQyUrXkeFptVcNFpaiOG0dFfvnoi1csWzXVf7t+CD+8/3kjVm2Qt7
vQXkV4yN7tNiYOvaukfXP7Og9ALpF8g8ok3YmXVXDPMu7+R7G+P6j3mVWr9ABMPj
LOmC+5Z/sscMFw1Io3XHuWoF5socQARXEzJNLCblDaw3GMlSj4LNxif2M/6B7bmR
nQVtiegj9K1Fc3OGOqPJcAIRPI4O9nMmf7uOwvXmOlwDSk7rCxF/yPk7Cto2+UXm
6mnLcH1l0/VaidW+a7rUAcDGIlWwgfw0F6tp2j6FdVl2Lx/IThcrkn0teLY1gAW8
CMi/BfTBEXO5583O3+ZCAzVQzeKnWR3yqwJe0oSftB1/rPkPD8PQ39MH8LuJJJxi
CN1W4R1/taQdOxMZqggDvS1biz7gwpjNGtnWsO9szAgMEXVjf2M1HOZVcT2e2997
81xDMdZaJSfd26tm7PhWtQnVPqyMZ6vqqIiq7FlIbEEkAE75Kbg4fUn/4y4WRnh9
3Gog6MZPu/MA5TbwvcZ/sy/CRfFu0HKm5q98oArhjSyU8C7oGeQ=
=W1/6
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-vmx-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM VMX changes for 6.11
- Remove an unnecessary EPT TLB flush when enabling hardware.
- Fix a series of bugs that cause KVM to fail to detect nested pending posted
interrupts as valid wake eents for a vCPU executing HLT in L2 (with
HLT-exiting disable by L1).
- Misc cleanups
- Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads
'0' and writes from userspace are ignored.
- Update to the newfangled Intel CPU FMS infrastructure.
- Use macros instead of open-coded literals to clean up KVM's manipulation of
FIXED_CTR_CTRL MSRs.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRu3oACgkQOlYIJqCj
N/0f/Q/+PoRFKrr9ENwlVjxmq7DBJOzrEiht5EH89bQdpYL0pcmv6I+n+Z77o08X
l49YFO2zVq26dMCe8EFDuQrZpqKjOS/qEc+/zTsLu4lx8NJD1gqYJLJryejgtdQI
+GefPVIN11TvlDDjuuxSWgKUCAevk8s3PRe+zbUwlsHmw+GVky8dJoe71QbW27rK
hL7Y2pOe5Y8MgRAadxlhm6QmgOnz3RKKYs9t/HMzi2gQP1TuvPxnYtMC3Gz5pVe+
w3Ak7M4fh8Z7FbQsoNY5h3IdigG6eFrssqHX4QpCXr/G5L9vAgUmSR93/M8jLjNv
wAkUulLx7vFeTlOXjqcEJSn0U6mX/48pt68vrPB5ES1Rx28RB5s9tzYXCGtCmSxv
nHmMDc3YUbg6tp2hvliMqjsN0j5l2GQiX7LJwH2Ma9qQFlTHPmFwJGS4hciki6c5
obCK2vXBoS1jyxrZx8qUhIcJl2oigv3hwihN2YqZ0Q4QDwllv8cw4BeABQByYR9x
T91PQ0biiJ9vWCkALbyzYOpy+grdHCblwYW9+FM/qZBGH0ouPzDyZWPrRBLX12pH
fEgDMB3vT9JqQ5tyafd0MHuAVlrDVHYEY+lmXplzFGKEFonBkN7HmDzAOKafuCuj
GnIe0Sa1JnHVPNomx2dnG6Sku6/tPIfERuHEXrR9zkUJsacnfRY=
=pJxP
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pmu-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM x86/pmu changes for 6.11
- Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads
'0' and writes from userspace are ignored.
- Update to the newfangled Intel CPU FMS infrastructure.
- Use macros instead of open-coded literals to clean up KVM's manipulation of
FIXED_CTR_CTRL MSRs.
Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD
hack, and instead always honor guest PAT on CPUs that support self-snoop.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRuwAACgkQOlYIJqCj
N/32Gg/+Nnnz6TCRno2vursPJme7gvtLdqSxjazAj3u2ZO8IApGYWMyfVpS+ymC9
Wdpj6gRe2ukSxgTsUI2CYoy5V2NxDaA9YgdTPZUVQvqwujVrqZCJ7L393iPYYnC9
No3LXZ+SOYRmomiCzknjC6GOlT2hAZHzQsyaXDlEYok7NAA2L6XybbLonEdA4RYi
V1mS62W5PaA4tUesuxkJjPujXo1nXRWD/aXOruJWjPESdSFSALlx7reFAf2Nwn7K
Uw8yZqhq6vWAZSph0Nz8OrZOS/kULKA3q2zl1B/qJJ0ToAt2VdXS6abXky52RExf
KvP+jBAWMO5kHbIqaMRtCHjbIkbhH8RdUIYNJQEUQ5DdydM5+/RDa+KprmLPcmUn
qvJq+3uyH0MEENtneGegs8uxR+sn6fT32cGMIw790yIywddh562+IJ4Z+C3BuYJi
yszD71odqKT8+knUd2CaZjE9UZyoQNDfj2OCCTzzZOC/6TuJWCh9CYQ1csssHbQR
KcvZCKE6ht8tWwi+2HWj0laOdg1reX2kV869k3xH4uCwEaFIj2Wk+/Bw/lg2Tn5h
5uTnQ01dx5XhAV1klr6IY3VXJ/A8G8895wRfkZEelsA9Wj8qZvNgXhsoXReIUIrn
aR0ppsFcbqHzC50qE2JT4juTD1EPx95LL9zKT8pI9mGKwxCAxUM=
=yb10
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-mtrrs-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MTRR virtualization removal
Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD
hack, and instead always honor guest PAT on CPUs that support self-snoop.
- Add a global struct to consolidate tracking of host values, e.g. EFER, and
move "shadow_phys_bits" into the structure as "maxphyaddr".
- Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
bus frequency, because TDX.
- Print the name of the APICv/AVIC inhibits in the relevant tracepoint.
- Clean up KVM's handling of vendor specific emulation to consistently act on
"compatible with Intel/AMD", versus checking for a specific vendor.
- Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRub0ACgkQOlYIJqCj
N/2LMxAArGzhcWZ6Qdo2aMRaMIPtSBJHmbEgEuHvHMumgsTZQzDcn9cxDi/hNSrc
l8ODOwAM2qNcq95YfwjU7F0ae3E+HRzGvKcBnmZWuQeCDp2HhVEoCphFu1sHst+t
XEJTL02b6OgyJUEU3h40mYk12eiq2S4FCnFYXPCqijwwuL6Y5KQvvTqek3c2/SDn
c+VneutYGax/S0GiiCkYh4wrwWh9g7qm0IX70ycBwJbW5qBFKgyglvHxvL8JLJC9
Nkkw/p2657wcOdraH+fOBuRy2dMwE5fv++1tOjWwB5WAAhSOJPZh0BGYvgA2yfN7
OE+k7APKUQd9Xxtud8H3LrTPoyMA4hz2sdDFyqrrWK9yjpBY7zXNyN50Fxi7VVsm
T8nTIiKAGyRbjotY+m7krXQPXjfZYhVqrJ/jtxESOZLZ93q2gSWU2p/ZXpUPVHnH
+YOBAI1owP3wepaYlrthtI4LQx9lF422dnmeSflztfKFGabRbQZxg3uHMCCxIaGc
lJ6CD546+D45f/uBXRDMqk//qFTqXhKUbDk9sutmU/C2oWufMwW0R8kOyItGPyvk
9PP1vd8vSsIHj+tpwg+i04jBqYDaAcPBOcTZaHm9SYYP+1e11Uu5Vjep37JL1bkA
xJWxnDZOCGcfKQi2jkh51HJ/dOAHXY1GQKMfyAoPQOSonYHvGVY=
=Cf2R
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.11
- Add a global struct to consolidate tracking of host values, e.g. EFER, and
move "shadow_phys_bits" into the structure as "maxphyaddr".
- Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
bus frequency, because TDX.
- Print the name of the APICv/AVIC inhibits in the relevant tracepoint.
- Clean up KVM's handling of vendor specific emulation to consistently act on
"compatible with Intel/AMD", versus checking for a specific vendor.
- Misc cleanups
- Enable halt poll shrinking by default, as Intel found it to be a clear win.
- Setup empty IRQ routing when creating a VM to avoid having to synchronize
SRCU when creating a split IRQCHIP on x86.
- Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag
that arch code can use for hooking both sched_in() and sched_out().
- Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
truncating a bogus value from userspace, e.g. to help userspace detect bugs.
- Mark a vCPU as preempted if and only if it's scheduled out while in the
KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest
memory when retrieving guest state during live migration blackout.
- A few minor cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRuOYACgkQOlYIJqCj
N/1UnQ/8CI5Qfr+/0gzYgtWmtEMczGG+rMNpzD3XVqPjJjXcMcBiQnplnzUVLhha
vlPdYVK7vgmEt003XGzV55mik46LHL+DX/v4hI3HEdblfyCeNLW3fKEWVRB44qJe
o+YUQwSK42SORUp9oXuQINxhA//U9EnI7CQxlJ8w8wenv5IJKfIGr01DefmfGPAV
PKm9t6WLcNqvhZMEyy/zmzM3KVPCJL0NcwI97x6sHxFpQYIDtL0E/VexA4AFqMoT
QK7cSDC/2US41Zvem/r/GzM/ucdF6vb9suzZYBohwhxtVhwJe2CDeYQZvtNKJ1U7
GOHPaKL6nBWdZCm/yyWbbX2nstY1lHqxhN3JD0X8wqU5rNcwm2b8Vfyav0Ehc7H+
jVbDTshOx4YJmIgajoKjgM050rdBK59TdfVL+l+AAV5q/TlHocalYtvkEBdGmIDg
2td9UHSime6sp20vQfczUEz4bgrQsh4l2Fa/qU2jFwLievnBw0AvEaMximkSGMJe
b8XfjmdTjlOesWAejANKtQolfrq14+1wYw0zZZ8PA+uNVpKdoovmcqSOcaDC9bT8
GO/NFUvoG+lkcvJcIlo1SSl81SmGLosijwxWfGvFAqsgpR3/3l3dYp0QtztoCNJO
d3+HnjgYn5o5FwufuTD3eUOXH4AFjG108DH0o25XrIkb2Kymy0o=
=BalU
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-generic-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM generic changes for 6.11
- Enable halt poll shrinking by default, as Intel found it to be a clear win.
- Setup empty IRQ routing when creating a VM to avoid having to synchronize
SRCU when creating a split IRQCHIP on x86.
- Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag
that arch code can use for hooking both sched_in() and sched_out().
- Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
truncating a bogus value from userspace, e.g. to help userspace detect bugs.
- Mark a vCPU as preempted if and only if it's scheduled out while in the
KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest
memory when retrieving guest state during live migration blackout.
- A few minor cleanups
Refine the macros which define maximum General Purpose (GP) and fixed
counter numbers.
Currently the macro KVM_INTEL_PMC_MAX_GENERIC is used to represent the
maximum supported General Purpose (GP) counter number ambiguously across
Intel and AMD platforms. This would cause issues if AMD begins to support
more GP counters than Intel.
Thus a bunch of new macros including vendor specific and vendor
independent are introduced to replace the old macros. The vendor
independent macros are used in x86 common code to hide vendor difference
and eliminate the ambiguity.
No logic changes are introduced in this patch.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240627021756.144815-1-dapeng1.mi@linux.intel.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Check for a Requested Virtual Interrupt, i.e. a virtual interrupt that is
pending delivery, in vmx_has_nested_events() and drop the one-off
kvm_x86_ops.guest_apic_has_interrupt() hook.
In addition to dropping a superfluous hook, this fixes a bug where KVM
would incorrectly treat virtual interrupts _for L2_ as always enabled due
to kvm_arch_interrupt_allowed(), by way of vmx_interrupt_blocked(),
treating IRQs as enabled if L2 is active and vmcs12 is configured to exit
on IRQs, i.e. KVM would treat a virtual interrupt for L2 as a valid wake
event based on L1's IRQ blocking status.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240607172609.3205077-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When requesting an immediate exit from L2 in order to inject a pending
event, do so only if the pending event actually requires manual injection,
i.e. if and only if KVM actually needs to regain control in order to
deliver the event.
Avoiding the "immediate exit" isn't simply an optimization, it's necessary
to make forward progress, as the "already expired" VMX preemption timer
trick that KVM uses to force a VM-Exit has higher priority than events
that aren't directly injected.
At present time, this is a glorified nop as all events processed by
vmx_has_nested_events() require injection, but that will not hold true in
the future, e.g. if there's a pending virtual interrupt in vmcs02.RVI.
I.e. if KVM is trying to deliver a virtual interrupt to L2, the expired
VMX preemption timer will trigger VM-Exit before the virtual interrupt is
delivered, and KVM will effectively hang the vCPU in an endless loop of
forced immediate VM-Exits (because the pending virtual interrupt never
goes away).
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240607172609.3205077-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Fold the guts of kvm_arch_sched_in() into kvm_arch_vcpu_load(), keying
off the recently added kvm_vcpu.scheduled_out as appropriate.
Note, there is a very slight functional change, as PLE shrink updates will
now happen after blasting WBINVD, but that is quite uninteresting as the
two operations do not interact in any way.
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240522014013.1672962-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The check_apicv_inhibit_reasons() callback implementation was dropped in
the commit b3f257a846 ("KVM: x86: Track required APICv inhibits with
variable, not callback"), but the definition removal was missed in the
final version patch (it was removed in the v4). Therefore, it should be
dropped, and the vmx_check_apicv_inhibit_reasons() function declaration
should also be removed.
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Link: https://lore.kernel.org/r/54abd1d0ccaba4d532f81df61259b9c0e021fbde.1714977229.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Remove KVM's support for virtualizing guest MTRR memtypes, as full MTRR
adds no value, negatively impacts guest performance, and is a maintenance
burden due to it's complexity and oddities.
KVM's approach to virtualizating MTRRs make no sense, at all. KVM *only*
honors guest MTRR memtypes if EPT is enabled *and* the guest has a device
that may perform non-coherent DMA access. From a hardware virtualization
perspective of guest MTRRs, there is _nothing_ special about EPT. Legacy
shadowing paging doesn't magically account for guest MTRRs, nor does NPT.
Unwinding and deciphering KVM's murky history, the MTRR virtualization
code appears to be the result of misdiagnosed issues when EPT + VT-d with
passthrough devices was enabled years and years ago. And importantly, the
underlying bugs that were fudged around by honoring guest MTRR memtypes
have since been fixed (though rather poorly in some cases).
The zapping GFNs logic in the MTRR virtualization code came from:
commit efdfe536d8
Author: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Date: Wed May 13 14:42:27 2015 +0800
KVM: MMU: fix MTRR update
Currently, whenever guest MTRR registers are changed
kvm_mmu_reset_context is called to switch to the new root shadow page
table, however, it's useless since:
1) the cache type is not cached into shadow page's attribute so that
the original root shadow page will be reused
2) the cache type is set on the last spte, that means we should sync
the last sptes when MTRR is changed
This patch fixs this issue by drop all the spte in the gfn range which
is being updated by MTRR
which was a fix for:
commit 0bed3b568b
Author: Sheng Yang <sheng@linux.intel.com>
AuthorDate: Thu Oct 9 16:01:54 2008 +0800
Commit: Avi Kivity <avi@redhat.com>
CommitDate: Wed Dec 31 16:51:44 2008 +0200
KVM: Improve MTRR structure
As well as reset mmu context when set MTRR.
which was part of a "MTRR/PAT support for EPT" series that also added:
+ if (mt_mask) {
+ mt_mask = get_memory_type(vcpu, gfn) <<
+ kvm_x86_ops->get_mt_mask_shift();
+ spte |= mt_mask;
+ }
where get_memory_type() was a truly gnarly helper to retrieve the guest
MTRR memtype for a given memtype. And *very* subtly, at the time of that
change, KVM *always* set VMX_EPT_IGMT_BIT,
kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK |
VMX_EPT_WRITABLE_MASK |
VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT |
VMX_EPT_IGMT_BIT);
which came in via:
commit 928d4bf747
Author: Sheng Yang <sheng@linux.intel.com>
AuthorDate: Thu Nov 6 14:55:45 2008 +0800
Commit: Avi Kivity <avi@redhat.com>
CommitDate: Tue Nov 11 21:00:37 2008 +0200
KVM: VMX: Set IGMT bit in EPT entry
There is a potential issue that, when guest using pagetable without vmexit when
EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory,
which would be inconsistent with host side and would cause host MCE due to
inconsistent cache attribute.
The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default
memory type to protect host (notice that all memory mapped by KVM should be WB).
Note the CommitDates! The AuthorDates strongly suggests Sheng Yang added
the whole "ignoreIGMT things as a bug fix for issues that were detected
during EPT + VT-d + passthrough enabling, but it was applied earlier
because it was a generic fix.
Jumping back to 0bed3b568b ("KVM: Improve MTRR structure"), the other
relevant code, or rather lack thereof, is the handling of *host* MMIO.
That fix came in a bit later, but given the author and timing, it's safe
to say it was all part of the same EPT+VT-d enabling mess.
commit 2aaf69dcee
Author: Sheng Yang <sheng@linux.intel.com>
AuthorDate: Wed Jan 21 16:52:16 2009 +0800
Commit: Avi Kivity <avi@redhat.com>
CommitDate: Sun Feb 15 02:47:37 2009 +0200
KVM: MMU: Map device MMIO as UC in EPT
Software are not allow to access device MMIO using cacheable memory type, the
patch limit MMIO region with UC and WC(guest can select WC using PAT and
PCD/PWT).
In addition to the host MMIO and IGMT issues, KVM's MTRR virtualization
was obviously never tested on NPT until much later, which lends further
credence to the theory/argument that this was all the result of
misdiagnosed issues.
Discussion from the EPT+MTRR enabling thread[*] more or less confirms that
Sheng Yang was trying to resolve issues with passthrough MMIO.
* Sheng Yang
: Do you mean host(qemu) would access this memory and if we set it to guest
: MTRR, host access would be broken? We would cover this in our shadow MTRR
: patch, for we encountered this in video ram when doing some experiment with
: VGA assignment.
And in the same thread, there's also what appears to be confirmation of
Intel running into issues with Windows XP related to a guest device driver
mapping DMA with WC in the PAT.
* Avi Kavity
: Sheng Yang wrote:
: > Yes... But it's easy to do with assigned devices' mmio, but what if guest
: > specific some non-mmio memory's memory type? E.g. we have met one issue in
: > Xen, that a assigned-device's XP driver specific one memory region as buffer,
: > and modify the memory type then do DMA.
: >
: > Only map MMIO space can be first step, but I guess we can modify assigned
: > memory region memory type follow guest's?
: >
:
: With ept/npt, we can't, since the memory type is in the guest's
: pagetable entries, and these are not accessible.
[*] https://lore.kernel.org/all/1223539317-32379-1-git-send-email-sheng@linux.intel.com
So, for the most part, what likely happened is that 15 years ago, a few
engineers (a) fixed a #MC problem by ignoring guest PAT and (b) initially
"fixed" passthrough device MMIO by emulating *guest* MTRRs. Except for
the below case, everything since then has been a result of those two
intertwined changes.
The one exception, which is actually yet more confirmation of all of the
above, is the revert of Paolo's attempt at "full" virtualization of guest
MTRRs:
commit 606decd670
Author: Paolo Bonzini <pbonzini@redhat.com>
Date: Thu Oct 1 13:12:47 2015 +0200
Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages"
This reverts commit fd717f1101.
It was reported to cause Machine Check Exceptions (bug 104091).
...
commit fd717f1101
Author: Paolo Bonzini <pbonzini@redhat.com>
Date: Tue Jul 7 14:38:13 2015 +0200
KVM: x86: apply guest MTRR virtualization on host reserved pages
Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true.
However, the guest could prefer a different page type than UC for
such pages. A good example is that pass-throughed VGA frame buffer is
not always UC as host expected.
This patch enables full use of virtual guest MTRRs.
I.e. Paolo tried to add back KVM's behavior before "Map device MMIO as UC
in EPT" and got the same result: machine checks, likely due to the guest
MTRRs not being trustworthy/sane at all times.
Note, Paolo also tried to enable MTRR virtualization on SVM+NPT, but that
too got reverted. Unfortunately, it doesn't appear that anyone ever found
a smoking gun, i.e. exactly why emulating guest MTRRs via NPT PAT caused
extremely slow boot times doesn't appear to have a definitive root cause.
commit fc07e76ac7
Author: Paolo Bonzini <pbonzini@redhat.com>
Date: Thu Oct 1 13:20:22 2015 +0200
Revert "KVM: SVM: use NPT page attributes"
This reverts commit 3c2e7f7de3.
Initializing the mapping from MTRR to PAT values was reported to
fail nondeterministically, and it also caused extremely slow boot
(due to caching getting disabled---bug 103321) with assigned devices.
...
commit 3c2e7f7de3
Author: Paolo Bonzini <pbonzini@redhat.com>
Date: Tue Jul 7 14:32:17 2015 +0200
KVM: SVM: use NPT page attributes
Right now, NPT page attributes are not used, and the final page
attribute depends solely on gPAT (which however is not synced
correctly), the guest MTRRs and the guest page attributes.
However, we can do better by mimicking what is done for VMX.
In the absence of PCI passthrough, the guest PAT can be ignored
and the page attributes can be just WB. If passthrough is being
used, instead, keep respecting the guest PAT, and emulate the guest
MTRRs through the PAT field of the nested page tables.
The only snag is that WP memory cannot be emulated correctly,
because Linux's default PAT setting only includes the other types.
In short, honoring guest MTRRs for VMX was initially a workaround of
sorts for KVM ignoring guest PAT *and* for KVM not forcing UC for host
MMIO. And while there *are* known cases where honoring guest MTRRs is
desirable, e.g. passthrough VGA frame buffers, the desired behavior in
that case is to get WC instead of UC, i.e. at this point it's for
performance, not correctness.
Furthermore, the complete absence of MTRR virtualization on NPT and
shadow paging proves that, while KVM theoretically can do better, it's
by no means necessary for correctnesss.
Lastly, since kernels mostly rely on firmware to do MTRR setup, and the
host typically provides guest firmware, honoring guest MTRRs is effectively
honoring *host* userspace memtypes, which is also backwards. I.e. it
would be far better for host userspace to communicate its desired memtype
directly to KVM (or perhaps indirectly via VMAs in the host kernel), not
through guest MTRRs.
Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20240309010929.1403984-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Keep kvm_apicv_inhibit enum naming consistent with the current pattern by
renaming the reason/enumerator defined as APICV_INHIBIT_REASON_DISABLE to
APICV_INHIBIT_REASON_DISABLED.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Link: https://lore.kernel.org/r/20240506225321.3440701-3-alejandro.j.jimenez@oracle.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use the tracing infrastructure helper __print_flags() for printing flag
bitfields, to enhance the trace output by displaying a string describing
each of the inhibit reasons set.
The kvm_apicv_inhibit_changed tracepoint currently shows the raw bitmap
value, requiring the user to consult the source file where the inhibit
reasons are defined to decode the trace output.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20240506225321.3440701-2-alejandro.j.jimenez@oracle.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Introduce the VM variable "nanoseconds per APIC bus cycle" in
preparation to make the APIC bus frequency configurable.
The TDX architecture hard-codes the core crystal clock frequency to
25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture
does not allow the VMM to override the value.
In addition, per Intel SDM:
"The APIC timer frequency will be the processor’s bus clock or core
crystal clock frequency (when TSC/core crystal clock ratio is
enumerated in CPUID leaf 0x15) divided by the value specified in
the divide configuration register."
The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded
APIC bus frequency of 1GHz.
Introduce the VM variable "nanoseconds per APIC bus cycle" to prepare
for allowing userspace to tell KVM to use the frequency that TDX mandates
instead of the default 1Ghz. Doing so ensures that the guest doesn't have
a conflicting view of the APIC bus frequency.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
[reinette: rework changelog]
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/ae75ce37c6c38bb4efd10a0a41932984c40b24ac.1714081726.git.reinette.chatre@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Several '_mask' suffixed variables such as, global_ctrl_mask, are
defined in kvm_pmu structure. However the _mask suffix is ambiguous and
misleading since it's not a real mask with positive logic. On the contrary
it represents the reserved bits of corresponding MSRs and these bits
should not be accessed.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240430005239.13527-2-dapeng1.mi@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Pull base x86 KVM support for running SEV-SNP guests from Michael Roth:
* add some basic infrastructure and introduces a new KVM_X86_SNP_VM
vm_type to handle differences versus the existing KVM_X86_SEV_VM and
KVM_X86_SEV_ES_VM types.
* implement the KVM API to handle the creation of a cryptographic
launch context, encrypt/measure the initial image into guest memory,
and finalize it before launching it.
* implement handling for various guest-generated events such as page
state changes, onlining of additional vCPUs, etc.
* implement the gmem/mmu hooks needed to prepare gmem-allocated pages
before mapping them into guest private memory ranges as well as
cleaning them up prior to returning them to the host for use as
normal memory. Because those cleanup hooks supplant certain
activities like issuing WBINVDs during KVM MMU invalidations, avoid
duplicating that work to avoid unecessary overhead.
This merge leaves out support support for attestation guest requests
and for loading the signing keys to be used for attestation requests.
Add "struct kvm_host_values kvm_host" to hold the various host values
that KVM snapshots during initialization. Bundling the host values into
a single struct simplifies adding new MSRs and other features with host
state/values that KVM cares about, and provides a one-stop shop. E.g.
adding a new value requires one line, whereas tracking each value
individual often requires three: declaration, definition, and export.
No functional change intended.
Link: https://lore.kernel.org/r/20240423221521.2923759-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Print the SPTEs that correspond to the faulting GPA on an unexpected EPT
Violation #VE to help the user debug failures, e.g. to pinpoint which SPTE
didn't have SUPPRESS_VE set.
Opportunistically assert that the underlying exit reason was indeed an EPT
Violation, as the CPU has *really* gone off the rails if a #VE occurs due
to a completely unexpected exit reason.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240518000430.1118488-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP
guests to alter the register state of the APs on their own. This allows
the guest a way of simulating INIT-SIPI.
A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used
so as to avoid updating the VMSA pointer while the vCPU is running.
For CREATE
The guest supplies the GPA of the VMSA to be used for the vCPU with
the specified APIC ID. The GPA is saved in the svm struct of the
target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added
to the vCPU and then the vCPU is kicked.
For CREATE_ON_INIT:
The guest supplies the GPA of the VMSA to be used for the vCPU with
the specified APIC ID the next time an INIT is performed. The GPA is
saved in the svm struct of the target vCPU.
For DESTROY:
The guest indicates it wishes to stop the vCPU. The GPA is cleared
from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is
added to vCPU and then the vCPU is kicked.
The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked
as a result of the event or as a result of an INIT. If a new VMSA is to
be installed, the VMSA guest page is set as the VMSA in the vCPU VMCB
and the vCPU state is set to KVM_MP_STATE_RUNNABLE. If a new VMSA is not
to be installed, the VMSA is cleared in the vCPU VMCB and the vCPU state
is set to KVM_MP_STATE_HALTED to prevent it from being run.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <20240501085210.2213060-13-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When SEV-SNP is enabled in the guest, the hardware places restrictions
on all memory accesses based on the contents of the RMP table. When
hardware encounters RMP check failure caused by the guest memory access
it raises the #NPF. The error code contains additional information on
the access type. See the APM volume 2 for additional information.
When using gmem, RMP faults resulting from mismatches between the state
in the RMP table vs. what the guest expects via its page table result
in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This
means the only expected case that needs to be handled in the kernel is
when the page size of the entry in the RMP table is larger than the
mapping in the nested page table, in which case a PSMASH instruction
needs to be issued to split the large RMP entry into individual 4K
entries so that subsequent accesses can succeed.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <20240501085210.2213060-12-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to
L1, as per the SDM.
- Move kvm_vcpu_arch's exit_qualification into x86_exception, as the field is
used only when synthesizing nested EPT violation, i.e. it's not the vCPU's
"real" exit_qualification, which is tracked elsewhere.
- Add a sanity check to assert that EPT Violations are the only sources of
nested PML Full VM-Exits.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmY+qzEACgkQOlYIJqCj
N/3O0Q/9HZruiL9vzMrLBKgFgWCxQHO2fy+EixuwzVBHunQGOsVnDCO2p+PWnF0p
kuW/MEZhZfLYnXoDi5/AP12G9qtDhlSNnfSl2gn+BMXqyGSYpcoXuM/zTjM24wLd
PXKkPirYMpVR2+lHsD7l8YK2I+qc7UfbRkCyJegBgGwUBs13/TBD6Rum3Aa9Q+dX
IcwjomH+MdHDFPnpfHjksA+G79Ckkqmu/DbOAlCqw1dUSC8oyV9tE/EKStSBzjZ+
OGMSm7Kl0T+km1JyH60H1ivbUbT3gJxpezoYL9EbO25VPrdldKP+ohqbtew/8ttk
UP/oW3mL79I7L06ZqqxZKDDj4JGvz53UhhAylZcBPw0P3v9TQF3wm59K4eM9btNt
eyIaT0SAbcigHAniM+3FPkq443hRxDvLNF5E66Ez03HhhkEz3ZsyNH1oPnQK0Crq
N1e+NGuKsTAPBzc3sSSrxOHnCajTUQ9WYjOpfdSgWsL6TQOmXIvHl0tE2ILrvDc/
f+VG62veqa9CCmX5B2lUT0yX9nXvyXKwVpJY9RSQIhB46sA8zjSZsZRCQFkDI5Gx
pzjxjcXtydAMWpn5qUvpD0B6agMlP6WUJHlu+ezmBQuSUHr+2PHY5dEj9442SusF
98VGJy8APxDhidK5TaJJXWmDfKNhEaWboMcTnWM1TwY/qLfDsVU=
=0ncM
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-vmx-6.10' of https://github.com/kvm-x86/linux into HEAD
KVM VMX changes for 6.10:
- Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to
L1, as per the SDM.
- Move kvm_vcpu_arch's exit_qualification into x86_exception, as the field is
used only when synthesizing nested EPT violation, i.e. it's not the vCPU's
"real" exit_qualification, which is tracked elsewhere.
- Add a sanity check to assert that EPT Violations are the only sources of
nested PML Full VM-Exits.
A combination of prep work for TDX and SNP, and a clean up of the
page fault path to (hopefully) make it easier to follow the rules for
private memory, noslot faults, writes to read-only slots, etc.
In the case of SEV-SNP, whether or not a 2MB page can be mapped via a
2MB mapping in the guest's nested page table depends on whether or not
any subpages within the range have already been initialized as private
in the RMP table. The existing mixed-attribute tracking in KVM is
insufficient here, for instance:
- gmem allocates 2MB page
- guest issues PVALIDATE on 2MB page
- guest later converts a subpage to shared
- SNP host code issues PSMASH to split 2MB RMP mapping to 4K
- KVM MMU splits NPT mapping to 4K
- guest later converts that shared page back to private
At this point there are no mixed attributes, and KVM would normally
allow for 2MB NPT mappings again, but this is actually not allowed
because the RMP table mappings are 4K and cannot be promoted on the
hypervisor side, so the NPT mappings must still be limited to 4K to
match this.
Add a hook to determine the max NPT mapping size in situations like
this.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <20240501085210.2213060-3-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In some cases, like with SEV-SNP, guest memory needs to be updated in a
platform-specific manner before it can be safely freed back to the host.
Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
allow for special handling of this sort when freeing memory in response
to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
ahead and define an arch-specific hook for x86 since it will be needed
for handling memory used for SEV-SNP guests.
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-Id: <20231230172351.574091-6-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
guest_memfd pages are generally expected to be in some arch-defined
initial state prior to using them for guest memory. For SEV-SNP this
initial state is 'private', or 'guest-owned', and requires additional
operations to move these pages into a 'private' state by updating the
corresponding entries the RMP table.
Allow for an arch-defined hook to handle updates of this sort, and go
ahead and implement one for x86 so KVM implementations like AMD SVM can
register a kvm_x86_ops callback to handle these updates for SEV-SNP
guests.
The preparation callback is always called when allocating/grabbing
folios via gmem, and it is up to the architecture to keep track of
whether or not the pages are already in the expected state (e.g. the RMP
table in the case of SEV-SNP).
In some cases, it is necessary to defer the preparation of the pages to
handle things like in-place encryption of initial guest memory payloads
before marking these pages as 'private'/'guest-owned'. Add an argument
(always true for now) to kvm_gmem_get_folio() that allows for the
preparation callback to be bypassed. To detect possible issues in
the way userspace initializes memory, it is only possible to add an
unprepared page if it is not already included in the filemap.
Link: https://lore.kernel.org/lkml/ZLqVdvsF11Ddo7Dq@google.com/
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-Id: <20231230172351.574091-5-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Right now the error code is not used when an async page fault is completed.
This is not a problem in the current code, but it is untidy. For protected
VMs, we will also need to check that the page attributes match the current
state of the page, because asynchronous page faults can only occur on
shared pages (private pages go through kvm_faultin_pfn_private() instead of
__gfn_to_pfn_memslot()).
Start by piping the error code from kvm_arch_setup_async_pf() to
kvm_arch_async_page_ready() via the architecture-specific async page
fault data. For now, it can be used to assert that there are no
async page faults on private memory.
Extracted from a patch by Isaku Yamahata.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add and use a synthetic, KVM-defined page fault error code to indicate
whether a fault is to private vs. shared memory. TDX and SNP have
different mechanisms for reporting private vs. shared, and KVM's
software-protected VMs have no mechanism at all. Usurp an error code
flag to avoid having to plumb another parameter to kvm_mmu_page_fault()
and friends.
Alternatively, KVM could borrow AMD's PFERR_GUEST_ENC_MASK, i.e. set it
for TDX and software-protected VMs as appropriate, but that would require
*clearing* the flag for SEV and SEV-ES VMs, which support encrypted
memory at the hardware layer, but don't utilize private memory at the
KVM layer.
Opportunistically add a comment to call out that the logic for software-
protected VMs is (and was before this commit) broken for nested MMUs, i.e.
for nested TDP, as the GPA is an L2 GPA. Punt on trying to play nice with
nested MMUs as there is a _lot_ of functionality that simply doesn't work
for software-protected VMs, e.g. all of the paths where KVM accesses guest
memory need to be updated to be aware of private vs. shared memory.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20240228024147.41573-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the sanity check that hardware never sets bits that collide with KVM-
define synthetic bits from kvm_mmu_page_fault() to npf_interception(),
i.e. make the sanity check #NPF specific. The legacy #PF path already
WARNs if _any_ of bits 63:32 are set, and the error code that comes from
VMX's EPT Violatation and Misconfig is 100% synthesized (KVM morphs VMX's
EXIT_QUALIFICATION into error code flags).
Add a compile-time assert in the legacy #PF handler to make sure that KVM-
define flags are covered by its existing sanity check on the upper bits.
Opportunistically add a description of PFERR_IMPLICIT_ACCESS, since we
are removing the comment that defined it.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20240228024147.41573-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Define more #NPF error code flags that are relevant to SEV+ (mostly SNP)
guests, as specified by the APM:
* Bit 31 (RMP): Set to 1 if the fault was caused due to an RMP check or a
VMPL check failure, 0 otherwise.
* Bit 34 (ENC): Set to 1 if the guest’s effective C-bit was 1, 0 otherwise.
* Bit 35 (SIZEM): Set to 1 if the fault was caused by a size mismatch between
PVALIDATE or RMPADJUST and the RMP, 0 otherwise.
* Bit 36 (VMPL): Set to 1 if the fault was caused by a VMPL permission
check failure, 0 otherwise.
Note, the APM is *extremely* misleading, and strongly implies that the
above flags can _only_ be set for #NPF exits from SNP guests. That is a
lie, as bit 34 (C-bit=1, i.e. was encrypted) can be set when running _any_
flavor of SEV guest on SNP capable hardware.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240228024147.41573-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Open code the bit number directly in the PFERR_* masks and drop the
intermediate PFERR_*_BIT defines, as having to bounce through two macros
just to see which flag corresponds to which bit is quite annoying, as is
having to define two macros just to add recognition of a new flag.
Use ternary operator to derive the bit in permission_fault(), the one
function that actually needs the bit number as part of clever shifting
to avoid conditional branches. Generally the compiler is able to turn
it into a conditional move, and if not it's not really a big deal.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240228024147.41573-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
TDX will use a different shadow PTE entry value for MMIO from VMX. Add a
member to kvm_arch and track value for MMIO per-VM instead of a global
variable. By using the per-VM EPT entry value for MMIO, the existing VMX
logic is kept working. Introduce a separate setter function so that guest
TD can use a different value later.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <229a18434e5d83f45b1fcd7bf1544d79db1becb6.1705965635.git.isaku.yamahata@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
By necessity, TDX will use a different register ABI for hypercalls.
Break out the core functionality so that it may be reused for TDX.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <5134caa55ac3dec33fb2addb5545b52b3b52db02.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Some VM types have characteristics in common; in fact, the only use
of VM types right now is kvm_arch_has_private_mem and it assumes that
_all_ nonzero VM types have private memory.
We will soon introduce a VM type for SEV and SEV-ES VMs, and at that
point we will have two special characteristics of confidential VMs
that depend on the VM type: not just if memory is private, but
also whether guest state is protected. For the latter we have
kvm->arch.guest_state_protected, which is only set on a fully initialized
VM.
For VM types with protected guest state, we can actually fix a problem in
the SEV-ES implementation, where ioctls to set registers do not cause an
error even if the VM has been initialized and the guest state encrypted.
Make sure that when using VM types that will become an error.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20240209183743.22030-7-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <20240404121327.3107131-8-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allow vendor modules to provide their own attributes on /dev/kvm.
To avoid proliferation of vendor ops, implement KVM_HAS_DEVICE_ATTR
and KVM_GET_DEVICE_ATTR in terms of the same function. You're not
supposed to use KVM_GET_DEVICE_ATTR to do complicated computations,
especially on /dev/kvm.
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <20240404121327.3107131-5-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add kvm_vcpu_arch.is_amd_compatible to cache if a vCPU's vendor model is
compatible with AMD, i.e. if the vCPU vendor is AMD or Hygon, along with
helpers to check if a vCPU is compatible AMD vs. Intel. To handle Intel
vs. AMD behavior related to masking the LVTPC entry, KVM will need to
check for vendor compatibility on every PMI injection, i.e. querying for
AMD will soon be a moderately hot path.
Note! This subtly (or maybe not-so-subtly) makes "Intel compatible" KVM's
default behavior, both if userspace omits (or never sets) CPUID 0x0 and if
userspace sets a completely unknown vendor. One could argue that KVM
should treat such vCPUs as not being compatible with Intel *or* AMD, but
that would add useless complexity to KVM.
KVM needs to do *something* in the face of vendor specific behavior, and
so unless KVM conjured up a magic third option, choosing to treat unknown
vendors as neither Intel nor AMD means that checks on AMD compatibility
would yield Intel behavior, and checks for Intel compatibility would yield
AMD behavior. And that's far worse as it would effectively yield random
behavior depending on whether KVM checked for AMD vs. Intel vs. !AMD vs.
!Intel. And practically speaking, all x86 CPUs follow either Intel or AMD
architecture, i.e. "supporting" an unknown third architecture adds no
value.
Deliberately don't convert any of the existing guest_cpuid_is_intel()
checks, as the Intel side of things is messier due to some flows explicitly
checking for exactly vendor==Intel, versus some flows assuming anything
that isn't "AMD compatible" gets Intel behavior. The Intel code will be
cleaned up in the future.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240405235603.1173076-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the exit_qualification field that is used to track information about
in-flight nEPT violations from "struct kvm_vcpu_arch" to "x86_exception",
i.e. associate the information with the actual nEPT violation instead of
the vCPU. To handle bits that are pulled from vmcs.EXIT_QUALIFICATION,
i.e. that are propagated from the "original" EPT violation VM-Exit, simply
grab them from the VMCS on-demand when injecting a nEPT Violation or a PML
Full VM-exit.
Aside from being ugly, having an exit_qualification field in kvm_vcpu_arch
is outright dangerous, e.g. see commit d7f0a00e43 ("KVM: VMX: Report
up-to-date exit qualification to userspace").
Opportunstically add a comment to call out that PML Full and EPT Violation
VM-Exits use the same bit to report NMI blocking information.
Link: https://lore.kernel.org/r/20240209221700.393189-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
* Changes to FPU handling came in via the main s390 pull request
* Only deliver to the guest the SCLP events that userspace has
requested.
* More virtual vs physical address fixes (only a cleanup since
virtual and physical address spaces are currently the same).
* Fix selftests undefined behavior.
x86:
* Fix a restriction that the guest can't program a PMU event whose
encoding matches an architectural event that isn't included in the
guest CPUID. The enumeration of an architectural event only says
that if a CPU supports an architectural event, then the event can be
programmed *using the architectural encoding*. The enumeration does
NOT say anything about the encoding when the CPU doesn't report support
the event *in general*. It might support it, and it might support it
using the same encoding that made it into the architectural PMU spec.
* Fix a variety of bugs in KVM's emulation of RDPMC (more details on
individual commits) and add a selftest to verify KVM correctly emulates
RDMPC, counter availability, and a variety of other PMC-related
behaviors that depend on guest CPUID and therefore are easier to
validate with selftests than with custom guests (aka kvm-unit-tests).
* Zero out PMU state on AMD if the virtual PMU is disabled, it does not
cause any bug but it wastes time in various cases where KVM would check
if a PMC event needs to be synthesized.
* Optimize triggering of emulated events, with a nice ~10% performance
improvement in VM-Exit microbenchmarks when a vPMU is exposed to the
guest.
* Tighten the check for "PMI in guest" to reduce false positives if an NMI
arrives in the host while KVM is handling an IRQ VM-Exit.
* Fix a bug where KVM would report stale/bogus exit qualification information
when exiting to userspace with an internal error exit code.
* Add a VMX flag in /proc/cpuinfo to report 5-level EPT support.
* Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for
read, e.g. to avoid serializing vCPUs when userspace deletes a memslot.
* Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB). KVM
doesn't support yielding in the middle of processing a zap, and 1GiB
granularity resulted in multi-millisecond lags that are quite impolite
for CONFIG_PREEMPT kernels.
* Allocate write-tracking metadata on-demand to avoid the memory overhead when
a kernel is built with i915 virtualization support but the workloads use
neither shadow paging nor i915 virtualization.
* Explicitly initialize a variety of on-stack variables in the emulator that
triggered KMSAN false positives.
* Fix the debugregs ABI for 32-bit KVM.
* Rework the "force immediate exit" code so that vendor code ultimately decides
how and when to force the exit, which allowed some optimization for both
Intel and AMD.
* Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
vCPU creation ultimately failed, causing extra unnecessary work.
* Cleanup the logic for checking if the currently loaded vCPU is in-kernel.
* Harden against underflowing the active mmu_notifier invalidation
count, so that "bad" invalidations (usually due to bugs elsehwere in the
kernel) are detected earlier and are less likely to hang the kernel.
x86 Xen emulation:
* Overlay pages can now be cached based on host virtual address,
instead of guest physical addresses. This removes the need to
reconfigure and invalidate the cache if the guest changes the
gpa but the underlying host virtual address remains the same.
* When possible, use a single host TSC value when computing the deadline for
Xen timers in order to improve the accuracy of the timer emulation.
* Inject pending upcall events when the vCPU software-enables its APIC to fix
a bug where an upcall can be lost (and to follow Xen's behavior).
* Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
events fails, e.g. if the guest has aliased xAPIC IDs.
RISC-V:
* Support exception and interrupt handling in selftests
* New self test for RISC-V architectural timer (Sstc extension)
* New extension support (Ztso, Zacas)
* Support userspace emulation of random number seed CSRs.
ARM:
* Infrastructure for building KVM's trap configuration based on the
architectural features (or lack thereof) advertised in the VM's ID
registers
* Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
x86's WC) at stage-2, improving the performance of interacting with
assigned devices that can tolerate it
* Conversion of KVM's representation of LPIs to an xarray, utilized to
address serialization some of the serialization on the LPI injection
path
* Support for _architectural_ VHE-only systems, advertised through the
absence of FEAT_E2H0 in the CPU's ID register
* Miscellaneous cleanups, fixes, and spelling corrections to KVM and
selftests
LoongArch:
* Set reserved bits as zero in CPUCFG.
* Start SW timer only when vcpu is blocking.
* Do not restart SW timer when it is expired.
* Remove unnecessary CSR register saving during enter guest.
* Misc cleanups and fixes as usual.
Generic:
* cleanup Kconfig by removing CONFIG_HAVE_KVM, which was basically always
true on all architectures except MIPS (where Kconfig determines the
available depending on CPU capabilities). It is replaced either by
an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM)
everywhere else.
* Factor common "select" statements in common code instead of requiring
each architecture to specify it
* Remove thoroughly obsolete APIs from the uapi headers.
* Move architecture-dependent stuff to uapi/asm/kvm.h
* Always flush the async page fault workqueue when a work item is being
removed, especially during vCPU destruction, to ensure that there are no
workers running in KVM code when all references to KVM-the-module are gone,
i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded.
* Grab a reference to the VM's mm_struct in the async #PF worker itself instead
of gifting the worker a reference, so that there's no need to remember
to *conditionally* clean up after the worker.
Selftests:
* Reduce boilerplate especially when utilize selftest TAP infrastructure.
* Add basic smoke tests for SEV and SEV-ES, along with a pile of library
support for handling private/encrypted/protected memory.
* Fix benign bugs where tests neglect to close() guest_memfd files.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmX0iP8UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroND7wf+JZoNvwZ+bmwWe/4jn/YwNoYi/C5z
eypn8M1gsWEccpCpqPBwznVm9T29rF4uOlcMvqLEkHfTpaL1EKUUjP1lXPz/ileP
6a2RdOGxAhyTiFC9fjy+wkkjtLbn1kZf6YsS0hjphP9+w0chNbdn0w81dFVnXryd
j7XYI8R/bFAthNsJOuZXSEjCfIHxvTTG74OrTf1B1FEBB+arPmrgUeJftMVhffQK
Sowgg8L/Ii/x6fgV5NZQVSIyVf1rp8z7c6UaHT4Fwb0+RAMW8p9pYv9Qp1YkKp8y
5j0V9UzOHP7FRaYimZ5BtwQoqiZXYylQ+VuU/Y2f4X85cvlLzSqxaEMAPA==
=mqOV
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"S390:
- Changes to FPU handling came in via the main s390 pull request
- Only deliver to the guest the SCLP events that userspace has
requested
- More virtual vs physical address fixes (only a cleanup since
virtual and physical address spaces are currently the same)
- Fix selftests undefined behavior
x86:
- Fix a restriction that the guest can't program a PMU event whose
encoding matches an architectural event that isn't included in the
guest CPUID. The enumeration of an architectural event only says
that if a CPU supports an architectural event, then the event can
be programmed *using the architectural encoding*. The enumeration
does NOT say anything about the encoding when the CPU doesn't
report support the event *in general*. It might support it, and it
might support it using the same encoding that made it into the
architectural PMU spec
- Fix a variety of bugs in KVM's emulation of RDPMC (more details on
individual commits) and add a selftest to verify KVM correctly
emulates RDMPC, counter availability, and a variety of other
PMC-related behaviors that depend on guest CPUID and therefore are
easier to validate with selftests than with custom guests (aka
kvm-unit-tests)
- Zero out PMU state on AMD if the virtual PMU is disabled, it does
not cause any bug but it wastes time in various cases where KVM
would check if a PMC event needs to be synthesized
- Optimize triggering of emulated events, with a nice ~10%
performance improvement in VM-Exit microbenchmarks when a vPMU is
exposed to the guest
- Tighten the check for "PMI in guest" to reduce false positives if
an NMI arrives in the host while KVM is handling an IRQ VM-Exit
- Fix a bug where KVM would report stale/bogus exit qualification
information when exiting to userspace with an internal error exit
code
- Add a VMX flag in /proc/cpuinfo to report 5-level EPT support
- Rework TDP MMU root unload, free, and alloc to run with mmu_lock
held for read, e.g. to avoid serializing vCPUs when userspace
deletes a memslot
- Tear down TDP MMU page tables at 4KiB granularity (used to be
1GiB). KVM doesn't support yielding in the middle of processing a
zap, and 1GiB granularity resulted in multi-millisecond lags that
are quite impolite for CONFIG_PREEMPT kernels
- Allocate write-tracking metadata on-demand to avoid the memory
overhead when a kernel is built with i915 virtualization support
but the workloads use neither shadow paging nor i915 virtualization
- Explicitly initialize a variety of on-stack variables in the
emulator that triggered KMSAN false positives
- Fix the debugregs ABI for 32-bit KVM
- Rework the "force immediate exit" code so that vendor code
ultimately decides how and when to force the exit, which allowed
some optimization for both Intel and AMD
- Fix a long-standing bug where kvm_has_noapic_vcpu could be left
elevated if vCPU creation ultimately failed, causing extra
unnecessary work
- Cleanup the logic for checking if the currently loaded vCPU is
in-kernel
- Harden against underflowing the active mmu_notifier invalidation
count, so that "bad" invalidations (usually due to bugs elsehwere
in the kernel) are detected earlier and are less likely to hang the
kernel
x86 Xen emulation:
- Overlay pages can now be cached based on host virtual address,
instead of guest physical addresses. This removes the need to
reconfigure and invalidate the cache if the guest changes the gpa
but the underlying host virtual address remains the same
- When possible, use a single host TSC value when computing the
deadline for Xen timers in order to improve the accuracy of the
timer emulation
- Inject pending upcall events when the vCPU software-enables its
APIC to fix a bug where an upcall can be lost (and to follow Xen's
behavior)
- Fall back to the slow path instead of warning if "fast" IRQ
delivery of Xen events fails, e.g. if the guest has aliased xAPIC
IDs
RISC-V:
- Support exception and interrupt handling in selftests
- New self test for RISC-V architectural timer (Sstc extension)
- New extension support (Ztso, Zacas)
- Support userspace emulation of random number seed CSRs
ARM:
- Infrastructure for building KVM's trap configuration based on the
architectural features (or lack thereof) advertised in the VM's ID
registers
- Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
x86's WC) at stage-2, improving the performance of interacting with
assigned devices that can tolerate it
- Conversion of KVM's representation of LPIs to an xarray, utilized
to address serialization some of the serialization on the LPI
injection path
- Support for _architectural_ VHE-only systems, advertised through
the absence of FEAT_E2H0 in the CPU's ID register
- Miscellaneous cleanups, fixes, and spelling corrections to KVM and
selftests
LoongArch:
- Set reserved bits as zero in CPUCFG
- Start SW timer only when vcpu is blocking
- Do not restart SW timer when it is expired
- Remove unnecessary CSR register saving during enter guest
- Misc cleanups and fixes as usual
Generic:
- Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically
always true on all architectures except MIPS (where Kconfig
determines the available depending on CPU capabilities). It is
replaced either by an architecture-dependent symbol for MIPS, and
IS_ENABLED(CONFIG_KVM) everywhere else
- Factor common "select" statements in common code instead of
requiring each architecture to specify it
- Remove thoroughly obsolete APIs from the uapi headers
- Move architecture-dependent stuff to uapi/asm/kvm.h
- Always flush the async page fault workqueue when a work item is
being removed, especially during vCPU destruction, to ensure that
there are no workers running in KVM code when all references to
KVM-the-module are gone, i.e. to prevent a very unlikely
use-after-free if kvm.ko is unloaded
- Grab a reference to the VM's mm_struct in the async #PF worker
itself instead of gifting the worker a reference, so that there's
no need to remember to *conditionally* clean up after the worker
Selftests:
- Reduce boilerplate especially when utilize selftest TAP
infrastructure
- Add basic smoke tests for SEV and SEV-ES, along with a pile of
library support for handling private/encrypted/protected memory
- Fix benign bugs where tests neglect to close() guest_memfd files"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits)
selftests: kvm: remove meaningless assignments in Makefiles
KVM: riscv: selftests: Add Zacas extension to get-reg-list test
RISC-V: KVM: Allow Zacas extension for Guest/VM
KVM: riscv: selftests: Add Ztso extension to get-reg-list test
RISC-V: KVM: Allow Ztso extension for Guest/VM
RISC-V: KVM: Forward SEED CSR access to user space
KVM: riscv: selftests: Add sstc timer test
KVM: riscv: selftests: Change vcpu_has_ext to a common function
KVM: riscv: selftests: Add guest helper to get vcpu id
KVM: riscv: selftests: Add exception handling support
LoongArch: KVM: Remove unnecessary CSR register saving during enter guest
LoongArch: KVM: Do not restart SW timer when it is expired
LoongArch: KVM: Start SW timer only when vcpu is blocking
LoongArch: KVM: Set reserved bits as zero in CPUCFG
KVM: selftests: Explicitly close guest_memfd files in some gmem tests
KVM: x86/xen: fix recursive deadlock in timer injection
KVM: pfncache: simplify locking and make more self-contained
KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery
KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled
KVM: x86/xen: improve accuracy of Xen timers
...
kernel to be used as a KVM hypervisor capable of running SNP (Secure
Nested Paging) guests. Roughly speaking, SEV-SNP is the ultimate goal
of the AMD confidential computing side, providing the most
comprehensive confidential computing environment up to date.
This is the x86 part and there is a KVM part which did not get ready
in time for the merge window so latter will be forthcoming in the next
cycle.
- Rework the early code's position-dependent SEV variable references in
order to allow building the kernel with clang and -fPIE/-fPIC and
-mcmodel=kernel
- The usual set of fixes, cleanups and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmXvH0wACgkQEsHwGGHe
VUrzmA//VS/n6dhHRnm/nAGngr4PeegkgV1OhyKYFfiZ272rT6P9QvblQrgcY0dc
Ij1DOhEKlke51pTHvMOQ33B3P4Fuc0mx3dpCLY0up5V26kzQiKCjRKEkC4U1bcw8
W4GqMejaR89bE14bYibmwpSib9T/uVsV65eM3xf1iF5UvsnoUaTziymDoy+nb43a
B1pdd5vcl4mBNqXeEvt0qjg+xkMLpWUI9tJDB8mbMl/cnIFGgMZzBaY8oktHSROK
QpuUnKegOgp1RXpfLbNjmZ2Q4Rkk4MNazzDzWq3EIxaRjXL3Qp507ePK7yeA2qa0
J3jCBQc9E2j7lfrIkUgNIzOWhMAXM2YH5bvH6UrIcMi1qsWJYDmkp2MF1nUedjdf
Wj16/pJbeEw1aKKIywJGwsmViSQju158vY3SzXG83U/A/Iz7zZRHFmC/ALoxZptY
Bi7VhfcOSpz98PE3axnG8CvvxRDWMfzBr2FY1VmQbg6VBNo1Xl1aP/IH1I8iQNKg
/laBYl/qP+1286TygF1lthYROb1lfEIJprgi2xfO6jVYUqPb7/zq2sm78qZRfm7l
25PN/oHnuidfVfI/H3hzcGubjOG9Zwra8WWYBB2EEmelf21rT0OLqq+eS4T6pxFb
GNVfc0AzG77UmqbrpkAMuPqL7LrGaSee4NdU3hkEdSphlx1/YTo=
=c1ps
-----END PGP SIGNATURE-----
Merge tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:
- Add the x86 part of the SEV-SNP host support.
This will allow the kernel to be used as a KVM hypervisor capable of
running SNP (Secure Nested Paging) guests. Roughly speaking, SEV-SNP
is the ultimate goal of the AMD confidential computing side,
providing the most comprehensive confidential computing environment
up to date.
This is the x86 part and there is a KVM part which did not get ready
in time for the merge window so latter will be forthcoming in the
next cycle.
- Rework the early code's position-dependent SEV variable references in
order to allow building the kernel with clang and -fPIE/-fPIC and
-mcmodel=kernel
- The usual set of fixes, cleanups and improvements all over the place
* tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
x86/sev: Disable KMSAN for memory encryption TUs
x86/sev: Dump SEV_STATUS
crypto: ccp - Have it depend on AMD_IOMMU
iommu/amd: Fix failure return from snp_lookup_rmpentry()
x86/sev: Fix position dependent variable references in startup code
crypto: ccp: Make snp_range_list static
x86/Kconfig: Remove CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT
Documentation: virt: Fix up pre-formatted text block for SEV ioctls
crypto: ccp: Add the SNP_SET_CONFIG command
crypto: ccp: Add the SNP_COMMIT command
crypto: ccp: Add the SNP_PLATFORM_STATUS command
x86/cpufeatures: Enable/unmask SEV-SNP CPU feature
KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
crypto: ccp: Add panic notifier for SEV/SNP firmware shutdown on kdump
iommu/amd: Clean up RMP entries for IOMMU pages during SNP shutdown
crypto: ccp: Handle legacy SEV commands when SNP is enabled
crypto: ccp: Handle non-volatile INIT_EX data when SNP is enabled
crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
x86/sev: Introduce an SNP leaked pages list
crypto: ccp: Provide an API to issue SEV and SNP commands
...
- Fix several bugs where KVM speciously prevents the guest from utilizing
fixed counters and architectural event encodings based on whether or not
guest CPUID reports support for the _architectural_ encoding.
- Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads,
priority of VMX interception vs #GP, PMC types in architectural PMUs, etc.
- Add a selftest to verify KVM correctly emulates RDMPC, counter availability,
and a variety of other PMC-related behaviors that depend on guest CPUID,
i.e. are difficult to validate via KVM-Unit-Tests.
- Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting
cycles, e.g. when checking if a PMC event needs to be synthesized when
skipping an instruction.
- Optimize triggering of emulated events, e.g. for "count instructions" events
when skipping an instruction, which yields a ~10% performance improvement in
VM-Exit microbenchmarks when a vPMU is exposed to the guest.
- Tighten the check for "PMI in guest" to reduce false positives if an NMI
arrives in the host while KVM is handling an IRQ VM-Exit.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrUFQACgkQOlYIJqCj
N/11dhAAnr9e6mPmXvaH4YKcvOGgTmwIQdi5W4IBzGm27ErEb0Vyskx3UATRhRm+
gZyp3wNgEA9LeifICDNu4ypn7HZcl2VtRql6FYcB8Bcu8OiHfU8PhWL0/qrpY20e
zffUj2tDweq2ft9Iks1SQJD0sxFkcXIcSKOffP7pRZJHFTKLltGORXwxzd9HJHPY
nc4nERKegK2yH4A4gY6nZ0oV5L3OMUNHx815db5Y+HxXOIjBCjTQiNNd6mUdyX1N
C5sIiElXLdvRTSDvirHfA32LqNwnajDGox4QKZkB3wszCxJ3kRd4OCkTEKMYKHxd
KoKCJQnAdJFFW9xqbT8nNKXZ+hg2+ZQuoSaBuwKryf7jWi0e6a7jcV0OH+cQSZw7
UNudKhs3r4ambfvnFp2IVZlZREMDB+LAjo2So48Jn/JGCAzqte3XqwVKskn9pS9S
qeauXCdOLioZALYtTBl8RM1rEY5mbwQrpPv9CzbeU09qQ/hpXV14W9GmbyeOZcI1
T1cYgEqlLuifRluwT/hxrY321+4noF116gSK1yb07x/sJU8/lhRooEk9V562066E
qo6nIvc7Bv9gTGLwo6VReKSPcTT/6t3HwgPsRjqe+evso3EFN9f9hG+uPxtO6TUj
pdPm3mkj2KfxDdJLf+Ys16gyGdiwI0ZImIkA0uLdM0zftNsrb4Y=
=vayI
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pmu-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM x86 PMU changes for 6.9:
- Fix several bugs where KVM speciously prevents the guest from utilizing
fixed counters and architectural event encodings based on whether or not
guest CPUID reports support for the _architectural_ encoding.
- Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads,
priority of VMX interception vs #GP, PMC types in architectural PMUs, etc.
- Add a selftest to verify KVM correctly emulates RDMPC, counter availability,
and a variety of other PMC-related behaviors that depend on guest CPUID,
i.e. are difficult to validate via KVM-Unit-Tests.
- Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting
cycles, e.g. when checking if a PMC event needs to be synthesized when
skipping an instruction.
- Optimize triggering of emulated events, e.g. for "count instructions" events
when skipping an instruction, which yields a ~10% performance improvement in
VM-Exit microbenchmarks when a vPMU is exposed to the guest.
- Tighten the check for "PMI in guest" to reduce false positives if an NMI
arrives in the host while KVM is handling an IRQ VM-Exit.
- Clean up code related to unprotecting shadow pages when retrying a guest
instruction after failed #PF-induced emulation.
- Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if
a reschedule is needed, e.g. if a high priority task needs to run. Because
KVM doesn't support yielding in the middle of processing a zapped non-leaf
SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when
attempting to schedule in a high priority.
- Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for
read, e.g. to avoid serializing vCPUs when userspace deletes a memslot.
- Allocate write-tracking metadata on-demand to avoid the memory overhead when
running kernels built with KVMGT support (external write-tracking enabled),
but for workloads that don't use nested virtualization (shadow paging) or
KVMGT.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrTH4ACgkQOlYIJqCj
N/1q3xAAh3wpUDzRfkNkgGUbulhuJmQ72PiaW3NRoMo/3Rowegsdgt1N3/ec+fcJ
Awx0KUM8Cju8O2Zqp6NzKwUkddCni8dHmOa55NJQuK2M1OpnE0RjBB94n+AFJZki
mm8wKSKNgjlVeJDG87+RLPnbaeEvqYPp22oNKJyAPsimTbxvmhIqtg8qdyujGPXA
Jke7LXgtVGav+nEzXiLh86VU/agoBJc/zt+hiuLvamU5Y8so+zReqFbrDtvsgtpV
ryvMbDZxcPXKrsBP+B7syqUAbODcmh/wkzOCZ4Tby5yurEaw1rwpZIH0BRKRgGx2
F2JqWayYsCOsrJ4DwQre8RfLMtbEKB2BBWkZlYyblAy0++1LcTP9pSk5YC5lSL71
5Oszql9DKi10Vq5IfR/ehsr6mHXFr3AB7C7QefiXpytGbObQs8/f/OxinxaEajcs
ERBgh+rcQ5p3kfdiHzuQjn7y45J7z21CKVhka4iKJtTxypBK4ZvkDOVqHuHppb5O
aw6rC5HR1EKhSW4jz7QWrDExtDZ2X5HeYl8TgfHncSSJRc7urKYcSCHhXJsB6BPs
iQf0xbHaIOyH9jmoqLZjz0QZmXB9fydQ/zAlFVXZsrNHvomayVjqrpl8UFTMdhuI
zll9ynfRRHMUkIi1YubUlmFMgBeqOXGkfBFh8QUH3+YiI7Cwzh4=
=SgFo
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-mmu-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.9:
- Clean up code related to unprotecting shadow pages when retrying a guest
instruction after failed #PF-induced emulation.
- Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if
a reschedule is needed, e.g. if a high priority task needs to run. Because
KVM doesn't support yielding in the middle of processing a zapped non-leaf
SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when
attempting to schedule in a high priority.
- Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for
read, e.g. to avoid serializing vCPUs when userspace deletes a memslot.
- Allocate write-tracking metadata on-demand to avoid the memory overhead when
running kernels built with KVMGT support (external write-tracking enabled),
but for workloads that don't use nested virtualization (shadow paging) or
KVMGT.
- Explicitly initialize a variety of on-stack variables in the emulator that
triggered KMSAN false positives (though in fairness in KMSAN, it's comically
difficult to see that the uninitialized memory is never truly consumed).
- Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading
DR6 and DR7.
- Rework the "force immediate exit" code so that vendor code ultimately
decides how and when to force the exit. This allows VMX to further optimize
handling preemption timer exits, and allows SVM to avoid sending a duplicate
IPI (SVM also has a need to force an exit).
- Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
vCPU creation ultimately failed, and add WARN to guard against similar bugs.
- Provide a dedicated arch hook for checking if a different vCPU was in-kernel
(for directed yield), and simplify the logic for checking if the currently
loaded vCPU is in-kernel.
- Misc cleanups and fixes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrRjQACgkQOlYIJqCj
N/2Dzw//b+ptSBAl1kGBRmk/DqsX7J9ZkQYCQOTeh1vXiUM+XRTSQoArN0Oo1roy
3wcEnQ0beVw7jMuzZ8UUuTfU8WUMja/kwltnqXYNHwLnb6yH0I/BIengXWdUdAMc
FmgPZ4qJR2IzKYzvDsc3eEQ515O8UHWakyVDnmLBtiakAeBcUTYceHpEEPpzE5y5
ODASTQKM9o/h8R8JwKFTJ8/mrOLNcsu5SycwFdnmubLJCrNWtJWTijA6y1lh6shn
hbEJex+ESoC2v8p7IP53u1SGJubVlPajt+RkYJtlEI3WVsevp024eYcF4nb1OjXi
qS2Y3W7DQGWvyCBoSzoMY+9nRMgyOOpHYetdiz+9oZOmnjiYWY0ku59U7Gv+Aotj
AUbCn4Ry/OpqsuZ7Oo7i3IT8R7uzsTeNNdxhYBn1OQquBEZ0KBYXlZkGfTk9K0t0
Fhka/5Zu6fBlg5J+zCyaXUGmsGWBo/9HxsC5z1JuKo8fatro5qyqYE5KiM01dkqc
6FET6gL+fFprC5c67JGRPdEtk6F9Emb+6oiTTA8/8q8JQQAKiJKk95Nlq7KzPfVS
A5RQPTuTJ7acE/5CY4zB1DdxCjqgnonBEA2ULnA/J10Rk8orHJRnGJcEwKEyDrZh
HpsxIIqt++i8KffORpCym6zSAVYuQjn1mu7MGth+zuCqhcEpBfc=
=GX0O
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.9:
- Explicitly initialize a variety of on-stack variables in the emulator that
triggered KMSAN false positives (though in fairness in KMSAN, it's comically
difficult to see that the uninitialized memory is never truly consumed).
- Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading
DR6 and DR7.
- Rework the "force immediate exit" code so that vendor code ultimately
decides how and when to force the exit. This allows VMX to further optimize
handling preemption timer exits, and allows SVM to avoid sending a duplicate
IPI (SVM also has a need to force an exit).
- Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
vCPU creation ultimately failed, and add WARN to guard against similar bugs.
- Provide a dedicated arch hook for checking if a different vCPU was in-kernel
(for directed yield), and simplify the logic for checking if the currently
loaded vCPU is in-kernel.
- Misc cleanups and fixes.
The write-track is used externally only by the gpu/drm/i915 driver.
Currently, it is always enabled, if a kernel has been compiled with this
driver.
Enabling the write-track mechanism adds a two-byte overhead per page across
all memory slots. It isn't significant for regular VMs. However in gVisor,
where the entire process virtual address space is mapped into the VM, even
with a 39-bit address space, the overhead amounts to 256MB.
Rework the write-tracking mechanism to enable it on-demand in
kvm_page_track_register_notifier.
Here is Sean's comment about the locking scheme:
The only potential hiccup would be if taking slots_arch_lock would
deadlock, but it should be impossible for slots_arch_lock to be taken in
any other path that involves VFIO and/or KVMGT *and* can be coincident.
Except for kvm_arch_destroy_vm() (which deletes KVM's internal
memslots), slots_arch_lock is taken only through KVM ioctls(), and the
caller of kvm_page_track_register_notifier() *must* hold a reference to
the VM.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
Link: https://lore.kernel.org/r/20240213192340.2023366-1-avagin@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Explicitly check that the source of external interrupt is indeed an NMI
in kvm_arch_pmi_in_guest(), which reduces perf-kvm false positive samples
(host samples labelled as guest samples) generated by perf/core NMI mode
if an NMI arrives after VM-Exit, but before kvm_after_interrupt():
# test: perf-record + cpu-cycles:HP (which collects host-only precise samples)
# Symbol Overhead sys usr guest sys guest usr
# ....................................... ........ ........ ........ ......... .........
#
# Before:
[g] entry_SYSCALL_64 24.63% 0.00% 0.00% 24.63% 0.00%
[g] syscall_return_via_sysret 23.23% 0.00% 0.00% 23.23% 0.00%
[g] files_lookup_fd_raw 6.35% 0.00% 0.00% 6.35% 0.00%
# After:
[k] perf_adjust_freq_unthr_context 57.23% 57.23% 0.00% 0.00% 0.00%
[k] __vmx_vcpu_run 4.09% 4.09% 0.00% 0.00% 0.00%
[k] vmx_update_host_rsp 3.17% 3.17% 0.00% 0.00% 0.00%
In the above case, perf records the samples labelled '[g]', the RIPs behind
the weird samples are actually being queried by perf_instruction_pointer()
after determining whether it's in GUEST state or not, and here's the issue:
If VM-Exit is caused by a non-NMI interrupt (such as hrtimer_interrupt) and
at least one PMU counter is enabled on host, the kvm_arch_pmi_in_guest()
will remain true (KVM_HANDLING_IRQ is set) until kvm_before_interrupt().
During this window, if a PMI occurs on host (since the KVM instructions on
host are being executed), the control flow, with the help of the host NMI
context, will be transferred to perf/core to generate performance samples,
thus perf_instruction_pointer() and perf_guest_get_ip() is called.
Since kvm_arch_pmi_in_guest() only checks if there is an interrupt, it may
cause perf/core to mistakenly assume that the source RIP of the host NMI
belongs to the guest world and use perf_guest_get_ip() to get the RIP of
a vCPU that has already exited by a non-NMI interrupt.
Error samples are recorded and presented to the end-user via perf-report.
Such false positive samples could be eliminated by explicitly determining
if the exit reason is KVM_HANDLING_NMI.
Note that when VM-exit is indeed triggered by PMI and before HANDLING_NMI
is cleared, it's also still possible that another PMI is generated on host.
Also for perf/core timer mode, the false positives are still possible since
those non-NMI sources of interrupts are not always being used by perf/core.
For events that are host-only, perf/core can and should eliminate false
positives by checking event->attr.exclude_guest, i.e. events that are
configured to exclude KVM guests should never fire in the guest.
Events that are configured to count host and guest are trickier, perhaps
impossible to handle with 100% accuracy? And regardless of what accuracy
is provided by perf/core, improving KVM's accuracy is cheap and easy, with
no real downsides.
Fixes: dd60d21706 ("KVM: x86: Fix perf timer mode IP reporting")
Signed-off-by: Like Xu <likexu@tencent.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231206032054.55070-1-likexu@tencent.com
[sean: massage changelog, squash !!in_nmi() fixup from Like]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Now that vmx->req_immediate_exit is used only in the scope of
vmx_vcpu_run(), use force_immediate_exit to detect that KVM should usurp
the VMX preemption to force a VM-Exit and let vendor code fully handle
forcing a VM-Exit.
Opportunsitically drop __kvm_request_immediate_exit() and just have
vendor code call smp_send_reschedule() directly. SVM already does this
when injecting an event while also trying to single-step an IRET, i.e.
it's not exactly secret knowledge that KVM uses a reschedule IPI to force
an exit.
Link: https://lore.kernel.org/r/20240110012705.506918-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Annotate the kvm_entry() tracepoint with "immediate exit" when KVM is
forcing a VM-Exit immediately after VM-Enter, e.g. when KVM wants to
inject an event but needs to first complete some other operation.
Knowing that KVM is (or isn't) forcing an exit is useful information when
debugging issues related to event injection.
Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240110012705.506918-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Convert kvm_get_dr()'s output parameter to a return value, and clean up
most of the mess that was created by forcing callers to provide a pointer.
No functional change intended.
Acked-by: Mathias Krause <minipli@grsecurity.net>
Reviewed-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240209220752.388160-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Refactor the handling of the reprogramming bitmap to snapshot and clear
to-be-processed bits before doing the reprogramming, and then explicitly
set bits for PMCs that need to be reprogrammed (again). This will allow
adding a macro to iterate over all valid PMCs without having to add
special handling for the reprogramming bit, which (a) can have bits set
for non-existent PMCs and (b) needs to clear such bits to avoid wasting
cycles in perpetuity.
Note, the existing behavior of clearing bits after reprogramming does NOT
have a race with kvm_vm_ioctl_set_pmu_event_filter(). Setting a new PMU
filter synchronizes SRCU _before_ setting the bitmap, i.e. guarantees that
the vCPU isn't in the middle of reprogramming with a stale filter prior to
setting the bitmap.
Link: https://lore.kernel.org/r/20231110022857.1273836-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Since commit b0563468ee ("x86/CPU/AMD: Disable XSAVES on AMD family 0x17")
kernel unconditionally clears the XSAVES CPU feature bit on Zen1/2 CPUs.
Because KVM CPU caps are initialized from the kernel boot CPU features this
makes the XSAVES feature also unavailable for KVM guests in this case.
At the same time the XSAVEC feature is left enabled.
Unfortunately, having XSAVEC but no XSAVES in CPUID breaks Hyper-V enabled
Windows Server 2016 VMs that have more than one vCPU.
Let's at least give users hint in the kernel log what could be wrong since
these VMs currently simply hang at boot with a black screen - giving no
clue what suddenly broke them and how to make them work again.
Trigger the kernel message hint based on the particular guest ID written to
the Guest OS Identity Hyper-V MSR implemented by KVM.
Defer this check to when the L1 Hyper-V hypervisor enables SVM in EFER
since we want to limit this message to Hyper-V enabled Windows guests only
(Windows session running nested as L2) but the actual Guest OS Identity MSR
write is done by L1 and happens before it enables SVM.
Fixes: b0563468ee ("x86/CPU/AMD: Disable XSAVES on AMD family 0x17")
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <b83ab45c5e239e5d148b0ae7750133a67ac9575c.1706127425.git.maciej.szmigiero@oracle.com>
[Move some checks before mutex_lock(), rename function. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Implement a workaround for an SNP erratum where the CPU will incorrectly
signal an RMP violation #PF if a hugepage (2MB or 1GB) collides with the
RMP entry of a VMCB, VMSA or AVIC backing page.
When SEV-SNP is globally enabled, the CPU marks the VMCB, VMSA, and AVIC
backing pages as "in-use" via a reserved bit in the corresponding RMP
entry after a successful VMRUN. This is done for _all_ VMs, not just
SNP-Active VMs.
If the hypervisor accesses an in-use page through a writable
translation, the CPU will throw an RMP violation #PF. On early SNP
hardware, if an in-use page is 2MB-aligned and software accesses any
part of the associated 2MB region with a hugepage, the CPU will
incorrectly treat the entire 2MB region as in-use and signal a an RMP
violation #PF.
To avoid this, the recommendation is to not use a 2MB-aligned page for
the VMCB, VMSA or AVIC pages. Add a generic allocator that will ensure
that the page returned is not 2MB-aligned and is safe to be used when
SEV-SNP is enabled. Also implement similar handling for the VMCB/VMSA
pages of nested guests.
[ mdr: Squash in nested guest handling from Ashish, commit msg fixups. ]
Reported-by: Alper Gun <alpergun@google.com> # for nested VMSA case
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Marc Orr <marcorr@google.com>
Signed-off-by: Marc Orr <marcorr@google.com>
Co-developed-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20240126041126.1927228-22-michael.roth@amd.com
- Use memdup_array_user() to harden against overflow.
- Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures.
- Clean up Kconfigs that all KVM architectures were selecting
- New functionality around "guest_memfd", a new userspace API that
creates an anonymous file and returns a file descriptor that refers
to it. guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be resized.
guest_memfd files do however support PUNCH_HOLE, which can be used to
switch a memory area between guest_memfd and regular anonymous memory.
- New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
per-page attributes for a given page of guest memory; right now the
only attribute is whether the guest expects to access memory via
guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees
confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM).
x86:
- Support for "software-protected VMs" that can use the new guest_memfd
and page attributes infrastructure. This is mostly useful for testing,
since there is no pKVM-like infrastructure to provide a meaningfully
reduced TCB.
- Fix a relatively benign off-by-one error when splitting huge pages during
CLEAR_DIRTY_LOG.
- Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf
TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE.
- Use more generic lockdep assertions in paths that don't actually care
about whether the caller is a reader or a writer.
- let Xen guests opt out of having PV clock reported as "based on a stable TSC",
because some of them don't expect the "TSC stable" bit (added to the pvclock
ABI by KVM, but never set by Xen) to be set.
- Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL.
- Advertise flush-by-ASID support for nSVM unconditionally, as KVM always
flushes on nested transitions, i.e. always satisfies flush requests. This
allows running bleeding edge versions of VMware Workstation on top of KVM.
- Sanity check that the CPU supports flush-by-ASID when enabling SEV support.
- On AMD machines with vNMI, always rely on hardware instead of intercepting
IRET in some cases to detect unmasking of NMIs
- Support for virtualizing Linear Address Masking (LAM)
- Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state
prior to refreshing the vPMU model.
- Fix a double-overflow PMU bug by tracking emulated counter events using a
dedicated field instead of snapshotting the "previous" counter. If the
hardware PMC count triggers overflow that is recognized in the same VM-Exit
that KVM manually bumps an event count, KVM would pend PMIs for both the
hardware-triggered overflow and for KVM-triggered overflow.
- Turn off KVM_WERROR by default for all configs so that it's not
inadvertantly enabled by non-KVM developers, which can be problematic for
subsystems that require no regressions for W=1 builds.
- Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL
"features".
- Don't force a masterclock update when a vCPU synchronizes to the current TSC
generation, as updating the masterclock can cause kvmclock's time to "jump"
unexpectedly, e.g. when userspace hotplugs a pre-created vCPU.
- Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths,
partly as a super minor optimization, but mostly to make KVM play nice with
position independent executable builds.
- Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
CONFIG_HYPERV as a minor optimization, and to self-document the code.
- Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation"
at build time.
ARM64:
- LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB
base granule sizes. Branch shared with the arm64 tree.
- Large Fine-Grained Trap rework, bringing some sanity to the
feature, although there is more to come. This comes with
a prefix branch shared with the arm64 tree.
- Some additional Nested Virtualization groundwork, mostly
introducing the NV2 VNCR support and retargetting the NV
support to that version of the architecture.
- A small set of vgic fixes and associated cleanups.
Loongarch:
- Optimization for memslot hugepage checking
- Cleanup and fix some HW/SW timer issues
- Add LSX/LASX (128bit/256bit SIMD) support
RISC-V:
- KVM_GET_REG_LIST improvement for vector registers
- Generate ISA extension reg_list using macros in get-reg-list selftest
- Support for reporting steal time along with selftest
s390:
- Bugfixes
Selftests:
- Fix an annoying goof where the NX hugepage test prints out garbage
instead of the magic token needed to run the test.
- Fix build errors when a header is delete/moved due to a missing flag
in the Makefile.
- Detect if KVM bugged/killed a selftest's VM and print out a helpful
message instead of complaining that a random ioctl() failed.
- Annotate the guest printf/assert helpers with __printf(), and fix the
various bugs that were lurking due to lack of said annotation.
There are two non-KVM patches buried in the middle of guest_memfd support:
fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()
mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
The first is small and mostly suggested-by Christian Brauner; the second
a bit less so but it was written by an mm person (Vlastimil Babka).
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmWcMWkUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroO15gf/WLmmg3SET6Uzw9iEq2xo28831ZA+
6kpILfIDGKozV5safDmMvcInlc/PTnqOFrsKyyN4kDZ+rIJiafJdg/loE0kPXBML
wdR+2ix5kYI1FucCDaGTahskBDz8Lb/xTpwGg9BFLYFNmuUeHc74o6GoNvr1uliE
4kLZL2K6w0cSMPybUD+HqGaET80ZqPwecv+s1JL+Ia0kYZJONJifoHnvOUJ7DpEi
rgudVdgzt3EPjG0y1z6MjvDBXTCOLDjXajErlYuZD3Ej8N8s59Dh2TxOiDNTLdP4
a4zjRvDmgyr6H6sz+upvwc7f4M4p+DBvf+TkWF54mbeObHUYliStqURIoA==
=66Ws
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"Generic:
- Use memdup_array_user() to harden against overflow.
- Unconditionally advertise KVM_CAP_DEVICE_CTRL for all
architectures.
- Clean up Kconfigs that all KVM architectures were selecting
- New functionality around "guest_memfd", a new userspace API that
creates an anonymous file and returns a file descriptor that refers
to it. guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be
resized. guest_memfd files do however support PUNCH_HOLE, which can
be used to switch a memory area between guest_memfd and regular
anonymous memory.
- New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
per-page attributes for a given page of guest memory; right now the
only attribute is whether the guest expects to access memory via
guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
TDX or ARM64 pKVM is checked by firmware or hypervisor that
guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in
the case of pKVM).
x86:
- Support for "software-protected VMs" that can use the new
guest_memfd and page attributes infrastructure. This is mostly
useful for testing, since there is no pKVM-like infrastructure to
provide a meaningfully reduced TCB.
- Fix a relatively benign off-by-one error when splitting huge pages
during CLEAR_DIRTY_LOG.
- Fix a bug where KVM could incorrectly test-and-clear dirty bits in
non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with
a non-huge SPTE.
- Use more generic lockdep assertions in paths that don't actually
care about whether the caller is a reader or a writer.
- let Xen guests opt out of having PV clock reported as "based on a
stable TSC", because some of them don't expect the "TSC stable" bit
(added to the pvclock ABI by KVM, but never set by Xen) to be set.
- Revert a bogus, made-up nested SVM consistency check for
TLB_CONTROL.
- Advertise flush-by-ASID support for nSVM unconditionally, as KVM
always flushes on nested transitions, i.e. always satisfies flush
requests. This allows running bleeding edge versions of VMware
Workstation on top of KVM.
- Sanity check that the CPU supports flush-by-ASID when enabling SEV
support.
- On AMD machines with vNMI, always rely on hardware instead of
intercepting IRET in some cases to detect unmasking of NMIs
- Support for virtualizing Linear Address Masking (LAM)
- Fix a variety of vPMU bugs where KVM fail to stop/reset counters
and other state prior to refreshing the vPMU model.
- Fix a double-overflow PMU bug by tracking emulated counter events
using a dedicated field instead of snapshotting the "previous"
counter. If the hardware PMC count triggers overflow that is
recognized in the same VM-Exit that KVM manually bumps an event
count, KVM would pend PMIs for both the hardware-triggered overflow
and for KVM-triggered overflow.
- Turn off KVM_WERROR by default for all configs so that it's not
inadvertantly enabled by non-KVM developers, which can be
problematic for subsystems that require no regressions for W=1
builds.
- Advertise all of the host-supported CPUID bits that enumerate
IA32_SPEC_CTRL "features".
- Don't force a masterclock update when a vCPU synchronizes to the
current TSC generation, as updating the masterclock can cause
kvmclock's time to "jump" unexpectedly, e.g. when userspace
hotplugs a pre-created vCPU.
- Use RIP-relative address to read kvm_rebooting in the VM-Enter
fault paths, partly as a super minor optimization, but mostly to
make KVM play nice with position independent executable builds.
- Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
CONFIG_HYPERV as a minor optimization, and to self-document the
code.
- Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV
"emulation" at build time.
ARM64:
- LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base
granule sizes. Branch shared with the arm64 tree.
- Large Fine-Grained Trap rework, bringing some sanity to the
feature, although there is more to come. This comes with a prefix
branch shared with the arm64 tree.
- Some additional Nested Virtualization groundwork, mostly
introducing the NV2 VNCR support and retargetting the NV support to
that version of the architecture.
- A small set of vgic fixes and associated cleanups.
Loongarch:
- Optimization for memslot hugepage checking
- Cleanup and fix some HW/SW timer issues
- Add LSX/LASX (128bit/256bit SIMD) support
RISC-V:
- KVM_GET_REG_LIST improvement for vector registers
- Generate ISA extension reg_list using macros in get-reg-list
selftest
- Support for reporting steal time along with selftest
s390:
- Bugfixes
Selftests:
- Fix an annoying goof where the NX hugepage test prints out garbage
instead of the magic token needed to run the test.
- Fix build errors when a header is delete/moved due to a missing
flag in the Makefile.
- Detect if KVM bugged/killed a selftest's VM and print out a helpful
message instead of complaining that a random ioctl() failed.
- Annotate the guest printf/assert helpers with __printf(), and fix
the various bugs that were lurking due to lack of said annotation"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits)
x86/kvm: Do not try to disable kvmclock if it was not enabled
KVM: x86: add missing "depends on KVM"
KVM: fix direction of dependency on MMU notifiers
KVM: introduce CONFIG_KVM_COMMON
KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd
KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache
RISC-V: KVM: selftests: Add get-reg-list test for STA registers
RISC-V: KVM: selftests: Add steal_time test support
RISC-V: KVM: selftests: Add guest_sbi_probe_extension
RISC-V: KVM: selftests: Move sbi_ecall to processor.c
RISC-V: KVM: Implement SBI STA extension
RISC-V: KVM: Add support for SBI STA registers
RISC-V: KVM: Add support for SBI extension registers
RISC-V: KVM: Add SBI STA info to vcpu_arch
RISC-V: KVM: Add steal-update vcpu request
RISC-V: KVM: Add SBI STA extension skeleton
RISC-V: paravirt: Implement steal-time support
RISC-V: Add SBI STA extension definitions
RISC-V: paravirt: Add skeleton for pv-time support
RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr()
...
- Fix a relatively benign off-by-one error when splitting huge pages during
CLEAR_DIRTY_LOG.
- Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf
TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE.
- Relax the TDP MMU's lockdep assertions related to holding mmu_lock for read
versus write so that KVM doesn't pass "bool shared" all over the place just
to have precise assertions in paths that don't actually care about whether
the caller is a reader or a writer.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW/W4SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5pAQP/27OuHVhtcao9m8GAsm55tMEIDh2n6Ih
Yb74SOQhprbaGWyx/J6QdKABzcQX86L3CvCMSy6Ssoz2FeDM8Qs2a80/0jkK0qtI
TPXnOsvxh/MgEsC55LOSJZeWM+xwEqaTA2wuPAbvAcJLdAZsSLQQV1XwH6nl4l0l
LhcdkUeAKqQ8DbeBvTDyz80zuWaKhQMm5DJOBM6HbqaAdmMpw/hPDtu9b654jC2R
Z6Rs5590OFrR13cBAvgCBayvb54pjW9dbm8lfZ4Grcq1I+C0I7Y5jF+DYFjKuTUi
N7t+3HrpAM54DWwD95qkTO73g8tMSvwKGKq+OhylZrbPjeqgyYApcUvbxgKmhgGu
hBRoAS9D2goJQtFv4TehpAiJIoR6YGX5cCyIIFI50EfsWkifED/b7zUBf2jg0UnV
Z7IzvGLQtMAHj6daDcgXxaLVAXQ3v4aQK7DiPuCIiFGcXvedgcGIFjZH4NaVvK2D
Y9ygO8PSbN0MbGUSkaS8Urrxmx7l1tv05cardxjSVxnKEx01nYugB0vTM0/o7gSq
Bs2gnf+UJrazeabqFuVBYzzuHH7QyjsiQQ0HWoQiq04v8+C5VldwQKk+g8vxLXG1
iYNXW7qgueLxYxRNleLiyfw1HIyspR09d6e67n2729Xxtay0cvL8dC68AhEenb+N
fZ2lKFhnZ2Vs
=3f2t
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-mmu-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.8:
- Fix a relatively benign off-by-one error when splitting huge pages during
CLEAR_DIRTY_LOG.
- Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf
TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE.
- Relax the TDP MMU's lockdep assertions related to holding mmu_lock for read
versus write so that KVM doesn't pass "bool shared" all over the place just
to have precise assertions in paths that don't actually care about whether
the caller is a reader or a writer.
Add KVM support for Linear Address Masking (LAM). LAM tweaks the canonicality
checks for most virtual address usage in 64-bit mode, such that only the most
significant bit of the untranslated address bits must match the polarity of the
last translated address bit. This allows software to use ignored, untranslated
address bits for metadata, e.g. to efficiently tag pointers for address
sanitization.
LAM can be enabled separately for user pointers and supervisor pointers, and
for userspace LAM can be select between 48-bit and 57-bit masking
- 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15.
- 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6.
For user pointers, LAM enabling utilizes two previously-reserved high bits from
CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and
61 respectively. Note, if LAM_57 is set, LAM_U48 is ignored, i.e.:
- CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers
- CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers
- CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers
For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with
the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.:
- CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers
- CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers
- CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers
The modified LAM canonicality checks:
- LAM_S48 : [ 1 ][ metadata ][ 1 ]
63 47
- LAM_U48 : [ 0 ][ metadata ][ 0 ]
63 47
- LAM_S57 : [ 1 ][ metadata ][ 1 ]
63 56
- LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
63 56
- LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
63 56..47
The bulk of KVM support for LAM is to emulate LAM's modified canonicality
checks. The approach taken by KVM is to "fill" the metadata bits using the
highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended
to bits 62:48. The most significant bit, 63, is *not* modified, i.e. its value
from the raw, untagged virtual address is kept for the canonicality check. This
untagging allows
Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM
touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26],
enabling via CR3 and CR4 bits, etc.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW+k4SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5KygQAKTSEmfdox6MSYzGVzAVHBD/8oSTZAGf
4l96Np3sZiX0ujWP7aW1GaIdGL27Yf1bQrKIrODR4xepaosVPpoZZbnLFQ4Jm16D
OuwEQL06LV91Lv5XuPkNdq3nMVi1X3wjiKLvP451oCGv8JdxsjXSlFr8ZmDoCfmS
NCjkPyitdK+/xOMY5WcrkHD/6VMMiM+5A+CrG7DkaTaqBJQSUXG1NvTKhhxey6Rq
OZv0GPv7QVMhHv1NX0Y3LyoiGyWXAoFRnbk/N3yVBOnXcpJ+HBwWiNLRpxmZOQj/
CTo0VvUH/ZkN6zGvAb75/9puFHNliA/QCW1hp+ShXnNdn1eNdS7nhhPrzVqtCTy2
QeNWM/z5v9Wa1norPqDxzqWlh2bWW8JU0soX7Q+quN0d7YjVvmmUluL3Lw/V2zmb
gFM2ZY43QHlmLVic4sSraK1LEcYFzjexzpTLhee2gNp+l2y0D0c1/hXukCk6YNUM
gad9DH8P9d7By7Eyr0ZaPHSJbuBW1PqZhot5gCg9nCn4pnT2/y7wXsLj6VAw8gdr
dWNu2MZWDuH0/d4aKfw2veAECbHUK2daok4ufPDj5nYLVVWCs4HU0U7HlYL2CX7/
TdWOCwtpFtKoN1NHz8mpET7xldxLPnFkByL+SxypTZurAZXoSnEG71IbO5pJ2iIf
wHQkXgM+XimA
=qUZ2
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-lam-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM x86 support for virtualizing Linear Address Masking (LAM)
Add KVM support for Linear Address Masking (LAM). LAM tweaks the canonicality
checks for most virtual address usage in 64-bit mode, such that only the most
significant bit of the untranslated address bits must match the polarity of the
last translated address bit. This allows software to use ignored, untranslated
address bits for metadata, e.g. to efficiently tag pointers for address
sanitization.
LAM can be enabled separately for user pointers and supervisor pointers, and
for userspace LAM can be select between 48-bit and 57-bit masking
- 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15.
- 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6.
For user pointers, LAM enabling utilizes two previously-reserved high bits from
CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and
61 respectively. Note, if LAM_57 is set, LAM_U48 is ignored, i.e.:
- CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers
- CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers
- CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers
For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with
the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.:
- CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers
- CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers
- CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers
The modified LAM canonicality checks:
- LAM_S48 : [ 1 ][ metadata ][ 1 ]
63 47
- LAM_U48 : [ 0 ][ metadata ][ 0 ]
63 47
- LAM_S57 : [ 1 ][ metadata ][ 1 ]
63 56
- LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
63 56
- LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
63 56..47
The bulk of KVM support for LAM is to emulate LAM's modified canonicality
checks. The approach taken by KVM is to "fill" the metadata bits using the
highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended
to bits 62:48. The most significant bit, 63, is *not* modified, i.e. its value
from the raw, untagged virtual address is kept for the canonicality check. This
untagging allows
Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM
touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26],
enabling via CR3 and CR4 bits, etc.
- Fix a variety of bugs where KVM fail to stop/reset counters and other state
prior to refreshing the vPMU model.
- Fix a double-overflow PMU bug by tracking emulated counter events using a
dedicated field instead of snapshotting the "previous" counter. If the
hardware PMC count triggers overflow that is recognized in the same VM-Exit
that KVM manually bumps an event count, KVM would pend PMIs for both the
hardware-triggered overflow and for KVM-triggered overflow.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW/rsSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5Q8gQAJc4y9NOd09kYXpI+DhkTVe6v07dmYds
NzBI2uViqxXFwA5pTs5VTVVYAl1FEmK6NvIVnJdc3epSYRSqyaeN/Z2NoulNxekj
/jLA/aA4+dTeJf2lfMFeH65IIuSJhuhyGeZV31RfW3NzEmlglcsb74QkHnJB8rLQ
RFJXZcOxSSap72AWxKmxk0alRaI6ONZ9NyqOWFWjZdQuAE7id9Ae5OixKUrlJkmR
6CbY8ra51MFIXQEsomVlcl5b1DNiv0drPPf5YaC9T4CERtt5yZxpvZeTPhq70evm
OutoZpzfi69cF1fFCxqN5cWZSt1C/Bu3xp8+ILI1+bZkMCV/ty85DU6hfMZQZzcV
JeJkRg/AAgOrG4dtHskwg9LDMs867kgbaqZ8l8K7Dt8rGmcLc5/rZ1ZdjTStFj6V
ukmVKMAVgkmh88u62wQ5HjrN1IE1oE6nmDp3zivfPuohEr49A8mAT02A2x9AVxAr
HvmwfDMA92xOGSRAN9Gt0mbOA+G0WZe4A36XgPEXloYeskYZgHzgW2hT6VWTd86O
ydU9s4L8g+Fy4jcObAiKsT8YwFgAMfVXZKTXvuTME4m/WUNBCrYCwqEOp/NM5qrk
qYWVXxOMMjZo71tQfvSPu1TWCtW/4ckvmqMrdQosgwLFy5pSqgXEwTruDvbJ1KWU
KhIWVbUfmgFA
=+Emh
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pmu-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM x86 PMU changes for 6.8:
- Fix a variety of bugs where KVM fail to stop/reset counters and other state
prior to refreshing the vPMU model.
- Fix a double-overflow PMU bug by tracking emulated counter events using a
dedicated field instead of snapshotting the "previous" counter. If the
hardware PMC count triggers overflow that is recognized in the same VM-Exit
that KVM manually bumps an event count, KVM would pend PMIs for both the
hardware-triggered overflow and for KVM-triggered overflow.
Hyper-V emulation in KVM is a fairly big chunk and in some cases it may be
desirable to not compile it in to reduce module sizes as well as the attack
surface. Introduce CONFIG_KVM_HYPERV option to make it possible.
Note, there's room for further nVMX/nSVM code optimizations when
!CONFIG_KVM_HYPERV, this will be done in follow-up patches.
Reorganize Makefile a bit so all CONFIG_HYPERV and CONFIG_KVM_HYPERV files
are grouped together.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-13-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Hyper-V partition assist page is used when KVM runs on top of Hyper-V and
is not used for Windows/Hyper-V guests on KVM, this means that 'hv_pa_pg'
placement in 'struct kvm_hv' is unfortunate. As a preparation to making
Hyper-V emulation optional, move 'hv_pa_pg' to 'struct kvm_arch' and put it
under CONFIG_HYPERV.
While on it, introduce hv_get_partition_assist_page() helper to allocate
partition assist page. Move the comment explaining why we use a single page
for all vCPUs from VMX and expand it a bit.
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-3-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Saving a few bytes of memory per KVM VM is certainly great but what's more
important is the ability to see where the code accesses Xen emulation
context while CONFIG_KVM_XEN is not enabled. Currently, kvm_cpu_get_extint()
is the only such place and it is harmless: kvm_xen_has_interrupt() always
returns '0' when !CONFIG_KVM_XEN.
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-2-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
It is cheap to take tdp_mmu_pages_lock in all write-side critical sections.
We already do it all the time when zapping with read_lock(), so it is not
a problem to do it from the kvm_tdp_mmu_zap_all() path (aka
kvm_arch_flush_shadow_all(), aka VM destruction and MMU notifier release).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231125083400.1399197-4-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Explicitly track emulated counter events instead of using the common
counter value that's shared with the hardware counter owned by perf.
Bumping the common counter requires snapshotting the pre-increment value
in order to detect overflow from emulation, and the snapshot approach is
inherently flawed.
Snapshotting the previous counter at every increment assumes that there is
at most one emulated counter event per emulated instruction (or rather,
between checks for KVM_REQ_PMU). That's mostly holds true today because
KVM only emulates (branch) instructions retired, but the approach will
fall apart if KVM ever supports event types that don't have a 1:1
relationship with instructions.
And KVM already has a relevant bug, as handle_invalid_guest_state()
emulates multiple instructions without checking KVM_REQ_PMU, i.e. could
miss an overflow event due to clobbering pmc->prev_counter. Not checking
KVM_REQ_PMU is problematic in both cases, but at least with the emulated
counter approach, the resulting behavior is delayed overflow detection,
as opposed to completely lost detection.
Tracking the emulated count fixes another bug where the snapshot approach
can signal spurious overflow due to incorporating both the emulated count
and perf's count in the check, i.e. if overflow is detected by perf, then
KVM's emulation will also incorrectly signal overflow. Add a comment in
the related code to call out the need to process emulated events *after*
pausing the perf event (big kudos to Mingwei for figuring out that
particular wrinkle).
Cc: Mingwei Zhang <mizhang@google.com>
Cc: Roman Kagan <rkagan@amazon.de>
Cc: Jim Mattson <jmattson@google.com>
Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20231103230541.352265-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Declare the kvm_x86_ops hooks used to wire up paravirt TLB flushes when
running under Hyper-V if and only if CONFIG_HYPERV!=n. Wrapping yet more
code with IS_ENABLED(CONFIG_HYPERV) eliminates a handful of conditional
branches, and makes it super obvious why the hooks *might* be valid.
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231018192325.1893896-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add support to allow guests to set the new CR4 control bit for LAM and add
implementation to get untagged address for supervisor pointers.
LAM modifies the canonicality check applied to 64-bit linear addresses for
data accesses, allowing software to use of the untranslated address bits for
metadata and masks the metadata bits before using them as linear addresses
to access memory. LAM uses CR4.LAM_SUP (bit 28) to configure and enable LAM
for supervisor pointers. It also changes VMENTER to allow the bit to be set
in VMCS's HOST_CR4 and GUEST_CR4 to support virtualization. Note CR4.LAM_SUP
is allowed to be set even not in 64-bit mode, but it will not take effect
since LAM only applies to 64-bit linear addresses.
Move CR4.LAM_SUP out of CR4_RESERVED_BITS, its reservation depends on vcpu
supporting LAM or not. Leave it intercepted to prevent guest from setting
the bit if LAM is not exposed to guest as well as to avoid vmread every time
when KVM fetches its value, with the expectation that guest won't toggle the
bit frequently.
Set CR4.LAM_SUP bit in the emulated IA32_VMX_CR4_FIXED1 MSR for guests to
allow guests to enable LAM for supervisor pointers in nested VMX operation.
Hardware is not required to do TLB flush when CR4.LAM_SUP toggled, KVM
doesn't need to emulate TLB flush based on it. There's no other features
or vmx_exec_controls connection, and no other code needed in
{kvm,vmx}_set_cr4().
Skip address untag for instruction fetches (which includes branch targets),
operand of INVLPG instructions, and implicit system accesses, all of which
are not subject to untagging. Note, get_untagged_addr() isn't invoked for
implicit system accesses as there is no reason to do so, but check the
flag anyways for documentation purposes.
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-11-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Introduce a new interface get_untagged_addr() to kvm_x86_ops to untag
the metadata from linear address. Call the interface in linearization
of instruction emulator for 64-bit mode.
When enabled feature like Intel Linear Address Masking (LAM) or AMD Upper
Address Ignore (UAI), linear addresses may be tagged with metadata that
needs to be dropped prior to canonicality checks, i.e. the metadata is
ignored.
Introduce get_untagged_addr() to kvm_x86_ops to hide the vendor specific
code, as sadly LAM and UAI have different semantics. Pass the emulator
flags to allow vendor specific implementation to precisely identify the
access type (LAM doesn't untag certain accesses).
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-9-binbin.wu@linux.intel.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
and testing vehicle for Confidential (CoCo) VMs, and potentially to even
become a "real" product in the distant future, e.g. a la pKVM.
The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
Intel's TDX, but those technologies are extremely complex (understatement),
difficult to debug, don't support running as nested guests, and require
hardware that's isn't universally accessible. I.e. relying SEV-SNP or TDX
for maintaining guest private memory isn't a realistic option.
At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
selftests for guest_memfd and private memory support without requiring
unique hardware.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20231027182217.3615211-24-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Let x86 track the number of address spaces on a per-VM basis so that KVM
can disallow SMM memslots for confidential VMs. Confidentials VMs are
fundamentally incompatible with emulating SMM, which as the name suggests
requires being able to read and write guest memory and register state.
Disallowing SMM will simplify support for guest private memory, as KVM
will not need to worry about tracking memory attributes for multiple
address spaces (SMM is the only "non-default" address space across all
architectures).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-23-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop __KVM_VCPU_MULTIPLE_ADDRESS_SPACE and instead check the value of
KVM_ADDRESS_SPACE_NUM.
No functional change intended.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-22-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Disallow creating hugepages with mixed memory attributes, e.g. shared
versus private, as mapping a hugepage in this case would allow the guest
to access memory with the wrong attributes, e.g. overlaying private memory
with a shared hugepage.
Tracking whether or not attributes are mixed via the existing
disallow_lpage field, but use the most significant bit in 'disallow_lpage'
to indicate a hugepage has mixed attributes instead using the normal
refcounting. Whether or not attributes are mixed is binary; either they
are or they aren't. Attempting to squeeze that info into the refcount is
unnecessarily complex as it would require knowing the previous state of
the mixed count when updating attributes. Using a flag means KVM just
needs to ensure the current status is reflected in the memslots.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-20-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where
appropriate to effectively maintain existing behavior. Using a proper
Kconfig will simplify building more functionality on top of KVM's
mmu_notifier infrastructure.
Add a forward declaration of kvm_gfn_range to kvm_types.h so that
including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't
generate warnings due to kvm_gfn_range being undeclared. PPC defines
hooks for PR vs. HV without guarding them via #ifdeffery, e.g.
bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
Alternatively, PPC could forward declare kvm_gfn_range, but there's no
good reason not to define it in common KVM.
Acked-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts SHUTDOWN while
running an SEV-ES guest.
- Clean up handling "failures" when KVM detects it can't emulate the "skip"
action for an instruction that has already been partially emulated. Drop a
hack in the SVM code that was fudging around the emulator code not giving
SVM enough information to do the right thing.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8GHYSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5hwkQAIR8l1gWz/caz29biBzmRnDS+aZOXcYM
8V8WBJqJgMKE9egibF4sADAlhInXzg19Xr7bQs6VfuvmdXrCn0UJ/nLorX+H85A2
pph6iNlWO6tyQAjvk/AieaeUyZOqpCFmKOgxfN2Fr/Lrn7u3AdjXC20qPeFJSLXr
YOTCQ704yvjjJp4yVA8JlclAQu38hanKiO5SZdlLzbuhUgWwQk4DVP2ZsYnhX+RO
F6exxORvMnYF/LJe/kR2/DMLf2JWvyUmjRrGWoeRoksOw5BlXMc5HyTPHSJ2jDac
lJaNtmZkTY1bDVWZk7N03ze5aFJa4DaqJdIFLtgujrFW8thog0P48aH6vmKi4UAA
bXme9GFYbmJTkemaGRnrzidFV12uPNvvanS+1PDOw4sn4HpscoMSpZw5PeH2kBwV
6uKNCJCwLtk8oe50yroKD7rJ/ASB7CeoqzbIL9s2TA0HSAskIf65T4eZp01uniyd
Q98yCdrG2mudsg5aU5yMfe0LwZby5BB5kUCqIe4hyRC68GJR8wkAzhaFRgCn4aJE
yaTyjnT2V3PGMEEJOPFdSF3VQGztljzQiXlEvBVj3zvMGQNTo2NhmS3ka4W+wW5G
avRYv8dITlGRs6J2gV1vp8Eb5LzDrwRpRURSmzeP5rR58saKdljTZgNfOzfLeFr1
WhLzonLz52IS
=U0fq
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-svm-6.7' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.7:
- Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts SHUTDOWN while
running an SEV-ES guest.
- Clean up handling "failures" when KVM detects it can't emulate the "skip"
action for an instruction that has already been partially emulated. Drop a
hack in the SVM code that was fudging around the emulator code not giving
SVM enough information to do the right thing.
- Omit "struct kvm_vcpu_xen" entirely when CONFIG_KVM_XEN=n.
- Use the fast path directly from the timer callback when delivering Xen timer
events. Avoid the problematic races with using the fast path by ensuring
the hrtimer isn't running when (re)starting the timer or saving the timer
information (for userspace).
- Follow the lead of upstream Xen and ignore the VCPU_SSHOTTMR_future flag.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8He8SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5KyQP+wUH3n6hhJGScsSCpWXK6r8q+Y2ZBftY
ecXuoTfeBJmsoTbnExF7K600DtbxHY5jjxt3ROmoUCertCFRCoq6pi5v4rbRDDQ1
fmGkht43A6zAuHQ0Ntvkq4rNEmISAbzLP4EXOxZJ/Hxld91T8IutMFo7NN/YfOSx
nb+qgb7B25T7ODGvzahRjxnoevCHBN/TdKeDrvsoWeMpVw+CDYqquQOcLfHMaBAN
DqGwZzpdVqRQqg3TOuBGCiv5IcvskjkFUh0y6cEYkCR/MruLoT6CygoLImEV2naW
RU0ZU9Y4cjf+BV/faQEdP6mDQwwCUHWLxDpXUVn03KQYQHlA7q6UgRKxy35ixZ5w
Euxvg4m2ZGgJjsVLqTTMUlbLSNxD6wWZAVxGH7w8XghKrNmoj1IoajPZS+1rwyO2
5rUynMKf3HMT6oeqqZH95aChlUMiAvaPYPc+ogku8Bt1zJQVv/xnk/6T95Vw6C/t
KfYsV80rmJd/EL/fUXYX3mCMcZGHyv80QlOEc0uR4f25HGszCG8qHiSaUtnvQUjQ
xaguSuO1Cf7sdhHPWj4p/US+Jerrgd8nzoQGvKUOkdLsQzU71xwjvTZNlmmBYKKO
zgGIXZfaXa4JibAqnRrC+V8UdDPOwKvOEzmH0joLEzkTISnIG2LycvZ6tG7sTcMU
0sIg2dvhJx/G
=Z2eM
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-xen-6.7' of https://github.com/kvm-x86/linux into HEAD
KVM x86 Xen changes for 6.7:
- Omit "struct kvm_vcpu_xen" entirely when CONFIG_KVM_XEN=n.
- Use the fast path directly from the timer callback when delivering Xen timer
events. Avoid the problematic races with using the fast path by ensuring
the hrtimer isn't running when (re)starting the timer or saving the timer
information (for userspace).
- Follow the lead of upstream Xen and ignore the VCPU_SSHOTTMR_future flag.
- Add CONFIG_KVM_MAX_NR_VCPUS to allow supporting up to 4096 vCPUs without
forcing more common use cases to eat the extra memory overhead.
- Add IBPB and SBPB virtualization support.
- Fix a bug where restoring a vCPU snapshot that was taken within 1 second of
creating the original vCPU would cause KVM to try to synchronize the vCPU's
TSC and thus clobber the correct TSC being set by userspace.
- Compute guest wall clock using a single TSC read to avoid generating an
inaccurate time, e.g. if the vCPU is preempted between multiple TSC reads.
- "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which complain
about a "Firmware Bug" if the bit isn't set for select F/M/S combos.
- Don't apply side effects to Hyper-V's synthetic timer on writes from
userspace to fix an issue where the auto-enable behavior can trigger
spurious interrupts, i.e. do auto-enabling only for guest writes.
- Remove an unnecessary kick of all vCPUs when synchronizing the dirty log
without PML enabled.
- Advertise "support" for non-serializing FS/GS base MSR writes as appropriate.
- Use octal notation for file permissions through KVM x86.
- Fix a handful of typo fixes and warts.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8EugSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5xS0P+gPTDO81CUZO70LrO2W4E7toRBf/F9x1
/v5D/76p9hG32Z6+BJs/xxDxJFagw75MtoR5oKivtXiip3TxbfOyDOlaQkIRo85E
/d95il/LRidL3Mv3TXRj1lykXnxSSz9tigAGEZti1Y9Fn9fXEIwurJH7dU5cBI1E
fin5bsDaTNRjG4jjTiEUbnKPRTlD/S7CQJn4CaYvZhMv/eJkYDLyBBVy4VLoLzvD
ctL6VJQLGPVxbxr9mEmulaqMrSuDIQQLkRVQJAViKyerBInTEc5d/GPCHuE8O3zi
0r/QSJbMS9titWLz07NhJ1UH4VJNyaEhRlyJPSFhBW4h6dzUb3EXdUe0Hwa+JH/S
H2cVqsANItTCIhvDtuEGIRDahu0eD+63h90InJ0gEVL1kSJS+UWZHB71PkUEQgAV
2OsuT1D26fuxrv+0b9ioBZURycqKw++zGsrwyVhe77eBgqBJ12tbL4TAD+QNjaQ5
HZTCe6YV83gZoOMeVkoTGSf96s9lGORgxsaAIXmFuLB9RVCVXhVh0ph2HZsnV8Hw
ZXEXpBEFo7GUhb0NIvsk2W73QL87A3fLv15yITWc8KuC7/dXP9z6KpSKjFySS69X
uWD1MVx6shhvbg97UzoJlXc3/z0aVzmdZJudE5d0gcFvAjIItqp6ICPOoKxfj8pT
tqRZu3kVHd61
=sfp8
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.7' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.7:
- Add CONFIG_KVM_MAX_NR_VCPUS to allow supporting up to 4096 vCPUs without
forcing more common use cases to eat the extra memory overhead.
- Add IBPB and SBPB virtualization support.
- Fix a bug where restoring a vCPU snapshot that was taken within 1 second of
creating the original vCPU would cause KVM to try to synchronize the vCPU's
TSC and thus clobber the correct TSC being set by userspace.
- Compute guest wall clock using a single TSC read to avoid generating an
inaccurate time, e.g. if the vCPU is preempted between multiple TSC reads.
- "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which complain
about a "Firmware Bug" if the bit isn't set for select F/M/S combos.
- Don't apply side effects to Hyper-V's synthetic timer on writes from
userspace to fix an issue where the auto-enable behavior can trigger
spurious interrupts, i.e. do auto-enabling only for guest writes.
- Remove an unnecessary kick of all vCPUs when synchronizing the dirty log
without PML enabled.
- Advertise "support" for non-serializing FS/GS base MSR writes as appropriate.
- Use octal notation for file permissions through KVM x86.
- Fix a handful of typo fixes and warts.
- Purge VMX's posted interrupt descriptor *before* loading APIC state when
handling KVM_SET_LAPIC. Purging the PID after loading APIC state results in
lost APIC timer IRQs as the APIC timer can be armed as part of loading APIC
state, i.e. can immediately pend an IRQ if the expiry is in the past.
- Clear the ICR.BUSY bit when handling trap-like x2APIC writes to suppress a
WARN due to KVM expecting the BUSY bit to be cleared when sending IPIs.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8DrkSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5Z54P/jQbjmwbvAVDa1g2HGr16ouFb3v+rCgV
eXiN2/8tDlQl9JkvZsjOY1AtJauS3RZu/SncT3IxCCvHVrKKt0xgnhWT//rAMX3N
zrrFxXJmilFdd2ijsuK8flHVFO5i06Kx1NRjSmAMb76et1et1EI0I62F1rom8ocn
l0tryTkV0qhWKlgxzLBNzzhzxxA2Mvb6vFhut+nfa0IisnQ+roh5dV4JS7Cdhy+5
qIQ4J86JdDR6kmraZLCv8r7vIQn5EQPVCAC2WWalds7iWUW3ixy01NLweVLaVtA3
xWRHYK7azvu3SJ9HvmSMRZzzTElQBYnE2d37ycyYewHbDUJ6FCXxh7H+BehYaVBW
CK6GLaMHY8vyXLcB2cRAzZ7GgBSSDevL11/CMc2ueGc51AOsDCxFh6v69TU+/PiQ
dMOYiNTuwV9QC3raIXPCc2Q9rqGp20xBqXIv5XLEyIxqcFQESvtRggg6zJdBZ3zB
T5/xAXDf++rKFiu1wskFAbwsk3M4T36u1WlK9iayC9dPvCDGRzsle3tt9npmoj2X
3e0Uz7fkRLki1YZW1qLmvqoYJ9bX9B9BjuKkORIqSqXuwcgPYEzHt1GcRByI2FSk
KSTnNwgs/Bs498LsDdU1Z740ZkYFGmu2nK+CUvwxP+uejPsnlhik5kGqLSIeA7zA
9kDpcdV+tZKJ
=atZ2
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-apic-6.7' of https://github.com/kvm-x86/linux into HEAD
KVM x86 APIC changes for 6.7:
- Purge VMX's posted interrupt descriptor *before* loading APIC state when
handling KVM_SET_LAPIC. Purging the PID after loading APIC state results in
lost APIC timer IRQs as the APIC timer can be armed as part of loading APIC
state, i.e. can immediately pend an IRQ if the expiry is in the past.
- Clear the ICR.BUSY bit when handling trap-like x2APIC writes. This avoids a
WARN, due to KVM expecting the BUSY bit to be cleared when sending IPIs.
Update the variable with name 'kvm' in kvm_x86_ops.sched_in() to 'vcpu' to
avoid confusions. Variable naming in KVM has a clear convention that 'kvm'
refers to pointer of type 'struct kvm *', while 'vcpu' refers to pointer of
type 'struct kvm_vcpu *'.
Fix this 9-year old naming issue for fun.
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20231017232610.4008690-1-mizhang@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The legacy API for setting the TSC is fundamentally broken, and only
allows userspace to set a TSC "now", without any way to account for
time lost between the calculation of the value, and the kernel eventually
handling the ioctl.
To work around this, KVM has a hack which, if a TSC is set with a value
which is within a second's worth of the last TSC "written" to any vCPU in
the VM, assumes that userspace actually intended the two TSC values to be
in sync and adjusts the newly-written TSC value accordingly.
Thus, when a VMM restores a guest after suspend or migration using the
legacy API, the TSCs aren't necessarily *right*, but at least they're
in sync.
This trick falls down when restoring a guest which genuinely has been
running for less time than the 1 second of imprecision KVM allows for in
in the legacy API. On *creation*, the first vCPU starts its TSC counting
from zero, and the subsequent vCPUs synchronize to that. But then when
the VMM tries to restore a vCPU's intended TSC, because the VM has been
alive for less than 1 second and KVM's default TSC value for new vCPU's is
'0', the intended TSC is within a second of the last "written" TSC and KVM
incorrectly adjusts the intended TSC in an attempt to synchronize.
But further hacks can be piled onto KVM's existing hackish ABI, and
declare that the *first* value written by *userspace* (on any vCPU)
should not be subject to this "correction", i.e. KVM can assume that the
first write from userspace is not an attempt to sync up with TSC values
that only come from the kernel's default vCPU creation.
To that end: Add a flag, kvm->arch.user_set_tsc, protected by
kvm->arch.tsc_write_lock, to record that a TSC for at least one vCPU in
the VM *has* been set by userspace, and make the 1-second slop hack only
trigger if user_set_tsc is already set.
Note that userspace can explicitly request a *synchronization* of the
TSC by writing zero. For the purpose of user_set_tsc, an explicit
synchronization counts as "setting" the TSC, i.e. if userspace then
subsequently writes an explicit non-zero value which happens to be within
1 second of the previous value, the new value will be "corrected". This
behavior is deliberate, as treating explicit synchronization as "setting"
the TSC preserves KVM's existing behaviour inasmuch as possible (KVM
always applied the 1-second "correction" regardless of whether the write
came from userspace vs. the kernel).
Reported-by: Yong He <alexyonghe@tencent.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217423
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Original-by: Oliver Upton <oliver.upton@linux.dev>
Original-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Tested-by: Yong He <alexyonghe@tencent.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231008025335.7419-1-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Refactor and rename can_emulate_instruction() to allow vendor code to
return more than true/false, e.g. to explicitly differentiate between
"retry", "fault", and "unhandleable". For now, just do the plumbing, a
future patch will expand SVM's implementation to signal outright failure
if KVM attempts EMULTYPE_SKIP on an SEV guest.
No functional change intended (or rather, none that are visible to the
guest or userspace).
Link: https://lore.kernel.org/r/20230825013621.2845700-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When running android emulator (which is based on QEMU 2.12) on
certain Intel hosts with kernel version 6.3-rc1 or above, guest
will freeze after loading a snapshot. This is almost 100%
reproducible. By default, the android emulator will use snapshot
to speed up the next launching of the same android guest. So
this breaks the android emulator badly.
I tested QEMU 8.0.4 from Debian 12 with an Ubuntu 22.04 guest by
running command "loadvm" after "savevm". The same issue is
observed. At the same time, none of our AMD platforms is impacted.
More experiments show that loading the KVM module with
"enable_apicv=false" can workaround it.
The issue started to show up after commit 8e6ed96cdd ("KVM: x86:
fire timer when it is migrated and expired, and in oneshot mode").
However, as is pointed out by Sean Christopherson, it is introduced
by commit 967235d320 ("KVM: vmx: clear pending interrupts on
KVM_SET_LAPIC"). commit 8e6ed96cdd ("KVM: x86: fire timer when
it is migrated and expired, and in oneshot mode") just makes it
easier to hit the issue.
Having both commits, the oneshot lapic timer gets fired immediately
inside the KVM_SET_LAPIC call when loading the snapshot. On Intel
platforms with APIC virtualization and posted interrupt processing,
this eventually leads to setting the corresponding PIR bit. However,
the whole PIR bits get cleared later in the same KVM_SET_LAPIC call
by apicv_post_state_restore. This leads to timer interrupt lost.
The fix is to move vmx_apicv_post_state_restore to the beginning of
the KVM_SET_LAPIC call and rename to vmx_apicv_pre_state_restore.
What vmx_apicv_post_state_restore does is actually clearing any
former apicv state and this behavior is more suitable to carry out
in the beginning.
Fixes: 967235d320 ("KVM: vmx: clear pending interrupts on KVM_SET_LAPIC")
Cc: stable@vger.kernel.org
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Haitao Shan <hshan@google.com>
Link: https://lore.kernel.org/r/20230913000215.478387-1-hshan@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a Kconfig entry to set the maximum number of vCPUs per KVM guest and
set the default value to 4096 when MAXSMP is enabled, as there are use
cases that want to create more than the currently allowed 1024 vCPUs and
are more than happy to eat the memory overhead.
The Hyper-V TLFS doesn't allow more than 64 sparse banks, i.e. allows a
maximum of 4096 virtual CPUs. Cap KVM's maximum number of virtual CPUs
to 4096 to avoid exceeding Hyper-V's limit as KVM support for Hyper-V is
unconditional, and alternatives like dynamically disabling Hyper-V
enlightenments that rely on sparse banks would require non-trivial code
changes.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
Link: https://lore.kernel.org/r/20230824215244.3897419-1-kyle.meyer@hpe.com
[sean: massage changelog with --verbose, document #ifdef mess]
Signed-off-by: Sean Christopherson <seanjc@google.com>
When the irq_work callback, kvm_pmi_trigger_fn(), is invoked during a
VM-exit that also invokes __kvm_perf_overflow() as a result of
instruction emulation, kvm_pmu_deliver_pmi() will be called twice
before the next VM-entry.
Calling kvm_pmu_deliver_pmi() twice is unlikely to be problematic now that
KVM sets the LVTPC mask bit when delivering a PMI. But using IRQ work to
trigger the PMI is still broken, albeit very theoretically.
E.g. if the self-IPI to trigger IRQ work is be delayed long enough for the
vCPU to be migrated to a different pCPU, then it's possible for
kvm_pmi_trigger_fn() to race with the kvm_pmu_deliver_pmi() from
KVM_REQ_PMI and still generate two PMIs.
KVM could set the mask bit using an atomic operation, but that'd just be
piling on unnecessary code to workaround what is effectively a hack. The
*only* reason KVM uses IRQ work is to ensure the PMI is treated as a wake
event, e.g. if the vCPU just executed HLT.
Remove the irq_work callback for synthesizing a PMI, and all of the
logic for invoking it. Instead, to prevent a vcpu from leaving C0 with
a PMI pending, add a check for KVM_REQ_PMI to kvm_vcpu_has_events().
Fixes: 9cd803d496 ("KVM: x86: Update vPMCs when retiring instructions")
Signed-off-by: Jim Mattson <jmattson@google.com>
Tested-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20230925173448.3518223-2-mizhang@google.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Stop zapping invalidate TDP MMU roots via work queue now that KVM
preserves TDP MMU roots until they are explicitly invalidated. Zapping
roots asynchronously was effectively a workaround to avoid stalling a vCPU
for an extended during if a vCPU unloaded a root, which at the time
happened whenever the guest toggled CR0.WP (a frequent operation for some
guest kernels).
While a clever hack, zapping roots via an unbound worker had subtle,
unintended consequences on host scheduling, especially when zapping
multiple roots, e.g. as part of a memslot. Because the work of zapping a
root is no longer bound to the task that initiated the zap, things like
the CPU affinity and priority of the original task get lost. Losing the
affinity and priority can be especially problematic if unbound workqueues
aren't affined to a small number of CPUs, as zapping multiple roots can
cause KVM to heavily utilize the majority of CPUs in the system, *beyond*
the CPUs KVM is already using to run vCPUs.
When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root
zap can result in KVM occupying all logical CPUs for ~8ms, and result in
high priority tasks not being scheduled in in a timely manner. In v5.15,
which doesn't preserve unloaded roots, the issues were even more noticeable
as KVM would zap roots more frequently and could occupy all CPUs for 50ms+.
Consuming all CPUs for an extended duration can lead to significant jitter
throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots
is a semi-frequent operation as memslots are deleted and recreated with
different host virtual addresses to react to host GPU drivers allocating
and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips
during games due to the audio server's tasks not getting scheduled in
promptly, despite the tasks having a high realtime priority.
Deleting memslots isn't exactly a fast path and should be avoided when
possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the
memslot shenanigans, but KVM is squarely in the wrong. Not to mention
that removing the async zapping eliminates a non-trivial amount of
complexity.
Note, one of the subtle behaviors hidden behind the async zapping is that
KVM would zap invalidated roots only once (ignoring partial zaps from
things like mmu_notifier events). Preserve this behavior by adding a flag
to identify roots that are scheduled to be zapped versus roots that have
already been zapped but not yet freed.
Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can
encounter invalid roots, as it's not at all obvious why zapping
invalidated roots shouldn't simply zap all invalid roots.
Reported-by: Pattara Teerapong <pteerapong@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Yiwei Zhang<zzyiwei@google.com>
Cc: Paul Hsia <paulhsia@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230916003916.2545000-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop "support" for multiple page-track modes, as there is no evidence
that array-based and refcounted metadata is the optimal solution for
other modes, nor is there any evidence that other use cases, e.g. for
access-tracking, will be a good fit for the page-track machinery in
general.
E.g. one potential use case of access-tracking would be to prevent guest
access to poisoned memory (from the guest's perspective). In that case,
the number of poisoned pages is likely to be a very small percentage of
the guest memory, and there is no need to reference count the number of
access-tracking users, i.e. expanding gfn_track[] for a new mode would be
grossly inefficient. And for poisoned memory, host userspace would also
likely want to trap accesses, e.g. to inject #MC into the guest, and that
isn't currently supported by the page-track framework.
A better alternative for that poisoned page use case is likely a
variation of the proposed per-gfn attributes overlay (linked), which
would allow efficiently tracking the sparse set of poisoned pages, and by
default would exit to userspace on access.
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Ben Gardon <bgardon@google.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Disable the page-track notifier code at compile time if there are no
external users, i.e. if CONFIG_KVM_EXTERNAL_WRITE_TRACKING=n. KVM itself
now hooks emulated writes directly instead of relying on the page-track
mechanism.
Provide a stub for "struct kvm_page_track_notifier_node" so that including
headers directly from the command line, e.g. for testing include guards,
doesn't fail due to a struct having an incomplete type.
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Don't use the generic page-track mechanism to handle writes to guest PTEs
in KVM's MMU. KVM's MMU needs access to information that should not be
exposed to external page-track users, e.g. KVM needs (for some definitions
of "need") the vCPU to query the current paging mode, whereas external
users, i.e. KVMGT, have no ties to the current vCPU and so should never
need the vCPU.
Moving away from the page-track mechanism will allow dropping use of the
page-track mechanism for KVM's own MMU, and will also allow simplifying
and cleaning up the page-track APIs.
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move x86's implementation of kvm_arch_flush_shadow_{all,memslot}() into
mmu.c, and make kvm_mmu_zap_all() static as it was globally visible only
for kvm_arch_flush_shadow_all(). This will allow refactoring
kvm_arch_flush_shadow_memslot() to call kvm_mmu_zap_all() directly without
having to expose kvm_mmu_zap_all_fast() outside of mmu.c. Keeping
everything in mmu.c will also likely simplify supporting TDX, which
intends to do zap only relevant SPTEs on memslot updates.
No functional change intended.
Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Misc cleanups
- Retry APIC optimized recalculation if a vCPU is added/enabled
- Overhaul emergency reboot code to bring SVM up to par with VMX, tie the
"emergency disabling" behavior to KVM actually being loaded, and move all of
the logic within KVM
- Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC
ratio MSR can diverge from the default iff TSC scaling is enabled, and clean
up related code
- Add a framework to allow "caching" feature flags so that KVM can check if
the guest can use a feature without needing to search guest CPUID
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmTueMwSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5hp4P/i/UmIJEJupryUrD/ZXcSjqmupCtv4JS
Z2o1KIAPbM5GUX4iyF1cnZrI4Ac5zMtULN8Tp3ATOp3AqKy72AqB1Z82e+v6SKis
KfSXlDFCPFisrwv3Ys7JEu9vIS8oqITHmSBk8OAmElwujdQ5jYLZjwGbCXbM9qas
yCFGLqD4fjX8XqkZLmXggjT99MPSgiTPoKL592Wq4JR8mY4hyQqJzBepDjb94sT7
wrsAv1B+BchGDguk0+nOdmHM4emGrZU7fVqi3OFPofSlwAAdkqZObleb422KB058
5bcpNow+9VH5pzgq8XSAU7DLNgH9aXH0PcVU8ASU6P0D9fceKoOFuL47nnFbwz0t
vKafcXNWFs8xHE4iyzvAAsZK/X8GR0ngNByPnamATMsjt2tTmsa5BOyAPkIN+GpT
DzZCIk27SbdGC3lGYlSV+5ob/+sOr6m384DkvSZnU6JiiFLlZiTxURj1/9Zvfka8
2co2wnf8cJxnKFUThFfuxs9XpKgvhkOE8LauwCSo4MAQM95Pen+NAK960RBWj0xl
wof5kIGmKbwmMXyg2Sr+EKqe5KRPba22Yi3x24tURAXafKK/AW7T8dgEEXOll7dp
pKmTPAevwUk9wYIGultjhEBXKYgMOeD2BVoTa5je5h1Da28onrSJ7aLQUixHHs0J
gLdtzs8M9K9t
=yGM1
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.6' of https://github.com/kvm-x86/linux into HEAD
KVM x86 changes for 6.6:
- Misc cleanups
- Retry APIC optimized recalculation if a vCPU is added/enabled
- Overhaul emergency reboot code to bring SVM up to par with VMX, tie the
"emergency disabling" behavior to KVM actually being loaded, and move all of
the logic within KVM
- Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC
ratio MSR can diverge from the default iff TSC scaling is enabled, and clean
up related code
- Add a framework to allow "caching" feature flags so that KVM can check if
the guest can use a feature without needing to search guest CPUID
- Add support for TLB range invalidation of Stage-2 page tables,
avoiding unnecessary invalidations. Systems that do not implement
range invalidation still rely on a full invalidation when dealing
with large ranges.
- Add infrastructure for forwarding traps taken from a L2 guest to
the L1 guest, with L0 acting as the dispatcher, another baby step
towards the full nested support.
- Simplify the way we deal with the (long deprecated) 'CPU target',
resulting in a much needed cleanup.
- Fix another set of PMU bugs, both on the guest and host sides,
as we seem to never have any shortage of those...
- Relax the alignment requirements of EL2 VA allocations for
non-stack allocations, as we were otherwise wasting a lot of that
precious VA space.
- The usual set of non-functional cleanups, although I note the lack
of spelling fixes...
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmTsXrUPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDZpIQAJUM1rNEOJ8ExYRfoG1LaTfcOm5TD6D1IWlO
uCUx4xLMBudw/55HusmUSdiomQ3Xg5UdRaU7vX5OYwPbdoWebjEUfgdP3jCA/TiW
mZTMv3x9hOvp+EOS/UnS469cERvg1/KfwcdOQsWL0HsCFZnu2XmQHWPD++vovLNp
F1892ij875mC6C6mOR60H2nyjIiCuqWh/8eKBkp65CARCbFDYxWhqBnmcmTvoquh
E87pQDPdtgXc0KlOWCABh5bYOu1WGVEXE5f3ixtdY9cQakkSI3NkFKw27/mIWS4q
TCsagByNnPFDXTglb1dJopNdluLMFi1iXhRJX78R/PYaHxf4uFafWcQk1U7eDdLg
1kPANggwYe4KNAQZUvRhH7lIPWHCH0r4c1qHV+FsiOZVoDOSKHo4RW1ZFtirJSNW
LNJMdk+8xyae0S7z164EpZB/tpFttX4gl3YvUT/T+4gH8+CRFAaoAlK39CoGDPpk
f+P2GE1Z5YupF16YjpZtBnan55KkU1b6eORl5zpnAtoaz5WGXqj1t4qo0Q6e9WB9
X4rdDVhH7vRUmhjmSP6PuEygb84hnITLdGpkH2BmWj/4uYuCN+p+U2B2o/QdMJoo
cPxdflLOU/+1gfAFYPtHVjVKCqzhwbw3iLXQpO12gzRYqE13rUnAr7RuGDf5fBVC
LW7Pv81o
=DKhx
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for Linux 6.6
- Add support for TLB range invalidation of Stage-2 page tables,
avoiding unnecessary invalidations. Systems that do not implement
range invalidation still rely on a full invalidation when dealing
with large ranges.
- Add infrastructure for forwarding traps taken from a L2 guest to
the L1 guest, with L0 acting as the dispatcher, another baby step
towards the full nested support.
- Simplify the way we deal with the (long deprecated) 'CPU target',
resulting in a much needed cleanup.
- Fix another set of PMU bugs, both on the guest and host sides,
as we seem to never have any shortage of those...
- Relax the alignment requirements of EL2 VA allocations for
non-stack allocations, as we were otherwise wasting a lot of that
precious VA space.
- The usual set of non-functional cleanups, although I note the lack
of spelling fixes...
Use the governed feature framework to track if XSAVES is "enabled", i.e.
if XSAVES can be used by the guest. Add a comment in the SVM code to
explain the very unintuitive logic of deliberately NOT checking if XSAVES
is enumerated in the guest CPUID model.
No functional change intended.
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Introduce yet another X86_FEATURE flag framework to manage and cache KVM
governed features (for lack of a better name). "Governed" in this case
means that KVM has some level of involvement and/or vested interest in
whether or not an X86_FEATURE can be used by the guest. The intent of the
framework is twofold: to simplify caching of guest CPUID flags that KVM
needs to frequently query, and to add clarity to such caching, e.g. it
isn't immediately obvious that SVM's bundle of flags for "optional nested
SVM features" track whether or not a flag is exposed to L1.
Begrudgingly define KVM_MAX_NR_GOVERNED_FEATURES for the size of the
bitmap to avoid exposing governed_features.h in arch/x86/include/asm/, but
add a FIXME to call out that it can and should be cleaned up once
"struct kvm_vcpu_arch" is no longer expose to the kernel at large.
Cc: Zeng Guang <guang.zeng@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Make kvm_flush_remote_tlbs_range() visible in common code and create a
default implementation that just invalidates the whole TLB.
This paves the way for several future features/cleanups:
- Introduction of range-based TLBI on ARM.
- Eliminating kvm_arch_flush_remote_tlbs_memslot()
- Moving the KVM/x86 TDP MMU to common code.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230811045127.3308641-6-rananta@google.com
Rename kvm_arch_flush_remote_tlb() and the associated macro
__KVM_HAVE_ARCH_FLUSH_REMOTE_TLB to kvm_arch_flush_remote_tlbs() and
__KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS respectively.
Making the name plural matches kvm_flush_remote_tlbs() and makes it more
clear that this function can affect more than one remote TLB.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230811045127.3308641-2-rananta@google.com
Drop the @offset and @multiplier params from the kvm_x86_ops hooks for
propagating TSC offsets/multipliers into hardware, and instead have the
vendor implementations pull the information directly from the vCPU
structure. The respective vCPU fields _must_ be written at the same
time in order to maintain consistent state, i.e. it's not random luck
that the value passed in by all callers is grabbed from the vCPU.
Explicitly grabbing the value from the vCPU field in SVM's implementation
in particular will allow for additional cleanup without introducing even
more subtle dependencies. Specifically, SVM can skip the WRMSR if guest
state isn't loaded, i.e. svm_prepare_switch_to_guest() will load the
correct value for the vCPU prior to entering the guest.
This also reconciles KVM's handling of related values that are stored in
the vCPU, as svm_write_tsc_offset() already assumes/requires the caller
to have updated l1_tsc_offset.
Link: https://lore.kernel.org/r/20230729011608.1065019-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reject KVM_SET_SREGS{2} with -EINVAL if the incoming CR0 is invalid,
e.g. due to setting bits 63:32, illegal combinations, or to a value that
isn't allowed in VMX (non-)root mode. The VMX checks in particular are
"fun" as failure to disallow Real Mode for an L2 that is configured with
unrestricted guest disabled, when KVM itself has unrestricted guest
enabled, will result in KVM forcing VM86 mode to virtual Real Mode for
L2, but then fail to unwind the related metadata when synthesizing a
nested VM-Exit back to L1 (which has unrestricted guest enabled).
Opportunistically fix a benign typo in the prototype for is_valid_cr4().
Cc: stable@vger.kernel.org
Reported-by: syzbot+5feef0b9ee9c8e9e5689@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000f316b705fdf6e2b4@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230613203037.1968489-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename global_ovf_ctrl_mask to global_status_mask to avoid confusion now
that Intel has renamed GLOBAL_OVF_CTRL to GLOBAL_STATUS_RESET in PMU v4.
GLOBAL_OVF_CTRL and GLOBAL_STATUS_RESET are the same MSR index, i.e. are
just different names for the same thing, but the SDM provides different
entries in the IA-32 Architectural MSRs table, which gets really confusing
when looking at PMU v4 definitions since it *looks* like GLOBAL_STATUS has
bits that don't exist in GLOBAL_OVF_CTRL, but in reality the bits are
simply defined in the GLOBAL_STATUS_RESET entry.
No functional change intended.
Cc: Like Xu <like.xu.linux@gmail.com>
Link: https://lore.kernel.org/r/20230603011058.1038821-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
- Add support for virtual NMIs
- Fixes for edge cases related to virtual interrupts
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGuLISHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5NOMQAKy1Od54yzQsIKyAZZJVfOEm7N5VLQgz
+jLilXgHd8dm/g0g/KVCDPFoZ/ut2Tf5Dn4WwyoPWOpgGsOyTwdDIJabf9rustkA
goZFcfUXz+P1nangTidrj6CFYgGmVS13Uu//H19X4bSzT+YifVevJ4QkRVElj9Mh
VBUeXppC/gMGBZ9tKEzl+AU3FwJ58cB88q4boovBFYiDdciv/fF86t02Lc+dCIX1
6hTcOAnjAcp3eJY0wPQJUAEScufDKcMf6tSrsB/yWXv9KB9ANXFNXry8/+lW/Ux/
oOUmUVdRXrrsRUqtYk9+KuMoIN7CL1SBV0RCm5ApqwqwnTVdHS+odHU3c2s7E/uU
QXIW4vwSne3W9Y4YApDgFjwDwmzY85dvblWlWBnR2LW2I3Or48xK+S8LpWG+lj6l
EDf7RzeqAipJ1qUq6qDYJlyg/YsyYlcoErtra423skg38HBWxQXdqkVIz3SYdKjA
0OcBQIRI28KzJDn1gU6P3Q0Wr/cKsx9EGy6+jWBhf4Yf3eHP7+3WUTrg/Up0q8ny
0j/+cbe5kBb6k2T9y2X6jm6TVbPV5FyMBOF/UxmqEbRLmxXjBe8tMnFwV+qN871I
gk5HTSIkX39GU9kNA3h5HoWjdNeRfhazKR9ZVrELVc1zjHnGLthXBPZbIAUsPPMx
vgM6jf8NwLXZ
=9xNX
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-svm-6.4' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.4:
- Add support for virtual NMIs
- Fixes for edge cases related to virtual interrupts
- Disallow virtualizing legacy LBRs if architectural LBRs are available,
the two are mutually exclusive in hardware
- Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES)
after KVM_RUN, and overhaul the vmx_pmu_caps selftest to better
validate PERF_CAPABILITIES
- Apply PMU filters to emulated events and add test coverage to the
pmu_event_filter selftest
- Misc cleanups and fixes
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGtd4SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5Z9kP/i3WZ40hevvQvB/5cEpxxmxYDwCYnnjM
hiQgK5jT4SrMTmVjLgkNdI2PogQoS4CX+GC7lcA9bvse84hjuPvgOflb2B+p2UQi
Ytbr9g/tfKNIpnKIk9mcPcSObN9vm2Kgt7n28rtPrHWj89eQzgc66eijqdpKBLxA
c3crVR8krwYAQK0tmzHq1+H6hB369YbHAHyTTRRI/bNWnqKblnvUbt0NL2aBusa9
rNMaOdRtinLpy2dmuX/b3japRB8QTnlf7zpPIF4cBEhbYXy5woClZpf1D2fCA6Er
XFbEoYawMVd9UeJYbW4z5yErLT83eYoGp4U0eFXWp6fvh8nZlgCGvBKE9g4mmqwj
aSLaTR5eVN2qlw6jXVeg3unCo8Eyl36AwYwve2L6sFmBvZvNV5iz2eQ7rrOe4oE3
dnTUaLQ8I2SVg04MbYmCq5W+frTL/I7kqNpbccL1Z3R5WO4y5gz63mug6NfLIvhR
t45TAIaifxBfcXQsBZM3v2KUK/xQrD3AbJmFKh54L2CKqiGaNWsMLX+6NZ7LZWgf
8rEqsVkkQDgF7z8eXai4TR26nYfSX6g9gDqtOH73L87aJ7PJk5cRoDWQ1sWs1e/l
4HA/L0Bo/3pnKAa0ZWxJOixmzqY49gNQf3dj8gt3jk3y2ijbAivshiSpPBmIxn0u
QLeOf/LGvipl
=m18F
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pmu-6.4' of https://github.com/kvm-x86/linux into HEAD
KVM x86 PMU changes for 6.4:
- Disallow virtualizing legacy LBRs if architectural LBRs are available,
the two are mutually exclusive in hardware
- Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES)
after KVM_RUN, and overhaul the vmx_pmu_caps selftest to better
validate PERF_CAPABILITIES
- Apply PMU filters to emulated events and add test coverage to the
pmu_event_filter selftest
- Misc cleanups and fixes
- Tweak FNAME(sync_spte) to avoid unnecessary writes+flushes when the
guest is only adding new PTEs
- Overhaul .sync_page() and .invlpg() to share the .sync_page()
implementation, i.e. utilize .sync_page()'s optimizations when emulating
invalidations
- Clean up the range-based flushing APIs
- Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single
A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle
changed SPTE" overhead associated with writing the entire entry
- Track the number of "tail" entries in a pte_list_desc to avoid having
to walk (potentially) all descriptors during insertion and deletion,
which gets quite expensive if the guest is spamming fork()
- Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGsvASHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5XnoP/0D8rQmrA0xPHK81zYS1E71tsR/itO/T
CQMSB4PhEqvcRUaWOuhLBRUW+noWzaOkjkMYK2uoPTdtme7v9+Ar7EtfrWYHrBWD
IxHCAymo3a5dQPUc3Nb77u6HjRAOokPSqSz5jE4qAjlniW09feruro2Phi+BTme4
JjxTc/7Oh0Fu26+mK7mJHiw3fV1x3YznnnRPrKGrVQes5L6ozNICkUZ6nvuJUVMk
lTNHNQbG8PqJZnfWG7VIKRn1vdfXwEfnvyucGVEqFfPLkOXqJHyqMVmIOtvsH7C5
l8j36+lBZwtFh2jk2EsXOTb6sS7l1MSvyHLlbaJaqqffP+77Hf1n0fROur0k9Yse
jJJejJWxZ/SvjMt/bOA+4ybGafZH0lt20DsDWnat5GSQ1EVT1CInN2p8OY8pdecR
QOJBqnNUOykC7/Pyad+IxTxwrOSNCYh+5aYG8AdGquZvNUEwjffVJqrmxDvklY8Z
DTYwGKgNY7NsP/dV0WYYElsAuHiKwiDZL15KftiQebO1fPcZDpTzDo83/8UMfGxh
yegngcNX9Qi7lWtLkUMy8A99UvejM0QrS/Zt8v1zjlQ8PjreZLLBWsNpe0ufIMRk
31ZAC2OS4Koi3wZ54tA7Z1Kh11meGhAk5Ti7sNke0rDqB9UMmj6UKw121cSRvW7q
W6O4U3YeGpKx
=zb4u
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-mmu-6.4' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.4:
- Tweak FNAME(sync_spte) to avoid unnecessary writes+flushes when the
guest is only adding new PTEs
- Overhaul .sync_page() and .invlpg() to share the .sync_page()
implementation, i.e. utilize .sync_page()'s optimizations when emulating
invalidations
- Clean up the range-based flushing APIs
- Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single
A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle
changed SPTE" overhead associated with writing the entire entry
- Track the number of "tail" entries in a pte_list_desc to avoid having
to walk (potentially) all descriptors during insertion and deletion,
which gets quite expensive if the guest is spamming fork()
- Misc cleanups
- Numerous fixes for the pathological lock inversion issue that
plagued KVM/arm64 since... forever.
- New framework allowing SMCCC-compliant hypercalls to be forwarded
to userspace, hopefully paving the way for some more features
being moved to VMMs rather than be implemented in the kernel.
- Large rework of the timer code to allow a VM-wide offset to be
applied to both virtual and physical counters as well as a
per-timer, per-vcpu offset that complements the global one.
This last part allows the NV timer code to be implemented on
top.
- A small set of fixes to make sure that we don't change anything
affecting the EL1&0 translation regime just after having having
taken an exception to EL2 until we have executed a DSB. This
ensures that speculative walks started in EL1&0 have completed.
- The usual selftest fixes and improvements.
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmRCZIwPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDoZ8P/ioXAdDbAE4hTuyD2YdKJ3IGWN3pg52Z7xc2
rBXXFrbK9+n9FEc3AVdHoGsRPDP0Ynl+apj+aB0Klr/Fl0KKqac+W0ARX9rn1mI1
HjeygFPaGnXjMUp0BjeSLS+g3b0gebELJ6R1QEe1/MIPb8Se7M1y3ZpMWdhe0PPL
vyzw3LZq2OAlLgWKZhAfhh03qdr2kqJxypYs6nMrcexfn8dXT78dsYKW1nXmqKcE
61Gg23MDPUoexYpUhm+ym5t8hltoI1di8faPmxEpaFzpSDyAg8V5vo6LiW9jn3cf
RX0Sikk1laiRAhVbbIFCKC148vFyKxum3scpKyb91Qc+sK1kmIcxvEqlc6SfG9je
+5ndZwAfXtW6SMSOyX8y5fXbee7M0sx3n3le9BNgwXfmLWg/GHXJ544dJgVIlf/e
0Z+8QnP1IUDfARR/b2FlW7A7XLzNHQzO379ekcAdUptbGwlS9CrW6SJ83QR7K6fB
bh0aSSELKsD7pX8wnNyNACvmz2zL12ITlDKdZWUr8MSxyTjgVy7s0BDsQT3sbrA1
1sH++RvUWfC2k7tVT3vjZFzUDlPw3bnZmo5YMWRTMbXEdr1V5rDw5F5IXit13KeT
8bk0hnJgnLmyoX2A17v5dkFMIKD7p13tqDRdfFcn0ru63HIKxgkS3ITkDmsAQELK
DHT7RBE0
=Bhta
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.4
- Numerous fixes for the pathological lock inversion issue that
plagued KVM/arm64 since... forever.
- New framework allowing SMCCC-compliant hypercalls to be forwarded
to userspace, hopefully paving the way for some more features
being moved to VMMs rather than be implemented in the kernel.
- Large rework of the timer code to allow a VM-wide offset to be
applied to both virtual and physical counters as well as a
per-timer, per-vcpu offset that complements the global one.
This last part allows the NV timer code to be implemented on
top.
- A small set of fixes to make sure that we don't change anything
affecting the EL1&0 translation regime just after having having
taken an exception to EL2 until we have executed a DSB. This
ensures that speculative walks started in EL1&0 have completed.
- The usual selftest fixes and improvements.
Refactor Hyper-V's range-based TLB flushing API to take a gfn+nr_pages
pair instead of a struct, and bury said struct in Hyper-V specific code.
Passing along two params generates much better code for the common case
where KVM is _not_ running on Hyper-V, as forwarding the flush on to
Hyper-V's hv_flush_remote_tlbs_range() from kvm_flush_remote_tlbs_range()
becomes a tail call.
Cc: David Matlack <dmatlack@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230405003133.419177-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rename the Hyper-V hooks for TLB flushing to match the naming scheme used
by all the other TLB flushing hooks, e.g. in kvm_x86_ops, vendor code,
arch hooks from common code, etc.
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230405003133.419177-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The 'longmode' field is a bit annoying as it blows an entire __u32 to
represent a boolean value. Since other architectures are looking to add
support for KVM_EXIT_HYPERCALL, now is probably a good time to clean it
up.
Redefine the field (and the remaining padding) as a set of flags.
Preserve the existing ABI by using bit 0 to indicate if the guest was in
long mode and requiring that the remaining 31 bits must be zero.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-2-oliver.upton@linux.dev
Move the 'version' member to the beginning of the structure to reuse an
existing hole instead of introducing another one.
This allows us to save 8 bytes for 64 bit builds.
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20230217193336.15278-2-minipli@grsecurity.net
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add support for SVM's Virtual NMIs implementation, which adds proper
tracking of virtual NMI blocking, and an intr_ctrl flag that software can
set to mark a virtual NMI as pending. Pending virtual NMIs are serviced
by hardware if/when virtual NMIs become unblocked, i.e. act more or less
like real NMIs.
Introduce two new kvm_x86_ops callbacks so to support SVM's vNMI, as KVM
needs to treat a pending vNMI as partially injected. Specifically, if
two NMIs (for L1) arrive concurrently in KVM's software model, KVM's ABI
is to inject one and pend the other. Without vNMI, KVM manually tracks
the pending NMI and uses NMI windows to detect when the NMI should be
injected.
With vNMI, the pending NMI is simply stuffed into the VMCB and handed
off to hardware. This means that KVM needs to be able to set a vNMI
pending on-demand, and also query if a vNMI is pending, e.g. to honor the
"at most one NMI pending" rule and to preserve all NMIs across save and
restore.
Warn if KVM attempts to open an NMI window when vNMI is fully enabled,
as the above logic should prevent KVM from ever getting to
kvm_check_and_inject_events() with two NMIs pending _in software_, and
the "at most one NMI pending" logic should prevent having an NMI pending
in hardware and an NMI pending in software if NMIs are also blocked, i.e.
if KVM can't immediately inject the second NMI.
Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com>
Co-developed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20230227084016.3368-11-santosh.shukla@amd.com
[sean: rewrite shortlog and changelog, massage code comments]
Signed-off-by: Sean Christopherson <seanjc@google.com>
In hardware TLB, invalidating TLB entries means the translations are
removed from the TLB.
In KVM shadowed vTLB, the translations (combinations of shadow paging
and hardware TLB) are generally maintained as long as they remain "clean"
when the TLB of an address space (i.e. a PCID or all) is flushed with
the help of write-protections, sp->unsync, and kvm_sync_page(), where
"clean" in this context means that no updates to KVM's SPTEs are needed.
However, FNAME(invlpg) always zaps/removes the vTLB if the shadow page is
unsync, and thus triggers a remote flush even if the original vTLB entry
is clean, i.e. is usable as-is.
Besides this, FNAME(invlpg) is largely is a duplicate implementation of
FNAME(sync_spte) to invalidate a vTLB entry.
To address both issues, reuse FNAME(sync_spte) to share the code and
slightly modify the semantics, i.e. keep the vTLB entry if it's "clean"
and avoid remote TLB flush.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216235321.735214-3-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The @root_hpa for kvm_mmu_invalidate_addr() is called with @mmu->root.hpa
or INVALID_PAGE where @mmu->root.hpa is to invalidate gva for the current
root (the same meaning as KVM_MMU_ROOT_CURRENT) and INVALID_PAGE is to
invalidate gva for all roots (the same meaning as KVM_MMU_ROOTS_ALL).
Change the argument type of kvm_mmu_invalidate_addr() and use
KVM_MMU_ROOT_XXX instead so that we can reuse the function for
kvm_mmu_invpcid_gva() and nested_ept_invalidate_addr() for invalidating
gva for different set of roots.
No fuctionalities changed.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-9-jiangshanlai@gmail.com
[sean: massage comment slightly]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tweak KVM_MMU_ROOTS_ALL to precisely cover all current+previous root
flags, and add a sanity in kvm_mmu_free_roots() to verify that the set
of roots to free doesn't stray outside KVM_MMU_ROOTS_ALL.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-8-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rename mmu->sync_page to mmu->sync_spte and move the code out
of FNAME(sync_page)'s loop body into mmu.c.
No functionalities change intended.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-6-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
FNAME(invlpg)() and kvm_mmu_invalidate_gva() take a gva_t, i.e. unsigned
long, as the type of the address to invalidate. On 32-bit kernels, the
upper 32 bits of the GPA will get dropped when an L2 GPA address is
invalidated in the shadowed nested TDP MMU.
Convert it to u64 to fix the problem.
Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-2-jiangshanlai@gmail.com
[sean: tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use a new EMULTYPE flag, EMULTYPE_WRITE_PF_TO_SP, to track page faults
on self-changing writes to shadowed page tables instead of propagating
that information to the emulator via a semi-persistent vCPU flag. Using
a flag in "struct kvm_vcpu_arch" is confusing, especially as implemented,
as it's not at all obvious that clearing the flag only when emulation
actually occurs is correct.
E.g. if KVM sets the flag and then retries the fault without ever getting
to the emulator, the flag will be left set for future calls into the
emulator. But because the flag is consumed if and only if both
EMULTYPE_PF and EMULTYPE_ALLOW_RETRY_PF are set, and because
EMULTYPE_ALLOW_RETRY_PF is deliberately not set for direct MMUs, emulated
MMIO, or while L2 is active, KVM avoids false positives on a stale flag
since FNAME(page_fault) is guaranteed to be run and refresh the flag
before it's ultimately consumed by the tail end of reexecute_instruction().
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230202182817.407394-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Provide a virtual cache topology to the guest to avoid
inconsistencies with migration on heterogenous systems. Non secure
software has no practical need to traverse the caches by set/way in
the first place.
- Add support for taking stage-2 access faults in parallel. This was an
accidental omission in the original parallel faults implementation,
but should provide a marginal improvement to machines w/o FEAT_HAFDBS
(such as hardware from the fruit company).
- A preamble to adding support for nested virtualization to KVM,
including vEL2 register state, rudimentary nested exception handling
and masking unsupported features for nested guests.
- Fixes to the PSCI relay that avoid an unexpected host SVE trap when
resuming a CPU when running pKVM.
- VGIC maintenance interrupt support for the AIC
- Improvements to the arch timer emulation, primarily aimed at reducing
the trap overhead of running nested.
- Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the
interest of CI systems.
- Avoid VM-wide stop-the-world operations when a vCPU accesses its own
redistributor.
- Serialize when toggling CPACR_EL1.SMEN to avoid unexpected exceptions
in the host.
- Aesthetic and comment/kerneldoc fixes
- Drop the vestiges of the old Columbia mailing list and add myself as
co-maintainer
This also drags in a couple of branches to avoid conflicts:
- The shared 'kvm-hw-enable-refactor' branch that reworks
initialization, as it conflicted with the virtual cache topology
changes.
- arm64's 'for-next/sme2' branch, as the PSCI relay changes, as both
touched the EL2 initialization code.
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmPw29cPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpD9doQAIJyMW0odT6JBe15uGCxTuTnJbb8mniajJdX
CuSxPl85WyKLtZbIJLRTQgyt6Nzbu0N38zM0y/qBZT5BvAnWYI8etvnJhYZjooAy
jrf0Me/GM5hnORXN+1dByCmlV+DSuBkax86tgIC7HhU71a2SWpjlmWQi/mYvQmIK
PBAqpFF+w2cWHi0ZvCq96c5EXBdN4FLEA5cdZhekCbgw1oX8+x+HxdpBuGW5lTEr
9oWOzOzJQC1uFnjP3unFuIaG94QIo+NA4aGLMzfb7wm2wdQUnKebtdj/RxsDZOKe
43Q1+MDFWMsxxFu4FULH8fPMwidIm5rfz3pw3JJloqaZp8vk/vjDLID7AYucMIX8
1G/mjqz6E9lYvv57WBmBhT/+apSDAmeHlAT97piH73Nemga91esDKuHSdtA8uB5j
mmzcUYajuB2GH9rsaXJhVKt/HW7l9fbGliCkI99ckq/oOTO9VsKLsnwS/rMRIsPn
y2Y8Lyoe4eqokd1DNn5/bo+3qDnfmzm6iDmZOo+JYuJv9KS95zuw17Wu7la9UAPV
e13+btoijHDvu8RnTecuXljWfAAKVtEjpEIoS5aP2R2iDvhr0d8POlMPaJ40YuRq
D2fKr18b6ngt+aI0TY63/ksEIFexx67HuwQsUZ2lRjyjq5/x+u3YIqUPbKrU4Rnl
uxXjSvyr
=r4s/
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.3
- Provide a virtual cache topology to the guest to avoid
inconsistencies with migration on heterogenous systems. Non secure
software has no practical need to traverse the caches by set/way in
the first place.
- Add support for taking stage-2 access faults in parallel. This was an
accidental omission in the original parallel faults implementation,
but should provide a marginal improvement to machines w/o FEAT_HAFDBS
(such as hardware from the fruit company).
- A preamble to adding support for nested virtualization to KVM,
including vEL2 register state, rudimentary nested exception handling
and masking unsupported features for nested guests.
- Fixes to the PSCI relay that avoid an unexpected host SVE trap when
resuming a CPU when running pKVM.
- VGIC maintenance interrupt support for the AIC
- Improvements to the arch timer emulation, primarily aimed at reducing
the trap overhead of running nested.
- Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the
interest of CI systems.
- Avoid VM-wide stop-the-world operations when a vCPU accesses its own
redistributor.
- Serialize when toggling CPACR_EL1.SMEN to avoid unexpected exceptions
in the host.
- Aesthetic and comment/kerneldoc fixes
- Drop the vestiges of the old Columbia mailing list and add [Oliver]
as co-maintainer
This also drags in arm64's 'for-next/sme2' branch, because both it and
the PSCI relay changes touch the EL2 initialization code.
- Fix a mostly benign overflow bug in SEV's send|receive_update_data()
- Move the SVM-specific "host flags" into vcpu_svm (extracted from the
vNMI enabling series)
- A handful for fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmPsH7kSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5/dQQAJSVCYA7F7LcfJf+c1ULG0XCd8rHnXdR
EGTygTnWrzwTCRaunOBPE4AJRxKdkwTKy+yfnnQVRdYYfRe1SZKpUQ2XNEDGn0+v
zVfOmSFFcCWXmJeY8y5n1GlDH4ENO3G7nD1ncDQ0I9PazmOsmxChoVZ9afFJ5bpo
73hjcYVfUDYxGkeRLWSSSFtWIGguE8BkpRH3wZ8MGZi+ueoFUJPPBKHeDtxCV2/T
KcJLne8tQVTiWCdMO3EFwxgIvsjQDoT0gZYLYNHJ6KqD9Szc3jA9v2ryTm5IYlpb
akYUqePaD0SGrrfDBrwz3bLu3fehDu7eduXESRlzzb8S4xP7/qXeo9KeVN+b4MBb
nmBBFncvMWbC8Po5wB5OVAfAa7ACmGiXeBV8pfgGI6FTq1fpc4VNm2PevKkDvlqN
O2eZ1KuNkwBnbIPj3JVPPnsJcUjYXFjZyzfpMV1T+ExmL/IYceatX4S7zfgL5nUg
3qFi5mX2Cufk2EBvBu+Dkpt/H4lze+ysZRciMC+v7Q4LWAYZ8HW1a44pnBVUJMPM
bWiJ1/O8RIWM1tWIrlO38+ZZalbu3spIVMBXKzqEGXvpUwJ4UgZM1tFiWvISTVFe
2X6N3d7aT/DQ1PzZU6BsyVZWAFaodHBauMcr9FUkWqqGu3HOhqC4rSJZ9eRR7V5O
WSp1gTVY1JXy
=AVpx
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-svm-6.3' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.3:
- Fix a mostly benign overflow bug in SEV's send|receive_update_data()
- Move the SVM-specific "host flags" into vcpu_svm (extracted from the
vNMI enabling series)
- A handful for fixes and cleanups
- Add support for created masked events for the PMU filter to allow
userspace to heavily restrict what events the guest can use without
needing to create an absurd number of events
- Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
support is disabled
- Add PEBS support for Intel SPR
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmPsFZ4SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5eKEP/0qeZsOQot53wkf+wNiGh1X6qDacBPFP
A8GAPC70fEisxAt776DeKEBwikHpARPglCt1Il9dFvkG+0jgYpvPu8UGF1LpouKX
cD/7itr2k8GZlXZBg2Rgu3TRyFBJEGHT6tAu7PBhZyL6yWQDUxao8FPFrRGfmJ7O
Z6eFMo1cERNHICQm+W/2TBd1xguiF+m4CXKlA70R4wzM37aPF9o5HvmIwAvPzyhU
w4WzcIQbjVPs1VpBTzwPqRmyZ8omSlDYo7VqmsDiRtJbucqgbhFI2wR+nyImFCa9
D2pI5TV3CFTt0fvd8SZpH19nR3S6cMLCXONOsijmvR2BmS3PhJMP4dMm5m4R06nF
RBtnTj9fkbeL1ghFEkMxHBZVTG3bBlO4ySOxIqNHCvPjqQ37mJ+xP4C8kcIC9p5F
+xL3AvZ7zenPv3A29SY9YH+QvZLBwyDJzAsveLeYkLFoJxoDT4glOY/Wpi1rkZ17
/zHDZWoF49l1Eu3Bql0hFetkCreUNFGpa4moUmEC0evYOvV2WCb+39TDXZ8CPCGD
+cDiRnD8MFQpBw47F03EnFheFHxiJoL0Clv5vvM3C+xOq2J9WVG9mqQWCk+4ta2B
Um4D++0a9lwvJhOImaR7uyiV3K7oVm+rU8+46x+nTNGaIP2bnE+vronY+b6KGeUx
7+xzTKlYygGe
=ev5v
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pmu-6.3' of https://github.com/kvm-x86/linux into HEAD
KVM x86 PMU changes for 6.3:
- Add support for created masked events for the PMU filter to allow
userspace to heavily restrict what events the guest can use without
needing to create an absurd number of events
- Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
support is disabled
- Add PEBS support for Intel SPR
* kvm-arm64/misc:
: Miscellaneous updates
:
: - Convert CPACR_EL1_TTA to the new, generated system register
: definitions.
:
: - Serialize toggling CPACR_EL1.SMEN to avoid unexpected exceptions when
: accessing SVCR in the host.
:
: - Avoid quiescing the guest if a vCPU accesses its own redistributor's
: SGIs/PPIs, eliminating the need to IPI. Largely an optimization for
: nested virtualization, as the L1 accesses the affected registers
: rather often.
:
: - Conversion to kstrtobool()
:
: - Common definition of INVALID_GPA across architectures
:
: - Enable CONFIG_USERFAULTFD for CI runs of KVM selftests
KVM: arm64: Fix non-kerneldoc comments
KVM: selftests: Enable USERFAULTFD
KVM: selftests: Remove redundant setbuf()
arm64/sysreg: clean up some inconsistent indenting
KVM: MMU: Make the definition of 'INVALID_GPA' common
KVM: arm64: vgic-v3: Use kstrtobool() instead of strtobool()
KVM: arm64: vgic-v3: Limit IPI-ing when accessing GICR_{C,S}ACTIVER0
KVM: arm64: Synchronize SMEN on vcpu schedule out
KVM: arm64: Kill CPACR_EL1_TTA definition
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Instead of re-defining the "host flags" bits, just expose dedicated
helpers for each of the two remaining flags that are consumed by the
emulator. The emulator never consumes both "is guest" and "is SMM" in
close proximity, so there is no motivation to avoid additional indirect
branches.
Also while at it, garbage collect the recently removed host flags.
No functional change is intended.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Santosh Shukla <Santosh.Shukla@amd.com>
Link: https://lore.kernel.org/r/20221129193717.513824-6-mlevitsk@redhat.com
[sean: fix CONFIG_KVM_SMM=n builds, tweak names of wrappers]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move HF_NMI_MASK and HF_IRET_MASK (a.k.a. "waiting for IRET") out of the
common "hflags" and into dedicated flags in "struct vcpu_svm". The flags
are used only for the SVM and thus should not be in hflags.
Tracking NMI masking in software isn't SVM specific, e.g. VMX has a
similar flag (soft_vnmi_blocked), but that's much more of a hack as VMX
can't intercept IRET, is useful only for ancient CPUs, i.e. will
hopefully be removed at some point, and again the exact behavior is
vendor specific and shouldn't ever be referenced in common code.
converting VMX
No functional change is intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Santosh Shukla <Santosh.Shukla@amd.com>
Link: https://lore.kernel.org/r/20221129193717.513824-5-mlevitsk@redhat.com
[sean: split from HF_GIF_MASK patch]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move HF_GIF_MASK out of the common "hflags" and into vcpu_svm.guest_gif.
GIF is an SVM-only concept and has should never be consulted outside of
SVM-specific code.
No functional change is intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Santosh Shukla <Santosh.Shukla@amd.com>
Link: https://lore.kernel.org/r/20221129193717.513824-5-mlevitsk@redhat.com
[sean: split to separate patch]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Limit the set of MSRs for fixed PMU counters based on the number of fixed
counters actually supported by the host so that userspace doesn't waste
time saving and restoring dummy values.
Signed-off-by: Like Xu <likexu@tencent.com>
[sean: split for !enable_pmu logic, drop min(), write changelog]
Link: https://lore.kernel.org/r/20230124234905.3774678-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When building a list of filter events, it can sometimes be a challenge
to fit all the events needed to adequately restrict the guest into the
limited space available in the pmu event filter. This stems from the
fact that the pmu event filter requires each event (i.e. event select +
unit mask) be listed, when the intention might be to restrict the
event select all together, regardless of it's unit mask. Instead of
increasing the number of filter events in the pmu event filter, add a
new encoding that is able to do a more generalized match on the unit mask.
Introduce masked events as another encoding the pmu event filter
understands. Masked events has the fields: mask, match, and exclude.
When filtering based on these events, the mask is applied to the guest's
unit mask to see if it matches the match value (i.e. umask & mask ==
match). The exclude bit can then be used to exclude events from that
match. E.g. for a given event select, if it's easier to say which unit
mask values shouldn't be filtered, a masked event can be set up to match
all possible unit mask values, then another masked event can be set up to
match the unit mask values that shouldn't be filtered.
Userspace can query to see if this feature exists by looking for the
capability, KVM_CAP_PMU_EVENT_MASKED_EVENTS.
This feature is enabled by setting the flags field in the pmu event
filter to KVM_PMU_EVENT_FLAG_MASKED_EVENTS.
Events can be encoded by using KVM_PMU_ENCODE_MASKED_ENTRY().
It is an error to have a bit set outside the valid bits for a masked
event, and calls to KVM_SET_PMU_EVENT_FILTER will return -EINVAL in
such cases, including the high bits of the event select (35:32) if
called on Intel.
With these updates the filter matching code has been updated to match on
a common event. Masked events were flexible enough to handle both event
types, so they were used as the common event. This changes how guest
events get filtered because regardless of the type of event used in the
uAPI, they will be converted to masked events. Because of this there
could be a slight performance hit because instead of matching the filter
event with a lookup on event select + unit mask, it does a lookup on event
select then walks the unit masks to find the match. This shouldn't be a
big problem because I would expect the set of common event selects to be
small, and if they aren't the set can likely be reduced by using masked
events to generalize the unit mask. Using one type of event when
filtering guest events allows for a common code path to be used.
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Link: https://lore.kernel.org/r/20221220161236.555143-5-aaronlewis@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The scaling information in subleaf 1 should match the values set by KVM in
the 'vcpu_info' sub-structure 'time_info' (a.k.a. pvclock_vcpu_time_info)
which is shared with the guest, but is not directly available to the VMM.
The offset values are not set since a TSC offset is already applied.
The TSC frequency should also be set in sub-leaf 2.
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20230106103600.528-3-pdurrant@amazon.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
A subsequent patch will need to acquire the CPUID leaf range for emulated
Xen so explicitly pass the signature of the hypervisor we're interested in
to the new function. Also introduce a new kvm_hypervisor_cpuid structure
so we can neatly store both the base and limit leaf indices.
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20230106103600.528-2-pdurrant@amazon.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop cpu_dirty_logging_count in favor of nr_memslots_dirty_logging.
Both fields count the number of memslots that have dirty-logging enabled,
with the only difference being that cpu_dirty_logging_count is only
incremented when using PML. So while nr_memslots_dirty_logging is not a
direct replacement for cpu_dirty_logging_count, it can be combined with
enable_pml to get the same information.
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20230105214303.2919415-1-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The first half or so patches fix semi-urgent, real-world relevant APICv
and AVIC bugs.
The second half fixes a variety of AVIC and optimized APIC map bugs
where KVM doesn't play nice with various edge cases that are
architecturally legal(ish), but are unlikely to occur in most real world
scenarios
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
ARM:
* Fix the PMCR_EL0 reset value after the PMU rework
* Correctly handle S2 fault triggered by a S1 page table walk
by not always classifying it as a write, as this breaks on
R/O memslots
* Document why we cannot exit with KVM_EXIT_MMIO when taking
a write fault from a S1 PTW on a R/O memslot
* Put the Apple M2 on the naughty list for not being able to
correctly implement the vgic SEIS feature, just like the M1
before it
* Reviewer updates: Alex is stepping down, replaced by Zenghui
x86:
* Fix various rare locking issues in Xen emulation and teach lockdep
to detect them
* Documentation improvements
* Do not return host topology information from KVM_GET_SUPPORTED_CPUID
KVM already has a 'GPA_INVALID' defined as (~(gpa_t)0) in kvm_types.h,
and it is used by ARM code. We do not need another definition of
'INVALID_GPA' for X86 specifically.
Instead of using the common 'GPA_INVALID' for X86, replace it with
'INVALID_GPA', and change the users of 'GPA_INVALID' so that the diff
can be smaller. Also because the name 'INVALID_GPA' tells the user we
are using an invalid GPA, while the name 'GPA_INVALID' is emphasizing
the GPA is an invalid one.
No functional change intended.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230105130127.866171-1-yu.c.zhang@linux.intel.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Track the per-vendor required APICv inhibits with a variable instead of
calling into vendor code every time KVM wants to query the set of
required inhibits. The required inhibits are a property of the vendor's
virtualization architecture, i.e. are 100% static.
Using a variable allows the compiler to inline the check, e.g. generate
a single-uop TEST+Jcc, and thus eliminates any desire to avoid checking
inhibits for performance reasons.
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-32-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Inhibit SVM's AVIC if multiple vCPUs are aliased to the same logical ID.
Architecturally, all CPUs whose logical ID matches the MDA are supposed
to receive the interrupt; overwriting existing entries in AVIC's
logical=>physical map can result in missed IPIs.
Fixes: 18f40c53e1 ("svm: Add VMEXIT handlers for AVIC")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-25-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Inhibit APICv/AVIC if the optimized physical map is disabled so that KVM
KVM provides consistent APIC behavior if xAPIC IDs are aliased due to
vcpu_id being truncated and the x2APIC hotplug hack isn't enabled. If
the hotplug hack is disabled, events that are emulated by KVM will follow
architectural behavior (all matching vCPUs receive events, even if the
"match" is due to truncation), whereas APICv and AVIC will deliver events
only to the first matching vCPU, i.e. the vCPU that matches without
truncation.
Note, the "extra" inhibit is needed because KVM deliberately ignores
mismatches due to truncation when applying the APIC_ID_MODIFIED inhibit
so that large VMs (>255 vCPUs) can run with APICv/AVIC.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-24-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Track all possibilities for the optimized APIC map's logical modes
instead of overloading the pseudo-bitmap and treating any "unknown" value
as "invalid".
As documented by the now-stale comment above the mode values, the values
did have meaning when the optimized map was originally added. That
dependent logical was removed by commit e45115b62f ("KVM: x86: use
physical LAPIC array for logical x2APIC"), but the obfuscated behavior
and its comment were left behind.
Opportunistically rename "mode" to "logical_mode", partly to make it
clear that the "disabled" case applies only to the logical map, but also
to prove that there is no lurking code that expects "mode" to be a bitmap.
Functionally, this is a glorified nop.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20230106011306.85230-19-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Free the APIC access page memslot if any vCPU enables x2APIC and SVM's
AVIC is enabled to prevent accesses to the virtual APIC on vCPUs with
x2APIC enabled. On AMD, if its "hybrid" mode is enabled (AVIC is enabled
when x2APIC is enabled even without x2AVIC support), keeping the APIC
access page memslot results in the guest being able to access the virtual
APIC page as x2APIC is fully emulated by KVM. I.e. hardware isn't aware
that the guest is operating in x2APIC mode.
Exempt nested SVM's update of APICv state from the new logic as x2APIC
can't be toggled on VM-Exit. In practice, invoking the x2APIC logic
should be harmless precisely because it should be a glorified nop, but
play it safe to avoid latent bugs, e.g. with dropping the vCPU's SRCU
lock.
Intel doesn't suffer from the same issue as APICv has fully independent
VMCS controls for xAPIC vs. x2APIC virtualization. Technically, KVM
should provide bus error semantics and not memory semantics for the APIC
page when x2APIC is enabled, but KVM already provides memory semantics in
other scenarios, e.g. if APICv/AVIC is enabled and the APIC is hardware
disabled (via APIC_BASE MSR).
Note, checking apic_access_memslot_enabled without taking locks relies
it being set during vCPU creation (before kvm_vcpu_reset()). vCPUs can
race to set the inhibit and delete the memslot, i.e. can get false
positives, but can't get false negatives as apic_access_memslot_enabled
can't be toggled "on" once any vCPU reaches KVM_RUN.
Opportunistically drop the "can" while updating avic_activate_vmcb()'s
comment, i.e. to state that KVM _does_ support the hybrid mode. Move
the "Note:" down a line to conform to preferred kernel/KVM multi-line
comment style.
Opportunistically update the apicv_update_lock comment, as it isn't
actually used to protect apic_access_memslot_enabled (which is protected
by slots_lock).
Fixes: 0e311d33bf ("KVM: SVM: Introduce hybrid-AVIC mode")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20230106011306.85230-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In commit 14243b3871 ("KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN
and event channel delivery") the clever version of me left some helpful
notes for those who would come after him:
/*
* For the irqfd workqueue, using the main kvm->lock mutex is
* fine since this function is invoked from kvm_set_irq() with
* no other lock held, no srcu. In future if it will be called
* directly from a vCPU thread (e.g. on hypercall for an IPI)
* then it may need to switch to using a leaf-node mutex for
* serializing the shared_info mapping.
*/
mutex_lock(&kvm->lock);
In commit 2fd6df2f2b ("KVM: x86/xen: intercept EVTCHNOP_send from guests")
the other version of me ran straight past that comment without reading it,
and introduced a potential deadlock by taking vcpu->mutex and kvm->lock
in the wrong order.
Solve this as originally suggested, by adding a leaf-node lock in the Xen
state rather than using kvm->lock for it.
Fixes: 2fd6df2f2b ("KVM: x86/xen: intercept EVTCHNOP_send from guests")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20230111180651.14394-4-dwmw2@infradead.org>
[Rebase, add docs. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>