Revert the hack to allow host-initiated accesses to all "PMU" MSRs,
as intel_is_valid_msr() returns true for _all_ MSRs, regardless of whether
or not it has a snowball's chance in hell of actually being a PMU MSR.
That mostly gets papered over by the actual get/set helpers only handling
MSRs that they knows about, except there's the minor detail that
kvm_pmu_{g,s}et_msr() eat reads and writes when the PMU is disabled.
I.e. KVM will happy allow reads and writes to _any_ MSR if the PMU is
disabled, either via module param or capability.
This reverts commit d1c88a4020.
Fixes: d1c88a4020 ("KVM: x86: always allow host-initiated writes to PMU MSRs")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220611005755.753273-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Eating reads and writes to all "PMU" MSRs when there is no PMU is wildly
broken as it results in allowing accesses to _any_ MSR on Intel CPUs
as intel_is_valid_msr() returns true for all host_initiated accesses.
A revert of commit d1c88a4020 ("KVM: x86: always allow host-initiated
writes to PMU MSRs") will soon follow.
This reverts commit 8e6a58e28b.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220611005755.753273-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the per-vCPU apicv_active flag into KVM's local APIC instance.
APICv is fully dependent on an in-kernel local APIC, but that's not at
all clear when reading the current code due to the flag being stored in
the generic kvm_vcpu_arch struct.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220614230548.3852141-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop a handful of unused AVIC function declarations whose implementations
were removed during the conversion to optional static calls.
No functional change intended.
Fixes: abb6d479e2 ("KVM: x86: make several APIC virtualization callbacks optional")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220614230548.3852141-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use root_level in svm_load_mmu_pg() rather that looking up the root
level in vcpu->arch.mmu->root_role.level. svm_load_mmu_pgd() has only
one caller, kvm_mmu_load_pgd(), which always passes
vcpu->arch.mmu->root_role.level as root_level.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220605063417.308311-7-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wrap the manipulation of @role and the manual mutex_{release,acquire}()
invocations in CONFIG_PROVE_LOCKING=y to squash a clang-15 warning. When
building with -Wunused-but-set-parameter and CONFIG_DEBUG_LOCK_ALLOC=n,
clang-15 seees there's no usage of @role in mutex_lock_killable_nested()
and yells. PROVE_LOCKING selects DEBUG_LOCK_ALLOC, and the only reason
KVM manipulates @role is to make PROVE_LOCKING happy.
To avoid true ugliness, use "i" and "j" to detect the first pass in the
loops; the "idx" field that's used by kvm_for_each_vcpu() is guaranteed
to be '0' on the first pass as it's simply the first entry in the vCPUs
XArray, which is fully KVM controlled. kvm_for_each_vcpu() passes '0'
for xa_for_each_range()'s "start", and xa_for_each_range() will not enter
the loop if there's no entry at '0'.
Fixes: 0c2c7c0692 ("KVM: SEV: Mark nested locking of vcpu->lock")
Reported-by: kernel test robot <lkp@intel.com>
Cc: Peter Gonda <pgonda@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220613214237.2538266-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This caused a warning on 32-bit systems, but undoubtedly would have acted
funny on 64-bit as well.
The fix was applied directly on merge in 5.19, see commit 24625f7d91 ("Merge
tag for-linus of git://git.kernel.org/pub/scm/virt/kvm/kvm").
Fixes: 3743c2f025 ("KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
s390:
* add an interface to provide a hypervisor dump for secure guests
* improve selftests to show tests
x86:
* Intel IPI virtualization
* Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS
* PEBS virtualization
* Simplify PMU emulation by just using PERF_TYPE_RAW events
* More accurate event reinjection on SVM (avoid retrying instructions)
* Allow getting/setting the state of the speaker port data bit
* Rewrite gfn-pfn cache refresh
* Refuse starting the module if VM-Entry/VM-Exit controls are inconsistent
* "Notify" VM exit
Commit 74fd41ed16 ("KVM: x86: nSVM: support PAUSE filtering when L0
doesn't intercept PAUSE") introduced passthrough support for nested pause
filtering, (when the host doesn't intercept PAUSE) (either disabled with
kvm module param, or disabled with '-overcommit cpu-pm=on')
Before this commit, L1 KVM didn't intercept PAUSE at all; afterwards,
the feature was exposed as supported by KVM cpuid unconditionally, thus
if L1 could try to use it even when the L0 KVM can't really support it.
In this case the fallback caused KVM to intercept each PAUSE instruction;
in some cases, such intercept can slow down the nested guest so much
that it can fail to boot. Instead, before the problematic commit KVM
was already setting both thresholds to 0 in vmcb02, but after the first
userspace VM exit shrink_ple_window was called and would reset the
pause_filter_count to the default value.
To fix this, change the fallback strategy - ignore the guest threshold
values, but use/update the host threshold values unless the guest
specifically requests disabling PAUSE filtering (either simple or
advanced).
Also fix a minor bug: on nested VM exit, when PAUSE filter counter
were copied back to vmcb01, a dirty bit was not set.
Thanks a lot to Suravee Suthikulpanit for debugging this!
Fixes: 74fd41ed16 ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE")
Reported-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Co-developed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220518072709.730031-1-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that these functions are always called with preemption disabled,
remove the preempt_disable()/preempt_enable() pair inside them.
No functional change intended.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-8-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are two issues in avic_kick_target_vcpus_fast
1. It is legal to issue an IPI request with APIC_DEST_NOSHORT
and a physical destination of 0xFF (or 0xFFFFFFFF in case of x2apic),
which must be treated as a broadcast destination.
Fix this by explicitly checking for it.
Also don’t use ‘index’ in this case as it gives no new information.
2. It is legal to issue a logical IPI request to more than one target.
Index field only provides index in physical id table of first
such target and therefore can't be used before we are sure
that only a single target was addressed.
Instead, parse the ICRL/ICRH, double check that a unicast interrupt
was requested, and use that info to figure out the physical id
of the target vCPU.
At that point there is no need to use the index field as well.
In addition to fixing the above issues, also skip the call to
kvm_apic_match_dest.
It is possible to do this now, because now as long as AVIC is not
inhibited, it is guaranteed that none of the vCPUs changed their
apic id from its default value.
This fixes boot of windows guest with AVIC enabled because it uses
IPI with 0xFF destination and no destination shorthand.
Fixes: 7223fd2d53 ("KVM: SVM: Use target APIC ID to complete AVIC IRQs when possible")
Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
AVIC is now inhibited if the guest changes the apic id,
and therefore this code is no longer needed.
There are several ways this code was broken, including:
1. a vCPU was only allowed to change its apic id to an apic id
of an existing vCPU.
2. After such change, the vCPU whose apic id entry was overwritten,
could not correctly change its own apic id, because its own
entry is already overwritten.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Neither of these settings should be changed by the guest and it is
a burden to support it in the acceleration code, so just inhibit
this code instead.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Whenever an MSR is part of KVM_GET_MSR_INDEX_LIST, as is the case for
MSR_K7_EVNTSEL0 or MSR_F15H_PERF_CTL0, it has to be always retrievable
and settable with KVM_GET_MSR and KVM_SET_MSR.
Accept a zero value for these MSRs to obey the contract.
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220601031925.59693-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add kvm_caps to hold a variety of capabilites and defaults that aren't
handled by kvm_cpu_caps because they aren't CPUID bits in order to reduce
the amount of boilerplate code required to add a new feature. The vast
majority (all?) of the caps interact with vendor code and are written
only during initialization, i.e. should be tagged __read_mostly, declared
extern in x86.h, and exported.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220524135624.22988-4-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
All gp or fixed counters have been reprogrammed using PERF_TYPE_RAW,
which means that the table that maps perf_hw_id to event select values is
no longer useful, at least for AMD.
For Intel, the logic to check if the pmu event reported by Intel cpuid is
not available is still required, in which case pmc_perf_hw_id() could be
renamed to hw_event_is_unavail() and a bool value is returned to replace
the semantics of "PERF_COUNT_HW_MAX+1".
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220518132512.37864-12-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since reprogram_counter(), reprogram_{gp, fixed}_counter() currently have
the same incoming parameter "struct kvm_pmc *pmc", the callers can simplify
the conetxt by using uniformly exported interface, which makes reprogram_
{gp, fixed}_counter() static and eliminates EXPORT_SYMBOL_GPL.
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220518132512.37864-8-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Because inside reprogram_gp_counter() it is bound to assign the requested
eventel to pmc->eventsel, this assignment step can be moved forward, thus
simplifying the passing of parameters to "struct kvm_pmc *pmc" only.
No functional change intended.
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220518132512.37864-6-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Whenever an MSR is part of KVM_GET_MSR_INDEX_LIST, it has to be always
retrievable and settable with KVM_GET_MSR and KVM_SET_MSR. Accept
the PMU MSRs unconditionally in intel_is_valid_msr, if the access was
host-initiated.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A NMI that L1 wants to inject into its L2 should be directly re-injected,
without causing L0 side effects like engaging NMI blocking for L1.
It's also worth noting that in this case it is L1 responsibility
to track the NMI window status for its L2 guest.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f894d13501cd48157b3069a4b4c7369575ddb60e.1651440202.git.maciej.szmigiero@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the IRQ injection tracepoint, differentiate between Hard IRQs and Soft
"IRQs", i.e. interrupts that are reinjected after incomplete delivery of
a software interrupt from an INTn instruction. Tag reinjected interrupts
as such, even though the information is usually redundant since soft
interrupts are only ever reinjected by KVM. Though rare in practice, a
hard IRQ can be reinjected.
Signed-off-by: Sean Christopherson <seanjc@google.com>
[MSS: change "kvm_inj_virq" event "reinjected" field type to bool]
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <9664d49b3bd21e227caa501cff77b0569bebffe2.1651440202.git.maciej.szmigiero@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Re-inject INTn software interrupts instead of retrying the instruction if
the CPU encountered an intercepted exception while vectoring the INTn,
e.g. if KVM intercepted a #PF when utilizing shadow paging. Retrying the
instruction is architecturally wrong e.g. will result in a spurious #DB
if there's a code breakpoint on the INT3/O, and lack of re-injection also
breaks nested virtualization, e.g. if L1 injects a software interrupt and
vectoring the injected interrupt encounters an exception that is
intercepted by L0 but not L1.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <1654ad502f860948e4f2d57b8bd881d67301f785.1651440202.git.maciej.szmigiero@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Re-inject INT3/INTO instead of retrying the instruction if the CPU
encountered an intercepted exception while vectoring the software
exception, e.g. if vectoring INT3 encounters a #PF and KVM is using
shadow paging. Retrying the instruction is architecturally wrong, e.g.
will result in a spurious #DB if there's a code breakpoint on the INT3/O,
and lack of re-injection also breaks nested virtualization, e.g. if L1
injects a software exception and vectoring the injected exception
encounters an exception that is intercepted by L0 but not L1.
Due to, ahem, deficiencies in the SVM architecture, acquiring the next
RIP may require flowing through the emulator even if NRIPS is supported,
as the CPU clears next_rip if the VM-Exit is due to an exception other
than "exceptions caused by the INT3, INTO, and BOUND instructions". To
deal with this, "skip" the instruction to calculate next_rip (if it's
not already known), and then unwind the RIP write and any side effects
(RFLAGS updates).
Save the computed next_rip and use it to re-stuff next_rip if injection
doesn't complete. This allows KVM to do the right thing if next_rip was
known prior to injection, e.g. if L1 injects a soft event into L2, and
there is no backing INTn instruction, e.g. if L1 is injecting an
arbitrary event.
Note, it's impossible to guarantee architectural correctness given SVM's
architectural flaws. E.g. if the guest executes INTn (no KVM injection),
an exit occurs while vectoring the INTn, and the guest modifies the code
stream while the exit is being handled, KVM will compute the incorrect
next_rip due to "skipping" the wrong instruction. A future enhancement
to make this less awful would be for KVM to detect that the decoded
instruction is not the correct INTn and drop the to-be-injected soft
event (retrying is a lesser evil compared to shoving the wrong RIP on the
exception stack).
Reported-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <65cb88deab40bc1649d509194864312a89bbe02e.1651440202.git.maciej.szmigiero@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If NRIPS is supported in hardware but disabled in KVM, set next_rip to
the next RIP when advancing RIP as part of emulating INT3 injection.
There is no flag to tell the CPU that KVM isn't using next_rip, and so
leaving next_rip is left as is will result in the CPU pushing garbage
onto the stack when vectoring the injected event.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Fixes: 66b7138f91 ("KVM: SVM: Emulate nRIP feature when reinjecting INT3")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <cd328309a3b88604daa2359ad56f36cb565ce2d4.1651440202.git.maciej.szmigiero@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unwind the RIP advancement done by svm_queue_exception() when injecting
an INT3 ultimately "fails" due to the CPU encountering a VM-Exit while
vectoring the injected event, even if the exception reported by the CPU
isn't the same event that was injected. If vectoring INT3 encounters an
exception, e.g. #NP, and vectoring the #NP encounters an intercepted
exception, e.g. #PF when KVM is using shadow paging, then the #NP will
be reported as the event that was in-progress.
Note, this is still imperfect, as it will get a false positive if the
INT3 is cleanly injected, no VM-Exit occurs before the IRET from the INT3
handler in the guest, the instruction following the INT3 generates an
exception (directly or indirectly), _and_ vectoring that exception
encounters an exception that is intercepted by KVM. The false positives
could theoretically be solved by further analyzing the vectoring event,
e.g. by comparing the error code against the expected error code were an
exception to occur when vectoring the original injected exception, but
SVM without NRIPS is a complete disaster, trying to make it 100% correct
is a waste of time.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Fixes: 66b7138f91 ("KVM: SVM: Emulate nRIP feature when reinjecting INT3")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <450133cf0a026cb9825a2ff55d02cb136a1cb111.1651440202.git.maciej.szmigiero@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
This field value (instead of the saved guest RIP) in used by the CPU for
the return address pushed on stack when injecting a software interrupt or
INT3 or INTO exception.
Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
loading a nested state and NRIPS is exposed to L1. If NRIPS is supported
in hardware but not exposed to L1 (nrips=0 or hidden by userspace), stuff
vmcb02's next_rip from the new L2 RIP to emulate a !NRIPS CPU (which
saves RIP on the stack as-is).
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <c2e0a3d78db3ae30530f11d4e9254b452a89f42b.1651440202.git.maciej.szmigiero@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If a vCPU is outside guest mode and is scheduled out, it might be in the
process of making a memory access. A problem occurs if another vCPU uses
the PV TLB flush feature during the period when the vCPU is scheduled
out, and a virtual address has already been translated but has not yet
been accessed, because this is equivalent to using a stale TLB entry.
To avoid this, only report a vCPU as preempted if sure that the guest
is at an instruction boundary. A rescheduling request will be delivered
to the host physical CPU as an external interrupt, so for simplicity
consider any vmexit *not* instruction boundary except for external
interrupts.
It would in principle be okay to report the vCPU as preempted also
if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the
vmentry/vmexit overhead unnecessarily, and optimistic spinning is
also unlikely to succeed. However, leave it for later because right
now kvm_vcpu_check_block() is doing memory accesses. Even
though the TLB flush issue only applies to virtual memory address,
it's very much preferrable to be conservative.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SVM uses a per-cpu variable to cache the current value of the
tsc scaling multiplier msr on each cpu.
Commit 1ab9287add
("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
broke this caching logic.
Refactor the code so that all TSC scaling multiplier writes go through
a single function which checks and updates the cache.
This fixes the following scenario:
1. A CPU runs a guest with some tsc scaling ratio.
2. New guest with different tsc scaling ratio starts on this CPU
and terminates almost immediately.
This ensures that the short running guest had set the tsc scaling ratio just
once when it was set via KVM_SET_TSC_KHZ. Due to the bug,
the per-cpu cache is not updated.
3. The original guest continues to run, it doesn't restore the msr
value back to its own value, because the cache matches,
and thus continues to run with a wrong tsc scaling ratio.
Fixes: 1ab9287add ("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606181149.103072-1-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* ultravisor communication device driver
* fix TEID on terminating storage key ops
RISC-V:
* Added Sv57x4 support for G-stage page table
* Added range based local HFENCE functions
* Added remote HFENCE functions based on VCPU requests
* Added ISA extension registers in ONE_REG interface
* Updated KVM RISC-V maintainers entry to cover selftests support
ARM:
* Add support for the ARMv8.6 WFxT extension
* Guard pages for the EL2 stacks
* Trap and emulate AArch32 ID registers to hide unsupported features
* Ability to select and save/restore the set of hypercalls exposed
to the guest
* Support for PSCI-initiated suspend in collaboration with userspace
* GICv3 register-based LPI invalidation support
* Move host PMU event merging into the vcpu data structure
* GICv3 ITS save/restore fixes
* The usual set of small-scale cleanups and fixes
x86:
* New ioctls to get/set TSC frequency for a whole VM
* Allow userspace to opt out of hypercall patching
* Only do MSR filtering for MSRs accessed by rdmsr/wrmsr
AMD SEV improvements:
* Add KVM_EXIT_SHUTDOWN metadata for SEV-ES
* V_TSC_AUX support
Nested virtualization improvements for AMD:
* Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE,
nested vGIF)
* Allow AVIC to co-exist with a nested guest running
* Fixes for LBR virtualizations when a nested guest is running,
and nested LBR virtualization support
* PAUSE filtering for nested hypervisors
Guest support:
* Decoupling of vcpu_is_preempted from PV spinlocks
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmKN9M4UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNLeAf+KizAlQwxEehHHeNyTkZuKyMawrD6
zsqAENR6i1TxiXe7fDfPFbO2NR0ZulQopHbD9mwnHJ+nNw0J4UT7g3ii1IAVcXPu
rQNRGMVWiu54jt+lep8/gDg0JvPGKVVKLhxUaU1kdWT9PhIOC6lwpP3vmeWkUfRi
PFL/TMT0M8Nfryi0zHB0tXeqg41BiXfqO8wMySfBAHUbpv8D53D2eXQL6YlMM0pL
2quB1HxHnpueE5vj3WEPQ3PCdy1M2MTfCDBJAbZGG78Ljx45FxSGoQcmiBpPnhJr
C6UGP4ZDWpml5YULUoA70k5ylCbP+vI61U4vUtzEiOjHugpPV5wFKtx5nw==
=ozWx
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"S390:
- ultravisor communication device driver
- fix TEID on terminating storage key ops
RISC-V:
- Added Sv57x4 support for G-stage page table
- Added range based local HFENCE functions
- Added remote HFENCE functions based on VCPU requests
- Added ISA extension registers in ONE_REG interface
- Updated KVM RISC-V maintainers entry to cover selftests support
ARM:
- Add support for the ARMv8.6 WFxT extension
- Guard pages for the EL2 stacks
- Trap and emulate AArch32 ID registers to hide unsupported features
- Ability to select and save/restore the set of hypercalls exposed to
the guest
- Support for PSCI-initiated suspend in collaboration with userspace
- GICv3 register-based LPI invalidation support
- Move host PMU event merging into the vcpu data structure
- GICv3 ITS save/restore fixes
- The usual set of small-scale cleanups and fixes
x86:
- New ioctls to get/set TSC frequency for a whole VM
- Allow userspace to opt out of hypercall patching
- Only do MSR filtering for MSRs accessed by rdmsr/wrmsr
AMD SEV improvements:
- Add KVM_EXIT_SHUTDOWN metadata for SEV-ES
- V_TSC_AUX support
Nested virtualization improvements for AMD:
- Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE,
nested vGIF)
- Allow AVIC to co-exist with a nested guest running
- Fixes for LBR virtualizations when a nested guest is running, and
nested LBR virtualization support
- PAUSE filtering for nested hypervisors
Guest support:
- Decoupling of vcpu_is_preempted from PV spinlocks"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (199 commits)
KVM: x86: Fix the intel_pt PMI handling wrongly considered from guest
KVM: selftests: x86: Sync the new name of the test case to .gitignore
Documentation: kvm: reorder ARM-specific section about KVM_SYSTEM_EVENT_SUSPEND
x86, kvm: use correct GFP flags for preemption disabled
KVM: LAPIC: Drop pending LAPIC timer injection when canceling the timer
x86/kvm: Alloc dummy async #PF token outside of raw spinlock
KVM: x86: avoid calling x86 emulator without a decoded instruction
KVM: SVM: Use kzalloc for sev ioctl interfaces to prevent kernel data leak
x86/fpu: KVM: Set the base guest FPU uABI size to sizeof(struct kvm_xsave)
s390/uv_uapi: depend on CONFIG_S390
KVM: selftests: x86: Fix test failure on arch lbr capable platforms
KVM: LAPIC: Trace LAPIC timer expiration on every vmentry
KVM: s390: selftest: Test suppression indication on key prot exception
KVM: s390: Don't indicate suppression on dirtying, failing memop
selftests: drivers/s390x: Add uvdevice tests
drivers/s390/char: Add Ultravisor io device
MAINTAINERS: Update KVM RISC-V entry to cover selftests support
RISC-V: KVM: Introduce ISA extension register
RISC-V: KVM: Cleanup stale TLB entries when host CPU changes
RISC-V: KVM: Add remote HFENCE functions based on VCPU requests
...
For some sev ioctl interfaces, the length parameter that is passed maybe
less than or equal to SEV_FW_BLOB_MAX_SIZE, but larger than the data
that PSP firmware returns. In this case, kmalloc will allocate memory
that is the size of the input rather than the size of the data.
Since PSP firmware doesn't fully overwrite the allocated buffer, these
sev ioctl interface may return uninitialized kernel slab memory.
Reported-by: Andy Nguyen <theflow@google.com>
Suggested-by: David Rientjes <rientjes@google.com>
Suggested-by: Peter Gonda <pgonda@google.com>
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Fixes: eaf78265a4 ("KVM: SVM: Move SEV code to separate file")
Fixes: 2c07ded064 ("KVM: SVM: add support for SEV attestation command")
Fixes: 4cfdd47d6d ("KVM: SVM: Add KVM_SEV SEND_START command")
Fixes: d3d1af85e2 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command")
Fixes: eba04b20e4 ("KVM: x86: Account a variety of miscellaneous allocations")
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Reviewed-by: Peter Gonda <pgonda@google.com>
Message-Id: <20220516154310.3685678-1-Ashish.Kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Added Sv57x4 support for G-stage page table
- Added range based local HFENCE functions
- Added remote HFENCE functions based on VCPU requests
- Added ISA extension registers in ONE_REG interface
- Updated KVM RISC-V maintainers entry to cover selftests support
-----BEGIN PGP SIGNATURE-----
iQIyBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmKHGu8ACgkQrUjsVaLH
LAe1sQ/40ltbl/v0cW+zkuUOem+apmJMhtoCfh2Pv00yUYftUNw01Uu+NN04T70x
PYwbu0O8j4dgIFNRPU7VQBVI+fJydkgEr3kpk8UOCCGKiE0NAcFoQv70ngPObc4W
L425i2RviZuQUXLTFsoLOb246p8V8lkfbEQKqWksFEROYWFbdNKmaLpfVqq3Bia2
+G8L2OyAHGjUXgIdOnflZHxowJg4ueGob3iH+4AhZNUpIQYtlKSfi/eo0vmzf5Uz
bD35o6y4G7NnZJyZoKb3QAEt0WQ55YDsNN62XrULQ7GEuWnpez+Jhw3jtrAr59Q7
m8n93NMKKJ9CbnsspFJ+4nHCd2Gb4i99Py70IW6Ro22DL8KRrLDv2ZQi3dJCGrAT
MtER+12coglkgjhDmLn6MMEjWkgbXXxQCEs4OQ8VMORtHAsOQEszu5TCEnihXr2q
+uUZ5O0G6eDowctOVMTdqVMtj1u1AT7fZ68evvk4omNnoFWjkQzd4sVPNDJtK+nC
7mA9IUyC2LSvr/oNNpcuIZsKU6OzQUQ5ISTMpbP/HJInFcvYbJTl0I8UcvjzlImo
81CZTUQOY9kQE+VUTHcGqPr0TjN/YlfF//koiCfeTycN0jbRZZ9rpcRQ38R8sDsS
yy7JQqwpi/x8me9ldt5r19ky5zMlCKpnQfGX6ws+umhqVEHBKw==
=Xznv
-----END PGP SIGNATURE-----
Merge tag 'kvm-riscv-5.19-1' of https://github.com/kvm-riscv/linux into HEAD
KVM/riscv changes for 5.19
- Added Sv57x4 support for G-stage page table
- Added range based local HFENCE functions
- Added remote HFENCE functions based on VCPU requests
- Added ISA extension registers in ONE_REG interface
- Updated KVM RISC-V maintainers entry to cover selftests support
- Add support for the ARMv8.6 WFxT extension
- Guard pages for the EL2 stacks
- Trap and emulate AArch32 ID registers to hide unsupported features
- Ability to select and save/restore the set of hypercalls exposed
to the guest
- Support for PSCI-initiated suspend in collaboration with userspace
- GICv3 register-based LPI invalidation support
- Move host PMU event merging into the vcpu data structure
- GICv3 ITS save/restore fixes
- The usual set of small-scale cleanups and fixes
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmKGAGsPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDB/gQAMhyZ+wCG0OMEZhwFF6iDfxVEX2Kw8L41NtD
a/e6LDWuIOGihItpRkYROc5myG74D7XckF2Bz3G7HJoU4vhwHOV/XulE26GFizoC
O1GVRekeSUY81wgS1yfo0jojLupBkTjiq3SjTHoDP7GmCM0qDPBtA0QlMRzd2bMs
Kx0+UUXZUHFSTXc7Lp4vqNH+tMp7se+yRx7hxm6PCM5zG+XYJjLxnsZ0qpchObgU
7f6YFojsLUs1SexgiUqJ1RChVQ+FkgICh5HyzORvGtHNNzK6D2sIbsW6nqMGAMql
Kr3A5O/VOkCztSYnLxaa76/HqD21mvUrXvr3grhabNc7rOmuzWV0dDgr6c6wHKHb
uNCtH4d7Ra06gUrEOrfsgLOLn0Zqik89y6aIlMsnTudMg9gMNgFHy1jz4LM7vMkY
FS5AVj059heg2uJcfgTvzzcqneyuBLBmF3dS4coowO6oaj8SycpaEmP5e89zkPMI
1kk8d0e6RmXuCh/2AJ8GxxnKvBPgqp2mMKXOCJ8j4AmHEDX/CKpEBBqIWLKkplUU
8DGiOWJUtRZJg398dUeIpiVLoXJthMODjAnkKkuhiFcQbXomlwgg7YSnNAz6TRED
Z7KR2leC247kapHnnagf02q2wED8pBeyrxbQPNdrHtSJ9Usm4nTkY443HgVTJW3s
aTwPZAQ7
=mh7W
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 5.19
- Add support for the ARMv8.6 WFxT extension
- Guard pages for the EL2 stacks
- Trap and emulate AArch32 ID registers to hide unsupported features
- Ability to select and save/restore the set of hypercalls exposed
to the guest
- Support for PSCI-initiated suspend in collaboration with userspace
- GICv3 register-based LPI invalidation support
- Move host PMU event merging into the vcpu data structure
- GICv3 ITS save/restore fixes
- The usual set of small-scale cleanups and fixes
[Due to the conflict, KVM_SYSTEM_EVENT_SEV_TERM is relocated
from 4 to 6. - Paolo]
Add to confidential guests the necessary memory integrity protection
against malicious hypervisor-based attacks like data replay, memory
remapping and others, thus achieving a stronger isolation from the
hypervisor.
At the core of the functionality is a new structure called a reverse
map table (RMP) with which the guest has a say in which pages get
assigned to it and gets notified when a page which it owns, gets
accessed/modified under the covers so that the guest can take an
appropriate action.
In addition, add support for the whole machinery needed to launch a SNP
guest, details of which is properly explained in each patch.
And last but not least, the series refactors and improves parts of the
previous SEV support so that the new code is accomodated properly and
not just bolted on.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLU2AACgkQEsHwGGHe
VUpb/Q//f4LGiJf4nw1flzpe90uIsHNwAafng3NOjeXmhI/EcOlqPf23WHPCgg3Z
2umfa4sRZyj4aZubDd7tYAoq4qWrQ7pO7viWCNTh0InxBAILOoMPMuq2jSAbq0zV
ASUJXeQ2bqjYxX4JV4N5f3HT2l+k68M0mpGLN0H+O+LV9pFS7dz7Jnsg+gW4ZP25
PMPLf6FNzO/1tU1aoYu80YDP1ne4eReLrNzA7Y/rx+S2NAetNwPn21AALVgoD4Nu
vFdKh4MHgtVbwaQuh0csb/+4vD+tDXAhc8lbIl+Abl9ZxJaDWtAJW5D9e2CnsHk1
NOkHwnrzizzhtGK1g56YPUVRFAWhZYMOI1hR0zGPLQaVqBnN4b+iahPeRiV0XnGE
PSbIHSfJdeiCkvLMCdIAmpE5mRshhRSUfl1CXTCdetMn8xV/qz/vG6bXssf8yhTV
cfLGPHU7gfVmsbR9nk5a8KZ78PaytxOxfIDXvCy8JfQwlIWtieaCcjncrj+sdMJy
0fdOuwvi4jma0cyYuPolKiS1Hn4ldeibvxXT7CZQlIx6jZShMbpfpTTJs11XdtHm
PdDAc1TY3AqI33mpy9DhDQmx/+EhOGxY3HNLT7evRhv4CfdQeK3cPVUWgo4bGNVv
ZnFz7nvmwpyufltW9K8mhEZV267174jXGl6/idxybnlVE7ESr2Y=
=Y8kW
-----END PGP SIGNATURE-----
Merge tag 'x86_sev_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull AMD SEV-SNP support from Borislav Petkov:
"The third AMD confidential computing feature called Secure Nested
Paging.
Add to confidential guests the necessary memory integrity protection
against malicious hypervisor-based attacks like data replay, memory
remapping and others, thus achieving a stronger isolation from the
hypervisor.
At the core of the functionality is a new structure called a reverse
map table (RMP) with which the guest has a say in which pages get
assigned to it and gets notified when a page which it owns, gets
accessed/modified under the covers so that the guest can take an
appropriate action.
In addition, add support for the whole machinery needed to launch a
SNP guest, details of which is properly explained in each patch.
And last but not least, the series refactors and improves parts of the
previous SEV support so that the new code is accomodated properly and
not just bolted on"
* tag 'x86_sev_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
x86/entry: Fixup objtool/ibt validation
x86/sev: Mark the code returning to user space as syscall gap
x86/sev: Annotate stack change in the #VC handler
x86/sev: Remove duplicated assignment to variable info
x86/sev: Fix address space sparse warning
x86/sev: Get the AP jump table address from secrets page
x86/sev: Add missing __init annotations to SEV init routines
virt: sevguest: Rename the sevguest dir and files to sev-guest
virt: sevguest: Change driver name to reflect generic SEV support
x86/boot: Put globals that are accessed early into the .data section
x86/boot: Add an efi.h header for the decompressor
virt: sevguest: Fix bool function returning negative value
virt: sevguest: Fix return value check in alloc_shared_pages()
x86/sev-es: Replace open-coded hlt-loop with sev_es_terminate()
virt: sevguest: Add documentation for SEV-SNP CPUID Enforcement
virt: sevguest: Add support to get extended report
virt: sevguest: Add support to derive key
virt: Add SEV-SNP guest driver
x86/sev: Register SEV-SNP guest request platform device
x86/sev: Provide support for SNP guest request NAEs
...
Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of
high bits of physical address bits as 'KeyID' bits. Intel Trust Domain
Extentions (TDX) further steals part of MKTME KeyID bits as TDX private
KeyID bits. TDX private KeyID bits cannot be set in any mapping in the
host kernel since they can only be accessed by software running inside a
new CPU isolated mode. And unlike to AMD's SME, host kernel doesn't set
any legacy MKTME KeyID bits to any mapping either. Therefore, it's not
legitimate for KVM to set any KeyID bits in SPTE which maps guest
memory.
KVM maintains shadow_zero_check bits to represent which bits must be
zero for SPTE which maps guest memory. MKTME KeyID bits should be set
to shadow_zero_check. Currently, shadow_me_mask is used by AMD to set
the sme_me_mask to SPTE, and shadow_me_shadow is excluded from
shadow_zero_check. So initializing shadow_me_mask to represent all
MKTME keyID bits doesn't work for VMX (as oppositely, they must be set
to shadow_zero_check).
Introduce a new 'shadow_me_value' to replace existing shadow_me_mask,
and repurpose shadow_me_mask as 'all possible memory encryption bits'.
The new schematic of them will be:
- shadow_me_value: the memory encryption bit(s) that will be set to the
SPTE (the original shadow_me_mask).
- shadow_me_mask: all possible memory encryption bits (which is a super
set of shadow_me_value).
- For now, shadow_me_value is supposed to be set by SVM and VMX
respectively, and it is a constant during KVM's life time. This
perhaps doesn't fit MKTME but for now host kernel doesn't support it
(and perhaps will never do).
- Bits in shadow_me_mask are set to shadow_zero_check, except the bits
in shadow_me_value.
Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them.
Replace shadow_me_mask with shadow_me_value in almost all code paths,
except the one in PT64_PERM_MASK, which is used by need_remote_flush()
to determine whether remote TLB flush is needed. This should still use
shadow_me_mask as any encryption bit change should need a TLB flush.
And for AMD, move initializing shadow_me_value/shadow_me_mask from
kvm_mmu_reset_all_pte_masks() to svm_hardware_setup().
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
svm_vm_migrate_from() uses sev_lock_vcpus_for_migration() to lock all
source and target vcpu->locks. Unfortunately there is an 8 subclass
limit, so a new subclass cannot be used for each vCPU. Instead maintain
ownership of the first vcpu's mutex.dep_map using a role specific
subclass: source vs target. Release the other vcpu's mutex.dep_maps.
Fixes: b56639318b ("KVM: SEV: Add support for SEV intra host migration")
Reported-by: John Sperbeck<jsperbeck@google.com>
Suggested-by: David Rientjes <rientjes@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Peter Gonda <pgonda@google.com>
Message-Id: <20220502165807.529624-1-pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Zen renumbered some of the performance counters that correspond to the
well known events in perf_hw_id. This code in KVM was never updated for
that, so guest that attempt to use counters on Zen that correspond to the
pre-Zen perf_hw_id values will silently receive the wrong values.
This has been observed in the wild with rr[0] when running in Zen 3
guests. rr uses the retired conditional branch counter 00d1 which is
incorrectly recognized by KVM as PERF_COUNT_HW_STALLED_CYCLES_BACKEND.
[0] https://rr-project.org/
Signed-off-by: Kyle Huey <me@kylehuey.com>
Message-Id: <20220503050136.86298-1-khuey@kylehuey.com>
Cc: stable@vger.kernel.org
[Check guest family, not host. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This can help identify potential performance issues when handles
AVIC incomplete IPI due vCPU not running.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220420154954.19305-3-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, an AVIC-enabled VM suffers from performance bottleneck
when scaling to large number of vCPUs for I/O intensive workloads.
In such case, a vCPU often executes halt instruction to get into idle state
waiting for interrupts, in which KVM would de-schedule the vCPU from
physical CPU.
When AVIC HW tries to deliver interrupt to the halting vCPU, it would
result in AVIC incomplete IPI #vmexit to notify KVM to reschedule
the target vCPU into running state.
Investigation has shown the main hotspot is in the kvm_apic_match_dest()
in the following call stack where it tries to find target vCPUs
corresponding to the information in the ICRH/ICRL registers.
- handle_exit
- svm_invoke_exit_handler
- avic_incomplete_ipi_interception
- kvm_apic_match_dest
However, AVIC provides hints in the #vmexit info, which can be used to
retrieve the destination guest physical APIC ID.
In addition, since QEMU defines guest physical APIC ID to be the same as
vCPU ID, it can be used to quickly identify the target vCPU to deliver IPI,
and avoid the overhead from searching through all vCPUs to match the target
vCPU.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220420154954.19305-2-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
root_role.level is always the same value as shadow_level:
- it's kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu
- it's the level argument when going through kvm_init_shadow_ept_mmu
- it's assigned directly from new_role.base.level when going
through shadow_mmu_init_context
Remove the duplication and get the level directly from the role.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Replace the per-vendor hack-a-fix for KVM's #PF => #PF => #DF workaround
with an explicit, common workaround in kvm_inject_emulated_page_fault().
Aside from being a hack, the current approach is brittle and incomplete,
e.g. nSVM's KVM_SET_NESTED_STATE fails to set ->inject_page_fault(),
and nVMX fails to apply the workaround when VMX is intercepting #PF due
to allow_smaller_maxphyaddr=1.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The TSC_AUX virtualization feature allows AMD SEV-ES guests to securely use
TSC_AUX (auxiliary time stamp counter data) in the RDTSCP and RDPID
instructions. The TSC_AUX value is set using the WRMSR instruction to the
TSC_AUX MSR (0xC0000103). It is read by the RDMSR, RDTSCP and RDPID
instructions. If the read/write of the TSC_AUX MSR is intercepted, then
RDTSCP and RDPID must also be intercepted when TSC_AUX virtualization
is present. However, the RDPID instruction can't be intercepted. This means
that when TSC_AUX virtualization is present, RDTSCP and TSC_AUX MSR
read/write must not be intercepted for SEV-ES (or SEV-SNP) guests.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Message-Id: <165040164424.1399644.13833277687385156344.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes for (relatively) old bugs, to be merged in both the -rc and next
development trees.
The merge reconciles the ABI fixes for KVM_EXIT_SYSTEM_EVENT between
5.18 and commit c24a950ec7 ("KVM, SEV: Add KVM_EXIT_SHUTDOWN metadata
for SEV-ES", 2022-04-13).
Flush the CPU caches when memory is reclaimed from an SEV guest (where
reclaim also includes it being unmapped from KVM's memslots). Due to lack
of coherency for SEV encrypted memory, failure to flush results in silent
data corruption if userspace is malicious/broken and doesn't ensure SEV
guest memory is properly pinned and unpinned.
Cache coherency is not enforced across the VM boundary in SEV (AMD APM
vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
VM guests have to be explicitly flushed on the host side. If a memory page
containing dirty confidential cachelines was released by VM and reallocated
to another user, the cachelines may corrupt the new user at a later time.
KVM takes a shortcut by assuming all confidential memory remain pinned
until the end of VM lifetime. Therefore, KVM does not flush cache at
mmu_notifier invalidation events. Because of this incorrect assumption and
the lack of cache flushing, malicous userspace can crash the host kernel:
creating a malicious VM and continuously allocates/releases unpinned
confidential memory pages when the VM is running.
Add cache flush operations to mmu_notifier operations to ensure that any
physical memory leaving the guest VM get flushed. In particular, hook
mmu_notifier_invalidate_range_start and mmu_notifier_release events and
flush cache accordingly. The hook after releasing the mmu lock to avoid
contention with other vCPUs.
Cc: stable@vger.kernel.org
Suggested-by: Sean Christpherson <seanjc@google.com>
Reported-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20220421031407.2516575-4-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use clflush_cache_range() to flush the confidential memory when
SME_COHERENT is supported in AMD CPU. Cache flush is still needed since
SME_COHERENT only support cache invalidation at CPU side. All confidential
cache lines are still incoherent with DMA devices.
Cc: stable@vger.kerel.org
Fixes: add5e2f045 ("KVM: SVM: Add support for the SEV-ES VMSA")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20220421031407.2516575-3-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rework sev_flush_guest_memory() to explicitly handle only a single page,
and harden it to fall back to WBINVD if VM_PAGE_FLUSH fails. Per-page
flushing is currently used only to flush the VMSA, and in its current
form, the helper is completely broken with respect to flushing actual
guest memory, i.e. won't work correctly for an arbitrary memory range.
VM_PAGE_FLUSH takes a host virtual address, and is subject to normal page
walks, i.e. will fault if the address is not present in the host page
tables or does not have the correct permissions. Current AMD CPUs also
do not honor SMAP overrides (undocumented in kernel versions of the APM),
so passing in a userspace address is completely out of the question. In
other words, KVM would need to manually walk the host page tables to get
the pfn, ensure the pfn is stable, and then use the direct map to invoke
VM_PAGE_FLUSH. And the latter might not even work, e.g. if userspace is
particularly evil/clever and backs the guest with Secret Memory (which
unmaps memory from the direct map).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Fixes: add5e2f045 ("KVM: SVM: Add support for the SEV-ES VMSA")
Reported-by: Mingwei Zhang <mizhang@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20220421031407.2516575-2-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>