We already deal with CNTPCT_EL0 accesses in non-HYP context.
Let's add CNTVCT_EL0 as a good measure.
This is also an opportunity to simplify things and make it
plain that this code is only for non-HYP context handling.
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-8-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Similarly to handling the physical timer accesses early when FEAT_ECV
causes a trap, we try to handle the physical counter without returning
to the general sysreg handling.
More surprisingly, we introduce something similar for the virtual
counter. Although this isn't necessary yet, it will prove useful on
systems that have a broken CNTVOFF_EL2 implementation. Yes, they exist.
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-7-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Although FEAT_ECV allows us to correctly emulate the timers, it also
reduces performances pretty badly.
Mitigate this by emulating the CTL/CVAL register reads in the
inner run loop, without returning to the general kernel.
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-6-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Although FEAT_NV2 makes most things fast, it also makes it impossible
to correctly emulate the timers, as the sysreg accesses are redirected
to memory.
FEAT_ECV addresses this by giving a hypervisor the ability to trap
the EL02 sysregs as well as the virtual timer.
Add the required trap setting to make use of the feature, allowing
us to elide the ugly resync in the middle of the run loop.
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-5-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
With FEAT_NV2, the EL0 timer state is entirely stored in memory,
meaning that the hypervisor can only provide a very poor emulation.
The only thing we can really do is to publish the interrupt state
in the guest view of CNT{P,V}_CTL_EL0, and defer everything else
to the next exit.
Only FEAT_ECV will allow us to fix it, at the cost of extra trapping.
Suggested-by: Chase Conklin <chase.conklin@arm.com>
Suggested-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-4-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Emulating the timers with FEAT_NV2 is a bit odd, as the timers
can be reconfigured behind our back without the hypervisor even
noticing. In the VHE case, that's an actual regression in the
architecture...
Co-developed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Add the required handling for EL2 and EL02 registers, as
well as EL1 registers used in the E2H context. This includes
handling the virtual timer accesses when CNTHCTL_EL2.EL1TVT
or CNTHCTL_EL2.EL1TVCT are set.
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Although we never supported 32bit anywhere in NV, we fail to
advertise so for EL0, probably owing to the relative lack of
hardware supporting both NV2 and 32bit EL0.
Add some sanitising to ID_AA64PFR0_EL1.EL0, and reaffirm that
"in 64bit-only we trust".
Reported-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Now that we have introduced kvm_vcpu_has_feature(), use it in the
remaining code that checks for features in struct kvm, instead of
using the __vcpu_has_feature() helper.
No functional change intended.
Suggested-by: Quentin Perret <qperret@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-18-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
The vcpu flag GUEST_HAS_SVE is per-vcpu, but it is based on what
is now a per-vm feature. Make the flag per-vm.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-17-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
The vcpu flag GUEST_HAS_PTRAUTH is always associated with the
vcpu PtrAuth features, which are defined per vm rather than per
vcpu.
Remove the flag, and replace it with checks for the features
instead.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-16-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Similar to VHE, calculate the value of cptr_el2 from scratch on
activate traps. This removes the need to store cptr_el2 in every
vcpu structure. Moreover, some traps, such as whether the guest
owns the fp registers, need to be set on every vcpu run.
Reported-by: James Clark <james.clark@linaro.org>
Fixes: 5294afdbf4 ("KVM: arm64: Exclude FP ownership from kvm_vcpu_arch")
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-13-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
In hVHE mode, HCR_E2H should be set for both protected and
non-protected VMs. Since commit b56680de9c ("KVM: arm64:
Initialize trap register values in hyp in pKVM"), this has been
fixed, and the setting of the flag here is redundant.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-12-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
The few remaining items needed in fixed_config.h are better
suited for pkvm.h. Move them there and delete it.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-11-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
The existing code didn't properly distinguish between signed and
unsigned features, and was difficult to read and to maintain.
Rework it using the same method used in other parts of KVM when
handling vcpu features.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-10-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Now that the VM's feature id registers are initialized with the
values of the supported features, use those values to determine
which traps to set using kvm_has_feature().
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-9-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Trap RAS in pKVM if not supported at all for protected VMs. The
RAS version doesn't matter in this case.
Fixes: 2a0c343386 ("KVM: arm64: Initialize trap registers for protected VMs")
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-8-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
The hypervisor maintains the state of protected VMs. Initialize
the values for feature ID registers for protected VMs, to be used
when setting traps and when advertising features to protected
VMs.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-7-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Use KVM extension checks as the source for determining which
capabilities are allowed for protected VMs. KVM extension checks
is the natural place for this, since it is also the interface
exposed to users.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-6-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
The hypervisor is responsible for the power state of protected
VMs in pKVM. Therefore, remove KVM_ARM_VCPU_POWER_OFF from the
list of allowed features for protected VMs.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-5-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
At the moment, checks for supported vcpu features for protected
VMs are build-time bugs. In the following patch, they will become
runtime checks based on the vcpu's features registers. Therefore,
consolidate them into one function that would return an error if
it encounters an unsupported feature.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-4-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Group setting protected VM traps by control register rather than
feature id register, since some trap values (e.g., PAuth), depend
on more than one feature id register.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-3-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
The definitions for features allowed and allowed with
restrictions for protected guests, which are based on feature
registers, were defined and checked for separately, even though
they are handled in the same way. This could result in missing
checks for certain features, e.g., pointer authentication,
causing traps for allowed features.
Consolidate the definitions into one. Use that new definition to
construct the guest view of the feature registers for
consistency.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-2-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Introduce the KVM_PGT_CALL() helper macro to allow switching from the
traditional pgtable code to the pKVM version easily in mmu.c. The cost
of this 'indirection' is expected to be very minimal due to
is_protected_kvm_enabled() being backed by a static key.
With this, everything is in place to allow the delegation of
non-protected guest stage-2 page-tables to pKVM, so let's stop using the
host's kvm_s2_mmu from EL2 and enjoy the ride.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-19-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Introduce a set of helper functions allowing to manipulate the pKVM
guest stage-2 page-tables from EL1 using pKVM's HVC interface.
Each helper has an exact one-to-one correspondance with the traditional
kvm_pgtable_stage2_*() functions from pgtable.c, with a strictly
matching prototype. This will ease plumbing later on in mmu.c.
These callbacks track the gfn->pfn mappings in a simple rb_tree indexed
by IPA in lieu of a page-table. This rb-tree is kept in sync with pKVM's
state and is protected by the mmu_lock like a traditional stage-2
page-table.
Signed-off-by: Quentin Perret <qperret@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-18-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Introduce a new hypercall to flush the TLBs of non-protected guests. The
host kernel will be responsible for issuing this hypercall after changing
stage-2 permissions using the __pkvm_host_relax_guest_perms() or
__pkvm_host_wrprotect_guest() paths. This is left under the host's
responsibility for performance reasons.
Note however that the TLB maintenance for all *unmap* operations still
remains entirely under the hypervisor's responsibility for security
reasons -- an unmapped page may be donated to another entity, so a stale
TLB entry could be used to leak private data.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-17-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Plumb the kvm_pgtable_stage2_mkyoung() callback into pKVM for
non-protected guests. It will be called later from the fault handling
path.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-16-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Plumb the kvm_stage2_test_clear_young() callback into pKVM for
non-protected guest. It will be later be called from MMU notifiers.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-15-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Introduce a new hypercall to remove the write permission from a
non-protected guest stage-2 mapping. This will be used for e.g. enabling
dirty logging.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-14-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Introduce a new hypercall allowing the host to relax the stage-2
permissions of mappings in a non-protected guest page-table. It will be
used later once we start allowing RO memslots and dirty logging.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-13-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
In preparation for letting the host unmap pages from non-protected
guests, introduce a new hypercall implementing the host-unshare-guest
transition.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-12-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
In preparation for handling guest stage-2 mappings at EL2, introduce a
new pKVM hypercall allowing to share pages with non-protected guests.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-11-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Rather than look-up the hyp vCPU on every run hypercall at EL2,
introduce a per-CPU 'loaded_hyp_vcpu' tracking variable which is updated
by a pair of load/put hypercalls called directly from
kvm_arch_vcpu_{load,put}() when pKVM is enabled.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-10-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
In preparation for accessing pkvm_hyp_vm structures at EL2 in a context
where we can't always expect a vCPU to be loaded (e.g. MMU notifiers),
introduce get/put helpers to get temporary references to hyp VMs from
any context.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-9-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
kvm_pgtable_stage2_relax_perms currently assumes that it is being called
from a 'shared' walker, which will not be true once called from pKVM. To
allow for the re-use of that function, make the walk flags one of its
parameters.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-7-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
kvm_pgtable_stage2_mkyoung currently assumes that it is being called
from a 'shared' walker, which will not be true once called from pKVM.
To allow for the re-use of that function, make the walk flags one of
its parameters.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-6-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
We currently store part of the page-tracking state in PTE software bits
for the host, guests and the hypervisor. This is sub-optimal when e.g.
sharing pages as this forces to break block mappings purely to support
this software tracking. This causes an unnecessarily fragmented stage-2
page-table for the host in particular when it shares pages with Secure,
which can lead to measurable regressions. Moreover, having this state
stored in the page-table forces us to do multiple costly walks on the
page transition path, hence causing overhead.
In order to work around these problems, move the host-side page-tracking
logic from SW bits in its stage-2 PTEs to the hypervisor's vmemmap.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-5-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
We don't need 16 bits to store the hyp page order, and we'll need some
bits to store page ownership data soon, so let's reduce the order
member.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-4-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
In order to prepare the way for storing page-tracking information in
pKVM's vmemmap, move the enum pkvm_page_state definition to
nvhe/memory.h.
No functional changes intended.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-3-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
The 'concrete' (a.k.a non-meta) page states are currently encoded using
software bits in PTEs. For performance reasons, the abstract
pkvm_page_state enum uses the same bits to encode these states as that
makes conversions from and to PTEs easy.
In order to prepare the ground for moving the 'concrete' state storage
to the hyp vmemmap, re-arrange the enum to use bits 0 and 1 for this
purpose.
No functional changes intended.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-2-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Only yielding control of the debug registers for writes is a bit silly,
unless of course you're a fan of pointless traps. Give control of the
debug registers to the guest upon the first access, regardless of
direction.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-20-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
There is a nauseating amount of boilerplate for accessing the
breakpoint and watchpoint registers. Fold everything together into a
single set of accessors and select the right storage based on the sysreg
encoding.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-19-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Similar to other per-CPU profiling/debug features we handle, store the
number of breakpoints/watchpoints in kvm_host_data to avoid reading the
ID register 4 times on every guest entry/exit. And if you're in the
nested virt business that's quite a few avoidable exits to the L0
hypervisor.
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-18-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
KVM takes over the guest's software step state machine if the VMM is
debugging the guest, but it does the save/restore fiddling for every
guest entry.
Note that the only constraint on host usage of software step is that the
guest's configuration remains visible to userspace via the ONE_REG
ioctls. So, we can cut down on the amount of fiddling by doing this at
load/put instead.
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-16-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Stealing MDSCR_EL1 in the guest's kvm_cpu_context for external debugging
is rather gross. Just add a field for this instead and let the context
switch code pick the correct one based on the debug owner.
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-15-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
KVM has picked up several hacks to cope with vcpu->arch.mdcr_el2 needing
to be prepared before vcpu_load(), which is when it gets programmed
into hardware on VHE.
Now that the flows for reprogramming MDCR_EL2 have been simplified, move
that computation to vcpu_load().
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-14-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
KVM takes ownership of the debug regs if the guest enables the OS lock,
as it needs to use MDSCR_EL1 to mask debug exceptions. Just reload the
vCPU if the guest toggles the OS lock, relying on kvm_vcpu_load_debug()
to update the debug owner and get the right trap configuration in place.
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-13-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Use the debug owner to determine if the debug regs are in use instead of
keeping around the DEBUG_DIRTY flag. Debug registers are now
saved/restored after the first trap, regardless of whether it was a read
or a write. This also shifts the point at which KVM becomes lazy to
vcpu_put() rather than the next exception taken from the guest.
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-12-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Delete the remnants of debug_ptr now that debug registers are selected
based on the debug owner instead.
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-11-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The debug tracepoints are a useless firehose of information that track
implementation detail rather than well-defined events. These are going
to be rather difficult to uphold now that the implementation is getting
redone, so throw them out instead of bending over backwards.
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-10-oliver.upton@linux.dev
[maz: fixed compilation after trace-ectomy]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Select the set of debug registers to use based on the owner rather than
relying on debug_ptr. Besides the code cleanup, this allows us to
eliminate a couple instances kern_hyp_va() as well.
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-9-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
In preparation for tossing the debug_ptr mess, introduce an enumeration
to track the ownership of the debug registers while in the guest. Update
the owner at vcpu_load() based on whether the host needs to steal the
guest's debug context or if breakpoints/watchpoints are actively in use.
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-7-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Expecting the callee to know when MDCR_EL2 needs to be written to
hardware asking for trouble. Do the deed from kvm_arm_setup_mdcr_el2()
instead.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-6-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The SME/SVE state tracking flags have no business in the vCPU. Move them
to kvm_host_data.
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-5-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Add flags to kvm_host_data to track if SPE/TRBE is present +
programmable on a per-CPU basis. Set the flags up at init rather than
vcpu_load() as the programmability of these buffers is unlikely to
change.
Reviewed-by: James Clark <james.clark@linaro.org>
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-4-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
KVM caches MDCR_EL2 on a per-CPU basis in order to preserve the
configuration of MDCR_EL2.HPMN while running a guest. This is a bit
gross, since we're relying on some baked configuration rather than the
hardware definition of implemented counters.
Discover the number of implemented counters by reading PMCR_EL0.N
instead. This works because:
- In VHE the kernel runs at EL2, and N always returns the number of
counters implemented in hardware
- In {n,h}VHE, the EL2 setup code programs MDCR_EL2.HPMN with the EL2
view of PMCR_EL0.N for the host
Lastly, avoid traps under nested virtualization by saving PMCR_EL0.N in
host data.
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-3-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
There is no such thing as CPACR_ELx in the architecture.
What we have is CPACR_EL1, for which CPTR_EL12 is an accessor.
Rename CPACR_ELx_* to CPACR_EL1_*, and fix the bit of code using
these names.
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241219173351.1123087-5-maz@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
TCR2_EL1x is a pretty bizarre construct, as it is shared between
TCR2_EL1 and TCR2_EL12. But the latter is obviously only an
accessor to the former.
In order to make things more consistent, upgrade TCR2_EL1x to
a full-blown sysreg definition for TCR2_EL1, and describe TCR2_EL12
as a mapping to TCR2_EL1.
This results in a couple of minor changes to the actual code.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241219173351.1123087-3-maz@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
The pKVM stage2 mapping code relies on an invalid physical address to
signal to the internal API that only the annotations of descriptors
should be updated, and these are stored in the high bits of invalid
descriptors covering memory that has been donated to protected guests,
and is therefore unmapped from the host stage-2 page tables.
Given that these invalid PAs are never stored into the descriptors, it
is better to rely on an explicit flag, to clarify the API and to avoid
confusion regarding whether or not the output address of a descriptor
can ever be invalid to begin with (which is not the case with LPA2).
That removes a dependency on the logic that reasons about the maximum PA
range, which differs on LPA2 capable CPUs based on whether LPA2 is
enabled or not, and will be further clarified in subsequent patches.
Cc: Quentin Perret <qperret@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241212081841.2168124-12-ardb+git@google.com
Signed-off-by: Will Deacon <will@kernel.org>
When the host stage1 is configured for LPA2, the value currently being
programmed into TCR_EL2.T0SZ may be invalid unless LPA2 is configured
at HYP as well. This means kvm_lpa2_is_enabled() is not the right
condition to test when setting TCR_EL2.DS, as it will return false if
LPA2 is only available for stage 1 but not for stage 2.
Similary, programming TCR_EL2.PS based on a limited IPA range due to
lack of stage2 LPA2 support could potentially result in problems.
So use lpa2_is_enabled() instead, and set the PS field according to the
host's IPS, which is capped at 48 bits if LPA2 support is absent or
disabled. Whether or not we can make meaningful use of such a
configuration is a different question.
Cc: stable@vger.kernel.org
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241212081841.2168124-11-ardb+git@google.com
Signed-off-by: Will Deacon <will@kernel.org>
An important distinction from other registers affected by HPMN is that
PMCR_EL0 only affects the guest range of counters, regardless of the EL
from which it is accessed. Ensure that PMCR_EL0.P is always applied to
'guest' counters by manually computing the mask rather than deriving it
from the current context.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241217175611.3658290-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
MDCR_EL2.HPME is the 'global' enable bit for event counters reserved for
EL2. Give the PMU a kick when it's changed to ensure events are
reprogrammed before returning to the guest.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241217175550.3658212-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Nested virt introduces yet another set of 'global' knobs for controlling
event counters that are reserved for EL2 (i.e. >= HPMN). Get ready to
share some plumbing with the NV controls by offloading counter
reprogramming to KVM_REQ_RELOAD_PMU.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241217175532.3658134-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Having separate helpers for enabling/disabling counters provides the
wrong abstraction, as the state of each counter needs to be evaluated
independently and, in some cases, use a different global enable bit.
Collapse the enable/disable accessors into a single, common helper that
reconfigures every counter set in @mask, leaving the complexity of
determining if an event is actually enabled in
kvm_pmu_counter_is_enabled().
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241217175513.3658056-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
There are multiple pKVM memory transitions where the state of a page is
not cross-checked from the completer's PoV for performance reasons.
For example, if a page is PKVM_PAGE_OWNED from the initiator's PoV,
we should be guaranteed by construction that it is PKVM_NOPAGE for
everybody else, hence allowing us to save a page-table lookup.
When it was introduced, hyp_ack_unshare() followed that logic and bailed
out without checking the PKVM_PAGE_SHARED_BORROWED state in the
hypervisor's stage-1. This was correct as we could safely assume that
all host-initiated shares were directed at the hypervisor at the time.
But with the introduction of other types of shares (e.g. for FF-A or
non-protected guests), it is now very much required to cross check this
state to prevent the host from running __pkvm_host_unshare_hyp() on a
page shared with TZ or a non-protected guest.
Thankfully, if an attacker were to try this, the hyp_unmap() call from
hyp_complete_unshare() would fail, hence causing to WARN() from
__do_unshare() with the host lock held, which is fatal. But this is
fragile at best, and can hardly be considered a security measure.
Let's just do the right thing and always check the state from
hyp_ack_unshare().
Signed-off-by: Quentin Perret <qperret@google.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241128154406.602875-1-qperret@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
- Fix confusion with implicitly-shifted MDCR_EL2 masks breaking
SPE/TRBE initialization
- Align nested page table walker with the intended memory attribute
combining rules of the architecture
- Prevent userspace from constraining the advertised ASID width,
avoiding horrors of guest TLBIs not matching the intended context in
hardware
- Don't leak references on LPIs when insertion into the translation
cache fails
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZ0+mZhccb2xpdmVyLnVw
dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFuKcAQDnFcLru8MVor4zjloe25oPPeuW
iBocGpgKwJMioHrAdwEAoq8v0eqfxrUpwr5KJ7iN9CTo9oANJYhVACC8jPHEowI=
=fLPh
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-6.13-2' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.13, part #2
- Fix confusion with implicitly-shifted MDCR_EL2 masks breaking
SPE/TRBE initialization
- Align nested page table walker with the intended memory attribute
combining rules of the architecture
- Prevent userspace from constraining the advertised ASID width,
avoiding horrors of guest TLBIs not matching the intended context in
hardware
- Don't leak references on LPIs when insertion into the translation
cache fails
The return value of xa_store() needs to be checked. This fix adds an
error handling path that resolves the kref inconsistency on failure. As
suggested by Oliver Upton, this function does not return the error code
intentionally because the translation cache is best effort.
Fixes: 8201d1028c ("KVM: arm64: vgic-its: Maintain a translation cache per ITS")
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241130144952.23729-1-keisuke.nishimura@inria.fr
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Catalin reports that a hypervisor lying to a guest about the size
of the ASID field may result in unexpected issues:
- if the underlying HW does only supports 8 bit ASIDs, the ASID
field in a TLBI VAE1* operation is only 8 bits, and the HW will
ignore the other 8 bits
- if on the contrary the HW is 16 bit capable, the ASID field
in the same TLBI operation is always 16 bits, irrespective of
the value of TCR_ELx.AS.
This could lead to missed invalidations if the guest was lead to
assume that the HW had 8 bit ASIDs while they really are 16 bit wide.
In order to avoid any potential disaster that would be hard to debug,
prenent the migration between a host with 8 bit ASIDs to one with
wider ASIDs (the converse was obviously always forbidden). This is
also consistent with what we already do for VMIDs.
If it becomes absolutely mandatory to support such a migration path
in the future, we will have to trap and emulate all TLBIs, something
that nobody should look forward to.
Fixes: d5a32b60dc ("KVM: arm64: Allow userspace to change ID_AA64MMFR{0-2}_EL1")
Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20241203190236.505759-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* Fixes.
RISC-V:
* Svade and Svadu (accessed and dirty bit) extension support for host and
guest. This was acked on the mailing list by the RISC-V maintainer, see
https://patchew.org/linux/20240726084931.28924-1-yongxuan.wang@sifive.com/.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmdKS0QUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroP7hggAmt5CJesFGIuDwQgJX1KuNWAS84AX
Oq5SPZLH0XjE5YDm6AusSzvbtOhRM6mARU5/iqMRE6Mqpf4MXpP9tOo6xaDiL7+m
bOFsDYEO73WQyrIfFUCZ7dXiTbDVtQfNH8Z1yQwHPsa1d+WDYY3tLbCe5qCdqYMF
JDiB7K0cQzPDmhCwf3Zf8mW2ZRI0QsTqiuFUfVGGNgFDspWfBFBqkLCkrMNmbp9z
ye375oKb2VCe6OBJCY+Nl6tdoBUkz+CtZDCxkxuh0Uk4NmsUC9JMye9iwgU9DuI7
nagFuvpUGcgbZvrx1ly47TL+wcEFLwnBJ0xBZTGIgVoZHj/wX9GM+tSgIw==
=semZ
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more kvm updates from Paolo Bonzini:
- ARM fixes
- RISC-V Svade and Svadu (accessed and dirty bit) extension support for
host and guest
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: riscv: selftests: Add Svade and Svadu Extension to get-reg-list test
RISC-V: KVM: Add Svade and Svadu Extensions Support for Guest/VM
dt-bindings: riscv: Add Svade and Svadu Entries
RISC-V: Add Svade and Svadu Extensions Support
KVM: arm64: Use MDCR_EL2.HPME to evaluate overflow of hyp counters
KVM: arm64: Ignore PMCNTENSET_EL0 while checking for overflow status
KVM: arm64: Mark set_sysreg_masks() as inline to avoid build failure
KVM: arm64: vgic-its: Add stronger type-checking to the ITS entry sizes
KVM: arm64: vgic: Kill VGIC_MAX_PRIVATE definition
KVM: arm64: vgic: Make vgic_get_irq() more robust
KVM: arm64: vgic-v3: Sanitise guest writes to GICR_INVLPIR
The G.a revision of the ARM ARM had it pretty clear that HCR_EL2.FWB
had no influence on "The way that stage 1 memory types and attributes
are combined with stage 2 Device type and attributes." (D5.5.5).
However, this wording was lost in further revisions of the architecture.
Restore the intended behaviour, which is to take the strongest memory
type of S1 and S2 in this case, as if FWB was 0. The specification is
being fixed accordingly.
Fixes: be04cebf3e ("KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241125094756.609590-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Since the linked fixes commit, these masks are already shifted so remove
the shifts. One issue that this fixes is SPE and TRBE not being
available anymore:
arm_spe_pmu arm,spe-v1: profiling buffer owned by higher exception level
Fixes: 641630313e ("arm64: sysreg: Migrate MDCR_EL2 definition to table")
Signed-off-by: James Clark <james.clark@linaro.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241122164636.2944180-1-james.clark@linaro.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
essentially guessing which pfns are refcounted pages. The reason to
do so was that KVM needs to map both non-refcounted pages (for example
BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP VMAs that contain
refcounted pages. However, the result was security issues in the past,
and more recently the inability to map VM_IO and VM_PFNMAP memory
that _is_ backed by struct page but is not refcounted. In particular
this broke virtio-gpu blob resources (which directly map host graphics
buffers into the guest as "vram" for the virtio-gpu device) with the
amdgpu driver, because amdgpu allocates non-compound higher order pages
and the tail pages could not be mapped into KVM.
This requires adjusting all uses of struct page in the per-architecture
code, to always work on the pfn whenever possible. The large series that
did this, from David Stevens and Sean Christopherson, also cleaned up
substantially the set of functions that provided arch code with the
pfn for a host virtual addresses. The previous maze of twisty little
passages, all different, is replaced by five functions (__gfn_to_page,
__kvm_faultin_pfn, the non-__ versions of these two, and kvm_prefetch_pages)
saving almost 200 lines of code.
ARM:
* Support for stage-1 permission indirection (FEAT_S1PIE) and
permission overlays (FEAT_S1POE), including nested virt + the
emulated page table walker
* Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call
was introduced in PSCIv1.3 as a mechanism to request hibernation,
similar to the S4 state in ACPI
* Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
part of it, introduce trivial initialization of the host's MPAM
context so KVM can use the corresponding traps
* PMU support under nested virtualization, honoring the guest
hypervisor's trap configuration and event filtering when running a
nested guest
* Fixes to vgic ITS serialization where stale device/interrupt table
entries are not zeroed when the mapping is invalidated by the VM
* Avoid emulated MMIO completion if userspace has requested synchronous
external abort injection
* Various fixes and cleanups affecting pKVM, vCPU initialization, and
selftests
LoongArch:
* Add iocsr and mmio bus simulation in kernel.
* Add in-kernel interrupt controller emulation.
* Add support for virtualization extensions to the eiointc irqchip.
PPC:
* Drop lingering and utterly obsolete references to PPC970 KVM, which was
removed 10 years ago.
* Fix incorrect documentation references to non-existing ioctls
RISC-V:
* Accelerate KVM RISC-V when running as a guest
* Perf support to collect KVM guest statistics from host side
s390:
* New selftests: more ucontrol selftests and CPU model sanity checks
* Support for the gen17 CPU model
* List registers supported by KVM_GET/SET_ONE_REG in the documentation
x86:
* Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve
documentation, harden against unexpected changes. Even if the hardware
A/D tracking is disabled, it is possible to use the hardware-defined A/D
bits to track if a PFN is Accessed and/or Dirty, and that removes a lot
of special cases.
* Elide TLB flushes when aging secondary PTEs, as has been done in x86's
primary MMU for over 10 years.
* Recover huge pages in-place in the TDP MMU when dirty page logging is
toggled off, instead of zapping them and waiting until the page is
re-accessed to create a huge mapping. This reduces vCPU jitter.
* Batch TLB flushes when dirty page logging is toggled off. This reduces
the time it takes to disable dirty logging by ~3x.
* Remove the shrinker that was (poorly) attempting to reclaim shadow page
tables in low-memory situations.
* Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE.
* Advertise CPUIDs for new instructions in Clearwater Forest
* Quirk KVM's misguided behavior of initialized certain feature MSRs to
their maximum supported feature set, which can result in KVM creating
invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero
value results in the vCPU having invalid state if userspace hides PDCM
from the guest, which in turn can lead to save/restore failures.
* Fix KVM's handling of non-canonical checks for vCPUs that support LA57
to better follow the "architecture", in quotes because the actual
behavior is poorly documented. E.g. most MSR writes and descriptor
table loads ignore CR4.LA57 and operate purely on whether the CPU
supports LA57.
* Bypass the register cache when querying CPL from kvm_sched_out(), as
filling the cache from IRQ context is generally unsafe; harden the
cache accessors to try to prevent similar issues from occuring in the
future. The issue that triggered this change was already fixed in 6.12,
but was still kinda latent.
* Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM
over-advertises SPEC_CTRL when trying to support cross-vendor VMs.
* Minor cleanups
* Switch hugepage recovery thread to use vhost_task. These kthreads can
consume significant amounts of CPU time on behalf of a VM or in response
to how the VM behaves (for example how it accesses its memory); therefore
KVM tried to place the thread in the VM's cgroups and charge the CPU
time consumed by that work to the VM's container. However the kthreads
did not process SIGSTOP/SIGCONT, and therefore cgroups which had KVM
instances inside could not complete freezing. Fix this by replacing the
kthread with a PF_USER_WORKER thread, via the vhost_task abstraction.
Another 100+ lines removed, with generally better behavior too like
having these threads properly parented in the process tree.
* Revert a workaround for an old CPU erratum (Nehalem/Westmere) that didn't
really work; there was really nothing to work around anyway: the broken
patch was meant to fix nested virtualization, but the PERF_GLOBAL_CTRL
MSR is virtualized and therefore unaffected by the erratum.
* Fix 6.12 regression where CONFIG_KVM will be built as a module even
if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is 'y'.
x86 selftests:
* x86 selftests can now use AVX.
Documentation:
* Use rST internal links
* Reorganize the introduction to the API document
Generic:
* Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead
of RCU, so that running a vCPU on a different task doesn't encounter long
due to having to wait for all CPUs become quiescent. In general both reads
and writes are rare, but userspace that supports confidential computing is
introducing the use of "helper" vCPUs that may jump from one host processor
to another. Those will be very happy to trigger a synchronize_rcu(), and
the effect on performance is quite the disaster.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmc9MRYUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroP00QgArxqxBIGLCW5t7bw7vtNq63QYRyh4
dTiDguLiYQJ+AXmnRu11R6aPC7HgMAvlFCCmH+GEce4WEgt26hxCmncJr/aJOSwS
letCS7TrME16PeZvh25A1nhPBUw6mTF1qqzgcdHMrqXG8LuHoGcKYGSRVbkf3kfI
1ZoMq1r8ChXbVVmCx9DQ3gw1TVr5Dpjs2voLh8rDSE9Xpw0tVVabHu3/NhQEz/F+
t8/nRaqH777icCHIf9PCk5HnarHxLAOvhM2M0Yj09PuBcE5fFQxpxltw/qiKQqqW
ep4oquojGl87kZnhlDaac2UNtK90Ws+WxxvCwUmbvGN0ZJVaQwf4FvTwig==
=lWpE
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"The biggest change here is eliminating the awful idea that KVM had of
essentially guessing which pfns are refcounted pages.
The reason to do so was that KVM needs to map both non-refcounted
pages (for example BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP
VMAs that contain refcounted pages.
However, the result was security issues in the past, and more recently
the inability to map VM_IO and VM_PFNMAP memory that _is_ backed by
struct page but is not refcounted. In particular this broke virtio-gpu
blob resources (which directly map host graphics buffers into the
guest as "vram" for the virtio-gpu device) with the amdgpu driver,
because amdgpu allocates non-compound higher order pages and the tail
pages could not be mapped into KVM.
This requires adjusting all uses of struct page in the
per-architecture code, to always work on the pfn whenever possible.
The large series that did this, from David Stevens and Sean
Christopherson, also cleaned up substantially the set of functions
that provided arch code with the pfn for a host virtual addresses.
The previous maze of twisty little passages, all different, is
replaced by five functions (__gfn_to_page, __kvm_faultin_pfn, the
non-__ versions of these two, and kvm_prefetch_pages) saving almost
200 lines of code.
ARM:
- Support for stage-1 permission indirection (FEAT_S1PIE) and
permission overlays (FEAT_S1POE), including nested virt + the
emulated page table walker
- Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This
call was introduced in PSCIv1.3 as a mechanism to request
hibernation, similar to the S4 state in ACPI
- Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
part of it, introduce trivial initialization of the host's MPAM
context so KVM can use the corresponding traps
- PMU support under nested virtualization, honoring the guest
hypervisor's trap configuration and event filtering when running a
nested guest
- Fixes to vgic ITS serialization where stale device/interrupt table
entries are not zeroed when the mapping is invalidated by the VM
- Avoid emulated MMIO completion if userspace has requested
synchronous external abort injection
- Various fixes and cleanups affecting pKVM, vCPU initialization, and
selftests
LoongArch:
- Add iocsr and mmio bus simulation in kernel.
- Add in-kernel interrupt controller emulation.
- Add support for virtualization extensions to the eiointc irqchip.
PPC:
- Drop lingering and utterly obsolete references to PPC970 KVM, which
was removed 10 years ago.
- Fix incorrect documentation references to non-existing ioctls
RISC-V:
- Accelerate KVM RISC-V when running as a guest
- Perf support to collect KVM guest statistics from host side
s390:
- New selftests: more ucontrol selftests and CPU model sanity checks
- Support for the gen17 CPU model
- List registers supported by KVM_GET/SET_ONE_REG in the
documentation
x86:
- Cleanup KVM's handling of Accessed and Dirty bits to dedup code,
improve documentation, harden against unexpected changes.
Even if the hardware A/D tracking is disabled, it is possible to
use the hardware-defined A/D bits to track if a PFN is Accessed
and/or Dirty, and that removes a lot of special cases.
- Elide TLB flushes when aging secondary PTEs, as has been done in
x86's primary MMU for over 10 years.
- Recover huge pages in-place in the TDP MMU when dirty page logging
is toggled off, instead of zapping them and waiting until the page
is re-accessed to create a huge mapping. This reduces vCPU jitter.
- Batch TLB flushes when dirty page logging is toggled off. This
reduces the time it takes to disable dirty logging by ~3x.
- Remove the shrinker that was (poorly) attempting to reclaim shadow
page tables in low-memory situations.
- Clean up and optimize KVM's handling of writes to
MSR_IA32_APICBASE.
- Advertise CPUIDs for new instructions in Clearwater Forest
- Quirk KVM's misguided behavior of initialized certain feature MSRs
to their maximum supported feature set, which can result in KVM
creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to
a non-zero value results in the vCPU having invalid state if
userspace hides PDCM from the guest, which in turn can lead to
save/restore failures.
- Fix KVM's handling of non-canonical checks for vCPUs that support
LA57 to better follow the "architecture", in quotes because the
actual behavior is poorly documented. E.g. most MSR writes and
descriptor table loads ignore CR4.LA57 and operate purely on
whether the CPU supports LA57.
- Bypass the register cache when querying CPL from kvm_sched_out(),
as filling the cache from IRQ context is generally unsafe; harden
the cache accessors to try to prevent similar issues from occuring
in the future. The issue that triggered this change was already
fixed in 6.12, but was still kinda latent.
- Advertise AMD_IBPB_RET to userspace, and fix a related bug where
KVM over-advertises SPEC_CTRL when trying to support cross-vendor
VMs.
- Minor cleanups
- Switch hugepage recovery thread to use vhost_task.
These kthreads can consume significant amounts of CPU time on
behalf of a VM or in response to how the VM behaves (for example
how it accesses its memory); therefore KVM tried to place the
thread in the VM's cgroups and charge the CPU time consumed by that
work to the VM's container.
However the kthreads did not process SIGSTOP/SIGCONT, and therefore
cgroups which had KVM instances inside could not complete freezing.
Fix this by replacing the kthread with a PF_USER_WORKER thread, via
the vhost_task abstraction. Another 100+ lines removed, with
generally better behavior too like having these threads properly
parented in the process tree.
- Revert a workaround for an old CPU erratum (Nehalem/Westmere) that
didn't really work; there was really nothing to work around anyway:
the broken patch was meant to fix nested virtualization, but the
PERF_GLOBAL_CTRL MSR is virtualized and therefore unaffected by the
erratum.
- Fix 6.12 regression where CONFIG_KVM will be built as a module even
if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is
'y'.
x86 selftests:
- x86 selftests can now use AVX.
Documentation:
- Use rST internal links
- Reorganize the introduction to the API document
Generic:
- Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock
instead of RCU, so that running a vCPU on a different task doesn't
encounter long due to having to wait for all CPUs become quiescent.
In general both reads and writes are rare, but userspace that
supports confidential computing is introducing the use of "helper"
vCPUs that may jump from one host processor to another. Those will
be very happy to trigger a synchronize_rcu(), and the effect on
performance is quite the disaster"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (298 commits)
KVM: x86: Break CONFIG_KVM_X86's direct dependency on KVM_INTEL || KVM_AMD
KVM: x86: add back X86_LOCAL_APIC dependency
Revert "KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()"
KVM: x86: switch hugepage recovery thread to vhost_task
KVM: x86: expose MSR_PLATFORM_INFO as a feature MSR
x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest
Documentation: KVM: fix malformed table
irqchip/loongson-eiointc: Add virt extension support
LoongArch: KVM: Add irqfd support
LoongArch: KVM: Add PCHPIC user mode read and write functions
LoongArch: KVM: Add PCHPIC read and write functions
LoongArch: KVM: Add PCHPIC device support
LoongArch: KVM: Add EIOINTC user mode read and write functions
LoongArch: KVM: Add EIOINTC read and write functions
LoongArch: KVM: Add EIOINTC device support
LoongArch: KVM: Add IPI user mode read and write function
LoongArch: KVM: Add IPI read and write function
LoongArch: KVM: Add IPI device support
LoongArch: KVM: Add iocsr and mmio bus simulation in kernel
KVM: arm64: Pass on SVE mapping failures
...
- Constrain invalidations from GICR_INVLPIR to only affect the LPI
INTID space
- Set of robustness improvements to the management of vgic irqs and GIC
ITS table entries
- Fix compilation issue w/ CONFIG_CC_OPTIMIZE_FOR_SIZE=y where
set_sysreg_masks() wasn't getting inlined, breaking check for a
constant sysreg index
- Correct KVM's vPMU overflow condition to match the architecture for
hyp and non-hyp counters
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZz7sPxccb2xpdmVyLnVw
dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFh88AQC5nNcGT2LSuwJYgGhjkAEwMdcv
/yKyiOM4fNmzXCoMPAD/cYEq6JLiAclMI+Xqs9lEnmn1cteyxjdV/8rIWvvJqAc=
=Mi/s
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-6.13-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 changes for 6.13, part #2
- Constrain invalidations from GICR_INVLPIR to only affect the LPI
INTID space
- Set of robustness improvements to the management of vgic irqs and GIC
ITS table entries
- Fix compilation issue w/ CONFIG_CC_OPTIMIZE_FOR_SIZE=y where
set_sysreg_masks() wasn't getting inlined, breaking check for a
constant sysreg index
- Correct KVM's vPMU overflow condition to match the architecture for
hyp and non-hyp counters
The 'global enable control' (as it is termed in the architecture) for
counters reserved by EL2 is MDCR_EL2.HPME. Use that instead of
PMCR_EL0.E when evaluating the overflow state for hyp counters.
Change the return value to a bool while at it, which better reflects the
fact that the overflow state is a shared signal and not a per-counter
property.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241120005230.2335682-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
DDI0487K.a D13.3.1 describes the PMU overflow condition, which evaluates
to true if any counter's global enable (PMCR_EL0.E), overflow flag
(PMOVSSET_EL0[n]), and interrupt enable (PMINTENSET_EL1[n]) are all 1.
Of note, this does not require a counter to be enabled
(i.e. PMCNTENSET_EL0[n] = 1) to generate an overflow.
Align kvm_pmu_overflow_status() with the reality of the architecture
and stop using PMCNTENSET_EL0 as part of the overflow condition. The
bug was discovered while running an SBSA PMU test [*], which only sets
PMCR.E, PMOVSSET<0>, PMINTENSET<0>, and expects an overflow interrupt.
Cc: stable@vger.kernel.org
Fixes: 76d883c4e6 ("arm64: KVM: Add access handler for PMOVSSET and PMOVSCLR register")
Link: https://github.com/ARM-software/sbsa-acs/blob/master/test_pool/pmu/operating_system/test_pmu001.c
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
[ oliver: massaged changelog ]
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241120005230.2335682-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
When compiling with CONFIG_CC_OPTIMIZE_FOR_SIZE=y, set_sysreg_masks()
fails to compile thanks to:
BUILD_BUG_ON(!__builtin_constant_p(sr));
as the compiler doesn't identify sr as a constant, despite all the
callers passing constants.
Fix the issue by always inlining this function, which allows GCC to
do the right thing.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202411201857.ZNudtGJl-lkp@intel.com/
Fixes: a016202009 ("KVM: arm64: Extend masking facility to arbitrary registers")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241120111516.304250-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The ITS ABI infrastructure allows for some pretty lax code, where
the size of the data doesn't have to match the size of the entry,
potentially leading to a collection of interesting bugs.
Commit 7fe28d7e68 ("KVM: arm64: vgic-its: Add a data length check
in vgic_its_save_*") added some checks, but starts by implicitly
casting all writes to a 64bit value, hiding some of the issues.
Instead, introduce macros that will check the data type actually used
for dealing with the table entries. The macros are taking a symbolic
entry type that is used to fetch the size of the entry type for the
current ABI. This immediately catches a couple of low-impact gotchas
(zero values that are implicitly 32bit), easy enough to fix.
Given that we currently only have a single ABI, hardcode a couple of
BUILD_BUG_ON()s that will fire if we use anything but a 64bit quantity,
and some (currently unreachable) fallback code that may become useful
one day.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241117165757.247686-5-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
VGIC_MAX_PRIVATE is a pretty useless definition, and is better
replaced with VGIC_NR_PRIVATE_IRQS.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241117165757.247686-4-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
vgic_get_irq() has an awkward signature, as it takes both a kvm
*and* a vcpu, where the vcpu is allowed to be NULL if the INTID
being looked up is a global interrupt (SPI or LPI).
This leads to potentially problematic situations where the INTID
passed is a private interrupt, but that there is no vcpu.
In order to make things less ambiguous, let have *two* helpers
instead:
- vgic_get_irq(struct kvm *kvm, u32 intid), which is only concerned
with *global* interrupts, as indicated by the lack of vcpu.
- vgic_get_vcpu_irq(struct kvm_vcpu *vcpu, u32 intid), which can
return *any* interrupt class, but must have of course a non-NULL
vcpu.
Most of the code nicely falls under one or the other situations,
except for a couple of cases (close to the UABI or in the debug code)
where we have to distinguish between the two cases.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241117165757.247686-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Make sure we filter out non-LPI invalidation when handling writes
to GICR_INVLPIR.
Fixes: 4645d11f4a ("KVM: arm64: vgic-v3: Implement MMIO-based LPI invalidation")
Reported-by: Alexander Potapenko <glider@google.com>
Tested-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20241117165757.247686-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* Support for running Linux in a protected VM under the Arm Confidential
Compute Architecture (CCA)
* Guarded Control Stack user-space support. Current patches follow the
x86 ABI of implicitly creating a shadow stack on clone(). Subsequent
patches (already on the list) will add support for clone3() allowing
finer-grained control of the shadow stack size and placement from libc
* AT_HWCAP3 support (not running out of HWCAP2 bits yet but we are
getting close with the upcoming dpISA support)
* Other arch features:
- In-kernel use of the memcpy instructions, FEAT_MOPS (previously only
exposed to user; uaccess support not merged yet)
- MTE: hugetlbfs support and the corresponding kselftests
- Optimise CRC32 using the PMULL instructions
- Support for FEAT_HAFT enabling ARCH_HAS_NONLEAF_PMD_YOUNG
- Optimise the kernel TLB flushing to use the range operations
- POE/pkey (permission overlays): further cleanups after bringing the
signal handler in line with the x86 behaviour for 6.12
* arm64 perf updates:
- Support for the NXP i.MX91 PMU in the existing IMX driver
- Support for Ampere SoCs in the Designware PCIe PMU driver
- Support for Marvell's 'PEM' PCIe PMU present in the 'Odyssey' SoC
- Support for Samsung's 'Mongoose' CPU PMU
- Support for PMUv3.9 finer-grained userspace counter access control
- Switch back to platform_driver::remove() now that it returns 'void'
- Add some missing events for the CXL PMU driver
* Miscellaneous arm64 fixes/cleanups:
- Page table accessors cleanup: type updates, drop unused macros,
reorganise arch_make_huge_pte() and clean up pte_mkcont(), sanity
check addresses before runtime P4D/PUD folding
- Command line override for ID_AA64MMFR0_EL1.ECV (advertising the
FEAT_ECV for the generic timers) allowing Linux to boot with
firmware deployments that don't set SCTLR_EL3.ECVEn
- ACPI/arm64: tighten the check for the array of platform timer
structures and adjust the error handling procedure in
gtdt_parse_timer_block()
- Optimise the cache flush for the uprobes xol slot (skip if no
change) and other uprobes/kprobes cleanups
- Fix the context switching of tpidrro_el0 when kpti is enabled
- Dynamic shadow call stack fixes
- Sysreg updates
- Various arm64 kselftest improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmc5POIACgkQa9axLQDI
XvEDYA//a3eeNkgMuGdnSCVcLz+zy+oNwAwboG/4X1DqL8jiCbI4npwugPx95RIA
YZOUvo9T2aL3OyefpUHll4gFHqx9OwoZIig2F70TEUmlPsGUbh0KBkdfQF3xZPdl
EwV0kHSGEqMWMBwsGJGwgCYrUaf1MUQzh1GBl7VJ2ts5XsJBaBeOyKkysij26wtZ
V+aHq2IUx7qQS7+HC/4P6IoHxKziFcsCMovaKaynP4cw9xXBQbDMcNlHEwndOMyk
pu2zrv7GG0j3KQuVP/2Alf5FKhmI0GVGP/6Nc/zsOmw96w8Kf7HfzEtkHawr2aRq
rqg/c9ivzDn1p+fUBo4ZYtrRk4IAY+yKu6hdzdLTP5+bQrBTWTO9rjQVBm9FAGYT
sCdEj1NqzvExvNHD7X6ut/GJ05lmce3K+qeSXSEysN9gqiT3eomYWMXrD2V2lxzb
rIDDcb/icfaqjt14Mksh19r/rzNeq7noj9CGSmcqw0BHZfHzl38Lai6pdfYzCNyn
vCM/c4c1D/WWX8/lifO1JZVbhDk1jy82Iphg2KEhL8iKPxDsKBBZLmYuU1oa7tMo
WryGAz9+GQwd+W9chFuaOEtMnzvW2scEJ5Eb2fEf0Qj0aEurkL+C9dZR6o1GN77V
DBUxtU628Ef4PJJGfbNCwZzdd8UPYG3a/mKfQQ3dz0oz2LySlW4=
=wDot
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- Support for running Linux in a protected VM under the Arm
Confidential Compute Architecture (CCA)
- Guarded Control Stack user-space support. Current patches follow the
x86 ABI of implicitly creating a shadow stack on clone(). Subsequent
patches (already on the list) will add support for clone3() allowing
finer-grained control of the shadow stack size and placement from
libc
- AT_HWCAP3 support (not running out of HWCAP2 bits yet but we are
getting close with the upcoming dpISA support)
- Other arch features:
- In-kernel use of the memcpy instructions, FEAT_MOPS (previously
only exposed to user; uaccess support not merged yet)
- MTE: hugetlbfs support and the corresponding kselftests
- Optimise CRC32 using the PMULL instructions
- Support for FEAT_HAFT enabling ARCH_HAS_NONLEAF_PMD_YOUNG
- Optimise the kernel TLB flushing to use the range operations
- POE/pkey (permission overlays): further cleanups after bringing
the signal handler in line with the x86 behaviour for 6.12
- arm64 perf updates:
- Support for the NXP i.MX91 PMU in the existing IMX driver
- Support for Ampere SoCs in the Designware PCIe PMU driver
- Support for Marvell's 'PEM' PCIe PMU present in the 'Odyssey' SoC
- Support for Samsung's 'Mongoose' CPU PMU
- Support for PMUv3.9 finer-grained userspace counter access
control
- Switch back to platform_driver::remove() now that it returns
'void'
- Add some missing events for the CXL PMU driver
- Miscellaneous arm64 fixes/cleanups:
- Page table accessors cleanup: type updates, drop unused macros,
reorganise arch_make_huge_pte() and clean up pte_mkcont(), sanity
check addresses before runtime P4D/PUD folding
- Command line override for ID_AA64MMFR0_EL1.ECV (advertising the
FEAT_ECV for the generic timers) allowing Linux to boot with
firmware deployments that don't set SCTLR_EL3.ECVEn
- ACPI/arm64: tighten the check for the array of platform timer
structures and adjust the error handling procedure in
gtdt_parse_timer_block()
- Optimise the cache flush for the uprobes xol slot (skip if no
change) and other uprobes/kprobes cleanups
- Fix the context switching of tpidrro_el0 when kpti is enabled
- Dynamic shadow call stack fixes
- Sysreg updates
- Various arm64 kselftest improvements
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (168 commits)
arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled
kselftest/arm64: Try harder to generate different keys during PAC tests
kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all()
arm64/ptrace: Clarify documentation of VL configuration via ptrace
kselftest/arm64: Corrupt P0 in the irritator when testing SSVE
acpi/arm64: remove unnecessary cast
arm64/mm: Change protval as 'pteval_t' in map_range()
kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c
kselftest/arm64: Add FPMR coverage to fp-ptrace
kselftest/arm64: Expand the set of ZA writes fp-ptrace does
kselftets/arm64: Use flag bits for features in fp-ptrace assembler code
kselftest/arm64: Enable build of PAC tests with LLVM=1
kselftest/arm64: Check that SVCR is 0 in signal handlers
selftests/mm: Fix unused function warning for aarch64_write_signal_pkey()
kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests
kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test
kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests
kselftest/arm64: Fix build with stricter assemblers
arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux()
arm64/scs: Deal with 64-bit relative offsets in FDE frames
...
* arm64/for-next/perf:
perf: Switch back to struct platform_driver::remove()
perf: arm_pmuv3: Add support for Samsung Mongoose PMU
dt-bindings: arm: pmu: Add Samsung Mongoose core compatible
perf/dwc_pcie: Fix typos in event names
perf/dwc_pcie: Add support for Ampere SoCs
ARM: pmuv3: Add missing write_pmuacr()
perf/marvell: Marvell PEM performance monitor support
perf/arm_pmuv3: Add PMUv3.9 per counter EL0 access control
perf/dwc_pcie: Convert the events with mixed case to lowercase
perf/cxlpmu: Support missing events in 3.1 spec
perf: imx_perf: add support for i.MX91 platform
dt-bindings: perf: fsl-imx-ddr: Add i.MX91 compatible
drivers perf: remove unused field pmu_node
* for-next/gcs: (42 commits)
: arm64 Guarded Control Stack user-space support
kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c
arm64/gcs: Fix outdated ptrace documentation
kselftest/arm64: Ensure stable names for GCS stress test results
kselftest/arm64: Validate that GCS push and write permissions work
kselftest/arm64: Enable GCS for the FP stress tests
kselftest/arm64: Add a GCS stress test
kselftest/arm64: Add GCS signal tests
kselftest/arm64: Add test coverage for GCS mode locking
kselftest/arm64: Add a GCS test program built with the system libc
kselftest/arm64: Add very basic GCS test program
kselftest/arm64: Always run signals tests with GCS enabled
kselftest/arm64: Allow signals tests to specify an expected si_code
kselftest/arm64: Add framework support for GCS to signal handling tests
kselftest/arm64: Add GCS as a detected feature in the signal tests
kselftest/arm64: Verify the GCS hwcap
arm64: Add Kconfig for Guarded Control Stack (GCS)
arm64/ptrace: Expose GCS via ptrace and core files
arm64/signal: Expose GCS state in signal frames
arm64/signal: Set up and restore the GCS context for signal handlers
arm64/mm: Implement map_shadow_stack()
...
* for-next/probes:
: Various arm64 uprobes/kprobes cleanups
arm64: insn: Simulate nop instruction for better uprobe performance
arm64: probes: Remove probe_opcode_t
arm64: probes: Cleanup kprobes endianness conversions
arm64: probes: Move kprobes-specific fields
arm64: probes: Fix uprobes for big-endian kernels
arm64: probes: Fix simulate_ldr*_literal()
arm64: probes: Remove broken LDR (literal) uprobe support
* for-next/asm-offsets:
: arm64 asm-offsets.c cleanup (remove unused offsets)
arm64: asm-offsets: remove PREEMPT_DISABLE_OFFSET
arm64: asm-offsets: remove DMA_{TO,FROM}_DEVICE
arm64: asm-offsets: remove VM_EXEC and PAGE_SZ
arm64: asm-offsets: remove MM_CONTEXT_ID
arm64: asm-offsets: remove COMPAT_{RT_,SIGFRAME_REGS_OFFSET
arm64: asm-offsets: remove VMA_VM_*
arm64: asm-offsets: remove TSK_ACTIVE_MM
* for-next/tlb:
: TLB flushing optimisations
arm64: optimize flush tlb kernel range
arm64: tlbflush: add __flush_tlb_range_limit_excess()
* for-next/misc:
: Miscellaneous patches
arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled
arm64/ptrace: Clarify documentation of VL configuration via ptrace
acpi/arm64: remove unnecessary cast
arm64/mm: Change protval as 'pteval_t' in map_range()
arm64: uprobes: Optimize cache flushes for xol slot
acpi/arm64: Adjust error handling procedure in gtdt_parse_timer_block()
arm64: fix .data.rel.ro size assertion when CONFIG_LTO_CLANG
arm64/ptdump: Test both PTE_TABLE_BIT and PTE_VALID for block mappings
arm64/mm: Sanity check PTE address before runtime P4D/PUD folding
arm64/mm: Drop setting PTE_TYPE_PAGE in pte_mkcont()
ACPI: GTDT: Tighten the check for the array of platform timer structures
arm64/fpsimd: Fix a typo
arm64: Expose ID_AA64ISAR1_EL1.XS to sanitised feature consumers
arm64: Return early when break handler is found on linked-list
arm64/mm: Re-organize arch_make_huge_pte()
arm64/mm: Drop _PROT_SECT_DEFAULT
arm64: Add command-line override for ID_AA64MMFR0_EL1.ECV
arm64: head: Drop SWAPPER_TABLE_SHIFT
arm64: cpufeature: add POE to cpucap_is_possible()
arm64/mm: Change pgattr_change_is_safe() arguments as pteval_t
* for-next/mte:
: Various MTE improvements
selftests: arm64: add hugetlb mte tests
hugetlb: arm64: add mte support
* for-next/sysreg:
: arm64 sysreg updates
arm64/sysreg: Update ID_AA64MMFR1_EL1 to DDI0601 2024-09
* for-next/stacktrace:
: arm64 stacktrace improvements
arm64: preserve pt_regs::stackframe during exec*()
arm64: stacktrace: unwind exception boundaries
arm64: stacktrace: split unwind_consume_stack()
arm64: stacktrace: report recovered PCs
arm64: stacktrace: report source of unwind data
arm64: stacktrace: move dump_backtrace() to kunwind_stack_walk()
arm64: use a common struct frame_record
arm64: pt_regs: swap 'unused' and 'pmr' fields
arm64: pt_regs: rename "pmr_save" -> "pmr"
arm64: pt_regs: remove stale big-endian layout
arm64: pt_regs: assert pt_regs is a multiple of 16 bytes
* for-next/hwcap3:
: Add AT_HWCAP3 support for arm64 (also wire up AT_HWCAP4)
arm64: Support AT_HWCAP3
binfmt_elf: Wire up AT_HWCAP3 at AT_HWCAP4
* for-next/kselftest: (30 commits)
: arm64 kselftest fixes/cleanups
kselftest/arm64: Try harder to generate different keys during PAC tests
kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all()
kselftest/arm64: Corrupt P0 in the irritator when testing SSVE
kselftest/arm64: Add FPMR coverage to fp-ptrace
kselftest/arm64: Expand the set of ZA writes fp-ptrace does
kselftets/arm64: Use flag bits for features in fp-ptrace assembler code
kselftest/arm64: Enable build of PAC tests with LLVM=1
kselftest/arm64: Check that SVCR is 0 in signal handlers
kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests
kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test
kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests
kselftest/arm64: Fix build with stricter assemblers
kselftest/arm64: Test signal handler state modification in fp-stress
kselftest/arm64: Provide a SIGUSR1 handler in the kernel mode FP stress test
kselftest/arm64: Implement irritators for ZA and ZT
kselftest/arm64: Remove unused ADRs from irritator handlers
kselftest/arm64: Correct misleading comments on fp-stress irritators
kselftest/arm64: Poll less often while waiting for fp-stress children
kselftest/arm64: Increase frequency of signal delivery in fp-stress
kselftest/arm64: Fix encoding for SVE B16B16 test
...
* for-next/crc32:
: Optimise CRC32 using PMULL instructions
arm64/crc32: Implement 4-way interleave using PMULL
arm64/crc32: Reorganize bit/byte ordering macros
arm64/lib: Handle CRC-32 alternative in C code
* for-next/guest-cca:
: Support for running Linux as a guest in Arm CCA
arm64: Document Arm Confidential Compute
virt: arm-cca-guest: TSM_REPORT support for realms
arm64: Enable memory encrypt for Realms
arm64: mm: Avoid TLBI when marking pages as valid
arm64: Enforce bounce buffers for realm DMA
efi: arm64: Map Device with Prot Shared
arm64: rsi: Map unprotected MMIO as decrypted
arm64: rsi: Add support for checking whether an MMIO is protected
arm64: realm: Query IPA size from the RMM
arm64: Detect if in a realm and set RIPAS RAM
arm64: rsi: Add RSI definitions
* for-next/haft:
: Support for arm64 FEAT_HAFT
arm64: pgtable: Warn unexpected pmdp_test_and_clear_young()
arm64: Enable ARCH_HAS_NONLEAF_PMD_YOUNG
arm64: Add support for FEAT_HAFT
arm64: setup: name 'tcr2' register
arm64/sysreg: Update ID_AA64MMFR1_EL1 register
* for-next/scs:
: Dynamic shadow call stack fixes
arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux()
arm64/scs: Deal with 64-bit relative offsets in FDE frames
arm64/scs: Fix handling of DWARF augmentation data in CIE/FDE frames
- Support for stage-1 permission indirection (FEAT_S1PIE) and
permission overlays (FEAT_S1POE), including nested virt + the
emulated page table walker
- Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call
was introduced in PSCIv1.3 as a mechanism to request hibernation,
similar to the S4 state in ACPI
- Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
part of it, introduce trivial initialization of the host's MPAM
context so KVM can use the corresponding traps
- PMU support under nested virtualization, honoring the guest
hypervisor's trap configuration and event filtering when running a
nested guest
- Fixes to vgic ITS serialization where stale device/interrupt table
entries are not zeroed when the mapping is invalidated by the VM
- Avoid emulated MMIO completion if userspace has requested synchronous
external abort injection
- Various fixes and cleanups affecting pKVM, vCPU initialization, and
selftests
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZzTZXRccb2xpdmVyLnVw
dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFioUAP0cs2pYcwuCqLgmeHqfz6L5Xsw3
hKBCNuvr5mjU0hZfLAEA5ml2eUKD7OnssAOmUZ/K/NoCdJFCe8mJWQDlURvr9g4=
=u2/3
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-6.13' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 changes for 6.13, part #1
- Support for stage-1 permission indirection (FEAT_S1PIE) and
permission overlays (FEAT_S1POE), including nested virt + the
emulated page table walker
- Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call
was introduced in PSCIv1.3 as a mechanism to request hibernation,
similar to the S4 state in ACPI
- Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
part of it, introduce trivial initialization of the host's MPAM
context so KVM can use the corresponding traps
- PMU support under nested virtualization, honoring the guest
hypervisor's trap configuration and event filtering when running a
nested guest
- Fixes to vgic ITS serialization where stale device/interrupt table
entries are not zeroed when the mapping is invalidated by the VM
- Avoid emulated MMIO completion if userspace has requested synchronous
external abort injection
- Various fixes and cleanups affecting pKVM, vCPU initialization, and
selftests
This function can fail but its return value isn't passed onto the
caller. Presumably this could result in a broken state.
Fixes: 66d5b53e20 ("KVM: arm64: Allocate memory mapped at hyp for host sve state in pKVM")
Signed-off-by: James Clark <james.clark@linaro.org>
Reviewed-by: Fuad Tabba <tabba@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241112105604.795809-1-james.clark@linaro.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/vgic-its-fixes:
: Fixes for vgic-its save/restore, courtesy of Kunkun Jiang and Jing Zhang
:
: Address bugs where restoring an ITS consumes a stale DTE/ITE, which
: may lead to either garbage mappings in the ITS or the overall restore
: ioctl failing. The fix in both cases is to zero a DTE/ITE when its
: translation has been invalidated by the guest.
KVM: arm64: vgic-its: Clear ITE when DISCARD frees an ITE
KVM: arm64: vgic-its: Clear DTE when MAPD unmaps a device
KVM: arm64: vgic-its: Add a data length check in vgic_its_save_*
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
When DISCARD frees an ITE, it does not invalidate the
corresponding ITE. In the scenario of continuous saves and
restores, there may be a situation where an ITE is not saved
but is restored. This is unreasonable and may cause restore
to fail. This patch clears the corresponding ITE when DISCARD
frees an ITE.
Cc: stable@vger.kernel.org
Fixes: eff484e029 ("KVM: arm64: vgic-its: ITT save and restore")
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
[Jing: Update with entry write helper]
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Link: https://lore.kernel.org/r/20241107214137.428439-6-jingzhangos@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
vgic_its_save_device_tables will traverse its->device_list to
save DTE for each device. vgic_its_restore_device_tables will
traverse each entry of device table and check if it is valid.
Restore if valid.
But when MAPD unmaps a device, it does not invalidate the
corresponding DTE. In the scenario of continuous saves
and restores, there may be a situation where a device's DTE
is not saved but is restored. This is unreasonable and may
cause restore to fail. This patch clears the corresponding
DTE when MAPD unmaps a device.
Cc: stable@vger.kernel.org
Fixes: 57a9a11715 ("KVM: arm64: vgic-its: Device table save/restore")
Co-developed-by: Shusen Li <lishusen2@huawei.com>
Signed-off-by: Shusen Li <lishusen2@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
[Jing: Update with entry write helper]
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Link: https://lore.kernel.org/r/20241107214137.428439-5-jingzhangos@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
In all the vgic_its_save_*() functinos, they do not check whether
the data length is 8 bytes before calling vgic_write_guest_lock.
This patch adds the check. To prevent the kernel from being blown up
when the fault occurs, KVM_BUG_ON() is used. And the other BUG_ON()s
are replaced together.
Cc: stable@vger.kernel.org
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
[Jing: Update with the new entry read/write helpers]
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Link: https://lore.kernel.org/r/20241107214137.428439-4-jingzhangos@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/nv-pmu:
: Support for vEL2 PMU controls
:
: Align the vEL2 PMU support with the current state of non-nested KVM,
: including:
:
: - Trap routing, with the annoying complication of EL2 traps that apply
: in Host EL0
:
: - PMU emulation, using the correct configuration bits depending on
: whether a counter falls in the hypervisor or guest range of PMCs
:
: - Perf event swizzling across nested boundaries, as the event filtering
: needs to be remapped to cope with vEL2
KVM: arm64: nv: Reprogram PMU events affected by nested transition
KVM: arm64: nv: Apply EL2 event filtering when in hyp context
KVM: arm64: nv: Honor MDCR_EL2.HLP
KVM: arm64: nv: Honor MDCR_EL2.HPME
KVM: arm64: Add helpers to determine if PMC counts at a given EL
KVM: arm64: nv: Adjust range of accessible PMCs according to HPMN
KVM: arm64: Rename kvm_pmu_valid_counter_mask()
KVM: arm64: nv: Advertise support for FEAT_HPMN0
KVM: arm64: nv: Describe trap behaviour of MDCR_EL2.HPMN
KVM: arm64: nv: Honor MDCR_EL2.{TPM, TPMCR} in Host EL0
KVM: arm64: nv: Reinject traps that take effect in Host EL0
KVM: arm64: nv: Rename BEHAVE_FORWARD_ANY
KVM: arm64: nv: Allow coarse-grained trap combos to use complex traps
KVM: arm64: Describe RES0/RES1 bits of MDCR_EL2
arm64: sysreg: Add new definitions for ID_AA64DFR0_EL1
arm64: sysreg: Migrate MDCR_EL2 definition to table
arm64: sysreg: Describe ID_AA64DFR2_EL1 fields
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/mmio-sea:
: Fix for SEA injection in response to MMIO
:
: Fix + test coverage for SEA injection in response to an unhandled MMIO
: exit to userspace. Naturally, if userspace decides to abort an MMIO
: instruction KVM shouldn't continue with instruction emulation...
KVM: arm64: selftests: Add tests for MMIO external abort injection
KVM: arm64: selftests: Convert to kernel's ESR terminology
tools: arm64: Grab a copy of esr.h from kernel
KVM: arm64: Don't retire aborted MMIO instruction
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/misc:
: Miscellaneous updates
:
: - Drop useless check against vgic state in ICC_CLTR_EL1.SEIS read
: emulation
:
: - Fix trap configuration for pKVM
:
: - Close the door on initialization bugs surrounding userspace irqchip
: static key by removing it.
KVM: selftests: Don't bother deleting memslots in KVM when freeing VMs
KVM: arm64: Get rid of userspace_irqchip_in_use
KVM: arm64: Initialize trap register values in hyp in pKVM
KVM: arm64: Initialize the hypervisor's VM state at EL2
KVM: arm64: Refactor kvm_vcpu_enable_ptrauth() for hyp use
KVM: arm64: Move pkvm_vcpu_init_traps() to init_pkvm_hyp_vcpu()
KVM: arm64: Don't map 'kvm_vgic_global_state' at EL2 with pKVM
KVM: arm64: Just advertise SEIS as 0 when emulating ICC_CTLR_EL1
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/mpam-ni:
: Hiding FEAT_MPAM from KVM guests, courtesy of James Morse + Joey Gouly
:
: Fix a longstanding bug where FEAT_MPAM was accidentally exposed to KVM
: guests + the EL2 trap configuration was not explicitly configured. As
: part of this, bring in skeletal support for initialising the MPAM CPU
: context so KVM can actually set traps for its guests.
:
: Be warned -- if this series leads to boot failures on your system,
: you're running on turd firmware.
:
: As an added bonus (that builds upon the infrastructure added by the MPAM
: series), allow userspace to configure CTR_EL0.L1Ip, courtesy of Shameer
: Kolothum.
KVM: arm64: Make L1Ip feature in CTR_EL0 writable from userspace
KVM: arm64: selftests: Test ID_AA64PFR0.MPAM isn't completely ignored
KVM: arm64: Disable MPAM visibility by default and ignore VMM writes
KVM: arm64: Add a macro for creating filtered sys_reg_descs entries
KVM: arm64: Fix missing traps of guest accesses to the MPAM registers
arm64: cpufeature: discover CPU support for MPAM
arm64: head.S: Initialise MPAM EL2 registers and disable traps
arm64/sysreg: Convert existing MPAM sysregs and add the remaining entries
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/psci-1.3:
: PSCI v1.3 support, courtesy of David Woodhouse
:
: Bump KVM's PSCI implementation up to v1.3, with the added bonus of
: implementing the SYSTEM_OFF2 call. Like other system-scoped PSCI calls,
: this gets relayed to userspace for further processing with a new
: KVM_SYSTEM_EVENT_SHUTDOWN flag.
:
: As an added bonus, implement client-side support for hibernation with
: the SYSTEM_OFF2 call.
arm64: Use SYSTEM_OFF2 PSCI call to power off for hibernate
KVM: arm64: nvhe: Pass through PSCI v1.3 SYSTEM_OFF2 call
KVM: selftests: Add test for PSCI SYSTEM_OFF2
KVM: arm64: Add support for PSCI v1.2 and v1.3
KVM: arm64: Add PSCI v1.3 SYSTEM_OFF2 function for hibernation
firmware/psci: Add definitions for PSCI v1.3 specification
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Only allow userspace to set VIPT(0b10) or PIPT(0b11) for L1Ip based on
what hardware reports as both AIVIVT (0b01) and VPIPT (0b00) are
documented as reserved.
Using a VIPT for Guest where hardware reports PIPT may lead to over
invalidation, but is still correct. Hence, we can allow downgrading
PIPT to VIPT, but not the other way around.
Reviewed-by: Sebastian Ott <sebott@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20241022073943.35764-1-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Improper use of userspace_irqchip_in_use led to syzbot hitting the
following WARN_ON() in kvm_timer_update_irq():
WARNING: CPU: 0 PID: 3281 at arch/arm64/kvm/arch_timer.c:459
kvm_timer_update_irq+0x21c/0x394
Call trace:
kvm_timer_update_irq+0x21c/0x394 arch/arm64/kvm/arch_timer.c:459
kvm_timer_vcpu_reset+0x158/0x684 arch/arm64/kvm/arch_timer.c:968
kvm_reset_vcpu+0x3b4/0x560 arch/arm64/kvm/reset.c:264
kvm_vcpu_set_target arch/arm64/kvm/arm.c:1553 [inline]
kvm_arch_vcpu_ioctl_vcpu_init arch/arm64/kvm/arm.c:1573 [inline]
kvm_arch_vcpu_ioctl+0x112c/0x1b3c arch/arm64/kvm/arm.c:1695
kvm_vcpu_ioctl+0x4ec/0xf74 virt/kvm/kvm_main.c:4658
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:907 [inline]
__se_sys_ioctl fs/ioctl.c:893 [inline]
__arm64_sys_ioctl+0x108/0x184 fs/ioctl.c:893
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall+0x78/0x1b8 arch/arm64/kernel/syscall.c:49
el0_svc_common+0xe8/0x1b0 arch/arm64/kernel/syscall.c:132
do_el0_svc+0x40/0x50 arch/arm64/kernel/syscall.c:151
el0_svc+0x54/0x14c arch/arm64/kernel/entry-common.c:712
el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
The following sequence led to the scenario:
- Userspace creates a VM and a vCPU.
- The vCPU is initialized with KVM_ARM_VCPU_PMU_V3 during
KVM_ARM_VCPU_INIT.
- Without any other setup, such as vGIC or vPMU, userspace issues
KVM_RUN on the vCPU. Since the vPMU is requested, but not setup,
kvm_arm_pmu_v3_enable() fails in kvm_arch_vcpu_run_pid_change().
As a result, KVM_RUN returns after enabling the timer, but before
incrementing 'userspace_irqchip_in_use':
kvm_arch_vcpu_run_pid_change()
ret = kvm_arm_pmu_v3_enable()
if (!vcpu->arch.pmu.created)
return -EINVAL;
if (ret)
return ret;
[...]
if (!irqchip_in_kernel(kvm))
static_branch_inc(&userspace_irqchip_in_use);
- Userspace ignores the error and issues KVM_ARM_VCPU_INIT again.
Since the timer is already enabled, control moves through the
following flow, ultimately hitting the WARN_ON():
kvm_timer_vcpu_reset()
if (timer->enabled)
kvm_timer_update_irq()
if (!userspace_irqchip())
ret = kvm_vgic_inject_irq()
ret = vgic_lazy_init()
if (unlikely(!vgic_initialized(kvm)))
if (kvm->arch.vgic.vgic_model !=
KVM_DEV_TYPE_ARM_VGIC_V2)
return -EBUSY;
WARN_ON(ret);
Theoretically, since userspace_irqchip_in_use's functionality can be
simply replaced by '!irqchip_in_kernel()', get rid of the static key
to avoid the mismanagement, which also helps with the syzbot issue.
Cc: <stable@vger.kernel.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Start reprogramming PMU events at nested boundaries now that everything
is in place to handle the EL2 event filter. Only repaint events where
the filter differs between EL1 and EL2 as a slight optimization.
PMU now 'works' for nested VMs, albeit slow.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182559.3364829-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
It hopefully comes as no surprise when I say that vEL2 actually runs at
EL1. So, the guest hypervisor's EL2 event filter (NSH) needs to actually
be applied to EL1 in the perf event. In addition to this, the disable
bit for the guest counter range (HPMD) needs to have the effect of
stopping the affected counters.
Do exactly that by stuffing ::exclude_kernel with the combined effect of
these controls. This isn't quite enough yet, as the backing perf events
need to be reprogrammed upon nested ERET/exception entry to remap the
effective filter onto ::exclude_kernel.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-18-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Counters that fall in the hypervisor range (i.e. N >= HPMN) have a
separate control for enabling 64 bit overflow. Take it into account.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-17-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Checking the exception level filters for a PMC is a minor annoyance to
open code. Add helpers to check if an event counts at EL0 and EL1, which
will prove useful in a subsequent change.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-15-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The value of MDCR_EL2.HPMN controls the number of event counters made
visible to EL0 and EL1. This means it is possible for the guest
hypervisor to allow direct access to event counters to the L2.
Rework KVM's PMU register emulation to take the effects of HPMN into
account when handling a trap. For bitmask-style registers, writes only
affect accessible registers.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-14-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Nested PMU support requires dynamically changing the visible range of
PMU counters based on the exception level and value of MDCR_EL2.HPMN. At
the same time, the PMU emulation code needs to know the absolute number
of implemented counters, regardless of context.
Rename the existing helper to make it obvious that it returns the number
of implemented counters and not anything else.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-13-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Everything is in place now for KVM to actually handle MDCR_EL2.HPMN. Not
only that, the emulation is capable of doing FEAT_HPMN0. Advertise
support for the feature in the VM's ID registers. It is possible to
emulate FEAT_HPMN0 on hardware that doesn't support it since KVM
currently traps all PMU registers. Having said that, let's only
advertise the feature on supporting hardware in case KVM ever provides
'direct' PMU support to VMs w/o involving host perf.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-12-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
MDCR_EL2.HPMN splits the PMU event counters into two ranges: the first
range is accessible from all ELs, and the second range is accessible
only to EL2/3. Supposing the guest hypervisor allows direct access to
the PMU counters from the L2, KVM needs to locally handle those
accesses.
Add a new complex trap configuration for HPMN that checks if the counter
index is accessible to the current context. As written, the architecture
suggests HPMN only causes PMEVCNTR<n>_EL0 to trap, though intuition (and
the pseudocode) suggest that the trap applies to PMEVTYPER<n>_EL0 as
well.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-11-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
TPM and TPMCR trap bits also affect Host EL0. How fun.
Mark these two trap bits as such and take advantage of the new
infrastructure for dealing w/ EL0 traps.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-10-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Wire up the other end of traps that affect host EL0 by actually
injecting them into the guest hypervisor. Skip over FGT entirely, as a
cursory glance suggests no FGT is effective in host EL0.
Note that kvm_inject_nested() is already equipped for handling
exceptions while the VM is already in a host context.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-9-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
BEHAVE_FORWARD_ANY is slightly ambiguous, especially since we're about
to cram some more information into the enum. Rephrase it.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-8-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
KVM uses a sanity-check to avoid infinite recursion in trap combinations
that could potentially depend on itself. Narrow the scope of this sanity
check to the exact CGT IDs that correspond w/ trap combos, opening the
door to using 'complex' traps as part of a combination.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-7-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Handle the initialization of trap registers at the hypervisor in
pKVM, even for non-protected guests. The host is not trusted with
the values of the trap registers, regardless of the VM type.
Therefore, when switching between the host and the guests, only
flush the HCR_EL2 TWI and TWE bits. The host is allowed to
configure these for opportunistic scheduling, as neither affects
the protection of VMs or the hypervisor.
Reported-by: Will Deacon <will@kernel.org>
Fixes: 814ad8f96e ("KVM: arm64: Drop trapping of PAuth instructions/keys")
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241018074833.2563674-5-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Do not trust the state of the VM as provided by the host when
initializing the hypervisor's view of the VM sate. Initialize it
instead at EL2 to a known good and safe state, as pKVM already
does with hypervisor VCPU states.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241018074833.2563674-4-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Move kvm_vcpu_enable_ptrauth() to a shared header to be used by
hypervisor code in protected mode.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241018074833.2563674-3-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Move pkvm_vcpu_init_traps() to the initialization of the
hypervisor's vcpu state in init_pkvm_hyp_vcpu(), and remove the
associated hypercall.
In protected mode, traps need to be initialized whenever a VCPU
is initialized anyway, and not only for protected VMs. This also
saves an unnecessary hypercall.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241018074833.2563674-2-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
commit 011e5f5bf5 ("arm64/cpufeature: Add remaining feature bits in
ID_AA64PFR0 register") exposed the MPAM field of AA64PFR0_EL1 to guests,
but didn't add trap handling. A previous patch supplied the missing trap
handling.
Existing VMs that have the MPAM field of ID_AA64PFR0_EL1 set need to
be migratable, but there is little point enabling the MPAM CPU
interface on new VMs until there is something a guest can do with it.
Clear the MPAM field from the guest's ID_AA64PFR0_EL1 and on hardware
that supports MPAM, politely ignore the VMMs attempts to set this bit.
Guests exposed to this bug have the sanitised value of the MPAM field,
so only the correct value needs to be ignored. This means the field
can continue to be used to block migration to incompatible hardware
(between MPAM=1 and MPAM=5), and the VMM can't rely on the field
being ignored.
Signed-off-by: James Morse <james.morse@arm.com>
Co-developed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241030160317.2528209-7-joey.gouly@arm.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The sys_reg_descs array holds function pointers and reset value for
managing the user-space and guest view of system registers. These
are mostly created by a set of macro's as only some combinations
of behaviour are needed.
If a register needs special treatment, its sys_reg_descs entry is
open-coded. This is true of some id registers where the value provided
by user-space is validated by some helpers.
Before adding another one of these, add a helper that covers the
existing special cases. 'ID_FILTERED' expects helpers to set the
user-space value, and retrieve the modified reset value.
Like ID_WRITABLE() this uses id_visibility(), which should have no
functional change for the registers converted to use ID_FILTERED().
read_sanitised_id_aa64dfr0_el1() and read_sanitised_id_aa64pfr0_el1()
have been refactored to be called from kvm_read_sanitised_id_reg(), to
try be consistent with ID_WRITABLE().
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241030160317.2528209-6-joey.gouly@arm.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
commit 011e5f5bf5 ("arm64/cpufeature: Add remaining feature bits in
ID_AA64PFR0 register") exposed the MPAM field of AA64PFR0_EL1 to guests,
but didn't add trap handling.
If you are unlucky, this results in an MPAM aware guest being delivered
an undef during boot. The host prints:
| kvm [97]: Unsupported guest sys_reg access at: ffff800080024c64 [00000005]
| { Op0( 3), Op1( 0), CRn(10), CRm( 5), Op2( 0), func_read },
Which results in:
| Internal error: Oops - Undefined instruction: 0000000002000000 [#1] PREEMPT SMP
| Modules linked in:
| CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.6.0-rc7-00559-gd89c186d50b2 #14616
| Hardware name: linux,dummy-virt (DT)
| pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : test_has_mpam+0x18/0x30
| lr : test_has_mpam+0x10/0x30
| sp : ffff80008000bd90
...
| Call trace:
| test_has_mpam+0x18/0x30
| update_cpu_capabilities+0x7c/0x11c
| setup_cpu_features+0x14/0xd8
| smp_cpus_done+0x24/0xb8
| smp_init+0x7c/0x8c
| kernel_init_freeable+0xf8/0x280
| kernel_init+0x24/0x1e0
| ret_from_fork+0x10/0x20
| Code: 910003fd 97ffffde 72001c00 54000080 (d538a500)
| ---[ end trace 0000000000000000 ]---
| Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
| ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
Add the support to enable the traps, and handle the three guest accessible
registers by injecting an UNDEF. This stops KVM from spamming the host
log, but doesn't yet hide the feature from the id registers.
With MPAM v1.0 we can trap the MPAMIDR_EL1 register only if
ARM64_HAS_MPAM_HCR, with v1.1 an additional MPAM2_EL2.TIDR bit traps
MPAMIDR_EL1 on platforms that don't have MPAMHCR_EL2. Enable one of
these if either is supported. If neither is supported, the guest can
discover that the CPU has MPAM support, and how many PARTID etc the
host has ... but it can't influence anything, so its harmless.
Fixes: 011e5f5bf5 ("arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register")
CC: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/linux-arm-kernel/20200925160102.118858-1-james.morse@arm.com/
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241030160317.2528209-5-joey.gouly@arm.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Until now, we didn't really care about WXN as it didn't have an
effect on the R/W permissions (only the execution could be droppped),
and therefore not of interest for AT.
However, with S1POE, WXN can revoke the Write permission if an
overlay is active and that execution is allowed. This *is* relevant
to AT.
Add full handling of WXN so that we correctly handle this case.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-38-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
We now have the intrastructure in place to emulate S1POE:
- direct permissions are always overlay-capable
- indirect permissions are overlay-capable if the permissions are
in the 0b0xxx range
- the overlays are strictly substractive
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-37-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Move the conditions describing PAN as part of the s1_walk_info
structure, in an effort to declutter the permission processing.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-36-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The hierarchical permissions must be disabled when POE is enabled
in the translation regime used for a given table walk.
We store the two enable bits in the s1_walk_info structure so that
they can be retrieved down the line, as they will be useful.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241023145345.1613824-35-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Just like the other extensions affecting address translation,
we must save/restore POE so that an out-of-context translation
context can be restored and used with the AT instructions.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241023145345.1613824-34-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
S1POE support implies support for POR_EL2, which we provide by
- adding it to the vcpu_sysreg enum
- advertising it as mapped to its EL1 counterpart in get_el2_to_el1_mapping
- wiring it in the sys_reg_desc table with the correct visibility
- handling POR_EL1 in __vcpu_{read,write}_sys_reg_from_cpu()
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-32-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Just like we have kvm_has_s1pie(), add its S1POE counterpart,
making the code slightly more readable.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-31-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
All the El0/EL1 S1PIE/S1POE system register are caught by the HCR_EL2
TVM and TRVM bits. Reflect this in the coarse grained trap table.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-30-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
It took me some time to realise it, but CPTR_EL2.E0POE does not
apply to a guest, only to EL0 when InHost(). And when InHost(),
CPCR_EL2 is mapped to CPACR_EL1, maning that the E0POE bit naturally
takes effect without any trap.
To sum it up, this trap bit is better left ignored, we will never
have to hanedle it.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-29-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
With a visibility defined for these registers, there is no need
to check again for S1PIE or TCRX being implemented as perform_access()
already handles it.
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-27-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
When the guest does not support S1PIE we should not allow any access
to the system registers it adds in order to ensure that we do not create
spurious issues with guest migration. Add a visibility operation for these
registers.
Fixes: 86f9de9db1 ("KVM: arm64: Save/restore PIE registers")
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240822-kvm-arm64-hide-pie-regs-v2-3-376624fa829c@kernel.org
[maz: simplify by using __el2_visibility(), kvm_has_s1pie() throughout]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-26-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
We are starting to have a bunch of visibility helpers checking
for EL2 + something else, and we are going to add more.
Simplify things somehow by introducing a helper that implement
extractly that by taking a visibility helper as a parameter,
and convert the existing ones to that.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-23-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
It doesn't take much effort to implement S1PIE support in AT.
It is only a matter of using the AArch64.S1IndirectBasePermissions()
encodings for the permission, ignoring GCS which has no impact on AT,
and enforce FEAT_PAN3 being enabled as this is a requirement of
FEAT_S1PIE.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241023145345.1613824-22-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
S1PIE implicitly disables hierarchical permissions, as specified in
R_JHSVW, by making TCR_ELx.HPDn RES1.
Add a predicate for S1PIE being enabled for a given translation regime,
and emulate this behaviour by forcing the hpd field to true if S1PIE
is enabled for that translation regime.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241023145345.1613824-21-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The AArch64.S1DirectBasePermissions() pseudocode deals with both
direct and hierarchical S1 permission evaluation. While this is
probably convenient in the pseudocode, we would like a bit more
flexibility to slot things like indirect permissions.
To that effect, split the two permission check parts out of
handle_at_slow() and into their own functions. The permissions
are passed around as part of the walk_result structure.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-20-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Emulating AT using AT instructions requires that the live state
matches the translation regime the AT instruction targets.
If targeting the EL1&0 translation regime and that S1PIE is
supported, we also need to restore that state (covering TCR2_EL1,
PIR_EL1, and PIRE0_EL1).
Add the required system register switcheroo.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241023145345.1613824-19-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Add the FEAT_S1PIE EL2 registers the sysreg descriptor array so that
they can be handled as a trap.
Access to these registers is conditional based on ID_AA64MMFR3_EL1.S1PIE
being advertised.
Similarly to other other changes, PIRE0_EL2 is guaranteed to trap
thanks to the D22677 update to the architecture.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-17-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Like their EL1 equivalent, the EL2-specific FEAT_S1PIE registers
are context-switched. This is made conditional on both FEAT_TCRX
and FEAT_S1PIE being adversised.
Note that this change only makes sense if read together with the
issue D22677 contained in 102105_K.a_04_en.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-16-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
We currently only use the masking (RES0/RES1) facility for VNCR
registers, as they are memory-based and thus easy to sanitise.
But we could apply the same thing to other registers if we:
- split the sanitisation from __VNCR_START__
- apply the sanitisation when reading from a HW register
This involves a new "marker" in the vcpu_sysreg enum, which
defines the point at which the sanitisation applies (the VNCR
registers being of course after this marker).
Whle we are at it, rename kvm_vcpu_sanitise_vncr_reg() to
kvm_vcpu_apply_reg_masks(), which is vaguely more explicit,
and harden set_sysreg_masks() against setting masks for
random registers...
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241023145345.1613824-10-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Like its EL1 equivalent, TCR2_EL2 gets context-switched.
This is made conditional on FEAT_TCRX being adversised.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-14-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
TCR2_EL2 is a bag of control bits, all of which are only valid if
certain features are present, and RES0 otherwise.
Describe these constraints and register them with the masking
infrastructure.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241023145345.1613824-13-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Whenever we need to restore the guest's system registers to the CPU, we
now need to take care of the EL2 system registers as well. Most of them
are accessed via traps only, but some have an immediate effect and also
a guest running in VHE mode would expect them to be accessible via their
EL1 encoding, which we do not trap.
For vEL2 we write the virtual EL2 registers with an identical format directly
into their EL1 counterpart, and translate the few registers that have a
different format for the same effect on the execution when running a
non-VHE guest guest hypervisor.
Based on an initial patch from Andre Przywara, rewritten many times
since.
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-8-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Add the TCR2_EL2 register to the per-vcpu sysreg register array,
the sysreg descriptor array, and advertise it as mapped to TCR2_EL1
for NV purposes.
Access to this register is conditional based on ID_AA64MMFR3_EL1.TCRX
being advertised.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-12-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Accessing CNTHCTL_EL2 is fraught with danger if running with
HCR_EL2.E2H=1: half of the bits are held in CNTKCTL_EL1, and
thus can be changed behind our back, while the rest lives
in the CNTHCTL_EL2 shadow copy that is memory-based.
Yes, this is a lot of fun!
Make sure that we merge the two on read access, while we can
write to CNTKCTL_EL1 in a more straightforward manner.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-7-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
As KVM has grown a bunch of new system register for NV, it appears
that we are missing them in the get_el2_to_el1_mapping() list.
Most of them are not crucial as they don't tend to be accessed via
vcpu_read_sys_reg() and vcpu_write_sys_reg().
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-6-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
__kvm_at_s1e2() contains the definition of an s2_mmu for the
current context, but doesn't make any use of it. Drop it.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-5-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Returning an abort to the guest for an unsupported MMIO access is a
documented feature of the KVM UAPI. Nevertheless, it's clear that this
plumbing has seen limited testing, since userspace can trivially cause a
WARN in the MMIO return:
WARNING: CPU: 0 PID: 30558 at arch/arm64/include/asm/kvm_emulate.h:536 kvm_handle_mmio_return+0x46c/0x5c4 arch/arm64/include/asm/kvm_emulate.h:536
Call trace:
kvm_handle_mmio_return+0x46c/0x5c4 arch/arm64/include/asm/kvm_emulate.h:536
kvm_arch_vcpu_ioctl_run+0x98/0x15b4 arch/arm64/kvm/arm.c:1133
kvm_vcpu_ioctl+0x75c/0xa78 virt/kvm/kvm_main.c:4487
__do_sys_ioctl fs/ioctl.c:51 [inline]
__se_sys_ioctl fs/ioctl.c:893 [inline]
__arm64_sys_ioctl+0x14c/0x1c8 fs/ioctl.c:893
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
el0_svc_common+0x1e0/0x23c arch/arm64/kernel/syscall.c:132
do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
el0_svc+0x38/0x68 arch/arm64/kernel/entry-common.c:712
el0t_64_sync_handler+0x90/0xfc arch/arm64/kernel/entry-common.c:730
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
The splat is complaining that KVM is advancing PC while an exception is
pending, i.e. that KVM is retiring the MMIO instruction despite a
pending synchronous external abort. Womp womp.
Fix the glaring UAPI bug by skipping over all the MMIO emulation in
case there is a pending synchronous exception. Note that while userspace
is capable of pending an asynchronous exception (SError, IRQ, or FIQ),
it is still safe to retire the MMIO instruction in this case as (1) they
are by definition asynchronous, and (2) KVM relies on hardware support
for pending/delivering these exceptions instead of the software state
machine for advancing PC.
Cc: stable@vger.kernel.org
Fixes: da345174ce ("KVM: arm/arm64: Allow user injection of external data aborts")
Reported-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025203106.3529261-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Don't mark pages/folios as accessed in the primary MMU when making a SPTE
young in KVM's secondary MMU, as doing so relies on
kvm_pfn_to_refcounted_page(), and generally speaking is unnecessary and
wasteful. KVM participates in page aging via mmu_notifiers, so there's no
need to push "accessed" updates to the primary MMU.
Dropping use of kvm_set_pfn_accessed() also paves the way for removing
kvm_pfn_to_refcounted_page() and all its users.
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-84-seanjc@google.com>
Use __gfn_to_page() instead when copying MTE tags between guest and
userspace. This will eventually allow removing gfn_to_pfn_prot(),
gfn_to_pfn(), kvm_pfn_to_refcounted_page(), and related APIs.
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-78-seanjc@google.com>
Convert arm64 to use __kvm_faultin_pfn()+kvm_release_faultin_page().
Three down, six to go.
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-57-seanjc@google.com>
Mark pages/folios accessed+dirty prior to dropping mmu_lock, as marking a
page/folio dirty after it has been written back can make some filesystems
unhappy (backing KVM guests will such filesystem files is uncommon, and
the race is minuscule, hence the lack of complaints).
While scary sounding, practically speaking the worst case scenario is that
KVM would trigger this WARN in filemap_unaccount_folio():
/*
* At this point folio must be either written or cleaned by
* truncate. Dirty folio here signals a bug and loss of
* unwritten data - on ordinary filesystems.
*
* But it's harmless on in-memory filesystems like tmpfs; and can
* occur when a driver which did get_user_pages() sets page dirty
* before putting it, while the inode is being finally evicted.
*
* Below fixes dirty accounting after removing the folio entirely
* but leaves the dirty flag set: it has no effect for truncated
* folio and anyway will be cleared before returning folio to
* buddy allocator.
*/
if (WARN_ON_ONCE(folio_test_dirty(folio) &&
mapping_can_writeback(mapping)))
folio_account_cleaned(folio, inode_to_wb(mapping->host));
KVM won't actually write memory because the stage-2 mappings are protected
by the mmu_notifier, i.e. there is no risk of loss of data, even if the
VM were backed by memory that needs writeback.
See the link below for additional details.
This will also allow converting arm64 to kvm_release_faultin_page(), which
requires that mmu_lock be held (for the aforementioned reason).
Link: https://lore.kernel.org/all/cover.1683044162.git.lstoakes@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-56-seanjc@google.com>
Drop @hva from __gfn_to_pfn_memslot() now that all callers pass NULL.
No functional change intended.
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-19-seanjc@google.com>
Drop @atomic from the myriad "to_pfn" APIs now that all callers pass
"false", and remove a comment blurb about KVM running only the "GUP fast"
part in atomic context.
No functional change intended.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-13-seanjc@google.com>
Pass through the SYSTEM_OFF2 function for hibernation, just like SYSTEM_OFF.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Miguel Luis <miguel.luis@oracle.com>
Link: https://lore.kernel.org/r/20241019172459.2241939-6-dwmw2@infradead.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
As with PSCI v1.1 in commit 512865d83f ("KVM: arm64: Bump guest PSCI
version to 1.1"), expose v1.3 to the guest by default. The SYSTEM_OFF2
call which is exposed by doing so is compatible for userspace because
it's just a new flag in the event that KVM raises, in precisely the same
way that SYSTEM_RESET2 was compatible when v1.1 was enabled by default.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Miguel Luis <miguel.luis@oracle.com>
Link: https://lore.kernel.org/r/20241019172459.2241939-4-dwmw2@infradead.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The PSCI v1.3 specification adds support for a SYSTEM_OFF2 function
which is analogous to ACPI S4 state. This will allow hosting
environments to determine that a guest is hibernated rather than just
powered off, and ensure that they preserve the virtual environment
appropriately to allow the guest to resume safely (or bump the
hardware_signature in the FACS to trigger a clean reboot instead).
This feature is safe to enable unconditionally (in a subsequent commit)
because it is exposed to userspace through the existing
KVM_SYSTEM_EVENT_SHUTDOWN event, just with an additional flag which
userspace can use to know that the instance intended hibernation instead
of a plain power-off.
As with SYSTEM_RESET2, there is only one type available (in this case
HIBERNATE_OFF), and it is not explicitly reported to userspace through
the event; userspace can get it from the registers if it cares).
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Miguel Luis <miguel.luis@oracle.com>
Link: https://lore.kernel.org/r/20241019172459.2241939-3-dwmw2@infradead.org
[oliver: slight cleanup of comments]
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Now that 'kvm_vgic_global_state' is no longer needed for ICC_CTLR_EL1
emulation on machines with a broken SEIS implementation, drop the
pKVM hypervisor mapping of the page.
Note that kvm_vgic_global_state is still mapped in non-protected
hypervisor configurations (i.e. {n,h}VHE) through the rodata section
mapping.
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241022144016.27350-3-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
ICC_CTLR_EL1 accesses from a guest are trapped and emulated on systems
with broken SEIS support and without FEAT_GICv3_TDIR. On such systems,
we mask SEIS support in 'kvm_vgic_global_state.ich_vtr_el2' and so the
value of ICC_CTLR_EL1.SEIS visible to the guest is always zero.
Simplify the ICC_CTLR_EL1 read emulation to return 0 for the SEIS field,
rather than reading an always-zero value from the global state.
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241022144016.27350-2-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
kvm_vgic_map_resources() prematurely marks the distributor as 'ready',
potentially allowing vCPUs to enter the guest before the distributor's
MMIO registration has been made visible.
Plug the race by marking the distributor as ready only after MMIO
registration is completed. Rely on the implied ordering of
synchronize_srcu() to ensure the MMIO registration is visible before
vgic_dist::ready. This also means that writers to vgic_dist::ready are
now serialized by the slots_lock, which was effectively the case already
as all writers held the slots_lock in addition to the config_lock.
Fixes: 59112e9c39 ("KVM: arm64: vgic: Fix a circular locking issue")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241017001947.2707312-3-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
KVM commits to a particular sizing of SPIs when the vgic is initialized,
which is before the point a vgic becomes ready. On top of that, KVM
supplies a default amount of SPIs should userspace not explicitly
configure this.
As such, the check for vgic_ready() in the handling of
KVM_DEV_ARM_VGIC_GRP_NR_IRQS is completely wrong, and testing if nr_spis
is nonzero is sufficient for preventing userspace from playing games
with us.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241017001947.2707312-2-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Our idmap is becoming too big, to the point where it doesn't fit in
a 4kB page anymore.
There are some low-hanging fruits though, such as the el2_init_state
horror that is expanded 3 times in the kernel. Let's at least limit
ourselves to two copies, which makes the kernel link again.
At some point, we'll have to have a better way of doing this.
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241009204903.GA3353168@thelio-3990X
Enable MTE support for hugetlb.
The MTE page flags will be set on the folio only. When copying
hugetlb folio (for example, CoW), the tags for all subpages will be copied
when copying the first subpage.
When freeing hugetlb folio, the MTE flags will be cleared.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20241001225220.271178-1-yang@os.amperecomputing.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
As there is very little ordering in the KVM API, userspace can
instanciate a half-baked GIC (missing its memory map, for example)
at almost any time.
This means that, with the right timing, a thread running vcpu-0
can enter the kernel without a GIC configured and get a GIC created
behind its back by another thread. Amusingly, it will pick up
that GIC and start messing with the data structures without the
GIC having been fully initialised.
Similarly, a thread running vcpu-1 can enter the kernel, and try
to init the GIC that was previously created. Since this GIC isn't
properly configured (no memory map), it fails to correctly initialise.
And that's the point where we decide to teardown the GIC, freeing all
its resources. Behind vcpu-0's back. Things stop pretty abruptly,
with a variety of symptoms. Clearly, this isn't good, we should be
a bit more careful about this.
It is obvious that this guest is not viable, as it is missing some
important part of its configuration. So instead of trying to tear
bits of it down, let's just mark it as *dead*. It means that any
further interaction from userspace will result in -EIO. The memory
will be released on the "normal" path, when userspace gives up.
Cc: stable@vger.kernel.org
Reported-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241009183603.3221824-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Prior to commit 70ed723829 ("KVM: arm64: Sanitise ID_AA64MMFR3_EL1")
we just exposed the santised view of ID_AA64MMFR3_EL1 to guests, meaning
that they saw both TCRX and S1PIE if present on the host machine. That
commit added VMM control over the contents of the register and exposed
S1POE but removed S1PIE, meaning that the extension is no longer visible
to guests. Reenable support for S1PIE with VMM control.
Fixes: 70ed723829 ("KVM: arm64: Sanitise ID_AA64MMFR3_EL1")
Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241005-kvm-arm64-fix-s1pie-v1-1-5901f02de749@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
There's been a decent amount of attention around unmaps of nested MMUs,
and TLBI handling is no exception to this. Add a comment clarifying why
it is safe to reschedule during a TLBI unmap, even without a reference
on the MMU in progress.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241007233028.2236133-5-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Currently, when a nested MMU is repurposed for some other MMU context,
KVM unmaps everything during vcpu_load() while holding the MMU lock for
write. This is quite a performance bottleneck for large nested VMs, as
all vCPU scheduling will spin until the unmap completes.
Start punting the MMU cleanup to a vCPU request, where it is then
possible to periodically release the MMU lock and CPU in the presence of
contention.
Ensure that no vCPU winds up using a stale MMU by tracking the pending
unmap on the S2 MMU itself and requesting an unmap on every vCPU that
finds it.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241007233028.2236133-4-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Right now the nested code allows unmap operations on a shadow stage-2 to
block unconditionally. This is wrong in a couple places, such as a
non-blocking MMU notifier or on the back of a sched_in() notifier as
part of shadow MMU recycling.
Carry through whether or not blocking is allowed to
kvm_pgtable_stage2_unmap(). This 'fixes' an issue where stage-2 MMU
reclaim would precipitate a stack overflow from a pile of kvm_sched_in()
callbacks, all trying to recycle a stage-2 MMU.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241007233028.2236133-3-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
If a vCPU is scheduling out and not in WFI emulation, it is highly
likely it will get scheduled again soon and reuse the MMU it had before.
Dropping the MMU at vcpu_put() can have some unfortunate consequences,
as the MMU could get reclaimed and used in a different context, forcing
another 'cold start' on an otherwise active MMU.
Avoid that altogether by keeping a reference on the MMU if the vCPU is
scheduling out, ensuring that another vCPU cannot reclaim it while the
current vCPU is away. Since there are more MMUs than vCPUs, this does
not affect the guarantee that an unused MMU is available at any time.
Furthermore, this makes the vcpu->arch.hw_mmu ~stable in preemptible
code, at least for where it matters in the stage-2 abort path. Yes, the
MMU can change across WFI emulation, but there isn't even a use case
where this would matter.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241007233028.2236133-2-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Alex reports that syzkaller has managed to trigger a use-after-free when
tearing down a VM:
BUG: KASAN: slab-use-after-free in kvm_put_kvm+0x300/0xe68 virt/kvm/kvm_main.c:5769
Read of size 8 at addr ffffff801c6890d0 by task syz.3.2219/10758
CPU: 3 UID: 0 PID: 10758 Comm: syz.3.2219 Not tainted 6.11.0-rc6-dirty #64
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x17c/0x1a8 arch/arm64/kernel/stacktrace.c:317
show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:324
__dump_stack lib/dump_stack.c:93 [inline]
dump_stack_lvl+0x94/0xc0 lib/dump_stack.c:119
print_report+0x144/0x7a4 mm/kasan/report.c:377
kasan_report+0xcc/0x128 mm/kasan/report.c:601
__asan_report_load8_noabort+0x20/0x2c mm/kasan/report_generic.c:381
kvm_put_kvm+0x300/0xe68 virt/kvm/kvm_main.c:5769
kvm_vm_release+0x4c/0x60 virt/kvm/kvm_main.c:1409
__fput+0x198/0x71c fs/file_table.c:422
____fput+0x20/0x30 fs/file_table.c:450
task_work_run+0x1cc/0x23c kernel/task_work.c:228
do_notify_resume+0x144/0x1a0 include/linux/resume_user_mode.h:50
el0_svc+0x64/0x68 arch/arm64/kernel/entry-common.c:169
el0t_64_sync_handler+0x90/0xfc arch/arm64/kernel/entry-common.c:730
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
Upon closer inspection, it appears that we do not properly tear down the
MMIO registration for a vCPU that fails creation late in the game, e.g.
a vCPU w/ the same ID already exists in the VM.
It is important to consider the context of commit that introduced this bug
by moving the unregistration out of __kvm_vgic_vcpu_destroy(). That
change correctly sought to avoid an srcu v. config_lock inversion by
breaking up the vCPU teardown into two parts, one guarded by the
config_lock.
Fix the use-after-free while avoiding lock inversion by adding a
special-cased unregistration to __kvm_vgic_vcpu_destroy(). This is safe
because failed vCPUs are torn down outside of the config_lock.
Cc: stable@vger.kernel.org
Fixes: f616506754 ("KVM: arm64: vgic: Don't hold config_lock while unregistering redistributors")
Reported-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241007223909.2157336-1-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/idregs-6.12:
: .
: Make some fields of ID_AA64DFR0_EL1 and ID_AA64PFR1_EL1
: writable from userspace, so that a VMM can influence the
: set of guest-visible features.
:
: - for ID_AA64DFR0_EL1: DoubleLock, WRPs, PMUVer and DebugVer
: are writable (courtesy of Shameer Kolothum)
:
: - for ID_AA64PFR1_EL1: BT, SSBS, CVS2_frac are writable
: (courtesy of Shaoqin Huang)
: .
KVM: selftests: aarch64: Add writable test for ID_AA64PFR1_EL1
KVM: arm64: Allow userspace to change ID_AA64PFR1_EL1
KVM: arm64: Use kvm_has_feat() to check if FEAT_SSBS is advertised to the guest
KVM: arm64: Disable fields that KVM doesn't know how to handle in ID_AA64PFR1_EL1
KVM: arm64: Make the exposed feature bits in AA64DFR0_EL1 writable from userspace
Signed-off-by: Marc Zyngier <maz@kernel.org>
When pKVM saves and restores the host floating point state on a SVE system,
it programs the vector length in ZCR_EL2.LEN to be whatever the maximum VL
for the PE is. But it uses a buffer allocated with kvm_host_sve_max_vl, the
maximum VL shared by all PEs in the system. This means that if we run on a
system where the maximum VLs are not consistent, we will overflow the buffer
on PEs which support larger VLs.
Since the host will not currently attempt to make use of non-shared VLs, fix
this by explicitly setting the EL2 VL to be the maximum shared VL when we
save and restore. This will enforce the limit on host VL usage. Should we
wish to support asymmetric VLs, this code will need to be updated along with
the required changes for the host:
https://lore.kernel.org/r/20240730-kvm-arm64-fix-pkvm-sve-vl-v6-0-cae8a2e0bd66@kernel.org
Fixes: b5b9955617 ("KVM: arm64: Eagerly restore host fpsimd/sve state in pKVM")
Signed-off-by: Mark Brown <broonie@kernel.org>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20240912-kvm-arm64-limit-guest-vl-v2-1-dd2c29cb2ac9@kernel.org
[maz: added punctuation to the commit message]
Signed-off-by: Marc Zyngier <maz@kernel.org>
On an error, hyp_vcpu will be accessed while this memory has already
been relinquished to the host and unmapped from the hypervisor. Protect
the CPTR assignment with an early return.
Fixes: b5b9955617 ("KVM: arm64: Eagerly restore host fpsimd/sve state in pKVM")
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20240919110500.2345927-1-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
* KVM currently invalidates the entirety of the page tables, not just
those for the memslot being touched, when a memslot is moved or deleted.
The former does not have particularly noticeable overhead, but Intel's
TDX will require the guest to re-accept private pages if they are
dropped from the secure EPT, which is a non starter. Actually,
the only reason why this is not already being done is a bug which
was never fully investigated and caused VM instability with assigned
GeForce GPUs, so allow userspace to opt into the new behavior.
* Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10
functionality that is on the horizon).
* Rework common MSR handling code to suppress errors on userspace accesses to
unsupported-but-advertised MSRs. This will allow removing (almost?) all of
KVM's exemptions for userspace access to MSRs that shouldn't exist based on
the vCPU model (the actual cleanup is non-trivial future work).
* Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the
64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv)
stores the entire 64-bit value at the ICR offset.
* Fix a bug where KVM would fail to exit to userspace if one was triggered by
a fastpath exit handler.
* Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when
there's already a pending wake event at the time of the exit.
* Fix a WARN caused by RSM entering a nested guest from SMM with invalid guest
state, by forcing the vCPU out of guest mode prior to signalling SHUTDOWN
(the SHUTDOWN hits the VM altogether, not the nested guest)
* Overhaul the "unprotect and retry" logic to more precisely identify cases
where retrying is actually helpful, and to harden all retry paths against
putting the guest into an infinite retry loop.
* Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in
the shadow MMU.
* Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for
adding multi generation LRU support in KVM.
* Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is enabled,
i.e. when the CPU has already flushed the RSB.
* Trace the per-CPU host save area as a VMCB pointer to improve readability
and cleanup the retrieval of the SEV-ES host save area.
* Remove unnecessary accounting of temporary nested VMCB related allocations.
* Set FINAL/PAGE in the page fault error code for EPT violations if and only
if the GVA is valid. If the GVA is NOT valid, there is no guest-side page
table walk and so stuffing paging related metadata is nonsensical.
* Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of
emulating posted interrupt delivery to L2.
* Add a lockdep assertion to detect unsafe accesses of vmcs12 structures.
* Harden eVMCS loading against an impossible NULL pointer deref (really truly
should be impossible).
* Minor SGX fix and a cleanup.
* Misc cleanups
Generic:
* Register KVM's cpuhp and syscore callbacks when enabling virtualization in
hardware, as the sole purpose of said callbacks is to disable and re-enable
virtualization as needed.
* Enable virtualization when KVM is loaded, not right before the first VM
is created. Together with the previous change, this simplifies a
lot the logic of the callbacks, because their very existence implies
virtualization is enabled.
* Fix a bug that results in KVM prematurely exiting to userspace for coalesced
MMIO/PIO in many cases, clean up the related code, and add a testcase.
* Fix a bug in kvm_clear_guest() where it would trigger a buffer overflow _if_
the gpa+len crosses a page boundary, which thankfully is guaranteed to not
happen in the current code base. Add WARNs in more helpers that read/write
guest memory to detect similar bugs.
Selftests:
* Fix a goof that caused some Hyper-V tests to be skipped when run on bare
metal, i.e. NOT in a VM.
* Add a regression test for KVM's handling of SHUTDOWN for an SEV-ES guest.
* Explicitly include one-off assets in .gitignore. Past Sean was completely
wrong about not being able to detect missing .gitignore entries.
* Verify userspace single-stepping works when KVM happens to handle a VM-Exit
in its fastpath.
* Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmb201AUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOM1gf+Ij7dpCh0KwoNYlHfW2aCHAv3PqQd
cKMDSGxoCernbJEyPO/3qXNUK+p4zKedk3d92snW3mKa+cwxMdfthJ3i9d7uoNiw
7hAgcfKNHDZGqAQXhx8QcVF3wgp+diXSyirR+h1IKrGtCCmjMdNC8ftSYe6voEkw
VTVbLL+tER5H0Xo5UKaXbnXKDbQvWLXkdIqM8dtLGFGLQ2PnF/DdMP0p6HYrKf1w
B7LBu0rvqYDL8/pS82mtR3brHJXxAr9m72fOezRLEUbfUdzkTUi/b1vEe6nDCl0Q
i/PuFlARDLWuetlR0VVWKNbop/C/l4EmwCcKzFHa+gfNH3L9361Oz+NzBw==
=Q7kz
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull x86 kvm updates from Paolo Bonzini:
"x86:
- KVM currently invalidates the entirety of the page tables, not just
those for the memslot being touched, when a memslot is moved or
deleted.
This does not traditionally have particularly noticeable overhead,
but Intel's TDX will require the guest to re-accept private pages
if they are dropped from the secure EPT, which is a non starter.
Actually, the only reason why this is not already being done is a
bug which was never fully investigated and caused VM instability
with assigned GeForce GPUs, so allow userspace to opt into the new
behavior.
- Advertise AVX10.1 to userspace (effectively prep work for the
"real" AVX10 functionality that is on the horizon)
- Rework common MSR handling code to suppress errors on userspace
accesses to unsupported-but-advertised MSRs
This will allow removing (almost?) all of KVM's exemptions for
userspace access to MSRs that shouldn't exist based on the vCPU
model (the actual cleanup is non-trivial future work)
- Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC)
splits the 64-bit value into the legacy ICR and ICR2 storage,
whereas Intel (APICv) stores the entire 64-bit value at the ICR
offset
- Fix a bug where KVM would fail to exit to userspace if one was
triggered by a fastpath exit handler
- Add fastpath handling of HLT VM-Exit to expedite re-entering the
guest when there's already a pending wake event at the time of the
exit
- Fix a WARN caused by RSM entering a nested guest from SMM with
invalid guest state, by forcing the vCPU out of guest mode prior to
signalling SHUTDOWN (the SHUTDOWN hits the VM altogether, not the
nested guest)
- Overhaul the "unprotect and retry" logic to more precisely identify
cases where retrying is actually helpful, and to harden all retry
paths against putting the guest into an infinite retry loop
- Add support for yielding, e.g. to honor NEED_RESCHED, when zapping
rmaps in the shadow MMU
- Refactor pieces of the shadow MMU related to aging SPTEs in
prepartion for adding multi generation LRU support in KVM
- Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is
enabled, i.e. when the CPU has already flushed the RSB
- Trace the per-CPU host save area as a VMCB pointer to improve
readability and cleanup the retrieval of the SEV-ES host save area
- Remove unnecessary accounting of temporary nested VMCB related
allocations
- Set FINAL/PAGE in the page fault error code for EPT violations if
and only if the GVA is valid. If the GVA is NOT valid, there is no
guest-side page table walk and so stuffing paging related metadata
is nonsensical
- Fix a bug where KVM would incorrectly synthesize a nested VM-Exit
instead of emulating posted interrupt delivery to L2
- Add a lockdep assertion to detect unsafe accesses of vmcs12
structures
- Harden eVMCS loading against an impossible NULL pointer deref
(really truly should be impossible)
- Minor SGX fix and a cleanup
- Misc cleanups
Generic:
- Register KVM's cpuhp and syscore callbacks when enabling
virtualization in hardware, as the sole purpose of said callbacks
is to disable and re-enable virtualization as needed
- Enable virtualization when KVM is loaded, not right before the
first VM is created
Together with the previous change, this simplifies a lot the logic
of the callbacks, because their very existence implies
virtualization is enabled
- Fix a bug that results in KVM prematurely exiting to userspace for
coalesced MMIO/PIO in many cases, clean up the related code, and
add a testcase
- Fix a bug in kvm_clear_guest() where it would trigger a buffer
overflow _if_ the gpa+len crosses a page boundary, which thankfully
is guaranteed to not happen in the current code base. Add WARNs in
more helpers that read/write guest memory to detect similar bugs
Selftests:
- Fix a goof that caused some Hyper-V tests to be skipped when run on
bare metal, i.e. NOT in a VM
- Add a regression test for KVM's handling of SHUTDOWN for an SEV-ES
guest
- Explicitly include one-off assets in .gitignore. Past Sean was
completely wrong about not being able to detect missing .gitignore
entries
- Verify userspace single-stepping works when KVM happens to handle a
VM-Exit in its fastpath
- Misc cleanups"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits)
Documentation: KVM: fix warning in "make htmldocs"
s390: Enable KVM_S390_UCONTROL config in debug_defconfig
selftests: kvm: s390: Add VM run test case
KVM: SVM: let alternatives handle the cases when RSB filling is required
KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid
KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent
KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals
KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range()
KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper
KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed
KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot
KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps()
KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range()
KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfn
KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list
KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version
KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure()
KVM: x86: Update retry protection fields when forcing retry on emulation failure
KVM: x86: Apply retry protection to "unprotect on failure" path
KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn
...
this pull request are:
"Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
consistency to the APIs and behaviour of these two core allocation
functions. This also simplifies/enables Rustification.
"Some cleanups for shmem" from Baolin Wang. No functional changes - mode
code reuse, better function naming, logic simplifications.
"mm: some small page fault cleanups" from Josef Bacik. No functional
changes - code cleanups only.
"Various memory tiering fixes" from Zi Yan. A small fix and a little
cleanup.
"mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
simplifications and .text shrinkage.
"Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This
is a feature, it adds new feilds to /proc/vmstat such as
$ grep kstack /proc/vmstat
kstack_1k 3
kstack_2k 188
kstack_4k 11391
kstack_8k 243
kstack_16k 0
which tells us that 11391 processes used 4k of stack while none at all
used 16k. Useful for some system tuning things, but partivularly useful
for "the dynamic kernel stack project".
"kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov.
Teaches kmemleak to detect leaksage of percpu memory.
"mm: memcg: page counters optimizations" from Roman Gushchin. "3
independent small optimizations of page counters".
"mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David
Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work
correctly by design rather than by accident.
"mm: remove arch_make_page_accessible()" from David Hildenbrand. Some
folio conversions which make arch_make_page_accessible() unneeded.
"mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel.
Cleans up and fixes our handling of the resetting of the cgroup/process
peak-memory-use detector.
"Make core VMA operations internal and testable" from Lorenzo Stoakes.
Rationalizaion and encapsulation of the VMA manipulation APIs. With a
view to better enable testing of the VMA functions, even from a
userspace-only harness.
"mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in
the zswap global shrinker, resulting in improved performance.
"mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in
some missing info in /proc/zoneinfo.
"mm: replace follow_page() by folio_walk" from David Hildenbrand. Code
cleanups and rationalizations (conversion to folio_walk()) resulting in
the removal of follow_page().
"improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some
tuning to improve zswap's dynamic shrinker. Significant reductions in
swapin and improvements in performance are shown.
"mm: Fix several issues with unaccepted memory" from Kirill Shutemov.
Improvements to the new unaccepted memory feature,
"mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX
PUDs. This was missing, although nobody seems to have notied yet.
"Introduce a store type enum for the Maple tree" from Sidhartha Kumar.
Cleanups and modest performance improvements for the maple tree library
code.
"memcg: further decouple v1 code from v2" from Shakeel Butt. Move more
cgroup v1 remnants away from the v2 memcg code.
"memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds
various warnings telling users that memcg v1 features are deprecated.
"mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li.
Greatly improves the success rate of the mTHP swap allocation.
"mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate
per-arch implementations of numa_memblk code into generic code.
"mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
improves the performance of munmap() of swap-filled ptes.
"support large folio swap-out and swap-in for shmem" from Baolin Wang.
With this series we no longer split shmem large folios into simgle-page
folios when swapping out shmem.
"mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance
improvements and code reductions for gigantic folios.
"support shmem mTHP collapse" from Baolin Wang. Adds support for
khugepaged's collapsing of shmem mTHP folios.
"mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
performance regression due to the addition of mseal().
"Increase the number of bits available in page_type" from Matthew Wilcox.
Increases the number of bits available in page_type!
"Simplify the page flags a little" from Matthew Wilcox. Many legacy page
flags are now folio flags, so the page-based flags and their
accessors/mutators can be removed.
"mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An
optimization which permits us to avoid writing/reading zero-filled zswap
pages to backing store.
"Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window
which occurs when a MAP_FIXED operqtion is occurring during an unrelated
vma tree walk.
"mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the
vma_merge() functionality, making ot cleaner, more testable and better
tested.
"misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor
fixups of DAMON selftests and kunit tests.
"mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code
cleanups and folio conversions.
"Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups
for shmem controls and stats.
"mm: count the number of anonymous THPs per size" from Barry Song. Expose
additional anon THP stats to userspace for improved tuning.
"mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio
conversions and removal of now-unused page-based APIs.
"replace per-quota region priorities histogram buffer with per-context
one" from SeongJae Park. DAMON histogram rationalization.
"Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae
Park. DAMON documentation updates.
"mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve
related doc and warn" from Jason Wang: fixes usage of page allocator
__GFP_NOFAIL and GFP_ATOMIC flags.
"mm: split underused THPs" from Yu Zhao. Improve THP=always policy - this
was overprovisioning THPs in sparsely accessed memory areas.
"zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add
support for zram run-time compression algorithm tuning.
"mm: Care about shadow stack guard gap when getting an unmapped area" from
Mark Brown. Fix up the various arch_get_unmapped_area() implementations
to better respect guard areas.
"Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of
mem_cgroup_iter() and various code cleanups.
"mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
pfnmap support.
"resource: Fix region_intersects() vs add_memory_driver_managed()" from
Huang Ying. Fix a bug in region_intersects() for systems with CXL memory.
"mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a
couple more code paths to correctly recover from the encountering of
poisoned memry.
"mm: enable large folios swap-in support" from Barry Song. Support the
swapin of mTHP memory into appropriately-sized folios, rather than into
single-page folios.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZu1BBwAKCRDdBJ7gKXxA
jlWNAQDYlqQLun7bgsAN4sSvi27VUuWv1q70jlMXTfmjJAvQqwD/fBFVR6IOOiw7
AkDbKWP2k0hWPiNJBGwoqxdHHx09Xgo=
=s0T+
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Along with the usual shower of singleton patches, notable patch series
in this pull request are:
- "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
consistency to the APIs and behaviour of these two core allocation
functions. This also simplifies/enables Rustification.
- "Some cleanups for shmem" from Baolin Wang. No functional changes -
mode code reuse, better function naming, logic simplifications.
- "mm: some small page fault cleanups" from Josef Bacik. No
functional changes - code cleanups only.
- "Various memory tiering fixes" from Zi Yan. A small fix and a
little cleanup.
- "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
simplifications and .text shrinkage.
- "Kernel stack usage histogram" from Pasha Tatashin and Shakeel
Butt. This is a feature, it adds new feilds to /proc/vmstat such as
$ grep kstack /proc/vmstat
kstack_1k 3
kstack_2k 188
kstack_4k 11391
kstack_8k 243
kstack_16k 0
which tells us that 11391 processes used 4k of stack while none at
all used 16k. Useful for some system tuning things, but
partivularly useful for "the dynamic kernel stack project".
- "kmemleak: support for percpu memory leak detect" from Pavel
Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory.
- "mm: memcg: page counters optimizations" from Roman Gushchin. "3
independent small optimizations of page counters".
- "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from
David Hildenbrand. Improves PTE/PMD splitlock detection, makes
powerpc/8xx work correctly by design rather than by accident.
- "mm: remove arch_make_page_accessible()" from David Hildenbrand.
Some folio conversions which make arch_make_page_accessible()
unneeded.
- "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David
Finkel. Cleans up and fixes our handling of the resetting of the
cgroup/process peak-memory-use detector.
- "Make core VMA operations internal and testable" from Lorenzo
Stoakes. Rationalizaion and encapsulation of the VMA manipulation
APIs. With a view to better enable testing of the VMA functions,
even from a userspace-only harness.
- "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix
issues in the zswap global shrinker, resulting in improved
performance.
- "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill
in some missing info in /proc/zoneinfo.
- "mm: replace follow_page() by folio_walk" from David Hildenbrand.
Code cleanups and rationalizations (conversion to folio_walk())
resulting in the removal of follow_page().
- "improving dynamic zswap shrinker protection scheme" from Nhat
Pham. Some tuning to improve zswap's dynamic shrinker. Significant
reductions in swapin and improvements in performance are shown.
- "mm: Fix several issues with unaccepted memory" from Kirill
Shutemov. Improvements to the new unaccepted memory feature,
- "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on
DAX PUDs. This was missing, although nobody seems to have notied
yet.
- "Introduce a store type enum for the Maple tree" from Sidhartha
Kumar. Cleanups and modest performance improvements for the maple
tree library code.
- "memcg: further decouple v1 code from v2" from Shakeel Butt. Move
more cgroup v1 remnants away from the v2 memcg code.
- "memcg: initiate deprecation of v1 features" from Shakeel Butt.
Adds various warnings telling users that memcg v1 features are
deprecated.
- "mm: swap: mTHP swap allocator base on swap cluster order" from
Chris Li. Greatly improves the success rate of the mTHP swap
allocation.
- "mm: introduce numa_memblks" from Mike Rapoport. Moves various
disparate per-arch implementations of numa_memblk code into generic
code.
- "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
improves the performance of munmap() of swap-filled ptes.
- "support large folio swap-out and swap-in for shmem" from Baolin
Wang. With this series we no longer split shmem large folios into
simgle-page folios when swapping out shmem.
- "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice
performance improvements and code reductions for gigantic folios.
- "support shmem mTHP collapse" from Baolin Wang. Adds support for
khugepaged's collapsing of shmem mTHP folios.
- "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
performance regression due to the addition of mseal().
- "Increase the number of bits available in page_type" from Matthew
Wilcox. Increases the number of bits available in page_type!
- "Simplify the page flags a little" from Matthew Wilcox. Many legacy
page flags are now folio flags, so the page-based flags and their
accessors/mutators can be removed.
- "mm: store zero pages to be swapped out in a bitmap" from Usama
Arif. An optimization which permits us to avoid writing/reading
zero-filled zswap pages to backing store.
- "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race
window which occurs when a MAP_FIXED operqtion is occurring during
an unrelated vma tree walk.
- "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of
the vma_merge() functionality, making ot cleaner, more testable and
better tested.
- "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.
Minor fixups of DAMON selftests and kunit tests.
- "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.
Code cleanups and folio conversions.
- "Shmem mTHP controls and stats improvements" from Ryan Roberts.
Cleanups for shmem controls and stats.
- "mm: count the number of anonymous THPs per size" from Barry Song.
Expose additional anon THP stats to userspace for improved tuning.
- "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more
folio conversions and removal of now-unused page-based APIs.
- "replace per-quota region priorities histogram buffer with
per-context one" from SeongJae Park. DAMON histogram
rationalization.
- "Docs/damon: update GitHub repo URLs and maintainer-profile" from
SeongJae Park. DAMON documentation updates.
- "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and
improve related doc and warn" from Jason Wang: fixes usage of page
allocator __GFP_NOFAIL and GFP_ATOMIC flags.
- "mm: split underused THPs" from Yu Zhao. Improve THP=always policy.
This was overprovisioning THPs in sparsely accessed memory areas.
- "zram: introduce custom comp backends API" frm Sergey Senozhatsky.
Add support for zram run-time compression algorithm tuning.
- "mm: Care about shadow stack guard gap when getting an unmapped
area" from Mark Brown. Fix up the various arch_get_unmapped_area()
implementations to better respect guard areas.
- "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability
of mem_cgroup_iter() and various code cleanups.
- "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
pfnmap support.
- "resource: Fix region_intersects() vs add_memory_driver_managed()"
from Huang Ying. Fix a bug in region_intersects() for systems with
CXL memory.
- "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches
a couple more code paths to correctly recover from the encountering
of poisoned memry.
- "mm: enable large folios swap-in support" from Barry Song. Support
the swapin of mTHP memory into appropriately-sized folios, rather
than into single-page folios"
* tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits)
zram: free secondary algorithms names
uprobes: turn xol_area->pages[2] into xol_area->page
uprobes: introduce the global struct vm_special_mapping xol_mapping
Revert "uprobes: use vm_special_mapping close() functionality"
mm: support large folios swap-in for sync io devices
mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
mm: fix swap_read_folio_zeromap() for large folios with partial zeromap
mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries
set_memory: add __must_check to generic stubs
mm/vma: return the exact errno in vms_gather_munmap_vmas()
memcg: cleanup with !CONFIG_MEMCG_V1
mm/show_mem.c: report alloc tags in human readable units
mm: support poison recovery from copy_present_page()
mm: support poison recovery from do_cow_fault()
resource, kunit: add test case for region_intersects()
resource: make alloc_free_mem_region() works for iomem_resource
mm: z3fold: deprecate CONFIG_Z3FOLD
vfio/pci: implement huge_fault support
mm/arm64: support large pfn mappings
mm/x86: support large pfn mappings
...
Register KVM's cpuhp and syscore callbacks when enabling virtualization in
hardware, as the sole purpose of said callbacks is to disable and re-enable
virtualization as needed.
The primary motivation for this series is to simplify dealing with enabling
virtualization for Intel's TDX, which needs to enable virtualization
when kvm-intel.ko is loaded, i.e. long before the first VM is created.
That said, this is a nice cleanup on its own. By registering the callbacks
on-demand, the callbacks themselves don't need to check kvm_usage_count,
because their very existence implies a non-zero count.
Patch 1 (re)adds a dedicated lock for kvm_usage_count. This avoids a
lock ordering issue between cpus_read_lock() and kvm_lock. The lock
ordering issue still exist in very rare cases, and will be fixed for
good by switching vm_list to an (S)RCU-protected list.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* New Stage-2 page table dumper, reusing the main ptdump infrastructure
* FP8 support
* Nested virtualization now supports the address translation (FEAT_ATS1A)
family of instructions
* Add selftest checks for a bunch of timer emulation corner cases
* Fix multiple cases where KVM/arm64 doesn't correctly handle the guest
trying to use a GICv3 that wasn't advertised
* Remove REG_HIDDEN_USER from the sysreg infrastructure, making
things little simpler
* Prevent MTE tags being restored by userspace if we are actively
logging writes, as that's a recipe for disaster
* Correct the refcount on a page that is not considered for MTE tag
copying (such as a device)
* When walking a page table to split block mappings, synchronize only
at the end the walk rather than on every store
* Fix boundary check when transfering memory using FFA
* Fix pKVM TLB invalidation, only affecting currently out of tree
code but worth addressing for peace of mind
LoongArch:
* Revert qspinlock to test-and-set simple lock on VM.
* Add Loongson Binary Translation extension support.
* Add PMU support for guest.
* Enable paravirt feature control from VMM.
* Implement function kvm_para_has_feature().
RISC-V:
* Fix sbiret init before forwarding to userspace
* Don't zero-out PMU snapshot area before freeing data
* Allow legacy PMU access from guest
* Fix to allow hpmcounter31 from the guest
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmbmghAUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPFQgf+Ijeqlx90BGy96pyzo/NkYKPeEc8G
gKhlm8PdtdZYaRdJ53MVRLLpzbLuzqbwrn0ZX2tvoDRLzuAqTt2GTFoT6e2HtY5B
Sf7KQMFwHWGtGklC1EmZ1fXsCocswpuAcexCLKLRBoWUcKABlgwV3N3vJo5gx/Ag
8XXhYpcLTh+p7bjMdJShQy019pTwEDE68pPVnL2NPzla1G6Qox7ZJIdOEMZXuyJA
MJ4jbFWE/T8vLFUf/8MGQ/+bo+4140kzB8N9wkazNcBRoodY6Hx+Lm1LiZjNudO1
ilIdB4P3Ht+D8UuBv2DO5XTakfJz9T9YsoRcPlwrOWi/8xBRbt236gFB3Q==
=sHTI
-----END PGP SIGNATURE-----
Merge tag 'for-linus-non-x86' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"These are the non-x86 changes (mostly ARM, as is usually the case).
The generic and x86 changes will come later"
ARM:
- New Stage-2 page table dumper, reusing the main ptdump
infrastructure
- FP8 support
- Nested virtualization now supports the address translation
(FEAT_ATS1A) family of instructions
- Add selftest checks for a bunch of timer emulation corner cases
- Fix multiple cases where KVM/arm64 doesn't correctly handle the
guest trying to use a GICv3 that wasn't advertised
- Remove REG_HIDDEN_USER from the sysreg infrastructure, making
things little simpler
- Prevent MTE tags being restored by userspace if we are actively
logging writes, as that's a recipe for disaster
- Correct the refcount on a page that is not considered for MTE tag
copying (such as a device)
- When walking a page table to split block mappings, synchronize only
at the end the walk rather than on every store
- Fix boundary check when transfering memory using FFA
- Fix pKVM TLB invalidation, only affecting currently out of tree
code but worth addressing for peace of mind
LoongArch:
- Revert qspinlock to test-and-set simple lock on VM.
- Add Loongson Binary Translation extension support.
- Add PMU support for guest.
- Enable paravirt feature control from VMM.
- Implement function kvm_para_has_feature().
RISC-V:
- Fix sbiret init before forwarding to userspace
- Don't zero-out PMU snapshot area before freeing data
- Allow legacy PMU access from guest
- Fix to allow hpmcounter31 from the guest"
* tag 'for-linus-non-x86' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (64 commits)
LoongArch: KVM: Implement function kvm_para_has_feature()
LoongArch: KVM: Enable paravirt feature control from VMM
LoongArch: KVM: Add PMU support for guest
KVM: arm64: Get rid of REG_HIDDEN_USER visibility qualifier
KVM: arm64: Simplify visibility handling of AArch32 SPSR_*
KVM: arm64: Simplify handling of CNTKCTL_EL12
LoongArch: KVM: Add vm migration support for LBT registers
LoongArch: KVM: Add Binary Translation extension support
LoongArch: KVM: Add VM feature detection function
LoongArch: Revert qspinlock to test-and-set simple lock on VM
KVM: arm64: Register ptdump with debugfs on guest creation
arm64: ptdump: Don't override the level when operating on the stage-2 tables
arm64: ptdump: Use the ptdump description from a local context
arm64: ptdump: Expose the attribute parsing functionality
KVM: arm64: Add memory length checks and remove inline in do_ffa_mem_xfer
KVM: arm64: Move pagetable definitions to common header
KVM: arm64: nv: Add support for FEAT_ATS1A
KVM: arm64: nv: Plumb handling of AT S1* traps from EL2
KVM: arm64: nv: Make AT+PAN instructions aware of FEAT_PAN3
KVM: arm64: nv: Sanitise SCTLR_EL1.EPAN according to VM configuration
...
ACPI:
* Enable PMCG erratum workaround for HiSilicon HIP10 and 11 platforms.
* Ensure arm64-specific IORT header is covered by MAINTAINERS.
CPU Errata:
* Enable workaround for hardware access/dirty issue on Ampere-1A cores.
Memory management:
* Define PHYSMEM_END to fix a crash in the amdgpu driver.
* Avoid tripping over invalid kernel mappings on the kexec() path.
* Userspace support for the Permission Overlay Extension (POE) using
protection keys.
Perf and PMUs:
* Add support for the "fixed instruction counter" extension in the CPU
PMU architecture.
* Extend and fix the event encodings for Apple's M1 CPU PMU.
* Allow LSM hooks to decide on SPE permissions for physical profiling.
* Add support for the CMN S3 and NI-700 PMUs.
Confidential Computing:
* Add support for booting an arm64 kernel as a protected guest under
Android's "Protected KVM" (pKVM) hypervisor.
Selftests:
* Fix vector length issues in the SVE/SME sigreturn tests
* Fix build warning in the ptrace tests.
Timers:
* Add support for PR_{G,S}ET_TSC so that 'rr' can deal with
non-determinism arising from the architected counter.
Miscellaneous:
* Rework our IPI-based CPU stopping code to try NMIs if regular IPIs
don't succeed.
* Minor fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmbkVNEQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNKeIB/9YtbN7JMgsXktM94GP03r3tlFF36Y1S51S
+zdDZclAVZCTCZN+PaFeAZ/+ah2EQYrY6rtDoHUSEMQdF9kH+ycuIPDTwaJ4Qkam
QKXMpAgtY/4yf2rX4lhDF8rEvkhLDsu7oGDhqUZQsA33GrMBHfgA3oqpYwlVjvGq
gkm7olTo9LdWAxkPpnjGrjB6Mv5Dq8dJRhW+0Q5AntI5zx3RdYGJZA9GUSzyYCCt
FIYOtMmWPkQ0kKxIVxOxAOm/ubhfyCs2sjSfkaa3vtvtt+Yjye1Xd81rFciIbPgP
QlK/Mes2kBZmjhkeus8guLI5Vi7tx3DQMkNqLXkHAAzOoC4oConE
=6osL
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"The highlights are support for Arm's "Permission Overlay Extension"
using memory protection keys, support for running as a protected guest
on Android as well as perf support for a bunch of new interconnect
PMUs.
Summary:
ACPI:
- Enable PMCG erratum workaround for HiSilicon HIP10 and 11
platforms.
- Ensure arm64-specific IORT header is covered by MAINTAINERS.
CPU Errata:
- Enable workaround for hardware access/dirty issue on Ampere-1A
cores.
Memory management:
- Define PHYSMEM_END to fix a crash in the amdgpu driver.
- Avoid tripping over invalid kernel mappings on the kexec() path.
- Userspace support for the Permission Overlay Extension (POE) using
protection keys.
Perf and PMUs:
- Add support for the "fixed instruction counter" extension in the
CPU PMU architecture.
- Extend and fix the event encodings for Apple's M1 CPU PMU.
- Allow LSM hooks to decide on SPE permissions for physical
profiling.
- Add support for the CMN S3 and NI-700 PMUs.
Confidential Computing:
- Add support for booting an arm64 kernel as a protected guest under
Android's "Protected KVM" (pKVM) hypervisor.
Selftests:
- Fix vector length issues in the SVE/SME sigreturn tests
- Fix build warning in the ptrace tests.
Timers:
- Add support for PR_{G,S}ET_TSC so that 'rr' can deal with
non-determinism arising from the architected counter.
Miscellaneous:
- Rework our IPI-based CPU stopping code to try NMIs if regular IPIs
don't succeed.
- Minor fixes and cleanups"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits)
perf: arm-ni: Fix an NULL vs IS_ERR() bug
arm64: hibernate: Fix warning for cast from restricted gfp_t
arm64: esr: Define ESR_ELx_EC_* constants as UL
arm64: pkeys: remove redundant WARN
perf: arm_pmuv3: Use BR_RETIRED for HW branch event if enabled
MAINTAINERS: List Arm interconnect PMUs as supported
perf: Add driver for Arm NI-700 interconnect PMU
dt-bindings/perf: Add Arm NI-700 PMU
perf/arm-cmn: Improve format attr printing
perf/arm-cmn: Clean up unnecessary NUMA_NO_NODE check
arm64/mm: use lm_alias() with addresses passed to memblock_free()
mm: arm64: document why pte is not advanced in contpte_ptep_set_access_flags()
arm64: Expose the end of the linear map in PHYSMEM_END
arm64: trans_pgd: mark PTEs entries as valid to avoid dead kexec()
arm64/mm: Delete __init region from memblock.reserved
perf/arm-cmn: Support CMN S3
dt-bindings: perf: arm-cmn: Add CMN S3
perf/arm-cmn: Refactor DTC PMU register access
perf/arm-cmn: Make cycle counts less surprising
perf/arm-cmn: Improve build-time assertion
...
* for-next/poe: (31 commits)
arm64: pkeys: remove redundant WARN
kselftest/arm64: Add test case for POR_EL0 signal frame records
kselftest/arm64: parse POE_MAGIC in a signal frame
kselftest/arm64: add HWCAP test for FEAT_S1POE
selftests: mm: make protection_keys test work on arm64
selftests: mm: move fpregs printing
kselftest/arm64: move get_header()
arm64: add Permission Overlay Extension Kconfig
arm64: enable PKEY support for CPUs with S1POE
arm64: enable POE and PIE to coexist
arm64/ptrace: add support for FEAT_POE
arm64: add POE signal support
arm64: implement PKEYS support
arm64: add pte_access_permitted_no_overlay()
arm64: handle PKEY/POE faults
arm64: mask out POIndex when modifying a PTE
arm64: convert protection key into vm_flags and pgprot values
arm64: add POIndex defines
arm64: re-order MTE VM_ flags
arm64: enable the Permission Overlay Extension for EL0
...
* kvm-arm64/visibility-cleanups:
: .
: Remove REG_HIDDEN_USER from the sysreg infrastructure, making things
: a little more simple. From the cover letter:
:
: "Since 4d4f52052b ("KVM: arm64: nv: Drop EL12 register traps that are
: redirected to VNCR") and the admission that KVM would never be supporting
: the original FEAT_NV, REG_HIDDEN_USER only had a few users, all of which
: could either be replaced by a more ad-hoc mechanism, or removed altogether."
: .
KVM: arm64: Get rid of REG_HIDDEN_USER visibility qualifier
KVM: arm64: Simplify visibility handling of AArch32 SPSR_*
KVM: arm64: Simplify handling of CNTKCTL_EL12
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/s2-ptdump:
: .
: Stage-2 page table dumper, reusing the main ptdump infrastructure,
: courtesy of Sebastian Ene. From the cover letter:
:
: "This series extends the ptdump support to allow dumping the guest
: stage-2 pagetables. When CONFIG_PTDUMP_STAGE2_DEBUGFS is enabled, ptdump
: registers the new following files under debugfs:
: - /sys/debug/kvm/<guest_id>/stage2_page_tables
: - /sys/debug/kvm/<guest_id>/stage2_levels
: - /sys/debug/kvm/<guest_id>/ipa_range
:
: This allows userspace tools (eg. cat) to dump the stage-2 pagetables by
: reading the 'stage2_page_tables' file.
: [...]"
: .
KVM: arm64: Register ptdump with debugfs on guest creation
arm64: ptdump: Don't override the level when operating on the stage-2 tables
arm64: ptdump: Use the ptdump description from a local context
arm64: ptdump: Expose the attribute parsing functionality
KVM: arm64: Move pagetable definitions to common header
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/nv-at-pan:
: .
: Add NV support for the AT family of instructions, which mostly results
: in adding a page table walker that deals with most of the complexity
: of the architecture.
:
: From the cover letter:
:
: "Another task that a hypervisor supporting NV on arm64 has to deal with
: is to emulate the AT instruction, because we multiplex all the S1
: translations on a single set of registers, and the guest S2 is never
: truly resident on the CPU.
:
: So given that we lie about page tables, we also have to lie about
: translation instructions, hence the emulation. Things are made
: complicated by the fact that guest S1 page tables can be swapped out,
: and that our shadow S2 is likely to be incomplete. So while using AT
: to emulate AT is tempting (and useful), it is not going to always
: work, and we thus need a fallback in the shape of a SW S1 walker."
: .
KVM: arm64: nv: Add support for FEAT_ATS1A
KVM: arm64: nv: Plumb handling of AT S1* traps from EL2
KVM: arm64: nv: Make AT+PAN instructions aware of FEAT_PAN3
KVM: arm64: nv: Sanitise SCTLR_EL1.EPAN according to VM configuration
KVM: arm64: nv: Add SW walker for AT S1 emulation
KVM: arm64: nv: Make ps_to_output_size() generally available
KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}
KVM: arm64: nv: Add basic emulation of AT S1E2{R,W}
KVM: arm64: nv: Add basic emulation of AT S1E1{R,W}P
KVM: arm64: nv: Add basic emulation of AT S1E{0,1}{R,W}
KVM: arm64: nv: Honor absence of FEAT_PAN2
KVM: arm64: nv: Turn upper_attr for S2 walk into the full descriptor
KVM: arm64: nv: Enforce S2 alignment when contiguous bit is set
arm64: Add ESR_ELx_FSC_ADDRSZ_L() helper
arm64: Add system register encoding for PSTATE.PAN
arm64: Add PAR_EL1 field description
arm64: Add missing APTable and TCR_ELx.HPD masks
KVM: arm64: Make kvm_at() take an OP_AT_*
Signed-off-by: Marc Zyngier <maz@kernel.org>
# Conflicts:
# arch/arm64/kvm/nested.c
* kvm-arm64/vgic-sre-traps:
: .
: Fix the multiple of cases where KVM/arm64 doesn't correctly
: handle the guest trying to use a GICv3 that isn't advertised.
:
: From the cover letter:
:
: "It recently appeared that, when running on a GICv3-equipped platform
: (which is what non-ancient arm64 HW has), *not* configuring a GICv3
: for the guest could result in less than desirable outcomes.
:
: We have multiple issues to fix:
:
: - for registers that *always* trap (the SGI registers) or that *may*
: trap (the SRE register), we need to check whether a GICv3 has been
: instantiated before acting upon the trap.
:
: - for registers that only conditionally trap, we must actively trap
: them even in the absence of a GICv3 being instantiated, and handle
: those traps accordingly.
:
: - finally, ID registers must reflect the absence of a GICv3, so that
: we are consistent.
:
: This series goes through all these requirements. The main complexity
: here is to apply a GICv3 configuration on the host in the absence of a
: GICv3 in the guest. This is pretty hackish, but I don't have a much
: better solution so far.
:
: As part of making wider use of of the trap bits, we fully define the
: trap routing as per the architecture, something that we eventually
: need for NV anyway."
: .
KVM: arm64: selftests: Cope with lack of GICv3 in set_id_regs
KVM: arm64: Add selftest checking how the absence of GICv3 is handled
KVM: arm64: Unify UNDEF injection helpers
KVM: arm64: Make most GICv3 accesses UNDEF if they trap
KVM: arm64: Honor guest requested traps in GICv3 emulation
KVM: arm64: Add trap routing information for ICH_HCR_EL2
KVM: arm64: Add ICH_HCR_EL2 to the vcpu state
KVM: arm64: Zero ID_AA64PFR0_EL1.GIC when no GICv3 is presented to the guest
KVM: arm64: Add helper for last ditch idreg adjustments
KVM: arm64: Force GICv3 trap activation when no irqchip is configured on VHE
KVM: arm64: Force SRE traps when SRE access is not enabled
KVM: arm64: Move GICv3 trap configuration to kvm_calculate_traps()
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/fpmr:
: .
: Add FP8 support to the KVM/arm64 floating point handling.
:
: This includes new ID registers (ID_AA64PFR2_EL1 ID_AA64FPFR0_EL1)
: being made visible to guests, as well as a new confrol register
: (FPMR) which gets context-switched.
: .
KVM: arm64: Expose ID_AA64PFR2_EL1 to userspace and guests
KVM: arm64: Enable FP8 support when available and configured
KVM: arm64: Expose ID_AA64FPFR0_EL1 as a writable ID reg
KVM: arm64: Honor trap routing for FPMR
KVM: arm64: Add save/restore support for FPMR
KVM: arm64: Move FPMR into the sysreg array
KVM: arm64: Add predicate for FPMR support in a VM
KVM: arm64: Move SVCR into the sysreg array
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/mmu-misc-6.12:
: .
: Various minor MMU improvements and bug-fixes:
:
: - Prevent MTE tags being restored by userspace if we are actively
: logging writes, as that's a recipe for disaster
:
: - Correct the refcount on a page that is not considered for MTE
: tag copying (such as a device)
:
: - When walking a page table to split blocks, keep the DSB at the end
: the walk, as there is no need to perform it on every store.
:
: - Fix boundary check when transfering memory using FFA
: .
KVM: arm64: Add memory length checks and remove inline in do_ffa_mem_xfer
KVM: arm64: Disallow copying MTE to guest memory while KVM is dirty logging
KVM: arm64: Release pfn, i.e. put page, if copying MTE tags hits ZONE_DEVICE
KVM: arm64: Move data barrier to end of split walk
Signed-off-by: Marc Zyngier <maz@kernel.org>
Now that REG_HIDDEN_USER has no direct user anymore, remove it
entirely and update all users of sysreg_hidden_user() to call
sysreg_hidden() instead.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240904082419.1982402-4-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Since SPSR_* are not associated with any register in the sysreg array,
nor do they have .get_user()/.set_user() helpers, they are invisible to
userspace with that encoding.
Therefore hidden_user_visibility() serves no purpose here, and can be
safely removed.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240904082419.1982402-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
We go trough a great deal of effort to map CNTKCTL_EL12 to CNTKCTL_EL1
while hidding this mapping from userspace via a special visibility helper.
However, it would be far simpler to just provide an accessor doing the
mapping job, removing the need for a visibility helper.
With that done, we can also remove the EL12_REG() macro which serves
no purpose.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240904082419.1982402-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
While arch/*/mem/ptdump handles the kernel pagetable dumping code,
introduce KVM/ptdump to show the guest stage-2 pagetables. The
separation is necessary because most of the definitions from the
stage-2 pagetable reside in the KVM path and we will be invoking
functionality specific to KVM. Introduce the PTDUMP_STAGE2_DEBUGFS config.
When a guest is created, register a new file entry under the guest
debugfs dir which allows userspace to show the contents of the guest
stage-2 pagetables when accessed.
[maz: moved function prototypes from kvm_host.h to kvm_mmu.h]
Signed-off-by: Sebastian Ene <sebastianene@google.com>
Reviewed-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20240909124721.1672199-6-sebastianene@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
When we share memory through FF-A and the description of the buffers
exceeds the size of the mapped buffer, the fragmentation API is used.
The fragmentation API allows specifying chunks of descriptors in subsequent
FF-A fragment calls and no upper limit has been established for this.
The entire memory region transferred is identified by a handle which can be
used to reclaim the transferred memory.
To be able to reclaim the memory, the description of the buffers has to fit
in the ffa_desc_buf.
Add a bounds check on the FF-A sharing path to prevent the memory reclaim
from failing.
Also do_ffa_mem_xfer() does not need __always_inline, except for the
BUILD_BUG_ON() aspect, which gets moved to a macro.
[maz: fixed the BUILD_BUG_ON() breakage with LLVM, thanks to Wei-Lin Chang
for the timely report]
Fixes: 634d90cf0a ("KVM: arm64: Handle FFA_MEM_LEND calls from the host")
Cc: stable@vger.kernel.org
Reviewed-by: Sebastian Ene <sebastianene@google.com>
Signed-off-by: Snehal Koukuntla <snehalreddy@google.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240909180154.3267939-1-snehalreddy@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Rename the per-CPU hooks used to enable virtualization in hardware to
align with the KVM-wide helpers in kvm_main.c, and to better capture that
the callbacks are invoked on every online CPU.
No functional change intended.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20240830043600.127750-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add the missing sanitisation of ID_AA64MMFR3_EL1, making sure we
solely expose S1POE and TCRX (we currently don't support anything
else).
[joey: Took Marc's patch for S1PIE, and changed it for S1POE]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20240822151113.1479789-11-joey.gouly@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
FEAT_ATS1E1A introduces a new instruction: `at s1e1a`.
This is an address translation, without permission checks.
POE allows read permissions to be removed from S1 by the guest. This means
that an `at` instruction could fail, and not get the IPA.
Switch to using `at s1e1a` so that KVM can get the IPA regardless of S1
permissions.
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240822151113.1479789-10-joey.gouly@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Define the new system registers that POE introduces and context switch them.
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240822151113.1479789-8-joey.gouly@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Besides the obvious (and desired) difference between krealloc() and
kvrealloc(), there is some inconsistency in their function signatures and
behavior:
- krealloc() frees the memory when the requested size is zero, whereas
kvrealloc() simply returns a pointer to the existing allocation.
- krealloc() behaves like kmalloc() if a NULL pointer is passed, whereas
kvrealloc() does not accept a NULL pointer at all and, if passed,
would fault instead.
- krealloc() is self-contained, whereas kvrealloc() relies on the caller
to provide the size of the previous allocation.
Inconsistent behavior throughout allocation APIs is error prone, hence
make kvrealloc() behave like krealloc(), which seems superior in all
mentioned aspects.
Besides that, implementing kvrealloc() by making use of krealloc() and
vrealloc() provides oppertunities to grow (and shrink) allocations more
efficiently. For instance, vrealloc() can be optimized to allocate and
map additional pages to grow the allocation or unmap and free unused pages
to shrink the allocation.
[dakr@kernel.org: document concurrency restrictions]
Link: https://lkml.kernel.org/r/20240725125442.4957-1-dakr@kernel.org
[dakr@kernel.org: disable KASAN when switching to vmalloc]
Link: https://lkml.kernel.org/r/20240730185049.6244-2-dakr@kernel.org
[dakr@kernel.org: properly document __GFP_ZERO behavior]
Link: https://lkml.kernel.org/r/20240730185049.6244-5-dakr@kernel.org
Link: https://lkml.kernel.org/r/20240722163111.4766-3-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Handling FEAT_ATS1A (which provides the AT S1E{1,2}A instructions)
is pretty easy, as it is just the usual AT without the permission
check.
This basically amounts to plumbing the instructions in the various
dispatch tables, and handling FEAT_ATS1A being disabled in the
ID registers.
Signed-off-by: Marc Zyngier <maz@kernel.org>
FEAT_PAN3 added a check for executable permissions to FEAT_PAN2.
Add the required SCTLR_ELx.EPAN and descriptor checks to handle
this correctly.
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
In order to plug the brokenness of our current AT implementation,
we need a SW walker that is going to... err.. walk the S1 tables
and tell us what it finds.
Of course, it builds on top of our S2 walker, and share similar
concepts. The beauty of it is that since it uses kvm_read_guest(),
it is able to bring back pages that have been otherwise evicted.
This is then plugged in the two AT S1 emulation functions as
a "slow path" fallback. I'm not sure it is that slow, but hey.
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
On the face of it, AT S12E{0,1}{R,W} is pretty simple. It is the
combination of AT S1E{0,1}{R,W}, followed by an extra S2 walk.
However, there is a great deal of complexity coming from combining
the S1 and S2 attributes to report something consistent in PAR_EL1.
This is an absolute mine field, and I have a splitting headache.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Similar to our AT S1E{0,1} emulation, we implement the AT S1E2
handling.
This emulation of course suffers from the same problems, but is
somehow simpler due to the lack of PAN2 and the fact that we are
guaranteed to execute it from the correct context.
Co-developed-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Building on top of our primitive AT S1E{0,1}{R,W} emulation,
add minimal support for the FEAT_PAN2 instructions, momentary
context-switching PSTATE.PAN so that it takes effect in the
context of the guest.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Emulating AT instructions is one the tasks devolved to the host
hypervisor when NV is on.
Here, we take the basic approach of emulating AT S1E{0,1}{R,W}
using the AT instructions themselves. While this mostly work,
it doesn't *always* work:
- S1 page tables can be swapped out
- shadow S2 can be incomplete and not contain mappings for
the S1 page tables
We are not trying to handle these case here, and defer it to
a later patch. Suitable comments indicate where we are in dire
need of better handling.
Co-developed-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
If our guest has been configured without PAN2, make sure that
AT S1E1{R,W}P will generate an UNDEF.
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
The upper_attr attribute has been badly named, as it most of the
time carries the full "last walked descriptor".
Rename it to "desc" and make ti contain the full 64bit descriptor.
This will be used by the S1 PTW.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Despite KVM not using the contiguous bit for anything related to
TLBs, the spec does require that the alignment defined by the
contiguous bit for the page size and the level is enforced.
Add the required checks to offset the point where PA and VA merge.
Fixes: 61e30b9eef ("KVM: arm64: nv: Implement nested Stage-2 page table walk logic")
Reported-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
To allow using newer instructions that current assemblers don't know about,
replace the `at` instruction with the underlying SYS instruction.
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
We currently have two helpers (undef_access() and trap_undef()) that
do exactly the same thing: inject an UNDEF and return 'false' (as an
indication that PC should not be incremented).
We definitely could do with one less. Given that undef_access() is
used 80ish times, while trap_undef() is only used 30 times, the
latter loses the battle and is immediately sacrificed.
We also have a large number of instances where undef_access() is
open-coded. Let's also convert those.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240827152517.3909653-11-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
We don't expect to trap any GICv3 register for host handling,
apart from ICC_SRE_EL1 and the SGI registers. If they trap,
that's because the guest is playing with us despite being
told it doesn't have a GICv3.
If it does, UNDEF is what it will get.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240827152517.3909653-10-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
On platforms that require emulation of the CPU interface, we still
need to honor the traps requested by the guest (ICH_HCR_EL2 as well
as the FGTs for ICC_IGRPEN{0,1}_EL1.
Check for these bits early and lail out if any trap applies.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240827152517.3909653-9-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
The usual song and dance. Anything that is a trap, any register
it traps. Note that we don't handle the registers added by
FEAT_NMI for now.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240827152517.3909653-8-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
As we are about to describe the trap routing for ICH_HCR_EL2, add
the register to the vcpu state in its VNCR form, as well as reset
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240827152517.3909653-7-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
In order to be consistent, we shouldn't advertise a GICv3 when none
is actually usable by the guest.
Wipe the feature when these conditions apply, and allow the field
to be written from userspace.
This now allows us to rewrite the kvm_has_gicv3 helper() in terms
of kvm_has_feat(), given that it is always evaluated at runtime.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240827152517.3909653-6-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
We already have to perform a set of last-chance adjustments for
NV purposes. We will soon have to do the same for the GIC, so
introduce a helper for that exact purpose.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240827152517.3909653-5-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
On a VHE system, no GICv3 traps get configured when no irqchip is
present. This is not quite matching the "no GICv3" semantics that
we want to present.
Force such traps to be configured in this case.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240827152517.3909653-4-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
We so far only write the ICH_HCR_EL2 config in two situations:
- when we need to emulate the GICv3 CPU interface due to HW bugs
- when we do direct injection, as the virtual CPU interface needs
to be enabled
This is all good. But it also means that we don't do anything special
when we emulate a GICv2, or that there is no GIC at all.
What happens in this case when the guest uses the GICv3 system
registers? The *guest* gets a trap for a sysreg access (EC=0x18)
while we'd really like it to get an UNDEF.
Fixing this is a bit involved:
- we need to set all the required trap bits (TC, TALL0, TALL1, TDIR)
- for these traps to take effect, we need to (counter-intuitively)
set ICC_SRE_EL1.SRE to 1 so that the above traps take priority.
Note that doesn't fully work when GICv2 emulation is enabled, as
we cannot set ICC_SRE_EL1.SRE to 1 (it breaks Group0 delivery as
IRQ).
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240827152517.3909653-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Follow the pattern introduced with vcpu_set_hcr(), and introduce
vcpu_set_ich_hcr(), which configures the GICv3 traps at the same
point.
This will allow future changes to introduce trap configuration on
a per-VM basis.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240827152517.3909653-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/tlbi-fixes-6.12:
: .
: A couple of TLB invalidation fixes, only affecting pKVM
: out of tree, courtesy of Will Deacon.
: .
KVM: arm64: Ensure TLBI uses correct VMID after changing context
KVM: arm64: Invalidate EL1&0 TLB entries for all VMIDs in nvhe hyp init
Signed-off-by: Marc Zyngier <maz@kernel.org>
Everything is now in place for a guest to "enjoy" FP8 support.
Expose ID_AA64PFR2_EL1 to both userspace and guests, with the
explicit restriction of only being able to clear FPMR.
All other features (MTE* at the time of writing) are hidden
and not writable.
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240820131802.3547589-9-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
If userspace has enabled FP8 support (by setting ID_AA64PFR2_EL1.FPMR
to 1), let's enable the feature by setting HCRX_EL2.EnFPM for the vcpu.
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240820131802.3547589-8-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
ID_AA64FPFR0_EL1 contains all sort of bits that contain a description
of which FP8 subfeatures are implemented.
We don't really care about them, so let's just expose that register
and allow userspace to disable subfeatures at will.
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240820131802.3547589-7-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
HCRX_EL2.EnFPM controls the trapping of FPMR (as well as the validity
of any FP8 instruction, but we don't really care about this last part).
Describe the trap bit so that the exception can be reinjected in a
NV guest.
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240820131802.3547589-6-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Just like the rest of the FP/SIMD state, FPMR needs to be context
switched.
The only interesting thing here is that we need to treat the pKVM
part a bit differently, as the host FP state is never written back
to the vcpu thread, but instead stored locally and eagerly restored.
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240820131802.3547589-5-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Just like SVCR, FPMR is currently stored at the wrong location.
Let's move it where it belongs.
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240820131802.3547589-4-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
SVCR is just a system register, and has no purpose being outside
of the sysreg array. If anything, it only makes it more difficult
to eventually support SME one day. If ever.
Move it into the array with its little friends, and associate it
with a visibility predicate.
Although this is dead code, it at least paves the way for the
next set of FP-related extensions.
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240820131802.3547589-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Allow userspace to change the guest-visible value of the register with
different way of handling:
- Since the RAS and MPAM is not writable in the ID_AA64PFR0_EL1
register, RAS_frac and MPAM_frac are also not writable in the
ID_AA64PFR1_EL1 register.
- The MTE is controlled by a separate UAPI (KVM_CAP_ARM_MTE) with an
internal flag (KVM_ARCH_FLAG_MTE_ENABLED).
So it's not writable.
- For those fields which KVM doesn't know how to handle, they are not
exposed to the guest (being disabled in the register read accessor),
those fields value will always be 0.
Those fields don't have a known behavior now, so don't advertise
them to the userspace. Thus still not writable.
Those fields include SME, RNDR_trap, NMI, GCS, THE, DF2, PFAR,
MTE_frac, MTEX.
- The BT, SSBS, CSV2_frac don't introduce any new registers which KVM
doesn't know how to handle, they can be written without ill effect.
So let them writable.
Besides, we don't do the crosscheck in KVM about the CSV2_frac even if
it depends on the value of CSV2, it should be made sure by the VMM
instead of KVM.
Signed-off-by: Shaoqin Huang <shahuang@redhat.com>
Link: https://lore.kernel.org/r/20240723072004.1470688-4-shahuang@redhat.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Currently KVM use cpus_have_final_cap() to check if FEAT_SSBS is
advertised to the guest. But if FEAT_SSBS is writable and isn't
advertised to the guest, this is wrong.
Update it to use kvm_has_feat() to check if FEAT_SSBS is advertised
to the guest, thus the KVM can do the right thing if FEAT_SSBS isn't
advertised to the guest.
Signed-off-by: Shaoqin Huang <shahuang@redhat.com>
Link: https://lore.kernel.org/r/20240723072004.1470688-3-shahuang@redhat.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
For some of the fields in the ID_AA64PFR1_EL1 register, KVM doesn't know
how to handle them right now. So explicitly disable them in the register
accessor, then those fields value will be masked to 0 even if on the
hardware the field value is 1. This is safe because from a UAPI point of
view that read_sanitised_ftr_reg() doesn't yet return a nonzero value
for any of those fields.
This will benifit the migration if the host and VM have different values
when restoring a VM.
Those fields include RNDR_trap, NMI, MTE_frac, GCS, THE, MTEX, DF2, PFAR.
Signed-off-by: Shaoqin Huang <shahuang@redhat.com>
Link: https://lore.kernel.org/r/20240723072004.1470688-2-shahuang@redhat.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
KVM exposes the OS double lock feature bit to Guests but returns
RAZ/WI on Guest OSDLR_EL1 access. This breaks Guest migration between
systems where this feature differ. Add support to make this feature
writable from userspace by setting the mask bit. While at it, set the
mask bits for the exposed WRPs(Number of Watchpoints) as well.
Also update the selftest to cover these fields.
However we still can't make BRPs and CTX_CMPs fields writable, because
as per ARM ARM DDI 0487K.a, section D2.8.3 Breakpoint types and
linking of breakpoints, highest numbered breakpoints(BRPs) must be
context aware breakpoints(CTX_CMPs). KVM does not trap + emulate the
breakpoint registers, and as such cannot support a layout that misaligns
with the underlying hardware.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20240816132819.34316-1-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
On a system with a GICv3, if a guest hasn't been configured with
GICv3 and that the host is not capable of GICv2 emulation,
a write to any of the ICC_*SGI*_EL1 registers is trapped to EL2.
We therefore try to emulate the SGI access, only to hit a NULL
pointer as no private interrupt is allocated (no GIC, remember?).
The obvious fix is to give the guest what it deserves, in the
shape of a UNDEF exception.
Reported-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240820100349.3544850-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Zenghui reports that VMs backed by hugetlb pages are no longer booting
after commit fd276e71d1 ("KVM: arm64: nv: Handle shadow stage 2 page
faults").
Support for shadow stage-2 MMUs introduced the concept of a fault IPA
and canonical IPA to stage-2 fault handling. These are identical in the
non-nested case, as the hardware stage-2 context is always that of the
canonical IPA space.
Both addresses need to be hugepage-aligned when preparing to install a
hugepage mapping to ensure that KVM uses the correct GFN->PFN translation
and installs that at the correct IPA for the current stage-2.
And now I'm feeling thirsty after all this talk of IPAs...
Fixes: fd276e71d1 ("KVM: arm64: nv: Handle shadow stage 2 page faults")
Reported-by: Zenghui Yu <yuzenghui@huawei.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240822071710.2291690-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
We recently moved the teardown of the vgic part of a vcpu inside
a critical section guarded by the config_lock. This teardown phase
involves calling into kvm_io_bus_unregister_dev(), which takes the
kvm->srcu lock.
However, this violates the established order where kvm->srcu is
taken on a memory fault (such as an MMIO access), possibly
followed by taking the config_lock if the GIC emulation requires
mutual exclusion from the other vcpus.
It therefore results in a bad lockdep splat, as reported by Zenghui.
Fix this by moving the call to kvm_io_bus_unregister_dev() outside
of the config_lock critical section. At this stage, there shouln't
be any need to hold the config_lock.
As an additional bonus, document the ordering between kvm->slots_lock,
kvm->srcu and kvm->arch.config_lock so that I cannot pretend I didn't
know about those anymore.
Fixes: 9eb18136af ("KVM: arm64: vgic: Hold config_lock while tearing down a CPU interface")
Reported-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Tested-by: Zenghui Yu <yuzenghui@huawei.com>
Link: https://lore.kernel.org/r/20240819125045.3474845-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
If there were LPIs being mapped behind our back (i.e., between .start() and
.stop()), we would put them at iter_unmark_lpis() without checking if they
were actually *marked*, which is obviously not good.
Switch to use the xa_for_each_marked() iterator to fix it.
Cc: stable@vger.kernel.org
Fixes: 85d3ccc8b7 ("KVM: arm64: vgic-debug: Use an xarray mark for debug iterator")
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240817101541.1664-1-yuzenghui@huawei.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Armv9.4/8.9 PMU adds optional support for a fixed instruction counter
similar to the fixed cycle counter. Support for the feature is indicated
in the ID_AA64DFR1_EL1 register PMICNTR field. The counter is not
accessible in AArch32.
Existing userspace using direct counter access won't know how to handle
the fixed instruction counter, so we have to avoid using the counter
when user access is requested.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Tested-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20240731-arm-pmu-3-9-icntr-v3-7-280a8d7ff465@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
There are 2 defines for the number of PMU counters:
ARMV8_PMU_MAX_COUNTERS and ARMPMU_MAX_HWEVENTS. Both are the same
currently, but Armv9.4/8.9 increases the number of possible counters
from 32 to 33. With this change, the maximum number of counters will
differ for KVM's PMU emulation which is PMUv3.4. Give KVM PMU emulation
its own define to decouple it from the rest of the kernel's number PMU
counters.
The VHE PMU code needs to match the PMU driver, so switch it to use
ARMPMU_MAX_HWEVENTS instead.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Tested-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20240731-arm-pmu-3-9-icntr-v3-6-280a8d7ff465@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
The PMUv3 and KVM code each have a define for the PMU cycle counter
index. Move KVM's define to a shared location and use it for PMUv3
driver.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Tested-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20240731-arm-pmu-3-9-icntr-v3-5-280a8d7ff465@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
ARMV8_PMU_COUNTER_MASK is really a mask for the PMSELR_EL0.SEL register
field. Make that clear by adding a standard sysreg definition for the
register, and using it instead.
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Tested-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20240731-arm-pmu-3-9-icntr-v3-4-280a8d7ff465@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
Commit df29ddf4f0 ("arm64: perf: Abstract system register accesses
away") split off PMU register accessor functions to a standalone header.
Let's use it for KVM PMU code and get rid one copy of the ugly switch
macro.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Tested-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20240731-arm-pmu-3-9-icntr-v3-3-280a8d7ff465@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
Various PMUv3 registers which are a mask of counters are 64-bit
registers, but the accessor functions take a u32. This has been fine as
the upper 32-bits have been RES0 as there has been a maximum of 32
counters prior to Armv9.4/8.9. With Armv9.4/8.9, a 33rd counter is
added. Update the accessor functions to use a u64 instead.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Tested-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20240731-arm-pmu-3-9-icntr-v3-2-280a8d7ff465@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
Xscale and Armv6 PMUs defined the cycle counter at 0 and event counters
starting at 1 and had 1:1 event index to counter numbering. On Armv7 and
later, this changed the cycle counter to 31 and event counters start at
0. The drivers for Armv7 and PMUv3 kept the old event index numbering
and introduced an event index to counter conversion. The conversion uses
masking to convert from event index to a counter number. This operation
relies on having at most 32 counters so that the cycle counter index 0
can be transformed to counter number 31.
Armv9.4 adds support for an additional fixed function counter
(instructions) which increases possible counters to more than 32, and
the conversion won't work anymore as a simple subtract and mask. The
primary reason for the translation (other than history) seems to be to
have a contiguous mask of counters 0-N. Keeping that would result in
more complicated index to counter conversions. Instead, store a mask of
available counters rather than just number of events. That provides more
information in addition to the number of events.
No (intended) functional changes.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Tested-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20240731-arm-pmu-3-9-icntr-v3-1-280a8d7ff465@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
When the target context passed to enter_vmid_context() matches the
current running context, the function returns early without manipulating
the registers of the stage-2 MMU. This can result in a stale VMID due to
the lack of an ISB instruction in exit_vmid_context() after writing the
VTTBR when ARM64_WORKAROUND_SPECULATIVE_AT is not enabled.
For example, with pKVM enabled:
// Initially running in host context
enter_vmid_context(guest);
-> __load_stage2(guest); isb // Writes VTCR & VTTBR
exit_vmid_context(guest);
-> __load_stage2(host); // Restores VTCR & VTTBR
enter_vmid_context(host);
-> Returns early as we're already in host context
tlbi vmalls12e1is // !!! Can use the stale VMID as we
// haven't performed context
// synchronisation since restoring
// VTTBR.VMID
Add an unconditional ISB instruction to exit_vmid_context() after
restoring the VTTBR. This already existed for the
ARM64_WORKAROUND_SPECULATIVE_AT path, so we can simply hoist that onto
the common path.
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Fuad Tabba <tabba@google.com>
Fixes: 58f3b0fc3b ("KVM: arm64: Support TLB invalidation in guest context")
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240814123429.20457-3-will@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
When initialising the nVHE hypervisor, we invalidate potentially stale
TLB entries for the EL1&0 regime using a 'vmalls12e1' invalidation.
However, this invalidation operation applies only to the active VMID
and therefore we could proceed with stale TLB entries for other VMIDs.
Replace the operation with an 'alle1' which applies to all entries for
the EL1&0 regime, regardless of the VMID.
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Fixes: 1025c8c0c6 ("KVM: arm64: Wrap the host with a stage 2")
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240814123429.20457-2-will@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Disallow copying MTE tags to guest memory while KVM is dirty logging, as
writing guest memory without marking the gfn as dirty in the memslot could
result in userspace failing to migrate the updated page. Ideally (maybe?),
KVM would simply mark the gfn as dirty, but there is no vCPU to work with,
and presumably the only use case for copy MTE tags _to_ the guest is when
restoring state on the target.
Fixes: f0376edb1d ("KVM: arm64: Add ioctl to fetch/store tags in a guest")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20240726235234.228822-3-seanjc@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Put the page reference acquired by gfn_to_pfn_prot() if
kvm_vm_ioctl_mte_copy_tags() runs into ZONE_DEVICE memory. KVM's less-
than-stellar heuristics for dealing with pfn-mapped memory means that KVM
can get a page reference to ZONE_DEVICE memory.
Fixes: f0376edb1d ("KVM: arm64: Add ioctl to fetch/store tags in a guest")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20240726235234.228822-2-seanjc@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
This DSB guarantees page table updates have been made visible to the
hardware table walker. Moving the DSB from stage2_split_walker() to
after the walk is finished in kvm_pgtable_stage2_split() results in a
roughly 70% reduction in Clear Dirty Log Time in
dirty_log_perf_test (modified to use eager page splitting) when using
huge pages. This gain holds steady through a range of vcpus
used (tested 1-64) and memory used (tested 1-64GB).
This is safe to do because nothing else is using the page tables while
they are still being mapped and this is how other page table walkers
already function. None of them have a data barrier in the walker
itself because relative ordering of table PTEs to table contents comes
from the release semantics of stage2_make_pte().
Signed-off-by: Colton Lewis <coltonlewis@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240808174243.2836363-1-coltonlewis@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tearing down a vcpu CPU interface involves freeing the private interrupt
array. If we don't hold the lock, we may race against another thread
trying to configure it. Yeah, fuzzers do wonderful things...
Taking the lock early solves this particular problem.
Fixes: 03b3d00a70 ("KVM: arm64: vgic: Allocate private interrupts on demand")
Reported-by: Alexander Potapenko <glider@google.com>
Tested-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240808091546.3262111-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Tidy up some of the PAuth trapping code to clear up some comments
and avoid clang/checkpatch warnings. Also, don't bother setting
PAuth HCR_EL2 bits in pKVM, since it's handled by the hypervisor.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20240722163311.1493879-1-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
In case the guest doesn't have any LPI, we previously relied on the
iterator setting
'intid = nr_spis + VGIC_NR_PRIVATE_IRQS' && 'lpi_idx = 1'
to exit the iterator. But it was broken with commit 85d3ccc8b7 ("KVM:
arm64: vgic-debug: Use an xarray mark for debug iterator") -- the intid
remains at 'nr_spis + VGIC_NR_PRIVATE_IRQS - 1', and we end up endlessly
printing the last SPI's state.
Consider that it's meaningless to search the LPI xarray and populate
lpi_idx when there is no LPI, let's just skip the process for that case.
The result is that
* If there's no LPI, we focus on the intid and exit the iterator when it
runs out of the valid SPI range.
* Otherwise we keep the current logic and let the xarray drive the
iterator.
Fixes: 85d3ccc8b7 ("KVM: arm64: vgic-debug: Use an xarray mark for debug iterator")
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240807052024.2084-1-yuzenghui@huawei.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
With the NV support of TLBI-range operations, KVM makes use of
instructions that are only supported by binutils versions >= 2.30.
This breaks the build for very old toolchains.
Make KVM support conditional on having ARMv8.4 support in the
assembler, side-stepping the issue.
Fixes: 5d476ca57d ("KVM: arm64: nv: Add handling of range-based TLBI operations")
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Suggested-by: Arnd Bergmann <arnd@linaro.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20240807115144.3237260-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Get rid of unexpected unlock sparse warnings in vgic code
by adding an annotation to vgic_queue_irq_unlock().
arch/arm64/kvm/vgic/vgic.c:334:17: warning: context imbalance in 'vgic_queue_irq_unlock' - unexpected unlock
arch/arm64/kvm/vgic/vgic.c:419:5: warning: context imbalance in 'kvm_vgic_inject_irq' - different lock contexts for basic block
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240723101204.7356-4-sebott@redhat.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Fix kdoc warnings by adding missing function parameter
descriptions or by conversion to a normal comment.
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240723101204.7356-3-sebott@redhat.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Add -Wno-override-init to the build flags for sys_regs.c,
handle_exit.c, and switch.c to fix warnings like the following:
arch/arm64/kvm/hyp/vhe/switch.c:271:43: warning: initialized field overwritten [-Woverride-init]
271 | [ESR_ELx_EC_CP15_32] = kvm_hyp_handle_cp15_32,
|
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240723101204.7356-2-sebott@redhat.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
kvm->arch.nested_mmus is allocated with kvrealloc(), hence free it with
kvfree() instead of kfree().
Fixes: 4f128f8e1a ("KVM: arm64: nv: Support multiple nested Stage-2 mmu structures")
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240723142204.758796-1-dakr@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* Initial infrastructure for shadow stage-2 MMUs, as part of nested
virtualization enablement
* Support for userspace changes to the guest CTR_EL0 value, enabling
(in part) migration of VMs between heterogenous hardware
* Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1 of
the protocol
* FPSIMD/SVE support for nested, including merged trap configuration
and exception routing
* New command-line parameter to control the WFx trap behavior under KVM
* Introduce kCFI hardening in the EL2 hypervisor
* Fixes + cleanups for handling presence/absence of FEAT_TCRX
* Miscellaneous fixes + documentation updates
LoongArch:
* Add paravirt steal time support.
* Add support for KVM_DIRTY_LOG_INITIALLY_SET.
* Add perf kvm-stat support for loongarch.
RISC-V:
* Redirect AMO load/store access fault traps to guest
* perf kvm stat support
* Use guest files for IMSIC virtualization, when available
ONE_REG support for the Zimop, Zcmop, Zca, Zcf, Zcd, Zcb and Zawrs ISA
extensions is coming through the RISC-V tree.
s390:
* Assortment of tiny fixes which are not time critical
x86:
* Fixes for Xen emulation.
* Add a global struct to consolidate tracking of host values, e.g. EFER
* Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
bus frequency, because TDX.
* Print the name of the APICv/AVIC inhibits in the relevant tracepoint.
* Clean up KVM's handling of vendor specific emulation to consistently act on
"compatible with Intel/AMD", versus checking for a specific vendor.
* Drop MTRR virtualization, and instead always honor guest PAT on CPUs
that support self-snoop.
* Update to the newfangled Intel CPU FMS infrastructure.
* Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads
'0' and writes from userspace are ignored.
* Misc cleanups
x86 - MMU:
* Small cleanups, renames and refactoring extracted from the upcoming
Intel TDX support.
* Don't allocate kvm_mmu_page.shadowed_translation for shadow pages that can't
hold leafs SPTEs.
* Unconditionally drop mmu_lock when allocating TDP MMU page tables for eager
page splitting, to avoid stalling vCPUs when splitting huge pages.
* Bug the VM instead of simply warning if KVM tries to split a SPTE that is
non-present or not-huge. KVM is guaranteed to end up in a broken state
because the callers fully expect a valid SPTE, it's all but dangerous
to let more MMU changes happen afterwards.
x86 - AMD:
* Make per-CPU save_area allocations NUMA-aware.
* Force sev_es_host_save_area() to be inlined to avoid calling into an
instrumentable function from noinstr code.
* Base support for running SEV-SNP guests. API-wise, this includes
a new KVM_X86_SNP_VM type, encrypting/measure the initial image into
guest memory, and finalizing it before launching it. Internally,
there are some gmem/mmu hooks needed to prepare gmem-allocated pages
before mapping them into guest private memory ranges.
This includes basic support for attestation guest requests, enough to
say that KVM supports the GHCB 2.0 specification.
There is no support yet for loading into the firmware those signing
keys to be used for attestation requests, and therefore no need yet
for the host to provide certificate data for those keys. To support
fetching certificate data from userspace, a new KVM exit type will be
needed to handle fetching the certificate from userspace. An attempt to
define a new KVM_EXIT_COCO/KVM_EXIT_COCO_REQ_CERTS exit type to handle
this was introduced in v1 of this patchset, but is still being discussed
by community, so for now this patchset only implements a stub version
of SNP Extended Guest Requests that does not provide certificate data.
x86 - Intel:
* Remove an unnecessary EPT TLB flush when enabling hardware.
* Fix a series of bugs that cause KVM to fail to detect nested pending posted
interrupts as valid wake eents for a vCPU executing HLT in L2 (with
HLT-exiting disable by L1).
* KVM: x86: Suppress MMIO that is triggered during task switch emulation
Explicitly suppress userspace emulated MMIO exits that are triggered when
emulating a task switch as KVM doesn't support userspace MMIO during
complex (multi-step) emulation. Silently ignoring the exit request can
result in the WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to
userspace for some other reason prior to purging mmio_needed.
See commit 0dc902267c ("KVM: x86: Suppress pending MMIO write exits if
emulator detects exception") for more details on KVM's limitations with
respect to emulated MMIO during complex emulator flows.
Generic:
* Rename the AS_UNMOVABLE flag that was introduced for KVM to AS_INACCESSIBLE,
because the special casing needed by these pages is not due to just
unmovability (and in fact they are only unmovable because the CPU cannot
access them).
* New ioctl to populate the KVM page tables in advance, which is useful to
mitigate KVM page faults during guest boot or after live migration.
The code will also be used by TDX, but (probably) not through the ioctl.
* Enable halt poll shrinking by default, as Intel found it to be a clear win.
* Setup empty IRQ routing when creating a VM to avoid having to synchronize
SRCU when creating a split IRQCHIP on x86.
* Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag
that arch code can use for hooking both sched_in() and sched_out().
* Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
truncating a bogus value from userspace, e.g. to help userspace detect bugs.
* Mark a vCPU as preempted if and only if it's scheduled out while in the
KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest
memory when retrieving guest state during live migration blackout.
Selftests:
* Remove dead code in the memslot modification stress test.
* Treat "branch instructions retired" as supported on all AMD Family 17h+ CPUs.
* Print the guest pseudo-RNG seed only when it changes, to avoid spamming the
log for tests that create lots of VMs.
* Make the PMU counters test less flaky when counting LLC cache misses by
doing CLFLUSH{OPT} in every loop iteration.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmaZQB0UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNkZwf/bv2jiENaLFNGPe/VqTKMQ6PHQLMG
+sNHx6fJPP35gTM8Jqf0/7/ummZXcSuC1mWrzYbecZm7Oeg3vwNXHZ4LquwwX6Dv
8dKcUzLbWDAC4WA3SKhi8C8RV2v6E7ohy69NtAJmFWTc7H95dtIQm6cduV2osTC3
OEuHe1i8d9umk6couL9Qhm8hk3i9v2KgCsrfyNrQgLtS3hu7q6yOTR8nT0iH6sJR
KE5A8prBQgLmF34CuvYDw4Hu6E4j+0QmIqodovg2884W1gZQ9LmcVqYPaRZGsG8S
iDdbkualLKwiR1TpRr3HJGKWSFdc7RblbsnHRvHIZgFsMQiimh4HrBSCyQ==
=zepX
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- Initial infrastructure for shadow stage-2 MMUs, as part of nested
virtualization enablement
- Support for userspace changes to the guest CTR_EL0 value, enabling
(in part) migration of VMs between heterogenous hardware
- Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1
of the protocol
- FPSIMD/SVE support for nested, including merged trap configuration
and exception routing
- New command-line parameter to control the WFx trap behavior under
KVM
- Introduce kCFI hardening in the EL2 hypervisor
- Fixes + cleanups for handling presence/absence of FEAT_TCRX
- Miscellaneous fixes + documentation updates
LoongArch:
- Add paravirt steal time support
- Add support for KVM_DIRTY_LOG_INITIALLY_SET
- Add perf kvm-stat support for loongarch
RISC-V:
- Redirect AMO load/store access fault traps to guest
- perf kvm stat support
- Use guest files for IMSIC virtualization, when available
s390:
- Assortment of tiny fixes which are not time critical
x86:
- Fixes for Xen emulation
- Add a global struct to consolidate tracking of host values, e.g.
EFER
- Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the
effective APIC bus frequency, because TDX
- Print the name of the APICv/AVIC inhibits in the relevant
tracepoint
- Clean up KVM's handling of vendor specific emulation to
consistently act on "compatible with Intel/AMD", versus checking
for a specific vendor
- Drop MTRR virtualization, and instead always honor guest PAT on
CPUs that support self-snoop
- Update to the newfangled Intel CPU FMS infrastructure
- Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as
it reads '0' and writes from userspace are ignored
- Misc cleanups
x86 - MMU:
- Small cleanups, renames and refactoring extracted from the upcoming
Intel TDX support
- Don't allocate kvm_mmu_page.shadowed_translation for shadow pages
that can't hold leafs SPTEs
- Unconditionally drop mmu_lock when allocating TDP MMU page tables
for eager page splitting, to avoid stalling vCPUs when splitting
huge pages
- Bug the VM instead of simply warning if KVM tries to split a SPTE
that is non-present or not-huge. KVM is guaranteed to end up in a
broken state because the callers fully expect a valid SPTE, it's
all but dangerous to let more MMU changes happen afterwards
x86 - AMD:
- Make per-CPU save_area allocations NUMA-aware
- Force sev_es_host_save_area() to be inlined to avoid calling into
an instrumentable function from noinstr code
- Base support for running SEV-SNP guests. API-wise, this includes a
new KVM_X86_SNP_VM type, encrypting/measure the initial image into
guest memory, and finalizing it before launching it. Internally,
there are some gmem/mmu hooks needed to prepare gmem-allocated
pages before mapping them into guest private memory ranges
This includes basic support for attestation guest requests, enough
to say that KVM supports the GHCB 2.0 specification
There is no support yet for loading into the firmware those signing
keys to be used for attestation requests, and therefore no need yet
for the host to provide certificate data for those keys.
To support fetching certificate data from userspace, a new KVM exit
type will be needed to handle fetching the certificate from
userspace.
An attempt to define a new KVM_EXIT_COCO / KVM_EXIT_COCO_REQ_CERTS
exit type to handle this was introduced in v1 of this patchset, but
is still being discussed by community, so for now this patchset
only implements a stub version of SNP Extended Guest Requests that
does not provide certificate data
x86 - Intel:
- Remove an unnecessary EPT TLB flush when enabling hardware
- Fix a series of bugs that cause KVM to fail to detect nested
pending posted interrupts as valid wake eents for a vCPU executing
HLT in L2 (with HLT-exiting disable by L1)
- KVM: x86: Suppress MMIO that is triggered during task switch
emulation
Explicitly suppress userspace emulated MMIO exits that are
triggered when emulating a task switch as KVM doesn't support
userspace MMIO during complex (multi-step) emulation
Silently ignoring the exit request can result in the
WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to userspace
for some other reason prior to purging mmio_needed
See commit 0dc902267c ("KVM: x86: Suppress pending MMIO write
exits if emulator detects exception") for more details on KVM's
limitations with respect to emulated MMIO during complex emulator
flows
Generic:
- Rename the AS_UNMOVABLE flag that was introduced for KVM to
AS_INACCESSIBLE, because the special casing needed by these pages
is not due to just unmovability (and in fact they are only
unmovable because the CPU cannot access them)
- New ioctl to populate the KVM page tables in advance, which is
useful to mitigate KVM page faults during guest boot or after live
migration. The code will also be used by TDX, but (probably) not
through the ioctl
- Enable halt poll shrinking by default, as Intel found it to be a
clear win
- Setup empty IRQ routing when creating a VM to avoid having to
synchronize SRCU when creating a split IRQCHIP on x86
- Rework the sched_in/out() paths to replace kvm_arch_sched_in() with
a flag that arch code can use for hooking both sched_in() and
sched_out()
- Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
truncating a bogus value from userspace, e.g. to help userspace
detect bugs
- Mark a vCPU as preempted if and only if it's scheduled out while in
the KVM_RUN loop, e.g. to avoid marking it preempted and thus
writing guest memory when retrieving guest state during live
migration blackout
Selftests:
- Remove dead code in the memslot modification stress test
- Treat "branch instructions retired" as supported on all AMD Family
17h+ CPUs
- Print the guest pseudo-RNG seed only when it changes, to avoid
spamming the log for tests that create lots of VMs
- Make the PMU counters test less flaky when counting LLC cache
misses by doing CLFLUSH{OPT} in every loop iteration"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits)
crypto: ccp: Add the SNP_VLEK_LOAD command
KVM: x86/pmu: Add kvm_pmu_call() to simplify static calls of kvm_pmu_ops
KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops
KVM: x86: Replace static_call_cond() with static_call()
KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event
x86/sev: Move sev_guest.h into common SEV header
KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event
KVM: x86: Suppress MMIO that is triggered during task switch emulation
KVM: x86/mmu: Clean up make_huge_page_split_spte() definition and intro
KVM: x86/mmu: Bug the VM if KVM tries to split a !hugepage SPTE
KVM: selftests: x86: Add test for KVM_PRE_FAULT_MEMORY
KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()
KVM: x86/mmu: Make kvm_mmu_do_page_fault() return mapped level
KVM: x86/mmu: Account pf_{fixed,emulate,spurious} in callers of "do page fault"
KVM: x86/mmu: Bump pf_taken stat only in the "real" page fault handler
KVM: Add KVM_PRE_FAULT_MEMORY vcpu ioctl to pre-populate guest memory
KVM: Document KVM_PRE_FAULT_MEMORY ioctl
mm, virt: merge AS_UNMOVABLE and AS_INACCESSIBLE
perf kvm: Add kvm-stat for loongarch64
LoongArch: KVM: Add PV steal time support in guest side
...
- Enable halt poll shrinking by default, as Intel found it to be a clear win.
- Setup empty IRQ routing when creating a VM to avoid having to synchronize
SRCU when creating a split IRQCHIP on x86.
- Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag
that arch code can use for hooking both sched_in() and sched_out().
- Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
truncating a bogus value from userspace, e.g. to help userspace detect bugs.
- Mark a vCPU as preempted if and only if it's scheduled out while in the
KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest
memory when retrieving guest state during live migration blackout.
- A few minor cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRuOYACgkQOlYIJqCj
N/1UnQ/8CI5Qfr+/0gzYgtWmtEMczGG+rMNpzD3XVqPjJjXcMcBiQnplnzUVLhha
vlPdYVK7vgmEt003XGzV55mik46LHL+DX/v4hI3HEdblfyCeNLW3fKEWVRB44qJe
o+YUQwSK42SORUp9oXuQINxhA//U9EnI7CQxlJ8w8wenv5IJKfIGr01DefmfGPAV
PKm9t6WLcNqvhZMEyy/zmzM3KVPCJL0NcwI97x6sHxFpQYIDtL0E/VexA4AFqMoT
QK7cSDC/2US41Zvem/r/GzM/ucdF6vb9suzZYBohwhxtVhwJe2CDeYQZvtNKJ1U7
GOHPaKL6nBWdZCm/yyWbbX2nstY1lHqxhN3JD0X8wqU5rNcwm2b8Vfyav0Ehc7H+
jVbDTshOx4YJmIgajoKjgM050rdBK59TdfVL+l+AAV5q/TlHocalYtvkEBdGmIDg
2td9UHSime6sp20vQfczUEz4bgrQsh4l2Fa/qU2jFwLievnBw0AvEaMximkSGMJe
b8XfjmdTjlOesWAejANKtQolfrq14+1wYw0zZZ8PA+uNVpKdoovmcqSOcaDC9bT8
GO/NFUvoG+lkcvJcIlo1SSl81SmGLosijwxWfGvFAqsgpR3/3l3dYp0QtztoCNJO
d3+HnjgYn5o5FwufuTD3eUOXH4AFjG108DH0o25XrIkb2Kymy0o=
=BalU
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-generic-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM generic changes for 6.11
- Enable halt poll shrinking by default, as Intel found it to be a clear win.
- Setup empty IRQ routing when creating a VM to avoid having to synchronize
SRCU when creating a split IRQCHIP on x86.
- Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag
that arch code can use for hooking both sched_in() and sched_out().
- Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
truncating a bogus value from userspace, e.g. to help userspace detect bugs.
- Mark a vCPU as preempted if and only if it's scheduled out while in the
KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest
memory when retrieving guest state during live migration blackout.
- A few minor cleanups
- Initial infrastructure for shadow stage-2 MMUs, as part of nested
virtualization enablement
- Support for userspace changes to the guest CTR_EL0 value, enabling
(in part) migration of VMs between heterogenous hardware
- Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1 of
the protocol
- FPSIMD/SVE support for nested, including merged trap configuration
and exception routing
- New command-line parameter to control the WFx trap behavior under KVM
- Introduce kCFI hardening in the EL2 hypervisor
- Fixes + cleanups for handling presence/absence of FEAT_TCRX
- Miscellaneous fixes + documentation updates
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZpTCAxccb2xpdmVyLnVw
dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFjChAQCWs9ucJag4USgvXpg5mo9sxzly
kBZZ1o49N/VLxs4cagEAtq3KVNQNQyGXelYH6gr20aI85j6VnZW5W5z+sy5TAgk=
=sSOt
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 changes for 6.11
- Initial infrastructure for shadow stage-2 MMUs, as part of nested
virtualization enablement
- Support for userspace changes to the guest CTR_EL0 value, enabling
(in part) migration of VMs between heterogenous hardware
- Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1 of
the protocol
- FPSIMD/SVE support for nested, including merged trap configuration
and exception routing
- New command-line parameter to control the WFx trap behavior under KVM
- Introduce kCFI hardening in the EL2 hypervisor
- Fixes + cleanups for handling presence/absence of FEAT_TCRX
- Miscellaneous fixes + documentation updates
* Virtual CPU hotplug support for arm64 ACPI systems
* cpufeature infrastructure cleanups and making the FEAT_ECBHB ID bits
visible to guests
* CPU errata: expand the speculative SSBS workaround to more CPUs
* arm64 ACPI:
- acpi=nospcr option to disable SPCR as default console for arm64
- Move some ACPI code (cpuidle, FFH) to drivers/acpi/arm64/
* GICv3, use compile-time PMR values: optimise the way regular IRQs are
masked/unmasked when GICv3 pseudo-NMIs are used, removing the need for
a static key in fast paths by using a priority value chosen
dynamically at boot time
* arm64 perf updates:
- Rework of the IMX PMU driver to enable support for I.MX95
- Enable support for tertiary match groups in the CMN PMU driver
- Initial refactoring of the CPU PMU code to prepare for the fixed
instruction counter introduced by Arm v9.4
- Add missing PMU driver MODULE_DESCRIPTION() strings
- Hook up DT compatibles for recent CPU PMUs
* arm64 kselftest updates:
- Kernel mode NEON fp-stress
- Cleanups, spelling mistakes
* arm64 Documentation update with a minor clarification on TBI
* Miscellaneous:
- Fix missing IPI statistics
- Implement raw_smp_processor_id() using thread_info rather than a
per-CPU variable (better code generation)
- Make MTE checking of in-kernel asynchronous tag faults conditional
on KASAN being enabled
- Minor cleanups, typos
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmaQKN4ACgkQa9axLQDI
XvE0Nw/+JZ6OEQ+DMUHXZfbWanvn1p0nVOoEV3MYVpOeQK1ILYCoDapatLNIlet0
wcja7tohKbL1ifc7GOqlkitu824LMlotncrdOBycRqb/4C5KuJ+XhygFv5hGfX0T
Uh2zbo4w52FPPEUMICfEAHrKT3QB9tv7f66xeUNbWWFqUn3rY02/ZVQVVdw6Zc0e
fVYWGUUoQDR7+9hRkk6tnYw3+9YFVAUAbLWk+DGrW7WsANi6HuJ/rBMibwFI6RkG
SZDZHum6vnwx0Dj9H7WrYaQCvUMm7AlckhQGfPbIFhUk6pWysfJtP5Qk49yiMl7p
oRk/GrSXpiKumuetgTeOHbokiE1Nb8beXx0OcsjCu4RrIaNipAEpH1AkYy5oiKoT
9vKZErMDtQgd96JHFVaXc+A3D2kxVfkc1u7K3TEfVRnZFV7CN+YL+61iyZ+uLxVi
d9xrAmwRsWYFVQzlZG3NWvSeQBKisUA1L8JROlzWc/NFDwTqDGIt/zS4pZNL3+OM
EXW0LyKt7Ijl6vPXKCXqrODRrPlcLc66VMZxofZOl0/dEqyJ+qLL4GUkWZu8lTqO
BqydYnbTSjiDg/ntWjTrD0uJ8c40Qy7KTPEdaPqEIQvkDEsUGlOnhAQjHrnGNb9M
psZtpDW2xm7GykEOcd6rgSz4Xeky2iLsaR4Wc7FTyDS0YRmeG44=
=ob2k
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"The biggest part is the virtual CPU hotplug that touches ACPI,
irqchip. We also have some GICv3 optimisation for pseudo-NMIs that has
been queued via the arm64 tree. Otherwise the usual perf updates,
kselftest, various small cleanups.
Core:
- Virtual CPU hotplug support for arm64 ACPI systems
- cpufeature infrastructure cleanups and making the FEAT_ECBHB ID
bits visible to guests
- CPU errata: expand the speculative SSBS workaround to more CPUs
- GICv3, use compile-time PMR values: optimise the way regular IRQs
are masked/unmasked when GICv3 pseudo-NMIs are used, removing the
need for a static key in fast paths by using a priority value
chosen dynamically at boot time
ACPI:
- 'acpi=nospcr' option to disable SPCR as default console for arm64
- Move some ACPI code (cpuidle, FFH) to drivers/acpi/arm64/
Perf updates:
- Rework of the IMX PMU driver to enable support for I.MX95
- Enable support for tertiary match groups in the CMN PMU driver
- Initial refactoring of the CPU PMU code to prepare for the fixed
instruction counter introduced by Arm v9.4
- Add missing PMU driver MODULE_DESCRIPTION() strings
- Hook up DT compatibles for recent CPU PMUs
Kselftest updates:
- Kernel mode NEON fp-stress
- Cleanups, spelling mistakes
Miscellaneous:
- arm64 Documentation update with a minor clarification on TBI
- Fix missing IPI statistics
- Implement raw_smp_processor_id() using thread_info rather than a
per-CPU variable (better code generation)
- Make MTE checking of in-kernel asynchronous tag faults conditional
on KASAN being enabled
- Minor cleanups, typos"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (69 commits)
selftests: arm64: tags: remove the result script
selftests: arm64: tags_test: conform test to TAP output
perf: add missing MODULE_DESCRIPTION() macros
arm64: smp: Fix missing IPI statistics
irqchip/gic-v3: Fix 'broken_rdists' unused warning when !SMP and !ACPI
ACPI: Add acpi=nospcr to disable ACPI SPCR as default console on ARM64
Documentation: arm64: Update memory.rst for TBI
arm64/cpufeature: Replace custom macros with fields from ID_AA64PFR0_EL1
KVM: arm64: Replace custom macros with fields from ID_AA64PFR0_EL1
perf: arm_pmuv3: Include asm/arm_pmuv3.h from linux/perf/arm_pmuv3.h
perf: arm_v6/7_pmu: Drop non-DT probe support
perf/arm: Move 32-bit PMU drivers to drivers/perf/
perf: arm_pmuv3: Drop unnecessary IS_ENABLED(CONFIG_ARM64) check
perf: arm_pmuv3: Avoid assigning fixed cycle counter with threshold
arm64: Kconfig: Fix dependencies to enable ACPI_HOTPLUG_CPU
perf: imx_perf: add support for i.MX95 platform
perf: imx_perf: fix counter start and config sequence
perf: imx_perf: refactor driver for imx93
perf: imx_perf: let the driver manage the counter usage rather the user
perf: imx_perf: add macro definitions for parsing config attr
...
* kvm-arm64/nv-tcr2:
: Fixes to the handling of TCR_EL1, courtesy of Marc Zyngier
:
: Series addresses a couple gaps that are present in KVM (from cover
: letter):
:
: - VM configuration: HCRX_EL2.TCR2En is forced to 1, and we blindly
: save/restore stuff.
:
: - trap bit description and routing: none, obviously, since we make a
: point in not trapping.
KVM: arm64: Honor trap routing for TCR2_EL1
KVM: arm64: Make PIR{,E0}_EL1 save/restore conditional on FEAT_TCRX
KVM: arm64: Make TCR2_EL1 save/restore dependent on the VM features
KVM: arm64: Get rid of HCRX_GUEST_FLAGS
KVM: arm64: Correctly honor the presence of FEAT_TCRX
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/nv-sve:
: CPTR_EL2, FPSIMD/SVE support for nested
:
: This series brings support for honoring the guest hypervisor's CPTR_EL2
: trap configuration when running a nested guest, along with support for
: FPSIMD/SVE usage at L1 and L2.
KVM: arm64: Allow the use of SVE+NV
KVM: arm64: nv: Add additional trap setup for CPTR_EL2
KVM: arm64: nv: Add trap description for CPTR_EL2
KVM: arm64: nv: Add TCPAC/TTA to CPTR->CPACR conversion helper
KVM: arm64: nv: Honor guest hypervisor's FP/SVE traps in CPTR_EL2
KVM: arm64: nv: Load guest FP state for ZCR_EL2 trap
KVM: arm64: nv: Handle CPACR_EL1 traps
KVM: arm64: Spin off helper for programming CPTR traps
KVM: arm64: nv: Ensure correct VL is loaded before saving SVE state
KVM: arm64: nv: Use guest hypervisor's max VL when running nested guest
KVM: arm64: nv: Save guest's ZCR_EL2 when in hyp context
KVM: arm64: nv: Load guest hyp's ZCR into EL1 state
KVM: arm64: nv: Handle ZCR_EL2 traps
KVM: arm64: nv: Forward SVE traps to guest hypervisor
KVM: arm64: nv: Forward FP/ASIMD traps to guest hypervisor
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/el2-kcfi:
: kCFI support in the EL2 hypervisor, courtesy of Pierre-Clément Tosi
:
: Enable the usage fo CONFIG_CFI_CLANG (kCFI) for hardening indirect
: branches in the EL2 hypervisor. Unlike kernel support for the feature,
: CFI failures at EL2 are always fatal.
KVM: arm64: nVHE: Support CONFIG_CFI_CLANG at EL2
KVM: arm64: Introduce print_nvhe_hyp_panic helper
arm64: Introduce esr_brk_comment, esr_is_cfi_brk
KVM: arm64: VHE: Mark __hyp_call_panic __noreturn
KVM: arm64: nVHE: gen-hyprel: Skip R_AARCH64_ABS32
KVM: arm64: nVHE: Simplify invalid_host_el2_vect
KVM: arm64: Fix __pkvm_init_switch_pgd call ABI
KVM: arm64: Fix clobbered ELR in sync abort/SError
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/ctr-el0:
: Support for user changes to CTR_EL0, courtesy of Sebastian Ott
:
: Allow userspace to change the guest-visible value of CTR_EL0 for a VM,
: so long as the requested value represents a subset of features supported
: by hardware. In other words, prevent the VMM from over-promising the
: capabilities of hardware.
:
: Make this happen by fitting CTR_EL0 into the existing infrastructure for
: feature ID registers.
KVM: selftests: Assert that MPIDR_EL1 is unchanged across vCPU reset
KVM: arm64: nv: Unfudge ID_AA64PFR0_EL1 masking
KVM: selftests: arm64: Test writes to CTR_EL0
KVM: arm64: rename functions for invariant sys regs
KVM: arm64: show writable masks for feature registers
KVM: arm64: Treat CTR_EL0 as a VM feature ID register
KVM: arm64: unify code to prepare traps
KVM: arm64: nv: Use accessors for modifying ID registers
KVM: arm64: Add helper for writing ID regs
KVM: arm64: Use read-only helper for reading VM ID registers
KVM: arm64: Make idregs debugfs iterator search sysreg table directly
KVM: arm64: Get sys_reg encoding from descriptor in idregs_debug_show()
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/shadow-mmu:
: Shadow stage-2 MMU support for NV, courtesy of Marc Zyngier
:
: Initial implementation of shadow stage-2 page tables to support a guest
: hypervisor. In the author's words:
:
: So here's the 10000m (approximately 30000ft for those of you stuck
: with the wrong units) view of what this is doing:
:
: - for each {VMID,VTTBR,VTCR} tuple the guest uses, we use a
: separate shadow s2_mmu context. This context has its own "real"
: VMID and a set of page tables that are the combination of the
: guest's S2 and the host S2, built dynamically one fault at a time.
:
: - these shadow S2 contexts are ephemeral, and behave exactly as
: TLBs. For all intent and purposes, they *are* TLBs, and we discard
: them pretty often.
:
: - TLB invalidation takes three possible paths:
:
: * either this is an EL2 S1 invalidation, and we directly emulate
: it as early as possible
:
: * or this is an EL1 S1 invalidation, and we need to apply it to
: the shadow S2s (plural!) that match the VMID set by the L1 guest
:
: * or finally, this is affecting S2, and we need to teardown the
: corresponding part of the shadow S2s, which invalidates the TLBs
KVM: arm64: nv: Truely enable nXS TLBI operations
KVM: arm64: nv: Add handling of NXS-flavoured TLBI operations
KVM: arm64: nv: Add handling of range-based TLBI operations
KVM: arm64: nv: Add handling of outer-shareable TLBI operations
KVM: arm64: nv: Invalidate TLBs based on shadow S2 TTL-like information
KVM: arm64: nv: Tag shadow S2 entries with guest's leaf S2 level
KVM: arm64: nv: Handle FEAT_TTL hinted TLB operations
KVM: arm64: nv: Handle TLBI IPAS2E1{,IS} operations
KVM: arm64: nv: Handle TLBI ALLE1{,IS} operations
KVM: arm64: nv: Handle TLBI VMALLS12E1{,IS} operations
KVM: arm64: nv: Handle TLB invalidation targeting L2 stage-1
KVM: arm64: nv: Handle EL2 Stage-1 TLB invalidation
KVM: arm64: nv: Add Stage-1 EL2 invalidation primitives
KVM: arm64: nv: Unmap/flush shadow stage 2 page tables
KVM: arm64: nv: Handle shadow stage 2 page faults
KVM: arm64: nv: Implement nested Stage-2 page table walk logic
KVM: arm64: nv: Support multiple nested Stage-2 mmu structures
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/ffa-1p1:
: Improvements to the pKVM FF-A Proxy, courtesy of Sebastian Ene
:
: Various minor improvements to how host FF-A calls are proxied with the
: TEE, along with support for v1.1 of the protocol.
KVM: arm64: Use FF-A 1.1 with pKVM
KVM: arm64: Update the identification range for the FF-A smcs
KVM: arm64: Add support for FFA_PARTITION_INFO_GET
KVM: arm64: Trap FFA_VERSION host call in pKVM
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
This reverts commit eb9d53d4a9.
As Marc pointed out on the list [*], this patch is wrong, and those who
find themselves in the SOB chain should have their heads checked.
Annoyingly, the architecture has some FGT trap bits that are negative
(i.e. 0 implies trap), and there was some confusion how KVM handles
this for nested guests. However, it is clear now that KVM honors the
RES0-ness of FGT traps already, meaning traps for features never exposed
to the guest hypervisor get handled at L0. As they should.
Link: https://lore.kernel.org/kvmarm/86bk3c3uss.wl-maz@kernel.org/T/#mb9abb3dd79f6a4544a91cb35676bd637c3a5e836
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Although we now have support for nXS-flavoured TLBI instructions,
we still don't expose the feature to the guest thanks to a mixture
of misleading comment and use of a bunch of magic values.
Fix the comment and correctly express the masking of LS64, which
is enough to expose nXS to the world. Not that anyone cares...
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240703154743.824824-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The arm64 asm/arm_pmuv3.h depends on defines from
linux/perf/arm_pmuv3.h. Rather than depend on include order, follow the
usual pattern of "linux" headers including "asm" headers of the same
name.
With this change, the include of linux/kvm_host.h is problematic due to
circular includes:
In file included from ../arch/arm64/include/asm/arm_pmuv3.h:9,
from ../include/linux/perf/arm_pmuv3.h:312,
from ../include/kvm/arm_pmu.h:11,
from ../arch/arm64/include/asm/kvm_host.h:38,
from ../arch/arm64/mm/init.c:41:
../include/linux/kvm_host.h:383:30: error: field 'arch' has incomplete type
Switching to asm/kvm_host.h solves the issue.
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://lore.kernel.org/r/20240626-arm-pmu-3-9-icntr-v2-5-c9784b4f4065@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
TCR2_EL1 handling is missing the handling of its trap configuration:
- HCRX_EL2.TCR2En must be handled in conjunction with HCR_EL2.{TVM,TRVM}
- HFG{R,W}TR_EL2.TCR_EL1 does apply to TCR2_EL1 as well
Without these two controls being implemented, it is impossible to
correctly route TCR2_EL1 traps.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240625130042.259175-7-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
As per the architecture, if FEAT_S1PIE is implemented, then FEAT_TCRX
must be implemented as well.
Take advantage of this to avoid checking for S1PIE when TCRX isn't
implemented.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240625130042.259175-6-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
As for other registers, save/restore of TCR2_EL1 should be gated
on the feature being actually present.
In the case of a nVHE hypervisor, it is perfectly fine to leave
the host value in the register, as HCRX_EL2.TCREn==0 imposes that
TCR2_EL1 is treated as 0.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240625130042.259175-4-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
HCRX_GUEST_FLAGS gives random KVM hackers the impression that
they can stuff bits in this macro and unconditionally enable
features in the guest.
In general, this is wrong (we have been there with FEAT_MOPS,
and again with FEAT_TCRX).
Document that HCRX_EL2.SMPME is an exception rather than the rule,
and get rid of HCRX_GUEST_FLAGS.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20240625130042.259175-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
We currently blindly enable TCR2_EL1 use in a guest, irrespective
of the feature set. This is obviously wrong, and we should actually
honor the guest configuration and handle the possible trap resulting
from the guest being buggy.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20240625130042.259175-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Marc reports that L1 VMs aren't booting with the NV series applied to
today's kvmarm/next. After bisecting the issue, it appears that
44241f34fa ("KVM: arm64: nv: Use accessors for modifying ID
registers") is to blame.
Poking around at the issue a bit further, it'd appear that the value for
ID_AA64PFR0_EL1 is complete garbage, as 'val' still contains the value
we set ID_AA64ISAR1_EL1 to.
Fix the read-modify-write pattern to actually use ID_AA64PFR0_EL1 as the
starting point. Excuse me as I return to my shame cube.
Reported-by: Marc Zyngier <maz@kernel.org>
Fixes: 44241f34fa ("KVM: arm64: nv: Use accessors for modifying ID registers")
Acked-by: Marc Zyngier <maz@kernel.org>
Tested-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240621224044.2465901-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
We need to teach KVM a couple of new tricks. CPTR_EL2 and its
VHE accessor CPACR_EL1 need to be handled specially:
- CPACR_EL1 is trapped on VHE so that we can track the TCPAC
and TTA bits
- CPTR_EL2.{TCPAC,E0POE} are propagated from L1 to L2
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240620164653.1130714-15-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Add trap description for CPTR_EL2.{TCPAC,TAM,E0POE,TTA}.
TTA is a bit annoying as it changes location depending on E2H.
This forces us to add yet another "complex" trap condition.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240620164653.1130714-14-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Start folding the guest hypervisor's FP/SVE traps into the value
programmed in hardware. Note that as of writing this is dead code, since
KVM does a full put() / load() for every nested exception boundary which
saves + flushes the FP/SVE state.
However, this will become useful when we can keep the guest's FP/SVE
state alive across a nested exception boundary and the host no longer
needs to conservatively program traps.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240620164653.1130714-12-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Round out the ZCR_EL2 gymnastics by loading SVE state in the fast path
when the guest hypervisor tries to access SVE state.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240620164653.1130714-11-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Handle CPACR_EL1 accesses when running a VHE guest. In order to
limit the cost of the emulation, implement it ass a shallow exit.
In the other cases:
- this is a nVHE L1 which will write to memory, and we don't trap
- this is a L2 guest:
* the L1 has CPTR_EL2.TCPAC==0, and the L2 has direct register
access
* the L1 has CPTR_EL2.TCPAC==1, and the L2 will trap, but the
handling is defered to the general handling for forwarding
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240620164653.1130714-10-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
A subsequent change to KVM will add preliminary support for merging a
guest hypervisor's CPTR traps with that of KVM. Prepare by spinning off
a new helper for managing CPTR traps.
Avoid reading CPACR_EL1 for the baseline trap config, and start off with
the most restrictive set of traps that is subsequently relaxed.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240620164653.1130714-9-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
It is possible that the guest hypervisor has selected a smaller VL than
the maximum for its nested guest. As such, ZCR_EL2 may be configured for
a different VL when exiting a nested guest.
Set ZCR_EL2 (via the EL1 alias) to the maximum VL for the VM before
saving SVE state as the SVE save area is dimensioned by the max VL.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240620164653.1130714-8-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The max VL for nested guests is additionally constrained by the max VL
selected by the guest hypervisor. Use that instead of KVM's max VL when
running a nested guest.
Note that the guest hypervisor's ZCR_EL2 is sanitised against the VM's
max VL at the time of access, so there's no additional handling required
at the time of use.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240620164653.1130714-7-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Load the guest hypervisor's ZCR_EL2 into the corresponding EL1 register
when restoring SVE state, as ZCR_EL2 affects the VL in the hypervisor
context.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240620164653.1130714-5-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Unlike other SVE-related registers, ZCR_EL2 takes a sysreg trap to EL2
when HCR_EL2.NV = 1. KVM still needs to honor the guest hypervisor's
trap configuration, which expects an SVE trap (i.e. ESR_EL2.EC = 0x19)
when CPTR traps are enabled for the vCPU's current context.
Otherwise, if the guest hypervisor has traps disabled, emulate the
access by mapping the requested VL into ZCR_EL1.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240620164653.1130714-4-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Similar to FPSIMD traps, don't load SVE state if the guest hypervisor
has SVE traps enabled and forward the trap instead. Note that ZCR_EL2
will require some special handling, as it takes a sysreg trap to EL2
when HCR_EL2.NV = 1.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240620164653.1130714-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Give precedence to the guest hypervisor's trap configuration when
routing an FP/ASIMD trap taken to EL2. Take advantage of the
infrastructure for translating CPTR_EL2 into the VHE (i.e. EL1) format
and base the trap decision solely on the VHE view of the register. The
in-memory value of CPTR_EL2 will always be up to date for the guest
hypervisor (more on that later), so just read it directly from memory.
Bury all of this behind a macro keyed off of the CPTR bitfield in
anticipation of supporting other traps (e.g. SVE).
[maz: account for HCR_EL2.E2H when testing for TFP/FPEN, with
all the hard work actually being done by Chase Conklin]
[ oliver: translate nVHE->VHE format for testing traps; macro for reuse
in other CPTR_EL2.xEN fields ]
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240620164653.1130714-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The compiler implements kCFI by adding type information (u32) above
every function that might be indirectly called and, whenever a function
pointer is called, injects a read-and-compare of that u32 against the
value corresponding to the expected type. In case of a mismatch, a BRK
instruction gets executed. When the hypervisor triggers such an
exception in nVHE, it panics and triggers and exception return to EL1.
Therefore, teach nvhe_hyp_panic_handler() to detect kCFI errors from the
ESR and report them. If necessary, remind the user that EL2 kCFI is not
affected by CONFIG_CFI_PERMISSIVE.
Pass $(CC_FLAGS_CFI) to the compiler when building the nVHE hyp code.
Use SYM_TYPED_FUNC_START() for __pkvm_init_switch_pgd, as nVHE can't
call it directly and must use a PA function pointer from C (because it
is part of the idmap page), which would trigger a kCFI failure if the
type ID wasn't present.
Signed-off-by: Pierre-Clément Tosi <ptosi@google.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240610063244.2828978-9-ptosi@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Add a helper to display a panic banner soon to also be used for kCFI
failures, to ensure that we remain consistent.
Signed-off-by: Pierre-Clément Tosi <ptosi@google.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240610063244.2828978-8-ptosi@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
As it is already used in two places, move esr_comment() to a header for
re-use, with a clearer name.
Introduce esr_is_cfi_brk() to detect kCFI BRK syndromes, currently used
by early_brk64() but soon to also be used by hypervisor code.
Signed-off-by: Pierre-Clément Tosi <ptosi@google.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240610063244.2828978-7-ptosi@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Given that the sole purpose of __hyp_call_panic() is to call panic(), a
__noreturn function, give it the __noreturn attribute, removing the need
for its caller to use unreachable().
Signed-off-by: Pierre-Clément Tosi <ptosi@google.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240610063244.2828978-6-ptosi@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Ignore R_AARCH64_ABS32 relocations, instead of panicking, when emitting
the relocation table of the hypervisor. The toolchain might produce them
when generating function calls with kCFI to represent the 32-bit type ID
which can then be resolved across compilation units at link time. These
are NOT actual 32-bit addresses and are therefore not needed in the
final (runtime) relocation table (which is unlikely to use 32-bit
absolute addresses for arm64 anyway).
Signed-off-by: Pierre-Clément Tosi <ptosi@google.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240610063244.2828978-5-ptosi@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The invalid_host_el2_vect macro is used by EL2{t,h} handlers in nVHE
*host* context, which should never run with a guest context loaded.
Therefore, remove the superfluous vCPU context check and branch
unconditionally to hyp_panic.
Signed-off-by: Pierre-Clément Tosi <ptosi@google.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240610063244.2828978-4-ptosi@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Fix the mismatch between the (incorrect) C signature, C call site, and
asm implementation by aligning all three on an API passing the
parameters (pgd and SP) separately, instead of as a bundled struct.
Remove the now unnecessary memory accesses while the MMU is off from the
asm, which simplifies the C caller (as it does not need to convert a VA
struct pointer to PA) and makes the code slightly more robust by
offsetting the struct fields from C and properly expressing the call to
the C compiler (e.g. type checker and kCFI).
Fixes: f320bc742b ("KVM: arm64: Prepare the creation of s1 mappings at EL2")
Signed-off-by: Pierre-Clément Tosi <ptosi@google.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240610063244.2828978-3-ptosi@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
When the hypervisor receives a SError or synchronous exception (EL2h)
while running with the __kvm_hyp_vector and if ELR_EL2 doesn't point to
an extable entry, it panics indirectly by overwriting ELR with the
address of a panic handler in order for the asm routine it returns to to
ERET into the handler.
However, this clobbers ELR_EL2 for the handler itself. As a result,
hyp_panic(), when retrieving what it believes to be the PC where the
exception happened, actually ends up reading the address of the panic
handler that called it! This results in an erroneous and confusing panic
message where the source of any synchronous exception (e.g. BUG() or
kCFI) appears to be __guest_exit_panic, making it hard to locate the
actual BRK instruction.
Therefore, store the original ELR_EL2 in the per-CPU kvm_hyp_ctxt and
point the sysreg to a routine that first restores it to its previous
value before running __guest_exit_panic.
Fixes: 7db2153047 ("KVM: arm64: Restore hyp when panicking in guest context")
Signed-off-by: Pierre-Clément Tosi <ptosi@google.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240610063244.2828978-2-ptosi@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Invariant system id registers are populated with host values
at initialization time using their .reset function cb.
These are currently called get_* which is usually used by
the functions implementing the .get_user callback.
Change their function names to reset_* to reflect what they
are used for.
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20240619174036.483943-10-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Instead of using ~0UL provide the actual writable mask for
non-id feature registers in the output of the
KVM_ARM_GET_REG_WRITABLE_MASKS ioctl.
This changes the mask for the CTR_EL0 and CLIDR_EL1 registers.
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20240619174036.483943-9-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
CTR_EL0 is currently handled as an invariant register, thus
guests will be presented with the host value of that register.
Add emulation for CTR_EL0 based on a per VM value. Userspace can
switch off DIC and IDC bits and reduce DminLine and IminLine sizes.
Naturally, ensure CTR_EL0 is trapped (HCR_EL2.TID2=1) any time that a
VM's CTR_EL0 differs from hardware.
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Link: https://lore.kernel.org/r/20240619174036.483943-8-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
There are 2 functions to calculate traps via HCR_EL2:
* kvm_init_sysreg() called via KVM_RUN (before the 1st run or when
the pid changes)
* vcpu_reset_hcr() called via KVM_ARM_VCPU_INIT
To unify these 2 and to support traps that are dependent on the
ID register configuration, move the code from vcpu_reset_hcr()
to sys_regs.c and call it via kvm_init_sysreg().
We still have to keep the non-FWB handling stuff in vcpu_reset_hcr().
Also the initialization with HCR_GUEST_FLAGS is kept there but guarded
by !vcpu_has_run_once() to ensure that previous calculated values
don't get overwritten.
While at it rename kvm_init_sysreg() to kvm_calculate_traps() to
better reflect what it's doing.
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20240619174036.483943-7-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
In the interest of abstracting away the underlying storage of feature
ID registers, rework the nested code to go through the accessors instead
of directly iterating the id_regs array.
This means we now lose the property that ID registers unknown to the
nested code get zeroed, but we really ought to be handling those
explicitly going forward.
Link: https://lore.kernel.org/r/20240619174036.483943-6-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Replace the remaining usage of IDREG() with a new helper for setting the
value of a feature ID register, with the benefit of cramming in some
extra sanity checks.
Reviewed-by: Sebastian Ott <sebott@redhat.com>
Link: https://lore.kernel.org/r/20240619174036.483943-5-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
IDREG() expands to the storage of a particular ID reg, which can be
useful for handling both reads and writes. However, outside of a select
few situations, the ID registers should be considered read only.
Replace current readers with a new macro that expands to the value of
the field rather than the field itself.
Reviewed-by: Sebastian Ott <sebott@redhat.com>
Link: https://lore.kernel.org/r/20240619174036.483943-4-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
CTR_EL0 complicates the existing scheme for iterating feature ID
registers, as it is not in the contiguous range that we presently
support. Just search the sysreg table for the Nth feature ID register in
anticipation of this. Yes, the debugfs interface has quadratic time
completixy now. Boo hoo.
Reviewed-by: Sebastian Ott <sebott@redhat.com>
Link: https://lore.kernel.org/r/20240619174036.483943-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
KVM is about to add support for more VM-scoped feature ID regs that
live outside of the id_regs[] array, which means the index of the
debugfs iterator may not actually be an index into the array.
Prepare by getting the sys_reg encoding from the descriptor itself.
Reviewed-by: Sebastian Ott <sebott@redhat.com>
Link: https://lore.kernel.org/r/20240619174036.483943-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Of course, userspace is in the driver's seat for struct kvm and
associated allocations. Make sure the sysreg_masks allocation
participates in kmem accounting.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240617181018.2054332-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Latest kid on the block: NXS (Non-eXtra-Slow) TLBI operations.
Let's add those in bulk (NSH, ISH, OSH, both normal and range)
as they directly map to their XS (the standard ones) counterparts.
Not a lot to say about them, they are basically useless.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240614144552.2773592-17-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
We already support some form of range operation by handling FEAT_TTL,
but so far the "arbitrary" range operations are unsupported.
Let's fix that.
For EL2 S1, this is simple enough: we just map both NSH, ISH and OSH
instructions onto the ISH version for EL1.
For TLBI instructions affecting EL1 S1, we use the same model as
their non-range counterpart to invalidate in the context of the
correct VMID.
For TLBI instructions affecting S2, we interpret the data passed
by the guest to compute the range and use that to tear-down part
of the shadow S2 range and invalidate the TLBs.
Finally, we advertise FEAT_TLBIRANGE if the host supports it.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240614144552.2773592-16-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Our handling of outer-shareable TLBIs is pretty basic: we just
map them to the existing inner-shareable ones, because we really
don't have anything else.
The only significant change is that we can now advertise FEAT_TLBIOS
support if the host supports it.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240614144552.2773592-15-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
In order to be able to make S2 TLB invalidations more performant on NV,
let's use a scheme derived from the FEAT_TTL extension.
If bits [56:55] in the leaf descriptor translating the address in the
corresponding shadow S2 are non-zero, they indicate a level which can
be used as an invalidation range. This allows further reduction of the
systematic over-invalidation that takes place otherwise.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240614144552.2773592-14-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Populate bits [56:55] of the leaf entry with the level provided
by the guest's S2 translation. This will allow us to better scope
the invalidation by remembering the mapping size.
Of course, this assume that the guest will issue an invalidation
with an address that falls into the same leaf. If the guest doesn't,
we'll over-invalidate.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240614144552.2773592-13-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Support guest-provided information information to size the range of
required invalidation. This helps with reducing over-invalidation,
provided that the guest actually provides accurate information.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240614144552.2773592-12-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
TLBI IPAS2E1* are the last class of TLBI instructions we need
to handle. For each matching S2 MMU context, we invalidate a
range corresponding to the largest possible mapping for that
context.
At this stage, we don't handle TTL, which means we are likely
over-invalidating. Further patches will aim at making this
a bit better.
Co-developed-by: Jintack Lim <jintack.lim@linaro.org>
Co-developed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240614144552.2773592-11-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
TLBI ALLE1* is a pretty big hammer that invalides all S1/S2 TLBs.
This translates into the unmapping of all our shadow S2 PTs, itself
resulting in the corresponding TLB invalidations.
Co-developed-by: Jintack Lim <jintack.lim@linaro.org>
Co-developed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240614144552.2773592-10-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Emulating TLBI VMALLS12E1* results in tearing down all the shadow
S2 PTs that match the current VMID, since our shadow S2s are just
some form of SW-managed TLBs. That teardown itself results in a
full TLB invalidation for both S1 and S2.
This can result in over-invalidation if two vcpus use the same VMID
to tag private S2 PTs, but this is still correct from an architecture
perspective.
Co-developed-by: Jintack Lim <jintack.lim@linaro.org>
Co-developed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240614144552.2773592-9-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
While dealing with TLB invalidation targeting the guest hypervisor's
own stage-1 was easy, doing the same thing for its own guests is
a bit more involved.
Since such an invalidation is scoped by VMID, it needs to apply to
all s2_mmu contexts that have been tagged by that VMID, irrespective
of the value of VTTBR_EL2.BADDR.
So for each s2_mmu context matching that VMID, we invalidate the
corresponding TLBs, each context having its own "physical" VMID.
Co-developed-by: Jintack Lim <jintack.lim@linaro.org>
Co-developed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240614144552.2773592-8-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Due to the way FEAT_NV2 suppresses traps when accessing EL2
system registers, we can't track when the guest changes its
HCR_EL2.TGE setting. This means we always trap EL1 TLBIs,
even if they don't affect any L2 guest.
Given that invalidating the EL2 TLBs doesn't require any messing
with the shadow stage-2 page-tables, we can simply emulate the
instructions early and return directly to the guest.
This is conditioned on the instruction being an EL1 one and
the guest's HCR_EL2.{E2H,TGE} being {1,1} (indicating that
the instruction targets the EL2 S1 TLBs), or the instruction
being one of the EL2 ones (which are not ambiguous).
EL1 TLBIs issued with HCR_EL2.{E2H,TGE}={1,0} are not handled
here, and cause a full exit so that they can be handled in
the context of a VMID.
Co-developed-by: Jintack Lim <jintack.lim@linaro.org>
Co-developed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240614144552.2773592-7-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Provide the primitives required to handle TLB invalidation for
Stage-1 EL2 TLBs, which by definition do not require messing
with the Stage-2 page tables.
Co-developed-by: Jintack Lim <jintack.lim@linaro.org>
Co-developed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240614144552.2773592-6-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Unmap/flush shadow stage 2 page tables for the nested VMs as well as the
stage 2 page table for the guest hypervisor.
Note: A bunch of the code in mmu.c relating to MMU notifiers is
currently dealt with in an extremely abrupt way, for example by clearing
out an entire shadow stage-2 table. This will be handled in a more
efficient way using the reverse mapping feature in a later version of
the patch series.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240614144552.2773592-5-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
If we are faulting on a shadow stage 2 translation, we first walk the
guest hypervisor's stage 2 page table to see if it has a mapping. If
not, we inject a stage 2 page fault to the virtual EL2. Otherwise, we
create a mapping in the shadow stage 2 page table.
Note that we have to deal with two IPAs when we got a shadow stage 2
page fault. One is the address we faulted on, and is in the L2 guest
phys space. The other is from the guest stage-2 page table walk, and is
in the L1 guest phys space. To differentiate them, we rename variables
so that fault_ipa is used for the former and ipa is used for the latter.
When mapping a page in a shadow stage-2, special care must be taken not
to be more permissive than the guest is.
Co-developed-by: Christoffer Dall <christoffer.dall@linaro.org>
Co-developed-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240614144552.2773592-4-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Based on the pseudo-code in the ARM ARM, implement a stage 2 software
page table walker.
Co-developed-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240614144552.2773592-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Add Stage-2 mmu data structures for virtual EL2 and for nested guests.
We don't yet populate shadow Stage-2 page tables, but we now have a
framework for getting to a shadow Stage-2 pgd.
We allocate twice the number of vcpus as Stage-2 mmu structures because
that's sufficient for each vcpu running two translation regimes without
having to flush the Stage-2 page tables.
Co-developed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240614144552.2773592-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Introduce vcpu->wants_to_run to indicate when a vCPU is in its core run
loop, i.e. when the vCPU is running the KVM_RUN ioctl and immediate_exit
was not set.
Replace all references to vcpu->run->immediate_exit with
!vcpu->wants_to_run to avoid TOCTOU races with userspace. For example, a
malicious userspace could invoked KVM_RUN with immediate_exit=true and
then after KVM reads it to set wants_to_run=false, flip it to false.
This would result in the vCPU running in KVM_RUN with
wants_to_run=false. This wouldn't cause any real bugs today but is a
dangerous landmine.
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240503181734.1467938-2-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Now that the layout of the structures is compatible with 1.1 it is time
to probe the 1.1 version of the FF-A protocol inside the hypervisor. If
the TEE doesn't support it, it should return the minimum supported
version.
Signed-off-by: Sebastian Ene <sebastianene@google.com>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240613132035.1070360-5-sebastianene@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The FF-A spec 1.2 reserves the following ranges for identifying FF-A
calls:
0x84000060-0x840000FF: FF-A 32-bit calls
0xC4000060-0xC40000FF: FF-A 64-bit calls.
Use the range identification according to the spec and allow calls that
are currently out of the range(eg. FFA_MSG_SEND_DIRECT_REQ2) to be
identified correctly.
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Sebastian Ene <sebastianene@google.com>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240613132035.1070360-4-sebastianene@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Handle the FFA_PARTITION_INFO_GET host call inside the pKVM hypervisor
and copy the response message back to the host buffers.
Signed-off-by: Sebastian Ene <sebastianene@google.com>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240613132035.1070360-3-sebastianene@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The pKVM hypervisor initializes with FF-A version 1.0. The spec requires
that no other FF-A calls to be issued before the version negotiation
phase is complete. Split the hypervisor proxy initialization code in two
parts so that we can move the later one after the host negotiates its
version.
Without trapping the call, the host drivers can negotiate a higher
version number with TEE which can result in a different memory layout
described during the memory sharing calls.
Signed-off-by: Sebastian Ene <sebastianene@google.com>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240613132035.1070360-2-sebastianene@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The Fine Grained Trap extension is pretty messy as it doesn't
consistently use the same polarity for all trap bits. A bunch
of them, added later in the life of the architecture, have a
*negative* priority.
So if these bits are disabled, they must be RES1 and not RES0.
But that's not what the code implements, making the traps for
these negative trap bits being always on instead of disabled.
Fix the relevant bits, and stick a brown paper bag on my head
for the rest of the day...
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240614125858.78361-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Add an early_params to control WFI and WFE trapping. This is to
control the degree guests can wait for interrupts on their own without
being trapped by KVM. Options for each param are trap and notrap. trap
enables the trap. notrap disables the trap. Note that when enabled,
traps are allowed but not guaranteed by the CPU architecture. Absent
an explicitly set policy, default to current behavior: disabling the
trap if only a single task is running and enabling otherwise.
Signed-off-by: Colton Lewis <coltonlewis@google.com>
Reviewed-by: Jing Zhang <jingzhangos@google.com>
Link: https://lore.kernel.org/r/20240523174056.1565133-1-coltonlewis@google.com
[ oliver: rework kvm_vcpu_should_clear_tw*() for readability ]
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
According to the FF-A spec (Buffer states and ownership), after a
producer has written into a buffer, it is "full" and now owned by the
consumer. The producer won't be able to use that buffer, until the
consumer hands it over with an invocation such as RX_RELEASE.
It is clear in the following paragraph (Transfer of buffer ownership),
that MEM_RETRIEVE_RESP is transferring the ownership from producer (in
our case SPM) to consumer (hypervisor). RX_RELEASE is therefore
mandatory here.
It is less clear though what is happening with MEM_FRAG_TX. But this
invocation, as a response to MEM_FRAG_RX writes into the same hypervisor
RX buffer (see paragraph "Transmission of transaction descriptor in
fragments"). Also this is matching the TF-A implementation where the RX
buffer is marked "full" during a MEM_FRAG_RX.
Release the RX hypervisor buffer in those two cases. This will unblock
later invocations using this buffer which would otherwise fail.
(RETRIEVE_REQ, MEM_FRAG_RX and PARTITION_INFO_GET).
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Link: https://lore.kernel.org/r/20240611175317.1220842-1-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
When tearing down a redistributor region, make sure we don't have
any dangling pointer to that region stored in a vcpu.
Fixes: e5a3563546 ("kvm: arm64: vgic-v3: Introduce vgic_v3_free_redist_region()")
Reported-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240605175637.1635653-1-maz@kernel.org
Cc: stable@vger.kernel.org
KVM (and pKVM) do not support SME guests. Therefore KVM ensures
that the host's SME state is flushed and that SME controls for
enabling access to ZA storage and for streaming are disabled.
pKVM needs to protect against a buggy/malicious host. Ensure that
it wouldn't run a guest when protected mode is enabled should any
of the SME controls be enabled.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20240603122852.3923848-10-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
When setting/clearing CPACR bits for EL0 and EL1, use the ELx
format of the bits, which covers both. This makes the code
clearer, and reduces the chances of accidentally missing a bit.
No functional change intended.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20240603122852.3923848-9-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Now that we have introduced finalize_init_hyp_mode(), lets
consolidate the initializing of the host_data fpsimd_state and
sve state.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240603122852.3923848-8-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
When running in protected mode we don't want to leak protected
guest state to the host, including whether a guest has used
fpsimd/sve. Therefore, eagerly restore the host state on guest
exit when running in protected mode, which happens only if the
guest has used fpsimd/sve.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20240603122852.3923848-7-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Protected mode needs to maintain (save/restore) the host's sve
state, rather than relying on the host kernel to do that. This is
to avoid leaking information to the host about guests and the
type of operations they are performing.
As a first step towards that, allocate memory mapped at hyp, per
cpu, for the host sve state. The following patch will use this
memory to save/restore the host state.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20240603122852.3923848-6-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
In subsequent patches, n/vhe will diverge on saving the host
fpsimd/sve state when taking a guest fpsimd/sve trap. Add a
specialized helper to handle it.
No functional change intended.
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20240603122852.3923848-5-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
The same traps controlled by CPTR_EL2 or CPACR_EL1 need to be
toggled in different parts of the code, but the exact bits and
their polarity differ between these two formats and the mode
(vhe/nvhe/hvhe).
To reduce the amount of duplicated code and the chance of getting
the wrong bit/polarity or missing a field, abstract the set/clear
of CPTR_EL2 bits behind a helper.
Since (h)VHE is the way of the future, use the CPACR_EL1 format,
which is a subset of the VHE CPTR_EL2, as a reference.
No functional change intended.
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20240603122852.3923848-4-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Since the prototypes for __sve_save_state/__sve_restore_state at
hyp were added, the underlying macro has acquired a third
parameter for saving/restoring ffr.
Fix the prototypes to account for the third parameter, and
restore the ffr for the guest since it is saved.
Suggested-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240603122852.3923848-3-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Now that the hypervisor is handling the host sve state in
protected mode, it needs to be able to save it.
This reverts commit e66425fc9b ("KVM: arm64: Remove unused
__sve_save_state").
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20240603122852.3923848-2-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Now that we expose PAC to NV guests, we can also expose BTI (as
the two as joined at the hip, due to some of the PAC instructions
being landing pads).
While we're at it, also propagate CSV_frac, which requires no
particular emulation.
Fixes: f4f6a95bac ("KVM: arm64: nv: Advertise support for PAuth")
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240528100632.1831995-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
ERETAx can fail in multiple ways:
(1) ELR_EL2 points lalaland
(2) we get a PAC failure
(3) SPSR_EL2 has the wrong mode
(1) is easy, as we just let the CPU do its thing and deliver an
Instruction Abort. However, (2) and (3) are interesting, because
the PAC failure priority is way below that of the Illegal Execution
State exception.
Which means that if we have detected a PAC failure (and that we have
FPACCOMBINE), we must be careful to give priority to the Illegal
Execution State exception, should one be pending.
Solving this involves hoisting the SPSR calculation earlier and
testing for the IL bit before injecting the FPAC exception.
In the extreme case of a ERETAx returning to an invalid mode *and*
failing its PAC check, we end up with an Instruction Abort (due
to the new PC being mangled by the failed Auth) *and* PSTATE.IL
being set. Which matches the requirements of the architecture.
Whilst we're at it, remove a stale comment that states the obvious
and only confuses the reader.
Fixes: 213b3d1ea1 ("KVM: arm64: nv: Handle ERETA[AB] instructions")
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240528100632.1831995-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
We recently upgraded the view of ESR_EL2 to 64bit, in keeping with
the requirements of the architecture.
However, the AArch32 emulation code was left unaudited, and the
(already dodgy) code that triages whether a trap is spurious or not
(because the condition code failed) broke in a subtle way:
If ESR_EL2.ISS2 is ever non-zero (unlikely, but hey, this is the ARM
architecture we're talking about), the hack that tests the top bits
of ESR_EL2.EC will break in an interesting way.
Instead, use kvm_vcpu_trap_get_class() to obtain the EC, and list
all the possible ECs that can fail a condition code check.
While we're at it, add SMC32 to the list, as it is explicitly listed
as being allowed to trap despite failing a condition code check (as
described in the HCR_EL2.TSC documentation).
Fixes: 0b12620fdd ("KVM: arm64: Treat ESR_EL2 as a 64-bit register")
Cc: stable@vger.kernel.org
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240524141956.1450304-4-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
It appears that we don't allow a vcpu to be restored in AArch32
System mode, as we *never* included it in the list of valid modes.
Just add it to the list of allowed modes.
Fixes: 0d854a60b1 ("arm64: KVM: enable initialization of a 32bit vcpu")
Cc: stable@vger.kernel.org
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240524141956.1450304-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
When userspace writes to one of the core registers, we make
sure to narrow the corresponding GPRs if PSTATE indicates
an AArch32 context.
The code tries to check whether the context is EL0 or EL1 so
that it narrows the correct registers. But it does so by checking
the full PSTATE instead of PSTATE.M.
As a consequence, and if we are restoring an AArch32 EL0 context
in a 64bit guest, and that PSTATE has *any* bit set outside of
PSTATE.M, we narrow *all* registers instead of only the first 15,
destroying the 64bit state.
Obviously, this is not something the guest is likely to enjoy.
Correctly masking PSTATE to only evaluate PSTATE.M fixes it.
Fixes: 90c1f934ed ("KVM: arm64: Get rid of the AArch32 register mapping code")
Reported-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Cc: stable@vger.kernel.org
Reviewed-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240524141956.1450304-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
- Avoid 'constexpr', which is a keyword in C23
- Allow 'dtbs_check' and 'dt_compatible_check' run independently of
'dt_binding_check'
- Fix weak references to avoid GOT entries in position-independent
code generation
- Convert the last use of 'optional' property in arch/sh/Kconfig
- Remove support for the 'optional' property in Kconfig
- Remove support for Clang's ThinLTO caching, which does not work with
the .incbin directive
- Change the semantics of $(src) so it always points to the source
directory, which fixes Makefile inconsistencies between upstream and
downstream
- Fix 'make tar-pkg' for RISC-V to produce a consistent package
- Provide reasonable default coverage for objtool, sanitizers, and
profilers
- Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc.
- Remove the last use of tristate choice in drivers/rapidio/Kconfig
- Various cleanups and fixes in Kconfig
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmZFlGcVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsG8voQALC8NtFpduWVfLRj2Qg6Ll/xf1vX
2igcTJEOFHkeqXLGoT8dTDKLEipUBUvKyguPq66CGwVTe2g6zy/nUSXeVtFrUsIa
msLTi8FqhqUo5lodNvGMRf8qqmuqcvnXoiQwIocF92jtsFy14bhiFY+n4HfcFNjj
GOKwqBZYQUwY/VVb090efc7RfS9c7uwABJSBelSoxg3AGZriwjGy7Pw5aSKGgVYi
inqL1eR6qwPP6z7CgQWM99soP+zwybFZmnQrsD9SniRBI4rtAat8Ih5jQFaSUFUQ
lk2w0NQBRFN88/uR2IJ2GWuIlQ74WeJ+QnCqVuQ59tV5zw90wqSmLzngfPD057Dv
JjNuhk0UyXVtpIg3lRtd4810ppNSTe33b9OM4O2H846W/crju5oDRNDHcflUXcwm
Rmn5ho1rb5QVzDVejJbgwidnUInSgJ9PZcvXQ/RJVZPhpgsBzAY9pQexG1G3hviw
y9UDrt6KP6bF9tHjmolmtdIes9Pj0c4dN6/Rdj4HS4hIQ/GDar0tnwvOvtfUctNL
orJlBsA6GeMmDVXKkR0ytOCWRYqWWbyt8g70RVKQJfuHX7/hGyAQPaQ2/u4mQhC2
aevYfbNJMj0VDfGz81HDBKFtkc5n+Ite8l157dHEl2LEabkOkRdNVcn7SNbOvZmd
ZCSnZ31h7woGfNho
=D5B/
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Avoid 'constexpr', which is a keyword in C23
- Allow 'dtbs_check' and 'dt_compatible_check' run independently of
'dt_binding_check'
- Fix weak references to avoid GOT entries in position-independent code
generation
- Convert the last use of 'optional' property in arch/sh/Kconfig
- Remove support for the 'optional' property in Kconfig
- Remove support for Clang's ThinLTO caching, which does not work with
the .incbin directive
- Change the semantics of $(src) so it always points to the source
directory, which fixes Makefile inconsistencies between upstream and
downstream
- Fix 'make tar-pkg' for RISC-V to produce a consistent package
- Provide reasonable default coverage for objtool, sanitizers, and
profilers
- Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc.
- Remove the last use of tristate choice in drivers/rapidio/Kconfig
- Various cleanups and fixes in Kconfig
* tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (46 commits)
kconfig: use sym_get_choice_menu() in sym_check_prop()
rapidio: remove choice for enumeration
kconfig: lxdialog: remove initialization with A_NORMAL
kconfig: m/nconf: merge two item_add_str() calls
kconfig: m/nconf: remove dead code to display value of bool choice
kconfig: m/nconf: remove dead code to display children of choice members
kconfig: gconf: show checkbox for choice correctly
kbuild: use GCOV_PROFILE and KCSAN_SANITIZE in scripts/Makefile.modfinal
Makefile: remove redundant tool coverage variables
kbuild: provide reasonable defaults for tool coverage
modules: Drop the .export_symbol section from the final modules
kconfig: use menu_list_for_each_sym() in sym_check_choice_deps()
kconfig: use sym_get_choice_menu() in conf_write_defconfig()
kconfig: add sym_get_choice_menu() helper
kconfig: turn defaults and additional prompt for choice members into error
kconfig: turn missing prompt for choice members into error
kconfig: turn conf_choice() into void function
kconfig: use linked list in sym_set_changed()
kconfig: gconf: use MENU_CHANGED instead of SYMBOL_CHANGED
kconfig: gconf: remove debug code
...
Now Kbuild provides reasonable defaults for objtool, sanitizers, and
profilers.
Remove redundant variables.
Note:
This commit changes the coverage for some objects:
- include arch/mips/vdso/vdso-image.o into UBSAN, GCOV, KCOV
- include arch/sparc/vdso/vdso-image-*.o into UBSAN
- include arch/sparc/vdso/vma.o into UBSAN
- include arch/x86/entry/vdso/extable.o into KASAN, KCSAN, UBSAN, GCOV, KCOV
- include arch/x86/entry/vdso/vdso-image-*.o into KASAN, KCSAN, UBSAN, GCOV, KCOV
- include arch/x86/entry/vdso/vdso32-setup.o into KASAN, KCSAN, UBSAN, GCOV, KCOV
- include arch/x86/entry/vdso/vma.o into GCOV, KCOV
- include arch/x86/um/vdso/vma.o into KASAN, GCOV, KCOV
I believe these are positive effects because all of them are kernel
space objects.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Roberto Sassu <roberto.sassu@huawei.com>
- Move a lot of state that was previously stored on a per vcpu
basis into a per-CPU area, because it is only pertinent to the
host while the vcpu is loaded. This results in better state
tracking, and a smaller vcpu structure.
- Add full handling of the ERET/ERETAA/ERETAB instructions in
nested virtualisation. The last two instructions also require
emulating part of the pointer authentication extension.
As a result, the trap handling of pointer authentication has
been greattly simplified.
- Turn the global (and not very scalable) LPI translation cache
into a per-ITS, scalable cache, making non directly injected
LPIs much cheaper to make visible to the vcpu.
- A batch of pKVM patches, mostly fixes and cleanups, as the
upstreaming process seems to be resuming. Fingers crossed!
- Allocate PPIs and SGIs outside of the vcpu structure, allowing
for smaller EL2 mapping and some flexibility in implementing
more or less than 32 private IRQs.
- Purge stale mpidr_data if a vcpu is created after the MPIDR
map has been created.
- Preserve vcpu-specific ID registers across a vcpu reset.
- Various minor cleanups and improvements.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmY/PT4ACgkQI9DQutE9
ekNwSA/7BTro0n5gP5/SfSFJeEedigpmHQJtHJk9og0LBzjXZTvYqKpI5J1HnpWE
AFsDf3aDRPaSCvI+S14LkkK+TmGtVEXUg8YGytQo08IcO2x6xBT/YjpkVOHy23kq
SGgNMPNUH2sycb7hTcz9Z/V0vBeYwFzYEAhmpvtROvmaRd8ZIyt+ofcclwUZZAQ2
SolOXR2d+ynCh8ZCOexqyZ67keikW1NXtW5aNWWFc6S6qhmcWdaWJGDcSyHauFac
+YuHjPETJYh7TNpwYTmKclRh1fk/CgA/e+r71Hlgdkg+DGCyVnEZBQxqMi6GTzNC
dzy3qhTtRT61SR54q55yMVIC3o6uRSkht+xNg1Nd+UghiqGKAtoYhvGjduodONW2
1Eas6O+vHipu98HgFnkJRPlnF1HR3VunPDwpzIWIZjK0fIXEfrWqCR3nHFaxShOR
dniTEPfELguxOtbl3jCZ+KHCIXueysczXFlqQjSDkg/P1l0jKBgpkZzMPY2mpP1y
TgjipfSL5gr1GPdbrmh4WznQtn5IYWduKIrdEmSBuru05OmBaCO4geXPUwL4coHd
O8TBnXYBTN/z3lORZMSOj9uK8hgU1UWmnOIkdJ4YBBAL8DSS+O+KtCRkHQP0ghl+
whl0q1SWTu4LtOQzN5CUrhq9Tge11erEt888VyJbBJmv8x6qJjE=
=CEfD
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for Linux 6.10
- Move a lot of state that was previously stored on a per vcpu
basis into a per-CPU area, because it is only pertinent to the
host while the vcpu is loaded. This results in better state
tracking, and a smaller vcpu structure.
- Add full handling of the ERET/ERETAA/ERETAB instructions in
nested virtualisation. The last two instructions also require
emulating part of the pointer authentication extension.
As a result, the trap handling of pointer authentication has
been greattly simplified.
- Turn the global (and not very scalable) LPI translation cache
into a per-ITS, scalable cache, making non directly injected
LPIs much cheaper to make visible to the vcpu.
- A batch of pKVM patches, mostly fixes and cleanups, as the
upstreaming process seems to be resuming. Fingers crossed!
- Allocate PPIs and SGIs outside of the vcpu structure, allowing
for smaller EL2 mapping and some flexibility in implementing
more or less than 32 private IRQs.
- Purge stale mpidr_data if a vcpu is created after the MPIDR
map has been created.
- Preserve vcpu-specific ID registers across a vcpu reset.
- Various minor cleanups and improvements.
Kbuild conventionally uses $(obj)/ for generated files, and $(src)/ for
checked-in source files. It is merely a convention without any functional
difference. In fact, $(obj) and $(src) are exactly the same, as defined
in scripts/Makefile.build:
src := $(obj)
When the kernel is built in a separate output directory, $(src) does
not accurately reflect the source directory location. While Kbuild
resolves this discrepancy by specifying VPATH=$(srctree) to search for
source files, it does not cover all cases. For example, when adding a
header search path for local headers, -I$(srctree)/$(src) is typically
passed to the compiler.
This introduces inconsistency between upstream and downstream Makefiles
because $(src) is used instead of $(srctree)/$(src) for the latter.
To address this inconsistency, this commit changes the semantics of
$(src) so that it always points to the directory in the source tree.
Going forward, the variables used in Makefiles will have the following
meanings:
$(obj) - directory in the object tree
$(src) - directory in the source tree (changed by this commit)
$(objtree) - the top of the kernel object tree
$(srctree) - the top of the kernel source tree
Consequently, $(srctree)/$(src) in upstream Makefiles need to be replaced
with $(src).
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
* kvm-arm64/mpidr-reset:
: .
: Fixes for CLIDR_EL1 and MPIDR_EL1 being accidentally mutable across
: a vcpu reset, courtesy of Oliver. From the cover letter:
:
: "For VM-wide feature ID registers we ensure they get initialized once for
: the lifetime of a VM. On the other hand, vCPU-local feature ID registers
: get re-initialized on every vCPU reset, potentially clobbering the
: values userspace set up.
:
: MPIDR_EL1 and CLIDR_EL1 are the only registers in this space that we
: allow userspace to modify for now. Clobbering the value of MPIDR_EL1 has
: some disastrous side effects as the compressed index used by the
: MPIDR-to-vCPU lookup table assumes MPIDR_EL1 is immutable after KVM_RUN.
:
: Series + reproducer test case to address the problem of KVM wiping out
: userspace changes to these registers. Note that there are still some
: differences between VM and vCPU scoped feature ID registers from the
: perspective of userspace. We do not allow the value of VM-scope
: registers to change after KVM_RUN, but vCPU registers remain mutable."
: .
KVM: selftests: arm64: Test vCPU-scoped feature ID registers
KVM: selftests: arm64: Test that feature ID regs survive a reset
KVM: selftests: arm64: Store expected register value in set_id_regs
KVM: selftests: arm64: Rename helper in set_id_regs to imply VM scope
KVM: arm64: Only reset vCPU-scoped feature ID regs once
KVM: arm64: Reset VM feature ID regs from kvm_reset_sys_regs()
KVM: arm64: Rename is_id_reg() to imply VM scope
Signed-off-by: Marc Zyngier <maz@kernel.org>
The general expecation with feature ID registers is that they're 'reset'
exactly once by KVM for the lifetime of a vCPU/VM, such that any
userspace changes to the CPU features / identity are honored after a
vCPU gets reset (e.g. PSCI_ON).
KVM handles what it calls VM-scoped feature ID registers correctly, but
feature ID registers local to a vCPU (CLIDR_EL1, MPIDR_EL1) get wiped
after every reset. What's especially concerning is that a
potentially-changing MPIDR_EL1 breaks MPIDR compression for indexing
mpidr_data, as the mask of useful bits to build the index could change.
This is absolutely no good. Avoid resetting vCPU feature ID registers
more than once.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240502233529.1958459-4-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
A subsequent change to KVM will expand the range of feature ID registers
that get special treatment at reset. Fold the existing ones back in to
kvm_reset_sys_regs() to avoid the need for an additional table walk.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240502233529.1958459-3-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The naming of some of the feature ID checks is ambiguous. Rephrase the
is_id_reg() helper to make its purpose slightly clearer.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240502233529.1958459-2-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/misc-6.10:
: .
: Misc fixes and updates targeting 6.10
:
: - Improve boot-time diagnostics when the sysreg tables
: are not correctly sorted
:
: - Allow FFA_MSG_SEND_DIRECT_REQ in the FFA proxy
:
: - Fix duplicate XNX field in the ID_AA64MMFR1_EL1
: writeable mask
:
: - Allocate PPIs and SGIs outside of the vcpu structure, allowing
: for smaller EL2 mapping and some flexibility in implementing
: more or less than 32 private IRQs.
:
: - Use bitmap_gather() instead of its open-coded equivalent
:
: - Make protected mode use hVHE if available
:
: - Purge stale mpidr_data if a vcpu is created after the MPIDR
: map has been created
: .
KVM: arm64: Destroy mpidr_data for 'late' vCPU creation
KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support
KVM: arm64: Fix hvhe/nvhe early alias parsing
KVM: arm64: Convert kvm_mpidr_index() to bitmap_gather()
KVM: arm64: vgic: Allocate private interrupts on demand
KVM: arm64: Remove duplicated AA64MMFR1_EL1 XNX
KVM: arm64: Remove FFA_MSG_SEND_DIRECT_REQ from the denylist
KVM: arm64: Improve out-of-order sysreg table diagnostics
Signed-off-by: Marc Zyngier <maz@kernel.org>
A particularly annoying userspace could create a vCPU after KVM has
computed mpidr_data for the VM, either by racing against VGIC
initialization or having a userspace irqchip.
In any case, this means mpidr_data no longer fully describes the VM, and
attempts to find the new vCPU with kvm_mpidr_to_vcpu() will fail. The
fix is to discard mpidr_data altogether, as it is only a performance
optimization and not required for correctness. In all likelihood KVM
will recompute the mappings when KVM_RUN is called on the new vCPU.
Note that reads of mpidr_data are not guarded by a lock; promote to RCU
to cope with the possibility of mpidr_data being invalidated at runtime.
Fixes: 54a8006d0b ("KVM: arm64: Fast-track kvm_mpidr_to_vcpu() when mpidr_data is available")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240508071952.2035422-1-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/pkvm-6.10: (25 commits)
: .
: At last, a bunch of pKVM patches, courtesy of Fuad Tabba.
: From the cover letter:
:
: "This series is a bit of a bombay-mix of patches we've been
: carrying. There's no one overarching theme, but they do improve
: the code by fixing existing bugs in pKVM, refactoring code to
: make it more readable and easier to re-use for pKVM, or adding
: functionality to the existing pKVM code upstream."
: .
KVM: arm64: Force injection of a data abort on NISV MMIO exit
KVM: arm64: Restrict supported capabilities for protected VMs
KVM: arm64: Refactor setting the return value in kvm_vm_ioctl_enable_cap()
KVM: arm64: Document the KVM/arm64-specific calls in hypercalls.rst
KVM: arm64: Rename firmware pseudo-register documentation file
KVM: arm64: Reformat/beautify PTP hypercall documentation
KVM: arm64: Clarify rationale for ZCR_EL1 value restored on guest exit
KVM: arm64: Introduce and use predicates that check for protected VMs
KVM: arm64: Add is_pkvm_initialized() helper
KVM: arm64: Simplify vgic-v3 hypercalls
KVM: arm64: Move setting the page as dirty out of the critical section
KVM: arm64: Change kvm_handle_mmio_return() return polarity
KVM: arm64: Fix comment for __pkvm_vcpu_init_traps()
KVM: arm64: Prevent kmemleak from accessing .hyp.data
KVM: arm64: Do not map the host fpsimd state to hyp in pKVM
KVM: arm64: Rename __tlb_switch_to_{guest,host}() in VHE
KVM: arm64: Support TLB invalidation in guest context
KVM: arm64: Avoid BBM when changing only s/w bits in Stage-2 PTE
KVM: arm64: Check for PTE validity when checking for executable/cacheable
KVM: arm64: Avoid BUG-ing from the host abort path
...
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/lpi-xa-cache:
: .
: New and improved LPI translation cache from Oliver Upton.
:
: From the cover letter:
:
: "As discussed [*], here is the new take on the LPI translation cache,
: migrating to an xarray indexed by (devid, eventid) per ITS.
:
: The end result is quite satisfying, as it becomes possible to rip out
: other nasties such as the lpi_list_lock. To that end, patches 2-6 aren't
: _directly_ related to the translation cache cleanup, but instead are
: done to enable the cleanups at the end of the series.
:
: I changed out my test machine from the last time so the baseline has
: moved a bit, but here are the results from the vgic_lpi_stress test:
:
: +----------------------------+------------+-------------------+
: | Configuration | v6.8-rc1 | v6.8-rc1 + series |
: +----------------------------+------------+-------------------+
: | -v 1 -d 1 -e 1 -i 1000000 | 2063296.81 | 1362602.35 |
: | -v 16 -d 16 -e 16 -i 10000 | 610678.33 | 5200910.01 |
: | -v 16 -d 16 -e 17 -i 10000 | 678361.53 | 5890675.51 |
: | -v 32 -d 32 -e 1 -i 100000 | 580918.96 | 8304552.67 |
: | -v 1 -d 1 -e 17 -i 1000 | 1512443.94 | 1425953.8 |
: +----------------------------+------------+-------------------+
:
: Unlike last time, no dramatic regressions at any performance point. The
: regression on a single interrupt stream is to be expected, as the
: overheads of SRCU and two tree traversals (kvm_io_bus_get_dev(),
: translation cache xarray) are likely greater than that of a linked-list
: with a single node."
: .
KVM: selftests: Add stress test for LPI injection
KVM: selftests: Use MPIDR_HWID_BITMASK from cputype.h
KVM: selftests: Add helper for enabling LPIs on a redistributor
KVM: selftests: Add a minimal library for interacting with an ITS
KVM: selftests: Add quadword MMIO accessors
KVM: selftests: Standardise layout of GIC frames
KVM: selftests: Align with kernel's GIC definitions
KVM: arm64: vgic-its: Get rid of the lpi_list_lock
KVM: arm64: vgic-its: Rip out the global translation cache
KVM: arm64: vgic-its: Use the per-ITS translation cache for injection
KVM: arm64: vgic-its: Spin off helper for finding ITS by doorbell addr
KVM: arm64: vgic-its: Maintain a translation cache per ITS
KVM: arm64: vgic-its: Scope translation cache invalidations to an ITS
KVM: arm64: vgic-its: Get rid of vgic_copy_lpi_list()
KVM: arm64: vgic-debug: Use an xarray mark for debug iterator
KVM: arm64: vgic-its: Walk LPI xarray in vgic_its_cmd_handle_movall()
KVM: arm64: vgic-its: Walk LPI xarray in vgic_its_invall()
KVM: arm64: vgic-its: Walk LPI xarray in its_sync_lpi_pending_table()
KVM: Treat the device list as an rculist
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/nv-eret-pauth:
: .
: Add NV support for the ERETAA/ERETAB instructions. From the cover letter:
:
: "Although the current upstream NV support has *some* support for
: correctly emulating ERET, that support is only partial as it doesn't
: support the ERETAA and ERETAB variants.
:
: Supporting these instructions was cast aside for a long time as it
: involves implementing some form of PAuth emulation, something I wasn't
: overly keen on. But I have reached a point where enough of the
: infrastructure is there that it actually makes sense. So here it is!"
: .
KVM: arm64: nv: Work around lack of pauth support in old toolchains
KVM: arm64: Drop trapping of PAuth instructions/keys
KVM: arm64: nv: Advertise support for PAuth
KVM: arm64: nv: Handle ERETA[AB] instructions
KVM: arm64: nv: Add emulation for ERETAx instructions
KVM: arm64: nv: Add kvm_has_pauth() helper
KVM: arm64: nv: Reinject PAC exceptions caused by HCR_EL2.API==0
KVM: arm64: nv: Handle HCR_EL2.{API,APK} independently
KVM: arm64: nv: Honor HFGITR_EL2.ERET being set
KVM: arm64: nv: Fast-track 'InHost' exception returns
KVM: arm64: nv: Add trap forwarding for ERET and SMC
KVM: arm64: nv: Configure HCR_EL2 for FEAT_NV2
KVM: arm64: nv: Drop VCPU_HYP_CONTEXT flag
KVM: arm64: Constraint PAuth support to consistent implementations
KVM: arm64: Add helpers for ESR_ELx_ERET_ISS_ERET*
KVM: arm64: Harden __ctxt_sys_reg() against out-of-range values
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/host_data:
: .
: Rationalise the host-specific data to live as part of the per-CPU state.
:
: From the cover letter:
:
: "It appears that over the years, we have accumulated a lot of cruft in
: the kvm_vcpu_arch structure. Part of the gunk is data that is strictly
: host CPU specific, and this result in two main problems:
:
: - the structure itself is stupidly large, over 8kB. With the
: arch-agnostic kvm_vcpu, we're above 10kB, which is insane. This has
: some ripple effects, as we need physically contiguous allocation to
: be able to map it at EL2 for !VHE. There is more to it though, as
: some data structures, although per-vcpu, could be allocated
: separately.
:
: - We lose track of the life-cycle of this data, because we're
: guaranteed that it will be around forever and we start relying on
: wrong assumptions. This is becoming a maintenance burden.
:
: This series rectifies some of these things, starting with the two main
: offenders: debug and FP, a lot of which gets pushed out to the per-CPU
: host structure. Indeed, their lifetime really isn't that of the vcpu,
: but tied to the physical CPU the vpcu runs on.
:
: This results in a small reduction of the vcpu size, but mainly a much
: clearer understanding of the life-cycle of these structures."
: .
KVM: arm64: Move management of __hyp_running_vcpu to load/put on VHE
KVM: arm64: Exclude FP ownership from kvm_vcpu_arch
KVM: arm64: Exclude host_fpsimd_state pointer from kvm_vcpu_arch
KVM: arm64: Exclude mdcr_el2_host from kvm_vcpu_arch
KVM: arm64: Exclude host_debug_data from vcpu_arch
KVM: arm64: Add accessor for per-CPU state
Signed-off-by: Marc Zyngier <maz@kernel.org>
The per-CPU host context structure contains a __hyp_running_vcpu that
serves as a replacement for kvm_get_current_vcpu() in contexts where
we cannot make direct use of it (such as in the nVHE hypervisor).
Since there is a lot of common code between nVHE and VHE, the latter
also populates this field even if kvm_get_running_vcpu() always works.
We currently pretty inconsistent when populating __hyp_running_vcpu
to point to the currently running vcpu:
- on {n,h}VHE, we set __hyp_running_vcpu on entry to __kvm_vcpu_run
and clear it on exit.
- on VHE, we set __hyp_running_vcpu on entry to __kvm_vcpu_run_vhe
and never clear it, effectively leaving a dangling pointer...
VHE is obviously the odd one here. Although we could make it behave
just like nVHE, this wouldn't match the behaviour of KVM with VHE,
where the load phase is where most of the context-switch gets done.
So move all the __hyp_running_vcpu management to the VHE-specific
load/put phases, giving us a bit more sanity and matching the
behaviour of kvm_get_running_vcpu().
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240502154030.3011995-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Private interrupts are currently part of the CPU interface structure
that is part of each and every vcpu we create.
Currently, we have 32 of them per vcpu, resulting in a per-vcpu array
that is just shy of 4kB. On its own, that's no big deal, but it gets
in the way of other things:
- each vcpu gets mapped at EL2 on nVHE/hVHE configurations. This
requires memory that is physically contiguous. However, the EL2
code has no purpose looking at the interrupt structures and
could do without them being mapped.
- supporting features such as EPPIs, which extend the number of
private interrupts past the 32 limit would make the array
even larger, even for VMs that do not use the EPPI feature.
Address these issues by moving the private interrupt array outside
of the vcpu, and replace it with a simple pointer. We take this
opportunity to make it obvious what gets initialised when, as
that path was remarkably opaque, and tighten the locking.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240502154545.3012089-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
If a vcpu exits for a data abort with an invalid syndrome, the
expectations are that userspace has a chance to save the day if
it has requested to see such exits.
However, this is completely futile in the case of a protected VM,
as none of the state is available. In this particular case, inject
a data abort directly into the vcpu, consistent with what userspace
could do.
This also helps with pKVM, which discards all syndrome information when
forwarding data aborts that are not known to be MMIO.
Finally, document this tweak to the API.
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-31-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
For practical reasons as well as security related ones, not all
capabilities are supported for protected VMs in pKVM.
Add a function that restricts the capabilities for protected VMs.
This behaves as an allow-list to ensure that future capabilities
are checked for compatibility and security before being allowed
for protected VMs.
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-30-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Initialize r = -EINVAL to get rid of the error-path
initializations in kvm_vm_ioctl_enable_cap().
No functional change intended.
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-29-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Expand comment clarifying why the host value representing SVE
vector length being restored for ZCR_EL1 on guest exit isn't the
same as it was on guest entry.
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-21-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
In order to determine whether or not a VM or vcpu are protected,
introduce helpers to query this state. While at it, use the vcpu
helper to check vcpus protected state instead of the kvm one.
Co-authored-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-19-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Consolidate the GICv3 VMCR accessor hypercalls into the APR save/restore
hypercalls so that all of the EL2 GICv3 state is covered by a single pair
of hypercalls.
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-17-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Move the unlock earlier in user_mem_abort() to shorten the
critical section. This also helps for future refactoring and
reuse of similar code.
This moves out marking the page as dirty outside of the critical
section. That code does not interact with the stage-2 page
tables, which the read lock in the critical section protects.
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-16-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Most exit handlers return <= 0 to indicate that the host needs to
handle the exit. Make kvm_handle_mmio_return() consistent with
the exit handlers in handle_exit(). This makes the code easier to
reason about, and makes it easier to add other handlers in future
patches.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-15-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Fix the comment to clarify that __pkvm_vcpu_init_traps()
initializes traps for all VMs in protected mode, and not only
for protected VMs.
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-14-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
We've added a .data section for the hypervisor, which kmemleak is
eager to parse. This clearly doesn't go well, so add the section
to kmemleak's block list.
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-13-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
pKVM maintains its own state at EL2 for tracking the host fpsimd
state. Therefore, no need to map and share the host's view with
it.
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-12-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Rename __tlb_switch_to_{guest,host}() to
{enter,exit}_vmid_context() in VHE code to maintain symmetry
between the nVHE and VHE TLB invalidations.
No functional change intended.
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-11-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Typically, TLB invalidation of guest stage-2 mappings using nVHE is
performed by a hypercall originating from the host. For the invalidation
instruction to be effective, therefore, __tlb_switch_to_{guest,host}()
swizzle the active stage-2 context around the TLBI instruction.
With guest-to-host memory sharing and unsharing hypercalls
originating from the guest under pKVM, there is need to support
both guest and host VMID invalidations issued from guest context.
Replace the __tlb_switch_to_{guest,host}() functions with a more general
{enter,exit}_vmid_context() implementation which supports being invoked
from guest context and acts as a no-op if the target context matches the
running context.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-10-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Break-before-make (BBM) can be expensive, as transitioning via an
invalid mapping (i.e. the "break" step) requires the completion of TLB
invalidation and can also cause other agents to fault concurrently on
the invalid mapping.
Since BBM is not required when changing only the software bits of a PTE,
avoid the sequence in this case and just update the PTE directly.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-9-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Don't just assume that the PTE is valid when checking whether it
describes an executable or cacheable mapping.
This makes sure that we don't issue CMOs for invalid mappings.
Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-8-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Under certain circumstances __get_fault_info() may resolve the faulting
address using the AT instruction. Given that this is being done outside
of the host lock critical section, it is racy and the resolution via AT
may fail. We currently BUG() in this situation, which is obviously less
than ideal. Moving the address resolution to the critical section may
have a performance impact, so let's keep it where it is, but bail out
and return to the host to try a second time.
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-7-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
On the guest teardown path, pKVM will zero the pages used to back
the guest data structures before returning them to the host as
they may contain secrets (e.g. in the vCPU registers). However,
the zeroing is done using a cacheable alias, and CMOs are
missing, hence giving the host a potential opportunity to read
the original content of the guest structs from memory.
Fix this by issuing CMOs after zeroing the pages.
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-6-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
The lock is already initialized in core KVM code at
kvm_create_vm().
Fixes: 9d0c063a4d ("KVM: arm64: Instantiate pKVM hypervisor VM and vCPU structures from EL1")
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-5-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
To avoid direct comparison against the fp_owner enum, add a new
function that performs the check, host_owns_fp_regs(), to
complement the existing guest_owns_fp_regs().
To check for fpsimd state ownership, use the helpers instead of
directly using the enums.
No functional change intended.
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-4-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
guest_owns_fp_regs() will be used to check fpsimd state ownership
across kvm/arm64. Therefore, move it to kvm_host.h to widen its
scope.
Moreover, the host state is not per-vcpu anymore, the vcpu
parameter isn't used, so remove it as well.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-3-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Since the host_fpsimd_state has been removed from kvm_vcpu_arch,
it isn't pointing to the hyp's version of the host fp_regs in
protected mode.
Initialize the host_data fpsimd_state point to the host_data's
context fp_regs on pKVM initialization.
Fixes: 51e09b5572 ("KVM: arm64: Exclude host_fpsimd_state pointer from kvm_vcpu_arch")
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-2-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Commit d5a32b60dc ("KVM: arm64: Allow userspace to change
ID_AA64MMFR{0-2}_EL1") made certain fields in these registers writable,
but in doing so, ID_AA64MMFR1_EL1_XNX was listed twice. Remove the
duplication.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Zenghui Yu <zenghui.yu@linux.dev>
Link: https://lore.kernel.org/r/E1s2AxF-00AWLv-03@rmk-PC.armlinux.org.uk
Signed-off-by: Marc Zyngier <maz@kernel.org>
The last genuine use case for the lpi_list_lock was the global LPI
translation cache, which has been removed in favor of a per-ITS xarray.
Remove a layer from the locking puzzle by getting rid of it.
vgic_add_lpi() still has a critical section that needs to protect
against the insertion of other LPIs; change it to take the LPI xarray's
xa_lock to retain this property.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-13-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Everything is in place to switch to per-ITS translation caches. Start
using the per-ITS cache to avoid the lock serialization related to the
global translation cache. Explicitly check for out-of-range device and
event IDs as the cache index is packed based on the range the ITS
actually supports.
Take the RCU read lock to protect against the returned descriptor being
freed while trying to take a reference on it, as it is no longer
necessary to acquire the lpi_list_lock.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-11-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The fast path will soon need to find an ITS by doorbell address, as the
translation caches will become local to an ITS. Spin off a helper to do
just that.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-10-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Within the context of a single ITS, it is possible to use an xarray to
cache the device ID & event ID translation to a particular irq
descriptor. Take advantage of this to build a translation cache capable
of fitting all valid translations for a given ITS.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-9-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
As the current LPI translation cache is global, the corresponding
invalidation helpers are also globally-scoped. In anticipation of
constructing a translation cache per ITS, add a helper for scoped cache
invalidations.
We still need to support global invalidations when LPIs are toggled on
a redistributor, as a property of the translation cache is that all
stored LPIs are known to be delieverable.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-8-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The last user has been transitioned to walking the LPI xarray directly.
Cut the wart off, and get rid of the now unneeded lpi_count while doing
so.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-7-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The vgic debug iterator is the final user of vgic_copy_lpi_list(), but
is a bit more complicated to transition to something else. Use a mark
in the LPI xarray to record the indices 'known' to the debug iterator.
Protect against the LPIs from being freed by associating an additional
reference with the xarray mark.
Rework iter_next() to let the xarray walk 'drive' the iteration after
visiting all of the SGIs, PPIs, and SPIs.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-6-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The new LPI xarray makes it possible to walk the VM's LPIs without
holding a lock, meaning that vgic_copy_lpi_list() is no longer
necessary. Prepare for the deletion by walking the LPI xarray directly
in vgic_its_cmd_handle_movall().
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-5-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The new LPI xarray makes it possible to walk the VM's LPIs without
holding a lock, meaning that vgic_copy_lpi_list() is no longer
necessary. Prepare for the deletion by walking the LPI xarray directly
in vgic_its_invall().
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-4-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>