Currently AMD does not support
IOMMU_HWPT_ALLOC_PASID | IOMMU_HWPT_ALLOC_DIRTY_TRACKING
It should be rejected. Instead it creates a V1 domain without dirty
tracking support.
Use a switch to fully decode the flags.
Fixes: ce2cd17546 ("iommu/amd: Enhance amd_iommu_domain_alloc_user()")
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The only thing that uses the nid is the io_pgtable code, and it should be
set before calling alloc_io_pgtable_ops() to ensure that the top levels
are allocated on the correct nid.
Since dev is never NULL now we can just do this trivially and remove the
other uses of nid. SVA and identity code paths never use it since they
don't use io_pgtable.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Currently it uses enum io_pgtable_fmt which is from the io pagetable code
and most of the enum values are invalid. protection_domain_mode is
internal the driver and has the only two valid values.
Fix some signatures and variables to use the right type as well.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
do_iommu_domain_alloc() is only called from
amd_iommu_domain_alloc_paging_flags() so type is always
IOMMU_DOMAIN_UNMANAGED. Remove type and all the dead conditionals checking
it.
IOMMU_DOMAIN_IDENTITY checks are similarly obsolete as the conversion to
the global static identity domain removed those call paths.
The caller of protection_domain_alloc() should set the type, fix the miss
in the SVA code.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This is no longer possible, amd_iommu_domain_alloc_paging_flags() is never
called with dev = NULL from the core code. Similarly
get_amd_iommu_from_dev() can never be NULL either.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
IOMMU drivers should not be sensitive to the domain type, a paging domain
should be created based only on the flags passed in, the same for all
callers.
AMD was using the domain_alloc() path to force VFIO into a v1 domain type,
because v1 gives higher performance. However now that
IOMMU_HWPT_ALLOC_PASID is present, and a NULL device is not possible,
domain_alloc_paging_flags() will do the right thing for VFIO.
When invoked from VFIO flags will be 0 and the amd_iommu_pgtable type of
domain will be selected. This is v1 by default unless the kernel command
line has overridden it to v2.
If the admin is forcing v2 assume they know what they are doing so force
it everywhere, including for VFIO.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
All the callers have been removed by the below commit, remove the
implementation and prototypes.
Fixes: 322d889ae7 ("iommu/amd: Remove amd_iommu_domain_update() from page table freeing")
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
When __BITS_PER_LONG == 32, size_t is defined as unsigned int rather
than unsigned long. Therefore, we should use size_t to avoid
type-checking errors.
Fixes: 488ffbf181 ("iommu/riscv: Paging domain support")
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
Cc: Tomasz Jeznach <tjeznach@rivosinc.com>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Link: https://lore.kernel.org/r/20250103024616.3359159-1-guoren@kernel.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Instead of marking individual interrupts as safe to be migrated in
arbitrary contexts, mark the interrupt chips, which require the interrupt
to be moved in actual interrupt context, with the new IRQCHIP_MOVE_DEFERRED
flag. This makes more sense because this is a per interrupt chip property
and not restricted to individual interrupts.
That flips the logic from the historical opt-out to a opt-in model. This is
simpler to handle for other architectures, which default to unrestricted
affinity setting. It also allows to cleanup the redundant core logic
significantly.
All interrupt chips, which belong to a top-level domain sitting directly on
top of the x86 vector domain are marked accordingly, unless the related
setup code marks the interrupts with IRQ_MOVE_PCNTXT, i.e. XEN.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/all/20241210103335.563277044@linutronix.de
Reorder the existing OBJ/IOCTL lists.
Also run clang-format for the same coding style at line wrappings.
No functional change.
Link: https://patch.msgid.link/r/c5e6d6e0a0bb7abc92ad26937fde19c9426bee96.1736237481.git.nicolinc@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Resolve a UBSAN shift-out-of-bounds issue in iova_bitmap_offset_to_index()
where shifting the constant "1" (of type int) by bitmap->mapped.pgshift
(an unsigned long value) could result in undefined behavior.
The constant "1" defaults to a 32-bit "int", and when "pgshift" exceeds
31 (e.g., pgshift = 63) the shift operation overflows, as the result
cannot be represented in a 32-bit type.
To resolve this, the constant is updated to "1UL", promoting it to an
unsigned long type to match the operand's type.
Fixes: 58ccf0190d ("vfio: Add an IOVA bitmap support")
Link: https://patch.msgid.link/r/20250113223820.10713-1-qasdev00@gmail.com
Reported-by: syzbot <syzbot+85992ace37d5b7b51635@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=85992ace37d5b7b51635
Signed-off-by: Qasim Ijaz <qasdev00@gmail.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Fix an issue detected by syzbot:
WARNING in iommufd_device_unbind iommufd: Time out waiting for iommufd object to become free
Resolve a warning in iommufd_device_unbind caused by a timeout while
waiting for the shortterm_users reference count to reach zero. The
existing 10-second timeout is insufficient in some scenarios, resulting in
failures the above warning.
Increase the timeout in iommufd_object_dec_wait_shortterm from 10 seconds
to 60 seconds to allow sufficient time for the reference count to drop to
zero. This change prevents premature timeouts and reduces the likelihood
of warnings during iommufd_device_unbind.
Fixes: 6f9c4d8c46 ("iommufd: Do not UAF during iommufd_put_object()")
Link: https://patch.msgid.link/r/20241123195900.3176-1-surajsonawane0215@gmail.com
Reported-by: syzbot+c92878e123785b1fa2db@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c92878e123785b1fa2db
Tested-by: syzbot+c92878e123785b1fa2db@syzkaller.appspotmail.com
Signed-off-by: Suraj Sonawane <surajsonawane0215@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Commit f2c77f6e41 ("iommu/arm-smmu-v3: Use str_read_write helper w/
logs") introduced a call to str_read_write() in the SMMUv3 driver but
without an explicit #include of <linux/string_choices.h>. This breaks
the build for custom configurations where CONFIG_ACPI=n:
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:1909:4: error: call to
undeclared function 'str_read_write'; ISO C99 and later do not
support implicit function declarations [-Wimplicit-function-declaration]
1909 | str_read_write(evt->read),
| ^
Add the missing #include.
Link: https://lore.kernel.org/r/d07e82a4-2880-4ae3-961b-471bfa7ac6c4@samsung.com
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Fixes: f2c77f6e41 ("iommu/arm-smmu-v3: Use str_read_write helper w/ logs")
Signed-off-by: Will Deacon <will@kernel.org>
Current code gets the APIC IDs for CPUs numbered 255 and lower.
This code assumes cpu_possible_mask is dense, which is not true in
the general case per [1]. If cpu_possible_mask contains holes,
num_possible_cpus() is less than nr_cpu_ids, so some CPUs might get
skipped. Furthermore, getting the APIC ID of a CPU that isn't in
cpu_possible_mask is invalid.
However, the configurations that Hyper-V provides to guest VMs on x86
hardware, in combination with how x86 code assigns Linux CPU numbers,
*does* always produce a dense cpu_possible_mask. So the dense assumption
is not currently causing failures. But for robustness against future
changes in how cpu_possible_mask is populated, update the code to no
longer assume dense.
The correct approach is to determine the range to scan based on
nr_cpu_ids, and skip any CPUs that are not in the cpu_possible_mask.
[1] https://lore.kernel.org/lkml/SN6PR02MB4157210CC36B2593F8572E5ED4692@SN6PR02MB4157.namprd02.prod.outlook.com/
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241003035333.49261-4-mhklinux@outlook.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20241003035333.49261-4-mhklinux@outlook.com>
Add an io-pgtable method to walk the pgtable returning the raw PTEs that
would be traversed for a given iova access.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Link: https://lore.kernel.org/r/20241210165127.600817-4-robdclark@gmail.com
[will: Removed 'arm_lpae_io_pgtable_walk_data::level' per Mostafa]
Signed-off-by: Will Deacon <will@kernel.org>
We can re-use this basic pgtable walk logic in a few places.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Link: https://lore.kernel.org/r/20241210165127.600817-2-robdclark@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
Add ACTLR data table for qcom_smmu_500 including corresponding data
entry and set prefetch value by way of a list of compatible strings.
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Signed-off-by: Bibek Kumar Patro <quic_bibekkum@quicinc.com>
Link: https://lore.kernel.org/r/20241212151402.159102-6-quic_bibekkum@quicinc.com
Signed-off-by: Will Deacon <will@kernel.org>
Currently in Qualcomm SoCs the default prefetch is set to 1 which allows
the TLB to fetch just the next page table. MMU-500 features ACTLR
register which is implementation defined and is used for Qualcomm SoCs
to have a custom prefetch setting enabling TLB to prefetch the next set
of page tables accordingly allowing for faster translations.
ACTLR value is unique for each SMR (Stream matching register) and stored
in a pre-populated table. This value is set to the register during
context bank initialisation.
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Signed-off-by: Bibek Kumar Patro <quic_bibekkum@quicinc.com>
Link: https://lore.kernel.org/r/20241212151402.159102-5-quic_bibekkum@quicinc.com
Signed-off-by: Will Deacon <will@kernel.org>
Add an adreno-smmu-priv interface for drm/msm to call into arm-smmu-qcom
and initiate the "Partially Resident Region" (PRR) bit setup or reset
sequence as per request.
This will be used by GPU to setup the PRR bit and related configuration
registers through adreno-smmu private interface instead of directly
poking the smmu hardware.
Suggested-by: Rob Clark <robdclark@gmail.com>
Signed-off-by: Bibek Kumar Patro <quic_bibekkum@quicinc.com>
Link: https://lore.kernel.org/r/20241212151402.159102-4-quic_bibekkum@quicinc.com
Signed-off-by: Will Deacon <will@kernel.org>
qcom_smmu_match_data is static and constant so refactor qcom_smmu to
store single pointer to qcom_smmu_match_data instead of replicating
multiple child members of the same and handle the further dereferences
in the places that want them.
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Reviewed-by: Rob Clark <robdclark@gmail.com>
Signed-off-by: Bibek Kumar Patro <quic_bibekkum@quicinc.com>
Link: https://lore.kernel.org/r/20241212151402.159102-3-quic_bibekkum@quicinc.com
Signed-off-by: Will Deacon <will@kernel.org>
Default MMU-500 reset operation disables context caching in prefetch
buffer. It is however expected for context banks using the ACTLR
register to retain their prefetch value during reset and runtime
suspend.
Add config 'ARM_SMMU_MMU_500_CPRE_ERRATA' to gate this errata workaround
in default MMU-500 reset operation which defaults to 'Y' and provide
option to disable workaround for context caching in prefetch buffer as
and when needed.
Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Bibek Kumar Patro <quic_bibekkum@quicinc.com>
Link: https://lore.kernel.org/r/20241212151402.159102-2-quic_bibekkum@quicinc.com
Signed-off-by: Will Deacon <will@kernel.org>
Cache tag invalidation requests for a domain are accumulated until a
different iommu unit is found when traversing the cache_tags linked list.
But cache tags of same iommu unit can be distributed in the linked list,
this make batched flush less efficient. E.g., one device backed by iommu0
is attached to a domain in between two devices attaching backed by iommu1.
Group cache tags together for same iommu unit in cache_tag_assign() to
maximize the performance of batched flush.
Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Link: https://lore.kernel.org/r/20241219054358.8654-1-zhenzhong.duan@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
When a device uses a PASID for SVA (Shared Virtual Address), it's possible
that the PASID entry is marked as non-present and FPD bit set before the
device flushes all ongoing DMA requests and removes the SVA domain. This
can occur when an exception happens and the process terminates before the
device driver stops DMA and calls the iommu driver to unbind the PASID.
There's no need to drain the PRQ in the mm release path. Instead, the PRQ
will be drained in the SVA unbind path. But in such case,
intel_pasid_tear_down_entry() only checks the presence of the pasid entry
and returns directly.
Add the code to clear the FPD bit and drain the PRQ.
Fixes: c43e1ccdeb ("iommu/vt-d: Drain PRQs when domain removed from RID")
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241217024240.139615-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The capability audit code was introduced by commit <ad3d19029979>
"iommu/vt-d: Audit IOMMU Capabilities and add helper functions", aiming
to verify the consistency of capabilities across all IOMMUs for supported
features.
Nowadays, all the kAPIs of the iommu subsystem have evolved to be device
oriented, in preparation for supporting heterogeneous IOMMU architectures.
There is no longer a need to require capability consistence among IOMMUs
for any feature.
Remove the iommu cap audit code to make the driver align with the design
in the iommu core.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241216071828.22962-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This is duplicated by intel_iommu_domain_alloc_paging_flags(), just remove
it.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Link: https://lore.kernel.org/r/0-v1-b101d00c5ee5+17645-vtd_paging_flags_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
There is a WARN_ON_ONCE to catch an unlikely situation when
domain_remove_dev_pasid can't find the `pasid`. In case it nevertheless
happens we must avoid using a NULL pointer.
Signed-off-by: Kees Bakker <kees@ijzerbout.nl>
Link: https://lore.kernel.org/r/20241218201048.E544818E57E@bout3.ijzerbout.nl
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The return value of amd_iommu_detect is not used, so remove it and
is consistent with other iommu detect functions.
Signed-off-by: Gao Shiyuan <gaoshiyuan@baidu.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250103165808.80939-1-gaoshiyuan@baidu.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Since commit 7ef9651e97 ("clk: Provide new devm_clk helpers for prepared
and enabled clocks"), devm_clk_get() and clk_prepare() can now be replaced
by devm_clk_get_prepared() when driver prepares the clocks for the whole
lifetime of the device. Moreover, it is no longer necessary to unprepare
the clocks explicitly.
Signed-off-by: Zhang Heng <zhangheng@kylinos.cn>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Link: https://lore.kernel.org/r/20250103113059.463033-1-zhangheng@kylinos.cn
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This commit supplies shutdown callback for iommu driver. The shutdown
callback resets necessary registers so that newly booted kernel can pass
riscv_iommu_init_check() after kexec. Also, the shutdown callback resets
iommu mode to bare instead of off so that new kernel can still use PCIE
devices even when CONFIG_RISCV_IOMMU is not enabled.
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
Link: https://lore.kernel.org/r/20250103093220.38106-3-luxu.kernel@bytedance.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Changing cqen/fqen/pqen from 0 to 1 sets the cqh/fqt/pqt registers to 0.
But the cqt/fqh/pqh registers are left unmodified. This commit resets
cqt/fqh/pqh registers to ensure corresponding queues are empty before
being enabled during initialization.
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
Link: https://lore.kernel.org/r/20250103093220.38106-2-luxu.kernel@bytedance.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The hardware limitation "max=19" actually comes from SMMU Command Queue.
So, it'd be more natural for tegra241-cmdqv driver to read it out rather
than hardcoding it itself.
This is not an issue yet for a kernel on a baremetal system, but a guest
kernel setting the queue base/size in form of IPA/gPA might result in a
noncontiguous queue in the physical address space, if underlying physical
pages backing up the guest RAM aren't contiguous entirely: e.g. 2MB-page
backed guest RAM cannot guarantee a contiguous queue if it is 8MB (capped
to VCMDQ_LOG2SIZE_MAX=19). This might lead to command errors when HW does
linear-read from a noncontiguous queue memory.
Adding this extra IDR1.CMDQS cap (in the guest kernel) allows VMM to set
SMMU's IDR1.CMDQS=17 for the case mentioned above, so a guest-level queue
will be capped to maximum 2MB, ensuring a contiguous queue memory.
Fixes: a3799717b8 ("iommu/tegra241-cmdqv: Fix alignment failure at max_n_shift")
Reported-by: Ian Kalinowski <ikalinowski@nvidia.com>
Cc: stable@vger.kernel.org
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20241219051421.1850267-1-nicolinc@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
The newly introduced arm_lpae_concat_mandatory() function reads the
ias/oas fields from the 'io_pgtable_cfg' copy embedded inside the
'arm_lpae_io_pgtable' structure. However, this copy is not set until
later in alloc_io_pgtable_ops() after the alloc() function has been
called.
Use the address sizes passed in the 'io_pgtable_cfg' structure when
deciding whether or not to concatenate the PGD.
Fixes: 4dcac8407f ("iommu/io-pgtable-arm: Fix stage-2 concatenation with 16K")
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Link: https://lore.kernel.org/r/20241215200412.561400-1-smostafa@google.com
Signed-off-by: Will Deacon <will@kernel.org>
The iommu drivers that supports PASID have supported attaching pasid to the
blocked_domain, hence remove the remove_dev_pasid op from the iommu_ops.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241204122928.11987-8-yi.l.liu@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The blocked domain can be extended to park PASID of a device to be the
DMA blocking state. By this the remove_dev_pasid() op is dropped.
Remove PASID from old domain and device GCR3 table. No need to attach
PASID to the blocked domain as clearing PASID from GCR3 table will make
sure all DMAs for that PASID are blocked.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241204122928.11987-7-yi.l.liu@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The blocked domain can be extended to park PASID of a device to be the
DMA blocking state. By this the remove_dev_pasid() op is dropped.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241204122928.11987-6-yi.l.liu@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The blocked domain is used to park RID to be blocking DMA state. This
can be extended to PASID as well. By this, the remove_dev_pasid() op
of ARM SMMUv3 can be dropped.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241204122928.11987-5-yi.l.liu@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The iommu drivers are on the way to detach pasid by attaching to the blocked
domain. However, this cannot be done in one shot. During the transition, iommu
core would select between the remove_dev_pasid op and the blocked domain.
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241204122928.11987-4-yi.l.liu@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Add a wrapper for the ops->remove_dev_pasid, this consolidates the iommu_ops
fetching and callback invoking. It is also a preparation for starting the
transition from using remove_dev_pasid op to detach pasid to the way using
blocked_domain to detach pasid.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241204122928.11987-3-yi.l.liu@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
driver should implement both set_dev_pasid and remove_dev_pasid op, otherwise
it is a problem how to detach pasid. In reality, it is impossible that an
iommu driver implements set_dev_pasid() but no remove_dev_pasid() op. However,
it is better to check it.
Move the group check to be the first as dev_iommu_ops() may fail when there
is no valid group. Also take the chance to remove the dev_has_iommu() check
as it is duplicated to the group check.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241204122928.11987-2-yi.l.liu@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
When updating only within a 64-bit tuple of a DTE, just lock the DTE and
use WRITE_ONCE() because it is writing to memory read back by HW.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20241118054937.5203-9-suravee.suthikulpanit@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
By reusing the make_clear_dte() and update_dte256().
Also, there is no need to set TV bit for non-SNP system when clearing DTE
for blocked domain, and no longer need to apply erratum 63 in clear_dte()
since it is already stored in struct ivhd_dte_flags and apply in
set_dte_entry().
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20241118054937.5203-8-suravee.suthikulpanit@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
And use it in clone_alias() along with update_dte256().
Also use get_dte256() in dump_dte_entry().
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20241118054937.5203-7-suravee.suthikulpanit@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Also, the set_dte_entry() is used to program several DTE fields (e.g.
stage1 table, stage2 table, domain id, and etc.), which is difficult
to keep track with current implementation.
Therefore, separate logic for clearing DTE (i.e. make_clear_dte) and
another function for setting up the GCR3 Table Root Pointer, GIOV, GV,
GLX, and GuestPagingMode into another function set_dte_gcr3_table().
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20241118054937.5203-6-suravee.suthikulpanit@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The current implementation does not follow 128-bit write requirement
to update DTE as specified in the AMD I/O Virtualization Techonology
(IOMMU) Specification.
Therefore, modify the struct dev_table_entry to contain union of u128 data
array, and introduce a helper functions update_dte256() to update DTE using
two 128-bit cmpxchg operations to update 256-bit DTE with the modified
structure, and take into account the DTE[V, GV] bits when programming
the DTE to ensure proper order of DTE programming and flushing.
In addition, introduce a per-DTE spin_lock struct dev_data.dte_lock to
provide synchronization when updating the DTE to prevent cmpxchg128
failure.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20241118054937.5203-5-suravee.suthikulpanit@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
During early initialization, the driver parses IVRS IVHD block to get list
of downstream devices along with their DTE flags (i.e INITPass, EIntPass,
NMIPass, SysMgt, Lint0Pass, Lint1Pass). This information is currently
store in the device DTE, and needs to be preserved when clearing
and configuring each DTE, which makes it difficult to manage.
Introduce struct ivhd_dte_flags to store IVHD DTE settings for a device or
range of devices, which are stored in the amd_ivhd_dev_flags_list during
initial IVHD parsing.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20241118054937.5203-4-suravee.suthikulpanit@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
According to the AMD IOMMU spec, IOMMU hardware reads the entire DTE
in a single 256-bit transaction. It is recommended to update DTE using
128-bit operation followed by an INVALIDATE_DEVTAB_ENTYRY command when
the IV=1b or V=1b before the change.
According to the AMD BIOS and Kernel Developer's Guide (BDKG) dated back
to family 10h Processor [1], which is the first introduction of AMD IOMMU,
AMD processor always has CPUID Fn0000_0001_ECX[CMPXCHG16B]=1.
Therefore, it is safe to assume cmpxchg128 is available with all AMD
processor w/ IOMMU.
In addition, the CMPXCHG16B feature has already been checked separately
before enabling the GA, XT, and GAM modes. Consolidate the detection logic,
and fail the IOMMU initialization if the feature is not supported.
[1] https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/programmer-references/31116.pdf
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20241118054937.5203-3-suravee.suthikulpanit@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
When a PASID is used for SVA by a device, it's possible that the PASID
entry is cleared before the device flushes all ongoing DMA requests and
removes the SVA domain. This can occur when an exception happens and the
process terminates before the device driver stops DMA and calls the
iommu driver to unbind the PASID.
There's no need to drain the PRQ in the mm release path. Instead, the PRQ
will be drained in the SVA unbind path.
Unfortunately, commit c43e1ccdeb ("iommu/vt-d: Drain PRQs when domain
removed from RID") changed this behavior by unconditionally draining the
PRQ in intel_pasid_tear_down_entry(). This can lead to a potential
sleeping-in-atomic-context issue.
Smatch static checker warning:
drivers/iommu/intel/prq.c:95 intel_iommu_drain_pasid_prq()
warn: sleeping in atomic context
To avoid this issue, prevent draining the PRQ in the SVA mm release path
and restore the previous behavior.
Fixes: c43e1ccdeb ("iommu/vt-d: Drain PRQs when domain removed from RID")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/linux-iommu/c5187676-2fa2-4e29-94e0-4a279dc88b49@stanley.mountain/
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241212021529.1104745-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The qi_batch is allocated when assigning cache tag for a domain. While
for nested parent domain, it is missed. Hence, when trying to map pages
to the nested parent, NULL dereference occurred. Also, there is potential
memleak since there is no lock around domain->qi_batch allocation.
To solve it, add a helper for qi_batch allocation, and call it in both
the __cache_tag_assign_domain() and __cache_tag_assign_parent_domain().
BUG: kernel NULL pointer dereference, address: 0000000000000200
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 8104795067 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 223 UID: 0 PID: 4357 Comm: qemu-system-x86 Not tainted 6.13.0-rc1-00028-g4b50c3c3b998-dirty #2632
Call Trace:
? __die+0x24/0x70
? page_fault_oops+0x80/0x150
? do_user_addr_fault+0x63/0x7b0
? exc_page_fault+0x7c/0x220
? asm_exc_page_fault+0x26/0x30
? cache_tag_flush_range_np+0x13c/0x260
intel_iommu_iotlb_sync_map+0x1a/0x30
iommu_map+0x61/0xf0
batch_to_domain+0x188/0x250
iopt_area_fill_domains+0x125/0x320
? rcu_is_watching+0x11/0x50
iopt_map_pages+0x63/0x100
iopt_map_common.isra.0+0xa7/0x190
iopt_map_user_pages+0x6a/0x80
iommufd_ioas_map+0xcd/0x1d0
iommufd_fops_ioctl+0x118/0x1c0
__x64_sys_ioctl+0x93/0xc0
do_syscall_64+0x71/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fixes: 705c1cdf1e ("iommu/vt-d: Introduce batched cache invalidation")
Cc: stable@vger.kernel.org
Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241210130322.17175-1-yi.l.liu@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The current implementation removes cache tags after disabling ATS,
leading to potential memory leaks and kernel crashes. Specifically,
CACHE_TAG_DEVTLB type cache tags may still remain in the list even
after the domain is freed, causing a use-after-free condition.
This issue really shows up when multiple VFs from different PFs
passed through to a single user-space process via vfio-pci. In such
cases, the kernel may crash with kernel messages like:
BUG: kernel NULL pointer dereference, address: 0000000000000014
PGD 19036a067 P4D 1940a3067 PUD 136c9b067 PMD 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 74 UID: 0 PID: 3183 Comm: testCli Not tainted 6.11.9 #2
RIP: 0010:cache_tag_flush_range+0x9b/0x250
Call Trace:
<TASK>
? __die+0x1f/0x60
? page_fault_oops+0x163/0x590
? exc_page_fault+0x72/0x190
? asm_exc_page_fault+0x22/0x30
? cache_tag_flush_range+0x9b/0x250
? cache_tag_flush_range+0x5d/0x250
intel_iommu_tlb_sync+0x29/0x40
intel_iommu_unmap_pages+0xfe/0x160
__iommu_unmap+0xd8/0x1a0
vfio_unmap_unpin+0x182/0x340 [vfio_iommu_type1]
vfio_remove_dma+0x2a/0xb0 [vfio_iommu_type1]
vfio_iommu_type1_ioctl+0xafa/0x18e0 [vfio_iommu_type1]
Move cache_tag_unassign_domain() before iommu_disable_pci_caps() to fix
it.
Fixes: 3b1d9e2b2d ("iommu/vt-d: Add cache tag assignment interface")
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241129020506.576413-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
- Use raw_smp_processor_id() when balancing traffic for NVIDIA's custom
command queue implementation.
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmdaJS8QHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNBkkB/97jVNBdS6nZ+LYZ8Tt7hGLnoO5Wo9BPcxA
lKfqrHz/WanPrRobiF4M5UNN01ITnjcWeVsjxWqox8liE9AwXftbpbnP6qQ6bEgI
ZpUNcAhF/CBRcAmJpCwHfoDvpOg5dm9PmuUEhAitD4R3w+I0Yygdm+fLmrsrG7/e
fx6M52b1J++BK7MXAUAijIi1TG5yVVuHIku+HkOdMurSAwxiH9Shwtib1RK8w/63
O2DVjJNfbM35xXFduIPSzbOqTComC/j1dd3mvbdCsumPKr3UzSbkWMZ3q3OiwOyU
BNVXpvx1ymjv3iI7kFuEW1v1ykUiimZhzbdAKaT3Dyp46fkhX1pu
=7lXv
-----END PGP SIGNATURE-----
Merge tag 'arm-smmu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into fixes
Arm SMMU fixes for 6.13-rc
- Use raw_smp_processor_id() when balancing traffic for NVIDIA's custom
command queue implementation.
IOMMU_HWPT_FAULT_ID_VALID is used to mark if the fault_id field of
iommu_hwp_alloc is valid or not. As the fault_id field is handled in
the iommufd core, so it makes sense to sanitize the
IOMMU_HWPT_FAULT_ID_VALID flag in the iommufd core, and mask it out
before passing the user flags to the iommu drivers.
Link: https://patch.msgid.link/r/20241207120108.5640-1-yi.l.liu@intel.com
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Since this implements domain_alloc_paging_flags() it only needs one
op. Fold mock_domain_alloc_paging() into mock_domain_alloc_paging_flags().
Link: https://patch.msgid.link/r/0-v1-8a3e7e21ff6a+1745d-iommufd_paging_flags_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add an assertion to all the iteration points that don't obviously
have the lock held already. These all take the locker higher in their
call chains.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/2-v1-3b9edcf8067d+3975-amd_dev_list_locking_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The list domain->dev_list is protected by the domain->lock spinlock.
Any iteration, addition or removal must be under the lock.
Move the list_del() up into the critical section. pdom_is_sva_capable(),
and destroy_gcr3_table() do not interact with the list element.
Wrap the list_add() in a lock, it would make more sense if this was under
the same critical section as adjusting the refcounts earlier, but that
requires more complications.
Fixes: d6b47dec36 ("iommu/amd: Reduce domain lock scope in attach device path")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/1-v1-3b9edcf8067d+3975-amd_dev_list_locking_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Run selftests with different OAS values intead of hardcoding it to 48
bits.
We always keep OAS >= IAS to make the config valid for stage-2.
This can be further improved, if we split IAS/OAS configuration for
stage-1 and stage-2 (to use input sizes compatible with VA_BITS as
SMMUv3 does, or IAS > OAS which is valid for stage-1).
However, that adds more complexity, and the current change improves
coverage and makes it possible to test all concatenation cases.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Link: https://lore.kernel.org/r/20241202140604.422235-3-smostafa@google.com
Signed-off-by: Will Deacon <will@kernel.org>
At the moment, io-pgtable-arm uses concatenation only if it is
possible at level 0, which misses a case where concatenation is
mandatory at level 1 according to R_SRKBC in Arm spec DDI0487 K.a.
Also, that means concatenation can be used when not mandated,
contradicting the comment on the code. However, these cases can only
happen if the SMMUv3 driver is changed to use ias != oas for stage-2.
This patch re-writes the code to use concatenation only if mandatory,
fixing the missing case for level-1 and granule 16K with PA = 40 bits.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Link: https://lore.kernel.org/r/20241202140604.422235-2-smostafa@google.com
Signed-off-by: Will Deacon <will@kernel.org>
During boot some of the calls to tegra241_cmdqv_get_cmdq() will happen
in preemptible context. As this function calls smp_processor_id(), if
CONFIG_DEBUG_PREEMPT is enabled, these calls will trigger a series of
"BUG: using smp_processor_id() in preemptible" backtraces.
As tegra241_cmdqv_get_cmdq() only calls smp_processor_id() to use the
CPU number as a factor to balance out traffic on cmdq usage, it is safe
to use raw_smp_processor_id() here.
Cc: <stable@vger.kernel.org>
Fixes: 918eb5c856 ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/Z1L1mja3nXzsJ0Pk@uudg.org
Signed-off-by: Will Deacon <will@kernel.org>
arm_smmu_domain_alloc_paging_flags() with a flags = 0 now does the same
thing as arm_smmu_domain_alloc_paging(), remove
arm_smmu_domain_alloc_paging().
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v1-0bb8d5313a27+27b-smmuv3_paging_flags_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
The selection of S1/S2 is a bit indirect today, make
domain_alloc_paging_flags() directly decode the flags and select the
correct S1/S2 type.
Directly reject flag combinations the HW doesn't support when processing
the flags.
Fix missing rejection of some flag combinations that are not supported
today (ie NEST_PARENT | DIRTY_TRACKING) by using a switch statement to
list out exactly the combinations that are currently supported.
Move the determination of the stage out of arm_smmu_domain_finalise() and
into both callers. As today the default stage is S1 if supported in HW.
This makes arm_smmu_domain_alloc_paging_flags() self contained and no
longer calling arm_smmu_domain_alloc_paging().
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v1-0bb8d5313a27+27b-smmuv3_paging_flags_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Domains are now always finalized during allocation because the core code
no longer permits a NULL dev argument to domain_alloc_paging/_flags().
Remove the late finalize during attach that supported domains that were
not fully initialized.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v1-0bb8d5313a27+27b-smmuv3_paging_flags_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Process pagetables may now be using new permission-indirection-based
features which an SMMU may not understand when given such a table for
SVA. Although SMMUv3.4 does add its own S1PIE feature, realistically
we're still going to have to cope with feature mismatches between CPUs
and SMMUs, so let's start simple and essentially just document the
expectations for what falls out as-is. Although it seems unlikely for
SVA applications to also depend on memory-hardening features, or
vice-versa, the relative lifecycles make it tricky to enforce mutual
exclusivity. Thankfully our PIE index allocation makes it relatively
benign for an SMMU to keep interpreting them as direct permissions, the
only real implication is that an SVA application cannot harden itself
against its own devices with these features. Thus, inform the user about
that just in case they have other expectations.
Also we don't (yet) support LPA2, so deny SVA entirely if we're going to
misunderstand the pagetable format altogether.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/68a37b00a720f0827cac0e4f40e4d3a688924054.1733406275.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Since iommu_fwspec_init() absorbed the basic driver probe deferral
check to wait for an IOMMU to register, we may as well handle the probe
deferral timeout there as well. The current inconsistency of callers
results in client devices deferring forever on an arm64 ACPI system
where an SMMU has failed its own driver probe.
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/41fa59f156ef8d196d08fa75c4901e6d4b12e6c4.1733406914.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
kmemleak noticed that the iopf queue allocated deep down within
arm_smmu_init_structures() can be leaked by a subsequent error return
from arm_smmu_device_probe(). Furthermore, after arm_smmu_device_reset()
we will also leave the SMMU enabled with an empty Stream Table, silently
blocking all DMA. This proves rather annoying for debugging said probe
failure, so let's handle it a bit better by putting the SMMU back into
(more or less) the same state as if it hadn't probed at all.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/5137901958471cf67f2fad5c2229f8a8f1ae901a.1733406914.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Relying on the driver list was a cute idea for minimising the scope of
our SMMU device lookups, however it turns out to have a subtle flaw. The
SMMU device only gets added to that list after arm_smmu_device_probe()
returns success, so there's actually no way the iommu_device_register()
call from there could ever work as intended, even if it wasn't already
hampered by the fwspec setup not happening early enough.
Switch both arm_smmu_get_by_fwnode() implementations to use a platform
bus lookup instead, which *will* reliably work. Also make sure that we
don't register SMMUv2 instances until we've fully initialised them, to
avoid similar consequences of the lookup now finding a device with no
drvdata. Moving the error returns is also a perfect excuse to streamline
them with dev_err_probe() in the process.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/6d7ce1dc31873abdb75c895fb8bd2097cce098b4.1733406914.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
The SMMUv3 spec has a note that BYPASS and ATS don't work together under
the STE EATS field definition. However there is another section "13.6.4
Full ATS skipping stage 1" that explains under certain conditions BYPASS
and ATS do work together if the STE is using S1DSS to select BYPASS and
the CD table has the possibility for a substream.
When these comments were written the understanding was that all forms of
BYPASS just didn't work and this was to be a future problem to solve.
It turns out that ATS and IDENTITY will always work just fine:
- If STE.Config = BYPASS then the PCI ATS is disabled
- If a PASID domain is attached then S1DSS = BYPASS and ATS will be
enabled. This meets the requirements of 13.6.4 to automatically
generate 1:1 ATS replies on the RID.
Update the comments to reflect this.
Fixes: 7497f4211f ("iommu/arm-smmu-v3: Make changing domains be hitless for ATS")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v1-f27174f44f39+27a33-smmuv3_ats_note_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Currently, the driver dumps the raw hex for a received event record.
Improve this by leveraging `struct arm_smmu_event` for event fields
and log human-readable event records with meaningful information.
Signed-off-by: Pranjal Shrivastava <praan@google.com>
Link: https://lore.kernel.org/r/20241203184906.2264528-3-praan@google.com
Signed-off-by: Will Deacon <will@kernel.org>
Introduce `struct arm_smmu_event` to represent event records.
Parse out relevant fields from raw event records for ease and
use the new `struct arm_smmu_event` instead.
Signed-off-by: Pranjal Shrivastava <praan@google.com>
Link: https://lore.kernel.org/r/20241203184906.2264528-2-praan@google.com
Signed-off-by: Will Deacon <will@kernel.org>
Add the compatible for the separate IOMMU on SDM670 for the Adreno GPU.
This IOMMU has the compatible strings:
"qcom,sdm670-smmu-v2", "qcom,adreno-smmu", "qcom,smmu-v2"
While the SMMU 500 doesn't need an entry for this specific SoC, the
SMMU v2 compatible should have its own entry, as the fallback entry in
arm-smmu.c handles "qcom,smmu-v2" without per-process page table support
unless there is an entry here. This entry can't be the
"qcom,adreno-smmu" compatible because dedicated GPU IOMMUs can also be
SMMU 500 with different handling.
Signed-off-by: Richard Acayan <mailingradian@gmail.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Link: https://lore.kernel.org/r/20241114004713.42404-6-mailingradian@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
- Correct typos in comments
- Elaborate a comment about how the uAPI works for
IOMMU_HW_INFO_TYPE_ARM_SMMUV3
- Fix a double free on error path and add test coverage for the bug
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZ1Hx1QAKCRCFwuHvBreF
YRA6AQCtirCoWoMgq1mSsFjtzmr5XCffS1g6T7KsKs54sDZZXAD/doa4NpE3LTPo
huOPMfH2O0HufLpj+FTzew4q0RbAJAc=
=z0Ak
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd fixes from Jason Gunthorpe:
"One bug fix and some documentation updates:
- Correct typos in comments
- Elaborate a comment about how the uAPI works for
IOMMU_HW_INFO_TYPE_ARM_SMMUV3
- Fix a double free on error path and add test coverage for the bug"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd:
iommu/arm-smmu-v3: Improve uAPI comment for IOMMU_HW_INFO_TYPE_ARM_SMMUV3
iommufd/selftest: Cover IOMMU_FAULT_QUEUE_ALLOC in iommufd_fail_nth
iommufd: Fix out_fput in iommufd_fault_alloc()
iommufd: Fix typos in kernel-doc comments
As fput() calls the file->f_op->release op, where fault obj and ictx are
getting released, there is no need to release these two after fput() one
more time, which would result in imbalanced refcounts:
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 48 PID: 2369 at lib/refcount.c:31 refcount_warn_saturate+0x60/0x230
Call trace:
refcount_warn_saturate+0x60/0x230 (P)
refcount_warn_saturate+0x60/0x230 (L)
iommufd_fault_fops_release+0x9c/0xe0 [iommufd]
...
VFS: Close: file count is 0 (f_op=iommufd_fops [iommufd])
WARNING: CPU: 48 PID: 2369 at fs/open.c:1507 filp_flush+0x3c/0xf0
Call trace:
filp_flush+0x3c/0xf0 (P)
filp_flush+0x3c/0xf0 (L)
__arm64_sys_close+0x34/0x98
...
imbalanced put on file reference count
WARNING: CPU: 48 PID: 2369 at fs/file.c:74 __file_ref_put+0x100/0x138
Call trace:
__file_ref_put+0x100/0x138 (P)
__file_ref_put+0x100/0x138 (L)
__fput_sync+0x4c/0xd0
Drop those two lines to fix the warnings above.
Cc: stable@vger.kernel.org
Fixes: 07838f7fd5 ("iommufd: Add iommufd fault object")
Link: https://patch.msgid.link/r/b5651beb3a6b1adeef26fffac24607353bf67ba1.1733212723.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The continual trickle of small conversion patches is grating on me, and
is really not helping. Just get rid of the 'remove_new' member
function, which is just an alias for the plain 'remove', and had a
comment to that effect:
/*
* .remove_new() is a relic from a prototype conversion of .remove().
* New drivers are supposed to implement .remove(). Once all drivers are
* converted to not use .remove_new any more, it will be dropped.
*/
This was just a tree-wide 'sed' script that replaced '.remove_new' with
'.remove', with some care taken to turn a subsequent tab into two tabs
to make things line up.
I did do some minimal manual whitespace adjustment for places that used
spaces to line things up.
Then I just removed the old (sic) .remove_new member function, and this
is the end result. No more unnecessary conversion noise.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the driver callback op domain_alloc_user() into two ops
domain_alloc_paging_flags() and domain_alloc_nesting() that better
describe what the ops are expected to do. There will be per-driver cleanup
based on this going into the next cycle via the driver trees.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZ0c+GQAKCRCFwuHvBreF
YVJXAP97zriu5x9AVNzG6mLOOYYROuhCTXGieKPqNg/apGOyogD/cMuMql3w86h+
oBIMai6GaTBUU+Ie8S07PM0849TpLgA=
=yyYk
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull more iommufd updates from Jason Gunthorpe:
"Change the driver callback op domain_alloc_user() into two ops:
domain_alloc_paging_flags() and domain_alloc_nesting() that better
describe what the ops are expected to do.
There will be per-driver cleanup based on this going into the next
cycle via the driver trees"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd:
iommu: Rename ops->domain_alloc_user() to domain_alloc_paging_flags()
iommu: Add ops->domain_alloc_nested()
Sergey Senozhatsky improves zram's post-processing selection algorithm.
This leads to improved memory savings.
- Wei Yang has gone to town on the mapletree code, contributing several
series which clean up the implementation:
- "refine mas_mab_cp()"
- "Reduce the space to be cleared for maple_big_node"
- "maple_tree: simplify mas_push_node()"
- "Following cleanup after introduce mas_wr_store_type()"
- "refine storing null"
- The series "selftests/mm: hugetlb_fault_after_madv improvements" from
David Hildenbrand fixes this selftest for s390.
- The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
implements some rationaizations and cleanups in the page mapping code.
- The series "mm: optimize shadow entries removal" from Shakeel Butt
optimizes the file truncation code by speeding up the handling of shadow
entries.
- The series "Remove PageKsm()" from Matthew Wilcox completes the
migration of this flag over to being a folio-based flag.
- The series "Unify hugetlb into arch_get_unmapped_area functions" from
Oscar Salvador implements a bunch of consolidations and cleanups in the
hugetlb code.
- The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
takes away the wp-fault time practice of turning a huge zero page into
small pages. Instead we replace the whole thing with a THP. More
consistent cleaner and potentiall saves a large number of pagefaults.
- The series "percpu: Add a test case and fix for clang" from Andy
Shevchenko enhances and fixes the kernel's built in percpu test code.
- The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
optimizes mremap() by avoiding doing things which we didn't need to do.
- The series "Improve the tmpfs large folio read performance" from
Baolin Wang teaches tmpfs to copy data into userspace at the folio size
rather than as individual pages. A 20% speedup was observed.
- The series "mm/damon/vaddr: Fix issue in
damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting.
- The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt
removes the long-deprecated memcgv2 charge moving feature.
- The series "fix error handling in mmap_region() and refactor" from
Lorenzo Stoakes cleanup up some of the mmap() error handling and
addresses some potential performance issues.
- The series "x86/module: use large ROX pages for text allocations" from
Mike Rapoport teaches x86 to use large pages for read-only-execute
module text.
- The series "page allocation tag compression" from Suren Baghdasaryan
is followon maintenance work for the new page allocation profiling
feature.
- The series "page->index removals in mm" from Matthew Wilcox remove
most references to page->index in mm/. A slow march towards shrinking
struct page.
- The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
interface tests" from Andrew Paniakin performs maintenance work for
DAMON's self testing code.
- The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
improves zswap's batching of compression and decompression. It is a
step along the way towards using Intel IAA hardware acceleration for
this zswap operation.
- The series "kasan: migrate the last module test to kunit" from
Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests
over to the KUnit framework.
- The series "implement lightweight guard pages" from Lorenzo Stoakes
permits userapace to place fault-generating guard pages within a single
VMA, rather than requiring that multiple VMAs be created for this.
Improved efficiencies for userspace memory allocators are expected.
- The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
tracepoints to provide increased visibility into memcg stats flushing
activity.
- The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
fixes a zram buglet which potentially affected performance.
- The series "mm: add more kernel parameters to control mTHP" from
Maíra Canal enhances our ability to control/configuremultisize THP from
the kernel boot command line.
- The series "kasan: few improvements on kunit tests" from Sabyrzhan
Tasbolatov has a couple of fixups for the KASAN KUnit tests.
- The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
from Kairui Song optimizes list_lru memory utilization when lockdep is
enabled.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzwFqgAKCRDdBJ7gKXxA
jkeuAQCkl+BmeYHE6uG0hi3pRxkupseR6DEOAYIiTv0/l8/GggD/Z3jmEeqnZaNq
xyyenpibWgUoShU2wZ/Ha8FE5WDINwg=
=JfWR
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- The series "zram: optimal post-processing target selection" from
Sergey Senozhatsky improves zram's post-processing selection
algorithm. This leads to improved memory savings.
- Wei Yang has gone to town on the mapletree code, contributing several
series which clean up the implementation:
- "refine mas_mab_cp()"
- "Reduce the space to be cleared for maple_big_node"
- "maple_tree: simplify mas_push_node()"
- "Following cleanup after introduce mas_wr_store_type()"
- "refine storing null"
- The series "selftests/mm: hugetlb_fault_after_madv improvements" from
David Hildenbrand fixes this selftest for s390.
- The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
implements some rationaizations and cleanups in the page mapping
code.
- The series "mm: optimize shadow entries removal" from Shakeel Butt
optimizes the file truncation code by speeding up the handling of
shadow entries.
- The series "Remove PageKsm()" from Matthew Wilcox completes the
migration of this flag over to being a folio-based flag.
- The series "Unify hugetlb into arch_get_unmapped_area functions" from
Oscar Salvador implements a bunch of consolidations and cleanups in
the hugetlb code.
- The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
takes away the wp-fault time practice of turning a huge zero page
into small pages. Instead we replace the whole thing with a THP. More
consistent cleaner and potentiall saves a large number of pagefaults.
- The series "percpu: Add a test case and fix for clang" from Andy
Shevchenko enhances and fixes the kernel's built in percpu test code.
- The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
optimizes mremap() by avoiding doing things which we didn't need to
do.
- The series "Improve the tmpfs large folio read performance" from
Baolin Wang teaches tmpfs to copy data into userspace at the folio
size rather than as individual pages. A 20% speedup was observed.
- The series "mm/damon/vaddr: Fix issue in
damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON
splitting.
- The series "memcg-v1: fully deprecate charge moving" from Shakeel
Butt removes the long-deprecated memcgv2 charge moving feature.
- The series "fix error handling in mmap_region() and refactor" from
Lorenzo Stoakes cleanup up some of the mmap() error handling and
addresses some potential performance issues.
- The series "x86/module: use large ROX pages for text allocations"
from Mike Rapoport teaches x86 to use large pages for
read-only-execute module text.
- The series "page allocation tag compression" from Suren Baghdasaryan
is followon maintenance work for the new page allocation profiling
feature.
- The series "page->index removals in mm" from Matthew Wilcox remove
most references to page->index in mm/. A slow march towards shrinking
struct page.
- The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
interface tests" from Andrew Paniakin performs maintenance work for
DAMON's self testing code.
- The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
improves zswap's batching of compression and decompression. It is a
step along the way towards using Intel IAA hardware acceleration for
this zswap operation.
- The series "kasan: migrate the last module test to kunit" from
Sabyrzhan Tasbolatov completes the migration of the KASAN built-in
tests over to the KUnit framework.
- The series "implement lightweight guard pages" from Lorenzo Stoakes
permits userapace to place fault-generating guard pages within a
single VMA, rather than requiring that multiple VMAs be created for
this. Improved efficiencies for userspace memory allocators are
expected.
- The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
tracepoints to provide increased visibility into memcg stats flushing
activity.
- The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
fixes a zram buglet which potentially affected performance.
- The series "mm: add more kernel parameters to control mTHP" from
Maíra Canal enhances our ability to control/configuremultisize THP
from the kernel boot command line.
- The series "kasan: few improvements on kunit tests" from Sabyrzhan
Tasbolatov has a couple of fixups for the KASAN KUnit tests.
- The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
from Kairui Song optimizes list_lru memory utilization when lockdep
is enabled.
* tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits)
cma: enforce non-zero pageblock_order during cma_init_reserved_mem()
mm/kfence: add a new kunit test test_use_after_free_read_nofault()
zram: fix NULL pointer in comp_algorithm_show()
memcg/hugetlb: add hugeTLB counters to memcg
vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount
zram: ZRAM_DEF_COMP should depend on ZRAM
MAINTAINERS/MEMORY MANAGEMENT: add document files for mm
Docs/mm/damon: recommend academic papers to read and/or cite
mm: define general function pXd_init()
kmemleak: iommu/iova: fix transient kmemleak false positive
mm/list_lru: simplify the list_lru walk callback function
mm/list_lru: split the lock to per-cgroup scope
mm/list_lru: simplify reparenting and initial allocation
mm/list_lru: code clean up for reparenting
mm/list_lru: don't export list_lru_add
mm/list_lru: don't pass unnecessary key parameters
kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller
kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW
kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols
...
Including:
- Core Updates:
- Convert call-sites using iommu_domain_alloc() to more specific
versions and remove function.
- Introduce iommu_paging_domain_alloc_flags().
- Extend support for allocating PASID-capable domains to more
drivers.
- Remove iommu_present().
- Some smaller improvements.
- New IOMMU driver for RISC-V.
- Intel VT-d Updates:
- Add domain_alloc_paging support.
- Enable user space IOPFs in non-PASID and non-svm cases.
- Small code refactoring and cleanups.
- Add domain replacement support for pasid.
- AMD-Vi Updates:
- Adapt to iommu_paging_domain_alloc_flags() interface and alloc V2
page-tables by default.
- Replace custom domain ID allocator with IDA allocator.
- Add ops->release_domain() support.
- Other improvements to device attach and domain allocation code
paths.
- ARM-SMMU Updates:
- SMMUv2:
- Return -EPROBE_DEFER for client devices probing before their SMMU.
- Devicetree binding updates for Qualcomm MMU-500 implementations.
- SMMUv3:
- Minor fixes and cleanup for NVIDIA's virtual command queue driver.
- IO-PGTable:
- Fix indexing of concatenated PGDs and extend selftest coverage.
- Remove unused block-splitting support.
- S390 IOMMU:
- Implement support for blocking domain.
- Mediatek IOMMU:
- Enable 35-bit physical address support for mt8186.
- OMAP IOMMU driver:
- Adapt to recent IOMMU core changes and unbreak driver.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmdAPOoACgkQK/BELZcB
GuOs1w/+PoLbOYUjmJiOfpI6YNSEfF2tE4z2al/YYIBcNoAmTTRauuhv6+S0gVRy
NTfSucw7OuLlbE9vGsdY02UL1PK58NGfUF8Z2rZSf+RRgLACc47cjZWh0vzDlNbP
4LTdqJXmIWiYcmDtY7LmHtwTSiB900YFZwZOHmTSfNyJt8UC4tBPRh8k2YD3vuxc
QZlxSihEf+F+vm8GtW40Ia9BiG3YhCYAcHq6Y4dKxI0JWN+7oRiPN8CF+z/vcdjV
VpCDBcbHjvqqpXJvddQHA0SrGDBMHz1AXYhRXnfe7Ogh6SbaSWDSsdaIS27DsOzC
L6fxW3+sNmfEOO1RmJoizkHzAtkLWCLNjBvjOb1hUCpwLcKf5nhgE3wOQSwzqumn
KbxpoQpHFJutikDBGRsKJCsNqS8ZNWd4Z8rHhTnq2ctuYUFvurkcwX4WXOSRpsoA
iJ+x1ezk9FxObHj/B+1nIAwKoeaLyFEwJe7Etom/E2m/2mq2oQOrq1bvfIGCms5h
mqLYJ9L9MDanhEiOshHooy6ROPD842XmWILfq3HUi9JcrB/BvILPRsESQnNAn3Zl
8ImbR5VijGGDy50KBE8I9abRwDTIn9c2JJVDSh3tAz1aicGnRLcIeqNeuJ4IEQZf
IQb7qcZQge17ie/Pwr24GlwrKG7DhOg5NXvl3DiVUum2NFGjuBc=
=V9hb
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:
"Core Updates:
- Convert call-sites using iommu_domain_alloc() to more specific
versions and remove function
- Introduce iommu_paging_domain_alloc_flags()
- Extend support for allocating PASID-capable domains to more drivers
- Remove iommu_present()
- Some smaller improvements
New IOMMU driver for RISC-V
Intel VT-d Updates:
- Add domain_alloc_paging support
- Enable user space IOPFs in non-PASID and non-svm cases
- Small code refactoring and cleanups
- Add domain replacement support for pasid
AMD-Vi Updates:
- Adapt to iommu_paging_domain_alloc_flags() interface and alloc V2
page-tables by default
- Replace custom domain ID allocator with IDA allocator
- Add ops->release_domain() support
- Other improvements to device attach and domain allocation code
paths
ARM-SMMU Updates:
- SMMUv2:
- Return -EPROBE_DEFER for client devices probing before their
SMMU
- Devicetree binding updates for Qualcomm MMU-500 implementations
- SMMUv3:
- Minor fixes and cleanup for NVIDIA's virtual command queue
driver
- IO-PGTable:
- Fix indexing of concatenated PGDs and extend selftest coverage
- Remove unused block-splitting support
S390 IOMMU:
- Implement support for blocking domain
Mediatek IOMMU:
- Enable 35-bit physical address support for mt8186
OMAP IOMMU driver:
- Adapt to recent IOMMU core changes and unbreak driver"
* tag 'iommu-updates-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (92 commits)
iommu/tegra241-cmdqv: Fix alignment failure at max_n_shift
iommu: Make set_dev_pasid op support domain replacement
iommu/arm-smmu-v3: Make set_dev_pasid() op support replace
iommu/vt-d: Add set_dev_pasid callback for nested domain
iommu/vt-d: Make identity_domain_set_dev_pasid() to handle domain replacement
iommu/vt-d: Make intel_svm_set_dev_pasid() support domain replacement
iommu/vt-d: Limit intel_iommu_set_dev_pasid() for paging domain
iommu/vt-d: Make intel_iommu_set_dev_pasid() to handle domain replacement
iommu/vt-d: Add iommu_domain_did() to get did
iommu/vt-d: Consolidate the struct dev_pasid_info add/remove
iommu/vt-d: Add pasid replace helpers
iommu/vt-d: Refactor the pasid setup helpers
iommu/vt-d: Add a helper to flush cache for updating present pasid entry
iommu: Pass old domain to set_dev_pasid op
iommu/iova: Fix typo 'adderss'
iommu: Add a kdoc to iommu_unmap()
iommu/io-pgtable-arm-v7s: Remove split on unmap behavior
iommu/io-pgtable-arm: Remove split on unmap behavior
iommu/vt-d: Drain PRQs when domain removed from RID
iommu/vt-d: Drop pasid requirement for prq initialization
...
Now that the main domain allocating path is calling this function it
doesn't make sense to leave it named _user. Change the name to
alloc_paging_flags() to mirror the new iommu_paging_domain_alloc_flags()
function.
A driver should implement only one of ops->domain_alloc_paging() or
ops->domain_alloc_paging_flags(). The former is a simpler interface with
less boiler plate that the majority of drivers use. The latter is for
drivers with a greater feature set (PASID, multiple page table support,
advanced iommufd support, nesting, etc). Additional patches will be needed
to achieve this.
Link: https://patch.msgid.link/r/2-v1-c252ebdeb57b+329-iommu_paging_flags_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
It turns out all the drivers that are using this immediately call into
another function, so just make that function directly into the op. This
makes paging=NULL for domain_alloc_user and we can remove the argument in
the next patch.
The function mirrors the similar op in the viommu that allocates a nested
domain on top of the viommu's nesting parent. This version supports cases
where a viommu is not being used.
Link: https://patch.msgid.link/r/1-v1-c252ebdeb57b+329-iommu_paging_flags_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Including:
- Core Updates:
- Convert call-sites using iommu_domain_alloc() to more specific
versions and remove function.
- Introduce iommu_paging_domain_alloc_flags().
- Extend support for allocating PASID-capable domains to more
drivers.
- Remove iommu_present().
- Some smaller improvements.
- New IOMMU driver for RISC-V.
- Intel VT-d Updates:
- Add domain_alloc_paging support.
- Enable user space IOPFs in non-PASID and non-svm cases.
- Small code refactoring and cleanups.
- Add domain replacement support for pasid.
- AMD-Vi Updates:
- Adapt to iommu_paging_domain_alloc_flags() interface and alloc V2
page-tables by default.
- Replace custom domain ID allocator with IDA allocator.
- Add ops->release_domain() support.
- Other improvements to device attach and domain allocation code
paths.
- ARM-SMMU Updates:
- SMMUv2:
- Return -EPROBE_DEFER for client devices probing before their SMMU.
- Devicetree binding updates for Qualcomm MMU-500 implementations.
- SMMUv3:
- Minor fixes and cleanup for NVIDIA's virtual command queue driver.
- IO-PGTable:
- Fix indexing of concatenated PGDs and extend selftest coverage.
- Remove unused block-splitting support.
- S390 IOMMU:
- Implement support for blocking domain.
- Mediatek IOMMU:
- Enable 35-bit physical address support for mt8186.
- OMAP IOMMU driver:
- Adapt to recent IOMMU core changes and unbreak driver.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmdAPOoACgkQK/BELZcB
GuOs1w/+PoLbOYUjmJiOfpI6YNSEfF2tE4z2al/YYIBcNoAmTTRauuhv6+S0gVRy
NTfSucw7OuLlbE9vGsdY02UL1PK58NGfUF8Z2rZSf+RRgLACc47cjZWh0vzDlNbP
4LTdqJXmIWiYcmDtY7LmHtwTSiB900YFZwZOHmTSfNyJt8UC4tBPRh8k2YD3vuxc
QZlxSihEf+F+vm8GtW40Ia9BiG3YhCYAcHq6Y4dKxI0JWN+7oRiPN8CF+z/vcdjV
VpCDBcbHjvqqpXJvddQHA0SrGDBMHz1AXYhRXnfe7Ogh6SbaSWDSsdaIS27DsOzC
L6fxW3+sNmfEOO1RmJoizkHzAtkLWCLNjBvjOb1hUCpwLcKf5nhgE3wOQSwzqumn
KbxpoQpHFJutikDBGRsKJCsNqS8ZNWd4Z8rHhTnq2ctuYUFvurkcwX4WXOSRpsoA
iJ+x1ezk9FxObHj/B+1nIAwKoeaLyFEwJe7Etom/E2m/2mq2oQOrq1bvfIGCms5h
mqLYJ9L9MDanhEiOshHooy6ROPD842XmWILfq3HUi9JcrB/BvILPRsESQnNAn3Zl
8ImbR5VijGGDy50KBE8I9abRwDTIn9c2JJVDSh3tAz1aicGnRLcIeqNeuJ4IEQZf
IQb7qcZQge17ie/Pwr24GlwrKG7DhOg5NXvl3DiVUum2NFGjuBc=
=V9hb
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.13' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/iommu/linux into iommufd.git
Merge with Joerg's tree for dependencies on the next patches.
======================================
IOMMU Updates for Linux v6.13:
Including:
- Core Updates:
- Convert call-sites using iommu_domain_alloc() to more specific
versions and remove function.
- Introduce iommu_paging_domain_alloc_flags().
- Extend support for allocating PASID-capable domains to more
drivers.
- Remove iommu_present().
- Some smaller improvements.
- New IOMMU driver for RISC-V.
- Intel VT-d Updates:
- Add domain_alloc_paging support.
- Enable user space IOPFs in non-PASID and non-svm cases.
- Small code refactoring and cleanups.
- Add domain replacement support for pasid.
- AMD-Vi Updates:
- Adapt to iommu_paging_domain_alloc_flags() interface and alloc V2
page-tables by default.
- Replace custom domain ID allocator with IDA allocator.
- Add ops->release_domain() support.
- Other improvements to device attach and domain allocation code
paths.
- ARM-SMMU Updates:
- SMMUv2:
- Return -EPROBE_DEFER for client devices probing before their SMMU.
- Devicetree binding updates for Qualcomm MMU-500 implementations.
- SMMUv3:
- Minor fixes and cleanup for NVIDIA's virtual command queue driver.
- IO-PGTable:
- Fix indexing of concatenated PGDs and extend selftest coverage.
- Remove unused block-splitting support.
- S390 IOMMU:
- Implement support for blocking domain.
- Mediatek IOMMU:
- Enable 35-bit physical address support for mt8186.
- OMAP IOMMU driver:
- Adapt to recent IOMMU core changes and unbreak driver.
======================================
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Several new features and uAPI for iommufd:
- IOMMU_IOAS_MAP_FILE allows passing in a file descriptor as the backing
memory for an iommu mapping. To date VFIO/iommufd have used VMA's and
pin_user_pages(), this now allows using memfds and memfd_pin_folios().
Notably this creates a pure folio path from the memfd to the iommu page
table where memory is never broken down to PAGE_SIZE.
- IOMMU_IOAS_CHANGE_PROCESS moves the pinned page accounting between two
processes. Combined with the above this allows iommufd to support a VMM
re-start using exec() where something like qemu would exec() a new
version of itself and fd pass the memfds/iommufd/etc to the new
process. The memfd allows DMA access to the memory to continue while
the new process is getting setup, and the CHANGE_PROCESS updates all
the accounting.
- Support for fault reporting to userspace on non-PRI HW, such as ARM
stall-mode embedded devices.
- IOMMU_VIOMMU_ALLOC introduces the concept of a HW/driver backed virtual
iommu. This will be used by VMMs to access hardware features that are
contained with in a VM. The first use is to inform the kernel of the
virtual SID to physical SID mapping when issuing SID based invalidation
on ARM. Further uses will tie HW features that are directly accessed by
the VM, such as invalidation queue assignment and others.
- IOMMU_VDEVICE_ALLOC informs the kernel about the mapping of virtual
device to physical device within a VIOMMU. Minimially this is used to
translate VM issued cache invalidation commands from virtual to physical
device IDs.
- Enhancements to IOMMU_HWPT_INVALIDATE and IOMMU_HWPT_ALLOC to work with
the VIOMMU
- ARM SMMuv3 support for nested translation. Using the VIOMMU and VDEVICE
the driver can model this HW's behavior for nested translation. This
includes a shared branch from Will.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZzzKKwAKCRCFwuHvBreF
YaCMAQDOQAgw87eUYKnY7vFodlsTUA2E8uSxDmk6nPWySd0NKwD/flOP85MdEs9O
Ot+RoL4/J3IyNH+eg5kN68odmx4mAw8=
=ec8x
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"Several new features and uAPI for iommufd:
- IOMMU_IOAS_MAP_FILE allows passing in a file descriptor as the
backing memory for an iommu mapping. To date VFIO/iommufd have used
VMA's and pin_user_pages(), this now allows using memfds and
memfd_pin_folios(). Notably this creates a pure folio path from the
memfd to the iommu page table where memory is never broken down to
PAGE_SIZE.
- IOMMU_IOAS_CHANGE_PROCESS moves the pinned page accounting between
two processes. Combined with the above this allows iommufd to
support a VMM re-start using exec() where something like qemu would
exec() a new version of itself and fd pass the memfds/iommufd/etc
to the new process. The memfd allows DMA access to the memory to
continue while the new process is getting setup, and the
CHANGE_PROCESS updates all the accounting.
- Support for fault reporting to userspace on non-PRI HW, such as ARM
stall-mode embedded devices.
- IOMMU_VIOMMU_ALLOC introduces the concept of a HW/driver backed
virtual iommu. This will be used by VMMs to access hardware
features that are contained with in a VM. The first use is to
inform the kernel of the virtual SID to physical SID mapping when
issuing SID based invalidation on ARM. Further uses will tie HW
features that are directly accessed by the VM, such as invalidation
queue assignment and others.
- IOMMU_VDEVICE_ALLOC informs the kernel about the mapping of virtual
device to physical device within a VIOMMU. Minimially this is used
to translate VM issued cache invalidation commands from virtual to
physical device IDs.
- Enhancements to IOMMU_HWPT_INVALIDATE and IOMMU_HWPT_ALLOC to work
with the VIOMMU
- ARM SMMuv3 support for nested translation. Using the VIOMMU and
VDEVICE the driver can model this HW's behavior for nested
translation. This includes a shared branch from Will"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (51 commits)
iommu/arm-smmu-v3: Import IOMMUFD module namespace
iommufd: IOMMU_IOAS_CHANGE_PROCESS selftest
iommufd: Add IOMMU_IOAS_CHANGE_PROCESS
iommufd: Lock all IOAS objects
iommufd: Export do_update_pinned
iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object
iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED
iommu/arm-smmu-v3: Use S2FWB for NESTED domains
iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC
Documentation: userspace-api: iommufd: Update vDEVICE
iommufd/selftest: Add vIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl
iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command
iommufd/selftest: Add mock_viommu_cache_invalidate
iommufd/viommu: Add iommufd_viommu_find_dev helper
iommu: Add iommu_copy_struct_from_full_user_array helper
iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE
iommu/viommu: Add cache_invalidate to iommufd_viommu_ops
iommufd/selftest: Add IOMMU_VDEVICE_ALLOC test coverage
iommufd/viommu: Add IOMMUFD_OBJ_VDEVICE and IOMMU_VDEVICE_ALLOC ioctl
...
Commit 69d9b312f3 ("iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC")
started using _iommufd_object_alloc() without importing the IOMMUFD
module namespace, resulting in a modpost warning:
WARNING: modpost: module arm_smmu_v3 uses symbol _iommufd_object_alloc from namespace IOMMUFD, but does not import it.
Commit d68beb276b ("iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE
using a VIOMMU object") added another warning by using
iommufd_viommu_find_dev():
WARNING: modpost: module arm_smmu_v3 uses symbol iommufd_viommu_find_dev from namespace IOMMUFD, but does not import it.
Import the IOMMUFD module namespace to resolve the warnings.
Fixes: 69d9b312f3 ("iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC")
Link: https://patch.msgid.link/r/20241114-arm-smmu-v3-import-iommufd-module-ns-v1-1-c551e7b972e9@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add an ioctl that updates all DMA mappings to reflect the current process,
Change the mm and transfer locked memory accounting from old to current mm.
This will be used for live update, allowing an old process to hand the
iommufd device descriptor to a new process. The new process calls the
ioctl.
IOMMU_IOAS_CHANGE_PROCESS only supports DMA mappings created with
IOMMU_IOAS_MAP_FILE, because the kernel metadata for such mappings does
not depend on the userland VA of the pages (which is different in the new
process).
IOMMU_IOAS_CHANGE_PROCESS fails if other types of mappings are present.
This is a revised version of code originally provided by Jason.
Link: https://patch.msgid.link/r/1731527497-16091-4-git-send-email-steven.sistare@oracle.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Define helpers to lock and unlock all IOAS objects.
This will allow DMA mappings to be updated atomically during live update.
This code is substantially the same as an initial version provided by
Jason, plus fixes.
Link: https://patch.msgid.link/r/1731527497-16091-3-git-send-email-steven.sistare@oracle.com
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Implement the vIOMMU's cache_invalidate op for user space to invalidate
the IOTLB entries, Device ATS and CD entries that are cached by hardware.
Add struct iommu_viommu_arm_smmuv3_invalidate defining invalidation
entries that are simply in the native format of a 128-bit TLBI
command. Scan those commands against the permitted command list and fix
their VMID/SID fields to match what is stored in the vIOMMU.
Link: https://patch.msgid.link/r/12-v4-9e99b76f3518+3a8-smmuv3_nesting_jgg@nvidia.com
Co-developed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The EATS flag needs to flow through the vSTE and into the pSTE, and ensure
physical ATS is enabled on the PCI device.
The physical ATS state must match the VM's idea of EATS as we rely on the
VM to issue the ATS invalidation commands. Thus ATS must remain off at the
device until EATS on a nesting domain turns it on. Attaching a nesting
domain is the point where the invalidation responsibility transfers to
userspace.
Update the ATS logic to track EATS for nesting domains and flush the
ATC whenever the S2 nesting parent changes.
Link: https://patch.msgid.link/r/11-v4-9e99b76f3518+3a8-smmuv3_nesting_jgg@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
works. When S2FWB is supported and enabled the IOPTE will force cachable
access to IOMMU_CACHE memory when nesting with a S1 and deny cachable
access when !IOMMU_CACHE.
When using a single stage of translation, a simple S2 domain, it doesn't
change things for PCI devices as it is just a different encoding for the
existing mapping of the IOMMU protection flags to cachability attributes.
For non-PCI it also changes the combining rules when incoming transactions
have inconsistent attributes.
However, when used with a nested S1, FWB has the effect of preventing the
guest from choosing a MemAttr in it's S1 that would cause ordinary DMA to
bypass the cache. Consistent with KVM we wish to deny the guest the
ability to become incoherent with cached memory the hypervisor believes is
cachable so we don't have to flush it.
Allow NESTED domains to be created if the SMMU has S2FWB support and use
S2FWB for NESTING_PARENTS. This is an additional option to CANWBS.
Link: https://patch.msgid.link/r/10-v4-9e99b76f3518+3a8-smmuv3_nesting_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Reviewed-by: Donald Dutile <ddutile@redhat.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
For SMMUv3 a IOMMU_DOMAIN_NESTED is composed of a S2 iommu_domain acting
as the parent and a user provided STE fragment that defines the CD table
and related data with addresses translated by the S2 iommu_domain.
The kernel only permits userspace to control certain allowed bits of the
STE that are safe for user/guest control.
IOTLB maintenance is a bit subtle here, the S1 implicitly includes the S2
translation, but there is no way of knowing which S1 entries refer to a
range of S2.
For the IOTLB we follow ARM's guidance and issue a CMDQ_OP_TLBI_NH_ALL to
flush all ASIDs from the VMID after flushing the S2 on any change to the
S2.
The IOMMU_DOMAIN_NESTED can only be created from inside a VIOMMU as the
invalidation path relies on the VIOMMU to translate virtual stream ID used
in the invalidation commands for the CD table and ATS.
Link: https://patch.msgid.link/r/9-v4-9e99b76f3518+3a8-smmuv3_nesting_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Reviewed-by: Donald Dutile <ddutile@redhat.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add a new driver-type for ARM SMMUv3 to enum iommu_viommu_type. Implement
an arm_vsmmu_alloc().
As an initial step, copy the VMID from s2_parent. A followup series is
required to give the VIOMMU object it's own VMID that will be used in all
nesting configurations.
Link: https://patch.msgid.link/r/8-v4-9e99b76f3518+3a8-smmuv3_nesting_jgg@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Common SMMUv3 patches for the following patches adding nesting, shared
branch with the iommu tree.
* 'iommufd/arm-smmuv3-nested' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
iommu/arm-smmu-v3: Expose the arm_smmu_attach interface
iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
ACPI/IORT: Support CANWBS memory access flag
ACPICA: IORT: Update for revision E.f
vfio: Remove VFIO_TYPE1_NESTING_IOMMU
...
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Similar to IOMMU_TEST_OP_MD_CHECK_IOTLB verifying a mock_domain's iotlb,
IOMMU_TEST_OP_DEV_CHECK_CACHE will be used to verify a mock_dev's cache.
Link: https://patch.msgid.link/r/cd4082079d75427bd67ed90c3c825e15b5720a5f.1730836308.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This avoids a bigger trouble of exposing struct iommufd_device and struct
iommufd_vdevice in the public header.
Link: https://patch.msgid.link/r/84fa7c624db4d4508067ccfdf42059533950180a.1730836308.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
With a vIOMMU object, use space can flush any IOMMU related cache that can
be directed via a vIOMMU object. It is similar to the IOMMU_HWPT_INVALIDATE
uAPI, but can cover a wider range than IOTLB, e.g. device/desciprtor cache.
Allow hwpt_id of the iommu_hwpt_invalidate structure to carry a viommu_id,
and reuse the IOMMU_HWPT_INVALIDATE uAPI for vIOMMU invalidations. Drivers
can define different structures for vIOMMU invalidations v.s. HWPT ones.
Since both the HWPT-based and vIOMMU-based invalidation pathways check own
cache invalidation op, remove the WARN_ON_ONCE in the allocator.
Update the uAPI, kdoc, and selftest case accordingly.
Link: https://patch.msgid.link/r/b411e2245e303b8a964f39f49453a5dff280968f.1730836308.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Introduce a new IOMMUFD_OBJ_VDEVICE to represent a physical device (struct
device) against a vIOMMU (struct iommufd_viommu) object in a VM.
This vDEVICE object (and its structure) holds all the infos and attributes
in the VM, regarding the device related to the vIOMMU.
As an initial patch, add a per-vIOMMU virtual ID. This can be:
- Virtual StreamID on a nested ARM SMMUv3, an index to a Stream Table
- Virtual DeviceID on a nested AMD IOMMU, an index to a Device Table
- Virtual RID on a nested Intel VT-D IOMMU, an index to a Context Table
Potentially, this vDEVICE structure would hold some vData for Confidential
Compute Architecture (CCA). Use this virtual ID to index an "vdevs" xarray
that belongs to a vIOMMU object.
Add a new ioctl for vDEVICE allocations. Since a vDEVICE is a connection
of a device object and an iommufd_viommu object, take two refcounts in the
ioctl handler.
Link: https://patch.msgid.link/r/cda8fd2263166e61b8191a3b3207e0d2b08545bf.1730836308.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Implement the viommu alloc/free functions to increase/reduce refcount of
its dependent mock iommu device. User space can verify this loop via the
IOMMU_VIOMMU_TYPE_SELFTEST.
Link: https://patch.msgid.link/r/9d755a215a3007d4d8d1c2513846830332db62aa.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
For an iommu_dev that can unplug (so far only this selftest does so), the
viommu->iommu_dev pointer has no guarantee of its life cycle after it is
copied from the idev->dev->iommu->iommu_dev.
Track the user count of the iommu_dev. Postpone the exit routine using a
completion, if refcount is unbalanced. The refcount inc/dec will be added
in the following patch.
Link: https://patch.msgid.link/r/33f28d64841b497eebef11b49a571e03103c5d24.1730836219.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
A nested domain now can be allocated for a parent domain or for a vIOMMU
object. Rework the existing allocators to prepare for the latter case.
Link: https://patch.msgid.link/r/f62894ad8ccae28a8a616845947fe4b76135d79b.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Use these inline helpers to shorten those container_of lines.
Note that one of them goes back and forth between iommu_domain and
mock_iommu_domain, which isn't necessary. So drop its container_of.
Link: https://patch.msgid.link/r/518ec64dae2e814eb29fd9f170f58a3aad56c81c.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Now a vIOMMU holds a shareable nesting parent HWPT. So, it can act like
that nesting parent HWPT to allocate a nested HWPT.
Support that in the IOMMU_HWPT_ALLOC ioctl handler, and update its kdoc.
Also, add an iommufd_viommu_alloc_hwpt_nested helper to allocate a nested
HWPT for a vIOMMU object. Since a vIOMMU object holds the parent hwpt's
refcount already, increase the refcount of the vIOMMU only.
Link: https://patch.msgid.link/r/a0f24f32bfada8b448d17587adcaedeeb50a67ed.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add a new ioctl for user space to do a vIOMMU allocation. It must be based
on a nesting parent HWPT, so take its refcount.
IOMMU driver wanting to support vIOMMUs must define its IOMMU_VIOMMU_TYPE_
in the uAPI header and implement a viommu_alloc op in its iommu_ops.
Link: https://patch.msgid.link/r/dc2b8ba9ac935007beff07c1761c31cd097ed780.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
To support driver-allocated vIOMMU objects, it's required for IOMMU driver
to call the provided iommufd_viommu_alloc helper to embed the core struct.
However, there is no guarantee that every driver will call it and allocate
objects properly.
Make the iommufd_object_finalize/abort functions more robust to verify if
the xarray slot indexed by the input obj->id is having an XA_ZERO_ENTRY,
which is the reserved value stored by xa_alloc via iommufd_object_alloc.
Link: https://patch.msgid.link/r/334bd4dde8e0a88eb30fa67eeef61827cdb546f9.1730836219.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The following patch will add a new vIOMMU allocator that will require this
_iommufd_object_alloc to be sharable with IOMMU drivers (and iommufd too).
Add a new driver.c file that will be built with CONFIG_IOMMUFD_DRIVER_CORE
selected by CONFIG_IOMMUFD, and put the CONFIG_DRIVER under that remaining
to be selectable for drivers to build the existing iova_bitmap.c file.
Link: https://patch.msgid.link/r/2f4f6e116dc49ffb67ff6c5e8a7a8e789ab9e98e.1730836219.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When configuring a kernel with PAGE_SIZE=4KB, depending on its setting of
CONFIG_CMA_ALIGNMENT, VCMDQ_LOG2SIZE_MAX=19 could fail the alignment test
and trigger a WARN_ON:
WARNING: at drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:3646
Call trace:
arm_smmu_init_one_queue+0x15c/0x210
tegra241_cmdqv_init_structures+0x114/0x338
arm_smmu_device_probe+0xb48/0x1d90
Fix it by capping max_n_shift to CMDQ_MAX_SZ_SHIFT as SMMUv3 CMDQ does.
Fixes: 918eb5c856 ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20241111030226.1940737-1-nicolinc@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
The introduction of iova_depot_pop() in 911aa1245d ("iommu/iova: Make
the rcache depot scale better") confused kmemleak by moving a struct
iova_magazine object from a singly linked list to rcache->depot and
resetting the 'next' pointer referencing it. Unlike doubly linked lists,
the content of the object being referred is never changed on removal from
a singly linked list and the kmemleak checksum heuristics do not detect
such scenario. This leads to false positives like:
unreferenced object 0xffff8881a5301000 (size 1024):
comm "softirq", pid 0, jiffies 4306297099 (age 462.991s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 e7 7d 05 00 00 00 00 00 .........}......
0f b4 05 00 00 00 00 00 b4 96 05 00 00 00 00 00 ................
backtrace:
[<ffffffff819f5f08>] __kmem_cache_alloc_node+0x1e8/0x320
[<ffffffff818a239a>] kmalloc_trace+0x2a/0x60
[<ffffffff8231d31e>] free_iova_fast+0x28e/0x4e0
[<ffffffff82310860>] fq_ring_free_locked+0x1b0/0x310
[<ffffffff8231225d>] fq_flush_timeout+0x19d/0x2e0
[<ffffffff813e95ba>] call_timer_fn+0x19a/0x5c0
[<ffffffff813ea16b>] __run_timers+0x78b/0xb80
[<ffffffff813ea5bd>] run_timer_softirq+0x5d/0xd0
[<ffffffff82f1d915>] __do_softirq+0x205/0x8b5
Introduce kmemleak_transient_leak() which resets the object checksum
requiring another scan pass before it is reported (if still unreferenced).
Call this new API in iova_depot_pop().
Link: https://lkml.kernel.org/r/20241104111944.2207155-1-catalin.marinas@arm.com
Link: https://lore.kernel.org/r/ZY1osaGLyT-sdKE8@shredder/
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Ido Schimmel <idosch@idosch.org>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Prepare for an embedded structure design for driver-level iommufd_viommu
objects:
// include/linux/iommufd.h
struct iommufd_viommu {
struct iommufd_object obj;
....
};
// Some IOMMU driver
struct iommu_driver_viommu {
struct iommufd_viommu core;
....
};
It has to expose struct iommufd_object and enum iommufd_object_type from
the core-level private header to the public iommufd header.
Link: https://patch.msgid.link/r/54a43b0768089d690104530754f499ca05ce0074.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The iommu core is going to support domain replacement for pasid, it needs
to make the set_dev_pasid op support replacing domain and keep the old
domain config in the failure case.
AMD iommu driver does not support domain replacement for pasid yet, so it
would fail the set_dev_pasid op to keep the old config if the input @old
is non-NULL. Till now, all the set_dev_pasid callbacks can handle the old
parameter and can keep the old config when failed, so update the kdoc of
set_dev_pasid op.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241107122234.7424-14-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
set_dev_pasid() op is going to be enhanced to support domain replacement
of a pasid. This prepares for this op definition.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241107122234.7424-13-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Add intel_nested_set_dev_pasid() to set a nested type domain to a PASID
of a device.
Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241107122234.7424-12-yi.l.liu@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Let identity_domain_set_dev_pasid() call the pasid replace helpers hence
be able to do domain replacement.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241107122234.7424-11-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
intel_iommu_set_dev_pasid() is only supposed to be used by paging domain,
so limit it.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241107122234.7424-9-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Let intel_iommu_set_dev_pasid() call the pasid replace helpers hence be
able to do domain replacement.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241107122234.7424-8-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
domain_id_iommu() does not support SVA type and identity type domains.
Add iommu_domain_did() to support all domain types.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241107122234.7424-7-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The domain_add_dev_pasid() and domain_remove_dev_pasid() are added to
consolidate the adding/removing of the struct dev_pasid_info. Besides,
it includes the cache tag assign/unassign as well.
This also prepares for adding domain replacement for pasid. The
set_dev_pasid callbacks need to deal with the dev_pasid_info for both old
and new domain. These two helpers make the life easier.
intel_iommu_set_dev_pasid() and intel_svm_set_dev_pasid() are updated to
use the helpers.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241107122234.7424-6-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
pasid replacement allows converting a present pasid entry to be FS, SS,
PT or nested, hence add helpers for such operations.
Suggested-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241107122234.7424-5-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
It is clearer to have a new set of pasid replacement helpers other than
extending the existing ones to cover both initial setup and replacement.
Then abstract out the common code for manipulating the pasid entry as
preparation.
No functional change is intended.
Suggested-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241107122234.7424-4-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Generalize the logic for flushing pasid-related cache upon changes to
bits other than SSADE and P which requires a different flow according
to VT-d spec.
No functional change is intended.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241107122234.7424-3-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
To support domain replacement for pasid, the underlying iommu driver needs
to know the old domain hence be able to clean up the existing attachment.
It would be much convenient for iommu layer to pass down the old domain.
Otherwise, iommu drivers would need to track domain for pasids by
themselves, this would duplicate code among the iommu drivers. Or iommu
drivers would rely group->pasid_array to get domain, which may not always
the correct one.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241107122234.7424-2-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
iommufd_fault_iopf_enable has limitation to PRI on PCI/SRIOV VFs because
the PRI might be a shared resource and current iommu subsystem is not
ready to support enabling/disabling PRI on a VF without any impact on
others.
However, we have devices that appear as PCI but are actually on the AMBA
bus. These fake PCI devices have PASID capability, support stall as well
as SRIOV, so remove the limitation for these devices.
Link: https://patch.msgid.link/r/20241107043711.116-1-zhangfei.gao@linaro.org
Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Describe the most conservative version of the driver implementations.
All drivers should support this.
Many drivers support extending the range if a large page is hit, but
let's not make that officially approved API. The main point is to
document explicitly that split is not supported.
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v3-b3a5b5937f56+7bb-arm_no_split_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
A minority of page table implementations (arm_lpae, armv7) are unique in
how they handle partial unmap of large IOPTEs.
Other implementations will unmap the large IOPTE and return it's
length. For example if a 2M IOPTE is present and the first 4K is requested
to be unmapped then unmap will remove the whole 2M and report 2M as the
result.
armv7 instead will break up contiguous entries and replace an entry with a
whole table so it can unmap the requested 4k.
This seems copied from the arm_lpae implementation, which was analyzed
here:
https://lore.kernel.org/all/20241024134411.GA6956@nvidia.com/
Bring consistency to the implementations and remove this unused
functionality.
There are no uses outside iommu, this effects the ARM_V7S drivers
msm_iommu, mtk_iommu, and arm-smmmu.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v3-b3a5b5937f56+7bb-arm_no_split_jgg@nvidia.com
[will: Remove unused 'loopnr' variable]
Signed-off-by: Will Deacon <will@kernel.org>
A minority of page table implementations (arm_lpae, armv7) are unique in
how they handle partial unmap of large IOPTEs.
Other implementations will unmap the large IOPTE and return it's
length. For example if a 2M IOPTE is present and the first 4K is requested
to be unmapped then unmap will remove the whole 2M and report 2M as the
result.
arm_lpae instead replaces the IOPTE with a table of smaller IOPTEs, unmaps
the 4K and returns 4k. This is actually an illegal/non-hitless operation
on at least SMMUv3 because of the BBM level 0 rules.
Will says this was done to support VFIO, but upon deeper analysis this was
never strictly necessary:
https://lore.kernel.org/all/20241024134411.GA6956@nvidia.com/
In summary, historical VFIO supported the AMD behavior of unmapping the
whole large IOPTE and returning the size, even if asked to unmap a
portion. The driver would see this as a request to split a large IOPTE.
Modern VFIO always unmaps entire large IOPTEs (except on AMD) and drivers
don't see an IOPTE split.
Given it doesn't work fully correctly on SMMUv3 and relying on ARM unique
behavior would create portability problems across IOMMU drivers, retire
this functionality.
Outside the iommu users, this will potentially effect io_pgtable users of
ARM_32_LPAE_S1, ARM_32_LPAE_S2, ARM_64_LPAE_S1, ARM_64_LPAE_S2, and
ARM_MALI_LPAE formats.
Cc: Boris Brezillon <boris.brezillon@collabora.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Liviu Dudau <liviu.dudau@arm.com>
Cc: dri-devel@lists.freedesktop.org
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v3-b3a5b5937f56+7bb-arm_no_split_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
As this iommu driver now supports page faults for requests without
PASID, page requests should be drained when a domain is removed from
the RID2PASID entry.
This results in the intel_iommu_drain_pasid_prq() call being moved to
intel_pasid_tear_down_entry(). This indicates that when a translation
is removed from any PASID entry and the PRI has been enabled on the
device, page requests are drained in the domain detachment path.
The intel_iommu_drain_pasid_prq() helper has been modified to support
sending device TLB invalidation requests for both PASID and non-PASID
cases.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241101045543.70086-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
PASID support within the IOMMU is not required to enable the Page
Request Queue, only the PRS capability.
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Link: https://lore.kernel.org/r/20241015-jag-iopfv8-v4-5-b696ca89ba29@kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Add IOMMU_HWPT_FAULT_ID_VALID as part of the valid flags when doing an
iommufd_hwpt_alloc allowing the use of an iommu fault allocation
(iommu_fault_alloc) with the IOMMU_HWPT_ALLOC ioctl.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Link: https://lore.kernel.org/r/20241015-jag-iopfv8-v4-4-b696ca89ba29@kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Move IOMMU_IOPF from under INTEL_IOMMU_SVM into INTEL_IOMMU. This
certifies that the core intel iommu utilizes the IOPF library
functions, independent of the INTEL_IOMMU_SVM config.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Link: https://lore.kernel.org/r/20241015-jag-iopfv8-v4-3-b696ca89ba29@kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
PASID is not strictly needed when handling a PRQ event; remove the check
for the pasid present bit in the request. This change was not included
in the creation of prq.c to emphasize the change in capability checks
when handing PRQ events.
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Link: https://lore.kernel.org/r/20241015-jag-iopfv8-v4-2-b696ca89ba29@kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
IO page faults are no longer dependent on CONFIG_INTEL_IOMMU_SVM. Move
all Page Request Queue (PRQ) functions that handle prq events to a new
file in drivers/iommu/intel/prq.c. The page_req_des struct is now
declared in drivers/iommu/intel/prq.c.
No functional changes are intended. This is a preparation patch to
enable the use of IO page faults outside the SVM/PASID use cases.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Link: https://lore.kernel.org/r/20241015-jag-iopfv8-v4-1-b696ca89ba29@kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
There are some issues in pgtable_walk():
1. Super page is dumped as non-present page
2. dma_pte_superpage() should not check against leaf page table entries
3. Pointer pte is never NULL so checking it is meaningless
4. When an entry is not present, it still makes sense to dump the entry
content.
Fix 1,2 by checking dma_pte_superpage()'s returned value after level check.
Fix 3 by removing pte check.
Fix 4 by checking present bit after printing.
By this chance, change to print "page table not present" instead of "PTE
not present" to be clearer.
Fixes: 914ff7719e ("iommu/vt-d: Dump DMAR translation structure when DMA fault occurs")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Link: https://lore.kernel.org/r/20241024092146.715063-3-zhenzhong.duan@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
There are some issues in dmar_fault_dump_ptes():
1. return value of phys_to_virt() is used for checking if an entry is
present.
2. dump is confusing, e.g., "pasid table entry is not present", confusing
by unpresent pasid table vs. unpresent pasid table entry. Current code
means the former.
3. pgtable_walk() is called without checking if page table is present.
Fix 1 by checking present bit of an entry before dump a lower level entry.
Fix 2 by removing "entry" string, e.g., "pasid table is not present".
Fix 3 by checking page table present before walk.
Take issue 3 for example, before fix:
[ 442.240357] DMAR: pasid dir entry: 0x000000012c83e001
[ 442.246661] DMAR: pasid table entry[0]: 0x0000000000000000
[ 442.253429] DMAR: pasid table entry[1]: 0x0000000000000000
[ 442.260203] DMAR: pasid table entry[2]: 0x0000000000000000
[ 442.266969] DMAR: pasid table entry[3]: 0x0000000000000000
[ 442.273733] DMAR: pasid table entry[4]: 0x0000000000000000
[ 442.280479] DMAR: pasid table entry[5]: 0x0000000000000000
[ 442.287234] DMAR: pasid table entry[6]: 0x0000000000000000
[ 442.293989] DMAR: pasid table entry[7]: 0x0000000000000000
[ 442.300742] DMAR: PTE not present at level 2
After fix:
...
[ 357.241214] DMAR: pasid table entry[6]: 0x0000000000000000
[ 357.248022] DMAR: pasid table entry[7]: 0x0000000000000000
[ 357.254824] DMAR: scalable mode page table is not present
Fixes: 914ff7719e ("iommu/vt-d: Dump DMAR translation structure when DMA fault occurs")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Link: https://lore.kernel.org/r/20241024092146.715063-2-zhenzhong.duan@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
dmar_domian has stored the s1_cfg which includes the s1_pgtbl info, so
no need to store s1_pgtbl, hence drop it.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241025143339.2328991-1-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
dmar_msi_read() has been unused since 2022 in
commit cf8e865810 ("arch: Remove Itanium (IA-64) architecture")
Remove it.
(dmar_msi_write still exists and is used once).
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://lore.kernel.org/r/20241022002702.302728-1-linux@treblig.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
GCC is not happy with the current code, e.g.:
.../iommu/intel/dmar.c:1063:9: note: ‘sprintf’ output between 6 and 15 bytes into a destination of size 13
1063 | sprintf(iommu->name, "dmar%d", iommu->seq_id);
When `make W=1` is supplied, this prevents kernel building. Fix it by
increasing the buffer size for device name and use sizeoF() instead of
hard coded constants.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20241014104529.4025937-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The macro PCI_DEVID() can be used instead of compose it manually.
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20240829021011.4135618-1-ruanjinjie@huawei.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The domain_alloc_user ops should always allocate a guest-compatible page
table unless specific allocation flags are specified.
Currently, IOMMU_HWPT_ALLOC_NEST_PARENT and IOMMU_HWPT_ALLOC_DIRTY_TRACKING
require special handling, as both require hardware support for scalable
mode and second-stage translation. In such cases, the driver should select
a second-stage page table for the paging domain.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241021085125.192333-8-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The first stage page table is compatible across host and guest kernels.
Therefore, this driver uses the first stage page table as the default for
paging domains.
The helper first_level_by_default() determines the feasibility of using
the first stage page table based on a global policy. This policy requires
consistency in scalable mode and first stage translation capability among
all iommu units. However, this is unnecessary as domain allocation,
attachment, and removal operations are performed on a per-device basis.
The domain type (IOMMU_DOMAIN_DMA vs. IOMMU_DOMAIN_UNMANAGED) should not
be a factor in determining the first stage page table usage. Both types
are for paging domains, and there's no fundamental difference between them.
The driver should not be aware of this distinction unless the core
specifies allocation flags that require special handling.
Convert first_level_by_default() from global to per-iommu and remove the
'type' input.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241021085125.192333-7-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The requirement for consistent super page support across all the IOMMU
hardware in the system has been removed. In the past, if a new IOMMU
was hot-added and lacked consistent super page capability, the hot-add
process would be aborted. However, with the updated attachment semantics,
it is now permissible for the super page capability to vary among
different IOMMU hardware units.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241021085125.192333-6-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The attributes of a paging domain are initialized during the allocation
process, and any attempt to attach a domain that is not compatible will
result in a failure. Therefore, there is no need to update the domain
attributes at the time of domain attachment.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241021085125.192333-5-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The driver now supports domain_alloc_paging, ensuring that a valid device
pointer is provided whenever a paging domain is allocated. Additionally,
the dmar_domain attributes are set up at the time of allocation.
Consistent with the established semantics in the IOMMU core, if a domain is
attached to a device and found to be incompatible with the IOMMU hardware
capabilities, the operation will return an -EINVAL error. This implicitly
advises the caller to allocate a new domain for the device and attempt the
domain attachment again.
Rename prepare_domain_attach_device() to a more meaningful name.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241021085125.192333-4-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
With domain_alloc_paging callback supported, the legacy domain_alloc
callback will never be used anymore. Remove it to avoid dead code.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241021085125.192333-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Add the domain_alloc_paging callback for domain allocation using the
iommu_paging_domain_alloc() interface.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241021085125.192333-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The arm-smmuv3-iommufd.c file will need to call these functions too.
Remove statics and put them in the header file. Remove the kunit
visibility protections from arm_smmu_make_abort_ste() and
arm_smmu_make_s2_domain_ste().
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Donald Dutile <ddutile@redhat.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v4-9e99b76f3518+3a8-smmuv3_nesting_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
For SMMUv3 the parent must be a S2 domain, which can be composed
into a IOMMU_DOMAIN_NESTED.
In future the S2 parent will also need a VMID linked to the VIOMMU and
even to KVM.
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Donald Dutile <ddutile@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v4-9e99b76f3518+3a8-smmuv3_nesting_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
For virtualization cases the IDR/IIDR/AIDR values of the actual SMMU
instance need to be available to the VMM so it can construct an
appropriate vSMMUv3 that reflects the correct HW capabilities.
For userspace page tables these values are required to constrain the valid
values within the CD table and the IOPTEs.
The kernel does not sanitize these values. If building a VMM then
userspace is required to only forward bits into a VM that it knows it can
implement. Some bits will also require a VMM to detect if appropriate
kernel support is available such as for ATS and BTM.
Start a new file and kconfig for the advanced iommufd support. This lets
it be compiled out for kernels that are not intended to support
virtualization, and allows distros to leave it disabled until they are
shipping a matching qemu too.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Reviewed-by: Donald Dutile <ddutile@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v4-9e99b76f3518+3a8-smmuv3_nesting_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
HW with CANWBS is always cache coherent and ignores PCI No Snoop requests
as well. This meets the requirement for IOMMU_CAP_ENFORCE_CACHE_COHERENCY,
so let's return it.
Implement the enforce_cache_coherency() op to reject attaching devices
that don't have CANWBS.
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Reviewed-by: Donald Dutile <ddutile@redhat.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v4-9e99b76f3518+3a8-smmuv3_nesting_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
This control causes the ARM SMMU drivers to choose a stage 2
implementation for the IO pagetable (vs the stage 1 usual default),
however this choice has no significant visible impact to the VFIO
user. Further qemu never implemented this and no other userspace user is
known.
The original description in commit f5c9ecebaf ("vfio/iommu_type1: add
new VFIO_TYPE1_NESTING_IOMMU IOMMU type") suggested this was to "provide
SMMU translation services to the guest operating system" however the rest
of the API to set the guest table pointer for the stage 1 and manage
invalidation was never completed, or at least never upstreamed, rendering
this part useless dead code.
Upstream has now settled on iommufd as the uAPI for controlling nested
translation. Choosing the stage 2 implementation should be done by through
the IOMMU_HWPT_ALLOC_NEST_PARENT flag during domain allocation.
Remove VFIO_TYPE1_NESTING_IOMMU and everything under it including the
enable_nesting iommu_domain_op.
Just in-case there is some userspace using this continue to treat
requesting it as a NOP, but do not advertise support any more.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Reviewed-by: Donald Dutile <ddutile@redhat.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v4-9e99b76f3518+3a8-smmuv3_nesting_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Previous patch added ops->release_domain support. Core will attach
devices to release_domain->attach_dev() before calling this function.
Devices are already detached their current domain and attached to
blocked domain.
This is mostly dummy function now. Just throw warning if device is still
attached to domain.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241030063556.6104-13-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
In release path, remove device from existing domain and attach it to
blocked domain. So that all DMAs from device is blocked.
Note that soon blocked_domain will support other ops like
set_dev_pasid() but release_domain supports only attach_dev ops.
Hence added separate 'release_domain' variable.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241030063556.6104-12-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Ideally in attach device path, it should take dev_data lock before
making changes to device data including IOPF enablement. So far dev_data
was using spinlock and it was hitting lock order issue when it tries to
enable IOPF. Hence Commit 526606b0a1 ("iommu/amd: Fix Invalid wait
context issue") moved IOPF enablement outside dev_data->lock.
Previous patch converted dev_data lock to mutex. Now its safe to call
amd_iommu_iopf_add_device() with dev_data->mutex. Hence move back PCI
device capability enablement (ATS, PRI, PASID) and IOPF enablement code
inside the lock. Also in attach_device(), update 'dev_data->domain' at
the end so that error handling becomes simple.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241030063556.6104-11-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Currently in attach device path it takes dev_data->spinlock. But as per
design attach device path can sleep. Also if device is PRI capable then
it adds device to IOMMU fault handler queue which takes mutex. Hence
currently PRI enablement is done outside dev_data lock.
Covert dev_data lock from spinlock to mutex so that it follows the
design and also PRI enablement can be done properly.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241030063556.6104-10-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
attach_device() is just holding lock and calling do_attach(). There is
not need to have another function. Just move do_attach() code to
attach_device(). Similarly move do_detach() code to detach_device().
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241030063556.6104-9-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Currently attach device path takes protection domain lock followed by
dev_data lock. Most of the operations in this function is specific to
device data except pdom_attach_iommu() where it updates protection
domain structure. Hence reduce the scope of protection domain lock.
Note that this changes the locking order. Now it takes device lock
before taking doamin lock (group->mutex -> dev_data->lock ->
pdom->lock). dev_data->lock is used only in device attachment path.
So changing order is fine. It will not create any issue.
Finally move numa node assignment to pdom_attach_iommu().
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241030063556.6104-8-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
All devices attached to a protection domain must be freed before
calling domain free. Hence do not try to free devices in domain
free path. Continue to throw warning if pdom->dev_list is not empty
so that any potential issues can be fixed.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241030063556.6104-7-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
protection_domain structure is updated to use xarray to track the IOMMUs
attached to the domain. Now domain flush code is not using amd_iommus.
Hence remove this unused variable.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241030063556.6104-6-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
protection_domain->dev_list tracks list of attached devices to
domain. We can use list_* functions on dev_list to get device count.
Hence remove 'dev_cnt' variable.
No functional change intended.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241030063556.6104-4-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
With the last external caller of bus_iommu_probe() now gone, make it
internal as it really should be.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: H. Nikolaus Schaller <hns@goldelico.com>
Reviewed-by: Kevin Hilman <khilman@baylibre.com>
Tested-by: Kevin Hilman <khilman@baylibre.com>
Tested-by: Beleswar Padhi <b-padhi@ti.com>
Link: https://lore.kernel.org/r/a7511a034a27259aff4e14d80a861d3c40fbff1e.1730136799.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The OMAP driver uses the generic "iommus" DT binding but is the final
holdout not implementing a corresponding .of_xlate method. Unfortunately
this now results in __iommu_probe_device() failing to find ops due to
client devices missing the expected IOMMU fwnode association. The legacy
DT parsing in omap_iommu_probe_device() could probably all be delegated
to generic code now, but for the sake of an immediate fix, just add a
minimal .of_xlate implementation to allow client fwspecs to be created
appropriately, and so the ops lookup to work again.
This means we also need to register the additional instances on DRA7 so
that of_iommu_xlate() doesn't defer indefinitely waiting for their ops
either, but we'll continue to hide them from sysfs just in case. This
also renders the bus_iommu_probe() call entirely redundant.
Reported-by: Beleswar Padhi <b-padhi@ti.com>
Link: https://lore.kernel.org/linux-iommu/0dbde87b-593f-4b14-8929-b78e189549ad@ti.com/
Reported-by: H. Nikolaus Schaller <hns@goldelico.com>
Link: https://lore.kernel.org/linux-media/A7C284A9-33A5-4E21-9B57-9C4C213CC13F@goldelico.com/
Fixes: 17de3f5fdd ("iommu: Retire bus ops")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: H. Nikolaus Schaller <hns@goldelico.com>
Reviewed-by: Kevin Hilman <khilman@baylibre.com>
Tested-by: Kevin Hilman <khilman@baylibre.com>
Tested-by: Beleswar Padhi <b-padhi@ti.com>
Link: https://lore.kernel.org/r/cfd766f96bc799e32b97f4664707adbcf99097b0.1730136799.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
While testing some io-pgtable changes, I ran into a compiler warning
from the Tegra CMDQ driver:
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:803:23: warning: unused variable 'cmdqv_debugfs_dir' [-Wunused-variable]
803 | static struct dentry *cmdqv_debugfs_dir;
| ^~~~~~~~~~~~~~~~~
1 warning generated.
Guard the variable declaration with CONFIG_IOMMU_DEBUGFS to silence the
warning.
Signed-off-by: Will Deacon <will@kernel.org>
Null pointer dereference occurs due to a race between smmu
driver probe and client driver probe, when of_dma_configure()
for client is called after the iommu_device_register() for smmu driver
probe has executed but before the driver_bound() for smmu driver
has been called.
Following is how the race occurs:
T1:Smmu device probe T2: Client device probe
really_probe()
arm_smmu_device_probe()
iommu_device_register()
really_probe()
platform_dma_configure()
of_dma_configure()
of_dma_configure_id()
of_iommu_configure()
iommu_probe_device()
iommu_init_device()
arm_smmu_probe_device()
arm_smmu_get_by_fwnode()
driver_find_device_by_fwnode()
driver_find_device()
next_device()
klist_next()
/* null ptr
assigned to smmu */
/* null ptr dereference
while smmu->streamid_mask */
driver_bound()
klist_add_tail()
When this null smmu pointer is dereferenced later in
arm_smmu_probe_device, the device crashes.
Fix this by deferring the probe of the client device
until the smmu device has bound to the arm smmu driver.
Fixes: 021bb8420d ("iommu/arm-smmu: Wire up generic configuration support")
Cc: stable@vger.kernel.org
Co-developed-by: Prakash Gupta <quic_guptap@quicinc.com>
Signed-off-by: Prakash Gupta <quic_guptap@quicinc.com>
Signed-off-by: Pratyush Brahma <quic_pbrahma@quicinc.com>
Link: https://lore.kernel.org/r/20241004090428.2035-1-quic_pbrahma@quicinc.com
[will: Add comment]
Signed-off-by: Will Deacon <will@kernel.org>
Add a case in the selftests that can detect some bugs with concatenated
page tables, where it maps the biggest supported page size at the end of
the IAS, this test would fail without the previous fix.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Link: https://lore.kernel.org/r/20241024162516.2005652-3-smostafa@google.com
Signed-off-by: Will Deacon <will@kernel.org>
ARM_LPAE_LVL_IDX() takes into account concatenated PGDs and can return
an index spanning multiple page-table pages given a sufficiently large
input address. However, when the resulting index is used to calculate
the number of remaining entries in the page, the possibility of
concatenation is ignored and we end up computing a negative upper bound:
max_entries = ARM_LPAE_PTES_PER_TABLE(data) - map_idx_start;
On the map path, this results in a negative 'mapped' value being
returned but on the unmap path we can leak child tables if they are
skipped in __arm_lpae_free_pgtable().
Introduce an arm_lpae_max_entries() helper to convert a table index into
the remaining number of entries within a single page-table page.
Cc: <stable@vger.kernel.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Link: https://lore.kernel.org/r/20241024162516.2005652-2-smostafa@google.com
[will: Tweaked comment and commit message]
Signed-off-by: Will Deacon <will@kernel.org>
Consolidate all the code to create an IDENTITY domain into one
function. This removes the legacy __iommu_domain_alloc() path from all
paths, and preps it for final removal.
BLOCKED/IDENTITY/PAGING are now always allocated via a type specific
function.
[Joerg: Actually remove __iommu_domain_alloc()]
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241028093810.5901-13-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
There is no longer a reason to call __iommu_domain_alloc() to allocate
the blocking domain. All drivers that support a native blocking domain
provide it via the ops, for other drivers we should call
iommu_paging_domain_alloc().
__iommu_group_alloc_blocking_domain() is the only place that allocates
an BLOCKED domain, so move the ops->blocked_domain logic there.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241028093810.5901-12-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Implement global identity domain. All device groups in identity domain
will share this domain.
In attach device path, based on device capability it will allocate per
device domain ID and GCR3 table. So that it can support SVA.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241028093810.5901-11-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Previous patch enhanced core layer to check device PASID capability and
pass right flags to ops->domain_alloc_user().
Enhance amd_iommu_domain_alloc_user() to allocate domain with
appropriate page table based on flags parameter.
- If flags is empty then allocate domain with default page table type.
This will eventually replace ops->domain_alloc().
For UNMANAGED domain, core will call this interface with flags=0. So
AMD driver will continue to allocate V1 page table.
- If IOMMU_HWPT_ALLOC_PASID flags is passed then allocate domain with v2
page table.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241028093810.5901-10-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Current code forces v1 page table for UNMANAGED domain and global page
table type (amd_iommu_pgtable) for rest of paging domain.
Following patch series adds support for domain_alloc_paging() ops. Also
enhances domain_alloc_user() to allocate page table based on 'flags.
Hence pass page table type as parameter to pdomain_setup_pgtable(). So
that caller can decide right page table type. Also update
dma_max_address() to take pgtable as parameter.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jacob Pan <jacob.pan@linux.microsoft.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241028093810.5901-9-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Currently protection_domain_alloc() allocates domain and also sets up
page table. Page table setup is required for PAGING domain only. Domain
type like SVA doesn't need page table. Hence move page table setup code
to separate function.
Also SVA domain allocation path does not call pdom_setup_pgtable().
Hence remove IOMMU_DOMAIN_SVA type check.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jacob Pan <jacob.pan@linux.microsoft.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241028093810.5901-8-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
amd_iommu_pgtable validation has to be done before calling
iommu_snp_enable(). It can be done immediately after reading IOMMU
features. Hence move this check to early_amd_iommu_init().
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241028093810.5901-7-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
amd_iommu_gt_ppr_supported() only checks for GTSUP. To support PASID
with V2 page table we need GIOSUP as well. Hence add new helper function
to check GIOSUP/GTSUP.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241028093810.5901-6-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Core layer is modified to call domain_alloc_user() to allocate PASID
capable domain. Enhance arm_smmu_domain_alloc_user() to allocate
PASID capable domain based on the 'flags' parameter.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241028093810.5901-5-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Introduce new flag (IOMMU_HWPT_ALLOC_PASID) to domain_alloc_users() ops.
If IOMMU supports PASID it will allocate domain. Otherwise return error.
In error path check for -EOPNOTSUPP and try to allocate non-PASID
domain so that DMA-API mode work fine for drivers which does not support
PASID as well.
Also modify __iommu_group_alloc_default_domain() to call
iommu_paging_domain_alloc_flags() with appropriate flag when allocating
paging domain.
Signed-off-by: Jason Gunthorpe <jgg@ziepe.ca>
Co-developed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241028093810.5901-4-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Currently drivers calls iommu_paging_domain_alloc(dev) to get an
UNMANAGED domain. This is not sufficient to support PASID with
UNMANAGED domain as some HW like AMD requires certain page table type
to support PASIDs.
Also the domain_alloc_paging op only passes device as param for domain
allocation. This is not sufficient for AMD driver to decide the right
page table.
Instead of extending ops->domain_alloc_paging() it was decided to
enhance ops->domain_alloc_user() so that caller can pass various
additional flags.
Hence add iommu_paging_domain_alloc_flags() API which takes flags as
parameter. Caller can pass additional parameter to indicate type of
domain required, etc. iommu_paging_domain_alloc_flags() internally calls
appropriate callback function to allocate a domain.
Signed-off-by: Jason Gunthorpe <jgg@ziepe.ca>
[Added description - Vasant]
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241028093810.5901-3-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Following patch will introduce iommu_paging_domain_alloc_flags() API.
Hence move domain init code to separate function so that it can be
reused.
Also move iommu_get_dma_cookie() setup iommu_setup_default_domain() as
it is required in DMA API mode only.
Signed-off-by: Jason Gunthorpe <jgg@ziepe.ca>
[Split the patch and added description - Vasant]
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241028093810.5901-2-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The iommu_domain_alloc() interface is no longer used in the tree anymore.
Remove it to avoid dead code.
There is increasing demand for supporting multiple IOMMU drivers, and this
is the last bus-based thing standing in the way of that.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241009041147.28391-5-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
If a page is mapped starting at 0 that is equal to or larger than can fit
in the current mode (number of table levels) it results in corrupting the
mapping as the following logic assumes the mode is correct for the page
size being requested.
There are two issues here, the check if the address fits within the table
uses the start address, it should use the last address to ensure that last
byte of the mapping fits within the current table mode.
The second is if the mapping is exactly the size of the full page table it
has to add another level to instead hold a single IOPTE for the large
size.
Since both corner cases require a 0 IOVA to be hit and doesn't start until
a page size of 2^48 it is unlikely to ever hit in a real system.
Reported-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v1-27ab08d646a1+29-amd_0map_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
These days iommu_map() does not require external flushing, it always
internally handles any required flushes. Since
iommu_create_device_direct_mappings() only calls iommu_map(), remove
the extra call.
Since this is the last call site for iommu_flush_iotlb_all() remove it
too.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/0-v1-bb6c694e1b07+a29e1-iommu_no_flush_all_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
All functions that take the class address as argument expect a const
pointer so we can make the iommu class constant.
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241018121725.61128-1-brgl@bgdev.pl
Signed-off-by: Joerg Roedel <jroedel@suse.de>
In suspend/resume path, no need to copy old DTE (early_enable_iommus()).
Just need to reload IOMMU hardware.
This is the side effect of commit 3ac3e5ee5e ("iommu/amd: Copy old
trans table from old kernel") which changed early_enable_iommus() but
missed to fix enable_iommus().
Resume path continue to work as 'amd_iommu_pre_enabled' is set to false
and copy_device_table() will fail. It will just re-loaded IOMMU. Hence I
think we don't need to backport this to stable tree.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20241016084958.99727-1-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Introduce first-stage address translation support.
Page table configured by the IOMMU driver will use the highest mode
implemented by the hardware, unless not known at the domain allocation
time falling back to the CPU’s MMU page mode.
This change introduces IOTINVAL.VMA command, required to invalidate
any cached IOATC entries after mapping is updated and/or removed from
the paging domain. Invalidations for the non-leaf page entries use
IOTINVAL for all addresses assigned to the protection domain for
hardware not supporting more granular non-leaf page table cache
invalidations.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Link: https://lore.kernel.org/r/1109202d389f51c7121cb1460eb2f21429b9bd5d.1729059707.git.tjeznach@rivosinc.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Introduce device command submission and fault reporting queues,
as described in Chapter 3.1 and 3.2 of the RISC-V IOMMU Architecture
Specification.
Command and fault queues are instantiated in contiguous system memory
local to IOMMU device domain, or mapped from fixed I/O space provided
by the hardware implementation. Detection of the location and maximum
allowed size of the queue utilize WARL properties of queue base control
register. Driver implementation will try to allocate up to 128KB of
system memory, while respecting hardware supported maximum queue size.
Interrupts allocation is based on interrupt vectors availability and
distributed to all queues in simple round-robin fashion. For hardware
Implementation with fixed event type to interrupt vector assignment
IVEC WARL property is used to discover such mappings.
Address translation, command and queue fault handling in this change
is limited to simple fault reporting without taking any action.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Link: https://lore.kernel.org/r/c4735fb6829053eff37ce1bcca4906192afd743c.1729059707.git.tjeznach@rivosinc.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Introduce device context allocation and device directory tree
management including capabilities discovery sequence, as described
in Chapter 2.1 of the RISC-V IOMMU Architecture Specification.
Device directory mode will be auto detected using DDTP WARL property,
using highest mode supported by the driver and hardware. If none
supported can be configured, driver will fall back to global pass-through.
First level DDTP page can be located in I/O (detected using DDTP WARL)
and system memory.
Only simple identity and blocking protection domains are supported by
this implementation.
Co-developed-by: Nick Kossifidis <mick@ics.forth.gr>
Signed-off-by: Nick Kossifidis <mick@ics.forth.gr>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Link: https://lore.kernel.org/r/e1c763aeccd2c05fd4ad3a32f6f2ff3b3148d907.1729059707.git.tjeznach@rivosinc.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Advertise IOMMU device and its core API.
Only minimal implementation for single identity domain type, without
per-group domain protection.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Link: https://lore.kernel.org/r/ba79c8eb9c7f1cd9a8961a1b048e3991ee9a2b05.1729059707.git.tjeznach@rivosinc.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Introduce device driver for PCIe implementation
of RISC-V IOMMU architected hardware.
IOMMU hardware and system support for MSI or MSI-X is
required by this implementation.
Vendor and device identifiers used in this patch
matches QEMU implementation of the RISC-V IOMMU PCIe
device, from Rivos VID (0x1efd) range allocated by the PCI-SIG.
MAINTAINERS | added iommu-pci.c already covered by matching pattern.
Link: https://lore.kernel.org/qemu-devel/20240307160319.675044-1-dbarboza@ventanamicro.com/
Co-developed-by: Nick Kossifidis <mick@ics.forth.gr>
Signed-off-by: Nick Kossifidis <mick@ics.forth.gr>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Link: https://lore.kernel.org/r/12f3bdbe519ebb7ca482191e7334d38b25b8ae8f.1729059707.git.tjeznach@rivosinc.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Introduce platform device driver for implementation of RISC-V IOMMU
architected hardware.
Hardware interface definition located in file iommu-bits.h is based on
ratified RISC-V IOMMU Architecture Specification version 1.0.0.
This patch implements platform device initialization, early check and
configuration of the IOMMU interfaces and enables global pass-through
address translation mode (iommu_mode == BARE), without registering
hardware instance in the IOMMU subsystem.
Link: https://github.com/riscv-non-isa/riscv-iommu
Co-developed-by: Nick Kossifidis <mick@ics.forth.gr>
Signed-off-by: Nick Kossifidis <mick@ics.forth.gr>
Co-developed-by: Sebastien Boeuf <seb@rivosinc.com>
Signed-off-by: Sebastien Boeuf <seb@rivosinc.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Link: https://lore.kernel.org/r/2f2e4530c0ee4a81385efa90f1da932f5179f3fb.1729059707.git.tjeznach@rivosinc.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Support file mappings for mediated devices, aka mdevs. Access is
initiated by the vfio_pin_pages() and vfio_dma_rw() kernel interfaces.
Link: https://patch.msgid.link/r/1729861919-234514-9-git-send-email-steven.sistare@oracle.com
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Define the IOMMU_IOAS_MAP_FILE ioctl interface, which allows a user to
register memory by passing a memfd plus offset and length. Implement it
using the memfd_pin_folios() kAPI.
Link: https://patch.msgid.link/r/1729861919-234514-8-git-send-email-steven.sistare@oracle.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Extend pfn_reader_user() to pin file mappings, by calling
memfd_pin_folios(). Repin at small page granularity, and fill the batch
from folios. Expand folios to upages for the iopt_pages_fill() path.
Link: https://patch.msgid.link/r/1729861919-234514-7-git-send-email-steven.sistare@oracle.com
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add local variables for common sub-expressions needed by a subsequent
patch. No functional change.
Link: https://patch.msgid.link/r/1729861919-234514-5-git-send-email-steven.sistare@oracle.com
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The starting address in iopt_pages is currently a __user *uptr.
Generalize to allow other types of addresses. Refactor iopt_alloc_pages()
and iopt_map_user_pages() into address-type specific and common functions.
Link: https://patch.msgid.link/r/1729861919-234514-4-git-send-email-steven.sistare@oracle.com
Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
iopt_alloc_iova() takes a uptr argument but only checks for its alignment.
Generalize this to an unsigned address, which can be the offset from the
start of a file in a subsequent patch. No functional change.
Link: https://patch.msgid.link/r/1729861919-234514-3-git-send-email-steven.sistare@oracle.com
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The last callsite of iommu_present() is removed by commit <45c690aea8ee>
("drm/tegra: Use iommu_paging_domain_alloc()"). Remove it to avoid dead
code.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241009051808.29455-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Use atomic64_inc_return(&ref) instead of atomic64_add_return(1, &ref)
to use optimized implementation and ease register pressure around
the primitive for targets that implement optimized variant.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Will Deacon <will@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241007084356.47799-1-ubizjak@gmail.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Reorganize kerneldoc parameter names to match the parameter
order in the function header.
Problems identified using Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/20240930112121.95324-20-Julia.Lawall@inria.fr
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This fixes a crash when surprise hot-unplugging a PCI device. This crash
happens because during hot-unplug __iommu_group_set_domain_nofail()
attaching the default domain fails when the platform no longer
recognizes the device as it has already been removed and we end up with
a NULL domain pointer and UAF. This is exactly the case referred to in
the second comment in __iommu_device_set_domain() and just as stated
there if we can instead attach the blocking domain the UAF is prevented
as this can handle the already removed device. Implement the blocking
domain to use this handling. With this change, the crash is fixed but
we still hit a warning attempting to change DMA ownership on a blocked
device.
Fixes: c76c067e48 ("s390/pci: Use dma-iommu layer")
Co-developed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20240910211516.137933-1-mjrosato@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Replace comma between expressions with semicolons.
Using a ',' in place of a ';' can have unintended side effects.
Although that is not the case here, it is seems best to use ';'
unless ',' is intended.
Found by inspection.
No functional change intended.
Compile tested only.
Fixes: e3b1be2e73 ("iommu/arm-smmu-v3: Reorganize struct arm_smmu_ctx_desc_cfg")
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20240923021557.3432068-1-nichen@iscas.ac.cn
Signed-off-by: Will Deacon <will@kernel.org>
The function arm_smmu_init_strtab_2lvl uses the expression
((1 << smmu->sid_bits) - 1)
to calculate the largest StreamID value. However, this fails for the
maximum allowed value of SMMU_IDR1.SIDSIZE which is 32. The C standard
states:
"If the value of the right operand is negative or is greater than or
equal to the width of the promoted left operand, the behavior is
undefined."
With smmu->sid_bits being 32, the prerequisites for undefined behavior
are met. We observed that the value of (1 << 32) is 1 and not 0 as we
initially expected.
Similar bit shift operations in arm_smmu_init_strtab_linear seem to not
be affected, because it appears to be unlikely for an SMMU to have
SMMU_IDR1.SIDSIZE set to 32 but then not support 2-level Stream tables
This issue was found by Ryan Huang <tzukui@google.com> on our team.
Fixes: ce410410f1 ("iommu/arm-smmu-v3: Add arm_smmu_strtab_l1/2_idx()")
Signed-off-by: Daniel Mentz <danielmentz@google.com>
Link: https://lore.kernel.org/r/20241002015357.1766934-1-danielmentz@google.com
Signed-off-by: Will Deacon <will@kernel.org>
CPRE workarounds are implicated in at least 5 MMU-500 errata, some of
which remain unfixed. The comment and warning message have proven to be
unhelpfully misleading about this scope, so reword them to get the point
across with less risk of going out of date or confusing users.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/dfa82171b5248ad7cf1f25592101a6eec36b8c9a.1728400877.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
no_llseek had been defined to NULL two years ago, in commit 868941b144
("fs: remove no_llseek")
To quote that commit,
At -rc1 we'll need do a mechanical removal of no_llseek -
git grep -l -w no_llseek | grep -v porting.rst | while read i; do
sed -i '/\<no_llseek\>/d' $i
done
would do it.
Unfortunately, that hadn't been done. Linus, could you do that now, so
that we could finally put that thing to rest? All instances are of the
form
.llseek = no_llseek,
so it's obviously safe.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- sort out a few issues with the direct calls to iommu-dma
(Christoph Hellwig, Leon Romanovsky)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmbyWtkLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPToRAAlChDftSXmz9Isxds5VyzFy8El6hcutPiRfB8mUze
9aq7qgCeLVq6L0tPvNTgl2iYhS8KjTxVKSooYSQ0eAfb3FgNHKggmfV9GO8ddqBe
ojf9s9EJzJ3BatkVPBDPin88Sv0Vfwkb6n78gEeuTGPReScMdR5WZfmyl5QfMWpt
5W16gSA666DcIpxIDPo6zEPZT/lXsPcmvXrZJhFZe7hITNp/CihVxdxy9spLRG2Z
Ntei8AtXUvjn3ziU4ymcKOfP6fmERlVAiq65CqlXMUELs3Eh4knH9eEiXHtu0JAD
qMhuHSB5b2Dd2gjwvJDoISRPPtpy0sGnx81lAZJPIGnvVKKJSI+Lp7wDP/w66Rxx
7vF7tU9uyAb4Qmb5h0s9BqJ3mCNN9WXsAKrNX3rLr2ZLCjSuQFDne0oAYgVl10J8
ezALiv+mG4C4Q5lxzmZm4PU8f1CYjLMe0KppDMLsuwdWpliOlGigCKYddqkOj0ct
Cvg4lMHNcBo8xADyaqm/9iJYIkvLl5o6JGmVBbSCikUT3GMssBf/0KdITS4D+/cj
150LGxiZOQaOx1Ef8eACYslKJq8hIz0OmAjgDGpdLM7sBxa62FuZJrcujkKW9UA6
8DrYp+XqKd5R/ZOcjAIMbNAlZEa5CytgB9r4dXvrAC1dk//Dd23G873KunAE1DJJ
cy0=
=3hPK
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.12-2024-09-24' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
- sort out a few issues with the direct calls to iommu-dma (Christoph
Hellwig, Leon Romanovsky)
* tag 'dma-mapping-6.12-2024-09-24' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: report unlimited DMA addressing in IOMMU DMA path
iommu/dma: remove most stubs in iommu-dma.h
dma-mapping: fix vmap and mmap of noncontiougs allocations
Collection of small cleanup and one fix:
- Sort headers and struct forward declarations
- Fix random selftest failures in some cases due to dirty tracking tests
- Have the reserved IOVA regions mechanism work when a HWPT is used as a
nesting parent. This updates the nesting parent's IOAS with the reserved
regions of the device and will also install the ITS doorbell page on
ARM.
- Add missed validation of parent domain ops against the current iommu
- Fix a syzkaller bug related to integer overflow during ALIGN()
- Tidy two iommu_domain attach paths
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZvGksgAKCRCFwuHvBreF
YbSfAP931gRT85t0r7z6tH1GJVIviX2mg5TYGsb9SkrxVKcKAwD9H65T7tJRzTyP
K1oYBY7wtpHbR38hjFbnRPD7ZM+k8A4=
=r7jm
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"Collection of small cleanup and one fix:
- Sort headers and struct forward declarations
- Fix random selftest failures in some cases due to dirty tracking
tests
- Have the reserved IOVA regions mechanism work when a HWPT is used
as a nesting parent. This updates the nesting parent's IOAS with
the reserved regions of the device and will also install the ITS
doorbell page on ARM.
- Add missed validation of parent domain ops against the current
iommu
- Fix a syzkaller bug related to integer overflow during ALIGN()
- Tidy two iommu_domain attach paths"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd:
iommu: Set iommu_attach_handle->domain in core
iommufd: Avoid duplicated __iommu_group_set_core_domain() call
iommufd: Protect against overflow of ALIGN() during iova allocation
iommufd: Reorder struct forward declarations
iommufd: Check the domain owner of the parent before creating a nesting domain
iommufd/device: Enforce reserved IOVA also when attached to hwpt_nested
iommufd/selftest: Fix buffer read overrrun in the dirty test
iommufd: Reorder include files
Commit b5c58b2fdc ("dma-mapping: direct calls for dma-iommu") switched
to use direct calls to dma-iommu, but missed the dma_vmap_noncontiguous,
dma_vunmap_noncontiguous and dma_mmap_noncontiguous behavior keyed off the
presence of the alloc_noncontiguous method.
Fix this by removing the now unused alloc_noncontiguous and
free_noncontiguous methods and moving the vmapping and mmaping of the
noncontiguous allocations into the iommu code, as it is the only provider
of actually noncontiguous allocations.
Fixes: b5c58b2fdc ("dma-mapping: direct calls for dma-iommu")
Reported-by: Xi Ruoyao <xry111@xry111.site>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Leon Romanovsky <leon@kernel.org>
Tested-by: Xi Ruoyao <xry111@xry111.site>
Quite a lot of nilfs2 work this time around.
Notable patch series in this pull request are:
"mul_u64_u64_div_u64: new implementation" by Nicolas Pitre, with
assistance from Uwe Kleine-König. Reimplement mul_u64_u64_div_u64() to
provide (much) more accurate results. The current implementation was
causing Uwe some issues in the PWM drivers.
"xz: Updates to license, filters, and compression options" from Lasse
Collin. Miscellaneous maintenance and kinor feature work to the xz
decompressor.
"Fix some GDB command error and add some GDB commands" from Kuan-Ying Lee.
Fixes and enhancements to the gdb scripts.
"treewide: add missing MODULE_DESCRIPTION() macros" from Jeff Johnson.
Adds lots of MODULE_DESCRIPTIONs, thus fixing lots of warnings about this.
"nilfs2: add support for some common ioctls" from Ryusuke Konishi. Adds
various commonly-available ioctls to nilfs2.
"This series fixes a number of formatting issues in kernel doc comments"
from Ryusuke Konishi does that.
"nilfs2: prevent unexpected ENOENT propagation" from Ryusuke Konishi. Fix
issues where -ENOENT was being unintentionally and inappropriately
returned to userspace.
"nilfs2: assorted cleanups" from Huang Xiaojia.
"nilfs2: fix potential issues with empty b-tree nodes" from Ryusuke
Konishi fixes some issues which can occur on corrupted nilfs2 filesystems.
"scripts/decode_stacktrace.sh: improve error reporting and usability" from
Luca Ceresoli does those things.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZu7dpAAKCRDdBJ7gKXxA
jsPqAPwMDEZyKlfSw7QioEHNHDkmkbP7VYCYR0CbUnppbztwpAD8D37aVbWQ+UzM
3nnOq3W2Pc2o/20zqi8Upf1mnvUrygQ=
=/NWE
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2024-09-21-07-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Many singleton patches - please see the various changelogs for
details.
Quite a lot of nilfs2 work this time around.
Notable patch series in this pull request are:
- "mul_u64_u64_div_u64: new implementation" by Nicolas Pitre, with
assistance from Uwe Kleine-König. Reimplement mul_u64_u64_div_u64()
to provide (much) more accurate results. The current implementation
was causing Uwe some issues in the PWM drivers.
- "xz: Updates to license, filters, and compression options" from
Lasse Collin. Miscellaneous maintenance and kinor feature work to
the xz decompressor.
- "Fix some GDB command error and add some GDB commands" from
Kuan-Ying Lee. Fixes and enhancements to the gdb scripts.
- "treewide: add missing MODULE_DESCRIPTION() macros" from Jeff
Johnson. Adds lots of MODULE_DESCRIPTIONs, thus fixing lots of
warnings about this.
- "nilfs2: add support for some common ioctls" from Ryusuke Konishi.
Adds various commonly-available ioctls to nilfs2.
- "This series fixes a number of formatting issues in kernel doc
comments" from Ryusuke Konishi does that.
- "nilfs2: prevent unexpected ENOENT propagation" from Ryusuke
Konishi. Fix issues where -ENOENT was being unintentionally and
inappropriately returned to userspace.
- "nilfs2: assorted cleanups" from Huang Xiaojia.
- "nilfs2: fix potential issues with empty b-tree nodes" from Ryusuke
Konishi fixes some issues which can occur on corrupted nilfs2
filesystems.
- "scripts/decode_stacktrace.sh: improve error reporting and
usability" from Luca Ceresoli does those things"
* tag 'mm-nonmm-stable-2024-09-21-07-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (103 commits)
list: test: increase coverage of list_test_list_replace*()
list: test: fix tests for list_cut_position()
proc: use __auto_type more
treewide: correct the typo 'retun'
ocfs2: cleanup return value and mlog in ocfs2_global_read_info()
nilfs2: remove duplicate 'unlikely()' usage
nilfs2: fix potential oob read in nilfs_btree_check_delete()
nilfs2: determine empty node blocks as corrupted
nilfs2: fix potential null-ptr-deref in nilfs_btree_insert()
user_namespace: use kmemdup_array() instead of kmemdup() for multiple allocation
tools/mm: rm thp_swap_allocator_test when make clean
squashfs: fix percpu address space issues in decompressor_multi_percpu.c
lib: glob.c: added null check for character class
nilfs2: refactor nilfs_segctor_thread()
nilfs2: use kthread_create and kthread_stop for the log writer thread
nilfs2: remove sc_timer_task
nilfs2: do not repair reserved inode bitmap in nilfs_new_inode()
nilfs2: eliminate the shared counter and spinlock for i_generation
nilfs2: separate inode type information from i_state field
nilfs2: use the BITS_PER_LONG macro
...
- support DMA zones for arm64 systems where memory starts at > 4GB
(Baruch Siach, Catalin Marinas)
- support direct calls into dma-iommu and thus obsolete dma_map_ops for
many common configurations (Leon Romanovsky)
- add DMA-API tracing (Sean Anderson)
- remove the not very useful return value from various dma_set_* APIs
(Christoph Hellwig)
- misc cleanups and minor optimizations (Chen Y, Yosry Ahmed,
Christoph Hellwig)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmbr2BALHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYNheA/6A453SQy2kFvspFRvEp8ztEqtvxwxGLAUMIyvmU+a
9b37KlMwUnpbMsXK5+KtYdTLRoIvtl89uIkdZq7pYYKj0uoPZvF9QVnKtrJWAvqK
fFuauokZznuD3ZSd6v6uY4ijb29ImGfx5kZopQf1zWoYLENxM7mWqRU+eqxDozev
FbyfYhJzMBhpHveen9+Q7PEfi/90ZdEqtJhSK2AOzuV9ZvbYiSFCrcnT/4wM30DS
2OxjGa8tKcGYZ9ah0rF2V5hboaRuYedTFgXoKfUSJINJkzmBlTXdxVx5Xr3kQtyC
7S/xv2y79CXkDKck2+IY7xkhwwBsXPrTAyTzWAIJqOEmaMJ4KqEW54JOsK+VHfmO
29UKBnASOK0xvfCzakm2631iOzEZF743RgpQiOGeMcnph789Mwu8EUCcqeEW/fJy
Xh7B0z3/XgJz8BtTG/64IhmqO63Cwa/o7DSQdLr9dh5F/mPBzqrnRov97KL7mH1q
VSO0Z7+8J0x9ALcYutpth/IzG/lXtXn/pfR1sj6dBHvjf5SwjuT8MKUHgh0l6N+C
BWZn8swwrZaJ2Li2Gv3CpnCzVQZCkL6ns9VqAWiWq7VfGhDLndMqfi/jHCyGH83i
E3dMtqf81XaQ7JRDPCs7Jx/4Zkn/iNkkZe8IQsByMc1BY4oeD7/Z2s8mkK8MbNla
/CA=
=DZVc
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.12-2024-09-19' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- support DMA zones for arm64 systems where memory starts at > 4GB
(Baruch Siach, Catalin Marinas)
- support direct calls into dma-iommu and thus obsolete dma_map_ops for
many common configurations (Leon Romanovsky)
- add DMA-API tracing (Sean Anderson)
- remove the not very useful return value from various dma_set_* APIs
(Christoph Hellwig)
- misc cleanups and minor optimizations (Chen Y, Yosry Ahmed, Christoph
Hellwig)
* tag 'dma-mapping-6.12-2024-09-19' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: reflow dma_supported
dma-mapping: reliably inform about DMA support for IOMMU
dma-mapping: add tracing for dma-mapping API calls
dma-mapping: use IOMMU DMA calls for common alloc/free page calls
dma-direct: optimize page freeing when it is not addressable
dma-mapping: clearly mark DMA ops as an architecture feature
vdpa_sim: don't select DMA_OPS
arm64: mm: keep low RAM dma zone
dma-mapping: don't return errors from dma_set_max_seg_size
dma-mapping: don't return errors from dma_set_seg_boundary
dma-mapping: don't return errors from dma_set_min_align_mask
scsi: check that busses support the DMA API before setting dma parameters
arm64: mm: fix DMA zone when dma-ranges is missing
dma-mapping: direct calls for dma-iommu
dma-mapping: call ->unmap_page and ->unmap_sg unconditionally
arm64: support DMA zone above 4GB
dma-mapping: replace zone_dma_bits by zone_dma_limit
dma-mapping: use bit masking to check VM_DMA_COHERENT
- Implement per-PMU context rescheduling to significantly improve single-PMU
performance, and related cleanups/fixes. (by Peter Zijlstra and Namhyung Kim)
- Fix ancient bug resulting in a lot of events being dropped erroneously
at higher sampling frequencies. (by Luo Gengkun)
- uprobes enhancements:
- Implement RCU-protected hot path optimizations for better performance:
"For baseline vs SRCU, peak througput increased from 3.7 M/s (million uprobe
triggerings per second) up to about 8 M/s. For uretprobes it's a bit more
modest with bump from 2.4 M/s to 5 M/s.
For SRCU vs RCU Tasks Trace, peak throughput for uprobes increases further from
8 M/s to 10.3 M/s (+28%!), and for uretprobes from 5.3 M/s to 5.8 M/s (+11%),
as we have more work to do on uretprobes side.
Even single-thread (no contention) performance is slightly better: 3.276 M/s to
3.396 M/s (+3.5%) for uprobes, and 2.055 M/s to 2.174 M/s (+5.8%)
for uretprobes."
(by Andrii Nakryiko et al)
- Document mmap_lock, don't abuse get_user_pages_remote(). (by Oleg Nesterov)
- Cleanups & fixes to prepare for future work:
- Remove uprobe_register_refctr()
- Simplify error handling for alloc_uprobe()
- Make uprobe_register() return struct uprobe *
- Fold __uprobe_unregister() into uprobe_unregister()
- Shift put_uprobe() from delete_uprobe() to uprobe_unregister()
- BPF: Fix use-after-free in bpf_uprobe_multi_link_attach()
(by Oleg Nesterov)
- New feature & ABI extension: allow events to use PERF_SAMPLE READ with
inheritance, enabling sample based profiling of a group of counters over
a hierarchy of processes or threads. (by Ben Gainey)
- Intel uncore & power events updates:
- Add Arrow Lake and Lunar Lake support
- Add PERF_EV_CAP_READ_SCOPE
- Clean up and enhance cpumask and hotplug support
(by Kan Liang)
- Add LNL uncore iMC freerunning support
- Use D0:F0 as a default device
(by Zhenyu Wang)
- Intel PT: fix AUX snapshot handling race. (by Adrian Hunter)
- Misc fixes and cleanups. (by James Clark, Jiri Olsa, Oleg Nesterov and Peter Zijlstra)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmbqxEwRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iusw/43UAcAZVof6Qs+j6bVAxSabF66fFfE9Wh
jc+F4yZ2MGl9x6a1f392+CPcTdVsYp6G2QtRGMipD+trmi/lhDhmRrhxxD1KWIwP
zVGSBx9CSFl0UpCXdGiVrGzT5xpIpJ4qqW2XUVr32n8SxTT5X/vM5ySm6KUXsIrD
2/KXwucT9a7grkl3pvy/A/FUHxaF7oAMJjcIPSvLBveQjQSHUrZoCZdHsRGT9rjS
HjzxG6gDy97172z5XV1ej3HJOfFlFTQ1RcoxNqdLfiZ6n3hD4hfmtsXWB5zTzRjT
xHaCOmWLhEp5v+fK2+RCFiWUbDBsmW/mecZdrjGb3C1RIDWQhLCXXc95XtrobTvk
BkW9QEC/XRB+vU6Ssdv3ugN7yRWxih0BsLU5sy4nlzmwoYt9qOy8fgjRvSBKHr5K
Mu1RIFu+KXq++sa7+ZJjUMY70PHQCp2m4AHprG/Y98t93CQMhDXzGVpPzWyQuW/V
lqYFjd/CAoCIVGF4Jxq7sqOdZ1emDN+P0WSnnFWssJ0ZJFvxN9ZDPH2AaMk4lwo7
NFW6u3+0Vx9P0m/H6xRQj00Iye2JLMqJNCIA8QtjnB7L6upgVvcIPjgcG58fpV1o
xfJekOR1A7T2aQUDlX5t9Cu36ZUImDRmwHj2m1p84s5AANlbD7/fOmffR1Hn9uFj
wCTqSpi8Hg==
=E3s3
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2024-09-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
- Implement per-PMU context rescheduling to significantly improve
single-PMU performance, and related cleanups/fixes (Peter Zijlstra
and Namhyung Kim)
- Fix ancient bug resulting in a lot of events being dropped
erroneously at higher sampling frequencies (Luo Gengkun)
- uprobes enhancements:
- Implement RCU-protected hot path optimizations for better
performance:
"For baseline vs SRCU, peak througput increased from 3.7 M/s
(million uprobe triggerings per second) up to about 8 M/s. For
uretprobes it's a bit more modest with bump from 2.4 M/s to
5 M/s.
For SRCU vs RCU Tasks Trace, peak throughput for uprobes
increases further from 8 M/s to 10.3 M/s (+28%!), and for
uretprobes from 5.3 M/s to 5.8 M/s (+11%), as we have more
work to do on uretprobes side.
Even single-thread (no contention) performance is slightly
better: 3.276 M/s to 3.396 M/s (+3.5%) for uprobes, and 2.055
M/s to 2.174 M/s (+5.8%) for uretprobes."
(Andrii Nakryiko et al)
- Document mmap_lock, don't abuse get_user_pages_remote() (Oleg
Nesterov)
- Cleanups & fixes to prepare for future work:
- Remove uprobe_register_refctr()
- Simplify error handling for alloc_uprobe()
- Make uprobe_register() return struct uprobe *
- Fold __uprobe_unregister() into uprobe_unregister()
- Shift put_uprobe() from delete_uprobe() to uprobe_unregister()
- BPF: Fix use-after-free in bpf_uprobe_multi_link_attach()
(Oleg Nesterov)
- New feature & ABI extension: allow events to use PERF_SAMPLE READ
with inheritance, enabling sample based profiling of a group of
counters over a hierarchy of processes or threads (Ben Gainey)
- Intel uncore & power events updates:
- Add Arrow Lake and Lunar Lake support
- Add PERF_EV_CAP_READ_SCOPE
- Clean up and enhance cpumask and hotplug support
(Kan Liang)
- Add LNL uncore iMC freerunning support
- Use D0:F0 as a default device
(Zhenyu Wang)
- Intel PT: fix AUX snapshot handling race (Adrian Hunter)
- Misc fixes and cleanups (James Clark, Jiri Olsa, Oleg Nesterov and
Peter Zijlstra)
* tag 'perf-core-2024-09-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
dmaengine: idxd: Clean up cpumask and hotplug for perfmon
iommu/vt-d: Clean up cpumask and hotplug for perfmon
perf/x86/intel/cstate: Clean up cpumask and hotplug
perf: Add PERF_EV_CAP_READ_SCOPE
perf: Generic hotplug support for a PMU with a scope
uprobes: perform lockless SRCU-protected uprobes_tree lookup
rbtree: provide rb_find_rcu() / rb_find_add_rcu()
perf/uprobe: split uprobe_unregister()
uprobes: travers uprobe's consumer list locklessly under SRCU protection
uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks
uprobes: protected uprobe lifetime with SRCU
uprobes: revamp uprobe refcounting and lifetime management
bpf: Fix use-after-free in bpf_uprobe_multi_link_attach()
perf/core: Fix small negative period being ignored
perf: Really fix event_function_call() locking
perf: Optimize __pmu_ctx_sched_out()
perf: Add context time freeze
perf: Fix event_function_call() locking
perf: Extract a few helpers
perf: Optimize context reschedule for single PMU cases
...
Including:
- Core changes:
- Allow ATS on VF when parent device is identity mapped.
- Optimize unmap path on ARM io-pagetable implementation.
- Use of_property_present().
- ARM-SMMU changes:
- SMMUv2:
- Devicetree binding updates for Qualcomm MMU-500 implementations.
- Extend workarounds for broken Qualcomm hypervisor to avoid
touching features that are not available (e.g. 16KiB page
support, reserved context banks).
- SMMUv3:
- Support for NVIDIA's custom virtual command queue hardware.
- Fix Stage-2 stall configuration and extend tests to cover this
area.
- A bunch of driver cleanups, including simplification of the
master rbtree code.
- Plus minor cleanups and fixes across both drivers.
- Intel VT-d changes:
- Retire si_domain and convert to use static identity domain.
- Batched IOTLB/dev-IOTLB invalidation.
- Small code refactoring and cleanups.
- AMD-Vi changes:
- Cleanup and refactoring of io-pagetable code.
- Add parameter to limit the used io-pagesizes.
- Other cleanups and fixes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmboAtoACgkQK/BELZcB
GuNidQ//WOhwVQZmdS6vnU2vu//LwFE7Q7PsRYPW2QhFri1eurKo6jxMNBtgUsXu
fPTSEBM7/lhagRgb29ycrbOYoavkEnUiIMX7vRsjl9tVkqd/GKNTrMuUC+QPiBYQ
ASkStmEUW6Zvye4rWyUxiCJIFIA5wm74wSOOQ6X2Wg3WMo51njrj1DK/k2H5JenJ
RTmIA9Ynef2py38xWDd0UE/psvKrzA5uug4IP0E0v014i36cSEVrH7hjztMfd8Sc
2dUuJ8eUUtLTo1ffTcmxoTvUBjBzJOzeSQrFfaDZDgyqayt6JoSKeX1DV/nCI8kc
ftg0pe37Zr3mndgQC7wNyUO1GOmkJl+GpMFyJTG8wpnBc0tr+TDn1o6QERymcRxA
kn62n4vxxjWoRSKt3di7hNM0Uuwj8/z/cIbDSTNbSov4fDuuz0xppdcA/ewKATv0
VgmpP5OyIFZXM+mR4Vem2hZQQ3wPOsJAFVWS1ROtYQFgiimrGf+w9et8rEU4pmp5
Ve4rSmka60NLdE6i1JNqx4sRrRsdJJ55knI77nHrt0TZkbMzA/JG1UT3TbbMJTtd
v5dviMMOXLpcKQLgqlde8QWOEjT6VUw/fbU640iyzhrWAm8fWDBefrSv6JLhevQ4
fBajoaej89cd9DkkEJiSTiyGig8QkY3HFaqDo3u5g/sBBrMBZas=
=1QvI
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:
"Core changes:
- Allow ATS on VF when parent device is identity mapped
- Optimize unmap path on ARM io-pagetable implementation
- Use of_property_present()
ARM-SMMU changes:
- SMMUv2:
- Devicetree binding updates for Qualcomm MMU-500 implementations
- Extend workarounds for broken Qualcomm hypervisor to avoid
touching features that are not available (e.g. 16KiB page
support, reserved context banks)
- SMMUv3:
- Support for NVIDIA's custom virtual command queue hardware
- Fix Stage-2 stall configuration and extend tests to cover this
area
- A bunch of driver cleanups, including simplification of the
master rbtree code
- Minor cleanups and fixes across both drivers
Intel VT-d changes:
- Retire si_domain and convert to use static identity domain
- Batched IOTLB/dev-IOTLB invalidation
- Small code refactoring and cleanups
AMD-Vi changes:
- Cleanup and refactoring of io-pagetable code
- Add parameter to limit the used io-pagesizes
- Other cleanups and fixes"
* tag 'iommu-updates-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (77 commits)
dt-bindings: arm-smmu: Add compatible for QCS8300 SoC
iommu/amd: Test for PAGING domains before freeing a domain
iommu/amd: Fix argument order in amd_iommu_dev_flush_pasid_all()
iommu/amd: Add kernel parameters to limit V1 page-sizes
iommu/arm-smmu-v3: Reorganize struct arm_smmu_ctx_desc_cfg
iommu/arm-smmu-v3: Add types for each level of the CD table
iommu/arm-smmu-v3: Shrink the cdtab l1_desc array
iommu/arm-smmu-v3: Do not use devm for the cd table allocations
iommu/arm-smmu-v3: Remove strtab_base/cfg
iommu/arm-smmu-v3: Reorganize struct arm_smmu_strtab_cfg
iommu/arm-smmu-v3: Add types for each level of the 2 level stream table
iommu/arm-smmu-v3: Add arm_smmu_strtab_l1/2_idx()
iommu/arm-smmu-qcom: apply num_context_bank fixes for SDM630 / SDM660
iommu/arm-smmu-v3: Use the new rb tree helpers
dt-bindings: arm-smmu: document the support on SA8255p
iommu/tegra241-cmdqv: Do not allocate vcmdq until dma_set_mask_and_coherent
iommu/tegra241-cmdqv: Drop static at local variable
iommu/tegra241-cmdqv: Fix ioremap() error handling in probe()
iommu/amd: Do not set the D bit on AMD v2 table entries
iommu/amd: Correct the reported page sizes from the V1 table
...
- Handle an allocation failure in the IO/APIC code gracefully instead of
crashing the machine.
- Remove support for APIC local destination mode on 64bit
Logical destination mode of the local APIC is used for systems with up
to 8 CPUs. It has an advantage over physical destination mode as it
allows to target multiple CPUs at once with IPIs. That advantage was
definitely worth it when systems with up to 8 CPUs were state of the
art for servers and workstations, but that's history.
In the recent past there were quite some reports of new laptops failing
to boot with logical destination mode, but they work fine with physical
destination mode. That's not a suprise because physical destination
mode is guaranteed to work as it's the only way to get a CPU up and
running via the INIT/INIT/STARTUP sequence. Some of the affected
systems were cured by BIOS updates, but not all OEMs provide them.
As the number of CPUs keep increasing, logical destination mode becomes
less used and the benefit for small systems, like laptops, is not
really worth the trouble. So just remove logical destination mode
support for 64bit and be done with it.
- Code and comment cleanups in the APIC area.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbpL0gTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYob/VD/984H4Ku5/Djq9HkhBO11hfRTIVz/uf
1/b5ogd3eN0dK5nAv79/Gj7E/zntVsvCjuCYckXz51xPxkQH2LxUhDKqeUwg5lmz
xQV0mKK4fIS/g5yymQGplKc7FfjRAnVL9ZZRRvMkvtqbr1+dA665XrfjFAPkp929
zLaBUbNC6YxYfSddsV+fE8711QP6NzCYdeEBIdZ3NuBrlGfiLy1g1OWCk8za7zjM
cLJfGnU63MNXI4smrZWrQwJDBOiQl1wPbJYWL216OPHofLzLNGNZFXm4y8OJcyN0
WPWn1TliAwpRYx18Z/cEPgkoES8mXqqpPcoo0yBjOmPLl31J6QYU7QQhDb3HOnM/
ALgnnuhoWll5YjNBPJkONAa7lpnmfTbEg82WxaipEscz9CyEBoeOLvYBGPl/YqV+
B8wMOZHDH+BchJ6rYXDA1AmkD+9q86F+ddbiVOKj09dVm/QeLrGjwox1O7yGALGZ
hZPQx9MsTOJqQIh40PsqFko6OiMKuMBIebacFb4NqmVA2/WbRbcmkzRyxk+kkBFv
UMZX5O6sQhat615WZkxTnjmdnXETTIlv4nRQURBd/LF6ECRkXXG11dWaZfTXZ9iW
8NNlHw8mIbGmzn7wWXHlhk7N7vuhWCikAf7V2y+eZUVtE56qGM2volJNCmTZacP2
rrjmltwEGR+5gg==
=Y3a/
-----END PGP SIGNATURE-----
Merge tag 'x86-apic-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC updates from Thomas Gleixner:
- Handle an allocation failure in the IO/APIC code gracefully instead
of crashing the machine.
- Remove support for APIC local destination mode on 64bit
Logical destination mode of the local APIC is used for systems with
up to 8 CPUs. It has an advantage over physical destination mode as
it allows to target multiple CPUs at once with IPIs. That advantage
was definitely worth it when systems with up to 8 CPUs were state of
the art for servers and workstations, but that's history.
In the recent past there were quite some reports of new laptops
failing to boot with logical destination mode, but they work fine
with physical destination mode. That's not a suprise because physical
destination mode is guaranteed to work as it's the only way to get a
CPU up and running via the INIT/INIT/STARTUP sequence. Some of the
affected systems were cured by BIOS updates, but not all OEMs provide
them.
As the number of CPUs keep increasing, logical destination mode
becomes less used and the benefit for small systems, like laptops, is
not really worth the trouble. So just remove logical destination mode
support for 64bit and be done with it.
- Code and comment cleanups in the APIC area.
* tag 'x86-apic-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/irq: Fix comment on IRQ vector layout
x86/apic: Remove unused extern declarations
x86/apic: Remove logical destination mode for 64-bit
x86/apic: Remove unused inline function apic_set_eoi_cb()
x86/ioapic: Cleanup remaining coding style issues
x86/ioapic: Cleanup line breaks
x86/ioapic: Cleanup bracket usage
x86/ioapic: Cleanup comments
x86/ioapic: Move replace_pin_at_irq_node() to the call site
iommu/vt-d: Cleanup apic_printk()
x86/mpparse: Cleanup apic_printk()s
x86/ioapic: Cleanup guarded debug printk()s
x86/ioapic: Cleanup apic_printk()s
x86/apic: Cleanup apic_printk()s
x86/apic: Provide apic_printk() helpers
x86/ioapic: Use guard() for locking where applicable
x86/ioapic: Cleanup structs
x86/ioapic: Mark mp_alloc_timer_irq() __init
x86/ioapic: Handle allocation failures gracefully
- clean up TTBCR magic numbers and use u32 for this register
- fix clang issue in VFP code leading to kernel oops, caused by
compiler instruction scheduling.
- switch 32-bit Arm to use GENERIC_CPU_DEVICES and use the
arch_cpu_is_hotpluggable() hook.
- pass struct device to arm_iommu_create_mapping() and move over
to use iommu_paging_domain_alloc() rather than iommu_domain_alloc()
- make amba_bustype constant
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEuNNh8scc2k/wOAE+9OeQG+StrGQFAmbZziAACgkQ9OeQG+St
rGRC4g//WolL3+djg71a1uDj4X3XjuDStlCoj2WVrOqmoQ57BcNR+yR/H89tXtNa
kWypAr6GZP9aFIS1Liprce2ypnsbNTKZjOnU4zqTiG9XTq5ddBvGw71RiA74R8Gt
hDt27XpvzwAmkoLiTRK9KPrHUFRAL6jj++lq9HJBnHzIV2VD4wlUHt4P535Wn26p
IO2vBWIHxsE/OBprO92nMeEskDHj3bwHGMBMdogz4mQvoet55m9q1JLtWlc5Fb5j
yrHPoZ/4kYyu8ReIZMvudULMvg9XwF8LT4ferwk7qrKj/aua0xglK3h6FBzKj0EU
Tsp9ic2TwVoDFg9xoV9US7+axVikxN6ZgVCqlGUZ+0jDArHlskV++cncsqfAxLw1
ga4Tb9JG3oPAj+SThLgsk8PzKK+RWXpW+n4DQsgJaI7iJhSyB6mLFwr5fud3j3mZ
pMVMeXAh3GVjcdfwDtWmrUmwSLh9KMaOLIKCQm4ooL3ELVsa+UUbEHLb71ZKMa8T
/H9bNkABhJHqTk7T2tBIVKV/hSnHddjg2lGLOppuo5z3DKYi6oIlbBcqBSqz69PK
afkSrhWk8u7tvHI31uY9qncyjY8mQPkvXu1mYxz9t0D3w/+YM5ts45y4bLUg5wMr
8eSkmFPVBQWqeQurs490YwVcmGNL9v4gSsArI7RJzvxnj9UUVVY=
=hhvp
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux
Pull ARM updates from Russell King:
- clean up TTBCR magic numbers and use u32 for this register
- fix clang issue in VFP code leading to kernel oops, caused by
compiler instruction scheduling.
- switch 32-bit Arm to use GENERIC_CPU_DEVICES and use the
arch_cpu_is_hotpluggable() hook.
- pass struct device to arm_iommu_create_mapping() and move over to use
iommu_paging_domain_alloc() rather than iommu_domain_alloc()
- make amba_bustype constant
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux:
ARM: 9418/1: dma-mapping: Use iommu_paging_domain_alloc()
ARM: 9417/1: dma-mapping: Pass device to arm_iommu_create_mapping()
ARM: 9416/1: amba: make amba_bustype constant
ARM: 9412/1: Convert to arch_cpu_is_hotpluggable()
ARM: 9411/1: Switch over to GENERIC_CPU_DEVICES using arch_register_cpu()
ARM: 9410/1: vfp: Use asm volatile in fmrx/fmxr macros
ARM: 9409/1: mmu: Do not use magic number for TTBCR settings
This domain free function can be called for IDENTITY and SVA domains too,
and they don't have page tables. For now protect against this by checking
the type. Eventually the different types should have their own free
functions.
Fixes: 485534bfcc ("iommu/amd: Remove conditions from domain free paths")
Reported-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/0-v1-ad9884ee5f5b+da-amd_iopgtbl_fix_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
An incorrect argument order calling amd_iommu_dev_flush_pasid_pages()
causes improper flushing of the IOMMU, leaving the old value of GCR3 from
a previous process attached to the same PASID.
The function has the signature:
void amd_iommu_dev_flush_pasid_pages(struct iommu_dev_data *dev_data,
ioasid_t pasid, u64 address, size_t size)
Correct the argument order.
Cc: stable@vger.kernel.org
Fixes: 474bf01ed9 ("iommu/amd: Add support for device based TLB invalidation")
Signed-off-by: Eliav Bar-ilan <eliavb@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/0-v1-fc6bc37d8208+250b-amd_pasid_flush_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The IOMMU core sets the iommu_attach_handle->domain for the
iommu_attach_group_handle() path, while the iommu_replace_group_handle()
sets it on the caller side. Make the two paths aligned on it.
Link: https://patch.msgid.link/r/20240908114256.979518-3-yi.l.liu@intel.com
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
For the fault-capable hwpts, the iommufd_hwpt_detach_device() calls both
iommufd_fault_domain_detach_dev() and iommu_detach_group(). This would have
duplicated __iommu_group_set_core_domain() call since both functions call
it in the end. This looks no harm as the __iommu_group_set_core_domain()
returns if the new domain equals to the existing one. But it makes sense to
avoid such duplicated calls in caller side.
Link: https://patch.msgid.link/r/20240908114256.979518-2-yi.l.liu@intel.com
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add two new kernel command line parameters to limit the page-sizes
used for v1 page-tables:
nohugepages - Limits page-sizes to 4KiB
v2_pgsizes_only - Limits page-sizes to 4Kib/2Mib/1GiB; The
same as the sizes used with v2 page-tables
This is needed for multiple scenarios. When assigning devices to
SEV-SNP guests the IOMMU page-sizes need to match the sizes in the RMP
table, otherwise the device will not be able to access all shared
memory.
Also, some ATS devices do not work properly with arbitrary IO
page-sizes as supported by AMD-Vi, so limiting the sizes used by the
driver is a suitable workaround.
All-in-all, these parameters are only workarounds until the IOMMU core
and related APIs gather the ability to negotiate the page-sizes in a
better way.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20240905072240.253313-1-joro@8bytes.org
The iommu PMU is system-wide scope, which is supported by the generic
perf_event subsystem now.
Set the scope for the iommu PMU and remove all the cpumask and hotplug
codes.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20240802151643.1691631-5-kan.liang@linux.intel.com
The members here are being used for both the linear and the 2 level case,
with the meaning of each item slightly different in the two cases.
Split it into a clean union where both cases have their own struct with
their own logical names and correct types.
Adjust all the users to detect linear/2lvl and use the right sub structure
and types consistently.
Remove CTXDESC_CD_DWORDS by changing the last places to use
sizeof(struct arm_smmu_cd).
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/8-v4-6416877274e1+1af-smmuv3_tidy_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
As well as indexing helpers arm_smmu_cdtab_l1/2_idx().
Remove CTXDESC_L1_DESC_DWORDS and CTXDESC_CD_DWORDS replacing them all
with type specific calculations.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v4-6416877274e1+1af-smmuv3_tidy_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
The top of the 2 level CD table is (at most) 1024 entries big, and two
high order allocations are required. One of __le64 which is programmed
into the HW (8k) and one of struct arm_smmu_l1_ctx_desc which holds the
CPU pointer (16k).
There are two copies of the l2ptr_dma, one is stored in the struct
arm_smmu_l1_ctx_desc, and another is encoded in the __le64 for the HW to
use. Instead of storing two copies just decode the value from the __le64.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v4-6416877274e1+1af-smmuv3_tidy_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
The master->cd_table is entirely contained within the struct
arm_smmu_master which is guaranteed to be freed by the core code under
arm_smmu_release_device().
There is no reason to use devm here, arm_smmu_free_cd_tables() is reliably
called to free the CD related memory. Remove it and save some memory.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v4-6416877274e1+1af-smmuv3_tidy_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
These values can be computed from the other values already stored in the
config. Move the calculation to arm_smmu_write_strtab() and do it directly
before writing the registers.
This moves all the logic to calculate the two registers into one function
from three and saves an unimportant 16 bytes from the arm_smmu_device.
Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v4-6416877274e1+1af-smmuv3_tidy_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
The members here are being used for both the linear and the 2 level case,
with the meaning of each item slightly different in the two cases.
Split it into a clean union where both cases have their own struct with
their own logical names and correct types.
Adjust all the users to detect linear/2lvl and use the right sub structure
and types consistently.
Remove STRTAB_STE_DWORDS by changing the last places to use
sizeof(struct arm_smmu_ste).
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v4-6416877274e1+1af-smmuv3_tidy_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Add types struct arm_smmu_strtab_l1 and l2 to represent the HW layout of
the descriptors, and use them in most places, following patches will get
the remaing places. The size of the l1 and l2 HW allocations are
sizeof(struct arm_smmu_strtab_l1/2).
This provides some more clarity than having raw __le64 *'s and sizes
computed via macros.
Remove STRTAB_L1_DESC_DWORDS.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v4-6416877274e1+1af-smmuv3_tidy_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Don't open code the calculations of the indexes for each level, provide
two functions to do that math and call them in all the places. Update all
the places computing indexes.
Calculate the L1 table size directly based on the max required index from
the cap. Remove STRTAB_L1_SZ_SHIFT in favour of STRTAB_NUM_L2_STES.
Use STRTAB_NUM_L2_STES to replace remaining open coded 1 << STRTAB_SPLIT.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v4-6416877274e1+1af-smmuv3_tidy_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
The Qualcomm SDM630 / SDM660 platform requires the same kind of
workaround as MSM8998: some IOMMUs have context banks reserved by
firmware / TZ, touching those banks resets the board.
Apply the num_context_bank workaround to those two SMMU devices in order
to allow them to be used by Linux.
Fixes: b812834b53 ("iommu: arm-smmu-qcom: Add sdm630/msm8998 compatibles for qcom quirks")
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Reviewed-by: Bjorn Andersson <andersson@kernel.org>
Link: https://lore.kernel.org/r/20240907-sdm660-wifi-v1-1-e316055142f8@linaro.org
Signed-off-by: Will Deacon <will@kernel.org>
Since v5.12 the rbtree has gained some simplifying helpers aimed at making
rb tree users write less convoluted boiler plate code. Instead the caller
provides a single comparison function and the helpers generate the prior
open-coded stuff.
Update smmu->streams to use rb_find_add() and rb_find().
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v3-9fef8cdc2ff6+150d1-smmuv3_tidy_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
It's observed that, when the first 4GB of system memory was reserved, all
VCMDQ allocations failed (even with the smallest qsz in the last attempt):
arm-smmu-v3: found companion CMDQV device: NVDA200C:00
arm-smmu-v3: option mask 0x10
arm-smmu-v3: failed to allocate queue (0x8000 bytes) for vcmdq0
acpi NVDA200C:00: tegra241_cmdqv: Falling back to standard SMMU CMDQ
arm-smmu-v3: ias 48-bit, oas 48-bit (features 0x001e1fbf)
arm-smmu-v3: allocated 524288 entries for cmdq
arm-smmu-v3: allocated 524288 entries for evtq
arm-smmu-v3: allocated 524288 entries for priq
This is because the 4GB reserved memory shifted the entire DMA zone from a
lower 32-bit range (on a system without the 4GB carveout) to higher range,
while the dev->coherent_dma_mask was set to DMA_BIT_MASK(32) by default.
The dma_set_mask_and_coherent() call is done in arm_smmu_device_hw_probe()
of the SMMU driver. So any DMA allocation from tegra241_cmdqv_probe() must
wait until the coherent_dma_mask is correctly set.
Move the vintf/vcmdq structure initialization routine into a different op,
"init_structures". Call it at the end of arm_smmu_init_structures(), where
standard SMMU queues get allocated.
Most of the impl_ops aren't ready until vintf/vcmdq structure are init-ed.
So replace the full impl_ops with an init_ops in __tegra241_cmdqv_probe().
And switch to tegra241_cmdqv_impl_ops later in arm_smmu_init_structures().
Note that tegra241_cmdqv_impl_ops does not link to the new init_structures
op after this switch, since there is no point in having it once it's done.
Fixes: 918eb5c856 ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Reported-by: Matt Ochs <mochs@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/530993c3aafa1b0fc3d879b8119e13c629d12e2b.1725503154.git.nicolinc@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
This check was missed, before we can pass a struct iommu_domain to a
driver callback we need to validate that the domain was created by that
driver.
Fixes: bd529dbb66 ("iommufd: Add a nested HW pagetable object")
Link: https://patch.msgid.link/r/0-v1-c8770519edde+1a-iommufd_nesting_ops_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The ioremap() function doesn't return error pointers, it returns NULL
on error so update the error handling. Also just return directly
instead of calling iounmap() on the NULL pointer. Calling
iounmap(NULL) doesn't cause a problem on ARM but on other architectures
it can trigger a warning so it'a bad habbit.
Fixes: 918eb5c856 ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/5a6c1e9a-0724-41b1-86d4-36335d3768ea@stanley.mountain
Signed-off-by: Will Deacon <will@kernel.org>
All users of ARM IOMMU mappings create them for a particular device, so
change the interface to accept the device rather than forcing a vague
indirection through a bus type. This prepares for making a similar
change to iommu_domain_alloc() itself.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
The manual says that bit 6 is IGN for all Page-Table Base Address
pointers, don't set it.
Fixes: aaac38f614 ("iommu/amd: Initial support for AMD IOMMU v2 page table")
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/14-v2-831cdc4d00f3+1a315-amd_iopgtbl_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The HW only has 52 bits of physical address support, the supported page
sizes should not have bits set beyond this. Further the spec says that the
6th level does not support any "default page size for translation entries"
meaning leafs in the 6th level are not allowed too.
Rework the definition to use GENMASK to build the range of supported pages
from the top of physical to 4k.
Nothing ever uses such large pages, so this is a cosmetic/documentation
improvement only.
Reported-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/13-v2-831cdc4d00f3+1a315-amd_iopgtbl_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The iommu driver is supposed to provide these ops to its io_pgtable
implementation so that it can hook the invalidations and do the right
thing.
They are called by wrapper functions like io_pgtable_tlb_add_page() etc,
which the AMD code never calls.
Instead it directly calls the AMD IOMMU invalidation functions by casting
to the struct protection_domain. Remove it all.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/12-v2-831cdc4d00f3+1a315-amd_iopgtbl_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Generates the same code, but is not the expected C style.
Fixes: aaac38f614 ("iommu/amd: Initial support for AMD IOMMU v2 page table")
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/11-v2-831cdc4d00f3+1a315-amd_iopgtbl_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Don't use tlb as some flag to indicate if protection_domain_alloc()
completed. Have protection_domain_alloc() unwind itself in the normal
kernel style and require protection_domain_free() only be called on
successful results of protection_domain_alloc().
Also, the amd_iommu_domain_free() op is never called by the core code with
a NULL argument, so remove all the NULL tests as well.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/10-v2-831cdc4d00f3+1a315-amd_iopgtbl_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The AMD io_pgtable stuff doesn't implement the tlb ops callbacks, instead
it invokes the invalidation ops directly on the struct protection_domain.
Narrow the use of struct protection_domain to only those few code paths.
Make everything else properly use struct amd_io_pgtable through the call
chains, which is the correct modular type for an io-pgtable module.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/9-v2-831cdc4d00f3+1a315-amd_iopgtbl_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
We already have memory in the union here that is being wasted in AMD's
case, use it to store the nid.
Putting the nid here further isolates the io_pgtable code from the struct
protection_domain.
Fixup protection_domain_alloc so that the NID from the device is provided,
at this point dev is never NULL for AMD so this will now allocate the
first table pointer on the correct NUMA node.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/8-v2-831cdc4d00f3+1a315-amd_iopgtbl_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This struct is already in iop.cfg, we don't need two.
AMD is using this API sort of wrong, the cfg is supposed to be passed in
and then the allocation function will allocate ops memory and copy the
passed config into the new memory. Keep it kind of wrong and pass in the
cfg memory that is already part of the pagetable struct.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v2-831cdc4d00f3+1a315-amd_iopgtbl_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
There is struct protection_domain iopt and struct amd_io_pgtable iopt.
Next patches are going to want to write domain.iopt.iopt.xx which is quite
unnatural to read.
Give one of them a different name, amd_io_pgtable has fewer references so
call it pgtbl, to match pgtbl_cfg, instead.
Suggested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v2-831cdc4d00f3+1a315-amd_iopgtbl_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Looks like many refactorings here have left this confused. There is only
one storage of the root/mode, it is in the iop struct.
increase_address_space() calls amd_iommu_domain_set_pgtable() with values
that it already stored in iop a few lines above.
amd_iommu_domain_clr_pt_root() is zero'ing memory we are about to free. It
used to protect against a double free of root, but that is gone now.
Remove amd_iommu_domain_set_pgtable(), amd_iommu_domain_set_pt_root(),
amd_iommu_domain_clr_pt_root() as they are all pointless.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v2-831cdc4d00f3+1a315-amd_iopgtbl_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
It is a serious bug if the domain is still mapped to any DTEs when it is
freed as we immediately start freeing page table memory, so any remaining
HW touch will UAF.
If it is not mapped then dev_list is empty and amd_iommu_domain_update()
does nothing.
Remove it and add a WARN_ON() to catch this class of bug.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v2-831cdc4d00f3+1a315-amd_iopgtbl_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
When using io_pgtable the correct pgsize_bitmap is stored in the cfg, both
v1_alloc_pgtable() and v2_alloc_pgtable() set it correctly.
This fixes a bug where the v2 pgtable had the wrong pgsize as
protection_domain_init_v2() would set it and then do_iommu_domain_alloc()
immediately resets it.
Remove the confusing ops.pgsize_bitmap since that is not used if the
driver sets domain.pgsize_bitmap.
Fixes: 134288158a ("iommu/amd: Add domain_alloc_user based domain allocation")
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v2-831cdc4d00f3+1a315-amd_iopgtbl_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Domain allocation is always done under a sleepable context, the v1 path
and other drivers use GFP_KERNEL already. Fix the v2 path to also use
GFP_KERNEL.
Fixes: 0d571dcbe7 ("iommu/amd: Allocate page table using numa locality info")
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v2-831cdc4d00f3+1a315-amd_iopgtbl_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
All the page table memory should be allocated/free within the io_pgtable
struct. The v2 path is already doing this, make it consistent.
It is hard to see but the free of the root in protection_domain_free() is
a NOP on the success path because v1_free_pgtable() does
amd_iommu_domain_clr_pt_root().
The root memory is already freed because free_sub_pt() put it on the
freelist. The free path in protection_domain_free() is only used during
error unwind of protection_domain_alloc().
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v2-831cdc4d00f3+1a315-amd_iopgtbl_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
As its used inside iommu.c only. Also rename function to dev_update_dte()
as its static function.
No functional changes intended.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240828111029.5429-9-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Remove separate function to update and flush the device table as only
amd_iommu_update_and_flush_device_table() calls these functions.
No functional changes intended.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240828111029.5429-8-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
AMD driver uses amd_iommu_domain_flush_complete() function to make sure
IOMMU processed invalidation commands before proceeding. Ideally this
should be called from functions which updates DTE/invalidates caches.
There is no need to call this function explicitly. This patches makes
below changes :
- Rename amd_iommu_domain_flush_complete() -> domain_flush_complete()
and make it as static function.
- Rearrage domain_flush_complete() to avoid forward declaration.
- Update amd_iommu_update_and_flush_device_table() to call
domain_flush_complete().
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240828111029.5429-7-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
As its not used outside iommu.c. Also rename it as dev_flush_pasid_all().
No functional change intended.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240828111029.5429-6-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Do not try to set max_pasids in error path as dev_data is not allocated.
Fixes: a0c47f233e ("iommu/amd: Introduce iommu_dev_data.max_pasids")
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240828111029.5429-5-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
It was added in commit 52815b7568 ("iommu/amd: Add support for
IOMMUv2 domain mode"), but never used it. Hence remove these unused
macros.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240828111029.5429-4-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
amd_iommu_is_attach_deferred() is a callback function called by
iommu_ops. Make it as static.
No functional changes intended.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240828111029.5429-3-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Update event buffer head pointer once driver completes processing. So
that IOMMU can write new log without waiting for driver to complete
processing all event logs.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240828111029.5429-2-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Use of_property_present() to test for property presence rather than
of_(find|get)_property(). This is part of a larger effort to remove
callers of of_find_property() and similar functions. of_find_property()
leaks the DT struct property and data pointers which is a problem for
dynamically allocated nodes which may be freed.
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/20240731191312.1710417-6-robh@kernel.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Converts IOTLB and Dev-IOTLB invalidation to a batched model. Cache tag
invalidation requests for a domain are now accumulated in a qi_batch
structure before being flushed in bulk. It replaces the previous per-
request qi_flush approach with a more efficient batching mechanism.
Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Tina Zhang <tina.zhang@intel.com>
Link: https://lore.kernel.org/r/20240815065221.50328-5-tina.zhang@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Introduces a qi_batch structure to hold batched cache invalidation
descriptors on a per-dmar_domain basis. A fixed-size descriptor
array is used for simplicity. The qi_batch is allocated when the
first cache tag is added to the domain and freed during
iommu_free_domain().
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Tina Zhang <tina.zhang@intel.com>
Link: https://lore.kernel.org/r/20240815065221.50328-4-tina.zhang@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Extracts IOTLB and Dev-IOTLB invalidation logic from cache tag flush
interfaces into dedicated helper functions. It prepares the codebase
for upcoming changes to support batched cache invalidations.
To enable direct use of qi_flush helpers in the new functions,
iommu->flush.flush_iotlb and quirk_extra_dev_tlb_flush() are opened up.
No functional changes are intended.
Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Tina Zhang <tina.zhang@intel.com>
Link: https://lore.kernel.org/r/20240815065221.50328-3-tina.zhang@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Separate the logic for constructing IOTLB and device TLB invalidation
descriptors from the qi_flush interfaces. New helpers, qi_desc(), are
introduced to encapsulate this common functionality.
Moving descriptor composition code to new helpers enables its reuse in
the upcoming qi_batch interfaces.
No functional changes are intended.
Signed-off-by: Tina Zhang <tina.zhang@intel.com>
Link: https://lore.kernel.org/r/20240815065221.50328-2-tina.zhang@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The caching mode of an IOMMU is irrelevant to the behavior of the device
TLB. Previously, commit <304b3bde24b5> ("iommu/vt-d: Remove caching mode
check before device TLB flush") removed this redundant check in the
domain unmap path.
Checking the caching mode before flushing the device TLB after a pasid
table entry is updated is unnecessary and can lead to inconsistent
behavior.
Extends this consistency by removing the caching mode check in the pasid
table update path.
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20240820030208.20020-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Currently, PCI PASID is enabled alongside PCI ATS when an iommu domain is
attached to the device and disabled when the device transitions to block
translation mode. This approach is inappropriate as PCI PASID is a device
feature independent of the type of the attached domain.
Enable PCI PASID during the IOMMU device probe and disables it during the
release path.
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20240819051805.116936-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
If qi_submit_sync() is invoked with 0 invalidation descriptors (for
instance, for DMA draining purposes), we can run into a bug where a
submitting thread fails to detect the completion of invalidation_wait.
Subsequently, this led to a soft lockup. Currently, there is no impact
by this bug on the existing users because no callers are submitting
invalidations with 0 descriptors. This fix will enable future users
(such as DMA drain) calling qi_submit_sync() with 0 count.
Suppose thread T1 invokes qi_submit_sync() with non-zero descriptors, while
concurrently, thread T2 calls qi_submit_sync() with zero descriptors. Both
threads then enter a while loop, waiting for their respective descriptors
to complete. T1 detects its completion (i.e., T1's invalidation_wait status
changes to QI_DONE by HW) and proceeds to call reclaim_free_desc() to
reclaim all descriptors, potentially including adjacent ones of other
threads that are also marked as QI_DONE.
During this time, while T2 is waiting to acquire the qi->q_lock, the IOMMU
hardware may complete the invalidation for T2, setting its status to
QI_DONE. However, if T1's execution of reclaim_free_desc() frees T2's
invalidation_wait descriptor and changes its status to QI_FREE, T2 will
not observe the QI_DONE status for its invalidation_wait and will
indefinitely remain stuck.
This soft lockup does not occur when only non-zero descriptors are
submitted.In such cases, invalidation descriptors are interspersed among
wait descriptors with the status QI_IN_USE, acting as barriers. These
barriers prevent the reclaim code from mistakenly freeing descriptors
belonging to other submitters.
Considered the following example timeline:
T1 T2
========================================
ID1
WD1
while(WD1!=QI_DONE)
unlock
lock
WD1=QI_DONE* WD2
while(WD2!=QI_DONE)
unlock
lock
WD1==QI_DONE?
ID1=QI_DONE WD2=DONE*
reclaim()
ID1=FREE
WD1=FREE
WD2=FREE
unlock
soft lockup! T2 never sees QI_DONE in WD2
Where:
ID = invalidation descriptor
WD = wait descriptor
* Written by hardware
The root of the problem is that the descriptor status QI_DONE flag is used
for two conflicting purposes:
1. signal a descriptor is ready for reclaim (to be freed)
2. signal by the hardware that a wait descriptor is complete
The solution (in this patch) is state separation by using QI_FREE flag
for #1.
Once a thread's invalidation descriptors are complete, their status would
be set to QI_FREE. The reclaim_free_desc() function would then only
free descriptors marked as QI_FREE instead of those marked as
QI_DONE. This change ensures that T2 (from the previous example) will
correctly observe the completion of its invalidation_wait (marked as
QI_DONE).
Signed-off-by: Sanjay K Kumar <sanjay.k.kumar@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240728210059.1964602-1-jacob.jun.pan@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The static identity domain has been introduced, rendering the si_domain
obsolete. Remove si_domain and cleanup the code accordingly.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240809055431.36513-8-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Software determines VT-d hardware support for passthrough translation by
inspecting the capability register. If passthrough translation is not
supported, the device is instructed to use DMA domain for its default
domain.
Add a global static identity domain with guaranteed attach semantics for
IOMMUs that support passthrough translation mode.
The global static identity domain is a dummy domain without corresponding
dmar_domain structure. Consequently, the device's info->domain will be
NULL with the identity domain is attached. Refactor the code accordingly.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240809055431.36513-7-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Extract common code from domain_context_mapping_one() into new helpers,
making it reusable by other functions such as the upcoming identity domain
implementation. No intentional functional changes.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Link: https://lore.kernel.org/r/20240809055431.36513-6-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The has_iotlb_device flag was used to indicate if a domain had attached
devices with ATS enabled. Domains without this flag didn't require device
TLB invalidation during unmap operations, optimizing performance by
avoiding unnecessary device iteration.
With the introduction of cache tags, this flag is no longer needed. The
code to iterate over attached devices was removed by commit 06792d0679
("iommu/vt-d: Cleanup use of iommu_flush_iotlb_psi()").
Remove has_iotlb_device to avoid unnecessary code.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240809055431.36513-5-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
We will use a global static identity domain. Reserve a static domain ID
for it.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Link: https://lore.kernel.org/r/20240809055431.36513-4-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
As the driver has enforced DMA domains for devices managed by an IOMMU
hardware that doesn't support passthrough translation mode, there is no
need for static identity mappings in the si_domain. Remove the identity
mapping code to avoid dead code.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240809055431.36513-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The iommu core defines the def_domain_type callback to query the iommu
driver about hardware capability and quirks. The iommu driver should
declare IOMMU_DOMAIN_DMA requirement for hardware lacking pass-through
capability.
Earlier VT-d hardware implementations did not support pass-through
translation mode. The iommu driver relied on a paging domain with all
physical system memory addresses identically mapped to the same IOVA
to simulate pass-through translation before the def_domain_type was
introduced and it has been kept until now. It's time to adjust it now
to make the Intel iommu driver follow the def_domain_type semantics.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Link: https://lore.kernel.org/r/20240809055431.36513-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Kernel test robot reported a few trucation warnings at the snprintf:
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:
In function ‘tegra241_vintf_free_lvcmdq’:
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:239:56:
warning: ‘%u’ directive output may be truncated writing between 1 and
5 bytes into a region of size between 3 and 11 [-Wformat-truncation=]
239 | snprintf(header, hlen, "VINTF%u: VCMDQ%u/LVCMDQ%u: ",
| ^~
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:239:32: note: directive argument
in the range [0, 65535]
239 | snprintf(header, hlen, "VINTF%u: VCMDQ%u/LVCMDQ%u: ",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:239:9: note: ‘snprintf’ output
between 25 and 37 bytes into a destination of size 32
239 | snprintf(header, hlen, "VINTF%u: VCMDQ%u/LVCMDQ%u: ",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
240 | vcmdq->vintf->idx, vcmdq->idx, vcmdq->lidx);
Fix by bumping up the size of the header to hold more characters.
Fixes: 918eb5c856 ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409020406.7ed5uojF-lkp@intel.com/
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20240902055745.629456-1-nicolinc@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
The fault-inject.h users across the kernel need to add a lot of #ifdef
CONFIG_FAULT_INJECTION to cater for shortcomings in the header. Make
fault-inject.h self-contained for CONFIG_FAULT_INJECTION=n, and add stubs
for DECLARE_FAULT_ATTR(), setup_fault_attr(), should_fail_ex(), and
should_fail() to allow removal of conditional compilation.
[akpm@linux-foundation.org: repair fallout from no longer including debugfs.h into fault-inject.h]
[akpm@linux-foundation.org: fix drivers/misc/xilinx_tmr_inject.c]
[akpm@linux-foundation.org: Add debugfs.h inclusion to more files, per Stephen]
Link: https://lkml.kernel.org/r/20240813121237.2382534-1-jani.nikula@intel.com
Fixes: 6ff1cb355e ("[PATCH] fault-injection capabilities infrastructure")
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Including:
- Fix a device-stall problem in bad io-page-fault setups (faults
received from devices with no supporting domain attached).
- Context flush fix for Intel VT-d.
- Do not allow non-read+non-write mapping through iommufd as most
implementations can not handle that.
- Fix a possible infinite-loop issue in map_pages() path.
- Add Jean-Philippe as reviewer for SMMUv3 SVA support
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmbRvfEACgkQK/BELZcB
GuOB8w//WLapQpxMw9w+4l3Z3SqxB5gSPF6pdCJwYRrpFGBX1yNZ0vWtF2TpKtOC
NaMa/EC1C2FWjcArCP21uFtDvN04FgXSVl6sjFUHsUf+YALrUfljQk/XFI4SenTq
PtvPv8PVGbhqLtdJDXMlQWBN3RX0qK/PIFmuUX5ySBk7J7k5QyBi2HEuK2DbPM7j
+LMnyTHj5Aa2jRz/NSCDIRKbSFJKgvd8apval2VX0zljjpyqk5KmHHjkLtiOiTTI
G6ZJlRYCn98eTLU2ww8b7/y0vVYop7C1Q7Cyds/72xvW+a3jbSRIGf6yqtmdbMYd
faxRng5rWHWsq3XMZC+Ts9k2FA3pUIvOmfptCFfrQYYXvZI6dD6o7uMko6SF82n4
xEy+H6AEWZXF70xaJDp1cn1PpURJgJly/l/6qAIB746qNT7j/CcOOha1bpbCy81x
EIOl0B4wyJGjQnxjKsH01K9ec3uT6rugbpFEE9PL8l25khhyweBwuQWc2EVxRZgH
ICH4pCmvU9Wy6mpXL2R/SyzECWjgg0oJr+pq3Yxv7xufSGQswWJ/StFozSBHnH01
OGGA/2xMrNeRzlm4PZfRzdAiCfYX9kEodiF1jGLA4B1V5Tx/y1LSX7W/nCeZmlRz
/OhEC07DWZumeSCTe5I+BmZwiXh/DEAlUypDQkVKaaeGltlyvl8=
=8XuD
-----END PGP SIGNATURE-----
Merge tag 'iommu-fixes-v6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu fixes from Joerg Roedel:
- Fix a device-stall problem in bad io-page-fault setups (faults
received from devices with no supporting domain attached).
- Context flush fix for Intel VT-d.
- Do not allow non-read+non-write mapping through iommufd as most
implementations can not handle that.
- Fix a possible infinite-loop issue in map_pages() path.
- Add Jean-Philippe as reviewer for SMMUv3 SVA support
* tag 'iommu-fixes-v6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
MAINTAINERS: Add Jean-Philippe as SMMUv3 SVA reviewer
iommu: Do not return 0 from map_pages if it doesn't do anything
iommufd: Do not allow creating areas without READ or WRITE
iommu/vt-d: Fix incorrect domain ID in context flush helper
iommu: Handle iommu faults for a bad iopf setup
At the moment, the SMMUv3 unit tests assume ATS is always enabled,
although this is sufficient to test hitless/non-hitless transitions,
but exercising other features is useful to check ste/cd population
logic (for example the .get_used logic).
Add an enum where bits define features per-master, at the moment there
is only ATS and STALLs which are mutually exclusive, but this would
make it easier to extend with other features in the future.
Also, Add 2 more tests for s1 <-> s2 transitions with stalls enabled.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Link: https://lore.kernel.org/r/20240830110349.797399-3-smostafa@google.com
Signed-off-by: Will Deacon <will@kernel.org>
According to the spec (ARM IHI 0070 F.b), in
"5.5 Fault configuration (A, R, S bits)":
A STE with stage 2 translation enabled and STE.S2S == 0 is
considered ILLEGAL if SMMU_IDR0.STALL_MODEL == 0b10.
Also described in the pseudocode “SteIllegal()”
if STE.Config == '11x' then
[..]
if eff_idr0_stall_model == '10' && STE.S2S == '0' then
// stall_model forcing stall, but S2S == 0
return TRUE;
Which means, S2S must be set when stall model is
"ARM_SMMU_FEAT_STALL_FORCE", but currently the driver ignores that.
Although, the driver can do the minimum and only set S2S for
“ARM_SMMU_FEAT_STALL_FORCE”, it is more consistent to match S1
behaviour, which also sets it for “ARM_SMMU_FEAT_STALL” if the
master has requested stalls.
Also, since S2 stalls are enabled now, report them to the IOMMU layer
and for VFIO devices it will fail anyway as VFIO doesn’t register an
iopf handler.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Link: https://lore.kernel.org/r/20240830110349.797399-2-smostafa@google.com
Signed-off-by: Will Deacon <will@kernel.org>
When VCMDQs are assigned to a VINTF owned by a guest (HYP_OWN bit unset),
only TLB and ATC invalidation commands are supported by the VCMDQ HW. So,
implement the new cmdq->supports_cmd op to scan the input cmd in order to
make sure that it is supported by the selected queue.
Note that the guest VM shouldn't have HYP_OWN bit being set regardless of
guest kernel driver writing it or not, i.e. the hypervisor running in the
host OS should wire this bit to zero when trapping a write access to this
VINTF_CONFIG register from a guest kernel.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/8160292337059b91271045800e5c62f7295e2c24.1724970714.git.nicolinc@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
The VCMDQ in the tegra241-cmdqv driver has a guest mode that supports only
a few invalidation commands. A batch is initialized with a cmdq, so it has
to confirm whether a new command is supported or not.
Add a supports_cmd function pointer to the cmdq structure, where the vcmdq
driver should hook a command scan function. Add an inline helper too so it
can be used by both sides.
If a new command is not supported, simply issue the existing batch and re-
init it as a new batch.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/aafb24b881504f18c5d0c7c15f2134e40ad2c486.1724970714.git.nicolinc@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
NVIDIA's Tegra241 Soc has a CMDQ-Virtualization (CMDQV) hardware, extending
the standard ARM SMMU v3 IP to support multiple VCMDQs with virtualization
capabilities. In terms of command queue, they are very like a standard SMMU
CMDQ (or ECMDQs), but only support CS_NONE in the CS field of CMD_SYNC.
Add a new tegra241-cmdqv driver, and insert its structure pointer into the
existing arm_smmu_device, and then add related function calls in the SMMUv3
driver to interact with the CMDQV driver.
In the CMDQV driver, add a minimal part for the in-kernel support: reserve
VINTF0 for in-kernel use, and assign some of the VCMDQs to the VINTF0, and
select one VCMDQ based on the current CPU ID to execute supported commands.
This multi-queue design for in-kernel use gives some limited improvements:
up to 20% reduction of invalidation time was measured by a multi-threaded
DMA unmap benchmark, compared to a single queue.
The other part of the CMDQV driver will be user-space support that gives a
hypervisor running on the host OS to talk to the driver for virtualization
use cases, allowing VMs to use VCMDQs without trappings, i.e. no VM Exits.
This is designed based on IOMMUFD, and its RFC series is also under review.
It will provide a guest OS a bigger improvement: 70% to 90% reductions of
TLB invalidation time were measured by DMA unmap tests running in a guest,
compared to nested SMMU CMDQ (with trappings).
As the initial version, the CMDQV driver only supports ACPI configurations.
Signed-off-by: Nate Watterson <nwatterson@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/dce50490b2c10b7254fb36aa73ed7ffd812b283a.1724970714.git.nicolinc@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
For model-specific implementation, repurpose the acpi_smmu_get_options()
to a wider acpi_smmu_acpi_probe_model(). A new model can add to the list
in this new function.
Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/79716299829aeab2e55b8c7932f2634b209bb4d5.1724970714.git.nicolinc@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
The CMDQV extension in NVIDIA Tegra241 SoC only supports CS_NONE in the
CS field of CMD_SYNC. Add a new SMMU option to accommodate that.
Suggested-by: Will Deacon <will@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/a3cb9bb2429fbae4a59f7ef517614d226763d717.1724970714.git.nicolinc@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
The symbols __arm_smmu_cmdq_skip_err(), arm_smmu_init_one_queue(), and
arm_smmu_cmdq_init() need to be used by the tegra241-cmdqv compilation
unit in a following patch.
Remove the static and put prototypes in the header.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/c4f2aa5f5f40a2e7c68b132c6d3171d6403de57a.1724970714.git.nicolinc@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
The CMDQV extension on NVIDIA Tegra241 SoC only supports CS_NONE in the
CS field of CMD_SYNC, v.s. standard SMMU CMDQ. Pass in the cmdq pointer
directly, so the function can identify a different cmdq implementation.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/723288287997b6dfbcd2a904d2c11e9b23f82250.1724970714.git.nicolinc@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
The driver calls in different places the arm_smmu_get_cmdq() helper, and
it's fine to do so since the helper always returns the single SMMU CMDQ.
However, with NVIDIA CMDQV extension or SMMU ECMDQ, there can be multiple
cmdqs in the system to select one from. And either case requires a batch
of commands to be issued to the same cmdq. Thus, a cmdq has to be decided
in the higher-level callers.
Add a cmdq pointer in arm_smmu_cmdq_batch structure, and decide the cmdq
when initializing the batch. Pass its pointer down to the bottom function.
Update __arm_smmu_cmdq_issue_cmd() accordingly for single command issuers.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/2cbf5ddefb6ea611e48d67c642271bd24421eb21.1724970714.git.nicolinc@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
The current __arm_lpae_unmap() function calls dma_sync() on individual
PTEs after clearing them. Overall unmap performance can be improved by
around 25% for large buffer sizes by combining the syncs for adjacent
leaf entries.
Optimize the unmap time by clearing all the leaf entries and issuing a
single dma_sync() for them.
Below is detailed analysis of average unmap latency(in us) with and
without this optimization obtained by running dma_map_benchmark for
different buffer sizes.
UnMap Latency(us)
Size Without With % gain with
optimiztion optimization optimization
4KB 3 3 0
8KB 4 3.8 5
16KB 6.1 5.4 11.48
32KB 10.2 8.5 16.67
64KB 18.5 14.9 19.46
128KB 35 27.5 21.43
256KB 67.5 52.2 22.67
512KB 127.9 97.2 24.00
1MB 248.6 187.4 24.62
2MB 65.5 65.5 0
4MB 119.2 119 0.17
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Ashish Mhetre <amhetre@nvidia.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240806105135.218089-1-amhetre@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
PCI ATS has a global Smallest Translation Unit field that is located in
the PF but shared by all of the VFs.
The expectation is that the STU will be set to the root port's global STU
capability which is driven by the IO page table configuration of the iommu
HW. Today it becomes set when the iommu driver first enables ATS.
Thus, to enable ATS on the VF, the PF must have already had the correct
STU programmed, even if ATS is off on the PF.
Unfortunately the PF only programs the STU when the PF enables ATS. The
iommu drivers tend to leave ATS disabled when IDENTITY translation is
being used.
Thus we can get into a state where the PF is setup to use IDENTITY with
the DMA API while the VF would like to use VFIO with a PAGING domain and
have ATS turned on. This fails because the PF never loaded a PAGING domain
and so it never setup the STU, and the VF can't do it.
The simplest solution is to have the iommu driver set the ATS STU when it
probes the device. This way the ATS STU is loaded immediately at boot time
to all PFs and there is no issue when a VF comes to use it.
Add a new call pci_prepare_ats() which should be called by iommu drivers
in their probe_device() op for every PCI device if the iommu driver
supports ATS. This will setup the STU based on whatever page size
capability the iommu HW has.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/0-v1-0fb4d2ab6770+7e706-ats_vf_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Nicolin Chen says:
=========
IOMMU_RESV_SW_MSI is a unique region defined by an IOMMU driver. Though it
is eventually used by a device for address translation to an MSI location
(including nested cases), practically it is a universal region across all
domains allocated for the IOMMU that defines it.
Currently IOMMUFD core fetches and reserves the region during an attach to
an hwpt_paging. It works with a hwpt_paging-only case, but might not work
with a nested case where a device could directly attach to a hwpt_nested,
bypassing the hwpt_paging attachment.
Move the enforcement forward, to the hwpt_paging allocation function. Then
clean up all the SW_MSI related things in the attach/replace routine.
=========
Based on v6.11-rc5 for dependencies.
* nesting_reserved_regions: (562 commits)
iommufd/device: Enforce reserved IOVA also when attached to hwpt_nested
Linux 6.11-rc5
...
Currently, device reserved regions are only enforced when the device is
attached to an hwpt_paging. In other words, if the device gets attached to
an hwpt_nested directly, the parent hwpt_paging of the hwpt_nested's would
not enforce those reserved IOVAs. This works for most of reserved region
types, but not for IOMMU_RESV_SW_MSI, which is a unique software defined
window, required by a nesting case too to setup an MSI doorbell on the
parent stage-2 hwpt/domain.
Kevin pointed out in 1 that:
1) there is no usage using up closely the entire IOVA space yet,
2) guest may change the viommu mode to switch between nested and paging
then VMM has to take all devices' reserved regions into consideration
anyway, when composing the GPA space.
So it would be actually convenient for us to also enforce reserved IOVA
onto the parent hwpt_paging, when attaching a device to an hwpt_nested.
Repurpose the existing attach/replace_paging helpers to attach device's
reserved IOVAs exclusively.
Add a new find_hwpt_paging helper, which is only used by these reserved
IOVA functions, to allow an IOMMUFD_OBJ_HWPT_NESTED hwpt to redirect to
its parent hwpt_paging. Return a NULL in these two helpers for any new
HWPT type in the future.
Link: https://patch.msgid.link/r/20240807003446.3740368-1-nicolinc@nvidia.com
Link: https://lore.kernel.org/all/BN9PR11MB5276497781C96415272E6FED8CB12@BN9PR11MB5276.namprd11.prod.outlook.com/#1
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
test_bit() is used to read the memory storing the bitmap, however
test_bit() always uses a unsigned long 8 byte access.
If the bitmap is not an aligned size of 64 bits this will now trigger a
KASAN warning reading past the end of the buffer.
Properly round the buffer allocation to an unsigned long size. Continue to
copy_from_user() using a byte granularity.
Fixes: 9560393b83 ("iommufd/selftest: Fix iommufd_test_dirty() to handle <u8 bitmaps")
Link: https://patch.msgid.link/r/0-v1-113e8d9e7861+5ae-iommufd_kasan_jgg@nvidia.com
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
SDM845's Adreno SMMU is unique in that it actually advertizes support
for 16K (and 32M) pages, which doesn't hold for newer SoCs.
This however, seems either broken in the hardware implementation, the
hypervisor middleware that abstracts the SMMU, or there's a bug in the
Linux kernel somewhere down the line that nobody managed to track down.
Booting SDM845 with 16K page sizes and drm/msm results in:
*** gpu fault: ttbr0=0000000000000000 iova=000100000000c000 dir=READ
type=TRANSLATION source=CP (0,0,0,0)
right after loading the firmware. The GPU then starts spitting out
illegal intstruction errors, as it's quite obvious that it got a
bogus pointer.
Moreover, it seems like this issue also concerns other implementations
of SMMUv2 on Qualcomm SoCs, such as the one on SC7180.
Hide 16K support on such instances to work around this.
Reported-by: Sumit Semwal <sumit.semwal@linaro.org>
Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
Link: https://lore.kernel.org/r/20240824-topic-845_gpu_smmu-v2-1-a302b8acc052@quicinc.com
Signed-off-by: Will Deacon <will@kernel.org>
These three implementations of map_pages() all succeed if a mapping is
requested with no read or write. Since they return back to __iommu_map()
leaving the mapped output as 0 it triggers an infinite loop. Therefore
nothing is using no-access protection bits.
Further, VFIO and iommufd rely on iommu_iova_to_phys() to get back PFNs
stored by map, if iommu_map() succeeds but iommu_iova_to_phys() fails that
will create serious bugs.
Thus remove this never used "nothing to do" concept and just fail map
immediately.
Fixes: e5fc9753b1 ("iommu/io-pgtable: Add ARMv7 short descriptor support")
Fixes: e1d3c0fd70 ("iommu: add ARM LPAE page table allocator")
Fixes: 745ef1092b ("iommu/io-pgtable: Move Apple DART support to its own file")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/2-v1-1211e1294c27+4b1-iommu_no_prot_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This results in passing 0 or just IOMMU_CACHE to iommu_map(). Most of
the page table formats don't like this:
amdv1 - -EINVAL
armv7s - returns 0, doesn't update mapped
arm-lpae - returns 0 doesn't update mapped
dart - returns 0, doesn't update mapped
VT-D - returns -EINVAL
Unfortunately the three formats that return 0 cause serious problems:
- Returning ret = but not uppdating mapped from domain->map_pages()
causes an infinite loop in __iommu_map()
- Not writing ioptes means that VFIO/iommufd have no way to recover them
and we will have memory leaks and worse during unmap
Since almost nothing can support this, and it is a useless thing to do,
block it early in iommufd.
Cc: stable@kernel.org
Fixes: aad37e71d5 ("iommufd: IOCTLs for the io_pagetable")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/1-v1-1211e1294c27+4b1-iommu_no_prot_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The helper intel_context_flush_present() is designed to flush all related
caches when a context entry with the present bit set is modified. It
currently retrieves the domain ID from the context entry and uses it to
flush the IOTLB and context caches. This is incorrect when the context
entry transitions from present to non-present, as the domain ID field is
cleared before calling the helper.
Fix it by passing the domain ID programmed in the context entry before the
change to intel_context_flush_present(). This ensures that the correct
domain ID is used for cache invalidation.
Fixes: f90584f4be ("iommu/vt-d: Add helper to flush caches for context change")
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Closes: https://lore.kernel.org/linux-iommu/20240814162726.5efe1a6e.alex.williamson@redhat.com/
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Reviewed-by: Jacob Pan <jacob.pan@linux.microsoft.com>
Link: https://lore.kernel.org/r/20240815124857.70038-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
On qcom msm8998, writing to the last context bank of lpass_q6_smmu
(base address 0x05100000) produces a system freeze & reboot.
The hardware/hypervisor reports 13 context banks for the LPASS SMMU
on msm8998, but only the first 12 are accessible...
Override the number of context banks
[ 2.546101] arm-smmu 5100000.iommu: probing hardware configuration...
[ 2.552439] arm-smmu 5100000.iommu: SMMUv2 with:
[ 2.558945] arm-smmu 5100000.iommu: stage 1 translation
[ 2.563627] arm-smmu 5100000.iommu: address translation ops
[ 2.568923] arm-smmu 5100000.iommu: non-coherent table walk
[ 2.574566] arm-smmu 5100000.iommu: (IDR0.CTTW overridden by FW configuration)
[ 2.580220] arm-smmu 5100000.iommu: stream matching with 12 register groups
[ 2.587263] arm-smmu 5100000.iommu: 13 context banks (0 stage-2 only)
[ 2.614447] arm-smmu 5100000.iommu: Supported page sizes: 0x63315000
[ 2.621358] arm-smmu 5100000.iommu: Stage-1: 36-bit VA -> 36-bit IPA
[ 2.627772] arm-smmu 5100000.iommu: preserved 0 boot mappings
Specifically, the crashes occur here:
qsmmu->bypass_cbndx = smmu->num_context_banks - 1;
arm_smmu_cb_write(smmu, qsmmu->bypass_cbndx, ARM_SMMU_CB_SCTLR, 0);
and here:
arm_smmu_write_context_bank(smmu, i);
arm_smmu_cb_write(smmu, i, ARM_SMMU_CB_FSR, ARM_SMMU_CB_FSR_FAULT);
It is likely that FW reserves the last context bank for its own use,
thus a simple work-around is: DON'T USE IT in Linux.
If we decrease the number of context banks, last one will be "hidden".
Signed-off-by: Marc Gonzalez <mgonzalez@freebox.fr>
Reviewed-by: Caleb Connolly <caleb.connolly@linaro.org>
Reviewed-by: Bjorn Andersson <andersson@kernel.org>
Link: https://lore.kernel.org/r/20240820-smmu-v3-1-2f71483b00ec@freebox.fr
Signed-off-by: Will Deacon <will@kernel.org>
Clean up and reorder them according to the bit index. There is no
functional change.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20240816221650.62295-1-suravee.suthikulpanit@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The iommu_report_device_fault function was updated to return void while
assuming that drivers only need to call iommu_report_device_fault() for
reporting an iopf. This implementation causes following problems:
1. The drivers rely on the core code to call it's page_reponse,
however, when a fault is received and no fault capable domain is
attached / iopf_param is NULL, the ops->page_response is NOT called
causing the device to stall in case the fault type was PAGE_REQ.
2. The arm_smmu_v3 driver relies on the returned value to log errors
returning void from iommu_report_device_fault causes these events to
be missed while logging.
Modify the iommu_report_device_fault function to return -EINVAL for
cases where no fault capable domain is attached or iopf_param was NULL
and calls back to the driver (ops->page_response) in case the fault type
was IOMMU_FAULT_PAGE_REQ. The returned value can be used by the drivers
to log the fault/event as needed.
Reported-by: Kunkun Jiang <jiangkunkun@huawei.com>
Closes: https://lore.kernel.org/all/6147caf0-b9a0-30ca-795e-a1aa502a5c51@huawei.com/
Fixes: 3dfa64aecb ("iommu: Make iommu_report_device_fault() return void")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20240816104906.1010626-1-praan@google.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Directly call into dma-iommu just like we have been doing for dma-direct
for a while. This avoids the indirect call overhead for IOMMU ops and
removes the need to have DMA ops entirely for many common configurations.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Two small fixes:
- Incorrect error unwind in iommufd_device_do_replace()
- Correct a sparse warning missing static
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZsUc4QAKCRCFwuHvBreF
YZwsAPwONmbC7pXeXfuT69SgebRZTGTQTYK7d5jTH6AW8w+KKwD+LwYWI6N9E1f0
7I/5jO5QYHURDzjv0DKJVBWrU5jUaQE=
=QqJc
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd fixes from Jason Gunthorpe:
- Incorrect error unwind in iommufd_device_do_replace()
- Correct a sparse warning missing static
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd:
iommufd/selftest: Make dirty_ops static
iommufd/device: Fix hwpt at err_unresv in iommufd_device_do_replace()
The sparse tool complains as follows:
drivers/iommu/iommufd/selftest.c:277:30: warning:
symbol 'dirty_ops' was not declared. Should it be static?
This symbol is not used outside of selftest.c, so marks it static.
Fixes: 266ce58989 ("iommufd/selftest: Test IOMMU_HWPT_ALLOC_DIRTY_TRACKING")
Link: https://patch.msgid.link/r/20240819120007.3884868-1-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The arm_smmu_domain_alloc() function returns error pointers on error. It
doesn't return NULL. Update the error checking to match.
Fixes: 52acd7d8a4 ("iommu/arm-smmu-v3: Add support for domain_alloc_user fn")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/9208cd0d-8105-40df-93e9-bdcdf0d55eec@stanley.mountain
Signed-off-by: Will Deacon <will@kernel.org>
arm_smmu_sva_remove_dev_pasid() has been removed since commit d38c28dbef
("iommu/arm-smmu-v3: Put the SVA mmu notifier in the smmu_domain"), remain
the empty definition untouched in header file, which is used when
CONFIG_ARM_SMMU_V3_SVA is not set. So, let's remove the unused definition.
Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20240815111504.48810-1-zhangzekun11@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
Previously this was dev_err_ratelimited() but it got changed to a
ratelimited dev_dbg(). Change it back to dev_err().
Fixes: d525b0af0c ("iommu/arm-smmu: Pretty-print context fault related regs")
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Link: https://lore.kernel.org/r/20240809172716.10275-1-robdclark@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
Create global blocked domain with attach device ops. It will clear the
DTE so that all DMA from device will be aborted.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20240722115452.5976-1-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
When iommu_report_device_fault gets called with a partial fault it is
supposed to collect the fault into the group and then return.
Instead the return was accidently deleted which results in trying to
process the fault and an eventual crash.
Deleting the return was a typo, put it back.
Fixes: 3dfa64aecb ("iommu: Make iommu_report_device_fault() return void")
Signed-off-by: Barak Biber <bbiber@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/0-v1-e7153d9c8cee+1c6-iommu_fault_fix_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
- Fix NULL dereference when freeing domain in Unisoc SPRD driver.
- Separate assignment statements with semicolons in AMD page-table code.
- Fix Tegra erratum workaround when the CPU is using 16KiB pages.
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmaiQhUQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNI3SB/9m78weKYVHUc87Vwzs/3aEmMzjNiCuIIax
nFKIFKZIeL7UqiUK1dehoVU+S/JYYjjPyR/0yeb+l4w4IaTmkcAu48r9rDssuoLe
I6rkLbIfwGIw5pKtf21UtkRmMhnoW2ptx8qrCFbvIkFYVWyfayP+c7hkbKBh2bMD
cal8bYrgWuax3TTgxu5V6cBBSK+EDmdN2ZVGArY1x7VZImFWZuZ1s5OowjHCX4Q1
il2ETdL4M8OmlIsitO96I0zvne2Ci+12i5TQOl8Ao5d7BhM5LfjdQjeNvT6YhdX9
X3u+p+GfVYQNZI1HYvwNvWTEuF+PIj5Oem9hfKDo/8vAYERe2M1A
=7kvS
-----END PGP SIGNATURE-----
Merge tag 'iommu-fixes-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu fixes from Will Deacon:
"We're still resolving a regression with the handling of unexpected
page faults on SMMUv3, but we're not quite there with a fix yet.
- Fix NULL dereference when freeing domain in Unisoc SPRD driver
- Separate assignment statements with semicolons in AMD page-table
code
- Fix Tegra erratum workaround when the CPU is using 16KiB pages"
* tag 'iommu-fixes-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
iommu: arm-smmu: Fix Tegra workaround for PAGE_SIZE mappings
iommu/amd: Convert comma to semicolon
iommu: sprd: Avoid NULL deref in sprd_iommu_hw_en
PAGE_SIZE can be 16KB for Tegra which is not supported by MMU-500 on
both Tegra194 and Tegra234. Retain only valid granularities from
pgsize_bitmap which would either be 4KB or 64KB.
Signed-off-by: Ashish Mhetre <amhetre@nvidia.com>
Link: https://lore.kernel.org/r/20240724173132.219978-1-amhetre@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Replace a comma between expression statements by a semicolon.
Fixes: c9b258c6be ("iommu/amd: Prepare for generic IO page table framework")
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240716072545.968690-1-nichen@iscas.ac.cn
Signed-off-by: Will Deacon <will@kernel.org>
In sprd_iommu_cleanup() before calling function sprd_iommu_hw_en()
dom->sdev is equal to NULL, which leads to null dereference.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 9afea57384 ("iommu/sprd: Release dma buffer to avoid memory leak")
Signed-off-by: Artem Chernyshev <artem.chernyshev@red-soft.ru>
Reviewed-by: Chunyan Zhang <zhang.lyra@gmail.com>
Link: https://lore.kernel.org/r/20240716125522.3690358-1-artem.chernyshev@red-soft.ru
Signed-off-by: Will Deacon <will@kernel.org>
Several new features here:
- Virtio find vqs API has been reworked
(required to fix the scalability issue we have with
adminq, which I hope to merge later in the cycle)
- vDPA driver for Marvell OCTEON
- virtio fs performance improvement
- mlx5 migration speedups
Fixes, cleanups all over the place.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmaXjQQPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpnIsH/jVNqAQbe/vaBQdNMdnsA+P9A9unLbYRxYCQ
tN73mQRIXKtnZHBRAEbMGq52HPYg8HlN2HJSgyNo6I6t8VD+PiOco7m+3GpmqEcW
aXPOPl0BAbVoDgyutxRuuodP8Z61lBx0mG6iOxpzTXOPGlpQqtPCFHO8YnodqnPf
tMix/5uAqgZKV2siCbw5DtzwEc0gDHU8qsD0/nyoS5nBDF9yh/ardr5P/qiyFDQH
atCNYTOhIFU83pLAaw0fpCGbkt7gxf+5RpWVx3wkYww+/MwvYhsveRvQyaGbBz3n
WDtET3SOtVTta98OAGIKCq/2z8f6mYXBP7vXapBgnJG3vwS/poQ=
=LYua
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
"Several new features here:
- Virtio find vqs API has been reworked (required to fix the
scalability issue we have with adminq, which I hope to merge later
in the cycle)
- vDPA driver for Marvell OCTEON
- virtio fs performance improvement
- mlx5 migration speedups
Fixes, cleanups all over the place"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (56 commits)
virtio: rename virtio_find_vqs_info() to virtio_find_vqs()
virtio: remove unused virtio_find_vqs() and virtio_find_vqs_ctx() helpers
virtio: convert the rest virtio_find_vqs() users to virtio_find_vqs_info()
virtio_balloon: convert to use virtio_find_vqs_info()
virtiofs: convert to use virtio_find_vqs_info()
scsi: virtio_scsi: convert to use virtio_find_vqs_info()
virtio_net: convert to use virtio_find_vqs_info()
virtio_crypto: convert to use virtio_find_vqs_info()
virtio_console: convert to use virtio_find_vqs_info()
virtio_blk: convert to use virtio_find_vqs_info()
virtio: rename find_vqs_info() op to find_vqs()
virtio: remove the original find_vqs() op
virtio: call virtio_find_vqs_info() from virtio_find_single_vq() directly
virtio: convert find_vqs() op implementations to find_vqs_info()
virtio_pci: convert vp_*find_vqs() ops to find_vqs_info()
virtio: introduce virtio_queue_info struct and find_vqs_info() config op
virtio: make virtio_find_single_vq() call virtio_find_vqs()
virtio: make virtio_find_vqs() call virtio_find_vqs_ctx()
caif_virtio: use virtio_find_single_vq() for single virtqueue finding
vdpa/mlx5: Don't enable non-active VQs in .set_vq_ready()
...
- reduce duplicate swiotlb pool lookups (Michael Kelley)
- minor small fixes (Yicong Yang, Yang Li)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmaZ+C0LHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPtUw/9FZr0azAzx1E5nqDNfa5sisqoPZUFKcHYBYmjhcgL
9NvDKbKqriRBt8fAMtjGpQqUfpIytfa4p0BGLYnzYKJ8Im+MqC9KYq7lhoNVQGO3
iz8T8clT9TPoVDkiMn/LXlf6dqE0SQtcSh4TJ1c+aTFFlez5jXClFu376rcbdewM
n+nyGL93uaCeRpeQlywyBsUdpu9NnWAJUYi/OszWSTlX/1ucG5kD9kNQ4sOFL0S/
WszylOCIuur9EweIvMgPa5X5yCUsr/n95EthbIQRiZq/TSkLpJ3DkTk3VYaaKLsJ
imq5Q6dqOnPplL22X6GgvXEDYd1OYzYSvbB93HcxuGnIIYLvdFMV+McCTkwL7XJV
v9fDNsx8Wn7cS+6OpT0YxfTFvzHXFAzRTqVoqA/lCWo8eaclzUKM3qOoIQxjl1Wh
1IDoHUhiH0m0ecS+ZzljjDrn39RzU6+5XJiJzCmzYlqL/9f0SumTCsNrEmF3W6l9
GmuP2Q49Rbh+Ru+oyx3KGlBS9DQxTB+heBnjkwigtAiwoO8Z2V9Vwg/a2Lzl3RkN
vDaHqEo6f8gAc/QlonOjN9UfuR5U8bxGUps1zw9O0CeTDqSDXmJKvkddfkBSAY7s
bIW+sObA5yRrPuwz4ALykw3h1UJXdZNJIEYjin3CJIHr3K0DZADWXXSXCSao16kC
MXU=
=PzrU
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.11-2024-07-19' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- reduce duplicate swiotlb pool lookups (Michael Kelley)
- minor small fixes (Yicong Yang, Yang Li)
* tag 'dma-mapping-6.11-2024-07-19' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: fix kernel-doc description for swiotlb_del_transient
swiotlb: reduce swiotlb pool lookups
dma-mapping: benchmark: Don't starve others when doing the test
- Core:
* Support for the "ats-supported" device-tree property.
* Removal of the 'ops' field from 'struct iommu_fwspec'.
* Introduction of iommu_paging_domain_alloc() and partial conversion
of existing users.
* Introduce 'struct iommu_attach_handle' and provide corresponding
IOMMU interfaces which will be used by the IOMMUFD subsystem.
* Remove stale documentation.
* Add missing MODULE_DESCRIPTION() macro.
* Misc cleanups.
- Allwinner Sun50i:
* Ensure bypass mode is disabled on H616 SoCs.
* Ensure page-tables are allocated below 4GiB for the 32-bit
page-table walker.
* Add new device-tree compatible strings.
- AMD Vi:
* Use try_cmpxchg64() instead of cmpxchg64() when updating pte.
- Arm SMMUv2:
* Print much more useful information on context faults.
* Fix Qualcomm TBU probing when CONFIG_ARM_SMMU_QCOM_DEBUG=n.
* Add new Qualcomm device-tree bindings.
- Arm SMMUv3:
* Support for hardware update of access/dirty bits and reporting via
IOMMUFD.
* More driver rework from Jason, this time updating the PASID/SVA support
to prepare for full IOMMUFD support.
* Add missing MODULE_DESCRIPTION() macro.
* Minor fixes and cleanups.
- NVIDIA Tegra:
* Fix for benign fwspec initialisation issue exposed by rework on the
core branch.
- Intel VT-d:
* Use try_cmpxchg64() instead of cmpxchg64() when updating pte.
* Use READ_ONCE() to read volatile descriptor status.
* Remove support for handling Execute-Requested requests.
* Avoid calling iommu_domain_alloc().
* Minor fixes and refactoring.
- Qualcomm MSM:
* Updates to the device-tree bindings.
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmaZTqMQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNApdB/wL2gW7ANJN3KDrOiWdq06P9fuzxbuiAegI
aKGH+aT05kJjLBXpAE5K9Bas0RbgN8iIB4TITDR9jyLnMOlTP3poy0fvB8y27q00
/WkQ7yVPkZc58ySdEOGH/EbuQkiXcD1YTjTGWP9071xzbWTDbsYN0smfbvvB9LgI
56KhdcUtB0QsqhqBzyyznHJLFdpVvDpbkiAFDXJfor7SNOOtV9a4Ect6IYteaYKz
S6+DWDEfUs+fHTEKEZ9sZVA745f2zPkT/YHY8vjLOEukWN07+3/2AKTra19DIgqF
HCGitRyZjOut1fg8sLn0SUliCKe/G/bHlwSbHnxJQ73b91YDvpzD
=xvLD
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Will Deacon:
"Core:
- Support for the "ats-supported" device-tree property
- Removal of the 'ops' field from 'struct iommu_fwspec'
- Introduction of iommu_paging_domain_alloc() and partial conversion
of existing users
- Introduce 'struct iommu_attach_handle' and provide corresponding
IOMMU interfaces which will be used by the IOMMUFD subsystem
- Remove stale documentation
- Add missing MODULE_DESCRIPTION() macro
- Misc cleanups
Allwinner Sun50i:
- Ensure bypass mode is disabled on H616 SoCs
- Ensure page-tables are allocated below 4GiB for the 32-bit
page-table walker
- Add new device-tree compatible strings
AMD Vi:
- Use try_cmpxchg64() instead of cmpxchg64() when updating pte
Arm SMMUv2:
- Print much more useful information on context faults
- Fix Qualcomm TBU probing when CONFIG_ARM_SMMU_QCOM_DEBUG=n
- Add new Qualcomm device-tree bindings
Arm SMMUv3:
- Support for hardware update of access/dirty bits and reporting via
IOMMUFD
- More driver rework from Jason, this time updating the PASID/SVA
support to prepare for full IOMMUFD support
- Add missing MODULE_DESCRIPTION() macro
- Minor fixes and cleanups
NVIDIA Tegra:
- Fix for benign fwspec initialisation issue exposed by rework on the
core branch
Intel VT-d:
- Use try_cmpxchg64() instead of cmpxchg64() when updating pte
- Use READ_ONCE() to read volatile descriptor status
- Remove support for handling Execute-Requested requests
- Avoid calling iommu_domain_alloc()
- Minor fixes and refactoring
Qualcomm MSM:
- Updates to the device-tree bindings"
* tag 'iommu-updates-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (72 commits)
iommu/tegra-smmu: Pass correct fwnode to iommu_fwspec_init()
iommu/vt-d: Fix identity map bounds in si_domain_init()
iommu: Move IOMMU_DIRTY_NO_CLEAR define
dt-bindings: iommu: Convert msm,iommu-v0 to yaml
iommu/vt-d: Fix aligned pages in calculate_psi_aligned_address()
iommu/vt-d: Limit max address mask to MAX_AGAW_PFN_WIDTH
docs: iommu: Remove outdated Documentation/userspace-api/iommu.rst
arm64: dts: fvp: Enable PCIe ATS for Base RevC FVP
iommu/of: Support ats-supported device-tree property
dt-bindings: PCI: generic: Add ats-supported property
iommu: Remove iommu_fwspec ops
OF: Simplify of_iommu_configure()
ACPI: Retire acpi_iommu_fwspec_ops()
iommu: Resolve fwspec ops automatically
iommu/mediatek-v1: Clean up redundant fwspec checks
RDMA/usnic: Use iommu_paging_domain_alloc()
wifi: ath11k: Use iommu_paging_domain_alloc()
wifi: ath10k: Use iommu_paging_domain_alloc()
drm/msm: Use iommu_paging_domain_alloc()
vhost-vdpa: Use iommu_paging_domain_alloc()
...
Major changes:
- The iova_bitmap logic for efficiently reporting dirty pages back to
userspace has a few more tricky corner case bugs that have been resolved
and backed with new tests. The revised version has simpler logic.
- Shared branch with iommu for handle support when doing domain
attach. Handles allow the domain owner to include additional private data
on a per-device basis.
- IO Page Fault Reporting to userspace via iommufd. Page faults can be
generated on fault capable HWPTs when a translation is not present.
Routing them to userspace would allow a VMM to be able to virtualize them
into an emulated vIOMMU. This is the next step to fully enabling vSVA
support.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZpfo4AAKCRCFwuHvBreF
YTO6APwMLxeWmHbE1H+7ZPuXP7B1aDuwRLczZOo3i816pIj+bQD+OywEA/NcljK6
6NLeqyUe7tECtVrFPSiRT9lWVuzZSQs=
=rnN/
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
- The iova_bitmap logic for efficiently reporting dirty pages back to
userspace has a few more tricky corner case bugs that have been
resolved and backed with new tests.
The revised version has simpler logic.
- Shared branch with iommu for handle support when doing domain attach.
Handles allow the domain owner to include additional private data on
a per-device basis.
- IO Page Fault Reporting to userspace via iommufd. Page faults can be
generated on fault capable HWPTs when a translation is not present.
Routing them to userspace would allow a VMM to be able to virtualize
them into an emulated vIOMMU. This is the next step to fully enabling
vSVA support.
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (26 commits)
iommufd: Put constants for all the uAPI enums
iommufd: Fix error pointer checking
iommufd: Add check on user response code
iommufd: Remove IOMMUFD_PAGE_RESP_FAILURE
iommufd: Require drivers to supply the cache_invalidate_user ops
iommufd/selftest: Add coverage for IOPF test
iommufd/selftest: Add IOPF support for mock device
iommufd: Associate fault object with iommufd_hw_pgtable
iommufd: Fault-capable hwpt attach/detach/replace
iommufd: Add iommufd fault object
iommufd: Add fault and response message definitions
iommu: Extend domain attach group with handle support
iommu: Add attach handle to struct iopf_group
iommu: Remove sva handle list
iommu: Introduce domain attachment handle
iommufd/iova_bitmap: Remove iterator logic
iommufd/iova_bitmap: Dynamic pinning on iova_bitmap_set()
iommufd/iova_bitmap: Consolidate iova_bitmap_set exit conditionals
iommufd/iova_bitmap: Move initial pinning to iova_bitmap_for_each()
iommufd/iova_bitmap: Cache mapped length in iova_bitmap_map struct
...
Since the original virtio_find_vqs() is no longer present, rename
virtio_find_vqs_info() back to virtio_find_vqs().
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Message-Id: <20240708074814.1739223-20-jiri@resnulli.us>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Instead of passing separate names and callbacks arrays
to virtio_find_vqs(), have one of virtual_queue_info structs and
pass it to virtio_find_vqs_info().
Suggested-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Message-Id: <20240708074814.1739223-18-jiri@resnulli.us>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* iommu/iommufd/paging-domain-alloc:
RDMA/usnic: Use iommu_paging_domain_alloc()
wifi: ath11k: Use iommu_paging_domain_alloc()
wifi: ath10k: Use iommu_paging_domain_alloc()
drm/msm: Use iommu_paging_domain_alloc()
vhost-vdpa: Use iommu_paging_domain_alloc()
vfio/type1: Use iommu_paging_domain_alloc()
iommufd: Use iommu_paging_domain_alloc()
iommu: Add iommu_paging_domain_alloc() interface
* iommu/iommufd/attach-handles:
iommu: Extend domain attach group with handle support
iommu: Add attach handle to struct iopf_group
iommu: Remove sva handle list
iommu: Introduce domain attachment handle
* iommu/intel/vt-d:
iommu/vt-d: Fix identity map bounds in si_domain_init()
iommu/vt-d: Fix aligned pages in calculate_psi_aligned_address()
iommu/vt-d: Limit max address mask to MAX_AGAW_PFN_WIDTH
iommu/vt-d: Refactor PCI PRI enabling/disabling callbacks
iommu/vt-d: Add helper to flush caches for context change
iommu/vt-d: Add helper to allocate paging domain
iommu/vt-d: Downgrade warning for pre-enabled IR
iommu/vt-d: Remove control over Execute-Requested requests
iommu/vt-d: Remove comment for def_domain_type
iommu/vt-d: Handle volatile descriptor status read
iommu/vt-d: Use try_cmpxchg64() in intel_pasid_get_entry()
* iommu/arm/smmu: (32 commits)
iommu: Move IOMMU_DIRTY_NO_CLEAR define
iommu/arm-smmu-qcom: Register the TBU driver in qcom_smmu_impl_init
iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping
iommu/arm-smmu-v3: Add support for dirty tracking in domain alloc
iommu/io-pgtable-arm: Add read_and_clear_dirty() support
iommu/arm-smmu-v3: Add feature detection for HTTU
iommu/arm-smmu-v3: Add support for domain_alloc_user fn
iommu/arm-smmu-qcom: record reason for deferring probe
iommu/arm-smmu: Pretty-print context fault related regs
iommu/arm-smmu-qcom-debug: Do not print for handled faults
iommu/arm-smmu: Add CB prefix to register bitfields
dt-bindings: arm-smmu: Add X1E80100 GPU SMMU
iommu/arm-smmu-v3: add missing MODULE_DESCRIPTION() macro
iommu/arm-smmu-v3: Shrink the strtab l1_desc array
iommu/arm-smmu-v3: Do not zero the strtab twice
iommu/arm-smmu-v3: Allow setting a S1 domain to a PASID
iommu/arm-smmu-v3: Allow a PASID to be set when RID is IDENTITY/BLOCKED
iommu/arm-smmu-v3: Test the STE S1DSS functionality
iommu/arm-smmu-v3: Allow IDENTITY/BLOCKED to be set while PASID is used
iommu/arm-smmu-v3: Put the SVA mmu notifier in the smmu_domain
...
iommu_fwspec_init() expects to receive the fwnode corresponding to the
IOMMU device, not the fwnode corresponding to the client device being
probed.
Fix arm_smmu_configure() to pass the correct fwnode to
iommu_fwspec_init().
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/0eec5f84-6b39-43ba-ab2f-914688a5cf45@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Intel IOMMU operates on inclusive bounds (both generally aas well as
iommu_domain_identity_map()). Meanwhile, for_each_mem_pfn_range() uses
exclusive bounds for end_pfn. This creates an off-by-one error when
switching between the two.
Fixes: c5395d5c4a ("intel-iommu: Clean up iommu_domain_identity_map()")
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
Tested-by: Sudheer Dantuluri <dantuluris@google.com>
Suggested-by: Gary Zibrat <gzibrat@google.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240709234913.2749386-1-pandoh@google.com
Signed-off-by: Will Deacon <will@kernel.org>
The response code from user space is only allowed to be SUCCESS or
INVALID. All other values are treated by the device as a response code of
Response Failure according to PCI spec, section 10.4.2.1. This response
disables the Page Request Interface for the Function.
Add a check in iommufd_fault_fops_write() to avoid invalid response
code.
Fixes: 07838f7fd5 ("iommufd: Add iommufd fault object")
Link: https://lore.kernel.org/r/20240710083341.44617-3-baolu.lu@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The helper calculate_psi_aligned_address() is used to convert an arbitrary
range into a size-aligned one.
The aligned_pages variable is calculated from input start and end, but is
not adjusted when the start pfn is not aligned and the mask is adjusted,
which results in an incorrect number of pages returned.
The number of pages is used by qi_flush_piotlb() to flush caches for the
first-stage translation. With the wrong number of pages, the cache is not
synchronized, leading to inconsistencies in some cases.
Fixes: c4d27ffaa8 ("iommu/vt-d: Add cache tag invalidation helpers")
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240709152643.28109-3-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
Address mask specifies the number of low order bits of the address field
that must be masked for the invalidation operation.
Since address bits masked start from bit 12, the max address mask should
be MAX_AGAW_PFN_WIDTH, as defined in Table 19 ("Invalidate Descriptor
Address Mask Encodings") of the spec.
Limit the max address mask returned from calculate_psi_aligned_address()
to MAX_AGAW_PFN_WIDTH to prevent potential integer overflow in the
following code:
qi_flush_dev_iotlb():
...
addr |= (1ULL << (VTD_PAGE_SHIFT + mask - 1)) - 1;
...
Fixes: c4d27ffaa8 ("iommu/vt-d: Add cache tag invalidation helpers")
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240709152643.28109-2-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
With CONFIG_SWIOTLB_DYNAMIC enabled, each round-trip map/unmap pair
in the swiotlb results in 6 calls to swiotlb_find_pool(). In multiple
places, the pool is found and used in one function, and then must
be found again in the next function that is called because only the
tlb_addr is passed as an argument. These are the six call sites:
dma_direct_map_page:
1. swiotlb_map -> swiotlb_tbl_map_single -> swiotlb_bounce
dma_direct_unmap_page:
2. dma_direct_sync_single_for_cpu -> is_swiotlb_buffer
3. dma_direct_sync_single_for_cpu -> swiotlb_sync_single_for_cpu ->
swiotlb_bounce
4. is_swiotlb_buffer
5. swiotlb_tbl_unmap_single -> swiotlb_del_transient
6. swiotlb_tbl_unmap_single -> swiotlb_release_slots
Reduce the number of calls by finding the pool at a higher level, and
passing it as an argument instead of searching again. A key change is
for is_swiotlb_buffer() to return a pool pointer instead of a boolean,
and then pass this pool pointer to subsequent swiotlb functions.
There are 9 occurrences of is_swiotlb_buffer() used to test if a buffer
is a swiotlb buffer before calling a swiotlb function. To reduce code
duplication in getting the pool pointer and passing it as an argument,
introduce inline wrappers for this pattern. The generated code is
essentially unchanged.
Since is_swiotlb_buffer() no longer returns a boolean, rename some
functions to reflect the change:
* swiotlb_find_pool() becomes __swiotlb_find_pool()
* is_swiotlb_buffer() becomes swiotlb_find_pool()
* is_xen_swiotlb_buffer() becomes xen_swiotlb_find_pool()
With these changes, a round-trip map/unmap pair requires only 2 pool
lookups (listed using the new names and wrappers):
dma_direct_unmap_page:
1. dma_direct_sync_single_for_cpu -> swiotlb_find_pool
2. swiotlb_tbl_unmap_single -> swiotlb_find_pool
These changes come from noticing the inefficiencies in a code review,
not from performance measurements. With CONFIG_SWIOTLB_DYNAMIC,
__swiotlb_find_pool() is not trivial, and it uses an RCU read lock,
so avoiding the redundant calls helps performance in a hot path.
When CONFIG_SWIOTLB_DYNAMIC is *not* set, the code size reduction
is minimal and the perf benefits are likely negligible, but no
harm is done.
No functional change is intended.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Petr Tesarik <petr@tesarici.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Lu Baolu says:
====================
This series implements the functionality of delivering IO page faults to
user space through the IOMMUFD framework. One feasible use case is the
nested translation. Nested translation is a hardware feature that supports
two-stage translation tables for IOMMU. The second-stage translation table
is managed by the host VMM, while the first-stage translation table is
owned by user space. This allows user space to control the IOMMU mappings
for its devices.
When an IO page fault occurs on the first-stage translation table, the
IOMMU hardware can deliver the page fault to user space through the
IOMMUFD framework. User space can then handle the page fault and respond
to the device top-down through the IOMMUFD. This allows user space to
implement its own IO page fault handling policies.
User space application that is capable of handling IO page faults should
allocate a fault object, and bind the fault object to any domain that it
is willing to handle the fault generatd for them. On a successful return
of fault object allocation, the user can retrieve and respond to page
faults by reading or writing to the file descriptor (FD) returned.
The iommu selftest framework has been updated to test the IO page fault
delivery and response functionality.
====================
* iommufd_pri:
iommufd/selftest: Add coverage for IOPF test
iommufd/selftest: Add IOPF support for mock device
iommufd: Associate fault object with iommufd_hw_pgtable
iommufd: Fault-capable hwpt attach/detach/replace
iommufd: Add iommufd fault object
iommufd: Add fault and response message definitions
iommu: Extend domain attach group with handle support
iommu: Add attach handle to struct iopf_group
iommu: Remove sva handle list
iommu: Introduce domain attachment handle
Link: https://lore.kernel.org/all/20240702063444.105814-1-baolu.lu@linux.intel.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Extend the selftest mock device to support generating and responding to
an IOPF. Also add an ioctl interface to userspace applications to trigger
the IOPF on the mock device. This would allow userspace applications to
test the IOMMUFD's handling of IOPFs without having to rely on any real
hardware.
Link: https://lore.kernel.org/r/20240702063444.105814-10-baolu.lu@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When allocating a user iommufd_hw_pagetable, the user space is allowed to
associate a fault object with the hw_pagetable by specifying the fault
object ID in the page table allocation data and setting the
IOMMU_HWPT_FAULT_ID_VALID flag bit.
On a successful return of hwpt allocation, the user can retrieve and
respond to page faults by reading and writing the file interface of the
fault object.
Once a fault object has been associated with a hwpt, the hwpt is
iopf-capable, indicated by hwpt->fault is non NULL. Attaching,
detaching, or replacing an iopf-capable hwpt to an RID or PASID will
differ from those that are not iopf-capable.
Link: https://lore.kernel.org/r/20240702063444.105814-9-baolu.lu@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add iopf-capable hw page table attach/detach/replace helpers. The pointer
to iommufd_device is stored in the domain attachment handle, so that it
can be echo'ed back in the iopf_group.
The iopf-capable hw page tables can only be attached to devices that
support the IOMMU_DEV_FEAT_IOPF feature. On the first attachment of an
iopf-capable hw_pagetable to the device, the IOPF feature is enabled on
the device. Similarly, after the last iopf-capable hwpt is detached from
the device, the IOPF feature is disabled on the device.
The current implementation allows a replacement between iopf-capable and
non-iopf-capable hw page tables. This matches the nested translation use
case, where a parent domain is attached by default and can then be
replaced with a nested user domain with iopf support.
Link: https://lore.kernel.org/r/20240702063444.105814-8-baolu.lu@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
An iommufd fault object provides an interface for delivering I/O page
faults to user space. These objects are created and destroyed by user
space, and they can be associated with or dissociated from hardware page
table objects during page table allocation or destruction.
User space interacts with the fault object through a file interface. This
interface offers a straightforward and efficient way for user space to
handle page faults. It allows user space to read fault messages
sequentially and respond to them by writing to the same file. The file
interface supports reading messages in poll mode, so it's recommended that
user space applications use io_uring to enhance read and write efficiency.
A fault object can be associated with any iopf-capable iommufd_hw_pgtable
during the pgtable's allocation. All I/O page faults triggered by devices
when accessing the I/O addresses of an iommufd_hw_pgtable are routed
through the fault object to user space. Similarly, user space's responses
to these page faults are routed back to the iommu device driver through
the same fault object.
Link: https://lore.kernel.org/r/20240702063444.105814-7-baolu.lu@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Device-tree declares whether a PCI root-complex supports ATS by setting
the "ats-supported" property. Copy this flag into device fwspec to let
IOMMU drivers quickly check if they can enable ATS for a device.
Tested-by: Ketan Patil <ketanp@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Link: https://lore.kernel.org/r/20240607105415.2501934-4-jean-philippe@linaro.org
Signed-off-by: Will Deacon <will@kernel.org>
The ops in iommu_fwspec are only needed for the early configuration and
probe process, and by now are easy enough to derive on-demand in those
couple of places which need them, so remove the redundant stored copy.
Tested-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/55c1410b2cd09531eab4f8e2f18f92a0faa0ea75.1719919669.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
We no longer have a notion of partially-initialised fwspecs existing,
and we also no longer need to use an iommu_ops pointer to return status
to of_dma_configure(). Clean up the remains of those, which lends itself
to clarifying the logic around the dma_range_map allocation as well.
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Tested-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/61972f88e31a6eda8bf5852f0853951164279a3c.1719919669.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
There's no real need for callers to resolve ops from a fwnode in order
to then pass both to iommu_fwspec_init() - it's simpler and more sensible
for that to resolve the ops itself. This in turn means we can centralise
the notion of checking for a present driver, and enforce that fwspecs
aren't allocated unless and until we know they will be usable.
Also use this opportunity to modernise with some "new" helpers that
arrived shortly after this code was first written; the generic
fwnode_handle_get() clears up that ugly get/put mismatch, while
of_fwnode_handle() can now abstract those open-coded dereferences.
Tested-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/0e2727adeb8cd73274425322f2f793561bdc927e.1719919669.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
If the iommu driver doesn't implement its domain_alloc_user callback,
iommufd_hwpt_paging_alloc() rolls back to allocate an iommu paging domain.
Replace iommu_domain_alloc() with iommu_user_domain_alloc() to pass the
device pointer along the path.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20240610085555.88197-3-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
Commit <17de3f5fdd35> ("iommu: Retire bus ops") removes iommu ops from
bus. The iommu subsystem no longer relies on bus for operations. So the
bus parameter in iommu_domain_alloc() is no longer relevant.
Add a new interface named iommu_paging_domain_alloc(), which explicitly
indicates the allocation of a paging domain for DMA managed by a kernel
driver. The new interface takes a device pointer as its parameter, that
better aligns with the current iommu subsystem.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20240610085555.88197-2-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
Unlike the SVA case where each PASID of a device has an SVA domain
attached to it, the I/O page faults are handled by the fault handler
of the SVA domain. The I/O page faults for a user page table might
be handled by the domain attached to RID or the domain attached to
the PASID, depending on whether the PASID table is managed by user
space or kernel. As a result, there is a need for the domain attach
group interfaces to have attach handle support. The attach handle
will be forwarded to the fault handler of the user domain.
Add some variants of the domain attaching group interfaces so that they
could support the attach handle and export them for use in IOMMUFD.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240702063444.105814-5-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
Previously, the domain that a page fault targets is stored in an
iopf_group, which represents a minimal set of page faults. With the
introduction of attach handle, replace the domain with the handle
so that the fault handler can obtain more information as needed
when handling the faults.
iommu_report_device_fault() is currently used for SVA page faults,
which handles the page fault in an internal cycle. The domain is retrieved
with iommu_get_domain_for_dev_pasid() if the pasid in the fault message
is valid. This doesn't work in IOMMUFD case, where if the pasid table of
a device is wholly managed by user space, there is no domain attached to
the PASID of the device, and all page faults are forwarded through a
NESTING domain attaching to RID.
Add a static flag in iommu ops, which indicates if the IOMMU driver
supports user-managed PASID tables. In the iopf deliver path, if no
attach handle found for the iopf PASID, roll back to RID domain when
the IOMMU driver supports this capability.
iommu_get_domain_for_dev_pasid() is no longer used and can be removed.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240702063444.105814-4-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
The struct sva_iommu represents an association between an SVA domain and
a PASID of a device. It's stored in the iommu group's pasid array and also
tracked by a list in the per-mm data structure. Removes duplicate tracking
of sva_iommu by eliminating the list.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240702063444.105814-3-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
Currently, when attaching a domain to a device or its PASID, domain is
stored within the iommu group. It could be retrieved for use during the
window between attachment and detachment.
With new features introduced, there's a need to store more information
than just a domain pointer. This information essentially represents the
association between a domain and a device. For example, the SVA code
already has a custom struct iommu_sva which represents a bond between
sva domain and a PASID of a device. Looking forward, the IOMMUFD needs
a place to store the iommufd_device pointer in the core, so that the
device object ID could be quickly retrieved in the critical fault handling
path.
Introduce domain attachment handle that explicitly represents the
attachment relationship between a domain and a device or its PASID.
Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240702063444.105814-2-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
Currently the TBU driver will only probe when CONFIG_ARM_SMMU_QCOM_DEBUG
is enabled. The driver not probing would prevent the platform to reach
sync_state and the system will remain in sub-optimal power consumption
mode while waiting for all consumer drivers to probe. To address this,
let's register the TBU driver in qcom_smmu_impl_init(), so that it can
probe, but still enable its functionality only when the debug option in
Kconfig is enabled.
Reported-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Closes: https://lore.kernel.org/r/CAA8EJppcXVu72OSo+OiYEiC1HQjP3qCwKMumOsUhcn6Czj0URg@mail.gmail.com
Fixes: 414ecb0308 ("iommu/arm-smmu-qcom-debug: Add support for TBUs")
Signed-off-by: Georgi Djakov <quic_c_gdjako@quicinc.com>
Link: https://lore.kernel.org/r/20240704010759.507798-1-quic_c_gdjako@quicinc.com
Signed-off-by: Will Deacon <will@kernel.org>
This helper is used to flush the related caches following a change in a
context table entry that was previously present. The VT-d specification
provides guidance for such invalidations in section 6.5.3.3.
This helper replaces the existing open code in the code paths where a
present context entry is being torn down.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240701112317.94022-2-baolu.lu@linux.intel.com
Link: https://lore.kernel.org/r/20240702130839.108139-7-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
Emitting a warning is overkill in intel_setup_irq_remapping() since the
interrupt remapping is pre-enabled. For example, there's no guarantee
that kexec will explicitly disable interrupt remapping before booting a
new kernel. As a result, users are seeing warning messages like below
when they kexec boot a kernel, though there is nothing wrong:
DMAR-IR: IRQ remapping was enabled on dmar18 but we are not in kdump mode
DMAR-IR: IRQ remapping was enabled on dmar17 but we are not in kdump mode
DMAR-IR: IRQ remapping was enabled on dmar16 but we are not in kdump mode
... ...
Downgrade the severity of this message to avoid user confusion.
CC: Paul Menzel <pmenzel@molgen.mpg.de>
Link: https://lore.kernel.org/linux-iommu/5517f76a-94ad-452c-bae6-34ecc0ec4831@molgen.mpg.de/
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240625043912.258036-1-baolu.lu@linux.intel.com
Link: https://lore.kernel.org/r/20240702130839.108139-5-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
The VT-d specification has removed architectural support of the requests
with pasid with a value of 1 for Execute-Requested (ER). And the NXE bit
in the pasid table entry and XD bit in the first-stage paging Entries are
deprecated accordingly.
Remove the programming of these bits to make it consistent with the spec.
Suggested-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240624032351.249858-1-baolu.lu@linux.intel.com
Link: https://lore.kernel.org/r/20240702130839.108139-4-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
Queued invalidation wait descriptor status is volatile in that IOMMU
hardware writes the data upon completion.
Use READ_ONCE() to prevent compiler optimizations which ensures memory
reads every time. As a side effect, READ_ONCE() also enforces strict
types and may add an extra instruction. But it should not have negative
performance impact since we use cpu_relax anyway and the extra time(by
adding an instruction) may allow IOMMU HW request cacheline ownership
easier.
e.g. gcc 12.3
BEFORE:
81 38 ad de 00 00 cmpl $0x2,(%rax)
AFTER (with READ_ONCE())
772f: 8b 00 mov (%rax),%eax
7731: 3d ad de 00 00 cmp $0x2,%eax
//status data is 32 bit
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20240607173817.3914600-1-jacob.jun.pan@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20240702130839.108139-2-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
If io-pgtable quirk flag indicates support for hardware update of
dirty state, enable HA/HD bits in the SMMU CD and also set the DBM
bit in the page descriptor.
Now report the dirty page tracking capability of SMMUv3 and
select IOMMUFD_DRIVER for ARM_SMMU_V3 if IOMMUFD is enabled.
Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20240703101604.2576-6-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
This provides all the infrastructure to enable dirty tracking if the
hardware has the capability and domain alloc request for it.
Also, add a device_iommu_capable() check in iommufd core for
IOMMU_CAP_DIRTY_TRACKING before we request a user domain with dirty
tracking support.
Please note, we still report no support for IOMMU_CAP_DIRTY_TRACKING
as it will finally be enabled in a subsequent patch.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20240703101604.2576-5-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
.read_and_clear_dirty() IOMMU domain op takes care of reading the dirty
bits (i.e. PTE has DBM set and AP[2] clear) and marshalling into a
bitmap of a given page size.
While reading the dirty bits we also set the PTE AP[2] bit to mark it
as writeable-clean depending on read_and_clear_dirty() flags.
PTE states with respect to DBM bit:
DBM bit AP[2]("RDONLY" bit)
1. writable_clean 1 1
2. writable_dirty 1 0
3. read-only 0 1
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20240703101604.2576-4-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
If the SMMU supports it and the kernel was built with HTTU support,
Probe support for Hardware Translation Table Update (HTTU) which is
essentially to enable hardware update of access and dirty flags.
Probe and set the smmu::features for Hardware Dirty and Hardware Access
bits. This is in preparation, to enable it on the context descriptors of
stage 1 format.
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20240703101604.2576-3-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
This will be used by iommufd for allocating usr managed domains and is
also required when we add support for iommufd based dirty tracking
support.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20240703101604.2576-2-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
To avoid deferring probe smmu driver silently, record reason for it.
It can be checked through ../debugfs/devices_deferred as well:
/sys/kernel/debug# cat devices_deferred
15000000.iommu arm-smmu: qcom_scm not ready
Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Link: https://lore.kernel.org/r/1719910870-25079-1-git-send-email-quic_zhenhuah@quicinc.com
Signed-off-by: Will Deacon <will@kernel.org>
Parse out the bitfields for easier-to-read fault messages.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Link: https://lore.kernel.org/r/20240701162025.375134-4-robdclark@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
Handled faults can be "normal", don't spam dmesg about them.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Link: https://lore.kernel.org/r/20240701162025.375134-3-robdclark@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
For consistency, add the "CB" prefix to the bitfield defines for context
registers.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Link: https://lore.kernel.org/r/20240701162025.375134-2-robdclark@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
With ARCH=arm64, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.o
Add the missing invocation of the MODULE_DESCRIPTION() macro.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Fixes: da55da5a42 ("iommu/arm-smmu-v3: Make the kunit into a module")
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20240613-md-arm64-drivers-iommu-arm-arm-smmu-v3-v1-1-0e9f7584a5c8@quicinc.com
Signed-off-by: Will Deacon <will@kernel.org>
The top of the 2 level stream table is (at most) 128k entries big, and two
high order allocations are required. One of __le64 which is programmed
into the HW (1M), and one of struct arm_smmu_strtab_l1_desc which holds
the CPU pointer (3M).
There is no reason to store the l2ptr_dma as nothing reads it. devm stores
a copy of it and the DMA memory will be freed via devm mechanisms. span is
a constant of 8+1. Remove both.
This removes 16 bytes from each arm_smmu_l1_ctx_desc and saves up to 2M of
memory per iommu instance.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/2-v2-318ed5f6983b+198f-smmuv3_tidy_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
dmam_alloc_coherent() already returns zero'd memory so cfg->strtab.l1_desc
(the list of DMA addresses for the L2 entries) is already zero'd.
arm_smmu_init_l1_strtab() goes through and calls
arm_smmu_write_strtab_l1_desc() on the newly allocated (and zero'd) struct
arm_smmu_strtab_l1_desc, which ends up computing 'val = 0' and zeroing it
again.
Remove arm_smmu_init_l1_strtab() and just call devm_kcalloc() from
arm_smmu_init_strtab_2lvl to allocate the companion struct.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/1-v2-318ed5f6983b+198f-smmuv3_tidy_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
The SVA cleanup made the SSID logic entirely general so all we need to do
is call it with the correct cd table entry for a S1 domain.
This is slightly tricky because of the ASID and how the locking works, the
simple fix is to just update the ASID once we get the right locks.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/14-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
If the STE doesn't point to the CD table we can upgrade it by
reprogramming the STE with the appropriate S1DSS. We may also need to turn
on ATS at the same time.
Keep track if the installed STE is pointing at the cd_table and the ATS
state to trigger this path.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/13-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
S1DSS brings in quite a few new transition pairs that are
interesting. Test to/from S1DSS_BYPASS <-> S1DSS_SSID0, and
BYPASS <-> S1DSS_SSID0.
Test a contrived non-hitless flow to make sure that the logic works.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Michael Shavit <mshavit@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/12-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
The HW supports this, use the S1DSS bits to configure the behavior
of SSID=0 which is the RID's translation.
If SSID's are currently being used in the CD table then just update the
S1DSS bits in the STE, remove the master_domain and leave ATS alone.
For iommufd the driver design has a small problem that all the unused CD
table entries are set with V=0 which will generate an event if VFIO
userspace tries to use the CD entry. This patch extends this problem to
include the RID as well if PASID is being used.
For BLOCKED with used PASIDs the
F_STREAM_DISABLED (STRTAB_STE_1_S1DSS_TERMINATE) event is generated on
untagged traffic and a substream CD table entry with V=0 (removed pasid)
will generate C_BAD_CD. Arguably there is no advantage to using S1DSS over
the CD entry 0 with V=0.
As we don't yet support PASID in iommufd this is a problem to resolve
later, possibly by using EPD0 for unused CD table entries instead of V=0,
and not using S1DSS for BLOCKED.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/11-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
This removes all the notifier de-duplication logic in the driver and
relies on the core code to de-duplicate and allocate only one SVA domain
per mm per smmu instance. This naturally gives a 1:1 relationship between
SVA domain and mmu notifier.
It is a significant simplication of the flow, as we end up with a single
struct arm_smmu_domain for each MM and the invalidation can then be
shifted to properly use the masters list like S1/S2 do.
Remove all of the previous mmu_notifier, bond, shared cd, and cd refcount
logic entirely.
The logic here is tightly wound together with the unusued BTM
support. Since the BTM logic requires holding all the iommu_domains in a
global ASID xarray it conflicts with the design to have a single SVA
domain per PASID, as multiple SMMU instances will need to have different
domains.
Following patches resolve this by making the ASID xarray per-instance
instead of global. However, converting the BTM code over to this
methodology requires many changes.
Thus, since ARM_SMMU_FEAT_BTM is never enabled, remove the parts of the
BTM support for ASID sharing that interact with SVA as well.
A followup series is already working on fully enabling the BTM support,
that requires iommufd's VIOMMU feature to bring in the KVM's VMID as
well. It will come with an already written patch to bring back the ASID
sharing using a per-instance ASID xarray.
https://lore.kernel.org/linux-iommu/20240208151837.35068-1-shameerali.kolothum.thodi@huawei.com/https://lore.kernel.org/linux-iommu/26-v6-228e7adf25eb+4155-smmuv3_newapi_p2_jgg@nvidia.com/
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Michael Shavit <mshavit@google.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/10-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Fill in the smmu_domain->devices list in the new struct arm_smmu_domain
that SVA allocates. Keep track of every SSID and master that is using the
domain reusing the logic for the RID attach.
This is the first step to making the SVA invalidation follow the same
design as S1/S2 invalidation. At present nothing will read this list.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/9-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Currently the SVA domain is a naked struct iommu_domain, allocate a struct
arm_smmu_domain instead.
This is necessary to be able to use the struct arm_master_domain
mechanism.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Michael Shavit <mshavit@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/8-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Allow creating and managing arm_smmu_mater_domain's with a non-zero SSID
through the arm_smmu_attach_*() family of functions. This triggers ATC
invalidation for the correct SSID in PASID cases and tracks the
per-attachment SSID in the struct arm_smmu_master_domain.
Generalize arm_smmu_attach_remove() to be able to remove SSID's as well by
ensuring the ATC for the PASID is flushed properly.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
We no longer need a master->sva_enable to control what attaches are
allowed. Instead we can tell if the attach is legal based on the current
configuration of the master.
Keep track of the number of valid CD entries for SSID's in the cd_table
and if the cd_table has been installed in the STE directly so we know what
the configuration is.
The attach logic is then made into:
- SVA bind, check if the CD is installed
- RID attach of S2, block if SSIDs are used
- RID attach of IDENTITY/BLOCKING, block if SSIDs are used
arm_smmu_set_pasid() is already checking if it is possible to setup a CD
entry, at this patch it means the RID path already set a STE pointing at
the CD table.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Prepare to allow a S1 domain to be attached to a PASID as well. Keep track
of the SSID the domain is using on each master in the
arm_smmu_master_domain.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Michael Shavit <mshavit@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
The core code allows the domain to be changed on the fly without a forced
stop in BLOCKED/IDENTITY. In this flow the driver should just continually
maintain the ATS with no change while the STE is updated.
ATS relies on a linked list smmu_domain->devices to keep track of which
masters have the domain programmed, but this list is also used by
arm_smmu_share_asid(), unrelated to ats.
Create two new functions to encapsulate this combined logic:
arm_smmu_attach_prepare()
<caller generates and sets the STE>
arm_smmu_attach_commit()
The two functions can sequence both enabling ATS and disabling across
the STE store. Have every update of the STE use this sequence.
Installing a S1/S2 domain always enables the ATS if the PCIe device
supports it.
The enable flow is now ordered differently to allow it to be hitless:
1) Add the master to the new smmu_domain->devices list
2) Program the STE
3) Enable ATS at PCIe
4) Remove the master from the old smmu_domain
This flow ensures that invalidations to either domain will generate an ATC
invalidation to the device while the STE is being switched. Thus we don't
need to turn off the ATS anymore for correctness.
The disable flow is the reverse:
1) Disable ATS at PCIe
2) Program the STE
3) Invalidate the ATC
4) Remove the master from the old smmu_domain
Move the nr_ats_masters adjustments to be close to the list
manipulations. It is a count of the number of ATS enabled masters
currently in the list. This is stricly before and after the STE/CD are
revised, and done under the list's spin_lock.
This is part of the bigger picture to allow changing the RID domain while
a PASID is in use. If a SVA PASID is relying on ATS to function then
changing the RID domain cannot just temporarily toggle ATS off without
also wrecking the SVA PASID. The new infrastructure here is organized so
that the PASID attach/detach flows will make use of it as well in
following patches.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Michael Shavit <mshavit@google.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
The next patch will need to store the same master twice (with different
SSIDs), so allocate memory for each list element.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Michael Shavit <mshavit@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Add arm_smmu_set_pasid()/arm_smmu_remove_pasid() which are to be used by
callers that already constructed the arm_smmu_cd they wish to program.
These functions will encapsulate the shared logic to setup a CD entry that
will be shared by SVA and S1 domain cases.
Prior fixes had already moved most of this logic up into
__arm_smmu_sva_bind(), move it to it's final home.
Following patches will relieve some of the remaining SVA restrictions:
- The RID domain is a S1 domain and has already setup the STE to point to
the CD table
- The programmed PASID is the mm_get_enqcmd_pasid()
- Nothing changes while SVA is running (sva_enable)
SVA invalidation will still iterate over the S1 domain's master list,
later patches will resolve that.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
This allows the driver the receive the mm and always a device during
allocation. Later patches need this to properly setup the notifier when
the domain is first allocated.
Remove ops->domain_alloc() as SVA was the only remaining purpose.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Michael Shavit <mshavit@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>