Nothing in there is active if CONFIG_AMD_IOMMU is not enabled, so the whole
directory can depend on that switch as well.
Fixes: cbe94c6e1a ("iommu/amd: Move Kconfig and Makefile bits down into amd directory")
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/1894970.atdPhlSkOF@devpool92.emlix.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Nothing in there is active if CONFIG_INTEL_IOMMU is not enabled, so the whole
directory can depend on that switch as well.
Fixes: ab65ba57e3 ("iommu/vt-d: Move Kconfig and Makefile bits down into intel directory")
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/3818749.MHq7AAxBmi@devpool92.emlix.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
There is an extraenous space after \n in a pr_debug message. Remove it.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250430151853.923614-1-colin.i.king@gmail.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Commit 991de2e590 ("PCI, x86: Implement pcibios_alloc_irq() and
pcibios_free_irq()") changed IRQ handling on PCI driver probing.
It inadvertently broke resume from system sleep on AMD platforms:
https://lore.kernel.org/r/20150926164651.GA3640@pd.tnic/
This was fixed by two independent commits:
* 8affb487d4 ("x86/PCI: Don't alloc pcibios-irq when MSI is enabled")
* cbbc00be2c ("iommu/amd: Prevent binding other PCI drivers to IOMMU PCI devices")
The breaking change and one of these two fixes were subsequently reverted:
* fe25d07887 ("Revert "x86/PCI: Don't alloc pcibios-irq when MSI is enabled"")
* 6c777e8799 ("Revert "PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()"")
This rendered the second fix unnecessary, so revert it as well. It used
the match_driver flag in struct pci_dev, which is internal to the PCI core
and not supposed to be touched by arbitrary drivers.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://patch.msgid.link/9a3ddff5cc49512044f963ba0904347bd404094d.1745572340.git.lukas@wunner.de
After commit 48e7b8e284 ("iommu/arm-smmu-v3: Remove
arm_smmu_domain_finalise() during attach"), an error code is not
returned on the attach path when the smmu does not match with the
domain. This causes problems with VFIO because
vfio_iommu_type1_attach_group() relies on this check to determine domain
compatability.
Re-instate the -EINVAL return value when the SMMU doesn't match on the
device attach path.
Fixes: 48e7b8e284 ("iommu/arm-smmu-v3: Remove arm_smmu_domain_finalise() during attach")
Signed-off-by: Qinxin Xia <xiaqinxin@huawei.com>
Link: https://lore.kernel.org/r/20250422112951.2027969-1-xiaqinxin@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
Introduce new DMA APIs to perform DMA linkage of buffers
in layers higher than DMA.
In proposed API, the callers will perform the following steps.
In map path:
if (dma_can_use_iova(...))
dma_iova_alloc()
for (page in range)
dma_iova_link_next(...)
dma_iova_sync(...)
else
/* Fallback to legacy map pages */
for (all pages)
dma_map_page(...)
In unmap path:
if (dma_can_use_iova(...))
dma_iova_destroy()
else
for (all pages)
dma_unmap_page(...)
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Split the iommu logic from iommu_dma_map_page into a separate helper.
This not only keeps the code neatly separated, but will also allow for
reuse in another caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
The existing .map_pages() callback provides both allocating of IOVA
and linking DMA pages. That combination works great for most of the
callers who use it in control paths, but is less effective in fast
paths where there may be multiple calls to map_page().
These advanced callers already manage their data in some sort of
database and can perform IOVA allocation in advance, leaving range
linkage operation to be in fast path.
Provide an interface to allocate/deallocate IOVA and next patch
link/unlink DMA ranges to that specific IOVA.
In the new API a DMA mapping transaction is identified by a
struct dma_iova_state, which holds some recomputed information
for the transaction which does not change for each page being
mapped, so add a check if IOVA can be used for the specific
transaction.
The API is exported from dma-iommu as it is the only implementation
supported, the namespace is clearly different from iommu_* functions
which are not allowed to be used. This code layout allows us to save
function call per API call used in datapath as well as a lot of boilerplate
code.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Add kernel-doc section for iommu_unmap_fast to document existing
limitation of underlying functions which can't split individual ranges.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
For the upcoming IOVA-based DMA API we want to batch the
ops->iotlb_sync_map() call after mapping multiple IOVAs from
dma-iommu without having a scatterlist. Improve the API.
Add a wrapper for the map_sync as iommu_sync_map() so that callers
don't need to poke into the methods directly.
Formalize __iommu_map() into iommu_map_nosync() which requires the
caller to call iommu_sync_map() after all maps are completed.
Refactor the existing sanity checks from all the different layers
into iommu_map_nosync().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
To support the upcoming non-scatterlist mapping helpers, we need to go
back to have them called outside of the DMA API. Thus move them out of
dma-map-ops.h, which is only for DMA API implementations to pci-p2pdma.h,
which is for driver use.
Note that the core helper is still not exported as the mapping is
expected to be done only by very highlevel subsystem code at least for
now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
The current scheme with a single helper to determine the P2P status
and map a scatterlist segment force users to always use the map_sg
helper to DMA map, which we're trying to get away from because they
are very cache inefficient.
Refactor the code so that there is a single helper that checks the P2P
state for a page, including the result that it is not a P2P page to
simplify the callers, and a second one to perform the address translation
for a bus mapped P2P transfer that does not depend on the scatterlist
structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
AMD IOMMU reserves the address range 0xfd00000000-0xffffffffff for
the hypertransport protocol (HT) and has special meaning. Hence devices
cannot use this address range for the DMA. However on some AMD platforms
this HT range is shifted to the very top of the address space and new
feature bit `HTRangeIgnore` is introduced. When this feature bit is on,
IOMMU treats the GPA access to the legacy HT range as regular GPA access.
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250317055020.25214-1-sarunkod@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Synchronize RCU when unregistering KVM's GA log notifier to ensure all
in-flight interrupt handlers complete before KVM-the module is unloaded.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250315031048.2374109-1-seanjc@google.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
On a 32 bit system calling:
iommu_map(0, 0x40000000)
When using the AMD V1 page table type with a domain->pgsize of 0xfffff000
causes iommu_pgsize() to miscalculate a result of:
size=0x40000000 count=2
count should be 1. This completely corrupts the mapping process.
This is because the final test to adjust the pagesize malfunctions when
the addition overflows. Use check_add_overflow() to prevent this.
Fixes: b1d99dc5f9 ("iommu: Hook up '->unmap_pages' driver callback")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/0-v1-3ad28fc2e3a3+163327-iommu_overflow_pgsize_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Next up on our list of race windows to close is another one during
iommu_device_register() - it's now OK again for multiple instances to
run their bus_iommu_probe() in parallel, but an iommu_probe_device() can
still also race against a running bus_iommu_probe(). As Johan has
managed to prove, this has now become a lot more visible on DT platforms
wth driver_async_probe where a client driver is attempting to probe in
parallel with its IOMMU driver - although commit b46064a188 ("iommu:
Handle race with default domain setup") resolves this from the client
driver's point of view, this isn't before of_iommu_configure() has had
the chance to attempt to "replay" a probe that the bus walk hasn't even
tried yet, and so still cause the out-of-order group allocation
behaviour that we're trying to clean up (and now warning about).
The most reliable thing to do here is to explicitly keep track of the
"iommu_device_register() is still running" state, so we can then
special-case the ops lookup for the replay path (based on dev->iommu
again) to let that think it's still waiting for the IOMMU driver to
appear at all. This still leaves the longstanding theoretical case of
iommu_bus_notifier() being triggered during bus_iommu_probe(), but it's
not so simple to defer a notifier, and nobody's ever reported that being
a visible issue, so let's quietly kick that can down the road for now...
Reported-by: Johan Hovold <johan@kernel.org>
Fixes: bcb81ac6ae ("iommu: Get DT/ACPI parsing into the proper probe path")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/88d54c1b48fed8279aa47d30f3d75173685bb26a.1745516488.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The idxd driver attaches the default domain to a PASID of the device to
perform kernel DMA using that PASID. The domain is attached to the
device's PASID through iommu_attach_device_pasid(), which checks if the
domain->owner matches the iommu_ops retrieved from the device. If they
do not match, it returns a failure.
if (ops != domain->owner || pasid == IOMMU_NO_PASID)
return -EINVAL;
The static identity domain implemented by the intel iommu driver doesn't
specify the domain owner. Therefore, kernel DMA with PASID doesn't work
for the idxd driver if the device translation mode is set to passthrough.
Generally the owner field of static domains are not set because they are
already part of iommu ops. Add a helper domain_iommu_ops_compatible()
that checks if a domain is compatible with the device's iommu ops. This
helper explicitly allows the static blocked and identity domains associated
with the device's iommu_ops to be considered compatible.
Fixes: 2031c469f8 ("iommu/vt-d: Add support for static identity domain")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220031
Cc: stable@vger.kernel.org
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/linux-iommu/20250422191554.GC1213339@ziepe.ca/
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20250424034123.2311362-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
In general a 'struct device' is way too large to be put on the kernel
stack. Apparently something just caused it to grow a slightly larger,
which pushed the arm_lpae_do_selftests() function over the warning
limit in some configurations:
drivers/iommu/io-pgtable-arm.c:1423:19: error: stack frame size (1032) exceeds limit (1024) in 'arm_lpae_do_selftests' [-Werror,-Wframe-larger-than]
1423 | static int __init arm_lpae_do_selftests(void)
| ^
Change the function to use a dynamically allocated faux_device
instead of the on-stack device structure.
Fixes: ca25ec247a ("iommu/io-pgtable-arm: Remove iommu_dev==NULL special case")
Link: https://lore.kernel.org/all/ab75a444-22a1-47f5-b3c0-253660395b5a@arm.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/20250423164826.2931382-1-arnd@kernel.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
iommu_device_sysfs_add() requires a constant format string, otherwise
a W=1 build produces a warning:
drivers/iommu/ipmmu-vmsa.c:1093:62: error: format string is not a string literal (potentially insecure) [-Werror,-Wformat-security]
1093 | ret = iommu_device_sysfs_add(&mmu->iommu, &pdev->dev, NULL, dev_name(&pdev->dev));
| ^~~~~~~~~~~~~~~~~~~~
drivers/iommu/ipmmu-vmsa.c:1093:62: note: treat the string as an argument to avoid this
1093 | ret = iommu_device_sysfs_add(&mmu->iommu, &pdev->dev, NULL, dev_name(&pdev->dev));
| ^
| "%s",
This was an old bug but I saw it now because the code was changed as part
of commit d9d3cede41 ("iommu/ipmmu-vmsa: Register in a sensible order").
Fixes: 7af9a5fdb9 ("iommu/ipmmu-vmsa: Use iommu_device_sysfs_add()/remove()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250423164006.2661372-1-arnd@kernel.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
fsl_pamu is the last user of domain_alloc(), and it is using it to create
something weird that doesn't really fit into the iommu subsystem
architecture. It is a not a paging domain since it doesn't have any
map/unmap ops. It may be some special kind of identity domain.
For now just leave it as is. Wrap it's definition in CONFIG_FSL_PAMU to
discourage any new drivers from attempting to use it.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/5-v4-ff5fb6b03bd1+288-iommu_virtio_domains_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
No driver implements SVA under domain_alloc() anymore, this is dead
code.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/4-v4-ff5fb6b03bd1+288-iommu_virtio_domains_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
virtio has the complication that it sometimes wants to return a paging
domain for IDENTITY which makes this conversion a little different than
other drivers.
Add a viommu_domain_alloc_paging() that combines viommu_domain_alloc() and
viommu_domain_finalise() to always return a fully initialized and
finalized paging domain.
Use viommu_domain_alloc_identity() to implement the special non-bypass
IDENTITY flow by calling viommu_domain_alloc_paging() then
viommu_domain_map_identity().
Remove support for deferred finalize and the vdomain->mutex.
Remove core support for domain_alloc() IDENTITY as virtio was the last
driver using it.
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/3-v4-ff5fb6b03bd1+288-iommu_virtio_domains_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
virtio-iommu has a mode where the IDENTITY domain is actually a paging
domain with an identity mapping covering some of the system address
space manually created.
To support this add a new domain_alloc_identity() op that accepts
the struct device so that virtio can allocate and fully finalize a
paging domain to return.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/2-v4-ff5fb6b03bd1+288-iommu_virtio_domains_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
To make way for a domain_alloc_paging conversion add the typical global
static IDENTITY domain. This supports VMMs that have a
VIRTIO_IOMMU_F_BYPASS_CONFIG config.
If the VMM does not have support then the domain_alloc path is still used,
which creates an IDENTITY domain out of a paging domain.
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/1-v4-ff5fb6b03bd1+288-iommu_virtio_domains_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
No external drivers use these interfaces anymore. Furthermore, no existing
iommu drivers implement anything in the callbacks. Remove them to avoid
dead code.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20250418080130.1844424-9-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The iopf enablement has been moved to the iommu drivers. It is unnecessary
for iommufd to handle iopf enablement. Remove the iopf enablement logic to
avoid duplication.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250418080130.1844424-8-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Update iopf enablement in the iommufd mock device driver to use the new
method, similar to the arm-smmu-v3 driver. Enable iopf support when any
domain with an iopf_handler is attached, and disable it when the domain
is removed.
Add a refcount in the mock device state structure to keep track of the
number of domains set to the device and PASIDs that require iopf.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250418080130.1844424-5-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Update iopf enablement in the driver to use the new method, similar to
the arm-smmu-v3 driver. Enable iopf support when any domain with an
iopf_handler is attached, and disable it when the domain is removed.
Place all the logic for controlling the PRI and iopf queue in the domain
set/remove/replace paths. Keep track of the number of domains set to the
device and PASIDs that require iopf. When the first domain requiring iopf
is attached, add the device to the iopf queue and enable PRI. When the
last domain is removed, remove it from the iopf queue and disable PRI.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250418080130.1844424-4-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
None of the drivers implement anything here anymore, remove the dead code.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250418080130.1844424-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
SMMUv3 co-mingles FEAT_IOPF and FEAT_SVA behaviors so that fault reporting
doesn't work unless both are enabled. This is not correct and causes
problems for iommufd which does not enable FEAT_SVA for it's fault capable
domains.
These APIs are both obsolete, update SMMUv3 to use the new method like AMD
implements.
A driver should enable iopf support when a domain with an iopf_handler is
attached, and disable iopf support when the domain is removed.
Move the fault support logic to sva domain allocation and to domain
attach, refusing to create or attach fault capable domains if the HW
doesn't support it.
Move all the logic for controlling the iopf queue under
arm_smmu_attach_prepare(). Keep track of the number of domains on the
master (over all the SSIDs) that require iopf. When the first domain
requiring iopf is attached create the iopf queue, when the last domain is
detached destroy it.
Turn FEAT_IOPF and FEAT_SVA into no ops.
Remove the sva_lock, this is all protected by the group mutex.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250418080130.1844424-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
On the Lenovo ThinkPad X201, when Intel VT-d is enabled in the BIOS, the
kernel boots with errors related to DMAR, the graphical interface appeared
quite choppy, and the system resets erratically within a minute after it
booted:
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xb97ff000
[fault reason 0x05] PTE Write access is not set
Upon comparing boot logs with VT-d on/off, I found that the Intel Calpella
quirk (`quirk_calpella_no_shadow_gtt()') correctly applied the igfx IOMMU
disable/quirk correctly:
pci 0000:00:00.0: DMAR: BIOS has allocated no shadow GTT; disabling IOMMU
for graphics
Whereas with VT-d on, it went into the "else" branch, which then
triggered the DMAR handling fault above:
... else if (!disable_igfx_iommu) {
/* we have to ensure the gfx device is idle before we flush */
pci_info(dev, "Disabling batched IOTLB flush on Ironlake\n");
iommu_set_dma_strict();
}
Now, this is not exactly scientific, but moving 0x0044 to quirk_iommu_igfx
seems to have fixed the aforementioned issue. Running a few `git blame'
runs on the function, I have found that the quirk was originally
introduced as a fix specific to ThinkPad X201:
commit 9eecabcb9a ("intel-iommu: Abort IOMMU setup for igfx if BIOS gave
no shadow GTT space")
Which was later revised twice to the "else" branch we saw above:
- 2011: commit 6fbcfb3e46 ("intel-iommu: Workaround IOTLB hang on
Ironlake GPU")
- 2024: commit ba00196ca4 ("iommu/vt-d: Decouple igfx_off from graphic
identity mapping")
I'm uncertain whether further testings on this particular laptops were
done in 2011 and (honestly I'm not sure) 2024, but I would be happy to do
some distro-specific testing if that's what would be required to verify
this patch.
P.S., I also see IDs 0x0040, 0x0062, and 0x006a listed under the same
`quirk_calpella_no_shadow_gtt()' quirk, but I'm not sure how similar these
chipsets are (if they share the same issue with VT-d or even, indeed, if
this issue is specific to a bug in the Lenovo BIOS). With regards to
0x0062, it seems to be a Centrino wireless card, but not a chipset?
I have also listed a couple (distro and kernel) bug reports below as
references (some of them are from 7-8 years ago!), as they seem to be
similar issue found on different Westmere/Ironlake, Haswell, and Broadwell
hardware setups.
Cc: stable@vger.kernel.org
Fixes: 6fbcfb3e46 ("intel-iommu: Workaround IOTLB hang on Ironlake GPU")
Fixes: ba00196ca4 ("iommu/vt-d: Decouple igfx_off from graphic identity mapping")
Link: https://groups.google.com/g/qubes-users/c/4NP4goUds2c?pli=1
Link: https://bugs.archlinux.org/task/65362
Link: https://bbs.archlinux.org/viewtopic.php?id=230323
Reported-by: Wenhao Sun <weiguangtwk@outlook.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=197029
Signed-off-by: Mingcong Bai <jeffbai@aosc.io>
Link: https://lore.kernel.org/r/20250415133330.12528-1-jeffbai@aosc.io
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
WARN if KVM attempts to set vCPU affinity when posted interrupts aren't
enabled, as KVM shouldn't try to enable posting when they're unsupported,
and the IOMMU driver darn well should only advertise posting support when
AMD_IOMMU_GUEST_IR_VAPIC() is true.
Note, KVM consumes is_guest_mode only on success.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Return -EINVAL instead of success if amd_ir_set_vcpu_affinity() is
invoked without use_vapic; lying to KVM about whether or not the IRTE was
configured to post IRQs is all kinds of bad.
Fixes: d98de49a53 ("iommu/amd: Enable vAPIC interrupt remapping mode by default")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Extend the aperture calculation to consider sizes beyond the maximum
size of a region third table. Attempt to always use the smallest
table size possible to avoid unnecessary extra steps during translation.
Update reserved region calculations to use the appropriate table size.
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Link: https://lore.kernel.org/r/20250411202433.181683-6-mjrosato@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Map and unmap ops use the shared dma_walk_cpu_trans routine, update
this using the origin_type of the dma_table to determine how many
table levels must be walked.
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Link: https://lore.kernel.org/r/20250411202433.181683-5-mjrosato@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The origin_type of the dma_table is used to determine how many table
levels must be traversed for the translation.
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Link: https://lore.kernel.org/r/20250411202433.181683-4-mjrosato@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Extend the existing dma_cleanup_tables to also handle region second and
region first tables.
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Link: https://lore.kernel.org/r/20250411202433.181683-3-mjrosato@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
When registering the I/O Translation Anchor, use the current table type
stored in the s390_domain to set the appropriate region type
indication. For the moment, the table type will always be stored as
region third.
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Link: https://lore.kernel.org/r/20250411202433.181683-2-mjrosato@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Add support for the two MM IOMMUs found in the MediaTek Dimensity
1200 (MT6893) SoC, used for display, camera, imgsys and vpu.
Since the MT6893 SoC is almost fully compatible with MT8192, also
move the mt8192_larb_region_msk before the newly introduced mt6893
platform data, as the larb regions from mt8192 are fully reused.
Note that the only difference with MT8192 is that MT6893 needs the
SHARE_PGTABLE flag.
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Link: https://lore.kernel.org/r/20250410144008.475888-3-angelogioacchino.delregno@collabora.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
There is a string parsing logic error which can lead to an overflow of hid
or uid buffers. Comparing ACPIID_LEN against a total string length doesn't
take into account the lengths of individual hid and uid buffers so the
check is insufficient in some cases. For example if the length of hid
string is 4 and the length of the uid string is 260, the length of str
will be equal to ACPIID_LEN + 1 but uid string will overflow uid buffer
which size is 256.
The same applies to the hid string with length 13 and uid string with
length 250.
Check the length of hid and uid strings separately to prevent
buffer overflow.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: ca3bf5d47c ("iommu/amd: Introduces ivrs_acpihid kernel parameter")
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Paklov <Pavel.Paklov@cyberprotect.ru>
Link: https://lore.kernel.org/r/20250325092259.392844-1-Pavel.Paklov@cyberprotect.ru
Signed-off-by: Joerg Roedel <jroedel@suse.de>
There are quite a lot of options for the Arm drivers, still all buried
in the top-level Kconfig. For ease of use and consistency with all the
other subdirectories, break these out into drivers/arm. For similar
clarity and self-consistency, also tweak the ARM_SMMU sub-options to use
"if" instead of "depends", to match ARM_SMMU_V3. Lastly also clean up
the slightly messy description of ARM_SMMU_DISABLE_BYPASS_BY_DEFAULT as
highlighted by Geert - by now we really shouldn't need commentary on
v4.x kernel behaviour anyway - and downgrade it to EXPERT as the first
step in the 6-year-old threat to remove it entirely.
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Link: https://lore.kernel.org/r/a614ec86ba78c09cd16e348f633f6bb38793391f.1742480488.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Although the lock-juggling is only a temporary workaround, we don't want
it to make things avoidably worse. Jason was right to be nervous, since
bus_iommu_probe() doesn't care *which* IOMMU instance it's probing for,
so it probably is possible for one walk to finish a probe which a
different walk started, thus we do want to check for that.
Also there's no need to drop the lock just to have of_iommu_configure()
do nothing when a fwspec already exists; check that directly and avoid
opening a window at all in that (still somewhat likely) case.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/09d901ad11b3a410fbb6e27f7d04ad4609c3fe4a.1741706365.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Intel is the only thing that uses this now, convert to the size versions,
trying to avoid PAGE_SHIFT.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/23-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Use iommu_alloc_pages_node_sz() instead.
AMD and Intel are both using 4K pages for these structures since those
drivers only work on 4K PAGE_SIZE.
riscv is also spec'd to use SZ_4K.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/21-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
A few small changes to the remaining drivers using these will allow
them to be removed:
- Exynos wants to allocate fixed 16K/8K allocations
- Rockchip already has a define SPAGE_SIZE which is used by the
dma_map immediately following, using SPAGE_ORDER which is a lg2size
- tegra has size constants already for its two allocations
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Convert most of the places calling get_order() as an argument to the
iommu-pages allocator into order_base_2() or the _sz flavour
instead. These places already have an exact size, there is no particular
reason to use order here.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/19-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
One part of RISCV already has a computed size, however the queue
allocation must be aligned to 4k. The other objects are 4k by spec.
Reviewed-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/18-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
If x >= PAGE_SIZE then:
1 << (get_order(x) + PAGE_SHIFT) == roundup_pow_two()
Inline this into the only caller, compute the size of the HW device table
in terms of 4K pages which matches the HW definition.
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/17-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This is just CPU memory used by the driver to track things, it doesn't
need to use iommu-pages. All of them are indexed by devid and devid is
bounded by pci_seg->last_bdf or we are already out of bounds on the page
allocation.
Switch them to use some version of kvmalloc_array() and drop the now
unused constants and remove the tbl_size() round up to PAGE_SIZE multiples
logic.
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/16-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Generally drivers have a specific idea what their HW structure size should
be. In a lot of cases this is related to PAGE_SIZE, but not always. ARM64,
for example, allows a 4K IO page table size on a 64K CPU page table
system.
Currently we don't have any good support for sub page allocations, but
make the API accommodate this by accepting a sub page size from the caller
and rounding up internally.
This is done by moving away from order as the size input and using size:
size == 1 << (order + PAGE_SHIFT)
Following patches convert drivers away from using order and try to specify
allocation sizes independent of PAGE_SIZE.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/15-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The entire allocator API is built around using the kernel virtual address,
it is illegal to pass GFP_HIGHMEM in as a GFP flag. Block it in the common
code. Remove the duplicated checks from drivers.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/14-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This brings the iommu page table allocator into the modern world of having
its own private page descriptor and not re-using fields from struct page
for its own purpose. It follows the basic pattern of struct ptdesc which
did this transformation for the CPU page table allocator.
Currently iommu-pages is pretty basic so this isn't a huge benefit,
however I see a coming need for features that CPU allocator has, like sub
PAGE_SIZE allocations, and RCU freeing. This provides the base
infrastructure to implement those cleanly.
Remove numa_node_id() calls from the inlines and instead use NUMA_NO_NODE
which will get switched to numa_mem_id(), which seems to be the right ID
to use for memory allocations.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/13-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This converts the remaining places using list of pages to the new API.
The Intel free path was shared with its gather path, so it is converted at
the same time.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/11-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Change the internal freelist to use struct iommu_pages_list.
AMD uses the freelist to batch free the entire table during domain
destruction, and to replace table levels with leafs during map.
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/10-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Change the internal freelist to use struct iommu_pages_list.
riscv uses this page list to free page table levels that are replaced
with leaf ptes.
Reviewed-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/9-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
We want to get rid of struct page references outside the internal
allocator implementation. The free list has the driver open code something
like:
list_add_tail(&virt_to_page(ptr)->lru, freelist);
Move the above into a small inline and make the freelist into a wrapper
type 'struct iommu_pages_list' so that the compiler can help check all the
conversion.
This struct has also proven helpful in some future ideas to convert to a
singly linked list to get an extra pointer in the struct page, and to
signal that the pages should be freed with RCU.
Use a temporary _Generic so we don't need to rename the free function as
the patches progress.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/8-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
These are called in a lot of places and are not trivial. Move them to the
core module.
Tidy some of the comments and function arguments, fold
__iommu_alloc_account() into its only caller, change
__iommu_free_account() into __iommu_free_page() to remove some
duplication.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Now that we have a folio under the allocation iommu_free_pages() can know
the order of the original allocation and do the correct thing to free it.
The next patch will rename iommu_free_page() to iommu_free_pages() so we
have naming consistency with iommu_alloc_pages_node().
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
alloc_pages_node(, order) needs to be paired with __free_pages(, order) to
free all the allocated pages. For order != 0 the return from
alloc_pages_node() is just a page list, it hasn't been formed into a
folio.
However iommu_put_pages_list() just calls put_page() on the head page of
an allocation, which will end up leaking the tail pages if order != 0.
Fix this by using __GFP_COMP to create a high order folio and then always
use put_page() to free the full high order folio.
__iommu_free_account() can get the order of the allocation via
folio_order(), which corrects the accounting of high order allocations in
iommu_put_pages_list(). This is the same technique slub uses.
As far as I can tell, none of the places using high order allocations are
also using the free list, so this not a current bug.
Fixes: 06c375053c ("iommu/vt-d: add wrapper functions for page allocations")
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
These were only used by tegra-smmu and leaked the struct page out of the
API. Delete them since tega-smmu has been converted to the other APIs.
In the process flatten the call tree so we have fewer one line functions
calling other one line functions.. iommu_alloc_pages_node() is the real
allocator and everything else can just call it directly.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Instead use the virtual address and dma_map_single() like as->pd
uses. Introduce a small struct tegra_pt instead of void * to have some
clarity what is using this API and add compile safety during the
conversion.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Instead use the virtual address. Change from dma_map_page() to
dma_map_single() which works directly on a KVA. Add a type for the pd
table level for clarity.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
We've never supported StreamID aliasing between devices, and as such
they will never have had functioning DMA, but this is not fatal to the
SMMU itself. Although aliasing between hard-wired platform device
StreamIDs would tend to raise questions about the whole system, in
practice it's far more likely to occur relatively innocently due to
legacy PCI bridges, where the underlying StreamID mappings are still
perfectly reasonable.
As such, return a more benign -ENODEV when failing probe for such an
unsupported device (and log a more obvious error message), so that it
doesn't break the entire SMMU probe now that bus_iommu_probe() runs in
the right order and can propagate that error back. The end result is
still that the device doesn't get an IOMMU group and probably won't
work, same as before.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/39d54e49c8476efc4653e352150d44b185d6d50f.1744380554.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
ASPEED VGA card has two built-in devices:
0008:06:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 06)
0008:07:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 52)
Its toplogy looks like this:
+-[0008:00]---00.0-[01-09]--+-00.0-[02-09]--+-00.0-[03]----00.0 Sandisk Corp Device 5017
| +-01.0-[04]--
| +-02.0-[05]----00.0 NVIDIA Corporation Device
| +-03.0-[06-07]----00.0-[07]----00.0 ASPEED Technology, Inc. ASPEED Graphics Family
| +-04.0-[08]----00.0 Renesas Technology Corp. uPD720201 USB 3.0 Host Controller
| \-05.0-[09]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
\-00.1 PMC-Sierra Inc. Device 4028
The IORT logic populaties two identical IDs into the fwspec->ids array via
DMA aliasing in iort_pci_iommu_init() called by pci_for_each_dma_alias().
Though the SMMU driver had been able to handle this situation since commit
563b5cbe33 ("iommu/arm-smmu-v3: Cope with duplicated Stream IDs"), that
got broken by the later commit cdf315f907 ("iommu/arm-smmu-v3: Maintain
a SID->device structure"), which ended up with allocating separate streams
with the same stuffing.
On a kernel prior to v6.15-rc1, there has been an overlooked warning:
pci 0008:07:00.0: vgaarb: setting as boot VGA device
pci 0008:07:00.0: vgaarb: bridge control possible
pci 0008:07:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
pcieport 0008:06:00.0: Adding to iommu group 14
ast 0008:07:00.0: stream 67328 already in tree <===== WARNING
ast 0008:07:00.0: enabling device (0002 -> 0003)
ast 0008:07:00.0: Using default configuration
ast 0008:07:00.0: AST 2600 detected
ast 0008:07:00.0: [drm] Using analog VGA
ast 0008:07:00.0: [drm] dram MCLK=396 Mhz type=1 bus_width=16
[drm] Initialized ast 0.1.0 for 0008:07:00.0 on minor 0
ast 0008:07:00.0: [drm] fb0: astdrmfb frame buffer device
With v6.15-rc, since the commit bcb81ac6ae ("iommu: Get DT/ACPI parsing
into the proper probe path"), the error returned with the warning is moved
to the SMMU device probe flow:
arm_smmu_probe_device+0x15c/0x4c0
__iommu_probe_device+0x150/0x4f8
probe_iommu_group+0x44/0x80
bus_for_each_dev+0x7c/0x100
bus_iommu_probe+0x48/0x1a8
iommu_device_register+0xb8/0x178
arm_smmu_device_probe+0x1350/0x1db0
which then fails the entire SMMU driver probe:
pci 0008:06:00.0: Adding to iommu group 21
pci 0008:07:00.0: stream 67328 already in tree
arm-smmu-v3 arm-smmu-v3.9.auto: Failed to register iommu
arm-smmu-v3 arm-smmu-v3.9.auto: probe with driver arm-smmu-v3 failed with error -22
Since SMMU driver had been already expecting a potential duplicated Stream
ID in arm_smmu_install_ste_for_dev(), change the arm_smmu_insert_master()
routine to ignore a duplicated ID from the fwspec->sids array as well.
Note: this has been failing the iommu_device_probe() since 2021, although a
recent iommu commit in v6.15-rc1 that moves iommu_device_probe() started to
fail the SMMU driver probe. Since nobody has cared about DMA Alias support,
leave that as it was but fix the fundamental iommu_device_probe() breakage.
Fixes: cdf315f907 ("iommu/arm-smmu-v3: Maintain a SID->device structure")
Cc: stable@vger.kernel.org
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20250415185620.504299-1-nicolinc@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
UBSan caught a bug with IOMMU SVA domains, where the reported exponent
value in __arm_smmu_tlb_inv_range() was >= 64.
__arm_smmu_tlb_inv_range() uses the domain's pgsize_bitmap to compute
the number of pages to invalidate and the invalidation range. Currently
arm_smmu_sva_domain_alloc() does not setup the iommu domain's
pgsize_bitmap. This leads to __ffs() on the value returning 64 and that
leads to undefined behaviour w.r.t. shift operations
Fix this by initializing the iommu_domain's pgsize_bitmap to PAGE_SIZE.
Effectively the code needs to use the smallest page size for
invalidation
Cc: stable@vger.kernel.org
Fixes: eb6c97647b ("iommu/arm-smmu-v3: Avoid constructing invalid range commands")
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Will Deacon <will@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250412002354.3071449-1-balbirs@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Commit 67e4fe3985 ("iommu/arm-smmu-v3: Use S2FWB for NESTED domains")
introduced S2FWB usage but omitted the corresponding feature detection.
As a result, vIOMMU allocation fails on FVP in arm_vsmmu_alloc(), due to
the following check:
if (!arm_smmu_master_canwbs(master) &&
!(smmu->features & ARM_SMMU_FEAT_S2FWB))
return ERR_PTR(-EOPNOTSUPP);
This patch adds the missing detection logic to prevent allocation
failure when S2FWB is supported.
Fixes: 67e4fe3985 ("iommu/arm-smmu-v3: Use S2FWB for NESTED domains")
Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Link: https://lore.kernel.org/r/20250408033351.1012411-1-aneesh.kumar@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
Two WARNINGs are observed when SMMU driver rolls back upon failure:
arm-smmu-v3.9.auto: Failed to register iommu
arm-smmu-v3.9.auto: probe with driver arm-smmu-v3 failed with error -22
------------[ cut here ]------------
WARNING: CPU: 5 PID: 1 at kernel/dma/mapping.c:74 dmam_free_coherent+0xc0/0xd8
Call trace:
dmam_free_coherent+0xc0/0xd8 (P)
tegra241_vintf_free_lvcmdq+0x74/0x188
tegra241_cmdqv_remove_vintf+0x60/0x148
tegra241_cmdqv_remove+0x48/0xc8
arm_smmu_impl_remove+0x28/0x60
devm_action_release+0x1c/0x40
------------[ cut here ]------------
128 pages are still in use!
WARNING: CPU: 16 PID: 1 at mm/page_alloc.c:6902 free_contig_range+0x18c/0x1c8
Call trace:
free_contig_range+0x18c/0x1c8 (P)
cma_release+0x154/0x2f0
dma_free_contiguous+0x38/0xa0
dma_direct_free+0x10c/0x248
dma_free_attrs+0x100/0x290
dmam_free_coherent+0x78/0xd8
tegra241_vintf_free_lvcmdq+0x74/0x160
tegra241_cmdqv_remove+0x98/0x198
arm_smmu_impl_remove+0x28/0x60
devm_action_release+0x1c/0x40
This is because the LVCMDQ queue memory are managed by devres, while that
dmam_free_coherent() is called in the context of devm_action_release().
Jason pointed out that "arm_smmu_impl_probe() has mis-ordered the devres
callbacks if ops->device_remove() is going to be manually freeing things
that probe allocated":
https://lore.kernel.org/linux-iommu/20250407174408.GB1722458@nvidia.com/
In fact, tegra241_cmdqv_init_structures() only allocates memory resources
which means any failure that it generates would be similar to -ENOMEM, so
there is no point in having that "falling back to standard SMMU" routine,
as the standard SMMU would likely fail to allocate memory too.
Remove the unwind part in tegra241_cmdqv_init_structures(), and return a
proper error code to ask SMMU driver to call tegra241_cmdqv_remove() via
impl_ops->device_remove(). Then, drop tegra241_vintf_free_lvcmdq() since
devres will take care of that.
Fixes: 483e0bd888 ("iommu/tegra241-cmdqv: Do not allocate vcmdq until dma_set_mask_and_coherent")
Cc: stable@vger.kernel.org
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250407201908.172225-1-nicolinc@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Currently, mtk_iommu calls during probe iommu_device_register before
the hw_list from driver data is initialized. Since iommu probing issue
fix, it leads to NULL pointer dereference in mtk_iommu_device_group when
hw_list is accessed with list_first_entry (not null safe).
So, change the call order to ensure iommu_device_register is called
after the driver data are initialized.
Fixes: 9e3a2a6436 ("iommu/mediatek: Adapt sharing and non-sharing pgtable case")
Fixes: bcb81ac6ae ("iommu: Get DT/ACPI parsing into the proper probe path")
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Tested-by: Chen-Yu Tsai <wenst@chromium.org> # MT8183 Juniper, MT8186 Tentacruel
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Louis-Alexis Eyraud <louisalexis.eyraud@collabora.com>
Link: https://lore.kernel.org/r/20250403-fix-mtk-iommu-error-v2-1-fe8b18f8b0a8@collabora.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Commit bcb81ac6ae ("iommu: Get DT/ACPI parsing into the proper probe
path") changed the sequence of probing the SYSMMU controller devices and
calls to arm_iommu_attach_device(), what results in resuming SYSMMU
controller earlier, when it is still set to IDENTITY mapping. Such change
revealed the bug in IDENTITY handling in the exynos-iommu driver. When
SYSMMU controller is set to IDENTITY mapping, data->domain is NULL, so
adjust checks in suspend & resume callbacks to handle this case
correctly.
Fixes: b3d14960e6 ("iommu/exynos: Implement an IDENTITY domain")
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250401202731.2810474-1-m.szyprowski@samsung.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
IPMMU registers almost-initialised instances, but misses assigning the
drvdata to make them fully functional, so initial calls back into
ipmmu_probe_device() are likely to fail unnecessarily. Reorder this to
work as it should, also pruning the long-out-of-date comment and adding
the missing sysfs cleanup on error for good measure.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes: bcb81ac6ae ("iommu: Get DT/ACPI parsing into the proper probe path")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/53be6667544de65a15415b699e38a9a965692e45.1742481687.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
If iommu_device_register() encounters an error, it can end up tearing
down already-configured groups and default domains, however this
currently still leaves devices hooked up to iommu-dma (and even
historically the behaviour in this area was at best inconsistent across
architectures/drivers...) Although in the case that an IOMMU is present
whose driver has failed to probe, users cannot necessarily expect DMA to
work anyway, it's still arguable that we should do our best to put
things back as if the IOMMU driver was never there at all, and certainly
the potential for crashing in iommu-dma itself is undesirable. Make sure
we clean up the dev->dma_iommu flag along with everything else.
Reported-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Closes: https://lore.kernel.org/all/CAGXv+5HJpTYmQ2h-GD7GjyeYT7bL9EBCvu0mz5LgpzJZtzfW0w@mail.gmail.com/
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/e788aa927f6d827dd4ea1ed608fada79f2bab030.1744284228.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Do not touch per-device DMA ops when the driver has been converted to use
the dma-iommu API.
Fixes: c588072bba ("iommu/vt-d: Convert intel iommu driver to the iommu ops")
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Link: https://lore.kernel.org/r/20250403165605.278541-1-ptesarik@suse.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Set the posted MSI irq_chip's irq_ack() hook to irq_move_irq() instead of
a dummy/empty callback so that posted MSIs process pending changes to the
IRQ's SMP affinity. Failure to honor a pending set-affinity results in
userspace being unable to change the effective affinity of the IRQ, as
IRQD_SETAFFINITY_PENDING is never cleared and so irq_set_affinity_locked()
always defers moving the IRQ.
The issue is most easily reproducible by setting /proc/irq/xx/smp_affinity
multiple times in quick succession, as only the first update is likely to
be handled in process context.
Fixes: ed1e48ea43 ("iommu/vt-d: Enable posted mode for device MSIs")
Cc: Robert Lippert <rlippert@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Wentao Yang <wentaoyang@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20250321194249.1217961-1-seanjc@google.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The following crash is observed while handling an IOMMU fault with a
recent kernel:
kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
BUG: unable to handle page fault for address: ffff8c708299f700
PGD 19ee01067 P4D 19ee01067 PUD 101c10063 PMD 80000001028001e3
Oops: Oops: 0011 [#1] SMP NOPTI
CPU: 4 UID: 0 PID: 139 Comm: irq/25-AMD-Vi Not tainted 6.15.0-rc1+ #20 PREEMPT(lazy)
Hardware name: LENOVO 21D0/LNVNB161216, BIOS J6CN50WW 09/27/2024
RIP: 0010:0xffff8c708299f700
Call Trace:
<TASK>
? report_iommu_fault+0x78/0xd3
? amd_iommu_report_page_fault+0x91/0x150
? amd_iommu_int_thread+0x77/0x180
? __pfx_irq_thread_fn+0x10/0x10
? irq_thread_fn+0x23/0x60
? irq_thread+0xf9/0x1e0
? __pfx_irq_thread_dtor+0x10/0x10
? __pfx_irq_thread+0x10/0x10
? kthread+0xfc/0x240
? __pfx_kthread+0x10/0x10
? ret_from_fork+0x34/0x50
? __pfx_kthread+0x10/0x10
? ret_from_fork_asm+0x1a/0x30
</TASK>
report_iommu_fault() checks for an installed handler comparing the
corresponding field to NULL. It can (and could before) be called for a
domain with a different cookie type - IOMMU_COOKIE_DMA_IOVA, specifically.
Cookie is represented as a union so we may end up with a garbage value
treated there if this happens for a domain with another cookie type.
Formerly there were two exclusive cookie types in the union.
IOMMU_DOMAIN_SVA has a dedicated iommu_report_device_fault().
Call the fault handler only if the passed domain has a required cookie
type.
Found by Linux Verification Center (linuxtesting.org).
Fixes: 6aa63a4ec9 ("iommu: Sort out domain user data")
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250408213342.285955-1-pchelkin@ispras.ru
Signed-off-by: Joerg Roedel <jroedel@suse.de>
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Two significant new items:
- Allow reporting IOMMU HW events to userspace when the events are clearly
linked to a device. This is linked to the VIOMMU object and is intended to
be used by a VMM to forward HW events to the virtual machine as part of
emulating a vIOMMU. ARM SMMUv3 is the first driver to use this
mechanism. Like the existing fault events the data is delivered through
a simple FD returning event records on read().
- PASID support in VFIO. "Process Address Space ID" is a PCI feature that
allows the device to tag all PCI DMA operations with an ID. The IOMMU
will then use the ID to select a unique translation for those DMAs. This
is part of Intel's vIOMMU support as VT-D HW requires the hypervisor to
manage each PASID entry. The support is generic so any VFIO user could
attach any translation to a PASID, and the support should work on ARM
SMMUv3 as well. AMD requires additional driver work.
Some minor updates, along with fixes:
- Prevent using nested parents with fault's, no driver support today
- Put a single "cookie_type" value in the iommu_domain to indicate what
owns the various opaque owner fields
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZ+q6NgAKCRCFwuHvBreF
YZ3zAQDbl4/Z0O+CLN2AXq4Zeiyq1HTSoF94hzqmm7lQ17zTIwD8CCdyLXHvupaq
tkBIv5IovpaxlrSk6M0kh2K8vPCk9Qk=
=CIM3
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"Two significant new items:
- Allow reporting IOMMU HW events to userspace when the events are
clearly linked to a device.
This is linked to the VIOMMU object and is intended to be used by a
VMM to forward HW events to the virtual machine as part of
emulating a vIOMMU. ARM SMMUv3 is the first driver to use this
mechanism. Like the existing fault events the data is delivered
through a simple FD returning event records on read().
- PASID support in VFIO.
The "Process Address Space ID" is a PCI feature that allows the
device to tag all PCI DMA operations with an ID. The IOMMU will
then use the ID to select a unique translation for those DMAs. This
is part of Intel's vIOMMU support as VT-D HW requires the
hypervisor to manage each PASID entry.
The support is generic so any VFIO user could attach any
translation to a PASID, and the support should work on ARM SMMUv3
as well. AMD requires additional driver work.
Some minor updates, along with fixes:
- Prevent using nested parents with fault's, no driver support today
- Put a single "cookie_type" value in the iommu_domain to indicate
what owns the various opaque owner fields"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (49 commits)
iommufd: Test attach before detaching pasid
iommufd: Fix iommu_vevent_header tables markup
iommu: Convert unreachable() to BUG()
iommufd: Balance veventq->num_events inc/dec
iommufd: Initialize the flags of vevent in iommufd_viommu_report_event()
iommufd/selftest: Add coverage for reporting max_pasid_log2 via IOMMU_HW_INFO
iommufd: Extend IOMMU_GET_HW_INFO to report PASID capability
vfio: VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT support pasid
vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
ida: Add ida_find_first_range()
iommufd/selftest: Add coverage for iommufd pasid attach/detach
iommufd/selftest: Add test ops to test pasid attach/detach
iommufd/selftest: Add a helper to get test device
iommufd/selftest: Add set_dev_pasid in mock iommu
iommufd: Allow allocating PASID-compatible domain
iommu/vt-d: Add IOMMU_HWPT_ALLOC_PASID support
iommufd: Enforce PASID-compatible domain for RID
iommufd: Support pasid attach/replace
iommufd: Enforce PASID-compatible domain in PASID path
iommufd/device: Add pasid_attach array to track per-PASID attach
...
iommufd_veventq_fops_read() decrements veventq->num_events when a vevent
is read out. However, the report path ony increments veventq->num_events
for normal events. To be balanced, make the read path decrement num_events
only for normal vevents.
Fixes: e36ba5ab80 ("iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC")
Link: https://patch.msgid.link/r/20250324120034.5940-3-yi.l.liu@intel.com
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The vevent->header.flags is not initialized per allocation, hence the
vevent read path may treat the vevent as lost_events_header wrongly.
Use kzalloc() to alloc memory for new vevent.
Fixes: e8e1ef9b77 ("iommufd/viommu: Add iommufd_viommu_report_event helper")
Link: https://patch.msgid.link/r/20250324120034.5940-2-yi.l.liu@intel.com
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
PASID usage requires PASID support in both device and IOMMU. Since the
iommu drivers always enable the PASID capability for the device if it
is supported, this extends the IOMMU_GET_HW_INFO to report the PASID
capability to userspace. Also, enhances the selftest accordingly.
Link: https://patch.msgid.link/r/20250321180143.8468-5-yi.l.liu@intel.com
Cc: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> #aarch64 platform
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Including:
- Core: IOMMUFD dependencies from Jason:
- Change the iommufd fault handle into an always present hwpt handle in
the domain
- Give iommufd its own SW_MSI implementation along with some IRQ layer
rework
- Improvements to the handle attach API
- Core: Fixes for probe-issues from Robin
- Intel VT-d changes:
- Checking for SVA support in domain allocation and attach paths
- Move PCI ATS and PRI configuration into probe paths
- Fix a pentential hang on reboot -f
- Miscellaneous cleanups
- AMD-Vi changes:
- Support for up to 2k IRQs per PCI device function
- Set of smaller fixes
- ARM-SMMU changes:
- SMMUv2 devicetree binding updates for Qualcomm implementations
(QCS8300 GPU and MSM8937)
- Clean up SMMUv2 runtime PM implementation to help with wider rework of
pm_runtime_put_autosuspend()
- Rockchip driver changes:
- Driver adjustments for recent DT probing changes
- S390 IOMMU changes:
- Support for IOMMU passthrough
- Apple Dart changes:
- Driver adjustments to meet ISP device requirements
- Null-ptr deref fix
- Disable subpage protection for DART 1
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmfkFSkACgkQK/BELZcB
GuOVUg/9En4QlF0eeg4E1stszGcOOXn9IkKiGP9iqKpL/WqVVitoagn7nwq0rBFF
PdqqcGbSPNHctiUursB18DyENOAmRUE8mGG29po9rAfq5wxZ2yA6eLmpFAf9QDCW
vY3nz2oen1lUrzbPwVI1IR67L6BrtqEUz/1syf295RMr2yXkdmBXb12lbL1rdjv0
7YfWwlDtMOVsewlZmCuDAAfWHtd7cO/ulDcYwq+GOiSXojVmKw8624EFlFbM9E9f
tkceE4mjGQnN/CpZzUc48f1DgIW82PVVTek95Cznbk9ACDJQ4VYYwDMx2WszfMzu
D1CXLiQu0vrxQgFdHn02bw3T4yvXQ4HeawIaTF+1J1gEHbrj6MoVQ88zaO3GnKFk
yxYI5sfOs5iL6zSMkig4OytXuRmRqDVJhgrF1i4XIM3GXrRKf9EiRZ/1eXb+1gab
3Q5k7+u4+7vWbaKLsxOdIBNzz02hb7LVndPIFlWmSnr1YY0LFBfwmgbfTAsSINZo
22Ni8aE2C42lYOZxJ67m3khLpqTZaTERtQIab/gUd5FcuEjxKuHBIzj6SZeC/0Q9
Y6rzklHxJxxtQLrUJp16IzjpiMJbWH0H2jKstRberOpxk3L5aLrJNd8M1B+CokmP
LZeKgqgxV5F+VlR2Bap5AXL9ZEuqjCVDIEy79j8XzAzxobDVnVM=
=a8Pd
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:
"Core iommufd dependencies from Jason:
- Change the iommufd fault handle into an always present hwpt handle
in the domain
- Give iommufd its own SW_MSI implementation along with some IRQ
layer rework
- Improvements to the handle attach API
Core fixes for probe-issues from Robin
Intel VT-d changes:
- Checking for SVA support in domain allocation and attach paths
- Move PCI ATS and PRI configuration into probe paths
- Fix a pentential hang on reboot -f
- Miscellaneous cleanups
AMD-Vi changes:
- Support for up to 2k IRQs per PCI device function
- Set of smaller fixes
ARM-SMMU changes:
- SMMUv2 devicetree binding updates for Qualcomm implementations
(QCS8300 GPU and MSM8937)
- Clean up SMMUv2 runtime PM implementation to help with wider rework
of pm_runtime_put_autosuspend()
Rockchip driver changes:
- Driver adjustments for recent DT probing changes
S390 IOMMU changes:
- Support for IOMMU passthrough
Apple Dart changes:
- Driver adjustments to meet ISP device requirements
- Null-ptr deref fix
- Disable subpage protection for DART 1"
* tag 'iommu-updates-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (54 commits)
iommu/vt-d: Fix possible circular locking dependency
iommu/vt-d: Don't clobber posted vCPU IRTE when host IRQ affinity changes
iommu/vt-d: Put IRTE back into posted MSI mode if vCPU posting is disabled
iommu: apple-dart: fix potential null pointer deref
iommu/rockchip: Retire global dma_dev workaround
iommu/rockchip: Register in a sensible order
iommu/rockchip: Allocate per-device data sensibly
iommu/mediatek-v1: Support COMPILE_TEST
iommu/amd: Enable support for up to 2K interrupts per function
iommu/amd: Rename DTE_INTTABLEN* and MAX_IRQS_PER_TABLE macro
iommu/amd: Replace slab cache allocator with page allocator
iommu/amd: Introduce generic function to set multibit feature value
iommu: Don't warn prematurely about dodgy probes
iommu/arm-smmu: Set rpm auto_suspend once during probe
dt-bindings: arm-smmu: Document QCS8300 GPU SMMU
iommu: Get DT/ACPI parsing into the proper probe path
iommu: Keep dev->iommu state consistent
iommu: Resolve ops in iommu_init_device()
iommu: Handle race with default domain setup
iommu: Unexport iommu_fwspec_free()
...
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmfhlLATHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXgchCADOz33rSm4G4w4r0qT05dTDi/lZkEdK
64dQq322XXP/C9FfR66d30243gsAmuM5a0SvzFHLXAOu6yqM270Xehd/Rud+Um2s
lSVnc0Ux0AWBgksqFd0t577aN7zmJEukosEYO5lBNop+zOcadrm3S6Th/AoL2h/D
yphPkhH13bsCK+Wll/eBOQLIhC9iA0konYbBLuEQ5MqvUbrzc6Rmb5gxsHHZKOqg
vLjkrYR/d3s2gIpKxiFp0RwvzGyffZEHxvU/YF3hTenPMlTlnXWbyspBSTVmWggP
13IFLzqxDdW9RgUnGB4xRc424AC1LKqEr42QPQE7zGvl2jdJriA2Q1LT
=BXqj
-----END PGP SIGNATURE-----
Merge tag 'hyperv-next-signed-20250324' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- Add support for running as the root partition in Hyper-V (Microsoft
Hypervisor) by exposing /dev/mshv (Nuno and various people)
- Add support for CPU offlining in Hyper-V (Hamza Mahfooz)
- Misc fixes and cleanups (Roman Kisel, Tianyu Lan, Wei Liu, Michael
Kelley, Thorsten Blum)
* tag 'hyperv-next-signed-20250324' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (24 commits)
x86/hyperv: fix an indentation issue in mshyperv.h
x86/hyperv: Add comments about hv_vpset and var size hypercall input args
Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
hyperv: Add definitions for root partition driver to hv headers
x86: hyperv: Add mshv_handler() irq handler and setup function
Drivers: hv: Introduce per-cpu event ring tail
Drivers: hv: Export some functions for use by root partition module
acpi: numa: Export node_to_pxm()
hyperv: Introduce hv_recommend_using_aeoi()
arm64/hyperv: Add some missing functions to arm64
x86/mshyperv: Add support for extended Hyper-V features
hyperv: Log hypercall status codes as strings
x86/hyperv: Fix check of return value from snp_set_vmsa()
x86/hyperv: Add VTL mode callback for restarting the system
x86/hyperv: Add VTL mode emergency restart callback
hyperv: Remove unused union and structs
hyperv: Add CONFIG_MSHV_ROOT to gate root partition support
hyperv: Change hv_root_partition into a function
hyperv: Convert hypercall statuses to linux error codes
drivers/hv: add CPU offlining support
...
This adds 4 test ops for pasid attach/replace/detach testing. There are
ops to attach/detach pasid, and also op to check the attached hwpt of a
pasid.
Link: https://patch.msgid.link/r/20250321171940.7213-18-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There is need to get the selftest device (sobj->type == TYPE_IDEV) in
multiple places, so have a helper to for it.
Link: https://patch.msgid.link/r/20250321171940.7213-17-yi.l.liu@intel.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The callback is needed to make pasid_attach/detach path complete for mock
device. A nop is enough for set_dev_pasid.
A MOCK_FLAGS_DEVICE_PASID is added to indicate a pasid-capable mock device
for the pasid test cases. Other test cases will still create a non-pasid
mock device. While the mock iommu always pretends to be pasid-capable.
Link: https://patch.msgid.link/r/20250321171940.7213-16-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The underlying infrastructure has supported the PASID attach and related
enforcement per the requirement of the IOMMU_HWPT_ALLOC_PASID flag. This
extends iommufd to support PASID compatible domain requested by userspace.
Link: https://patch.msgid.link/r/20250321171940.7213-15-yi.l.liu@intel.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Intel iommu driver just treats it as a nop since Intel VT-d does not have
special requirement on domains attached to either the PASID or RID of a
PASID-capable device.
Link: https://patch.msgid.link/r/20250321171940.7213-14-yi.l.liu@intel.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Per the definition of IOMMU_HWPT_ALLOC_PASID, iommufd needs to enforce
the RID to use PASID-compatible domain if PASID has been attached, and
vice versa. The PASID path has already enforced it. This adds the
enforcement in the RID path.
This enforcement requires a lock across the RID and PASID attach path,
the idev->igroup->lock is used as both the RID and the PASID path holds
it.
Link: https://patch.msgid.link/r/20250321171940.7213-13-yi.l.liu@intel.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This extends the below APIs to support PASID. Device drivers to manage pasid
attach/replace/detach.
int iommufd_device_attach(struct iommufd_device *idev,
ioasid_t pasid, u32 *pt_id);
int iommufd_device_replace(struct iommufd_device *idev,
ioasid_t pasid, u32 *pt_id);
void iommufd_device_detach(struct iommufd_device *idev,
ioasid_t pasid);
The pasid operations share underlying attach/replace/detach infrastructure
with the device operations, but still have some different implications:
- no reserved region per pasid otherwise SVA architecture is already
broken (CPU address space doesn't count device reserved regions);
- accordingly no sw_msi trick;
Cache coherency enforcement is still applied to pasid operations since
it is about memory accesses post page table walking (no matter the walk
is per RID or per PASID).
Link: https://patch.msgid.link/r/20250321171940.7213-12-yi.l.liu@intel.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
AMD IOMMU requires attaching PASID-compatible domains to PASID-capable
devices. This includes the domains attached to RID and PASIDs. Related
discussions in link [1] and [2]. ARM also has such a requirement, Intel
does not need it, but can live up with it. Hence, iommufd is going to
enforce this requirement as it is not harmful to vendors that do not
need it.
Mark the PASID-compatible domains and enforce it in the PASID path.
[1] https://lore.kernel.org/linux-iommu/20240709182303.GK14050@ziepe.ca/
[2] https://lore.kernel.org/linux-iommu/20240822124433.GD3468552@ziepe.ca/
Link: https://patch.msgid.link/r/20250321171940.7213-11-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
PASIDs of PASID-capable device can be attached to hwpt separately, hence
a pasid array to track per-PASID attachment is necessary. The index
IOMMU_NO_PASID is used by the RID path. Hence drop the igroup->attach.
Link: https://patch.msgid.link/r/20250321171940.7213-10-yi.l.liu@intel.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
igroup->attach->device_list is used to track attached device of a group
in the RID path. Such tracking is also needed in the PASID path in order
to share path with the RID path.
While there is only one list_head in the iommufd_device. It cannot work
if the device has been attached in both RID path and PASID path. To solve
it, replacing the device_list with an xarray. The attached iommufd_device
is stored in the entry indexed by the idev->obj.id.
Link: https://patch.msgid.link/r/20250321171940.7213-9-yi.l.liu@intel.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The igroup->hwpt and igroup->device_list are used to track the hwpt attach
of a group in the RID path. While the coming PASID path also needs such
tracking. To be prepared, wrap igroup->hwpt and igroup->device_list into
attach struct which is allocated per attaching the first device of the
group and freed per detaching the last device of the group.
Link: https://patch.msgid.link/r/20250321171940.7213-8-yi.l.liu@intel.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The existing code detects the first attach by checking the
igroup->device_list. However, the igroup->hwpt can also be used to detect
the first attach. In future modifications, it is better to check the
igroup->hwpt instead of the device_list. To improve readbility and also
prepare for further modifications on this part, this adds a helper for it.
Link: https://patch.msgid.link/r/20250321171940.7213-7-yi.l.liu@intel.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
With more use of the fields of igroup, use a local vairable instead of
using the idev->igroup heavily.
No functional change expected.
Link: https://patch.msgid.link/r/20250321171940.7213-6-yi.l.liu@intel.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
As the pasid is passed through the attach/replace/detach helpers, it is
necessary to ensure only the non-pasid path adds reserved_iova.
Link: https://patch.msgid.link/r/20250321171940.7213-5-yi.l.liu@intel.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Most of the core logic before conducting the actual device attach/
replace operation can be shared with pasid attach/replace. So pass
@pasid through the device attach/replace helpers to prepare adding
pasid attach/replace.
So far the @pasid should only be IOMMU_NO_PASID. No functional change.
Link: https://patch.msgid.link/r/20250321171940.7213-4-yi.l.liu@intel.com
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Provide a high-level API to allow replacements of one domain with another
for specific pasid of a device. This is similar to
iommu_replace_group_handle() and it is expected to be used only by IOMMUFD.
Link: https://patch.msgid.link/r/20250321171940.7213-3-yi.l.liu@intel.com
Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add kdoc to highligt the caller of iommu_[attach|replace]_group_handle()
and iommu_attach_device_pasid() should always provide a new handle. This
can avoid race with lockless reference to the handle. e.g. the
find_fault_handler() and iommu_report_device_fault() in the PRI path.
Link: https://patch.msgid.link/r/20250321171940.7213-2-yi.l.liu@intel.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There are only two sw_msi implementations in the entire system, thus it's
not very necessary to have an sw_msi pointer.
Instead, check domain->cookie_type to call the two sw_msi implementations
directly from the core code.
Link: https://patch.msgid.link/r/7ded87c871afcbaac665b71354de0a335087bf0f.1742871535.git.nicolinc@nvidia.com
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
To provide the iommufd_sw_msi() to the iommu core that is under a different
Kconfig, move it and its related functions to driver.c. Then, stub it into
the iommu-priv header. The iommufd_sw_msi_install() continues to be used by
iommufd internal, so put it in the private header.
Note that iommufd_sw_msi() will be called in the iommu core, replacing the
sw_msi function pointer. Given that IOMMU_API is "bool" in Kconfig, change
IOMMUFD_DRIVER_CORE to "bool" as well.
Since this affects the module size, here is before-n-after size comparison:
[Before]
text data bss dec hex filename
18797 848 56 19701 4cf5 drivers/iommu/iommufd/device.o
722 44 0 766 2fe drivers/iommu/iommufd/driver.o
[After]
text data bss dec hex filename
17735 808 56 18599 48a7 drivers/iommu/iommufd/device.o
3020 180 0 3200 c80 drivers/iommu/iommufd/driver.o
Link: https://patch.msgid.link/r/374c159592dba7852bee20968f3f66fa0ee8ca93.1742871535.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When DMA/MSI cookies were made first-class citizens back in commit
46983fcd67 ("iommu: Pull IOVA cookie management into the core"), there
was no real need to further expose the two different cookie types.
However, now that IOMMUFD wants to add a third type of MSI-mapping
cookie, we do have a nicely compelling reason to properly dismabiguate
things at the domain level beyond just vaguely guessing from the domain
type.
Meanwhile, we also effectively have another "cookie" in the form of the
anonymous union for other user data, which isn't much better in terms of
being vague and unenforced. The fact is that all these cookie types are
mutually exclusive, in the sense that combining them makes zero sense
and/or would be catastrophic (iommu_set_fault_handler() on an SVA
domain, anyone?) - the only combination which *might* be reasonable is
perhaps a fault handler and an MSI cookie, but nobody's doing that at
the moment, so let's rule it out as well for the sake of being clear and
robust. To that end, we pull DMA and MSI cookies apart a little more,
mostly to clear up the ambiguity at domain teardown, then for clarity
(and to save a little space), move them into the union, whose ownership
we can then properly describe and enforce entirely unambiguously.
[nicolinc: rebase on latest tree; use prefix IOMMU_COOKIE_; merge unions
in iommu_domain; add IOMMU_COOKIE_IOMMUFD for iommufd_hwpt]
Link: https://patch.msgid.link/r/1ace9076c95204bbe193ee77499d395f15f44b23.1742871535.git.nicolinc@nvidia.com
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Introduce hv_status_printk() macros as a convenience to log hypercall
errors, formatting them with the status code (HV_STATUS_*) as a raw hex
value and also as a string, which saves some time while debugging.
Create a table of HV_STATUS_ codes with strings and mapped errnos, and
use it for hv_result_to_string() and hv_result_to_errno().
Use the new hv_status_printk()s in hv_proc.c, hyperv-iommu.c, and
irqdomain.c hypercalls to aid debugging in the root partition.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Link: https://lore.kernel.org/r/1741980536-3865-2-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1741980536-3865-2-git-send-email-nunodasneves@linux.microsoft.com>
We have recently seen report of lockdep circular lock dependency warnings
on platforms like Skylake and Kabylake:
======================================================
WARNING: possible circular locking dependency detected
6.14.0-rc6-CI_DRM_16276-gca2c04fe76e8+ #1 Not tainted
------------------------------------------------------
swapper/0/1 is trying to acquire lock:
ffffffff8360ee48 (iommu_probe_device_lock){+.+.}-{3:3},
at: iommu_probe_device+0x1d/0x70
but task is already holding lock:
ffff888102c7efa8 (&device->physical_node_lock){+.+.}-{3:3},
at: intel_iommu_init+0xe75/0x11f0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&device->physical_node_lock){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
intel_iommu_init+0xe75/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #5 (dmar_global_lock){++++}-{3:3}:
down_read+0x43/0x1d0
enable_drhd_fault_handling+0x21/0x110
cpuhp_invoke_callback+0x4c6/0x870
cpuhp_issue_call+0xbf/0x1f0
__cpuhp_setup_state_cpuslocked+0x111/0x320
__cpuhp_setup_state+0xb0/0x220
irq_remap_enable_fault_handling+0x3f/0xa0
apic_intr_mode_init+0x5c/0x110
x86_late_time_init+0x24/0x40
start_kernel+0x895/0xbd0
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0xbf/0x110
common_startup_64+0x13e/0x141
-> #4 (cpuhp_state_mutex){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
__cpuhp_setup_state_cpuslocked+0x67/0x320
__cpuhp_setup_state+0xb0/0x220
page_alloc_init_cpuhp+0x2d/0x60
mm_core_init+0x18/0x2c0
start_kernel+0x576/0xbd0
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0xbf/0x110
common_startup_64+0x13e/0x141
-> #3 (cpu_hotplug_lock){++++}-{0:0}:
__cpuhp_state_add_instance+0x4f/0x220
iova_domain_init_rcaches+0x214/0x280
iommu_setup_dma_ops+0x1a4/0x710
iommu_device_register+0x17d/0x260
intel_iommu_init+0xda4/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #2 (&domain->iova_cookie->mutex){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
iommu_setup_dma_ops+0x16b/0x710
iommu_device_register+0x17d/0x260
intel_iommu_init+0xda4/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #1 (&group->mutex){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
__iommu_probe_device+0x24c/0x4e0
probe_iommu_group+0x2b/0x50
bus_for_each_dev+0x7d/0xe0
iommu_device_register+0xe1/0x260
intel_iommu_init+0xda4/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #0 (iommu_probe_device_lock){+.+.}-{3:3}:
__lock_acquire+0x1637/0x2810
lock_acquire+0xc9/0x300
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
iommu_probe_device+0x1d/0x70
intel_iommu_init+0xe90/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
other info that might help us debug this:
Chain exists of:
iommu_probe_device_lock --> dmar_global_lock -->
&device->physical_node_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&device->physical_node_lock);
lock(dmar_global_lock);
lock(&device->physical_node_lock);
lock(iommu_probe_device_lock);
*** DEADLOCK ***
This driver uses a global lock to protect the list of enumerated DMA
remapping units. It is necessary due to the driver's support for dynamic
addition and removal of remapping units at runtime.
Two distinct code paths require iteration over this remapping unit list:
- Device registration and probing: the driver iterates the list to
register each remapping unit with the upper layer IOMMU framework
and subsequently probe the devices managed by that unit.
- Global configuration: Upper layer components may also iterate the list
to apply configuration changes.
The lock acquisition order between these two code paths was reversed. This
caused lockdep warnings, indicating a risk of deadlock. Fix this warning
by releasing the global lock before invoking upper layer interfaces for
device registration.
Fixes: b150654f74 ("iommu/vt-d: Fix suspicious RCU usage")
Closes: https://lore.kernel.org/linux-iommu/SJ1PR11MB612953431F94F18C954C4A9CB9D32@SJ1PR11MB6129.namprd11.prod.outlook.com/
Tested-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250317035714.1041549-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Don't overwrite an IRTE that is posting IRQs to a vCPU with a posted MSI
entry if the host IRQ affinity happens to change. If/when the IRTE is
reverted back to "host mode", it will be reconfigured as a posted MSI or
remapped entry as appropriate.
Drop the "mode" field, which doesn't differentiate between posted MSIs and
posted vCPUs, in favor of a dedicated posted_vcpu flag. Note! The two
posted_{msi,vcpu} flags are intentionally not mutually exclusive; an IRTE
can transition between posted MSI and posted vCPU.
Fixes: ed1e48ea43 ("iommu/vt-d: Enable posted mode for device MSIs")
Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250315025135.2365846-3-seanjc@google.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Add a helper to take care of reconfiguring an IRTE to deliver IRQs to the
host, i.e. not to a vCPU, and use the helper when an IRTE's vCPU affinity
is nullified, i.e. when KVM puts an IRTE back into "host" mode. Because
posted MSIs use an ephemeral IRTE, using modify_irte() puts the IRTE into
full remapped mode, i.e. unintentionally disables posted MSIs on the IRQ.
Fixes: ed1e48ea43 ("iommu/vt-d: Enable posted mode for device MSIs")
Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250315025135.2365846-2-seanjc@google.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The global dma_dev trick was mostly because the old domain_alloc op
provided no context, so no way to know which IOMMU was to own the
pagetable, or if a suitable one even existed at all. In the new
multi-instance world with domain_alloc_paging this is no longer a
concern - now we know that the given device must be associated with a
valid IOMMU instance which provided the op to call in the first place,
and therefore that instance can and should be the pagetable owner. To
avoid worrying about the lifetime and stability of the rk_domain->iommus
list, and keep the lookups simple and efficient, we'll still stash a
dma_dev pointer, but now it's accurately per-domain.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Quentin Schulz <quentin.schulz@cherry.de>
Tested-by: Dang Huynh <danct12@riseup.net>
Reviewed-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com>
Tested-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com>
Link: https://lore.kernel.org/r/25dc948a7d35c8142c5719ac22bc523f8524d006.1741886382.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Currently Rockchip calls iommu_device_register() before it's finished
setting up the hardware and driver state, and as such it now gets
unhappy in various ways when registration starts working the way it was
always intended to, and probing client devices straight away. Reorder
the operations to ensure that what we're registering is a prepared and
functional IOMMU instance.
Fixes: bcb81ac6ae ("iommu: Get DT/ACPI parsing into the proper probe path")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Quentin Schulz <quentin.schulz@cherry.de>
Tested-by: Dang Huynh <danct12@riseup.net>
Reviewed-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com>
Tested-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com>
Link: https://lore.kernel.org/r/e69532f00bf49d98322b96788edb7e2e305e4006.1741886382.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Now that DT-based probing is finally happening in the right order again,
it reveals an issue in Rockchip's of_xlate, which can now be called
during registration, but is using the global dma_dev which is only
assigned later. However, this makes little sense when we're already
looking up the correct IOMMU device, who should logically be the owner
of the devm allocation anyway.
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Fixes: bcb81ac6ae ("iommu: Get DT/ACPI parsing into the proper probe path")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Quentin Schulz <quentin.schulz@cherry.de>
Tested-by: Dang Huynh <danct12@riseup.net>
Reviewed-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com>
Tested-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com>
Link: https://lore.kernel.org/r/771e91cf16b3048e93f657153b76905665878fa2.1741886382.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
There is a DoS concern on the shared hardware event queue among devices
passed through to VMs, that too many translation failures that belong to
VMs could overflow the shared hardware event queue if those VMs or their
VMMs don't handle/recover the devices properly.
The MEV bit in the STE allows to configure the SMMU HW to merge similar
event records, though there is no guarantee. Set it in a nested STE for
DoS mitigations.
In the future, we might want to enable the MEV for non-nested cases too
such as domain->type == IOMMU_DOMAIN_UNMANAGED or even IOMMU_DOMAIN_DMA.
Link: https://patch.msgid.link/r/8ed12feef67fc65273d0f5925f401a81f56acebe.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Aside from the IOPF framework, iommufd provides an additional pathway to
report hardware events, via the vEVENTQ of vIOMMU infrastructure.
Define an iommu_vevent_arm_smmuv3 uAPI structure, and report stage-1 events
in the threaded IRQ handler. Also, add another four event record types that
can be forwarded to a VM.
Link: https://patch.msgid.link/r/5cf6719682fdfdabffdb08374cdf31ad2466d75a.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Use it to store all vSMMU-related data. The vsid (Virtual Stream ID) will
be the first use case. Since the vsid reader will be the eventq handler
that already holds a streams_mutex, reuse that to fence the vmaster too.
Also add a pair of arm_smmu_attach_prepare/commit_vmaster helpers to set
or unset the master->vmaster pointer. Put the helpers inside the existing
arm_smmu_attach_prepare/commit().
For identity/blocked ops that don't call arm_smmu_attach_prepare/commit(),
add a simpler arm_smmu_master_clear_vmaster helper to unset the vmaster.
Link: https://patch.msgid.link/r/a7f282e1a531279e25f06c651e95d56f6b120886.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When attaching a device to a vIOMMU-based nested domain, vdev_id must be
present. Add a piece of code hard-requesting it, preparing for a vEVENTQ
support in the following patch. Then, update the TEST_F.
A HWPT-based nested domain will return a NULL new_viommu, thus no such a
vDEVICE requirement.
Link: https://patch.msgid.link/r/4051ca8a819e51cb30de6b4fe9e4d94d956afe3d.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Similar to iommu_report_device_fault, this allows IOMMU drivers to report
vIOMMU events from threaded IRQ handlers to user space hypervisors.
Link: https://patch.msgid.link/r/44be825042c8255e75d0151b338ffd8ba0e4920b.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This is a reverse search v.s. iommufd_viommu_find_dev, as drivers may want
to convert a struct device pointer (physical) to its virtual device ID for
an event injection to the user space VM.
Again, this avoids exposing more core structures to the drivers, than the
iommufd_viommu alone.
Link: https://patch.msgid.link/r/18b8e8bc1b8104d43b205d21602c036fd0804e56.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Introduce a new IOMMUFD_OBJ_VEVENTQ object for vIOMMU Event Queue that
provides user space (VMM) another FD to read the vIOMMU Events.
Allow a vIOMMU object to allocate vEVENTQs, with a condition that each
vIOMMU can only have one single vEVENTQ per type.
Add iommufd_veventq_alloc() with iommufd_veventq_ops for the new ioctl.
Link: https://patch.msgid.link/r/21acf0751dd5c93846935ee06f93b9c65eff5e04.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The fault object was designed exclusively for hwpt's IO page faults (PRI).
But its queue implementation can be reused for other purposes too, such as
hardware IRQ and event injections to user space.
Meanwhile, a fault object holds a list of faults. So it's more accurate to
call it a "fault queue". Combining the reusing idea above, abstract a new
iommufd_eventq as a common structure embedded into struct iommufd_fault,
similar to hwpt_paging holding a common hwpt.
Add a common iommufd_eventq_ops and iommufd_eventq_init to prepare for an
IOMMUFD_OBJ_VEVENTQ (vIOMMU Event Queue).
Link: https://patch.msgid.link/r/e7336a857954209aabb466e0694aab323da95d90.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The infrastructure of a fault object will be shared with a new vEVENTQ
object in a following change. Add an iommufd_fault_init helper and an
INIT_EVENTQ_FOPS marco for a vEVENTQ allocator to use too.
Reorder the iommufd_ctx_get and refcount_inc, to keep them symmetrical
with the iommufd_fault_fops_release().
Since the new vEVENTQ doesn't need "response" and its "mutex", so keep
the xa_init_flags and mutex_init in their original locations.
Link: https://patch.msgid.link/r/a9522c521909baeb1bd843950b2490478f3d06e0.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There is no need to keep them in the header. The vEVENTQ version of these
two functions will turn out to be a different implementation and will not
share with this fault version. Thus, move them out of the header.
Link: https://patch.msgid.link/r/7eebe32f3d354799f5e28128c693c3c284740b21.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
AMD IOMMU optionally supports up to 2K interrupts per function on newer
platforms. Support for this feature is indicated through Extended
Feature 2 Register (MMIO Offset 01A0h[NumIntRemapSup]). Allocate 2K IRTEs
per device when this support is available.
Co-developed-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250307095822.2274-5-sarunkod@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
AMD iommu can support both 512 and 2K interrupts on newer platform. Hence
add suffix "512" to the existing macros.
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250307095822.2274-4-sarunkod@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Commit 05152a0494 ("iommu/amd: Add slab-cache for irq remapping tables")
introduces slab cache allocator. But slab cache allocator provides benefit
only when the allocation and deallocation of many identical objects is
frequent. The AMD IOMMU driver allocates Interrupt remapping table (IRT)
when device driver requests IRQ for the first time and never frees it.
Hence the slab allocator does not provide any benefit here.
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250307095822.2274-3-sarunkod@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Define generic function `iommu_feature_set()` to set the values
in the feature control register and replace `iommu_set_inv_tlb_timeout()`
with it.
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250307095822.2274-2-sarunkod@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The warning for suspect probe conditions inadvertently got moved too
early in a prior respin - it happened to work out OK for fwspecs, but in
general still needs to be after the ops->probe_device call so drivers
which filter devices for themselves have a chance to do that.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Fixes: bcb81ac6ae ("iommu: Get DT/ACPI parsing into the proper probe path")
Link: https://lore.kernel.org/r/72a4853e7ef36e7c1c4ca171ed4ed8e1a463a61a.1741791691.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The current code calls arm_smmu_rpm_use_autosuspend() during device
attach, which seems unusual as it sets the autosuspend delay and the
'use_autosuspend' flag for the smmu device. These parameters can be
simply set once during the smmu probe and in order to avoid bouncing
rpm states, we can simply mark_last_busy() during a client dev attach
as discussed in [1].
Move the handling of arm_smmu_rpm_use_autosuspend() to the SMMU probe
and modify the arm_smmu_rpm_put() function to mark_last_busy() before
calling __pm_runtime_put_autosuspend(). Additionally,
s/pm_runtime_put_autosuspend/__pm_runtime_put_autosuspend/ to help with
the refactor of the pm_runtime_put_autosuspend() API [2].
Link: https://lore.kernel.org/r/20241023164835.GF29251@willie-the-truck [1]
Link: https://git.kernel.org/linus/b7d46644e554 [2]
Signed-off-by: Pranjal Shrivastava <praan@google.com>
Link: https://lore.kernel.org/r/20250123195636.4182099-1-praan@google.com
Signed-off-by: Will Deacon <will@kernel.org>
In hindsight, there were some crucial subtleties overlooked when moving
{of,acpi}_dma_configure() to driver probe time to allow waiting for
IOMMU drivers with -EPROBE_DEFER, and these have become an
ever-increasing source of problems. The IOMMU API has some fundamental
assumptions that iommu_probe_device() is called for every device added
to the system, in the order in which they are added. Calling it in a
random order or not at all dependent on driver binding leads to
malformed groups, a potential lack of isolation for devices with no
driver, and all manner of unexpected concurrency and race conditions.
We've attempted to mitigate the latter with point-fix bodges like
iommu_probe_device_lock, but it's a losing battle and the time has come
to bite the bullet and address the true source of the problem instead.
The crux of the matter is that the firmware parsing actually serves two
distinct purposes; one is identifying the IOMMU instance associated with
a device so we can check its availability, the second is actually
telling that instance about the relevant firmware-provided data for the
device. However the latter also depends on the former, and at the time
there was no good place to defer and retry that separately from the
availability check we also wanted for client driver probe.
Nowadays, though, we have a proper notion of multiple IOMMU instances in
the core API itself, and each one gets a chance to probe its own devices
upon registration, so we can finally make that work as intended for
DT/IORT/VIOT platforms too. All we need is for iommu_probe_device() to
be able to run the iommu_fwspec machinery currently buried deep in the
wrong end of {of,acpi}_dma_configure(). Luckily it turns out to be
surprisingly straightforward to bootstrap this transformation by pretty
much just calling the same path twice. At client driver probe time,
dev->driver is obviously set; conversely at device_add(), or a
subsequent bus_iommu_probe(), any device waiting for an IOMMU really
should *not* have a driver already, so we can use that as a condition to
disambiguate the two cases, and avoid recursing back into the IOMMU core
at the wrong times.
Obviously this isn't the nicest thing, but for now it gives us a
functional baseline to then unpick the layers in between without many
more awkward cross-subsystem patches. There are some minor side-effects
like dma_range_map potentially being created earlier, and some debug
prints being repeated, but these aren't significantly detrimental. Let's
make things work first, then deal with making them nice.
With the basic flow finally in the right order again, the next step is
probably turning the bus->dma_configure paths inside-out, since all we
really need from bus code is its notion of which device and input ID(s)
to parse the common firmware properties with...
Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci-driver.c
Acked-by: Rob Herring (Arm) <robh@kernel.org> # of/device.c
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/e3b191e6fd6ca9a1e84c5e5e40044faf97abb874.1740753261.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
At the moment, if of_iommu_configure() allocates dev->iommu itself via
iommu_fwspec_init(), then suffers a DT parsing failure, it cleans up the
fwspec but leaves the empty dev_iommu hanging around. So far this is
benign (if a tiny bit wasteful), but we'd like to be able to reason
about dev->iommu having a consistent and unambiguous lifecycle. Thus
make sure that the of_iommu cleanup undoes precisely whatever it did.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/d219663a3f23001f23d520a883ac622d70b4e642.1740753261.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Since iommu_init_device() was factored out, it is in fact the only
consumer of the ops which __iommu_probe_device() is resolving, so let it
do that itself rather than passing them in. This also puts the ops
lookup at a more logical point relative to the rest of the flow through
__iommu_probe_device().
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/fa4b6cfc67a352488b7f4e0b736008307ce9ac2e.1740753261.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
It turns out that deferred default domain creation leaves a subtle
race window during iommu_device_register() wherein a client driver may
asynchronously probe in parallel and get as far as performing DMA API
operations with dma-direct, only to be switched to iommu-dma underfoot
once the default domain attachment finally happens, with obviously
disastrous consequences. Even the wonky of_iommu_configure() path is at
risk, since iommu_fwspec_init() will no longer defer client probe as the
instance ops are (necessarily) already registered, and the "replay"
iommu_probe_device() call can see dev->iommu_group already set and so
think there's nothing to do either.
Fortunately we already have the right tool in the right place in the
form of iommu_device_use_default_domain(), which just needs to ensure
that said default domain is actually ready to *be* used. Deferring the
client probe shouldn't have too much impact, given that this only
happens while the IOMMU driver is probing, and thus due to kick the
deferred probe list again once it finishes.
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Fixes: 98ac73f99b ("iommu: Require a default_domain for all iommu drivers")
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/e88b94c9b575034a2c98a48b3d383654cbda7902.1740753261.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The drivers doing their own fwspec parsing have no need to call
iommu_fwspec_free() since fwspecs were moved into dev_iommu, as
returning an error from .probe_device will tear down the whole lot
anyway. Move it into the private interface now that it only serves
for of_iommu to clean up in an error case.
I have no idea what mtk_v1 was doing in effectively guaranteeing
a NULL fwspec would be dereferenced if no "iommus" DT property was
found, so add a check for that to at least make the code look sane.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/36e245489361de2d13db22a510fa5c79e7126278.1740667667.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The intel_context_flush_present() is called in places where either the
scalable mode is disabled, or scalable mode is enabled but all PASID
entries are known to be non-present. In these cases, the flush_domains
path within intel_context_flush_present() will never execute. This dead
code is therefore removed.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250228092631.3425464-7-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Update PRI enablement to use the new method, similar to the amd iommu
driver. Enable PRI in the device probe path and disable it when the device
is released. PRI is enabled throughout the device's iommu lifecycle. The
infrastructure for the iommu subsystem to handle iopf requests is created
during iopf enablement and released during iopf disablement. All invalid
page requests from the device are automatically handled by the iommu
subsystem if iopf is not enabled. Add iopf_refcount to track the iopf
enablement.
Convert the return type of intel_iommu_disable_iopf() to void, as there
is no way to handle a failure when disabling this feature. Make
intel_iommu_enable/disable_iopf() helpers global, as they will be used
beyond the current file in the subsequent patch.
The iopf_refcount is not protected by any lock. This is acceptable, as
there is no concurrent access to it in the current code. The following
patch will address this by moving it to the domain attach/detach paths,
which are protected by the iommu group mutex.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250228092631.3425464-6-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Device ATS is currently enabled when a domain is attached to the device
and disabled when the domain is detached. This creates a limitation:
when the IOMMU is operating in scalable mode and IOPF is enabled, the
device's domain cannot be changed.
The previous code enables ATS when a domain is set to a device's RID and
disables it during RID domain switch. So, if a PASID is set with a
domain requiring PRI, ATS should remain enabled until the domain is
removed. During the PASID domain's lifecycle, if the RID's domain
changes, PRI will be disrupted because it depends on ATS, which is
disabled when the blocking domain is set for the device's RID.
Remove this limitation by moving ATS enablement to the device probe path.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250228092631.3425464-5-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Attach of a SVA domain should fail if SVA is not supported, move the check
for SVA support out of IOMMU_DEV_FEAT_SVA and into attach.
Also check when allocating a SVA domain to match other drivers.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250228092631.3425464-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
If all the inlines are unwound virt_to_dma_pfn() is simply:
return page_to_pfn(virt_to_page(p)) << (PAGE_SHIFT - VTD_PAGE_SHIFT);
Which can be re-arranged to:
(page_to_pfn(virt_to_page(p)) << PAGE_SHIFT) >> VTD_PAGE_SHIFT
The only caller is:
((uint64_t)virt_to_dma_pfn(tmp_page) << VTD_PAGE_SHIFT)
re-arranged to:
((page_to_pfn(virt_to_page(tmp_page)) << PAGE_SHIFT) >> VTD_PAGE_SHIFT)
<< VTD_PAGE_SHIFT
Which simplifies to:
page_to_pfn(virt_to_page(tmp_page)) << PAGE_SHIFT
That is the same as virt_to_phys(tmp_page), so just remove all of this.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/8-v3-e797f4dc6918+93057-iommu_pages_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
We found that executing the command ./a.out &;reboot -f (where a.out is a
program that only executes a while(1) infinite loop) can probabilistically
cause the system to hang in the intel_iommu_shutdown() function, rendering
it unresponsive. Through analysis, we identified that the factors
contributing to this issue are as follows:
1. The reboot -f command does not prompt the kernel to notify the
application layer to perform cleanup actions, allowing the application to
continue running.
2. When the kernel reaches the intel_iommu_shutdown() function, only the
BSP (Bootstrap Processor) CPU is operational in the system.
3. During the execution of intel_iommu_shutdown(), the function down_write
(&dmar_global_lock) causes the process to sleep and be scheduled out.
4. At this point, though the processor's interrupt flag is not cleared,
allowing interrupts to be accepted. However, only legacy devices and NMI
(Non-Maskable Interrupt) interrupts could come in, as other interrupts
routing have already been disabled. If no legacy or NMI interrupts occur
at this stage, the scheduler will not be able to run.
5. If the application got scheduled at this time is executing a while(1)-
type loop, it will be unable to be preempted, leading to an infinite loop
and causing the system to become unresponsive.
To resolve this issue, the intel_iommu_shutdown() function should not
execute down_write(), which can potentially cause the process to be
scheduled out. Furthermore, since only the BSP is running during the later
stages of the reboot, there is no need for protection against parallel
access to the DMAR (DMA Remapping) unit. Therefore, the following lines
could be removed:
down_write(&dmar_global_lock);
up_write(&dmar_global_lock);
After testing, the issue has been resolved.
Fixes: 6c3a44ed3c ("iommu/vt-d: Turn off translations at shutdown")
Co-developed-by: Ethan Zhao <haifeng.zhao@linux.intel.com>
Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com>
Signed-off-by: Yunhui Cui <cuiyunhui@bytedance.com>
Link: https://lore.kernel.org/r/20250303062421.17929-1-cuiyunhui@bytedance.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This is needed by ISP, which has DART0 with bypass and DART1/2 without.
Signed-off-by: Hector Martin <marcan@marcan.st>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Signed-off-by: Sasha Finkelstein <fnkl.kernel@gmail.com>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Link: https://lore.kernel.org/r/20250307-isp-dart-v3-2-684fe4489591@gmail.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Useful for debugging ILLEGAL_DEV_TABLE_ENTRY events as some of the
DTE settings depends on Control register settings.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250227162320.5805-2-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This reverts commit ac9a5d522b.
iommu_dma_init_domain() is now only called under the group mutex, as it
should be given that that the default domain belongs to the group, so
the additional internal locking is no longer needed.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/a943d4c198e6a1fffe998337d577dc3aa7f660a9.1740585469.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The current implementation of iommufd_device_do_replace() implicitly
assumes that the input device has already been attached. However, there
is no explicit check to verify this assumption. If another device within
the same group has been attached, the replace operation might succeed,
but the input device itself may not have been attached yet.
As a result, the input device might not be tracked in the
igroup->device_list, and its reserved IOVA might not be added. Despite
this, the caller might incorrectly assume that the device has been
successfully replaced, which could lead to unexpected behavior or errors.
To address this issue, add a check to ensure that the input device has
been attached before proceeding with the replace operation. This check
will help maintain the integrity of the device tracking system and prevent
potential issues arising from incorrect assumptions about the device's
attachment status.
Fixes: e88d4ec154 ("iommufd: Add iommufd_device_replace()")
Link: https://patch.msgid.link/r/20250306034842.5950-1-yi.l.liu@intel.com
Cc: stable@vger.kernel.org
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Setting domain->iommufd_hwpt in iommufd_hwpt_alloc() only covers the HWPT
allocations from user space, but not for an auto domain. This resulted in
a NULL pointer access in the auto domain pathway:
Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000008
pc : iommufd_sw_msi+0x54/0x2b0
lr : iommufd_sw_msi+0x40/0x2b0
Call trace:
iommufd_sw_msi+0x54/0x2b0 (P)
iommu_dma_prepare_msi+0x64/0xa8
its_irq_domain_alloc+0xf0/0x2c0
irq_domain_alloc_irqs_parent+0x2c/0xa8
msi_domain_alloc+0xa0/0x1a8
Since iommufd_sw_msi() requires to access the domain->iommufd_hwpt, it is
better to set that explicitly prior to calling iommu_domain_set_sw_msi().
Fixes: 748706d7ca ("iommu: Turn fault_data to iommufd private pointer")
Link: https://patch.msgid.link/r/20250305211800.229465-1-nicolinc@nvidia.com
Reported-by: Ankit Agrawal <ankita@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Ankit Agrawal <ankita@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Allocating a domain with a fault ID indicates that the domain is faultable.
However, there is a gap for the nested parent domain to support PRI. Some
hardware lacks the capability to distinguish whether PRI occurs at stage 1
or stage 2. This limitation may require software-based page table walking
to resolve. Since no in-tree IOMMU driver currently supports this
functionality, it is disallowed. For more details, refer to the related
discussion at [1].
[1] https://lore.kernel.org/linux-iommu/bd1655c6-8b2f-4cfa-adb1-badc00d01811@intel.com/
Link: https://patch.msgid.link/r/20250226104012.82079-1-yi.l.liu@intel.com
Suggested-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The three dependent series on a shared branch:
- Change the iommufd fault handle into an always present hwpt handle in
the domain
- Give iommufd its own SW_MSI implementation along with some IRQ layer
rework
- Improvements to the handle attach API
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZ8HEaQAKCRCFwuHvBreF
YfEHAP9khiTmkfsx5JGBtKMqFbF/TShyZfgcLTCWw/VGXLcdswD/fnxA8L+y6zD4
9P73iOqSfHRV69dXu9CU6duZZifWawI=
=YZh7
-----END PGP SIGNATURE-----
Merge tag 'for-joerg' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd into core
iommu shared branch with iommufd
The three dependent series on a shared branch:
- Change the iommufd fault handle into an always present hwpt handle in
the domain
- Give iommufd its own SW_MSI implementation along with some IRQ layer
rework
- Improvements to the handle attach API
The current implementation stores entry to the group->pasid_array before
the underlying iommu driver has successfully set the new domain. This can
lead to issues where PRIs are received on the new domain before the attach
operation is completed.
This patch swaps the order of operations to ensure that the domain is set
in the underlying iommu driver before updating the group->pasid_array.
Link: https://patch.msgid.link/r/20250226011849.5102-5-yi.l.liu@intel.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
iommu_attach_device_pasid() only stores handle to group->pasid_array
when there is a valid handle input. However, it makes the
iommu_attach_device_pasid() unable to detect if the pasid has been
attached or not previously.
To be complete, let the iommu_attach_device_pasid() store the domain
to group->pasid_array if no valid handle. The other users of the
group->pasid_array should be updated to be consistent. e.g. the
iommu_attach_group_handle() and iommu_replace_group_handle().
Link: https://patch.msgid.link/r/20250226011849.5102-4-yi.l.liu@intel.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
iommufd does not use it now, so drop it.
Link: https://patch.msgid.link/r/20250226011849.5102-3-yi.l.liu@intel.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Caller of the two APIs always provide a valid handle, make @handle as
mandatory parameter. Take this chance incoporate the handle->domain
set under the protection of group->mutex in iommu_attach_group_handle().
Link: https://patch.msgid.link/r/20250226011849.5102-2-yi.l.liu@intel.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Commit <d74169ceb0d2> ("iommu/vt-d: Allocate DMAR fault interrupts
locally") moved the call to enable_drhd_fault_handling() to a code
path that does not hold any lock while traversing the drhd list. Fix
it by ensuring the dmar_global_lock lock is held when traversing the
drhd list.
Without this fix, the following warning is triggered:
=============================
WARNING: suspicious RCU usage
6.14.0-rc3 #55 Not tainted
-----------------------------
drivers/iommu/intel/dmar.c:2046 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 1
2 locks held by cpuhp/1/23:
#0: ffffffff84a67c50 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x87/0x2c0
#1: ffffffff84a6a380 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x87/0x2c0
stack backtrace:
CPU: 1 UID: 0 PID: 23 Comm: cpuhp/1 Not tainted 6.14.0-rc3 #55
Call Trace:
<TASK>
dump_stack_lvl+0xb7/0xd0
lockdep_rcu_suspicious+0x159/0x1f0
? __pfx_enable_drhd_fault_handling+0x10/0x10
enable_drhd_fault_handling+0x151/0x180
cpuhp_invoke_callback+0x1df/0x990
cpuhp_thread_fun+0x1ea/0x2c0
smpboot_thread_fn+0x1f5/0x2e0
? __pfx_smpboot_thread_fn+0x10/0x10
kthread+0x12a/0x2d0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x4a/0x60
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Holding the lock in enable_drhd_fault_handling() triggers a lockdep splat
about a possible deadlock between dmar_global_lock and cpu_hotplug_lock.
This is avoided by not holding dmar_global_lock when calling
iommu_device_register(), which initiates the device probe process.
Fixes: d74169ceb0 ("iommu/vt-d: Allocate DMAR fault interrupts locally")
Reported-and-tested-by: Ido Schimmel <idosch@nvidia.com>
Closes: https://lore.kernel.org/linux-iommu/Zx9OwdLIc_VoQ0-a@shredder.mtl.com/
Tested-by: Breno Leitao <leitao@debian.org>
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20250218022422.2315082-1-baolu.lu@linux.intel.com
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
When updating the page table root field on the DTE, avoid overwriting any
bits that are already set. The earlier call to make_clear_dte() writes
default values that all DTEs must have set (currently DTE[V]), and those
must be preserved.
Currently this doesn't cause problems since the page table root update is
the first field that is set after make_clear_dte() is called, and
DTE_FLAG_V is set again later along with the permission bits (IR/IW).
Remove this redundant assignment too.
Fixes: fd5dff9de4 ("iommu/amd: Modify set_dte_entry() to use 256-bit DTE helpers")
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250106191413.3107140-1-alejandro.j.jimenez@oracle.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
iommufd has a model where the iommu_domain can be changed while the VFIO
device is attached. In this case, the MSI should continue to work. This
corner case has not worked because the dma-iommu implementation of sw_msi
is tied to a single domain.
Implement the sw_msi mapping directly and use a global per-fd table to
associate assigned IOVA to the MSI pages. This allows the MSI pages to
be loaded into a domain before it is attached ensuring that MSI is not
disrupted.
Link: https://patch.msgid.link/r/e13d23eeacd67c0a692fc468c85b483f4dd51c57.1740014950.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Introduce hv_curr_partition_type to store the partition type
as an enum.
Right now this is limited to guest or root partition, but there will
be other kinds in future and the enum is easily extensible.
Set up hv_curr_partition_type early in Hyper-V initialization with
hv_identify_partition_type(). hv_root_partition() just queries this
value, and shouldn't be called before that.
Making this check into a function sets the stage for adding a config
option to gate the compilation of root partition code. In particular,
hv_root_partition() can be stubbed out always be false if root
partition support isn't desired.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1740167795-13296-3-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1740167795-13296-3-git-send-email-nunodasneves@linux.microsoft.com>
A "fault_data" was added exclusively for the iommufd_fault_iopf_handler()
used by IOPF/PRI use cases, along with the attach_handle. Now, the iommufd
version of the sw_msi function will reuse the attach_handle and fault_data
for a non-fault case.
Rename "fault_data" to "iommufd_hwpt" so as not to confine it to a "fault"
case. Move it into a union to be the iommufd private pointer. A following
patch will move the iova_cookie to the union for dma-iommu too after the
iommufd_sw_msi implementation is added.
Since we have two unions now, add some simple comments for readability.
Link: https://patch.msgid.link/r/ee5039503f28a16590916e9eef28b917e2d1607a.1740014950.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently, IRQ_MSI_IOMMU is selected if DMA_IOMMU is available to provide
an implementation for iommu_dma_prepare/compose_msi_msg(). However, it'll
make more sense for irqchips that call prepare/compose to select it, and
that will trigger all the additional code and data to be compiled into
the kernel.
If IRQ_MSI_IOMMU is selected with no IOMMU side implementation, then the
prepare/compose() will be NOP stubs.
If IRQ_MSI_IOMMU is not selected by an irqchip, then the related code on
the iommu side is compiled out.
Link: https://patch.msgid.link/r/a2620f67002c5cdf974e89ca3bf905f5c0817be6.1740014950.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
SW_MSI supports IOMMU to translate an MSI message before the MSI message
is delivered to the interrupt controller. On such systems, an iommu_domain
must have a translation for the MSI message for interrupts to work.
The IRQ subsystem will call into IOMMU to request that a physical page be
set up to receive MSI messages, and the IOMMU then sets an IOVA that maps
to that physical page. Ultimately the IOVA is programmed into the device
via the msi_msg.
Generalize this by allowing iommu_domain owners to provide implementations
of this mapping. Add a function pointer in struct iommu_domain to allow a
domain owner to provide its own implementation.
Have dma-iommu supply its implementation for IOMMU_DOMAIN_DMA types during
the iommu_get_dma_cookie() path. For IOMMU_DOMAIN_UNMANAGED types used by
VFIO (and iommufd for now), have the same iommu_dma_sw_msi set as well in
the iommu_get_msi_cookie() path.
Hold the group mutex while in iommu_dma_prepare_msi() to ensure the domain
doesn't change or become freed while running. Races with IRQ operations
from VFIO and domain changes from iommufd are possible here.
Replace the msi_prepare_lock with a lockdep assertion for the group mutex
as documentation. For the dmau_iommu.c each iommu_domain is unique to a
group.
Link: https://patch.msgid.link/r/4ca696150d2baee03af27c4ddefdb7b0b0280e7b.1740014950.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The two-step process to translate the MSI address involves two functions,
iommu_dma_prepare_msi() and iommu_dma_compose_msi_msg().
Previously iommu_dma_compose_msi_msg() needed to be in the iommu layer as
it had to dereference the opaque cookie pointer. Now, the previous patch
changed the cookie pointer into an integer so there is no longer any need
for the iommu layer to be involved.
Further, the call sites of iommu_dma_compose_msi_msg() all follow the same
pattern of setting an MSI message address_hi/lo to non-translated and then
immediately calling iommu_dma_compose_msi_msg().
Refactor iommu_dma_compose_msi_msg() into msi_msg_set_addr() that directly
accepts the u64 version of the address and simplifies all the callers.
Move the new helper to linux/msi.h since it has nothing to do with iommu.
Aside from refactoring, this logically prepares for the next patch, which
allows multiple implementation options for iommu_dma_prepare_msi(). So, it
does not make sense to have the iommu_dma_compose_msi_msg() in dma-iommu.c
as it no longer provides the only iommu_dma_prepare_msi() implementation.
Link: https://patch.msgid.link/r/eda62a9bafa825e9cdabd7ddc61ad5a21c32af24.1740014950.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The IOMMU translation for MSI message addresses has been a 2-step process,
separated in time:
1) iommu_dma_prepare_msi(): A cookie pointer containing the IOVA address
is stored in the MSI descriptor when an MSI interrupt is allocated.
2) iommu_dma_compose_msi_msg(): this cookie pointer is used to compute a
translated message address.
This has an inherent lifetime problem for the pointer stored in the cookie
that must remain valid between the two steps. However, there is no locking
at the irq layer that helps protect the lifetime. Today, this works under
the assumption that the iommu domain is not changed while MSI interrupts
being programmed. This is true for normal DMA API users within the kernel,
as the iommu domain is attached before the driver is probed and cannot be
changed while a driver is attached.
Classic VFIO type1 also prevented changing the iommu domain while VFIO was
running as it does not support changing the "container" after starting up.
However, iommufd has improved this so that the iommu domain can be changed
during VFIO operation. This potentially allows userspace to directly race
VFIO_DEVICE_ATTACH_IOMMUFD_PT (which calls iommu_attach_group()) and
VFIO_DEVICE_SET_IRQS (which calls into iommu_dma_compose_msi_msg()).
This potentially causes both the cookie pointer and the unlocked call to
iommu_get_domain_for_dev() on the MSI translation path to become UAFs.
Fix the MSI cookie UAF by removing the cookie pointer. The translated IOVA
address is already known during iommu_dma_prepare_msi() and cannot change.
Thus, it can simply be stored as an integer in the MSI descriptor.
The other UAF related to iommu_get_domain_for_dev() will be addressed in
patch "iommu: Make iommu_dma_prepare_msi() into a generic operation" by
using the IOMMU group mutex.
Link: https://patch.msgid.link/r/a4f2cd76b9dc1833ee6c1cf325cba57def22231c.1740014950.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Subpage protection can't be disabled on t6000-style darts,
as such the disable flag no longer applies, and probably
even affects something else.
Fixes: dc09fe1c5e ("iommu/io-pgtable-dart: Add DART PTE support for t6000")
Signed-off-by: Asahi Lina <lina@asahilina.net>
Signed-off-by: Sasha Finkelstein <fnkl.kernel@gmail.com>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Link: https://lore.kernel.org/r/20250219-dart2-no-sp-disable-v1-1-9f324cfa4e70@gmail.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Enabled via the kernel command-line 'iommu.passthrough=1' option.
Introduce the concept of identity domains to s390-iommu, which relies on
the bus_dma_region to offset identity mappings to the start of the DMA
aperture advertized by CLP.
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250212213418.182902-5-mjrosato@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
At this point, the dma_table is really a property of the s390-iommu
domain. Rather than checking its contents elsewhere in the codebase,
move the code that registers the table with firmware into
s390-iommu and make a decision what to register with firmware based
upon the type of domain in use for the device in question.
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/20250212213418.182902-4-mjrosato@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
- Large set of fixes for vector handling, specially in the interactions
between host and guest state. This fixes a number of bugs affecting
actual deployments, and greatly simplifies the FP/SIMD/SVE handling.
Thanks to Mark Rutland for dealing with this thankless task.
- Fix an ugly race between vcpu and vgic creation/init, resulting in
unexpected behaviours.
- Fix use of kernel VAs at EL2 when emulating timers with nVHE.
- Small set of pKVM improvements and cleanups.
x86:
- Fix broken SNP support with KVM module built-in, ensuring the PSP
module is initialized before KVM even when the module infrastructure
cannot be used to order initcalls
- Reject Hyper-V SEND_IPI hypercalls if the local APIC isn't being emulated
by KVM to fix a NULL pointer dereference.
- Enter guest mode (L2) from KVM's perspective before initializing the vCPU's
nested NPT MMU so that the MMU is properly tagged for L2, not L1.
- Load the guest's DR6 outside of the innermost .vcpu_run() loop, as the
guest's value may be stale if a VM-Exit is handled in the fastpath.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmev2ykUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMvxwf/bw2u08moAYWAjJLROFvfiKXnznLS
iqJ2+jcw0lJ7wDqm4Zw8M5t74Rd+y5yzkLkZOyjav9yBB09zRkItiTHljCNMOQnt
2QptBa3CUN8N+rNnvVRt6dMkhw7z6n7eoFRSIDY2Y9PgiTapbFXPV1gFkMPO6+0f
SyF4LCr0iuDkJdvGAZJAH/Mp8nG6dv/A6a+Q+R1RkbKn9c2OdWw4VMfhIzimFGN6
0RFjbfXXvyO0aU/W/VHwvvuhcjGkAZWfHDdaTXqbvSMhayW562UPVMVBwXdVBmDj
Dk1gCKcbm4WyktbXYW6iOYj3MgdK96eI24ozps4R0aDexsrTRY4IfH4KEg==
=20Ql
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Large set of fixes for vector handling, especially in the
interactions between host and guest state.
This fixes a number of bugs affecting actual deployments, and
greatly simplifies the FP/SIMD/SVE handling. Thanks to Mark Rutland
for dealing with this thankless task.
- Fix an ugly race between vcpu and vgic creation/init, resulting in
unexpected behaviours
- Fix use of kernel VAs at EL2 when emulating timers with nVHE
- Small set of pKVM improvements and cleanups
x86:
- Fix broken SNP support with KVM module built-in, ensuring the PSP
module is initialized before KVM even when the module
infrastructure cannot be used to order initcalls
- Reject Hyper-V SEND_IPI hypercalls if the local APIC isn't being
emulated by KVM to fix a NULL pointer dereference
- Enter guest mode (L2) from KVM's perspective before initializing
the vCPU's nested NPT MMU so that the MMU is properly tagged for
L2, not L1
- Load the guest's DR6 outside of the innermost .vcpu_run() loop, as
the guest's value may be stale if a VM-Exit is handled in the
fastpath"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
x86/sev: Fix broken SNP support with KVM module built-in
KVM: SVM: Ensure PSP module is initialized if KVM module is built-in
crypto: ccp: Add external API interface for PSP module initialization
KVM: arm64: vgic: Hoist SGI/PPI alloc from vgic_init() to kvm_create_vgic()
KVM: arm64: timer: Drop warning on failed interrupt signalling
KVM: arm64: Fix alignment of kvm_hyp_memcache allocations
KVM: arm64: Convert timer offset VA when accessed in HYP code
KVM: arm64: Simplify warning in kvm_arch_vcpu_load_fp()
KVM: arm64: Eagerly switch ZCR_EL{1,2}
KVM: arm64: Mark some header functions as inline
KVM: arm64: Refactor exit handlers
KVM: arm64: Refactor CPTR trap deactivation
KVM: arm64: Remove VHE host restore of CPACR_EL1.SMEN
KVM: arm64: Remove VHE host restore of CPACR_EL1.ZEN
KVM: arm64: Remove host FPSIMD saving for non-protected KVM
KVM: arm64: Unconditionally save+flush host FPSIMD/SVE/SME state
KVM: x86: Load DR6 with guest value only before entering .vcpu_run() loop
KVM: nSVM: Enter guest mode before initializing nested NPT MMU
KVM: selftests: Add CPUID tests for Hyper-V features that need in-kernel APIC
KVM: selftests: Manage CPUID array in Hyper-V CPUID test's core helper
...
Fix issues with enabling SNP host support and effectively SNP support
which is broken with respect to the KVM module being built-in.
SNP host support is enabled in snp_rmptable_init() which is invoked as
device_initcall(). SNP check on IOMMU is done during IOMMU PCI init
(IOMMU_PCI_INIT stage). And for that reason snp_rmptable_init() is
currently invoked via device_initcall() and cannot be invoked via
subsys_initcall() as core IOMMU subsystem gets initialized via
subsys_initcall().
Now, if kvm_amd module is built-in, it gets initialized before SNP host
support is enabled in snp_rmptable_init() :
[ 10.131811] kvm_amd: TSC scaling supported
[ 10.136384] kvm_amd: Nested Virtualization enabled
[ 10.141734] kvm_amd: Nested Paging enabled
[ 10.146304] kvm_amd: LBR virtualization supported
[ 10.151557] kvm_amd: SEV enabled (ASIDs 100 - 509)
[ 10.156905] kvm_amd: SEV-ES enabled (ASIDs 1 - 99)
[ 10.162256] kvm_amd: SEV-SNP enabled (ASIDs 1 - 99)
[ 10.171508] kvm_amd: Virtual VMLOAD VMSAVE supported
[ 10.177052] kvm_amd: Virtual GIF supported
...
...
[ 10.201648] kvm_amd: in svm_enable_virtualization_cpu
And then svm_x86_ops->enable_virtualization_cpu()
(svm_enable_virtualization_cpu) programs MSR_VM_HSAVE_PA as following:
wrmsrl(MSR_VM_HSAVE_PA, sd->save_area_pa);
So VM_HSAVE_PA is non-zero before SNP support is enabled on all CPUs.
snp_rmptable_init() gets invoked after svm_enable_virtualization_cpu()
as following :
...
[ 11.256138] kvm_amd: in svm_enable_virtualization_cpu
...
[ 11.264918] SEV-SNP: in snp_rmptable_init
This triggers a #GP exception in snp_rmptable_init() when snp_enable()
is invoked to set SNP_EN in SYSCFG MSR:
[ 11.294289] unchecked MSR access error: WRMSR to 0xc0010010 (tried to write 0x0000000003fc0000) at rIP: 0xffffffffaf5d5c28 (native_write_msr+0x8/0x30)
...
[ 11.294404] Call Trace:
[ 11.294482] <IRQ>
[ 11.294513] ? show_stack_regs+0x26/0x30
[ 11.294522] ? ex_handler_msr+0x10f/0x180
[ 11.294529] ? search_extable+0x2b/0x40
[ 11.294538] ? fixup_exception+0x2dd/0x340
[ 11.294542] ? exc_general_protection+0x14f/0x440
[ 11.294550] ? asm_exc_general_protection+0x2b/0x30
[ 11.294557] ? __pfx_snp_enable+0x10/0x10
[ 11.294567] ? native_write_msr+0x8/0x30
[ 11.294570] ? __snp_enable+0x5d/0x70
[ 11.294575] snp_enable+0x19/0x20
[ 11.294578] __flush_smp_call_function_queue+0x9c/0x3a0
[ 11.294586] generic_smp_call_function_single_interrupt+0x17/0x20
[ 11.294589] __sysvec_call_function+0x20/0x90
[ 11.294596] sysvec_call_function+0x80/0xb0
[ 11.294601] </IRQ>
[ 11.294603] <TASK>
[ 11.294605] asm_sysvec_call_function+0x1f/0x30
...
[ 11.294631] arch_cpu_idle+0xd/0x20
[ 11.294633] default_idle_call+0x34/0xd0
[ 11.294636] do_idle+0x1f1/0x230
[ 11.294643] ? complete+0x71/0x80
[ 11.294649] cpu_startup_entry+0x30/0x40
[ 11.294652] start_secondary+0x12d/0x160
[ 11.294655] common_startup_64+0x13e/0x141
[ 11.294662] </TASK>
This #GP exception is getting triggered due to the following errata for
AMD family 19h Models 10h-1Fh Processors:
Processor may generate spurious #GP(0) Exception on WRMSR instruction:
Description:
The Processor will generate a spurious #GP(0) Exception on a WRMSR
instruction if the following conditions are all met:
- the target of the WRMSR is a SYSCFG register.
- the write changes the value of SYSCFG.SNPEn from 0 to 1.
- One of the threads that share the physical core has a non-zero
value in the VM_HSAVE_PA MSR.
The document being referred to above:
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/revision-guides/57095-PUB_1_01.pdf
To summarize, with kvm_amd module being built-in, KVM/SVM initialization
happens before host SNP is enabled and this SVM initialization
sets VM_HSAVE_PA to non-zero, which then triggers a #GP when
SYSCFG.SNPEn is being set and this will subsequently cause
SNP_INIT(_EX) to fail with INVALID_CONFIG error as SYSCFG[SnpEn] is not
set on all CPUs.
Essentially SNP host enabling code should be invoked before KVM
initialization, which is currently not the case when KVM is built-in.
Add fix to call snp_rmptable_init() early from iommu_snp_enable()
directly and not invoked via device_initcall() which enables SNP host
support before KVM initialization with kvm_amd module built-in.
Add additional handling for `iommu=off` or `amd_iommu=off` options.
Note that IOMMUs need to be enabled for SNP initialization, therefore,
if host SNP support is enabled but late IOMMU initialization fails
then that will cause PSP driver's SNP_INIT to fail as IOMMU SNP sanity
checks in SNP firmware will fail with invalid configuration error as
below:
[ 9.723114] ccp 0000:23:00.1: sev enabled
[ 9.727602] ccp 0000:23:00.1: psp enabled
[ 9.732527] ccp 0000:a2:00.1: enabling device (0000 -> 0002)
[ 9.739098] ccp 0000:a2:00.1: no command queues available
[ 9.745167] ccp 0000:a2:00.1: psp enabled
[ 9.805337] ccp 0000:23:00.1: SEV-SNP: failed to INIT rc -5, error 0x3
[ 9.866426] ccp 0000:23:00.1: SEV API:1.53 build:5
Fixes: c3b86e61b7 ("x86/cpufeatures: Enable/unmask SEV-SNP CPU feature")
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Acked-by: Joerg Roedel <jroedel@suse.de>
Message-ID: <138b520fb83964782303b43ade4369cd181fdd9c.1739226950.git.ashish.kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This driver supports page faults on PCI RID since commit <9f831c16c69e>
("iommu/vt-d: Remove the pasid present check in prq_event_thread") by
allowing the reporting of page faults with the pasid_present field cleared
to the upper layer for further handling. The fundamental assumption here
is that the detach or replace operations act as a fence for page faults.
This implies that all pending page faults associated with a specific RID
or PASID are flushed when a domain is detached or replaced from a device
RID or PASID.
However, the intel_iommu_drain_pasid_prq() helper does not correctly
handle faults for RID. This leads to faults potentially remaining pending
in the iommu hardware queue even after the domain is detached, thereby
violating the aforementioned assumption.
Fix this issue by extending intel_iommu_drain_pasid_prq() to cover faults
for RID.
Fixes: 9f831c16c6 ("iommu/vt-d: Remove the pasid present check in prq_event_thread")
Cc: stable@vger.kernel.org
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20250121023150.815972-1-baolu.lu@linux.intel.com
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20250211005512.985563-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
There are some typos in comments/messages:
- modyfying -> modifying
- Unabled -> Unable
Fix them via codespell.
Signed-off-by: Andrew Kreimer <algonell@gmail.com>
Link: https://lore.kernel.org/r/20250210112027.29791-1-algonell@gmail.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
With recent kernel, AMDGPU failed to resume after suspend on certain laptop.
Sample log:
-----------
Nov 14 11:52:19 Thinkbook kernel: iommu ivhd0: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=0000:06:00.0 pasid=0x00000 address=0x135300000 flags=0x0080]
Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[0]: 7d90000000000003
Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[1]: 0000100103fc0009
Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[2]: 2000000117840013
Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[3]: 0000000000000000
This is because in resume path, CNTRL[EPHEn] is not set. Fix this by
setting CNTRL[EPHEn] to 1 in resume path if EFR[EPHSUP] is set.
Note
May be better approach is to save the control register in suspend path
and restore it in resume path instead of trying to set indivisual
bits. We will have separate patch for that.
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219499
Fixes: c4cb231111 ("iommu/amd: Add support for enable/disable IOPF")
Tested-by: Hamish McIntyre-Bhatty <kernel-bugzilla@regd.hamishmb.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250127094411.5931-1-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
There are new attach/detach/replace helpers in device.c taking care of both
the attach_handle and the fault specific routines for iopf_enable/disable()
and auto response.
Clean up these redundant functions in the fault.c file.
Link: https://patch.msgid.link/r/3ca94625e9d78270d9a715fa0809414fddd57e58.1738645017.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
"attach_handle" was added exclusively for the iommufd_fault_iopf_handler()
used by IOPF/PRI use cases. Now, both the MSI and PASID series require to
reuse the attach_handle for non-fault cases.
Add a set of new attach/detach/replace helpers that does the attach_handle
allocation/releasing/replacement in the common path and also handles those
fault specific routines such as iopf enabling/disabling and auto response.
This covers both non-fault and fault cases in a clean way, replacing those
inline helpers in the header. The following patch will clean up those old
helpers in the fault.c file.
Link: https://patch.msgid.link/r/32687df01c02291d89986a9fca897bbbe2b10987.1738645017.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The iopf_queue_remove_device() helper removes a device from the per-iommu
iopf queue when PRI is disabled on the device. It responds to all
outstanding iopf's with an IOMMU_PAGE_RESP_INVALID code and detaches the
device from the queue.
However, it fails to release the group structure that represents a group
of iopf's awaiting for a response after responding to the hardware. This
can cause a memory leak if iopf_queue_remove_device() is called with
pending iopf's.
Fix it by calling iopf_free_group() after the iopf group is responded.
Fixes: 1991123271 ("iommu: Track iopf group instead of last fault")
Cc: stable@vger.kernel.org
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250117055800.782462-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmeTFQ4THHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXqMWB/4uHjnu50u+m00OwXAKQr6i92zh50BZ
RQragd9s9C8tuUNwPDmS/ct2BNAhoy43KJ0ClegdZjKxT1Ys8cLv4Wr5CaGckqWq
+WCHqTgt+cPe0vUofqahB5wiAZMsnBgzFkV/OfFwBx0wkub9y5T3qVq5KapYlaDI
7Gftb+wg1AAsrdZ/HuLRy5ZVvkM/73rU2uoi8WXjr/T14E1krCFR/qirLd1OXo6Q
Jb97qhnCt/N9JPwIq5/VnYWde5Mpqz6UgtA2rFLDXgNGz+h9/ND6ecWFHjZWNVdc
AKWZTO5t+fRVBOSyahoyRoYSntPw3wlxyL7A2/54h6j4Dex7wLt6NQBj
=empO
-----END PGP SIGNATURE-----
Merge tag 'hyperv-next-signed-20250123' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- Introduce a new set of Hyper-V headers in include/hyperv and replace
the old hyperv-tlfs.h with the new headers (Nuno Das Neves)
- Fixes for the Hyper-V VTL mode (Roman Kisel)
- Fixes for cpu mask usage in Hyper-V code (Michael Kelley)
- Document the guest VM hibernation behaviour (Michael Kelley)
- Miscellaneous fixes and cleanups (Jacob Pan, John Starks, Naman Jain)
* tag 'hyperv-next-signed-20250123' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
Documentation: hyperv: Add overview of guest VM hibernation
hyperv: Do not overlap the hvcall IO areas in hv_vtl_apicid_to_vp_id()
hyperv: Do not overlap the hvcall IO areas in get_vtl()
hyperv: Enable the hypercall output page for the VTL mode
hv_balloon: Fallback to generic_online_page() for non-HV hot added mem
Drivers: hv: vmbus: Log on missing offers if any
Drivers: hv: vmbus: Wait for boot-time offers during boot and resume
uio_hv_generic: Add a check for HV_NIC for send, receive buffers setup
iommu/hyper-v: Don't assume cpu_possible_mask is dense
Drivers: hv: Don't assume cpu_possible_mask is dense
x86/hyperv: Don't assume cpu_possible_mask is dense
hyperv: Remove the now unused hyperv-tlfs.h files
hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.h
hyperv: Add new Hyper-V headers in include/hyperv
hyperv: Clean up unnecessary #includes
hyperv: Move hv_connection_id to hyperv-tlfs.h
No major functionality this cycle:
- iommufd part of the domain_alloc_paging_flags() conversion
- Move IOMMU_HWPT_FAULT_ID_VALID processing out of drivers
- Increase a timeout waiting for other threads to drop transient refcounts
that syzkaller was hitting
- Fix a UBSAN hit in iova_bitmap due to shift out of bounds
- Add missing cleanup of fault events during FD shutdown, fixing a memory leak
- Improve the fault delivery flow to have a smaller locking critical
region that does not include copy_to_user()
- Fix 32 bit ABI breakage due to missed implicit padding, and fix the
stack memory leakage
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZ5JgnAAKCRCFwuHvBreF
YQWrAP9ItOsbeOxYIEVQK1E90HbMWVHF8RhWDPnpChgzyorjjQEA/OOou6uTAcVC
fJybLk3T7JrutMmBFvVLsRg+yCJPlgE=
=ldhT
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"No major functionality this cycle:
- iommufd part of the domain_alloc_paging_flags() conversion
- Move IOMMU_HWPT_FAULT_ID_VALID processing out of drivers
- Increase a timeout waiting for other threads to drop transient
refcounts that syzkaller was hitting
- Fix a UBSAN hit in iova_bitmap due to shift out of bounds
- Add missing cleanup of fault events during FD shutdown, fixing a
memory leak
- Improve the fault delivery flow to have a smaller locking critical
region that does not include copy_to_user()
- Fix 32 bit ABI breakage due to missed implicit padding, and fix the
stack memory leakage"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd:
iommufd: Fix struct iommu_hwpt_pgfault init and padding
iommufd/fault: Use a separate spinlock to protect fault->deliver list
iommufd/fault: Destroy response and mutex in iommufd_fault_destroy()
iommufd: Keep OBJ/IOCTL lists in an alphabetical order
iommufd/iova_bitmap: Fix shift-out-of-bounds in iova_bitmap_offset_to_index()
iommu: iommufd: fix WARNING in iommufd_device_unbind
iommufd: Deal with IOMMU_HWPT_FAULT_ID_VALID in iommufd core
iommufd/selftest: Remove domain_alloc_paging()
Including:
- Core changes:
- PASID support for the blocked_domain.
- ARM-SMMU Updates:
- SMMUv2:
* Implement per-client prefetcher configuration on Qualcomm SoCs.
* Support for the Adreno SMMU on Qualcomm's SDM670 SOC.
- SMMUv3:
* Pretty-printing of event records.
* Drop the ->domain_alloc_paging implementation in favour of
->domain_alloc_paging_flags(flags==0).
- IO-PGTable:
* Generalisation of the page-table walker to enable external walkers
(e.g. for debugging unexpected page-faults from the GPU).
* Minor fix for handling concatenated PGDs at stage-2 with 16KiB pages.
- Misc:
* Clean-up device probing and replace the crufty probe-deferral hack
with a more robust implementation of arm_smmu_get_by_fwnode().
* Device-tree binding updates for a bunch of Qualcomm platforms.
- Intel VT-d Updates:
- Remove domain_alloc_paging().
- Remove capability audit code.
- Draining PRQ in sva unbind path when FPD bit set.
- Link cache tags of same iommu unit together.
- AMD-Vi Updates:
- Use CMPXCHG128 to update DTE.
- Cleanups of the domain_alloc_paging() path.
- RiscV IOMMU:
- Platform MSI support.
- Shutdown support.
- Rockchip IOMMU:
- Add DT bindings for Rockchip RK3576.
- More smaller fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmeQsnQACgkQK/BELZcB
GuOHug/8DDIuKlUHU2+3U2kKxMb8o1kDxGPkfKgXBaxxpprvY9DARRv7N4mnF6Vu
Db8P7QihXfQKnZHXEiqHE7TlA5B+IVQd3Kz96P4sY3OlVWGZYqyKv2GEHyG5CjN/
bay7bfgeo2EVEiAio6VToFFWTm+oxFZzhoYFIlAyZAuIQUp17gHXf7YyhUwk4rOz
8g0XMH6uldidID6BVpArxHh/bN9MOTdHzkyhwPF3FL8E94ziX6rWILH9ADYxBn2o
wqHR1STxv398k62toPpWb78c2RdANI8treDXsYpCyDF87dygdP+SA0qkK3G6kAVA
/IiPothAj6oNm+Gvwd04tEkuqVVrVqrmWE3VXSps33Tk+apYtcmLCtdpwY/F93D1
EZwTVqveBKk2hSWYVDlEyj9XKmZ9dYWDGvg2Fx844gltQHoHtEgYL+azCMU/ry7k
3+KlkUqFZZxUDbVBbbDdMF+NpxyZTtfKsaLB8f5laOP1R8Os3+dC69Gw6bhoxaMc
xfL3v245V5kWSRy+w02TyZGmZSzRx0FKUbFLKpLOvZD6pfx8t8oqJTG9FwN1KzpL
lvcBpPB5AUeNJQpKcVSjs/ne8WiOqdABt7ge4E9J5TwrDI8sXk0mwJaPrlDgK1V/
0xkkLmxWnqsu04CyDay+Uv3Bls/rRBpikR/iGt9P3BbZs7Cyryw=
=Xei0
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:
"Core changes:
- PASID support for the blocked_domain
ARM-SMMU Updates:
- SMMUv2:
- Implement per-client prefetcher configuration on Qualcomm SoCs
- Support for the Adreno SMMU on Qualcomm's SDM670 SOC
- SMMUv3:
- Pretty-printing of event records
- Drop the ->domain_alloc_paging implementation in favour of
domain_alloc_paging_flags(flags==0)
- IO-PGTable:
- Generalisation of the page-table walker to enable external
walkers (e.g. for debugging unexpected page-faults from the GPU)
- Minor fix for handling concatenated PGDs at stage-2 with 16KiB
pages
- Misc:
- Clean-up device probing and replace the crufty probe-deferral
hack with a more robust implementation of
arm_smmu_get_by_fwnode()
- Device-tree binding updates for a bunch of Qualcomm platforms
Intel VT-d Updates:
- Remove domain_alloc_paging()
- Remove capability audit code
- Draining PRQ in sva unbind path when FPD bit set
- Link cache tags of same iommu unit together
AMD-Vi Updates:
- Use CMPXCHG128 to update DTE
- Cleanups of the domain_alloc_paging() path
RiscV IOMMU:
- Platform MSI support
- Shutdown support
Rockchip IOMMU:
- Add DT bindings for Rockchip RK3576
More smaller fixes and cleanups"
* tag 'iommu-updates-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (66 commits)
iommu: Use str_enable_disable-like helpers
iommu/amd: Fully decode all combinations of alloc_paging_flags
iommu/amd: Move the nid to pdom_setup_pgtable()
iommu/amd: Change amd_iommu_pgtable to use enum protection_domain_mode
iommu/amd: Remove type argument from do_iommu_domain_alloc() and related
iommu/amd: Remove dev == NULL checks
iommu/amd: Remove domain_alloc()
iommu/amd: Remove unused amd_iommu_domain_update()
iommu/riscv: Fixup compile warning
iommu/arm-smmu-v3: Add missing #include of linux/string_choices.h
iommu/arm-smmu-v3: Use str_read_write helper w/ logs
iommu/io-pgtable-arm: Add way to debug pgtable walk
iommu/io-pgtable-arm: Re-use the pgtable walk for iova_to_phys
iommu/io-pgtable-arm: Make pgtable walker more generic
iommu/arm-smmu: Add ACTLR data and support for qcom_smmu_500
iommu/arm-smmu: Introduce ACTLR custom prefetcher settings
iommu/arm-smmu: Add support for PRR bit setup
iommu/arm-smmu: Refactor qcom_smmu structure to include single pointer
iommu/arm-smmu: Re-enable context caching in smmu reset operation
iommu/vt-d: Link cache tags of same iommu unit together
...
- Consolidation of the machine_kexec_mask_interrupts() by providing a
generic implementation and replacing the copy & pasta orgy in the
relevant architectures.
- Prevent unconditional operations on interrupt chips during kexec
shutdown, which can trigger warnings in certain cases when the
underlying interrupt has been shut down before.
- Make the enforcement of interrupt handling in interrupt context
unconditionally available, so that it actually works for non x86
related interrupt chips. The earlier enablement for ARM GIC chips set
the required chip flag, but did not notice that the check was hidden
behind a config switch which is not selected by ARM[64].
- Decrapify the handling of deferred interrupt affinity setting. Some
interrupt chips require that affinity changes are made from the context
of handling an interrupt to avoid certain race conditions. For x86 this
was the default, but with interrupt remapping this requirement was
lifted and a flag was introduced which tells the core code that
affinity changes can be done in any context. Unrestricted affinity
changes are the default for the majority of interrupt chips. RISCV has
the requirement to add the deferred mode to one of it's interrupt
controllers, but with the original implementation this would require to
add the any context flag to all other RISC-V interrupt chips. That's
backwards, so reverse the logic and require that chips, which need the
deferred mode have to be marked accordingly. That avoids chasing the
'sane' chips and marking them.
- Add multi-node support to the Loongarch AVEC interrupt controller
driver.
- The usual tiny cleanups, fixes and improvements all over the place.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmePkVITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRbQD/9bHVph/V9Ekl7JAX3aY4gG4JbRhOc7
dp1VAcHRhktRfoTztYRbjsbMu2nvZ58GKA8bkOS2jHSF/m3PbkIJfOhwk0YdIAoa
+kdy5yDgqCGfkqW43DN4Cr+CnzGjWMitw67tFp3fhwehMDpDjdt2L28IjtanSS0f
hO6FV7o65MWeJwxk4Isb2/nvkO+X23Lrp6RrWS8SXBnF9FFXxiPIg/fiOPTizhCh
1W/bSGxLLb9WwsVzmlGAKVFlXDij0QGaIUug2fdVZ63OsELXD7tJrLSPG133yk92
ppIa0s6BT4IBsfM00us4hG15PkLuJmP3yWWcoquG0rP8Wq58VOXiN6+rcJIyvB+5
mWceTH6IKfZGoRQKwXC7BxeBAIb147reiJtb06meq1/8ADIvzafiNy0c8x9i/UaV
QiyhPVENjaGCGDomZmJQqN7Yb02Wge1k8InQnodDrHxZNl/bX/B1Z8Bxd0n6hPHg
NSJXYif2AxgaddpohsdygqRDbT6SNyQdj7YjJFY5qAGJ3yFyJ4JB6WTqkWW4o1vH
3FVqdAnJmejAmmYSkah0Hkem2T5QASQmTWb93PLxiV6q+d0NM8stWAujjyVdIV/B
W4Uj9mQ20cz54TjLtxqX+A1k6KcqOWRgh1l2QbUlFsgsOP3V8yz47yqYdR9qMWlO
9kNEjI3sw+G/IQ==
=q4rj
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull interrupt subsystem updates from Thomas Gleixner:
- Consolidate the machine_kexec_mask_interrupts() by providing a
generic implementation and replacing the copy & pasta orgy in the
relevant architectures.
- Prevent unconditional operations on interrupt chips during kexec
shutdown, which can trigger warnings in certain cases when the
underlying interrupt has been shut down before.
- Make the enforcement of interrupt handling in interrupt context
unconditionally available, so that it actually works for non x86
related interrupt chips. The earlier enablement for ARM GIC chips set
the required chip flag, but did not notice that the check was hidden
behind a config switch which is not selected by ARM[64].
- Decrapify the handling of deferred interrupt affinity setting.
Some interrupt chips require that affinity changes are made from the
context of handling an interrupt to avoid certain race conditions.
For x86 this was the default, but with interrupt remapping this
requirement was lifted and a flag was introduced which tells the core
code that affinity changes can be done in any context. Unrestricted
affinity changes are the default for the majority of interrupt chips.
RISCV has the requirement to add the deferred mode to one of it's
interrupt controllers, but with the original implementation this
would require to add the any context flag to all other RISC-V
interrupt chips. That's backwards, so reverse the logic and require
that chips, which need the deferred mode have to be marked
accordingly. That avoids chasing the 'sane' chips and marking them.
- Add multi-node support to the Loongarch AVEC interrupt controller
driver.
- The usual tiny cleanups, fixes and improvements all over the place.
* tag 'irq-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/generic_chip: Export irq_gc_mask_disable_and_ack_set()
genirq/timings: Add kernel-doc for a function parameter
genirq: Remove IRQ_MOVE_PCNTXT and related code
x86/apic: Convert to IRQCHIP_MOVE_DEFERRED
genirq: Provide IRQCHIP_MOVE_DEFERRED
hexagon: Remove GENERIC_PENDING_IRQ leftover
ARC: Remove GENERIC_PENDING_IRQ
genirq: Remove handle_enforce_irqctx() wrapper
genirq: Make handle_enforce_irqctx() unconditionally available
irqchip/loongarch-avec: Add multi-nodes topology support
irqchip/ts4800: Replace seq_printf() by seq_puts()
irqchip/ti-sci-inta : Add module build support
irqchip/ti-sci-intr: Add module build support
irqchip/irq-brcmstb-l2: Replace brcmstb_l2_mask_and_ack() by generic function
irqchip: keystone: Use syscon_regmap_lookup_by_phandle_args
genirq/kexec: Prevent redundant IRQ masking by checking state before shutdown
kexec: Consolidate machine_kexec_mask_interrupts() implementation
genirq: Reuse irq_thread_fn() for forced thread case
genirq: Move irq_thread_fn() further up in the code
The fault->mutex serializes the fault read()/write() fops and the
iommufd_fault_auto_response_faults(), mainly for fault->response. Also, it
was conveniently used to fence the fault->deliver in poll() fop and
iommufd_fault_iopf_handler().
However, copy_from/to_user() may sleep if pagefaults are enabled. Thus,
they could take a long time to wait for user pages to swap in, blocking
iommufd_fault_iopf_handler() and its caller that is typically a shared IRQ
handler of an IOMMU driver, resulting in a potential global DOS.
Instead of reusing the mutex to protect the fault->deliver list, add a
separate spinlock, nested under the mutex, to do the job.
iommufd_fault_iopf_handler() would no longer be blocked by
copy_from/to_user().
Add a free_list in iommufd_auto_response_faults(), so the spinlock can
simply fence a fast list_for_each_entry_safe routine.
Provide two deliver list helpers for iommufd_fault_fops_read() to use:
- Fetch the first iopf_group out of the fault->deliver list
- Restore an iopf_group back to the head of the fault->deliver list
Lastly, move the mutex closer to the response in the fault structure,
and update its kdoc accordingly.
Fixes: 07838f7fd5 ("iommufd: Add iommufd fault object")
Link: https://patch.msgid.link/r/20250117192901.79491-1-nicolinc@nvidia.com
Cc: stable@vger.kernel.org
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Replace ternary (condition ? "enable" : "disable") syntax with helpers
from string_choices.h because:
1. Simple function call with one argument is easier to read. Ternary
operator has three arguments and with wrapping might lead to quite
long code.
2. Is slightly shorter thus also easier to read.
3. It brings uniformity in the text - same string.
4. Allows deduping by the linker, which results in a smaller binary
file.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Heiko Stuebner <heiko@sntech.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Link: https://lore.kernel.org/r/20250114192642.912331-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>