The driver never reads this value, arm_smmu_init_domain_context() always
sets domain.pgsize_bitmap to smmu->pgsize_bitmap, the per-instance value.
Remove the ops version entirely, the related dead code and make
arm_smmu_ops const.
Since this driver does not yet finalize the domain under
arm_smmu_domain_alloc_paging() add a page size initialization to alloc so
the page size is still setup prior to attach.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/2-v2-68a2e1ba507c+1fb-iommu_rm_ops_pgsize_jgg@nvidia.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The driver never reads this value, arm_smmu_domain_finalise() always sets
domain.pgsize_bitmap to pgtbl_cfg, which comes from the per-smmu
calculated value.
Remove the ops version entirely, the related dead code and make
arm_smmu_ops const.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/1-v2-68a2e1ba507c+1fb-iommu_rm_ops_pgsize_jgg@nvidia.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The EFR[HATS] bits indicate maximum host translation level supported by
IOMMU. Adding support to set the maximum host page table level as indicated
by EFR[HATS]. If the HATS=11b (reserved), the driver will attempt to use
guest page table for DMA API.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Ankit Soni <Ankit.Soni@amd.com>
Link: https://lore.kernel.org/r/df0f8562c2a20895cc185c86f1a02c4d826fd597.1749016436.git.Ankit.Soni@amd.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Current AMD IOMMU assumes Host Address Translation (HAT) is always
supported, and Linux kernel enables this capability by default. However,
in case of emulated and virtualized IOMMU, this might not be the case.
For example,current QEMU-emulated AMD vIOMMU does not support host
translation for VFIO pass-through device, but the interrupt remapping
support is required for x2APIC (i.e. kvm-msi-ext-dest-id is also not
supported by the guest OS). This would require the guest kernel to boot
with guest kernel option iommu=pt to by-pass the initialization of
host (v1) table.
The AMD I/O Virtualization Technology (IOMMU) Specification Rev 3.10 [1]
introduces a new flag 'HATDis' in the IVHD 11h IOMMU attributes to indicate
that HAT is not supported on a particular IOMMU instance.
Therefore, modifies the AMD IOMMU driver to detect the new HATDis
attributes, and disable host translation and switch to use guest
translation if it is available. Otherwise, the driver will disable DMA
translation.
[1] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_IOMMU.pdf
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Ankit Soni <Ankit.Soni@amd.com>
Link: https://lore.kernel.org/r/8109b208f87b80e400c2abd24a2e44fcbc0763a5.1749016436.git.Ankit.Soni@amd.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Add plumbing to the AMD IOMMU driver to allow KVM to control whether or
not an IRTE is configured to generate GA log interrupts. KVM only needs a
notification if the target vCPU is blocking, so the vCPU can be awakened.
If a vCPU is preempted or exits to userspace, KVM clears is_run, but will
set the vCPU back to running when userspace does KVM_RUN and/or the vCPU
task is scheduled back in, i.e. KVM doesn't need a notification.
Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes
from the KVM changes insofar as possible.
Opportunistically swap the ordering of parameters for amd_iommu_update_ga()
so that the match amd_iommu_activate_guest_mode().
Note, as of this writing, the AMD IOMMU manual doesn't list GALogIntr as
a non-cached field, but per AMD hardware architects, it's not cached and
can be safely updated without an invalidation.
Link: https://lore.kernel.org/all/b29b8c22-2fd4-4b5e-b755-9198874157c7@amd.com
Cc: Vasant Hegde <vasant.hegde@amd.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20250611224604.313496-62-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
WARN if KVM attempts to update IRTE entries when virtual APIC isn't fully
supported, as KVM should guard all such calls on IRQ posting being enabled.
Link: https://lore.kernel.org/r/20250611224604.313496-58-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
If an IRQ can be posted to a vCPU, but AVIC is currently inhibited on the
vCPU, go through the dance of "affining" the IRTE to the vCPU, but leave
the actual IRTE in remapped mode. KVM already handles the case where AVIC
is inhibited => uninhibited with posted IRQs (see avic_set_pi_irte_mode()),
but doesn't handle the scenario where a postable IRQ comes along while AVIC
is inhibited.
Link: https://lore.kernel.org/r/20250611224604.313496-45-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Now that setting vCPU affinity is guarded with ir_list_lock, i.e. now that
avic_physical_id_entry can be safely accessed, set the pCPU info
straight-away when setting vCPU affinity. Putting the IRTE into posted
mode, and then immediately updating the IRTE a second time if the target
vCPU is running is wasteful and confusing.
This also fixes a flaw where a posted IRQ that arrives between putting
the IRTE into guest_mode and setting the correct destination could cause
the IOMMU to ring the doorbell on the wrong pCPU.
Link: https://lore.kernel.org/r/20250611224604.313496-44-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Split the guts of amd_iommu_update_ga() to a dedicated helper so that the
logic can be shared with flows that put the IRTE into posted mode.
Opportunistically move amd_iommu_update_ga() and its new helper above
amd_iommu_activate_guest_mode() so that it's all co-located.
Link: https://lore.kernel.org/r/20250611224604.313496-43-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Infer whether or not a vCPU should be marked running from the validity of
the pCPU on which it is running. amd_iommu_update_ga() already skips the
IRTE update if the pCPU is invalid, i.e. passing %true for is_run with an
invalid pCPU would be a blatant and egregrious KVM bug.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-42-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a comment to amd_iommu_update_ga() to document what fields it can
safely modify without issuing an invalidation of the IRTE, and to explain
its role in keeping GA IRTEs up-to-date.
Per page 93 of the IOMMU spec dated Feb 2025:
When virtual interrupts are enabled by setting MMIO Offset 0018h[GAEn] and
IRTE[GuestMode=1], IRTE[IsRun], IRTE[Destination], and if present IRTE[GATag],
are not cached by the IOMMU. Modifications to these fields do not require an
invalidation of the Interrupt Remapping Table.
Link: https://lore.kernel.org/all/9b7ceea3-8c47-4383-ad9c-1a9bbdc9044a@oracle.com
Cc: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20250611224604.313496-41-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Split the vcpu_data structure that serves as a handoff from KVM to IOMMU
drivers into vendor specific structures. Overloading a single structure
makes the code hard to read and maintain, is *very* misleading as it
suggests that mixing vendors is actually supported, and bastardizing
Intel's posted interrupt descriptor address when AMD's IOMMU already has
its own structure is quite unnecessary.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-33-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Pass NULL to amd_ir_set_vcpu_affinity() to communicate "don't post to a
vCPU" now that there's no need to communicate information back to KVM
about the previous vCPU (KVM does its own tracking).
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use vcpu_data.pi_desc_addr instead of amd_iommu_pi_data.base to get the
GA root pointer. KVM is the only source of amd_iommu_pi_data.base, and
KVM's one and only path for writing amd_iommu_pi_data.base computes the
exact same value for vcpu_data.pi_desc_addr and amd_iommu_pi_data.base,
and fills amd_iommu_pi_data.base if and only if vcpu_data.pi_desc_addr is
valid, i.e. amd_iommu_pi_data.base is fully redundant.
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Delete the amd_ir_data.prev_ga_tag field now that all usage is
superfluous.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Now the new ucmd-based object allocator eases the finalize/abort routine,
apply this to all existing allocators that aren't protected by any lock.
Upgrade the for-driver vIOMMU alloctor too, and pass down to all existing
viommu_alloc op accordingly.
Note that __iommufd_object_alloc_ucmd() builds in some static tests that
cover both static_asserts in the iommufd_viommu_alloc(). Thus drop them.
Link: https://patch.msgid.link/r/107b24a3b791091bb09c92ffb0081c56c413b26d.1749882255.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
An object allocator needs to call either iommufd_object_finalize() upon a
success or iommufd_object_abort_and_destroy() upon an error code.
To reduce duplication, store a new_obj in the struct iommufd_ucmd and call
iommufd_object_finalize/iommufd_object_abort_and_destroy() accordingly in
the main function.
Similar to iommufd_object_alloc() and __iommufd_object_alloc(), add a pair
of helpers: __iommufd_object_alloc_ucmd() and iommufd_object_alloc_ucmd().
Link: https://patch.msgid.link/r/e7206d4227844887cc8dbf0cc7b0242580fafd9d.1749882255.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Now, all driver structures will be allocated by the core, i.e. no longer a
need of driver calling _iommufd_object_alloc. Thus, move it back.
Before:
text data bss dec hex filename
3024 180 0 3204 c84 drivers/iommu/iommufd/driver.o
9074 610 64 9748 2614 drivers/iommu/iommufd/main.o
After:
text data bss dec hex filename
2665 164 0 2829 b0d drivers/iommu/iommufd/driver.o
9410 618 64 10092 276c drivers/iommu/iommufd/main.o
Link: https://patch.msgid.link/r/79e630c7b911930cf36e3c8a775a04e66c528d65.1749882255.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
To ease the for-driver iommufd APIs, get_viommu_size and viommu_init ops
are introduced. Now, those existing vIOMMU supported drivers implemented
these two ops, replacing the viommu_alloc one. So, there is no use of it.
Remove it from the headers and the viommu core.
Link: https://patch.msgid.link/r/5b32d4499d7ed02a63e57a293c11b642d226ef8d.1749882255.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
To ease the for-driver iommufd APIs, get_viommu_size and viommu_init ops
are introduced.
Sanitize the inputs and report the size of struct arm_vsmmu on success, in
arm_smmu_get_viommu_size().
Place the type sanity at the last, becase there will be soon an impl level
get_viommu_size op, which will require the same sanity tests prior. It can
simply insert a piece of code in front of the IOMMU_VIOMMU_TYPE_ARM_SMMUV3
sanity.
The core will ensure the viommu_type is set to the core vIOMMU object, and
pass in the same dev pointer, so arm_vsmmu_init() won't need to repeat the
same sanity tests but to simply init the arm_vsmmu struct.
Remove the arm_vsmmu_alloc, completing the replacement.
Link: https://patch.msgid.link/r/64e4b4c33acd26e1bd676e077be80e00fb63f17c.1749882255.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
To ease the for-driver iommufd APIs, get_viommu_size and viommu_init ops
are introduced.
Sanitize the inputs and report the size of struct mock_viommu on success,
in mock_get_viommu_size().
The core will ensure the viommu_type is set to the core vIOMMU object, so
simply init the driver part in mock_viommu_init().
Remove the mock_viommu_alloc, completing the replacement.
Link: https://patch.msgid.link/r/993beabbb0bc9705d979a92801ea5ed5996a34eb.1749882255.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There is no use of this parent domain. Delete the dead code.
Note that the s2_parent in struct mock_viommu will be a deadcode too. Yet,
keep it because it will be soon used by HW queue objects, i.e. no point in
adding it back and forth in such a short window. Besides, keeping it could
cover the majority of vIOMMU use cases where a driver-level structure will
be larger in size than the core structure.
Link: https://patch.msgid.link/r/0f155a7cd71034a498448fe4828fb4aaacdabf95.1749882255.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
To ease the for-driver iommufd APIs, get_viommu_size and viommu_init ops
are introduced to replace the viommu_init op.
Let the new viommu_init pathway coexist with the old viommu_alloc one.
Since the viommu_alloc op and its pathway will be soon deprecated, try to
minimize the code difference between them by adding a tentative jump tag.
Note that this fails a !viommu->ops case from now on with a WARN_ON_ONCE
since a vIOMMU is expected to support an alloc_domain_nested op for now,
or some sort of a viommu op in the foreseeable future. This WARN_ON_ONCE
can be lifted, if some day there is a use case wanting !viommu->ops.
Link: https://patch.msgid.link/r/35c5fa5926be45bda82f5fc87545cd3180ad4c9c.1749882255.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
It's more accurate to report EOPNOTSUPP when an ioctl failed due to driver
bug, since there is nothing wrong with the user space side.
Link: https://patch.msgid.link/r/623bb6f0e8fdd7b9c5745a2f99f280163f9f1f5a.1749882255.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Replace unsigned int, to make it clear. No functional changes.
Link: https://patch.msgid.link/r/208a260c100a00667d3799feaad1260745f96c6b.1749882255.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The core code can always get the ictx pointer via vdev->viommu->ictx, thus
drop this unused one.
Link: https://patch.msgid.link/r/6cbb65e8df433de45b6c3a4bb2c5df09faca8a7c.1749882255.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Run clang-format but exclude those not so obvious ones, which leaves us:
- Align indentations
- Add missing spaces
- Remove unnecessary spaces
- Remove unnecessary line wrappings
Link: https://patch.msgid.link/r/9132e1ab45690ab1959c66bbb51ac5536a635388.1749882255.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This driver uses a mixture of ways to get the size of a PTE,
tegra_smmu_set_pde() did it as sizeof(*pd) which became wrong when pd
switched to a struct tegra_pd.
Switch pd back to a u32* in tegra_smmu_set_pde() so the sizeof(*pd)
returns 4.
Fixes: 50568f87d1 ("iommu/terga: Do not use struct page as the handle for as->pd memory")
Reported-by: Diogo Ivo <diogo.ivo@tecnico.ulisboa.pt>
Closes: https://lore.kernel.org/all/62e7f7fe-6200-4e4f-ad42-d58ad272baa6@tecnico.ulisboa.pt/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Tested-by: Diogo Ivo <diogo.ivo@tecnico.ulisboa.pt>
Link: https://lore.kernel.org/r/0-v1-da7b8b3d57eb+ce-iommu_terga_sizeof_jgg@nvidia.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Move this API to the canonical timer_*() namespace.
[ tglx: Redone against pre rc1 ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmhAa9EUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vyA3w//aX8d73z/xVxkYLMN/6XQA5fdmd4d
Dv4n0Pjf0WCMKbsgRCdXEYLvcHV8VhH5iCR/b2UsFm9LjxSIRuqE5XosY3bNhrHn
xVKEh2prq2XZOibWrFkJ+RZ0FF7Ogq1Uy5gUBbBHbE1q1byZzrOALaF3FWGaDIZQ
6QLLAFtd3UtqOOUu8J8P9N15uFR8gunyfuM9U7TLMcy4B8txk6T6m/9xAWtRURuJ
I6WN8lO+g8Nl2mL9m27+wyWiVT3tKqoMwp8rVtym/L5JQOmHycYhn0WQAr2dPCMs
Xbgmoeei0je7mZvk5btpt68NAKQ3ZnCVkxbbINBkUxAjI0dbI6h37EhW18ShYVUk
CCo4fmaFtwP8qNN9tSvDN8vZdGB44fN5tIz4lmGzKk5gt+oV50RC/APrzC+PJBQ0
+2SdDVKj71Gr2H1VnI6uLB7oQ+tp7TOdhg+DGV4bdc6QFnsM+BpKWRq5f1UQcau/
XVDmorM/2t6z0DNktAv3NFwSodUjk1loWESr/pRBH1AqAWZTK98PWIg97XYsal59
zbJ3dLrnCqUNozeVgjtZo1LWD2FZaVTvhq2NY7D+QPpnMGhFUhHxNliZUXiQa1q4
boI2hEFdu3IQP/OC2a1zGJyMRLU43d5rhZ1U5xQSVtM0c3lgCY7rn/t26LymQVPA
SYdg2jBcnhe6gXo=
=eWJw
-----END PGP SIGNATURE-----
Merge tag 'pci-v6.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci updates from Bjorn Helgaas:
"Enumeration:
- Print the actual delay time in pci_bridge_wait_for_secondary_bus()
instead of assuming it was 1000ms (Wilfred Mallawa)
- Revert 'iommu/amd: Prevent binding other PCI drivers to IOMMU PCI
devices', which broke resume from system sleep on AMD platforms and
has been fixed by other commits (Lukas Wunner)
Resource management:
- Remove mtip32xx use of pcim_iounmap_regions(), which is deprecated
and unnecessary (Philipp Stanner)
- Remove pcim_iounmap_regions() and pcim_request_region_exclusive()
and related flags since all uses have been removed (Philipp
Stanner)
- Rework devres 'request' functions so they are no longer 'hybrid',
i.e., their behavior no longer depends on whether
pcim_enable_device or pci_enable_device() was used, and remove
related code (Philipp Stanner)
- Warn (not BUG()) about failure to assign optional resources (Ilpo
Järvinen)
Error handling:
- Log the DPC Error Source ID only when it's actually valid (when
ERR_FATAL or ERR_NONFATAL was received from a downstream device)
and decode into bus/device/function (Bjorn Helgaas)
- Determine AER log level once and save it so all related messages
use the same level (Karolina Stolarek)
- Use KERN_WARNING, not KERN_ERR, when logging PCIe Correctable
Errors (Karolina Stolarek)
- Ratelimit PCIe Correctable and Non-Fatal error logging, with sysfs
controls on interval and burst count, to avoid flooding logs and
RCU stall warnings (Jon Pan-Doh)
Power management:
- Increment PM usage counter when probing reset methods so we don't
try to read config space of a powered-off device (Alex Williamson)
- Set all devices to D0 during enumeration to ensure ACPI opregion is
connected via _REG (Mario Limonciello)
Power control:
- Rename pwrctrl Kconfig symbols from 'PWRCTL' to 'PWRCTRL' to match
the filename paths. Retain old deprecated symbols for
compatibility, except for the pwrctrl slot driver
(PCI_PWRCTRL_SLOT) (Johan Hovold)
- When unregistering pwrctrl, cancel outstanding rescan work before
cleaning up data structures to avoid use-after-free issues (Brian
Norris)
Bandwidth control:
- Simplify link bandwidth controller by replacing the count of Link
Bandwidth Management Status (LBMS) events with a PCI_LINK_LBMS_SEEN
flag (Ilpo Järvinen)
- Update the Link Speed after retraining, since the Link Speed may
have changed (Ilpo Järvinen)
PCIe native device hotplug:
- Ignore Presence Detect Changed caused by DPC.
pciehp already ignores Link Down/Up events caused by DPC, but on
slots using in-band presence detect, DPC causes a spurious Presence
Detect Changed event (Lukas Wunner)
- Ignore Link Down/Up caused by Secondary Bus Reset.
On hotplug ports using in-band presence detect, the reset causes a
Presence Detect Changed event, which mistakenly caused teardown and
re-enumeration of the device. Drivers may need to annotate code
that resets their device (Lukas Wunner)
Virtualization:
- Add an ACS quirk for Loongson Root Ports that don't advertise ACS
but don't allow peer-to-peer transactions between Root Ports; the
quirk allows each Root Port to be in a separate IOMMU group (Huacai
Chen)
Endpoint framework:
- For fixed-size BARs, retain both the actual size and the possibly
larger size allocated to accommodate iATU alignment requirements
(Jerome Brunet)
- Simplify ctrl/SPAD space allocation and avoid allocating more space
than needed (Jerome Brunet)
- Correct MSI-X PBA offset calculations for DesignWare and Cadence
endpoint controllers (Niklas Cassel)
- Align the return value (number of interrupts) encoding for
pci_epc_get_msi()/pci_epc_ops::get_msi() and
pci_epc_get_msix()/pci_epc_ops::get_msix() (Niklas Cassel)
- Align the nr_irqs parameter encoding for
pci_epc_set_msi()/pci_epc_ops::set_msi() and
pci_epc_set_msix()/pci_epc_ops::set_msix() (Niklas Cassel)
Common host controller library:
- Convert pci-host-common to a library so platforms that don't need
native host controller drivers don't need to include these helper
functions (Manivannan Sadhasivam)
Apple PCIe controller driver:
- Extract ECAM bridge creation helper from pci_host_common_probe() to
separate driver-specific things like MSI from PCI things (Marc
Zyngier)
- Dynamically allocate RID-to_SID bitmap to prepare for SoCs with
varying capabilities (Marc Zyngier)
- Skip ports disabled in DT when setting up ports (Janne Grunau)
- Add t6020 compatible string (Alyssa Rosenzweig)
- Add T602x PCIe support (Hector Martin)
- Directly set/clear INTx mask bits because T602x dropped the
accessors that could do this without locking (Marc Zyngier)
- Move port PHY registers to their own reg items to accommodate
T602x, which moves them around; retain default offsets for existing
DTs that lack phy%d entries with the reg offsets (Hector Martin)
- Stop polling for core refclk, which doesn't work on T602x and the
bootloader has already done anyway (Hector Martin)
- Use gpiod_set_value_cansleep() when asserting PERST# in probe
because we're allowed to sleep there (Hector Martin)
Cadence PCIe controller driver:
- Drop a runtime PM 'put' to resolve a runtime atomic count underflow
(Hans Zhang)
- Make the cadence core buildable as a module (Kishon Vijay Abraham I)
- Add cdns_pcie_host_disable() and cdns_pcie_ep_disable() for use by
loadable drivers when they are removed (Siddharth Vadapalli)
Freescale i.MX6 PCIe controller driver:
- Apply link training workaround only on IMX6Q, IMX6SX, IMX6SP
(Richard Zhu)
- Remove redundant dw_pcie_wait_for_link() from
imx_pcie_start_link(); since the DWC core does this, imx6 only
needs it when retraining for a faster link speed (Richard Zhu)
- Toggle i.MX95 core reset to align with PHY powerup (Richard Zhu)
- Set SYS_AUX_PWR_DET to work around i.MX95 ERR051624 erratum: in
some cases, the controller can't exit 'L23 Ready' through Beacon or
PERST# deassertion (Richard Zhu)
- Clear GEN3_ZRXDC_NONCOMPL to work around i.MX95 ERR051586 erratum:
controller can't meet 2.5 GT/s ZRX-DC timing when operating at 8
GT/s, causing timeouts in L1 (Richard Zhu)
- Wait for i.MX95 PLL lock before enabling controller (Richard Zhu)
- Save/restore i.MX95 LUT for suspend/resume (Richard Zhu)
Mobiveil PCIe controller driver:
- Return bool (not int) for link-up check in
mobiveil_pab_ops.link_up() and layerscape-gen4, mobiveil (Hans
Zhang)
NVIDIA Tegra194 PCIe controller driver:
- Create debugfs directory for 'aspm_state_cnt' only when
CONFIG_PCIEASPM is enabled, since there are no other entries (Hans
Zhang)
Qualcomm PCIe controller driver:
- Add OF support for parsing DT 'eq-presets-<N>gts' property for lane
equalization presets (Krishna Chaitanya Chundru)
- Read Maximum Link Width from the Link Capabilities register if DT
lacks 'num-lanes' property (Krishna Chaitanya Chundru)
- Add Physical Layer 64 GT/s Capability ID and register offsets for
8, 32, and 64 GT/s lane equalization registers (Krishna Chaitanya
Chundru)
- Add generic dwc support for configuring lane equalization presets
(Krishna Chaitanya Chundru)
- Add DT and driver support for PCIe on IPQ5018 SoC (Nitheesh Sekar)
Renesas R-Car PCIe controller driver:
- Describe endpoint BAR 4 as being fixed size (Jerome Brunet)
- Document how to obtain R-Car V4H (r8a779g0) controller firmware
(Yoshihiro Shimoda)
Rockchip PCIe controller driver:
- Reorder rockchip_pci_core_rsts because
reset_control_bulk_deassert() deasserts in reverse order, to fix a
link training regression (Jensen Huang)
- Mark RK3399 as being capable of raising INTx interrupts (Niklas
Cassel)
Rockchip DesignWare PCIe controller driver:
- Check only PCIE_LINKUP, not LTSSM status, to determine whether the
link is up (Shawn Lin)
- Increase N_FTS (used in L0s->L0 transitions) and enable ASPM L0s
for Root Complex and Endpoint modes (Shawn Lin)
- Hide the broken ATS Capability in rockchip_pcie_ep_init() instead
of rockchip_pcie_ep_pre_init() so it stays hidden after PERST#
resets non-sticky registers (Shawn Lin)
- Call phy_power_off() before phy_exit() in rockchip_pcie_phy_deinit()
(Diederik de Haas)
Synopsys DesignWare PCIe controller driver:
- Set PORT_LOGIC_LINK_WIDTH to one lane to make initial link training
more robust; this will not affect the intended link width if all
lanes are functional (Wenbin Yao)
- Return bool (not int) for link-up check in dw_pcie_ops.link_up()
and armada8k, dra7xx, dw-rockchip, exynos, histb, keembay,
keystone, kirin, meson, qcom, qcom-ep, rcar_gen4, spear13xx,
tegra194, uniphier, visconti (Hans Zhang)
- Add debugfs support for exposing DWC device-specific PTM context
(Manivannan Sadhasivam)
TI J721E PCIe driver:
- Make j721e buildable as a loadable and removable module (Siddharth
Vadapalli)
- Fix j721e host/endpoint dependencies that result in link failures
in some configs (Arnd Bergmann)
Device tree bindings:
- Add qcom DT binding for 'global' interrupt (PCIe controller and
link-specific events) for ipq8074, ipq8074-gen3, ipq6018, sa8775p,
sc7280, sc8180x sdm845, sm8150, sm8250, sm8350 (Manivannan
Sadhasivam)
- Add qcom DT binding for 8 MSI SPI interrupts for msm8998, ipq8074,
ipq8074-gen3, ipq6018 (Manivannan Sadhasivam)
- Add dw rockchip DT binding for rk3576 and rk3562 (Kever Yang)
- Correct indentation and style of examples in brcm,stb-pcie,
cdns,cdns-pcie-ep, intel,keembay-pcie-ep, intel,keembay-pcie,
microchip,pcie-host, rcar-pci-ep, rcar-pci-host, xilinx-versal-cpm
(Krzysztof Kozlowski)
- Convert Marvell EBU (dove, kirkwood, armada-370, armada-xp) and
armada8k from text to schema DT bindings (Rob Herring)
- Remove obsolete .txt DT bindings for content that has been moved to
schemas (Rob Herring)
- Add qcom DT binding for MHI registers in IPQ5332, IPQ6018, IPQ8074
and IPQ9574 (Varadarajan Narayanan)
- Convert v3,v360epc-pci from text to DT schema binding (Rob Herring)
- Change microchip,pcie-host DT binding to be 'dma-noncoherent' since
PolarFire may be configured that way (Conor Dooley)
Miscellaneous:
- Drop 'pci' suffix from intel_mid_pci.c filename to match similar
files (Andy Shevchenko)
- All platforms with PCI have an MMU, so add PCI Kconfig dependency
on MMU to simplify build testing and avoid inadvertent build
regressions (Arnd Bergmann)
- Update Krzysztof Wilczyński's email address in MAINTAINERS
(Krzysztof Wilczyński)
- Update Manivannan Sadhasivam's email address in MAINTAINERS
(Manivannan Sadhasivam)"
* tag 'pci-v6.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (147 commits)
MAINTAINERS: Update Manivannan Sadhasivam email address
PCI: j721e: Fix host/endpoint dependencies
PCI: j721e: Add support to build as a loadable module
PCI: cadence-ep: Introduce cdns_pcie_ep_disable() helper for cleanup
PCI: cadence-host: Introduce cdns_pcie_host_disable() helper for cleanup
PCI: cadence: Add support to build pcie-cadence library as a kernel module
MAINTAINERS: Update Krzysztof Wilczyński email address
PCI: Remove unnecessary linesplit in __pci_setup_bridge()
PCI: WARN (not BUG()) when we fail to assign optional resources
PCI: Remove unused pci_printk()
PCI: qcom: Replace PERST# sleep time with proper macro
PCI: dw-rockchip: Replace PERST# sleep time with proper macro
PCI: host-common: Convert to library for host controller drivers
PCI/ERR: Remove misleading TODO regarding kernel panic
PCI: cadence: Remove duplicate message code definitions
PCI: endpoint: Align pci_epc_set_msix(), pci_epc_ops::set_msix() nr_irqs encoding
PCI: endpoint: Align pci_epc_set_msi(), pci_epc_ops::set_msi() nr_irqs encoding
PCI: endpoint: Align pci_epc_get_msix(), pci_epc_ops::get_msix() return value encoding
PCI: endpoint: Align pci_epc_get_msi(), pci_epc_ops::get_msi() return value encoding
PCI: cadence-ep: Correct PBA offset in .set_msix() callback
...
This reverts commit e436576b02.
That commit is very broken, and seems to have missed the fact that
CONFIG_ARM_SMMU_V3 is not just a yes-or-no thing, but also can be
modular.
So it caused build errors on arm64 allmodconfig setups:
ERROR: modpost: "arm_smmu_make_cdtable_ste" [drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.ko] undefined!
ERROR: modpost: "arm_smmu_make_s2_domain_ste" [drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.ko] undefined!
ERROR: modpost: "arm_smmu_make_s1_cd" [drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.ko] undefined!
...
(and six more symbols just the same).
Link: https://lore.kernel.org/all/CAHk-=wh4qRwm7AQ8sBmQj7qECzgAhj4r73RtCDfmHo5SdcN0Jw@mail.gmail.com/
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Rolf Eike Beer <eb@emlix.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Including:
- Core:
- Introduction of iommu-pages infrastructure to consolitate page-table
allocation code among hardware drivers. This is ground-work for more
generalization in the future.
- Remove IOMMU_DEV_FEAT_SVA and IOMMU_DEV_FEAT_IOPF feature flags.
- Convert virtio-iommu to domain_alloc_paging().
- KConfig cleanups.
- Some small fixes for possible overflows and race conditions.
- Intel VT-d driver:
- Restore WO permissions on second-level paging entries.
- Use ida to manage domain id.
- Miscellaneous cleanups.
- AMD-Vi:
- Make sure notifiers finish running before module unload.
- Add support for HTRangeIgnore feature.
- Allow matching ACPI HID devices without matching UIDs.
- ARM-SMMU:
- SMMUv2:
- Recognise the compatible string for SAR2130P MDSS in the Qualcomm
driver, as this device requires an identity domain.
- Fix Adreno stall handling so that GPU debugging is more robust and
doesn't e.g. result in deadlock.
- SMMUv3:
- Fix ->attach_dev() error reporting for unrecognised domains.
- IO-pgtable:
- Allow clients (notably, drivers that process requests from
userspace) to silence warnings when mapping an already-mapped IOVA.
- S390:
- Add support for additional table regions.
- Mediatek:
- Add support for MT6893 MM IOMMU.
- Some smaller fixes and improvements in various other drivers.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmg5nnoACgkQK/BELZcB
GuNaPg//XgI0g1WUjENFvtri5OLz7/slIYHw1TPJkmyEnwnipEcz7C607W7L4fbe
yvspsc80mnN1xe986GGpfK+wyEOoVWE2k1Jg3iDhHjxSgjbmPl+fMlpTyKm9kPkk
nR2v4szosCPCTgfy+km1c6VCS/aRUtyrX289fKK51hlQ38YMa1b+D7p/S87dehLf
TJ/VqnE6lwSugXJZek6bVR7bQscArjcmHNK+pTEvdHUZiR2c9WLQAPNCmBpvUglJ
oPRQh3JzfF/zbFshyyCpqOAKbsjJsQhrEVXHIgE1lF4ap10U5jEx1CME8XuxqkqL
klQZzmAllAyRfEm43OcELpfAiUc3QZpR8+i2Xnmg5r3ZgM5bm6MOc424XCSmKg45
v4R6r976rvOWGOava2c/YrrwtEaemR29f0Q/ht/+m6CtMCvUPokuHYNvZ5/jM4Xh
PjPGgXm9sNtNb78TwMgHQHBhPqI71m02lh+W8WKs+pQT1KdaRZbRuftHj72iwUvZ
tglkYJmdnpxa30Mlvi5OZTVKLGwzCdUiTpvvvQNVQroH8J1pWtRtaTz2yXKbR8zy
B1juOIWbQtqT1NSz+IwCVbNOyMO+Jzu1Olw24LXrX/MMSDcG1ZsoaKhrGy09gcAq
s+PSETd01+Z+L2GW6p9panKePf3vWO54SEU6nwQzgVuGdDjcs/U=
=RmX9
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:
"Core:
- Introduction of iommu-pages infrastructure to consolitate
page-table allocation code among hardware drivers. This is
ground-work for more generalization in the future
- Remove IOMMU_DEV_FEAT_SVA and IOMMU_DEV_FEAT_IOPF feature flags
- Convert virtio-iommu to domain_alloc_paging()
- KConfig cleanups
- Some small fixes for possible overflows and race conditions
Intel VT-d driver:
- Restore WO permissions on second-level paging entries
- Use ida to manage domain id
- Miscellaneous cleanups
AMD-Vi:
- Make sure notifiers finish running before module unload
- Add support for HTRangeIgnore feature
- Allow matching ACPI HID devices without matching UIDs
ARM-SMMU:
- SMMUv2:
- Recognise the compatible string for SAR2130P MDSS in the
Qualcomm driver, as this device requires an identity domain
- Fix Adreno stall handling so that GPU debugging is more robust
and doesn't e.g. result in deadlock
- SMMUv3:
- Fix ->attach_dev() error reporting for unrecognised domains
- IO-pgtable:
- Allow clients (notably, drivers that process requests from
userspace) to silence warnings when mapping an already-mapped
IOVA
S390:
- Add support for additional table regions
Mediatek:
- Add support for MT6893 MM IOMMU
And some smaller fixes and improvements in various other drivers"
* tag 'iommu-updates-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (75 commits)
iommu/vt-d: Restore context entry setup order for aliased devices
iommu/mediatek: Fix compatible typo for mediatek,mt6893-iommu-mm
iommu/arm-smmu-qcom: Make set_stall work when the device is on
iommu/arm-smmu: Move handing of RESUME to the context fault handler
iommu/arm-smmu-qcom: Enable threaded IRQ for Adreno SMMUv2/MMU500
iommu/io-pgtable-arm: Add quirk to quiet WARN_ON()
iommu: Clear the freelist after iommu_put_pages_list()
iommu/vt-d: Change dmar_ats_supported() to return boolean
iommu/vt-d: Eliminate pci_physfn() in dmar_find_matched_satc_unit()
iommu/vt-d: Replace spin_lock with mutex to protect domain ida
iommu/vt-d: Use ida to manage domain id
iommu/vt-d: Restore WO permissions on second-level paging entries
iommu/amd: Allow matching ACPI HID devices without matching UIDs
iommu: make inclusion of arm/arm-smmu-v3 directory conditional
iommu: make inclusion of riscv directory conditional
iommu: make inclusion of amd directory conditional
iommu: make inclusion of intel directory conditional
iommu: remove duplicate selection of DMAR_TABLE
iommu/fsl_pamu: remove trailing space after \n
iommu/arm-smmu-qcom: Add SAR2130P MDSS compatible
...
- new two step DMA mapping API, which is is a first step to a long path
to provide alternatives to scatterlist and to remove hacks, abuses and
design mistakes related to scatterlists; this new approach optimizes
some calls to DMA-IOMMU layer and cache maintenance by batching them,
reduces memory usage as it is no need to store mapped DMA addresses to
unmap them, and reduces some function call overhead; it is a combination
effort of many people, lead and developed by Christoph Hellwig and Leon
Romanovsky
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaDRXIQAKCRCJp1EFxbsS
RG8tAP9kgjIwMoJqfr6DC8yYraIIUuNDyhb/fZ9vPppW6Cb7aAD/cg8udjrsUu3h
iAZBIHkYuWmkx8JG7t5/lqBc4AOC1AA=
=F3TU
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.16-2025-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping updates from Marek Szyprowski:
"New two step DMA mapping API, which is is a first step to a long path
to provide alternatives to scatterlist and to remove hacks, abuses and
design mistakes related to scatterlists.
This new approach optimizes some calls to DMA-IOMMU layer and cache
maintenance by batching them, reduces memory usage as it is no need to
store mapped DMA addresses to unmap them, and reduces some function
call overhead. It is a combination effort of many people, lead and
developed by Christoph Hellwig and Leon Romanovsky"
* tag 'dma-mapping-6.16-2025-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
docs: core-api: document the IOVA-based API
dma-mapping: add a dma_need_unmap helper
dma-mapping: Implement link/unlink ranges API
iommu/dma: Factor out a iommu_dma_map_swiotlb helper
dma-mapping: Provide an interface to allow allocate IOVA
iommu: add kernel-doc for iommu_unmap_fast
iommu: generalize the batched sync after map interface
dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
PCI/P2PDMA: Refactor the p2pdma mapping helpers
Fix the "mediatek.mt6893-iommu-mm" compatible string typo, as the
dot was actually meant to be a comma: "mediatek,mt6893-iommu-mm".
Fixes: f6a1e89ab6 ("iommu/mediatek: Add support for Dimensity 1200 MT6893 MM IOMMU")
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Link: https://lore.kernel.org/r/20250521151548.185910-1-angelogioacchino.delregno@collabora.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Generally PASID support requires ACS settings that usually create
single device groups, but there are some niche cases where we can get
multi-device groups and still have working PASID support. The primary
issue is that PCI switches are not required to treat PASID tagged TLPs
specially so appropriate ACS settings are required to route all TLPs to
the host bridge if PASID is going to work properly.
pci_enable_pasid() does check that each device that will use PASID has
the proper ACS settings to achieve this routing.
However, no-PASID devices can be combined with PASID capable devices
within the same topology using non-uniform ACS settings. In this case
the no-PASID devices may not have strict route to host ACS flags and
end up being grouped with the PASID devices.
This configuration fails to allow use of the PASID within the iommu
core code which wrongly checks if the no-PASID device supports PASID.
Fix this by ignoring no-PASID devices during the PASID validation. They
will never issue a PASID TLP anyhow so they can be ignored.
Fixes: c404f55c26 ("iommu: Validate the PASID in iommu_attach_device_pasid()")
Cc: stable@vger.kernel.org
Signed-off-by: Tushar Dave <tdave@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250520011937.3230557-1-tdave@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Up until now we have only called the set_stall callback during
initialization when the device is off. But we will soon start calling it
to temporarily disable stall-on-fault when the device is on, so handle
that by checking if the device is on and writing SCTLR.
Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
Reviewed-by: Rob Clark <robdclark@gmail.com>
Link: https://lore.kernel.org/r/20250520-msm-gpu-fault-fixes-next-v8-3-fce6ee218787@gmail.com
[will: Fix "mixed declarations and code" warning from sparse]
Signed-off-by: Will Deacon <will@kernel.org>
The upper layer fault handler is now expected to handle everything
required to retry the transaction or dump state related to it, since we
enable threaded IRQs. This means that we can take charge of writing
RESUME, making sure that we always write it after writing FSR as
recommended by the specification.
The iommu handler should write -EAGAIN if a transaction needs to be
retried. This avoids tricky cross-tree changes in drm/msm, since it
never wants to retry the transaction and it already returns 0 from its
fault handler. Therefore it will continue to correctly terminate the
transaction without any changes required.
devcoredumps from drm/msm will temporarily be broken until it is fixed
to collect devcoredumps inside its fault handler, but fixing that first
would actually be worse because MMU-500 ignores writes to RESUME unless
all fields of FSR (except SS of course) are clear and raises an
interrupt when only SS is asserted. Right now, things happen to work
most of the time if we collect a devcoredump, because RESUME is written
asynchronously in the fault worker after the fault handler clears FSR
and finishes, although there will be some spurious faults, but if this
is changed before this commit fixes the FSR/RESUME write order then SS
will never be cleared, the interrupt will never be cleared, and the
whole system will hang every time a fault happens. It will therefore
help bisectability if this commit goes first.
I've changed the TBU path to also accept -EAGAIN and do the same thing,
while keeping the old -EBUSY behavior. Although the old path was broken
because you'd get a storm of interrupts due to returning IRQ_NONE that
would eventually result in the interrupt being disabled, and I think it
was dead code anyway, so it should eventually be deleted. Note that
drm/msm never uses TBU so this is untested.
Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
Link: https://lore.kernel.org/r/20250520-msm-gpu-fault-fixes-next-v8-2-fce6ee218787@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
The recommended flow for stall-on-fault in SMMUv2 is the following:
1. Resolve the fault.
2. Write to FSR to clear the fault bits.
3. Write RESUME to retry or fail the transaction.
MMU500 is designed with this sequence in mind. For example,
experimentally we have seen on MMU500 that writing RESUME does not clear
FSR.SS unless the original fault is cleared in FSR, so 2 must come
before 3. FSR.SS is allowed to signal a fault (and does on MMU500) so
that if we try to do 2 -> 1 -> 3 (while exiting from the fault handler
after 2) we can get duplicate faults without hacks to disable
interrupts.
However, resolving the fault typically requires lengthy operations that
can stall, like bringing in pages from disk. The only current user,
drm/msm, dumps GPU state before failing the transaction which indeed can
stall. Therefore, from now on we will require implementations that want
to use stall-on-fault to also enable threaded IRQs. Do that with the
Adreno MMU implementations.
Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
Link: https://lore.kernel.org/r/20250520-msm-gpu-fault-fixes-next-v8-1-fce6ee218787@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
In situations where mapping/unmapping sequence can be controlled by
userspace, attempting to map over a region that has not yet been
unmapped is an error. But not something that should spam dmesg.
Now that there is a quirk, we can also drop the selftest_running
flag, and use the quirk instead for selftests.
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Rob Clark <robdclark@chromium.org>
Link: https://lore.kernel.org/r/20250519175348.11924-6-robdclark@gmail.com
[will: Rename quirk to IO_PGTABLE_QUIRK_NO_WARN per Robin's suggestion]
Signed-off-by: Will Deacon <will@kernel.org>
The commit below reworked how iommu_put_pages_list() worked to not do
list_del() on every entry. This was done expecting all the callers to
already re-init the list so doing a per-item deletion is not
efficient.
It was missed that fq_ring_free_locked() re-uses its list after calling
iommu_put_pages_list() and so the leftover list reaches free'd struct
pages and will crash or WARN/BUG/etc.
Reinit the list to empty in fq_ring_free_locked() after calling
iommu_put_pages_list().
Audit to see if any other callers of iommu_put_pages_list() need the list
to be empty:
- iommu_dma_free_fq_single() and iommu_dma_free_fq_percpu() immediately
frees the memory
- iommu_v1_map_pages(), v1_free_pgtable(), domain_exit(),
riscv_iommu_map_pages() uses a stack variable which goes out of scope
- intel_iommu_tlb_sync() uses a gather in a iotlb_sync() callback, the
caller re-inits the gather
Fixes: 13f43d7cf3 ("iommu/pages: Formalize the freelist API")
Reported-by: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
Closes: https://lore.kernel.org/r/SJ1PR11MB61292CE72D7BE06B8810021CB997A@SJ1PR11MB6129.namprd11.prod.outlook.com
Tested-by: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v1-7d4dfa6140f7+11f04-iommu_freelist_init_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
According to "Function return values and names" in coding-style.rst, the
dmar_ats_supported() function should return a boolean instead of an
integer. Also, rename "ret" to "supported" to be more straightforward.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20250509140021.4029303-3-wei.w.wang@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The function dmar_find_matched_satc_unit() contains a duplicate call to
pci_physfn(). This call is unnecessary as pci_physfn() has already been
invoked by the caller. Removing the redundant call simplifies the code
and improves efficiency a bit.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Link: https://lore.kernel.org/r/20250509140021.4029303-2-wei.w.wang@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The domain ID allocator is currently protected by a spin_lock. However,
ida_alloc_range can potentially block if it needs to allocate memory to
grow its internal structures.
Replace the spin_lock with a mutex which allows sleep on block. Thus,
the memory allocation flags can be updated from GFP_ATOMIC to GFP_KERNEL
to allow blocking memory allocations if necessary.
Introduce a new mutex, did_lock, specifically for protecting the domain
ida. The existing spinlock will remain for protecting other intel_iommu
fields.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250430021135.2370244-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Switch the intel iommu driver to use the ida mechanism for managing domain
IDs, replacing the previous fixed-size bitmap.
The previous approach allocated a bitmap large enough to cover the maximum
number of domain IDs supported by the hardware, regardless of the actual
number of domains in use. This led to unnecessary memory consumption,
especially on systems supporting a large number of iommu units but only
utilizing a small number of domain IDs.
The ida allocator dynamically manages the allocation and freeing of integer
IDs, only consuming memory for the IDs that are currently in use. This
significantly optimizes memory usage compared to the fixed-size bitmap.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250430021135.2370244-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
VT-D HW can do WO permissions on the second-stage but not the first-stage
page table formats. The commit eea53c5816 ("iommu/vt-d: Remove WO
permissions on second-level paging entries") wanted to make this uniform
for VT-D by disabling the support for WO permissions in the second-stage.
This isn't consistent with how other drivers are working. Instead if the
underlying HW can support WO, it should. For instance AMD already supports
WO on its second stage (v1) format and not its first (v2).
If WO support needs to be discoverable it should be done through an
iommu_domain capability flag.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/0-v1-c26553717e90+65f-iommu_vtd_ss_wo_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
A BIOS upgrade has changed the IVRS DTE UID for a device that no
longer matches the UID in the SSDT. In this case there is only
one ACPI device on the system with that _HID but the _UID mismatch.
IVRS:
```
Subtable Type : F0 [Device Entry: ACPI HID Named Device]
Device ID : 0060
Data Setting (decoded below) : 40
INITPass : 0
EIntPass : 0
NMIPass : 0
Reserved : 0
System MGMT : 0
LINT0 Pass : 1
LINT1 Pass : 0
ACPI HID : "MSFT0201"
ACPI CID : 0000000000000000
UID Format : 02
UID Length : 09
UID : "\_SB.MHSP"
```
SSDT:
```
Device (MHSP)
{
Name (_ADR, Zero) // _ADR: Address
Name (_HID, "MSFT0201") // _HID: Hardware ID
Name (_UID, One) // _UID: Unique ID
```
To handle this case; while enumerating ACPI devices in
get_acpihid_device_id() count the number of matching ACPI devices with
a matching _HID. If there is exactly one _HID match then accept it even
if the UID doesn't match. Other operating systems allow this, but the
current IVRS spec leaves some ambiguity whether to allow or disallow it.
This should be clarified in future revisions of the spec. Output
'Firmware Bug' for this case to encourage it to be solved in the BIOS.
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250512173129.1274275-1-superm1@kernel.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Nothing in there is active if CONFIG_ARM_SMMU_V3 is not enabled, so the whole
directory can depend on that switch as well.
Fixes: e86d1aa8b6 ("iommu/arm-smmu: Move Arm SMMU drivers into their own subdirectory")
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/2434059.NG923GbCHz@devpool92.emlix.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Nothing in there is active if CONFIG_RISCV_IOMMU is not enabled, so the whole
directory can depend on that switch as well.
Fixes: 5c0ebbd3c6 ("iommu/riscv: Add RISC-V IOMMU platform device driver")
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/2235451.Icojqenx9y@devpool92.emlix.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Nothing in there is active if CONFIG_AMD_IOMMU is not enabled, so the whole
directory can depend on that switch as well.
Fixes: cbe94c6e1a ("iommu/amd: Move Kconfig and Makefile bits down into amd directory")
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/1894970.atdPhlSkOF@devpool92.emlix.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Nothing in there is active if CONFIG_INTEL_IOMMU is not enabled, so the whole
directory can depend on that switch as well.
Fixes: ab65ba57e3 ("iommu/vt-d: Move Kconfig and Makefile bits down into intel directory")
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/3818749.MHq7AAxBmi@devpool92.emlix.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
There is an extraenous space after \n in a pr_debug message. Remove it.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250430151853.923614-1-colin.i.king@gmail.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Commit 991de2e590 ("PCI, x86: Implement pcibios_alloc_irq() and
pcibios_free_irq()") changed IRQ handling on PCI driver probing.
It inadvertently broke resume from system sleep on AMD platforms:
https://lore.kernel.org/r/20150926164651.GA3640@pd.tnic/
This was fixed by two independent commits:
* 8affb487d4 ("x86/PCI: Don't alloc pcibios-irq when MSI is enabled")
* cbbc00be2c ("iommu/amd: Prevent binding other PCI drivers to IOMMU PCI devices")
The breaking change and one of these two fixes were subsequently reverted:
* fe25d07887 ("Revert "x86/PCI: Don't alloc pcibios-irq when MSI is enabled"")
* 6c777e8799 ("Revert "PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()"")
This rendered the second fix unnecessary, so revert it as well. It used
the match_driver flag in struct pci_dev, which is internal to the PCI core
and not supposed to be touched by arbitrary drivers.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://patch.msgid.link/9a3ddff5cc49512044f963ba0904347bd404094d.1745572340.git.lukas@wunner.de
After commit 48e7b8e284 ("iommu/arm-smmu-v3: Remove
arm_smmu_domain_finalise() during attach"), an error code is not
returned on the attach path when the smmu does not match with the
domain. This causes problems with VFIO because
vfio_iommu_type1_attach_group() relies on this check to determine domain
compatability.
Re-instate the -EINVAL return value when the SMMU doesn't match on the
device attach path.
Fixes: 48e7b8e284 ("iommu/arm-smmu-v3: Remove arm_smmu_domain_finalise() during attach")
Signed-off-by: Qinxin Xia <xiaqinxin@huawei.com>
Link: https://lore.kernel.org/r/20250422112951.2027969-1-xiaqinxin@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
Introduce new DMA APIs to perform DMA linkage of buffers
in layers higher than DMA.
In proposed API, the callers will perform the following steps.
In map path:
if (dma_can_use_iova(...))
dma_iova_alloc()
for (page in range)
dma_iova_link_next(...)
dma_iova_sync(...)
else
/* Fallback to legacy map pages */
for (all pages)
dma_map_page(...)
In unmap path:
if (dma_can_use_iova(...))
dma_iova_destroy()
else
for (all pages)
dma_unmap_page(...)
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Split the iommu logic from iommu_dma_map_page into a separate helper.
This not only keeps the code neatly separated, but will also allow for
reuse in another caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
The existing .map_pages() callback provides both allocating of IOVA
and linking DMA pages. That combination works great for most of the
callers who use it in control paths, but is less effective in fast
paths where there may be multiple calls to map_page().
These advanced callers already manage their data in some sort of
database and can perform IOVA allocation in advance, leaving range
linkage operation to be in fast path.
Provide an interface to allocate/deallocate IOVA and next patch
link/unlink DMA ranges to that specific IOVA.
In the new API a DMA mapping transaction is identified by a
struct dma_iova_state, which holds some recomputed information
for the transaction which does not change for each page being
mapped, so add a check if IOVA can be used for the specific
transaction.
The API is exported from dma-iommu as it is the only implementation
supported, the namespace is clearly different from iommu_* functions
which are not allowed to be used. This code layout allows us to save
function call per API call used in datapath as well as a lot of boilerplate
code.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Add kernel-doc section for iommu_unmap_fast to document existing
limitation of underlying functions which can't split individual ranges.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
For the upcoming IOVA-based DMA API we want to batch the
ops->iotlb_sync_map() call after mapping multiple IOVAs from
dma-iommu without having a scatterlist. Improve the API.
Add a wrapper for the map_sync as iommu_sync_map() so that callers
don't need to poke into the methods directly.
Formalize __iommu_map() into iommu_map_nosync() which requires the
caller to call iommu_sync_map() after all maps are completed.
Refactor the existing sanity checks from all the different layers
into iommu_map_nosync().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
To support the upcoming non-scatterlist mapping helpers, we need to go
back to have them called outside of the DMA API. Thus move them out of
dma-map-ops.h, which is only for DMA API implementations to pci-p2pdma.h,
which is for driver use.
Note that the core helper is still not exported as the mapping is
expected to be done only by very highlevel subsystem code at least for
now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
The current scheme with a single helper to determine the P2P status
and map a scatterlist segment force users to always use the map_sg
helper to DMA map, which we're trying to get away from because they
are very cache inefficient.
Refactor the code so that there is a single helper that checks the P2P
state for a page, including the result that it is not a P2P page to
simplify the callers, and a second one to perform the address translation
for a bus mapped P2P transfer that does not depend on the scatterlist
structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
AMD IOMMU reserves the address range 0xfd00000000-0xffffffffff for
the hypertransport protocol (HT) and has special meaning. Hence devices
cannot use this address range for the DMA. However on some AMD platforms
this HT range is shifted to the very top of the address space and new
feature bit `HTRangeIgnore` is introduced. When this feature bit is on,
IOMMU treats the GPA access to the legacy HT range as regular GPA access.
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250317055020.25214-1-sarunkod@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Synchronize RCU when unregistering KVM's GA log notifier to ensure all
in-flight interrupt handlers complete before KVM-the module is unloaded.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250315031048.2374109-1-seanjc@google.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
On a 32 bit system calling:
iommu_map(0, 0x40000000)
When using the AMD V1 page table type with a domain->pgsize of 0xfffff000
causes iommu_pgsize() to miscalculate a result of:
size=0x40000000 count=2
count should be 1. This completely corrupts the mapping process.
This is because the final test to adjust the pagesize malfunctions when
the addition overflows. Use check_add_overflow() to prevent this.
Fixes: b1d99dc5f9 ("iommu: Hook up '->unmap_pages' driver callback")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/0-v1-3ad28fc2e3a3+163327-iommu_overflow_pgsize_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Next up on our list of race windows to close is another one during
iommu_device_register() - it's now OK again for multiple instances to
run their bus_iommu_probe() in parallel, but an iommu_probe_device() can
still also race against a running bus_iommu_probe(). As Johan has
managed to prove, this has now become a lot more visible on DT platforms
wth driver_async_probe where a client driver is attempting to probe in
parallel with its IOMMU driver - although commit b46064a188 ("iommu:
Handle race with default domain setup") resolves this from the client
driver's point of view, this isn't before of_iommu_configure() has had
the chance to attempt to "replay" a probe that the bus walk hasn't even
tried yet, and so still cause the out-of-order group allocation
behaviour that we're trying to clean up (and now warning about).
The most reliable thing to do here is to explicitly keep track of the
"iommu_device_register() is still running" state, so we can then
special-case the ops lookup for the replay path (based on dev->iommu
again) to let that think it's still waiting for the IOMMU driver to
appear at all. This still leaves the longstanding theoretical case of
iommu_bus_notifier() being triggered during bus_iommu_probe(), but it's
not so simple to defer a notifier, and nobody's ever reported that being
a visible issue, so let's quietly kick that can down the road for now...
Reported-by: Johan Hovold <johan@kernel.org>
Fixes: bcb81ac6ae ("iommu: Get DT/ACPI parsing into the proper probe path")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/88d54c1b48fed8279aa47d30f3d75173685bb26a.1745516488.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The idxd driver attaches the default domain to a PASID of the device to
perform kernel DMA using that PASID. The domain is attached to the
device's PASID through iommu_attach_device_pasid(), which checks if the
domain->owner matches the iommu_ops retrieved from the device. If they
do not match, it returns a failure.
if (ops != domain->owner || pasid == IOMMU_NO_PASID)
return -EINVAL;
The static identity domain implemented by the intel iommu driver doesn't
specify the domain owner. Therefore, kernel DMA with PASID doesn't work
for the idxd driver if the device translation mode is set to passthrough.
Generally the owner field of static domains are not set because they are
already part of iommu ops. Add a helper domain_iommu_ops_compatible()
that checks if a domain is compatible with the device's iommu ops. This
helper explicitly allows the static blocked and identity domains associated
with the device's iommu_ops to be considered compatible.
Fixes: 2031c469f8 ("iommu/vt-d: Add support for static identity domain")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220031
Cc: stable@vger.kernel.org
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/linux-iommu/20250422191554.GC1213339@ziepe.ca/
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20250424034123.2311362-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
In general a 'struct device' is way too large to be put on the kernel
stack. Apparently something just caused it to grow a slightly larger,
which pushed the arm_lpae_do_selftests() function over the warning
limit in some configurations:
drivers/iommu/io-pgtable-arm.c:1423:19: error: stack frame size (1032) exceeds limit (1024) in 'arm_lpae_do_selftests' [-Werror,-Wframe-larger-than]
1423 | static int __init arm_lpae_do_selftests(void)
| ^
Change the function to use a dynamically allocated faux_device
instead of the on-stack device structure.
Fixes: ca25ec247a ("iommu/io-pgtable-arm: Remove iommu_dev==NULL special case")
Link: https://lore.kernel.org/all/ab75a444-22a1-47f5-b3c0-253660395b5a@arm.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/20250423164826.2931382-1-arnd@kernel.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
iommu_device_sysfs_add() requires a constant format string, otherwise
a W=1 build produces a warning:
drivers/iommu/ipmmu-vmsa.c:1093:62: error: format string is not a string literal (potentially insecure) [-Werror,-Wformat-security]
1093 | ret = iommu_device_sysfs_add(&mmu->iommu, &pdev->dev, NULL, dev_name(&pdev->dev));
| ^~~~~~~~~~~~~~~~~~~~
drivers/iommu/ipmmu-vmsa.c:1093:62: note: treat the string as an argument to avoid this
1093 | ret = iommu_device_sysfs_add(&mmu->iommu, &pdev->dev, NULL, dev_name(&pdev->dev));
| ^
| "%s",
This was an old bug but I saw it now because the code was changed as part
of commit d9d3cede41 ("iommu/ipmmu-vmsa: Register in a sensible order").
Fixes: 7af9a5fdb9 ("iommu/ipmmu-vmsa: Use iommu_device_sysfs_add()/remove()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250423164006.2661372-1-arnd@kernel.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
fsl_pamu is the last user of domain_alloc(), and it is using it to create
something weird that doesn't really fit into the iommu subsystem
architecture. It is a not a paging domain since it doesn't have any
map/unmap ops. It may be some special kind of identity domain.
For now just leave it as is. Wrap it's definition in CONFIG_FSL_PAMU to
discourage any new drivers from attempting to use it.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/5-v4-ff5fb6b03bd1+288-iommu_virtio_domains_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
No driver implements SVA under domain_alloc() anymore, this is dead
code.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/4-v4-ff5fb6b03bd1+288-iommu_virtio_domains_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
virtio has the complication that it sometimes wants to return a paging
domain for IDENTITY which makes this conversion a little different than
other drivers.
Add a viommu_domain_alloc_paging() that combines viommu_domain_alloc() and
viommu_domain_finalise() to always return a fully initialized and
finalized paging domain.
Use viommu_domain_alloc_identity() to implement the special non-bypass
IDENTITY flow by calling viommu_domain_alloc_paging() then
viommu_domain_map_identity().
Remove support for deferred finalize and the vdomain->mutex.
Remove core support for domain_alloc() IDENTITY as virtio was the last
driver using it.
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/3-v4-ff5fb6b03bd1+288-iommu_virtio_domains_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
virtio-iommu has a mode where the IDENTITY domain is actually a paging
domain with an identity mapping covering some of the system address
space manually created.
To support this add a new domain_alloc_identity() op that accepts
the struct device so that virtio can allocate and fully finalize a
paging domain to return.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/2-v4-ff5fb6b03bd1+288-iommu_virtio_domains_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
To make way for a domain_alloc_paging conversion add the typical global
static IDENTITY domain. This supports VMMs that have a
VIRTIO_IOMMU_F_BYPASS_CONFIG config.
If the VMM does not have support then the domain_alloc path is still used,
which creates an IDENTITY domain out of a paging domain.
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/1-v4-ff5fb6b03bd1+288-iommu_virtio_domains_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
No external drivers use these interfaces anymore. Furthermore, no existing
iommu drivers implement anything in the callbacks. Remove them to avoid
dead code.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20250418080130.1844424-9-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The iopf enablement has been moved to the iommu drivers. It is unnecessary
for iommufd to handle iopf enablement. Remove the iopf enablement logic to
avoid duplication.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250418080130.1844424-8-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Update iopf enablement in the iommufd mock device driver to use the new
method, similar to the arm-smmu-v3 driver. Enable iopf support when any
domain with an iopf_handler is attached, and disable it when the domain
is removed.
Add a refcount in the mock device state structure to keep track of the
number of domains set to the device and PASIDs that require iopf.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250418080130.1844424-5-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Update iopf enablement in the driver to use the new method, similar to
the arm-smmu-v3 driver. Enable iopf support when any domain with an
iopf_handler is attached, and disable it when the domain is removed.
Place all the logic for controlling the PRI and iopf queue in the domain
set/remove/replace paths. Keep track of the number of domains set to the
device and PASIDs that require iopf. When the first domain requiring iopf
is attached, add the device to the iopf queue and enable PRI. When the
last domain is removed, remove it from the iopf queue and disable PRI.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250418080130.1844424-4-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
None of the drivers implement anything here anymore, remove the dead code.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250418080130.1844424-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
SMMUv3 co-mingles FEAT_IOPF and FEAT_SVA behaviors so that fault reporting
doesn't work unless both are enabled. This is not correct and causes
problems for iommufd which does not enable FEAT_SVA for it's fault capable
domains.
These APIs are both obsolete, update SMMUv3 to use the new method like AMD
implements.
A driver should enable iopf support when a domain with an iopf_handler is
attached, and disable iopf support when the domain is removed.
Move the fault support logic to sva domain allocation and to domain
attach, refusing to create or attach fault capable domains if the HW
doesn't support it.
Move all the logic for controlling the iopf queue under
arm_smmu_attach_prepare(). Keep track of the number of domains on the
master (over all the SSIDs) that require iopf. When the first domain
requiring iopf is attached create the iopf queue, when the last domain is
detached destroy it.
Turn FEAT_IOPF and FEAT_SVA into no ops.
Remove the sva_lock, this is all protected by the group mutex.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250418080130.1844424-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
On the Lenovo ThinkPad X201, when Intel VT-d is enabled in the BIOS, the
kernel boots with errors related to DMAR, the graphical interface appeared
quite choppy, and the system resets erratically within a minute after it
booted:
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xb97ff000
[fault reason 0x05] PTE Write access is not set
Upon comparing boot logs with VT-d on/off, I found that the Intel Calpella
quirk (`quirk_calpella_no_shadow_gtt()') correctly applied the igfx IOMMU
disable/quirk correctly:
pci 0000:00:00.0: DMAR: BIOS has allocated no shadow GTT; disabling IOMMU
for graphics
Whereas with VT-d on, it went into the "else" branch, which then
triggered the DMAR handling fault above:
... else if (!disable_igfx_iommu) {
/* we have to ensure the gfx device is idle before we flush */
pci_info(dev, "Disabling batched IOTLB flush on Ironlake\n");
iommu_set_dma_strict();
}
Now, this is not exactly scientific, but moving 0x0044 to quirk_iommu_igfx
seems to have fixed the aforementioned issue. Running a few `git blame'
runs on the function, I have found that the quirk was originally
introduced as a fix specific to ThinkPad X201:
commit 9eecabcb9a ("intel-iommu: Abort IOMMU setup for igfx if BIOS gave
no shadow GTT space")
Which was later revised twice to the "else" branch we saw above:
- 2011: commit 6fbcfb3e46 ("intel-iommu: Workaround IOTLB hang on
Ironlake GPU")
- 2024: commit ba00196ca4 ("iommu/vt-d: Decouple igfx_off from graphic
identity mapping")
I'm uncertain whether further testings on this particular laptops were
done in 2011 and (honestly I'm not sure) 2024, but I would be happy to do
some distro-specific testing if that's what would be required to verify
this patch.
P.S., I also see IDs 0x0040, 0x0062, and 0x006a listed under the same
`quirk_calpella_no_shadow_gtt()' quirk, but I'm not sure how similar these
chipsets are (if they share the same issue with VT-d or even, indeed, if
this issue is specific to a bug in the Lenovo BIOS). With regards to
0x0062, it seems to be a Centrino wireless card, but not a chipset?
I have also listed a couple (distro and kernel) bug reports below as
references (some of them are from 7-8 years ago!), as they seem to be
similar issue found on different Westmere/Ironlake, Haswell, and Broadwell
hardware setups.
Cc: stable@vger.kernel.org
Fixes: 6fbcfb3e46 ("intel-iommu: Workaround IOTLB hang on Ironlake GPU")
Fixes: ba00196ca4 ("iommu/vt-d: Decouple igfx_off from graphic identity mapping")
Link: https://groups.google.com/g/qubes-users/c/4NP4goUds2c?pli=1
Link: https://bugs.archlinux.org/task/65362
Link: https://bbs.archlinux.org/viewtopic.php?id=230323
Reported-by: Wenhao Sun <weiguangtwk@outlook.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=197029
Signed-off-by: Mingcong Bai <jeffbai@aosc.io>
Link: https://lore.kernel.org/r/20250415133330.12528-1-jeffbai@aosc.io
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
WARN if KVM attempts to set vCPU affinity when posted interrupts aren't
enabled, as KVM shouldn't try to enable posting when they're unsupported,
and the IOMMU driver darn well should only advertise posting support when
AMD_IOMMU_GUEST_IR_VAPIC() is true.
Note, KVM consumes is_guest_mode only on success.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Return -EINVAL instead of success if amd_ir_set_vcpu_affinity() is
invoked without use_vapic; lying to KVM about whether or not the IRTE was
configured to post IRQs is all kinds of bad.
Fixes: d98de49a53 ("iommu/amd: Enable vAPIC interrupt remapping mode by default")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Extend the aperture calculation to consider sizes beyond the maximum
size of a region third table. Attempt to always use the smallest
table size possible to avoid unnecessary extra steps during translation.
Update reserved region calculations to use the appropriate table size.
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Link: https://lore.kernel.org/r/20250411202433.181683-6-mjrosato@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Map and unmap ops use the shared dma_walk_cpu_trans routine, update
this using the origin_type of the dma_table to determine how many
table levels must be walked.
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Link: https://lore.kernel.org/r/20250411202433.181683-5-mjrosato@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The origin_type of the dma_table is used to determine how many table
levels must be traversed for the translation.
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Link: https://lore.kernel.org/r/20250411202433.181683-4-mjrosato@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Extend the existing dma_cleanup_tables to also handle region second and
region first tables.
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Link: https://lore.kernel.org/r/20250411202433.181683-3-mjrosato@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
When registering the I/O Translation Anchor, use the current table type
stored in the s390_domain to set the appropriate region type
indication. For the moment, the table type will always be stored as
region third.
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Link: https://lore.kernel.org/r/20250411202433.181683-2-mjrosato@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Add support for the two MM IOMMUs found in the MediaTek Dimensity
1200 (MT6893) SoC, used for display, camera, imgsys and vpu.
Since the MT6893 SoC is almost fully compatible with MT8192, also
move the mt8192_larb_region_msk before the newly introduced mt6893
platform data, as the larb regions from mt8192 are fully reused.
Note that the only difference with MT8192 is that MT6893 needs the
SHARE_PGTABLE flag.
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Link: https://lore.kernel.org/r/20250410144008.475888-3-angelogioacchino.delregno@collabora.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
There is a string parsing logic error which can lead to an overflow of hid
or uid buffers. Comparing ACPIID_LEN against a total string length doesn't
take into account the lengths of individual hid and uid buffers so the
check is insufficient in some cases. For example if the length of hid
string is 4 and the length of the uid string is 260, the length of str
will be equal to ACPIID_LEN + 1 but uid string will overflow uid buffer
which size is 256.
The same applies to the hid string with length 13 and uid string with
length 250.
Check the length of hid and uid strings separately to prevent
buffer overflow.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: ca3bf5d47c ("iommu/amd: Introduces ivrs_acpihid kernel parameter")
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Paklov <Pavel.Paklov@cyberprotect.ru>
Link: https://lore.kernel.org/r/20250325092259.392844-1-Pavel.Paklov@cyberprotect.ru
Signed-off-by: Joerg Roedel <jroedel@suse.de>
There are quite a lot of options for the Arm drivers, still all buried
in the top-level Kconfig. For ease of use and consistency with all the
other subdirectories, break these out into drivers/arm. For similar
clarity and self-consistency, also tweak the ARM_SMMU sub-options to use
"if" instead of "depends", to match ARM_SMMU_V3. Lastly also clean up
the slightly messy description of ARM_SMMU_DISABLE_BYPASS_BY_DEFAULT as
highlighted by Geert - by now we really shouldn't need commentary on
v4.x kernel behaviour anyway - and downgrade it to EXPERT as the first
step in the 6-year-old threat to remove it entirely.
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Link: https://lore.kernel.org/r/a614ec86ba78c09cd16e348f633f6bb38793391f.1742480488.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Although the lock-juggling is only a temporary workaround, we don't want
it to make things avoidably worse. Jason was right to be nervous, since
bus_iommu_probe() doesn't care *which* IOMMU instance it's probing for,
so it probably is possible for one walk to finish a probe which a
different walk started, thus we do want to check for that.
Also there's no need to drop the lock just to have of_iommu_configure()
do nothing when a fwspec already exists; check that directly and avoid
opening a window at all in that (still somewhat likely) case.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/09d901ad11b3a410fbb6e27f7d04ad4609c3fe4a.1741706365.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Intel is the only thing that uses this now, convert to the size versions,
trying to avoid PAGE_SHIFT.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/23-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Use iommu_alloc_pages_node_sz() instead.
AMD and Intel are both using 4K pages for these structures since those
drivers only work on 4K PAGE_SIZE.
riscv is also spec'd to use SZ_4K.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/21-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
A few small changes to the remaining drivers using these will allow
them to be removed:
- Exynos wants to allocate fixed 16K/8K allocations
- Rockchip already has a define SPAGE_SIZE which is used by the
dma_map immediately following, using SPAGE_ORDER which is a lg2size
- tegra has size constants already for its two allocations
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Convert most of the places calling get_order() as an argument to the
iommu-pages allocator into order_base_2() or the _sz flavour
instead. These places already have an exact size, there is no particular
reason to use order here.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/19-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
One part of RISCV already has a computed size, however the queue
allocation must be aligned to 4k. The other objects are 4k by spec.
Reviewed-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/18-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
If x >= PAGE_SIZE then:
1 << (get_order(x) + PAGE_SHIFT) == roundup_pow_two()
Inline this into the only caller, compute the size of the HW device table
in terms of 4K pages which matches the HW definition.
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/17-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This is just CPU memory used by the driver to track things, it doesn't
need to use iommu-pages. All of them are indexed by devid and devid is
bounded by pci_seg->last_bdf or we are already out of bounds on the page
allocation.
Switch them to use some version of kvmalloc_array() and drop the now
unused constants and remove the tbl_size() round up to PAGE_SIZE multiples
logic.
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/16-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Generally drivers have a specific idea what their HW structure size should
be. In a lot of cases this is related to PAGE_SIZE, but not always. ARM64,
for example, allows a 4K IO page table size on a 64K CPU page table
system.
Currently we don't have any good support for sub page allocations, but
make the API accommodate this by accepting a sub page size from the caller
and rounding up internally.
This is done by moving away from order as the size input and using size:
size == 1 << (order + PAGE_SHIFT)
Following patches convert drivers away from using order and try to specify
allocation sizes independent of PAGE_SIZE.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/15-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The entire allocator API is built around using the kernel virtual address,
it is illegal to pass GFP_HIGHMEM in as a GFP flag. Block it in the common
code. Remove the duplicated checks from drivers.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/14-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This brings the iommu page table allocator into the modern world of having
its own private page descriptor and not re-using fields from struct page
for its own purpose. It follows the basic pattern of struct ptdesc which
did this transformation for the CPU page table allocator.
Currently iommu-pages is pretty basic so this isn't a huge benefit,
however I see a coming need for features that CPU allocator has, like sub
PAGE_SIZE allocations, and RCU freeing. This provides the base
infrastructure to implement those cleanly.
Remove numa_node_id() calls from the inlines and instead use NUMA_NO_NODE
which will get switched to numa_mem_id(), which seems to be the right ID
to use for memory allocations.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/13-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This converts the remaining places using list of pages to the new API.
The Intel free path was shared with its gather path, so it is converted at
the same time.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/11-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Change the internal freelist to use struct iommu_pages_list.
AMD uses the freelist to batch free the entire table during domain
destruction, and to replace table levels with leafs during map.
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/10-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Change the internal freelist to use struct iommu_pages_list.
riscv uses this page list to free page table levels that are replaced
with leaf ptes.
Reviewed-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/9-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
We want to get rid of struct page references outside the internal
allocator implementation. The free list has the driver open code something
like:
list_add_tail(&virt_to_page(ptr)->lru, freelist);
Move the above into a small inline and make the freelist into a wrapper
type 'struct iommu_pages_list' so that the compiler can help check all the
conversion.
This struct has also proven helpful in some future ideas to convert to a
singly linked list to get an extra pointer in the struct page, and to
signal that the pages should be freed with RCU.
Use a temporary _Generic so we don't need to rename the free function as
the patches progress.
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/8-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
These are called in a lot of places and are not trivial. Move them to the
core module.
Tidy some of the comments and function arguments, fold
__iommu_alloc_account() into its only caller, change
__iommu_free_account() into __iommu_free_page() to remove some
duplication.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Now that we have a folio under the allocation iommu_free_pages() can know
the order of the original allocation and do the correct thing to free it.
The next patch will rename iommu_free_page() to iommu_free_pages() so we
have naming consistency with iommu_alloc_pages_node().
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
alloc_pages_node(, order) needs to be paired with __free_pages(, order) to
free all the allocated pages. For order != 0 the return from
alloc_pages_node() is just a page list, it hasn't been formed into a
folio.
However iommu_put_pages_list() just calls put_page() on the head page of
an allocation, which will end up leaking the tail pages if order != 0.
Fix this by using __GFP_COMP to create a high order folio and then always
use put_page() to free the full high order folio.
__iommu_free_account() can get the order of the allocation via
folio_order(), which corrects the accounting of high order allocations in
iommu_put_pages_list(). This is the same technique slub uses.
As far as I can tell, none of the places using high order allocations are
also using the free list, so this not a current bug.
Fixes: 06c375053c ("iommu/vt-d: add wrapper functions for page allocations")
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
These were only used by tegra-smmu and leaked the struct page out of the
API. Delete them since tega-smmu has been converted to the other APIs.
In the process flatten the call tree so we have fewer one line functions
calling other one line functions.. iommu_alloc_pages_node() is the real
allocator and everything else can just call it directly.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Instead use the virtual address and dma_map_single() like as->pd
uses. Introduce a small struct tegra_pt instead of void * to have some
clarity what is using this API and add compile safety during the
conversion.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Instead use the virtual address. Change from dma_map_page() to
dma_map_single() which works directly on a KVA. Add a type for the pd
table level for clarity.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
We've never supported StreamID aliasing between devices, and as such
they will never have had functioning DMA, but this is not fatal to the
SMMU itself. Although aliasing between hard-wired platform device
StreamIDs would tend to raise questions about the whole system, in
practice it's far more likely to occur relatively innocently due to
legacy PCI bridges, where the underlying StreamID mappings are still
perfectly reasonable.
As such, return a more benign -ENODEV when failing probe for such an
unsupported device (and log a more obvious error message), so that it
doesn't break the entire SMMU probe now that bus_iommu_probe() runs in
the right order and can propagate that error back. The end result is
still that the device doesn't get an IOMMU group and probably won't
work, same as before.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/39d54e49c8476efc4653e352150d44b185d6d50f.1744380554.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
ASPEED VGA card has two built-in devices:
0008:06:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 06)
0008:07:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 52)
Its toplogy looks like this:
+-[0008:00]---00.0-[01-09]--+-00.0-[02-09]--+-00.0-[03]----00.0 Sandisk Corp Device 5017
| +-01.0-[04]--
| +-02.0-[05]----00.0 NVIDIA Corporation Device
| +-03.0-[06-07]----00.0-[07]----00.0 ASPEED Technology, Inc. ASPEED Graphics Family
| +-04.0-[08]----00.0 Renesas Technology Corp. uPD720201 USB 3.0 Host Controller
| \-05.0-[09]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
\-00.1 PMC-Sierra Inc. Device 4028
The IORT logic populaties two identical IDs into the fwspec->ids array via
DMA aliasing in iort_pci_iommu_init() called by pci_for_each_dma_alias().
Though the SMMU driver had been able to handle this situation since commit
563b5cbe33 ("iommu/arm-smmu-v3: Cope with duplicated Stream IDs"), that
got broken by the later commit cdf315f907 ("iommu/arm-smmu-v3: Maintain
a SID->device structure"), which ended up with allocating separate streams
with the same stuffing.
On a kernel prior to v6.15-rc1, there has been an overlooked warning:
pci 0008:07:00.0: vgaarb: setting as boot VGA device
pci 0008:07:00.0: vgaarb: bridge control possible
pci 0008:07:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
pcieport 0008:06:00.0: Adding to iommu group 14
ast 0008:07:00.0: stream 67328 already in tree <===== WARNING
ast 0008:07:00.0: enabling device (0002 -> 0003)
ast 0008:07:00.0: Using default configuration
ast 0008:07:00.0: AST 2600 detected
ast 0008:07:00.0: [drm] Using analog VGA
ast 0008:07:00.0: [drm] dram MCLK=396 Mhz type=1 bus_width=16
[drm] Initialized ast 0.1.0 for 0008:07:00.0 on minor 0
ast 0008:07:00.0: [drm] fb0: astdrmfb frame buffer device
With v6.15-rc, since the commit bcb81ac6ae ("iommu: Get DT/ACPI parsing
into the proper probe path"), the error returned with the warning is moved
to the SMMU device probe flow:
arm_smmu_probe_device+0x15c/0x4c0
__iommu_probe_device+0x150/0x4f8
probe_iommu_group+0x44/0x80
bus_for_each_dev+0x7c/0x100
bus_iommu_probe+0x48/0x1a8
iommu_device_register+0xb8/0x178
arm_smmu_device_probe+0x1350/0x1db0
which then fails the entire SMMU driver probe:
pci 0008:06:00.0: Adding to iommu group 21
pci 0008:07:00.0: stream 67328 already in tree
arm-smmu-v3 arm-smmu-v3.9.auto: Failed to register iommu
arm-smmu-v3 arm-smmu-v3.9.auto: probe with driver arm-smmu-v3 failed with error -22
Since SMMU driver had been already expecting a potential duplicated Stream
ID in arm_smmu_install_ste_for_dev(), change the arm_smmu_insert_master()
routine to ignore a duplicated ID from the fwspec->sids array as well.
Note: this has been failing the iommu_device_probe() since 2021, although a
recent iommu commit in v6.15-rc1 that moves iommu_device_probe() started to
fail the SMMU driver probe. Since nobody has cared about DMA Alias support,
leave that as it was but fix the fundamental iommu_device_probe() breakage.
Fixes: cdf315f907 ("iommu/arm-smmu-v3: Maintain a SID->device structure")
Cc: stable@vger.kernel.org
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20250415185620.504299-1-nicolinc@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
UBSan caught a bug with IOMMU SVA domains, where the reported exponent
value in __arm_smmu_tlb_inv_range() was >= 64.
__arm_smmu_tlb_inv_range() uses the domain's pgsize_bitmap to compute
the number of pages to invalidate and the invalidation range. Currently
arm_smmu_sva_domain_alloc() does not setup the iommu domain's
pgsize_bitmap. This leads to __ffs() on the value returning 64 and that
leads to undefined behaviour w.r.t. shift operations
Fix this by initializing the iommu_domain's pgsize_bitmap to PAGE_SIZE.
Effectively the code needs to use the smallest page size for
invalidation
Cc: stable@vger.kernel.org
Fixes: eb6c97647b ("iommu/arm-smmu-v3: Avoid constructing invalid range commands")
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Will Deacon <will@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250412002354.3071449-1-balbirs@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Commit 67e4fe3985 ("iommu/arm-smmu-v3: Use S2FWB for NESTED domains")
introduced S2FWB usage but omitted the corresponding feature detection.
As a result, vIOMMU allocation fails on FVP in arm_vsmmu_alloc(), due to
the following check:
if (!arm_smmu_master_canwbs(master) &&
!(smmu->features & ARM_SMMU_FEAT_S2FWB))
return ERR_PTR(-EOPNOTSUPP);
This patch adds the missing detection logic to prevent allocation
failure when S2FWB is supported.
Fixes: 67e4fe3985 ("iommu/arm-smmu-v3: Use S2FWB for NESTED domains")
Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Link: https://lore.kernel.org/r/20250408033351.1012411-1-aneesh.kumar@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
Two WARNINGs are observed when SMMU driver rolls back upon failure:
arm-smmu-v3.9.auto: Failed to register iommu
arm-smmu-v3.9.auto: probe with driver arm-smmu-v3 failed with error -22
------------[ cut here ]------------
WARNING: CPU: 5 PID: 1 at kernel/dma/mapping.c:74 dmam_free_coherent+0xc0/0xd8
Call trace:
dmam_free_coherent+0xc0/0xd8 (P)
tegra241_vintf_free_lvcmdq+0x74/0x188
tegra241_cmdqv_remove_vintf+0x60/0x148
tegra241_cmdqv_remove+0x48/0xc8
arm_smmu_impl_remove+0x28/0x60
devm_action_release+0x1c/0x40
------------[ cut here ]------------
128 pages are still in use!
WARNING: CPU: 16 PID: 1 at mm/page_alloc.c:6902 free_contig_range+0x18c/0x1c8
Call trace:
free_contig_range+0x18c/0x1c8 (P)
cma_release+0x154/0x2f0
dma_free_contiguous+0x38/0xa0
dma_direct_free+0x10c/0x248
dma_free_attrs+0x100/0x290
dmam_free_coherent+0x78/0xd8
tegra241_vintf_free_lvcmdq+0x74/0x160
tegra241_cmdqv_remove+0x98/0x198
arm_smmu_impl_remove+0x28/0x60
devm_action_release+0x1c/0x40
This is because the LVCMDQ queue memory are managed by devres, while that
dmam_free_coherent() is called in the context of devm_action_release().
Jason pointed out that "arm_smmu_impl_probe() has mis-ordered the devres
callbacks if ops->device_remove() is going to be manually freeing things
that probe allocated":
https://lore.kernel.org/linux-iommu/20250407174408.GB1722458@nvidia.com/
In fact, tegra241_cmdqv_init_structures() only allocates memory resources
which means any failure that it generates would be similar to -ENOMEM, so
there is no point in having that "falling back to standard SMMU" routine,
as the standard SMMU would likely fail to allocate memory too.
Remove the unwind part in tegra241_cmdqv_init_structures(), and return a
proper error code to ask SMMU driver to call tegra241_cmdqv_remove() via
impl_ops->device_remove(). Then, drop tegra241_vintf_free_lvcmdq() since
devres will take care of that.
Fixes: 483e0bd888 ("iommu/tegra241-cmdqv: Do not allocate vcmdq until dma_set_mask_and_coherent")
Cc: stable@vger.kernel.org
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250407201908.172225-1-nicolinc@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Currently, mtk_iommu calls during probe iommu_device_register before
the hw_list from driver data is initialized. Since iommu probing issue
fix, it leads to NULL pointer dereference in mtk_iommu_device_group when
hw_list is accessed with list_first_entry (not null safe).
So, change the call order to ensure iommu_device_register is called
after the driver data are initialized.
Fixes: 9e3a2a6436 ("iommu/mediatek: Adapt sharing and non-sharing pgtable case")
Fixes: bcb81ac6ae ("iommu: Get DT/ACPI parsing into the proper probe path")
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Tested-by: Chen-Yu Tsai <wenst@chromium.org> # MT8183 Juniper, MT8186 Tentacruel
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Louis-Alexis Eyraud <louisalexis.eyraud@collabora.com>
Link: https://lore.kernel.org/r/20250403-fix-mtk-iommu-error-v2-1-fe8b18f8b0a8@collabora.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Commit bcb81ac6ae ("iommu: Get DT/ACPI parsing into the proper probe
path") changed the sequence of probing the SYSMMU controller devices and
calls to arm_iommu_attach_device(), what results in resuming SYSMMU
controller earlier, when it is still set to IDENTITY mapping. Such change
revealed the bug in IDENTITY handling in the exynos-iommu driver. When
SYSMMU controller is set to IDENTITY mapping, data->domain is NULL, so
adjust checks in suspend & resume callbacks to handle this case
correctly.
Fixes: b3d14960e6 ("iommu/exynos: Implement an IDENTITY domain")
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250401202731.2810474-1-m.szyprowski@samsung.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
IPMMU registers almost-initialised instances, but misses assigning the
drvdata to make them fully functional, so initial calls back into
ipmmu_probe_device() are likely to fail unnecessarily. Reorder this to
work as it should, also pruning the long-out-of-date comment and adding
the missing sysfs cleanup on error for good measure.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes: bcb81ac6ae ("iommu: Get DT/ACPI parsing into the proper probe path")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/53be6667544de65a15415b699e38a9a965692e45.1742481687.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
If iommu_device_register() encounters an error, it can end up tearing
down already-configured groups and default domains, however this
currently still leaves devices hooked up to iommu-dma (and even
historically the behaviour in this area was at best inconsistent across
architectures/drivers...) Although in the case that an IOMMU is present
whose driver has failed to probe, users cannot necessarily expect DMA to
work anyway, it's still arguable that we should do our best to put
things back as if the IOMMU driver was never there at all, and certainly
the potential for crashing in iommu-dma itself is undesirable. Make sure
we clean up the dev->dma_iommu flag along with everything else.
Reported-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Closes: https://lore.kernel.org/all/CAGXv+5HJpTYmQ2h-GD7GjyeYT7bL9EBCvu0mz5LgpzJZtzfW0w@mail.gmail.com/
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/e788aa927f6d827dd4ea1ed608fada79f2bab030.1744284228.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Do not touch per-device DMA ops when the driver has been converted to use
the dma-iommu API.
Fixes: c588072bba ("iommu/vt-d: Convert intel iommu driver to the iommu ops")
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Link: https://lore.kernel.org/r/20250403165605.278541-1-ptesarik@suse.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Set the posted MSI irq_chip's irq_ack() hook to irq_move_irq() instead of
a dummy/empty callback so that posted MSIs process pending changes to the
IRQ's SMP affinity. Failure to honor a pending set-affinity results in
userspace being unable to change the effective affinity of the IRQ, as
IRQD_SETAFFINITY_PENDING is never cleared and so irq_set_affinity_locked()
always defers moving the IRQ.
The issue is most easily reproducible by setting /proc/irq/xx/smp_affinity
multiple times in quick succession, as only the first update is likely to
be handled in process context.
Fixes: ed1e48ea43 ("iommu/vt-d: Enable posted mode for device MSIs")
Cc: Robert Lippert <rlippert@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Wentao Yang <wentaoyang@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20250321194249.1217961-1-seanjc@google.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The following crash is observed while handling an IOMMU fault with a
recent kernel:
kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
BUG: unable to handle page fault for address: ffff8c708299f700
PGD 19ee01067 P4D 19ee01067 PUD 101c10063 PMD 80000001028001e3
Oops: Oops: 0011 [#1] SMP NOPTI
CPU: 4 UID: 0 PID: 139 Comm: irq/25-AMD-Vi Not tainted 6.15.0-rc1+ #20 PREEMPT(lazy)
Hardware name: LENOVO 21D0/LNVNB161216, BIOS J6CN50WW 09/27/2024
RIP: 0010:0xffff8c708299f700
Call Trace:
<TASK>
? report_iommu_fault+0x78/0xd3
? amd_iommu_report_page_fault+0x91/0x150
? amd_iommu_int_thread+0x77/0x180
? __pfx_irq_thread_fn+0x10/0x10
? irq_thread_fn+0x23/0x60
? irq_thread+0xf9/0x1e0
? __pfx_irq_thread_dtor+0x10/0x10
? __pfx_irq_thread+0x10/0x10
? kthread+0xfc/0x240
? __pfx_kthread+0x10/0x10
? ret_from_fork+0x34/0x50
? __pfx_kthread+0x10/0x10
? ret_from_fork_asm+0x1a/0x30
</TASK>
report_iommu_fault() checks for an installed handler comparing the
corresponding field to NULL. It can (and could before) be called for a
domain with a different cookie type - IOMMU_COOKIE_DMA_IOVA, specifically.
Cookie is represented as a union so we may end up with a garbage value
treated there if this happens for a domain with another cookie type.
Formerly there were two exclusive cookie types in the union.
IOMMU_DOMAIN_SVA has a dedicated iommu_report_device_fault().
Call the fault handler only if the passed domain has a required cookie
type.
Found by Linux Verification Center (linuxtesting.org).
Fixes: 6aa63a4ec9 ("iommu: Sort out domain user data")
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250408213342.285955-1-pchelkin@ispras.ru
Signed-off-by: Joerg Roedel <jroedel@suse.de>
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Two significant new items:
- Allow reporting IOMMU HW events to userspace when the events are clearly
linked to a device. This is linked to the VIOMMU object and is intended to
be used by a VMM to forward HW events to the virtual machine as part of
emulating a vIOMMU. ARM SMMUv3 is the first driver to use this
mechanism. Like the existing fault events the data is delivered through
a simple FD returning event records on read().
- PASID support in VFIO. "Process Address Space ID" is a PCI feature that
allows the device to tag all PCI DMA operations with an ID. The IOMMU
will then use the ID to select a unique translation for those DMAs. This
is part of Intel's vIOMMU support as VT-D HW requires the hypervisor to
manage each PASID entry. The support is generic so any VFIO user could
attach any translation to a PASID, and the support should work on ARM
SMMUv3 as well. AMD requires additional driver work.
Some minor updates, along with fixes:
- Prevent using nested parents with fault's, no driver support today
- Put a single "cookie_type" value in the iommu_domain to indicate what
owns the various opaque owner fields
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZ+q6NgAKCRCFwuHvBreF
YZ3zAQDbl4/Z0O+CLN2AXq4Zeiyq1HTSoF94hzqmm7lQ17zTIwD8CCdyLXHvupaq
tkBIv5IovpaxlrSk6M0kh2K8vPCk9Qk=
=CIM3
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"Two significant new items:
- Allow reporting IOMMU HW events to userspace when the events are
clearly linked to a device.
This is linked to the VIOMMU object and is intended to be used by a
VMM to forward HW events to the virtual machine as part of
emulating a vIOMMU. ARM SMMUv3 is the first driver to use this
mechanism. Like the existing fault events the data is delivered
through a simple FD returning event records on read().
- PASID support in VFIO.
The "Process Address Space ID" is a PCI feature that allows the
device to tag all PCI DMA operations with an ID. The IOMMU will
then use the ID to select a unique translation for those DMAs. This
is part of Intel's vIOMMU support as VT-D HW requires the
hypervisor to manage each PASID entry.
The support is generic so any VFIO user could attach any
translation to a PASID, and the support should work on ARM SMMUv3
as well. AMD requires additional driver work.
Some minor updates, along with fixes:
- Prevent using nested parents with fault's, no driver support today
- Put a single "cookie_type" value in the iommu_domain to indicate
what owns the various opaque owner fields"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (49 commits)
iommufd: Test attach before detaching pasid
iommufd: Fix iommu_vevent_header tables markup
iommu: Convert unreachable() to BUG()
iommufd: Balance veventq->num_events inc/dec
iommufd: Initialize the flags of vevent in iommufd_viommu_report_event()
iommufd/selftest: Add coverage for reporting max_pasid_log2 via IOMMU_HW_INFO
iommufd: Extend IOMMU_GET_HW_INFO to report PASID capability
vfio: VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT support pasid
vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
ida: Add ida_find_first_range()
iommufd/selftest: Add coverage for iommufd pasid attach/detach
iommufd/selftest: Add test ops to test pasid attach/detach
iommufd/selftest: Add a helper to get test device
iommufd/selftest: Add set_dev_pasid in mock iommu
iommufd: Allow allocating PASID-compatible domain
iommu/vt-d: Add IOMMU_HWPT_ALLOC_PASID support
iommufd: Enforce PASID-compatible domain for RID
iommufd: Support pasid attach/replace
iommufd: Enforce PASID-compatible domain in PASID path
iommufd/device: Add pasid_attach array to track per-PASID attach
...
iommufd_veventq_fops_read() decrements veventq->num_events when a vevent
is read out. However, the report path ony increments veventq->num_events
for normal events. To be balanced, make the read path decrement num_events
only for normal vevents.
Fixes: e36ba5ab80 ("iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC")
Link: https://patch.msgid.link/r/20250324120034.5940-3-yi.l.liu@intel.com
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The vevent->header.flags is not initialized per allocation, hence the
vevent read path may treat the vevent as lost_events_header wrongly.
Use kzalloc() to alloc memory for new vevent.
Fixes: e8e1ef9b77 ("iommufd/viommu: Add iommufd_viommu_report_event helper")
Link: https://patch.msgid.link/r/20250324120034.5940-2-yi.l.liu@intel.com
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
PASID usage requires PASID support in both device and IOMMU. Since the
iommu drivers always enable the PASID capability for the device if it
is supported, this extends the IOMMU_GET_HW_INFO to report the PASID
capability to userspace. Also, enhances the selftest accordingly.
Link: https://patch.msgid.link/r/20250321180143.8468-5-yi.l.liu@intel.com
Cc: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> #aarch64 platform
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Including:
- Core: IOMMUFD dependencies from Jason:
- Change the iommufd fault handle into an always present hwpt handle in
the domain
- Give iommufd its own SW_MSI implementation along with some IRQ layer
rework
- Improvements to the handle attach API
- Core: Fixes for probe-issues from Robin
- Intel VT-d changes:
- Checking for SVA support in domain allocation and attach paths
- Move PCI ATS and PRI configuration into probe paths
- Fix a pentential hang on reboot -f
- Miscellaneous cleanups
- AMD-Vi changes:
- Support for up to 2k IRQs per PCI device function
- Set of smaller fixes
- ARM-SMMU changes:
- SMMUv2 devicetree binding updates for Qualcomm implementations
(QCS8300 GPU and MSM8937)
- Clean up SMMUv2 runtime PM implementation to help with wider rework of
pm_runtime_put_autosuspend()
- Rockchip driver changes:
- Driver adjustments for recent DT probing changes
- S390 IOMMU changes:
- Support for IOMMU passthrough
- Apple Dart changes:
- Driver adjustments to meet ISP device requirements
- Null-ptr deref fix
- Disable subpage protection for DART 1
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmfkFSkACgkQK/BELZcB
GuOVUg/9En4QlF0eeg4E1stszGcOOXn9IkKiGP9iqKpL/WqVVitoagn7nwq0rBFF
PdqqcGbSPNHctiUursB18DyENOAmRUE8mGG29po9rAfq5wxZ2yA6eLmpFAf9QDCW
vY3nz2oen1lUrzbPwVI1IR67L6BrtqEUz/1syf295RMr2yXkdmBXb12lbL1rdjv0
7YfWwlDtMOVsewlZmCuDAAfWHtd7cO/ulDcYwq+GOiSXojVmKw8624EFlFbM9E9f
tkceE4mjGQnN/CpZzUc48f1DgIW82PVVTek95Cznbk9ACDJQ4VYYwDMx2WszfMzu
D1CXLiQu0vrxQgFdHn02bw3T4yvXQ4HeawIaTF+1J1gEHbrj6MoVQ88zaO3GnKFk
yxYI5sfOs5iL6zSMkig4OytXuRmRqDVJhgrF1i4XIM3GXrRKf9EiRZ/1eXb+1gab
3Q5k7+u4+7vWbaKLsxOdIBNzz02hb7LVndPIFlWmSnr1YY0LFBfwmgbfTAsSINZo
22Ni8aE2C42lYOZxJ67m3khLpqTZaTERtQIab/gUd5FcuEjxKuHBIzj6SZeC/0Q9
Y6rzklHxJxxtQLrUJp16IzjpiMJbWH0H2jKstRberOpxk3L5aLrJNd8M1B+CokmP
LZeKgqgxV5F+VlR2Bap5AXL9ZEuqjCVDIEy79j8XzAzxobDVnVM=
=a8Pd
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:
"Core iommufd dependencies from Jason:
- Change the iommufd fault handle into an always present hwpt handle
in the domain
- Give iommufd its own SW_MSI implementation along with some IRQ
layer rework
- Improvements to the handle attach API
Core fixes for probe-issues from Robin
Intel VT-d changes:
- Checking for SVA support in domain allocation and attach paths
- Move PCI ATS and PRI configuration into probe paths
- Fix a pentential hang on reboot -f
- Miscellaneous cleanups
AMD-Vi changes:
- Support for up to 2k IRQs per PCI device function
- Set of smaller fixes
ARM-SMMU changes:
- SMMUv2 devicetree binding updates for Qualcomm implementations
(QCS8300 GPU and MSM8937)
- Clean up SMMUv2 runtime PM implementation to help with wider rework
of pm_runtime_put_autosuspend()
Rockchip driver changes:
- Driver adjustments for recent DT probing changes
S390 IOMMU changes:
- Support for IOMMU passthrough
Apple Dart changes:
- Driver adjustments to meet ISP device requirements
- Null-ptr deref fix
- Disable subpage protection for DART 1"
* tag 'iommu-updates-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (54 commits)
iommu/vt-d: Fix possible circular locking dependency
iommu/vt-d: Don't clobber posted vCPU IRTE when host IRQ affinity changes
iommu/vt-d: Put IRTE back into posted MSI mode if vCPU posting is disabled
iommu: apple-dart: fix potential null pointer deref
iommu/rockchip: Retire global dma_dev workaround
iommu/rockchip: Register in a sensible order
iommu/rockchip: Allocate per-device data sensibly
iommu/mediatek-v1: Support COMPILE_TEST
iommu/amd: Enable support for up to 2K interrupts per function
iommu/amd: Rename DTE_INTTABLEN* and MAX_IRQS_PER_TABLE macro
iommu/amd: Replace slab cache allocator with page allocator
iommu/amd: Introduce generic function to set multibit feature value
iommu: Don't warn prematurely about dodgy probes
iommu/arm-smmu: Set rpm auto_suspend once during probe
dt-bindings: arm-smmu: Document QCS8300 GPU SMMU
iommu: Get DT/ACPI parsing into the proper probe path
iommu: Keep dev->iommu state consistent
iommu: Resolve ops in iommu_init_device()
iommu: Handle race with default domain setup
iommu: Unexport iommu_fwspec_free()
...
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmfhlLATHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXgchCADOz33rSm4G4w4r0qT05dTDi/lZkEdK
64dQq322XXP/C9FfR66d30243gsAmuM5a0SvzFHLXAOu6yqM270Xehd/Rud+Um2s
lSVnc0Ux0AWBgksqFd0t577aN7zmJEukosEYO5lBNop+zOcadrm3S6Th/AoL2h/D
yphPkhH13bsCK+Wll/eBOQLIhC9iA0konYbBLuEQ5MqvUbrzc6Rmb5gxsHHZKOqg
vLjkrYR/d3s2gIpKxiFp0RwvzGyffZEHxvU/YF3hTenPMlTlnXWbyspBSTVmWggP
13IFLzqxDdW9RgUnGB4xRc424AC1LKqEr42QPQE7zGvl2jdJriA2Q1LT
=BXqj
-----END PGP SIGNATURE-----
Merge tag 'hyperv-next-signed-20250324' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- Add support for running as the root partition in Hyper-V (Microsoft
Hypervisor) by exposing /dev/mshv (Nuno and various people)
- Add support for CPU offlining in Hyper-V (Hamza Mahfooz)
- Misc fixes and cleanups (Roman Kisel, Tianyu Lan, Wei Liu, Michael
Kelley, Thorsten Blum)
* tag 'hyperv-next-signed-20250324' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (24 commits)
x86/hyperv: fix an indentation issue in mshyperv.h
x86/hyperv: Add comments about hv_vpset and var size hypercall input args
Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
hyperv: Add definitions for root partition driver to hv headers
x86: hyperv: Add mshv_handler() irq handler and setup function
Drivers: hv: Introduce per-cpu event ring tail
Drivers: hv: Export some functions for use by root partition module
acpi: numa: Export node_to_pxm()
hyperv: Introduce hv_recommend_using_aeoi()
arm64/hyperv: Add some missing functions to arm64
x86/mshyperv: Add support for extended Hyper-V features
hyperv: Log hypercall status codes as strings
x86/hyperv: Fix check of return value from snp_set_vmsa()
x86/hyperv: Add VTL mode callback for restarting the system
x86/hyperv: Add VTL mode emergency restart callback
hyperv: Remove unused union and structs
hyperv: Add CONFIG_MSHV_ROOT to gate root partition support
hyperv: Change hv_root_partition into a function
hyperv: Convert hypercall statuses to linux error codes
drivers/hv: add CPU offlining support
...
This adds 4 test ops for pasid attach/replace/detach testing. There are
ops to attach/detach pasid, and also op to check the attached hwpt of a
pasid.
Link: https://patch.msgid.link/r/20250321171940.7213-18-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There is need to get the selftest device (sobj->type == TYPE_IDEV) in
multiple places, so have a helper to for it.
Link: https://patch.msgid.link/r/20250321171940.7213-17-yi.l.liu@intel.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The callback is needed to make pasid_attach/detach path complete for mock
device. A nop is enough for set_dev_pasid.
A MOCK_FLAGS_DEVICE_PASID is added to indicate a pasid-capable mock device
for the pasid test cases. Other test cases will still create a non-pasid
mock device. While the mock iommu always pretends to be pasid-capable.
Link: https://patch.msgid.link/r/20250321171940.7213-16-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The underlying infrastructure has supported the PASID attach and related
enforcement per the requirement of the IOMMU_HWPT_ALLOC_PASID flag. This
extends iommufd to support PASID compatible domain requested by userspace.
Link: https://patch.msgid.link/r/20250321171940.7213-15-yi.l.liu@intel.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Intel iommu driver just treats it as a nop since Intel VT-d does not have
special requirement on domains attached to either the PASID or RID of a
PASID-capable device.
Link: https://patch.msgid.link/r/20250321171940.7213-14-yi.l.liu@intel.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Per the definition of IOMMU_HWPT_ALLOC_PASID, iommufd needs to enforce
the RID to use PASID-compatible domain if PASID has been attached, and
vice versa. The PASID path has already enforced it. This adds the
enforcement in the RID path.
This enforcement requires a lock across the RID and PASID attach path,
the idev->igroup->lock is used as both the RID and the PASID path holds
it.
Link: https://patch.msgid.link/r/20250321171940.7213-13-yi.l.liu@intel.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This extends the below APIs to support PASID. Device drivers to manage pasid
attach/replace/detach.
int iommufd_device_attach(struct iommufd_device *idev,
ioasid_t pasid, u32 *pt_id);
int iommufd_device_replace(struct iommufd_device *idev,
ioasid_t pasid, u32 *pt_id);
void iommufd_device_detach(struct iommufd_device *idev,
ioasid_t pasid);
The pasid operations share underlying attach/replace/detach infrastructure
with the device operations, but still have some different implications:
- no reserved region per pasid otherwise SVA architecture is already
broken (CPU address space doesn't count device reserved regions);
- accordingly no sw_msi trick;
Cache coherency enforcement is still applied to pasid operations since
it is about memory accesses post page table walking (no matter the walk
is per RID or per PASID).
Link: https://patch.msgid.link/r/20250321171940.7213-12-yi.l.liu@intel.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
AMD IOMMU requires attaching PASID-compatible domains to PASID-capable
devices. This includes the domains attached to RID and PASIDs. Related
discussions in link [1] and [2]. ARM also has such a requirement, Intel
does not need it, but can live up with it. Hence, iommufd is going to
enforce this requirement as it is not harmful to vendors that do not
need it.
Mark the PASID-compatible domains and enforce it in the PASID path.
[1] https://lore.kernel.org/linux-iommu/20240709182303.GK14050@ziepe.ca/
[2] https://lore.kernel.org/linux-iommu/20240822124433.GD3468552@ziepe.ca/
Link: https://patch.msgid.link/r/20250321171940.7213-11-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>