Commit Graph

342 Commits

Author SHA1 Message Date
Linus Torvalds
63eb28bb14 ARM:
- Host driver for GICv5, the next generation interrupt controller for
   arm64, including support for interrupt routing, MSIs, interrupt
   translation and wired interrupts.
 
 - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
   GICv5 hardware, leveraging the legacy VGIC interface.
 
 - Userspace control of the 'nASSGIcap' GICv3 feature, allowing
   userspace to disable support for SGIs w/o an active state on hardware
   that previously advertised it unconditionally.
 
 - Map supporting endpoints with cacheable memory attributes on systems
   with FEAT_S2FWB and DIC where KVM no longer needs to perform cache
   maintenance on the address range.
 
 - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest
   hypervisor to inject external aborts into an L2 VM and take traps of
   masked external aborts to the hypervisor.
 
 - Convert more system register sanitization to the config-driven
   implementation.
 
 - Fixes to the visibility of EL2 registers, namely making VGICv3 system
   registers accessible through the VGIC device instead of the ONE_REG
   vCPU ioctls.
 
 - Various cleanups and minor fixes.
 
 LoongArch:
 
 - Add stat information for in-kernel irqchip
 
 - Add tracepoints for CPUCFG and CSR emulation exits
 
 - Enhance in-kernel irqchip emulation
 
 - Various cleanups.
 
 RISC-V:
 
 - Enable ring-based dirty memory tracking
 
 - Improve perf kvm stat to report interrupt events
 
 - Delegate illegal instruction trap to VS-mode
 
 - MMU improvements related to upcoming nested virtualization
 
 s390x
 
 - Fixes
 
 x86:
 
 - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC,
   PIC, and PIT emulation at compile time.
 
 - Share device posted IRQ code between SVM and VMX and
   harden it against bugs and runtime errors.
 
 - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1)
   instead of O(n).
 
 - For MMIO stale data mitigation, track whether or not a vCPU has access to
   (host) MMIO based on whether the page tables have MMIO pfns mapped; using
   VFIO is prone to false negatives
 
 - Rework the MSR interception code so that the SVM and VMX APIs are more or
   less identical.
 
 - Recalculate all MSR intercepts from scratch on MSR filter changes,
   instead of maintaining shadow bitmaps.
 
 - Advertise support for LKGS (Load Kernel GS base), a new instruction
   that's loosely related to FRED, but is supported and enumerated
   independently.
 
 - Fix a user-triggerable WARN that syzkaller found by setting the vCPU
   in INIT_RECEIVED state (aka wait-for-SIPI), and then putting the vCPU
   into VMX Root Mode (post-VMXON).  Trying to detect every possible path
   leading to architecturally forbidden states is hard and even risks
   breaking userspace (if it goes from valid to valid state but passes
   through invalid states), so just wait until KVM_RUN to detect that
   the vCPU state isn't allowed.
 
 - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of
   APERF/MPERF reads, so that a "properly" configured VM can access
   APERF/MPERF.  This has many caveats (APERF/MPERF cannot be zeroed
   on vCPU creation or saved/restored on suspend and resume, or preserved
   over thread migration let alone VM migration) but can be useful whenever
   you're interested in letting Linux guests see the effective physical CPU
   frequency in /proc/cpuinfo.
 
 - Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been
   created, as there's no known use case for changing the default
   frequency for other VM types and it goes counter to the very reason
   why the ioctl was added to the vm file descriptor.  And also, there
   would be no way to make it work for confidential VMs with a "secure"
   TSC, so kill two birds with one stone.
 
 - Dynamically allocation the shadow MMU's hashed page list, and defer
   allocating the hashed list until it's actually needed (the TDP MMU
   doesn't use the list).
 
 - Extract many of KVM's helpers for accessing architectural local APIC
   state to common x86 so that they can be shared by guest-side code for
   Secure AVIC.
 
 - Various cleanups and fixes.
 
 x86 (Intel):
 
 - Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest.
   Failure to honor FREEZE_IN_SMM can leak host state into guests.
 
 - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to prevent
   L1 from running L2 with features that KVM doesn't support, e.g. BTF.
 
 x86 (AMD):
 
 - WARN and reject loading kvm-amd.ko instead of panicking the kernel if the
   nested SVM MSRPM offsets tracker can't handle an MSR (which is pretty
   much a static condition and therefore should never happen, but still).
 
 - Fix a variety of flaws and bugs in the AVIC device posted IRQ code.
 
 - Inhibit AVIC if a vCPU's ID is too big (relative to what hardware
   supports) instead of rejecting vCPU creation.
 
 - Extend enable_ipiv module param support to SVM, by simply leaving
   IsRunning clear in the vCPU's physical ID table entry.
 
 - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by
   erratum #1235, to allow (safely) enabling AVIC on such CPUs.
 
 - Request GA Log interrupts if and only if the target vCPU is blocking,
   i.e. only if KVM needs a notification in order to wake the vCPU.
 
 - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the
   vCPU's CPUID model.
 
 - Accept any SNP policy that is accepted by the firmware with respect to
   SMT and single-socket restrictions.  An incompatible policy doesn't put
   the kernel at risk in any way, so there's no reason for KVM to care.
 
 - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and
   use WBNOINVD instead of WBINVD when possible for SEV cache maintenance.
 
 - When reclaiming memory from an SEV guest, only do cache flushes on CPUs
   that have ever run a vCPU for the guest, i.e. don't flush the caches for
   CPUs that can't possibly have cache lines with dirty, encrypted data.
 
 Generic:
 
 - Rework irqbypass to track/match producers and consumers via an xarray
   instead of a linked list.  Using a linked list leads to O(n^2) insertion
   times, which is hugely problematic for use cases that create large
   numbers of VMs.  Such use cases typically don't actually use irqbypass,
   but eliminating the pointless registration is a future problem to
   solve as it likely requires new uAPI.
 
 - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *",
   to avoid making a simple concept unnecessarily difficult to understand.
 
 - Decouple device posted IRQs from VFIO device assignment, as binding a VM
   to a VFIO group is not a requirement for enabling device posted IRQs.
 
 - Clean up and document/comment the irqfd assignment code.
 
 - Disallow binding multiple irqfds to an eventfd with a priority waiter,
   i.e.  ensure an eventfd is bound to at most one irqfd through the entire
   host, and add a selftest to verify eventfd:irqfd bindings are globally
   unique.
 
 - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues
   related to private <=> shared memory conversions.
 
 - Drop guest_memfd's .getattr() implementation as the VFS layer will call
   generic_fillattr() if inode_operations.getattr is NULL.
 
 - Fix issues with dirty ring harvesting where KVM doesn't bound the
   processing of entries in any way, which allows userspace to keep KVM
   in a tight loop indefinitely.
 
 - Kill off kvm_arch_{start,end}_assignment() and x86's associated tracking,
   now that KVM no longer uses assigned_device_count as a heuristic for
   either irqbypass usage or MDS mitigation.
 
 Selftests:
 
 - Fix a comment typo.
 
 - Verify KVM is loaded when getting any KVM module param so that attempting
   to run a selftest without kvm.ko loaded results in a SKIP message about
   KVM not being loaded/enabled (versus some random parameter not existing).
 
 - Skip tests that hit EACCES when attempting to access a file, and rpint
   a "Root required?" help message.  In most cases, the test just needs to
   be run with elevated permissions.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmiKXMgUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMhMQf/QDhC/CP1aGXph2whuyeD2NMqPKiU
 9KdnDNST+ftPwjg9QxZ9mTaa8zeVz/wly6XlxD9OQHy+opM1wcys3k0GZAFFEEQm
 YrThgURdzEZ3nwJZgb+m0t4wjJQtpiFIBwAf7qq6z1VrqQBEmHXJ/8QxGuqO+BNC
 j5q/X+q6KZwehKI6lgFBrrOKWFaxqhnRAYfW6rGBxRXxzTJuna37fvDpodQnNceN
 zOiq+avfriUMArTXTqOteJNKU0229HjiPSnjILLnFQ+B3akBlwNG0jk7TMaAKR6q
 IZWG1EIS9q1BAkGXaw6DE1y6d/YwtXCR5qgAIkiGwaPt5yj9Oj6kRN2Ytw==
 =j2At
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM:

   - Host driver for GICv5, the next generation interrupt controller for
     arm64, including support for interrupt routing, MSIs, interrupt
     translation and wired interrupts

   - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
     GICv5 hardware, leveraging the legacy VGIC interface

   - Userspace control of the 'nASSGIcap' GICv3 feature, allowing
     userspace to disable support for SGIs w/o an active state on
     hardware that previously advertised it unconditionally

   - Map supporting endpoints with cacheable memory attributes on
     systems with FEAT_S2FWB and DIC where KVM no longer needs to
     perform cache maintenance on the address range

   - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the
     guest hypervisor to inject external aborts into an L2 VM and take
     traps of masked external aborts to the hypervisor

   - Convert more system register sanitization to the config-driven
     implementation

   - Fixes to the visibility of EL2 registers, namely making VGICv3
     system registers accessible through the VGIC device instead of the
     ONE_REG vCPU ioctls

   - Various cleanups and minor fixes

  LoongArch:

   - Add stat information for in-kernel irqchip

   - Add tracepoints for CPUCFG and CSR emulation exits

   - Enhance in-kernel irqchip emulation

   - Various cleanups

  RISC-V:

   - Enable ring-based dirty memory tracking

   - Improve perf kvm stat to report interrupt events

   - Delegate illegal instruction trap to VS-mode

   - MMU improvements related to upcoming nested virtualization

  s390x

   - Fixes

  x86:

   - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O
     APIC, PIC, and PIT emulation at compile time

   - Share device posted IRQ code between SVM and VMX and harden it
     against bugs and runtime errors

   - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups
     O(1) instead of O(n)

   - For MMIO stale data mitigation, track whether or not a vCPU has
     access to (host) MMIO based on whether the page tables have MMIO
     pfns mapped; using VFIO is prone to false negatives

   - Rework the MSR interception code so that the SVM and VMX APIs are
     more or less identical

   - Recalculate all MSR intercepts from scratch on MSR filter changes,
     instead of maintaining shadow bitmaps

   - Advertise support for LKGS (Load Kernel GS base), a new instruction
     that's loosely related to FRED, but is supported and enumerated
     independently

   - Fix a user-triggerable WARN that syzkaller found by setting the
     vCPU in INIT_RECEIVED state (aka wait-for-SIPI), and then putting
     the vCPU into VMX Root Mode (post-VMXON). Trying to detect every
     possible path leading to architecturally forbidden states is hard
     and even risks breaking userspace (if it goes from valid to valid
     state but passes through invalid states), so just wait until
     KVM_RUN to detect that the vCPU state isn't allowed

   - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling
     interception of APERF/MPERF reads, so that a "properly" configured
     VM can access APERF/MPERF. This has many caveats (APERF/MPERF
     cannot be zeroed on vCPU creation or saved/restored on suspend and
     resume, or preserved over thread migration let alone VM migration)
     but can be useful whenever you're interested in letting Linux
     guests see the effective physical CPU frequency in /proc/cpuinfo

   - Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been
     created, as there's no known use case for changing the default
     frequency for other VM types and it goes counter to the very reason
     why the ioctl was added to the vm file descriptor. And also, there
     would be no way to make it work for confidential VMs with a
     "secure" TSC, so kill two birds with one stone

   - Dynamically allocation the shadow MMU's hashed page list, and defer
     allocating the hashed list until it's actually needed (the TDP MMU
     doesn't use the list)

   - Extract many of KVM's helpers for accessing architectural local
     APIC state to common x86 so that they can be shared by guest-side
     code for Secure AVIC

   - Various cleanups and fixes

  x86 (Intel):

   - Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest.
     Failure to honor FREEZE_IN_SMM can leak host state into guests

   - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to
     prevent L1 from running L2 with features that KVM doesn't support,
     e.g. BTF

  x86 (AMD):

   - WARN and reject loading kvm-amd.ko instead of panicking the kernel
     if the nested SVM MSRPM offsets tracker can't handle an MSR (which
     is pretty much a static condition and therefore should never
     happen, but still)

   - Fix a variety of flaws and bugs in the AVIC device posted IRQ code

   - Inhibit AVIC if a vCPU's ID is too big (relative to what hardware
     supports) instead of rejecting vCPU creation

   - Extend enable_ipiv module param support to SVM, by simply leaving
     IsRunning clear in the vCPU's physical ID table entry

   - Disable IPI virtualization, via enable_ipiv, if the CPU is affected
     by erratum #1235, to allow (safely) enabling AVIC on such CPUs

   - Request GA Log interrupts if and only if the target vCPU is
     blocking, i.e. only if KVM needs a notification in order to wake
     the vCPU

   - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to
     the vCPU's CPUID model

   - Accept any SNP policy that is accepted by the firmware with respect
     to SMT and single-socket restrictions. An incompatible policy
     doesn't put the kernel at risk in any way, so there's no reason for
     KVM to care

   - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and
     use WBNOINVD instead of WBINVD when possible for SEV cache
     maintenance

   - When reclaiming memory from an SEV guest, only do cache flushes on
     CPUs that have ever run a vCPU for the guest, i.e. don't flush the
     caches for CPUs that can't possibly have cache lines with dirty,
     encrypted data

  Generic:

   - Rework irqbypass to track/match producers and consumers via an
     xarray instead of a linked list. Using a linked list leads to
     O(n^2) insertion times, which is hugely problematic for use cases
     that create large numbers of VMs. Such use cases typically don't
     actually use irqbypass, but eliminating the pointless registration
     is a future problem to solve as it likely requires new uAPI

   - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a
     "void *", to avoid making a simple concept unnecessarily difficult
     to understand

   - Decouple device posted IRQs from VFIO device assignment, as binding
     a VM to a VFIO group is not a requirement for enabling device
     posted IRQs

   - Clean up and document/comment the irqfd assignment code

   - Disallow binding multiple irqfds to an eventfd with a priority
     waiter, i.e. ensure an eventfd is bound to at most one irqfd
     through the entire host, and add a selftest to verify eventfd:irqfd
     bindings are globally unique

   - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues
     related to private <=> shared memory conversions

   - Drop guest_memfd's .getattr() implementation as the VFS layer will
     call generic_fillattr() if inode_operations.getattr is NULL

   - Fix issues with dirty ring harvesting where KVM doesn't bound the
     processing of entries in any way, which allows userspace to keep
     KVM in a tight loop indefinitely

   - Kill off kvm_arch_{start,end}_assignment() and x86's associated
     tracking, now that KVM no longer uses assigned_device_count as a
     heuristic for either irqbypass usage or MDS mitigation

  Selftests:

   - Fix a comment typo

   - Verify KVM is loaded when getting any KVM module param so that
     attempting to run a selftest without kvm.ko loaded results in a
     SKIP message about KVM not being loaded/enabled (versus some random
     parameter not existing)

   - Skip tests that hit EACCES when attempting to access a file, and
     print a "Root required?" help message. In most cases, the test just
     needs to be run with elevated permissions"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (340 commits)
  Documentation: KVM: Use unordered list for pre-init VGIC registers
  RISC-V: KVM: Avoid re-acquiring memslot in kvm_riscv_gstage_map()
  RISC-V: KVM: Use find_vma_intersection() to search for intersecting VMAs
  RISC-V: perf/kvm: Add reporting of interrupt events
  RISC-V: KVM: Enable ring-based dirty memory tracking
  RISC-V: KVM: Fix inclusion of Smnpm in the guest ISA bitmap
  RISC-V: KVM: Delegate illegal instruction fault to VS mode
  RISC-V: KVM: Pass VMID as parameter to kvm_riscv_hfence_xyz() APIs
  RISC-V: KVM: Factor-out g-stage page table management
  RISC-V: KVM: Add vmid field to struct kvm_riscv_hfence
  RISC-V: KVM: Introduce struct kvm_gstage_mapping
  RISC-V: KVM: Factor-out MMU related declarations into separate headers
  RISC-V: KVM: Use ncsr_xyz() in kvm_riscv_vcpu_trap_redirect()
  RISC-V: KVM: Implement kvm_arch_flush_remote_tlbs_range()
  RISC-V: KVM: Don't flush TLB when PTE is unchanged
  RISC-V: KVM: Replace KVM_REQ_HFENCE_GVMA_VMID_ALL with KVM_REQ_TLB_FLUSH
  RISC-V: KVM: Rename and move kvm_riscv_local_tlb_sanitize()
  RISC-V: KVM: Drop the return value of kvm_riscv_vcpu_aia_init()
  RISC-V: KVM: Check kvm_riscv_vcpu_alloc_vector_context() return value
  KVM: arm64: selftests: Add FEAT_RAS EL2 registers to get-reg-list
  ...
2025-07-30 17:14:01 -07:00
Will Deacon
6ae1477fd3 Merge branch 'amd/amd-vi' into next
* amd/amd-vi:
  iommu/amd: Fix geometry.aperture_end for V2 tables
  iommu/amd: Wrap debugfs ABI testing symbols snippets in literal code blocks
  iommu/amd: Add documentation for AMD IOMMU debugfs support
  iommu/amd: Add debugfs support to dump IRT Table
  iommu/amd: Add debugfs support to dump device table
  iommu/amd: Add support for device id user input
  iommu/amd: Add debugfs support to dump IOMMU command buffer
  iommu/amd: Add debugfs support to dump IOMMU Capability registers
  iommu/amd: Add debugfs support to dump IOMMU MMIO registers
  iommu/amd: Refactor AMD IOMMU debugfs initial setup
  iommu/amd: Enable PASID and ATS capabilities in the correct order
  iommu/amd: Add efr[HATS] max v1 page table level
  iommu/amd: Add HATDis feature support
2025-07-24 11:17:59 +01:00
Jason Gunthorpe
8637afa79c iommu/amd: Fix geometry.aperture_end for V2 tables
The AMD IOMMU documentation seems pretty clear that the V2 table follows
the normal CPU expectation of sign extension. This is shown in

  Figure 25: AMD64 Long Mode 4-Kbyte Page Address Translation

Where bits Sign-Extend [63:57] == [56]. This is typical for x86 which
would have three regions in the page table: lower, non-canonical, upper.

The manual describes that the V1 table does not sign extend in section
2.2.4 Sharing AMD64 Processor and IOMMU Page Tables GPA-to-SPA

Further, Vasant has checked this and indicates the HW has an addtional
behavior that the manual does not yet describe. The AMDv2 table does not
have the sign extended behavior when attached to PASID 0, which may
explain why this has gone unnoticed.

The iommu domain geometry does not directly support sign extended page
tables. The driver should report only one of the lower/upper spaces. Solve
this by removing the top VA bit from the geometry to use only the lower
space.

This will also make the iommu_domain work consistently on all PASID 0 and
PASID != 1.

Adjust dma_max_address() to remove the top VA bit. It now returns:

5 Level:
  Before 0x1ffffffffffffff
  After  0x0ffffffffffffff
4 Level:
  Before 0xffffffffffff
  After  0x7fffffffffff

Fixes: 11c439a194 ("iommu/amd/pgtbl_v2: Fix domain max address")
Link: https://lore.kernel.org/all/8858d4d6-d360-4ef0-935c-bfd13ea54f42@amd.com/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/0-v2-0615cc99b88a+1ce-amdv2_geo_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-17 10:18:57 +01:00
Dheeraj Kumar Srivastava
fb3af1f4fe iommu/amd: Add debugfs support to dump IOMMU command buffer
IOMMU driver sends command to IOMMU hardware via command buffer. In cases
where IOMMU hardware fails to process commands in command buffer, dumping
it is a valuable input to debug the issue.

IOMMU hardware processes command buffer entry at offset equals to the head
pointer. Dumping just the entry at the head pointer may not always be
useful. The current head may not be pointing to the entry of the command
buffer which is causing the issue. IOMMU Hardware may have processed the
entry and updated the head pointer. So dumping the entire command buffer
gives a broad understanding of what hardware was/is doing. The command
buffer dump will have all entries from start to end of the command buffer.
Along with that, it will have a head and tail command buffer pointer
register dump to facilitate where the IOMMU driver and hardware are in
the command buffer for injecting and processing the entries respectively.

Command buffer is a per IOMMU data structure. So dumping on per IOMMU
basis.
eg.
-> To get command buffer dump for iommu<x> (say, iommu00)
   #cat /sys/kernel/debug/iommu/amd/iommu00/cmdbuf

Signed-off-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250702093804.849-5-dheerajkumar.srivastava@amd.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-15 11:41:52 +01:00
Easwar Hariharan
c694bc8b61 iommu/amd: Enable PASID and ATS capabilities in the correct order
Per the PCIe spec, behavior of the PASID capability is undefined if the
value of the PASID Enable bit changes while the Enable bit of the
function's ATS control register is Set. Unfortunately,
pdev_enable_caps() does exactly that by ordering enabling ATS for the
device before enabling PASID.

Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Vasant Hegde <vasant.hegde@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jerry Snitselaar <jsnitsel@redhat.com>
Fixes: eda8c2860a ("iommu/amd: Enable device ATS/PASID/PRI capabilities independently")
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250703155433.6221-1-eahariha@linux.microsoft.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-07-11 09:13:58 +02:00
Marc Zyngier
0eaa67ad3a iommu/amd: Convert to msi_create_parent_irq_domain() helper
Now that we have a concise helper to create an MSI parent domain,
switch the AMD IOMMU remapping over to that.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Nam Cao <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241204124549.607054-9-maz@kernel.org
Link: https://lore.kernel.org/r/92e5ae97a03e4ffc272349d0863cd2cc8f904c44.1750858125.git.namcao@linutronix.de
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-07-04 10:28:10 +02:00
Ankit Soni
025d1371cc iommu/amd: Add efr[HATS] max v1 page table level
The EFR[HATS] bits indicate maximum host translation level supported by
IOMMU. Adding support to set the maximum host page table level as indicated
by EFR[HATS]. If the HATS=11b (reserved), the driver will attempt to use
guest page table for DMA API.

Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Ankit Soni <Ankit.Soni@amd.com>
Link: https://lore.kernel.org/r/df0f8562c2a20895cc185c86f1a02c4d826fd597.1749016436.git.Ankit.Soni@amd.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-06-27 08:51:20 +02:00
Ankit Soni
7e5516e609 iommu/amd: Add HATDis feature support
Current AMD IOMMU assumes Host Address Translation (HAT) is always
supported, and Linux kernel enables this capability by default. However,
in case of emulated and virtualized IOMMU, this might not be the case.
For example,current QEMU-emulated AMD vIOMMU does not support host
translation for VFIO pass-through device, but the interrupt remapping
support is required for x2APIC (i.e. kvm-msi-ext-dest-id is also not
supported by the guest OS). This would require the guest kernel to boot
with guest kernel option iommu=pt to by-pass the initialization of
host (v1) table.

The AMD I/O Virtualization Technology (IOMMU) Specification Rev 3.10 [1]
introduces a new flag 'HATDis' in the IVHD 11h IOMMU attributes to indicate
that HAT is not supported on a particular IOMMU instance.

Therefore, modifies the AMD IOMMU driver to detect the new HATDis
attributes, and disable host translation and switch to use guest
translation if it is available. Otherwise, the driver will disable DMA
translation.

[1] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_IOMMU.pdf

Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Ankit Soni <Ankit.Soni@amd.com>
Link: https://lore.kernel.org/r/8109b208f87b80e400c2abd24a2e44fcbc0763a5.1749016436.git.Ankit.Soni@amd.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-06-27 08:51:18 +02:00
Sean Christopherson
b9e53f9ff4 iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
Add plumbing to the AMD IOMMU driver to allow KVM to control whether or
not an IRTE is configured to generate GA log interrupts.  KVM only needs a
notification if the target vCPU is blocking, so the vCPU can be awakened.
If a vCPU is preempted or exits to userspace, KVM clears is_run, but will
set the vCPU back to running when userspace does KVM_RUN and/or the vCPU
task is scheduled back in, i.e. KVM doesn't need a notification.

Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes
from the KVM changes insofar as possible.

Opportunistically swap the ordering of parameters for amd_iommu_update_ga()
so that the match amd_iommu_activate_guest_mode().

Note, as of this writing, the AMD IOMMU manual doesn't list GALogIntr as
a non-cached field, but per AMD hardware architects, it's not cached and
can be safely updated without an invalidation.

Link: https://lore.kernel.org/all/b29b8c22-2fd4-4b5e-b755-9198874157c7@amd.com
Cc: Vasant Hegde <vasant.hegde@amd.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20250611224604.313496-62-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:51 -07:00
Sean Christopherson
a23480fe21 iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC support
WARN if KVM attempts to update IRTE entries when virtual APIC isn't fully
supported, as KVM should guard all such calls on IRQ posting being enabled.

Link: https://lore.kernel.org/r/20250611224604.313496-58-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:48 -07:00
Sean Christopherson
6df262f915 iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC is inhibited
If an IRQ can be posted to a vCPU, but AVIC is currently inhibited on the
vCPU, go through the dance of "affining" the IRTE to the vCPU, but leave
the actual IRTE in remapped mode.  KVM already handles the case where AVIC
is inhibited => uninhibited with posted IRQs (see avic_set_pi_irte_mode()),
but doesn't handle the scenario where a postable IRQ comes along while AVIC
is inhibited.

Link: https://lore.kernel.org/r/20250611224604.313496-45-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:40 -07:00
Sean Christopherson
f965255dc5 iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity
Now that setting vCPU affinity is guarded with ir_list_lock, i.e. now that
avic_physical_id_entry can be safely accessed, set the pCPU info
straight-away when setting vCPU affinity.  Putting the IRTE into posted
mode, and then immediately updating the IRTE a second time if the target
vCPU is running is wasteful and confusing.

This also fixes a flaw where a posted IRQ that arrives between putting
the IRTE into guest_mode and setting the correct destination could cause
the IOMMU to ring the doorbell on the wrong pCPU.

Link: https://lore.kernel.org/r/20250611224604.313496-44-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:39 -07:00
Sean Christopherson
0b2b541fa3 iommu/amd: Factor out helper for manipulating IRTE GA/CPU info
Split the guts of amd_iommu_update_ga() to a dedicated helper so that the
logic can be shared with flows that put the IRTE into posted mode.

Opportunistically move amd_iommu_update_ga() and its new helper above
amd_iommu_activate_guest_mode() so that it's all co-located.

Link: https://lore.kernel.org/r/20250611224604.313496-43-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:38 -07:00
Sean Christopherson
08d9ccdd1a iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination
Infer whether or not a vCPU should be marked running from the validity of
the pCPU on which it is running.  amd_iommu_update_ga() already skips the
IRTE update if the pCPU is invalid, i.e. passing %true for is_run with an
invalid pCPU would be a blatant and egregrious KVM bug.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-42-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:37 -07:00
Sean Christopherson
3be405e89f iommu/amd: Document which IRTE fields amd_iommu_update_ga() can modify
Add a comment to amd_iommu_update_ga() to document what fields it can
safely modify without issuing an invalidation of the IRTE, and to explain
its role in keeping GA IRTEs up-to-date.

Per page 93 of the IOMMU spec dated Feb 2025:

  When virtual interrupts are enabled by setting MMIO Offset 0018h[GAEn] and
  IRTE[GuestMode=1], IRTE[IsRun], IRTE[Destination], and if present IRTE[GATag],
  are not cached by the IOMMU. Modifications to these fields do not require an
  invalidation of the Interrupt Remapping Table.

Link: https://lore.kernel.org/all/9b7ceea3-8c47-4383-ad9c-1a9bbdc9044a@oracle.com
Cc: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20250611224604.313496-41-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:37 -07:00
Sean Christopherson
53527ea1b7 iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel structs
Split the vcpu_data structure that serves as a handoff from KVM to IOMMU
drivers into vendor specific structures.  Overloading a single structure
makes the code hard to read and maintain, is *very* misleading as it
suggests that mixing vendors is actually supported, and bastardizing
Intel's posted interrupt descriptor address when AMD's IOMMU already has
its own structure is quite unnecessary.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-33-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:31 -07:00
Sean Christopherson
95d50ebe6d iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode"
Pass NULL to amd_ir_set_vcpu_affinity() to communicate "don't post to a
vCPU" now that there's no need to communicate information back to KVM
about the previous vCPU (KVM does its own tracking).

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:25 -07:00
Sean Christopherson
c4cdbaf9d8 iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr
Use vcpu_data.pi_desc_addr instead of amd_iommu_pi_data.base to get the
GA root pointer.  KVM is the only source of amd_iommu_pi_data.base, and
KVM's one and only path for writing amd_iommu_pi_data.base computes the
exact same value for vcpu_data.pi_desc_addr and amd_iommu_pi_data.base,
and fills amd_iommu_pi_data.base if and only if vcpu_data.pi_desc_addr is
valid, i.e. amd_iommu_pi_data.base is fully redundant.

Cc: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:24 -07:00
Sean Christopherson
1da19c5ce0 iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields
Delete the amd_ir_data.prev_ga_tag field now that all usage is
superfluous.

Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:55 -07:00
Joerg Roedel
879b141b7c Merge branches 'fixes', 'apple/dart', 'arm/smmu/updates', 'arm/smmu/bindings', 'fsl/pamu', 'mediatek', 'renesas/ipmmu', 's390', 'intel/vt-d', 'amd/amd-vi' and 'core' into next 2025-05-23 17:14:32 +02:00
Mario Limonciello
51c33f333b iommu/amd: Allow matching ACPI HID devices without matching UIDs
A BIOS upgrade has changed the IVRS DTE UID for a device that no
longer matches the UID in the SSDT. In this case there is only
one ACPI device on the system with that _HID but the _UID mismatch.

IVRS:
```
              Subtable Type : F0 [Device Entry: ACPI HID Named Device]
                  Device ID : 0060
Data Setting (decoded below) : 40
                 INITPass : 0
                 EIntPass : 0
                 NMIPass : 0
                 Reserved : 0
                 System MGMT : 0
                 LINT0 Pass : 1
                 LINT1 Pass : 0
                   ACPI HID : "MSFT0201"
                   ACPI CID : 0000000000000000
                 UID Format : 02
                 UID Length : 09
                        UID : "\_SB.MHSP"
```

SSDT:
```
Device (MHSP)
{
    Name (_ADR, Zero)  // _ADR: Address
    Name (_HID, "MSFT0201")  // _HID: Hardware ID
    Name (_UID, One)  // _UID: Unique ID
```

To handle this case; while enumerating ACPI devices in
get_acpihid_device_id() count the number of matching ACPI devices with
a matching _HID. If there is exactly one _HID match then accept it even
if the UID doesn't match. Other operating systems allow this, but the
current IVRS spec leaves some ambiguity whether to allow or disallow it.
This should be clarified in future revisions of the spec. Output
'Firmware Bug' for this case to encourage it to be solved in the BIOS.

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250512173129.1274275-1-superm1@kernel.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-05-16 08:48:18 +02:00
Sairaj Kodilkar
94a9921e2c iommu/amd: Add support for HTRangeIgnore feature
AMD IOMMU reserves the address range 0xfd00000000-0xffffffffff for
the hypertransport protocol (HT) and has special meaning. Hence devices
cannot use this address range for the DMA. However on some AMD platforms
this HT range is shifted to the very top of the address space and new
feature bit `HTRangeIgnore` is introduced. When this feature bit is on,
IOMMU treats the GPA access to the legacy HT range as regular GPA access.

Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250317055020.25214-1-sarunkod@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-05-02 08:38:09 +02:00
Sean Christopherson
94c721ea03 iommu/amd: Ensure GA log notifier callbacks finish running before module unload
Synchronize RCU when unregistering KVM's GA log notifier to ensure all
in-flight interrupt handlers complete before KVM-the module is unloaded.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250315031048.2374109-1-seanjc@google.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28 13:40:20 +02:00
Lu Baolu
f984fb09e6 iommu: Remove iommu_dev_enable/disable_feature()
No external drivers use these interfaces anymore. Furthermore, no existing
iommu drivers implement anything in the callbacks. Remove them to avoid
dead code.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20250418080130.1844424-9-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28 13:04:35 +02:00
Jason Gunthorpe
7c8896dd4a iommu: Remove IOMMU_DEV_FEAT_SVA
None of the drivers implement anything here anymore, remove the dead code.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250418080130.1844424-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28 13:04:29 +02:00
Sean Christopherson
aae251a380 iommu/amd: WARN if KVM attempts to set vCPU affinity without posted intrrupts
WARN if KVM attempts to set vCPU affinity when posted interrupts aren't
enabled, as KVM shouldn't try to enable posting when they're unsupported,
and the IOMMU driver darn well should only advertise posting support when
AMD_IOMMU_GUEST_IR_VAPIC() is true.

Note, KVM consumes is_guest_mode only on success.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:31 -04:00
Sean Christopherson
07172206a2 iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE
Return -EINVAL instead of success if amd_ir_set_vcpu_affinity() is
invoked without use_vapic; lying to KVM about whether or not the IRTE was
configured to post IRQs is all kinds of bad.

Fixes: d98de49a53 ("iommu/amd: Enable vAPIC interrupt remapping mode by default")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:31 -04:00
Jason Gunthorpe
c3b42b6ffa iommu/amd: Use iommu_alloc_pages_node_sz() for the IRT
Use the actual size of the irq_table allocation, limiting to 128 due to
the HW alignment needs.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/22-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17 16:22:53 +02:00
Jason Gunthorpe
5087f663c2 iommu/pages: Remove iommu_alloc_page_node()
Use iommu_alloc_pages_node_sz() instead.

AMD and Intel are both using 4K pages for these structures since those
drivers only work on 4K PAGE_SIZE.

riscv is also spec'd to use SZ_4K.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/21-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17 16:22:51 +02:00
Jason Gunthorpe
3e8e986ce8 iommu/pages: Remove iommu_free_page()
Use iommu_free_pages() instead.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17 16:22:36 +02:00
Jason Gunthorpe
4316ba4a50 iommu/pages: Remove the order argument to iommu_free_pages()
Now that we have a folio under the allocation iommu_free_pages() can know
the order of the original allocation and do the correct thing to free it.

The next patch will rename iommu_free_page() to iommu_free_pages() so we
have naming consistency with iommu_alloc_pages_node().

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17 16:22:33 +02:00
Joerg Roedel
22df63a23a Merge branches 'apple/dart', 'arm/smmu/updates', 'arm/smmu/bindings', 'rockchip', 's390', 'core', 'intel/vt-d' and 'amd/amd-vi' into next 2025-03-20 09:11:09 +01:00
Kishon Vijay Abraham I
19e5cc156c iommu/amd: Enable support for up to 2K interrupts per function
AMD IOMMU optionally supports up to 2K interrupts per function on newer
platforms. Support for this feature is indicated through Extended
Feature 2 Register (MMIO Offset 01A0h[NumIntRemapSup]). Allocate 2K IRTEs
per device when this support is available.

Co-developed-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250307095822.2274-5-sarunkod@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-13 12:14:17 +01:00
Sairaj Kodilkar
950865c1b8 iommu/amd: Rename DTE_INTTABLEN* and MAX_IRQS_PER_TABLE macro
AMD iommu can support both 512 and 2K interrupts on newer platform. Hence
add suffix "512" to the existing macros.

Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250307095822.2274-4-sarunkod@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-13 12:14:16 +01:00
Sairaj Kodilkar
eaf717fa1c iommu/amd: Replace slab cache allocator with page allocator
Commit 05152a0494 ("iommu/amd: Add slab-cache for irq remapping tables")
introduces slab cache allocator. But slab cache allocator provides benefit
only when the allocation and deallocation of many identical objects is
frequent. The AMD IOMMU driver allocates Interrupt remapping table (IRT)
when device driver requests IRQ for the first time and never frees it.
Hence the slab allocator does not provide any benefit here.

Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250307095822.2274-3-sarunkod@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-13 12:14:15 +01:00
Vasant Hegde
625586855f iommu/amd: Consolidate protection domain free code
Consolidate protection domain free code inside amd_iommu_domain_free()
and remove protection_domain_free() function.

Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250227162320.5805-8-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-10 09:27:11 +01:00
Vasant Hegde
558d2bbd45 iommu/amd: Log IOMMU control register in event log path
Useful for debugging ILLEGAL_DEV_TABLE_ENTRY events as some of the
DTE settings depends on Control register settings.

Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250227162320.5805-2-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-10 09:27:08 +01:00
Alejandro Jimenez
c5b0320bbf iommu/amd: Preserve default DTE fields when updating Host Page Table Root
When updating the page table root field on the DTE, avoid overwriting any
bits that are already set. The earlier call to make_clear_dte() writes
default values that all DTEs must have set (currently DTE[V]), and those
must be preserved.

Currently this doesn't cause problems since the page table root update is
the first field that is set after make_clear_dte() is called, and
DTE_FLAG_V is set again later along with the permission bits (IR/IW).
Remove this redundant assignment too.

Fixes: fd5dff9de4 ("iommu/amd: Modify set_dte_entry() to use 256-bit DTE helpers")
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250106191413.3107140-1-alejandro.j.jimenez@oracle.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-02-28 12:18:00 +01:00
Linus Torvalds
f1c243fc78 IOMMU Updates for Linux v6.14
Including:
 
 	- Core changes:
 	  - PASID support for the blocked_domain.
 
 	- ARM-SMMU Updates:
 	  - SMMUv2:
 	    * Implement per-client prefetcher configuration on Qualcomm SoCs.
 	    * Support for the Adreno SMMU on Qualcomm's SDM670 SOC.
 	  - SMMUv3:
 	    * Pretty-printing of event records.
 	    * Drop the ->domain_alloc_paging implementation in favour of
 	      ->domain_alloc_paging_flags(flags==0).
 	  - IO-PGTable:
 	    * Generalisation of the page-table walker to enable external walkers
 	      (e.g. for debugging unexpected page-faults from the GPU).
 	    * Minor fix for handling concatenated PGDs at stage-2 with 16KiB pages.
 	  - Misc:
 	    * Clean-up device probing and replace the crufty probe-deferral hack
 	      with a more robust implementation of arm_smmu_get_by_fwnode().
 	    * Device-tree binding updates for a bunch of Qualcomm platforms.
 
 	- Intel VT-d Updates:
 	  - Remove domain_alloc_paging().
 	  - Remove capability audit code.
 	  - Draining PRQ in sva unbind path when FPD bit set.
 	  - Link cache tags of same iommu unit together.
 
 	- AMD-Vi Updates:
 	  - Use CMPXCHG128 to update DTE.
 	  - Cleanups of the domain_alloc_paging() path.
 
 	- RiscV IOMMU:
 	  - Platform MSI support.
 	  - Shutdown support.
 
 	- Rockchip IOMMU:
 	  - Add DT bindings for Rockchip RK3576.
 
 	- More smaller fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmeQsnQACgkQK/BELZcB
 GuOHug/8DDIuKlUHU2+3U2kKxMb8o1kDxGPkfKgXBaxxpprvY9DARRv7N4mnF6Vu
 Db8P7QihXfQKnZHXEiqHE7TlA5B+IVQd3Kz96P4sY3OlVWGZYqyKv2GEHyG5CjN/
 bay7bfgeo2EVEiAio6VToFFWTm+oxFZzhoYFIlAyZAuIQUp17gHXf7YyhUwk4rOz
 8g0XMH6uldidID6BVpArxHh/bN9MOTdHzkyhwPF3FL8E94ziX6rWILH9ADYxBn2o
 wqHR1STxv398k62toPpWb78c2RdANI8treDXsYpCyDF87dygdP+SA0qkK3G6kAVA
 /IiPothAj6oNm+Gvwd04tEkuqVVrVqrmWE3VXSps33Tk+apYtcmLCtdpwY/F93D1
 EZwTVqveBKk2hSWYVDlEyj9XKmZ9dYWDGvg2Fx844gltQHoHtEgYL+azCMU/ry7k
 3+KlkUqFZZxUDbVBbbDdMF+NpxyZTtfKsaLB8f5laOP1R8Os3+dC69Gw6bhoxaMc
 xfL3v245V5kWSRy+w02TyZGmZSzRx0FKUbFLKpLOvZD6pfx8t8oqJTG9FwN1KzpL
 lvcBpPB5AUeNJQpKcVSjs/ne8WiOqdABt7ge4E9J5TwrDI8sXk0mwJaPrlDgK1V/
 0xkkLmxWnqsu04CyDay+Uv3Bls/rRBpikR/iGt9P3BbZs7Cyryw=
 =Xei0
 -----END PGP SIGNATURE-----

Merge tag 'iommu-updates-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux

Pull iommu updates from Joerg Roedel:
 "Core changes:
   - PASID support for the blocked_domain

  ARM-SMMU Updates:
   - SMMUv2:
      - Implement per-client prefetcher configuration on Qualcomm SoCs
      - Support for the Adreno SMMU on Qualcomm's SDM670 SOC
   - SMMUv3:
      - Pretty-printing of event records
      - Drop the ->domain_alloc_paging implementation in favour of
        domain_alloc_paging_flags(flags==0)
   - IO-PGTable:
      - Generalisation of the page-table walker to enable external
        walkers (e.g. for debugging unexpected page-faults from the GPU)
      - Minor fix for handling concatenated PGDs at stage-2 with 16KiB
        pages
   - Misc:
      - Clean-up device probing and replace the crufty probe-deferral
        hack with a more robust implementation of
        arm_smmu_get_by_fwnode()
      - Device-tree binding updates for a bunch of Qualcomm platforms

  Intel VT-d Updates:
   - Remove domain_alloc_paging()
   - Remove capability audit code
   - Draining PRQ in sva unbind path when FPD bit set
   - Link cache tags of same iommu unit together

  AMD-Vi Updates:
   - Use CMPXCHG128 to update DTE
   - Cleanups of the domain_alloc_paging() path

  RiscV IOMMU:
   - Platform MSI support
   - Shutdown support

  Rockchip IOMMU:
   - Add DT bindings for Rockchip RK3576

  More smaller fixes and cleanups"

* tag 'iommu-updates-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (66 commits)
  iommu: Use str_enable_disable-like helpers
  iommu/amd: Fully decode all combinations of alloc_paging_flags
  iommu/amd: Move the nid to pdom_setup_pgtable()
  iommu/amd: Change amd_iommu_pgtable to use enum protection_domain_mode
  iommu/amd: Remove type argument from do_iommu_domain_alloc() and related
  iommu/amd: Remove dev == NULL checks
  iommu/amd: Remove domain_alloc()
  iommu/amd: Remove unused amd_iommu_domain_update()
  iommu/riscv: Fixup compile warning
  iommu/arm-smmu-v3: Add missing #include of linux/string_choices.h
  iommu/arm-smmu-v3: Use str_read_write helper w/ logs
  iommu/io-pgtable-arm: Add way to debug pgtable walk
  iommu/io-pgtable-arm: Re-use the pgtable walk for iova_to_phys
  iommu/io-pgtable-arm: Make pgtable walker more generic
  iommu/arm-smmu: Add ACTLR data and support for qcom_smmu_500
  iommu/arm-smmu: Introduce ACTLR custom prefetcher settings
  iommu/arm-smmu: Add support for PRR bit setup
  iommu/arm-smmu: Refactor qcom_smmu structure to include single pointer
  iommu/arm-smmu: Re-enable context caching in smmu reset operation
  iommu/vt-d: Link cache tags of same iommu unit together
  ...
2025-01-24 07:33:46 -08:00
Linus Torvalds
4c551165e7 Updates for the interrupt subsystem:
- Consolidation of the machine_kexec_mask_interrupts() by providing a
     generic implementation and replacing the copy & pasta orgy in the
     relevant architectures.
 
   - Prevent unconditional operations on interrupt chips during kexec
     shutdown, which can trigger warnings in certain cases when the
     underlying interrupt has been shut down before.
 
   - Make the enforcement of interrupt handling in interrupt context
     unconditionally available, so that it actually works for non x86
     related interrupt chips. The earlier enablement for ARM GIC chips set
     the required chip flag, but did not notice that the check was hidden
     behind a config switch which is not selected by ARM[64].
 
   - Decrapify the handling of deferred interrupt affinity setting. Some
     interrupt chips require that affinity changes are made from the context
     of handling an interrupt to avoid certain race conditions. For x86 this
     was the default, but with interrupt remapping this requirement was
     lifted and a flag was introduced which tells the core code that
     affinity changes can be done in any context. Unrestricted affinity
     changes are the default for the majority of interrupt chips. RISCV has
     the requirement to add the deferred mode to one of it's interrupt
     controllers, but with the original implementation this would require to
     add the any context flag to all other RISC-V interrupt chips. That's
     backwards, so reverse the logic and require that chips, which need the
     deferred mode have to be marked accordingly. That avoids chasing the
     'sane' chips and marking them.
 
   - Add multi-node support to the Loongarch AVEC interrupt controller
     driver.
 
   - The usual tiny cleanups, fixes and improvements all over the place.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmePkVITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRbQD/9bHVph/V9Ekl7JAX3aY4gG4JbRhOc7
 dp1VAcHRhktRfoTztYRbjsbMu2nvZ58GKA8bkOS2jHSF/m3PbkIJfOhwk0YdIAoa
 +kdy5yDgqCGfkqW43DN4Cr+CnzGjWMitw67tFp3fhwehMDpDjdt2L28IjtanSS0f
 hO6FV7o65MWeJwxk4Isb2/nvkO+X23Lrp6RrWS8SXBnF9FFXxiPIg/fiOPTizhCh
 1W/bSGxLLb9WwsVzmlGAKVFlXDij0QGaIUug2fdVZ63OsELXD7tJrLSPG133yk92
 ppIa0s6BT4IBsfM00us4hG15PkLuJmP3yWWcoquG0rP8Wq58VOXiN6+rcJIyvB+5
 mWceTH6IKfZGoRQKwXC7BxeBAIb147reiJtb06meq1/8ADIvzafiNy0c8x9i/UaV
 QiyhPVENjaGCGDomZmJQqN7Yb02Wge1k8InQnodDrHxZNl/bX/B1Z8Bxd0n6hPHg
 NSJXYif2AxgaddpohsdygqRDbT6SNyQdj7YjJFY5qAGJ3yFyJ4JB6WTqkWW4o1vH
 3FVqdAnJmejAmmYSkah0Hkem2T5QASQmTWb93PLxiV6q+d0NM8stWAujjyVdIV/B
 W4Uj9mQ20cz54TjLtxqX+A1k6KcqOWRgh1l2QbUlFsgsOP3V8yz47yqYdR9qMWlO
 9kNEjI3sw+G/IQ==
 =q4rj
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull interrupt subsystem updates from Thomas Gleixner:

 - Consolidate the machine_kexec_mask_interrupts() by providing a
   generic implementation and replacing the copy & pasta orgy in the
   relevant architectures.

 - Prevent unconditional operations on interrupt chips during kexec
   shutdown, which can trigger warnings in certain cases when the
   underlying interrupt has been shut down before.

 - Make the enforcement of interrupt handling in interrupt context
   unconditionally available, so that it actually works for non x86
   related interrupt chips. The earlier enablement for ARM GIC chips set
   the required chip flag, but did not notice that the check was hidden
   behind a config switch which is not selected by ARM[64].

 - Decrapify the handling of deferred interrupt affinity setting.

   Some interrupt chips require that affinity changes are made from the
   context of handling an interrupt to avoid certain race conditions.
   For x86 this was the default, but with interrupt remapping this
   requirement was lifted and a flag was introduced which tells the core
   code that affinity changes can be done in any context. Unrestricted
   affinity changes are the default for the majority of interrupt chips.

   RISCV has the requirement to add the deferred mode to one of it's
   interrupt controllers, but with the original implementation this
   would require to add the any context flag to all other RISC-V
   interrupt chips. That's backwards, so reverse the logic and require
   that chips, which need the deferred mode have to be marked
   accordingly. That avoids chasing the 'sane' chips and marking them.

 - Add multi-node support to the Loongarch AVEC interrupt controller
   driver.

 - The usual tiny cleanups, fixes and improvements all over the place.

* tag 'irq-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/generic_chip: Export irq_gc_mask_disable_and_ack_set()
  genirq/timings: Add kernel-doc for a function parameter
  genirq: Remove IRQ_MOVE_PCNTXT and related code
  x86/apic: Convert to IRQCHIP_MOVE_DEFERRED
  genirq: Provide IRQCHIP_MOVE_DEFERRED
  hexagon: Remove GENERIC_PENDING_IRQ leftover
  ARC: Remove GENERIC_PENDING_IRQ
  genirq: Remove handle_enforce_irqctx() wrapper
  genirq: Make handle_enforce_irqctx() unconditionally available
  irqchip/loongarch-avec: Add multi-nodes topology support
  irqchip/ts4800: Replace seq_printf() by seq_puts()
  irqchip/ti-sci-inta : Add module build support
  irqchip/ti-sci-intr: Add module build support
  irqchip/irq-brcmstb-l2: Replace brcmstb_l2_mask_and_ack() by generic function
  irqchip: keystone: Use syscon_regmap_lookup_by_phandle_args
  genirq/kexec: Prevent redundant IRQ masking by checking state before shutdown
  kexec: Consolidate machine_kexec_mask_interrupts() implementation
  genirq: Reuse irq_thread_fn() for forced thread case
  genirq: Move irq_thread_fn() further up in the code
2025-01-21 13:51:07 -08:00
Joerg Roedel
125f34e4c1 Merge branches 'arm/smmu/updates', 'arm/smmu/bindings', 'qualcomm/msm', 'rockchip', 'riscv', 'core', 'intel/vt-d' and 'amd/amd-vi' into next 2025-01-17 09:02:35 +01:00
Jason Gunthorpe
082f1bcae8 iommu/amd: Fully decode all combinations of alloc_paging_flags
Currently AMD does not support
 IOMMU_HWPT_ALLOC_PASID | IOMMU_HWPT_ALLOC_DIRTY_TRACKING

It should be rejected. Instead it creates a V1 domain without dirty
tracking support.

Use a switch to fully decode the flags.

Fixes: ce2cd17546 ("iommu/amd: Enhance amd_iommu_domain_alloc_user()")
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-01-17 08:59:32 +01:00
Jason Gunthorpe
5a081f7f42 iommu/amd: Move the nid to pdom_setup_pgtable()
The only thing that uses the nid is the io_pgtable code, and it should be
set before calling alloc_io_pgtable_ops() to ensure that the top levels
are allocated on the correct nid.

Since dev is never NULL now we can just do this trivially and remove the
other uses of nid. SVA and identity code paths never use it since they
don't use io_pgtable.

Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-01-17 08:59:31 +01:00
Jason Gunthorpe
13b4ec7491 iommu/amd: Change amd_iommu_pgtable to use enum protection_domain_mode
Currently it uses enum io_pgtable_fmt which is from the io pagetable code
and most of the enum values are invalid. protection_domain_mode is
internal the driver and has the only two valid values.

Fix some signatures and variables to use the right type as well.

Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-01-17 08:59:30 +01:00
Jason Gunthorpe
55b237dd7f iommu/amd: Remove type argument from do_iommu_domain_alloc() and related
do_iommu_domain_alloc() is only called from
amd_iommu_domain_alloc_paging_flags() so type is always
IOMMU_DOMAIN_UNMANAGED. Remove type and all the dead conditionals checking
it.

IOMMU_DOMAIN_IDENTITY checks are similarly obsolete as the conversion to
the global static identity domain removed those call paths.

The caller of protection_domain_alloc() should set the type, fix the miss
in the SVA code.

Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-01-17 08:59:29 +01:00
Jason Gunthorpe
02bcd1a8b9 iommu/amd: Remove dev == NULL checks
This is no longer possible, amd_iommu_domain_alloc_paging_flags() is never
called with dev = NULL from the core code. Similarly
get_amd_iommu_from_dev() can never be NULL either.

Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-01-17 08:59:29 +01:00
Jason Gunthorpe
f9b80f941e iommu/amd: Remove domain_alloc()
IOMMU drivers should not be sensitive to the domain type, a paging domain
should be created based only on the flags passed in, the same for all
callers.

AMD was using the domain_alloc() path to force VFIO into a v1 domain type,
because v1 gives higher performance. However now that
IOMMU_HWPT_ALLOC_PASID is present, and a NULL device is not possible,
domain_alloc_paging_flags() will do the right thing for VFIO.

When invoked from VFIO flags will be 0 and the amd_iommu_pgtable type of
domain will be selected. This is v1 by default unless the kernel command
line has overridden it to v2.

If the admin is forcing v2 assume they know what they are doing so force
it everywhere, including for VFIO.

Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-01-17 08:59:28 +01:00
Alejandro Jimenez
1a684b099f iommu/amd: Remove unused amd_iommu_domain_update()
All the callers have been removed by the below commit, remove the
implementation and prototypes.

Fixes: 322d889ae7 ("iommu/amd: Remove amd_iommu_domain_update() from page table freeing")
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-01-17 08:59:28 +01:00
Thomas Gleixner
7d04319a05 x86/apic: Convert to IRQCHIP_MOVE_DEFERRED
Instead of marking individual interrupts as safe to be migrated in
arbitrary contexts, mark the interrupt chips, which require the interrupt
to be moved in actual interrupt context, with the new IRQCHIP_MOVE_DEFERRED
flag. This makes more sense because this is a per interrupt chip property
and not restricted to individual interrupts.

That flips the logic from the historical opt-out to a opt-in model. This is
simpler to handle for other architectures, which default to unrestricted
affinity setting. It also allows to cleanup the redundant core logic
significantly.

All interrupt chips, which belong to a top-level domain sitting directly on
top of the x86 vector domain are marked accordingly, unless the related
setup code marks the interrupts with IRQ_MOVE_PCNTXT, i.e. XEN.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/all/20241210103335.563277044@linutronix.de
2025-01-15 21:38:53 +01:00
Yi Liu
5f53638882 iommu/amd: Make the blocked domain support PASID
The blocked domain can be extended to park PASID of a device to be the
DMA blocking state. By this the remove_dev_pasid() op is dropped.

Remove PASID from old domain and device GCR3 table. No need to attach
PASID to the blocked domain as clearing PASID from GCR3 table will make
sure all DMAs for that PASID are blocked.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241204122928.11987-7-yi.l.liu@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-12-18 09:39:37 +01:00