Introduce the helper function cdns_pcie_host_disable() which will undo
the configuration performed by cdns_pcie_host_setup(). Also, export it
for use by existing callers of cdns_pcie_host_setup(), thereby allowing
them to cleanup on their exit path.
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/20250417124408.2752248-3-s-vadapalli@ti.com
Currently, the Cadence PCIe controller driver can be built as a built-in
module only. Since PCIe functionality is not a necessity for booting, add
support to build the Cadence PCIe driver as a loadable module as well.
Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com>
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/20250417124408.2752248-2-s-vadapalli@ti.com
No need to split the line in __pci_setup_bridge() as it is way shorter
than the limit.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/r/20250404124547.51185-1-ilpo.jarvinen@linux.intel.com
Resource fitting/assignment code checks if there's a remainder in
add_list (aka. realloc_head in the inner functions) using BUG_ON().
This problem typically results in a mere PCI device resource assignment
failure which does not warrant using BUG_ON(). The machine could well
come up usable even if this condition occurs because the realloc_head
relates to resources which are optional anyway.
Change BUG_ON() to WARN_ON_ONCE() and free the list if it's not empty.
[bhelgaas: subject]
Reported-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/linux-pci/5f103643-5e1c-43c6-b8fe-9617d3b5447c@linaro.org
Link: https://lore.kernel.org/r/20250511215223.7131-1-ilpo.jarvinen@linux.intel.com
This common library will be used as a placeholder for helper functions
shared by the host controller drivers. This avoids placing the host
controller drivers specific helpers in drivers/pci/*.c, to avoid enlarging
the kernel image on platforms that do not use host controller drivers at
all (like x86/ACPI platforms).
Suggested-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20250508-pcie-reset-slot-v4-3-7050093e2b50@linaro.org
A PCI device is just another peripheral in a system. So failure to
recover it, must not result in a kernel panic. So remove the TODO which
is quite misleading.
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Link: https://patch.msgid.link/20250508-pcie-reset-slot-v4-1-7050093e2b50@linaro.org
The Cadence PCIe controller driver defines message codes in enum
cdns_pcie_msg_code duplicating the existing PCIE_MSG_CODE_* definitions in
drivers/pci/pci.h. The driver only uses ASSERT_INTA and DEASSERT_INTA codes
from this enum.
Remove the redundant Cadence-specific enum definitions and use the ones
available in drivers/pci/pci.h. This helps in avoiding code duplication,
maintaining consistency with the spec, and simplifying the code
maintenance.
Signed-off-by: Hans Zhang <18255117159@163.com>
[mani: commit message rewording]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20250401145023.22948-1-18255117159@163.com
The kdoc for pci_epc_set_msix() says:
"Invoke to set the required number of MSI-X interrupts."
The kdoc for the callback pci_epc_ops->set_msix() says:
"ops to set the requested number of MSI-X interrupts in the MSI-X
capability register"
pci_epc_ops::set_msix() does however expect the parameter 'interrupts' to
be in the encoding as defined by the Table Size field. Nowhere in the
kdoc does it say that the number of interrupts should be in Table Size
encoding.
It is very confusing that the API pci_epc_set_msix() and the callback
function pci_epc_ops::set_msix() both take a parameter named 'interrupts',
but they expect completely different encodings.
Clean up the API and the callback function to have the same semantics,
i.e. the parameter represents the number of interrupts, regardless of the
internal encoding of that value.
Also rename the parameter 'interrupts' to 'nr_irqs', in both the wrapper
function and the callback function, such that the name is unambiguous.
[bhelgaas: more specific subject]
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable+noautosel@kernel.org # this is simply a cleanup
Link: https://patch.msgid.link/20250514074313.283156-14-cassel@kernel.org
The kdoc for pci_epc_set_msi() says:
"Invoke to set the required number of MSI interrupts."
The kdoc for the callback pci_epc_ops::set_msi() says:
"ops to set the requested number of MSI interrupts in the MSI capability
register"
pci_epc_ops::set_msi() does however expect the parameter 'interrupts' to be
in the encoding as defined by the Multiple Message Capable (MMC) field of
the MSI capability structure. Nowhere in the kdoc does it say that the
number of interrupts should be in MMC encoding.
It is very confusing that the API pci_epc_set_msi() and the callback
function pci_epc_ops::set_msi() both take a parameter named 'interrupts',
but they expect completely different encodings.
Clean up the API and the callback function to have the same semantics,
i.e. the parameter represents the number of interrupts, regardless of the
internal encoding of that value.
Also rename the parameter 'interrupts' to 'nr_irqs', in both the wrapper
function and the callback function, such that the name is unambiguous.
[bhelgaas: more specific subject]
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable+noautosel@kernel.org # this is simply a cleanup
Link: https://patch.msgid.link/20250514074313.283156-13-cassel@kernel.org
The kdoc for pci_epc_get_msix() says:
"Invoke to get the number of MSI-X interrupts allocated by the RC"
The kdoc for the callback pci_epc_ops->get_msix() says:
"ops to get the number of MSI-X interrupts allocated by the RC from the
MSI-X capability register"
pci_epc_ops::get_msix() does however return the number of interrupts in the
encoding as defined by the Table Size field. Nowhere in the kdoc does it
say that the returned number of interrupts is in Table Size encoding.
It is very confusing that the API pci_epc_get_msix() and the callback
function pci_epc_ops::get_msix() don't return the same value.
Clean up the API and the callback function to have the same semantics,
i.e. return the number of interrupts, regardless of the internal encoding
of that value.
[bhelgaas: more specific subject]
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: stable+noautosel@kernel.org # this is simply a cleanup
Link: https://patch.msgid.link/20250514074313.283156-12-cassel@kernel.org
The kdoc for API pci_epc_get_msi() says:
"Invoke to get the number of MSI interrupts allocated by the RC"
The kdoc for the callback pci_epc_ops::get_msi() says:
"ops to get the number of MSI interrupts allocated by the RC from
the MSI capability register"
pci_epc_ops::get_msi() does however return the number of interrupts in the
encoding as defined by the Multiple Message Enable (MME) field of the MSI
Capability structure.
Nowhere in the kdoc does it say that the returned number of interrupts is
in MME encoding. It is very confusing that the API pci_epc_get_msi() and
the callback function pci_epc_ops::get_msi() don't return the same value.
Clean up the API and the callback function to have the same semantics,
i.e. return the number of interrupts, regardless of the internal encoding
of that value.
[bhelgaas: more specific subject]
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: stable+noautosel@kernel.org # this is simply a cleanup
Link: https://patch.msgid.link/20250514074313.283156-11-cassel@kernel.org
While cdns_pcie_ep_set_msix() writes the Table Size field correctly (N-1),
the calculation of the PBA offset is wrong because it calculates space for
(N-1) entries instead of N.
This results in the following QEMU error when using PCI passthrough on a
device which relies on the PCI endpoint subsystem:
failed to add PCI capability 0x11[0x50]@0xb0: table & pba overlap, or they don't fit in BARs, or don't align
Fix the calculation of PBA offset in the MSI-X capability.
[bhelgaas: more specific subject and commit log]
Fixes: 3ef5d16f50 ("PCI: cadence: Add MSI-X support to Endpoint driver")
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20250514074313.283156-10-cassel@kernel.org
While dw_pcie_ep_set_msix() writes the Table Size field correctly (N-1),
the calculation of the PBA offset is wrong because it calculates space for
(N-1) entries instead of N.
This results in the following QEMU error when using PCI passthrough on a
device which relies on the PCI endpoint subsystem:
failed to add PCI capability 0x11[0x50]@0xb0: table & pba overlap, or they don't fit in BARs, or don't align
Fix the calculation of PBA offset in the MSI-X capability.
[bhelgaas: more specific subject and commit log]
Fixes: 83153d9f36 ("PCI: endpoint: Fix ->set_msix() to take BIR and offset as arguments")
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20250514074313.283156-9-cassel@kernel.org
When allocating the shared ctrl/SPAD space, epf_ntb_config_spad_bar_alloc()
should not try to handle the size quirks for underlying BAR, whether it is
fixed size or alignment. This is already handled by pci_epf_alloc_space().
Also, when handling the alignment, this allocates more space than
necessary. For example, with a SPAD size of 1024B and a ctrl size of 308B,
the space necessary is 1332B. If the alignment is 1MB,
epf_ntb_config_spad_bar_alloc() tries to allocate 2MB where 1MB would have
been more than enough.
Drop the handling of the BAR size quirks and let pci_epf_alloc_space()
handle that. Just make sure the 32bits SPAD register are aligned on 32bits.
Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20250424-pci-ep-size-alignment-v5-2-2d4ec2af23f5@baylibre.com
When allocating space for an endpoint function on a BAR with a fixed size,
the size saved in 'struct pci_epf_bar.size' should be the fixed size as
expected by pci_epc_set_bar().
However, if pci_epf_alloc_space() increased the allocation size to
accommodate iATU alignment requirements, it previously saved the larger
aligned size in .size, which broke pci_epc_set_bar().
To solve this, keep the fixed BAR size in .size and save the aligned size
in a new .aligned_size for use when deallocating it.
Fixes: 2a9a801620 ("PCI: endpoint: Add support to specify alignment for buffers allocated to BARs")
Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
[mani: commit message fixup]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
[bhelgaas: more specific subject, commit log, wrap comment to match file]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Link: https://patch.msgid.link/20250424-pci-ep-size-alignment-v5-1-2d4ec2af23f5@baylibre.com
- new two step DMA mapping API, which is is a first step to a long path
to provide alternatives to scatterlist and to remove hacks, abuses and
design mistakes related to scatterlists; this new approach optimizes
some calls to DMA-IOMMU layer and cache maintenance by batching them,
reduces memory usage as it is no need to store mapped DMA addresses to
unmap them, and reduces some function call overhead; it is a combination
effort of many people, lead and developed by Christoph Hellwig and Leon
Romanovsky
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaDRXIQAKCRCJp1EFxbsS
RG8tAP9kgjIwMoJqfr6DC8yYraIIUuNDyhb/fZ9vPppW6Cb7aAD/cg8udjrsUu3h
iAZBIHkYuWmkx8JG7t5/lqBc4AOC1AA=
=F3TU
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.16-2025-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping updates from Marek Szyprowski:
"New two step DMA mapping API, which is is a first step to a long path
to provide alternatives to scatterlist and to remove hacks, abuses and
design mistakes related to scatterlists.
This new approach optimizes some calls to DMA-IOMMU layer and cache
maintenance by batching them, reduces memory usage as it is no need to
store mapped DMA addresses to unmap them, and reduces some function
call overhead. It is a combination effort of many people, lead and
developed by Christoph Hellwig and Leon Romanovsky"
* tag 'dma-mapping-6.16-2025-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
docs: core-api: document the IOVA-based API
dma-mapping: add a dma_need_unmap helper
dma-mapping: Implement link/unlink ranges API
iommu/dma: Factor out a iommu_dma_map_swiotlb helper
dma-mapping: Provide an interface to allow allocate IOVA
iommu: add kernel-doc for iommu_unmap_fast
iommu: generalize the batched sync after map interface
dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
PCI/P2PDMA: Refactor the p2pdma mapping helpers
- Switch the MSI decriptor locking to lock guards
- Replace a broken and naive implementation of PCI/MSI-X control word
updates in the PCI/TPH driver with a properly serialized variant in the
PCI/MSI core code.
- Remove the MSI descriptor abuse in the SCCI/UFS/QCOM driver by
replacing the direct access to the MSI descriptors with the proper API
function calls. People will never understand that APIs exist for a
reason...
- Provide core infrastructre for the upcoming PCI endpoint library
extensions. Currently limited to ARM GICv3+, but in theory extensible
to other architectures.
- Provide a MSI domain::teardown() callback, which allows drivers to undo
the effects of the prepare() callback.
- Move the MSI domain::prepare() callback invocation to domain creation
time to avoid redundant (and in case of ARM/GIC-V3-ITS confusing)
invocations on every allocation.
In combination with the new teardown callback this removes some ugly
hacks in the GIC-V3-ITS driver, which pretended to work around the
short comings of the core code so far. With this update the code is
correct by design and implementation.
- Make the irqchip MSI library globally available, provide a MSI parent
domain creation helper and convert a bunch of (PCI/)MSI drivers over to
the modern MSI parent mechanism. This is the first step to get rid of
at least one incarnation of the three PCI/MSI management schemes.
- The usual small cleanups and improvements
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmgzgFsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoR0KD/402K12tlI/D70H2aTG25dbTx+dkVk+
pKpJz0985uUlLJiPCR54dZL0ofcfRU+CdjEIf1I+6TPshtg6IWLJCfqu7OWVPYzz
2lJDO0yeUGwJqc0CIa1vttvJWvcUcxfWBX/ZSkOIM5avaXqSwRwsFNfd7TQ+T+eG
79VS1yyW197mUva53ekSF2voa8EEPWfEslAjoX1dRg5d4viAxaLtKm/KpBqo1oPh
Eb+E67xEWiIonvWNdr1AOisxnbi19PyDo1xnftgBToaeXXYBodNrNIAfAkx40YUZ
IZQLHvhZ91x15hXYIS4Cz1RXqPECbu/tHxs4AFUgGvqdgJUF89wzI3C21ymrKA6E
tDlWfpIcuE3vV/bsqj1gHGL5G5m1tyBRgIdIAOOmMoTHvwp5rrQtuZzpuqzGmEzj
iVIHnn5m08kRpOZQc7+PlxQMh3eunEyj9WWG49EJgoAnJPb5lou4shTwBUheHcKm
NXxKsfo4x5C+WehGTxv80UlnMcK3Yh/TuWf2OPR6QuT2iHP2VL5jyHjIs0ICn0cp
1tvSJtdc1rgvk/4Vn4lu5eyVaTx5ZAH8ZXNQfwwBTWTp3ZyAW+7GkaCq3LPaNJoZ
4LWpgZ5gs6wT+1XNT3boKdns81VolmeTI8P1ciQKpUtaTt6Cy9P/i2az/J+BCS4U
Fn5Qqk08PHGrUQ==
=OBMj
-----END PGP SIGNATURE-----
Merge tag 'irq-msi-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MSI updates from Thomas Gleixner:
"Updates for the MSI subsystem (core code and PCI):
- Switch the MSI descriptor locking to lock guards
- Replace a broken and naive implementation of PCI/MSI-X control word
updates in the PCI/TPH driver with a properly serialized variant in
the PCI/MSI core code.
- Remove the MSI descriptor abuse in the SCCI/UFS/QCOM driver by
replacing the direct access to the MSI descriptors with the proper
API function calls. People will never understand that APIs exist
for a reason...
- Provide core infrastructre for the upcoming PCI endpoint library
extensions. Currently limited to ARM GICv3+, but in theory
extensible to other architectures.
- Provide a MSI domain::teardown() callback, which allows drivers to
undo the effects of the prepare() callback.
- Move the MSI domain::prepare() callback invocation to domain
creation time to avoid redundant (and in case of ARM/GIC-V3-ITS
confusing) invocations on every allocation.
In combination with the new teardown callback this removes some
ugly hacks in the GIC-V3-ITS driver, which pretended to work around
the short comings of the core code so far. With this update the
code is correct by design and implementation.
- Make the irqchip MSI library globally available, provide a MSI
parent domain creation helper and convert a bunch of (PCI/)MSI
drivers over to the modern MSI parent mechanism. This is the first
step to get rid of at least one incarnation of the three PCI/MSI
management schemes.
- The usual small cleanups and improvements"
* tag 'irq-msi-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
PCI/MSI: Use bool for MSI enable state tracking
PCI: tegra: Convert to MSI parent infrastructure
PCI: xgene: Convert to MSI parent infrastructure
PCI: apple: Convert to MSI parent infrastructure
irqchip/msi-lib: Honour the MSI_FLAG_NO_AFFINITY flag
irqchip/mvebu: Convert to msi_create_parent_irq_domain() helper
irqchip/gic: Convert to msi_create_parent_irq_domain() helper
genirq/msi: Add helper for creating MSI-parent irq domains
irqchip: Make irq-msi-lib.h globally available
irqchip/gic-v3-its: Use allocation size from the prepare call
genirq/msi: Engage the .msi_teardown() callback on domain removal
genirq/msi: Move prepare() call to per-device allocation
irqchip/gic-v3-its: Implement .msi_teardown() callback
genirq/msi: Add .msi_teardown() callback as the reverse of .msi_prepare()
irqchip/gic-v3-its: Add support for device tree msi-map and msi-mask
dt-bindings: PCI: pci-ep: Add support for iommu-map and msi-map
irqchip/gic-v3-its: Set IRQ_DOMAIN_FLAG_MSI_IMMUTABLE for ITS
irqdomain: Add IRQ_DOMAIN_FLAG_MSI_IMMUTABLE and irq_domain_is_msi_immutable()
platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
genirq/msi: Rename msi_[un]lock_descs()
...
- Consolidate on one set of functions for the interrupt domain code to
get rid of pointlessly duplicated code with only marginal different
semantics.
- Update the documentation accordingly and consolidate the coding style
of the irqdomain header.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmgzd+MTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodTRD/0RmG5tngCbEJmTw6lPDQzRZH4OO3ja
yRYlyBipemoRmvJRGjV4uHqN2QPrdOuoqMuyBO1aWcMdkpww5bAHcbgSFrlGM1lW
kqtaxVMbufPiLQSGYe7OQf478CE1ykoBd5Va8whFKrtA73qEUdEMfWT0stspg780
7BlmQOemL91p7Ytf03FbDdo8tZ5Xu9uXGAulwY9FZsFtsCNyvhl7nOv5Sk8ZQtGO
xHRCeunjZLWR+IaK59hdakvQybXwSnjT6jODp96nlyKABEKSPShGSPFDWd3g9px7
4911QwgnvTbcrsk6YmQEmPIOgXZzypjbnjpJr8tFpTbkVIy+6chi5cBJzXoqsUaM
ylTwFcUQNvcP8yF447qb+nyPFKM5xsC07W0UpZMuJUDmhhPRtDm5pK0jpsif96GP
l4aMsWe65PUmXHQqLdE89RJXAa8XQ2qspKVtNKq9DmEVgTviQ09Z9SSQIx4U0yIx
w+YPde8kH2+O+YtMUn/MmfHhUP4MKya7j5zd8Bnv8wLBi7XGPPA5EKKh9I0dz9m+
X94lweNXyH+Q8U9mt2cQf8VG8Yzgk0eeC0sliJIlybwRgEgRcQbVWw0VvZUA1ySa
VBlaj3SinO90FEQ0CctT51ss2mUJ/XsGCnxpiGZXfqIZzFbyD1YfZQnXJH0H67DI
CqdHw22I27Mu/A==
=9nLp
-----END PGP SIGNATURE-----
Merge tag 'irq-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq cleanups from Thomas Gleixner:
"A set of cleanups for the generic interrupt subsystem:
- Consolidate on one set of functions for the interrupt domain code
to get rid of pointlessly duplicated code with only marginal
different semantics.
- Update the documentation accordingly and consolidate the coding
style of the irqdomain header"
* tag 'irq-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
irqdomain: Consolidate coding style
irqdomain: Fix kernel-doc and add it to Documentation
Documentation: irqdomain: Update it
Documentation: irq-domain.rst: Simple improvements
Documentation: irq/concepts: Minor improvements
Documentation: irq/concepts: Add commas and reflow
irqdomain: Improve kernel-docs of functions
irqdomain: Make struct irq_domain_info variables const
irqdomain: Use irq_domain_instantiate()'s return value as initializers
irqdomain: Drop irq_linear_revmap()
pinctrl: keembay: Switch to irq_find_mapping()
irqchip/armada-370-xp: Switch to irq_find_mapping()
gpu: ipu-v3: Switch to irq_find_mapping()
gpio: idt3243x: Switch to irq_find_mapping()
sh: Switch to irq_find_mapping()
powerpc: Switch to irq_find_mapping()
irqdomain: Drop irq_domain_add_*() functions
powerpc: Switch irq_domain_add_nomap() to use fwnode
thermal: Switch to irq_domain_create_linear()
soc: Switch to irq_domain_create_*()
...
- Large rework of the protected key crypto code to allow for asynchronous
handling without memory allocation
- Speed up system call entry/exit path by re-implementing lazy ASCE
handling
- Add module autoload support for the diag288_wdt watchdog device driver
- Get rid of s390 specific strcpy() and strncpy() implementations, and
switch all remaining users to strscpy() when possible
- Various other small fixes and improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmgwuaIACgkQIg7DeRsp
bsJ+xA//VTVWo15XHqX6xuJOSGSjwE7BbyA2RBNoIK+qH2PqCRvwvJU+QUcmjiKF
v+GHOL5GGypKneEVjHCB4xTlKeBPt61yf9/dmTbE4K8KD/3olpHcC6BY62VJiet1
hDd6Tx9JT3TmjiTV+BanJ5KdiioUxc7jjJ2etMpQkCsHwAZlBtJ+yK+P/IoSI2kP
80hhNPHvlBadd36ke+Ell95TJBoxhQkKPX/u8ryIKybCvUf4ybsoEaYSPJv8/2I8
G4IqhTQ0Ft5BcO0g2gcfkpU+8HUtrqhjIDDDhy2h42wuWXsE2Jn3/PieNJUybtwA
QdsAcN7P85a1TX2L7e4cEM9LEeIk15syZ8mUDKl6H69UHz4M8VNjMO8XfPpFD2ml
5xCPp7D+RcvUkdnjmz0jLx18mgrF9xjTTqrhYwcP2PWtANvf2m602ElEUlIWuRpI
Mew/wgCfSdqKsYEbPenvfT+XTd62+hMdo3U3dwZ6OZlKwVXNTk5EUXe3UgMDkKdO
37OmAh5N2DcYwtPtL9vtl1CLguVJD+FXRnB/veHLCg25+OO/G4IrobySOjI7K0m7
C48Z5lUjxVJiRjKe35XXwBjuEaN+/TctK1kH3RfCDQGvBGBNoPyvC+L4kOprsbce
1Lnc7/pFCKdP5y90sJMomMdJCyP5BHKiJvHHWTr8maeslfxd9Vw=
=lFqh
-----END PGP SIGNATURE-----
Merge tag 's390-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- Large rework of the protected key crypto code to allow for
asynchronous handling without memory allocation
- Speed up system call entry/exit path by re-implementing lazy ASCE
handling
- Add module autoload support for the diag288_wdt watchdog device
driver
- Get rid of s390 specific strcpy() and strncpy() implementations, and
switch all remaining users to strscpy() when possible
- Various other small fixes and improvements
* tag 's390-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (51 commits)
s390/pci: Serialize device addition and removal
s390/pci: Allow re-add of a reserved but not yet removed device
s390/pci: Prevent self deletion in disable_slot()
s390/pci: Remove redundant bus removal and disable from zpci_release_device()
s390/crypto: Extend protected key conversion retry loop
s390/pci: Fix __pcilg_mio_inuser() inline assembly
s390/ptrace: Always inline regs_get_kernel_stack_nth() and regs_get_register()
s390/thread_info: Cleanup header includes
s390/extmem: Add workaround for DCSS unload diag
s390/crypto: Rework protected key AES for true asynch support
s390/cpacf: Rework cpacf_pcc() to return condition code
s390/mm: Fix potential use-after-free in __crst_table_upgrade()
s390/mm: Add mmap_assert_write_locked() check to crst_table_upgrade()
s390/string: Remove strcpy() implementation
s390/con3270: Use strscpy() instead of strcpy()
s390/boot: Use strspcy() instead of strcpy()
s390: Simple strcpy() to strscpy() conversions
s390/pkey/crypto: Introduce xflags param for pkey in-kernel API
s390/pkey: Provide and pass xflags within pkey and zcrypt layers
s390/uv: Remove uv_get_secret_metadata function
...
Commits b88cbaaa6f ("PCI/pwrctrl: Rename pwrctl files to pwrctrl") and
3f925cd628 ("PCI/pwrctrl: Rename pwrctrl functions and structures")
renamed the "pwrctl" framework to "pwrctrl" for consistency reasons.
Rename also the Kconfig symbols so that they reflect the new name while
adding entries for the deprecated ones. The old symbols can be removed once
everything that depends on them has been updated.
Note that no deprecated symbol is added for the new slot driver to avoid
having to add a user visible option.
Rename the new slot module to reflect the framework name and match the
other pwrctrl modules.
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Acked-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Link: https://patch.msgid.link/20250402132634.18065-2-johan+linaro@kernel.org
It's possible to trigger use-after-free here by:
(a) forcing rescan_work_func() to take a long time and
(b) utilizing a pwrctrl driver that may be unloaded for some reason
Cancel outstanding work to ensure it is finished before we allow our data
structures to be cleaned up.
[bhelgaas: tidy commit log]
Fixes: 8f62819aaa ("PCI/pwrctl: Rescan bus on a separate thread")
Signed-off-by: Brian Norris <briannorris@google.com>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Acked-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Cc: Konrad Dybcio <konradybcio@kernel.org>
Link: https://patch.msgid.link/20250409115313.1.Ia319526ed4ef06bec3180378c9a008340cec9658@changeid
struct pci_packet contains a "message" field that is a flex array
of struct pci_message. struct pci_packet is usually followed by a
second struct in a containing struct that is defined locally in
individual functions in pci-hyperv.c. As such, the compiler
flag -Wflex-array-member-not-at-end (introduced in gcc-14) generates
multiple warnings such as:
drivers/pci/controller/pci-hyperv.c:3809:35: warning: structure
containing a flexible array member is not at the end of another
structure [-Wflex-array-member-not-at-end]
The Linux kernel intends to introduce this compiler flag in standard
builds, so the current code is problematic in generating these warnings.
The "message" field is used only to locate the start of the second
struct, and not as an array. Because the second struct can be
addressed directly, the "message" field is not really necessary.
Rather than try to fix its usage to meet the requirements of
-Wflex-array-member-not-at-end, just eliminate the field and
either directly reference the second struct, or use "pkt + 1"
when "pkt" is dynamically allocated.
Reported-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20250514044440.48924-1-mhklinux@outlook.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20250514044440.48924-1-mhklinux@outlook.com>
The hyperv-pci driver uses ACPI for MSI IRQ domain configuration on
arm64. It won't be able to do that in the VTL mode where only DeviceTree
can be used.
Update the hyperv-pci driver to get vPCI MSI IRQ domain in the DeviceTree
case, too.
Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20250428210742.435282-12-romank@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20250428210742.435282-12-romank@linux.microsoft.com>
Allow userspace to read/write log ratelimits per device (including
enable/disable). Create aer/ sysfs directory to store them and any
future AER configs.
The new sysfs files are:
/sys/bus/pci/devices/*/aer/correctable_ratelimit_burst
/sys/bus/pci/devices/*/aer/correctable_ratelimit_interval_ms
/sys/bus/pci/devices/*/aer/nonfatal_ratelimit_burst
/sys/bus/pci/devices/*/aer/nonfatal_ratelimit_interval_ms
The default values are ratelimit_burst=10, ratelimit_interval_ms=5000, so
if we try to emit more than 10 messages in a 5 second period, some are
suppressed.
Update AER sysfs ABI filename to reflect the broader scope of AER sysfs
attributes (e.g. stats and ratelimits).
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats ->
sysfs-bus-pci-devices-aer
Tested using aer-inject[1]. Configured correctable log ratelimit to 5.
Sent 6 AER errors. Observed 5 errors logged while AER stats
(cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) shows 6.
Disabled ratelimiting and sent 6 more AER errors. Observed all 6 errors
logged and accounted in AER stats (12 total errors).
[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
[bhelgaas: note fatal errors are not ratelimited, "aer_report" ->
"aer_info", replace ratelimit_log_enable toggle with *_ratelimit_interval_ms]
Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://patch.msgid.link/20250522232339.1525671-21-helgaas@kernel.org
Spammy devices can flood kernel logs with AER errors and slow/stall
execution. Add per-device ratelimits for AER correctable and non-fatal
uncorrectable errors that use the kernel defaults (10 per 5s). Logging of
fatal errors is not ratelimited.
There are two AER logging entry points:
- aer_print_error() is used by DPC and native AER
- pci_print_aer() is used by GHES and CXL
The native AER aer_print_error() case includes a loop that may log details
from multiple devices, which are ratelimited individually. If we log
details for any device, we also log the Error Source ID from the Root Port
or RCEC.
If no such device details are found, we still log the Error Source from the
ERR_* Message, ratelimited by the Root Port or RCEC that received it.
The DPC aer_print_error() case is not ratelimited, since this only happens
for fatal errors.
The CXL pci_print_aer() case is ratelimited by the Error Source device.
The GHES pci_print_aer() case is via aer_recover_work_func(), which
searches for the Error Source device. If the device is not found, there's
no per-device ratelimit, so we use a system-wide ratelimit that covers all
error types (correctable, non-fatal, and fatal).
Sargun at Meta reported internally that a flood of AER errors causes RCU
CPU stall warnings and CSD-lock warnings.
Tested using aer-inject[1]. Sent 11 AER errors. Observed 10 errors logged
while AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) show
true count of 11.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
[bhelgaas: commit log, factor out trace_aer_event() and aer_print_rp_info()
changes to previous patches, enable Error Source logging if any downstream
detail will be printed, don't ratelimit fatal errors, "aer_report" ->
"aer_info", "cor_log_ratelimit" -> "correctable_ratelimit",
"uncor_log_ratelimit" -> "nonfatal_ratelimit"]
Reported-by: Sargun Dhillon <sargun@meta.com>
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://patch.msgid.link/20250522232339.1525671-19-helgaas@kernel.org
Return -ENOSPC error early so the usual path through add_error_device() is
the straightline code.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://patch.msgid.link/20250522232339.1525671-18-helgaas@kernel.org
Previously aer_get_device_error_info() and aer_print_error() took a pointer
to struct aer_err_info and a pointer to a pci_dev. Typically the pci_dev
was one of the elements of the aer_err_info.dev[] array (DPC was an
exception, where the dev[] array was unused).
Convert aer_get_device_error_info() and aer_print_error() to take an index
into the aer_err_info.dev[] array instead. A future patch will add
per-device ratelimit information, so the index makes it convenient to find
the ratelimit associated with the device.
To accommodate DPC, set info->dev[0] to the DPC port before using these
interfaces.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://patch.msgid.link/20250522232339.1525671-17-helgaas@kernel.org
Update name to reflect the broader definition of structs/variables that are
stored (e.g. ratelimits). This is a preparatory patch for adding rate limit
support.
[bhelgaas: "aer_report" -> "aer_info"]
Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://patch.msgid.link/20250522232339.1525671-16-helgaas@kernel.org
Some existing logs in pci_print_aer() log with error severity by default.
Convert them to use KERN_WARNING for correctable errors and KERN_ERR for
uncorrectable errors.
[bhelgaas: commit log]
Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://patch.msgid.link/20250522232339.1525671-15-helgaas@kernel.org
aer_print_error() produces output at a printk level (KERN_ERR/KERN_WARNING/
etc) that depends on the kind of error, and it calls pcie_print_tlp_log(),
which previously always produced output at KERN_ERR.
Add a "level" parameter so aer_print_error() can control the level of the
pcie_print_tlp_log() output to match.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://patch.msgid.link/20250522232339.1525671-14-helgaas@kernel.org
When reporting an AER error, we check its type multiple times to determine
the log level for each message. Do this check only in the top-level
functions (aer_isr_one_error(), pci_print_aer()) and save the level in
struct aer_err_info.
[bhelgaas: save log level in struct aer_err_info instead of passing it
as a parameter]
Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://patch.msgid.link/20250522232339.1525671-13-helgaas@kernel.org
As with the AER statistics, we always want to emit trace events, even if
the actual dmesg logging is rate limited.
Call trace_aer_event() immediately after pci_dev_aer_stats_incr() so both
happen before ratelimiting.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://patch.msgid.link/20250522232339.1525671-12-helgaas@kernel.org
There are two AER logging entry points:
- aer_print_error() is used by DPC (dpc_process_error()) and native AER
handling (aer_process_err_devices()).
- pci_print_aer() is used by GHES (aer_recover_work_func()) and CXL
(cxl_handle_rdport_errors())
Both use __aer_print_error() to print the AER error bits. Previously
__aer_print_error() also incremented the AER statistics via
pci_dev_aer_stats_incr().
Call pci_dev_aer_stats_incr() early in the entry points instead of in
__aer_print_error() so we update the statistics even if the actual printing
of error bits is rate limited by a future change.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://patch.msgid.link/20250522232339.1525671-11-helgaas@kernel.org
Simplify pci_print_aer() by initializing the struct aer_err_info "info"
with a designated initializer list (it was previously initialized with
memset()) and using pci_name().
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://patch.msgid.link/20250522232339.1525671-10-helgaas@kernel.org
Previously the struct aer_err_info "e_info" was allocated on the stack
without being initialized, so it contained junk except for the fields we
explicitly set later.
Initialize "e_info" at declaration with a designated initializer list,
which initializes the other members to zero.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://patch.msgid.link/20250522232339.1525671-9-helgaas@kernel.org
Move aer_print_source() earlier in the file so a future change can use it
from aer_print_error(), where it's easier to rate limit it.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://patch.msgid.link/20250522232339.1525671-8-helgaas@kernel.org
Rename aer_print_port_info() to aer_print_source() to be more descriptive.
This logs the Error Source ID logged by a Root Port or Root Complex Event
Collector when it receives an ERR_COR, ERR_NONFATAL, or ERR_FATAL Message.
[bhelgaas: aer_print_rp_info() -> aer_print_source()]
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://patch.msgid.link/20250522232339.1525671-7-helgaas@kernel.org
Use PCI_BUS_NUM(), PCI_SLOT(), PCI_FUNC() to extract the bus number,
device, and function number directly from the Error Source ID. There's no
need to shift and mask it explicitly.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://patch.msgid.link/20250522232339.1525671-6-helgaas@kernel.org
Previously we decoded the AER Error Source ID in aer_isr_one_error_type(),
then again in find_source_device() if we didn't find any devices with
errors logged in their AER Capabilities.
Consolidate this so we only decode and log the Error Source ID once in
aer_isr_one_error_type(). Add a "found" parameter so we can add a note
when we didn't find any downstream devices with errors logged in their AER
Capability.
This changes the dmesg logging when we found no devices with errors logged:
- pci 0000:00:01.0: AER: Correctable error message received from 0000:02:00.0
- pci 0000:00:01.0: AER: found no error details for 0000:02:00.0
+ pci 0000:00:01.0: AER: Correctable error message received from 0000:02:00.0 (no details found)
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20250522232339.1525671-5-helgaas@kernel.org
aer_isr_one_error() duplicates the Error Source ID logging and AER error
processing for Correctable Errors and Uncorrectable Errors. Factor out the
duplicated code to aer_isr_one_error_type().
aer_isr_one_error() doesn't need the struct aer_rpc pointer, so pass it the
Root Port or RCEC pci_dev pointer instead.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://patch.msgid.link/20250522232339.1525671-4-helgaas@kernel.org
DPC Error Source ID is only valid when the DPC Trigger Reason indicates
that DPC was triggered due to reception of an ERR_NONFATAL or ERR_FATAL
Message (PCIe r6.0, sec 7.9.14.5).
When DPC was triggered by ERR_NONFATAL (PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE)
or ERR_FATAL (PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) from a downstream device,
log the Error Source ID (decoded into domain/bus/device/function). Don't
print the source otherwise, since it's not valid.
For DPC trigger due to reception of ERR_NONFATAL or ERR_FATAL, the dmesg
logging changes:
- pci 0000:00:01.0: DPC: containment event, status:0x000d source:0x0200
- pci 0000:00:01.0: DPC: ERR_FATAL detected
+ pci 0000:00:01.0: DPC: containment event, status:0x000d, ERR_FATAL received from 0000:02:00.0
and when DPC triggered for other reasons, where DPC Error Source ID is
undefined, e.g., unmasked uncorrectable error:
- pci 0000:00:01.0: DPC: containment event, status:0x0009 source:0x0200
- pci 0000:00:01.0: DPC: unmasked uncorrectable error detected
+ pci 0000:00:01.0: DPC: containment event, status:0x0009: unmasked uncorrectable error detected
Previously the "containment event" message was at KERN_INFO and the
"%s detected" message was at KERN_WARNING. Now the single message is at
KERN_WARNING.
Fixes: 26e5157133 ("PCI: Add Downstream Port Containment driver")
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20250522232339.1525671-3-helgaas@kernel.org
Previously the struct aer_err_info "info" was allocated on the stack
without being initialized, so it contained junk except for the fields we
explicitly set later.
Initialize "info" at declaration so it starts as all zeros.
Fixes: 8aefa9b0d9 ("PCI/DPC: Print AER status in DPC event handling")
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://patch.msgid.link/20250522232339.1525671-2-helgaas@kernel.org
As disable_slot() takes a struct zpci_dev from the Configured to the
Standby state. In Standby there is still a hotplug slot so this is not
usually a case of sysfs self deletion. This is important because self
deletion gets very hairy in terms of locking (see for example
recover_store() in arch/s390/pci/pci_sysfs.c).
Because the pci_dev_put() is not within the critical section of the
zdev->state_lock however, disable_slot() can turn into a case of self
deletion if zPCI device event handling slips between the mutex_unlock()
and the pci_dev_put(). If the latter is the last put and
zpci_release_device() is called this then tries to remove the hotplug
slot via zpci_exit_slot() which will try to remove the hotplug slot
directory the disable_slot() is part of i.e. self deletion.
Prevent this by widening the zdev->state_lock critical section to
include the pci_dev_put() which is then guaranteed to happen with the
struct zpci_dev still in Standby state ensuring it will not lead to
a zpci_release_device() call as at least the zPCI event handling code
still holds a reference.
Cc: stable@vger.kernel.org
Fixes: a46044a92a ("s390/pci: fix zpci_zdev_put() on reserve")
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>
Tested-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
In the pci_acpi_scan_root() function, when creating a PCI bus fails,
we need to free up the previously allocated memory, which can avoid
invalid memory usage and save resources.
Fixes: 789befdfa3 ("arm64: PCI: Migrate ACPI related functions to pci-acpi.c")
Signed-off-by: Zhe Qiao <qiaozhe@iscas.ac.cn>
[kwilczynski: commit log]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20250430060603.381504-1-qiaozhe@iscas.ac.cn
The subsystem-internal header pci.h still contains the function
prototype of pcim_intx(), which has since been made public in the global
header.
Remove the redundant function prototype.
Signed-off-by: Philipp Stanner <phasta@kernel.org>
[kwilczynski: commit log]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20250522084626.150148-2-phasta@kernel.org
Convert pci_msi_enable and pci_msi_enabled() to use bool type for clarity.
No functional changes, only code cleanup.
Signed-off-by: Hans Zhang <hans.zhang@cixtech.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250516165223.125083-2-18255117159@163.com
pci/iomap.c still contains warnings about those functions not behaving
in a managed manner if pcim_enable_device() was called. Since all hybrid
behavior that users could know about has been removed by now, those
explicit warnings are no longer necessary.
Remove the hybrid-devres usage warnings from the kernel-doc.
Signed-off-by: Philipp Stanner <phasta@kernel.org>
[kwilczynski: commit log]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://lore.kernel.org/r/20250519112959.25487-8-phasta@kernel.org
When the demangling of the hybrid devres functions within PCI was
implemented, it was necessary to implement several PCI functions a
second time to avoid cyclic calls, since the hybrid functions in pci.c
call the managed functions in devres.c, which in turn can be directly
used outside of PCI and needed request infrastructure, too.
Therefore, __pcim_request_region_range(), __pci_release_region_range()
and wrappers around them were implemented.
The hybrid nature has recently been removed from all functions in pci.c.
Therefore, the functions in devres.c can now directly use their
counterparts in pci.c without causing a call-cycle.
Remove __pcim_request_region_range(), __pcim_request_region_range() and
the wrappers. Use the corresponding request functions from pci.c in
devres.c
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://lore.kernel.org/r/20250519112959.25487-7-phasta@kernel.org
pcim_request_region_exclusive(), the only user in PCI devres that needed
exclusive region requests, has been removed.
All features related to exclusive requests can, therefore, be removed,
too. Remove them.
Signed-off-by: Philipp Stanner <phasta@kernel.org>
[kwilczynski: commit log]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20250519112959.25487-6-phasta@kernel.org
pcim_request_region_exclusive() was only needed for redirecting the
relatively exotic exclusive request functions in pci.c in case of them
operating in managed mode.
The managed nature has been removed from those functions and no one else
uses pcim_request_region_exclusive().
Remove pcim_request_region_exclusive().
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://lore.kernel.org/r/20250519112959.25487-5-phasta@kernel.org
All functions based on __pci_request_region() and its release counter
part support "hybrid mode", where the functions become managed if the
PCI device was enabled with pcim_enable_device().
Removing this undesirable feature requires to remove all users who
activated their device with that function and use one of the affected
request functions.
These users were:
ASoC
alsa
cardreader
cirrus
i2c
mmc
mtd
mtd
mxser
net
spi
vdpa
vmwgfx
all of which have been ported to always-managed pcim_ functions by now.
The hybrid nature can, thus, be removed from the aforementioned PCI
functions.
Remove all function guards and documentation in pci.c related to the
hybrid redirection. Adjust the visibility of pcim_release_region().
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://lore.kernel.org/r/20250519112959.25487-3-phasta@kernel.org
In an effort to move ARM64 away from the legacy MSI setup, convert the
Tegra PCIe driver to the MSI-parent infrastructure and let each device have
its own MSI domain.
[ tglx: Moved the struct out of the function call argument ]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250513172819.2216709-10-maz@kernel.org
In an effort to move ARM64 away from the legacy MSI setup, convert the
XGENE PCIe driver to the MSI-parent infrastructure and let each device have
its own MSI domain.
[ tglx: Moved the struct out of the function call argument ]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250513172819.2216709-9-maz@kernel.org
In an effort to move ARM64 away from the legacy MSI setup, convert the
Apple PCIe driver to the MSI-parent infrastructure and let each device have
its own MSI domain.
[ tglx: Moved the struct out of the function call argument ]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Link: https://lore.kernel.org/all/20250513172819.2216709-8-maz@kernel.org
irq_domain_add_linear() is going away as being obsolete now. Switch to
the preferred irq_domain_create_linear(). That differs in the first
parameter: It takes more generic struct fwnode_handle instead of struct
device_node. Therefore, of_fwnode_handle() is added around the
parameter.
Note some of the users can likely use dev->fwnode directly instead of
indirect of_fwnode_handle(dev->of_node). But dev->fwnode is not
guaranteed to be set for all, so this has to be investigated on case to
case basis (by people who can actually test with the HW).
[ tglx: Fix up subject prefix and convert the new instance in
dwc/pcie-amd-mdb.c ]
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250319092951.37667-30-jirislaby@kernel.org
PCIe Link Retraining can alter Link Speed. pcie_retrain_link() that
performs the Link Training is called from bwctrl and ASPM driver.
While bwctrl listens for Link Bandwidth Management Status (LBMS) to
pick up changes in Link Speed, there is a race between
pcie_reset_lbms() clearing LBMS after the Link Training and
pcie_bwnotif_irq() reading the Link Status register. If LBMS is already
cleared when the irq handler reads the register, the interrupt handler
will return early with IRQ_NONE and won't update the Link Speed.
When Link Speed update originates from bwctrl,
pcie_bwctrl_change_speed() ensures Link Speed is updated after the
retraining. ASPM driver, however, calls pcie_retrain_link() but does
not update the Link Speed after retraining which can result in stale
Link Speed. Also, it is possible to have ASPM support with
CONFIG_PCIEPORTBUS=n in which case bwctrl will not be built in (and
thus won't update the Link Speed at all).
To ensure Link Speed is not left stale after Link Training, move the
call to pcie_update_link_speed() from pcie_bwctrl_change_speed() into
pcie_retrain_link().
Suggested-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Link: https://lore.kernel.org/linux-pci/aBCjpfyYmlkJ12AZ@wunner.de
Link: https://lore.kernel.org/r/20250514132821.15705-1-ilpo.jarvinen@linux.intel.com
Since commit 58d9a38f6f ("PCI: Skip attaching driver in device_add()"),
PCI enumeration is split into two steps: In the first step, all devices
are published in sysfs with device_add(). In the second step, drivers are
bound to the devices with device_attach(). To delay driver binding until
the second step, a "bool match_driver" in struct pci_dev is used.
Instead of a bool, use a bit in the "unsigned long priv_flags" to shrink
struct pci_dev a little and prevent use of the bool outside the PCI core
(as has happened with commit cbbc00be2c ("iommu/amd: Prevent binding
other PCI drivers to IOMMU PCI devices")).
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Link: https://patch.msgid.link/d22a9e5b81d6bd8dd1837607d6156679b3b1199c.1745572340.git.lukas@wunner.de
When PTM is enabled, PTM_UPDATING interrupt will be fired for each PTM
context update, which will be once every 10ms in the case of auto context
update. Since the interrupt is not strictly needed for making use of PTM,
mask it to avoid the overhead of processing it.
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Link: https://patch.msgid.link/20250505-pcie-ptm-v4-4-02d26d51400b@linaro.org
Synopsys Designware PCIe IPs support PTM capability as defined in the PCIe
spec r6.0, sec 6.21. The PTM context information is exposed through Vendor
Specific Extended Capability (VSEC) registers on supported controller
implementation.
Hence, add support for exposing these context information to userspace
through the debugfs interface for the DWC controllers (both RC and EP).
Currently, only Qcom controllers are supported. For adding support for
other DWC vendor controllers, dwc_pcie_ptm_vsec_ids[] needs to be extended.
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Link: https://patch.msgid.link/20250505-pcie-ptm-v4-3-02d26d51400b@linaro.org
Upcoming PTM debugfs interface relies on the DWC PCIe mode to expose the
relevant debugfs attributes to userspace. So pass the mode to
dwc_pcie_debugfs_init() API from host and ep drivers and save it in
'struct dw_pcie::mode'.
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Link: https://patch.msgid.link/20250505-pcie-ptm-v4-2-02d26d51400b@linaro.org
Precision Time Management (PTM) mechanism defined in PCIe spec r6.0,
sec 6.21 allows precise coordination of timing information across multiple
components in a PCIe hierarchy with independent local time clocks.
PCI core already supports enabling PTM in the root port and endpoint
devices through PTM Extended Capability registers. But the PTM context
supported by the PTM capable components such as Root Complex (RC) and
Endpoint (EP) controllers were not exposed as of now. Part of the reason is
that the spec doesn't define how the context information is exposed to the
software and left it to the vendor implementation. So there is no
standardized way to get access to the context information and each vendor
have defined their own way.
This commit adds debugfs support to expose the PTM context to userspace
from both PCIe RC and EP controllers. Since the context information is
exposed in a vendor specific way, the debugfs interface allows the
controller drivers to implement callbacks for each attribute, to be called
by the generic PTM driver.
The Controller drivers are expected to call pcie_ptm_create_debugfs() to
create the debugfs attributes for the PTM context and call
pcie_ptm_destroy_debugfs() to destroy them. The drivers should also
populate the relevant callbacks in the 'struct pcie_ptm_ops' structure
based on the controller implementation.
Below PTM context are exposed through debugfs:
PCIe RC
=======
1. PTM Local clock
2. PTM T2 timestamp
3. PTM T3 timestamp
4. PTM Context valid
PCIe EP
=======
1. PTM Local clock
2. PTM T1 timestamp
3. PTM T4 timestamp
4. PTM Master clock
5. PTM Context update
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
[kwilczynski: fix overflow issue reported by Dan Carpenter from
https://lore.kernel.org/linux-pci/b41c1754-c6b7-4805-9f14-7c643d6c5304@suswa.mountain]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Link: https://patch.msgid.link/20250505-pcie-ptm-v4-1-02d26d51400b@linaro.org
PCIe BW controller counted LBMS assertions for the purposes of the Target
Speed quirk (pcie_failed_link_retrain()). It was also a plan to expose the
LBMS count through sysfs to allow better diagnosing link related issues.
Lukas Wunner suggested, however, that adding a trace event would be better
for diagnostics purposes, leaving only pcie_failed_link_retrain() as a user
of the lbms_count.
The logic in pcie_failed_link_retrain() does not require keeping count of
LBMS assertions, so replace lbms_count with a simple flag in pci_dev's
priv_flags. The reduced complexity allows removing pcie_bwctrl_lbms_rwsem.
Since pcie_failed_link_retrain() runs before bwctrl is probed during boot,
the LBMS in Link Status register still has to be checked by the quirk.
The priv_flags numbering is not continuous because hotplug code added a few
flags to fill numbers 4-5 (hotplug and bwctrl changes are routed through in
different branches).
Suggested-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
[bhelgaas: commit log]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
[kwilczynski: squashed a fix to resolve build failures from
https://lore.kernel.org/all/20250508090036.1528-1-ilpo.jarvinen@linux.intel.com]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Link: https://patch.msgid.link/20250422115548.1483-1-ilpo.jarvinen@linux.intel.com
Replace explicit if-else condition with direct return statement in
j721e_pcie_link_up(). This reduces code verbosity while maintaining
the same logic for detecting PCIe link completion.
Signed-off-by: Hans Zhang <18255117159@163.com>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Link: https://patch.msgid.link/20250510160710.392122-4-18255117159@163.com
PCIe link status check is supposed to return a boolean to indicate whether
the link is up or not. So update ls_g4_pcie_link_up() to return bool and
also simplify the LTSSM state check.
Signed-off-by: Hans Zhang <18255117159@163.com>
[mani: commit message rewording]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Link: https://patch.msgid.link/20250510160710.392122-3-18255117159@163.com
PCIe link status check is supposed to return a boolean to indicate whether
the link is up or not. So, modify the link_up callbacks and
dw_pcie_link_up() function to return bool instead of int.
Signed-off-by: Hans Zhang <18255117159@163.com>
[mani: commit message reword]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Link: https://patch.msgid.link/20250510160710.392122-2-18255117159@163.com
With commit bcb5d6c769 ("s390/pci: introduce lock to synchronize state
of zpci_dev's") the code to ignore power off of a PF that has child VFs
was changed from a direct return to a goto to the unlock and
pci_dev_put() section. The change however left the existing pci_dev_put()
untouched resulting in a doubple put. This can subsequently cause a use
after free if the struct pci_dev is released in an unexpected state.
Fix this by removing the extra pci_dev_put().
Cc: stable@vger.kernel.org
Fixes: bcb5d6c769 ("s390/pci: introduce lock to synchronize state of zpci_dev's")
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The current scheme with a single helper to determine the P2P status
and map a scatterlist segment force users to always use the map_sg
helper to DMA map, which we're trying to get away from because they
are very cache inefficient.
Refactor the code so that there is a single helper that checks the P2P
state for a page, including the result that it is not a P2P page to
simplify the callers, and a second one to perform the address translation
for a bus mapped P2P transfer that does not depend on the scatterlist
structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
AMD BIOS team has root caused an issue that NVMe storage failed to come
back from suspend to a lack of a call to _REG when NVMe device was probed.
112a7f9c8e ("PCI/ACPI: Call _REG when transitioning D-states") added
support for calling _REG when transitioning D-states, but this only works
if the device actually "transitions" D-states.
967577b062 ("PCI/PM: Keep runtime PM enabled for unbound PCI devices")
added support for runtime PM on PCI devices, but never actually
'explicitly' sets the device to D0.
To make sure that devices are in D0 and that platform methods such as
_REG are called, explicitly set all devices into D0 during initialization.
Fixes: 967577b062 ("PCI/PM: Keep runtime PM enabled for unbound PCI devices")
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Denis Benato <benato.denis96@gmail.com>
Tested-By: Yijun Shen <Yijun_Shen@Dell.com>
Tested-By: David Perry <david.perry@amd.com>
Reviewed-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://patch.msgid.link/20250424043232.1848107-1-superm1@kernel.org
The commit a4e772898f ("PCI: Add missing bridge lock to pci_bus_lock()")
made the lock function to call depend on dev->subordinate but left
pci_slot_unlock() unmodified creating locking asymmetry compared with
pci_slot_lock().
Because of the asymmetric lock handling, the same bridge device is unlocked
twice. First pci_bus_unlock() unlocks bus->self and then pci_slot_unlock()
will unconditionally unlock the same bridge device.
Move pci_dev_unlock() inside an else branch to match the logic in
pci_slot_lock().
Fixes: a4e772898f ("PCI: Add missing bridge lock to pci_bus_lock()")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20250505115412.37628-1-ilpo.jarvinen@linux.intel.com
Previously, the debugfs directory was unconditionally created in
tegra_pcie_config_rp() regardless of the CONFIG_PCIEASPM setting.
This led to unnecessary directory creation when ASPM support was disabled
since only ASPM state count was exposed through debugfs.
Hence, move the debugfs directory creation into init_debugfs() which is
conditionally compiled based on CONFIG_PCIEASPM. This ensures that both
the directory and 'aspm_state_cnt' entry are only created when ASPM is
enabled and avoids cluttering debugfs with empty directories when ASPM is
disabled.
Signed-off-by: Hans Zhang <18255117159@163.com>
[mani: subject and description change]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://patch.msgid.link/20250407124331.69459-1-18255117159@163.com
As per DWC PCIe registers description 4.30a, section 1.13.43, NUM_OF_LANES
named as PORT_LOGIC_LINK_WIDTH in PCIe DWC driver, is referred to as the
"Predetermined Number of Lanes" in PCIe r6.0, sec 4.2.7.2.1, which explains
the conditions required to enter Polling.Configuration:
Next state is Polling.Configuration after at least 1024 TS1 Ordered Sets
were transmitted, and all Lanes that detected a Receiver during Detect
receive eight consecutive training sequences ...
Otherwise, after a 24 ms timeout the next state is:
Polling.Configuration if,
(i) Any Lane, which detected a Receiver during Detect, received eight
consecutive training sequences ... and a minimum of 1024 TS1 Ordered
Sets are transmitted after receiving one TS1 or TS2 Ordered Set.
And
(ii) At least a predetermined set of Lanes that detected a Receiver
during Detect have detected an exit from Electrical Idle at least
once since entering Polling.Active.
Note: This may prevent one or more bad Receivers or Transmitters
from holding up a valid Link from being configured, and allow for
additional training in Polling.Configuration. The exact set of
predetermined Lanes is implementation specific.
Note: Any Lane that receives eight consecutive TS1 or TS2 Ordered
Sets should have detected an exit from Electrical Idle at least
once since entering Polling.Active.
In a PCIe link supporting multiple lanes, if PORT_LOGIC_LINK_WIDTH is set
to lane width the hardware supports, all lanes that detect a receiver
during the Detect phase must receive eight consecutive training sequences.
Otherwise, LTSSM will not enter Polling.Configuration and link training
will fail.
Therefore, always set PORT_LOGIC_LINK_WIDTH to 1, regardless of the number
of lanes the port actually supports, to make link up more robust. This
setting will not affect the intended link width if all lanes are
functional. Additionally, the link can still be established with at least
one lane if other lanes are faulty.
Co-developed-by: Qiang Yu <quic_qianyu@quicinc.com>
Signed-off-by: Qiang Yu <quic_qianyu@quicinc.com>
Signed-off-by: Wenbin Yao <quic_wenbyao@quicinc.com>
[mani: subject change]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
[bhelgaas: update PCIe spec citation, format quote]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Niklas Cassel <cassel@kernel.org>
Link: https://patch.msgid.link/20250422103623.462277-1-quic_wenbyao@quicinc.com
The documentation for the phy_power_off() function explicitly says that it
must be called before phy_exit().
Hence, follow the same rule in rockchip_pcie_phy_deinit().
Fixes: 0e898eb8df ("PCI: rockchip-dwc: Add Rockchip RK356X host controller driver")
Signed-off-by: Diederik de Haas <didi.debian@cknow.org>
[mani: commit message change]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Dragan Simic <dsimic@manjaro.org>
Acked-by: Shawn Lin <shawn.lin@rock-chips.com>
Cc: stable@vger.kernel.org # v5.15+
Link: https://patch.msgid.link/20250417142138.1377451-1-didi.debian@cknow.org
Some of the callers of rockchip_pcie_link_up() are open coding the
rockchip_pcie_link_up() function, leading to code duplication. So switch
them to use rockchip_pcie_link_up() function.
Also, use the FIELD_GET() macro to simplify the link up check in
rockchip_pcie_link_up().
Signed-off-by: Hans Zhang <18255117159@163.com>
[mani: subject and description rewording]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/20250427125316.99627-4-18255117159@163.com
Register definitions were scattered with ambiguous names (e.g.,
PCIE_RDLH_LINK_UP_CHGED in PCIE_CLIENT_INTR_STATUS_MISC) and lacked
hierarchical grouping.
Group registers and their associated bitfields logically. This improves
maintainability and aligns the code with hardware documentation.
Signed-off-by: Hans Zhang <18255117159@163.com>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/20250427125316.99627-3-18255117159@163.com
The PCIE_CLIENT_GENERAL_DEBUG register offset is defined but never
used in the driver. It's presence adds noise to the register map.
Remove this unused definition to keep the register list minimal and aligned
with actual hardware usage.
Signed-off-by: Hans Zhang <18255117159@163.com>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/20250427125316.99627-2-18255117159@163.com
The look up table (LUT) setting would be lost during the PCIe suspend on
i.MX95 SoC. So to ensure proper functionality after resume, save it during
suspend and restore it while resuming.
Fixes: 9d6b1bd6b3 ("PCI: imx6: Add i.MX8MQ, i.MX8Q and i.MX95 PM support")
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
[mani: subject and description rewording]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/20250416081314.3929794-8-hongxing.zhu@nxp.com
PLL lock is required to ensure that the PLL clock is stable before enabling
the controller in i.MX95 SoC.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
[mani: subject and description rewording]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/20250416081314.3929794-7-hongxing.zhu@nxp.com
ERR051586: Compliance with 8GT/s Receiver Impedance ECN.
The default value of GEN3_RELATED_OFF[GEN3_ZRXDC_NONCOMPL] is 1 which
makes receiver non-compliant with the ZRX-DC parameter for 2.5 GT/s when
operating at 8 GT/s or higher. It causes unnecessary timeout in L1.
So the workaround is to set GEN3_RELATED_OFF[GEN3_ZRXDC_NONCOMPL] to 0.
Add this workaround in the dw_pcie_host_ops::post_init() callback for
i.MX95 platforms.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
[mani: subject and description rewording]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/20250416081314.3929794-6-hongxing.zhu@nxp.com
ERR051624: The Controller Without Vaux Cannot Exit L23 Ready Through Beacon
or PERST# De-assertion
When the auxiliary power is not available, the controller cannot exit from
L23 Ready with beacon or PERST# de-assertion when main power is not
removed. So the workaround is to set SS_RW_REG_1[SYS_AUX_PWR_DET] to 1.
This workaround is required irrespective of whether Vaux is supplied to the
link partner or not.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
[mani: subject and description rewording]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/20250416081314.3929794-5-hongxing.zhu@nxp.com
Add toggling core reset for i.MX95 to align with PHY's power-up sequence.
Note that the register is named as IMX95_PCIE_COLD_RST in hardware, though
it is used to reset the PCIe core.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
[mani: subject and description rewording]
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20250416081314.3929794-4-hongxing.zhu@nxp.com
Since the DWC driver is already calling dw_pcie_wait_for_link() after
calling the start_link() callback, remove the redundant
dw_pcie_wait_for_link() call from imx_pcie_start_link(). It is still
required to call this function for controllers supporting Gen 2 and higher
link speeds.
Suggested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
[mani: subject and description rewording]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20250416081314.3929794-3-hongxing.zhu@nxp.com
The current link setup procedure has one workaround to detect the devices
behind PCIe switches on some i.MX6 platforms. But this workaround is not
needed on recent i.MX7 platforms. So skip the workaround for platforms that
do not set the flag and start LTSSM directly.
Also, change the flag name from IMX_PCIE_FLAG_IMX_SPEED_CHANGE to
IMX_PCIE_FLAG_SPEED_CHANGE_WORKAROUND to match the usecase.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
[mani: subject and description rewording]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/20250416081314.3929794-2-hongxing.zhu@nxp.com
In the case of PERST# deassert, non-sticky registers will get reset to
their hardware default state and EXT_CAP registers are one among them. But
since the broken ATS cap is hidden only in dw_pcie_ep_ops::pre_init()
callback which is not gettting called during PERST# deassert, it results in
the capability getting advertised again.
So move it to dw_pcie_ep_ops::init() to fix it.
Suggested-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>
[mani: subject and description rewording]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Tested-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/1744594109-209312-1-git-send-email-shawn.lin@rock-chips.com
L0s capability isn't enabled on all Rockchip SoCs by default, so enable it
in order to make ASPM L0s work on Rockchip platforms.
Testing the L0s for a long time revealed that the default N_FTS value of
210 in the hardware doesn't work stable and causes LTSSM to switch between
L0s and Recovery states. This leads to long exit latency and also causes
link down sometimes. So override the value to the max 255, which seems to
work fine under both PHYs used on Rockchip platforms.
Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>
[mani: subject and description rewording]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/1744850111-236269-2-git-send-email-shawn.lin@rock-chips.com
rockchip_pcie_link_up() currently has two issues:
1. Value 0x11 of PCIE_L0S_ENTRY corresponds to L0 state, not L0S. So the
naming is wrong from the very beginning.
2. Checking for value 0x11 treats other states like L0S and L1 as link
down, which is wrong.
Hence, remove the PCIE_L0S_ENTRY check and also its definition. This allows
adding ASPM support in the successive commits.
Fixes: 0e898eb8df ("PCI: rockchip-dwc: Add Rockchip RK356X host controller driver")
Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>
[mani: commit message rewording]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/1744850111-236269-1-git-send-email-shawn.lin@rock-chips.com
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmgNN/gUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vyu/BAAlByPLoEBG3POoliiXkCGw8sW+mhi
B/TzfHeg0gjzMAziRUySG3Fq9D287zlK0MQ9SXHW94FQ6tOd4I3eonMNsBAfYhLR
wRhKw60d2lx7VM1ffVf/Kahn9fkF1N+VptMNOdMAM+yLsSLvGAk2mpVEBNAOUmRw
VO6QSycApJ3XtT1vVWImUwmBZLcSDX44Ud6YEwGuSWVrTWzRC7zhLwRrJeglXmTW
5R2SdbSc3zTqBYojPkmZSxT5CM6FpmUVOesiAuDrwQcGfLDcArS+IrozeBZ5uy+g
qxyz+oH4c2U/ZBGLSUU/WTwjoR/uaImv20altqqQ2glaYx8nlDMhVw7MzdfjxGQ3
RvcMBJaPguoJVSM6JNZLvIwwjQ1k9Gw/Xt5X0K1E2qmDf/mWHRr0Ct6Qs8xD6W4h
otd9Q6f4y6a/wQhSu5etn34xcMQsDNaRSivUi+ABYhQrhPBovmXPNsN9myBTp+HI
kej8Usr7Yf0eBE7pMAPtEyyrzcbrEinZG7FAErMyrMGWnF5QzoinVQBXG0/rGVMr
uQ8KD3eJRveWRTeOoZpUi6Ll/UzHEAnKrANxIU3iIDZT3sn8ZbORAV8ANeX7H5sJ
NCgZfmHZX+9F/PHAX6t7msoKCj+I3nXDZVy7w+iKVG8vdQGWVsm4xb4kJg3OkE60
bPjjEWYDSn+PodA=
=5H9J
-----END PGP SIGNATURE-----
Merge tag 'pci-v6.15-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull PCI fixes from Bjorn Helgaas:
- When releasing a start-aligned resource, e.g., a bridge window, save
start/end/flags for the next assignment attempt; fixes a v6.15-rc1
regression (Ilpo Järvinen)
- Move set_pcie_speed.sh from TEST_PROGS to TEST_FILE; fixes a bwctrl
selftest v6.15-rc1 regression (Ilpo Järvinen)
- Add Manivannan Sadhasivam as maintainer of native host bridge and
endpoint drivers (Manivannan Sadhasivam)
- In endpoint test driver, defer IRQ allocation from .probe() until
ioctl() to fix a regression on platforms where the Vendor/Device ID
match doesn't include driver_data (Niklas Cassel)
* tag 'pci-v6.15-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
misc: pci_endpoint_test: Defer IRQ allocation until ioctl(PCITEST_SET_IRQTYPE)
MAINTAINERS: Move Manivannan Sadhasivam as PCI Native host bridge and endpoint maintainer
selftests/pcie_bwctrl: Fix test progs list
PCI: Restore assigned resources fully after release
We can get different results probing reset methods for a device depending
on its power state. For example, reading the PM control register of a
device in D3cold will always indicate NoSoftRst+ because we get ~0 data
when the config read fails on PCI, preventing us from correctly probing PM
reset support.
Increment the PM usage counter before any probes and use the cleanup __free
facility to automatically drop the usage counter out of scope.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20250422230534.2295291-3-alex.williamson@redhat.com
It turns out that there are no platforms that have PCI but don't have an
MMU, so adding a Kconfig dependency on CONFIG_PCI simplifies build testing
kernels for those platforms a lot, and avoids a lot of inadvertent build
regressions.
Add a dependency for CONFIG_PCI and remove all the ones for PCI specific
device drivers that are currently marked not having it.
There are a few platforms that have an optional MMU, but they usually
cannot have PCI at all. The one exception is Coldfire MCF54xx, but this is
mainly for historic reasons, and anyone using those chips should really use
the MMU these days.
Link: https://lore.kernel.org/lkml/a41f1b20-a76c-43d8-8c36-f12744327a54@app.fastmail.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20250423202215.3315550-1-arnd@kernel.org
This version of the hardware moved around a bunch of registers, so we
avoid the old compatible for these and introduce register offset
structures to handle the differences.
Signed-off-by: Hector Martin <marcan@marcan.st>
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Link: https://patch.msgid.link/20250401091713.2765724-14-maz@kernel.org
Newer versions of the Apple PCIe block have a bunch of small, but
annoying differences.
In order to embrace this diversity of implementations, move the
currently hardcoded offsets into a hw_info structure. Future SoCs
will provide their own structure describing the applicable offsets.
Signed-off-by: Hector Martin <marcan@marcan.st>
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
[maz: split from original patch to only address T8103]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Link: https://patch.msgid.link/20250401091713.2765724-13-maz@kernel.org
This is checking a core refclk in per-port setup which doesn't make a
lot of sense, and the bootloader needs to have gone through this anyway.
It doesn't work on T602x, so just drop it across the board.
Signed-off-by: Hector Martin <marcan@marcan.st>
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Link: https://patch.msgid.link/20250401091713.2765724-11-maz@kernel.org
T602x PCIe cores move these registers around. Instead of hardcoding in
another offset, let's move them into their own reg entries. This matches
what Apple does on macOS device trees too.
Maintains backwards compatibility with old DTs by using the old offsets.
Note that we open code devm_platform_ioremap_resource_byname() to avoid
error messages on older platforms with missing resources in the pcie
node. ("pcie-apple 590000000.pcie: invalid resource (null)" on probe)
Co-developed-by: Janne Grunau <j@jannau.net>
Signed-off-by: Janne Grunau <j@jannau.net>
Signed-off-by: Hector Martin <marcan@marcan.st>
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Link: https://patch.msgid.link/20250401091713.2765724-10-maz@kernel.org
In the success path, we hang onto a reference to the node, so make sure
to grab one. The caller iterator puts our borrowed reference when we
return.
Signed-off-by: Hector Martin <marcan@marcan.st>
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Link: https://patch.msgid.link/20250401091713.2765724-9-maz@kernel.org
T602x seems to have dropped the rather useful SET/CLR accessors
to the masking register.
Instead, let's use the mask register directly, and wrap it with
a brand new spinlock. No, this isn't moving in the right direction.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Link: https://patch.msgid.link/20250401091713.2765724-8-maz@kernel.org
As we move towards supporting SoCs with varying RID-to-SID mapping
capabilities, turn the static SID tracking bitmap into a dynamically
allocated one. The current allocation size is still the same, but
that's about to change.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Link: https://patch.msgid.link/20250401091713.2765724-7-maz@kernel.org
Now that we have the required infrastructure, split the Apple PCIe
setup into two categories:
- stuff that has to do with PCI setup stays in the .init() callback
- stuff that is just driver gunk (such as MSI setup) goes into a
probe routine, which will eventually call into the host-common
code
The result is a far more logical setup process.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Link: https://patch.msgid.link/20250401091713.2765724-6-maz@kernel.org
In order to decouple ecam config space creation from probing via
pci_host_common_probe(), allow the private pointer to be populated
via the device drvdata pointer.
Crucially, this is set before calling ops->init(), allowing that
particular callback to have access to probe data.
This should have no impact on existing code which ignores the
current value of cfg->priv.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Link: https://patch.msgid.link/20250401091713.2765724-5-maz@kernel.org
pci_host_common_probe() is an extremely useful helper, as it
abstracts away most of the gunk that a "mostly-ECAM-compliant"
device driver needs.
However, it is structured as a probe function, meaning that a lot
of the driver-specific setup has to happen in a .init() callback,
after the bridge and config space have been instantiated.
This is a bit awkward, and results in a number of convolutions
that could be avoided if the host-common code was more like
a library.
Introduce a pci_host_common_init() helper that does exactly that,
taking the platform device and a struct pci_ecam_op as parameters.
This can then be called from the probe routine, and a lot of the
code that isn't relevant to PCI setup moved away from the .init()
callback. This also removes the dependency on the device match
data, which is an oddity.
Signed-off-by: Marc Zyngier <maz@kernel.org>
[mani: fixed spelling mistakes]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Link: https://patch.msgid.link/20250401091713.2765724-4-maz@kernel.org
Add IPQ5018 platform with is based on Qcom IP rev. 2.9.0 and Synopsys IP
rev. 5.00a.
The platform itself has two PCIe Gen2 controllers: one single-lane and
one dual-lane. So add the IPQ5018 compatible and re-use 2_9_0 ops.
Signed-off-by: Nitheesh Sekar <quic_nsekar@quicinc.com>
Signed-off-by: Sricharan R <quic_srichara@quicinc.com>
Signed-off-by: George Moussalem <george.moussalem@outlook.com>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/20250326-ipq5018-pcie-v7-4-e1828fef06c9@outlook.com
PCIe equalization presets are predefined settings used to optimize signal
integrity by compensating for signal loss and distortion in high-speed data
transmission.
Based upon the number of lanes and the data rate supported, write the
preset data read from the device tree in to the lane equalization control
registers. These preset values will be used by the controller during the
LTSSM lane equalization procedure.
Signed-off-by: Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
[mani: reworded the commit message and comments in the driver]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/20250328-preset_v6-v9-5-22cfa0490518@oss.qualcomm.com
If the call to pci_host_probe() in cdns_pcie_host_setup() fails, PM
runtime count is decremented in the error path using pm_runtime_put_sync().
But the runtime count is not incremented by this driver, but only by the
callers (cdns_plat_pcie_probe/j721e_pcie_probe). And the callers also
decrement the runtime PM count in their error path. So this leads to the
below warning from the PM core:
"runtime PM usage count underflow!"
So fix it by getting rid of pm_runtime_put_sync() in the error path and
directly return the errno.
Fixes: 49e427e6bd ("Merge branch 'pci/host-probe-refactor'")
Signed-off-by: Hans Zhang <18255117159@163.com>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/20250419133058.162048-1-18255117159@163.com
RK3399 can raise INTx interrupts, as can be seen by
rockchip_pcie_ep_send_intx_irq().
This is also in line with the register description of
PCIE_CLIENT_LEGACY_INT_CTRL, section "17.6.3 PCIe Client Detail Register
Description" of the RK3399 TRM.
Thus, mark RK3399 as intx_capable.
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/20250416142749.336298-2-cassel@kernel.org
Iterating over disabled ports results in of_irq_parse_raw() parsing
the wrong "interrupt-map" entries, as it takes the status of the node
into account.
This became apparent after disabling unused PCIe ports in the Apple
Silicon device trees instead of deleting them.
Switching from for_each_child_of_node_scoped() to
for_each_available_child_of_node_scoped() solves this issue.
Fixes: 1e33888fbe ("PCI: apple: Add initial hardware bring-up")
Fixes: a0189fdfb7 ("arm64: dts: apple: t8103: Disable unused PCIe ports")
Signed-off-by: Janne Grunau <j@jannau.net>
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/asahi/20230214-apple_dts_pcie_disable_unused-v1-0-5ea0d3ddcde3@jannau.net/
Link: https://lore.kernel.org/asahi/1ea2107a-bb86-8c22-0bbc-82c453ab08ce@linaro.org/
Link: https://patch.msgid.link/20250401091713.2765724-2-maz@kernel.org
The mvebu "ranges" is a bit unusual with its own encoding of addresses,
but it's still just normal "ranges" as far as parsing is concerned.
Convert mvebu_get_tgt_attr() to use the for_each_of_range() iterator
instead of open coding the parsing.
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/20241107153255.2740610-1-robh@kernel.org
If the num-lanes property is not present in the devicetree, update
pci->num_lanes with the hardware supported maximum link width using
the newly introduced dw_pcie_link_get_max_link_width() API.
The API is used to get the Maximum Link Width (MLW) of the controller.
Signed-off-by: Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
[mani: reworded commit message a bit]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/20250328-preset_v6-v9-3-22cfa0490518@oss.qualcomm.com
PCIe equalization presets are predefined settings used to optimize
signal integrity by compensating for signal loss and distortion in
high-speed data transmission.
As per PCIe spec 6.0.1 revision section 8.3.3.3 & 4.2.4 for data rates
of 8.0 GT/s, 16.0 GT/s, 32.0 GT/s, and 64.0 GT/s, there is a way to
configure lane equalization presets for each lane to enhance the PCIe
link reliability. Each preset value represents a different combination
of pre-shoot and de-emphasis values. For each data rate, different
registers are defined: for 8.0 GT/s, registers are defined in section
7.7.3.4; for 16.0 GT/s, in section 7.7.5.9, etc. The 8.0 GT/s rate has
an extra receiver preset hint, requiring 16 bits per lane, while the
remaining data rates use 8 bits per lane.
Based on the number of lanes and the supported data rate,
of_pci_get_equalization_presets() reads the device tree property and
stores in the presets structure.
Signed-off-by: Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/20250328-preset_v6-v9-2-22cfa0490518@oss.qualcomm.com
On rcar-gen4, the ep BAR4 has a fixed size of 256B. Document this
constraint in the epc features of the platform.
Fixes: e311b3834d ("PCI: rcar-gen4: Add endpoint mode support")
Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://patch.msgid.link/20250328-rcar-gen4-bar4-v1-1-10bb6ce9ee7f@baylibre.com
The order of rockchip_pci_core_rsts introduced in the offending commit
followed the previous comment that warned not to reorder them. But the
commit failed to take into account that reset_control_bulk_deassert()
deasserts the resets in reverse order. So this leads to the link getting
downgraded to 2.5 GT/s.
Hence, restore the deassert order and also add back the comments for
rockchip_pci_core_rsts.
Tested on NanoPC-T4 with Samsung 970 Pro.
Fixes: 18715931a5 ("PCI: rockchip: Simplify reset control handling by using reset_control_bulk*() function")
Signed-off-by: Jensen Huang <jensenhuang@friendlyarm.com>
[mani: reworded the commit message and the comment above rockchip_pci_core_rsts]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Acked-by: Shawn Lin <shawn.lin@rock-chips.com>
Link: https://patch.msgid.link/20250328105822.3946767-1-jensenhuang@friendlyarm.com
PCI resource fitting code in __assign_resources_sorted() runs in multiple
steps. A resource that was successfully assigned may have to be released
before the next step attempts assignment again. The assign+release cycle is
destructive to a start-aligned struct resource (bridge window or IOV
resource) because the start field is overwritten with the real address when
the resource got assigned.
One symptom:
pci 0002:00:00.0: bridge window [mem size 0x00100000]: can't assign; bogus alignment
Properly restore the resource after releasing it. The start, end, and flags
fields must be stored into the related struct pci_dev_resource in order to
be able to restore the resource to its original state.
Fixes: 96336ec702 ("PCI: Perform reset_resource() and build fail list in sync")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Closes: https://lore.kernel.org/r/01eb7d40-f5b5-4ec5-b390-a5c042c30aff@roeck-us.net/
Reported-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com>
Closes: https://lore.kernel.org/r/3578030.5fSG56mABF@workhorse
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Ondrej Jirman <megi@xff.cz>
Link: https://patch.msgid.link/20250403093137.1481-1-ilpo.jarvinen@linux.intel.com
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmgBgqMUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vxXkRAAgicW9BMwem/xsX2IcJKDJ1lFrXq0
DCeUyDMh3U3EnbkFGtG+lhkJnWVOcyH8jlBbRd44dsFE95TbL4dvd0l1ZR8TUY9u
HwSR80CsejO3ZrXMygKQkLgHOBdbvw3knVSHBH+nmQGM9PbYxG1OtCLJMyrrmtPV
8e2lM6CxvqY2NsajOA4oHk9TW35FyKicmHB6iQV8iANj8/MO6ZfbAwqL9EdD/qRy
wZvycNfQRjD2qbHDqi1S8Mh3JTPCp9F1Ld7KhrkRCzIW8AUBDpMlWur0aKKWHK7F
jPSuppO+qNN5HbRUuQWgbh8AxSRVzvMaDo3VCi21G/tgOOWpPggJL5qqQ8yS4GPk
ZcIP7d3/B7rSkUqe1dNHlzW4GSRf1XSIVZSUIzjbZEraHkbgsIsoeZM3iD8WsZ+F
EilDFIeFEswK4DFMthVtWDsWy8EBEcO6aX6qDT8pJBbJNj2GPuL5ZeCuKYWwOHhh
Pkqf5kAIj06QCx6AghmK9RQk7lt6uAhnuP9dZq+/L+DYRFej3caOjQPFH7n6jdEY
e42B+Ceb4eANCnaKnY0ZdlsnC//VRsSFWJAiGZIZX1P3jnvlyin7YFugp739U/La
fUJUT8MhnSibZsSF1bPSwqTDl0D9tstnPVXnio+umfe9YATf21m3W/lLrRYu4Emy
6ZNJIszfQrY/cqw=
=ebpM
-----END PGP SIGNATURE-----
Merge tag 'pci-v6.15-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci fix from Bjorn Helgaas:
- Revert a reset patch that broke VFIO passthrough because devices
ended up with no available reset mechanisms (Alex Williamson)
* tag 'pci-v6.15-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
Revert "PCI: Avoid reset when disabled via sysfs"
Loongson PCIe Root Ports don't advertise an ACS capability, but they do not
allow peer-to-peer transactions between Root Ports. Add an ACS quirk so
each Root Port can be in a separate IOMMU group.
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20250403040756.720409-1-chenhuacai@loongson.cn
Print the delay amount that pcie_wait_for_link_delay() is invoked with
instead of the hardcoded 1000ms value in the debug info print.
Fixes: 7b3ba09feb ("PCI/PM: Shorten pci_bridge_wait_for_secondary_bus() wait time for slow links")
Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20250414001505.21243-2-wilfred.opensource@gmail.com
In February 2003, historic commit
https://git.kernel.org/tglx/history/c/280c1c9a0ea4
("[PATCH] PCI Hotplug: Replace pcihpfs with sysfs.")
removed all invocations of __get_free_page() and free_page() from the PCI
hotplug core without also removing the #include <linux/pagemap.h>
directive.
It removed all invocations of kern_mount(), mntget() and mntput()
without also removing the #include <linux/mount.h> directive.
It removed all invocations of lookup_hash()
without also removing the #include <linux/namei.h> directive.
It removed all invocations of copy_to_user() and copy_from_user()
without also removing the #include <linux/uaccess.h> directive.
These #include directives are still unnecessary today, so drop them.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/c19e25bf2cefecc14e0822c6a9bb3a7f546258bc.1744640331.git.lukas@wunner.de
pci_read_bases() is given literal 6 that means PCI_STD_NUM_BARS. Replace
the literal with the define to annotate the code better.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Link: https://patch.msgid.link/20250416100239.6958-1-ilpo.jarvinen@linux.intel.com
This reverts commit 479380efe1.
The reset_method attribute on a PCI device is only intended to manage the
availability of function scoped resets for a device. It was never intended
to restrict resets targeting the bus or slot.
In introducing a restriction that each device must support function level
reset by testing pci_reset_supported(), we essentially create a catch-22,
that a device must have a function scope reset in order to support bus/slot
reset, when we use bus/slot reset to effect a reset of a device that does
not support a function scoped reset, especially multi-function devices.
This breaks the majority of uses cases where vfio-pci uses bus/slot resets
to manage multifunction devices that do not support function scoped resets.
Fixes: 479380efe1 ("PCI: Avoid reset when disabled via sysfs")
Reported-by: Cal Peake <cp@absolutedigital.net>
Closes: https://lore.kernel.org/all/808e1111-27b7-f35b-6d5c-5b275e73677b@absolutedigital.net
Reported-by: Athul Krishna <athul.krishna.kr@protonmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220010
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20250414211828.3530741-1-alex.williamson@redhat.com
When a Secondary Bus Reset is issued at a hotplug port, it causes a Data
Link Layer State Changed event as a side effect. On hotplug ports using
in-band presence detect, it additionally causes a Presence Detect Changed
event.
These spurious events should not result in teardown and re-enumeration of
the device in the slot. Hence commit 2e35afaefe ("PCI: pciehp: Add
reset_slot() method") masked the Presence Detect Changed Enable bit in the
Slot Control register during a Secondary Bus Reset. Commit 06a8d89af5
("PCI: pciehp: Disable link notification across slot reset") additionally
masked the Data Link Layer State Changed Enable bit.
However masking those bits only disables interrupt generation (PCIe r6.2
sec 6.7.3.1). The events are still visible in the Slot Status register
and picked up by the IRQ handler if it runs during a Secondary Bus Reset.
This can happen if the interrupt is shared or if an unmasked hotplug event
occurs, e.g. Attention Button Pressed or Power Fault Detected.
The likelihood of this happening used to be small, so it wasn't much of a
problem in practice. That has changed with the recent introduction of
bandwidth control in v6.13-rc1 with commit 665745f274 ("PCI/bwctrl:
Re-add BW notification portdrv as PCIe BW controller"):
Bandwidth control shares the interrupt with PCIe hotplug. A Secondary Bus
Reset causes a Link Bandwidth Notification, so the hotplug IRQ handler
runs, picks up the masked events and tears down the device in the slot.
As a result, Joel reports VFIO passthrough failure of a GPU, which Ilpo
root-caused to the incorrect handling of masked hotplug events.
Clearly, a more reliable way is needed to ignore spurious hotplug events.
For Downstream Port Containment, a new ignore mechanism was introduced by
commit a97396c6eb ("PCI: pciehp: Ignore Link Down/Up caused by DPC").
It has been working reliably for the past four years.
Adapt it for Secondary Bus Resets.
Introduce two helpers to annotate code sections which cause spurious link
changes: pci_hp_ignore_link_change() and pci_hp_unignore_link_change()
Use those helpers in lieu of masking interrupts in the Slot Control
register.
Introduce a helper to check whether such a code section is executing
concurrently and if so, await it: pci_hp_spurious_link_change()
Invoke the helper in the hotplug IRQ thread pciehp_ist(). Re-use the
IRQ thread's existing code which ignores DPC-induced link changes unless
the link is unexpectedly down after reset recovery or the device was
replaced during the bus reset.
That code block in pciehp_ist() was previously only executed if a Data
Link Layer State Changed event has occurred. Additionally execute it for
Presence Detect Changed events. That's necessary for compatibility with
PCIe r1.0 hotplug ports because Data Link Layer State Changed didn't exist
before PCIe r1.1. DPC was added with PCIe r3.1 and thus DPC-capable
hotplug ports always support Data Link Layer State Changed events.
But the same cannot be assumed for Secondary Bus Reset, which already
existed in PCIe r1.0.
Secondary Bus Reset is only one of many causes of spurious link changes.
Others include runtime suspend to D3cold, firmware updates or FPGA
reconfiguration. The new pci_hp_{,un}ignore_link_change() helpers may be
used by all kinds of drivers to annotate such code sections, hence their
declarations are publicly visible in <linux/pci.h>. A case in point is
the Mellanox Ethernet driver which disables a firmware reset feature if
the Ethernet card is attached to a hotplug port, see commit 3d7a3f2612
("net/mlx5: Nack sync reset request when HotPlug is enabled"). Going
forward, PCIe hotplug will be able to cope gracefully with all such use
cases once the code sections are properly annotated.
The new helpers internally use two bits in struct pci_dev's priv_flags as
well as a wait_queue. This mirrors what was done for DPC by commit
a97396c6eb ("PCI: pciehp: Ignore Link Down/Up caused by DPC"). That may
be insufficient if spurious link changes are caused by multiple sources
simultaneously. An example might be a Secondary Bus Reset issued by AER
during FPGA reconfiguration. If this turns out to happen in real life,
support for it can easily be added by replacing the PCI_LINK_CHANGING flag
with an atomic_t counter incremented by pci_hp_ignore_link_change() and
decremented by pci_hp_unignore_link_change(). Instead of awaiting a zero
PCI_LINK_CHANGING flag, the pci_hp_spurious_link_change() helper would
then simply await a zero counter.
Fixes: 665745f274 ("PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller")
Reported-by: Joel Mathew Thomas <proxy0@tutamail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219765
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Joel Mathew Thomas <proxy0@tutamail.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/d04deaf49d634a2edf42bf3c06ed81b4ca54d17b.1744298239.git.lukas@wunner.de
Commit a97396c6eb ("PCI: pciehp: Ignore Link Down/Up caused by DPC")
amended PCIe hotplug to not bring down the slot upon Data Link Layer State
Changed events caused by Downstream Port Containment.
However Keith reports off-list that if the slot uses in-band presence
detect (i.e. Presence Detect State is derived from Data Link Layer Link
Active), DPC also causes a spurious Presence Detect Changed event.
This needs to be ignored as well.
Unfortunately there's no register indicating that in-band presence detect
is used. PCIe r5.0 sec 7.5.3.10 introduced the In-Band PD Disable bit in
the Slot Control Register. The PCIe hotplug driver sets this bit on
ports supporting it. But older ports may still use in-band presence
detect.
If in-band presence detect can be disabled, Presence Detect Changed events
occurring during DPC must not be ignored because they signal device
replacement. On all other ports, device replacement cannot be detected
reliably because the Presence Detect Changed event could be a side effect
of DPC. On those (older) ports, perform a best-effort device replacement
check by comparing the Vendor ID, Device ID and other data in Config Space
with the values cached in struct pci_dev. Use the existing helper
pciehp_device_replaced() to accomplish this. It is currently #ifdef'ed to
CONFIG_PM_SLEEP in pciehp_core.c, so move it to pciehp_hpc.c where most
other functions accessing config space reside.
Reported-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/fa264ff71952915c4e35a53c89eb0cde8455a5c5.1744298239.git.lukas@wunner.de
Commit 7d5ec3d361 ("PCI/MSI: Mask all unused MSI-X entries") introduced a
readl() from ENTRY_VECTOR_CTRL before the writel() to ENTRY_DATA.
This is correct, however some hardware, like the Sun Neptune chips, the NIU
module, will cause an error and/or fatal trap if any MSIX table entry is
read before the corresponding ENTRY_DATA field is written to.
Add an optional early writel() in msix_prepare_msi_desc().
Fixes: 7d5ec3d361 ("PCI/MSI: Mask all unused MSI-X entries")
Signed-off-by: Jonathan Currier <dullfire@yahoo.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20241117234843.19236-2-dullfire@yahoo.com
quirk_huawei_pcie_sva() sets properties needed by arm_smmu_probe_device(),
but bcb81ac6ae ("iommu: Get DT/ACPI parsing into the proper probe path")
changed the iommu_probe_device() flow so arm_smmu_probe_device() is now
invoked before the quirk, leading to failures like this:
reg-dummy reg-dummy: late IOMMU probe at driver bind, something fishy here!
WARNING: CPU: 0 PID: 1 at drivers/iommu/iommu.c:449 __iommu_probe_device+0x140/0x570
RIP: 0010:__iommu_probe_device+0x140/0x570
The SR-IOV enumeration ordering changes like this:
pci_iov_add_virtfn
pci_device_add
pci_fixup_device(pci_fixup_header) <--
device_add
bus_notify
iommu_bus_notifier
+ iommu_probe_device
+ arm_smmu_probe_device
pci_bus_add_device
pci_fixup_device(pci_fixup_final) <--
device_attach
driver_probe_device
really_probe
pci_dma_configure
acpi_dma_configure_id
- iommu_probe_device
- arm_smmu_probe_device
The non-SR-IOV case is similar in that pci_device_add() is called from
pci_scan_single_device() in the generic enumeration path and
pci_bus_add_device() is called later, after all host bridges have been
enumerated.
Declare quirk_huawei_pcie_sva() as a header fixup to ensure that it happens
before arm_smmu_probe_device().
Fixes: bcb81ac6ae ("iommu: Get DT/ACPI parsing into the proper probe path")
Reported-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Closes: https://lore.kernel.org/all/SJ1PR11MB61295DE21A1184AEE0786E25B9D22@SJ1PR11MB6129.namprd11.prod.outlook.com/
Signed-off-by: Zhangfei Gao <zhangfei.gao@linaro.org>
[bhelgaas: commit log, add failure info and reporter]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20250317011352.5806-1-zhangfei.gao@linaro.org
All users of the deprecated function pcim_iounmap_regions() have been
ported by now. Remove it.
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Zijun Hu <quic_zijuhu@quicinc.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20250327110707.20025-4-phasta@kernel.org
The driver walks the MSI descriptors to test whether a descriptor exists
for a given index. That's just abuse of the MSI internals.
The same test can be done with a single function call by looking up whether
there is a Linux interrupt number assigned at the index.
What's worse is that the function is completely unserialized against
modifications of the MSI-X control by operations issued from the interrupt
core. It also brings the PCI/MSI-X internal cached control word out of
sync.
Remove the trainwreck and invoke the function provided by the PCI/MSI core
to update it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/all/20250319105506.744271447@linutronix.de
The PCI/TPH driver fiddles with the MSI-X control word of an active
interrupt completely unserialized against concurrent operations issued
from the interrupt core. It also brings the PCI/MSI-X internal cached
control word out of sync.
Provide a function, which has the required serialization and keeps the
control word cache in sync.
Unfortunately this requires to look up and lock the interrupt descriptor,
which should be only done in the interrupt core code. But confining this
particular oddity in the PCI/MSI core is the lesser of all evil. A
interrupt core implementation would require a larger pile of infrastructure
and indirections for dubious value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/all/20250319105506.683663807@linutronix.de
Convert the code to use the new guard(msi_descs_lock).
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/all/20250319105506.624838146@linutronix.de
Split the lock protected functionality of msix_capability_init() out into a
helper function and use guard(msi_desc_lock) to replace the lock/unlock
pair.
Simplify the error path in the helper function by utilizing a custom
cleanup to get rid of the remaining gotos.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/all/20250319105506.564105011@linutronix.de
Split the lock protected functionality of msi_capability_init() out into a
helper function and use guard(msi_desc_lock) to replace the lock/unlock
pair.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/all/20250319105506.504992208@linutronix.de
Let cleanup handle the freeing of the affinity mask. That prepares for
switching the MSI descriptor locking to a guard().
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/all/20250319105506.444764312@linutronix.de
The comment claiming that pci_dev::msi_enabled has to be set across setup
is a leftover from ancient code versions. Nothing in the setup code
requires the flag to be set anymore.
Set it in the success path and remove the extra goto label.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/all/20250319105506.383222333@linutronix.de
of_node_to_fwnode() is irqdomain's reimplementation of the "officially"
defined of_fwnode_handle(). The former is in the process of being removed,
so use the latter instead.
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20250319092951.37667-8-jirislaby@kernel.org
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Two significant new items:
- Allow reporting IOMMU HW events to userspace when the events are clearly
linked to a device. This is linked to the VIOMMU object and is intended to
be used by a VMM to forward HW events to the virtual machine as part of
emulating a vIOMMU. ARM SMMUv3 is the first driver to use this
mechanism. Like the existing fault events the data is delivered through
a simple FD returning event records on read().
- PASID support in VFIO. "Process Address Space ID" is a PCI feature that
allows the device to tag all PCI DMA operations with an ID. The IOMMU
will then use the ID to select a unique translation for those DMAs. This
is part of Intel's vIOMMU support as VT-D HW requires the hypervisor to
manage each PASID entry. The support is generic so any VFIO user could
attach any translation to a PASID, and the support should work on ARM
SMMUv3 as well. AMD requires additional driver work.
Some minor updates, along with fixes:
- Prevent using nested parents with fault's, no driver support today
- Put a single "cookie_type" value in the iommu_domain to indicate what
owns the various opaque owner fields
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZ+q6NgAKCRCFwuHvBreF
YZ3zAQDbl4/Z0O+CLN2AXq4Zeiyq1HTSoF94hzqmm7lQ17zTIwD8CCdyLXHvupaq
tkBIv5IovpaxlrSk6M0kh2K8vPCk9Qk=
=CIM3
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"Two significant new items:
- Allow reporting IOMMU HW events to userspace when the events are
clearly linked to a device.
This is linked to the VIOMMU object and is intended to be used by a
VMM to forward HW events to the virtual machine as part of
emulating a vIOMMU. ARM SMMUv3 is the first driver to use this
mechanism. Like the existing fault events the data is delivered
through a simple FD returning event records on read().
- PASID support in VFIO.
The "Process Address Space ID" is a PCI feature that allows the
device to tag all PCI DMA operations with an ID. The IOMMU will
then use the ID to select a unique translation for those DMAs. This
is part of Intel's vIOMMU support as VT-D HW requires the
hypervisor to manage each PASID entry.
The support is generic so any VFIO user could attach any
translation to a PASID, and the support should work on ARM SMMUv3
as well. AMD requires additional driver work.
Some minor updates, along with fixes:
- Prevent using nested parents with fault's, no driver support today
- Put a single "cookie_type" value in the iommu_domain to indicate
what owns the various opaque owner fields"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (49 commits)
iommufd: Test attach before detaching pasid
iommufd: Fix iommu_vevent_header tables markup
iommu: Convert unreachable() to BUG()
iommufd: Balance veventq->num_events inc/dec
iommufd: Initialize the flags of vevent in iommufd_viommu_report_event()
iommufd/selftest: Add coverage for reporting max_pasid_log2 via IOMMU_HW_INFO
iommufd: Extend IOMMU_GET_HW_INFO to report PASID capability
vfio: VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT support pasid
vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
ida: Add ida_find_first_range()
iommufd/selftest: Add coverage for iommufd pasid attach/detach
iommufd/selftest: Add test ops to test pasid attach/detach
iommufd/selftest: Add a helper to get test device
iommufd/selftest: Add set_dev_pasid in mock iommu
iommufd: Allow allocating PASID-compatible domain
iommu/vt-d: Add IOMMU_HWPT_ALLOC_PASID support
iommufd: Enforce PASID-compatible domain for RID
iommufd: Support pasid attach/replace
iommufd: Enforce PASID-compatible domain in PASID path
iommufd/device: Add pasid_attach array to track per-PASID attach
...
Uros Bizjak uses x86 named address space qualifiers to provide
compile-time checking of percpu area accesses.
This has caused a small amount of fallout - two or three issues were
reported. In all cases the calling code was founf to be incorrect.
- The 4 patch series "Some cleanup for memcg" from Chen Ridong
implements some relatively monir cleanups for the memcontrol code.
- The 17 patch series "mm: fixes for device-exclusive entries (hmm)"
from David Hildenbrand fixes a boatload of issues which David found then
using device-exclusive PTE entries when THP is enabled. More work is
needed, but this makes thins better - our own HMM selftests now succeed.
- The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry
Ahmed remove the z3fold and zbud implementations. They have been
deprecated for half a year and nobody has complained.
- The 5 patch series "mm: further simplify VMA merge operation" from
Lorenzo Stoakes implements numerous simplifications in this area. No
runtime effects are anticipated.
- The 4 patch series "mm/madvise: remove redundant mmap_lock operations
from process_madvise()" from SeongJae Park rationalizes the locking in
the madvise() implementation. Performance gains of 20-25% were observed
in one MADV_DONTNEED microbenchmark.
- The 12 patch series "Tiny cleanup and improvements about SWAP code"
from Baoquan He contains a number of touchups to issues which Baoquan
noticed when working on the swap code.
- The 2 patch series "mm: kmemleak: Usability improvements" from Catalin
Marinas implements a couple of improvements to the kmemleak user-visible
output.
- The 2 patch series "mm/damon/paddr: fix large folios access and
schemes handling" from Usama Arif provides a couple of fixes for DAMON's
handling of large folios.
- The 3 patch series "mm/damon/core: fix wrong and/or useless
damos_walk() behaviors" from SeongJae Park fixes a few issues with the
accuracy of kdamond's walking of DAMON regions.
- The 3 patch series "expose mapping wrprotect, fix fb_defio use" from
Lorenzo Stoakes changes the interaction between framebuffer deferred-io
and core MM. No functional changes are anticipated - this is
preparatory work for the future removal of page structure fields.
- The 4 patch series "mm/damon: add support for hugepage_size DAMOS
filter" from Usama Arif adds a DAMOS filter which permits the filtering
by huge page sizes.
- The 4 patch series "mm: permit guard regions for file-backed/shmem
mappings" from Lorenzo Stoakes extends the guard region feature from its
present "anon mappings only" state. The feature now covers shmem and
file-backed mappings.
- The 4 patch series "mm: batched unmap lazyfree large folios during
reclamation" from Barry Song cleans up and speeds up the unmapping for
pte-mapped large folios.
- The 18 patch series "reimplement per-vma lock as a refcount" from
Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for
pulling it out were largely bogus and that change made the code more
messy. This patchset provides small (0-10%) improvements on one
microbenchmark.
- The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation
fixes and improves" from SeongJae Park does some maintenance work on the
DAMON docs.
- The 27 patch series "hugetlb/CMA improvements for large systems" from
Frank van der Linden addresses a pile of issues which have been observed
when using CMA on large machines.
- The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped
pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the
page's mapped/unmapped status.
- The 19 patch series "zsmalloc/zram: there be preemption" from Sergey
Senozhatsky teaches zram to run its compression and decompression
operations preemptibly.
- The 12 patch series "selftests/mm: Some cleanups from trying to run
them" from Brendan Jackman fixes a pile of unrelated issues which
Brendan encountered while runnimg our selftests.
- The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap"
from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
determine whether a particular page is a guard page.
- The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song
removes the swap slot cache from the allocation path - it simply wasn't
being effective.
- The 5 patch series "mm: cleanups for device-exclusive entries (hmm)"
from David Hildenbrand implements a number of unrelated cleanups in this
code.
- The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman
Khandual implements a number of preparatoty cleanups to the
GENERIC_PTDUMP Kconfig logic.
- The 8 patch series "mm/damon: auto-tune aggregation interval" from
SeongJae Park implements a feedback-driven automatic tuning feature for
DAMON's aggregation interval tuning.
- The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some
issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did
this in preparation for implementing lazy mmu mode for arm64 to optimize
vmalloc.
- The 2 patch series "mm/page_alloc: Some clarifications for migratetype
fallback" from Brendan Jackman reworks some commentary to make the code
easier to follow.
- The 3 patch series "page_counter cleanup and size reduction" from
Shakeel Butt cleans up the page_counter code and fixes a size increase
which we accidentally added late last year.
- The 3 patch series "Add a command line option that enables control of
how many threads should be used to allocate huge pages" from Thomas
Prescher does that. It allows the careful operator to significantly
reduce boot time by tuning the parallalization of huge page
initialization.
- The 3 patch series "Fix calculations in trace_balance_dirty_pages()
for cgwb" from Tang Yizhou fixes the tracing output from the dirty page
balancing code.
- The 9 patch series "mm/damon: make allow filters after reject filters
useful and intuitive" from SeongJae Park improves the handling of allow
and reject filters. Behaviour is made more consistent and the
documention is updated accordingly.
- The 5 patch series "Switch zswap to object read/write APIs" from Yosry
Ahmed updates zswap to the new object read/write APIs and thus permits
the removal of some legacy code from zpool and zsmalloc.
- The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang
does as it claims.
- The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts"
from Alistair Popple regularizes the weird ZONE_DEVICE page refcount
handling in DAX, permittig the removal of a number of special-case
checks.
- The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes
is a preparatoty refactoring and cleanup of the mremap() code.
- The 20 patch series "mm: MM owner tracking for large folios (!hugetlb)
+ CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
which we determine whether a large folio is known to be mapped
exclusively into a single MM.
- The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS
filters based on handling layers" from SeongJae Park adds a couple of
new sysfs directories to ease the management of DAMON/DAMOS filters.
- The 13 patch series "arch, mm: reduce code duplication in mem_init()"
from Mike Rapoport consolidates many per-arch implementations of
mem_init() into code generic code, where that is practical.
- The 13 patch series "mm/damon/sysfs: commit parameters online via
damon_call()" from SeongJae Park continues the cleaning up of sysfs
access to DAMON internal data.
- The 3 patch series "mm: page_ext: Introduce new iteration API" from
Luiz Capitulino reworks the page_ext initialization to fix a boot-time
crash which was observed with an unusual combination of compile and
cmdline options.
- The 8 patch series "Buddy allocator like (or non-uniform) folio split"
from Zi Yan reworks the code to split a folio into smaller folios. The
main benefit is lessened memory consumption: fewer post-split folios are
generated.
- The 2 patch series "Minimize xa_node allocation during xarry split"
from Zi Yan reduces the number of xarray xa_nodes which are generated
during an xarray split.
- The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan
performs some maintenance work on the drivers/base/memory code.
- The 3 patch series "Add tracepoints for lowmem reserves, watermarks
and totalreserve_pages" from Martin Liu adds some more tracepoints to
the page allocator code.
- The 4 patch series "mm/madvise: cleanup requests validations and
classifications" from SeongJae Park cleans up some warts which SeongJae
observed during his earlier madvise work.
- The 3 patch series "mm/hwpoison: Fix regressions in memory failure
handling" from Shuai Xue addresses two quite serious regressions which
Shuai has observed in the memory-failure implementation.
- The 5 patch series "mm: reliable huge page allocator" from Johannes
Weiner makes huge page allocations cheaper and more reliable by reducing
fragmentation.
- The 5 patch series "Minor memcg cleanups & prep for memdescs" from
Matthew Wilcox is preparatory work for the future implementation of
memdescs.
- The 4 patch series "track memory used by balloon drivers" from Nico
Pache introduces a way to track memory used by our various balloon
drivers.
- The 2 patch series "mm/damon: introduce DAMOS filter type for active
pages" from Nhat Pham permits users to filter for active/inactive pages,
separately for file and anon pages.
- The 2 patch series "Adding Proactive Memory Reclaim Statistics" from
Hao Jia separates the proactive reclaim statistics from the direct
reclaim statistics.
- The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio"
from Jinjiang Tu fixes our handling of hwpoisoned pages within the
reclaim code.
-----BEGIN PGP SIGNATURE-----
iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA
jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF
Ee93qSSLR1BkNdDw+931Pu0mXfbnBw==
=Pn2K
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- The series "Enable strict percpu address space checks" from Uros
Bizjak uses x86 named address space qualifiers to provide
compile-time checking of percpu area accesses.
This has caused a small amount of fallout - two or three issues were
reported. In all cases the calling code was found to be incorrect.
- The series "Some cleanup for memcg" from Chen Ridong implements some
relatively monir cleanups for the memcontrol code.
- The series "mm: fixes for device-exclusive entries (hmm)" from David
Hildenbrand fixes a boatload of issues which David found then using
device-exclusive PTE entries when THP is enabled. More work is
needed, but this makes thins better - our own HMM selftests now
succeed.
- The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
remove the z3fold and zbud implementations. They have been deprecated
for half a year and nobody has complained.
- The series "mm: further simplify VMA merge operation" from Lorenzo
Stoakes implements numerous simplifications in this area. No runtime
effects are anticipated.
- The series "mm/madvise: remove redundant mmap_lock operations from
process_madvise()" from SeongJae Park rationalizes the locking in the
madvise() implementation. Performance gains of 20-25% were observed
in one MADV_DONTNEED microbenchmark.
- The series "Tiny cleanup and improvements about SWAP code" from
Baoquan He contains a number of touchups to issues which Baoquan
noticed when working on the swap code.
- The series "mm: kmemleak: Usability improvements" from Catalin
Marinas implements a couple of improvements to the kmemleak
user-visible output.
- The series "mm/damon/paddr: fix large folios access and schemes
handling" from Usama Arif provides a couple of fixes for DAMON's
handling of large folios.
- The series "mm/damon/core: fix wrong and/or useless damos_walk()
behaviors" from SeongJae Park fixes a few issues with the accuracy of
kdamond's walking of DAMON regions.
- The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
Stoakes changes the interaction between framebuffer deferred-io and
core MM. No functional changes are anticipated - this is preparatory
work for the future removal of page structure fields.
- The series "mm/damon: add support for hugepage_size DAMOS filter"
from Usama Arif adds a DAMOS filter which permits the filtering by
huge page sizes.
- The series "mm: permit guard regions for file-backed/shmem mappings"
from Lorenzo Stoakes extends the guard region feature from its
present "anon mappings only" state. The feature now covers shmem and
file-backed mappings.
- The series "mm: batched unmap lazyfree large folios during
reclamation" from Barry Song cleans up and speeds up the unmapping
for pte-mapped large folios.
- The series "reimplement per-vma lock as a refcount" from Suren
Baghdasaryan puts the vm_lock back into the vma. Our reasons for
pulling it out were largely bogus and that change made the code more
messy. This patchset provides small (0-10%) improvements on one
microbenchmark.
- The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
improves" from SeongJae Park does some maintenance work on the DAMON
docs.
- The series "hugetlb/CMA improvements for large systems" from Frank
van der Linden addresses a pile of issues which have been observed
when using CMA on large machines.
- The series "mm/damon: introduce DAMOS filter type for unmapped pages"
from SeongJae Park enables users of DMAON/DAMOS to filter my the
page's mapped/unmapped status.
- The series "zsmalloc/zram: there be preemption" from Sergey
Senozhatsky teaches zram to run its compression and decompression
operations preemptibly.
- The series "selftests/mm: Some cleanups from trying to run them" from
Brendan Jackman fixes a pile of unrelated issues which Brendan
encountered while runnimg our selftests.
- The series "fs/proc/task_mmu: add guard region bit to pagemap" from
Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
determine whether a particular page is a guard page.
- The series "mm, swap: remove swap slot cache" from Kairui Song
removes the swap slot cache from the allocation path - it simply
wasn't being effective.
- The series "mm: cleanups for device-exclusive entries (hmm)" from
David Hildenbrand implements a number of unrelated cleanups in this
code.
- The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
implements a number of preparatoty cleanups to the GENERIC_PTDUMP
Kconfig logic.
- The series "mm/damon: auto-tune aggregation interval" from SeongJae
Park implements a feedback-driven automatic tuning feature for
DAMON's aggregation interval tuning.
- The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
preparation for implementing lazy mmu mode for arm64 to optimize
vmalloc.
- The series "mm/page_alloc: Some clarifications for migratetype
fallback" from Brendan Jackman reworks some commentary to make the
code easier to follow.
- The series "page_counter cleanup and size reduction" from Shakeel
Butt cleans up the page_counter code and fixes a size increase which
we accidentally added late last year.
- The series "Add a command line option that enables control of how
many threads should be used to allocate huge pages" from Thomas
Prescher does that. It allows the careful operator to significantly
reduce boot time by tuning the parallalization of huge page
initialization.
- The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
from Tang Yizhou fixes the tracing output from the dirty page
balancing code.
- The series "mm/damon: make allow filters after reject filters useful
and intuitive" from SeongJae Park improves the handling of allow and
reject filters. Behaviour is made more consistent and the documention
is updated accordingly.
- The series "Switch zswap to object read/write APIs" from Yosry Ahmed
updates zswap to the new object read/write APIs and thus permits the
removal of some legacy code from zpool and zsmalloc.
- The series "Some trivial cleanups for shmem" from Baolin Wang does as
it claims.
- The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
Alistair Popple regularizes the weird ZONE_DEVICE page refcount
handling in DAX, permittig the removal of a number of special-case
checks.
- The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
preparatoty refactoring and cleanup of the mremap() code.
- The series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
which we determine whether a large folio is known to be mapped
exclusively into a single MM.
- The series "mm/damon: add sysfs dirs for managing DAMOS filters based
on handling layers" from SeongJae Park adds a couple of new sysfs
directories to ease the management of DAMON/DAMOS filters.
- The series "arch, mm: reduce code duplication in mem_init()" from
Mike Rapoport consolidates many per-arch implementations of
mem_init() into code generic code, where that is practical.
- The series "mm/damon/sysfs: commit parameters online via
damon_call()" from SeongJae Park continues the cleaning up of sysfs
access to DAMON internal data.
- The series "mm: page_ext: Introduce new iteration API" from Luiz
Capitulino reworks the page_ext initialization to fix a boot-time
crash which was observed with an unusual combination of compile and
cmdline options.
- The series "Buddy allocator like (or non-uniform) folio split" from
Zi Yan reworks the code to split a folio into smaller folios. The
main benefit is lessened memory consumption: fewer post-split folios
are generated.
- The series "Minimize xa_node allocation during xarry split" from Zi
Yan reduces the number of xarray xa_nodes which are generated during
an xarray split.
- The series "drivers/base/memory: Two cleanups" from Gavin Shan
performs some maintenance work on the drivers/base/memory code.
- The series "Add tracepoints for lowmem reserves, watermarks and
totalreserve_pages" from Martin Liu adds some more tracepoints to the
page allocator code.
- The series "mm/madvise: cleanup requests validations and
classifications" from SeongJae Park cleans up some warts which
SeongJae observed during his earlier madvise work.
- The series "mm/hwpoison: Fix regressions in memory failure handling"
from Shuai Xue addresses two quite serious regressions which Shuai
has observed in the memory-failure implementation.
- The series "mm: reliable huge page allocator" from Johannes Weiner
makes huge page allocations cheaper and more reliable by reducing
fragmentation.
- The series "Minor memcg cleanups & prep for memdescs" from Matthew
Wilcox is preparatory work for the future implementation of memdescs.
- The series "track memory used by balloon drivers" from Nico Pache
introduces a way to track memory used by our various balloon drivers.
- The series "mm/damon: introduce DAMOS filter type for active pages"
from Nhat Pham permits users to filter for active/inactive pages,
separately for file and anon pages.
- The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
separates the proactive reclaim statistics from the direct reclaim
statistics.
- The series "mm/vmscan: don't try to reclaim hwpoison folio" from
Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
code.
* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
x86/mm: restore early initialization of high_memory for 32-bits
mm/vmscan: don't try to reclaim hwpoison folio
mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
selftests/mm: speed up split_huge_page_test
selftests/mm: uffd-unit-tests support for hugepages > 2M
docs/mm/damon/design: document active DAMOS filter type
mm/damon: implement a new DAMOS filter type for active pages
fs/dax: don't disassociate zero page entries
MM documentation: add "Unaccepted" meminfo entry
selftests/mm: add commentary about 9pfs bugs
fork: use __vmalloc_node() for stack allocation
docs/mm: Physical Memory: Populate the "Zones" section
xen: balloon: update the NR_BALLOON_PAGES state
hv_balloon: update the NR_BALLOON_PAGES state
balloon_compaction: update the NR_BALLOON_PAGES state
meminfo: add a per node counter for balloon drivers
mm: remove references to folio in __memcg_kmem_uncharge_page()
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmfmt1AUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vxqNw//cC1BlVe4UUVR5r9nfpoFeGAZeDJz
32naGvZKjwL0tR6dStO/BEZx4QrBp+smVfJfuxtQxfzLHLgMigM2jVhfa7XUmaun
7yZlJZu4Jmydc57sPf56CFOYOFP6zyPzSaE8u1Eb4IIqpvuoYpvDayDt6PSsLmFS
PDzqmicT3nuNbbcfE4rYLyL6JsXooKCR1h+NNcDjy7Run9DvQbE6N2PPvXCu6O97
aC3+kYUydEpgn9DfjBDghe+GBQCkBPldwnWqXBxKDFmYj5bKFujNccS9/IDSEuuX
oWntDRAXgLWg048sBC1AuJQajF3UaqffRGJkzUBaZWbU/jB9t5N/Z3GpYlXzizRx
CAqnt/ciGUKVbaESKwoeeIqgK+wG1bnrmoEaJHXFGqjr6sjm2A2T5EzyBMJ1hFwE
wUq6SDnkp5igG7rWtsBPo/lGa5h/pNlaXng11570ikD2ZfHVfRgwy2MpXYxChrkt
X2q/lRYU7yyfNJQ8O5LQJ6bYztatjxT0TxXNFv+cxfVrdI7vnQMuaeBE352jn+Lo
aZ8fTJDFbRXtrZolcoetZjBdHwPIS42wYQtvxo/ylUl64xKEzZEzN09XODjwn74v
nuOzDtbM0TAjZWxi6bwRFPnemTEAQxPuv1i4VdPQ7yblvC0at2agNBsz6Fdq60GS
s80YMJVUU0UcVoc=
=gM1i
-----END PGP SIGNATURE-----
Merge tag 'pci-v6.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci updates from Bjorn Helgaas:
"Enumeration:
- Enable Configuration RRS SV, which makes device readiness visible,
early instead of during child bus scanning (Bjorn Helgaas)
- Log debug messages about reset methods being used (Bjorn Helgaas)
- Avoid reset when it has been disabled via sysfs (Nishanth
Aravamudan)
- Add common pci-ep-bus.yaml schema for exporting several peripherals
of a single PCI function via devicetree (Andrea della Porta)
- Create DT nodes for PCI host bridges to enable loading device tree
overlays to create platform devices for PCI devices that have
several features that require multiple drivers (Herve Codina)
Resource management:
- Enlarge devres table[] to accommodate bridge windows, ROM, IOV
BARs, etc., and validate BAR index in devres interfaces (Philipp
Stanner)
- Fix typo that repeatedly distributed resources to a bridge instead
of iterating over subordinate bridges, which resulted in too little
space to assign some BARs (Kai-Heng Feng)
- Relax bridge window tail sizing for optional resources, e.g., IOV
BARs, to avoid failures when removing and re-adding devices (Ilpo
Järvinen)
- Allow drivers to enable devices even if we haven't assigned
optional IOV resources to them (Ilpo Järvinen)
- Rework handling of optional resources (IOV BARs, ROMs) to reduce
failures if we can't allocate them (Ilpo Järvinen)
- Fix a NULL dereference in the SR-IOV VF creation error path (Shay
Drory)
- Fix s390 mmio_read/write syscalls, which didn't cause page faults
in some cases, which broke vfio-pci lazy mapping on first access
(Niklas Schnelle)
- Add pdev->non_mappable_bars to replace CONFIG_VFIO_PCI_MMAP, which
was disabled only for s390 (Niklas Schnelle)
- Support mmap of PCI resources on s390 except for ISM devices
(Niklas Schnelle)
ASPM:
- Delay pcie_link_state deallocation to avoid dangling pointers that
cause invalid references during hot-unplug (Daniel Stodden)
Power management:
- Allow PCI bridges to go to D3Hot when suspending on all non-x86
systems (Manivannan Sadhasivam)
Power control:
- Create pwrctrl devices in pci_scan_device() to make it more
symmetric with pci_pwrctrl_unregister() and make pwrctrl devices
for PCI bridges possible (Manivannan Sadhasivam)
- Unregister pwrctrl devices in pci_destroy_dev() so DOE, ASPM, etc.
can still access devices after pci_stop_dev() (Manivannan
Sadhasivam)
- If there's a pwrctrl device for a PCI device, skip scanning it
because the pwrctrl core will rescan the bus after the device is
powered on (Manivannan Sadhasivam)
- Add a pwrctrl driver for PCI slots based on voltage regulators
described via devicetree (Manivannan Sadhasivam)
Bandwidth control:
- Add set_pcie_speed.sh to TEST_PROGS to fix issue when executing the
set_pcie_cooling_state.sh test case (Yi Lai)
- Avoid a NULL pointer dereference when we run out of bus numbers to
assign for a bridge secondary bus (Lukas Wunner)
Hotplug:
- Drop superfluous pci_hotplug_slot_list, try_module_get() calls, and
NULL pointer checks (Lukas Wunner)
- Drop shpchp module init/exit logging, replace shpchp dbg() with
ctrl_dbg(), and remove unused dbg(), err(), info(), warn() wrappers
(Ilpo Järvinen)
- Drop 'shpchp_debug' module parameter in favor of standard dynamic
debugging (Ilpo Järvinen)
- Drop unused cpcihp .get_power(), .set_power() function pointers
(Guilherme Giacomo Simoes)
- Disable hotplug interrupts in portdrv only when pciehp is not
enabled to avoid issuing two hotplug commands too close together
(Feng Tang)
- Skip pciehp 'device replaced' check if the device has been removed
to address a deadlock when resuming after a device was removed
during system sleep (Lukas Wunner)
- Don't enable pciehp hotplug interupt when resuming in poll mode
(Ilpo Järvinen)
Virtualization:
- Fix bugs in 'pci=config_acs=' kernel command line parameter (Tushar
Dave)
DOE:
- Expose supported DOE features via sysfs (Alistair Francis)
- Allow DOE support to be enabled even if CXL isn't enabled (Alistair
Francis)
Endpoint framework:
- Convert PCI device data so pci-epf-test works correctly on
big-endian endpoint systems (Niklas Cassel)
- Add BAR_RESIZABLE type to endpoint framework and add DWC core
support for EPF drivers to set BAR_RESIZABLE type and size (Niklas
Cassel)
- Fix pci-epf-test double free that causes an oops if the host
reboots and PERST# deassertion restarts endpoint BAR allocation
(Christian Bruel)
- Fix endpoint BAR testing so tests can skip disabled BARs instead of
reporting them as failures (Niklas Cassel)
- Widen endpoint test BAR size variable to accommodate BARs larger
than INT_MAX (Niklas Cassel)
- Remove unused tools 'pci' build target left over after moving tests
to tools/testing/selftests/pci_endpoint (Jianfeng Liu)
Altera PCIe controller driver:
- Add DT binding and driver support for Agilex family (P-Tile,
F-Tile, R-Tile) (Matthew Gerlach and D M, Sharath Kumar)
AMD MDB PCIe controller driver:
- Add DT binding and driver for AMD MDB (Multimedia DMA Bridge)
(Thippeswamy Havalige)
Broadcom STB PCIe controller driver:
- Add BCM2712 MSI-X DT binding and interrupt controller drivers and
add softdep on irq_bcm2712_mip driver to ensure that it is loaded
first (Stanimir Varbanov)
- Expand inbound window map to 64GB so it can accommodate BCM2712
(Stanimir Varbanov)
- Add BCM2712 support and DT updates (Stanimir Varbanov)
- Apply link speed restriction before bringing link up, not after
(Jim Quinlan)
- Update Max Link Speed in Link Capabilities via the internal
writable register, not the read-only config register (Jim Quinlan)
- Handle regulator_bulk_get() error to avoid panic when we call
regulator_bulk_free() later (Jim Quinlan)
- Disable regulators only when removing the bus immediately below a
Root Port because we don't support regulators deeper in the
hierarchy (Jim Quinlan)
- Make const read-only arrays static (Colin Ian King)
Cadence PCIe endpoint driver:
- Correct MSG TLP generation so endpoints can generate INTx messages
(Hans Zhang)
Freescale i.MX6 PCIe controller driver:
- Identify the second controller on i.MX8MQ based on devicetree
'linux,pci-domain' instead of DBI 'reg' address (Richard Zhu)
- Remove imx_pcie_cpu_addr_fixup() since dwc core can now derive the
ATU input address (using parent_bus_offset) from devicetree (Frank
Li)
Freescale Layerscape PCIe controller driver:
- Drop deprecated 'num-ib-windows' and 'num-ob-windows' and
unnecessary 'status' from example (Krzysztof Kozlowski)
- Correct the syscon_regmap_lookup_by_phandle_args("fsl,pcie-scfg")
arg_count to fix probe failure on LS1043A (Ioana Ciornei)
HiSilicon STB PCIe controller driver:
- Call phy_exit() to clean up if histb_pcie_probe() fails (Christophe
JAILLET)
Intel Gateway PCIe controller driver:
- Remove intel_pcie_cpu_addr() since dwc core can now derive the ATU
input address (using parent_bus_offset) from devicetree (Frank Li)
Intel VMD host bridge driver:
- Convert vmd_dev.cfg_lock from spinlock_t to raw_spinlock_t so
pci_ops.read() will never sleep, even on PREEMPT_RT where
spinlock_t becomes a sleepable lock, to avoid calling a sleeping
function from invalid context (Ryo Takakura)
MediaTek PCIe Gen3 controller driver:
- Remove leftover mac_reset assert for Airoha EN7581 SoC (Lorenzo
Bianconi)
- Add EN7581 PBUS controller 'mediatek,pbus-csr' DT property and
program host bridge memory aperture to this syscon node (Lorenzo
Bianconi)
Qualcomm PCIe controller driver:
- Add qcom,pcie-ipq5332 binding (Varadarajan Narayanan)
- Add qcom i.MX8QM and i.MX8QXP/DXP optional DMA interrupt (Alexander
Stein)
- Add optional dma-coherent DT property for Qualcomm SA8775P (Dmitry
Baryshkov)
- Make DT iommu property required for SA8775P and prohibited for
SDX55 (Dmitry Baryshkov)
- Add DT IOMMU and DMA-related properties for Qualcomm SM8450 (Dmitry
Baryshkov)
- Add endpoint DT properties for SAR2130P and enable endpoint mode in
driver (Dmitry Baryshkov)
- Describe endpoint BAR0 and BAR2 as 64-bit only and BAR1 and BAR3 as
RESERVED (Manivannan Sadhasivam)
Rockchip DesignWare PCIe controller driver:
- Describe rk3568 and rk3588 BARs as Resizable, not Fixed (Niklas
Cassel)
Synopsys DesignWare PCIe controller driver:
- Add debugfs-based Silicon Debug, Error Injection, Statistical
Counter support for DWC (Shradha Todi)
- Add debugfs property to expose LTSSM status of DWC PCIe link (Hans
Zhang)
- Add Rockchip support for DWC debugfs features (Niklas Cassel)
- Add dw_pcie_parent_bus_offset() to look up the parent bus address
of a specified 'reg' property and return the offset from the CPU
physical address (Frank Li)
- Use dw_pcie_parent_bus_offset() to derive CPU -> ATU addr offset
via 'reg[config]' for host controllers and 'reg[addr_space]' for
endpoint controllers (Frank Li)
- Apply struct dw_pcie.parent_bus_offset in ATU users to remove use
of .cpu_addr_fixup() when programming ATU (Frank Li)
TI J721E PCIe driver:
- Correct the 'link down' interrupt bit for J784S4 (Siddharth
Vadapalli)
TI Keystone PCIe controller driver:
- Describe AM65x BARs 2 and 5 as Resizable (not Fixed) and reduce
alignment requirement from 1MB to 64KB (Niklas Cassel)
Xilinx Versal CPM PCIe controller driver:
- Free IRQ domain in probe error path to avoid leaking it
(Thippeswamy Havalige)
- Add DT .compatible "xlnx,versal-cpm5nc-host" and driver support for
Versal Net CPM5NC Root Port controller (Thippeswamy Havalige)
- Add driver support for CPM5_HOST1 (Thippeswamy Havalige)
Miscellaneous:
- Convert fsl,mpc83xx-pcie binding to YAML (J. Neuschäfer)
- Use for_each_available_child_of_node_scoped() to simplify apple,
kirin, mediatek, mt7621, tegra drivers (Zhang Zekun)"
* tag 'pci-v6.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (197 commits)
PCI: layerscape: Fix arg_count to syscon_regmap_lookup_by_phandle_args()
PCI: j721e: Fix the value of .linkdown_irq_regfield for J784S4
misc: pci_endpoint_test: Add support for PCITEST_IRQ_TYPE_AUTO
PCI: endpoint: pci-epf-test: Expose supported IRQ types in CAPS register
PCI: dw-rockchip: Endpoint mode cannot raise INTx interrupts
PCI: endpoint: Add intx_capable to epc_features struct
dt-bindings: PCI: Add common schema for devices accessible through PCI BARs
PCI: intel-gw: Remove intel_pcie_cpu_addr()
PCI: imx6: Remove imx_pcie_cpu_addr_fixup()
PCI: dwc: Use parent_bus_offset to remove need for .cpu_addr_fixup()
PCI: dwc: ep: Ensure proper iteration over outbound map windows
PCI: dwc: ep: Use devicetree 'reg[addr_space]' to derive CPU -> ATU addr offset
PCI: dwc: ep: Consolidate devicetree handling in dw_pcie_ep_get_resources()
PCI: dwc: ep: Call epc_create() early in dw_pcie_ep_init()
PCI: dwc: Use devicetree 'reg[config]' to derive CPU -> ATU addr offset
PCI: dwc: Add dw_pcie_parent_bus_offset() checking and debug
PCI: dwc: Add dw_pcie_parent_bus_offset()
PCI/bwctrl: Fix NULL pointer dereference on bus number exhaustion
PCI: xilinx-cpm: Add cpm_csr register mapping for CPM5_HOST1 variant
PCI: brcmstb: Make const read-only arrays static
...
This reverts commit 36f5f026df, reversing
changes made to 43a7eec035.
Thomas says:
"I just noticed that for some incomprehensible reason, probably sheer
incompetemce when trying to utilize b4, I managed to merge an outdated
_and_ buggy version of that series.
Can you please revert that merge completely?"
Done.
Requested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PASID usage requires PASID support in both device and IOMMU. Since the
iommu drivers always enable the PASID capability for the device if it
is supported, this extends the IOMMU_GET_HW_INFO to report the PASID
capability to userspace. Also, enhances the selftest accordingly.
Link: https://patch.msgid.link/r/20250321180143.8468-5-yi.l.liu@intel.com
Cc: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> #aarch64 platform
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
- Ioremap() msg_res region using res->start (the CPU address), not the ATU
'cpu_addr', which will be replaced with the ATU input address (which may
not be the CPU address) (Frank Li)
- Rename struct dw_pcie_ob_atu_cfg.cpu_addr to 'parent_bus_addr' (Frank Li)
- Call devm_pci_alloc_host_bridge() early in dw_pcie_host_init() to keep
devicetree-related code together (Frank Li)
- Consolidate devicetree handling in dw_pcie_host_get_resources() (Bjorn
Helgaas)
- Add dw_pcie_parent_bus_offset() to look up the parent bus address of a
specified 'reg' property and return the offset from the CPU physical
address (Frank Li)
- Add cross-checking with .cpu_addr_fixup() and debug logging to
dw_pcie_parent_bus_offset() (Frank Li)
- Use devicetree 'reg[config]' via dw_pcie_parent_bus_offset() to derive
CPU -> ATU addr offset for host controller (Frank Li)
- Call epc_create() early in dw_pcie_ep_init() to keep devicetree-related
code together (Bjorn Helgaas)
- Consolidate devicetree handling in dw_pcie_ep_get_resources() (Bjorn
Helgaas)
- Use devicetree 'reg[addr_space]' via dw_pcie_parent_bus_offset() to
derive CPU -> ATU addr offset for endpoint controller (Frank Li)
- Update dw_pcie_find_index() to remove assumption that ATU input address
is non-zero (Frank Li)
- Apply struct dw_pcie.parent_bus_offset in ATU users to remove use of
.cpu_addr_fixup() when programming ATU (Frank Li)
- Remove imx_pcie_cpu_addr_fixup() since dwc core can now derive the ATU
input address (using parent_bus_offset) from devicetree (Frank Li)
- Remove intel_pcie_cpu_addr() since dwc core can now derive the ATU input
address (using parent_bus_offset) from devicetree (Frank Li)
* pci/controller/dwc-cpu-addr-fixup:
PCI: intel-gw: Remove intel_pcie_cpu_addr()
PCI: imx6: Remove imx_pcie_cpu_addr_fixup()
PCI: dwc: Use parent_bus_offset to remove need for .cpu_addr_fixup()
PCI: dwc: ep: Ensure proper iteration over outbound map windows
PCI: dwc: ep: Use devicetree 'reg[addr_space]' to derive CPU -> ATU addr offset
PCI: dwc: ep: Consolidate devicetree handling in dw_pcie_ep_get_resources()
PCI: dwc: ep: Call epc_create() early in dw_pcie_ep_init()
PCI: dwc: Use devicetree 'reg[config]' to derive CPU -> ATU addr offset
PCI: dwc: Add dw_pcie_parent_bus_offset() checking and debug
PCI: dwc: Add dw_pcie_parent_bus_offset()
PCI: dwc: Consolidate devicetree handling in dw_pcie_host_get_resources()
PCI: dwc: Call devm_pci_alloc_host_bridge() early in dw_pcie_host_init()
PCI: dwc: Rename cpu_addr to parent_bus_addr for ATU configuration
PCI: dwc: Use resource start as ioremap() input in dw_pcie_pme_turn_off()
# Conflicts:
# drivers/pci/controller/dwc/pcie-designware.c
# drivers/pci/controller/dwc/pcie-designware.h
- Free IRQ domain in probe error path to avoid leaking it (Thippeswamy
Havalige)
- Add DT .compatible "xlnx,versal-cpm5nc-host" and driver support for
Versal Net CPM5NC Root Port controller (Thippeswamy Havalige)
- Add driver support for CPM5_HOST1 (Thippeswamy Havalige)
* pci/controller/xilinx-cpm:
PCI: xilinx-cpm: Add cpm_csr register mapping for CPM5_HOST1 variant
PCI: xilinx-cpm: Add support for Versal Net CPM5NC Root Port controller
dt-bindings: PCI: xilinx-cpm: Add compatible string for CPM5NC Versal Net host
PCI: xilinx-cpm: Fix IRQ domain leak in error path of probe
- Convert vmd_dev.cfg_lock from spinlock_t to raw_spinlock_t so
pci_ops.read() will never sleep, even on PREEMPT_RT where spinlock_t
becomes a sleepable lock (Ryo Takakura)
* pci/controller/vmd:
PCI: vmd: Make vmd_dev::cfg_lock a raw_spinlock_t type
- Describe endpoint BAR0 and BAR2 as 64-bit only and BAR1 and BAR3 as
RESERVED (Manivannan Sadhasivam)
- Add optional dma-coherent DT property for Qualcomm SA8775P (Dmitry
Baryshkov)
- Make DT iommu property required for SA8775P and prohibited for SDX55
(Dmitry Baryshkov)
- Add DT iommu and DMA-related properties for Qualcomm SM8450 (Dmitry
Baryshkov)
- Consolidate DMA vs non-DMA cases in DT (Dmitry Baryshkov)
- Add endpoint DT properties for SAR2130P and enable endpoint mode in
driver (Dmitry Baryshkov)
* pci/controller/qcom:
PCI: qcom-ep: Enable EP mode support for SAR2130P
dt-bindings: PCI: qcom-ep: Add SAR2130P compatible
dt-bindings: PCI: qcom-ep: Consolidate DMA vs non-DMA cases
dt-bindings: PCI: qcom-ep: Enable DMA for SM8450
dt-bindings: PCI: qcom-ep: Describe optional IOMMU
dt-bindings: PCI: qcom-ep: Describe optional dma-coherent property
PCI: qcom-ep: Mark BAR0/BAR2 as 64bit BARs and BAR1/BAR3 as RESERVED
- Correct the 'link down' interrupt bit for J784S4 (Siddharth Vadapalli)
* pci/controller/j721e:
PCI: j721e: Fix the value of .linkdown_irq_regfield for J784S4
- Identify the second controller on i.MX8MQ based on devicetree
'linux,pci-domain' instead of DBI 'reg' address (Richard Zhu)
- Use devm_clk_bulk_get_all() to fetch clocks to simplify the code (Richard
Zhu)
* pci/controller/imx6:
PCI: imx6: Use devm_clk_bulk_get_all() to fetch clocks
PCI: imx6: Identify controller via 'linux,pci-domain', not address
- Correct comment to say that invalidations from a PF driver are delivered
to the VF endpoint driver, not by the VF driver (Easwar Hariharan)
* pci/controller/hyperv:
PCI: hv: Correct a comment
- Call phy_exit() to clean up if histb_pcie_probe() fails (Christophe
JAILLET)
* pci/controller/histb:
PCI: histb: Fix an error handling path in histb_pcie_probe()
- Move struct dwc_pcie_vsec_id to include/linux/pcie-dwc.h, where it can be
shared by debugfs, perf, sysfs, etc (Manivannan Sadhasivam)
- Add dw_pcie_find_vsec_capability() to locate Vendor Specific Extended
Capabilities (Shradha Todi)
- Add debugfs-based Silicon Debug, Error Injection, Statistical Counter
support for DWC (Shradha Todi)
- Add debugfs property to expose LTSSM status of DWC PCIe link (Hans Zhang)
- Add Rockchip Vendor ID and Vendor Specific ID of RAS DES Capability so
the DWC debugfs features work for Rockchip as well (Niklas Cassel)
* pci/controller/dwc:
PCI: dw-rockchip: Hide broken ATS capability for RK3588 running in EP mode
PCI: dwc: ep: Add dw_pcie_ep_hide_ext_capability()
PCI: dwc: ep: Return -ENOMEM for allocation failures
PCI: dwc: Add Rockchip to the RAS DES allowed vendor list
PCI: Add Rockchip Vendor ID
PCI: dwc: Add debugfs property to provide LTSSM status of the PCIe link
PCI: dwc: Add debugfs based Statistical Counter support for DWC
PCI: dwc: Add debugfs based Error Injection support for DWC
PCI: dwc: Add debugfs based Silicon Debug support for DWC
PCI: dwc: Add helper to find the Vendor Specific Extended Capability (VSEC)
perf/dwc_pcie: Move common DWC struct definitions to 'pcie-dwc.h'
- Correct MSG TLP generation so endpoint can generate INTx messages (Hans
Zhang)
* pci/controller/cadence:
PCI: cadence-ep: Fix the driver to send MSG TLP for INTx without data payload
- Add missing of_node refcount release after of_parse_phandle() (Stanimir
Varbanov)
- Add BCM2712 MSI-X DT binding and interrupt controller drivers (Stanimir
Varbanov)
- Add brcmstb softdep on irq_bcm2712_mip MIP MSI-X interrupt controller
driver to ensure that it is loaded first (Stanimir Varbanov)
- Add struct brcm_pcie pointer to pcie_cfg_data so we can reference the
pcie_cfg_data directly instead of copying it to brcm_pcie (Stanimir
Varbanov)
- Expand inbound window map to 64GB so it can accommodate BCM2712 (Stanimir
Varbanov)
- Add BCM2712 support and DT updates (Stanimir Varbanov)
- Apply link speed restriction before bringing link up, not after (Jim
Quinlan)
- Update Max Link Speed in Link Capabilities via the internal writable
register, not the read-only config register (Jim Quinlan)
- Handle regulator_bulk_get() error to avoid panic when we call
regulator_bulk_free() later (Jim Quinlan)
- Disable regulators only when removing the bus immediately below a Root
Port because we don't support regulators deeper in the hierarchy (Jim
Quinlan)
- Consistently use config access index/data register offsets from the
SoC-specific pcie_offsets[] table (Jim Quinlan)
- Update MDIO register fields that reduced CMD from 12 bits to 1 and
widened PORT from 4 bits to 5 and split it into two parts (Jim Quinlan)
- Make const read-only arrays static (Colin Ian King)
* pci/controller/brcmstb:
PCI: brcmstb: Make const read-only arrays static
PCI: brcmstb: Make irq_domain_set_info() parameter cast explicit
PCI: brcmstb: Make two changes in MDIO register fields
PCI: brcmstb: Use same constant table for config space access
PCI: brcmstb: Fix potential premature regulator disabling
PCI: brcmstb: Fix error path after a call to regulator_bulk_get()
PCI: brcmstb: Do not assume that register field starts at LSB
PCI: brcmstb: Use internal register to change link capability
PCI: brcmstb: Set generation limit before PCIe link up
PCI: brcmstb: Add BCM2712 support
PCI: brcmstb: Expand inbound window size up to 64GB
PCI: brcmstb: Reuse pcie_cfg_data structure
PCI: brcmstb: Add a softdep to MIP MSI-X driver
irqchip: Add Broadcom BCM2712 MSI-X interrupt controller
dt-bindings: PCI: brcmstb: Update bindings for PCIe on BCM2712
dt-bindings: interrupt-controller: Add BCM2712 MSI-X bindings
PCI: brcmstb: Fix missing of_node_put() in brcm_pcie_probe()
- Add DT binding for Agilex family (P-Tile, F-Tile, R-Tile) (Matthew
Gerlach)
- Add driver support for Agilex family (P-Tile, F-Tile, R-Tile) (D M,
Sharath Kumar)
* pci/controller/altera:
PCI: altera: Add Agilex support
dt-bindings: PCI: altera: Add binding for Agilex
- Use for_each_available_child_of_node_scoped() to simplify apple, kirin,
mediatek, mt7621, tegra drivers (Zhang Zekun)
* pci/scoped-cleanup:
PCI: tegra: Use helper function for_each_child_of_node_scoped()
PCI: apple: Use helper function for_each_child_of_node_scoped()
PCI: mt7621: Use helper function for_each_available_child_of_node_scoped()
PCI: mediatek: Use helper function for_each_available_child_of_node_scoped()
PCI: kirin: Tidy up _probe() related function with dev_err_probe()
PCI: kirin: Use helper function for_each_available_child_of_node_scoped()
- Fix endpoint BAR testing so the test can skip disabled BARs instead of
reporting them as failures (Niklas Cassel)
- Verify that pci_endpoint interrupt tests set the correct IRQ type
(Kunihiko Hayashi)
- Fix interpretation of pci_endpoint_test_bars_read_bar() error returns
(Niklas Cassel)
- Fix potential string truncation in pci_endpoint_test_probe() (Niklas
Cassel)
- Increase endpoint test BAR size variable to accommodate BARs larger than
INT_MAX (Niklas Cassel)
- Release IRQs to avoid leak in pci_endpoint interrupt tests (Kunihiko
Hayashi)
- Log the correct IRQ type when pci_endpoint IRQ request test fails
(Kunihiko Hayashi)
- Remove pci_endpoint_test irq_type and no_msi globals; instead use
test->irq_type (Kunihiko Hayashi)
- Remove unnecessary use of managed IRQ functions in pci_endpoint_test
(Kunihiko Hayashi)
- Add and use IRQ_TYPE_* defines in pci_endpoint_test (Niklas Cassel)
- Add struct pci_epc_features.intx_capable and note that RK3568 and RK3588
can't raise INTx interrupts (Niklas Cassel)
- Expose supported IRQ types in CAPS so pci_endpoint_test can set
appropriate type (Niklas Cassel)
- Add PCITEST_IRQ_TYPE_AUTO to pci_endpoint_test for cases where the IRQ
type doesn't matter (Niklas Cassel)
* pci/endpoint-test:
misc: pci_endpoint_test: Add support for PCITEST_IRQ_TYPE_AUTO
PCI: endpoint: pci-epf-test: Expose supported IRQ types in CAPS register
PCI: dw-rockchip: Endpoint mode cannot raise INTx interrupts
PCI: endpoint: Add intx_capable to epc_features struct
selftests: pci_endpoint: Use IRQ_TYPE_* defines from UAPI header
misc: pci_endpoint_test: Use IRQ_TYPE_* defines from UAPI header
PCI: endpoint: pcitest: Add IRQ_TYPE_* defines to UAPI header
misc: pci_endpoint_test: Do not use managed IRQ functions
misc: pci_endpoint_test: Remove global 'irq_type' and 'no_msi'
misc: pci_endpoint_test: Fix 'irq_type' to convey the correct type
misc: pci_endpoint_test: Fix displaying 'irq_type' after 'request_irq' error
misc: pci_endpoint_test: Avoid issue of interrupts remaining after request_irq error
misc: pci_endpoint_test: Handle BAR sizes larger than INT_MAX
misc: pci_endpoint_test: Give disabled BARs a distinct error code
misc: pci_endpoint_test: Fix potential truncation in pci_endpoint_test_probe()
misc: pci_endpoint_test: Fix pci_endpoint_test_bars_read_bar() error handling
selftests: pci_endpoint: Add GET_IRQTYPE checks to each interrupt test
selftests: pci_endpoint: Skip disabled BARs
- Convert PCI device data so pci-epf-test works correctly on big-endian
endpoint systems (Niklas Cassel)
- Add BAR_RESIZABLE type to endpoint framework (Niklas Cassel)
- Add pci_epc_bar_size_to_rebar_cap() to convert a size to the Resizable
BAR Capability so endpoint drivers can configure what the Capability
register advertises (Niklas Cassel)
- Add DWC core support for EPF drivers to set BAR_RESIZABLE type and size
via dw_pcie_ep_set_bar() (Niklas Cassel)
- Describe TI AM65x (keystone) BARs 2 and 5 as Resizable, not Fixed
(Niklas Cassel)
- Reduce TI AM65x (keystone) BAR alignment requirement from 1MB to 64KB
(Niklas Cassel)
- Describe Rockchip rk3568 and rk3588 BARs as Resizable, not Fixed (Niklas
Cassel)
- Drop unused devm_pci_epc_destroy() (Zijun Hu)
- Fix pci-epf-test double free that causes an oops if the host reboots and
PERST# deassertion restarts endpoint BAR allocation (Christian Bruel)
- Drop dw_pcie_ep_find_ext_capability() and use
dw_pcie_find_ext_capability() instead (Niklas Cassel)
* pci/endpoint:
PCI: dwc: ep: Remove superfluous function dw_pcie_ep_find_ext_capability()
PCI: endpoint: pci-epf-test: Fix double free that causes kernel to oops
PCI: endpoint: Remove unused devm_pci_epc_destroy()
PCI: dw-rockchip: Describe Resizable BARs as Resizable BARs
PCI: keystone: Specify correct alignment requirement
PCI: keystone: Describe Resizable BARs as Resizable BARs
PCI: dwc: ep: Allow EPF drivers to configure the size of Resizable BARs
PCI: dwc: ep: Move dw_pcie_ep_find_ext_capability()
PCI: endpoint: Add pci_epc_bar_size_to_rebar_cap()
PCI: endpoint: Allow EPF drivers to configure the size of Resizable BARs
PCI: endpoint: pci-epf-test: Handle endianness properly
- Add device_add_of_node() to set dev->of_node and dev->fwnode only if they
haven't been set already (Herve Codina)
- Allow of_pci_set_address() to set the DT address property for root bus
nodes, where there is no PCI bridge to supply the PCI bus/device/function
part of the property (Herve Codina)
- Create DT nodes for PCI host bridges to enable loading device tree
overlays to create platform devices for PCI devices that have several
features that require multiple drivers (Herve Codina)
* pci/devtree-create:
PCI: of: Create device tree PCI host bridge node
PCI: of_property: Constify parameter in of_pci_get_addr_flags()
PCI: of_property: Add support for NULL pdev in of_pci_set_address()
PCI: of: Use device_{add,remove}_of_node() to attach of_node to existing device
driver core: Introduce device_{add,remove}_of_node()
- Use pci_resource_n() to simplify BAR/window resource lookup (Ilpo
Järvinen)
- Fix typo that repeatedly distributed resources to a bridge instead of
iterating over subordinate bridges, which resulted in too little space to
assign some BARs (Kai-Heng Feng)
- Relax bridge window tail sizing for optional resources, e.g., IOV BARs,
to avoid failures when removing and re-adding devices (Ilpo Järvinen)
- Fix a double counting error for I/O resources, as we previously did for
memory resources (Ilpo Järvinen)
- Use resource_set_{range,size}() helpers in more places (Ilpo Järvinen)
- Add pci_resource_is_iov() to identify IOV resources (Ilpo Järvinen)
- Add pci_resource_num() to look up the BAR number from the resource
pointer (Ilpo Järvinen)
- Add restore_dev_resource() to simplify code that resources saved device
resources (Ilpo Järvinen)
- Allow drivers to enable devices even if we haven't assigned optional IOV
resources to them (Ilpo Järvinen)
- Improve debug output during resource reallocation (Ilpo Järvinen)
- Rework handling of optional resources (IOV BARs, ROMs) to reduce failures
if we can't allocate them (Ilpo Järvinen)
- Move declarations of pci_rescan_bus_bridge_resize(),
pci_reassign_bridge_resources(), and CardBus-related sizes from
include/linux/pci.h to drivers/pci/pci.h since they're not used outside
the PCI core (Ilpo Järvinen)
- Make pci_setup_bridge() static (Ilpo Järvinen)
- Fix a NULL dereference in the SR-IOV VF creation error path (Shay Drory)
- Fix s390 mmio_read/write syscalls, which didn't cause page faults in some
cases, which broke vfio-pci lazy mapping on first access (Niklas
Schnelle)
- Add pdev->non_mappable_bars to replace CONFIG_VFIO_PCI_MMAP, which was
disabled only for s390 (Niklas Schnelle)
- Support mmap of PCI resources on s390 except for ISM devices (Niklas
Schnelle)
* pci/resource:
s390/pci: Support mmap() of PCI resources except for ISM devices
s390/pci: Introduce pdev->non_mappable_bars and replace VFIO_PCI_MMAP
s390/pci: Fix s390_mmio_read/write syscall page fault handling
PCI: Fix NULL dereference in SR-IOV VF creation error path
PCI: Move cardbus IO size declarations into pci/pci.h
PCI: Make pci_setup_bridge() static
PCI: Move resource reassignment func declarations into pci/pci.h
PCI: Move pci_rescan_bus_bridge_resize() declaration to pci/pci.h
PCI: Fix BAR resizing when VF BARs are assigned
PCI: Do not claim to release resource falsely
PCI: Increase Resizable BAR support from 512 GB to 128 TB
PCI: Rework optional resource handling
PCI: Perform reset_resource() and build fail list in sync
PCI: Use res->parent to check if resource is assigned
PCI: Add debug print when releasing resources before retry
PCI: Indicate optional resource assignment failures
PCI: Always have realloc_head in __assign_resources_sorted()
PCI: Extend enable to check for any optional resource
PCI: Add restore_dev_resource()
PCI: Remove incorrect comment from pci_reassign_resource()
PCI: Consolidate assignment loop next round preparation
PCI: Rename retval to ret
PCI: Use while loop and break instead of gotos
PCI: Refactor pdev_sort_resources() & __dev_sort_resources()
PCI: Converge return paths in __assign_resources_sorted()
PCI: Add dev & res local variables to resource assignment funcs
PCI: Add pci_resource_num() helper
PCI: Check resource_size() separately
PCI: Add pci_resource_is_iov() to identify IOV resources
PCI: Use resource_set_{range,size}() helpers
PCI: Use SZ_* instead of literals in setup-bus.c
PCI: Fix old_size lower bound in calculate_iosize() too
PCI: Allow relaxed bridge window tail sizing for optional resources
PCI: Simplify size1 assignment logic
PCI: Use min_align, not unrelated add_align, for size0
PCI: Remove add_align overwrite unrelated to size0
PCI: Use downstream bridges for distributing resources
PCI: Cleanup dev->resource + resno to use pci_resource_n()
- Log debug messages about reset methods being used (Bjorn Helgaas)
- Avoid reset when it has been disabled via sysfs (Nishanth Aravamudan)
* pci/reset:
PCI: Avoid reset when disabled via sysfs
PCI: Log debug messages about reset method
- Create pwrctrl devices in pci_scan_device() to make it more symmetric
with pci_pwrctrl_unregister() and make pwrctrl devices for PCI bridges
possible (Manivannan Sadhasivam)
- Unregister pwrctrl devices in pci_destroy_dev() so DOE, ASPM, etc. can
still access devices after pci_stop_dev() (Manivannan Sadhasivam)
- If there's a pwrctrl device for a PCI device, skip scanning it because
the pwrctrl core will rescan the bus after the device is powered on
(Manivannan Sadhasivam)
- Add a pwrctrl driver for PCI slots based on voltage regulators described
via devicetree (Manivannan Sadhasivam)
* pci/pwrctrl:
PCI/pwrctrl: Add pwrctrl driver for PCI slots
dt-bindings: vendor-prefixes: Document the 'pciclass' prefix
PCI/pwrctrl: Skip scanning for the device further if pwrctrl device is created
PCI/pwrctrl: Move pci_pwrctrl_unregister() to pci_destroy_dev()
PCI/pwrctrl: Move creation of pwrctrl devices to pci_scan_device()
- Drop shpchp module init/exit logging (Ilpo Järvinen)
- Replace shpchp dbg() with ctrl_dbg() and remove unused dbg(), err(),
info(), warn() wrappers (Ilpo Järvinen)
- Drop 'shpchp_debug' module parameter in favor of standard dynamic
debugging (Ilpo Järvinen)
- Drop unused .get_power(), .set_power() function pointers (Guilherme
Giacomo Simoes)
- Drop superfluous pci_hotplug_slot_list (Lukas Wunner)
- Drop superfluous try_module_get() calls (Lukas Wunner)
- Drop superfluous NULL pointer checks (Lukas Wunner)
- Pass struct hotplug_slot pointers directly to avoid backpointer
dereferencing in has_*_file() (Lukas Wunner)
- Inline pci_hp_{create,remove}_module_link() to reduce exported symbols
(Lukas Wunner)
- Disable hotplug interrupts in portdrv only when pciehp is not enabled to
prevent issuing two hotplug commands too close together (Feng Tang)
- Skip pciehp 'device replaced' check if the device has been removed to
address a common deadlock when resuming after a device was removed during
system sleep (Lukas Wunner)
- Don't enable pciehp hotplug interupt when resuming in poll mode (Ilpo
Järvinen)
* pci/hotplug:
PCI: pciehp: Don't enable HPIE when resuming in poll mode
PCI: pciehp: Avoid unnecessary device replacement check
PCI/portdrv: Only disable pciehp interrupts early when needed
PCI: hotplug: Inline pci_hp_{create,remove}_module_link()
PCI: hotplug: Avoid backpointer dereferencing in has_*_file()
PCI: hotplug: Drop superfluous NULL pointer checks in has_*_file()
PCI: hotplug: Drop superfluous try_module_get() calls
PCI: hotplug: Drop superfluous pci_hotplug_slot_list
PCI: cpcihp: Remove unused .get_power() and .set_power()
PCI: shpchp: Remove 'shpchp_debug' module parameter
PCI: shpchp: Remove unused logging wrappers
PCI: shpchp: Change dbg() -> ctrl_dbg()
PCI: shpchp: Remove logging from module init/exit functions
- Enable Configuration RRS SV early instead of during child bus scanning
(Bjorn Helgaas)
- Cache offset of Resizable BAR capability to avoid redundant searches for
it (Bjorn Helgaas)
- Fix reference leaks in pci_register_host_bridge() and
pci_alloc_child_bus() (Ma Ke)
- Drop put_device() in pci_register_host_bridge() left over from converting
device_register() to device_add() (Dan Carpenter)
* pci/enumeration:
PCI: Remove stray put_device() in pci_register_host_bridge()
PCI: Fix reference leak in pci_alloc_child_bus()
PCI: Fix reference leak in pci_register_host_bridge()
PCI: Cache offset of Resizable BAR capability
PCI: Enable Configuration RRS SV early
- Rename DOE 'protocol' to 'feature' to follow spec terminology (Alistair
Francis)
- Expose supported DOE features via sysfs (Alistair Francis)
- Allow DOE support to be enabled even if CXL isn't enabled (Alistair
Francis)
* pci/doe:
PCI/DOE: Allow enabling DOE without CXL
PCI/DOE: Expose DOE features via sysfs
PCI/DOE: Rename Discovery Response Data Object Contents to type
PCI/DOE: Rename DOE protocol to feature
- Enlarge the devres table[] to accommodate bridge windows, ROM, IOV BARs,
etc (Philipp Stanner)
- Validate BAR index in devres interfaces (Philipp Stanner)
* pci/devres:
PCI: Check BAR index for validity
PCI: Fix wrong length of devres array
- Add set_pcie_speed.sh to TEST_PROGS to fix issue when executing the
set_pcie_cooling_state.sh test case (Yi Lai)
- Fix the pcie_bwctrl_select_speed() return value in cases where a
non-compliant device doesn't advertise valid supported speeds (Ilpo
Järvinen)
- Avoid a NULL pointer dereference when we run out of bus numbers to assign
for a bridge secondary bus (Lukas Wunner)
* pci/bwctrl:
PCI/bwctrl: Fix NULL pointer dereference on bus number exhaustion
PCI/bwctrl: Fix pcie_bwctrl_select_speed() return type
selftests/pcie_bwctrl: Add 'set_pcie_speed.sh' to TEST_PROGS
- Delay pcie_link_state deallocation to avoid dangling pointers that cause
invalid references during hot-unplug (Daniel Stodden)
* pci/aspm:
PCI/ASPM: Fix link state exit during switch upstream function removal
- Implement local aer_printk() since AER is the only place that prints a
message with level depending on the error severity (Ilpo Järvinen)
* pci/aer:
PCI/ERR: Handle TLP Log in Flit mode
PCI: Track Flit Mode Status & print it with link status
PCI/AER: Descope pci_printk() to aer_printk()
The arg_count parameter to syscon_regmap_lookup_by_phandle_args()
represents the number of argument cells following the phandle. In this
case, the number of arguments should be 1 instead of 2 since the dt
property looks like this:
fsl,pcie-scfg = <&scfg 0>;
Without this fix, layerscape-pcie fails with the following message on
LS1043A:
OF: /soc/pcie@3500000: phandle scfg@1570000 needs 2, found 1
layerscape-pcie 3500000.pcie: No syscfg phandle specified
layerscape-pcie 3500000.pcie: probe with driver layerscape-pcie failed with error -22
Link: https://lore.kernel.org/r/20250327151949.2765193-1-ioana.ciornei@nxp.com
Fixes: 149fc35734 ("PCI: layerscape: Use syscon_regmap_lookup_by_phandle_args")
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Roy Zang <Roy.Zang@nxp.com>
Cc: stable@vger.kernel.org
Including:
- Core: IOMMUFD dependencies from Jason:
- Change the iommufd fault handle into an always present hwpt handle in
the domain
- Give iommufd its own SW_MSI implementation along with some IRQ layer
rework
- Improvements to the handle attach API
- Core: Fixes for probe-issues from Robin
- Intel VT-d changes:
- Checking for SVA support in domain allocation and attach paths
- Move PCI ATS and PRI configuration into probe paths
- Fix a pentential hang on reboot -f
- Miscellaneous cleanups
- AMD-Vi changes:
- Support for up to 2k IRQs per PCI device function
- Set of smaller fixes
- ARM-SMMU changes:
- SMMUv2 devicetree binding updates for Qualcomm implementations
(QCS8300 GPU and MSM8937)
- Clean up SMMUv2 runtime PM implementation to help with wider rework of
pm_runtime_put_autosuspend()
- Rockchip driver changes:
- Driver adjustments for recent DT probing changes
- S390 IOMMU changes:
- Support for IOMMU passthrough
- Apple Dart changes:
- Driver adjustments to meet ISP device requirements
- Null-ptr deref fix
- Disable subpage protection for DART 1
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmfkFSkACgkQK/BELZcB
GuOVUg/9En4QlF0eeg4E1stszGcOOXn9IkKiGP9iqKpL/WqVVitoagn7nwq0rBFF
PdqqcGbSPNHctiUursB18DyENOAmRUE8mGG29po9rAfq5wxZ2yA6eLmpFAf9QDCW
vY3nz2oen1lUrzbPwVI1IR67L6BrtqEUz/1syf295RMr2yXkdmBXb12lbL1rdjv0
7YfWwlDtMOVsewlZmCuDAAfWHtd7cO/ulDcYwq+GOiSXojVmKw8624EFlFbM9E9f
tkceE4mjGQnN/CpZzUc48f1DgIW82PVVTek95Cznbk9ACDJQ4VYYwDMx2WszfMzu
D1CXLiQu0vrxQgFdHn02bw3T4yvXQ4HeawIaTF+1J1gEHbrj6MoVQ88zaO3GnKFk
yxYI5sfOs5iL6zSMkig4OytXuRmRqDVJhgrF1i4XIM3GXrRKf9EiRZ/1eXb+1gab
3Q5k7+u4+7vWbaKLsxOdIBNzz02hb7LVndPIFlWmSnr1YY0LFBfwmgbfTAsSINZo
22Ni8aE2C42lYOZxJ67m3khLpqTZaTERtQIab/gUd5FcuEjxKuHBIzj6SZeC/0Q9
Y6rzklHxJxxtQLrUJp16IzjpiMJbWH0H2jKstRberOpxk3L5aLrJNd8M1B+CokmP
LZeKgqgxV5F+VlR2Bap5AXL9ZEuqjCVDIEy79j8XzAzxobDVnVM=
=a8Pd
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:
"Core iommufd dependencies from Jason:
- Change the iommufd fault handle into an always present hwpt handle
in the domain
- Give iommufd its own SW_MSI implementation along with some IRQ
layer rework
- Improvements to the handle attach API
Core fixes for probe-issues from Robin
Intel VT-d changes:
- Checking for SVA support in domain allocation and attach paths
- Move PCI ATS and PRI configuration into probe paths
- Fix a pentential hang on reboot -f
- Miscellaneous cleanups
AMD-Vi changes:
- Support for up to 2k IRQs per PCI device function
- Set of smaller fixes
ARM-SMMU changes:
- SMMUv2 devicetree binding updates for Qualcomm implementations
(QCS8300 GPU and MSM8937)
- Clean up SMMUv2 runtime PM implementation to help with wider rework
of pm_runtime_put_autosuspend()
Rockchip driver changes:
- Driver adjustments for recent DT probing changes
S390 IOMMU changes:
- Support for IOMMU passthrough
Apple Dart changes:
- Driver adjustments to meet ISP device requirements
- Null-ptr deref fix
- Disable subpage protection for DART 1"
* tag 'iommu-updates-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (54 commits)
iommu/vt-d: Fix possible circular locking dependency
iommu/vt-d: Don't clobber posted vCPU IRTE when host IRQ affinity changes
iommu/vt-d: Put IRTE back into posted MSI mode if vCPU posting is disabled
iommu: apple-dart: fix potential null pointer deref
iommu/rockchip: Retire global dma_dev workaround
iommu/rockchip: Register in a sensible order
iommu/rockchip: Allocate per-device data sensibly
iommu/mediatek-v1: Support COMPILE_TEST
iommu/amd: Enable support for up to 2K interrupts per function
iommu/amd: Rename DTE_INTTABLEN* and MAX_IRQS_PER_TABLE macro
iommu/amd: Replace slab cache allocator with page allocator
iommu/amd: Introduce generic function to set multibit feature value
iommu: Don't warn prematurely about dodgy probes
iommu/arm-smmu: Set rpm auto_suspend once during probe
dt-bindings: arm-smmu: Document QCS8300 GPU SMMU
iommu: Get DT/ACPI parsing into the proper probe path
iommu: Keep dev->iommu state consistent
iommu: Resolve ops in iommu_init_device()
iommu: Handle race with default domain setup
iommu: Unexport iommu_fwspec_free()
...
XEN used a global variable to disable the masking of MSI interrupts as
XEN handles that on the hypervisor side. This turned out to be a problem
with VMD as the PCI devices behind a VMD bridge are not always handled
by the hypervisor and then require masking by guest.
To solve this the global variable was replaced by a interrupt domain
specific flag, which is set by the generic XEN PCI/MSI domain, but not by
VMD or any other domain in the system.
So far, so good. But the implementation (and the reviewer) missed the
fact, that accessing the domain flag cannot be done directly because
there are at least two situations, where this fails. Legacy architectures
are not providing interrupt domains at all. The new MSI parent domains do
not require to have a domain info pointer. Both cases result in a
unconditional NULL pointer derefence.
The PCI/MSI code already has a function to query the MSI domain specific
flag in a safe way, which handles all possible cases of PCI/MSI backends.
So the fix it simply to replace the open coded checks by invoking the
safe helper to query the flag.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmfkUA4THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVymD/4xvcTnjjBQgesBWtkZ/wyvOx+zAoZD
s6ikEnYVzG6jjiob9SXrttJS/fZDuFujmntvq1lp+Lw1MbK/k9qyeKpMUkjVw4V5
gHgPUZUkKapk2Jf3vYy23cb8CuRp9nANi7r//kzefN/pyEmfuqme3AYNJjusyEyn
QfhLGE046GI2PLLS+4FKUjk+/aPEk/Pywy3moaSpDVd8Kmsus5GTmxt/AWcwGfcw
00TSRKstpew+RVLBb7NvSar0D43vtvP/8If6qkIpEttVhB6/4zTREB1z6iPjUSf0
2+90iubCdSm8wO+y0JDmA13cUct2XqW19M75imSLBIsJAlVtgWgOi2IJF787vDPJ
3bN2HeScCbl8+0e4Uv+NOi2c906k43MtC4XTrIq0nZjAuRyF9FEtdK+IrcywxuKH
nuaHtlseGWg97kU49yDTi1k+tRvIpmsibKM4pKZoUZKNMfE2soCFjzj40O7hTcKJ
LaN/fcDdPGnn+GMPywwIWc7e6kOIYnCqakom/x7e+KWpqfujnM3KLiyOy9iys/oL
xXfKnMbbmL0CSdUrC7EJn20KFRA2ovAJ+ouzJRkFhCdERCVidY29O3VrstacbevS
zQKb302PbFraGHa/MiuzyF0BuWuM0T0d12+2wEoVJpN/GTAbgO2vVbk2gvHEReXJ
eMDn02f5lF659w==
=ZPMY
-----END PGP SIGNATURE-----
Merge tag 'irq-urgent-2025-03-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MSI irq fix from Thomas Gleixner:
"An urgent fix for the XEN related PCI/MSI changes:
XEN used a global variable to disable the masking of MSI interrupts as
XEN handles that on the hypervisor side. This turned out to be a
problem with VMD as the PCI devices behind a VMD bridge are not always
handled by the hypervisor and then require masking by guest.
To solve this the global variable was replaced by a interrupt domain
specific flag, which is set by the generic XEN PCI/MSI domain, but not
by VMD or any other domain in the system.
So far, so good. But the implementation (and the reviewer) missed the
fact, that accessing the domain flag cannot be done directly because
there are at least two situations, where this fails.
Legacy architectures are not providing interrupt domains at all. The
new MSI parent domains do not require to have a domain info pointer.
Both cases result in a unconditional NULL pointer derefence.
The PCI/MSI code already has a function to query the MSI domain
specific flag in a safe way, which handles all possible cases of
PCI/MSI backends.
So the fix it simply to replace the open coded checks by invoking the
safe helper to query the flag"
* tag 'irq-urgent-2025-03-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
PCI/MSI: Handle the NOMASK flag correctly for all PCI/MSI backends
The conversion of the XEN specific global variable pci_msi_ignore_mask to a
MSI domain flag, missed the facts that:
1) Legacy architectures do not provide a interrupt domain
2) Parent MSI domains do not necessarily have a domain info attached
Both cases result in an unconditional NULL pointer dereference. This was
unfortunatly missed in review and testing revealed it late.
Cure this by using the existing pci_msi_domain_supports() helper, which
handles all possible cases correctly.
Fixes: c3164d2e0d ("PCI/MSI: Convert pci_msi_ignore_mask to per MSI domain flag")
Reported-by: Daniel Gomez <da.gomez@kernel.org>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Daniel Gomez <da.gomez@kernel.org>
Link: https://lore.kernel.org/all/87iknwyp2o.ffs@tglx
Closes: https://lore.kernel.org/all/qn7fzggcj6qe6r6gdbwcz23pzdz2jx64aldccmsuheabhmjgrt@tawf5nfwuvw7
Commit e49ad66781 ("PCI: j721e: Add TI J784S4 PCIe configuration")
assigned the value of .linkdown_irq_regfield for the J784S4 SoC as the
"LINK_DOWN" macro corresponding to BIT(1), and as a result, the Link
Down interrupts on J784S4 SoC are missed.
According to the Technical Reference Manual and Register Documentation
for the J784S4 SoC[1], BIT(1) corresponds to "ENABLE_SYS_EN_PCIE_DPA_1",
which is not the correct field for the link-state interrupt. Instead, it
is BIT(10) of the "PCIE_INTD_ENABLE_REG_SYS_2" register that corresponds
to the link-state field named as "ENABLE_SYS_EN_PCIE_LINK_STATE".
Thus, set .linkdown_irq_regfield to the macro "J7200_LINK_DOWN", which
expands to BIT(10) and was first defined for the J7200 SoC. Other SoCs
already reuse this macro since it accurately represents the "link-state"
field in their respective "PCIE_INTD_ENABLE_REG_SYS_2" register.
1: https://www.ti.com/lit/zip/spruj52
Fixes: e49ad66781 ("PCI: j721e: Add TI J784S4 PCIe configuration")
Cc: stable@vger.kernel.org
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
[kwilczynski: commit log, add a missing .linkdown_irq_regfield member
set to the J7200_LINK_DOWN macro to struct j7200_pcie_ep_data]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Link: https://lore.kernel.org/r/20250305132018.2260771-1-s-vadapalli@ti.com
Expose the supported IRQ types in the CAPS register.
This way, the host side driver (drivers/misc/pci_endpoint_test.c) can
know which IRQ types that the endpoint supports.
The host side driver will make use of this information in a follow-up
commit.
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Link: https://lore.kernel.org/r/20250310111016.859445-15-cassel@kernel.org
Neither RK3568 or RK3588 supports INTx interrupts.
Since epc_features is zero initialized, this is strictly not needed.
However, setting intx_capable explicitly to false makes it more clear
that neither RK3568 or RK3588 supports INTx interrupts.
No functional change.
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Link: https://lore.kernel.org/r/20250310111016.859445-14-cassel@kernel.org
- Manage sysfs attributes and boost frequencies efficiently from
cpufreq core to reduce boilerplate code in drivers (Viresh Kumar).
- Minor cleanups to cpufreq drivers (Aaron Kling, Benjamin Schneider,
Dhananjay Ugwekar, Imran Shaik, zuoqian).
- Migrate some cpufreq drivers to using for_each_present_cpu() (Jacky
Bai).
- cpufreq-qcom-hw DT binding fixes (Krzysztof Kozlowski).
- Use str_enable_disable() helper in cpufreq_online() (Lifeng Zheng).
- Optimize the amd-pstate driver to avoid cases where call paths end
up calling the same writes multiple times and needlessly caching
variables through code reorganization, locking overhaul and tracing
adjustments (Mario Limonciello, Dhananjay Ugwekar).
- Make it possible to avoid enabling capacity-aware scheduling (CAS) in
the intel_pstate driver and relocate a check for out-of-band (OOB)
platform handling in it to make it detect OOB before checking HWP
availability (Rafael Wysocki).
- Fix dbs_update() to avoid inadvertent conversions of negative integer
values to unsigned int which causes CPU frequency selection to be
inaccurate in some cases when the "conservative" cpufreq governor is
in use (Jie Zhan).
- Update the handling of the most recent idle intervals in the menu
cpuidle governor to prevent useful information from being discarded
by it in some cases and improve the prediction accuracy (Rafael
Wysocki).
- Make it possible to tell the intel_idle driver to ignore its built-in
table of idle states for the given processor, clean up the handling
of auto-demotion disabling on Baytrail and Cherrytrail chips in it,
and update its MAINTAINERS entry (David Arcari, Artem Bityutskiy,
Rafael Wysocki).
- Make some cpuidle drivers use for_each_present_cpu() instead of
for_each_possible_cpu() during initialization to avoid issues
occurring when nosmp or maxcpus=0 are used (Jacky Bai).
- Clean up the Energy Model handling code somewhat (Rafael Wysocki).
- Use kfree_rcu() to simplify the handling of runtime Energy Model
updates (Li RongQing).
- Add an entry for the Energy Model framework to MAINTAINERS as
properly maintained (Lukasz Luba).
- Address RCU-related sparse warnings in the Energy Model code (Rafael
Wysocki).
- Remove ENERGY_MODEL dependency on SMP and allow it to be selected
when DEVFREQ is set without CPUFREQ so it can be used on a wider
range of systems (Jeson Gao).
- Unify error handling during runtime suspend and runtime resume in the
core to help drivers to implement more consistent runtime PM error
handling (Rafael Wysocki).
- Drop a redundant check from pm_runtime_force_resume() and rearrange
documentation related to __pm_runtime_disable() (Rafael Wysocki).
- Rework the handling of the "smart suspend" driver flag in the PM core
to avoid issues hat may occur when drivers using it depend on some
other drivers and clean up the related PM core code (Rafael Wysocki,
Colin Ian King).
- Fix the handling of devices with the power.direct_complete flag set
if device_suspend() returns an error for at least one device to avoid
situations in which some of them may not be resumed (Rafael Wysocki).
- Use mutex_trylock() in hibernate_compressor_param_set() to avoid a
possible deadlock that may occur if the "compressor" hibernation
module parameter is accessed during the registration of a new
ieee80211 device (Lizhi Xu).
- Suppress sleeping parent warning in device_pm_add() in the case when
new children are added under a device with the power.direct_complete
set after it has been processed by device_resume() (Xu Yang).
- Remove needless return in three void functions related to system
wakeup (Zijun Hu).
- Replace deprecated kmap_atomic() with kmap_local_page() in the
hibernation core code (David Reaver).
- Remove unused helper functions related to system sleep (David Alan
Gilbert).
- Clean up s2idle_enter() so it does not lock and unlock CPU offline
in vain and update comments in it (Ulf Hansson).
- Clean up broken white space in dpm_wait_for_children() (Geert
Uytterhoeven).
- Update the cpupower utility to fix lib version-ing in it and memory
leaks in error legs, remove hard-coded values, and implement CPU
physical core querying (Thomas Renninger, John B. Wyatt IV, Shuah
Khan, Yiwei Lin, Zhongqiu Han).
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmfhhTYSHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO16/gIAKuRiG1fFgUcUSXC1iFu42vrB/1i4wpA
02GICACqM3K6/5jd3ct/WOU28GUgDs+xcmqH7CnMaM6y9nXEWjWarmSfFekAO+0q
TPtQ7xTy0hBCB3he1P2uLKBJBin4Wn47U9/rvs4J7mQd5zDxTINKIiVoHg2lEE+s
HAeSoNRb2sp5IZDm9+/LfhHNYRP1mJ97cbZlymqctGB3xgDL7qMLid/1+gFPHAQS
4/LXj3IgyU8DpA/j5nhtpaAqjN5g2QxIUfQgADRIcESK99Y/7aAMs1/G0WhJKaay
9yx+4/xmkGvVCZQx1DphksFLISEzltY0SFWLsoppPzBTGVEW2GQQsNI=
=LqVy
-----END PGP SIGNATURE-----
Merge tag 'pm-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These are dominated by cpufreq updates which in turn are dominated by
updates related to boost support in the core and drivers and
amd-pstate driver optimizations.
Apart from the above, there are some cpuidle updates including a
rework of the most recent idle intervals handling in the venerable
menu governor that leads to significant improvements in some
performance benchmarks, as the governor is now more likely to predict
a shorter idle duration in some cases, and there are updates of the
core device power management code, mostly related to system suspend
and resume, that should help to avoid potential issues arising when
the drivers of devices depending on one another want to use different
optimizations.
There is also a usual collection of assorted fixes and cleanups,
including removal of some unused code.
Specifics:
- Manage sysfs attributes and boost frequencies efficiently from
cpufreq core to reduce boilerplate code in drivers (Viresh Kumar)
- Minor cleanups to cpufreq drivers (Aaron Kling, Benjamin Schneider,
Dhananjay Ugwekar, Imran Shaik, zuoqian)
- Migrate some cpufreq drivers to using for_each_present_cpu() (Jacky
Bai)
- cpufreq-qcom-hw DT binding fixes (Krzysztof Kozlowski)
- Use str_enable_disable() helper in cpufreq_online() (Lifeng Zheng)
- Optimize the amd-pstate driver to avoid cases where call paths end
up calling the same writes multiple times and needlessly caching
variables through code reorganization, locking overhaul and tracing
adjustments (Mario Limonciello, Dhananjay Ugwekar)
- Make it possible to avoid enabling capacity-aware scheduling (CAS)
in the intel_pstate driver and relocate a check for out-of-band
(OOB) platform handling in it to make it detect OOB before checking
HWP availability (Rafael Wysocki)
- Fix dbs_update() to avoid inadvertent conversions of negative
integer values to unsigned int which causes CPU frequency selection
to be inaccurate in some cases when the "conservative" cpufreq
governor is in use (Jie Zhan)
- Update the handling of the most recent idle intervals in the menu
cpuidle governor to prevent useful information from being discarded
by it in some cases and improve the prediction accuracy (Rafael
Wysocki)
- Make it possible to tell the intel_idle driver to ignore its
built-in table of idle states for the given processor, clean up the
handling of auto-demotion disabling on Baytrail and Cherrytrail
chips in it, and update its MAINTAINERS entry (David Arcari, Artem
Bityutskiy, Rafael Wysocki)
- Make some cpuidle drivers use for_each_present_cpu() instead of
for_each_possible_cpu() during initialization to avoid issues
occurring when nosmp or maxcpus=0 are used (Jacky Bai)
- Clean up the Energy Model handling code somewhat (Rafael Wysocki)
- Use kfree_rcu() to simplify the handling of runtime Energy Model
updates (Li RongQing)
- Add an entry for the Energy Model framework to MAINTAINERS as
properly maintained (Lukasz Luba)
- Address RCU-related sparse warnings in the Energy Model code
(Rafael Wysocki)
- Remove ENERGY_MODEL dependency on SMP and allow it to be selected
when DEVFREQ is set without CPUFREQ so it can be used on a wider
range of systems (Jeson Gao)
- Unify error handling during runtime suspend and runtime resume in
the core to help drivers to implement more consistent runtime PM
error handling (Rafael Wysocki)
- Drop a redundant check from pm_runtime_force_resume() and rearrange
documentation related to __pm_runtime_disable() (Rafael Wysocki)
- Rework the handling of the "smart suspend" driver flag in the PM
core to avoid issues hat may occur when drivers using it depend on
some other drivers and clean up the related PM core code (Rafael
Wysocki, Colin Ian King)
- Fix the handling of devices with the power.direct_complete flag set
if device_suspend() returns an error for at least one device to
avoid situations in which some of them may not be resumed (Rafael
Wysocki)
- Use mutex_trylock() in hibernate_compressor_param_set() to avoid a
possible deadlock that may occur if the "compressor" hibernation
module parameter is accessed during the registration of a new
ieee80211 device (Lizhi Xu)
- Suppress sleeping parent warning in device_pm_add() in the case
when new children are added under a device with the
power.direct_complete set after it has been processed by
device_resume() (Xu Yang)
- Remove needless return in three void functions related to system
wakeup (Zijun Hu)
- Replace deprecated kmap_atomic() with kmap_local_page() in the
hibernation core code (David Reaver)
- Remove unused helper functions related to system sleep (David Alan
Gilbert)
- Clean up s2idle_enter() so it does not lock and unlock CPU offline
in vain and update comments in it (Ulf Hansson)
- Clean up broken white space in dpm_wait_for_children() (Geert
Uytterhoeven)
- Update the cpupower utility to fix lib version-ing in it and memory
leaks in error legs, remove hard-coded values, and implement CPU
physical core querying (Thomas Renninger, John B. Wyatt IV, Shuah
Khan, Yiwei Lin, Zhongqiu Han)"
* tag 'pm-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (139 commits)
PM: sleep: Fix bit masking operation
dt-bindings: cpufreq: cpufreq-qcom-hw: Narrow properties on SDX75, SA8775p and SM8650
dt-bindings: cpufreq: cpufreq-qcom-hw: Drop redundant minItems:1
dt-bindings: cpufreq: cpufreq-qcom-hw: Add missing constraint for interrupt-names
dt-bindings: cpufreq: cpufreq-qcom-hw: Add QCS8300 compatible
cpufreq: Init cpufreq only for present CPUs
PM: sleep: Fix handling devices with direct_complete set on errors
cpuidle: Init cpuidle only for present CPUs
PM: clk: Remove unused pm_clk_remove()
PM: sleep: core: Fix indentation in dpm_wait_for_children()
PM: s2idle: Extend comment in s2idle_enter()
PM: s2idle: Drop redundant locks when entering s2idle
PM: sleep: Remove unused pm_generic_ wrappers
cpufreq: tegra186: Share policy per cluster
cpupower: Make lib versioning scheme more obvious and fix version link
PM: EM: Rework the depends on for CONFIG_ENERGY_MODEL
PM: EM: Address RCU-related sparse warnings
cpupower: Implement CPU physical core querying
pm: cpupower: remove hard-coded topology depth values
pm: cpupower: Fix cmd_monitor() error legs to free cpu_topology
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZ9/gEwAKCRCAXGG7T9hj
vlxhAQCRzSCNI8wwvENnuc2OnRyWKy8gq7C5WAOIOJdJ3U+scQEAwKGhPJLwE4IS
/JDh5PRJgZ4rdMYatuDfldEcSAfRRgw=
=dF6Z
-----END PGP SIGNATURE-----
Merge tag 'for-linus-6.15-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
- cleanup: remove an used function
- add support for a XenServer specific virtual PCI device
- fix the handling of a sparse Xen hypervisor symbol table
- avoid warnings when building the kernel with gcc 15
- fix use of devices behind a VMD bridge when running as a Xen PV dom0
* tag 'for-linus-6.15-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
PCI/MSI: Convert pci_msi_ignore_mask to per MSI domain flag
PCI: vmd: Disable MSI remapping bypass under Xen
xen/pci: Do not register devices with segments >= 0x10000
xen/pciback: Remove unused pcistub_get_pci_dev
xenfs/xensyms: respect hypervisor's "next" indication
xen/mcelog: Add __nonstring annotations for unterminated strings
xen: Add support for XenServer 6.1 platform device
- Switch the MSI descriptor locking to guards
- Replace the broken PCI/TPH implementation, which lacks any form of
serialization against concurrent modifications with a properly
serialized mechanism in the PCI/MSI core code.
- Replace the MSI descriptor abuse in the SCSI/UFS Qualcom driver with
dedicated driver internal storage.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5JcTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYocp6D/9IUXX/8HsBVZwCBY1/SYAC8fnTUNGG
fMxPTX0BSRNOaChvGkyIllEEkNbh+WDMh1qkV94fusJCRvQVaDnwJZ9zoTMJekiM
ZEBXdqTiDyqUG5LzFeoYQFu/XBth59HMkaDqCxJEy+YbjnTNamX8QS8A/FZlPNH0
uRs4m7oQxABL6E/JkjOkC0UOqddZ5BDwbLgXoGq+uDSLTicO9yR6EsigVacoreap
zmkaUPRmL9NM8onrP2/9LduWwgcxn9u2wvd0uOCoIHBUKrlIpHkuNjieQ4R8QCP0
b7INNTEnwv815O11EoU+AVoL+sTYjUVOAUzfeNejlUspwvjBASm5lUkyB2TS5nq7
TtCxFyIS7/eBXwrKeV1w9ZcGjcso0DRJPvBQvhhL+q00Mtlrbpy7NUIF5S/8I60q
QWLN1h1iKISl9aT5p7K1v1lnGyPTb942PNmRc9Cd8Si3QMt6be3RkAJekPWU78Wd
Q9fXiMS4vyZhtt1S0vKU1L80Ezg3JVL6tWWpsSdZhMBM8wYiw1PI9NYtYEO2lzP/
hDHOu4EaItPg5A6Z+IDOj13pddtQsnssiWJmcDvarnJKBOfmAbwOcNGpwcwCbm62
bephtPc4h7En6/d4lsi8yzK6dSpf7yHLxTRrp9lomMaydEjfsNWcAs/nN2Y+wbr4
UMuZDSrlTGHP0Q==
=Vl8Q
-----END PGP SIGNATURE-----
Merge tag 'irq-msi-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MSI irq updates from Thomas Gleixner:
- Switch the MSI descriptor locking to guards
- Replace the broken PCI/TPH implementation, which lacks any form of
serialization against concurrent modifications with a properly
serialized mechanism in the PCI/MSI core code
- Replace the MSI descriptor abuse in the SCSI/UFS Qualcom driver with
dedicated driver internal storage
* tag 'irq-msi-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/msi: Rename msi_[un]lock_descs()
scsi: ufs: qcom: Remove the MSI descriptor abuse
PCI/TPH: Replace the broken MSI-X control word update
PCI/MSI: Provide a sane mechanism for TPH
PCI: hv: Switch MSI descriptor locking to guard()
PCI/MSI: Switch to MSI descriptor locking to guard()
NTB/msi: Switch MSI descriptor locking to lock guard()
soc: ti: ti_sci_inta_msi: Switch MSI descriptor locking to guard()
genirq/msi: Use lock guards for MSI descriptor locking
cleanup: Provide retain_ptr()
genirq/msi: Make a few functions static
two locking commits in the locking tree,
part of the locking-core-2025-03-22 pull request. ]
x86 CPU features support:
- Generate the <asm/cpufeaturemasks.h> header based on build config
(H. Peter Anvin, Xin Li)
- x86 CPUID parsing updates and fixes (Ahmed S. Darwish)
- Introduce the 'setcpuid=' boot parameter (Brendan Jackman)
- Enable modifying CPU bug flags with '{clear,set}puid='
(Brendan Jackman)
- Utilize CPU-type for CPU matching (Pawan Gupta)
- Warn about unmet CPU feature dependencies (Sohil Mehta)
- Prepare for new Intel Family numbers (Sohil Mehta)
Percpu code:
- Standardize & reorganize the x86 percpu layout and
related cleanups (Brian Gerst)
- Convert the stackprotector canary to a regular percpu
variable (Brian Gerst)
- Add a percpu subsection for cache hot data (Brian Gerst)
- Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak)
- Construct __percpu_seg_override from __percpu_seg (Uros Bizjak)
MM:
- Add support for broadcast TLB invalidation using AMD's INVLPGB instruction
(Rik van Riel)
- Rework ROX cache to avoid writable copy (Mike Rapoport)
- PAT: restore large ROX pages after fragmentation
(Kirill A. Shutemov, Mike Rapoport)
- Make memremap(MEMREMAP_WB) map memory as encrypted by default
(Kirill A. Shutemov)
- Robustify page table initialization (Kirill A. Shutemov)
- Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn)
- Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW
(Matthew Wilcox)
KASLR:
- x86/kaslr: Reduce KASLR entropy on most x86 systems,
to support PCI BAR space beyond the 10TiB region
(CONFIG_PCI_P2PDMA=y) (Balbir Singh)
CPU bugs:
- Implement FineIBT-BHI mitigation (Peter Zijlstra)
- speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta)
- speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan Gupta)
- RFDS: Exclude P-only parts from the RFDS affected list (Pawan Gupta)
System calls:
- Break up entry/common.c (Brian Gerst)
- Move sysctls into arch/x86 (Joel Granados)
Intel LAM support updates: (Maciej Wieczor-Retman)
- selftests/lam: Move cpu_has_la57() to use cpuinfo flag
- selftests/lam: Skip test if LAM is disabled
- selftests/lam: Test get_user() LAM pointer handling
AMD SMN access updates:
- Add SMN offsets to exclusive region access (Mario Limonciello)
- Add support for debugfs access to SMN registers (Mario Limonciello)
- Have HSMP use SMN through AMD_NODE (Yazen Ghannam)
Power management updates: (Patryk Wlazlyn)
- Allow calling mwait_play_dead with an arbitrary hint
- ACPI/processor_idle: Add FFH state handling
- intel_idle: Provide the default enter_dead() handler
- Eliminate mwait_play_dead_cpuid_hint()
Bootup:
Build system:
- Raise the minimum GCC version to 8.1 (Brian Gerst)
- Raise the minimum LLVM version to 15.0.0
(Nathan Chancellor)
Kconfig: (Arnd Bergmann)
- Add cmpxchg8b support back to Geode CPUs
- Drop 32-bit "bigsmp" machine support
- Rework CONFIG_GENERIC_CPU compiler flags
- Drop configuration options for early 64-bit CPUs
- Remove CONFIG_HIGHMEM64G support
- Drop CONFIG_SWIOTLB for PAE
- Drop support for CONFIG_HIGHPTE
- Document CONFIG_X86_INTEL_MID as 64-bit-only
- Remove old STA2x11 support
- Only allow CONFIG_EISA for 32-bit
Headers:
- Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI headers
(Thomas Huth)
Assembly code & machine code patching:
- x86/alternatives: Simplify alternative_call() interface (Josh Poimboeuf)
- x86/alternatives: Simplify callthunk patching (Peter Zijlstra)
- KVM: VMX: Use named operands in inline asm (Josh Poimboeuf)
- x86/hyperv: Use named operands in inline asm (Josh Poimboeuf)
- x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra)
- x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h>
(Uros Bizjak)
- Use named operands in inline asm (Uros Bizjak)
- Improve performance by using asm_inline() for atomic locking instructions
(Uros Bizjak)
Earlyprintk:
- Harden early_serial (Peter Zijlstra)
NMI handler:
- Add an emergency handler in nmi_desc & use it in nmi_shootdown_cpus()
(Waiman Long)
Miscellaneous fixes and cleanups:
- by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel,
Artem Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst,
Dan Carpenter, Dr. David Alan Gilbert, H. Peter Anvin,
Ingo Molnar, Josh Poimboeuf, Kevin Brodsky, Mike Rapoport,
Lukas Bulwahn, Maciej Wieczor-Retman, Max Grobecker,
Patryk Wlazlyn, Pawan Gupta, Peter Zijlstra,
Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner,
Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak,
Vitaly Kuznetsov, Xin Li, liuye.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfenkQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1g1FRAAi6OFTSn/5aeLMI0IMNBxJ6ddQiFc3imd
7+C/vU5nul4CyDs8mKyj/+f/DDrbkG9lKz3VG631Yl237lXHjD8XWcVMeC/1z/q0
3zInDIloE9/nBHRPkF6F7fARBLBZ0LFgaBsGrCo7mwpGybiQdqGcqcxllvTbtXaw
OHta4q6ok+lBDNlfc0v6H4cRnzhmmlKu6Ng0j6UI3V7uFhi3vtxas32ltDQtzorq
2+jbV6/+kbrrv+xPC+jlzOFhTEKRupNPQXmvyQteoQg6G3kqAKMDvBthGXd1rHuX
Qa+BoDIifE/2NiVeRwNrhoqYH/pHCzUzDREW5IW8+ca+4XNKuzAC6EuC8CeCzyK1
q8ZjZjooQW4zEeVFeJYllHONzJYfxfSH5CLsnbcuhq99yfGlrQhF1qL72/Omn1w/
DfPJM8Zt5zyKvLqUg3Md+fkVCO2wyDNhB61QPzRgHF+yD+rvuDpoqvUWir+w7cSn
fwEDVZGXlFx6dumtSrqRaTd1nvFt80s8yP2ll09DMvGQ8D/yruS7hndGAmmJVCSW
NAfd8pSjq5v2+ux2UR92/Cc3VF3SjaUqHBOp/Nq9rESya18ZVa3cJpHhVYYtPIVf
THW0h07RIkGVKs1uq+5ekLCr/8uAZg58UPIqmhTuW0ttymRHCNfohR45FQZzy+0M
tJj1oc2TIZw=
=Dcb3
-----END PGP SIGNATURE-----
Merge tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core x86 updates from Ingo Molnar:
"x86 CPU features support:
- Generate the <asm/cpufeaturemasks.h> header based on build config
(H. Peter Anvin, Xin Li)
- x86 CPUID parsing updates and fixes (Ahmed S. Darwish)
- Introduce the 'setcpuid=' boot parameter (Brendan Jackman)
- Enable modifying CPU bug flags with '{clear,set}puid=' (Brendan
Jackman)
- Utilize CPU-type for CPU matching (Pawan Gupta)
- Warn about unmet CPU feature dependencies (Sohil Mehta)
- Prepare for new Intel Family numbers (Sohil Mehta)
Percpu code:
- Standardize & reorganize the x86 percpu layout and related cleanups
(Brian Gerst)
- Convert the stackprotector canary to a regular percpu variable
(Brian Gerst)
- Add a percpu subsection for cache hot data (Brian Gerst)
- Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak)
- Construct __percpu_seg_override from __percpu_seg (Uros Bizjak)
MM:
- Add support for broadcast TLB invalidation using AMD's INVLPGB
instruction (Rik van Riel)
- Rework ROX cache to avoid writable copy (Mike Rapoport)
- PAT: restore large ROX pages after fragmentation (Kirill A.
Shutemov, Mike Rapoport)
- Make memremap(MEMREMAP_WB) map memory as encrypted by default
(Kirill A. Shutemov)
- Robustify page table initialization (Kirill A. Shutemov)
- Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn)
- Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW
(Matthew Wilcox)
KASLR:
- x86/kaslr: Reduce KASLR entropy on most x86 systems, to support PCI
BAR space beyond the 10TiB region (CONFIG_PCI_P2PDMA=y) (Balbir
Singh)
CPU bugs:
- Implement FineIBT-BHI mitigation (Peter Zijlstra)
- speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta)
- speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan
Gupta)
- RFDS: Exclude P-only parts from the RFDS affected list (Pawan
Gupta)
System calls:
- Break up entry/common.c (Brian Gerst)
- Move sysctls into arch/x86 (Joel Granados)
Intel LAM support updates: (Maciej Wieczor-Retman)
- selftests/lam: Move cpu_has_la57() to use cpuinfo flag
- selftests/lam: Skip test if LAM is disabled
- selftests/lam: Test get_user() LAM pointer handling
AMD SMN access updates:
- Add SMN offsets to exclusive region access (Mario Limonciello)
- Add support for debugfs access to SMN registers (Mario Limonciello)
- Have HSMP use SMN through AMD_NODE (Yazen Ghannam)
Power management updates: (Patryk Wlazlyn)
- Allow calling mwait_play_dead with an arbitrary hint
- ACPI/processor_idle: Add FFH state handling
- intel_idle: Provide the default enter_dead() handler
- Eliminate mwait_play_dead_cpuid_hint()
Build system:
- Raise the minimum GCC version to 8.1 (Brian Gerst)
- Raise the minimum LLVM version to 15.0.0 (Nathan Chancellor)
Kconfig: (Arnd Bergmann)
- Add cmpxchg8b support back to Geode CPUs
- Drop 32-bit "bigsmp" machine support
- Rework CONFIG_GENERIC_CPU compiler flags
- Drop configuration options for early 64-bit CPUs
- Remove CONFIG_HIGHMEM64G support
- Drop CONFIG_SWIOTLB for PAE
- Drop support for CONFIG_HIGHPTE
- Document CONFIG_X86_INTEL_MID as 64-bit-only
- Remove old STA2x11 support
- Only allow CONFIG_EISA for 32-bit
Headers:
- Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI
headers (Thomas Huth)
Assembly code & machine code patching:
- x86/alternatives: Simplify alternative_call() interface (Josh
Poimboeuf)
- x86/alternatives: Simplify callthunk patching (Peter Zijlstra)
- KVM: VMX: Use named operands in inline asm (Josh Poimboeuf)
- x86/hyperv: Use named operands in inline asm (Josh Poimboeuf)
- x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra)
- x86/kexec: Merge x86_32 and x86_64 code using macros from
<asm/asm.h> (Uros Bizjak)
- Use named operands in inline asm (Uros Bizjak)
- Improve performance by using asm_inline() for atomic locking
instructions (Uros Bizjak)
Earlyprintk:
- Harden early_serial (Peter Zijlstra)
NMI handler:
- Add an emergency handler in nmi_desc & use it in
nmi_shootdown_cpus() (Waiman Long)
Miscellaneous fixes and cleanups:
- by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel, Artem
Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst, Dan
Carpenter, Dr. David Alan Gilbert, H. Peter Anvin, Ingo Molnar,
Josh Poimboeuf, Kevin Brodsky, Mike Rapoport, Lukas Bulwahn, Maciej
Wieczor-Retman, Max Grobecker, Patryk Wlazlyn, Pawan Gupta, Peter
Zijlstra, Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner,
Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak, Vitaly
Kuznetsov, Xin Li, liuye"
* tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (211 commits)
zstd: Increase DYNAMIC_BMI2 GCC version cutoff from 4.8 to 11.0 to work around compiler segfault
x86/asm: Make asm export of __ref_stack_chk_guard unconditional
x86/mm: Only do broadcast flush from reclaim if pages were unmapped
perf/x86/intel, x86/cpu: Replace Pentium 4 model checks with VFM ones
perf/x86/intel, x86/cpu: Simplify Intel PMU initialization
x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-UAPI headers
x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI headers
x86/locking/atomic: Improve performance by using asm_inline() for atomic locking instructions
x86/asm: Use asm_inline() instead of asm() in clwb()
x86/asm: Use CLFLUSHOPT and CLWB mnemonics in <asm/special_insns.h>
x86/hweight: Use asm_inline() instead of asm()
x86/hweight: Use ASM_CALL_CONSTRAINT in inline asm()
x86/hweight: Use named operands in inline asm()
x86/stackprotector/64: Only export __ref_stack_chk_guard on CONFIG_SMP
x86/head/64: Avoid Clang < 17 stack protector in startup code
x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h>
x86/runtime-const: Add the RUNTIME_CONST_PTR assembly macro
x86/cpu/intel: Limit the non-architectural constant_tsc model checks
x86/mm/pat: Replace Intel x86_model checks with VFM ones
x86/cpu/intel: Fix fast string initialization for extended Families
...
Remove imx_pcie_cpu_addr_fixup, the .cpu_addr_fixup() method, because the
dwc core driver already handles address translation based on the devicetree
description.
Link: https://lore.kernel.org/r/20250315201548.858189-14-helgaas@kernel.org
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Acked-by: Richard Zhu <hongxing.zhu@nxp.com>
We know the parent_bus_offset, either computed from a DT reg property (the
offset is the CPU physical addr - the 'config'/'addr_space' address on the
parent bus) or from a .cpu_addr_fixup() (which may have used a host bridge
window offset).
Apply that parent_bus_offset instead of calling .cpu_addr_fixup() when
programming the ATU.
This assumes all intermediate addresses are at the same offset from the CPU
physical addresses.
[bhelgaas: commit log]
Link: https://lore.kernel.org/r/20250315201548.858189-13-helgaas@kernel.org
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Most systems' PCIe outbound map windows have non-zero physical addresses,
but the possibility of encountering zero increased after following commit
("PCI: dwc: Use parent_bus_offset").
'ep->outbound_addr[n]', representing 'parent_bus_address', might be 0 on
some hardware, which trims high address bits through bus fabric before
sending to the PCIe controller.
Replace the iteration logic with 'for_each_set_bit()' to ensure only
allocated map windows are iterated when determining the ATU index from a
given address.
Link: https://lore.kernel.org/r/20250315201548.858189-12-helgaas@kernel.org
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Endpoint
┌───────────────────────────────────────────────┐
│ pcie-ep@5f010000 │
│ ┌────────────────┐│
│ │ Endpoint ││
│ │ PCIe ││
│ │ Controller ││
│ bus@5f000000 │ ┌────────►
│ ┌──────────┐ │ │ ││dynamically
│ │ │ Outbound Transfer │ ││allocated
│┌─────┐ │ Bus ┼─────►│ ATU ───────┘ ││PCI Addr
││ │ │ Fabric │Bus │ ││
││ CPU ├───►│ │Addr │ ││
││ │CPU │ │0x8000_0000 ││
│└─────┘Addr└──────────┘ │ ││
│ 0x7000_0000 └────────────────┘│
└───────────────────────────────────────────────┘
bus@5f000000 {
compatible = "simple-bus";
ranges = <0x80000000 0x0 0x70000000 0x10000000>;
pcie-ep@5f010000 {
reg = <0x80000000 0x10000000>;
reg-names ="addr_space";
...
};
...
};
In the diagram above, CPU writes data to outbound window address
0x7000_0000, and the bus fabric maps it to 0x8000_0000. The ATU uses
bus address 0x8000_0000 as input address and maps to some PCI address
dynamically allocated by a PCI device driver on the host side.
The pcie-ep@5f010000 'reg[addr_space]' is the parent bus address, which is
the input of PCIe controller, including the ATU.
Set parent_bus_offset, the offset from the CPU address to the PCIe
controller input address using dw_pcie_init_parent_bus_offset(). The
parent_bus_offset is not used yet, so no functional change intended.
[bhelgaas: commit log]
Link: https://lore.kernel.org/r/20250315201548.858189-11-helgaas@kernel.org
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Move devm_pci_epc_create() to the beginning of dw_pcie_ep_init().
devm_pci_epc_create() is generic code that doesn't depend on any DWC
resource, so moving it earlier keeps all the subsequent devicetree-related
code together.
Link: https://lore.kernel.org/r/20250315201548.858189-9-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
The 'ranges' property of a PCI controller's parent can indicate address
translation information. Most system use 1:1 map between CPU physical and
PCI controller input addresses.
But some hardware, like i.MX8QXP, doesn't use 1:1 map. See below diagram:
┌─────────┐ ┌────────────┐
┌─────┐ │ │ IA: 0x8ff8_0000 │ │
│ CPU ├───►│ ┌────►├─────────────────┐ │ PCI │
└─────┘ │ │ │ IA: 0x8ff0_0000 │ │ │
CPU Addr │ │ ┌─►├─────────────┐ │ │ Controller │
0x7ff8_0000─┼───┘ │ │ │ │ │ │
│ │ │ │ │ │ │ PCI Addr
0x7ff0_0000─┼──────┘ │ │ └──► IOSpace ─┼────────────►
│ │ │ │ │ 0
0x7000_0000─┼────────►├─────────┐ │ │ │
└─────────┘ │ └──────► CfgSpace ─┼────────────►
Bus Fabric │ │ │ 0
│ │ │
└──────────► MemSpace ─┼────────────►
IA: 0x8000_0000 │ │ 0x8000_0000
└────────────┘
bus@5f000000 {
compatible = "simple-bus";
#address-cells = <1>;
#size-cells = <1>;
ranges = <0x80000000 0x0 0x70000000 0x10000000>;
pcie@5f010000 {
compatible = "fsl,imx8q-pcie";
reg = <0x5f010000 0x10000>, <0x8ff00000 0x80000>;
reg-names = "dbi", "config";
...
};
};
Intermediate address (IA) here means the PCIe controller input address.
The pcie@5f010000 'reg[config]' address is the parent bus (PCIe controller
input) address of CfgSpace.
The ATU in MemSpace is not explicitly described via devicetree, so we
assume the offset from CPU address to intermediate MemSpace address is the
same as that for CfgSpace.
We could use bus@5f000000 'ranges' for the same purpose.
Set parent_bus_offset using dw_pcie_init_parent_bus_offset(). The
parent_bus_offset is not used yet, so no functional change intended.
Link: https://lore.kernel.org/r/20250315201548.858189-8-helgaas@kernel.org
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
dw_pcie_parent_bus_offset() looks up the parent bus address of a PCI
controller 'reg' property in devicetree. If implemented, .cpu_addr_fixup()
is a hard-coded way to get the parent bus address corresponding to a CPU
physical address.
Add debug code to compare the address from .cpu_addr_fixup() with the
address from devicetree. If they match, warn that .cpu_addr_fixup() is
redundant and should be removed; if they differ, warn that something is
wrong with the devicetree.
If .cpu_addr_fixup() is not implemented, the parent bus address should be
identical to the CPU physical address because we previously ignored the
parent bus address from devicetree. If the devicetree has a different
parent bus address, warn about it being broken.
[bhelgaas: split debug to separate patch for easier future revert, commit
log]
Link: https://lore.kernel.org/r/20250315201548.858189-7-helgaas@kernel.org
Signed-off-by: Frank Li <Frank.Li@nxp.com>
[bhelgaas: squash Ioana Ciornei <ioana.ciornei@nxp.com> fix for NULL
pointer deref when driver doesn't supply dw_pcie_ops, e.g., layerscape-pcie
https://lore.kernel.org/r/20250319134339.3114817-1-ioana.ciornei@nxp.com]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Return the offset from CPU physical address to the parent bus address of
the specified element of the devicetree 'reg' property.
[bhelgaas: cpu_phy_addr -> cpu_phys_addr, return offset, split
.cpu_addr_fixup() checking and debug to separate patch]
Link: https://lore.kernel.org/r/20250315201548.858189-6-helgaas@kernel.org
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
When BIOS neglects to assign bus numbers to PCI bridges, the kernel
attempts to correct that during PCI device enumeration. If it runs out
of bus numbers, no pci_bus is allocated and the "subordinate" pointer in
the bridge's pci_dev remains NULL.
The PCIe bandwidth controller erroneously does not check for a NULL
subordinate pointer and dereferences it on probe.
Bandwidth control of unusable devices below the bridge is of questionable
utility, so simply error out instead. This mirrors what PCIe hotplug does
since commit 62e4492c30 ("PCI: Prevent NULL dereference during pciehp
probe").
The PCI core emits a message with KERN_INFO severity if it has run out of
bus numbers. PCIe hotplug emits an additional message with KERN_ERR
severity to inform the user that hotplug functionality is disabled at the
bridge. A similar message for bandwidth control does not seem merited,
given that its only purpose so far is to expose an up-to-date link speed
in sysfs and throttle the link speed on certain laptops with limited
Thermal Design Power. So error out silently.
User-visible messages:
pci 0000:16:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[...]
pci_bus 0000:45: busn_res: [bus 45-74] end is updated to 74
pci 0000:16:02.0: devices behind bridge are unusable because [bus 45-74] cannot be assigned for them
[...]
pcieport 0000:16:02.0: pciehp: Hotplug bridge without secondary bus, ignoring
[...]
BUG: kernel NULL pointer dereference
RIP: pcie_update_link_speed
pcie_bwnotif_enable
pcie_bwnotif_probe
pcie_port_probe_service
really_probe
Fixes: 665745f274 ("PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller")
Reported-by: Wouter Bijlsma <wouter@wouterbijlsma.nl>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219906
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Tested-by: Wouter Bijlsma <wouter@wouterbijlsma.nl>
Cc: stable@vger.kernel.org # v6.13+
Link: https://lore.kernel.org/r/3b6c8d973aedc48860640a9d75d20528336f1f3c.1742669372.git.lukas@wunner.de
Update the CPM5 check to include CPM5_HOST1 variant. Previously, only
CPM5 was considered when mapping the "cpm_csr" register.
With this change, CPM5_HOST1 is also supported, ensuring proper
resource mapping for this variant.
Signed-off-by: Thippeswamy Havalige <thippeswamy.havalige@amd.com>
[kwilczynski: commit log]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Link: https://lore.kernel.org/r/20250317124136.1317723-1-thippeswamy.havalige@amd.com
Don't populate the const read-only arrays "data" and "regs" on the
stack at run time, instead make them static.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
[kwilczynski: commit log, wrap overly long line to 80 columns]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Link: https://lore.kernel.org/r/20250317143456.477901-1-colin.i.king@gmail.com
Add support for AMD MDB (Multimedia DMA Bridge) IP core as Root Port.
The Versal2 devices include MDB Module. The integrated block for MDB
along with the integrated bridge can function as PCIe Root Port
controller at Gen5 32-GT/s operation per lane.
Bridge supports error and INTx interrupts and are handled using platform
specific interrupt line in Versal2.
Signed-off-by: Thippeswamy Havalige <thippeswamy.havalige@amd.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://lore.kernel.org/r/20250228093351.923615-4-thippeswamy.havalige@amd.com
[bhelgaas: only present on ARM64-based SoCs; squash Kconfig dependency on
ARM64 from Geert Uytterhoeven <geert+renesas@glider.be>:
https://lore.kernel.org/r/eaef1dea7edcf146aa377d5e5c5c85a76ff56bae.1742306383.git.geert+renesas@glider.be]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
[kwilczynski: commit log, code comments and error messages clean-up,
drop redundant "depends on PCI" from Kconfig, expose the error code
as part of error messages where appropriatie, change "depends on"
expression to match existing style from other drivers]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
PCIe devices (not CXL) can support DOE as well, so allow DOE to be enabled
even if CXL isn't.
Link: https://lore.kernel.org/r/20250306075211.1855177-4-alistair@alistair23.me
Signed-off-by: Alistair Francis <alistair@alistair23.me>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
PCIe r6.0 added support for Data Object Exchange (DOE). When DOE is
supported, the DOE Discovery Feature must be implemented per PCIe r6.1, sec
6.30.1.1. DOE allows a requester to obtain information about the other DOE
features supported by the device.
The kernel already queries the DOE features supported and caches the
values. Expose the values in sysfs to allow user space to determine which
DOE features are supported by the PCIe device.
By exposing the information to userspace, tools like lspci can relay the
information to users. By listing all of the supported features we can allow
userspace to parse the list, which might include vendor specific features
as well as yet to be supported features.
As the DOE Discovery feature must always be supported we treat it as a
special named attribute case. This allows the usual PCI attribute_group
handling to correctly create the doe_features directory when registering
pci_doe_sysfs_group (otherwise it doesn't and sysfs_add_file_to_group()
will seg fault).
After this patch is supported you can see something like this when
attaching a DOE device:
$ ls /sys/devices/pci0000:00/0000:00:02.0//doe*
0001:01 0001:02 doe_discovery
Link: https://lore.kernel.org/r/20250306075211.1855177-3-alistair@alistair23.me
Signed-off-by: Alistair Francis <alistair@alistair23.me>
[bhelgaas: drop pci_doe_sysfs_init() stub return, make
DEVICE_ATTR_RO(doe_discovery) static]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
The ability to map PCI resources to user-space is controlled by global
defines. For vfio there is VFIO_PCI_MMAP which is only disabled on s390 and
controls mapping of PCI resources using vfio-pci with a fallback option via
the pread()/pwrite() interface.
For the PCI core there is ARCH_GENERIC_PCI_MMAP_RESOURCE which enables a
generic implementation for mapping PCI resources plus the newer sysfs
interface. Then there is HAVE_PCI_MMAP which can be used with custom
definitions of pci_mmap_resource_range() and the historical /proc/bus/pci
interface. Both mechanisms are all or nothing.
For s390 mapping PCI resources is possible and useful for testing and
certain applications such as QEMU's vfio-pci based user-space NVMe driver.
For certain devices, however access to PCI resources via mappings to
user-space is not possible and these must be excluded from the general PCI
resource mapping mechanisms.
Introduce pdev->non_mappable_bars to indicate that a PCI device's BARs can
not be accessed via mappings to user-space. In the future this enables
per-device restrictions of PCI resource mapping.
For now, set this flag for all PCI devices on s390 in line with the
existing, general disable of PCI resource mapping. As s390 is the only user
of the VFI_PCI_MMAP Kconfig options this can already be replaced with a
check of this new flag. Also add similar checks in the other code protected
by HAVE_PCI_MMAP respectively ARCH_GENERIC_PCI_MMAP in preparation for
enabling these for supported devices.
Link: https://lore.kernel.org/lkml/20250212132808.08dcf03c.alex.williamson@redhat.com/
Link: https://lore.kernel.org/r/20250226-vfio_pci_mmap-v7-2-c5c0f1d26efd@linux.ibm.com
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
pcie_bwctrl_select_speed() should take __fls() of the speed bit, not return
it as a raw value. Instead of directly returning 2.5GT/s speed bit, simply
assign the fallback speed (2.5GT/s) into supported_speeds variable to share
the normal return path that calls pcie_supported_speeds2target_speed() to
calculate __fls().
This code path is not very likely to execute because
pcie_get_supported_speeds() should provide valid ->supported_speeds but a
spec violating device could fail to synthesize any speed in
pcie_get_supported_speeds(). It could also happen in case the
supported_speeds intersection is empty (also a violation of the current
PCIe specs).
Link: https://lore.kernel.org/r/20250321163103.5145-1-ilpo.jarvinen@linux.intel.com
Fixes: de9a6c8d5d ("PCI/bwctrl: Add pcie_set_target_speed() to set PCIe Link Speed")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
PCIe hotplug can operate in poll mode without interrupt handlers using a
polling kthread only. eb34da60ed ("PCI: pciehp: Disable hotplug
interrupt during suspend") failed to consider that and enables HPIE
(Hot-Plug Interrupt Enable) unconditionally when resuming the Port.
Only set HPIE if non-poll mode is in use. This makes
pcie_enable_interrupt() match how pcie_enable_notification() already
handles HPIE.
Link: https://lore.kernel.org/r/20250321162114.3939-1-ilpo.jarvinen@linux.intel.com
Fixes: eb34da60ed ("PCI: pciehp: Disable hotplug interrupt during suspend")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Setting pci_msi_ignore_mask inhibits the toggling of the mask bit for both
MSI and MSI-X entries globally, regardless of the IRQ chip they are using.
Only Xen sets the pci_msi_ignore_mask when routing physical interrupts over
event channels, to prevent PCI code from attempting to toggle the maskbit,
as it's Xen that controls the bit.
However, the pci_msi_ignore_mask being global will affect devices that use
MSI interrupts but are not routing those interrupts over event channels
(not using the Xen pIRQ chip). One example is devices behind a VMD PCI
bridge. In that scenario the VMD bridge configures MSI(-X) using the
normal IRQ chip (the pIRQ one in the Xen case), and devices behind the
bridge configure the MSI entries using indexes into the VMD bridge MSI
table. The VMD bridge then demultiplexes such interrupts and delivers to
the destination device(s). Having pci_msi_ignore_mask set in that scenario
prevents (un)masking of MSI entries for devices behind the VMD bridge.
Move the signaling of no entry masking into the MSI domain flags, as that
allows setting it on a per-domain basis. Set it for the Xen MSI domain
that uses the pIRQ chip, while leaving it unset for the rest of the
cases.
Remove pci_msi_ignore_mask at once, since it was only used by Xen code, and
with Xen dropping usage the variable is unneeded.
This fixes using devices behind a VMD bridge on Xen PV hardware domains.
Albeit Devices behind a VMD bridge are not known to Xen, that doesn't mean
Linux cannot use them. By inhibiting the usage of
VMD_FEAT_CAN_BYPASS_MSI_REMAP and the removal of the pci_msi_ignore_mask
bodge devices behind a VMD bridge do work fine when use from a Linux Xen
hardware domain. That's the whole point of the series.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Message-ID: <20250219092059.90850-4-roger.pau@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
MSI remapping bypass (directly configuring MSI entries for devices on the
VMD bus) won't work under Xen, as Xen is not aware of devices in such bus,
and hence cannot configure the entries using the pIRQ interface in the PV
case, and in the PVH case traps won't be setup for MSI entries for such
devices.
Until Xen is aware of devices in the VMD bus prevent the
VMD_FEAT_CAN_BYPASS_MSI_REMAP capability from being used when running as
any kind of Xen guest.
The MSI remapping bypass is an optional feature of VMD bridges, and hence
when running under Xen it will be masked and devices will be forced to
redirect its interrupts from the VMD bridge. That mode of operation must
always be supported by VMD bridges and works when Xen is not aware of
devices behind the VMD bridge.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Message-ID: <20250219092059.90850-3-roger.pau@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
For some reason, cardbus related io/mem size declarations are in
linux/pci.h, whereas non-cardbus sizes are already in pci/pci.h.
Move all them into one place in pci/pci.h.
Link: https://lore.kernel.org/r/20250311174701.3586-4-ilpo.jarvinen@linux.intel.com
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Neither pci_reassign_bridge_resources() nor pci_reassign_resource() is used
outside of the PCI subsystem. They seem to be naturally static functions
but since resource fitting/assignment is split between setup-bus.c and
setup-res.c, they fall into different sides of the divide and need to be
declared.
Move the declarations of pci_reassign_bridge_resources() and
pci_reassign_resource() into pci/pci.h to keep them internal to PCI
subsystem.
Link: https://lore.kernel.org/r/20250311174701.3586-2-ilpo.jarvinen@linux.intel.com
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
pci_rescan_bus_bridge_resize() is only used by code inside PCI subsystem.
The comment also falsely advertises it to be for hotplug drivers, yet the
only caller is from sysfs store function. Move the function declaration
into pci/pci.h.
Link: https://lore.kernel.org/r/20250311174701.3586-1-ilpo.jarvinen@linux.intel.com
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
__resource_resize_store() attempts to release all resources of the device
before attempting the resize. The loop, however, only covers standard BARs
(< PCI_STD_NUM_BARS). If a device has VF BARs that are assigned,
pci_reassign_bridge_resources() finds the bridge window still has some
assigned child resources and returns -NOENT which makes
pci_resize_resource() to detect an error and abort the resize.
Change the release loop to cover all resources up to VF BARs which allows
the resize operation to release the bridge windows and attempt to assigned
them again with the different size.
If SR-IOV is enabled, disallow resize as it requires releasing also IOV
resources.
Link: https://lore.kernel.org/r/20250320142837.8027-1-ilpo.jarvinen@linux.intel.com
Fixes: 91fa127794 ("PCI: Expose PCIe Resizable BAR support via sysfs")
Reported-by: Michał Winiarski <michal.winiarski@intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Currently, pci_bridge_d3_possible() encodes a variety of decision factors
when deciding whether a given bridge can be put into D3. A particular one
of note is for "recent enough PCIe ports." Per Rafael [0]:
"There were hardware issues related to PM on x86 platforms predating
the introduction of Connected Standby in Windows. For instance,
programming a port into D3hot by writing to its PMCSR might cause the
PCIe link behind it to go down and the only way to revive it was to
power cycle the Root Complex. And similar."
Thus, this function contains a DMI-based check for post-2015 BIOS.
The above factors (Windows, x86) don't really apply to non-x86 systems, and
also, many such systems don't have BIOS or DMI. However, we'd like to be
able to suspend bridges on non-x86 systems too.
Restrict the "recent enough" check to x86. If we find further
incompatibilities, it probably makes sense to expand on the deny-list
approach (i.e., bridge_d3_blacklist or similar).
Link: https://lore.kernel.org/r/20250320110604.v6.1.Id0a0e78ab0421b6bce51c4b0b87e6aebdfc69ec7@changeid
Link: https://lore.kernel.org/linux-pci/CAJZ5v0j_6jeMAQ7eFkZBe5Yi+USGzysxAgfemYh=-zq4h5W+Qg@mail.gmail.com/ [0]
Link: https://lore.kernel.org/linux-pci/20240227225442.GA249898@bhelgaas/ [1]
Link: https://lore.kernel.org/linux-pci/20240828210705.GA37859@bhelgaas/ [2]
[Brian: rewrite to !X86 based on Rafael's suggestions]
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
The access to the PCI config space via pci_ops::read and pci_ops::write is
a low-level hardware access. The functions can be accessed with disabled
interrupts even on PREEMPT_RT. The pci_lock is a raw_spinlock_t for this
purpose.
A spinlock_t becomes a sleeping lock on PREEMPT_RT, so it cannot be
acquired with disabled interrupts. The vmd_dev::cfg_lock is accessed in
the same context as the pci_lock.
Make vmd_dev::cfg_lock a raw_spinlock_t type so it can be used with
interrupts disabled.
This was reported as:
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
Call Trace:
rt_spin_lock+0x4e/0x130
vmd_pci_read+0x8d/0x100 [vmd]
pci_user_read_config_byte+0x6f/0xe0
pci_read_config+0xfe/0x290
sysfs_kf_bin_read+0x68/0x90
Signed-off-by: Ryo Takakura <ryotkkr98@gmail.com>
Tested-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Acked-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
[bigeasy: reword commit message]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Link: https://lore.kernel.org/r/20250218080830.ufw3IgyX@linutronix.de
[kwilczynski: commit log]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
[bhelgaas: add back report info from
https://lore.kernel.org/lkml/20241218115951.83062-1-ryotkkr98@gmail.com/]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Zone device pages are used to represent various type of device memory
managed by device drivers. Currently compound zone device pages are not
supported. This is because MEMORY_DEVICE_FS_DAX pages are the only user
of higher order zone device pages and have their own page reference
counting.
A future change will unify FS DAX reference counting with normal page
reference counting rules and remove the special FS DAX reference counting.
Supporting that requires compound zone device pages.
Supporting compound zone device pages requires compound_head() to
distinguish between head and tail pages whilst still preserving the
special struct page fields that are specific to zone device pages.
A tail page is distinguished by having bit zero being set in
page->compound_head, with the remaining bits pointing to the head page.
For zone device pages page->compound_head is shared with page->pgmap.
The page->pgmap field must be common to all pages within a folio, even if
the folio spans memory sections. Therefore pgmap is the same for both
head and tail pages and can be moved into the folio and we can use the
standard scheme to find compound_head from a tail page.
Link: https://lkml.kernel.org/r/67055d772e6102accf85161d0b57b0b3944292bf.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Asahi Lina <lina@asahilina.net>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: linmiaohe <linmiaohe@huawei.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently ZONE_DEVICE page reference counts are initialised by core memory
management code in __init_zone_device_page() as part of the memremap()
call which driver modules make to obtain ZONE_DEVICE pages. This
initialises page refcounts to 1 before returning them to the driver.
This was presumably done because it drivers had a reference of sorts on
the page. It also ensured the page could always be mapped with
vm_insert_page() for example and would never get freed (ie. have a zero
refcount), freeing drivers of manipulating page reference counts.
However it complicates figuring out whether or not a page is free from the
mm perspective because it is no longer possible to just look at the
refcount. Instead the page type must be known and if GUP is used a
secondary pgmap reference is also sometimes needed.
To simplify this it is desirable to remove the page reference count for
the driver, so core mm can just use the refcount without having to account
for page type or do other types of tracking. This is possible because
drivers can always assume the page is valid as core kernel will never
offline or remove the struct page.
This means it is now up to drivers to initialise the page refcount as
required. P2PDMA uses vm_insert_page() to map the page, and that requires
a non-zero reference count when initialising the page so set that when the
page is first mapped.
Link: https://lkml.kernel.org/r/6aedb0ac2886dcc4503cb705273db5b3863a0b66.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Asahi Lina <lina@asahilina.net>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: linmiaohe <linmiaohe@huawei.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Move devm_pci_alloc_host_bridge() to the beginning of dw_pcie_host_init().
devm_pci_alloc_host_bridge() is generic code that doesn't depend on any DWC
resource, so moving it earlier keeps all the subsequent devicetree-related
code together.
[bhelgaas: reorder earlier in series]
Link: https://lore.kernel.org/r/20250315201548.858189-4-helgaas@kernel.org
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Rename 'cpu_addr' to 'parent_bus_addr' in the DesignWare ATU configuration.
The ATU translates parent bus addresses to PCI addresses, which are often
the same as CPU addresses but can differ in systems where the bus fabric
translates addresses before passing them to the PCIe controller. This
renaming clarifies the purpose and avoids confusion.
Link: https://lore.kernel.org/r/20250315201548.858189-3-helgaas@kernel.org
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
The msg_res region translates writes into PCIe Message TLPs. Previously we
mapped this region using atu.cpu_addr, the input address programmed into
the ATU.
"cpu_addr" is a misnomer because when a bus fabric translates addresses
between the CPU and the ATU, the ATU input address is different from the
CPU address. A future patch will rename "cpu_addr" and correct the value
to be the ATU input address instead of the CPU physical address.
Map the msg_res region before writing to it using the msg_res resource
start, a CPU physical address.
Link: https://lore.kernel.org/r/20250315201548.858189-2-helgaas@kernel.org
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Use devm_clk_bulk_get_all() helper to simplify clock handle code.
No functional changes intended.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
[kwilczynski: commit log, refactor to use dev_err_probe()]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/r/20250226025628.1681206-1-hongxing.zhu@nxp.com
Instead of testing the controller register address to distinguish
controller 1 from controller 0 on i.MX8MQ platforms, use the PCI domain
number, which comes from the devicetree 'linux,pci-domain' property.
All relevant devicetrees should already supply 'linux,pci-domain', which
was added by c0b70f05c8 ("arm64: dts: imx8mq: use_dt_domains for pci
node").
Instead of being set directly in imx_pcie_probe(), pci->dbi_base will be
set by the DWC core in dw_pcie_get_resources().
No functional changes intended.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
[bhelgaas: commit log]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://lore.kernel.org/r/20250226024256.1678103-3-hongxing.zhu@nxp.com
When running the RK3588 in Endpoint mode, with an Intel host with IOMMU
enabled, the host side prints:
DMAR: VT-d detected Invalidation Time-out Error: SID 0
When running the RK3588 in Endpoint mode, with an AMD host with IOMMU
enabled, the host side prints:
iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=63:00.0 address=0x42b5b01a0]
Rockchip has confirmed that the ATS support for RK3588 only works when
running the PCIe controller in Root Complex (RC) mode, see:
https://lore.kernel.org/linux-pci/93cdce39-1ae6-4939-a3fc-db10be7564e5@rock-chips.com
Usually, to handle these issues, we add a quirk for the PCI vendor and
device ID in drivers/pci/quirks.c with quirk_no_ats(). That is because
we cannot usually modify the capabilities on the EP side. In this case,
we can modify the capabilities on the EP side.
Thus, hide the broken ATS capability on RK3588 when running in EP mode.
That way, we don't need any quirk on the host side, and we see no errors
on the host side, and we can run pci_endpoint_test successfully, with
the IOMMU enabled on the host side.
Acked-by: Shawn Lin <shawn.lin@rock-chips.com>
Signed-off-by: Niklas Cassel <cassel@kernel.org>
[kwilczynski: commit log, tidy up code comments and error message]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://lore.kernel.org/r/20250310094826.842681-6-cassel@kernel.org
Add dw_pcie_ep_hide_ext_capability() which can be used by an endpoint
controller driver to hide a capability.
This can be useful to hide a capability that is buggy, such that the
host side does not try to enable the buggy capability.
Suggested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Link: https://lore.kernel.org/r/20250310094826.842681-5-cassel@kernel.org
If the bitmap or memory allocations fail, then dw_pcie_ep_init_registers()
will incorrectly return a success.
Return -ENOMEM instead.
Fixes: 869bc52534 ("PCI: dwc: ep: Fix DBI access failure for drivers requiring refclk from host")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
[kwilczynski: commit log]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Krzysztof Wilczyński <kw@linux.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://lore.kernel.org/r/36dcb6fc-f292-4dd5-bd45-a8c6f9dc3df7@stanley.mountain
Many functions in PCI use accessor macros such as pci_resource_len(),
which take a BAR index. That index, however, is never checked for
validity, potentially resulting in undefined behavior by overflowing the
array pci_dev.resource in the macro pci_resource_n().
Since many users of those macros directly assign the accessed value to
an unsigned integer, the macros cannot be changed easily anymore to
return -EINVAL for invalid indexes. Consequently, the problem has to be
mitigated in higher layers.
Add pci_bar_index_valid(). Use it where appropriate.
Link: https://lore.kernel.org/r/20250312080634.13731-4-phasta@kernel.org
Closes: https://lore.kernel.org/all/adb53b1f-29e1-3d14-0e61-351fd2d3ff0d@linux.intel.com/
Reported-by: Bingbu Cao <bingbu.cao@linux.intel.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
[kwilczynski: correct if-statement condition the pci_bar_index_is_valid()
helper function uses, tidy up code comments]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
[bhelgaas: fix typo]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Hot-removal of nested PCI hotplug ports suffers from a long-standing race
condition which can lead to a deadlock: A parent hotplug port acquires
pci_lock_rescan_remove(), then waits for pciehp to unbind from a child
hotplug port. Meanwhile that child hotplug port tries to acquire
pci_lock_rescan_remove() as well in order to remove its own children.
The deadlock only occurs if the parent acquires pci_lock_rescan_remove()
first, not if the child happens to acquire it first.
Several workarounds to avoid the issue have been proposed and discarded
over the years, e.g.:
https://lore.kernel.org/r/4c882e25194ba8282b78fe963fec8faae7cf23eb.1529173804.git.lukas@wunner.de/
A proper fix is being worked on, but needs more time as it is nontrivial
and necessarily intrusive.
Recent commit 9d573d1954 ("PCI: pciehp: Detect device replacement during
system sleep") provokes more frequent occurrence of the deadlock when
removing more than one Thunderbolt device during system sleep. The commit
sought to detect device replacement, but also triggered on device removal.
Differentiating reliably between replacement and removal is impossible
because pci_get_dsn() returns 0 both if the device was removed, as well as
if it was replaced with one lacking a Device Serial Number.
Avoid the more frequent occurrence of the deadlock by checking whether the
hotplug port itself was hot-removed. If so, there's no sense in checking
whether its child device was replaced.
This works because the ->resume_noirq() callback is invoked in top-down
order for the entire hierarchy: A parent hotplug port detecting device
replacement (or removal) marks all children as removed using
pci_dev_set_disconnected() and a child hotplug port can then reliably
detect being removed.
Link: https://lore.kernel.org/r/02f166e24c87d6cde4085865cce9adfdfd969688.1741674172.git.lukas@wunner.de
Fixes: 9d573d1954 ("PCI: pciehp: Detect device replacement during system sleep")
Reported-by: Kenneth Crudup <kenny@panix.com>
Closes: https://lore.kernel.org/r/83d9302a-f743-43e4-9de2-2dd66d91ab5b@panix.com/
Reported-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
Closes: https://lore.kernel.org/r/20240926125909.2362244-1-acelan.kao@canonical.com/
Tested-by: Kenneth Crudup <kenny@panix.com>
Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Cc: stable@vger.kernel.org # v6.11+
The array for the iomapping cookie addresses has a length of
PCI_STD_NUM_BARS. This constant, however, only describes standard BARs;
while PCI can allow for additional, special BARs.
The total number of PCI resources is described by constant
PCI_NUM_RESOURCES, which is also used in, e.g., pci_select_bars().
Thus, the devres array has so far been too small.
Change the length of the devres array to PCI_NUM_RESOURCES.
Link: https://lore.kernel.org/r/20250312080634.13731-3-phasta@kernel.org
Fixes: bbaff68bf4 ("PCI: Add managed partial-BAR request and map infrastructure")
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Cc: stable@vger.kernel.org # v6.11+
The driver walks the MSI descriptors to test whether a descriptor exists
for a given index. That's just abuse of the MSI internals.
The same test can be done with a single function call by looking up whether
there is a Linux interrupt number assigned at the index.
What's worse is that the function is completely unserialized against
modifications of the MSI-X control by operations issued from the interrupt
core. It also brings the PCI/MSI-X internal cached control word out of
sync.
Remove the trainwreck and invoke the function provided by the PCI/MSI core
to update it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/all/20250313130321.898592817@linutronix.de
The PCI/TPH driver fiddles with the MSI-X control word of an active
interrupt completely unserialized against concurrent operations issued
from the interrupt core. It also brings the PCI/MSI-X internal cached
control word out of sync.
Provide a function, which has the required serialization and keeps the
control word cache in sync.
Unfortunately this requires to look up and lock the interrupt descriptor,
which should be only done in the interrupt core code. But confining this
particular oddity in the PCI/MSI core is the lesser of all evil. A
interrupt core implementation would require a larger pile of infrastructure
and indirections for dubious value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/all/20250313130321.822790423@linutronix.de
Convert the code to use the new guard(msi_descs_lock).
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/all/20250313130321.758905320@linutronix.de
Convert the code to use the new guard(msi_descs_lock).
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/all/20250313130321.695027112@linutronix.de
The Versal Net ACAP (Adaptive Compute Acceleration Platform) devices
incorporate the Coherency and PCIe Gen5 Module, specifically the
Next-Generation Compact Module (CPM5NC).
The integrated CPM5NC block, along with the built-in bridge, can function
as a PCIe Root Port and supports the PCIe Gen5 protocol with data transfer
rates of up to 32 GT/s, and is capable of supporting up to a x16 lane-width
configuration.
Bridge errors are managed using a specific interrupt line designed for
CPM5N. The INTx interrupt support is not available.
Currently in this commit platform specific bridge errors support is not
added.
Signed-off-by: Thippeswamy Havalige <thippeswamy.havalige@amd.com>
[kwilczynski: commit log, squashed patch to fix an if-statement condition
to ensure that xilinx_cpm_pcie_init_port() does not run on the CPM5NC_HOST
variant from https://lore.kernel.org/linux-pci/20250311072402.1049990-1-thippeswamy.havalige@amd.com]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://lore.kernel.org/r/20250224155025.782179-4-thippeswamy.havalige@amd.com
The IRQ domain allocated for the PCIe controller is not freed if
resource_list_first_type() returns NULL, leading to a resource leak.
This fix ensures properly cleaning up the allocated IRQ domain in
the error path.
Fixes: 49e427e6bd ("Merge branch 'pci/host-probe-refactor'")
Signed-off-by: Thippeswamy Havalige <thippeswamy.havalige@amd.com>
[kwilczynski: added missing Fixes: tag, refactored to use one of the goto labels]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Link: https://lore.kernel.org/r/20250224155025.782179-2-thippeswamy.havalige@amd.com
In hindsight, there were some crucial subtleties overlooked when moving
{of,acpi}_dma_configure() to driver probe time to allow waiting for
IOMMU drivers with -EPROBE_DEFER, and these have become an
ever-increasing source of problems. The IOMMU API has some fundamental
assumptions that iommu_probe_device() is called for every device added
to the system, in the order in which they are added. Calling it in a
random order or not at all dependent on driver binding leads to
malformed groups, a potential lack of isolation for devices with no
driver, and all manner of unexpected concurrency and race conditions.
We've attempted to mitigate the latter with point-fix bodges like
iommu_probe_device_lock, but it's a losing battle and the time has come
to bite the bullet and address the true source of the problem instead.
The crux of the matter is that the firmware parsing actually serves two
distinct purposes; one is identifying the IOMMU instance associated with
a device so we can check its availability, the second is actually
telling that instance about the relevant firmware-provided data for the
device. However the latter also depends on the former, and at the time
there was no good place to defer and retry that separately from the
availability check we also wanted for client driver probe.
Nowadays, though, we have a proper notion of multiple IOMMU instances in
the core API itself, and each one gets a chance to probe its own devices
upon registration, so we can finally make that work as intended for
DT/IORT/VIOT platforms too. All we need is for iommu_probe_device() to
be able to run the iommu_fwspec machinery currently buried deep in the
wrong end of {of,acpi}_dma_configure(). Luckily it turns out to be
surprisingly straightforward to bootstrap this transformation by pretty
much just calling the same path twice. At client driver probe time,
dev->driver is obviously set; conversely at device_add(), or a
subsequent bus_iommu_probe(), any device waiting for an IOMMU really
should *not* have a driver already, so we can use that as a condition to
disambiguate the two cases, and avoid recursing back into the IOMMU core
at the wrong times.
Obviously this isn't the nicest thing, but for now it gives us a
functional baseline to then unpick the layers in between without many
more awkward cross-subsystem patches. There are some minor side-effects
like dma_range_map potentially being created earlier, and some debug
prints being repeated, but these aren't significantly detrimental. Let's
make things work first, then deal with making them nice.
With the basic flow finally in the right order again, the next step is
probably turning the bus->dma_configure paths inside-out, since all we
really need from bus code is its notion of which device and input ID(s)
to parse the common firmware properties with...
Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci-driver.c
Acked-by: Rob Herring (Arm) <robh@kernel.org> # of/device.c
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/e3b191e6fd6ca9a1e84c5e5e40044faf97abb874.1740753261.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This put_device() was accidentally left over from when we changed the code
from using device_register() to calling device_add(). Delete it.
Link: https://lore.kernel.org/r/55b24870-89fb-4c91-b85d-744e35db53c2@stanley.mountain
Fixes: 9885440b16 ("PCI: Fix pci_host_bridge struct device release/free handling")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
If device_register(&child->dev) fails, call put_device() to explicitly
release child->dev, per the comment at device_register().
Found by code review.
Link: https://lore.kernel.org/r/20250202062357.872971-1-make24@iscas.ac.cn
Fixes: 4f535093cf ("PCI: Put pci_dev in device tree as early as possible")
Signed-off-by: Ma Ke <make24@iscas.ac.cn>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: stable@vger.kernel.org
Previously most resizable BAR interfaces (pci_rebar_get_possible_sizes(),
pci_rebar_set_size(), etc) as well as pci_restore_state() searched config
space for a Resizable BAR capability. Most devices don't have such a
capability, so this is wasted effort, especially for pci_restore_state().
Search for a Resizable BAR capability once at enumeration-time and cache
the offset so we don't have to search every time we need it. No functional
change intended.
Link: https://lore.kernel.org/r/20250215000301.175097-3-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Following a reset, a Function may respond to Config Requests with Request
Retry Status (RRS) Completion Status to indicate that it is temporarily
unable to process the Request, but will be able to process the Request in
the future (PCIe r6.0, sec 2.3.1).
If the Configuration RRS Software Visibility feature is enabled and a Root
Complex receives RRS for a config read of the Vendor ID, the Root Complex
completes the Request to the host by returning PCI_VENDOR_ID_PCI_SIG,
0x0001 (sec 2.3.2).
The Config RRS SV feature applies only to Root Ports and is not directly
related to pci_scan_bridge_extend(). Move the RRS SV enable to
set_pcie_port_type() where we handle other PCIe-specific configuration.
Link: https://lore.kernel.org/r/20250303210217.199504-1-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Remove the superfluous function dw_pcie_ep_find_ext_capability(),
as it is virtually identical to dw_pcie_find_ext_capability().
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://lore.kernel.org/r/20250221202646.395252-3-cassel@kernel.org
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Fix a kernel oops found while testing the stm32_pcie Endpoint driver
with handling of PERST# deassertion:
During EP initialization, pci_epf_test_alloc_space() allocates all BARs,
which are further freed if epc_set_bar() fails (for instance, due to no
free inbound window).
However, when pci_epc_set_bar() fails, the error path:
pci_epc_set_bar() ->
pci_epf_free_space()
does not clear the previous assignment to epf_test->reg[bar].
Then, if the host reboots, the PERST# deassertion restarts the BAR
allocation sequence with the same allocation failure (no free inbound
window), creating a double free situation since epf_test->reg[bar] was
deallocated and is still non-NULL.
Thus, make sure that pci_epf_alloc_space() and pci_epf_free_space()
invocations are symmetric, and as such, set epf_test->reg[bar] to NULL
when memory is freed.
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Christian Bruel <christian.bruel@foss.st.com>
Link: https://lore.kernel.org/r/20250124123043.96112-1-christian.bruel@foss.st.com
[kwilczynski: commit log]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
The static function devm_pci_epc_match() is only invoked within the
devm_pci_epc_destroy(). However, since it was initially introduced,
this new API has had no callers.
Thus, remove both the unused API and the static function.
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Link: https://lore.kernel.org/r/20250217-remove_api-v2-1-b169c9117045@quicinc.com
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
[kwilczynski: commit log]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Looking at section "11.4.4.29 USP_PCIE_RESBAR Registers Summary" in the
Technical Reference Manual (TRM) for RK3588, we can see that none of the
BARs are Fixed BARs, but actually Resizable BARs.
I couldn't find any reference in the TRM for RK3568, but looking at the
downstream PCIe endpoint driver, both RK3568 and RK3588 are treated as
the same, so the BARs on RK3568 must also be Resizable BARs.
Now when we actually have support for Resizable BARs, let's configure
these BARs as such.
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Link: https://lore.kernel.org/r/20250131182949.465530-16-cassel@kernel.org
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
[kwilczynski: commit log]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
The support for a specific iATU alignment was added in
commit 2a9a801620 ("PCI: endpoint: Add support to specify alignment for
buffers allocated to BARs").
This commit specifically mentions both that the alignment by each DWC
based EP driver should match CX_ATU_MIN_REGION_SIZE, and that AM65x
specifically has a 64 KB alignment.
This also matches the CX_ATU_MIN_REGION_SIZE value specified in the
section "12.2.2.4.7 PCIe Subsystem Address Translation" of the Technical
Reference Manual (TRM) for AM65x:
https://www.ti.com/lit/ug/spruid7e/spruid7e.pdf
This higher value, 1 MB, was obviously an ugly hack used to be able to
handle Resizable BARs which have a minimum size of 1 MB.
Now when we actually have support for Resizable BARs, let's configure the
iATU alignment requirement to the actual requirement.
(BARs described as Resizable will still get aligned to 1 MB.)
Cc: stable+noautosel@kernel.org # Depends on PCI endpoint Resizable BARs series
Fixes: 23284ad677 ("PCI: keystone: Add support for PCIe EP in AM654x Platforms")
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://lore.kernel.org/r/20250131182949.465530-15-cassel@kernel.org
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
[kwilczynski: commit log]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>