Commit Graph

332 Commits

Author SHA1 Message Date
Ofir Bitton
fa58b59493 accel/habanalabs: modify pci health check
Today we read PCI VENDOR-ID in order to make sure PCI link is
healthy. Apparently the VENDOR-ID might be stored on host and
hence, when we read it we might not access the PCI bus.
In order to make sure PCI health check is reliable, we will start
checking the DEVICE-ID instead.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:47:32 +02:00
Tomer Tayar
c517068349 accel/habanalabs: keep explicit size of reserved memory for FW
The reserved memory for FW is currently saved in an ASIC property in
units of MB, just like the value that comes from FW.
Except the fact that it is not clear from the property's name, it means
also that a calculation to actual size is required everywhere that it is
used.
Modify the property to hold the size in bytes.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:47:27 +02:00
Tomer Tayar
db45bbdd02 accel/habanalabs: handle reserved memory request when working with full FW
Currently the reserved memory request from FW is handled when running
with preboot only, but this request is relevant also when running with
full FW.
Modify to always handle this reservation request.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:47:24 +02:00
Ofir Bitton
5b6658eb7c accel/habanalabs/hwmon: rate limit errors user can generate
Fetching sensor data can fail due to various reasons. In order
not to pollute the kernel log, those error prints must be
rate limited.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:47:19 +02:00
Ofir Bitton
3bf6ef981f accel/habanalabs/gaudi2: drain event lacks rd/wr indication
Due to a H/W issue, AXI drain event does not include a read/write
indication, hence we remove this print.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:47:16 +02:00
Dani Liberman
fd8d2fa066 accel/habanalabs: fix error print
The unmasking is for event and it can be other event than RAZWI.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:47:12 +02:00
Tal Risin
c8c062e967 accel/habanalabs: initialize maybe-uninitialized variables
Prevent static analysis warning.

Signed-off-by: Tal Risin <trisin@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:47:08 +02:00
Avri Kehat
0b105a2a72 accel/habanalabs: fix debugfs files permissions
debugfs files are created with permissions that don't align
with the access requirements.

Signed-off-by: Avri Kehat <akehat@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:47:03 +02:00
Tomer Tayar
e855869bec accel/habanalabs: fix glbl error cause handling
The glbl error cause handling has a wrong assumption that all error
bits are consecutive.
Fix the handling to check all relevant error bits per ASIC.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:47:00 +02:00
Tomer Tayar
c1e89ae455 accel/habanalabs/gaudi2: check extended errors according to PCIe addr_dec interrupt info
The FW interrupt info for a PCIe addr_dec event is set correctly, so
check for either global errors or razwi according to the indications
there.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:46:55 +02:00
Tomer Tayar
7159813c91 accel/habanalabs: modify print for skip loading linux FW to debug log
Skip loading a linux FW image into the device with the current supported
ASICs is done for test purposes only.
Moreover, for future supported ASICs it is possible that there won't be
a need to load such an image.
The print in such a case is therefore not needed in most cases, so
replace the used dev_info() with dev_dbg().

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:46:52 +02:00
Farah Kassabri
c14e5cd3ed accel/habanalabs: remove hop size from asic properties
The hop size related properties is a MMU properties and not
asic properties.
As for PMMU and HMMU we could have different sizes.

Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:46:40 +02:00
Erick Archer
9e263c5042 accel/habanalabs: use kcalloc() instead of kzalloc()
As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.

So, use the purpose specific kcalloc() function instead of the argument
size * count in the kzalloc() function.

Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [1]
Link: https://github.com/KSPP/linux/issues/162

Signed-off-by: Erick Archer <erick.archer@gmx.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:41 +02:00
Colin Ian King
5ae8b6b774 accel/habanalabs/goya: remove redundant assignment to pointer 'input'
The pointer input is assigned a value that is not read, it is
being re-assigned again later with the same value. Resolve this
by moving the declaration to input into the if block.

Cleans up clang scan build warning:
warning: Value stored to 'input' during its initialization is never
read [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@intel.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:40 +02:00
Tomer Tayar
01f8cd0faf accel/habanalabs/gaudi2: fail memory memset when failing to copy QM packet to device
gaudi2_memset_memory_chunk_using_edma_qm() calls the access_dev_mem()
ASIC function, but ignores its return value.
Add this missing check.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:40 +02:00
Dani Liberman
731d320e68 accel/habanalabs: remove call to deprecated function
In newer kernel versions, irq_set_affinity_hint() is deprecated.
Instead, use the newer version which is irq_set_affinity_and_hint().

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:40 +02:00
Malkoot Khan
8a5be2b62b accel/habanalabs: Remove unnecessary braces from if statement
The coding style in the Linux kernel prefers not to use
braces for single-statement if conditions.
This patch removes the unnecessary braces from an if statement
in the file drivers/accel/habanalabs/common/command_submission.c,
which also resolves a coding style warning.

Signed-off-by: Malkoot Khan <engr.mkhan1990@gmail.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:40 +02:00
Farah Kassabri
f728c17fc9 accel/habanalabs/gaudi2: move HMMU page tables to device memory
Currently the HMMU page tables reside in the host memory,
which will cause host access from the device for every page walk.
This can affect PCIe bandwidth in certain scenarios.

To prevent that problem, HMMU page tables will be moved to the device
memory so the miss transaction will read the hops from there instead of
going to the host.

Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:40 +02:00
Tomer Tayar
246d8b6cfb accel/habanalabs: abort device reset for consecutive heartbeat failures
The mechanism of aborting device reset for consecutive fatal errors is
currently only for fatal errors that are reported by FW.
A non-responsive FW and consecutive heartbeat failures is also
considered fatal, so add them as well to this mechanism to avoid
recurring device reset in such a case.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:40 +02:00
Tomer Tayar
d0df8a35a7 accel/habanalabs: fix DRAM BAR base address calculation
When the DRAM region size in the BAR is not a power of 2, calculating
the corresponding BAR base address should be done using the offset from
the DRAM start address, and not using directly the DRAM address.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:40 +02:00
Koby Elbaz
8c075401f2 accel/habanalabs: increase HL_MAX_STR to 64 bytes to avoid warnings
Fix a warning of a buffer overflow:
‘snprintf’ output between 38 and 47 bytes into a destination of size 32

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:40 +02:00
Dani Liberman
e91c37f194 accel/habanalabs/gaudi2: add interrupt affinity for user interrupts
User interrupts are MSIx interrupts coming from Gaudi2, that have
specific range of IDs and are assigned to the sole use of the user
process that opened the Gaudi2 device (reminder: there can be only
a single user process running on Gaudi2 at any given time).

The interrupts are allocated and managed by the driver and therefore,
the user expects the driver to initialize them properly, which also
includes setting the affinity to the related CPU cores of the
device's NUMA node to get maximum performance.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:40 +02:00
Linus Torvalds
cf65598d59 drm-next for 6.8:
new drivers:
 - imagination - new driver for Imagination Technologies GPU
 - xe - new driver for Intel GPUs using core drm concepts
 
 core:
 - add CLOSE_FB ioctl
 - remove old UMS ioctls
 - increase max objects to accomodate AMD color mgmt
 
 encoder:
 - create per-encoder debugfs directory
 
 edid:
 - split out drm_eld
 - SAD helpers
 - drop edid_firmware module parameter
 
 format-helper:
 - cache format conversion buffers
 
 sched:
 - move from kthread to workqueue
 - rename some internals
 - implement dynamic job-flow control
 
 gpuvm:
 - provide more features to handle GEM objects
 
 client:
 - don't acquire module reference
 
 displayport:
 - add mst path property documentation
 
 fdinfo:
 - alignment fix
 
 dma-buf:
 - add fence timestamp helper
 - add fence deadline support
 
 bridge:
 - transparent aux-bridge for DP/USB-C
 - lt8912b: add suspend/resume support and power regulator support
 
 panel:
 - edp: AUO B116XTN02, BOE NT116WHM-N21,836X2, NV116WHM-N49
 - chromebook panel support
 - elida-kd35t133: rework pm
 - powkiddy RK2023 panel
 - himax-hx8394: drop prepare/unprepare and shutdown logic
 - BOE BP101WX1-100, Powkiddy X55, Ampire AM8001280G
 - Evervision VGG644804, SDC ATNA45AF01
 - nv3052c: register docs, init sequence fixes, fascontek FS035VG158
 - st7701: Anbernic RG-ARC support
 - r63353 panel controller
 - Ilitek ILI9805 panel controller
 - AUO G156HAN04.0
 
 simplefb:
 - support memory regions
 - support power domains
 
 amdgpu:
 - add new 64-bit sequence number infrastructure
 - add AMD specific color management
 - ACPI WBRF support for RF interference handling
 - GPUVM updates
 - RAS updates
 - DCN 3.5 updates
 - Rework PCIe link speed handling
 - Document GPU reset types
 - DMUB fixes
 - eDP fixes
 - NBIO 7.9/7.11 updates
 - SubVP updates
 - XGMI PCIe state dumping for aqua vanjaram
 - GFX11 golden register updates
 - enable tunnelling on high pri compute
 
 amdkfd:
 - Migrate TLB flushing logic to amdgpu
 - Trap handler fixes
 - Fix restore workers handling on suspend/resume
 - Fix possible memory leak in pqm_uninit()
 - support import/export of dma-bufs using GEM handles
 
 radeon:
 - fix possible overflows in command buffer checking
 - check for errors in ring_lock
 
 i915:
 - reorg display code for reuse in xe driver
 - fdinfo memory stats printing
 - DP MST bandwidth mgmt improvements
 - DP panel replay enabling
 - MTL C20 phy state verification
 - MTL DP DSC fractional bpp support
 - Audio fastset support
 - use dma_fence interfaces instead of i915_sw_fence
 - Separate gem and display code
 - AUX register macro refactoring
 - Separate display module/device parameters
 - Move display capabilities debugfs under display
 - Makefile cleanups
 - Register cleanups
 - Move display lock inits under display/
 - VLV/CHV DPIO PHY register and interface refactoring
 - DSI VBT sequence refactoring
 - C10/C20 PHY PLL hardware readout
 - DPLL code cleanups
 - Cleanup PXP plane protection checks
 - Improve display debug msgs
 - PSR selective fetch fixes/improvements
 - DP MST fixes
 - Xe2LPD FBC restrictions removed
 - DGFX uses direct VBT pin mapping
 - more MTL WAs
 - fix MTL eDP bug
 - eliminate use of kmap_atomic
 
 habanalabs:
 - sysfs entry to identify a device minor id with debugfs path
 - sysfs entry to expose device module id
 - add signed device info retrieval through INFO ioctl
 - add Gaudi2C device support
 - pcie reset prepare/done hooks
 
 msm:
 - Add support for SDM670, SM8650
 - Handle the CFG interconnect to fix the obscure hangs / timeouts
 - Kconfig fix for QMP dependency
 - use managed allocators
 - DPU: SDM670, SM8650 support
 - DPU: Enable SmartDMA on SM8350 and SM8450
 - DP: enable runtime PM support
 - GPU: add metadata UAPI
 - GPU: move devcoredumps to GPU device
 - GPU: convert to drm_exec
 
 ivpu:
 - update FW API
 - new debugfs file
 - a new NOP job submission test mode
 - improve suspend/resume
 - PM improvements
 - MMU PT optimizations
 - firmware profile frequency support
 - support for uncached buffers
 - switch to gem shmem helpers
 - replace kthread with threaded irqs
 
 rockchip:
 - rk3066_hdmi: convert to atomic
 - vop2: support nv20 and nv30
 - rk3588 support
 
 mediatek:
 - use devm_platform_ioremap_resource
 - stop using iommu_present
 - MT8188 VDOSYS1 display support
 
 panfrost:
 - PM improvements
 - improve interrupt handling as poweroff
 
 qaic:
 - allow to run with single MSI
 - support host/device time sync
 - switch to persistent DRM devices
 
 exynos:
 - fix potential error pointer dereference
 - fix wrong error checking
 - add missing call to drm_atomic_helper_shutdown
 
 omapdrm:
 - dma-fence lockdep annotation fix
 
 tidss:
 - dma-fence lockdep annotation fix
 - support for AM62A7
 
 v3d:
 - BCM2712 - rpi5 support
 - fdinfo + gputop support
 - uapi for CPU job handling
 
 virtio-gpu:
 - add context debug name
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmWeLcQACgkQDHTzWXnE
 hr54zg//dtPiG9nRA3OeoQh/pTmbFO26uhS8OluLiXhcX/7T/c1e6ck4dA3De5kB
 wgaqVH6/TFuMgiBbEqZSFuQM6k2X3HLCgHcCRpiz7iGse2GODLtFiUE/E4XFPrSP
 VhycI64and9XLBmxW87yGdmezVXxo6KZNX4nYabgZ7SD83/2w+ub6rxiAvd0KfSO
 gFmaOrujOIYBjFYFtKLZIYLH4Jzsy81bP0REBzEnAiWYV5qHdsXfvVgwuOU+3G/B
 BAVUUf++SU046QeD3HPEuOp3AqgazF4uNHQH5QL0UD2144uGWsk0LA4OZBnU0qhd
 oM4Oxu9V+TXvRfYhHwiQKeVleifcZBijndqiF7rlrTnNqS4YYOCPxuXzMlZO9aEJ
 6wQL/0JX8d5G6lXsweoBzNC76jeU/gspd1DvyaTFt7I8l8YqWvR5V8l8KRf2s14R
 +CwwujoqMMVmhZ4WhB+FgZTiWw5PaWoMM9ijVFOv8QhXOz21rj718NPdBspvdJK3
 Lo3obSO5p4lqgkMEuINBEXzkHjcSyOmMe1fG4Et8Wr+IrEBr1gfG9E4Twr+3/k3s
 9Ok9nOPykbYmt4gfJp/RDNCWBr8QGZKznP6Nq8EFfIqhEkXOHQo9wtsofVUhyW7P
 qEkCYcYkRa89KFp4Lep6lgDT5O7I+32eRmbRg716qRm9nn3Vj3Y=
 =nuw0
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2024-01-10' of git://anongit.freedesktop.org/drm/drm

Pull drm updates from Dave Airlie:
 "This contains two major new drivers:

   - imagination is a first driver for Imagination Technologies devices,
     it only covers very specific devices, but there is hope to grow it

   - xe is a reboot of the i915 GPU (shares display) side using a more
     upstream focused development model, and trying to maximise code
     sharing. It's not enabled for any hw by default, and will hopefully
     get switched on for Intel's Lunarlake.

  This also drops a bunch of the old UMS ioctls. It's been dead long
  enough.

  amdgpu has a bunch of new color management code that is being used in
  the Steam Deck.

  amdgpu also has a new ACPI WBRF interaction to help avoid radio
  interference.

  Otherwise it's the usual lots of changes in lots of places.

  Detailed summary:

  new drivers:
   - imagination - new driver for Imagination Technologies GPU
   - xe - new driver for Intel GPUs using core drm concepts

  core:
   - add CLOSE_FB ioctl
   - remove old UMS ioctls
   - increase max objects to accomodate AMD color mgmt

  encoder:
   - create per-encoder debugfs directory

  edid:
   - split out drm_eld
   - SAD helpers
   - drop edid_firmware module parameter

  format-helper:
   - cache format conversion buffers

  sched:
   - move from kthread to workqueue
   - rename some internals
   - implement dynamic job-flow control

  gpuvm:
   - provide more features to handle GEM objects

  client:
   - don't acquire module reference

  displayport:
   - add mst path property documentation

  fdinfo:
   - alignment fix

  dma-buf:
   - add fence timestamp helper
   - add fence deadline support

  bridge:
   - transparent aux-bridge for DP/USB-C
   - lt8912b: add suspend/resume support and power regulator support

  panel:
   - edp: AUO B116XTN02, BOE NT116WHM-N21,836X2, NV116WHM-N49
   - chromebook panel support
   - elida-kd35t133: rework pm
   - powkiddy RK2023 panel
   - himax-hx8394: drop prepare/unprepare and shutdown logic
   - BOE BP101WX1-100, Powkiddy X55, Ampire AM8001280G
   - Evervision VGG644804, SDC ATNA45AF01
   - nv3052c: register docs, init sequence fixes, fascontek FS035VG158
   - st7701: Anbernic RG-ARC support
   - r63353 panel controller
   - Ilitek ILI9805 panel controller
   - AUO G156HAN04.0

  simplefb:
   - support memory regions
   - support power domains

  amdgpu:
   - add new 64-bit sequence number infrastructure
   - add AMD specific color management
   - ACPI WBRF support for RF interference handling
   - GPUVM updates
   - RAS updates
   - DCN 3.5 updates
   - Rework PCIe link speed handling
   - Document GPU reset types
   - DMUB fixes
   - eDP fixes
   - NBIO 7.9/7.11 updates
   - SubVP updates
   - XGMI PCIe state dumping for aqua vanjaram
   - GFX11 golden register updates
   - enable tunnelling on high pri compute

  amdkfd:
   - Migrate TLB flushing logic to amdgpu
   - Trap handler fixes
   - Fix restore workers handling on suspend/resume
   - Fix possible memory leak in pqm_uninit()
   - support import/export of dma-bufs using GEM handles

  radeon:
   - fix possible overflows in command buffer checking
   - check for errors in ring_lock

  i915:
   - reorg display code for reuse in xe driver
   - fdinfo memory stats printing
   - DP MST bandwidth mgmt improvements
   - DP panel replay enabling
   - MTL C20 phy state verification
   - MTL DP DSC fractional bpp support
   - Audio fastset support
   - use dma_fence interfaces instead of i915_sw_fence
   - Separate gem and display code
   - AUX register macro refactoring
   - Separate display module/device parameters
   - Move display capabilities debugfs under display
   - Makefile cleanups
   - Register cleanups
   - Move display lock inits under display/
   - VLV/CHV DPIO PHY register and interface refactoring
   - DSI VBT sequence refactoring
   - C10/C20 PHY PLL hardware readout
   - DPLL code cleanups
   - Cleanup PXP plane protection checks
   - Improve display debug msgs
   - PSR selective fetch fixes/improvements
   - DP MST fixes
   - Xe2LPD FBC restrictions removed
   - DGFX uses direct VBT pin mapping
   - more MTL WAs
   - fix MTL eDP bug
   - eliminate use of kmap_atomic

  habanalabs:
   - sysfs entry to identify a device minor id with debugfs path
   - sysfs entry to expose device module id
   - add signed device info retrieval through INFO ioctl
   - add Gaudi2C device support
   - pcie reset prepare/done hooks

  msm:
   - Add support for SDM670, SM8650
   - Handle the CFG interconnect to fix the obscure hangs / timeouts
   - Kconfig fix for QMP dependency
   - use managed allocators
   - DPU: SDM670, SM8650 support
   - DPU: Enable SmartDMA on SM8350 and SM8450
   - DP: enable runtime PM support
   - GPU: add metadata UAPI
   - GPU: move devcoredumps to GPU device
   - GPU: convert to drm_exec

  ivpu:
   - update FW API
   - new debugfs file
   - a new NOP job submission test mode
   - improve suspend/resume
   - PM improvements
   - MMU PT optimizations
   - firmware profile frequency support
   - support for uncached buffers
   - switch to gem shmem helpers
   - replace kthread with threaded irqs

  rockchip:
   - rk3066_hdmi: convert to atomic
   - vop2: support nv20 and nv30
   - rk3588 support

  mediatek:
   - use devm_platform_ioremap_resource
   - stop using iommu_present
   - MT8188 VDOSYS1 display support

  panfrost:
   - PM improvements
   - improve interrupt handling as poweroff

  qaic:
   - allow to run with single MSI
   - support host/device time sync
   - switch to persistent DRM devices

  exynos:
   - fix potential error pointer dereference
   - fix wrong error checking
   - add missing call to drm_atomic_helper_shutdown

  omapdrm:
   - dma-fence lockdep annotation fix

  tidss:
   - dma-fence lockdep annotation fix
   - support for AM62A7

  v3d:
   - BCM2712 - rpi5 support
   - fdinfo + gputop support
   - uapi for CPU job handling

  virtio-gpu:
   - add context debug name"

* tag 'drm-next-2024-01-10' of git://anongit.freedesktop.org/drm/drm: (2340 commits)
  drm/amd/display: Allow z8/z10 from driver
  drm/amd/display: fix bandwidth validation failure on DCN 2.1
  drm/amdgpu: apply the RV2 system aperture fix to RN/CZN as well
  drm/amd/display: Move fixpt_from_s3132 to amdgpu_dm
  drm/amd/display: Fix recent checkpatch errors in amdgpu_dm
  Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole"
  drm/amd/display: avoid stringop-overflow warnings for dp_decide_lane_settings()
  drm/amd/display: Fix power_helpers.c codestyle
  drm/amd/display: Fix hdcp_log.h codestyle
  drm/amd/display: Fix hdcp2_execution.c codestyle
  drm/amd/display: Fix hdcp_psp.h codestyle
  drm/amd/display: Fix freesync.c codestyle
  drm/amd/display: Fix hdcp_psp.c codestyle
  drm/amd/display: Fix hdcp1_execution.c codestyle
  drm/amd/pm/smu7: fix a memleak in smu7_hwmgr_backend_init
  drm/amdkfd: Fix iterator used outside loop in 'kfd_add_peer_prop()'
  drm/amdgpu: Drop 'fence' check in 'to_amdgpu_amdkfd_fence()'
  drm/amdkfd: Confirm list is non-empty before utilizing list_first_entry in kfd_topology.c
  drm/amdgpu: Fix '*fw' from request_firmware() not released in 'amdgpu_ucode_request()'
  drm/amdgpu: Fix variable 'mca_funcs' dereferenced before NULL check in 'amdgpu_mca_smu_get_mca_entry()'
  ...
2024-01-12 11:32:19 -08:00
Xingyuan Mo
a9f07790a4 accel/habanalabs: fix information leak in sec_attest_info()
This function may copy the pad0 field of struct hl_info_sec_attest to user
mode which has not been initialized, resulting in leakage of kernel heap
data to user mode. To prevent this, use kzalloc() to allocate and zero out
the buffer, which can also eliminate other uninitialized holes, if any.

Fixes: 0c88760f8f ("habanalabs/gaudi2: add secured attestation info uapi")
Signed-off-by: Xingyuan Mo <hdthky0@gmail.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:44 +02:00
Tomer Tayar
bc5f15abcf accel/habanalabs/gaudi2: avoid overriding existing undefined opcode data
Part of the undefined opcode data is updated in
gaudi2_handle_qman_err_generic() and some in
handle_lower_qman_data_on_err().
However, the 'write_enable' flag is checked only in
gaudi2_handle_qman_err_generic(), and information of more than a single
error can be mixed there.

Moreover, handle_lower_qman_data_on_err() is called only for the lower
QMAN, so for an error in the upper QMAN there is only a partial info.

Move all the data update to be done in a single place, protected by the
'write_enable' flag.
As mainly the lower QMAN's info is interesting, avoid saving the partial
info for the upper QMAN.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:44 +02:00
Tomer Tayar
aa5cea38ce accel/habanalabs: add parent_device sysfs attribute
The device debugfs directory was modified to be named as the
device-name.
This name is the parent device name, i.e. either the PCI address in case
of an ASIC, or the simulator device name in case of a simulator.

This change makes it more difficult for a user to access the debugfs
directory for a specific accel device, because he can't just use the
accel minor id, but he needs to do more device-dependent operations to
get the device name.

To make it easier to get this name, add a 'parent_device' sysfs
attribute that the user can read using the minor id before accessing
debugfs.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:44 +02:00
Tomer Tayar
565ee78840 accel/habanalabs/gaudi2: add zero padding when printing QM CP instruction
QM instructions are in multiples of 64 bits and the command type is in
the upper bits of first QWORD.
To make it clearer that an undefined command is due to a type of 0x0,
always print all 64 bits and add a zero padding if needed.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:44 +02:00
Ariel Suller
d980e1ced9 accel/habanalabs: report 3 instances of Infineon second stage
Infineon controller second stage has 3 instances that their version
need to be reported by driver.

Signed-off-by: Ariel Suller <asuller@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Moti Haimovski
7259eb7b53 accel/habanalabs/gaudi2: add signed dev info uAPI
User will provide a nonce via the INFO ioctl, and will retrieve
the signed device info generated using given nonce.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Tomer Tayar
5bc155cfea accel/habanalabs/gaudi2: use correct registers to dump QM CQ info
The QM CQ PTR_LO/PTR_HI/TSIZE registers are for pushing a CQ entry, and
although they are updated by HW even when descriptors are fetched by PQ
and CB addresses are fed into CQ, the correct registers to use when
dumping the CQ info are the ones with the _STS suffix.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Dani Liberman
47a552863d accel/habanalabs: expose module id through sysfs
Module ID exposes the physical location of the device in the server,
from the pov of the devices in regard to how they are connected by
internal fabric.

This information is already exposed in our INFO ioctl, but there are
utilities and scripts running in data-center which are already
accessing sysfs for topology information and it is easier for them
to continue getting that information from sysfs instead of opening
a file descriptor.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Dani Liberman
c9f9d0e3d0 accel/habanalabs: print error code when mapping fails
Failure to map is considered a non-trivial error and we need to notify
the user about it.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Tomer Tayar
ae303d885d accel/habanalabs/gaudi2: get the correct QM CQ info upon an error
Upon a QM error, the address/size from both the CQ and the ARC_CQ are
printed, although the instruction that led to the error was received
from only one of them.

Moreover, in case of a QM undefined opcode, only one of these
address/size sets will be captured based on the value of ARC_CQ_PTR.
However, this value can be non-zero even if currently the CQ is used, in
case the CQ/ARC_CQ are alternately used.

Under the assumption of having a stop-on-error configuration, modify to
use CP_STS.CUR_CQ field to get the relevant CQ for the QM error.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Tomer Tayar
4b0b1fbc77 accel/habanalabs: set hard reset flag if graceful reset is skipped
hl_device_cond_reset() might be called with the hard reset flag unset,
because a compute reset upon device release as part of a graceful reset
is valid.
If the conditions for graceful reset are not met, hl_device_reset() will
be called for an immediate reset. In this case a compute reset is not
valid, so it will be replaced with a hard reset together with a debug
message about it.
This message might be confusing, as it implies that a compute reset was
requested when it shouldn't. To prevent this confusion, set the hard
reset flag in hl_device_cond_reset() if going to an immediate reset.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Ofir Bitton
571cdb6e3b accel/habanalabs: remove 'get temperature' debug print
The print was added long back for a specific debug and can
now be removed.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Dafna Hirschfeld
0ec3467796 accel/habanalabs/gaudi2: fix undef opcode reporting
currently the undefined opcode event bit in set only for lower cp and
only if 'write_enable' is true. It should be set anyway and for all
streams in order to report that event to userspace.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Farah Kassabri
d1958dce5a accel/habanalabs: fix EQ heartbeat mechanism
Stop rescheduling another heartbeat check when EQ heartbeat check fails
as it generates confusing logs in dmesg that the heartbeat fails.

Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Oded Gabbay
42422993cf accel/habanalabs: add support for Gaudi2C device
Gaudi2 with PCI revision ID with the value of '3' represents Gaudi2C
device and should be detected and initialized as Gaudi2.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Farah Kassabri
e8bc0c1b1b accel/habanalabs: add log when eq event is not received
Add error log when no eq event is received from FW,
to cover a scenario when FW is stuck for some reason.
In such case driver will not receive neither the eq error interrupt
or the eq heartbeat event, and will just initiate a reset without
indication in the dmesg about the reason.

Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:42 +02:00
Tomer Tayar
c648548233 accel/habanalabs/gaudi2: assume hard-reset by FW upon PCIe AXI drain
When a PCIe AXI drain event happens, it is possible that the driver
cannot access the device through PCIe, and therefore cannot send a
hard-reset request to FW.
Starting from FW version 1.13, FW will initiate a hard-reset in such
a case without waiting for a reset request from the driver.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:42 +02:00
Farah Kassabri
fbc2a09e09 accel/habanalabs: update device boot error check
Use a predefined mask which set the device critical boot errors.
Driver will fail and stop its loading, only upon detecting at least
one of those errors defined in this mask.

Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:42 +02:00
farah kassabri
f64fa33260 accel/habanalabs: add pcie reset prepare/done hooks
When working on a bare-metal system, if FLR will happen the firmware
will handle it and driver will have no knowledge of it, and this will
cause two issues:

1.The driver will be in operational state while it should be in reset.
  This will cause the heartbeat mechanism to keep sending messages to FW
  while pci device is in reset. Eventually heartbeat will fail and
  the device will end up in non-operational state.

2. After FW handles the FLR, and due to the reset it'll go back to
   preboot stage, and driver need to perform hard reset in order to
   load the boot fit binary.

This patch will add reset_prepare hook that will set the device to
be in disabled state, so it'll be not operational, and also
reset_done hook which will be called after the actual FLR handling,
then it will perform hard reset.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:42 +02:00
Christian Brauner
3652117f85 eventfd: simplify eventfd_signal()
Ever since the eventfd type was introduced back in 2007 in commit
e1ad7468c7 ("signal/timer/event: eventfd core") the eventfd_signal()
function only ever passed 1 as a value for @n. There's no point in
keeping that additional argument.

Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org
Acked-by: Xu Yilun <yilun.xu@intel.com>
Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl
Acked-by: Eric Farman <farman@linux.ibm.com>  # s390
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-28 14:08:38 +01:00
Oded Gabbay
4db74c0fde accel/habanalabs/gaudi2: fix spmu mask creation
event_types_num received from the user can be 0. In that case, the
event_mask should be 0.

In addition, to create a correct mask we need to match the number
of event types to the bit location such that bit 0 represents a single
event type, bit 1 represents 2 types and so on.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-10-09 12:37:24 +03:00
Tomer Tayar
0426e03126 accel/habanalabs/gaudi2: perform hard-reset upon PCIe AXI drain event
Non-completed transactions from PCIe towards the device are handled by
the AXI drain mechanism. This handling is in the PCIe level, but the
transactions are still there in the device consuming some queues
entries, and therefore the device must be reset.
Modify to perform hard-reset upon PCIe AXI drain events.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:24 +03:00
farah kassabri
84190b92cc accel/habanalabs: fix bug in decoder wait for cs completion
The decoder interrupts are handled in the interrupt context
same as all user interrupts.
In such case, the wait list should be protected by
spin_lock_irqsave in order to avoid deadlock that might happen
with the user submission flow.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:24 +03:00
Dafna Hirschfeld
2ba0236f5b accel/habanalabs: remove wrong doc for init_phys_pg_pack_from_userptr
The function does not pin the pages so remove that from the inline doc.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:24 +03:00
Arnd Bergmann
c1805bf36a accel/habanalabs: add missing debugfs function stubs
Two function stubs were removed in an earlier commit but are now needed
again:

drivers/accel/habanalabs/common/device.c: In function 'hl_device_init':
drivers/accel/habanalabs/common/device.c:2231:14: error: implicit declaration of function 'hl_debugfs_device_init'; did you mean 'drm_debugfs_dev_init'? [-Werror=implicit-function-declaration]
 2231 |         rc = hl_debugfs_device_init(hdev);
drivers/accel/habanalabs/common/device.c:2367:9: error: implicit declaration of function 'hl_debugfs_device_fini'; did you mean 'hl_debugfs_remove_file'? [-Werror=implicit-function-declaration]
 2367 |         hl_debugfs_device_fini(hdev);
      |         ^~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:24 +03:00
Oded Gabbay
1630d14f8d accel/habanalabs: minor cosmetic update to habanalabs.h
- Update copyright years
- Align fields in struct hl_userptr
- Fix comments

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-10-09 12:37:24 +03:00
Oded Gabbay
4355f2c322 accel/habanalabs/gaudi: remove define used for simulator
We don't support simulator in upstream.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-10-09 12:37:24 +03:00
Oded Gabbay
87c60e23f2 accel/habanalabs: remove leftover code
This code was added as part of a bigger feature which was never
upstreamed, so remove this code.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-10-09 12:37:24 +03:00
Oded Gabbay
6fc69ca84a accel/habanalabs: print device name when it is removed
Notifies the user which device was removed. It is important in
a server with multiple devices.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-10-09 12:37:23 +03:00
Oded Gabbay
e5873f6b91 accel/habanalabs: remove unused field
flags in struct wait_interrupt_data is not used anywhere so remove it.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-10-09 12:37:23 +03:00
Oded Gabbay
b5305d23aa accel/habanalabs/gaudi: remove unused structure definition
struct gaudi_nic_status is not used anywhere in the code.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-10-09 12:37:23 +03:00
Ohad Sharabi
ff92d01052 accel/habanalabs: trace dma map sgtable
Traces the DMA [un]map_sgtable using the new traces we added.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:23 +03:00
Oded Gabbay
d7aa294805 accel/habanalabs: remove unused asic functions
asic_dma_{un}map_single() asic-specific functions are no longer called
from the common code, so delete these functions.

In addition, delete the gaudi2 implementation as they are also not
called.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-10-09 12:37:23 +03:00
Ariel Suller
de8773fdc5 accel/habanalabs: update boot status print
FW shutdown preparation status was added to spec.

Signed-off-by: Ariel Suller <asuller@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:23 +03:00
Dafna Hirschfeld
674f77798e accel/habanalabs: extend preboot timeout when preboot might take longer
There are cases such when FW runs MBIST, that preboot is expected to take
longer than the usual. In such cases the firmware reports status
SECURITY_READY/IN_PREBOOT and we extend the timeout waiting for it.
This is currently implemented for Gaudi2 only.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:22 +03:00
Tomer Tayar
3824be1f4d accel/habanalabs: add debug prints to dump content of SG table for dma-buf
Add debug prints to dump the content of the SG table which is prepared
when the dma-buf map op is called.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:22 +03:00
Tomer Tayar
d16945f602 accel/habanalabs: add missing offset handling for dma-buf
On devices with virtual device memory (Gaudi2 onwards), user can provide
an offset within an allocated device memory from which he wants to
export a dma-buf object.
The offset value is verified by driver, but it is not taken into
consideration when the importer driver maps the dma-buf and the SG table
it prepared.
Add the missing offset handling.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:22 +03:00
Tomer Tayar
878ebc14db accel/habanalabs: set hl_dmabuf_priv.device_address only when needed
The device_address member of 'struct hl_dmabuf_priv' is used only when
virtual device memory is not supported and dma-buf is exported from
address.
Set the value of this field only when it is relevant, and add "phys" to
its name so it would be clearer that it can't be a device virtual
address.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:22 +03:00
Tomer Tayar
bb644f6197 accel/habanalabs: fix SG table creation for dma-buf mapping
In some cases the calculated number of required entries for the dma-buf
SG table is wrong. For example, if the page size is larger than both the
dma max segment size of the importer device and from the exported side,
or if the exported size is part of a phys_pg_pack that is composed of
several pages.
In these cases, redundant entries will be added to the SG table.

Modify the method that the number of entries is calculated, and the way
they are prepared.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:22 +03:00
farah kassabri
ba24b5ec78 accel/habanalabs: split user interrupts pending list
Currently driver maintain one list for both pending user interrupts
which seeks to wait till CQ reaches it's target value and also the ones
that seeks to get timestamp records when the CQ reaches it's target
value.
This causes delay in handling the waiters which gets higher priority
than the timestamp records.
In order to solve this, let's split the list into two,
one for each case and each one is protected by it's own spinlock.
Waiters will be handled within the interrupt context first,
then the timestamp records will be set.
Freeing the timestamp related memory will be handled in a workqueue.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:22 +03:00
farah kassabri
1157b5d6b3 accel/habanalabs: optimize timestamp registration handler
Currently we use dynamic allocation inside the irq handler
in order to allocate free node to be used for the free jobs.

This operation is expensive, especially when we deal with large
burst of events records that get released at the same time.

The alternative is to have pre allocated pool of free nodes
and just fetch nodes from this pool at irq handling time instead
of allocating them.

In case the pool becomes full, then the driver will fallback to
dynamic allocations.

As part of the optimization also update the unregister flow
upon re-using a timestamp record, by making the operation much
simpler and quicker. We already have the record in the registration
flow and now we just seek to re-use with different interrupt.
Therefore, no need to look for buffer according to the user handle.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:22 +03:00
farah kassabri
0165994c21 accel/habanalabs: fix bug in timestamp interrupt handling
There is a potential race between user thread seeking to re-use
a timestamp record with new interrupt id, while this record is still
in the middle of interrupt handling and it is about to be freed.
Imagine the driver set the record in_use to 0 and only then fill the
free_node information. This might lead to unpleasant scenario where
the new registration thread detects the record as free to use, and
change the cq buff address. That will cause the free_node to get
the wrong buffer address to put refcount to.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:22 +03:00
Tomer Tayar
d89d329a2b accel/habanalabs: tiny refactor of hl_map_dmabuf()
alloc_sgt_from_device_pages() includes relatively many parameters, and
in a subsequent change another offset parameter is going to be added.
Using structure fields directly when calling this function, and in
hl_map_dmabuf() it is done twice, makes it a little bit difficult to
understand the meaning of the parameters.
To make it clearer, assign the required values into local variables with
explicit names, and use the variables when calling the function.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:22 +03:00
Tomer Tayar
0b75cb5b24 accel/habanalabs: export dma-buf only if size/offset multiples of PAGE_SIZE
It is currently allowed for a user to export dma-buf with size and
offset that are not multiples of PAGE_SIZE.
The exported memory is mapped for the importer device, and there it will
be rounded to PAGE_SIZE, leading to actually exporting more than the
user intended to.
To make the user be aware of it, accept only size and offset which are
multiple of PAGE_SIZE.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:22 +03:00
Tomer Tayar
efbca048c6 accel/habanalabs: use exported size from dma_buf and not from phys_pg_pack
The 'exported_size' member in 'struct hl_vm_phys_pg_pack' is used to
keep the exported dma-buf size, to be later used when the buffer is
mapped.
However it is possible that the same phys_pg_pack will be exported more
than once, and independently of when the mapping takes place.
Remove this member from the phys_pg_pack structure, and simply use the
size in the dma-buf object as the exported size when mapping.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:21 +03:00
Tomer Tayar
dfdbc55a9c accel/habanalabs: always pass exported size to alloc_sgt_from_device_pages()
For Gaudi1 the exported dma-buf is always composed of a single page, and
therefore the exported size is equal to this page's size.
When calling alloc_sgt_from_device_pages(), we pass 0 as the exported
size and internally calculate it as "number of pages * page size".
This makes alloc_sgt_from_device_pages() less clear, because the
exported size parameter is not understood as a restriction on the pages'
size.
Modify to always pass the exported size explicitly.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:21 +03:00
farah kassabri
051868d93c accel/habanalabs: prevent sending heartbeat before events are enabled
After the heartbeat mechanism is now expanded to be used also
for EQ health check, we shouldn't send heartbeat messages
to FW before driver allow events to be received from FW.

Because if the driver will send two heartbeats before it enables
events to be received from FW, then the EQ health check
will fail and reset the device.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:21 +03:00
farah kassabri
764bfd138f accel/habanalabs/gaudi2: add eq health check using irq
This is the second patch for applying the eq health check mechanism
which will add support for the interrupt flow for gaudi2 asic.

More info about the interrupt mechanism:
set a dedicated msix for the eq error interrupt, and add
interrupt handler for it.
when FW detects some issue with EQ like EQ_FULL, it'll
raise that interrupt and driver should reset the device.
Driver will inform the FW which msix index to use through
the already existing handshake mechanism which will
send msix info message to fw.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:21 +03:00
farah kassabri
7c4130e6dd accel/habanalabs/gaudi2: handle eq health heartbeat check
Add mechanism for fw eq health check. this will be done using two flows:
using the heartbeat mechanism and raising a dedicated interrupt to
indicate an eq failure like EQ full.
This patch will add implementation for the eq heartbeat for gaudi2 asic.

More info about the heartbeat mechanism:
Expand the heartbeat mechanism to monitor a new event that
will be sent from FW upon receiving heartbeat message.
that way driver can know that the eq is working or not.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:21 +03:00
Moti Haimovski
72bff371b2 accel/habanalabs/gaudi2: print power-mode changes
Print to kernel log any device power mode changes events reported by
the FW.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:21 +03:00
Hen Alon
0648c4d080 accel/habanalabs: add tsc clock sampling to clock sync info
Add tsc clock to clock sync info, to enable using this clock for
sampling and sync it with device time.

Signed-off-by: Hen Alon <halon@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:21 +03:00
Dafna Hirschfeld
e0f452802b accel/habanalabs: fix inline doc typos
Fix two typos

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:21 +03:00
Dafna Hirschfeld
ab574f6a81 accel/habanalabs: disable events ioctls on control device
Because it is not used and also, for graceful reset to work
those ioctls should run on the compute device.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:21 +03:00
David Meriin
2b76129c5a accel/habanalabs: move cpucp interface to linux/habanalabs
The CPUCP interface is moved to a shared folder outside of accel as
a pre-requisite to upstream the NIC drivers that will also include
this file.

Signed-off-by: David Meriin <dmeriin@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:21 +03:00
Ofir Bitton
d261b0ab13 accel/habanalabs/gaudi2: include block id in ECC error reporting
During ECC event handling, Memory wrapper id was mistakenly
printed as block id. Fix the print and in addition fetch the actual
block-id from firmware.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:21 +03:00
Benjamin Dotan
10d260f655 accel/habanalabs: improve etf configuration
coresight ETF blocks have different size. As a result, sync packets
need to be aligned based on fifo size.

Signed-off-by: Benjamin Dotan <bdotan@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:20 +03:00
Justin Stitt
571bfeb48a accel/habanalabs: refactor deprecated strncpy
`strncpy` is deprecated for use on NUL-terminated destination strings [1].

A suitable replacement is `strscpy` [2] due to the fact that it
guarantees NUL-termination on its destination buffer argument which is
_not_ the case for `strncpy`!

There is likely no bug happening in this case since HL_STR_MAX is
strictly larger than all source strings. Nonetheless, prefer a safer and
more robust interface.

It should also be noted that `strscpy` will not pad like `strncpy`. If
this NUL-padding behavior is _required_ we should use `strscpy_pad`
instead of `strscpy`.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:20 +03:00
Christophe JAILLET
90f3de6162 accel/habanalabs/gaudi2: Fix incorrect string length computation in gaudi2_psoc_razwi_get_engines()
snprintf() returns the "number of characters which *would* be generated for
the given input", not the size *really* generated.

In order to avoid too large values for 'str_size' (and potential negative
values for "PSOC_RAZWI_ENG_STR_SIZE - str_size") use scnprintf()
instead of snprintf().

Fixes: c0e6df9160 ("accel/habanalabs: fix address decode RAZWI handling")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:20 +03:00
Justin Stitt
a45d5cf09d accel/habanalabs: refactor deprecated strncpy to strscpy_pad
`strncpy` is deprecated for use on NUL-terminated destination strings [1].

We see that `prop->cpucp_info.card_name` is supposed to be
NUL-terminated based on its usage within `__hwmon_device_register()`
(wherein it's called "name"):
|	if (name && (!strlen(name) || strpbrk(name, "-* \t\n")))
|		dev_warn(dev,
|			 "hwmon: '%s' is not a valid name attribute, please fix\n",
|			 name);

A suitable replacement is `strscpy_pad` [2] due to the fact that it
guarantees both NUL-termination and NUL-padding on its destination
buffer.

NUL-padding on `prop->cpucp_info.card_name` is not strictly necessary as
`hdev->prop` is explicitly zero-initialized but should be used
regardless as it gets copied out to userspace directly -- as per Kees'
suggestion.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:20 +03:00
Benjamin Dotan
428f6882a6 accel/habanalabs: fix ETR/ETF flush logic
When config_etr or config_etf are called we need to validate the
parameters that are passed into them to make sure the requested
operation is valid.

Signed-off-by: Benjamin Dotan <bdotan@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:20 +03:00
Benjamin Dotan
cf1ed52d12 accel/habanalabs/gaudi2 : remove psoc_arc access
Because firmware is blocking PSOC_ARC_DBG, we need to disable access
to this block.

Signed-off-by: Benjamin Dotan <bdotan@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:20 +03:00
Igor Grinberg
01ab1629ad accel/habanalabs/gaudi2: prepare to remove cpu_rst_status
The soft reset has transitioned to CPUCP packet instead of plain
register write and is about to be removed from the struct cpu_dyn_regs.
As a preparation for removing the cpu_rst_status field from
struct cpu_dyn_regs, switch to use the plain macro - this keeps the
backward compatibility.

Signed-off-by: Igor Grinberg <igrinberg@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:20 +03:00
Tomer Tayar
57963ff8ad accel/habanalabs: Move ioctls to the device specific ioctls range
To use drm_ioctl(), move the ioctls to the device specific ioctls
range at [DRM_COMMAND_BASE, DRM_COMMAND_END).

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:20 +03:00
Tomer Tayar
fe77368c0f accel/habanalabs: register compute device as an accel device
Register the compute device as an accel device, and remove the creation
of the habanalabs compute char device.

The IOCTLs in this patch are still handled by the current driver
handler. Moving to DRM IOCTL handling requires moving the IOCTLs
numbers to a specific range, so it will be handled in subsequent
patches.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:20 +03:00
Ofir Bitton
a8ab1a81cc accel/habanalabs: add info ioctl for engine error reports
User gets notification for every engine error report, but he still
lacks the exact engine information. Hence, we allow user to query
for the exact engine reported an error.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:19 +03:00
Tomer Tayar
10926f6005 accel/habanalabs: set default device release watchdog T/O as 30 sec
After being notified about certain errors, user is expected to finish
his post-errors actions and to release the device within some timeout,
after which is deice is being reset.
The default timeout value is 5 sec, which in some case is not enough for
a user application to collect debug data.
Increase the default value to 30 sec.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:19 +03:00
Dani Liberman
8887279092 accel/habanalabs: handle f/w reserved dram space request
It is possible for FW to request reserved space in dram.
If the device supports this option, it will retrieve the size from the
f/w and will reserve it.

Currently we add the common code infrastructure to support it.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:19 +03:00
Oded Gabbay
fa46c7bb50 accel/habanalabs/gaudi2: fix missing check of kernel ctx
If we are initializing the kernel context when we have a Gaudi2 device,
we don't need to do any late initializing of that context with
specific Gaudi2 code.

Reviewed-by: Ofir Bitton <obitton@habana.ai>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:19 +03:00
Igor Grinberg
15c0bb1623 accel/habanalabs/gaudi2: prepare to remove soft_rst_irq
The soft reset has transitioned to CPUCP packet instead of plain
register write and is about to be removed from the struct cpu_dyn_regs.
As a preparation for removing the gic_host_soft_rst_irq field from
struct cpu_dyn_regs, switch to use the plain macro - this keeps the
backward compatibility.

Signed-off-by: Igor Grinberg <igrinberg@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:19 +03:00
Ofir Bitton
1e3a78270b accel/habanalabs/gaudi2: unsecure tpc count registers
As TPC kernels now must use those registers we unsecure them.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:19 +03:00
Tomer Tayar
5a8487ac54 accel/habanalabs/gaudi2: un-secure register for engine cores interrupt
The F/W dynamically allocates one of the PSOC scratchpad registers for
the engine cores, so they can raise events towards the F/W.
To allow the engine cores to access this register, this register must be
non-secured.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:19 +03:00
Juerg Haefliger
b03dc2b621 accel/habanalabs/gaudi: Add MODULE_FIRMWARE macros
The module loads firmware so add MODULE_FIRMWARE macros to provide that
information via modinfo.

Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:19 +03:00
Ofir Bitton
d33c3d0541 accel/habanalabs: dump temperature threshold boot error
Add dump of an error reported from f/w during boot time.
This error indicates a failure with setting temperature threshold.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:19 +03:00
Oded Gabbay
37d72439a4 accel/habanalabs: reset device if scrubbing failed
If scrubbing memory after user released device has failed it means
the device is in a bad state and should be reset.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-10-09 12:37:19 +03:00
Oded Gabbay
89803af535 accel/habanalabs: remove pdev check on idle check
Our simulator supports idle check so no need anymore to check if pdev
exists.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-10-09 12:37:18 +03:00
farah kassabri
2da9f8d805 accel/habanalabs: fix wait_for_interrupt abortion flow
When the driver needs to abort waiters for interrupts, for cases
such as critical events that occur and driver need to do hard reset,
in such scenario the driver will complete the fence to wake up the
waiting thread, and will set the fence error indication.
The return value of the completion API will be greater than 0
since it will return the timeout, but as this indicates successful
completion, the driver should mark it as aborted.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:18 +03:00
farah kassabri
eaa43a06b7 accel/habanalabs: Allow single timestamp registration request at a time
Protect against concurrency of user requesting to register a timestamp
offset (where the driver fills the timestamp when the command submission
has finished executing) to a specific user interrupt ID. The
protection is basically to allow only one timestamp registration
request to be handled at a time.

This is needed because the user can decide to re-use a timestamp
offset (register an already registered offset, to a different
interrupt ID). This means the request will cause the timestamp node to
move from one interrupt list to another interrupt list. In such
scenario, without proper protection, we could end up adding the same
node twice to the interrupts wait lists.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:18 +03:00