Commit Graph

332 Commits

Author SHA1 Message Date
Koby Elbaz
964b1f675d accel/habanalabs: rename fd_list to hpriv_list
Every time an FD is returned to the user, the driver adds
a corresponding private structure to the list.
Yet, it's still a list of private structures rather than of FDs.
Remove, as well, an unnecessary comment.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:18 +03:00
Koby Elbaz
942f18c56d accel/habanalabs: call put_pid after hpriv list is updated
Because we might still be using related resources, decrementing PID's
reference count should be done at later stages of the device release.
A good place is right after the representing private structure is
removed from LKD's list.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:18 +03:00
Koby Elbaz
2b541cf913 accel/habanalabs: print return code when process termination fails
As part of driver teardown, we attempt to kill all user processes.
It shouldn't fail, but if it does we want to print the error code that
the kapi returned to us.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:18 +03:00
farah kassabri
bffd2f16ae accel/habanalabs: fix standalone preboot descriptor request
The preboot used to statically allocate memory for the comms descriptor
on the device memory when driver requested the descriptor information.
Now preboot moved to dynamic memory allocation where it wants to check
the size the driver expects vs. what the f/w expects.
Note there are no backward compatibility issues as older f/w versions
simply ignore this value.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:18 +03:00
Dani Liberman
43d8acce60 accel/habanalabs: handle arc farm razwi
Implement razwi handling for arc farm and add it to arc farm sei
event handler.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:18 +03:00
Ofir Bitton
f17182d036 accel/habanalabs: stop fetching MME SBTE error cause
Because in this case we have only a single possible cause, we can
safely stop fetching the cause from firmware.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:18 +03:00
Koby Elbaz
e4a97d6b62 accel/habanalabs: set device status 'malfunction' while in rmmod
hl_device_status() returns the status of an acquired device.
If a device is going down (following an rmmod cmd),
it should be marked as an unusable/malfunctioning device, and
hence should not be acquired.
However, since this was not the case so far (i.e., a device going
down would inaccurately return 'in reset' status allowing the user
to acquire the device) it introduced a bug where as part of a reset
flow, the driver could not kill processes that have not run yet, and
since those processes aren't blocked from reacquiring a device,
we get eventually a new flow of a driver attempting to kill all
processes in a list that can't be ever really empty.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:18 +03:00
Tomer Tayar
e7b2902a33 accel/habanalabs: print task name upon creation of a user context
It is useful for debug to know which user process have acquired the
device.
Add this info to the relevant debug print, in addition to the already
printed user context's ASID.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:18 +03:00
Tomer Tayar
7dccb064a7 accel/habanalabs: print task name and request code upon ioctl failure
When an ioctl fails, it is useful to know what is the task command name
and the full ioctl request code, in addition to the task pid and the
ioctl number.
Add the additional information to the relevant debug error prints.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:17 +03:00
Ofir Bitton
c6a4f256ae accel/habanalabs: notify user about undefined opcode event
In order for user to be aware of undefined opcode events, we must
store all relevant information and notify user about the failure.
The user will fetch the stored info via info ioctl.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:17 +03:00
Tomer Tayar
a35c997601 accel/habanalabs: update pending reset flags with new reset requests
If hl_device_cond_reset() is called while a reset is already pending but
hasn't started, the reset request will be dropped.
If the flags of the new request are more severe, e.g. a hard reset while
the pending reset is a compute reset, the eventual reset won't be
suitable for the device status.

To prevent such cases, update the pending reset flags with the new
requests flags before the requests are dropped.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:17 +03:00
Tomer Tayar
5d89ce6f8c accel/habanalabs: prevent immediate hard reset due to 2 adjacent H/W events
When a H/W event is received while a user is registered to events, no
immediate hard reset will happen, and instead the user will be notified
and will have some time to handle it and eventually release the
device, after which the reset will be done.
If a user, as part of the handling and as part of the cleanup steps
towards releasing the device, unregisters from receiving those events,
and at that time an adjacent H/W event is received, it will be assumed
that the user is not registered to events and thus an immediate hard
reset is required.

To prevent such an unwanted immediate reset, modify the driver to
perform it if the user is not registered to events AND we don't already
have a pending reset for a previous H/W event.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:17 +03:00
Arnd Bergmann
78e9b217d7 accel/habanalabs: add more debugfs stub helpers
Two functions got added with normal prototypes for debugfs, but not
alternative when building without it:

drivers/accel/habanalabs/common/device.c: In function 'hl_device_init':
drivers/accel/habanalabs/common/device.c:2177:14: error: implicit declaration of function 'hl_debugfs_device_init'; did you mean 'hl_debugfs_init'? [-Werror=implicit-function-declaration]
drivers/accel/habanalabs/common/device.c:2305:9: error: implicit declaration of function 'hl_debugfs_device_fini'; did you mean 'hl_debugfs_remove_file'? [-Werror=implicit-function-declaration]

Add stubs for these as well.

Fixes: 3b9abb4fa6 ("accel/habanalabs: expose debugfs files later")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
Acked-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20230609120636.3969045-1-arnd@kernel.org
2023-07-20 12:51:59 +02:00
Dani Liberman
e6f49e96bc accel/habanalabs: refactor error info reset
Moved error info reset code to single function for future use from
other places in the driver.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:56 +03:00
Ofir Bitton
fac91dd54f accel/habanalabs: add event queue extra validation
In order to increase reliability of the event queue interface,
we apply to Gaudi2 the same mechanism we have in Gaudi1.
The extra validation is basically checking that the received
event index matches the expected index.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:56 +03:00
Ofir Bitton
19aa21b980 accel/habanalabs: unsecure TSB_CFG_MTRR regs
In order to utilize Engine Barrier padding, user must have access to
this register set.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:56 +03:00
Oded Gabbay
ff5c702522 accel/habanalabs: move ioctl error print to debug level
We don't want to allow users to spam the kernel log and sending
ioctls with bad opcodes is a sure way to do it.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-06-08 12:35:56 +03:00
Ofir Bitton
8a20b38164 accel/habanalabs: fix bug of not fetching addr_dec info
addr_dec info should always be fetched, regardless of cause value.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:56 +03:00
Oded Gabbay
569210233a accel/habanalabs: remove sim code
There were a few places where simulator only code got into the upstream.
Remove those places that can confuse other developers.

Fixes: 2a0a839b6a ("habanalabs: extend fatal messages to contain PCI info")
Cc: Moti Haimovski <mhaimovski@habana.ai>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:56 +03:00
Dani Liberman
5d658d0c51 accel/habanalabs: mask part of hmmu page fault captured address
When receiving page fault from hmmu, the captured address is scrambled
both by HW and by driver. The driver part is unscrambled but the HW
part isn't getting unscrambled.
To avoid declaring wrong address, the HW scrambled part will be
masked.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:56 +03:00
Koby Elbaz
7e63f317c0 accel/habanalabs: update state when loading boot fit
Any FW component we load must be followed by a corresponding state
update. However, it seems that so far we skipped doing so for the
bootfit case, so fix that.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:56 +03:00
Tomer Tayar
6092cedfff accel/habanalabs: print qman data on error only for lower qman
By default, the upper QMANs are not used, and instead engines ARCs
access the lower QMANs directly.
Errors for upper QMANs are therefore not expected, and the debug print
of the PQ entries is not needed.

Modify the QMAN debug data print on errors to include only information
for the lower QMAN.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:56 +03:00
Tomer Tayar
54381ee809 accel/habanalabs: use lower QM in QM errors handling
The QMAN GLBL_ERR_STS_4 register has indications for errors also in the
lower CQ and the ARC CQ, and not just for errors in the lower CP.
Modify the relevant define/struct and the related print to use "lower
QM" instead of "lower CP".

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:55 +03:00
Dani Liberman
dcc8fa88d4 accel/habanalabs: use binning info when handling razwi
When receiving sei interrupt from tpc or decoder, we need to check
the binning mask because if the engine is binned, the razwi info
won't be in the router of the binned engine, instead will be in the
router of the substitute engine.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:55 +03:00
Ofir Bitton
583f12a80d accel/habanalabs: remove support for mmu disable
As mmu disable mode is only used for bring-up stages, let's remove this
option and all code related to it.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:55 +03:00
Koby Elbaz
b2d61fecb4 accel/habanalabs: upon DMA errors, use FW-extracted error cause
Initially, the driver used to read the error cause data directly from
the ASIC. However, the FW now clears it before the driver could read
it. Therefore we should use the error cause data that is extracted by
the FW.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:55 +03:00
Oded Gabbay
adda800c04 accel/habanalabs: print max timeout value on CS stuck
If a workload got stuck, we print an error to the kernel log about it.
Add to that print the configured max timeout value, as that value is
not fixed between ASICs and in addition it can be configured using
a kernel module parameter.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-06-08 12:35:55 +03:00
Oded Gabbay
dcfce96ee8 accel/habanalabs: align to latest firmware specs
Update the firmware common interface files with the latest version.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-06-08 12:35:55 +03:00
Moti Haimovski
314a7ffd7c accel/habanalabs: fix mem leak in capture user mappings
This commit fixes a memory leak caused when clearing the user_mappings
info when a new context is opened immediately after user_mapping is
captured and a hard reset is performed.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:55 +03:00
Oded Gabbay
e715008b7c accel/habanalabs: set unused bit as reserved
Get latest f/w gaudi2 interface file which marks unused
bist_need_iatu_config bit in cold_rst_data structure as reserved bit.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-06-08 12:35:55 +03:00
Koby Elbaz
964234aba5 accel/habanalabs: rename security functions related arguments
Make the argument names specify the registers array represent
registers that should be unsecured so the user can access them.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:55 +03:00
Dan Carpenter
9ec7639b5e accel/habanalabs: fix gaudi2_get_tpc_idle_status() return
The gaudi2_get_tpc_idle_status() function returned the incorrect variable
so it always returned true.

Fixes: d85f0531b9 ("accel/habanalabs: break is_idle function into per-engine sub-routines")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:55 +03:00
Yang Li
cc1eeaa335 accel/habanalabs: Fix some kernel-doc comments
Make the description of @regs_range_array and @regs_range_array_size
to @user_regs_range_array and @user_regs_range_array_size  to silence
the warnings:

drivers/accel/habanalabs/common/security.c:506: warning: Function parameter or member 'user_regs_range_array' not described in 'hl_init_pb_ranges_single_dcore'
drivers/accel/habanalabs/common/security.c:506: warning: Function parameter or member 'user_regs_range_array_size' not described in 'hl_init_pb_ranges_single_dcore'
drivers/accel/habanalabs/common/security.c:506: warning: Excess function parameter 'regs_range_array' description in 'hl_init_pb_ranges_single_dcore'
drivers/accel/habanalabs/common/security.c:506: warning: Excess function parameter 'regs_range_array_size' description in 'hl_init_pb_ranges_single_dcore'

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=4940
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:54 +03:00
Ofir Bitton
d0dcd4bbfa accel/habanalabs: always fetch pci addr_dec error info
Due to missing indication of address decode source (LBW/HBW bus),
we should always try and fetch extended information.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:54 +03:00
Koby Elbaz
7d21296336 accel/habanalabs: fix a static warning - 'dubious: x & !y'
Use a straight forward approach to get a conditional result.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:54 +03:00
Koby Elbaz
67d19a2f49 accel/habanalabs: poll for device status update following WFE cmd
Currently, we rely on COMMS protocol's ack to verify that WFE command
has been acknowledged by the FW. However, this does not guarantee that
the device status has been updated.
Although unlikely, this could trigger a race since the driver expects
the device to be halted at that stage, but it might not be.
Therefore, we increase WFE's robustness by polling on the status
register that will be updated once the device is actually halted.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:54 +03:00
Tomer Tayar
3b9abb4fa6 accel/habanalabs: expose debugfs files later
Currently the debugfs root folder and files for a device are created at
an early step, before the device initialization and before the char
device and sysfs files are exposed to user.
As there is no real reason not to do it together with the device
creation, postpone it to be done right afterwards.

The initialization of the debugfs entry structure is left in its
current position because it is used before creating the files.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:54 +03:00
Ofir Bitton
d8b9cea584 accel/habanalabs: add pci health check during heartbeat
Currently upon a heartbeat failure, we don't know if the failure
is due to firmware hang or due to a bad PCI link. Hence, we
are reading a PCI config space register with a known value (vendor ID)
so we will know which of the two possibilities caused the heartbeat
failure.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:54 +03:00
Dafna Hirschfeld
3d21ec6424 accel/habanalabs: add missing tpc interrupt info
For some reason the last possible tpc interrupt cause in
gaudi2_tpc_interrupts_cause is missing from the code.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:54 +03:00
Koby Elbaz
9a4e44a4ee accel/habanalabs: refactor abort of completions and waits
Aborting CS completions should be in command_submission.c but aborting
waiting for user interrupts should be in device.c.

This separation is also for adding more abort operations in the future.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:54 +03:00
Koby Elbaz
ad8bfd3619 accel/habanalabs: minimize encapsulation signal mutex lock time
Sync Stream Encapsulated Signal Handlers can be managed from different
contexts, and as such they are protected via a spin_lock.
However, spin_lock was unnecessarily protecting a larger code section
than really needed, covering a sleepable code section as well.
Since spin_lock disables preemption, it could lead to sleeping in
atomic context.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-08 12:35:54 +03:00
Moti Haimovski
04729b418f accel/habanalabs: call to HW/FW err returns 0 when no events exist
This commit modifies the call to retrieve HW or FW error events to
return success when no events are pending, as done in the calls to
other events.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05 15:31:34 +03:00
Ofir Bitton
57469c1206 accel/habanalabs: unsecure TPC bias registers
User needs to be able to perform downcast / upcast of fp8_143 dtype.
Hence bias register needs to be accessed by the user.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05 15:31:34 +03:00
Dafna Hirschfeld
cc7b790d41 accel/habanalabs: do soft-reset using cpucp packet
This is done depending on the FW version. The cpucp method is
preferable and saves scratchpads resource.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05 15:31:34 +03:00
Dafna Hirschfeld
a12428acf8 accel/habanalabs: check fw version using sw version
The fw inner version is less trustable, instead use the fw general
sw release version.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05 15:31:33 +03:00
Dafna Hirschfeld
dd5667ff6f accel/habanalabs: extract and save the FW's SW major/minor/sub-minor
It is not always possible to know the FW's SW version from the inner FW
version. Therefore we should extract the general SW version in addition
to the FW version and use it in functions like
'hl_is_fw_ver_below_1_9' etc.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05 15:31:33 +03:00
Dafna Hirschfeld
3071247ca0 accel/habanalabs: rename fw_{major/minor}_version to fw_inner_{major/minor}_ver
We later want to add fields for Firmware SW version. The current
extracted FW version is the inner FW versioning so the new name
is better and also better differentiate from the FW's SW version.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05 15:31:33 +03:00
Dafna Hirschfeld
9ef23f05ae accel/habanalabs: add helper to extract the FW major/minor
the helper is extract_u32_until_given_char and can later be used to
also get the major/minor of the sw version.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05 15:31:33 +03:00
Moti Haimovski
f9b60242af accel/habanalabs: fix bug in free scratchpad memory
This commit fixes a bug in Gaudi2 when freeing the scratchpad memory
in case software init fails.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05 15:31:33 +03:00
Koby Elbaz
574ee40f51 accel/habanalabs: remove commented code that won't be used
Once it was decided that these security settings are to be done by FW
rather than by the driver, there's no reason to keep them in the code.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05 15:31:33 +03:00
Rakesh Ughreja
9ce36082c1 accel/habanalabs: allow user to modify EDMA RL register
EDMA transpose workload requires to signal for every activation.
User FW sends all the dummy signals to RD_LBW_RATE_LIM_CFG, to save
lbw bandwidth. We need the user to be able to access that register to
configure it.

Signed-off-by: Rakesh Ughreja <rughreja@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05 15:31:33 +03:00
Tal Cohen
1464fbd8ba accel/habanalabs: ignore false positive razwi
In Gaudi2 asic, PSOC RAZWI may cause in HBW or LBW. The address that
caused the error is read from HW register and printed by the Driver.
There are cases where the Driver receives an indication on PSOC
RAZWI error but the address value is zero. In that case, the indication
is a false positive.
The Driver should not "count" a PSOC RAZWI event error when the
caused the address is zeroed.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05 15:31:33 +03:00
Tom Rix
d95f87d29c accel/habanalabs: remove variable gaudi_irq_name
gcc with W=1 reports
drivers/accel/habanalabs/gaudi/gaudi.c:117:19: error:
  ‘gaudi_irq_name’ defined but not used [-Werror=unused-const-variable=]
  117 | static const char gaudi_irq_name[GAUDI_MSI_ENTRIES][GAUDI_MAX_STRING_LEN] = {
      |                   ^~~~~~~~~~~~~~

This variable is not used so remove it.

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-06-05 15:31:33 +03:00
Linus Torvalds
556eb8b791 Driver core changes for 6.4-rc1
Here is the large set of driver core changes for 6.4-rc1.
 
 Once again, a busy development cycle, with lots of changes happening in
 the driver core in the quest to be able to move "struct bus" and "struct
 class" into read-only memory, a task now complete with these changes.
 
 This will make the future rust interactions with the driver core more
 "provably correct" as well as providing more obvious lifetime rules for
 all busses and classes in the kernel.
 
 The changes required for this did touch many individual classes and
 busses as many callbacks were changed to take const * parameters
 instead.  All of these changes have been submitted to the various
 subsystem maintainers, giving them plenty of time to review, and most of
 them actually did so.
 
 Other than those changes, included in here are a small set of other
 things:
   - kobject logging improvements
   - cacheinfo improvements and updates
   - obligatory fw_devlink updates and fixes
   - documentation updates
   - device property cleanups and const * changes
   - firwmare loader dependency fixes.
 
 All of these have been in linux-next for a while with no reported
 problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZEp7Sw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykitQCfamUHpxGcKOAGuLXMotXNakTEsxgAoIquENm5
 LEGadNS38k5fs+73UaxV
 =7K4B
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the large set of driver core changes for 6.4-rc1.

  Once again, a busy development cycle, with lots of changes happening
  in the driver core in the quest to be able to move "struct bus" and
  "struct class" into read-only memory, a task now complete with these
  changes.

  This will make the future rust interactions with the driver core more
  "provably correct" as well as providing more obvious lifetime rules
  for all busses and classes in the kernel.

  The changes required for this did touch many individual classes and
  busses as many callbacks were changed to take const * parameters
  instead. All of these changes have been submitted to the various
  subsystem maintainers, giving them plenty of time to review, and most
  of them actually did so.

  Other than those changes, included in here are a small set of other
  things:

   - kobject logging improvements

   - cacheinfo improvements and updates

   - obligatory fw_devlink updates and fixes

   - documentation updates

   - device property cleanups and const * changes

   - firwmare loader dependency fixes.

  All of these have been in linux-next for a while with no reported
  problems"

* tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits)
  device property: make device_property functions take const device *
  driver core: update comments in device_rename()
  driver core: Don't require dynamic_debug for initcall_debug probe timing
  firmware_loader: rework crypto dependencies
  firmware_loader: Strip off \n from customized path
  zram: fix up permission for the hot_add sysfs file
  cacheinfo: Add use_arch[|_cache]_info field/function
  arch_topology: Remove early cacheinfo error message if -ENOENT
  cacheinfo: Check cache properties are present in DT
  cacheinfo: Check sib_leaf in cache_leaves_are_shared()
  cacheinfo: Allow early level detection when DT/ACPI info is missing/broken
  cacheinfo: Add arm64 early level initializer implementation
  cacheinfo: Add arch specific early level initializer
  tty: make tty_class a static const structure
  driver core: class: remove struct class_interface * from callbacks
  driver core: class: mark the struct class in struct class_interface constant
  driver core: class: make class_register() take a const *
  driver core: class: mark class_release() as taking a const *
  driver core: remove incorrect comment for device_create*
  MIPS: vpe-cmp: remove module owner pointer from struct class usage.
  ...
2023-04-27 11:53:57 -07:00
Linus Torvalds
4173cf6fb6 hwmon updates for v6.4
- New drivers
 
   - Driver for Acbel FSB032 power supply
 
   - Driver for StarFive JH71x0 temperature sensor
 
 - Added support to existing drivers
 
   - aquacomputer_d5next: Support for Aquacomputer Aquastream XT
 
   - nct6775: Added various ASUS boards to list of boards supporting WMI
 
   - asus-ec-sensors: ROG STRIX Z390-F GAMING, ProArt B550-Creator,
 
 - Notable improvements
 
   - Regulator event and sysfs notification support for PMBus drivers
 
 - Notable cleanup:
 
   - Constified pointers to hwmon_channel_info
 
 - Various other minor bug fixes and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEiHPvMQj9QTOCiqgVyx8mb86fmYEFAmRF2DQACgkQyx8mb86f
 mYEGxhAAnm4r7MkxdBPv8UloSxljTGMtPr+xfJ+TqW+bJVFoPJ57OZ+Coqfa8NTs
 h58inRehmkYVuYU3dIKYLWq4TiCyCq8W8ACZ/oRiUB7vKEG5Yt5AWw8ZSWjYWf/b
 b3+N/IdEFSrnR2+xjIx2HcCYfpBomaq/5VDH6DDfRGD+HHkhfp065z1TeiOj4Dr3
 3sueu+vbVPowIs55VXhmUwSC95n7+qv9qrA0ZYFtGUd1klDC+4fWy2wQZ+ed1gz2
 WATJmbhHu+QhrqFC4qlgiow50FrHjx0EWywEBiwBm9A6bfKNRWitz7YXqe7entP3
 TFg9rykBMleowNstwqtMARmaLxzDoH/IpsoIOY8NhGgzka0zy1pVyQ3bXRgVTP2X
 BRUOqKfj6gsysT3KenUmsMi7Mr+NFJ8LwSl9d4z/v+4ojE5N57l12tdoMaY8qlGs
 9bk75xtGuuZm8LYYPTVp1DU3XaF816Qj30fmaBruy3HVgKRyaWw1OEVayZMWQucr
 KJ+1GoEyoCwlaRDX9UrazPtVg2TocAvIZfUdl2+Zd6WF7Mifl3q9HJg+XMmeZlI5
 xRmvMj+srJ99Q2HEWS/SyiOr4n2rjRC+WeTs7N0x6X4zDsbbZTK7ONfEf0Yz1aLL
 rykeG94DO5qLpz8PQ38X8ONT8OwVcpCg0UF/0xbfAO1zYSqzv5E=
 =i+Fr
 -----END PGP SIGNATURE-----

Merge tag 'hwmon-for-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging

Pull hwmon updates from Guenter Roeck:
 "New drivers

   - Driver for Acbel FSB032 power supply

   - Driver for StarFive JH71x0 temperature sensor

  Added support to existing drivers:

   - aquacomputer_d5next: Support for Aquacomputer Aquastream XT

   - nct6775: Added various ASUS boards to list of boards supporting WMI

   - asus-ec-sensors: ROG STRIX Z390-F GAMING, ProArt B550-Creator,

  Notable improvements:

   - Regulator event and sysfs notification support for PMBus drivers

  Notable cleanup:

   - Constified pointers to hwmon_channel_info

  .. and various other minor bug fixes and improvements"

* tag 'hwmon-for-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: (131 commits)
  hwmon: lochnagar: Remove the unneeded include <linux/i2c.h>
  hwmon: (pmbus/fsp-3y) Fix functionality bitmask in FSP-3Y YM-2151E
  hwmon: (adt7475) Use device_property APIs when configuring polarity
  hwmon: (aquacomputer_d5next) Add support for Aquacomputer Aquastream XT
  hwmon: (it87) Disable/enable SMBus access for IT8622E chipset
  hwmon: (it87) Add calls to smbus_enable/smbus_disable as required
  hwmon: (it87) Test for error in it87_update_device
  hwmon: (it87) Disable SMBus access for environmental controller registers.
  docs: hwmon: Add documentaion for acbel-fsg032 PSU
  hwmon: (pmbus/acbel-fsg032) Add Acbel power supply
  dt-bindings: trivial-devices: Add acbel,fsg032
  dt-bindings: vendor-prefixes: Add prefix for acbel
  hwmon: (sfctemp) Simplify error message
  hwmon: (pmbus/ibm-cffps) Use default debugfs attributes and lock function
  hwmon: (pmbus/core) Add lock and unlock functions
  hwmon: (pmbus/core) Request threaded interrupt with IRQF_ONESHOT
  hwmon: (nct6775) update ASUS WMI monitoring list A620/B760/W790
  hwmon: ina2xx: add optional regulator support
  dt-bindings: hwmon: ina2xx: add supply property
  dt-bindings: hwmon: pwm-fan: Convert to DT schema
  ...
2023-04-25 17:43:44 -07:00
Tomer Tayar
56499c4615 accel/habanalabs: add missing error flow in hl_sysfs_init()
hl_sysfs_fini() is called only if hl_sysfs_init() completes
successfully. Therefore if hl_sysfs_init() fails, need to remove any
sysfs group that was added until that point.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08 10:44:23 +03:00
Moti Haimovski
31420f93b5 accel/habanalabs: speedup h/w queues test in Gaudi2
HW queues testing at driver load and after reset takes a substantial
amount of time.
This commit reduces the queues test time in Gaudi2 devices by running
all the tests in parallel instead of one after the other.
Time measurements on tests duration shows that the new method is almost
x100 faster than the serial approach.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08 10:43:34 +03:00
Dani Liberman
91204e4703 accel/habanalabs: fix handling of arc farm sei event
There is only single eq entry for arc farm sei event which aggregates
events from the four arc farms.
Fix the code to handle this event according to this behavior.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08 10:42:13 +03:00
Ofir Bitton
b207e166db accel/habanalabs: remove Gaudi1 multi MSI code
Multi MSI interrupts aren't working in Gaudi1 and because of that,
we are only using a single MSI interrupt. Therefore, let's remove this
dead code in order to avoid confusion.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08 10:40:35 +03:00
Ofir Bitton
38f3c732fc accel/habanalabs: fixes for unexpected error interrupt
Removing redundant asic prop variable as we don't need to expose this
to common code. In addition, fix some typos.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08 10:39:34 +03:00
Koby Elbaz
c19350efa9 accel/habanalabs: don't wait for STS_OK after sending COMMS WFE
Sending COMMS_GOTO_WFE instructs the FW's CPU to halt (WFE state).
Once sent, FW's CPU isn't expected to continue communicating with LKD.
Therefore, the stage of waiting for COMMS_STS_OK should be skipped or
else waiting for COMMS_STS_OK will simply timeout, which will trigger
unexpected behavior.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08 10:39:34 +03:00
Tal Cohen
802f25b6c2 accel/habanalabs: sync f/w events interrupt in hard reset
Receiving events from FW, while the device is in hard reset, causes
a warning message in Driver log. The message may point to a
problem in the Driver or FW. But It also can appear as a result
of events that have been sent from FW just before the hard reset.
In order to avoid receiving events from FW while the device is in reset
and is already in 'disabled' mode, sync the f/w events interrupt right
before setting the device to 'disabled'.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08 10:39:34 +03:00
Ofir Bitton
82a1b48a4e accel/habanalabs: fix wrong reset and event flags
During event handling, driver sets relevant reset and user event
notifier flags. Fix few wrong flags settings.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08 10:39:34 +03:00
Tomer Tayar
d4801c0485 accel/habanalabs: fix events mask of decoder abnormal interrupts
The decoder IRQ status register may have several set bits upon an
abnormal interrupt. Therefore, when setting the events mask, need to
check all bits and not using if-else.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08 10:39:34 +03:00
Tomer Tayar
9cf56f0d97 accel/habanalabs: remove completion from abnormal interrupt work name
Decoder abnormal interrupts are for errors and not for completion, so
rename the relevant work and work function to not include 'completion'.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08 10:39:34 +03:00
Ofir Bitton
49fd071d15 accel/habanalabs: print raw binning masks in debug level
There are rare cases of failures when cards are initialized due to
wrong values in efuse mappings that are parsed by firmware.

To help debug those cases, print (in debug level) the raw binning masks
as fetched from the firmware during device initialization.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08 10:39:34 +03:00
Ofir Bitton
d1943f1b97 accel/habanalabs: fix HBM MMU interrupt handling
Current mapping between HMMU event and HMMU block is wrong.
In addition the captured address in case of a page fault or
an access error is scrambled, Hence we must call the descramble
function.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08 10:39:34 +03:00
Dafna Hirschfeld
12f7701138 accel/habanalabs: improvements to FW ver extraction
1. Rename the func to hl_get_preboot_major_minor because we also set
   the extracted values in hdev fields.

2. Free the allocated string in the calling function which makes more
   sense

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08 10:39:33 +03:00
Dani Liberman
6306e81583 accel/habanalabs: fix access error clear event
The register which needs to be cleared is the valid register instead
of the address.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08 10:39:33 +03:00
Tal Cohen
3a8d7c3a7d accel/habanalabs: send disable pci when compute ctx is active
Fix an issue in hard reset flow in which the driver didn't send a
disable pci message if there was an active compute context.
In hard reset, disable pci message should be sent no matter if
a compute context exists or not.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-04-08 10:39:33 +03:00
Tal Cohen
a855f710f5 accel/habanalabs: remove duplicated disable pci msg
The disable pci message is sent in reset device. It informs the FW not
to raise more EQs. The Driver may ignore received EQs, when the device
is in disabled mode.
The duplication happens when hard reset is scheduled during compute
reset and also performs 'escalate_reset_flow'.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-04-08 10:39:33 +03:00
Koby Elbaz
9d7fef7c59 accel/habanalabs: change COMMS warning messages to error level
COMMS protocol is used for LKD <--> FW communication, and any
communication failure between the two might turn out to be
destructive, hence, it should be well emphasized.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-04-08 10:39:33 +03:00
Dafna Hirschfeld
fb10da9337 accel/habanalabs: check return value of add_va_block_locked
since the function might fail and we should propagate the failure.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-04-08 10:39:33 +03:00
Tal Cohen
957b247bca accel/habanalabs: print event type when device is disabled
When the device is in disabled state, the driver isn't suppose to
receive any events from FW. Printing the event type, as part of the
message that was already printed, shall help to get more info if this
unexpected message is received.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-04-08 10:39:33 +03:00
Koby Elbaz
6c31c13759 accel/habanalabs: unmap mapped memory when TLB inv fails
Once a memory mapping is added to the page tables, it's followed by
a TLB invalidation request which could potentially fail (HW failure).
Removing the mapping is simply a part of this failure handling routine.
TLB invalidation failure prints were updated to be more accurate.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-04-08 10:39:33 +03:00
Cai Huoqing
248ed9e227 accel/habanalabs: Remove redundant pci_clear_master
Remove pci_clear_master to simplify the code,
the bus-mastering is also cleared in do_pci_disable_device,
like this:
./drivers/pci/pci.c:2197
static void do_pci_disable_device(struct pci_dev *dev)
{
	u16 pci_command;

	pci_read_config_word(dev, PCI_COMMAND, &pci_command);
	if (pci_command & PCI_COMMAND_MASTER) {
		pci_command &= ~PCI_COMMAND_MASTER;
		pci_write_config_word(dev, PCI_COMMAND, pci_command);
	}

	pcibios_disable_device(dev);
}.
And dev->is_busmaster is set to 0 in pci_disable_device.

Signed-off-by: Cai Huoqing <cai.huoqing@linux.dev>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08 10:39:33 +03:00
Krzysztof Kozlowski
d8cc9415a4 hwmon: constify pointers to hwmon_channel_info
HWmon core receives an array of pointers to hwmon_channel_info and it
does not modify it, thus it can be array of const pointers for safety.
This allows drivers to make them also const.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2023-04-07 08:45:17 -07:00
Ofir Bitton
75b4457530 accel/habanalabs: remove redundant TODOs
As mmu refactor and nic resume are not relevant anymore, remove
their TODO comments.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:34 +02:00
Dani Liberman
ec48493183 accel/habanalabs: change razwi handle after fw fix
FW had one data route for tpc0 and tpc1 when running in secured mode
and a different one when running without secured mode. After fw fixed
this issue, both mode have the same data path.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:34 +02:00
Ofir Bitton
e1ef053e08 accel/habanalabs: add handling for unexpected user event
In order for the user to be aware of unexpected events in Gaudi2 that
aren't assigned to a specific engine, we are adding the handling of
this dedicated interrupt.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:34 +02:00
Tomer Tayar
7cd6b5625b accel/habanalabs: fix a missing-braces compilation warning
Replace initialization of "struct cpucp_packet" from "{0} to "{}" to
avoid a "missing braces around initializer" compilation warning.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:34 +02:00
Tomer Tayar
dc934c183d accel/habanalabs: fix a maybe-uninitialized compilation warnings
Initialize 'index' in gaudi2_handle_qman_err() and 'offset' in
gaudi2_get_nic_idle_status() to avoid "maybe-uninitialized" compilation
warnings.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:34 +02:00
Dani Liberman
9669b96f27 accel/habanalabs: fix page fault event clear
After getting page fault in gaudi2, we need to clear the valid bit
instead of the address.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:34 +02:00
Ofir Bitton
958e47977b accel/habanalabs: expose rotator mask to userspace
All engine masks are exposed to user, make sure user gets the
correct rotator enabled mask in gaudi2.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:34 +02:00
Ohad Sharabi
481e9a0fda accel/habanalabs: regenerate gaudi2 ids_map_extended
Some names of events has been modified/added.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:34 +02:00
Ofir Bitton
76e1ff37b6 accel/habanalabs: expose dram reserved size by kmd
We expose this in order for user applications to know how much dram
is reserved for internal use.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:33 +02:00
Tomer Tayar
0e418ab718 accel/habanalabs: remove '\n' when passing strings to gaudi2_print_event()
Remove all '\n' from strings which are passed as arguments to
gaudi2_print_event(), because the newline character is added internally
in this function.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:33 +02:00
Koby Elbaz
af5e675f13 accel/habanalabs: return tlb inv error code upon failure
Now that CQ-completion based jobs do not trigger a reset upon failure,
failure of such jobs (e.g., MMU cache invalidation) should be handled
by the caller itself depending on the error code returned to it.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:33 +02:00
Dafna Hirschfeld
5d8a5f2965 accel/habanalabs: in {e/p}dma_core events read the err cause reg
Since the err_cause register is unprivileged, we should read it from
the driver instead of using the param that came from the FW.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:30 +02:00
Dafna Hirschfeld
f8d139a71b accel/habanalabs: fix use of var reset_sleep_ms
- remove reset_sleep_ms arg from functions that don't use it.
- move the call msleep(reset_sleep_ms) from btm poll to gaudi2_hw_fini
as it is called from there already for other flow.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:30 +02:00
Dafna Hirschfeld
077a39fabe accel/habanalabs: in hw_fini return error code if polling timed-out
In hw_fini callback, we use either the cpucp packet method or polling a
register. Currently we return error only in the case of cpucp packet
failure. In this patch we also return error if polling timed out.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:29 +02:00
Ofir Bitton
8c695455ee accel/habanalabs: increase reset poll timeout
Due to a firmware bug we need to increase reset poll timeout
or else we will timeout in secured environments.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:29 +02:00
Koby Elbaz
7c766e58cc accel/habanalabs: do not verify engine modes after being changed
Engines idle state can't always be verified between changes of
engine modes (e.g., stall/halt).
For example, if a CS is inflight when altering engine's mode,
idle state will return NOT idle, always.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:29 +02:00
Oded Gabbay
336b78c655 accel/habanalabs: align to latest firmware specs
Copy the most up-to-date interface files to the firmware.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-03-20 17:35:25 +02:00
Colin Ian King
443355d2ff accel/habanalabs: Fix spelling mistake "maped" -> "mapped"
There is a spelling mistake in a dev_err message. Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:25 +02:00
Oded Gabbay
79c164372e accel/habanalabs: make gaudi2_is_device_idle() static
This function is only called inside gaudi2.c file.

Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202303071320.X5ouBlNY-lkp@intel.com/
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-03-20 17:31:02 +02:00
Greg Kroah-Hartman
1aaba11da9 driver core: class: remove module * from class_create()
The module pointer in class_create() never actually did anything, and it
shouldn't have been requred to be set as a parameter even if it did
something.  So just remove it and fix up all callers of the function in
the kernel tree at the same time.

Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Link: https://lore.kernel.org/r/20230313181843.1207845-4-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-17 15:16:33 +01:00
Bjorn Helgaas
4516fede7c accel/habanalabs: Drop redundant pci_enable_pcie_error_reporting()
pci_enable_pcie_error_reporting() enables the device to send ERR_*
Messages. Since
commit f26e58bf6f ("PCI/AER: Enable error reporting when AER is native"),
the PCI core does this for all devices during enumeration, so the
driver doesn't need to do it itself.

Remove the redundant pci_enable_pcie_error_reporting() call from the
driver.  Also remove the corresponding pci_disable_pcie_error_reporting()
from the driver .remove() path.

Note that this only controls ERR_* Messages from the device.  An ERR_*
Message may cause the Root Port to generate an interrupt, depending on the
AER Root Error Command register managed by the AER service driver.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:16 +02:00
Tomer Tayar
2e8e9a895c accel/habanalabs: postpone mem_mgr IDR destruction to hpriv_release()
The memory manager IDR is currently destroyed when user releases the
file descriptor.
However, at this point the user context might be still held, and memory
buffers might be still in use.
Later on, calls to release those buffers will fail due to not finding
their handles in the IDR, leading to a memory leak.
To avoid this leak, split the IDR destruction from the memory manager
fini, and postpone it to hpriv_release() when there is no user context
and no buffers are used.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Dafna Hirschfeld
801507d3b3 accel/habanalabs: move soft-reset wait to soft-reset execute
We plan to do soft-reset either by mmio or by using cpucp packet
depending on the FW version. We don't want to check FW version in two
different places for that (execute soft-reset and wait to soft-reset)
so move the waiting to gaudi2_execute_soft_reset. This also makes sense
because the cpucp also does the waiting.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Koby Elbaz
f7f0085eec accel/habanalabs: add uapi to stall/resume engine
The user might want to stall/resume engines to perform power testing
for various scenarios. Because our current
HL_CS_FLAGS_ENGINE_CORE_COMMAND command only handles the engines' cores,
we need to add another opcode for handling entire engine and not just
its core.

The user supplies an array, where each entry holds the engine's ID and
the command to send to the engine. The size of the array is limited
by the number of engines in the ASIC (only Gaudi2 is currently
supported).

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Tomer Tayar
28fbc058f2 accel/habanalabs: use scnprintf() in print_device_in_use_info()
compose_device_in_use_info() was added to handle the snprintf() return
value in a single place.
However, the buffer size in print_device_in_use_info() is set such that
it would be enough for the max possible print, so
compose_device_in_use_info() is not really needed.
Moreover, scnprintf() can be used instead of snprintf(), to save the
check if the return value larger than the given size.

Cc: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:15 +02:00
Dafna Hirschfeld
087fe7c9c2 accel/habanalabs: unify err log of hw-fini failure in dirty state
print more informative message when failing in dirty state

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:15 +02:00
Koby Elbaz
7ffb5ced2b accel/habanalabs: use a mutex rather than a spinlock
There are two reasons why mutex is better here:
1. There's a critical section relatively long, where in
certain scenarios (e.g., multiple VM allocations) taking a spinlock
might cause noticeable performance degradation.
2. It will remove the incorrect usage of mutex under
spin_lock (where preemption is disabled).

Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Dafna Hirschfeld
c7ac65c881 accel/habanalabs: allow getting HL_INFO_DRAM_USAGE during soft-reset
We can allow userspace to query the dram usage during soft-reset.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Koby Elbaz
9732d5d0d4 accel/habanalabs: fix register address on PDMA/EDMA idle check
The PDMA/EDMA is_idle routines didn't check the correct CORE register
in order to get the accurate idle state.
Moreover, it's better to make the is_idle routine more robust by adding
additional checks (IS_HALTED) before announcing that the core is idle.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Koby Elbaz
f831aade13 accel/habanalabs: remove a useless is_idle TPC flag
Is appears that the flag -
DCORE0_TPC0_CFG_STATUS_VECTOR_PIPE_EMPTY_MASK, has no actual use when
it comes to querying TPC idleness, since this flag's corresponding bit
turns-off after stalling the engine, and turns back on after resuming
it.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
farah kassabri
25ebbc57ca accel/habanalabs: fix few misspelled words in the code
Run spell checker on the code and fix accordingly.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Koby Elbaz
75276e23d1 accel/habanalabs: verify return code after scrubbing ARCs DCCMs
In case the KDMA fails scrubbing the DCCMs (following a soft-reset
upon device release), the driver will only print failure until reset
flow ends, rather than escalating it into a hard-reset.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Tomer Tayar
81e609bebb accel/habanalabs: use notifications and graceful reset for decoder
Add notifications to user in case of decoder abnormal interrupts, and
use the graceful reset mechanism if reset is required.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:14 +02:00
Dafna Hirschfeld
86b74d8438 accel/habanalabs: assert return value of hw_fini
Since hw_fini return error code for failure indication, we should
check its return value. Currently it might only fail upon soft-reset
from hl_device_reset. Later patch will add hw_fini failure in case of
polling timeout in hard-reset.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:14 +02:00
Koby Elbaz
d85f0531b9 accel/habanalabs: break is_idle function into per-engine sub-routines
is_idle() was too long, so break it up for readability.
In addition, we can now use the new sub-routines from other places.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:14 +02:00
Sagiv Ozeri
efbd36b281 accel/habanalabs: add device id to all threads names
Compute driver threads names will start with hlX-*, when X is the
device id.
This will help distinguish them from the NIC thread names.

Signed-off-by: Sagiv Ozeri <sozeri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:14 +02:00
Tomer Tayar
b041e788a7 accel/habanalabs: add helper function to get vm hash node
Add a helper function to search the vm hash for a node with a given
virtual address.
As opposed to the current code, this function explicitly returns NULL
when no node is found, instead of basing on the loop cursor object's
value.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:14 +02:00
Tomer Tayar
d1bae8199a accel/habanalabs: remove unneeded irq_handler variable
'irq_handler' in gaudi2_enable_msix(), is just assigned with a function
name and then used when calling request_threaded_irq().
Remove the variable and use the function name directly as an argument.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:14 +02:00
Dafna Hirschfeld
5e09ae9203 accel/habanalabs: change hw_fini to return int to indicate error
We later use cpucp packet for soft reset which might fail
so we should be able propagate the failure case.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:14 +02:00
Tomer Tayar
a8c14f5388 accel/habanalabs: improve readability of engines idle mask print
Remove leading zeroes when printing the idle mask to make it clearer.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:14 +02:00
Sagiv Ozeri
66116a3981 accel/habanalabs: organize hl_device structure comment
Make the comments align with the order of the fields in the structure

Signed-off-by: Sagiv Ozeri <sozeri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:14 +02:00
Tom Rix
3a621af637 accel/habanalabs: set hl_capture_*_err storage-class-specifier to static
smatch reports
drivers/accel/habanalabs/common/device.c:2619:6: warning:
  symbol 'hl_capture_hw_err' was not declared. Should it be static?
drivers/accel/habanalabs/common/device.c:2641:6: warning:
  symbol 'hl_capture_fw_err' was not declared. Should it be static?

both are only used in device.c, so they should be static

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:14 +02:00
Tom Rix
69ff5bccbc accel/habanalabs: change unused extern decl of hdev to forward decl of hl_device
Building with clang W=2 has several similar warnings
drivers/accel/habanalabs/common/decoder.c:46:51: error: declaration shadows a variable in the global scope [-Werror,-Wshadow]
static void
dec_error_intr_work(struct hl_device *hdev, u32 base_addr, u32 core_id)
                                                  ^
drivers/accel/habanalabs/common/security.h:13:26: note: previous declaration is here
extern struct hl_device *hdev;
                         ^

There is no global definition of hdev, so the extern is not needed.
Searched with
grep -r '^struct' . | grep hl_dev

Change to an forward decl to resolve these issues
drivers/accel/habanalabs/common/mmu/../security.h:133:40: error: ‘struct hl_device’ declared inside parameter list will not be visible outside of this definition or declaration [-Werror]
  133 |         bool (*skip_block_hook)(struct hl_device *hdev,
      |                                        ^~~~~~~~~

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:14 +02:00
Dafna Hirschfeld
4a2e9d11fc accel/habanalabs: don't trace cpu accessible dma alloc/free
The cpu accessible dma allocations use the gen_pool api which actually
does not allocate new memory from the system but manages memory already
allocated before. When tracing this together with real dma
allocation/free it cause confusing logs like a '0' dma address and
a cpu address appearing twice etc.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:13 +02:00
Dafna Hirschfeld
1d0f9ad7ce accel/habanalabs: in hl_device_reset small refactor for readabilty
in the out_err flow, combine the two cases of soft-reset since
they have mostly common code. In addition unlock reset_info.lock
after touching reset count.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:13 +02:00
Dafna Hirschfeld
39ab4da9c1 accel/habanalabs: in hl_device_reset remove 'hard_instead_of_soft'
Because this field is only used for debug print,
we can do more precise debug directly instead.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:13 +02:00
Koby Elbaz
3b3d853a84 accel/habanalabs: rename security function parameters
To match their description above the function

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:13 +02:00
Dafna Hirschfeld
7810c5244d accel/habanalabs: tiny refactor of hl_device_reset for readability
Align assignment of reset_upon_device_release to the convention used
in this function.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:13 +02:00
Tomer Tayar
34bef313e4 accel/habanalabs: remove hl_irq_handler_default()
hl_irq_handler_default() is not used and can be removed.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:13 +02:00
Tomer Tayar
b87b8b3e72 accel/habanalabs: fix print in hl_irq_handler_eq()
"eq_base[eq->ci].hdr.ctl" is used directly in a print without a
le32_to_cpu() conversion.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:13 +02:00
Ofir Bitton
4713ace324 accel/habanalabs: add support for TPC assert
In order to allow TPC engines to raise an assert, we must expose
the relevant MSIX interrupt to the user so he will configure the engine
correctly. In addition, we implement the corresponding interrupt
handler that will notify the user upon such an event.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:13 +02:00
Ofir Bitton
bcfcd084aa accel/habanalabs: capture interrupt timestamp in handler
In order for interrupt timestamp to be more accurate we should
capture it during the interrupt handling rather than in threaded
irq context.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:13 +02:00
Tal Cohen
6012235899 accel/habanalabs: change user interrupt to threaded IRQ
We prefer not to handle the user interrupt job inside the interrupt
context. Instead, use threaded IRQ to handle the user interrupts.
This will allow to avoid disabling interrupts when the user process
registers for a new event and to avoid long handling inside an
interrupt.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:13 +02:00
Ohad Sharabi
bca98be5d4 accel/habanalabs: modify events reset policy
The policy file of the events reset has been modified.
This change is reflected in the autogenerated file.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:13 +02:00
Ohad Sharabi
32231b6c30 accel/habanalabs: get reset type indication from irq_map
When getting an event, add the ability to deduce the reset type from
the IRQ map table instead of using hard reset regardless.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:12 +02:00
Tomer Tayar
18d1358459 accel/habanalabs: enable graceful reset mechanism for compute-reset
The graceful reset mechanism is currently enabled only for reset
requests that will end up with hard-reset.
In future, reset requests due to errors in some device engines, are
going to be modified to request compute-reset, as the much longer
hard-reset is not really needed there.
To allow it, enable graceful reset also for compute-reset, and reset
after user releases the device won't be escalated to hard-reset in those
cases.
If watchdog expires and user didn't release the device, hard-reset will
be initiated in any case.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:12 +02:00
Koby Elbaz
57479adb41 accel/habanalabs: disable PCI when escalating compute to hard-reset
In case a compute reset has failed or a request for a hard reset has
just arrived, then we escalate current reset procedure from compute
to hard-reset.
In such a case, the FW should be aware of the updated error cause,
and if LKD is the one who performs the reset (rather than the FW),
then we ask the FW to disable PCI access.

We would also like to have relevant debug info and therefore
we print the currently escalating reset type.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:12 +02:00
Moti Haimovski
e0ad6bad80 accel/habanalabs: minimize error prints when mem map fails
This commit minimizes the "chain of errors" displayed when memory
mapping fails.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:12 +02:00
Koby Elbaz
e38f88e42a accel/habanalabs: unsecure CFG_TPC_ID register
Required to allow the TPC compiler to know on which offset of the index
space it works on.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:12 +02:00
Ofir Bitton
7fc0d011c3 accel/habanalabs: expose engine core int reg address
In order for engine cores to raise interrupts towards FW, They need
to know which register the event data should be written to.
Hence, we forward the relevant scratchpad register received during
dynamic regs handshake with FW.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:12 +02:00
Moti Haimovski
313e9f63b7 accel/habanalabs: add critical-event bit in notifier
Enhance the existing user notifications by adding a HW and FW critical
event bits to be used when a HW or FW event occur that requires
both SW abort and hard-resetting the chip.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:12 +02:00
Tomer Tayar
09524eb882 accel/habanalabs: enforce release order of compute device and dma-buf
When user closes the compute device file descriptor without closing a
dma-buf file descriptor, the device will be considered as in use,
leading to hard reset and killing the user process, to ensure the
release of the dma-buf.
Same thing will happen if user first releases the compute device file
and only then the dma-buf.

The implication of this is the duration of hard reset, during which the
device cannot be reacquired.
Moreover, this behavior adds a constraint on a user process to follow
this order of release operations.

To avoid killing the user process and to remove this constraint, enforce
the correct order of release operations inside the driver, by
incrementing the device file refcount for any dma-buf until it is
released.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:12 +02:00
Tomer Tayar
d43bce6e76 accel/habanalabs: add info when FD released while device still in use
When user closes the device file descriptor, it is checked whether the
device is still in use, and a message is printed if it is.
To make this message more informative, add to this print also the reason
due to which the device is considered as in use.
The possible reasons which are checked for now are active CS and
exported dma-buf.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:12 +02:00
Dani Liberman
c0e6df9160 accel/habanalabs: fix address decode RAZWI handling
PSOC RAZWI handling code did not took into account single router that
supports several initiators with different XY coordinates. Also, it
ignored XY_HI coordinate. This caused 2 problems:
1. RAZWI handle ignored some initiators.
2. When getting PSOC RAZWI from some routers, there was a lot of
   possible engines which could have caused the RAZWI.

Fixed the above issue by handling PSOC RAZWI with both low and high
XY coordinates. This way driver supports all initiators and in
the worst case there are not more than 2 possible engines for RAZWI.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:12 +02:00
Tomer Tayar
6071e21325 accel/habanalabs: use memhash_node_export_put() in hl_release_dmabuf()
The same mutex lock/unlock and counter decrementing in
hl_release_dmabuf() is already done in the memhash_node_export_put()
helper function.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:12 +02:00
Oded Gabbay
271b2d5f30 accel/habanalabs: refactor debugfs init
Make it easier to later add support for accel device.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:11 +02:00
Oded Gabbay
323adae99f accel/habanalabs: save class in hdev
It is more concise than to pass it to device init. Once we will add the
accel class, then we won't need to change the function signatures.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:11 +02:00
Oded Gabbay
89859a8997 accel/habanalabs: split cdev creation to separate function
Move the cdev creation code from the main hdev init function to
a separate function. This will make the code more readable once we
add the accel registration code (instead/in addition to legacy
cdev).

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:11 +02:00
Koby Elbaz
4b9c2d3633 accel/habanalabs: capture RAZWI info only if HW indication detected
RAZWI handling routine is called from most EQ events,
no matter if a RAZWI happens or not.
This fix is added to verify the handler is called only if
a real RAZWI indication in HW has been detected.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:11 +02:00
Ofir Bitton
6168a83788 accel/habanalabs: increase user interrupt grace time
Currently we support scenarios where a timestamp registration request
of a certain offset is received during the interrupt handling of the
same offset. In this case we give a grace period of up to 100us for
the interrupt handler to finish.
It seems that sometimes the interrupt handling takes more than expected,
and therefore this path should be optimized. Until that happens, let's
increase the grace period in order not to reach timeout which will
cause user call to be rejected.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:11 +02:00
Linus Torvalds
3822a7c409 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X bit.
 
 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.
 
 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes
 
 - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
   does perform some memcg maintenance and cleanup work.
 
 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".  These filters provide users
   with finer-grained control over DAMOS's actions.  SeongJae has also done
   some DAMON cleanup work.
 
 - Kairui Song adds a series ("Clean up and fixes for swap").
 
 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".
 
 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series.  It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.
 
 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".
 
 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".
 
 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".
 
 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".
 
 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series "mm:
   support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
   PTEs".
 
 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".
 
 - Sergey Senozhatsky has improved zsmalloc's memory utilization with his
   series "zsmalloc: make zspage chain size configurable".
 
 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.  The previous BPF-based approach had
   shortcomings.  See "mm: In-kernel support for memory-deny-write-execute
   (MDWE)".
 
 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
 
 - T.J.  Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".
 
 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a per-node
   basis.  See the series "Introduce per NUMA node memory error
   statistics".
 
 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage during
   compaction".
 
 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".
 
 - Christoph Hellwig has removed block_device_operations.rw_page() in ths
   series "remove ->rw_page".
 
 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".
 
 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier functions".
 
 - Some pagemap cleanup and generalization work in Mike Rapoport's series
   "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
   "fixups for generic implementation of pfn_valid()"
 
 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
 
 - Jason Gunthorpe rationalized the GUP system's interface to the rest of
   the kernel in the series "Simplify the external interface for GUP".
 
 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface.  To support this, we'll temporarily be
   printing warnings when people use the debugfs interface.  See the series
   "mm/damon: deprecate DAMON debugfs interface".
 
 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.
 
 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".
 
 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
 jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
 DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
 =MlGs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
   F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X
   bit.

 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.

 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes

 - Johannes Weiner has a series ("mm: push down lock_page_memcg()")
   which does perform some memcg maintenance and cleanup work.

 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".

   These filters provide users with finer-grained control over DAMOS's
   actions. SeongJae has also done some DAMON cleanup work.

 - Kairui Song adds a series ("Clean up and fixes for swap").

 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".

 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.

 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".

 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".

 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".

 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".

 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series
   "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
   swap PTEs".

 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".

 - Sergey Senozhatsky has improved zsmalloc's memory utilization with
   his series "zsmalloc: make zspage chain size configurable".

 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.

   The previous BPF-based approach had shortcomings. See "mm: In-kernel
   support for memory-deny-write-execute (MDWE)".

 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".

 - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".

 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a
   per-node basis. See the series "Introduce per NUMA node memory error
   statistics".

 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage
   during compaction".

 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".

 - Christoph Hellwig has removed block_device_operations.rw_page() in
   ths series "remove ->rw_page".

 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".

 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier
   functions".

 - Some pagemap cleanup and generalization work in Mike Rapoport's
   series "mm, arch: add generic implementation of pfn_valid() for
   FLATMEM" and "fixups for generic implementation of pfn_valid()"

 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".

 - Jason Gunthorpe rationalized the GUP system's interface to the rest
   of the kernel in the series "Simplify the external interface for
   GUP".

 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface. To support this, we'll temporarily be
   printing warnings when people use the debugfs interface. See the
   series "mm/damon: deprecate DAMON debugfs interface".

 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.

 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".

 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".

* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
  include/linux/migrate.h: remove unneeded externs
  mm/memory_hotplug: cleanup return value handing in do_migrate_range()
  mm/uffd: fix comment in handling pte markers
  mm: change to return bool for isolate_movable_page()
  mm: hugetlb: change to return bool for isolate_hugetlb()
  mm: change to return bool for isolate_lru_page()
  mm: change to return bool for folio_isolate_lru()
  objtool: add UACCESS exceptions for __tsan_volatile_read/write
  kmsan: disable ftrace in kmsan core code
  kasan: mark addr_has_metadata __always_inline
  mm: memcontrol: rename memcg_kmem_enabled()
  sh: initialize max_mapnr
  m68k/nommu: add missing definition of ARCH_PFN_OFFSET
  mm: percpu: fix incorrect size in pcpu_obj_full_size()
  maple_tree: reduce stack usage with gcc-9 and earlier
  mm: page_alloc: call panic() when memoryless node allocation fails
  mm: multi-gen LRU: avoid futile retries
  migrate_pages: move THP/hugetlb migration support check to simplify code
  migrate_pages: batch flushing TLB
  migrate_pages: share more code between _unmap and _move
  ...
2023-02-23 17:09:35 -08:00
Koby Elbaz
f7d67c1cfd habanalabs/gaudi2: find decode error root cause
When a decode error happens, we often don't know the exact root
cause (the erroneous address that was accessed) and the exact engine
that created the erroneous transaction.

To find out, we need to go over all the relevant register blocks
in the ASIC. Once we find the relevant engine, we print its details
and the offending address.

This helps tremendously when debugging an error that was created
by running a user workload.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:14 +02:00
Ofir Bitton
ce582bea86 habanalabs/gaudi2: unsecure tpc kernel_config registers
This is required in order to allow the kernel to control relevant
configuration space via load and store instructions.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:14 +02:00
Tomer Tayar
44155bb627 habanalabs: clear in_compute_reset when escalating to hard reset
If resetting device upon release while the release watchdog work is
scheduled, the compute reset is replaced with hard reset.
In this case, need to clear the in_compute_reset indication in the
device reset information structure.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:14 +02:00
Tomer Tayar
0c93eb098f habanalabs: run error handling if scrub_device_mem fails after reset
If device memory scrubbing from hl_device_reset() fails, we return with
an error code but not perform error handling code.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:13 +02:00
Moti Haimovski
eba773d30d habanalabs: enhance info printed on FW load errors
This commit enhances the following error messages to also provide the
type of error occurred, this in order to ease debugging of errors
detected during firmware-load.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:13 +02:00
Ofir Bitton
75b6984ef6 habanalabs: optimize command submission completion timestamp
Completion timestamp is taken during the actual command submission
release. As the release happens in a work queue, the timestamp taken
is not accurate. Hence, we will take the timestamp in the interrupt
handler itself while propagating it to the release function.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:13 +02:00
Ofir Bitton
9a7d530a80 habanalabs: refactor user interrupt type
In order to support more user interrupt types in the future, we
enumerate the user interrupt type instead of using a boolean.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:13 +02:00
Dani Liberman
12d3ea014d habanalabs/gaudi2: fix emda range registers razwi handling
Handling edma razwi is different than all other engines since edma
uses sft routers. For hbw transactions sft router contain separate
interface for each edma and for lbw there is common interface for
both edma engines of the same dcore.

To handle the razwi correctly we need to:
1. Simplify the calculation of the sft router address.
2. Add razwi handling for edma qm errors, since edma qman doesn't
   reports axi error response.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:13 +02:00
Koby Elbaz
a6685b573c habanalabs: block soft-reset on an unusable device
A device with status malfunction indicates that it can't be used.
In such a case we do not support certain reset types, e.g.,
all kinds of soft-resets (compute reset, inference soft-reset),
and reset upon device release.

A hard-reset is the only way that an unusable device can change its
status. All other reset procedures can't put the device in a reset
procedure, which might ultimately cause the device to change its
status, unintentionally, to become operational again.

Such a scenario has recently occurred, when a user requested
a hard-reset while another heavy user workload was ongoing (reset
request is queued).
Since the workload couldn't finish within reset's timeout limits, the
reset has failed and set a device status malfunction.
Eventually, when the user released the FD, an unsuccessful soft-reset
occurred, hence followed by an additional hard-reset that changed the
ASICs status back to be operational.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:13 +02:00
Dani Liberman
436479522d habanalabs/gaudi2: print page fault axi transaction id
AXI transaction id holds information about the initiator which caused
the page fault. In the future it will be translated automatically by
driver to an initiator name.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:13 +02:00
Dani Liberman
c89d19f7a3 habanalabe/gaudi2: add cfg base when displaying razwi addresses
Captured addresses of low b/w razwi information contains only the
offset from the cfg base. To make it more user readable, add the cfg
base to it.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:12 +02:00
Dani Liberman
c21f9f3471 habanalabs/gaudi2: read mmio razwi information
In gaudi2 there night be different routers for low b/w and high b/w
transactions. But in the code that collects razwi information, we used
the same router for high b/w and low b/w.

Fixed it by reading the information also from low b/w routers.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:12 +02:00
farah kassabri
ac5af9900f habanalabs: fix bug in timestamps registration code
Protect re-using the same timestamp buffer record before actually
adding it to the to interrupt wait list.
Mark ts buff offset as in use in the spinlock protection area of the
interrupt wait list to avoid getting in the re-use section in
ts_buff_get_kernel_ts_record before adding the node to the list.
this scenario might happen when multiple threads are racing on
same offset and one thread could set data in the ts buff in
ts_buff_get_kernel_ts_record then the other thread takes over
and get to ts_buff_get_kernel_ts_record and we will try
to re-use the same ts buff offset then we will try to
delete a non existing node from the list.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:12 +02:00
farah kassabri
1693fef9e9 habanalabs: bugs fixes in timestamps buff alloc
use argument instead of fixed GFP value for allocation
in Timestamps buffers alloc function.
change data type of size to size_t.

Fixes: 9158bf69e7 ("habanalabs: Timestamps buffers registration")
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:12 +02:00
farah kassabri
72848de04b habanalabs: check pad and reserved fields in ioctls
Make sure all reserved/pad fields in uapi input structures
are set to 0.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:12 +02:00
XU pengfei
0970380a7e habanalabs: remove unnecessary (void*) conversions
data is a void * type and does not require a cast.

Signed-off-by: XU pengfei <xupengfei@nfschina.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:12 +02:00
Gustavo A. R. Silva
7d352a8160 habanalabs: Replace zero-length arrays with flexible-array members
Zero-length arrays are deprecated[1] and we are moving towards
adopting C99 flexible-array members instead. So, replace zero-length
arrays in a couple of structures with flex-array members.

This helps with the ongoing efforts to tighten the FORTIFY_SOURCE
routines on memcpy() and help us make progress towards globally
enabling -fstrict-flex-arrays=3 [2].

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays [1]
Link: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602902.html [2]
Link: https://github.com/KSPP/linux/issues/78
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:12 +02:00
Moti Haimovski
2a0a839b6a habanalabs: extend fatal messages to contain PCI info
This commit attaches the PCI device address to driver fatal messages
in order to ease debugging in multi-device setups.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:12 +02:00
Dani Liberman
eaca606e88 habanalabs/gaudi2: remove use of razwi info received from f/w
Because f/w does not update razwi info when sending events, remove the
use of it.
The driver is responsible to check if razwi happened and to
collect razwi data.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:12 +02:00
Ohad Sharabi
54fcb384be habanalabs: trace LBW reads/writes
Add traces to LBW reads/writes.
This may be handy when debugging configuration failure or events when
tracking configuration flow.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:11 +02:00
Carmit Carmel
200f3cf047 habanalabs/gaudi2: fix log for sob value overflow/underflow
The value in SM_SEI_CAUSE includes the SOB index and not the SOB group
index.
Remove usage of log_mask in sm_sei_cause structure as it was never
used.

Signed-off-by: Carmit Carmel <ccarmel@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:11 +02:00
Ohad Sharabi
ab509d81c9 habanalabs: add set engines masks ASIC function
This function shall be used whenever components enable/binning masks
should be updated.

Usage is in one of the below cases:
- update user (or default) component masks
- update when getting the masks from FW (either CPUCP or COMMS)

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:11 +02:00
Koby Elbaz
571d1a7222 habanalabs: protect access to dynamic mem 'user_mappings'
When HL_INFO_USER_MAPPINGS IOCTL is called, we copy_to_user from
a dynamically allocated memory - 'user_mappings'.
Since freeing/allocating it happens in runtime (upon a page fault),
it not unlikely to access it even before being initially allocated
(i.e., accessing a NULL pointer).

The solution is to simply mark the spot when the err info has been
collected, and that way to know whether err info (either page fault
or RAZWI) is available to be read.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:11 +02:00
Tom Rix
c7d7b9aca2 habanalabs: remove redundant memset
From reviewing the code, the line
  memset(kdata, 0, usize);
is not needed because kdata is either zeroed by
  kdata = kzalloc(asize, GFP_KERNEL);
when allocated at runtime or by
  char stack_kdata[128] = {0};
at compile time.

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:11 +02:00
Koby Elbaz
78baccbdc3 habanalabs: refactor razwi/page-fault information structures
This refactor makes the code clearer and the new variables' names
better describe their roles.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:11 +02:00
Koby Elbaz
6cfb00139d habanalabs/gaudi2: avoid reconfiguring the same PB registers
It appears that, within the sync manager security configuration,
we reconfigure PB registers over and over without any need to do that.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:11 +02:00
Ofir Bitton
4083697a36 habanalabs/gaudi: allow device acquire while in debug mode
During device acquire, the driver is using a QMAN for clearing some
registers. In order to avoid internal races, the driver verifies
the device is idle before submitting the register clear job.

This check introduces an issue, as debug mode will cause the device
to be non-idle which will lead to device acquire failure.

In order to overcome this issue we can entirely remove the idle
check as the driver is using the QMAN only when there is no active
context.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:11 +02:00
Oded Gabbay
e1e8e7472b habanalabs: move some prints to debug level
When entering an IOCTL, the driver prints a message in case device is
not operational. This message should be printed in debug level as
it can spam the kernel log and it is not an error.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:10 +02:00
Oded Gabbay
139dad0471 habanalabs: update f/w files
Update common firmware files with the latest version.
There is no functional change.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:10 +02:00
Oded Gabbay
bcace6f058 habanalabs/gaudi2: update f/w files
Update gaudi2 firmware files with the latest version.
There is no functional change.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:10 +02:00
Oded Gabbay
2fd7db3c80 habanalabs/gaudi2: update asic register files
Update some register files with the latest h/w auto-generated files.
There is no functional change.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:10 +02:00
Tomer Tayar
e2a079a206 habanalabs: verify that kernel CB is destroyed only once
Remove the distinction between user CB and kernel CB, and verify for
both that they are not destroyed more than once.

As kernel CB might be taken from the pre-allocated CB pool, so we need
to clear the handle destroyed indication when returning a CB to the
pool.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:10 +02:00
Ohad Sharabi
20faaeec37 habanalabs: add uapi to flush inbound HBM transactions
When doing p2p with a NIC device, the NIC needs to make sure all the
writes to the HBM (through the PCI bar of the Gaudi device) were
flushed.

It can be done by either the NIC or the host reading through the PCI
bar.

To support the host side, we supply a simple uapi to perform this flush
through the driver, because the user can't create such a transaction
by itself (the PCI bar isn't exposed to normal users).

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:10 +02:00
Oded Gabbay
e65e175b07 habanalabs: move driver to accel subsystem
Now that we have a subsystem for compute accelerators, move the
habanalabs driver to it.

This patch only moves the files and fixes the Makefiles. Future
patches will change the existing code to register to the accel
subsystem and expose the accel device char files instead of the
habanalabs device char files.

Update the MAINTAINERS file to reflect this change.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:10 +02:00