Commit Graph

188 Commits

Author SHA1 Message Date
YuanShang
c00d8b79fd drm/amdgpu: Retain job->vm in amdgpu_job_prepare_job
The field job->vm is used in function amdgpu_job_run to get the page
table re-generation counter and decide whether the job should be skipped.

Specifically, function amdgpu_vm_generation checks if the VM is valid for this job to use.
For instance, if a gfx job depends on a cancelled sdma job from entity vm->delayed,
then the gfx job should be skipped.

Fixes: 26c95e838e ("drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare")
Signed-off-by: YuanShang <YuanShang.Mao@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit ed76936c6b10b547c6df4ca75412331e9ef6d339)
Cc: stable@vger.kernel.org
2025-08-04 15:41:09 -04:00
Dave Airlie
acab5fbd77 amd-drm-next-6.17-2025-07-17:
amdgpu:
 - Partition fixes
 - Reset fixes
 - RAS fixes
 - i2c fix
 - MPC updates
 - DSC cleanup
 - EDID fixes
 - Display idle D3 update
 - IPS updates
 - DMUB updates
 - Retimer fix
 - Replay fixes
 - Fix DC memory leak
 - Initial support for smartmux
 - DCN 4.0.1 degamma LUT fix
 - Per queue reset cleanups
 - Track ring state associated with a fence
 - SR-IOV fixes
 - SMU fixes
 - Per queue reset improvements for GC 9+ compute
 - Per queue reset improvements for GC 10+ gfx
 - Per queue reset improvements for SDMA 5+
 - Per queue reset improvements for JPEG 2+
 - Per queue reset improvements for VCN 2+
 - GC 8 fix
 - ISP updates
 
 amdkfd:
 - Enable KFD on LoongArch
 
 radeon:
 - Drop console lock during suspend/resume
 
 UAPI:
 - Add userq slot info to INFO IOCTL
   Used for IGT userq validation tests (https://lists.freedesktop.org/archives/igt-dev/2025-July/093228.html)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCaHlrvgAKCRC93/aFa7yZ
 2LuDAP9c5phb6TBIFEmLlQEcZ+Ngx8LVkEl1JIfAhwuAwN2V0wD9H1pVVNDXmFlh
 jfB8hLYikqnxYznXuImvutGEBHM8BgQ=
 =W2D8
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.17-2025-07-17' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.17-2025-07-17:

amdgpu:
- Partition fixes
- Reset fixes
- RAS fixes
- i2c fix
- MPC updates
- DSC cleanup
- EDID fixes
- Display idle D3 update
- IPS updates
- DMUB updates
- Retimer fix
- Replay fixes
- Fix DC memory leak
- Initial support for smartmux
- DCN 4.0.1 degamma LUT fix
- Per queue reset cleanups
- Track ring state associated with a fence
- SR-IOV fixes
- SMU fixes
- Per queue reset improvements for GC 9+ compute
- Per queue reset improvements for GC 10+ gfx
- Per queue reset improvements for SDMA 5+
- Per queue reset improvements for JPEG 2+
- Per queue reset improvements for VCN 2+
- GC 8 fix
- ISP updates

amdkfd:
- Enable KFD on LoongArch

radeon:
- Drop console lock during suspend/resume

UAPI:
- Add userq slot info to INFO IOCTL
  Used for IGT userq validation tests (https://lists.freedesktop.org/archives/igt-dev/2025-July/093228.html)

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250717213827.2061581-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-07-21 11:57:43 +10:00
Alex Deucher
6ac55eab4f drm/amdgpu: move reset support type checks into the caller
Rather than checking in the callbacks, check if the reset
type is supported in the caller.

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:56 -04:00
Alex Deucher
77cc0da39c drm/amdgpu: track ring state associated with a fence
We need to know the wptr and sequence number associated
with a fence so that we can re-emit the unprocessed state
after a ring reset.  Pre-allocate storage space for
the ring buffer contents and add helpers to save off
and re-emit the unprocessed state so that it can be
re-emitted after the queue is reset.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:14:11 -04:00
Maíra Canal
0a5dc1b67e
drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET
Among the scheduler's statuses, the only one that indicates an error is
DRM_GPU_SCHED_STAT_ENODEV. Any status other than DRM_GPU_SCHED_STAT_ENODEV
signifies that the operation succeeded and the GPU is in a nominal state.

However, to provide more information about the GPU's status, it is needed
to convey more information than just "OK".

Therefore, rename DRM_GPU_SCHED_STAT_NOMINAL to
DRM_GPU_SCHED_STAT_RESET, which better communicates the meaning of this
status. The status DRM_GPU_SCHED_STAT_RESET indicates that the GPU has
hung, but it has been successfully reset and is now in a nominal state
again.

Reviewed-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250714-sched-skip-reset-v6-1-5c5ba4f55039@igalia.com
Signed-off-by: Maíra Canal <mcanal@igalia.com>
2025-07-15 08:27:00 -03:00
André Almeida
667efb3419 drm/amdgpu: Fix lifetime of struct amdgpu_task_info after ring reset
When a ring reset happens, amdgpu calls drm_dev_wedged_event() using
struct amdgpu_task_info *ti as one of the arguments. After using *ti, a
call to amdgpu_vm_put_task_info(ti) is required to correctly track its
lifetime.

However, it's called from a place that the ring reset path never reaches
due to a goto after drm_dev_wedged_event() is called. Move
amdgpu_vm_put_task_info() bellow the exit label to make sure that it's
called regardless of the code path.

amdgpu_vm_put_task_info() can only accept a valid address or NULL as
argument, so initialise *ti to make sure we can call this function if
*ti isn't used.

Fixes: a72002cb18 ("drm/amdgpu: Make use of drm_wedge_task_info")
Reported-by: Dave Airlie <airlied@gmail.com>
Closes: https://lore.kernel.org/dri-devel/CAPM=9tz0rQP8VZWKWyuF8kUMqRScxqoa6aVdwWw9=5yYxyYQ2Q@mail.gmail.com/
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250704030629.1064397-1-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-07-10 18:41:21 -03:00
Dave Airlie
7e2818386a amd-drm-next-6.17-2025-07-01:
amdgpu:
 - FAMS2 fixes
 - OLED fixes
 - Misc cleanups
 - AUX fixes
 - DMCUB updates
 - SR-IOV hibernation support
 - RAS updates
 - DP tunneling fixes
 - DML2 fixes
 - Backlight improvements
 - Suspend improvements
 - Use scaling for non-native modes on eDP
 - SDMA 4.4.x fixes
 - PCIe DPM fixes
 - SDMA 5.x fixes
 - Cleaner shader updates for GC 9.x
 - Remove fence slab
 - ISP genpd support
 - Parition handling rework
 - SDMA FW checks for userq support
 - Add missing firmware declaration
 - Fix leak in amdgpu_ctx_mgr_entity_fini()
 - Freesync fix
 - Ring reset refactoring
 - Legacy dpm verbosity changes
 
 amdkfd:
 - GWS fix
 - mtype fix for ext coherent system memory
 - MMU notifier fix
 - gfx7/8 fix
 
 radeon:
 - CS validation support for additional GL extensions
 - Bump driver version for new CS validation checks
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCaGQ6hwAKCRC93/aFa7yZ
 2JVJAQC/IpZoSmW22VrtXjBiQ3yoYt61oOM/WJVXPV11FmisyQEAqQ9nc+/lY9SM
 s5/RskaQGlfCGqafnaSGuoILkD/YDQ4=
 =i8mT
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.17-2025-07-01' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.17-2025-07-01:

amdgpu:
- FAMS2 fixes
- OLED fixes
- Misc cleanups
- AUX fixes
- DMCUB updates
- SR-IOV hibernation support
- RAS updates
- DP tunneling fixes
- DML2 fixes
- Backlight improvements
- Suspend improvements
- Use scaling for non-native modes on eDP
- SDMA 4.4.x fixes
- PCIe DPM fixes
- SDMA 5.x fixes
- Cleaner shader updates for GC 9.x
- Remove fence slab
- ISP genpd support
- Parition handling rework
- SDMA FW checks for userq support
- Add missing firmware declaration
- Fix leak in amdgpu_ctx_mgr_entity_fini()
- Freesync fix
- Ring reset refactoring
- Legacy dpm verbosity changes

amdkfd:
- GWS fix
- mtype fix for ext coherent system memory
- MMU notifier fix
- gfx7/8 fix

radeon:
- CS validation support for additional GL extensions
- Bump driver version for new CS validation checks

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250701194707.32905-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-07-04 10:06:29 +10:00
Alex Deucher
38b20968f3 drm/amdgpu: move scheduler wqueue handling into callbacks
Move the scheduler wqueue stopping and starting into
the ring reset callbacks.  On some IPs we have to reset
an engine which may have multiple queues.  Move the wqueue
handling into the backend so we can handle them as needed
based on the type of reset available.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:58:22 -04:00
Alex Deucher
43ca5eb94b drm/amdgpu: move guilty handling into ring resets
Move guilty logic into the ring reset callbacks.  This
allows each ring reset callback to better handle fence
errors and force completions in line with the reset
behavior for each IP.  It also allows us to remove
the ring guilty callback since that logic now lives
in the reset callback.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:57:36 -04:00
Alex Deucher
2dee58ca47 drm/amdgpu: move force completion into ring resets
Move the force completion handling into each ring
reset function so that each engine can determine
whether or not it needs to force completion on the
jobs in the ring.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:57:29 -04:00
Christian König
821aacb2dc drm/amdgpu: rework queue reset scheduler interaction
Stopping the scheduler for queue reset is generally a good idea because
it prevents any worker from touching the ring buffer.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:57:21 -04:00
Alex Deucher
787e2ce10f drm/amdgpu: update ring reset function signature
Going forward, we'll need more than just the vmid.  Add the
guilty amdgpu_fence.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:57:16 -04:00
Alex Deucher
bf1cd14f9e drm/amdgpu: switch job hw_fence to amdgpu_fence
Use the amdgpu fence container so we can store additional
data in the fence.  This also fixes the start_time handling
for MCBP since we were casting the fence to an amdgpu_fence
and it wasn't.

Fixes: 3f4c175d62 ("drm/amdgpu: MCBP based on DRM scheduler (v9)")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
André Almeida
a72002cb18 drm/amdgpu: Make use of drm_wedge_task_info
To notify userspace about which task (if any) made the device get in a
wedge state, make use of drm_wedge_task_info parameter, filling it with
the task PID and name.

Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250617124949.2151549-7-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17 11:32:48 -03:00
André Almeida
183bccafa1 drm: Create a task info option for wedge events
When a device get wedged, it might be caused by a guilty application.
For userspace, knowing which task was involved can be useful for some
situations, like for implementing a policy, logs or for giving a chance
for the compositor to let the user know what task was involved in the
problem.  This is an optional argument, when the task info is not
available, the PID and TASK string won't appear in the event string.

Sometimes just the PID isn't enough giving that the task might be already
dead by the time userspace will try to check what was this PID's name,
so to make the life easier also notify what's the task's name in the user
event.

Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Krzysztof Karas <krzysztof.karas@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
Acked-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250617124949.2151549-4-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17 11:32:47 -03:00
André Almeida
3bfd1af74a drm: amdgpu: Create amdgpu_vm_print_task_info()
To avoid repetitive code in amdgpu, create a function that prints the
content of struct amdgpu_task_info.

Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250617124949.2151549-3-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17 11:32:47 -03:00
Pierre-Eric Pelloux-Prayer
2956554823 drm/sched: Store the drm client_id in drm_sched_fence
This will be used in a later commit to trace the drm client_id in
some of the gpu_scheduler trace events.

This requires changing all the users of drm_sched_job_init to
add an extra parameter.

The newly added drm_client_id field in the drm_sched_fence is a bit
of a duplicate of the owner one. One suggestion I received was to
merge those 2 fields - this can't be done right now as amdgpu uses
some special values (AMDGPU_FENCE_OWNER_*) that can't really be
translated into a client id. Christian is working on getting rid of
those; when it's done we should be able to squash owner/drm_client_id
together.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250526125505.2360-3-pierre-eric.pelloux-prayer@amd.com
2025-05-28 16:15:58 +02:00
Christian König
bd22e44ad4 drm/amdgpu: rework how isolation is enforced v2
Limiting the number of available VMIDs to enforce isolation causes some
issues with gang submit and applying certain HW workarounds which
require multiple VMIDs to work correctly.

So instead start to track all submissions to the relevant engines in a
per partition data structure and use the dma_fences of the submissions
to enforce isolation similar to what a VMID limit does.

v2: use ~0l for jobs without isolation to distinct it from kernel
    submissions which uses NULL for the owner. Add some warning when we
    are OOM.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-03-21 12:16:34 -04:00
André Almeida
099f273eff drm/amdgpu: Trigger a wedged event for ring reset
Instead of only triggering a wedged event for complete GPU resets,
trigger for ring resets. Regardless of the reset, it's useful for
userspace to know that it happened because the kernel will reject
further submissions from that app.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: André Almeida <andrealmeid@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-03-10 14:56:25 -04:00
Dave Airlie
236f475d29 amdgpu:
- Fix spelling typos
 - RAS updates
 - VCN 5.0.1 updates
 - SubVP fixes
 - DCN 4.0.1 fixes
 - MSO DPCD fixes
 - DIO encoder refactor
 - PCON fixes
 - Misc cleanups
 - DMCUB fixes
 - USB4 DP fixes
 - DM cleanups
 - Backlight cleanups and fixes
 - Support platform backlight curves
 - Misc code cleanups
 - SMU 14 fixes
 - JPEG 4.0.3 reset updates
 - SR-IOV fixes
 - SVM fixes
 - GC 12 DCC fixes
 - DC DCE 6.x fix
 - Hiberation fix
 
 amdkfd:
 - Fix possible NULL pointer in queue validation
 - Remove unnecessary CP domain validation
 - SDMA queue reset support
 - Add per process flags
 
 radeon:
 - Fix spelling typos
 - RS400 hyperZ fix
 
 UAPI:
 - Add KFD per process flags for setting precision
   Proposed user space: 2a64fa5e06
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCZ8tfmgAKCRC93/aFa7yZ
 2CmkAP4xoy09aHecz6WOdNN9VBJiSMSZ3UOyBL9Ll2i5nI3YegEA7Co+AmQpXOPn
 HKeD6Vy1tt1XE/Liuur5dbLUKsVYSAI=
 =IQ45
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.15-2025-03-07' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amdgpu:
- Fix spelling typos
- RAS updates
- VCN 5.0.1 updates
- SubVP fixes
- DCN 4.0.1 fixes
- MSO DPCD fixes
- DIO encoder refactor
- PCON fixes
- Misc cleanups
- DMCUB fixes
- USB4 DP fixes
- DM cleanups
- Backlight cleanups and fixes
- Support platform backlight curves
- Misc code cleanups
- SMU 14 fixes
- JPEG 4.0.3 reset updates
- SR-IOV fixes
- SVM fixes
- GC 12 DCC fixes
- DC DCE 6.x fix
- Hiberation fix

amdkfd:
- Fix possible NULL pointer in queue validation
- Remove unnecessary CP domain validation
- SDMA queue reset support
- Add per process flags

radeon:
- Fix spelling typos
- RS400 hyperZ fix

UAPI:
- Add KFD per process flags for setting precision
  Proposed user space: 2a64fa5e06

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250307211051.1880472-1-alexander.deucher@amd.com
2025-03-10 09:04:52 +10:00
André Almeida
9c696cc57c drm/amdgpu: Create a debug option to disable ring reset
Prior to the addition of ring reset, the debug option
`debug_disable_soft_recovery` could be used to force a full device
reset. Now that we have ring reset, create a debug option to disable
them in amdgpu, forcing the driver to go with the full device
reset path again when both options are combined.

This option is useful for testing and debugging purposes when one wants
to test the full reset from userspace.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-27 16:50:04 -05:00
André Almeida
b7fd6528b5 drm/amdgpu: Log after a successful ring reset
When a ring reset happens, the kernel log shows only "amdgpu: Starting
<ring name> ring reset", but when it finishes nothing appears in the
log. Explicitly write in the log that the reset has finished correctly.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: André Almeida <andrealmeid@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-25 11:44:00 -05:00
Jesse.zhang@amd.com
c94943b086 drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guilty
This patch updates the `amdgpu_job_timedout` function to check if
the ring is actually guilty of causing the timeout. If not, it
skips error handling and fence completion.

v2: move the is_guilty check down into the queue reset area (Alex)
v3: need to call is_guilty before reset (Alex)
v4: squash in is_guilty logic fixes (Alex)

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-25 11:43:59 -05:00
Tvrtko Ursulin
80b6ef8ae2 drm/amdgpu: Pop jobs from the queue more robustly
Replace a copy of DRM scheduler's to_drm_sched_job with a copy of a newly
added drm_sched_entity_queue_pop.

This allows breaking the hidden dependency that queue_node has to be the
first element in struct drm_sched_job.

A comment is also added with a reference to the mailing list discussion
explaining the copied helper will be removed when the whole broken
amdgpu_job_stop_all_jobs_on_sched is removed.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <phasta@kernel.org>
Cc: Zhang, Hawking <Hawking.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250221105038.79665-3-tvrtko.ursulin@igalia.com
2025-02-24 10:17:40 +01:00
Christian König
11815bb0e3 drm/amdgpu: partially revert "reduce reset time"
This partially reverts commit 194eb174cb.

This commit introduced a new state variable into adev without even
remotely worrying about CPU barriers.

Since we already have the amdgpu_in_reset() function exactly for this
use case partially revert that.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:39:07 -05:00
Christian König
26c95e838e drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare
As soon as the prepare phase is completed the VM might be released,
better set it to NULL.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:39:07 -05:00
Pierre-Eric Pelloux-Prayer
54a1b36d4b drm/amdgpu: remove useless init from amdgpu_job_alloc
This init is useless because base.sched will be cleared to 0 in drm_sched_job_init
because of commit 2320c9e6a7 ("drm/sched: memset() 'job' in drm_sched_job_init()").

Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:17:46 -05:00
Pierre-Eric Pelloux-Prayer
0014952b17 drm/amdgpu: drop the amdgpu_device argument from amdgpu_ib_free
It's unused.

Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:17:32 -05:00
Pierre-Eric Pelloux-Prayer
2ae520cb12 drm/amdgpu: don't access invalid sched
Since 2320c9e6a7 ("drm/sched: memset() 'job' in drm_sched_job_init()")
accessing job->base.sched can produce unexpected results as the initialisation
of (*job)->base.sched done in amdgpu_job_alloc is overwritten by the
memset.

This commit fixes an issue when a CS would fail validation and would
be rejected after job->num_ibs is incremented. In this case,
amdgpu_ib_free(ring->adev, ...) will be called, which would crash the
machine because the ring value is bogus.

To fix this, pass a NULL pointer to amdgpu_ib_free(): we can do this
because the device is actually not used in this function.

The next commit will remove the ring argument completely.

Fixes: 2320c9e6a7 ("drm/sched: memset() 'job' in drm_sched_job_init()")
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:16:35 -05:00
Dave Airlie
1f8bdc31c7 amd-drm-next-6.13-2024-11-06:
amdgpu:
 - Misc cleanups
 - OLED fixes
 - DCN 4.x fixes
 - DCN 3.5 fixes
 - 8K fixes
 - IPS fixes
 - DSC fixes
 - S3 fix
 - KASAN fix
 - SMU13 fixes
 - fdinfo fixes
 - USB-C fixes
 - ACPI fix
 - Fix dummy page overlapping mappings
 - Fix workload profile handling
 - Add user control for zero RPM on SMU13
 - Cleaner shader updates
 - Stop syncing PRT map operations
 - Debugfs permissions fixes
 - Debugfs bounds check fix
 - RAS cleanups
 - Enforce isolation updates
 
 amdkfd:
 - Add topology cap flag for per queue reset
 - Add an interface to query whether KFD queues are present
 - Use dynamic allocation for get_cu_occupancy
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCZyua0wAKCRC93/aFa7yZ
 2DjiAP9aBOidQQX+qgq9brFBcm6QlSOFKnOf8ZNKJEZ3yYOYBwEAv7EY0S2xnox1
 UrmLDd8APpVJZhDbQgJWaQUe09fkIgg=
 =G1Jb
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.13-2024-11-06' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.13-2024-11-06:

amdgpu:
- Misc cleanups
- OLED fixes
- DCN 4.x fixes
- DCN 3.5 fixes
- 8K fixes
- IPS fixes
- DSC fixes
- S3 fix
- KASAN fix
- SMU13 fixes
- fdinfo fixes
- USB-C fixes
- ACPI fix
- Fix dummy page overlapping mappings
- Fix workload profile handling
- Add user control for zero RPM on SMU13
- Cleaner shader updates
- Stop syncing PRT map operations
- Debugfs permissions fixes
- Debugfs bounds check fix
- RAS cleanups
- Enforce isolation updates

amdkfd:
- Add topology cap flag for per queue reset
- Add an interface to query whether KFD queues are present
- Use dynamic allocation for get_cu_occupancy

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241106163904.189108-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
2024-11-08 12:04:24 +10:00
Alex Deucher
35984fd4a0 drm/amdgpu: add ring reset messages
Add messages to make it clear when a per ring reset
happens.  This is helpful for debugging and aligns with
other reset methods.

v2: add ring name in success/fail messages (Lijo)

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com> (v1)
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-11-04 11:29:49 -05:00
Dave Airlie
e7103f8785 amd-drm-next-6.13-2024-10-25:
amdgpu:
 - SDMA queue reset support
 - SMU 13.0.6 updates
 - Add debugfs interface to help limit jpeg queue scheduling for testing
 - JPEG 4.0.3 updates
 - Initial runtime repartitioning support
 - GFX9 fixes
 - Misc code cleanups
 - Rework IP structures to better handle multiple instances of an IP
 - DML updates
 - DSC fixes
 - HDR fixes
 - Brightness control updates
 - Runtime pm cleanup
 - DMCUB fixes
 - DCN 3.5 updates
 - Struct drm_edid cleanup
 - Fetch EDID from _DDC if available
 - Ring noop optimizations
 - MES logging fixes
 - 3DLUT fixes
 - DCN 4.x fixes
 - SMU 13.x fixes
 - Fixes for set_soft_freq_range()
 - ACPI fixes
 - SMU 14.x updates
 - PSR-SU fixes
 - fdinfo cleanup
 - DCN documentation updates
 
 amdkfd:
 - Misc code cleanups
 - Increase event FIFO size
 - Copy wave state fixes for SDMA
 
 radeon:
 - Fix possible overflow in packet3 check
 - Late init connector fix
 - Always set GEM function pointer
 
 Documentation:
 - Update drm-memory documentation
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCZxua4QAKCRC93/aFa7yZ
 2C/TAQC3PZqI36hkKOPwdcbFq2ydK1r3xiG7Q60K0PxpTnsqKQEAuF1MEuTXfamv
 mVqZfJuqF3wWXzoqM190qf3947f0eQk=
 =MSZa
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.13-2024-10-25' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.13-2024-10-25:

amdgpu:
- SDMA queue reset support
- SMU 13.0.6 updates
- Add debugfs interface to help limit jpeg queue scheduling for testing
- JPEG 4.0.3 updates
- Initial runtime repartitioning support
- GFX9 fixes
- Misc code cleanups
- Rework IP structures to better handle multiple instances of an IP
- DML updates
- DSC fixes
- HDR fixes
- Brightness control updates
- Runtime pm cleanup
- DMCUB fixes
- DCN 3.5 updates
- Struct drm_edid cleanup
- Fetch EDID from _DDC if available
- Ring noop optimizations
- MES logging fixes
- 3DLUT fixes
- DCN 4.x fixes
- SMU 13.x fixes
- Fixes for set_soft_freq_range()
- ACPI fixes
- SMU 14.x updates
- PSR-SU fixes
- fdinfo cleanup
- DCN documentation updates

amdkfd:
- Misc code cleanups
- Increase event FIFO size
- Copy wave state fixes for SDMA

radeon:
- Fix possible overflow in packet3 check
- Late init connector fix
- Always set GEM function pointer

Documentation:
- Update drm-memory documentation

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241025132336.2416913-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
2024-10-29 18:25:24 +10:00
Dave Airlie
7fefa1edc2 Merge tag 'drm-misc-next-2024-09-20' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next
drm-misc-next for v6.12:

UAPI Changes:
- Add panthor/DEV_QUERY_TIMESTAMP_INFO query.

Cross-subsystem Changes:
- Updated dt bindings.
- Add documentation explaining default errnos for fences.
- Mark dma-buf heaps creation functions as __init.

Core Changes:
- Split DSC helpers from DP helpers.
- Clang build fixes for drm/mm test.
- Remove simple pipeline support for gem-vram,
  no longer any users left after converting bochs.
- Add erno to drm_sched_start to distinguish between GPU and queue
  reset.
- Add drm_framebuffer testcases.
- Fix uninitialized spinlock acquisition with CONFIG_DRM_PANIC=n.
- Use read_trylock instead of read_lock in dma_fence_begin_signalling to
  quiesce lockdep.

Driver Changes:
- Assorted small fixes and updates for tegra, host1x, imagination,
  nouveau, panfrost, panthor, panel/ili9341, mali, exynos,
  panel/samsung-s6e3fa7, ast, bridge/ti-sn65dsi86, panel/himax-hx83112a,
  bridge/tc358767, bridge/imx8mp-hdmi-tx, panel/khadas-ts050,
  panel/nt36523, panel/sony-acx565akm, kmb, accel/qaic, omap, v3d.
- Add bridge/TI TDP158.
- Assorted documentation updates.
- Convert bochs from simple drm to gem shmem, and check modes
  against available memory.
- Many VC4 fixes, most related to scaling and YUV support.
- Convert some drivers to use SYSTEM_SLEEP_PM_OPS and RUNTIME_PM_OPS.
- Rockchip 4k@60 support.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/445713a6-2427-4c53-8ec2-3a894ec62405@linux.intel.com
2024-10-09 09:03:46 +10:00
Tvrtko Ursulin
89cfa73b61 drm/amdgpu: Remove the while loop from amdgpu_job_prepare_job
While loop makes it sound like amdgpu_vmid_grab() potentially needs to be
called multiple times to produce a fence, while in reality all code paths
either return an error, assign a valid job->vmid or assign a vmid which
will be valid once the returned fence signals.

Therefore we can remove the loop to make it clear the call does not need
to be repeated.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-10-08 09:43:43 -04:00
Tvrtko Ursulin
871f44b4ba drm/amdgpu: Drop impossible condition from amdgpu_job_prepare_job
Fence has been initialised to NULL so no need to test it.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-10-08 09:43:39 -04:00
Sunil Khatri
fa73462dc0 drm/amdgpu: update the handle ptr in dump_ip_state
Update the ptr handle to amdgpu_ip_block ptr in all
the functions.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-10-01 17:28:51 -04:00
ZhenGuo Yin
e1d27f7a9c drm/amdgpu: skip coredump after job timeout in SRIOV
VF FLR will be triggered by host driver before job timeout,
hence the error status of GPU get cleared. Performing a
coredump here is unnecessary.

Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-25 12:55:52 -04:00
Thomas Zimmermann
61b86391fb Merge drm/drm-next into drm-misc-next
Backmerging to get fixes from v6.12-rc7.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2024-09-11 09:48:49 +02:00
Christian König
b2ef808786 drm/sched: add optional errno to drm_sched_start()
The current implementation of drm_sched_start uses a hardcoded
-ECANCELED to dispose of a job when the parent/hw fence is NULL.
This results in drm_sched_job_done being called with -ECANCELED for
each job with a NULL parent in the pending list, making it difficult
to distinguish between recovery methods, whether a queue reset or a
full GPU reset was used.

To improve this, we first try a soft recovery for timeout jobs and
use the error code -ENODATA. If soft recovery fails, we proceed with
a queue reset, where the error code remains -ENODATA for the job.
Finally, for a full GPU reset, we use error codes -ECANCELED or
-ETIME. This patch adds an error code parameter to drm_sched_start,
allowing us to differentiate between queue reset and GPU reset
failures. This enables user mode and test applications to validate
the expected correctness of the requested operation. After a
successful queue reset, the only way to continue normal operation is
to call drm_sched_job_done with the specific error code -ENODATA.

v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain
    and amdgpu_device_unlock_reset_domain to allow user mode to track
    the queue reset status and distinguish between queue reset and
    GPU reset.
v2: Christian suggested using the error codes -ENODATA for queue reset
    and -ECANCELED or -ETIME for GPU reset, returned to
    amdgpu_cs_wait_ioctl.
v3: To meet the requirements, we introduce a new function
    drm_sched_start_ex with an additional parameter to set
    dma_fence_set_error, allowing us to handle the specific error
    codes appropriately and dispose of bad jobs with the selected
    error code depending on whether it was a queue reset or GPU reset.
v4: Alex suggested using a new name, drm_sched_start_with_recovery_error,
    which more accurately describes the function's purpose.
    Additionally, it was recommended to add documentation details
    about the new method.
v5: Fixed declaration of new function drm_sched_start_with_recovery_error.(Alex)
v6 (chk): rebase on upstream changes, cleanup the commit message,
          drop the new function again and update all callers,
          apply the errno also to scheduler fences with hw fences
v7 (chk): rebased

Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240826122541.85663-1-christian.koenig@amd.com
2024-09-06 18:05:52 +02:00
Sunil Khatri
30e8f4c2bd drm/amdgpu: Move the dumping log out of for loop
log message "Dumping IP State Completed" needs to
be logged only once when state dumping is complete.

Hence moving it out of the for loop.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Acked-by: Trigger Huang <Trigger.Huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:39:19 -04:00
Trigger Huang
c67db6a6a6 drm/amdgpu: Do core dump immediately when job tmo
Do the coredump immediately after a job timeout to get a closer
representation of GPU's error status.

V2: This will skip printing vram_lost as the GPU reset is not
happened yet (Alex)

V3: Unconditionally call the core dump as we care about all the reset
functions(soft-recovery and queue reset and full adapter reset, Alex)

V4: Do the dump after adev->job_hang = true (Sunil)

Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
Acked-by:  Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:39:00 -04:00
Daniel Vetter
e55ef65510 amd-drm-next-6.12-2024-08-26:
amdgpu:
 - SDMA devcoredump support
 - DCN 4.0.1 updates
 - DC SUBVP fixes
 - Refactor OPP in DC
 - Refactor MMHUBBUB in DC
 - DC DML 2.1 updates
 - DC FAMS2 updates
 - RAS updates
 - GFX12 updates
 - VCN 4.0.3 updates
 - JPEG 4.0.3 updates
 - Enable wave kill (soft recovery) for compute queues
 - Clean up CP error interrupt handling
 - Enable CP bad opcode interrupts
 - VCN 4.x fixes
 - VCN 5.x fixes
 - GPU reset fixes
 - Fix vbios embedded EDID size handling
 - SMU 14.x updates
 - Misc code cleanups and spelling fixes
 - VCN devcoredump support
 - ISP MFD i2c support
 - DC vblank fixes
 - GFX 12 fixes
 - PSR fixes
 - Convert vbios embedded EDID to drm_edid
 - DCN 3.5 updates
 - DMCUB updates
 - Cursor fixes
 - Overdrive support for SMU 14.x
 - GFX CP padding optimizations
 - DCC fixes
 - DSC fixes
 - Preliminary per queue reset infrastructure
 - Initial per queue reset support for GFX 9
 - Initial per queue reset support for GFX 7, 8
 - DCN 3.2 fixes
 - DP MST fixes
 - SR-IOV fixes
 - GFX 9.4.3/4 devcoredump support
 - Add process isolation framework
 - Enable process isolation support for GFX 9.4.3/4
 - Take IOMMU remapping into account for P2P DMA checks
 
 amdkfd:
 - CRIU fixes
 - Improved input validation for user queues
 - HMM fix
 - Enable process isolation support for GFX 9.4.3/4
 - Initial per queue reset support for GFX 9
 - Allow users to target recommended SDMA engines
 
 radeon:
 - remove .load and drm_dev_alloc
 - Fix vbios embedded EDID size handling
 - Convert vbios embedded EDID to drm_edid
 - Use GEM references instead of TTM
 - r100 cp init cleanup
 - Fix potential overflows in evergreen CS offset tracking
 
 UAPI:
 - KFD support for targetting queues on recommended SDMA engines
   Proposed userspace:
   2f588a2406
   eb30a5bbc7
 
 drm/buddy:
 - Add start address support for trim function
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCZszhcQAKCRC93/aFa7yZ
 2M4ZAQD+xgIJkQ9HISQeqER5GblnfrorARd32yP/BH0c+JbGUAD9H/BIB41teZ80
 vw2WTx+4TyB39awgvtpDH8iEQdkcSAE=
 =717w
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.12-2024-08-26' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.12-2024-08-26:

amdgpu:
- SDMA devcoredump support
- DCN 4.0.1 updates
- DC SUBVP fixes
- Refactor OPP in DC
- Refactor MMHUBBUB in DC
- DC DML 2.1 updates
- DC FAMS2 updates
- RAS updates
- GFX12 updates
- VCN 4.0.3 updates
- JPEG 4.0.3 updates
- Enable wave kill (soft recovery) for compute queues
- Clean up CP error interrupt handling
- Enable CP bad opcode interrupts
- VCN 4.x fixes
- VCN 5.x fixes
- GPU reset fixes
- Fix vbios embedded EDID size handling
- SMU 14.x updates
- Misc code cleanups and spelling fixes
- VCN devcoredump support
- ISP MFD i2c support
- DC vblank fixes
- GFX 12 fixes
- PSR fixes
- Convert vbios embedded EDID to drm_edid
- DCN 3.5 updates
- DMCUB updates
- Cursor fixes
- Overdrive support for SMU 14.x
- GFX CP padding optimizations
- DCC fixes
- DSC fixes
- Preliminary per queue reset infrastructure
- Initial per queue reset support for GFX 9
- Initial per queue reset support for GFX 7, 8
- DCN 3.2 fixes
- DP MST fixes
- SR-IOV fixes
- GFX 9.4.3/4 devcoredump support
- Add process isolation framework
- Enable process isolation support for GFX 9.4.3/4
- Take IOMMU remapping into account for P2P DMA checks

amdkfd:
- CRIU fixes
- Improved input validation for user queues
- HMM fix
- Enable process isolation support for GFX 9.4.3/4
- Initial per queue reset support for GFX 9
- Allow users to target recommended SDMA engines

radeon:
- remove .load and drm_dev_alloc
- Fix vbios embedded EDID size handling
- Convert vbios embedded EDID to drm_edid
- Use GEM references instead of TTM
- r100 cp init cleanup
- Fix potential overflows in evergreen CS offset tracking

UAPI:
- KFD support for targetting queues on recommended SDMA engines
  Proposed userspace:
  2f588a2406
  eb30a5bbc7

drm/buddy:
- Add start address support for trim function

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240826201528.55307-1-alexander.deucher@amd.com
2024-08-27 14:33:12 +02:00
Prike Liang
fb0a5834a3 drm/amdgpu: increase the reset counter for the queue reset
Update the reset counter for the amdgpu_cs_query_reset_state()

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-16 14:17:56 -04:00
Alex Deucher
15789fa0f0 drm/amdgpu: add per ring reset support (v5)
If a specific job is hung, try and reset just the
ring associated with the job.

v2: move to amdgpu_job.c
v3: fix drm_sched_stop() handling when ring reset fails
v4: drop unnecessary amdgpu_fence_driver_clear_job_fences() and
    drm_sched_increase_karma()
v5: rework sched_stop handling

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-16 14:17:52 -04:00
Joshua Ashton
829798c789 drm/amdgpu: Forward soft recovery errors to userspace
As we discussed before[1], soft recovery should be
forwarded to userspace, or we can get into a really
bad state where apps will keep submitting hanging
command buffers cascading us to a hard reset.

1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 434967aadb)
Cc: stable@vger.kernel.org
2024-08-07 18:20:07 -04:00
Joshua Ashton
434967aadb drm/amdgpu: Forward soft recovery errors to userspace
As we discussed before[1], soft recovery should be
forwarded to userspace, or we can get into a really
bad state where apps will keep submitting hanging
command buffers cascading us to a hard reset.

1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-06 11:11:01 -04:00
Alex Deucher
7d570f56f1 drm/amdgpu/job: Replace DRM_INFO/ERROR logging
Use the dev_info/err variants so we get per device logging
in multi-GPU cases.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-10 10:12:55 -04:00
Eric Huang
bac640ddb5 drm/amdgpu: add reset source in various cases
To fullfill the reset event description.

Suggested-by: Lijo Lazar <Lijo.Lazar@amd.com>
Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-14 16:15:58 -04:00
Lijo Lazar
cd409dbc69 drm/amdgpu: Refine IB schedule error logging
Downgrade to debug information when IBs are skipped. Also, use dev_* to
identify the device.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 21:49:09 -04:00
Shashank Sharma
b8f67b9ddf drm/amdgpu: change vm->task_info handling
This patch changes the handling and lifecycle of vm->task_info object.
The major changes are:
- vm->task_info is a dynamically allocated ptr now, and its uasge is
  reference counted.
- introducing two new helper funcs for task_info lifecycle management
    - amdgpu_vm_get_task_info: reference counts up task_info before
      returning this info
    - amdgpu_vm_put_task_info: reference counts down task_info
- last put to task_info() frees task_info from the vm.

This patch also does logistical changes required for existing usage
of vm->task_info.

V2: Do not block all the prints when task_info not found (Felix)

V3: Fixed review comments from Felix
   - Fix wrong indentation
   - No debug message for -ENOMEM
   - Add NULL check for task_info
   - Do not duplicate the debug messages (ti vs no ti)
   - Get first reference of task_info in vm_init(), put last
     in vm_fini()

V4: Fixed review comments from Felix
   - fix double reference increment in create_task_info
   - change amdgpu_vm_get_task_info_pasid
   - additional changes in amdgpu_gem.c while porting

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-04 15:59:08 -05:00