Commit Graph

1817 Commits

Author SHA1 Message Date
Mukul Joshi
9a16042f02 drm/amdkfd: Update queue unmap after VM fault with MES
MEC FW expects MES to unmap all queues when a VM fault is observed
on a queue and then resumed once the affected process is terminated.
Use the MES Suspend and Resume APIs to achieve this.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-20 22:14:13 -04:00
Amber Lin
87758a0ef1 drm/amdkfd: Enable processes isolation on gfx9
When amdgpu enable enforce_isolation, KFD enables single-process mode in
HWS and sets exec_cleaner_shader bit in MAP_PROCESS.

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-20 22:08:07 -04:00
Amber Lin
234eebe161 drm/amdkfd: APIs to stop/start KFD scheduling
Provide amdgpu_amdkfd_stop_sched() for amdgpu to stop KFD scheduling
compute work on HIQ. amdgpu_amdkfd_start_sched() resumes the scheduling.
When amdgpu_amdkfd_stop_sched is called, KFD will unmap queues from
runlist. If users send ioctls to KFD to create queues, they'll be added
but those queues won't be mapped to runlist (so not scheduled) until
amdgpu_amdkfd_start_sched is called.

v2: fix build (Alex)

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-20 22:07:28 -04:00
Lijo Lazar
42b3a6f12a drm/amdkfd: Add node_id to location_id generically
If there are multiple nodes per kfd device, add nodeid to location_id to
differentiate.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-13 12:12:52 -04:00
Philip Yang
a1fc9f584c drm/amdkfd: Handle queue destroy buffer access race
Add helper function kfd_queue_unreference_buffers to reduce queue buffer
refcount, separate it from release queue buffers.

Because it is circular locking to hold dqm_lock to take vm lock,
kfd_ioctl_destroy_queue should take vm lock, unreference queue buffers
first, but not release queue buffers, to handle error in case failed to
hold vm lock. Then hold dqm_lock to remove queue from queue list and
then release queue buffers.

Restore process worker restore queue hold dqm_lock, will always find
the queue with valid queue buffers.

v2 (Felix):
- renamed kfd_queue_unreference_buffer(s) to kfd_queue_unref_bo_va(s)
- added two FIXME comments for follow up

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-13 10:43:16 -04:00
Jonathan Kim
70f83e7706 drm/amdkfd: fix partition query when setting up recommended sdma engines
When users dynamically set the partition mode through sysfs writes,
this can lead to a double lock situation where the KFD is trying to take
the partition lock when updating the recommended SDMA engines.
Have the KFD reference its saved socket device number count instead.
Also ensure we have enough SDMA xGMI engines to report the recommended
engines in the first place.

Fixes: e06b71b231 ("drm/amdkfd: allow users to target recommended SDMA engines")
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-13 10:35:00 -04:00
Jonathan Kim
b41a382932 drm/amdkfd: fix debug watchpoints for logical devices
The number of watchpoints should be set and constrained per logical
partition device, not by the socket device.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Harish Kasiviswanathan <harish.kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-06 10:43:39 -04:00
Jonathan Kim
ee0a469cf9 drm/amdkfd: support per-queue reset on gfx9
Support per-queue reset for GFX9.  The recommendation is for the driver
to target reset the HW queue via a SPI MMIO register write.

Since this requires pipe and HW queue info and MEC FW is limited to
doorbell reports of hung queues after an unmap failure, scan the HW
queue slots defined by SET_RESOURCES first to identify the user queue
candidates to reset.

Only signal reset events to processes that have had a queue reset.

If queue reset fails, fall back to GPU reset.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-06 10:43:18 -04:00
Philip Yang
1cb62da080 drm/amdkfd: Fix compile error if HMM support not enabled
Fixes the below if kernel config not enable HMM support

>> drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_queue.c:107:26: error:
implicit declaration of function 'svm_range_from_addr'

>> drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_queue.c:107:24: error:
assignment to 'struct svm_range *' from 'int' makes pointer from integer
without a cast [-Wint-conversion]

>> drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_queue.c:111:28: error:
invalid use of undefined type 'struct svm_range'

Fixes: b049504e21 ("drm/amdkfd: Validate user queue svm memory residency")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202407252127.zvnxaKRA-lkp@intel.com/
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-06 10:40:30 -04:00
Srinivasan Shanmugam
7c5b344537 drm/amdkfd: Fix missing error code in kfd_queue_acquire_buffers
The fix involves setting 'err' to '-EINVAL' before each 'goto
out_err_unreserve'.

Fixes the below:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_queue.c:265 kfd_queue_acquire_buffers()
warn: missing error code 'err'

drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_queue.c
    226 int kfd_queue_acquire_buffers(struct kfd_process_device *pdd, struct queue_properties *properties)
    227 {
    228         struct kfd_topology_device *topo_dev;
    229         struct amdgpu_vm *vm;
    230         u32 total_cwsr_size;
    231         int err;
    232
    233         topo_dev = kfd_topology_device_by_id(pdd->dev->id);
    234         if (!topo_dev)
    235                 return -EINVAL;
    236
    237         vm = drm_priv_to_vm(pdd->drm_priv);
    238         err = amdgpu_bo_reserve(vm->root.bo, false);
    239         if (err)
    240                 return err;
    241
    242         err = kfd_queue_buffer_get(vm, properties->write_ptr, &properties->wptr_bo, PAGE_SIZE);
    243         if (err)
    244                 goto out_err_unreserve;
    245
    246         err = kfd_queue_buffer_get(vm, properties->read_ptr, &properties->rptr_bo, PAGE_SIZE);
    247         if (err)
    248                 goto out_err_unreserve;
    249
    250         err = kfd_queue_buffer_get(vm, (void *)properties->queue_address,
    251                                    &properties->ring_bo, properties->queue_size);
    252         if (err)
    253                 goto out_err_unreserve;
    254
    255         /* only compute queue requires EOP buffer and CWSR area */
    256         if (properties->type != KFD_QUEUE_TYPE_COMPUTE)
    257                 goto out_unreserve;

This is clearly a success path.

    258
    259         /* EOP buffer is not required for all ASICs */
    260         if (properties->eop_ring_buffer_address) {
    261                 if (properties->eop_ring_buffer_size != topo_dev->node_props.eop_buffer_size) {
    262                         pr_debug("queue eop bo size 0x%lx not equal to node eop buf size 0x%x\n",
    263                                 properties->eop_buf_bo->tbo.base.size,
    264                                 topo_dev->node_props.eop_buffer_size);
--> 265                         goto out_err_unreserve;

This has err in the label name.  err = -EINVAL?

    266                 }
    267                 err = kfd_queue_buffer_get(vm, (void *)properties->eop_ring_buffer_address,
    268                                            &properties->eop_buf_bo,
    269                                            properties->eop_ring_buffer_size);
    270                 if (err)
    271                         goto out_err_unreserve;
    272         }
    273
    274         if (properties->ctl_stack_size != topo_dev->node_props.ctl_stack_size) {
    275                 pr_debug("queue ctl stack size 0x%x not equal to node ctl stack size 0x%x\n",
    276                         properties->ctl_stack_size,
    277                         topo_dev->node_props.ctl_stack_size);
    278                 goto out_err_unreserve;

err?

    279         }
    280
    281         if (properties->ctx_save_restore_area_size != topo_dev->node_props.cwsr_size) {
    282                 pr_debug("queue cwsr size 0x%x not equal to node cwsr size 0x%x\n",
    283                         properties->ctx_save_restore_area_size,
    284                         topo_dev->node_props.cwsr_size);
    285                 goto out_err_unreserve;

err?  Not sure.

    286         }
    287
    288         total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev->node_props.debug_memory_size)
    289                           * NUM_XCC(pdd->dev->xcc_mask);
    290         total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
    291
    292         err = kfd_queue_buffer_get(vm, (void *)properties->ctx_save_restore_area_address,
    293                                    &properties->cwsr_bo, total_cwsr_size);
    294         if (!err)
    295                 goto out_unreserve;
    296
    297         amdgpu_bo_unreserve(vm->root.bo);
    298
    299         err = kfd_queue_buffer_svm_get(pdd, properties->ctx_save_restore_area_address,
    300                                        total_cwsr_size);
    301         if (err)
    302                 goto out_err_release;
    303
    304         return 0;
    305
    306 out_unreserve:
    307         amdgpu_bo_unreserve(vm->root.bo);
    308         return 0;
    309
    310 out_err_unreserve:
    311         amdgpu_bo_unreserve(vm->root.bo);
    312 out_err_release:
    313         kfd_queue_release_buffers(pdd, properties);
    314         return err;
    315 }

Fixes: 629568d25f ("drm/amdkfd: Validate queue cwsr area and eop buffer size")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-27 17:29:37 -04:00
Jonathan Kim
e06b71b231 drm/amdkfd: allow users to target recommended SDMA engines
Certain GPUs have better copy performance over xGMI on specific
SDMA engines depending on the source and destination GPU.
Allow users to create SDMA queues on these recommended engines.
Close to 2x overall performance has been observed with this
optimization.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-25 17:43:41 -04:00
Philip Yang
629568d25f drm/amdkfd: Validate queue cwsr area and eop buffer size
When creating KFD user compute queue, check if queue eop buffer size,
cwsr area size, ctl stack size equal to the size of KFD node
properities.

Check the entire cwsr area which may split into multiple svm ranges
aligned to granularity boundary.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-24 14:46:06 -04:00
Philip Yang
517fff221c drm/amdkfd: Store queue cwsr area size to node properties
Use the queue eop buffer size, cwsr area size, ctl stack size
calculation from Thunk, store the value to KFD node properties.

Those will be used to validate queue eop buffer size, cwsr area size,
ctl stack size when creating KFD user compute queue.

Those will be exposed to user space via sysfs KFD node properties, to
remove the duplicate calculation code from Thunk.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-24 14:45:58 -04:00
Philip Yang
305cd109b7 drm/amdkfd: Validate user queue update
Ensure update queue new ring buffer is mapped on GPU with correct size.

Decrease queue old ring_bo queue_refcount and increase new ring_bo
queue_refcount.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-24 14:43:49 -04:00
Philip Yang
b049504e21 drm/amdkfd: Validate user queue svm memory residency
Queue CWSR area maybe registered to GPU as svm memory, create queue to
ensure svm mapped to GPU with KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED flag.

Add queue_refcount to struct svm_range, to track queue CWSR area usage.

Because unmap mmu notifier callback return value is ignored, if
application unmap the CWSR area while queue is active, pr_warn message
in dmesg log. To be safe, evict user queue.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-24 14:43:28 -04:00
Philip Yang
834368eab3 drm/amdkfd: Ensure user queue buffers residency
Add atomic queue_refcount to struct bo_va, return -EBUSY to fail unmap
BO from the GPU if the bo_va queue_refcount is not zero.

Create queue to increase the bo_va queue_refcount, destroy queue to
decrease the bo_va queue_refcount, to ensure the queue buffers mapped on
the GPU when queue is active.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-23 17:42:54 -04:00
Philip Yang
68e599db7a drm/amdkfd: Validate user queue buffers
Find user queue rptr, ring buf, eop buffer and cwsr area BOs, and
check BOs are mapped on the GPU with correct size and take the BO
reference.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-23 17:42:14 -04:00
Philip Yang
fb91065851 drm/amdkfd: Refactor queue wptr_bo GART mapping
Add helper function kfd_queue_acquire_buffers to get queue wptr_bo
reference from queue write_ptr if it is mapped to the KFD node with
expected size.

Add wptr_bo to structure queue_properties because structure queue is
allocated after queue buffers are validated, then we can remove wptr_bo
parameter from pqm_create_queue.

Rename structure queue wptr_bo_gart to hold wptr_bo reference for GART
mapping and umapping. Move MES wptr_bo_gart mapping to init_user_queue,
the same location with queue ctx_bo GART mapping.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-23 17:35:04 -04:00
Philip Yang
c86ad39140 drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer
Pass pointer reference to amdgpu_bo_unref to clear the correct pointer,
otherwise amdgpu_bo_unref clear the local variable, the original pointer
not set to NULL, this could cause use-after-free bug.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-23 17:34:44 -04:00
Philip Yang
f9e292cbba drm/amdkfd: kfd_bo_mapped_dev support partition
Change amdgpu_amdkfd_bo_mapped_to_dev to use drm_priv as parameter
instead of adev, to support spatial partition. This is only used by CRIU
checkpoint restore now. No functional change.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-23 17:34:36 -04:00
Zhigang Luo
ee98fb71ba drm/amdgpu: set CP_HQD_PQ_DOORBELL_CONTROL.DOORBELL_MODE to 1
to avoid reading wrong WPTR from doorbell in sriov vf, set
CP_HQD_PQ_DOORBELL_CONTROL.DOORBELL_MODE to 1 to read WPTR from MQD.

Signed-off-by: Zhigang Luo <Zhigang.Luo@amd.com>
Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-08 16:56:27 -04:00
Yang Wang
12b435a40c drm/amdgpu: add ras POSION_CONSUMPTION event id support
add amdgpu ras POSION_CONSUMPTION event id support.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-08 16:55:37 -04:00
Stanley.Yang
91ba536ead drm/amdkfd: Use mode1 reset for GFX v9.4.4
GFX v9.4.4 uses mode1 reset to handle poison consumption.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-08 16:55:26 -04:00
Tim Huang
71d8af38d3 drm/amdkfd: add KFD support for SDMA IP v6.1.2
Enable KFD setting SDMA info for SDMA 6.1.2.

Signed-off-by: Tim Huang <Tim.Huang@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-02 18:05:47 -04:00
Tim Huang
53c3a37436 drm/amdkfd: add KFD support for GC IP v11.5.2
Enable KFD for GC 11.5.2.

Signed-off-by: Tim Huang <Tim.Huang@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-02 18:05:29 -04:00
Lijo Lazar
62ec7d38b7 drm/amdkfd: Use device based logging for errors
Convert some pr_* to some dev_* APIs to identify the device.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-01 16:10:47 -04:00
Dan Carpenter
39de69c4f9 drm/amdgpu/kfd: Add unlock() on error path to add_queue_mes()
We recently added locking to add_queue_mes() but this error path was
overlooked.  Add an unlock to the error path.

Fixes: 1802b042a3 ("drm/amdgpu/kfd: remove is_hws_hang and is_resetting")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-27 17:09:47 -04:00
Tao Zhou
9d308e32a9 drm/amdkfd: add ASIC version check for the reset selection of RAS poison
GFX v9.4.3 uses mode1 reset, other ASICs choose mode2.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-14 16:18:27 -04:00
Tao Zhou
4280f60e8e drm/amdkfd: use mode1 reset for RAS poison consumption
Per firmware's requirement, replace mode2 with mode1.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-14 16:18:27 -04:00
Jay Cornwall
ec14eab37d drm/amdkfd: Extend gfx12 trap handler fix to gfx10/11
In commit fda812ebe3 ("drm/amdkfd: gfx12 context save/restore trap handler fixes")
the following fix was introduced but incorrectly restricted to gfx12.
The same issue and a corresponding fix apply to gfx10 and gfx11.

Do not overwrite TRAPSTS.{SAVECTX,HOST_TRAP} when restoring this
register. Both of these fields can assert while the wavefront is
running the trap handler.

Signed-off-by: Jay Cornwall <jay.cornwall@amd.com>
Reviewed-by: Lancelot Six <lancelot.six@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-14 16:17:15 -04:00
Yunxiang Li
d225960c23 drm/amdgpu: add lock in kfd_process_dequeue_from_device
We need to take the reset domain lock before talking to MES. While in
this case we can take the lock inside the mes helper. We can't do so for
most other mes helpers since they are used during reset. So for
consistency sake we add the lock here.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-14 16:17:11 -04:00
Yunxiang Li
1802b042a3 drm/amdgpu/kfd: remove is_hws_hang and is_resetting
is_hws_hang and is_resetting serves pretty much the same purpose and
they all duplicates the work of the reset_domain lock, just check that
directly instead. This also eliminate a few bugs listed below and get
rid of dqm->ops.pre_reset.

kfd_hws_hang did not need to avoid scheduling another reset. If the
on-going reset decided to skip GPU reset we have a bad time, otherwise
the extra reset will get cancelled anyway.

remove_queue_mes forgot to check is_resetting flag compared to the
pre-MES path unmap_queue_cpsch, so it did not block hw access during
reset correctly.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-14 16:15:58 -04:00
Jesse Zhang
7f7f43f28e drm/amdkfd: remove logically dead code
idr_for_each_entry can ensure that mem is not empty during the loop.
So don't need check mem again.

Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05 11:25:14 -04:00
Eric Huang
dbe2c4c8ab drm/amdkfd: add reset cause in gpu pre-reset smi event
reset cause is requested by customer as additional
info for gpu reset smi event.

v2: integerate reset sources suggested by Lijo Lazar

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05 11:25:14 -04:00
Jesse Zhang
b5b561621d drm/amdkfd: remove dead code in kfd_create_vcrat_image_gpu
kfd_create_vcrat_image_gpu itself checks the avail_size at the start.
So the value of avail_size is at least VCRAT_SIZE_FOR_GPU(16384),
minus struct crat_header(40UL) and struct crat_subtype_compute(40UL) it cannot be less than 0.

Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05 11:25:13 -04:00
Jesse Zhang
8178cfb0b4 drm/amdkfd: fix the kdf debugger issue
The expression caps | HSA_CAP_TRAP_DEBUG_PRECISE_MEMORY_OPERATIONS_SUPPORTED
and  caps | HSA_CAP_TRAP_DEBUG_PRECISE_ALU_OPERATIONS_SUPPORTED
are always 1/true regardless of the values of its operand.

Fixes: 9243240bed ("drm/amdkfd: enable single alu ops for gfx12")
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05 11:25:00 -04:00
Jesse Zhang
46eb63ec8a drm/amdkfd: Comment out the unused variable use_static in pm_map_queues_v9
To fix the warning about unused value,
remove the use_static and use the parameter is_static directly.

Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05 11:06:52 -04:00
Jay Cornwall
c5afb313e7 drm/amdkfd: Handle deallocated VPGRs in gfx11+ trap handler
A wavefront may deallocate its VGPRs at the end of a program while
waiting for memory transactions to complete. If it subsequently
receives a context save exception it will be unable to save,
since this requires VGPRs. In this case the trap handler should
terminate the wavefront.

Fixes intermittent VM faults under context switching load.

V2: Use S_ENDPGM instead of S_ENDPGM_SAVED for performance counters

Signed-off-by: Jay Cornwall <jay.cornwall@amd.com>
Reviewed-by: Lancelot Six <lancelot.six@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05 11:05:57 -04:00
Jesse Zhang
dfe190aff8 drm/amdkfd: remove dead code in the function svm_range_get_pte_flags
The varible uncached  set false, the condition uncached cannot be true.
So remove the dead code, mapping flags will set the flag AMDGPU_VM_MTYPE_UC in else.

Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05 11:02:59 -04:00
Jay Cornwall
fda812ebe3 drm/amdkfd: gfx12 context save/restore trap handler fixes
Fix LDS size interpretation: 512 bytes (>= gfx12) vs 256 (< gfx12).

Ensure STATE_PRIV.BARRIER_COMPLETE cannot change after reading or
before writing. Other waves in the threadgroup may cause this field
to assert if they complete the barrier.

Do not overwrite EXCP_FLAG_PRIV.{SAVE_CONTEXT,HOST_TRAP} when
restoring this register. Both of these fields can assert while the
wavefront is running the trap handler.

Signed-off-by: Jay Cornwall <jay.cornwall@amd.com>
Reviewed-by: Lancelot Six <lancelot.six@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05 10:57:40 -04:00
Jay Cornwall
c5e358913d drm/amdkfd: Replace deprecated gfx12 trap handler instructions
Newer assemblers reject S_WAITCNT. All instances of S_WAITCNT can be
replaced by S_WAITCNT 0 (< gfx12) or S_WAIT_IDLE (>= gfx12) since
there is no concurrency of different memory instruction classes.

Signed-off-by: Jay Cornwall <jay.cornwall@amd.com>
Reviewed-by: Lancelot Six <lancelot.six@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-29 14:48:30 -04:00
Jay Cornwall
dec4f2d224 drm/amdkfd: Sync trap handler binary with source
Source and binary have become mismatched during branch activity.

Signed-off-by: Jay Cornwall <jay.cornwall@amd.com>
Reviewed-by: Lancelot Six <lancelot.six@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-29 14:48:30 -04:00
Alex Deucher
7978c4d414 drm/amdkfd: simplify APU VRAM handling
With commit 89773b8559
("drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs")
big and small APU "VRAM" handling in KFD was unified.  Since AMD_IS_APU
is set for both big and small APUs, we can simplify the checks in
the code.

v2: clean up a few more places (Lang)

Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Lang Yu <Lang.Yu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-29 14:48:30 -04:00
Alex Deucher
d4ab6c409b Revert "drm/amdkfd: fix gfx_target_version for certain 11.0.3 devices"
This reverts commit 28ebbb4981.

Revert this commit as apparently the LLVM code to take advantage of
this never landed.

Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Feifei Xu <feifei.xu@amd.com>
2024-05-29 14:48:30 -04:00
Alex Deucher
f2a1fbdd1f drm/amdgpu: drop MES 10.1 support v3
It was an enablement vehicle for MES 11 and was never
productized.  Remove it.

v2: drop additional checks in the GFX10 code.
v3: drop mes_api_def.h

Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-29 14:09:01 -04:00
Sreekant Somasekharan
5d32b7e77b drm/amdkfd: Add GFX1201 to svm_range_get_pte_flags function
GFX1201 was missed in the commit below. Adding it in.

Fixes: 628e1ace23 ("drm/amdkfd: mark GFX12 system and peer GPU memory mappings as MTYPE_NC")
Signed-off-by: Sreekant Somasekharan <sreekant.somasekharan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-17 17:40:37 -04:00
Harish Kasiviswanathan
3ed181b8ff drm/amdkfd: Ensure gpu_id is unique
gpu_id needs to be unique for user space to identify GPUs via KFD
interface. In the current implementation there is a very small
probability of having non unique gpu_ids.

v2: Add check to confirm if gpu_id is unique. If not unique, find one
    Changed commit header to reflect the above
v3: Use crc16 as suggested-by: Lijo Lazar <lijo.lazar@amd.com>
    Ensure that gpu_id != 0

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-17 17:40:37 -04:00
Harish Kasiviswanathan
8e8c68f4c9 drm/amdkfd: Use dev_error intead of pr_error
No functional change. This will help in moving gpu_id creation to next
step while still being able to identify the correct GPU

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-13 16:12:02 -04:00
Mukul Joshi
85cf43c554 drm/amdkfd: Fix CU Masking for GFX 9.4.3
We are incorrectly passing the first XCC's MQD when
updating CU masks for other XCCs in the partition. Fix
this by passing the MQD for the XCC currently being
updated with CU mask to update_cu_mask function.

Fixes: fc6efed2c7 ("drm/amdkfd: Update CU masking for GFX 9.4.3")
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-13 15:48:21 -04:00
Ramesh Errabolu
d2d3a44008 drm/amd/amdkfd: Fix a resource leak in svm_range_validate_and_map()
Analysis of code by Coverity, a static code analyser, has identified
a resource leak in the symbol hmm_range. This leak occurs when one of
the prior steps before it is released encounters an error.

Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-13 15:44:32 -04:00
Philip Yang
9095e55440 drm/amdkfd: Remove arbitrary timeout for hmm_range_fault
On system with khugepaged enabled and user cases with THP buffer, the
hmm_range_fault may takes > 15 seconds to return -EBUSY, the arbitrary
timeout value is not accurate, cause memory allocation failure.

Remove the arbitrary timeout value, return EAGAIN to application if
hmm_range_fault return EBUSY, then userspace libdrm and Thunk will call
ioctl again.

Change EAGAIN to debug message as this is not error.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-13 15:44:02 -04:00
Michael Chen
10f624ef23 drm/amdkfd: Reconcile the definition and use of oem_id in struct kfd_topology_device
Currently oem_id is defined as uint8_t[6] and casted to uint64_t*
in some use case. This would lead code scanner to complain about
access beyond. Re-define it in union to enforce 8-byte size and
alignment to avoid potential issue.

Signed-off-by: Michael Chen <michael.chen@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-08 18:46:26 -04:00
Alex Deucher
24e82654e9 drm/amdkfd: don't allow mapping the MMIO HDP page with large pages
We don't get the right offset in that case.  The GPU has
an unused 4K area of the register BAR space into which you can
remap registers.  We remap the HDP flush registers into this
space to allow userspace (CPU or GPU) to flush the HDP when it
updates VRAM.  However, on systems with >4K pages, we end up
exposing PAGE_SIZE of MMIO space.

Fixes: d8e408a827 ("drm/amdkfd: Expose HDP registers to user space")
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-05-08 15:17:05 -04:00
Lin.Cao
547033b593 drm/amdkfd: Check debug trap enable before write dbg_ev_file
In interrupt context, write dbg_ev_file will be run by work queue. It
will cause write dbg_ev_file execution after debug_trap_disable, which
will cause NULL pointer access.
v2: cancel work "debug_event_workarea" before set dbg_ev_file as NULL.

Signed-off-by: Lin.Cao <lincao12@amd.com>
Reviewed-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-08 15:17:05 -04:00
Lijo Lazar
421226e5c9 Revert "drm/amdkfd: Add partition id field to location_id"
This reverts commit c37ce764cd.

RCCL library is currently not treating spatial partitions differently,
hence this change is causing issues. Revert temporarily till RCCL
implementation is ready for spatial partitions.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-08 15:17:04 -04:00
Lang Yu
89773b8559 drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs
Small APUs(i.e., consumer, embedded products) usually have a small
carveout device memory which can't satisfy most compute workloads
memory allocation requirements.

We can't even run a Basic MNIST Example with a default 512MB carveout.
https://github.com/pytorch/examples/tree/main/mnist. Error Log:

"torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate
84.00 MiB. GPU 0 has a total capacity of 512.00 MiB of which 0 bytes
is free. Of the allocated memory 103.83 MiB is allocated by PyTorch,
and 22.17 MiB is reserved by PyTorch but unallocated"

Though we can change BIOS settings to enlarge carveout size,
which is inflexible and may bring complaint. On the other hand,
the memory resource can't be effectively used between host and device.

The solution is MI300A approach, i.e., let VRAM allocations go to GTT.
Then device and host can flexibly and effectively share memory resource.

v2: Report local_mem_size_private as 0. (Felix)

Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-08 15:17:04 -04:00
Sreekant Somasekharan
628e1ace23 drm/amdkfd: mark GFX12 system and peer GPU memory mappings as MTYPE_NC
Due to a HW bug, the system memory mappings and peer GPU mappings
on GFX12 need to be marked as MTYPE_NC.

Cc: Joe Greathouse <joseph.greathouse@amd.com>
Cc: David Belanger <david.belanger@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Sreekant Somasekharan <sreekant.somasekharan@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02 16:18:13 -04:00
Sreekant Somasekharan
a8a4615ba0 drm/amd/amdkfd: Add GFX12 PTE flag to SVM get PTE function
Add new GFX12 PTE flag AMDGPU_PTE_IS_PTE to svm_range_get_pte_flags
function. This resolves the issues related to SVM enablement in GFX12.

Signed-off-by: Sreekant Somasekharan <sreekant.somasekharan@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02 16:18:13 -04:00
David Belanger
c5faf18bbe drm/amdkfd: Enable atomic support for GFX12
Enable flag in KFD and set the atomic support bit in MQD.

Signed-off-by: David Belanger <david.belanger@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02 16:18:13 -04:00
Eric Huang
a921c35ae5 drm/amdkfd: fix NULL ptr for debugfs mqds on GFX v12
mqd_stride function in gfx v12 is not implemented, that
causes NULL ptr error. Add the generic func to fix it.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02 16:18:13 -04:00
Jonathan Kim
9243240bed drm/amdkfd: enable single alu ops for gfx12
GFX12 debugging requires setting up precise ALU operation for catching
ALU exceptions.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Tested-by: Lancelot Six <lancelot.six@amd.com>
Reviewed-by: Eric Huang <jinhuieric.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02 16:18:13 -04:00
Jonathan Kim
fda3f378c4 drm/amdkfd: always enable ttmp setup for gfx12
Similar to GFX11, always enable the setup of trap temporaries on GFX12.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02 16:18:13 -04:00
David Belanger
782b93436a drm/amdkfd: Enable GFX12 trap handler
Updated switch statement to use GFX12 trap handler.

Signed-off-by: David Belanger <david.belanger@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02 16:18:13 -04:00
Laurent Morichetti
cf338b5dfe drm/amdkfd: enable missed single-step workaround for gfx12
When trap_ctrl.trap_after_inst is set, it is possible for a wave to
enter the trap handler, after single-stepping an instruction and a
save_context is raised, with only save_context set in excp_flag_priv.

Because excp_flag_priv.trap_after_inst is not reliably set, we need to
use the missed single-step workaround for gfx12 as well.

Also add wave_start and wave_end as exceptions that should be handled
by the 2nd level trap handler.

Signed-off-by: Laurent Morichetti <laurent.morichetti@amd.com>
Tested-by: Lancelot Six <lancelot.six@amd.com>
Reviewed-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02 16:18:13 -04:00
Lancelot SIX
450abfe433 drm/amdkfd: save and restore barrier state for gfx12
Add support to save and restore the work group barrier state in gfx12
CWSR trap handler.

There is no support to directly restore the signal count of a barrier
state, so instead this patch repeatedly calls s_barrier_signal to
increment the signal count to the desired value.

In this patch, I have implemented the logic to restore the barrier at
the end of the block restoring the HWREGs.  This process needs to be
done by exactly 1 wave per work group.  To achieve this, the initial
value of s_restore_spi_init_hi (containing a FIRST_WAVE bit) needs to be
saved up until that point.  An alternative could be restore the barrier
earlier in the process (around when LDS is restored, as the same wave
does both).  Doing this would break the pattern that the restore
procedure follows the CWSR area layout.

Before restoring the barrier, this patch checks if the barrier was whose
state was saved has the "valid" bit set, even if I don't think this
barrier can be in an invalid state during context save.  I expect this
test to always be true.

Signed-off-by: Lancelot SIX <lancelot.six@amd.com>
Reviewed-by: Jay Cornwall <jay.cornwall@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02 16:18:12 -04:00
Jay Cornwall
f281003336 drm/amdkfd: Add gfx12 trap handler support
- HWREG changes since gfx11
- Save/restore barrier state
- get_wave_size is now reserved by assembler

v2: rebase (Alex)

Signed-off-by: Jay Cornwall <jay.cornwall@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02 16:18:12 -04:00
Jay Cornwall
385093fde8 drm/amdkfd: Move trap handler coherence flags to preprocessor
No functional change. Preparation for gfx12 support.

v2: drop unrelated change (Alex)

Signed-off-by: Jay Cornwall <jay.cornwall@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02 16:18:12 -04:00
David Belanger
90e4fc8369 drm/amdkfd: Added gfx_v12_kfd2kgd interface for GFX12.
Initial implementation, based on GFX11.

v2: Removed functions not needed by cp scheduler.
v3: Fixed typos.
v4: squash in warning fix (Alex)

Signed-off-by: David Belanger <david.belanger@amd.com>
Acked-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02 16:18:12 -04:00
David Belanger
47fa09b788 drm/amdkfd: Added device queue manager files for GFX12.
Initial implementation, based on GFX11.

v2: squash in include fix from David (Alex)

Signed-off-by: David Belanger <david.belanger@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02 16:18:12 -04:00
David Belanger
48f0bdf4e3 drm/amdkfd: Added MQD manager files for GFX12.
Initial implementation, based on GFX11.

v2: Removed dbg_wa code as not needed on GFX12.
v3: squash in SDMA queue fixes (Alex)
v4: rebase (Alex)

Signed-off-by: David Belanger <david.belanger@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02 16:18:12 -04:00
David Belanger
8aa89b69d6 drm/amdkfd: Added temporary changes for GFX12.
Added cases for GFX12 in switch statement, code relying on GFX11
implementation until GFX12 implementation is complete.

Signed-off-by: David Belanger <david.belanger@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02 16:18:12 -04:00
David Belanger
592a5d7de4 drm/amdkfd: Basic SDMA and cache info changes for GFX12.
Added GFX12 support to a few switch statements.

Signed-off-by: David Belanger <david.belanger@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02 16:18:12 -04:00
Hawking Zhang
5f571c61b9 drm/amdgpu: Add gfx v9_4_4 ip block
Add gfx v9_4_4 ip block support

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Le Ma <le.ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02 15:49:16 -04:00
Hawking Zhang
3d1bb1a2e0 drm/amdgpu: Add sdma v4_4_5 ip block
Add sdma v4_4_5 ip block support

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Le Ma <le.ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02 15:48:57 -04:00
Lancelot SIX
a89a05e3ca drm/amdkfd: Flush the process wq before creating a kfd_process
There is a race condition when re-creating a kfd_process for a process.
This has been observed when a process under the debugger executes
exec(3).  In this scenario:
- The process executes exec.
 - This will eventually release the process's mm, which will cause the
   kfd_process object associated with the process to be freed
   (kfd_process_free_notifier decrements the reference count to the
   kfd_process to 0).  This causes kfd_process_ref_release to enqueue
   kfd_process_wq_release to the kfd_process_wq.
- The debugger receives the PTRACE_EVENT_EXEC notification, and tries to
  re-enable AMDGPU traps (KFD_IOC_DBG_TRAP_ENABLE).
 - When handling this request, KFD tries to re-create a kfd_process.
   This eventually calls kfd_create_process and kobject_init_and_add.

At this point the call to kobject_init_and_add can fail because the
old kfd_process.kobj has not been freed yet by kfd_process_wq_release.

This patch proposes to avoid this race by making sure to drain
kfd_process_wq before creating a new kfd_process object.  This way, we
know that any cleanup task is done executing when we reach
kobject_init_and_add.

Signed-off-by: Lancelot SIX <lancelot.six@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-30 10:00:15 -04:00
Harish Kasiviswanathan
7f11a836e1 drm/amdkfd: Enforce queue BO's adev
Queue buffer, though it is in system memory, has to be created using the
correct amdgpu device. Enforce this as the BO needs to mapped to the
GART for MES Hardware scheduler to access it.

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:42 -04:00
YiPeng Chai
bfa579b38b drm/amdgpu: prepare to handle pasid poison consumption
Prepare to handle pasid poison consumption.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:42 -04:00
Frank Min
ea9238a81b drm/amdgpu: replace tmz flag into buffer flag
Replace tmz flag into buffer flag to make it easier to understand
and extend

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:38 -04:00
Mukul Joshi
63335b383a drm/amdkfd: Add VRAM accounting for SVM migration
Do VRAM accounting when doing migrations to vram to make sure
there is enough available VRAM and migrating to VRAM doesn't evict
other possible non-unified memory BOs. If migrating to VRAM fails,
driver can fall back to using system memory seamlessly.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:38 -04:00
Felix Kuehling
f989ecccdf drm/amdkfd: Fix rescheduling of restore worker
Handle the case that the restore worker was already scheduled by another
eviction while the restore was in progress.

Fixes: 9a1c1339ab ("drm/amdkfd: Run restore_workers on freezable WQs")
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Tested-by: Yunxiang Li <Yunxiang.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:38 -04:00
Alex Deucher
80f071a343 drm/amdkfd: demote unsupported device messages to dev_info
It's not really an error since the devices don't support
the necessary hardware functionality.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3331
Tested-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-23 12:08:30 -04:00
Dave Airlie
0208ca55aa Linux 6.9-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmYlapoeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGJLgH/RFrsXefpxAWACJv
 wRmIx/YFjI3Z4nypBZPXysX/JOuf/cf4bR22jMHC1vhQDtxVgB7eUpEYx0C0Y5RD
 ll7cunAZCpP+PB725/bOWbaF94eigl8xC4RrKPon4tG1CS9NAvCJFDyNgWpqh48s
 sSKKae68a16zMtIX5Za8nvmsBQOeD67ZFBgI4m7wIDDhTXY3XlG84r8Rz/ZwcdE/
 rExFkDMUqjpoJVAMBtETQNk4pvKZVMrW2B0f/xZSR+IKXw5l2FO0raXkIG/6TERs
 CxJvCuUjKIosqFLIK2O/cU9G+5V2axZ/WD8YRwDr2vlnnOPht+dFQzyPcuSqezbG
 ShIhQo8=
 =GI5F
 -----END PGP SIGNATURE-----

Backmerge tag 'v6.9-rc5' into drm-next

Linux 6.9-rc5

I've had a persistent msm failure on clang, and the fix is in fixes
so just pull it back to fix that.

Signed-off-by: Dave Airlie <airlied@redhat.com>
2024-04-22 14:35:52 +10:00
Felix Kuehling
7e38ccb527 drm/amdkfd: Fix eviction fence handling
Handle case that dma_fence_get_rcu_safe returns NULL.

If restore work is already scheduled, only update its timer. The same
work item cannot be queued twice, so undo the extra queue eviction.

Fixes: 9a1c1339ab ("drm/amdkfd: Run restore_workers on freezable WQs")
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Tested-by: Gang BA <Gang.Ba@amd.com>
Reviewed-by: Gang BA <Gang.Ba@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:54:48 -04:00
Hawking Zhang
5e984b0a3d drm/amdgpu: Use driver mode reset for data poison
mode-2 reset is the only reliable method that can get
GC/SDMA back when poison is consumed. mmhub requires
mode-1 reset.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:46:23 -04:00
Felix Kuehling
18921b2050 drm/amdkfd: Fix memory leak in create_process failure
Fix memory leak due to a leaked mmget reference on an error handling
code path that is triggered when attempting to create KFD processes
while a GPU reset is in progress.

Fixes: 0ab2d7532b ("drm/amdkfd: prepare per-process debug enable and disable")
CC: Xiaogang Chen <xiaogang.chen@amd.com>
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Tested-by: Harish Kasiviswanthan <Harish.Kasiviswanthan@amd.com>
Reviewed-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-17 11:05:09 -04:00
Felix Kuehling
4135899209 drm/amdkfd: Fix memory leak in create_process failure
Fix memory leak due to a leaked mmget reference on an error handling
code path that is triggered when attempting to create KFD processes
while a GPU reset is in progress.

Fixes: 0ab2d7532b ("drm/amdkfd: prepare per-process debug enable and disable")
CC: Xiaogang Chen <xiaogang.chen@amd.com>
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Tested-by: Harish Kasiviswanthan <Harish.Kasiviswanthan@amd.com>
Reviewed-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-12 00:35:49 -04:00
Dave Airlie
3b0daecfea amdkfd: use calloc instead of kzalloc to avoid integer overflow
This uses calloc instead of doing the multiplication which might
overflow.

Cc: stable@vger.kernel.org
Signed-off-by: Dave Airlie <airlied@redhat.com>
2024-04-12 11:11:59 +10:00
Zhigang Luo
d06af584be amd/amdkfd: sync all devices to wait all processes being evicted
If there are more than one device doing reset in parallel, the first
device will call kfd_suspend_all_processes() to evict all processes
on all devices, this call takes time to finish. other device will
start reset and recover without waiting. if the process has not been
evicted before doing recover, it will be restored, then caused page
fault.

Signed-off-by: Zhigang Luo <Zhigang.Luo@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 23:28:30 -04:00
Harish Kasiviswanathan
8bdfb4ea95 drm/amdkfd: Reset GPU on queue preemption failure
Currently, with F32 HWS GPU reset is only when unmap queue fails.

However, if compute queue doesn't repond to preemption request in time
unmap will return without any error. In this case, only preemption error
is logged and Reset is not triggered. Call GPU reset in this case also.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-09 23:09:31 -04:00
Zhigang Luo
dfb15c4ab5 amd/amdkfd: sync all devices to wait all processes being evicted
If there are more than one device doing reset in parallel, the first
device will call kfd_suspend_all_processes() to evict all processes
on all devices, this call takes time to finish. other device will
start reset and recover without waiting. if the process has not been
evicted before doing recover, it will be restored, then caused page
fault.

Signed-off-by: Zhigang Luo <Zhigang.Luo@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 22:14:03 -04:00
Jonathan Kim
0cac183b98 drm/amdkfd: range check cp bad op exception interrupts
Due to a CP interrupt bug, bad packet garbage exception codes are raised.
Do a range check so that the debugger and runtime do not receive garbage
codes.
Update the user api to guard exception code type checking as well.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Tested-by: Jesse Zhang <jesse.zhang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-27 08:53:02 -04:00
Eric Huang
1210e2f103 drm/amdkfd: fix TLB flush after unmap for GFX9.4.2
TLB flush after unmap accidentially was removed on
gfx9.4.2. It is to add it back.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-03-27 08:51:47 -04:00
Mukul Joshi
9d7993a7ab drm/amdkfd: Check cgroup when returning DMABuf info
Check cgroup permissions when returning DMA-buf info and
based on cgroup info return the GPU id of the GPU that have
access to the BO.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-27 08:49:29 -04:00
Harish Kasiviswanathan
d7e8ddc392 drm/amdkfd: Reset GPU on queue preemption failure
Currently, with F32 HWS GPU reset is only when unmap queue fails.

However, if compute queue doesn't repond to preemption request in time
unmap will return without any error. In this case, only preemption error
is logged and Reset is not triggered. Call GPU reset in this case also.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-27 01:44:53 -04:00
Mukul Joshi
417f78a2a1 drm/amdkfd: Cleanup workqueue during module unload
Destroy the high priority workqueue that handles interrupts
during KFD node cleanup.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-22 15:54:48 -04:00
Jonathan Kim
fb880635e0 drm/amdkfd: range check cp bad op exception interrupts
Due to a CP interrupt bug, bad packet garbage exception codes are raised.
Do a range check so that the debugger and runtime do not receive garbage
codes.
Update the user api to guard exception code type checking as well.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Tested-by: Jesse Zhang <jesse.zhang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-22 15:54:42 -04:00
Eric Huang
acf760c890 drm/amdkfd: fix TLB flush after unmap for GFX9.4.2
TLB flush after unmap accidentially was removed on
gfx9.4.2. It is to add it back.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-22 15:51:49 -04:00
Mukul Joshi
e2680ee222 drm/amdkfd: Check cgroup when returning DMABuf info
Check cgroup permissions when returning DMA-buf info and
based on cgroup info return the GPU id of the GPU that have
access to the BO.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-22 15:47:56 -04:00
Tao Zhou
2fc46e0b2f drm/amdgpu: make reset method configurable for RAS poison
Each RAS block has different requirement for gpu reset in poison
consumption handling.
Add support for mmhub RAS poison consumption handling.

v2: remove the mmhub poison support for kfd int v10.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:15 -04:00
Tao Zhou
d8070c4241 drm/amdgpu: support utcl2 RAS poison query for mmhub
Support the query for both gfxhub and mmhub, also replace
xcc_id with hub_inst.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:14 -04:00
Mukul Joshi
0991a4c192 drm/amdkfd: Check preemption status on all XCDs
This patch adds the following functionality:
- Check the queue preemption status on all XCDs in a partition
  for GFX 9.4.3.
- Update the queue preemption debug message to print the queue
  doorbell id for which preemption failed.
- Change the signature of check preemption failed function to
  return a bool instead of uint32_t and pass the MQD manager
  as an argument.

Suggested-by: Jay Cornwall <jay.cornwall@amd.com>
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:12 -04:00
Mukul Joshi
26d97182bb drm/amdkfd: Rename read_doorbell_id in MQD functions
Rename read_doorbell_id function to a more meaningful name,
implying what it is used for. No functional change.

Suggested-by: Jay Cornwall <jay.cornwall@amd.com>
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:12 -04:00
Tao Zhou
71a8d61ebc drm/amdgpu: retire gfx ras query_utcl2_poison_status
Replace it with related interface in gfxhub functions.

v2: replace node id with xcc id.
    get node id for query_utcl2_poison_status

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:37:36 -04:00
Ricardo B. Marliere
6d3b27e046 drm/amdkfd: make kfd_class constant
Since commit 43a7206b09 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the kfd_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-06 15:24:50 -05:00
Laurent Morichetti
45bbf800c5 drm/amdkfd: Use SQC when TCP would fail in gfx10.1 context save
Similarly to gfx9, gfx10.1 drops vector stores when an xnack error is
raised. To work around this issue, use scalar stores instead of vector
stores when trapsts.xnack_error == 1.

Signed-off-by: Laurent Morichetti <laurent.morichetti@amd.com>
Reviewed-by: Jay Cornwall <jay.cornwall@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-06 15:24:50 -05:00
Laurent Morichetti
f36e3f7260 drm/amdkfd: Increase the size of the memory reserved for the TBA
In a future commit, the cwsr trap handler code size for gfx10.1 will
increase to slightly above the one page mark. Since the TMA does not
need to be page aligned, and only 2 pointers are stored in it, push
the TMA offset by 2 KiB and keep the TBA+TMA reserved memory size
to two pages.

Signed-off-by: Laurent Morichetti <laurent.morichetti@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-06 15:24:49 -05:00
Shashank Sharma
b8f67b9ddf drm/amdgpu: change vm->task_info handling
This patch changes the handling and lifecycle of vm->task_info object.
The major changes are:
- vm->task_info is a dynamically allocated ptr now, and its uasge is
  reference counted.
- introducing two new helper funcs for task_info lifecycle management
    - amdgpu_vm_get_task_info: reference counts up task_info before
      returning this info
    - amdgpu_vm_put_task_info: reference counts down task_info
- last put to task_info() frees task_info from the vm.

This patch also does logistical changes required for existing usage
of vm->task_info.

V2: Do not block all the prints when task_info not found (Felix)

V3: Fixed review comments from Felix
   - Fix wrong indentation
   - No debug message for -ENOMEM
   - Add NULL check for task_info
   - Do not duplicate the debug messages (ti vs no ti)
   - Get first reference of task_info in vm_init(), put last
     in vm_fini()

V4: Fixed review comments from Felix
   - fix double reference increment in create_task_info
   - change amdgpu_vm_get_task_info_pasid
   - additional changes in amdgpu_gem.c while porting

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-04 15:59:08 -05:00
Eric Huang
1761d9a688 amd/amdkfd: remove unused parameter
The adev can be found from bo by amdgpu_ttm_adev(bo->tbo.bdev),
and adev is also not used in the function
amdgpu_amdkfd_map_gtt_bo_to_gart().

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-28 17:10:53 -05:00
Lijo Lazar
c37ce764cd drm/amdkfd: Add partition id field to location_id
On devices which have multi-partition nodes, keep partition id in
location_id[31:28].

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-26 11:15:38 -05:00
Lijo Lazar
e1f6746f33 drm/amdkfd: Skip packet submission on fatal error
If fatal error is detected, packet submission won't go through. Return
error in such cases. Also, avoid waiting for fence when fatal error is
detected.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-26 11:14:31 -05:00
Jonathan Kim
6f05159a0d drm/amdkfd: fix process reference drop on debug ioctl
Prevent dropping the KFD process reference at the end of a debug
IOCTL call where the acquired process value is an error.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-22 10:27:50 -05:00
Yifan Zhang
f5f83441c4 drm/amdkfd: add KFD support for GC 11.5.1
Enable KFD for GC 11.5.1.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-22 10:27:10 -05:00
Felix Kuehling
34a1de0f79 drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole
The TBA and TMA, along with an unused IB allocation, reside at low
addresses in the VM address space. A stray VM fault which hits these
pages must be serviced by making their page table entries invalid.
The scheduler depends upon these pages being resident and fails,
preventing a debugger from inspecting the failure state.

By relocating these pages above 47 bits in the VM address space they
can only be reached when bits [63:48] are set to 1. This makes it much
less likely for a misbehaving program to generate accesses to them.
The current placement at VA (PAGE_SIZE*2) is readily hit by a NULL
access with a small offset.

v2:
- Move it to the reserved space to avoid concflicts with Mesa
- Add macros to make reserved space management easier

v3:
- Move VM  max PFN calculation into AMDGPU_VA_RESERVED macros

Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jay Cornwall <jay.cornwall@amd.com>
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-16 15:41:50 -05:00
Rajneesh Bhardwaj
3459ffe8a8 drm/amdgpu: Fix implicit assumtion in gfx11 debug flags
Gfx11 debug flags mask is currently set with an implicit assumption that
no other mqd update flags exist. This needs to be fixed with newly
introduced flag UPDATE_FLAG_IS_GWS by the previous patch.

Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-14 17:15:35 -05:00
Rajneesh Bhardwaj
5869b32bbe drm/amdkfd: update SIMD distribution algo for GFXIP 9.4.2 onwards
In certain cooperative group dispatch scenarios the default SPI resource
allocation may cause reduced per-CU workgroup occupancy. Set
COMPUTE_RESOURCE_LIMITS.FORCE_SIMD_DIST=1 to mitigate soft hang
scenarions.

Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Suggested-by: Joseph Greathouse <Joseph.Greathouse@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-14 17:15:26 -05:00
Kent Russell
1727816961 drm/amdkfd: Fix L2 cache size reporting in GFX9.4.3
Its currently incorrectly multiplied by number of XCCs in the partition

Fixes: be457b2252 ("drm/amdkfd: Update cache info for GFX 9.4.3")
Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-12 16:06:05 -05:00
Laurent Morichetti
804bf74b16 drm/amdkfd: pass debug exceptions to second-level trap handler
Call the 2nd level trap handler if the cwsr handler is entered with any
one of wave_start, wave_end, or trap_after_inst exceptions.

Signed-off-by: Laurent Morichetti <laurent.morichetti@amd.com>
Tested-by: Lancelot Six <lancelot.six@amd.com>
Reviewed-by: Jay Cornwall <jay.cornwall@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-12 16:05:33 -05:00
Jonathan Kim
7f9dde7884 drm/amdkfd: fill in data for control stack header for gfx10
The debugger requires the control stack header to be filled in to
update_waves.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-12 16:05:26 -05:00
Joseph Greathouse
5a2df8ecba drm/amdkfd: Add cache line sizes to KFD topology
The KFD topology includes cache line size, but we have not been
filling that information out unless we are parsing a CRAT table.
Fill in this information for the devices where we have cache
information structs, and pipe this information to the topology
sysfs files.

v2: squash in fix from Joe (Alex)

Signed-off-by: Joseph Greathouse <Joseph.Greathouse@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-07 12:25:51 -05:00
Lang Yu
0c93bd4957 drm/amdkfd: reserve the BO before validating it
Fix a warning.

v2: Avoid unmapping attachment repeatedly when ERESTARTSYS.

v3: Lock the BO before accessing ttm->sg to avoid race conditions.(Felix)

[   41.708711] WARNING: CPU: 0 PID: 1463 at drivers/gpu/drm/ttm/ttm_bo.c:846 ttm_bo_validate+0x146/0x1b0 [ttm]
[   41.708989] Call Trace:
[   41.708992]  <TASK>
[   41.708996]  ? show_regs+0x6c/0x80
[   41.709000]  ? ttm_bo_validate+0x146/0x1b0 [ttm]
[   41.709008]  ? __warn+0x93/0x190
[   41.709014]  ? ttm_bo_validate+0x146/0x1b0 [ttm]
[   41.709024]  ? report_bug+0x1f9/0x210
[   41.709035]  ? handle_bug+0x46/0x80
[   41.709041]  ? exc_invalid_op+0x1d/0x80
[   41.709048]  ? asm_exc_invalid_op+0x1f/0x30
[   41.709057]  ? amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x2c/0x80 [amdgpu]
[   41.709185]  ? ttm_bo_validate+0x146/0x1b0 [ttm]
[   41.709197]  ? amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x2c/0x80 [amdgpu]
[   41.709337]  ? srso_alias_return_thunk+0x5/0x7f
[   41.709346]  kfd_mem_dmaunmap_attachment+0x9e/0x1e0 [amdgpu]
[   41.709467]  amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x56/0x80 [amdgpu]
[   41.709586]  kfd_ioctl_unmap_memory_from_gpu+0x1b7/0x300 [amdgpu]
[   41.709710]  kfd_ioctl+0x1ec/0x650 [amdgpu]
[   41.709822]  ? __pfx_kfd_ioctl_unmap_memory_from_gpu+0x10/0x10 [amdgpu]
[   41.709945]  ? srso_alias_return_thunk+0x5/0x7f
[   41.709949]  ? tomoyo_file_ioctl+0x20/0x30
[   41.709959]  __x64_sys_ioctl+0x9c/0xd0
[   41.709967]  do_syscall_64+0x3f/0x90
[   41.709973]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Fixes: 101b810430 ("drm/amdkfd: Move dma unmapping after TLB flush")
Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-31 14:05:19 -05:00
Mukul Joshi
50d3cf5e5a drm/amdkfd: Use correct drm device for cgroup permission check
On GFX 9.4.3, for a given KFD node, fetch the correct drm device from
XCP manager when checking for cgroup permissions.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-29 15:38:30 -05:00
Jay Cornwall
7297ff96ea drm/amdkfd: Use S_ENDPGM_SAVED in trap handler
This instruction has no functional difference to S_ENDPGM
but allows performance counters to track save events correctly.

Signed-off-by: Jay Cornwall <jay.cornwall@amd.com>
Reviewed-by: Laurent Morichetti <laurent.morichetti@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-29 15:38:20 -05:00
Philip Yang
34e98e5b07 drm/amdkfd: Correct partial migration virtual addr
Partial migration to system memory should use migrate.addr, not
prange->start as virtual address to allocate system memory page.

Fixes: a546a27684 ("drm/amdkfd: Use partial migrations/mapping for GPU/CPU page faults in SVM")
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Xiaogang Chen <Xiaogang.Chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-29 15:36:47 -05:00
YiPeng Chai
ed1e1e42fd drm/amdgpu: Support passing poison consumption ras block to SRIOV
Support passing poison consumption ras blocks
to SRIOV.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-25 14:58:03 -05:00
Alex Deucher
fc8f5a29d4 drm/amdgpu/gfx11: set UNORD_DISPATCH in compute MQDs
This needs to be set to 1 to avoid a potential deadlock in
the GC 10.x and newer.  On GC 9.x and older, this needs
to be set to 0. This can lead to hangs in some mixed
graphics and compute workloads. Updated firmware is also
required for AQL.

Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-01-25 14:49:03 -05:00
Alex Deucher
ca01082353 drm/amdgpu/gfx10: set UNORD_DISPATCH in compute MQDs
This needs to be set to 1 to avoid a potential deadlock in
the GC 10.x and newer.  On GC 9.x and older, this needs
to be set to 0.  This can lead to hangs in some mixed
graphics and compute workloads.  Updated firmware is also
required for AQL.

Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-01-25 14:48:58 -05:00
YiPeng Chai
22f6e3e112 drm/amdgpu: Add log info for umc_v12_0
Add log info for umc_v12_0.

v2:
 Delete redundant logs.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-22 17:13:25 -05:00
Srinivasan Shanmugam
81d4b97068 drm/amdkfd: Fix variable dereferenced before NULL check in 'kfd_dbg_trap_device_snapshot()'
Fixes the below:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_debug.c:1024 kfd_dbg_trap_device_snapshot() warn: variable dereferenced before check 'entry_size' (see line 1021)

Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-15 18:35:37 -05:00
Felix Kuehling
50661eb1a2 drm/amdgpu: Auto-validate DMABuf imports in compute VMs
DMABuf imports in compute VMs are not wrapped in a kgd_mem object on the
process_info->kfd_bo_list. There is no explicit KFD API call to validate
them or add eviction fences to them.

This patch automatically validates and fences dymanic DMABuf imports when
they are added to a compute VM. Revalidation after evictions is handled
in the VM code.

v2:
* Renamed amdgpu_vm_validate_evicted_bos to amdgpu_vm_validate
* Eliminated evicted_user state, use evicted state for VM BOs and user BOs
* Fixed and simplified amdgpu_vm_fence_imports, depends on reserved BOs
* Moved dma_resv_reserve_fences for amdgpu_vm_fence_imports into
  amdgpu_vm_validate, outside the vm->status_lock
* Added dummy version of amdgpu_amdkfd_bo_validate_and_fence for builds
  without KFD

v4: Eliminate amdgpu_vm_fence_imports. It's not needed because the
reservation with its fences is shared with the export, as long as all
imports are from KFD, with the exports already reserved, validated and
fenced by the KFD restore worker.

v5: Reintroduced separate evicted_user state to simplify the state machine
and CS error handling when amdgpu_vm_validate is called without a ticket.

Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-15 18:35:35 -05:00
Srinivasan Shanmugam
d7a254fad8 drm/amdkfd: Fix 'node' NULL check in 'svm_range_get_range_boundaries()'
Range interval [start, last] is ordered by rb_tree, rb_prev, rb_next
return value still needs NULL check, thus modified from "node" to "rb_node".

Fixes the below:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:2691 svm_range_get_range_boundaries() warn: can 'node' even be NULL?

Suggested-by: Philip Yang <Philip.Yang@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-15 18:33:00 -05:00
Dafna Hirschfeld
02eed83abc drm/amdkfd: fixes for HMM mem allocation
Fix err return value and reset pgmap->type after checking it.

Fixes: c83dee9b63 ("drm/amdkfd: add SPM support for SVM")
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-15 18:30:46 -05:00
Felix Kuehling
c147ddc68e drm/amdkfd: Fix sparse __rcu annotation warnings
Properly mark kfd_process->ef as __rcu and consistently use the right
accessor functions.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202312052245.yFpBSgNH-lkp@intel.com/
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-09 15:44:13 -05:00
Philip Yang
2a9de42e8d drm/amdkfd: Fix lock dependency warning with srcu
======================================================
WARNING: possible circular locking dependency detected
6.5.0-kfd-yangp #2289 Not tainted
------------------------------------------------------
kworker/0:2/996 is trying to acquire lock:
        (srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x5/0x1a0

but task is already holding lock:
        ((work_completion)(&svms->deferred_list_work)){+.+.}-{0:0}, at:
	process_one_work+0x211/0x560

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 ((work_completion)(&svms->deferred_list_work)){+.+.}-{0:0}:
        __flush_work+0x88/0x4f0
        svm_range_list_lock_and_flush_work+0x3d/0x110 [amdgpu]
        svm_range_set_attr+0xd6/0x14c0 [amdgpu]
        kfd_ioctl+0x1d1/0x630 [amdgpu]
        __x64_sys_ioctl+0x88/0xc0

-> #2 (&info->lock#2){+.+.}-{3:3}:
        __mutex_lock+0x99/0xc70
        amdgpu_amdkfd_gpuvm_restore_process_bos+0x54/0x740 [amdgpu]
        restore_process_helper+0x22/0x80 [amdgpu]
        restore_process_worker+0x2d/0xa0 [amdgpu]
        process_one_work+0x29b/0x560
        worker_thread+0x3d/0x3d0

-> #1 ((work_completion)(&(&process->restore_work)->work)){+.+.}-{0:0}:
        __flush_work+0x88/0x4f0
        __cancel_work_timer+0x12c/0x1c0
        kfd_process_notifier_release_internal+0x37/0x1f0 [amdgpu]
        __mmu_notifier_release+0xad/0x240
        exit_mmap+0x6a/0x3a0
        mmput+0x6a/0x120
        do_exit+0x322/0xb90
        do_group_exit+0x37/0xa0
        __x64_sys_exit_group+0x18/0x20
        do_syscall_64+0x38/0x80

-> #0 (srcu){.+.+}-{0:0}:
        __lock_acquire+0x1521/0x2510
        lock_sync+0x5f/0x90
        __synchronize_srcu+0x4f/0x1a0
        __mmu_notifier_release+0x128/0x240
        exit_mmap+0x6a/0x3a0
        mmput+0x6a/0x120
        svm_range_deferred_list_work+0x19f/0x350 [amdgpu]
        process_one_work+0x29b/0x560
        worker_thread+0x3d/0x3d0

other info that might help us debug this:
Chain exists of:
  srcu --> &info->lock#2 --> (work_completion)(&svms->deferred_list_work)

Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
        lock((work_completion)(&svms->deferred_list_work));
                        lock(&info->lock#2);
			lock((work_completion)(&svms->deferred_list_work));
        sync(srcu);

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-09 15:44:13 -05:00
Felix Kuehling
47bf0f83fc drm/amdkfd: Fix lock dependency warning
======================================================
WARNING: possible circular locking dependency detected
6.5.0-kfd-fkuehlin #276 Not tainted
------------------------------------------------------
kworker/8:2/2676 is trying to acquire lock:
ffff9435aae95c88 ((work_completion)(&svm_bo->eviction_work)){+.+.}-{0:0}, at: __flush_work+0x52/0x550

but task is already holding lock:
ffff9435cd8e1720 (&svms->lock){+.+.}-{3:3}, at: svm_range_deferred_list_work+0xe8/0x340 [amdgpu]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&svms->lock){+.+.}-{3:3}:
       __mutex_lock+0x97/0xd30
       kfd_ioctl_alloc_memory_of_gpu+0x6d/0x3c0 [amdgpu]
       kfd_ioctl+0x1b2/0x5d0 [amdgpu]
       __x64_sys_ioctl+0x86/0xc0
       do_syscall_64+0x39/0x80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd

-> #1 (&mm->mmap_lock){++++}-{3:3}:
       down_read+0x42/0x160
       svm_range_evict_svm_bo_worker+0x8b/0x340 [amdgpu]
       process_one_work+0x27a/0x540
       worker_thread+0x53/0x3e0
       kthread+0xeb/0x120
       ret_from_fork+0x31/0x50
       ret_from_fork_asm+0x11/0x20

-> #0 ((work_completion)(&svm_bo->eviction_work)){+.+.}-{0:0}:
       __lock_acquire+0x1426/0x2200
       lock_acquire+0xc1/0x2b0
       __flush_work+0x80/0x550
       __cancel_work_timer+0x109/0x190
       svm_range_bo_release+0xdc/0x1c0 [amdgpu]
       svm_range_free+0x175/0x180 [amdgpu]
       svm_range_deferred_list_work+0x15d/0x340 [amdgpu]
       process_one_work+0x27a/0x540
       worker_thread+0x53/0x3e0
       kthread+0xeb/0x120
       ret_from_fork+0x31/0x50
       ret_from_fork_asm+0x11/0x20

other info that might help us debug this:

Chain exists of:
  (work_completion)(&svm_bo->eviction_work) --> &mm->mmap_lock --> &svms->lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&svms->lock);
                               lock(&mm->mmap_lock);
                               lock(&svms->lock);
  lock((work_completion)(&svm_bo->eviction_work));

I believe this cannot really lead to a deadlock in practice, because
svm_range_evict_svm_bo_worker only takes the mmap_read_lock if the BO
refcount is non-0. That means it's impossible that svm_range_bo_release
is running concurrently. However, there is no good way to annotate this.

To avoid the problem, take a BO reference in
svm_range_schedule_evict_svm_bo instead of in the worker. That way it's
impossible for a BO to get freed while eviction work is pending and the
cancel_work_sync call in svm_range_bo_release can be eliminated.

v2: Use svm_bo_ref_unless_zero and explained why that's safe. Also
removed redundant checks that are already done in
amdkfd_fence_enable_signaling.

Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Philip Yang <philip.yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-09 15:43:53 -05:00
Dave Airlie
e54478fbda amd-drm-next-6.8-2024-01-05:
amdgpu:
 - VRR fixes
 - PSR-SU fixes
 - SubVP fixes
 - DCN 3.5 fixes
 - Documentation updates
 - DMCUB fixes
 - DML2 fixes
 - UMC 12.0 updates
 - GPUVM fix
 - Misc code cleanups and whitespace cleanups
 - DP MST fix
 - Let KFD sync with GPUVM fences
 - GFX11 reset fix
 - SMU 13.0.6 fixes
 - VSC fix for DP/eDP
 - Navi12 display fix
 - RN/CZN system aperture fix
 - DCN 2.1 bandwidth validation fix
 - DCN INIT cleanup
 
 amdkfd:
 - SVM fixes
 - Revert TBA/TMA location change
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCZZh7ZQAKCRC93/aFa7yZ
 2EPJAQDkoOlSjRLZoqwOPvBCo3WIzVO+4N6pOgohbTrjhHvDFAD+ONEgIH/wydk1
 IOdtyizh9o7spo2qN2Oi06MDimclDg8=
 =bFa3
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.8-2024-01-05' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.8-2024-01-05:

amdgpu:
- VRR fixes
- PSR-SU fixes
- SubVP fixes
- DCN 3.5 fixes
- Documentation updates
- DMCUB fixes
- DML2 fixes
- UMC 12.0 updates
- GPUVM fix
- Misc code cleanups and whitespace cleanups
- DP MST fix
- Let KFD sync with GPUVM fences
- GFX11 reset fix
- SMU 13.0.6 fixes
- VSC fix for DP/eDP
- Navi12 display fix
- RN/CZN system aperture fix
- DCN 2.1 bandwidth validation fix
- DCN INIT cleanup

amdkfd:
- SVM fixes
- Revert TBA/TMA location change

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240105220522.4976-1-alexander.deucher@amd.com
2024-01-09 09:07:50 +10:00
Kaibo Ma
0f35b0a7b8 Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole"
That commit causes NULL pointer dereferences in dmesgs when
running applications using ROCm, including clinfo, blender,
and PyTorch, since v6.6.1. Revert it to fix blender again.

This reverts commit 96c211f1f9.

Closes: https://github.com/ROCm/ROCm/issues/2596
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/2991
Reviewed-by: Jay Cornwall <jay.cornwall@amd.com>
Signed-off-by: Kaibo Ma <ent3rm4n@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-05 16:10:44 -05:00
Srinivasan Shanmugam
b1a428b45d drm/amdkfd: Fix iterator used outside loop in 'kfd_add_peer_prop()'
Fix the following about iterator use:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_topology.c:1456 kfd_add_peer_prop() warn: iterator used outside loop: 'iolink3'

Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-05 16:10:43 -05:00
Srinivasan Shanmugam
499839eca3 drm/amdkfd: Confirm list is non-empty before utilizing list_first_entry in kfd_topology.c
Before using list_first_entry, make sure to check that list is not
empty, if list is empty return -ENODATA.

Fixes the below:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_topology.c:1347 kfd_create_indirect_link_prop() warn: can 'gpu_link' even be NULL?
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_topology.c:1428 kfd_add_peer_prop() warn: can 'iolink1' even be NULL?
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_topology.c:1433 kfd_add_peer_prop() warn: can 'iolink2' even be NULL?

Fixes: 0f28cca87e ("drm/amdkfd: Extend KFD device topology to surface peer-to-peer links")
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com>
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-05 16:10:43 -05:00
Srinivasan Shanmugam
217e85f970 drm/amdkfd: Fix type of 'dbg_flags' in 'struct kfd_process'
dbg_flags looks to be defined with incorrect data type; to process
multiple debug flag options, and hence defined dbg_flags as u32.

Fixes the below:

drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_packet_manager_v9.c:117 pm_map_process_aldebaran() warn: maybe use && instead of &

Fixes: 0de4ec9a03 ("drm/amdgpu: prepare map process for multi-process debug devices")
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-03 11:16:06 -05:00
Dave Airlie
22a2decedf Merge tag 'drm-msm-next-2023-12-15' of https://gitlab.freedesktop.org/drm/msm into drm-next
Updates for v6.8:

Core:
- Add support for SDM670, SM8650
- Handle the CFG interconnect to fix the obscure hangs / timeouts
  on register write
- Kconfig fix for QMP dependency
- DT schema fixes

DPU:
- Add support for SDM670, SM8650
- Enable SmartDMA on SM8350 and SM8450
- Correct UBWC settings for SC8280XP
- Fix catalog settings for SC8180X
- Actually make use of the version to switch between QSEED3/3LITE/4
  scalers
- Use devres-managed and drm-managed allocations where appropriate
- misc other fixes
- Enabled YUV writeback on SC7280, SM8250
- Enabled writeback on SM8350, SM8450
- CRC fix when encoder is selected as the input source
- other misc fixes

MDP4:
- Use devres-managed and drm-managed allocations where appropriate
- flush vblank event on CRTC disable

MDP5:
- Use devres-managed and drm-managed allocations where appropriate

DP:
- Add support for SM8650
- Enable PM runtime support
- Merge msm-specific debugfs dir with the generic one
- Described DisplayPort on SM8150 in DeviceTree bindings
- Moved dp_display_get_next_bridge() to probe()

DSI:
- Add support for SM8650
- Enable PM runtime support

GPU/GEM:
- demote userspace triggerable warnings to debug
- add GEM object metadata UAPI
- move GPU devcoredumps to GPU device
- fix hangcheck to skip retired submits
- expose UBWC config to userspace
- fix a680 chip-id
- drm_exec conversion
- drm/ci: remove rebase-merge directory (to unblock CI)

[airlied: fix drm_exec/amd interaction]
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rob Clark <robdclark@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/CAF6AEGs9auYqmo-7NSd9FsbNBCDf7aBevd=4xkcF3A5G_OGvMQ@mail.gmail.com
2023-12-20 07:54:03 +10:00
Xiaogang Chen
006ad514a5 drm/amdkfd: Use partial hmm page walk during buffer validation in SVM
SVM uses hmm page walk to valid buffer before map to gpu vm. After have partial
migration/mapping do validation on same vm range as migration/map do instead of
whole svm range that can be very large. This change is expected to improve svm
code performance.

Signed-off-by: Xiaogang Chen <xiaogang.chen@amd.com>
Reviewed-by: Philip Yang <philip.yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-19 14:58:41 -05:00
Philip Yang
65a618dd73 drm/amdkfd: svm range always mapped flag not working on APU
On gfx943 APU there is no VRAM and page migration, queue CWSR area, svm
range with always mapped flag, is not mapped to GPU correctly. This
works fine if retry fault on CWSR area can be recovered, but could cause
deadlock if there is another retry fault recover waiting for CWSR to
finish.

Fix this by mapping svm range with always mapped flag to GPU with ACCESS
attribute if XNACK ON.

There is side effect, because all GPUs have ACCESS attribute by default
on new svm range with XNACK on, the CWSR area will be mapped to all GPUs
after this change. This side effect will be fixed with Thunk change to
set CWSR svm range with ACCESS_IN_PLACE attribute on the GPU that user
queue is created.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-15 12:16:26 -05:00
Jonathan Kim
24149412df drm/amdkfd: only flush mes process context if mes support is there
Fix up on mes process context flush to prevent non-mes devices from
spamming error messages or running into undefined behaviour during
process termination.

Fixes: bd33bb1409 ("drm/amdkfd: fix mes set shader debugger process management")
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Eric Huang <jinhuieric.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-15 12:15:13 -05:00
Jonathan Kim
bd33bb1409 drm/amdkfd: fix mes set shader debugger process management
MES provides the driver a call to explicitly flush stale process memory
within the MES to avoid a race condition that results in a fatal
memory violation.

When SET_SHADER_DEBUGGER is called, the driver passes a memory address
that represents a process context address MES uses to keep track of
future per-process calls.

Normally, MES will purge its process context list when the last queue
has been removed.  The driver, however, can call SET_SHADER_DEBUGGER
regardless of whether a queue has been added or not.

If SET_SHADER_DEBUGGER has been called with no queues as the last call
prior to process termination, the passed process context address will
still reside within MES.

On a new process call to SET_SHADER_DEBUGGER, the driver may end up
passing an identical process context address value (based on per-process
gpu memory address) to MES but is now pointing to a new allocated buffer
object during KFD process creation.  Since the MES is unaware of this,
access of the passed address points to the stale object within MES and
triggers a fatal memory violation.

The solution is for KFD to explicitly flush the process context address
from MES on process termination.

Note that the flush call and the MES debugger calls use the same MES
interface but are separated as KFD calls to avoid conflicting with each
other.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Tested-by: Alice Wong <shiwei.wong@amd.com>
Reviewed-by: Eric Huang <jinhuieric.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-13 16:07:43 -05:00
Felix Kuehling
0188006d7c drm/amdkfd: Import DMABufs for interop through DRM
Use drm_gem_prime_fd_to_handle to import DMABufs for interop. This
ensures that a GEM handle is created on import and that obj->dma_buf
will be set and remain set as long as the object is imported into KFD.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
Reviewed-by: Xiaogang.Chen <Xiaogang.Chen@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-13 15:09:55 -05:00
Felix Kuehling
1819200166 drm/amdkfd: Export DMABufs from KFD using GEM handles
Create GEM handles for exporting DMABufs using GEM-Prime APIs. The GEM
handles are created in a drm_client_dev context to avoid exposing them
in user mode contexts through a DMABuf import.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-13 15:09:55 -05:00
Rob Clark
05d249352f drm/exec: Pass in initial # of objects
In cases where the # is known ahead of time, it is silly to do the table
resize dance.

Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Christian König <christian.koenig@amd.com>
Patchwork: https://patchwork.freedesktop.org/patch/568338/
2023-12-10 10:38:47 -08:00
Xiaogang Chen
a546a27684 drm/amdkfd: Use partial migrations/mapping for GPU/CPU page faults in SVM
This patch implements partial migration/mapping for gpu/cpu page faults in SVM
according to migration granularity(default 2MB). A svm range may include pages
from both system ram and vram of one gpu now. These chagnes are expected to
improve migration performance and reduce mmu callback and TLB flush workloads.

Signed-off-by: Xiaogang Chen <xiaogang.chen@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-06 15:22:32 -05:00
ZhenGuo Yin
d581ceab26 drm/amdkfd: Free gang_ctx_bo and wptr_bo in pqm_uninit
[Why]
Memory leaks of gang_ctx_bo and wptr_bo.

[How]
Free gang_ctx_bo and wptr_bo in pqm_uninit.

v2: add a common function pqm_clean_queue_resource to
free queue's resources.
v3: reset pdd->pqd.num_gws when destorying GWS queue.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-29 16:49:23 -05:00
Felix Kuehling
9a1c1339ab drm/amdkfd: Run restore_workers on freezable WQs
Make restore workers freezable so we don't have to explicitly flush them
in suspend and GPU reset code paths, and we don't accidentally try to
restore BOs while the GPU is suspended. Not having to flush restore_work
also helps avoid lock/fence dependencies in the GPU reset case where we're
not allowed to wait for fences.

A side effect of this is, that we can now have multiple concurrent threads
trying to signal the same eviction fence. Rework eviction fence signaling
and replacement to account for that.

The GPU reset path can no longer rely on restore_process_worker to resume
queues because evict/restore workers can run independently of it. Instead
call a new restore_process_helper directly.

This is an RFC and request for testing.

v2:
- Reworked eviction fence signaling
- Introduced restore_process_helper

v3:
- Handle unsignaled eviction fences in restore_process_bos

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Tested-by: Emily Deng <Emily.Deng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-29 16:49:23 -05:00
Mukul Joshi
fbd2076c31 drm/amdkfd: Use common function for IP version check
KFD_GC_VERSION was recently updated to use a new function
for IP version checks. As a result, use KFD_GC_VERSION as
the common function for all IP version checks in KFD.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-29 16:49:01 -05:00
David Yat Sin
a2d3c69261 drm/amdkfd: Copy HW exception data to user event
Fixes issue where user events of type KFD_EVENT_TYPE_HW_EXCEPTION do not
have valid data

Signed-off-by: David Yat Sin <David.YatSin@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-29 16:24:43 -05:00
Laurent Morichetti
f4fac4163c drm/amdkfd: Clear the VALU exception state in the trap handler
The trap handler could be entered with pending VALU exceptions, so
clear the exception state before issuing vector instructions.

Reviewed-by: Jay Cornwall <jay.cornwall@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Laurent Morichetti <laurent.morichetti@amd.com>
Tested-by: Lancelot Six <lancelot.six@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-17 09:29:54 -05:00
Felix Kuehling
94e2dae0a8 drm/amdkfd: Move TLB flushing logic into amdgpu
This will make it possible for amdgpu GEM ioctls to flush TLBs on compute
VMs.

This removes VMID-based TLB flushing and always uses PASID-based
flushing. This still works because it scans the VMID-PASID mapping
registers to find the right VMID. It's only slightly less efficient. This
is not a production use case.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-17 09:29:53 -05:00
Linus Torvalds
c0d12d7692 drm fixes for 6.7-rc1
- big pile of amd fixes, but mostly hw support newly added in 6.7
 - i915 fixes, mostly minor things
 - qxl memory leak fix
 - vc4 uaf fix in mock helpers
 - syncobj fix for DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEb4nG6jLu8Y5XI+PfTA9ye/CYqnEFAmVOlEQACgkQTA9ye/CY
 qnGawQ//d7M2CZpd9LvkRnaV2+fG9yGONOwZsG5fVtXrfT4RDQmITC9KMEh2TxJb
 s1W6HA+UijMMx6RtQN6cTYNHeYDaX2b55/g3lMnreXydii0COJwkWe52iFn0Dpcm
 RpsT5cYLEiRtiTvEzKbrkxS+rrQMu9jwxcA23b+lMkmybVgqQe1m9hYtRxZCFqr7
 6BMKOgrCRoY1mYZrNaccXBHvvgOtcpWPOsuNNgjW3MDKB663BmpABT1iDg3xxPdX
 BI8SAl6PX6ju2Jwi3WPlscmI199fhcUXDuqb8LXhJsuqynhwU940aUlxcbI7Hz6Y
 LaFwSK4OiaIsIC8yisa7cZ1z2mqnMIiGXasP6mfYVmYqpGYMW+AZcOmzui/MLiGd
 duOcvK/bLxN7moSqcKgz+mmrLfZzJkPLV8pGEVk0IbTn239AnR76bqrEQJ+Iqukx
 d4yLGE4OJchD4zzw5RlVmQIhwA8M/5croJBIo6yNyQB5xgN/Krd4QUc6KjjgzNo8
 e402NjaW2/PqWIiPtsL5tK4XdkVtvMvVUq+bJSch6Wfn3j9u06/wgFl1FBRyX7zS
 3QYvMtF/QM5/QTpbdl82hSSsJ78iO62tPYSOhycIxgB/BHoc/fap+IOFtKKnT3RZ
 xr7UiwAQA043gAMvS+TkZAc6bFW8U8Dzxu5XxEPE1L+WU6XCSbs=
 =l45e
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2023-11-10' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Daniel Vetter:
 "Dave's VPN to the big machine died, so it's on me to do fixes pr this
  and next week while everyone else is at plumbers.

   - big pile of amd fixes, but mostly for hw support newly added in 6.7

   - i915 fixes, mostly minor things

   - qxl memory leak fix

   - vc4 uaf fix in mock helpers

   - syncobj fix for DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE"

* tag 'drm-next-2023-11-10' of git://anongit.freedesktop.org/drm/drm: (78 commits)
  drm/amdgpu: fix error handling in amdgpu_vm_init
  drm/amdgpu: Fix possible null pointer dereference
  drm/amdgpu: move UVD and VCE sched entity init after sched init
  drm/amdgpu: move kfd_resume before the ip late init
  drm/amd: Explicitly check for GFXOFF to be enabled for s0ix
  drm/amdgpu: Change WREG32_RLC to WREG32_SOC15_RLC where inst != 0 (v2)
  drm/amdgpu: Use correct KIQ MEC engine for gfx9.4.3 (v5)
  drm/amdgpu: add smu v13.0.6 pcs xgmi ras error query support
  drm/amdgpu: fix software pci_unplug on some chips
  drm/amd/display: remove duplicated argument
  drm/amdgpu: correct mca debugfs dump reg list
  drm/amdgpu: correct acclerator check architecutre dump
  drm/amdgpu: add pcs xgmi v6.4.0 ras support
  drm/amdgpu: Change extended-scope MTYPE on GC 9.4.3
  drm/amdgpu: disable smu v13.0.6 mca debug mode by default
  drm/amdgpu: Support multiple error query modes
  drm/amdgpu: refine smu v13.0.6 mca dump driver
  drm/amdgpu: Do not program PF-only regs in hdp_v4_0.c under SRIOV (v2)
  drm/amdgpu: Skip PCTL0_MMHUB_DEEPSLEEP_IB write in jpegv4.0.3 under SRIOV
  drm: amd: Resolve Sphinx unexpected indentation warning
  ...
2023-11-10 14:59:30 -08:00
David Yat Sin
4abf0b0bdf drm/amdgpu: Change extended-scope MTYPE on GC 9.4.3
Change local memory type to MTYPE_UC on revision id 0

Signed-off-by: David Yat Sin <David.YatSin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09 17:02:14 -05:00
Linus Torvalds
25b6377007 drm next and fixes for 6.7-rc1
renesas:
 - atomic conversion
 - DT support
 
 ssd13xx:
 - dt binding fix for ssd132x
 - Initialize ssd130x crtc_state to NULL.
 
 amdgpu:
 - Fix RAS support check
 - RAS fixes
 - MES fixes
 - SMU13 fixes
 - Contiguous memory allocation fix
 - BACO fixes
 - GPU reset fixes
 - Min power limit fixes
 - GFX11 fixes
 - USB4/TB hotplug fixes
 - ARM regression fix
 - GFX9.4.3 fixes
 - KASAN/KCSAN stack size check fixes
 - SR-IOV fixes
 - SMU14 fixes
 - PSP13 fixes
 - Display blend fixes
 - Flexible array size fixes
 
 amdkfd:
 - GPUVM fix
 
 radeon:
 - Flexible array size fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmVJmdsACgkQDHTzWXnE
 hr5hphAAoFdk4ma7TauUNyP3JoUwht+Ohm9NcUHq/9P9kOCwPIIehxMZnwPoTGyw
 VWwXpqpeVsW6zyMgfxWq/+P1S1C5LvZ1HLccbP3xv327fUZ1QnLapHwFxT1SYNpi
 Tw5qhQN/cwNOX0Pc9uBavYmJzf54OhvPxt2CHHPDShsHBOBc0Gd88gKJr7GWUF5M
 Ri6i20Tfsgq8AopWQj9628TT+y/aN3rVIfYYBiNejxejlFtt1HODkKFX3DuBNPOI
 bZOtQm11cbxmX7/2RI92mI20axUb4UMNIQFDYEl3bVlyPyEhhwKPQMSDxhQQodPg
 8zMY9Fbl4Z4VaOFEDbpiRv/0/HeWoLefmpQ5LZbz35RhKLTkwsWXHUELPUEj3uTr
 7+EnLwuvQdtbT9W2J8btO7v1dHOy86ArlnmqNjB2cEnvGaR3DNM4jxPVLn60SyAc
 N7CFWNU4EoJf02XhwAludYa5pQHEaTFL8ss6TSoRWXuHHg1vNdu2SorOvkcGkg28
 q/t28gZDZaOpXfMmf3ec+PHhO6nxXLXJRbiksVP/rpXQQ42cEI72hr6//UYLRuzg
 BbYPZ8uXmDjsSZIqYceZwc3Vr6oEmD6EAzHM9+zwS5h0IZ12jKTmFiDEAhG2DwoG
 8PCaj5UXkxVK+6iHndz8Qwg4+Fu1j5nodKM+vBNem/iSnxC/OVw=
 =TsFN
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2023-11-07' of git://anongit.freedesktop.org/drm/drm

Pull more drm updates from Dave Airlie:
 "Geert pointed out I missed the renesas reworks in my main pull, so
  this pull contains the renesas next work for atomic conversion and DT
  support.

  It also contains a bunch of amdgpu and some small ssd13xx fixes.

  renesas:
   - atomic conversion
   - DT support

  ssd13xx:
   - dt binding fix for ssd132x
   - Initialize ssd130x crtc_state to NULL.

  amdgpu:
   - Fix RAS support check
   - RAS fixes
   - MES fixes
   - SMU13 fixes
   - Contiguous memory allocation fix
   - BACO fixes
   - GPU reset fixes
   - Min power limit fixes
   - GFX11 fixes
   - USB4/TB hotplug fixes
   - ARM regression fix
   - GFX9.4.3 fixes
   - KASAN/KCSAN stack size check fixes
   - SR-IOV fixes
   - SMU14 fixes
   - PSP13 fixes
   - Display blend fixes
   - Flexible array size fixes

  amdkfd:
   - GPUVM fix

  radeon:
   - Flexible array size fixes"

* tag 'drm-next-2023-11-07' of git://anongit.freedesktop.org/drm/drm: (83 commits)
  drm/amd/display: Enable fast update on blendTF change
  drm/amd/display: Fix blend LUT programming
  drm/amd/display: Program plane color setting correctly
  drm/amdgpu: Query and report boot status
  drm/amdgpu: Add psp v13 function to query boot status
  drm/amd/swsmu: remove fw version check in sw_init.
  drm/amd/swsmu: update smu v14_0_0 driver if and metrics table
  drm/amdgpu: Add C2PMSG_109/126 reg field shift/masks
  drm/amdgpu: Optimize the asic type fix code
  drm/amdgpu: fix GRBM read timeout when do mes_self_test
  drm/amdgpu: check recovery status of xgmi hive in ras_reset_error_count
  drm/amd/pm: only check sriov vf flag once when creating hwmon sysfs
  drm/amdgpu: Attach eviction fence on alloc
  drm/amdkfd: Improve amdgpu_vm_handle_moved
  drm/amd/display: Increase frame warning limit with KASAN or KCSAN in dml2
  drm/amd/display: Avoid NULL dereference of timing generator
  drm/amdkfd: Update cache info for GFX 9.4.3
  drm/amdkfd: Populate cache info for GFX 9.4.3
  drm/amdgpu: don't put MQDs in VRAM on ARM | ARM64
  drm/amdgpu/smu13: drop compute workload workaround
  ...
2023-11-07 17:10:02 -08:00
Surbhi Kakarya
9256e8d47a drm/amd: Disable XNACK on SRIOV environment
The purpose of this patch is to disable XNACK or set XNACK OFF mode
on SRIOV platform which doesn't support it.

This will prevent user-space application to fail or result into
unexpected behaviour whenever the application need to run test-case
in XNACK ON mode.

Signed-off-by: Surbhi Kakarya <surbhi.kakarya@amd.com>
Reviewed-by: Shaoyun Liu <shaoyun.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-07 11:15:37 -05:00
Mukul Joshi
be457b2252 drm/amdkfd: Update cache info for GFX 9.4.3
Update cache info reporting based on compute and
memory partitioning modes.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03 11:59:51 -04:00
Mukul Joshi
0ce8edae8b drm/amdkfd: Populate cache info for GFX 9.4.3
GFX 9.4.3 uses a new version of the GC info table which
contains the cache info. This patch adds a new function
to populate the cache info from IP discovery for GFX 9.4.3.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03 11:59:51 -04:00
Linus Torvalds
7d461b291e drm for 6.7-rc1
kernel:
 - add initial vmemdup-user-array
 
 core:
 - fix platform remove() to return void
 - drm_file owner updated to reflect owner
 - move size calcs to drm buddy allocator
 - let GPUVM build as a module
 - allow variable number of run-queues in scheduler
 
 edid:
 - handle bad h/v sync_end in EDIDs
 
 panfrost:
 - add Boris as maintainer
 
 fbdev:
 - use fb_ops helpers more
 - only allow logo use from fbcon
 - rename fb_pgproto to pgprot_framebuffer
 - add HPD state to drm_connector_oob_hotplug_event
 - convert to fbdev i/o mem helpers
 
 i915:
 - Enable meteorlake by default
 - Early Xe2 LPD/Lunarlake display enablement
 - Rework subplatforms into IP version checks
 - GuC based TLB invalidation for Meteorlake
 - Display rework for future Xe driver integration
 - LNL FBC features
 - LNL display feature capability reads
 - update recommended fw versions for DG2+
 - drop fastboot module parameter
 - added deviceid for Arrowlake-S
 - drop preproduction workarounds
 - don't disable preemption for resets
 - cleanup inlines in headers
 - PXP firmware loading fix
 - Fix sg list lengths
 - DSC PPS state readout/verification
 - Add more RPL P/U PCI IDs
 - Add new DG2-G12 stepping
 - DP enhanced framing support to state checker
 - Improve shared link bandwidth management
 - stop using GEM macros in display code
 - refactor related code into display code
 - locally enable W=1 warnings
 - remove PSR watchdog timers on LNL
 
 amdgpu:
 - RAS/FRU EEPROM updatse
 - IP discovery updatses
 - GC 11.5 support
 - DCN 3.5 support
 - VPE 6.1 support
 - NBIO 7.11 support
 - DML2 support
 - lots of IP updates
 - use flexible arrays for bo list handling
 - W=1 fixes
 - Enable seamless boot in more cases
 - Enable context type property for HDMI
 - Rework GPUVM TLB flushing
 - VCN IB start/size alignment fixes
 
 amdkfd:
 - GC 10/11 fixes
 - GC 11.5 support
 - use partial migration in GPU faults
 
 radeon:
 - W=1 Fixes
 - fix some possible buffer overflow/NULL derefs
 nouveau:
 - update uapi for NO_PREFETCH
 - scheduler/fence fixes
 - rework suspend/resume for GSP-RM
 - rework display in preparation for GSP-RM
 
 habanalabs:
 - uapi: expose tsc clock
 - uapi: block access to eventfd through control device
 - uapi: force dma-buf export to PAGE_SIZE alignments
 - complete move to accel subsystem
 - move firmware interface include files
 - perform hard reset on PCIe AXI drain event
 - optimise user interrupt handling
 
 msm:
 - DP: use existing helpers for DPCD
 - DPU: interrupts reworked
 - gpu: a7xx (a730/a740) support
 - decouple msm_drv from kms for headless devices
 
 mediatek:
 - MT8188 dsi/dp/edp support
 - DDP GAMMA - 12 bit LUT support
 - connector dynamic selection capability
 
 rockchip:
 - rv1126 mipi-dsi/vop support
 - add planar formats
 
 ast:
 - rename constants
 
 panels:
 - Mitsubishi AA084XE01
 - JDI LPM102A188A
 - LTK050H3148W-CTA6
 
 ivpu:
 - power management fixes
 
 qaic:
 - add detach slice bo api
 
 komeda:
 - add NV12 writeback
 
 tegra:
 - support NVSYNC/NHSYNC
 - host1x suspend fixes
 
 ili9882t:
 - separate into own driver
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmVAgzYACgkQDHTzWXnE
 hr7ZEQ//UXne3tyGOsU3X8r+lstLFDMa90a3hvTg6hX+Q0MjHd/clwkKFkLpkipL
 n7gIZlaHl11dRs0FzrIZA5EVAAgjMLKmIl10NBDFec6ZFA3VERcggx8y61uifI15
 VviMR1VbLHYZaCdyrQOK0A4wcktWnKXyoXp7cwy9crdc2GOBMUZkdIqtvD7jHxQx
 UMIFnzi1CyKUX/Fjt/JceYcNk9y2ZGkzakYO3sHcUdv4DPu9qX4kNzpjF691AZBP
 UeKWvCswTRVg2M0kuo/RYIBzqaTmOlk6dHLWBognIeZPyuyhCcaGC2d64c6tShwQ
 dtHdi+IgyQ8s2qb350ymKTQUP7xA/DfZBwH7LvrZALBxeQGYQN1CnsgDMOS2wcUc
 XrRFiS7PxEOtMMBctcPBnnoV5ttnsLLlPpzM9puh9sUFMn6CgLzcAMqXdqxzMajH
 +dz2aD1N0vMqq4varozOg9SC2QamgUiPN/TQfrulhCTCfQaXczy5x1OYiIz65+Sl
 mKoe2WASuP9Ve8do4N/wEwH5SZY2ItipBdUTRxttY9NTanmV0X5DjZBXH5b9XGci
 Zl5Ar613f9zwm5T5BVA5k6s3ZbGY6QcP5pDNTCPaSgitfFXIdReBZ2CaYzK3MPg/
 Wit/TXrud9yT6VPpI1igboMyasf5QubV1MY1K83kOCWr9u8R2CM=
 =l79u
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm

Pull drm updates from Dave Airlie:
 "Highlights:
   - AMD adds some more upcoming HW platforms
   - Intel made Meteorlake stable and started adding Lunarlake
   - nouveau has a bunch of display rework in prepartion for the NVIDIA
     GSP firmware support
   - msm adds a7xx support
   - habanalabs has finished migration to accel subsystem

  Detail summary:

  kernel:
   - add initial vmemdup-user-array

  core:
   - fix platform remove() to return void
   - drm_file owner updated to reflect owner
   - move size calcs to drm buddy allocator
   - let GPUVM build as a module
   - allow variable number of run-queues in scheduler

  edid:
   - handle bad h/v sync_end in EDIDs

  panfrost:
   - add Boris as maintainer

  fbdev:
   - use fb_ops helpers more
   - only allow logo use from fbcon
   - rename fb_pgproto to pgprot_framebuffer
   - add HPD state to drm_connector_oob_hotplug_event
   - convert to fbdev i/o mem helpers

  i915:
   - Enable meteorlake by default
   - Early Xe2 LPD/Lunarlake display enablement
   - Rework subplatforms into IP version checks
   - GuC based TLB invalidation for Meteorlake
   - Display rework for future Xe driver integration
   - LNL FBC features
   - LNL display feature capability reads
   - update recommended fw versions for DG2+
   - drop fastboot module parameter
   - added deviceid for Arrowlake-S
   - drop preproduction workarounds
   - don't disable preemption for resets
   - cleanup inlines in headers
   - PXP firmware loading fix
   - Fix sg list lengths
   - DSC PPS state readout/verification
   - Add more RPL P/U PCI IDs
   - Add new DG2-G12 stepping
   - DP enhanced framing support to state checker
   - Improve shared link bandwidth management
   - stop using GEM macros in display code
   - refactor related code into display code
   - locally enable W=1 warnings
   - remove PSR watchdog timers on LNL

  amdgpu:
   - RAS/FRU EEPROM updatse
   - IP discovery updatses
   - GC 11.5 support
   - DCN 3.5 support
   - VPE 6.1 support
   - NBIO 7.11 support
   - DML2 support
   - lots of IP updates
   - use flexible arrays for bo list handling
   - W=1 fixes
   - Enable seamless boot in more cases
   - Enable context type property for HDMI
   - Rework GPUVM TLB flushing
   - VCN IB start/size alignment fixes

  amdkfd:
   - GC 10/11 fixes
   - GC 11.5 support
   - use partial migration in GPU faults

  radeon:
   - W=1 Fixes
   - fix some possible buffer overflow/NULL derefs

  nouveau:
   - update uapi for NO_PREFETCH
   - scheduler/fence fixes
   - rework suspend/resume for GSP-RM
   - rework display in preparation for GSP-RM

  habanalabs:
   - uapi: expose tsc clock
   - uapi: block access to eventfd through control device
   - uapi: force dma-buf export to PAGE_SIZE alignments
   - complete move to accel subsystem
   - move firmware interface include files
   - perform hard reset on PCIe AXI drain event
   - optimise user interrupt handling

  msm:
   - DP: use existing helpers for DPCD
   - DPU: interrupts reworked
   - gpu: a7xx (a730/a740) support
   - decouple msm_drv from kms for headless devices

  mediatek:
   - MT8188 dsi/dp/edp support
   - DDP GAMMA - 12 bit LUT support
   - connector dynamic selection capability

  rockchip:
   - rv1126 mipi-dsi/vop support
   - add planar formats

  ast:
   - rename constants

  panels:
   - Mitsubishi AA084XE01
   - JDI LPM102A188A
   - LTK050H3148W-CTA6

  ivpu:
   - power management fixes

  qaic:
   - add detach slice bo api

  komeda:
   - add NV12 writeback

  tegra:
   - support NVSYNC/NHSYNC
   - host1x suspend fixes

  ili9882t:
   - separate into own driver"

* tag 'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm: (1803 commits)
  drm/amdgpu: Remove unused variables from amdgpu_show_fdinfo
  drm/amdgpu: Remove duplicate fdinfo fields
  drm/amd/amdgpu: avoid to disable gfxhub interrupt when driver is unloaded
  drm/amdgpu: Add EXT_COHERENT support for APU and NUMA systems
  drm/amdgpu: Retrieve CE count from ce_count_lo_chip in EccInfo table
  drm/amdgpu: Identify data parity error corrected in replay mode
  drm/amdgpu: Fix typo in IP discovery parsing
  drm/amd/display: fix S/G display enablement
  drm/amdxcp: fix amdxcp unloads incompletely
  drm/amd/amdgpu: fix the GPU power print error in pm info
  drm/amdgpu: Use pcie domain of xcc acpi objects
  drm/amd: check num of link levels when update pcie param
  drm/amdgpu: Add a read to GFX v9.4.3 ring test
  drm/amd/pm: call smu_cmn_get_smc_version in is_mode1_reset_supported.
  drm/amdgpu: get RAS poison status from DF v4_6_2
  drm/amdgpu: Use discovery table's subrevision
  drm/amd/display: 3.2.256
  drm/amd/display: add interface to query SubVP status
  drm/amd/display: Read before writing Backlight Mode Set Register
  drm/amd/display: Disable SYMCLK32_SE RCO on DCN314
  ...
2023-11-01 06:28:35 -10:00
Linus Torvalds
eb55307e67 X86 core code updates:
- Limit the hardcoded topology quirk for Hygon CPUs to those which have a
     model ID less than 4. The newer models have the topology CPUID leaf 0xB
     correctly implemented and are not affected.
 
   - Make SMT control more robust against enumeration failures
 
     SMT control was added to allow controlling SMT at boottime or
     runtime. The primary purpose was to provide a simple mechanism to
     disable SMT in the light of speculation attack vectors.
 
     It turned out that the code is sensible to enumeration failures and
     worked only by chance for XEN/PV. XEN/PV has no real APIC enumeration
     which means the primary thread mask is not set up correctly. By chance
     a XEN/PV boot ends up with smp_num_siblings == 2, which makes the
     hotplug control stay at its default value "enabled". So the mask is
     never evaluated.
 
     The ongoing rework of the topology evaluation caused XEN/PV to end up
     with smp_num_siblings == 1, which sets the SMT control to "not
     supported" and the empty primary thread mask causes the hotplug core to
     deny the bringup of the APS.
 
     Make the decision logic more robust and take 'not supported' and 'not
     implemented' into account for the decision whether a CPU should be
     booted or not.
 
   - Fake primary thread mask for XEN/PV
 
     Pretend that all XEN/PV vCPUs are primary threads, which makes the
     usage of the primary thread mask valid on XEN/PV. That is consistent
     with because all of the topology information on XEN/PV is fake or even
     non-existent.
 
   - Encapsulate topology information in cpuinfo_x86
 
     Move the randomly scattered topology data into a separate data
     structure for readability and as a preparatory step for the topology
     evaluation overhaul.
 
   - Consolidate APIC ID data type to u32
 
     It's fixed width hardware data and not randomly u16, int, unsigned long
     or whatever developers decided to use.
 
   - Cure the abuse of cpuinfo for persisting logical IDs.
 
     Per CPU cpuinfo is used to persist the logical package and die
     IDs. That's really not the right place simply because cpuinfo is
     subject to be reinitialized when a CPU goes through an offline/online
     cycle.
 
     Use separate per CPU data for the persisting to enable the further
     topology management rework. It will be removed once the new topology
     management is in place.
 
   - Provide a debug interface for inspecting topology information
 
     Useful in general and extremly helpful for validating the topology
     management rework in terms of correctness or "bug" compatibility.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmU+yX0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoROUD/4vlvKEcpm9rbI5DzLcaq4DFHKbyEZF
 cQtzuOSM/9vTc9DHnuoNNLl9TWSYxiVYnejf3E21evfsqspYlzbTH8bId9XBCUid
 6B68AJW842M2erNuwj0b0HwF1z++zpDmBDyhGOty/KQhoM8pYOHMvntAmbzJbuso
 Dgx6BLVFcboTy6RwlfRa0EE8f9W5V+JbmG/VBDpdyCInal7VrudoVFZmWQnPIft7
 zwOJpAoehkp8OKq7geKDf79yWxu9a1sNPd62HtaVEvfHwehHqE6OaMLss1us+0vT
 SJ/D6gmRQBOwcXaZL0wL1dG7Km9Et4AisOvzhXGvTa5b2D5oljVoqJ7V7FTf5g3u
 y3aqWbeUJzERUbeJt1HoGVAKyA4GtZOvg+TNIysf6F1Z4khl9alfa9jiqjj4g1au
 zgItq/ZMBEBmJ7X4FxQUEUVBG2CDsEidyNBDRcimWQUDfBakV/iCs0suD8uu8ZOD
 K5jMx8Hi2+xFx7r1YqsfsyMBYOf/zUZw65RbNe+kI992JbJ9nhcODbnbo5MlAsyv
 vcqlK5FwXgZ4YAC8dZHU/tyTiqAW7oaOSkqKwTP5gcyNEqsjQHV//q6v+uqtjfYn
 1C4oUsRHT2vJiV9ktNJTA4GQHIYF4geGgpG8Ih2SjXsSzdGtUd3DtX1iq0YiLEOk
 eHhYsnniqsYB5g==
 =xrz8
 -----END PGP SIGNATURE-----

Merge tag 'x86-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 core updates from Thomas Gleixner:

 - Limit the hardcoded topology quirk for Hygon CPUs to those which have
   a model ID less than 4.

   The newer models have the topology CPUID leaf 0xB correctly
   implemented and are not affected.

 - Make SMT control more robust against enumeration failures

   SMT control was added to allow controlling SMT at boottime or
   runtime. The primary purpose was to provide a simple mechanism to
   disable SMT in the light of speculation attack vectors.

   It turned out that the code is sensible to enumeration failures and
   worked only by chance for XEN/PV. XEN/PV has no real APIC enumeration
   which means the primary thread mask is not set up correctly. By
   chance a XEN/PV boot ends up with smp_num_siblings == 2, which makes
   the hotplug control stay at its default value "enabled". So the mask
   is never evaluated.

   The ongoing rework of the topology evaluation caused XEN/PV to end up
   with smp_num_siblings == 1, which sets the SMT control to "not
   supported" and the empty primary thread mask causes the hotplug core
   to deny the bringup of the APS.

   Make the decision logic more robust and take 'not supported' and 'not
   implemented' into account for the decision whether a CPU should be
   booted or not.

 - Fake primary thread mask for XEN/PV

   Pretend that all XEN/PV vCPUs are primary threads, which makes the
   usage of the primary thread mask valid on XEN/PV. That is consistent
   with because all of the topology information on XEN/PV is fake or
   even non-existent.

 - Encapsulate topology information in cpuinfo_x86

   Move the randomly scattered topology data into a separate data
   structure for readability and as a preparatory step for the topology
   evaluation overhaul.

 - Consolidate APIC ID data type to u32

   It's fixed width hardware data and not randomly u16, int, unsigned
   long or whatever developers decided to use.

 - Cure the abuse of cpuinfo for persisting logical IDs.

   Per CPU cpuinfo is used to persist the logical package and die IDs.
   That's really not the right place simply because cpuinfo is subject
   to be reinitialized when a CPU goes through an offline/online cycle.

   Use separate per CPU data for the persisting to enable the further
   topology management rework. It will be removed once the new topology
   management is in place.

 - Provide a debug interface for inspecting topology information

   Useful in general and extremly helpful for validating the topology
   management rework in terms of correctness or "bug" compatibility.

* tag 'x86-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  x86/apic, x86/hyperv: Use u32 in hv_snp_boot_ap() too
  x86/cpu: Provide debug interface
  x86/cpu/topology: Cure the abuse of cpuinfo for persisting logical ids
  x86/apic: Use u32 for wakeup_secondary_cpu[_64]()
  x86/apic: Use u32 for [gs]et_apic_id()
  x86/apic: Use u32 for phys_pkg_id()
  x86/apic: Use u32 for cpu_present_to_apicid()
  x86/apic: Use u32 for check_apicid_used()
  x86/apic: Use u32 for APIC IDs in global data
  x86/apic: Use BAD_APICID consistently
  x86/cpu: Move cpu_l[l2]c_id into topology info
  x86/cpu: Move logical package and die IDs into topology info
  x86/cpu: Remove pointless evaluation of x86_coreid_bits
  x86/cpu: Move cu_id into topology info
  x86/cpu: Move cpu_core_id into topology info
  hwmon: (fam15h_power) Use topology_core_id()
  scsi: lpfc: Use topology_core_id()
  x86/cpu: Move cpu_die_id into topology info
  x86/cpu: Move phys_proc_id into topology info
  x86/cpu: Encapsulate topology information in cpuinfo_x86
  ...
2023-10-30 17:37:47 -10:00
David Francis
142262a1c0 drm/amdgpu: Add EXT_COHERENT support for APU and NUMA systems
On gfx943 APU, EXT_COHERENT should give MTYPE_CC for local and
MTYPE_UC for nonlocal memory.

On NUMA systems, local memory gets the local mtype, set by an
override callback. If EXT_COHERENT is set, memory will be set as
MTYPE_UC by default, with local memory MTYPE_CC.

Add an option in the override function for this case, and
add a check to ensure it is not used on UNCACHED memory.

V2: Combined APU and NUMA code into one patch
V3: Fixed a potential nullptr in amdgpu_vm_bo_update

Signed-off-by: David Francis <David.Francis@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-27 14:15:16 -04:00
Philip Yang
541c341d2e Revert "drm/amdkfd: Use partial migrations in GPU page faults"
This reverts commit dc427a473e.

The change prevents migrating the entire range to VRAM because retry
fault restore_pages map the remaining system memory range to GPUs. It
will work correctly to submit together with partial mapping to GPU
patch later.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26 18:41:23 -04:00
Philip Yang
afaec204d2 Revert "drm/amdkfd:remove unused code"
This reverts commit f9caf6cdd5.

Needed for the next revert patch.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26 18:41:23 -04:00
Srinivasan Shanmugam
0300882ed6 drm/amdkfd: Address 'remap_list' not described in 'svm_range_add'
Fixes the below:

drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:2073: warning: Function parameter or member 'remap_list' not described in 'svm_range_add'

Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26 18:41:22 -04:00
Jesse Zhang
282c1d7930 drm/amdkfd: Fix shift out-of-bounds issue
[  567.613292] shift exponent 255 is too large for 64-bit type 'long unsigned int'
[  567.614498] CPU: 5 PID: 238 Comm: kworker/5:1 Tainted: G           OE      6.2.0-34-generic #34~22.04.1-Ubuntu
[  567.614502] Hardware name: AMD Splinter/Splinter-RPL, BIOS WS43927N_871 09/25/2023
[  567.614504] Workqueue: events send_exception_work_handler [amdgpu]
[  567.614748] Call Trace:
[  567.614750]  <TASK>
[  567.614753]  dump_stack_lvl+0x48/0x70
[  567.614761]  dump_stack+0x10/0x20
[  567.614763]  __ubsan_handle_shift_out_of_bounds+0x156/0x310
[  567.614769]  ? srso_alias_return_thunk+0x5/0x7f
[  567.614773]  ? update_sd_lb_stats.constprop.0+0xf2/0x3c0
[  567.614780]  svm_range_split_by_granularity.cold+0x2b/0x34 [amdgpu]
[  567.615047]  ? srso_alias_return_thunk+0x5/0x7f
[  567.615052]  svm_migrate_to_ram+0x185/0x4d0 [amdgpu]
[  567.615286]  do_swap_page+0x7b6/0xa30
[  567.615291]  ? srso_alias_return_thunk+0x5/0x7f
[  567.615294]  ? __free_pages+0x119/0x130
[  567.615299]  handle_pte_fault+0x227/0x280
[  567.615303]  __handle_mm_fault+0x3c0/0x720
[  567.615311]  handle_mm_fault+0x119/0x330
[  567.615314]  ? lock_mm_and_find_vma+0x44/0x250
[  567.615318]  do_user_addr_fault+0x1a9/0x640
[  567.615323]  exc_page_fault+0x81/0x1b0
[  567.615328]  asm_exc_page_fault+0x27/0x30
[  567.615332] RIP: 0010:__get_user_8+0x1c/0x30

Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Suggested-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26 18:41:21 -04:00
Alex Sierra
7ef6b2d4b7 drm/amdkfd: remap unaligned svm ranges that have split
Split SVM ranges that have been mapped into 2MB page table entries,
require to be remap in case the split has happened in a non-aligned
VA.
[WHY]:
This condition causes the 2MB page table entries be split into 4KB
PTEs.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-20 15:11:29 -04:00
Jesse Zhang
f9caf6cdd5 drm/amdkfd:remove unused code
Function svm_range_split_by_grinity is not used,
so it is removed.

Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Suggested-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-20 15:11:26 -04:00
Jiapeng Chong
8e9a110cb2 drm/amdkfd: clean up some inconsistent indenting
No functional modification involved.

drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:305 svm_range_free() warn: inconsistent indenting.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6804
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-19 18:26:52 -04:00
Dave Airlie
27442758e9 amd-drm-next-6.7-2023-10-13:
amdgpu:
 - DC replay fixes
 - Misc code cleanups and spelling fixes
 - Documentation updates
 - RAS EEPROM Updates
 - FRU EEPROM Updates
 - IP discovery updates
 - SR-IOV fixes
 - RAS updates
 - DC PQ fixes
 - SMU 13.0.6 updates
 - GC 11.5 Support
 - NBIO 7.11 Support
 - GMC 11 Updates
 - Reset fixes
 - SMU 11.5 Updates
 - SMU 13.0 OD support
 - Use flexible arrays for bo list handling
 - W=1 Fixes
 - SubVP fixes
 - DPIA fixes
 - DCN 3.5 Support
 - Devcoredump fixes
 - VPE 6.1 support
 - VCN 4.0 Updates
 - S/G display fixes
 - DML fixes
 - DML2 Support
 - MST fixes
 - VRR fixes
 - Enable seamless boot in more cases
 - Enable content type property for HDMI
 - OLED fixes
 - Rework and clean up GPUVM TLB flushing
 - DC ODM fixes
 - DP 2.x fixes
 - AGP aperture fixes
 - SDMA firmware loading cleanups
 - Cyan Skillfish GPU clock counter fix
 - GC 11 GART fix
 - Cache GPU fault info for userspace queries
 - DC cursor check fixes
 - eDP fixes
 - DC FP handling fixes
 - Variable sized array fixes
 - SMU 13.0.x fixes
 - IB start and size alignment fixes for VCN
 - SMU 14 Support
 - Suspend and resume sequence rework
 - vkms fix
 
 amdkfd:
 - GC 11 fixes
 - GC 10 fixes
 - Doorbell fixes
 - CWSR fixes
 - SVM fixes
 - Clean up GC info enumeration
 - Rework memory limit handling
 - Coherent memory handling fixes
 - Use partial migrations in GPU faults
 - TLB flush fixes
 - DMA unmap fixes
 - GC 9.4.3 fixes
 - SQ interrupt fix
 - GTT mapping fix
 - GC 11.5 Support
 
 radeon:
 - Misc code cleanups
 - W=1 Fixes
 - Fix possible buffer overflow
 - Fix possible NULL pointer dereference
 
 UAPI:
 - Add EXT_COHERENT memory allocation flags.  These allow for system scope atomics.
   Proposed userspace: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/pull/88
 - Add support for new VPE engine.  This is a memory to memory copy engine with advanced scaling, CSC, and color management features
   Proposed mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25713
 - Add INFO IOCTL interface to query GPU faults
   Proposed Mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23238
   Proposed libdrm MR: https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/298
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCZSmDAQAKCRC93/aFa7yZ
 2EdeAQC2lkQ9IHLOon5kIZUK+r9IPYlgFsii+qfmMPLBaMcuwgEA8F4eJln/cc9V
 02EKhlapkggYXYa+uhOE2KTnWgMFJgI=
 =SEXq
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.7-2023-10-13' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.7-2023-10-13:

amdgpu:
- DC replay fixes
- Misc code cleanups and spelling fixes
- Documentation updates
- RAS EEPROM Updates
- FRU EEPROM Updates
- IP discovery updates
- SR-IOV fixes
- RAS updates
- DC PQ fixes
- SMU 13.0.6 updates
- GC 11.5 Support
- NBIO 7.11 Support
- GMC 11 Updates
- Reset fixes
- SMU 11.5 Updates
- SMU 13.0 OD support
- Use flexible arrays for bo list handling
- W=1 Fixes
- SubVP fixes
- DPIA fixes
- DCN 3.5 Support
- Devcoredump fixes
- VPE 6.1 support
- VCN 4.0 Updates
- S/G display fixes
- DML fixes
- DML2 Support
- MST fixes
- VRR fixes
- Enable seamless boot in more cases
- Enable content type property for HDMI
- OLED fixes
- Rework and clean up GPUVM TLB flushing
- DC ODM fixes
- DP 2.x fixes
- AGP aperture fixes
- SDMA firmware loading cleanups
- Cyan Skillfish GPU clock counter fix
- GC 11 GART fix
- Cache GPU fault info for userspace queries
- DC cursor check fixes
- eDP fixes
- DC FP handling fixes
- Variable sized array fixes
- SMU 13.0.x fixes
- IB start and size alignment fixes for VCN
- SMU 14 Support
- Suspend and resume sequence rework
- vkms fix

amdkfd:
- GC 11 fixes
- GC 10 fixes
- Doorbell fixes
- CWSR fixes
- SVM fixes
- Clean up GC info enumeration
- Rework memory limit handling
- Coherent memory handling fixes
- Use partial migrations in GPU faults
- TLB flush fixes
- DMA unmap fixes
- GC 9.4.3 fixes
- SQ interrupt fix
- GTT mapping fix
- GC 11.5 Support

radeon:
- Misc code cleanups
- W=1 Fixes
- Fix possible buffer overflow
- Fix possible NULL pointer dereference

UAPI:
- Add EXT_COHERENT memory allocation flags.  These allow for system scope atomics.
  Proposed userspace: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/pull/88
- Add support for new VPE engine.  This is a memory to memory copy engine with advanced scaling, CSC, and color management features
  Proposed mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25713
- Add INFO IOCTL interface to query GPU faults
  Proposed Mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23238
  Proposed libdrm MR: https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/298

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20231013175758.1735031-1-alexander.deucher@amd.com
2023-10-18 16:08:07 +10:00
Thomas Gleixner
b9655e702d x86/cpu: Encapsulate topology information in cpuinfo_x86
The topology related information is randomly scattered across cpuinfo_x86.

Create a new structure cpuinfo_topo and move in a first step initial_apicid
and apicid into it.

Aside of being better readable this is in preparation for replacing the
horribly fragile CPU topology evaluation code further down the road.

Consolidate APIC ID fields to u32 as that represents the hardware type.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.269787744@linutronix.de
2023-10-10 14:38:17 +02:00
Arvind Yadav
367a0af433 drm/amdkfd: get doorbell's absolute offset based on the db_size
Here, Adding db_size in byte to find the doorbell's
absolute offset for both 32-bit and 64-bit doorbell sizes.
So that doorbell offset will be aligned based on the doorbell
size.

v2:
- Addressed the review comment from Felix.
v3:
- Adding doorbell_size as parameter to get db absolute offset.
v4:
  Squash the two patches into one.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-09 17:02:34 -04:00
Xiaogang Chen
dc427a473e drm/amdkfd: Use partial migrations in GPU page faults
This patch implements partial migration in gpu page fault according to migration
granularity(default 2MB) and not split svm range in cpu page fault handling.
A svm range may include pages from both system ram and vram of one gpu now.
These chagnes are expected to improve migration performance and reduce mmu
callback and TLB flush workloads.

Signed-off-by: Xiaogang Chen <xiaogang.chen@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-05 17:58:59 -04:00
Philip Yang
23de7616f3 drm/amdkfd: Fix EXT_COHERENT memory allocation crash
If there is no VRAM domain, bo_node is NULL and this causes crash.
Refactor the change, and use the module parameter as higher privilege.

Need another patch to support override PTE flag on APU.

Fixes: 5f248462c6 ("drm/amdgpu: Add EXT_COHERENT memory allocation flags")
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-04 18:40:30 -04:00
Alex Deucher
0021d70a06 drm/amdkfd: drop struct kfd_cu_info
I think this was an abstraction back from when
kfd supported both radeon and amdgpu.  Since we just
support amdgpu now, there is no more need for this and
we can use the amdgpu structures directly.

This also avoids having the kfd_cu_info structures on
the stack when inlining which can blow up the stack.

Cc: Arnd Bergmann <arnd@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-03 15:41:13 -04:00
Alex Deucher
4ff91f2185 drm/amdkfd: reduce stack size in kfd_topology_add_device()
kfd_topology.c:2082:1: warning: the frame size of 1440 bytes is larger than 1024 bytes

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2866
Cc: Arnd Bergmann <arnd@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-03 15:40:50 -04:00
Xiaogang Chen
709c348261 drm/amdkfd: Fix a race condition of vram buffer unref in svm code
prange->svm_bo unref can happen in both mmu callback and a callback after
migrate to system ram. Both are async call in different tasks. Sync svm_bo
unref operation to avoid random "use-after-free".

Signed-off-by: Xiaogang Chen <xiaogang.chen@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Tested-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-28 15:44:29 -04:00
Philip Yang
eb3c357bcb drm/amdkfd: Handle errors from svm validate and map
If new range is splited to multiple pranges with max_svm_range_pages
alignment and added to update_list, svm validate and map should keep
going after error to make sure prange->mapped_to_gpu flag is up to date
for the whole range.

svm validate and map update set prange->mapped_to_gpu after mapping to
GPUs successfully, otherwise clear prange->mapped_to_gpu flag (for
update mapping case) instead of setting error flag, we can remove
the redundant error flag to simpliy code.

Refactor to remove goto and update prange->mapped_to_gpu flag inside
svm_range_lock, to guarant we always evict queues or unmap from GPUs if
there are invalid ranges.

After svm validate and map return error -EAGIN, the caller retry will
update the mapping for the whole range again.

Fixes: c22b044070 ("drm/amdkfd: flag added to handle errors from svm validate and map")
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: James Zhu <james.zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-26 17:00:23 -04:00
Xiaogang Chen
7bfaa160ca drm/amdkfd: fix some race conditions in vram buffer alloc/free of svm code
This patch fixes:
1: ref number of prange's svm_bo got decreased by an async call from hmm. When
wait svm_bo of prange got released we shoul also wait prang->svm_bo become NULL,
otherwise prange->svm_bo may be set to null after allocate new vram buffer.

2: During waiting svm_bo of prange got released in a while loop should reschedule
current task to give other tasks oppotunity to run, specially the the workque
task that handles svm_bo ref release, otherwise we may enter to softlock.

Signed-off-by: Xiaogang.Chen <xiaogang.chen@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-26 17:00:02 -04:00
Philip Yang
101b810430 drm/amdkfd: Move dma unmapping after TLB flush
Otherwise GPU may access the stale mapping and generate IOMMU
IO_PAGE_FAULT.

Move this to inside p->mutex to prevent multiple threads mapping and
unmapping concurrently race condition.

After kfd_mem_dmaunmap_attachment is removed from unmap_bo_from_gpuvm,
kfd_mem_dmaunmap_attachment is called if failed to map to GPUs, and
before free the mem attachment in case failed to unmap from GPUs.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-26 16:55:10 -04:00
YuBiao Wang
cc39f9ccb8 drm/amdkfd: Use gpu_offset for user queue's wptr
Directly use tbo's start address will miss the domain start offset. Need
to use gpu_offset instead.

Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2023-09-20 17:30:42 -04:00
Philip Yang
c99b161280 drm/amdkfd: Remove svm range validated_once flag
The validated_once flag is not used after the prefault was removed, The
prefault was needed to ensure validate all system memory pages at least
once before mapping or migrating the range to GPU.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-20 16:25:17 -04:00
Xiaogang Chen
df954b695c drm/amdkfd: Separate dma unmap and free of dma address array operations
We do not need free dma address array of svm_range each time we do dma unmap
for pages in svm_range as we can reuse the same array. Only free it when free
svm_range. Separate these two operations and use them accordingly.

Signed-off-by: Xiaogang Chen <xiaogang.chen@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-20 16:24:22 -04:00
YuBiao Wang
b157df66d8 drm/amdkfd: Use gpu_offset for user queue's wptr
Directly use tbo's start address will miss the domain start offset. Need
to use gpu_offset instead.

Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-20 16:24:09 -04:00
David Francis
5f248462c6 drm/amdgpu: Add EXT_COHERENT memory allocation flags
These flags (for GEM and SVM allocations) allocate
memory that allows for system-scope atomic semantics.

On GFX943 these flags cause caches to be avoided on
non-local memory.

On all other ASICs they are identical in functionality to the
equivalent COHERENT flags.

Corresponding Thunk patch is at
https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/pull/88

Reviewed-by: David Yat Sin <David.YatSin@amd.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-20 16:24:06 -04:00
Jonathan Kim
d92e55565c drm/amdkfd: fix add queue process context clear without runtime enable
There are cases where HSA runtime is not enabled through the
AMDKFD_IOC_RUNTIME_ENABLE call when adding queues and the MES ADD_QUEUE
API should clear the MES process context instead of SET_SHADER_DEBUGGER.
Such examples are legacy HSA runtime builds that do not support the
current exception handling and running KFD tests.

The only time ADD_QUEUE.skip_process_ctx_clear is required is for
debugger use cases where a debugged process is always runtime enabled
when adding a queue.

Tested-by: Shikai Guo <shikai.guo@amd.com>
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Eric Huang <jinhuieric.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-20 16:23:57 -04:00
Lijo Lazar
4e8303cf2c drm/amdgpu: Use function for IP version check
Use an inline function for version check. Gives more flexibility to
handle any format changes.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-20 12:23:28 -04:00
Harish Kasiviswanathan
edcfe22985 drm/amdkfd: Insert missing TLB flush on GFX10 and later
Heavy-weight TLB flush is required after unmap on all GPUs for
correctness and security.

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2023-09-12 17:45:40 -04:00
Harish Kasiviswanathan
4412f8529c drm/amdkfd: Insert missing TLB flush on GFX10 and later
Heavy-weight TLB flush is required after unmap on all GPUs for
correctness and security.

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-12 17:30:01 -04:00
David Francis
9296da8c40 drm/amdkfd: Checkpoint and restore queues on GFX11
The code in kfd_mqd_manager_v11.c to support criu dump and
restore of queue state was missing.

Added it; should be equivalent to kfd_mqd_manager_v10.c.

CC: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-11 18:22:38 -04:00
Mukul Joshi
fc6efed2c7 drm/amdkfd: Update CU masking for GFX 9.4.3
The CU mask passed from user-space will change based on
different spatial partitioning mode. As a result, update
CU masking code for GFX9.4.3 to work for all partitioning
modes.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-11 18:17:27 -04:00
Mukul Joshi
0752e66e91 drm/amdkfd: Update cache info reporting for GFX v9.4.3
Update cache info reporting in sysfs to report the correct
number of CUs and associated cache information based on
different spatial partitioning modes.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-11 18:17:20 -04:00
Mukul Joshi
97e3c6a853 drm/amdgpu: Store CU info from all XCCs for GFX v9.4.3
Currently, we store CU info only for a single XCC assuming
that it is the same for all XCCs. However, that may not be
true. As a result, store CU info for all XCCs. This info is
later used for CU masking.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-11 18:16:31 -04:00
Mukul Joshi
2f06b27444 drm/amdkfd: Fix unaligned 64-bit doorbell warning
This patch fixes the following unaligned 64-bit doorbell
warning seen when submitting packets on HIQ on GFX v9.4.3
by making the HIQ doorbell 64-bit aligned.
The warning is seen when GPU is loaded in any mode other
than SPX mode.

[  +0.000301] ------------[ cut here ]------------
[  +0.000003] Unaligned 64-bit doorbell
[  +0.000030] WARNING: /amdkfd/kfd_doorbell.c:339 write_kernel_doorbell64+0x72/0x80
[  +0.000003] RIP: 0010:write_kernel_doorbell64+0x72/0x80
[  +0.000004] RSP: 0018:ffffc90004287730 EFLAGS: 00010246
[  +0.000005] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  +0.000003] RDX: 0000000000000001 RSI: ffffffff82837c71 RDI: 00000000ffffffff
[  +0.000003] RBP: ffffc90004287748 R08: 0000000000000003 R09: 0000000000000001
[  +0.000002] R10: 000000000000001a R11: ffff88a034008198 R12: ffffc900013bd004
[  +0.000003] R13: 0000000000000008 R14: ffffc900042877b0 R15: 000000000000007f
[  +0.000003] FS:  00007fa8c7b62000(0000) GS:ffff889f88400000(0000) knlGS:0000000000000000
[  +0.000004] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 000056111c45aaf0 CR3: 00000001414f2002 CR4: 0000000000770ee0
[  +0.000003] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000004]  <TASK>
[  +0.000006]  kq_submit_packet+0x45/0x50 [amdgpu]
[  +0.000524]  pm_send_set_resources+0x7f/0xc0 [amdgpu]
[  +0.000500]  set_sched_resources+0xe4/0x160 [amdgpu]
[  +0.000503]  start_cpsch+0x1c5/0x2a0 [amdgpu]
[  +0.000497]  kgd2kfd_device_init.cold+0x816/0xb42 [amdgpu]
[  +0.000743]  amdgpu_amdkfd_device_init+0x15f/0x1f0 [amdgpu]
[  +0.000602]  amdgpu_device_init.cold+0x1813/0x2176 [amdgpu]
[  +0.000684]  ? pci_bus_read_config_word+0x4a/0x80
[  +0.000012]  ? do_pci_enable_device+0xdc/0x110
[  +0.000008]  amdgpu_driver_load_kms+0x1a/0x110 [amdgpu]
[  +0.000545]  amdgpu_pci_probe+0x197/0x400 [amdgpu]

Fixes: c318666510 ("drm/amdgpu: use doorbell mgr for kfd kernel doorbells")
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-11 18:15:49 -04:00
Mukul Joshi
81faf9e0c3 drm/amdkfd: Fix reg offset for setting CWSR grace period
This patch fixes the case where the code currently passes
absolute register address and not the reg offset, which HWS
expects, when sending the PM4 packet to set/update CWSR grace
period. Additionally, cleanup the signature of
build_grace_period_packet_info function as it no longer needs
the inst parameter.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-11 18:15:43 -04:00
André Almeida
887db1e49a drm/amdgpu: Merge debug module parameters
Merge all developer debug options available as separated module
parameters in one, making it obvious that are for developers.

Drop the obsolete module options in favor of the new ones.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-11 17:19:54 -04:00
David Francis
e379162adf drm/amdkfd: Checkpoint and restore queues on GFX11
The code in kfd_mqd_manager_v11.c to support criu dump and
restore of queue state was missing.

Added it; should be equivalent to kfd_mqd_manager_v10.c.

CC: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: David Francis <David.Francis@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-11 17:16:42 -04:00
Mukul Joshi
68fa72a437 drm/amdgpu: Rename KGD_MAX_QUEUES to AMDGPU_MAX_QUEUES
Rename KGD_MAX_QUEUES to AMDGPU_MAX_QUEUES to conform with
the naming convention followed in amdgpu_gfx.h. No functional
change.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-11 17:15:41 -04:00
Mukul Joshi
f4fa8fcd25 drm/amdkfd: Update CU masking for GFX 9.4.3
The CU mask passed from user-space will change based on
different spatial partitioning mode. As a result, update
CU masking code for GFX9.4.3 to work for all partitioning
modes.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-11 17:11:12 -04:00