Commit Graph

16174 Commits

Author SHA1 Message Date
Hawking Zhang
4dbc17b455 drm/amdgpu: Convert update_partition_sched_list into a common helper v3
The update_partition_sched_list function does not
need to remain as a soc specific callback. It can
be reused for future products.

v2: bypass the function if xcp_mgr is not available (Likun)

v3: Let caller check the availability of xcp_mgr (Lijo)

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:03:41 -04:00
Hawking Zhang
bf587417ff drm/amdgpu: Convert select_sched into a common helper v3
The xcp select_sched function does not need to
remain as a soc specific callback. It can be reused
for future products

v2: bypass the function if xcp_mgr is not available (Likun)

v3: Let caller check the availability of xcp mgr (Lijo)

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:03:32 -04:00
Likun Gao
37b791d667 drm/amdgpu: use common function to map ip for aqua_vanjaram
Transfer to use function amdgpu_ip_map_init to map ip
instance for aqua_vanjaram instead of operation on
different ASIC.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:03:25 -04:00
Likun Gao
20905edb24 drm/amdgpu: make ip map init to common function
IP instance map init function can be an common function
instead of operation on different ASIC.
V2: Create amdgpu_ip.[ch] file for ip related functions.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:03:16 -04:00
Pratap Nirujogi
f0ebe9e578 drm/amd/amdgpu: Refine isp_v4_1_1 logging
Replace DRM_ERROR with drm_err function and update log
messages to drop __func__ and print return value.

Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:02:55 -04:00
Pratap Nirujogi
fd14786071 drm/amd/amdgpu: Add ISP Generic PM Domain (genpd) support
AMDISP I2C device requires to power on ISP HW to probe the sensor
device. Instead of using the exported symbols from ISP driver to
control the power and clocks remotely,added Generic PM Domain (genpd)
support in amdgpu_isp device for its child devices (amd_isp_capture,
amd_isp_i2c_designware) to set power and clocks using PM methods.

Co-developed-by: Bin Du <bin.du@amd.com>
Signed-off-by: Bin Du <bin.du@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:02:50 -04:00
Vitaly Prosyak
5fb90421fa drm/amdgpu: fix slab-use-after-free in amdgpu_userq_mgr_fini+0x70c
The issue was reproduced on NV10 using IGT pci_unplug test.
It is expected that `amdgpu_driver_postclose_kms()` is called prior to `amdgpu_drm_release()`.
However, the bug is that `amdgpu_fpriv` was freed in `amdgpu_driver_postclose_kms()`, and then
later accessed in `amdgpu_drm_release()` via a call to `amdgpu_userq_mgr_fini()`.
As a result, KASAN detected a use-after-free condition, as shown in the log below.
The proposed fix is to move the calls to `amdgpu_eviction_fence_destroy()` and
`amdgpu_userq_mgr_fini()` into `amdgpu_driver_postclose_kms()`, so they are invoked before
`amdgpu_fpriv` is freed.

This also ensures symmetry with the initialization path in `amdgpu_driver_open_kms()`,
where the following components are initialized:
- `amdgpu_userq_mgr_init()`
- `amdgpu_eviction_fence_init()`
- `amdgpu_ctx_mgr_init()`

Correspondingly, in `amdgpu_driver_postclose_kms()` we should clean up using:
- `amdgpu_userq_mgr_fini()`
- `amdgpu_eviction_fence_destroy()`
- `amdgpu_ctx_mgr_fini()`

This change eliminates the use-after-free and improves consistency in resource management between open and close paths.

[  +0.094367] ==================================================================
[  +0.000026] BUG: KASAN: slab-use-after-free in amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000866] Write of size 8 at addr ffff88811c068c60 by task amd_pci_unplug/1737
[  +0.000026] CPU: 3 UID: 0 PID: 1737 Comm: amd_pci_unplug Not tainted 6.14.0+ #2
[  +0.000008] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020
[  +0.000004] Call Trace:
[  +0.000004]  <TASK>
[  +0.000003]  dump_stack_lvl+0x76/0xa0
[  +0.000010]  print_report+0xce/0x600
[  +0.000009]  ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000790]  ? srso_return_thunk+0x5/0x5f
[  +0.000007]  ? kasan_complete_mode_report_info+0x76/0x200
[  +0.000008]  ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000684]  kasan_report+0xbe/0x110
[  +0.000007]  ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000601]  __asan_report_store8_noabort+0x17/0x30
[  +0.000007]  amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000801]  ? __pfx_amdgpu_userq_mgr_fini+0x10/0x10 [amdgpu]
[  +0.000819]  ? srso_return_thunk+0x5/0x5f
[  +0.000008]  amdgpu_drm_release+0xa3/0xe0 [amdgpu]
[  +0.000604]  __fput+0x354/0xa90
[  +0.000010]  __fput_sync+0x59/0x80
[  +0.000005]  __x64_sys_close+0x7d/0xe0
[  +0.000006]  x64_sys_call+0x2505/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000004]  ? kasan_record_aux_stack+0xae/0xd0
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? kmem_cache_free+0x398/0x580
[  +0.000006]  ? __fput+0x543/0xa90
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __fput+0x543/0xa90
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000007]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? fpregs_assert_state_consistent+0x21/0xb0
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? syscall_exit_to_user_mode+0x4e/0x240
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? do_syscall_64+0x88/0x170
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? do_syscall_64+0x88/0x170
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? irqentry_exit+0x43/0x50
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? exc_page_fault+0x7c/0x110
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000005] RIP: 0033:0x7ffff7b14f67
[  +0.000005] Code: ff e8 0d 16 02 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 73 ba f7 ff
[  +0.000004] RSP: 002b:00007fffffffe358 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000006] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffff7b14f67
[  +0.000003] RDX: 0000000000000000 RSI: 00007ffff7f5755a RDI: 0000000000000003
[  +0.000003] RBP: 00007fffffffe380 R08: 0000555555568170 R09: 0000000000000000
[  +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffffffe5c8
[  +0.000003] R13: 00005555555552a9 R14: 0000555555557d48 R15: 00007ffff7ffd040
[  +0.000007]  </TASK>

[  +0.000286] Allocated by task 425 on cpu 11 at 29.751192s:
[  +0.000013]  kasan_save_stack+0x28/0x60
[  +0.000008]  kasan_save_track+0x18/0x70
[  +0.000006]  kasan_save_alloc_info+0x38/0x60
[  +0.000006]  __kasan_kmalloc+0xc1/0xd0
[  +0.000005]  __kmalloc_cache_noprof+0x1bd/0x430
[  +0.000006]  amdgpu_driver_open_kms+0x172/0x760 [amdgpu]
[  +0.000521]  drm_file_alloc+0x569/0x9a0
[  +0.000008]  drm_client_init+0x1b7/0x410
[  +0.000007]  drm_fbdev_client_setup+0x174/0x470
[  +0.000007]  drm_client_setup+0x8a/0xf0
[  +0.000006]  amdgpu_pci_probe+0x50b/0x10d0 [amdgpu]
[  +0.000482]  local_pci_probe+0xe7/0x1b0
[  +0.000008]  pci_device_probe+0x5bf/0x890
[  +0.000005]  really_probe+0x1fd/0x950
[  +0.000007]  __driver_probe_device+0x307/0x410
[  +0.000005]  driver_probe_device+0x4e/0x150
[  +0.000006]  __driver_attach+0x223/0x510
[  +0.000005]  bus_for_each_dev+0x102/0x1a0
[  +0.000006]  driver_attach+0x3d/0x60
[  +0.000005]  bus_add_driver+0x309/0x650
[  +0.000005]  driver_register+0x13d/0x490
[  +0.000006]  __pci_register_driver+0x1ee/0x2b0
[  +0.000006]  xfrm_ealg_get_byidx+0x43/0x50 [xfrm_algo]
[  +0.000008]  do_one_initcall+0x9c/0x3e0
[  +0.000007]  do_init_module+0x29e/0x7f0
[  +0.000006]  load_module+0x5c75/0x7c80
[  +0.000006]  init_module_from_file+0x106/0x180
[  +0.000007]  idempotent_init_module+0x377/0x740
[  +0.000006]  __x64_sys_finit_module+0xd7/0x180
[  +0.000006]  x64_sys_call+0x1f0b/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000013] Freed by task 1737 on cpu 9 at 76.455063s:
[  +0.000010]  kasan_save_stack+0x28/0x60
[  +0.000006]  kasan_save_track+0x18/0x70
[  +0.000005]  kasan_save_free_info+0x3b/0x60
[  +0.000006]  __kasan_slab_free+0x54/0x80
[  +0.000005]  kfree+0x127/0x470
[  +0.000006]  amdgpu_driver_postclose_kms+0x455/0x760 [amdgpu]
[  +0.000485]  drm_file_free.part.0+0x5b1/0xba0
[  +0.000007]  drm_file_free+0x13/0x30
[  +0.000006]  drm_client_release+0x1c4/0x2b0
[  +0.000006]  drm_fbdev_ttm_fb_destroy+0xd2/0x120 [drm_ttm_helper]
[  +0.000007]  put_fb_info+0x97/0xe0
[  +0.000006]  unregister_framebuffer+0x197/0x380
[  +0.000005]  drm_fb_helper_unregister_info+0x94/0x100
[  +0.000005]  drm_fbdev_client_unregister+0x3c/0x80
[  +0.000007]  drm_client_dev_unregister+0x144/0x330
[  +0.000006]  drm_dev_unregister+0x49/0x1b0
[  +0.000006]  drm_dev_unplug+0x4c/0xd0
[  +0.000006]  amdgpu_pci_remove+0x58/0x130 [amdgpu]
[  +0.000482]  pci_device_remove+0xae/0x1e0
[  +0.000006]  device_remove+0xc7/0x180
[  +0.000006]  device_release_driver_internal+0x3d4/0x5a0
[  +0.000007]  device_release_driver+0x12/0x20
[  +0.000006]  pci_stop_bus_device+0x104/0x150
[  +0.000006]  pci_stop_and_remove_bus_device_locked+0x1b/0x40
[  +0.000005]  remove_store+0xd7/0xf0
[  +0.000007]  dev_attr_store+0x3f/0x80
[  +0.000006]  sysfs_kf_write+0x125/0x1d0
[  +0.000005]  kernfs_fop_write_iter+0x2ea/0x490
[  +0.000007]  vfs_write+0x90d/0xe70
[  +0.000006]  ksys_write+0x119/0x220
[  +0.000006]  __x64_sys_write+0x72/0xc0
[  +0.000006]  x64_sys_call+0x18ab/0x26f0
[  +0.000005]  do_syscall_64+0x7c/0x170
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000013] The buggy address belongs to the object at ffff88811c068000
               which belongs to the cache kmalloc-rnd-01-4k of size 4096
[  +0.000016] The buggy address is located 3168 bytes inside of
               freed 4096-byte region [ffff88811c068000, ffff88811c069000)

[  +0.000022] The buggy address belongs to the physical page:
[  +0.000010] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88811c06e000 pfn:0x11c068
[  +0.000006] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  +0.000006] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[  +0.000007] page_type: f5(slab)
[  +0.000007] raw: 0017ffffc0000040 ffff88810004c140 dead000000000122 0000000000000000
[  +0.000005] raw: ffff88811c06e000 0000000080040002 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000040 ffff88810004c140 dead000000000122 0000000000000000
[  +0.000005] head: ffff88811c06e000 0000000080040002 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000003 ffffea0004701a01 ffffffffffffffff 0000000000000000
[  +0.000005] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
[  +0.000004] page dumped because: kasan: bad access detected

[  +0.000011] Memory state around the buggy address:
[  +0.000009]  ffff88811c068b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000012]  ffff88811c068b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] >ffff88811c068c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]                                                        ^
[  +0.000010]  ffff88811c068c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]  ffff88811c068d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] ==================================================================

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Jesse Zhang <Jesse.Zhang@amd.com>
Cc: Arvind Yadav <arvind.yadav@amd.com>

v2: drop amdgpu_drm_release() and assign drm_release()
    as the callback directly.(Alex)

Fixes: adba092973 ("drm/amdgpu: Fix Illegal opcode in command stream Error")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:01:28 -04:00
Alex Deucher
684385273d drm/amdgpu: remove fence slab
Just use kmalloc for the fences in the rare case we need
an independent fence.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:00:03 -04:00
Mario Limonciello
49f1f9f6c3 drm/amd: Adjust output for discovery error handling
commit 017fbb6690 ("drm/amdgpu/discovery: check ip_discovery fw file
available") added support for reading an amdgpu IP discovery bin file
for some specific products. If it's not found then it will fallback to
hardcoded values. However if it's not found there is also a lot of noise
about missing files and errors.

Adjust the error handling to decrease most messages to DEBUG and to show
users less about missing files.

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reported-by: Marcus Seyfarth <m.seyfarth@gmail.com>
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4312
Tested-by: Marcus Seyfarth <m.seyfarth@gmail.com>
Fixes: 017fbb6690 ("drm/amdgpu/discovery: check ip_discovery fw file available")
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250617183052.1692059-1-superm1@kernel.org
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 09:58:35 -04:00
Alex Deucher
0180e0a5dd drm/amdgpu/mes: add compatibility checks for set_hw_resource_1
Seems some older MES firmware versions do not properly support
this packet.  Add back some the compatibility checks.

v2: switch to fw version check (Shaoyun)

Fixes: f81cd79311 ("drm/amd/amdgpu: Fix MES init sequence")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4295
Cc: Shaoyun Liu <shaoyun.liu@amd.com>
Reviewed-by: shaoyun.liu <shaoyun.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 09:54:33 -04:00
Srinivasan Shanmugam
99808926d0 drm/amdgpu/gfx9: Add Cleaner Shader Support for GFX9.x GPUs
Enable the cleaner shader for other GFX9.x series of GPUs to provide
data isolation between GPU workloads. The cleaner shader is responsible
for clearing the Local Data Store (LDS), Vector General Purpose
Registers (VGPRs), and Scalar General Purpose Registers (SGPRs), which
helps prevent data leakage and ensures accurate computation results.

This update extends cleaner shader support to GFX9.x GPUs, previously
available for GFX9.4.2. It enhances security by clearing GPU memory
between processes and maintains a consistent GPU state across KGD and
KFD workloads.

Cc: Manu Rastogi <manu.rastogi@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 09:54:23 -04:00
Greg Kroah-Hartman
63dafeb392 Merge 6.16-rc3 into driver-core-next
We need the driver-core fixes that are in 6.16-rc3 into here as well
to build on top of.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-23 07:53:36 +02:00
Alex Deucher
fe79ef3530 drm/amdgpu/sdma5.2: init engine reset mutex
Missing the mutex init.

Fixes: 47454f2dc0 ("drm/amdgpu: Register the new sdma function pointers for sdma_v5_2")
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit ea685ff30a)
2025-06-18 13:17:49 -04:00
Alex Deucher
49cc5beeab drm/amdgpu/sdma5: init engine reset mutex
Missing the mutex init.

Fixes: e56d4bf57f ("drm/amdgpu/: drm/amdgpu: Register the new sdma function pointers for sdma_v5_0")
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 3f4caf092f)
2025-06-18 13:17:00 -04:00
Alex Deucher
ebe4354270 drm/amdgpu: switch job hw_fence to amdgpu_fence
Use the amdgpu fence container so we can store additional
data in the fence.  This also fixes the start_time handling
for MCBP since we were casting the fence to an amdgpu_fence
and it wasn't.

Fixes: 3f4c175d62 ("drm/amdgpu: MCBP based on DRM scheduler (v9)")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit bf1cd14f9e)
Cc: stable@vger.kernel.org
2025-06-18 13:15:26 -04:00
Jesse Zhang
7f3b16f3f2 drm/amdgpu: Fix SDMA UTC_L1 handling during start/stop sequences
This commit makes two key fixes to SDMA v4.4.2 handling:

1. disable UTC_L1 in sdma_cntl register when stopping SDMA engines
   by reading the current value before modifying UTC_L1_ENABLE bit.

2. Ensure UTC_L1_ENABLE is consistently managed by:
   - Adding the missing register write when enabling UTC_L1 during start
   - Keeping UTC_L1 enabled by default as per hardware requirements

v2: Correct SDMA_CNTL setting (Philip)

Suggested-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 375bf56465)
Cc: stable@vger.kernel.org
2025-06-18 13:14:40 -04:00
Lijo Lazar
785c536c31 drm/amdgpu: Release reset locks during failures
Make sure to release reset domain lock in case of failures.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Fixes: 11bb33766f ("drm/amdgpu: refactor amdgpu_device_gpu_recover")
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1ab11a8268)
2025-06-18 13:14:10 -04:00
Sonny Jiang
46e15197b5 drm/amdgpu: VCN v5_0_1 to prevent FW checking RB during DPG pause
Add a protection to ensure programming are all complete prior VCPU
starting. This is a WA for an unintended VCPU running.

Signed-off-by: Sonny Jiang <sonny.jiang@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit c29521b529)
Cc: stable@vger.kernel.org
2025-06-18 13:13:13 -04:00
Jesse Zhang
caade9d69f drm/amdgpu: Use logical instance ID for SDMA v4_4_2 queue operations
Simplify SDMA v4_4_2 queue reset and stop operations by:
1. Removing GET_INST(SDMA0) conversion for ring->me
2. Using the logical instance ID (ring->me) directly
3. Maintaining consistent behavior with other SDMA queue operations

This change aligns with the existing queue handling logic where
ring->me already represents the correct instance identifier.

Signed-off-by:  Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 3bab282dfe)
Cc: stable@vger.kernel.org
2025-06-18 13:11:15 -04:00
Jesse Zhang
09b585592f drm/amdgpu: Fix SDMA engine reset with logical instance ID
This commit makes the following improvements to SDMA engine reset handling:

1. Clarifies in the function documentation that instance_id refers to a logical ID
2. Adds conversion from logical to physical instance ID before performing reset
   using GET_INST(SDMA0, instance_id) macro
3. Improves error messaging to indicate when a logical instance reset fails
4. Adds better code organization with blank lines for readability

The change ensures proper SDMA engine reset by using the correct physical
instance ID while maintaining the logical ID interface for callers.

V2: Remove harvest_config check and convert directly to physical instance (Lijo)

Suggested-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 5efa6217c2)
Cc: stable@vger.kernel.org
2025-06-18 13:10:44 -04:00
Frank Min
854171405e drm/amdgpu: add kicker fws loading for gfx11/smu13/psp13
1. Add kicker firmwares loading for gfx11/smu13/psp13
2. Register additional MODULE_FIRMWARE entries for kicker fws
   - gc_11_0_0_rlc_kicker.bin
   - gc_11_0_0_imu_kicker.bin
   - psp_13_0_0_sos_kicker.bin
   - psp_13_0_0_ta_kicker.bin
   - smu_13_0_0_kicker.bin

Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit fb5ec2174d)
Cc: stable@vger.kernel.org
2025-06-18 13:09:41 -04:00
Frank Min
0bbf5fd86c drm/amdgpu: Add kicker device detection
1. add kicker device list
2. add kicker device checking helper function

Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 09aa2b408f)
Cc: stable@vger.kernel.org
2025-06-18 13:08:52 -04:00
Alex Deucher
ea685ff30a drm/amdgpu/sdma5.2: init engine reset mutex
Missing the mutex init.

Fixes: 47454f2dc0 ("drm/amdgpu: Register the new sdma function pointers for sdma_v5_2")
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:22 -04:00
Alex Deucher
3f4caf092f drm/amdgpu/sdma5: init engine reset mutex
Missing the mutex init.

Fixes: e56d4bf57f ("drm/amdgpu/: drm/amdgpu: Register the new sdma function pointers for sdma_v5_0")
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
Alex Deucher
bf1cd14f9e drm/amdgpu: switch job hw_fence to amdgpu_fence
Use the amdgpu fence container so we can store additional
data in the fence.  This also fixes the start_time handling
for MCBP since we were casting the fence to an amdgpu_fence
and it wasn't.

Fixes: 3f4c175d62 ("drm/amdgpu: MCBP based on DRM scheduler (v9)")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
Lijo Lazar
9750ad5aee drm/amdgpu: Add xgmi API to set max speed/width
Add an API to set the max possible xgmi speed/width.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
Lijo Lazar
8c9eb6ce50 drm/amdgpu: Deprecate xgmi_link_speed enum
xgmi doesn't have discrete max speeds defined. Speed numbers can be
arbitrary based on SOC. Deprecate the enum.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
Lijo Lazar
04141c05f3 drm/amdgpu: Extend bus status check to more cases
In case of unexpected errors, check if device is alive on the bus.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
Frank Min
a3b7f9c306 drm/amdgpu: reclaim psp fw reservation memory region
PSP v14 fw update introduced changes on memory reservation region, according
to the change driver reclaim some non-reserved region.

1. introduce 2 new psp commands to query fw reservation regions
2. add a new reservation region for psp
3. reclaim psp non-used region

Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
ganglxie
e2d1e96c53 drm/amdgpu: refine usage of amdgpu_bad_page_threshold
when amdgpu_bad_page_threshold == -1 or -2, driver will issue a warning
message when threshold is reached and continue runtime services.

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
Jesse Zhang
375bf56465 drm/amdgpu: Fix SDMA UTC_L1 handling during start/stop sequences
This commit makes two key fixes to SDMA v4.4.2 handling:

1. disable UTC_L1 in sdma_cntl register when stopping SDMA engines
   by reading the current value before modifying UTC_L1_ENABLE bit.

2. Ensure UTC_L1_ENABLE is consistently managed by:
   - Adding the missing register write when enabling UTC_L1 during start
   - Keeping UTC_L1 enabled by default as per hardware requirements

v2: Correct SDMA_CNTL setting (Philip)

Suggested-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
Lijo Lazar
1ab11a8268 drm/amdgpu: Release reset locks during failures
Make sure to release reset domain lock in case of failures.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Fixes: 11bb33766f ("drm/amdgpu: refactor amdgpu_device_gpu_recover")
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
Alex Deucher
9a9e87d152 drm/amdgpu/sdma: handle paging queues in amdgpu_sdma_reset_engine()
Need to properly start and stop paging queues if they are present.

This is not an issue today since we don't support a paging queue
on any chips with queue reset.

Fixes: b22659d5d3 ("drm/amdgpu: switch amdgpu_sdma_reset_engine to use the new sdma function pointers")
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:20 -04:00
Emily Deng
54f7a24e14 drm/amdkfd: Move the process suspend and resume out of full access
For the suspend and resume process, exclusive access is not required.
Therefore, it can be moved out of the full access section to reduce the
duration of exclusive access.

v3:
Move suspend processes before hardware fini.
Remove twice call for bare metal.

v4:
Refine code

Signed-off-by: Emily Deng <Emily.Deng@amd.com>
Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:19 -04:00
Sonny Jiang
c29521b529 drm/amdgpu: VCN v5_0_1 to prevent FW checking RB during DPG pause
Add a protection to ensure programming are all complete prior VCPU
starting. This is a WA for an unintended VCPU running.

Signed-off-by: Sonny Jiang <sonny.jiang@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:19 -04:00
Jesse Zhang
0c3f972394 drm/amdgpu: Add soft reset callback to SDMA v4.4.x
Implement soft reset engine callback for SDMA 4.4.x IPs. This avoids IP
version check in generic implementation.

V2: Correct physical instance ID calculation in soft_reset_engine (Jesse)

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:19 -04:00
Jesse Zhang
3bab282dfe drm/amdgpu: Use logical instance ID for SDMA v4_4_2 queue operations
Simplify SDMA v4_4_2 queue reset and stop operations by:
1. Removing GET_INST(SDMA0) conversion for ring->me
2. Using the logical instance ID (ring->me) directly
3. Maintaining consistent behavior with other SDMA queue operations

This change aligns with the existing queue handling logic where
ring->me already represents the correct instance identifier.

Signed-off-by:  Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:19 -04:00
Jesse Zhang
5efa6217c2 drm/amdgpu: Fix SDMA engine reset with logical instance ID
This commit makes the following improvements to SDMA engine reset handling:

1. Clarifies in the function documentation that instance_id refers to a logical ID
2. Adds conversion from logical to physical instance ID before performing reset
   using GET_INST(SDMA0, instance_id) macro
3. Improves error messaging to indicate when a logical instance reset fails
4. Adds better code organization with blank lines for readability

The change ensures proper SDMA engine reset by using the correct physical
instance ID while maintaining the logical ID interface for callers.

V2: Remove harvest_config check and convert directly to physical instance (Lijo)

Suggested-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:18 -04:00
Lijo Lazar
3f1e81ecb6 drm/amdgpu: Suspend IH during mode-2 reset
On multi-aid SOCs, there could be a continuous stream of interrupts from
GC after poison consumption. Suspend IH to disable them before doing
mode-2 reset. This avoids conflicts in hardware accesses during
interrupt handlers while a reset is ongoing.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:18 -04:00
Mario Limonciello
64c3e4a868 drm/amd: Add support for a complete pmops action
complete() callbacks are supposed to handle reversing anything
that occurred during prepare() callbacks.  They'll be called on every
power state transition, and will also be called if the sequence is
failed (such as an aborted suspend).

Add support for IP blocks to support this action.

Reviewed-by: Alex Hung <alex.hung@amd.com>
Link: https://lore.kernel.org/r/20250602014432.3538345-2-superm1@kernel.org
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:18 -04:00
Xiang Liu
f43411978d drm/amdgpu: Add debug mask to disable CE logs
Add debug mask to disable kernel logs of RAS correctable errors,
including both ACA and CE error counter kernel messages.

Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:18 -04:00
Frank Min
fb5ec2174d drm/amdgpu: add kicker fws loading for gfx11/smu13/psp13
1. Add kicker firmwares loading for gfx11/smu13/psp13
2. Register additional MODULE_FIRMWARE entries for kicker fws
   - gc_11_0_0_rlc_kicker.bin
   - gc_11_0_0_imu_kicker.bin
   - psp_13_0_0_sos_kicker.bin
   - psp_13_0_0_ta_kicker.bin
   - smu_13_0_0_kicker.bin

Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:18 -04:00
Frank Min
09aa2b408f drm/amdgpu: Add kicker device detection
1. add kicker device list
2. add kicker device checking helper function

Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:18 -04:00
Lijo Lazar
3bdf8dd84e drm/amdgpu: Clear reset flags from ras context
Once RAS errors are cleared with appropriate recovery mechanism, clear
reset flags also from RAS context. Otherwise, stale flag values could
affect the subsequent RAS reset handling on the device.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:18 -04:00
Alex Deucher
87fbe3a548 drm/amdgpu/gfx9: drop reset_kgq
It doesn't work reliably and we have soft recover and
full adapter reset so drop this.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:18 -04:00
Alex Deucher
fda02c911a drm/amdgpu/gfx8: drop reset_kgq
It doesn't work reliably and we have soft recover and
full adapter reset so drop this.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:18 -04:00
Alex Deucher
18d321c1dc drm/amdgpu/gfx7: drop reset_kgq
It doesn't work reliably and we have soft recover and
full adapter reset so drop this.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:17 -04:00
Jonathan Kim
96f75f9594 drm/amdkfd: allow compute partition mode switch with cgroup exclusions
The KFD currently bars a compute partition mode switch while a KFD
process exists.

Since cgroup excluded devices remain excluded for the lifetime of a KFD
process and user space is able to mode switch single devices, allow
users to mode switch a device with any running process that has been
cgroup excluded from this device.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Harish Kasiviswanathan <harish.kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:17 -04:00
ganglxie
d0cc8d2b7d drm/amdgpu: clear pa and mca record counter when resetting eeprom
clear pa and mca record counter when resetting eeprom, so that
ras_num_bad_pages can be calculated correctly

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:15 -04:00
Samuel Zhang
4108c2be12 drm/amdgpu: fix fence fallback timer expired error
IH is not working after switching a new gpu index for the first time.

During VM resume, QEMU programming of VF MSIX table (register GFXMSIX_VECT0_ADDR_LO)
may not work.The access could be blocked by nBIF protection as VF isn't in
exclusive access mode. Exclusive access is enabled now, disable/enable MSIX
so that QEMU reprograms MSIX table.

call amdgpu_restore_msix on resume to restore msix table.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:15 -04:00
Samuel Zhang
2f405eb45c drm/amdgpu: enable pdb0 for hibernation on SRIOV
When switching to new GPU index after hibernation and then resume,
VRAM offset of each VRAM BO will be changed, and the cached gpu
addresses needed to updated.

This is to enable pdb0 and switch to use pdb0-based virtual gpu
address by default in amdgpu_bo_create_reserved(). since the virtual
addresses do not change, this can avoid the need to update all
cached gpu addresses all over the codebase.

Signed-off-by: Emily Deng <Emily.Deng@amd.com>
Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:15 -04:00
Samuel Zhang
18b66a6c2a drm/amdgpu: update GPU addresses for SMU and PSP
add amdgpu_bo_fb_aper_addr() and update the cached GPU addresses to use
the FB aperture address for SMU and PSP.

2 reasons for this change:
1. when pdb0 is enabled, gpu addr from amdgpu_bo_create_kernel() is GART
aperture address, it is not compatible with SMU and PSP, it need to be
updated to use FB aperture address.
2. Since FB aperture address will change after switching to new GPU
index after hibernation, it need to be updated on resume.

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:15 -04:00
Lijo Lazar
0f566f0e9c drm/amdgpu: Remove nbiov7.9 replay count reporting
Direct pcie replay count reporting is not available on nbio v7.9.
Reporting is done through firmware.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Fixes: 50709d18f4 ("drm/amdgpu: Add pci replay count to nbio v7.9")
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:09 -04:00
Lijo Lazar
196aefea44 drm/amdgpu: Check pcie replays reporting support
Check if pcie replay count reporting is supported before creating sysfs
attribute.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:02 -04:00
Shiwu Zhang
c09910b511 drm/amdgpu: Enable IFWI update support for PSPv14.0.2 and v14.0.3
Make the psp_vbflash and psp_vbflash_status available in sysfs.

v2: make it available for v14.0.2 as well (hawking)

Signed-off-by: Shiwu Zhang <shiwu.zhang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:01 -04:00
Samuel Zhang
855a2a029a drm/amdgpu: update xgmi info and vram_base_offset on resume
For SRIOV VM env with XGMI enabled systems, XGMI physical node id may
change when hibernate and resume with different VF.

Update XGMI info and vram_base_offset on resume for gfx444 SRIOV env.
Add amdgpu_virt_xgmi_migrate_enabled() as the feature flag.

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:01 -04:00
André Almeida
a72002cb18 drm/amdgpu: Make use of drm_wedge_task_info
To notify userspace about which task (if any) made the device get in a
wedge state, make use of drm_wedge_task_info parameter, filling it with
the task PID and name.

Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250617124949.2151549-7-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17 11:32:48 -03:00
André Almeida
35dc4ce200 drm: amdgpu: Use struct drm_wedge_task_info inside of struct amdgpu_task_info
To avoid a cast when calling drm_dev_wedged_event(), replace pid and
task name inside of struct amdgpu_task_info with struct
drm_wedge_task_info.

Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250617124949.2151549-6-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17 11:32:47 -03:00
André Almeida
183bccafa1 drm: Create a task info option for wedge events
When a device get wedged, it might be caused by a guilty application.
For userspace, knowing which task was involved can be useful for some
situations, like for implementing a policy, logs or for giving a chance
for the compositor to let the user know what task was involved in the
problem.  This is an optional argument, when the task info is not
available, the PID and TASK string won't appear in the event string.

Sometimes just the PID isn't enough giving that the task might be already
dead by the time userspace will try to check what was this PID's name,
so to make the life easier also notify what's the task's name in the user
event.

Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Krzysztof Karas <krzysztof.karas@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
Acked-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250617124949.2151549-4-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17 11:32:47 -03:00
André Almeida
3bfd1af74a drm: amdgpu: Create amdgpu_vm_print_task_info()
To avoid repetitive code in amdgpu, create a function that prints the
content of struct amdgpu_task_info.

Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250617124949.2151549-3-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17 11:32:47 -03:00
André Almeida
2a4f069d0f drm: amdgpu: Allow NULL pointers at amdgpu_vm_put_task_info()
Allow NULL pointers at amdgpu_vm_put_task_info() as it common practice
for "put" or "free" functions. This avoid an extra check for NULL for
callers.

Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250617124949.2151549-2-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17 11:32:47 -03:00
Thomas Weißschuh
fb506e31b3 sysfs: treewide: switch back to attribute_group::bin_attrs
The normal bin_attrs field can now handle const pointers.
This makes the _new variant unnecessary.
Switch all users back.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20250530-sysfs-const-bin_attr-final-v3-4-724bfcf05b99@weissschuh.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-17 10:44:15 +02:00
Thomas Weißschuh
2fbe82037a sysfs: treewide: switch back to bin_attribute::read()/write()
The bin_attribute argument of bin_attribute::read() is now const.
This makes the _new() callbacks unnecessary. Switch all users back.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20250530-sysfs-const-bin_attr-final-v3-3-724bfcf05b99@weissschuh.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-17 10:44:13 +02:00
Thomas Zimmermann
c598d5eb9f Merge drm/drm-next into drm-misc-next
Backmerging to forward to v6.16-rc1

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2025-06-11 09:01:34 +02:00
Ingo Molnar
41cb08555c treewide, timers: Rename from_timer() to timer_container_of()
Move this API to the canonical timer_*() namespace.

[ tglx: Redone against pre rc1 ]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-08 09:07:37 +02:00
Linus Torvalds
e332935a54 drm fixes for 6.16-rc1
(amdkfd on riscv is more a feature).
 
 panel:
 - nt37801: fix IS_ERR
 - nt37801: fix KConfig
 
 connector:
 - Fix null deref in HDMI audio helper.
 
 bridge:
 - analogix_dp: fixup clk-disable removal
 
 msm:
 - mailmap updates
 
 i915:
 - Fix the enabling/disabling of DP audio SDP splitting
 - Fix PSR register definitions for ALPM
 - Fix u32 overflow in SNPS PHY HDMI PLL setup
 - Fix GuC pending message underflow when submit fails
 - Fix GuC wakeref underflow race during reset
 
 xe:
 - Two documentation fixes
 - A couple of vm init fixes
 - Hwmon fixes
 - Drop reduntant conversion to bool
 - Fix CONFIG_INTEL_VSEC dependency
 - Rework eviction rejection of bound external bos
 - Stop re-submitting signalled jobs
 - A couple of pxp fixes
 - Add back a fix that got lost in a merge
 - Create LRC bo without VM
 - Fix for the above fix
 
 amdgpu:
 - UserQ fixes
 - SMU 13.x fixes
 - VCN fixes
 - JPEG fixes
 - Misc cleanups
 - runtime pm fix
 - DCN 4.0.1 fixes
 - Misc display fixes
 - ISP fix
 - VRAM manager fix
 - RAS fixes
 - IP discovery fix
 - Cleaner shader fix for GC 10.1.x
 - OD fix
 - Non-OLED panel fix
 - Misc display fixes
 - Brightness fixes
 
 amdkfd:
 - Enable CONFIG_HSA_AMD on RISCV
 - SVM fix
 - Misc cleanups
 - Ref leak fix
 - WPTR BO fix
 
 radeon:
 - Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmhCg/4ACgkQDHTzWXnE
 hr7Q0g//an3wQGf8KgZxCs8DxVVA3zSrUDLiAbs5hsZJDNtd9uGqzy9pZzIV+cVK
 rguAcM/AEVvY/ET1PCVh1FlJ8jMadlGGX6MuegUzdzQ/wB7puwZ+KRZAMmSVEiY6
 7PVKceeJ2bnCK+Vn/SdXpD1s4AXn3hMCyuTfvOC4fJuee/qW62H/wl4ivXzilvvf
 DBSSlpjEcTSKJVRveOw1AL678Z34JhoUB3oek0kpx9TyF4rdKs5qDStEUxMIhpD8
 22vN5oF0UOU93N53udCt4gGQ/Xfqyyl03XP2JYnNmCMJB+BGSR/u/u59cjnvkwDs
 TQBBS8gXfAdRCEPrvtDGNZOLxEhPl+ZaKoTqRp6qi4uL7nUc8NTVBE3UTkt6LVcx
 W1HY5+QzuLPH73QUSSHL609qz1X1aRLWgFh+/Fo82LYh3ORtO6BwbQLP6ZGkbNzm
 GTRqLAmzprL2XisrxP0gsdvgpRplXjwxx7RzCE6evr/u+lMRr4dxoSx1k2C0vVhS
 sFoFjHdrWvHO8KtM14vTt/F7J79suqgQBqF37s8s1e5ptDra4aDQEzCAXxJYx6Pg
 2Q7tamvwaJndQUojd858+OU8lHVWDKm6eYuA4WrbbomT31CVkAWWrmcIiS3CBBX1
 6U0J4h8JcGilbCuPHCP2c9ibakkF/jkO+tZAgW88C/enF9r59r8=
 =jZKo
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2025-06-06' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
 "This is pretty much two weeks worth of fixes, plus one thing that
  might be considered next: amdkfd is now able to be enabled on risc-v
  platforms.

  Otherwise, amdgpu and xe with the majority of fixes, and then a
  smattering all over.

  panel:
   - nt37801: fix IS_ERR
   - nt37801: fix KConfig

  connector:
   - Fix null deref in HDMI audio helper.

  bridge:
   - analogix_dp: fixup clk-disable removal

  nouveau:
   - minor typo fix (',' vs ';')

  msm:
   - mailmap updates

  i915:
   - Fix the enabling/disabling of DP audio SDP splitting
   - Fix PSR register definitions for ALPM
   - Fix u32 overflow in SNPS PHY HDMI PLL setup
   - Fix GuC pending message underflow when submit fails
   - Fix GuC wakeref underflow race during reset

  xe:
   - Two documentation fixes
   - A couple of vm init fixes
   - Hwmon fixes
   - Drop reduntant conversion to bool
   - Fix CONFIG_INTEL_VSEC dependency
   - Rework eviction rejection of bound external bos
   - Stop re-submitting signalled jobs
   - A couple of pxp fixes
   - Add back a fix that got lost in a merge
   - Create LRC bo without VM
   - Fix for the above fix

  amdgpu:
   - UserQ fixes
   - SMU 13.x fixes
   - VCN fixes
   - JPEG fixes
   - Misc cleanups
   - runtime pm fix
   - DCN 4.0.1 fixes
   - Misc display fixes
   - ISP fix
   - VRAM manager fix
   - RAS fixes
   - IP discovery fix
   - Cleaner shader fix for GC 10.1.x
   - OD fix
   - Non-OLED panel fix
   - Misc display fixes
   - Brightness fixes

  amdkfd:
   - Enable CONFIG_HSA_AMD on RISCV
   - SVM fix
   - Misc cleanups
   - Ref leak fix
   - WPTR BO fix

  radeon:
   - Misc cleanups"

* tag 'drm-next-2025-06-06' of https://gitlab.freedesktop.org/drm/kernel: (105 commits)
  drm/nouveau/vfn/r535: Convert comma to semicolon
  drm/xe: remove unmatched xe_vm_unlock() from __xe_exec_queue_init()
  drm/xe: Create LRC BO without VM
  drm/xe/guc_submit: add back fix
  drm/xe/pxp: Clarify PXP queue creation behavior if PXP is not ready
  drm/xe/pxp: Use the correct define in the set_property_funcs array
  drm/xe/sched: stop re-submitting signalled jobs
  drm/xe: Rework eviction rejection of bound external bos
  drm/xe/vsec: fix CONFIG_INTEL_VSEC dependency
  drm/xe: drop redundant conversion to bool
  drm/xe/hwmon: Move card reactive critical power under channel card
  drm/xe/hwmon: Add support to manage power limits though mailbox
  drm/xe/vm: move xe_svm_init() earlier
  drm/xe/vm: move rebind_work init earlier
  MAINTAINERS: .mailmap: update Rob Clark's email address
  mailmap: Update entry for Akhil P Oommen
  MAINTAINERS: update my email address
  MAINTAINERS: drop myself as maintainer
  drm/i915/display: Fix u32 overflow in SNPS PHY HDMI PLL setup
  drm/amd/display: Fix default DC and AC levels
  ...
2025-06-06 08:09:56 -07:00
Linus Torvalds
3719a04a80 pci-v6.16-changes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmhAa9EUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vyA3w//aX8d73z/xVxkYLMN/6XQA5fdmd4d
 Dv4n0Pjf0WCMKbsgRCdXEYLvcHV8VhH5iCR/b2UsFm9LjxSIRuqE5XosY3bNhrHn
 xVKEh2prq2XZOibWrFkJ+RZ0FF7Ogq1Uy5gUBbBHbE1q1byZzrOALaF3FWGaDIZQ
 6QLLAFtd3UtqOOUu8J8P9N15uFR8gunyfuM9U7TLMcy4B8txk6T6m/9xAWtRURuJ
 I6WN8lO+g8Nl2mL9m27+wyWiVT3tKqoMwp8rVtym/L5JQOmHycYhn0WQAr2dPCMs
 Xbgmoeei0je7mZvk5btpt68NAKQ3ZnCVkxbbINBkUxAjI0dbI6h37EhW18ShYVUk
 CCo4fmaFtwP8qNN9tSvDN8vZdGB44fN5tIz4lmGzKk5gt+oV50RC/APrzC+PJBQ0
 +2SdDVKj71Gr2H1VnI6uLB7oQ+tp7TOdhg+DGV4bdc6QFnsM+BpKWRq5f1UQcau/
 XVDmorM/2t6z0DNktAv3NFwSodUjk1loWESr/pRBH1AqAWZTK98PWIg97XYsal59
 zbJ3dLrnCqUNozeVgjtZo1LWD2FZaVTvhq2NY7D+QPpnMGhFUhHxNliZUXiQa1q4
 boI2hEFdu3IQP/OC2a1zGJyMRLU43d5rhZ1U5xQSVtM0c3lgCY7rn/t26LymQVPA
 SYdg2jBcnhe6gXo=
 =eWJw
 -----END PGP SIGNATURE-----

Merge tag 'pci-v6.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull pci updates from Bjorn Helgaas:
 "Enumeration:

   - Print the actual delay time in pci_bridge_wait_for_secondary_bus()
     instead of assuming it was 1000ms (Wilfred Mallawa)

   - Revert 'iommu/amd: Prevent binding other PCI drivers to IOMMU PCI
     devices', which broke resume from system sleep on AMD platforms and
     has been fixed by other commits (Lukas Wunner)

  Resource management:

   - Remove mtip32xx use of pcim_iounmap_regions(), which is deprecated
     and unnecessary (Philipp Stanner)

   - Remove pcim_iounmap_regions() and pcim_request_region_exclusive()
     and related flags since all uses have been removed (Philipp
     Stanner)

   - Rework devres 'request' functions so they are no longer 'hybrid',
     i.e., their behavior no longer depends on whether
     pcim_enable_device or pci_enable_device() was used, and remove
     related code (Philipp Stanner)

   - Warn (not BUG()) about failure to assign optional resources (Ilpo
     Järvinen)

  Error handling:

   - Log the DPC Error Source ID only when it's actually valid (when
     ERR_FATAL or ERR_NONFATAL was received from a downstream device)
     and decode into bus/device/function (Bjorn Helgaas)

   - Determine AER log level once and save it so all related messages
     use the same level (Karolina Stolarek)

   - Use KERN_WARNING, not KERN_ERR, when logging PCIe Correctable
     Errors (Karolina Stolarek)

   - Ratelimit PCIe Correctable and Non-Fatal error logging, with sysfs
     controls on interval and burst count, to avoid flooding logs and
     RCU stall warnings (Jon Pan-Doh)

  Power management:

   - Increment PM usage counter when probing reset methods so we don't
     try to read config space of a powered-off device (Alex Williamson)

   - Set all devices to D0 during enumeration to ensure ACPI opregion is
     connected via _REG (Mario Limonciello)

  Power control:

   - Rename pwrctrl Kconfig symbols from 'PWRCTL' to 'PWRCTRL' to match
     the filename paths. Retain old deprecated symbols for
     compatibility, except for the pwrctrl slot driver
     (PCI_PWRCTRL_SLOT) (Johan Hovold)

   - When unregistering pwrctrl, cancel outstanding rescan work before
     cleaning up data structures to avoid use-after-free issues (Brian
     Norris)

  Bandwidth control:

   - Simplify link bandwidth controller by replacing the count of Link
     Bandwidth Management Status (LBMS) events with a PCI_LINK_LBMS_SEEN
     flag (Ilpo Järvinen)

   - Update the Link Speed after retraining, since the Link Speed may
     have changed (Ilpo Järvinen)

  PCIe native device hotplug:

   - Ignore Presence Detect Changed caused by DPC.

     pciehp already ignores Link Down/Up events caused by DPC, but on
     slots using in-band presence detect, DPC causes a spurious Presence
     Detect Changed event (Lukas Wunner)

   - Ignore Link Down/Up caused by Secondary Bus Reset.

     On hotplug ports using in-band presence detect, the reset causes a
     Presence Detect Changed event, which mistakenly caused teardown and
     re-enumeration of the device. Drivers may need to annotate code
     that resets their device (Lukas Wunner)

  Virtualization:

   - Add an ACS quirk for Loongson Root Ports that don't advertise ACS
     but don't allow peer-to-peer transactions between Root Ports; the
     quirk allows each Root Port to be in a separate IOMMU group (Huacai
     Chen)

  Endpoint framework:

   - For fixed-size BARs, retain both the actual size and the possibly
     larger size allocated to accommodate iATU alignment requirements
     (Jerome Brunet)

   - Simplify ctrl/SPAD space allocation and avoid allocating more space
     than needed (Jerome Brunet)

   - Correct MSI-X PBA offset calculations for DesignWare and Cadence
     endpoint controllers (Niklas Cassel)

   - Align the return value (number of interrupts) encoding for
     pci_epc_get_msi()/pci_epc_ops::get_msi() and
     pci_epc_get_msix()/pci_epc_ops::get_msix() (Niklas Cassel)

   - Align the nr_irqs parameter encoding for
     pci_epc_set_msi()/pci_epc_ops::set_msi() and
     pci_epc_set_msix()/pci_epc_ops::set_msix() (Niklas Cassel)

  Common host controller library:

   - Convert pci-host-common to a library so platforms that don't need
     native host controller drivers don't need to include these helper
     functions (Manivannan Sadhasivam)

  Apple PCIe controller driver:

   - Extract ECAM bridge creation helper from pci_host_common_probe() to
     separate driver-specific things like MSI from PCI things (Marc
     Zyngier)

   - Dynamically allocate RID-to_SID bitmap to prepare for SoCs with
     varying capabilities (Marc Zyngier)

   - Skip ports disabled in DT when setting up ports (Janne Grunau)

   - Add t6020 compatible string (Alyssa Rosenzweig)

   - Add T602x PCIe support (Hector Martin)

   - Directly set/clear INTx mask bits because T602x dropped the
     accessors that could do this without locking (Marc Zyngier)

   - Move port PHY registers to their own reg items to accommodate
     T602x, which moves them around; retain default offsets for existing
     DTs that lack phy%d entries with the reg offsets (Hector Martin)

   - Stop polling for core refclk, which doesn't work on T602x and the
     bootloader has already done anyway (Hector Martin)

   - Use gpiod_set_value_cansleep() when asserting PERST# in probe
     because we're allowed to sleep there (Hector Martin)

  Cadence PCIe controller driver:

   - Drop a runtime PM 'put' to resolve a runtime atomic count underflow
     (Hans Zhang)

   - Make the cadence core buildable as a module (Kishon Vijay Abraham I)

   - Add cdns_pcie_host_disable() and cdns_pcie_ep_disable() for use by
     loadable drivers when they are removed (Siddharth Vadapalli)

  Freescale i.MX6 PCIe controller driver:

   - Apply link training workaround only on IMX6Q, IMX6SX, IMX6SP
     (Richard Zhu)

   - Remove redundant dw_pcie_wait_for_link() from
     imx_pcie_start_link(); since the DWC core does this, imx6 only
     needs it when retraining for a faster link speed (Richard Zhu)

   - Toggle i.MX95 core reset to align with PHY powerup (Richard Zhu)

   - Set SYS_AUX_PWR_DET to work around i.MX95 ERR051624 erratum: in
     some cases, the controller can't exit 'L23 Ready' through Beacon or
     PERST# deassertion (Richard Zhu)

   - Clear GEN3_ZRXDC_NONCOMPL to work around i.MX95 ERR051586 erratum:
     controller can't meet 2.5 GT/s ZRX-DC timing when operating at 8
     GT/s, causing timeouts in L1 (Richard Zhu)

   - Wait for i.MX95 PLL lock before enabling controller (Richard Zhu)

   - Save/restore i.MX95 LUT for suspend/resume (Richard Zhu)

  Mobiveil PCIe controller driver:

   - Return bool (not int) for link-up check in
     mobiveil_pab_ops.link_up() and layerscape-gen4, mobiveil (Hans
     Zhang)

  NVIDIA Tegra194 PCIe controller driver:

   - Create debugfs directory for 'aspm_state_cnt' only when
     CONFIG_PCIEASPM is enabled, since there are no other entries (Hans
     Zhang)

  Qualcomm PCIe controller driver:

   - Add OF support for parsing DT 'eq-presets-<N>gts' property for lane
     equalization presets (Krishna Chaitanya Chundru)

   - Read Maximum Link Width from the Link Capabilities register if DT
     lacks 'num-lanes' property (Krishna Chaitanya Chundru)

   - Add Physical Layer 64 GT/s Capability ID and register offsets for
     8, 32, and 64 GT/s lane equalization registers (Krishna Chaitanya
     Chundru)

   - Add generic dwc support for configuring lane equalization presets
     (Krishna Chaitanya Chundru)

   - Add DT and driver support for PCIe on IPQ5018 SoC (Nitheesh Sekar)

  Renesas R-Car PCIe controller driver:

   - Describe endpoint BAR 4 as being fixed size (Jerome Brunet)

   - Document how to obtain R-Car V4H (r8a779g0) controller firmware
     (Yoshihiro Shimoda)

  Rockchip PCIe controller driver:

   - Reorder rockchip_pci_core_rsts because
     reset_control_bulk_deassert() deasserts in reverse order, to fix a
     link training regression (Jensen Huang)

   - Mark RK3399 as being capable of raising INTx interrupts (Niklas
     Cassel)

  Rockchip DesignWare PCIe controller driver:

   - Check only PCIE_LINKUP, not LTSSM status, to determine whether the
     link is up (Shawn Lin)

   - Increase N_FTS (used in L0s->L0 transitions) and enable ASPM L0s
     for Root Complex and Endpoint modes (Shawn Lin)

   - Hide the broken ATS Capability in rockchip_pcie_ep_init() instead
     of rockchip_pcie_ep_pre_init() so it stays hidden after PERST#
     resets non-sticky registers (Shawn Lin)

   - Call phy_power_off() before phy_exit() in rockchip_pcie_phy_deinit()
     (Diederik de Haas)

  Synopsys DesignWare PCIe controller driver:

   - Set PORT_LOGIC_LINK_WIDTH to one lane to make initial link training
     more robust; this will not affect the intended link width if all
     lanes are functional (Wenbin Yao)

   - Return bool (not int) for link-up check in dw_pcie_ops.link_up()
     and armada8k, dra7xx, dw-rockchip, exynos, histb, keembay,
     keystone, kirin, meson, qcom, qcom-ep, rcar_gen4, spear13xx,
     tegra194, uniphier, visconti (Hans Zhang)

   - Add debugfs support for exposing DWC device-specific PTM context
     (Manivannan Sadhasivam)

  TI J721E PCIe driver:

   - Make j721e buildable as a loadable and removable module (Siddharth
     Vadapalli)

   - Fix j721e host/endpoint dependencies that result in link failures
     in some configs (Arnd Bergmann)

  Device tree bindings:

   - Add qcom DT binding for 'global' interrupt (PCIe controller and
     link-specific events) for ipq8074, ipq8074-gen3, ipq6018, sa8775p,
     sc7280, sc8180x sdm845, sm8150, sm8250, sm8350 (Manivannan
     Sadhasivam)

   - Add qcom DT binding for 8 MSI SPI interrupts for msm8998, ipq8074,
     ipq8074-gen3, ipq6018 (Manivannan Sadhasivam)

   - Add dw rockchip DT binding for rk3576 and rk3562 (Kever Yang)

   - Correct indentation and style of examples in brcm,stb-pcie,
     cdns,cdns-pcie-ep, intel,keembay-pcie-ep, intel,keembay-pcie,
     microchip,pcie-host, rcar-pci-ep, rcar-pci-host, xilinx-versal-cpm
     (Krzysztof Kozlowski)

   - Convert Marvell EBU (dove, kirkwood, armada-370, armada-xp) and
     armada8k from text to schema DT bindings (Rob Herring)

   - Remove obsolete .txt DT bindings for content that has been moved to
     schemas (Rob Herring)

   - Add qcom DT binding for MHI registers in IPQ5332, IPQ6018, IPQ8074
     and IPQ9574 (Varadarajan Narayanan)

   - Convert v3,v360epc-pci from text to DT schema binding (Rob Herring)

   - Change microchip,pcie-host DT binding to be 'dma-noncoherent' since
     PolarFire may be configured that way (Conor Dooley)

  Miscellaneous:

   - Drop 'pci' suffix from intel_mid_pci.c filename to match similar
     files (Andy Shevchenko)

   - All platforms with PCI have an MMU, so add PCI Kconfig dependency
     on MMU to simplify build testing and avoid inadvertent build
     regressions (Arnd Bergmann)

   - Update Krzysztof Wilczyński's email address in MAINTAINERS
     (Krzysztof Wilczyński)

   - Update Manivannan Sadhasivam's email address in MAINTAINERS
     (Manivannan Sadhasivam)"

* tag 'pci-v6.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (147 commits)
  MAINTAINERS: Update Manivannan Sadhasivam email address
  PCI: j721e: Fix host/endpoint dependencies
  PCI: j721e: Add support to build as a loadable module
  PCI: cadence-ep: Introduce cdns_pcie_ep_disable() helper for cleanup
  PCI: cadence-host: Introduce cdns_pcie_host_disable() helper for cleanup
  PCI: cadence: Add support to build pcie-cadence library as a kernel module
  MAINTAINERS: Update Krzysztof Wilczyński email address
  PCI: Remove unnecessary linesplit in __pci_setup_bridge()
  PCI: WARN (not BUG()) when we fail to assign optional resources
  PCI: Remove unused pci_printk()
  PCI: qcom: Replace PERST# sleep time with proper macro
  PCI: dw-rockchip: Replace PERST# sleep time with proper macro
  PCI: host-common: Convert to library for host controller drivers
  PCI/ERR: Remove misleading TODO regarding kernel panic
  PCI: cadence: Remove duplicate message code definitions
  PCI: endpoint: Align pci_epc_set_msix(), pci_epc_ops::set_msix() nr_irqs encoding
  PCI: endpoint: Align pci_epc_set_msi(), pci_epc_ops::set_msi() nr_irqs encoding
  PCI: endpoint: Align pci_epc_get_msix(), pci_epc_ops::get_msix() return value encoding
  PCI: endpoint: Align pci_epc_get_msi(), pci_epc_ops::get_msi() return value encoding
  PCI: cadence-ep: Correct PBA offset in .set_msix() callback
  ...
2025-06-04 11:26:17 -07:00
Arunpravin Paneer Selvam
e34bcf1594 drm/amdgpu: Add userq fence support to SDMAv7.0
- Add userq fence support to SDMAv7.0.
- GFX12's user fence irq src id differs from GFX11's,
  hence we need create a new irq srcid header file for GFX12.

  User fence irq src id information-
  GFX11 and SDMA6.0 - 0x43
  GFX12 and SDMA7.0 - 0x46

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-03 15:32:50 -04:00
Dan Carpenter
335f1e797c drm/amdgpu: Fix integer overflow in amdgpu_gem_add_input_fence()
The "num_syncobj_handles" is a u32 value that comes from the user via the
ioctl.  On 32bit systems the "sizeof(uint32_t) * num_syncobj_handles"
multiplication can have an integer overflow.  Use size_mul() to fix that.

Fixes: 38c67ec9aa ("drm/amdgpu: Add input fence to sync bo map/unmap")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-03 15:31:13 -04:00
Dan Carpenter
98a46a4089 drm/amdgpu: Fix integer overflow issues in amdgpu_userq_fence.c
This patch only affects 32bit systems.  There are several integer
overflows bugs here but only the "sizeof(u32) * num_syncobj"
multiplication is a problem at runtime.  (The last lines of this patch).

These variables are u32 variables that come from the user.  The issue
is the multiplications can overflow leading to us allocating a smaller
buffer than intended.  For the first couple integer overflows, the
syncobj_handles = memdup_user() allocation is immediately followed by
a kmalloc_array():

	syncobj = kmalloc_array(num_syncobj_handles, sizeof(*syncobj), GFP_KERNEL);

In that situation the kmalloc_array() works as a bounds check and we
haven't accessed the syncobj_handlesp[] array yet so the integer overflow
is harmless.

But the "num_syncobj" multiplication doesn't have that and the integer
overflow could lead to an out of bounds access.

Fixes: a292fdecd7 ("drm/amdgpu: Implement userqueue signal/wait IOCTL")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-03 15:06:35 -04:00
Alex Deucher
5cccf10f65 drm/amdgpu: disable workload profile switching when OD is enabled
Users have reported that they have to reduce the level of undervolting
to acheive stability when dynamic workload profiles are enabled on
GC 10.3.x. Disable dynamic workload profiles if the user has enabled
OD.

Fixes: b9467983b7 ("drm/amdgpu: add dynamic workload profile switching for gfx10")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4262
Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.15.x
2025-06-03 15:04:24 -04:00
Vitaly Prosyak
d26625d034 drm/amdgpu/gfx10: Refine Cleaner Shader for GFX10.1.10
This patch updates the cleaner shader, which is responsible for
initializing GPU resources such as Local Data Share (LDS), Vector
General Purpose Registers (VGPRs), and Scalar General Purpose Registers
(SGPRs). Changes include adjustments to register clearing and shader
configuration.

- Updated GPU resource initialization addresses in the cleaner shader
  from `be803080` to `be803000`.
- Simplified the logic in the SGPR clearing section, ensuring all SGPRs
  are set to zero.

Fixes: 25961bad92 ("drm/amdgpu/gfx10: Add cleaner shader for GFX10.1.10")
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Manu Rastogi <manu.rastogi@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-03 15:03:09 -04:00
Lijo Lazar
719d84f8a8 drm/amdgpu: Add more checks to discovery fetch
Add more checks for valid vram size and log error, if any.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-03 15:02:58 -04:00
Tvrtko Ursulin
bf33a0003d dma-fence: Use a flag for 64-bit seqnos
With the goal of reducing the need for drivers to touch (and dereference)
fence->ops, we move the 64-bit seqnos flag from struct dma_fence_ops to
the fence->flags.

Drivers which were setting this flag are changed to use new
dma_fence_init64() instead of dma_fence_init().

v2:
 * Streamlined init and added kerneldoc.
 * Rebase for amdgpu userq which landed since.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Reviewed-by: Christian König <christian.koenig@amd.com> # v1
Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
Link: https://lore.kernel.org/r/20250515095004.28318-3-tvrtko.ursulin@igalia.com
2025-06-03 17:38:04 +01:00
Maxime Ripard
7b1166dee8 drm for 6.16-rc1
new drivers:
 - bring in the asahi uapi header standalone
 - nova-drm: stub driver
 
 rust dependencies (for nova-core):
 - auxiliary
   - bus abstractions
   - driver registration
   - sample driver
 - devres changes from driver-core
 - revocable changes
 
 core:
 - add Apple fourcc modifiers
 - add virtio capset definitions
 - extend EXPORT_SYNC_FILE for timeline syncobjs
 - convert to devm_platform_ioremap_resource
 - refactor shmem helper page pinning
 - DP powerup/down link helpers
 - remove disgusting turds
 - extended %p4cc in vsprintf.c to support fourcc prints
 - change vsprintf %p4cn to %p4chR, remove %p4cn
 - Add drm_file_err function
 - IN_FORMATS_ASYNC property
 - move sitronix from tiny to their own subdir
 
 rust:
 - add drm core infrastructure rust abstractions
   (device/driver, ioctl, file, gem)
 
 dma-buf:
 - adjust sg handling to not cache map on attach
 - allow setting dma-device for import
 - Add a helper to sort and deduplicate dma_fence arrays
 
 docs:
 - updated drm scheduler docs
 - fbdev todo update
 - fb rendering
 - actual brightness
 
 ttm:
 - fix delayed destroy resv object
 
 bridge:
 - add kunit tests
 - convert tc358775 to atomic
 - convert drivers to devm_drm_bridge_alloc
 - convert rk3066_hdmi to bridge driver
 
 scheduler:
 - add kunit tests
 
 panel:
 - refcount panels to improve lifetime handling
 - Powertip PH128800T004-ZZA01
 - NLT NL13676BC25-03F, Tianma TM070JDHG34-00
 - Himax HX8279/HX8279-D DDIC
 - Visionox G2647FB105
 - Sitronix ST7571
 - ZOTAC rotation quirk
 
 vkms:
 - allow attaching more displays
 
 i915:
 - xe3lpd display updates
 - vrr refactor
 - intel_display struct conversions
 - xe2hpd memory type identification
 - add link rate/count to i915_display_info
 - cleanup VGA plane handling
 - refactor HDCP GSC
 - fix SLPC wait boosting reference counting
 - add 20ms delay to engine reset
 - fix fence release on early probe errors
 
 xe:
 - SRIOV updates
 - BMG PCI ID update
 - support separate firmware for each GT
 - SVM fix, prelim SVM multi-device work
 - export fan speed
 - temp disable d3cold on BMG
 - backup VRAM in PM notifier instead of suspend/freeze
 - update xe_ttm_access_memory to use GPU for non-visible access
 - fix guc_info debugfs for VFs
 - use copy_from_user instead of __copy_from_user
 - append PCIe gen5 limitations to xe_firmware document
 
 amdgpu:
 - DSC cleanup
 - DC Scaling updates
 - Fused I2C-over-AUX updates
 - DMUB updates
 - Use drm_file_err in amdgpu
 - Enforce isolation updates
 - Use new dma_fence helpers
 - USERQ fixes
 - Documentation updates
 - SR-IOV updates
 - RAS updates
 - PSP 12 cleanups
 - GC 9.5 updates
 - SMU 13.x updates
 - VCN / JPEG SR-IOV updates
 
 amdkfd:
 - Update error messages for SDMA
 - Userptr updates
 - XNACK fixes
 
 radeon:
 - CIK doorbell cleanup
 
 nouveau:
 - add support for NVIDIA r570 GSP firmware
 - enable Hopper/Blackwell support
 
 nova-core:
 - fix task list
 - register definition infrastructure
 - move firmware into own rust module
 - register auxiliary device for nova-drm
 
 nova-drm:
 - initial driver skeleton
 
 msm:
 - GPU:
   - ACD (adaptive clock distribution) for X1-85
   - drop fictional address_space_size
   - improve GMU HFI response time out robustness
   - fix crash when throttling during boot
 - DPU:
   - use single CTL path for flushing on DPU 5.x+
   - improve SSPP allocation code for better sharing
   - Enabled SmartDMA on SM8150, SC8180X, SC8280XP, SM8550
   - Added SAR2130P support
   - Disabled DSC support on MSM8937, MSM8917, MSM8953, SDM660
 - DP:
   - switch to new audio helpers
   - better LTTPR handling
 - DSI:
   - Added support for SA8775P
   - Added SAR2130P support
 - HDMI:
   - Switched to use new helpers for ACR data
   - Fixed old standing issue of HPD not working in some cases
 
 amdxdna:
 - add dma-buf support
 - allow empty command submits
 
 renesas:
 - add dma-buf support
 - add zpos, alpha, blend support
 
 panthor:
 - fail properly for NO_MMAP bos
 - add SET_LABEL ioctl
 - debugfs BO dumping support
 
 imagination:
 - update DT bindings
 - support TI AM68 GPU
 
 hibmc:
 - improve interrupt handling and HPD support
 
 virtio:
 - add panic handler support
 
 rockchip:
 - add RK3588 support
 - add DP AUX bus panel support
 
 ivpu:
 - add heartbeat based hangcheck
 
 mediatek:
 - prepares support for MT8195/99 HDMIv2/DDCv2
 
 anx7625:
 - improve HPD
 
 tegra:
 - speed up firmware loading
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmg2aVAACgkQDHTzWXnE
 hr6DjhAApr2fZjugU3EmpsARdcIWgEd+X65R97ef7RlUGqBKm2joSwZGOhH0oBsG
 9WyO92Qzu6XMe8OibKqY4D2hir9UPz5v+uEWe3q9CzZGbNyAwyVRjVkaKpnI9upv
 1dmHFI7HgPu6qbz6RfPIfgALBLXvVXMaQ4+ZgN/cLtZFa+OLAV5ByqWsRPPXZFb0
 F/pQGQ4ursglfA+LH3SVPfnTN53lu93IlM5/Os9OQQGj+44w94zQ6DCm7CY1AugH
 n+RM/0Yv7WaoF1ByeOtq4FcrmLRrd+ozsvITbRZqhOx7zS/mhP8LRzAwgKWOYzSh
 puKunyQiSdHR7FSqSi8uyY3YumcLWNa/17LMKoTf+KqweJbKGE7RVBuFBn6WUdPb
 AYHZrSB4USAeyahdrrsU+q7ltu5urs5ckpbXsRurMiaUz/BLim1PIm3N5FDLPY7B
 PD1n1FcMUv3CmJT5Y+aNIQgmf1/dETESRTSAgSoOo3gNp6jdRCYqSuWIBsppibWT
 26+tyz0/FGhE50QviHzg0Sv+jd/g93fN6snNlV8wNFMviq3bC69Toa+y3qJ5e7UC
 /42R7nCWdkCZJfr6E67rOaahe9TDV/LXLqPErwptOkdK8sMchaIgF+deybgTtTi/
 zGRBfjLvb5ocYBmPbeGX4mtXNRpyZ3o9I0QUyGUO4zMwFXmFwn0=
 =jpVr
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iJUEABMJAB0WIQTkHFbLp4ejekA/qfgnX84Zoj2+dgUCaD7zuQAKCRAnX84Zoj2+
 dv21AX4qAXMoS1eQQOzx5/MN0LhibwHO8lq0HgyhKKCMZTUvFP91hvuB6qKGzxEU
 +RJmN5cBgPGNuXwr9zLe5A/Lv1LWgfSj1DaAlauYvduFh1xyLOLuo0H3xfTsKrcl
 Onjxi5QVsg==
 =bMa5
 -----END PGP SIGNATURE-----

Merge drm-next-2025-05-28 into drm-misc-next

Christian needs a recent drm-next branch to merge fence patches.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
2025-06-03 15:07:39 +02:00
Lang Yu
30837a49bd drm/amdkfd: Map wptr BO to GART unconditionally
For simulation C models that don't run CP FW where adev->mes.sched_version
is not populated correctly. This causes NULL dereference in
amdgpu_amdkfd_free_gtt_mem(dev->adev, (void **)&pqn->q->wptr_bo_gart)
and warning on unpinned BO in amdgpu_bo_gpu_offset(q->properties.wptr_bo).

Compared with adding version check here and there,
always map wptr BO to GART simplifies things.

v2: Add NULL check in amdgpu_amdkfd_free_gtt_mem.(Philip)

Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:58:44 -04:00
Alex Deucher
684530526f drm/amdgpu/mes: remove some unused functions
Nothing uses them so remove them.  Leftover from
MES bring up.

Reviewed-by: Michael Chen <michael.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:58:39 -04:00
Alex Deucher
40f970ba7a drm/amdgpu/mes: add missing locking in helper functions
We need to take the MES lock.

Reviewed-by: Michael Chen <michael.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2025-05-29 10:58:10 -04:00
Mario Limonciello
82a277d529 drm/amd: Export DMCUB version to sysfs
For supported ASICs DMCU version is exported, but ASICs that support
DMCUB there is no information exported to sysfs.

Add an attribute for DMCUB.

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Link: https://lore.kernel.org/r/20250527155942.476354-1-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:57:42 -04:00
ganglxie
fce0afca35 drm/amdgpu: Get mca address for old eeprom records
after getting mca address for old eeprom records with 'address==0', it can be
correctly parsed under none-nps1, or it will be dropped.

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:57:09 -04:00
ganglxie
31e837d242 drm/amdgpu: handle old RAS eeprom data in non-nps1 mode
Get MCA address from PA in nps1, then convert MCA address to PA in specific nps
mode.

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:57:05 -04:00
Arunpravin Paneer Selvam
5ae9de5867 drm/amdgpu: Add userq fence support to SDMAv6.0
Add userq fence support to SDMAv6.0

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:58 -04:00
John Olender
4d2f6b4e4c drm/amdgpu: amdgpu_vram_mgr_new(): Clamp lpfn to total vram
The drm_mm allocator tolerated being passed end > mm->size, but the
drm_buddy allocator does not.

Restore the pre-buddy-allocator behavior of allowing such placements.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3448
Signed-off-by: John Olender <john.olender@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2025-05-29 10:56:23 -04:00
David (Ming Qiang) Wu
bf394d2854 drm/amdgpu/vcn5.0.1: read back register after written
The addition of register read-back in VCN v5.0.1 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:19 -04:00
David (Ming Qiang) Wu
a8bce9b7a2 drm/amdgpu/vcn5: read back register after written
The addition of register read-back in VCN v5.0.0 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:17 -04:00
David (Ming Qiang) Wu
4d4275a038 drm/amdgpu/vcn4.0.5: read back register after written
The addition of register read-back in VCN v4.0.5 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:14 -04:00
David (Ming Qiang) Wu
5b4c6413c8 drm/amdgpu/vcn4.0.3: read back register after written
The addition of register read-back in VCN v4.0.3 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:12 -04:00
David (Ming Qiang) Wu
a3810a5e37 drm/amdgpu/vcn4: read back register after written
The addition of register read-back in VCN v4.0.0 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:10 -04:00
David (Ming Qiang) Wu
b7a4842a91 drm/amdgpu/vcn3: read back register after written
The addition of register read-back in VCN v3.0 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:08 -04:00
David (Ming Qiang) Wu
d9e688b914 drm/amdgpu/vcn2.5: read back register after written
The addition of register read-back in VCN v2.5 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:05 -04:00
David (Ming Qiang) Wu
8c5ed7f5ab drm/amdgpu/vcn2: read back register after written
The addition of register read-back in VCN v2.0 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:03 -04:00
David (Ming Qiang) Wu
0ef2803173 drm/amdgpu/vcn1: read back register after written
V3: drop changes where readbacks have implemented. This patch set
    is to add readbacks only.

V2: use common register UVD_STATUS for readback (standard PCI MMIO
    behavior, i.e. readback post all writes to let the writes hit
    the hardware)
    add readback in ..._stop() for more coverage.

Similar to the changes made for VCN v4.0.5 where readback to post the
writes to avoid race with the doorbell, the addition of register
readback support in other VCN versions is intended to prevent potential
race conditions, even though such issues have not been observed yet.
This change ensures consistency across different VCN variants and helps
avoid similar issues. The overhead introduced is negligible.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:55:39 -04:00
Pratap Nirujogi
3e9d9df850 drm/amd/amdgpu: Add GPIO resources required for amdisp
ISP is a child device to GFX, and its device specific information
is not available in ACPI. Adding the 2 GPIO resources required for
ISP_v4_1_1 in amdgpu_isp driver.

- GPIO 0 to allow sensor driver to enable and disable sensor module.
- GPIO 85 to allow ISP driver to enable and disable ISP RGB streaming mode.

Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-28 16:01:39 -04:00
Linus Torvalds
b08494a8f7 drm for 6.16-rc1
new drivers:
 - bring in the asahi uapi header standalone
 - nova-drm: stub driver
 
 rust dependencies (for nova-core):
 - auxiliary
   - bus abstractions
   - driver registration
   - sample driver
 - devres changes from driver-core
 - revocable changes
 
 core:
 - add Apple fourcc modifiers
 - add virtio capset definitions
 - extend EXPORT_SYNC_FILE for timeline syncobjs
 - convert to devm_platform_ioremap_resource
 - refactor shmem helper page pinning
 - DP powerup/down link helpers
 - remove disgusting turds
 - extended %p4cc in vsprintf.c to support fourcc prints
 - change vsprintf %p4cn to %p4chR, remove %p4cn
 - Add drm_file_err function
 - IN_FORMATS_ASYNC property
 - move sitronix from tiny to their own subdir
 
 rust:
 - add drm core infrastructure rust abstractions
   (device/driver, ioctl, file, gem)
 
 dma-buf:
 - adjust sg handling to not cache map on attach
 - allow setting dma-device for import
 - Add a helper to sort and deduplicate dma_fence arrays
 
 docs:
 - updated drm scheduler docs
 - fbdev todo update
 - fb rendering
 - actual brightness
 
 ttm:
 - fix delayed destroy resv object
 
 bridge:
 - add kunit tests
 - convert tc358775 to atomic
 - convert drivers to devm_drm_bridge_alloc
 - convert rk3066_hdmi to bridge driver
 
 scheduler:
 - add kunit tests
 
 panel:
 - refcount panels to improve lifetime handling
 - Powertip PH128800T004-ZZA01
 - NLT NL13676BC25-03F, Tianma TM070JDHG34-00
 - Himax HX8279/HX8279-D DDIC
 - Visionox G2647FB105
 - Sitronix ST7571
 - ZOTAC rotation quirk
 
 vkms:
 - allow attaching more displays
 
 i915:
 - xe3lpd display updates
 - vrr refactor
 - intel_display struct conversions
 - xe2hpd memory type identification
 - add link rate/count to i915_display_info
 - cleanup VGA plane handling
 - refactor HDCP GSC
 - fix SLPC wait boosting reference counting
 - add 20ms delay to engine reset
 - fix fence release on early probe errors
 
 xe:
 - SRIOV updates
 - BMG PCI ID update
 - support separate firmware for each GT
 - SVM fix, prelim SVM multi-device work
 - export fan speed
 - temp disable d3cold on BMG
 - backup VRAM in PM notifier instead of suspend/freeze
 - update xe_ttm_access_memory to use GPU for non-visible access
 - fix guc_info debugfs for VFs
 - use copy_from_user instead of __copy_from_user
 - append PCIe gen5 limitations to xe_firmware document
 
 amdgpu:
 - DSC cleanup
 - DC Scaling updates
 - Fused I2C-over-AUX updates
 - DMUB updates
 - Use drm_file_err in amdgpu
 - Enforce isolation updates
 - Use new dma_fence helpers
 - USERQ fixes
 - Documentation updates
 - SR-IOV updates
 - RAS updates
 - PSP 12 cleanups
 - GC 9.5 updates
 - SMU 13.x updates
 - VCN / JPEG SR-IOV updates
 
 amdkfd:
 - Update error messages for SDMA
 - Userptr updates
 - XNACK fixes
 
 radeon:
 - CIK doorbell cleanup
 
 nouveau:
 - add support for NVIDIA r570 GSP firmware
 - enable Hopper/Blackwell support
 
 nova-core:
 - fix task list
 - register definition infrastructure
 - move firmware into own rust module
 - register auxiliary device for nova-drm
 
 nova-drm:
 - initial driver skeleton
 
 msm:
 - GPU:
   - ACD (adaptive clock distribution) for X1-85
   - drop fictional address_space_size
   - improve GMU HFI response time out robustness
   - fix crash when throttling during boot
 - DPU:
   - use single CTL path for flushing on DPU 5.x+
   - improve SSPP allocation code for better sharing
   - Enabled SmartDMA on SM8150, SC8180X, SC8280XP, SM8550
   - Added SAR2130P support
   - Disabled DSC support on MSM8937, MSM8917, MSM8953, SDM660
 - DP:
   - switch to new audio helpers
   - better LTTPR handling
 - DSI:
   - Added support for SA8775P
   - Added SAR2130P support
 - HDMI:
   - Switched to use new helpers for ACR data
   - Fixed old standing issue of HPD not working in some cases
 
 amdxdna:
 - add dma-buf support
 - allow empty command submits
 
 renesas:
 - add dma-buf support
 - add zpos, alpha, blend support
 
 panthor:
 - fail properly for NO_MMAP bos
 - add SET_LABEL ioctl
 - debugfs BO dumping support
 
 imagination:
 - update DT bindings
 - support TI AM68 GPU
 
 hibmc:
 - improve interrupt handling and HPD support
 
 virtio:
 - add panic handler support
 
 rockchip:
 - add RK3588 support
 - add DP AUX bus panel support
 
 ivpu:
 - add heartbeat based hangcheck
 
 mediatek:
 - prepares support for MT8195/99 HDMIv2/DDCv2
 
 anx7625:
 - improve HPD
 
 tegra:
 - speed up firmware loading
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmg2aVAACgkQDHTzWXnE
 hr6DjhAApr2fZjugU3EmpsARdcIWgEd+X65R97ef7RlUGqBKm2joSwZGOhH0oBsG
 9WyO92Qzu6XMe8OibKqY4D2hir9UPz5v+uEWe3q9CzZGbNyAwyVRjVkaKpnI9upv
 1dmHFI7HgPu6qbz6RfPIfgALBLXvVXMaQ4+ZgN/cLtZFa+OLAV5ByqWsRPPXZFb0
 F/pQGQ4ursglfA+LH3SVPfnTN53lu93IlM5/Os9OQQGj+44w94zQ6DCm7CY1AugH
 n+RM/0Yv7WaoF1ByeOtq4FcrmLRrd+ozsvITbRZqhOx7zS/mhP8LRzAwgKWOYzSh
 puKunyQiSdHR7FSqSi8uyY3YumcLWNa/17LMKoTf+KqweJbKGE7RVBuFBn6WUdPb
 AYHZrSB4USAeyahdrrsU+q7ltu5urs5ckpbXsRurMiaUz/BLim1PIm3N5FDLPY7B
 PD1n1FcMUv3CmJT5Y+aNIQgmf1/dETESRTSAgSoOo3gNp6jdRCYqSuWIBsppibWT
 26+tyz0/FGhE50QviHzg0Sv+jd/g93fN6snNlV8wNFMviq3bC69Toa+y3qJ5e7UC
 /42R7nCWdkCZJfr6E67rOaahe9TDV/LXLqPErwptOkdK8sMchaIgF+deybgTtTi/
 zGRBfjLvb5ocYBmPbeGX4mtXNRpyZ3o9I0QUyGUO4zMwFXmFwn0=
 =jpVr
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2025-05-28' of https://gitlab.freedesktop.org/drm/kernel

Pull drm updates from Dave Airlie:
 "As part of building up nova-core/nova-drm pieces we've brought in some
  rust abstractions through this tree, aux bus being the main one, with
  devres changes also in the driver-core tree. Along with the drm core
  abstractions and enough nova-core/nova-drm to use them. This is still
  all stub work under construction, to build the nova driver upstream.

  The other big NVIDIA related one is nouveau adds support for
  Hopper/Blackwell GPUs, this required a new GSP firmware update to
  570.144, and a bunch of rework in order to support multiple fw
  interfaces.

  There is also the introduction of an asahi uapi header file as a
  precursor to getting the real driver in later, but to unblock
  userspace mesa packages while the driver is trapped behind rust
  enablement.

  Otherwise it's the usual mixture of stuff all over, amdgpu, i915/xe,
  and msm being the main ones, and some changes to vsprintf.

  new drivers:
   - bring in the asahi uapi header standalone
   - nova-drm: stub driver

  rust dependencies (for nova-core):
   - auxiliary
       - bus abstractions
       - driver registration
       - sample driver
   - devres changes from driver-core
   - revocable changes

  core:
   - add Apple fourcc modifiers
   - add virtio capset definitions
   - extend EXPORT_SYNC_FILE for timeline syncobjs
   - convert to devm_platform_ioremap_resource
   - refactor shmem helper page pinning
   - DP powerup/down link helpers
   - extended %p4cc in vsprintf.c to support fourcc prints
   - change vsprintf %p4cn to %p4chR, remove %p4cn
   - Add drm_file_err function
   - IN_FORMATS_ASYNC property
   - move sitronix from tiny to their own subdir

  rust:
   - add drm core infrastructure rust abstractions
     (device/driver, ioctl, file, gem)

  dma-buf:
   - adjust sg handling to not cache map on attach
   - allow setting dma-device for import
   - Add a helper to sort and deduplicate dma_fence arrays

  docs:
   - updated drm scheduler docs
   - fbdev todo update
   - fb rendering
   - actual brightness

  ttm:
   - fix delayed destroy resv object

  bridge:
   - add kunit tests
   - convert tc358775 to atomic
   - convert drivers to devm_drm_bridge_alloc
   - convert rk3066_hdmi to bridge driver

  scheduler:
   - add kunit tests

  panel:
   - refcount panels to improve lifetime handling
   - Powertip PH128800T004-ZZA01
   - NLT NL13676BC25-03F, Tianma TM070JDHG34-00
   - Himax HX8279/HX8279-D DDIC
   - Visionox G2647FB105
   - Sitronix ST7571
   - ZOTAC rotation quirk

  vkms:
   - allow attaching more displays

  i915:
   - xe3lpd display updates
   - vrr refactor
   - intel_display struct conversions
   - xe2hpd memory type identification
   - add link rate/count to i915_display_info
   - cleanup VGA plane handling
   - refactor HDCP GSC
   - fix SLPC wait boosting reference counting
   - add 20ms delay to engine reset
   - fix fence release on early probe errors

  xe:
   - SRIOV updates
   - BMG PCI ID update
   - support separate firmware for each GT
   - SVM fix, prelim SVM multi-device work
   - export fan speed
   - temp disable d3cold on BMG
   - backup VRAM in PM notifier instead of suspend/freeze
   - update xe_ttm_access_memory to use GPU for non-visible access
   - fix guc_info debugfs for VFs
   - use copy_from_user instead of __copy_from_user
   - append PCIe gen5 limitations to xe_firmware document

  amdgpu:
   - DSC cleanup
   - DC Scaling updates
   - Fused I2C-over-AUX updates
   - DMUB updates
   - Use drm_file_err in amdgpu
   - Enforce isolation updates
   - Use new dma_fence helpers
   - USERQ fixes
   - Documentation updates
   - SR-IOV updates
   - RAS updates
   - PSP 12 cleanups
   - GC 9.5 updates
   - SMU 13.x updates
   - VCN / JPEG SR-IOV updates

  amdkfd:
   - Update error messages for SDMA
   - Userptr updates
   - XNACK fixes

  radeon:
   - CIK doorbell cleanup

  nouveau:
   - add support for NVIDIA r570 GSP firmware
   - enable Hopper/Blackwell support

  nova-core:
   - fix task list
   - register definition infrastructure
   - move firmware into own rust module
   - register auxiliary device for nova-drm

  nova-drm:
   - initial driver skeleton

  msm:
   - GPU:
       - ACD (adaptive clock distribution) for X1-85
       - drop fictional address_space_size
       - improve GMU HFI response time out robustness
       - fix crash when throttling during boot
   - DPU:
       - use single CTL path for flushing on DPU 5.x+
       - improve SSPP allocation code for better sharing
       - Enabled SmartDMA on SM8150, SC8180X, SC8280XP, SM8550
       - Added SAR2130P support
       - Disabled DSC support on MSM8937, MSM8917, MSM8953, SDM660
   - DP:
       - switch to new audio helpers
       - better LTTPR handling
   - DSI:
       - Added support for SA8775P
       - Added SAR2130P support
   - HDMI:
       - Switched to use new helpers for ACR data
       - Fixed old standing issue of HPD not working in some cases

  amdxdna:
   - add dma-buf support
   - allow empty command submits

  renesas:
   - add dma-buf support
   - add zpos, alpha, blend support

  panthor:
   - fail properly for NO_MMAP bos
   - add SET_LABEL ioctl
   - debugfs BO dumping support

  imagination:
   - update DT bindings
   - support TI AM68 GPU

  hibmc:
   - improve interrupt handling and HPD support

  virtio:
   - add panic handler support

  rockchip:
   - add RK3588 support
   - add DP AUX bus panel support

  ivpu:
   - add heartbeat based hangcheck

  mediatek:
   - prepares support for MT8195/99 HDMIv2/DDCv2

  anx7625:
   - improve HPD

  tegra:
   - speed up firmware loading

* tag 'drm-next-2025-05-28' of https://gitlab.freedesktop.org/drm/kernel: (1627 commits)
  drm/nouveau/tegra: Fix error pointer vs NULL return in nvkm_device_tegra_resource_addr()
  drm/xe: Default auto_link_downgrade status to false
  drm/xe/guc: Make creation of SLPC debugfs files conditional
  drm/i915/display: Add check for alloc_ordered_workqueue() and alloc_workqueue()
  drm/i915/dp_mst: Work around Thunderbolt sink disconnect after SINK_COUNT_ESI read
  drm/i915/ptl: Use everywhere the correct DDI port clock select mask
  drm/nouveau/kms: add support for GB20x
  drm/dp: add option to disable zero sized address only transactions.
  drm/nouveau: add support for GB20x
  drm/nouveau/gsp: add hal for fifo.chan.doorbell_handle
  drm/nouveau: add support for GB10x
  drm/nouveau/gf100-: track chan progress with non-WFI semaphore release
  drm/nouveau/nv50-: separate CHANNEL_GPFIFO handling out from CHANNEL_DMA
  drm/nouveau: add helper functions for allocating pinned/cpu-mapped bos
  drm/nouveau: add support for GH100
  drm/nouveau: improve handling of 64-bit BARs
  drm/nouveau/gv100-: switch to volta semaphore methods
  drm/nouveau/gsp: support deeper page tables in COPY_SERVER_RESERVED_PDES
  drm/nouveau/gsp: init client VMMs with NV0080_CTRL_DMA_SET_PAGE_DIRECTORY
  drm/nouveau/gsp: fetch level shift and PDE from BAR2 VMM
  ...
2025-05-28 09:46:39 -07:00
Pierre-Eric Pelloux-Prayer
6c8e8a1c43 drm/amdgpu: update trace format to match gpu_scheduler_trace
Log fences using the same format for coherency.

Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250526125505.2360-11-pierre-eric.pelloux-prayer@amd.com
2025-05-28 16:16:20 +02:00
Pierre-Eric Pelloux-Prayer
4f7fa5fa41 drm: Get rid of drm_sched_job.id
Its only purpose was for trace events, but jobs can already be
uniquely identified using their fence.

The downside of using the fence is that it's only available
after 'drm_sched_job_arm' was called which is true for all trace
events that used job.id so they can safely switch to using it.

Suggested-by: Tvrtko Ursulin <tursulin@igalia.com>
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250526125505.2360-9-pierre-eric.pelloux-prayer@amd.com
2025-05-28 16:16:15 +02:00
Pierre-Eric Pelloux-Prayer
2956554823 drm/sched: Store the drm client_id in drm_sched_fence
This will be used in a later commit to trace the drm client_id in
some of the gpu_scheduler trace events.

This requires changing all the users of drm_sched_job_init to
add an extra parameter.

The newly added drm_client_id field in the drm_sched_fence is a bit
of a duplicate of the owner one. One suggestion I received was to
merge those 2 fields - this can't be done right now as amdgpu uses
some special values (AMDGPU_FENCE_OWNER_*) that can't really be
translated into a client id. Christian is working on getting rid of
those; when it's done we should be able to squash owner/drm_client_id
together.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250526125505.2360-3-pierre-eric.pelloux-prayer@amd.com
2025-05-28 16:15:58 +02:00
Linus Torvalds
2bd1bea5fa A set of cleanups for the generic interrupt subsystem:
- Consolidate on one set of functions for the interrupt domain code to
     get rid of pointlessly duplicated code with only marginal different
     semantics.
 
   - Update the documentation accordingly and consolidate the coding style
     of the irqdomain header.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmgzd+MTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodTRD/0RmG5tngCbEJmTw6lPDQzRZH4OO3ja
 yRYlyBipemoRmvJRGjV4uHqN2QPrdOuoqMuyBO1aWcMdkpww5bAHcbgSFrlGM1lW
 kqtaxVMbufPiLQSGYe7OQf478CE1ykoBd5Va8whFKrtA73qEUdEMfWT0stspg780
 7BlmQOemL91p7Ytf03FbDdo8tZ5Xu9uXGAulwY9FZsFtsCNyvhl7nOv5Sk8ZQtGO
 xHRCeunjZLWR+IaK59hdakvQybXwSnjT6jODp96nlyKABEKSPShGSPFDWd3g9px7
 4911QwgnvTbcrsk6YmQEmPIOgXZzypjbnjpJr8tFpTbkVIy+6chi5cBJzXoqsUaM
 ylTwFcUQNvcP8yF447qb+nyPFKM5xsC07W0UpZMuJUDmhhPRtDm5pK0jpsif96GP
 l4aMsWe65PUmXHQqLdE89RJXAa8XQ2qspKVtNKq9DmEVgTviQ09Z9SSQIx4U0yIx
 w+YPde8kH2+O+YtMUn/MmfHhUP4MKya7j5zd8Bnv8wLBi7XGPPA5EKKh9I0dz9m+
 X94lweNXyH+Q8U9mt2cQf8VG8Yzgk0eeC0sliJIlybwRgEgRcQbVWw0VvZUA1ySa
 VBlaj3SinO90FEQ0CctT51ss2mUJ/XsGCnxpiGZXfqIZzFbyD1YfZQnXJH0H67DI
 CqdHw22I27Mu/A==
 =9nLp
 -----END PGP SIGNATURE-----

Merge tag 'irq-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq cleanups from Thomas Gleixner:
 "A set of cleanups for the generic interrupt subsystem:

   - Consolidate on one set of functions for the interrupt domain code
     to get rid of pointlessly duplicated code with only marginal
     different semantics.

   - Update the documentation accordingly and consolidate the coding
     style of the irqdomain header"

* tag 'irq-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
  irqdomain: Consolidate coding style
  irqdomain: Fix kernel-doc and add it to Documentation
  Documentation: irqdomain: Update it
  Documentation: irq-domain.rst: Simple improvements
  Documentation: irq/concepts: Minor improvements
  Documentation: irq/concepts: Add commas and reflow
  irqdomain: Improve kernel-docs of functions
  irqdomain: Make struct irq_domain_info variables const
  irqdomain: Use irq_domain_instantiate()'s return value as initializers
  irqdomain: Drop irq_linear_revmap()
  pinctrl: keembay: Switch to irq_find_mapping()
  irqchip/armada-370-xp: Switch to irq_find_mapping()
  gpu: ipu-v3: Switch to irq_find_mapping()
  gpio: idt3243x: Switch to irq_find_mapping()
  sh: Switch to irq_find_mapping()
  powerpc: Switch to irq_find_mapping()
  irqdomain: Drop irq_domain_add_*() functions
  powerpc: Switch irq_domain_add_nomap() to use fwnode
  thermal: Switch to irq_domain_create_linear()
  soc: Switch to irq_domain_create_*()
  ...
2025-05-27 08:07:32 -07:00
Philip Yang
a359288ccb drm/amdgpu: seq64 memory unmap uses uninterruptible lock
To unmap and free seq64 memory when drm node close to free vm, if there
is signal accepted, then taking vm lock failed and leaking seq64 va
mapping, and then dmesg has error log "still active bo inside vm".

Change to use uninterruptible lock fix the mapping leaking and no dmesg
error log.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:03:02 -04:00
Mangesh Gadre
b758667f55 drm/amdgpu: update ras support check
update ras support check for vcn 5.0.1

Signed-off-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:02:55 -04:00
Mangesh Gadre
25e9fb6e3a drm/amdgpu: Enable RAS for jpeg 5.0.1
Enable jpeg ras posion processing and aca error logging

Signed-off-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:02:52 -04:00
Mangesh Gadre
5035caf18d drm/amdgpu: Enable RAS for vcn 5.0.1
Enable vcn ras posion processing and aca error logging

Signed-off-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:02:46 -04:00
Tvrtko Ursulin
dd64956685 drm/amdgpu: Remove duplicated "context still alive" check
When amdgpu_ctx_mgr_fini() calls amdgpu_ctx_mgr_entity_fini() it contains
the exact same "context still alive" check as it will do next. Remove the
duplicated copy.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:02:25 -04:00
Tvrtko Ursulin
16f2c942b6 drm/amdgpu: Make amdgpu_ctx_mgr_entity_fini static
Function amdgpu_ctx_mgr_entity_fini() only has a single local caller so
lets make it local.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:02:21 -04:00
Alex Deucher
e90bd6d898 drm/amdgpu: Update runtime pm checks
Don't enable BACO when in passthrough. PCI resets don't work
correctly when in BACO.

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:02:14 -04:00
Lijo Lazar
74956242a0 drm/amd/pm: Use external link order for xgmi data
xgmi_port_num interface reports external link number for port number. To
be consistent, use the external link number for reporting other XGMI
link data also.

v2: For invalid link number return -EINVAL (Kevin)

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Yang Wang <kevinyang.wang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:02:04 -04:00
Lijo Lazar
cbbab29246 drm/amdgpu: Add sysfs nodes for partition
Add sysfs nodes to provide compute paritition specific data.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:01:36 -04:00
Stanley.Yang
1b2231de41 drm/amdgpu: Register aqua vanjaram jpeg poison irq
Register aqua vanjaram jpeg poison irq, add jpeg poison handle.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:01:28 -04:00
Stanley.Yang
4c4a891496 drm/amdgpu: Register aqua vanjaram vcn poison irq
Register aqua vanjaram vcn poison irq, add vcn poison handle.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:01:24 -04:00
Jesse.Zhang
0132ba7ff0 drm/amdgpu: Fix eviction fence worker race during fd close
The current cleanup order during file descriptor close can lead to
a race condition where the eviction fence worker attempts to access
a destroyed mutex from the user queue manager:

[  517.294055] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
[  517.294060] WARNING: CPU: 8 PID: 2030 at kernel/locking/mutex.c:564
[  517.294094] Workqueue: events amdgpu_eviction_fence_suspend_worker [amdgpu]

The issue occurs because:
1. We destroy the user queue manager (including its mutex) first
2. Then try to destroy eviction fences which may have pending work
3. The eviction fence worker may try to access the already-destroyed mutex

Fix this by reordering the cleanup to:
1. First mark the fd as closing and destroy eviction fences,
   which flushes any pending work
2. Then safely destroy the user queue manager after we're certain
   no more fence work will be executed

The copy in amdgpu_driver_postclose_kms() needs to be removed (Christian)

Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:00:52 -04:00
Prike Liang
b2c11e2708 drm/amdgpu: lock the eviction fence for wq signals it
Lock and refer to the eviction fence before the eviction fence
schedules work queue tries to signal it.

Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Arvind Yadav <Arvind.Yadav@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:00:44 -04:00
Jiri Slaby (SUSE)
493e109267 gpu: Switch to irq_domain_create_linear()
irq_domain_add_linear() is going away as being obsolete now. Switch to
the preferred irq_domain_create_linear(). That differs in the first
parameter: It takes more generic struct fwnode_handle instead of struct
device_node. Therefore, of_fwnode_handle() is added around the
parameter.

Note some of the users can likely use dev->fwnode directly instead of
indirect of_fwnode_handle(dev->of_node). But dev->fwnode is not
guaranteed to be set for all, so this has to be investigated on case to
case basis (by people who can actually test with the HW).

[ tglx: Fix up subject prefix ]

Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250319092951.37667-19-jirislaby@kernel.org
2025-05-16 21:06:09 +02:00
fanhuang
2f0268ca1c drm/amdgpu/jpeg: sriov support for jpeg_v5_0_1
initialization table handshake with mmsch

Signed-off-by: fanhuang <FangSheng.Huang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-16 13:39:14 -04:00
fanhuang
56fc141a5c drm/amdgpu/vcn: sriov support for vcn_v5_0_1
initialization table handshake with mmsch

Signed-off-by: fanhuang <FangSheng.Huang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-16 13:39:10 -04:00
Arvind Yadav
13d0724f0f drm/amdgpu: fix use-after-unlock in eviction fence destroy
The eviction fence destroy path incorrectly calls dma_fence_put() on
evf_mgr->ev_fence after releasing the ev_fence_lock. This introduces a
potential use-after-unlock or race because another thread concurrently
modifies evf_mgr->ev_fence.

Fix this by grabbing a local reference to evf_mgr->ev_fence under the
lock and using that for dma_fence_put() after waiting.

Cc: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-16 13:38:10 -04:00
Lijo Lazar
cc473057bb drm/amdgpu: Allow NPS2-CPX combination for VFs
CPX partition mode is compatible with NPS2 on aquavanjaram VFs.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-16 13:38:05 -04:00
fanhuang
67cc7f9096 drm/amdgpu/mmsch: Add MMSCH v5_0 support for sriov
These structures are basically ported from MMSCH v4_0
The structures are the same as v4_0 except for the
init header

Signed-off-by: fanhuang <FangSheng.Huang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-16 13:37:57 -04:00
Lijo Lazar
3aa37922c6 drm/amdgpu: Use compatible NPS mode info
Compatible NPS modes for a partition mode are exposed through xcp_config
interface. To determine if a compute partition mode is valid, check if
the current NPS mode is part of compatible NPS modes.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-16 13:37:50 -04:00
Asad Kamal
58c397890f drm/amdgpu: Add pldm version reporting
Add pldm version reporting through sysfs node

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-16 13:37:38 -04:00
Amber Lin
e3d0870a90 drm/amdkfd: Support chain runlists of XNACK+/XNACK-
If the MEC firmware supports chaining runlists of XNACK+/XNACK-
processes, set SQ_CONFIG1 chicken bit and SET_RESOURCES bit 28.

When the MEC/HWS supports it, KFD checks the XNACK+/XNACK- processes mix
happens or not. If it does, enter over-subscription.

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-16 13:37:29 -04:00
David (Ming Qiang) Wu
ee7360fc27 drm/amdgpu: read back register after written for VCN v4.0.5
On VCN v4.0.5 there is a race condition where the WPTR is not
updated after starting from idle when doorbell is used. Adding
register read-back after written at function end is to ensure
all register writes are done before they can be used.

Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12528
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 07c9db090b)
Cc: stable@vger.kernel.org
2025-05-14 11:51:31 -04:00
Shiwu Zhang
72ea78335e drm/amdgpu: add debugfs for spirom IFWI dump
Expose the debugfs file node for user space to dump the IFWI image
on spirom.

For one transaction between PSP and host, it will read out the
images on both active and inactive partitions so a buffer with two
times the size of maximum IFWI image (currently 16MByte) is needed.

v2: move the vbios gfl macros to the common header and rename the
    bo triplet struct to spirom_bo for this specific usage (Hawking)

v3: return directly the result of last command execution (Lijo)

Signed-off-by: Shiwu Zhang <shiwu.zhang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-14 11:30:15 -04:00
Prike Liang
64db767013 drm/amdgpu: fix userq resource double freed
As the userq resource was already freed at the drm_release
early phase, it should avoid freeing userq resource again
at the later kms postclose callback.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-14 11:30:03 -04:00
Jesse.Zhang
96a86dcb5b drm/amdgpu: Fix circular locking in userq creation
A circular locking dependency was detected between the global
`adev->userq_mutex` and per-file `userq_mgr->userq_mutex` when
creating user queues. The issue occurs because:

1. `amdgpu_userq_suspend()` and `amdgpu_userq_resume` take `adev->userq_mutex` first, then
   `userq_mgr->userq_mutex`
2. While `amdgpu_userq_create()` takes them in reverse order

This patch resolves the issue by:
1. Moving the `adev->userq_mutex` lock earlier in `amdgpu_userq_create()`
   to cover the `amdgpu_userq_ensure_ev_fence()` call
2. Releasing it after we're done with both queue creation and the
   scheduling halt check

v2: remove unused adev->userq_mutex lock (Prike)

Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-14 11:29:38 -04:00
David (Ming Qiang) Wu
07c9db090b drm/amdgpu: read back register after written for VCN v4.0.5
On VCN v4.0.5 there is a race condition where the WPTR is not
updated after starting from idle when doorbell is used. Adding
register read-back after written at function end is to ensure
all register writes are done before they can be used.

Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12528
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-14 11:28:39 -04:00
Tim Huang
2d73b0845a drm/amdgpu: fix incorrect MALL size for GFX1151
On GFX1151, the reported MALL cache size reflects only
half of its actual size; this adjustment corrects the discrepancy.

Signed-off-by: Tim Huang <tim.huang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 0a5c060b59)
Cc: stable@vger.kernel.org
2025-05-13 14:16:43 -04:00
Philip Yang
a0fa7873f2 drm/amdgpu: csa unmap use uninterruptible lock
After process exit to unmap csa and free GPU vm, if signal is accepted
and then waiting to take vm lock is interrupted and return, it causes
memory leaking and below warning backtrace.

Change to use uninterruptible wait lock fix the issue.

WARNING: CPU: 69 PID: 167800 at amd/amdgpu/amdgpu_kms.c:1525
 amdgpu_driver_postclose_kms+0x294/0x2a0 [amdgpu]
 Call Trace:
  <TASK>
  drm_file_free.part.0+0x1da/0x230 [drm]
  drm_close_helper.isra.0+0x65/0x70 [drm]
  drm_release+0x6a/0x120 [drm]
  amdgpu_drm_release+0x51/0x60 [amdgpu]
  __fput+0x9f/0x280
  ____fput+0xe/0x20
  task_work_run+0x67/0xa0
  do_exit+0x217/0x3c0
  do_group_exit+0x3b/0xb0
  get_signal+0x14a/0x8d0
  arch_do_signal_or_restart+0xde/0x100
  exit_to_user_mode_loop+0xc1/0x1a0
  exit_to_user_mode_prepare+0xf4/0x100
  syscall_exit_to_user_mode+0x17/0x40
  do_syscall_64+0x69/0xc0

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 7dbbfb3c17)
Cc: stable@vger.kernel.org
2025-05-13 14:16:30 -04:00
Arunpravin Paneer Selvam
553ad6fc2b drm/amdgpu/userq: Fix DEBUG_LOCKS_WARN_ON(lock->magic != lock)
Fix DEBUG_LOCKS_WARN_ON(lock->magic != lock) warning logs.

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 13:39:52 -04:00
Arunpravin Paneer Selvam
bc5bab82d3 drm/amdgpu: Fix userq ttm_bo_pin and ttm_bo_unpin lockdep warnings
The ttm_bo_pin and ttm_bo_unpin warnings are resolved by moving the
doorbell bo reserve up before pin/unpin.

WARNING: CPU: 11 PID: 1818 at drivers/gpu/drm/ttm/ttm_bo.c:592 ttm_bo_pin+0x1f6/0x270 [ttm]
[  +0.000277] CPU: 11 UID: 1000 PID: 1818 Comm: Xwayland Tainted: G        W          6.12.0+ #15
[  +0.000006] Tainted: [W]=WARN
[  +0.000004] Hardware name: ASUS System Product Name/TUF GAMING B650-PLUS, BIOS 3072 12/20/2024
[  +0.000004] RIP: 0010:ttm_bo_pin+0x1f6/0x270 [ttm]
[  +0.000005] RSP: 0018:ffff88846ca879d0 EFLAGS: 00010246
[  +0.000007] RAX: 0000000000000000 RBX: ffff88810b7ca848 RCX: 0000000000000000
[  +0.000004] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[  +0.000005] RBP: ffff88846ca879e8 R08: 0000000000000000 R09: 0000000000000000
[  +0.000004] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88810b7ca848
[  +0.000004] R13: ffff88846c666250 R14: 1ffff1108d950f44 R15: ffff88846ca87aa0
[  +0.000005] FS:  00007c45ff436d00(0000) GS:ffff888409580000(0000) knlGS:0000000000000000
[  +0.000004] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000005] CR2: 00005b0c142a60e0 CR3: 000000012ce5a000 CR4: 0000000000f50ef0
[  +0.000004] PKRU: 55555554
[  +0.000004] Call Trace:
[  +0.000004]  <TASK>
[  +0.000005]  ? show_regs+0x6c/0x80
[  +0.000007]  ? __warn+0xd2/0x2d0
[  +0.000007]  ? ttm_bo_pin+0x1f6/0x270 [ttm]
[  +0.000031]  ? report_bug+0x282/0x2f0
[  +0.000012]  ? handle_bug+0x6e/0xc0
[  +0.000007]  ? exc_invalid_op+0x18/0x50
[  +0.000007]  ? asm_exc_invalid_op+0x1b/0x20
[  +0.000017]  ? ttm_bo_pin+0x1f6/0x270 [ttm]
[  +0.000014]  amdgpu_bo_pin+0x365/0x9d0 [amdgpu]
[  +0.000191]  ? __pfx_amdgpu_bo_pin+0x10/0x10 [amdgpu]
[  +0.000185]  ? drm_gem_object_lookup+0x81/0xc0
[  +0.000008]  ? kasan_save_alloc_info+0x37/0x60
[  +0.000007]  ? __kasan_kmalloc+0xc3/0xd0
[  +0.000013]  amdgpu_userqueue_get_doorbell_index+0xee/0x5f0 [amdgpu]
[  +0.000209]  amdgpu_userq_ioctl+0x6b4/0xd40 [amdgpu]
[  +0.000193]  ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu]
[  +0.000211]  ? lock_acquire+0x7c/0xc0
[  +0.000006]  ? drm_dev_enter+0x51/0x190
[  +0.000015]  drm_ioctl_kernel+0x18b/0x330
[  +0.000007]  ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu]
[  +0.000190]  ? __pfx_drm_ioctl_kernel+0x10/0x10
[  +0.000005]  ? lock_acquire+0x7c/0xc0
[  +0.000009]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? __kasan_check_write+0x14/0x30
[  +0.000005]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000011]  drm_ioctl+0x589/0xd00
[  +0.000005]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000006]  ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu]
[  +0.000194]  ? __pfx_drm_ioctl+0x10/0x10
[  +0.000006]  ? __pm_runtime_resume+0x80/0x110
[  +0.000021]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? trace_hardirqs_on+0x53/0x60
[  +0.000005]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? _raw_spin_unlock_irqrestore+0x51/0x80
[  +0.000013]  amdgpu_drm_ioctl+0xd2/0x1c0 [amdgpu]
[  +0.000185]  __x64_sys_ioctl+0x13a/0x1c0
[  +0.000010]  x64_sys_call+0x11ad/0x25f0
[  +0.000007]  do_syscall_64+0x91/0x180
[  +0.000007]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? irqentry_exit+0x77/0xb0
[  +0.000005]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? exc_page_fault+0x93/0x150
[  +0.000009]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000005] RIP: 0033:0x7c45ff924ded
[  +0.000005] RSP: 002b:00007ffff7167810 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.000008] RAX: ffffffffffffffda RBX: 00000000c0486456 RCX: 00007c45ff924ded
[  +0.000004] RDX: 00007ffff7167870 RSI: 00000000c0486456 RDI: 000000000000000b
[  +0.000004] RBP: 00007ffff7167860 R08: ffff800100000000 R09: 0000000000010000
[  +0.000005] R10: 00007ffff7167950 R11: 0000000000000246 R12: 00005b0c2a51bc48
[  +0.000004] R13: 000000000000000b R14: 0000000000000000 R15: 00007ffff7167950
[  +0.000022]  </TASK>
[  +0.000004] irq event stamp: 80693
[  +0.000004] hardirqs last  enabled at (80699): [<ffffffff86a693a9>] __up_console_sem+0x79/0xa0
[  +0.000005] hardirqs last disabled at (80704): [<ffffffff86a6938e>] __up_console_sem+0x5e/0xa0
[  +0.000005] softirqs last  enabled at (80390): [<ffffffff8687377e>] __irq_exit_rcu+0x17e/0x1d0
[  +0.000005] softirqs last disabled at (80385): [<ffffffff8687377e>] __irq_exit_rcu+0x17e/0x1d0
[  +0.000006] ---[ end trace 0000000000000000 ]---
------------------------------------------------------------------------------------------------------

[  +0.000006] WARNING: CPU: 10 PID: 1818 at drivers/gpu/drm/ttm/ttm_bo.c:611 ttm_bo_unpin+0x21f/0x2c0 [ttm]
[  +0.000280] CPU: 10 UID: 1000 PID: 1818 Comm: Xwayland Tainted: G        W          6.12.0+ #15
[  +0.000006] Tainted: [W]=WARN
[  +0.000004] Hardware name: ASUS System Product Name/TUF GAMING B650-PLUS, BIOS 3072 12/20/2024
[  +0.000004] RIP: 0010:ttm_bo_unpin+0x21f/0x2c0 [ttm]
[  +0.000005] RSP: 0018:ffff88846ca87888 EFLAGS: 00010246
[  +0.000007] RAX: 0000000000000000 RBX: ffff88810b7ca848 RCX: 0000000000000000
[  +0.000005] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[  +0.000004] RBP: ffff88846ca878a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000004] R10: 0000000000000000 R11: 0000000000000000 R12: ffff888164e90050
[  +0.000005] R13: ffff88846c666200 R14: 0000000000000001 R15: ffff888168402d28
[  +0.000004] FS:  00007c45ff436d00(0000) GS:ffff888409500000(0000) knlGS:0000000000000000
[  +0.000005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000004] CR2: 00007c45f7373b20 CR3: 000000012ce5a000 CR4: 0000000000f50ef0
[  +0.000005] PKRU: 55555554
[  +0.000004] Call Trace:
[  +0.000004]  <TASK>
[  +0.000005]  ? show_regs+0x6c/0x80
[  +0.000008]  ? __warn+0xd2/0x2d0
[  +0.000007]  ? ttm_bo_unpin+0x21f/0x2c0 [ttm]
[  +0.000012]  ? report_bug+0x282/0x2f0
[  +0.000013]  ? handle_bug+0x6e/0xc0
[  +0.000006]  ? exc_invalid_op+0x18/0x50
[  +0.000008]  ? asm_exc_invalid_op+0x1b/0x20
[  +0.000017]  ? ttm_bo_unpin+0x21f/0x2c0 [ttm]
[  +0.000011]  ? ttm_bo_unpin+0x217/0x2c0 [ttm]
[  +0.000011]  amdgpu_bo_unpin+0x45/0x250 [amdgpu]
[  +0.000216]  amdgpu_userq_ioctl+0x2c3/0xd40 [amdgpu]
[  +0.000226]  ? drm_dev_exit+0x2d/0x60
[  +0.000010]  ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu]
[  +0.000201]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? lock_acquire+0x7c/0xc0
[  +0.000006]  ? drm_dev_enter+0x51/0x190
[  +0.000015]  drm_ioctl_kernel+0x18b/0x330
[  +0.000007]  ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu]
[  +0.000188]  ? __pfx_drm_ioctl_kernel+0x10/0x10
[  +0.000006]  ? lock_acquire+0x7c/0xc0
[  +0.000008]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? __kasan_check_write+0x14/0x30
[  +0.000006]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000010]  drm_ioctl+0x589/0xd00
[  +0.000005]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000006]  ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu]
[  +0.000211]  ? __pfx_drm_ioctl+0x10/0x10
[  +0.000006]  ? __pm_runtime_resume+0x80/0x110
[  +0.000020]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000006]  ? trace_hardirqs_on+0x53/0x60
[  +0.000005]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? _raw_spin_unlock_irqrestore+0x51/0x80
[  +0.000013]  amdgpu_drm_ioctl+0xd2/0x1c0 [amdgpu]
[  +0.000186]  __x64_sys_ioctl+0x13a/0x1c0
[  +0.000010]  x64_sys_call+0x11ad/0x25f0
[  +0.000007]  do_syscall_64+0x91/0x180
[  +0.000007]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? do_syscall_64+0x9d/0x180
[  +0.000007]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000010]  ? __pfx___rseq_handle_notify_resume+0x10/0x10
[  +0.000005]  ? __pfx_blkcg_maybe_throttle_current+0x10/0x10
[  +0.000013]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000009]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000008]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? syscall_exit_to_user_mode+0x95/0x260
[  +0.000008]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? do_syscall_64+0x9d/0x180
[  +0.000007]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? do_syscall_64+0x9d/0x180
[  +0.000011]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000010]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000009]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000008]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? irqentry_exit_to_user_mode+0x8b/0x260
[  +0.000007]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000006]  ? irqentry_exit+0x77/0xb0
[  +0.000004]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? exc_page_fault+0x93/0x150
[  +0.000010]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000005] RIP: 0033:0x7c45ff924ded
[  +0.000005] RSP: 002b:00007ffff7168790 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.000008] RAX: ffffffffffffffda RBX: 00000000c0486456 RCX: 00007c45ff924ded
[  +0.000005] RDX: 00007ffff71687f0 RSI: 00000000c0486456 RDI: 000000000000000b
[  +0.000004] RBP: 00007ffff71687e0 R08: 00005b0c2a49b010 R09: 0000000000000007
[  +0.000004] R10: 00005b0c2a4d7140 R11: 0000000000000246 R12: 000000000000000b
[  +0.000004] R13: 00007c45ff19e5cc R14: 00005b0c2a51c538 R15: 00005b0c2a51bbd8
[  +0.000022]  </TASK>
[  +0.000005] irq event stamp: 87419
[  +0.000004] hardirqs last  enabled at (87425): [<ffffffff86a693a9>] __up_console_sem+0x79/0xa0
[  +0.000005] hardirqs last disabled at (87430): [<ffffffff86a6938e>] __up_console_sem+0x5e/0xa0
[  +0.000005] softirqs last  enabled at (87058): [<ffffffff8687377e>] __irq_exit_rcu+0x17e/0x1d0
[  +0.000006] softirqs last disabled at (87053): [<ffffffff8687377e>] __irq_exit_rcu+0x17e/0x1d0
[  +0.000005] ---[ end trace 0000000000000000 ]---

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 13:39:46 -04:00
Arunpravin Paneer Selvam
218caca4ba drm/amdgpu/userq: Fix lock contention in userq fence
Fix lockdep warnings.

[  +0.000637] ================================
[  +0.000004] WARNING: inconsistent lock state
[  +0.000004] 6.12.0+ #18 Tainted: G        W  OE
[  +0.000004] --------------------------------
[  +0.000004] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[  +0.000004] Xwayland/1952 [HC0[0]:SC0[0]:HE1:SE1] takes:
[  +0.000005] ffff8884636f4740 (&fence_drv->fence_list_lock){?...}-{2:2}, at: amdgpu_userq_fence_driver_destroy+0xb8/0x540 [amdgpu]
[  +0.000208] {IN-HARDIRQ-W} state was registered at:
[  +0.000004]   lock_acquire.part.0+0x116/0x360
[  +0.000005]   lock_acquire+0x7c/0xc0
[  +0.000005]   _raw_spin_lock+0x2f/0x60
[  +0.000005]   amdgpu_userq_fence_driver_process+0x75/0x400 [amdgpu]
[  +0.000185]   gfx_v12_0_eop_irq+0x29f/0x420 [amdgpu]
[  +0.000210]   amdgpu_irq_dispatch+0x2a4/0x7b0 [amdgpu]
[  +0.000191]   amdgpu_ih_process+0x1e1/0x3d0 [amdgpu]
[  +0.000185]   amdgpu_irq_handler+0x28/0xc0 [amdgpu]
[  +0.000183]   __handle_irq_event_percpu+0x1bb/0x590
[  +0.000005]   handle_irq_event+0xab/0x1d0
[  +0.000005]   handle_edge_irq+0x1fd/0xc10
[  +0.000005]   __common_interrupt+0x83/0x190
[  +0.000004]   common_interrupt+0xb1/0xe0
[  +0.000005]   asm_common_interrupt+0x27/0x40
[  +0.000004]   cpuidle_enter_state+0x2ba/0x530
[  +0.000005]   cpuidle_enter+0x4f/0xb0
[  +0.000006]   call_cpuidle+0x46/0xd0
[  +0.000005]   do_idle+0x367/0x430
[  +0.000004]   cpu_startup_entry+0x58/0x70
[  +0.000005]   start_secondary+0x224/0x2b0
[  +0.000005]   common_startup_64+0x13e/0x141
[  +0.000005] irq event stamp: 88271
[  +0.000004] hardirqs last  enabled at (88271): [<ffffffffad9ca7a1>] _raw_spin_unlock_irqrestore+0x51/0x80
[  +0.000005] hardirqs last disabled at (88270): [<ffffffffad9ca424>] _raw_spin_lock_irqsave+0x74/0x80
[  +0.000005] softirqs last  enabled at (87858): [<ffffffffaa67377e>] __irq_exit_rcu+0x17e/0x1d0
[  +0.000005] softirqs last disabled at (87849): [<ffffffffaa67377e>] __irq_exit_rcu+0x17e/0x1d0
[  +0.000005]
              other info that might help us debug this:
[  +0.000004]  Possible unsafe locking scenario:

[  +0.000003]        CPU0
[  +0.000004]        ----
[  +0.000003]   lock(&fence_drv->fence_list_lock);

v2:
  Drop fence_list_flags and use xa_lock_irqsave() flags parameter (Christian)
  Fix merge conflicts.

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 13:39:20 -04:00
Jesse.Zhang
3b63602614 drm/amdgpu: Add GFX 9.5.0 support for per-queue/pipe reset
This patch enables per-queue and per-pipe reset functionality for
GFX IP v9.5.0 when using MEC firmware version 21 (0x15) or later.

This change:
1. Refactors the pipe reset support check in gfx_v9_4_3_pipe_reset_support()
   to use the compute_supported_reset flags instead of hardcoding
   version checks.
2. Adds support for GFX9.5.0 (IP 9.5.0) with MEC firmware version >= 21
   to enable per-queue and per-pipe reset capabilities.

v2: Replaced mec version check with !!(adev->gfx.compute_supported_reset & AMDGPU_RESET_TYPE_PER_PIPE) (Lijo)

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:37:32 -04:00
Tao Zhou
5d6fddac55 drm/amdgpu: set vram type for GC 9.5.0
Set vram type so we can take different actions according to the type.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:37:28 -04:00
Tao Zhou
dc111f8fb1 drm/amdgpu: set flip bits for RAS bad pages
Make the code more general, user doesn't need to pay attention to the
detail of flip bits setting.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:37:19 -04:00
Ce Sun
533aa8bdbe drm/amdgpu: Modify the count method of defer error
The number of newly added de counts and the number of
newly added error addresses remain consistent

Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:35:12 -04:00
Lijo Lazar
937467b7d5 drm/amdgpu: Log RAS errors during load
During driver load, RAS event manager may not be initialized. This will
cause any ATHUB event during driver load to be skipped in dmesg log. Log
the error in dmesg log for easier diagnosis.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:34:02 -04:00
Jesse.Zhang
648a0dc0d7 drm/amdgpu: Fix user queue deadlock by reordering mutex locking
This resolves a deadlock between user queue management and GPU reset
paths by enforcing consistent lock ordering.

The deadlock occurred when:

1. Process exit path (amdgpu_userq_mgr_fini) would:
   - Take uqm->userq_mutex
   - Then try to take adev->userq_mutex for list operations

2. GPU reset path (amdgpu_userq_pre_reset) would:
   - Take adev->userq_mutex first (for list traversal)
   - Then take uqm->userq_mutex

The solution establishes a strict top-down locking order:
1. Always take adev->userq_mutex before any uqm->userq_mutex
2. Maintain this order consistently across all code paths

Changes made:
- Reordered locking in amdgpu_userq_mgr_fini() to take device lock first
- Kept existing proper order in amdgpu_userq_pre_reset()
- Simplified the fini flow by removing redundant operations

This prevents circular dependencies while maintaining thread safety
during both normal operation and GPU reset scenarios.

Fixes: 4ce60dbada ("drm/amdgpu: store userq_managers in a list in adev")
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:32:25 -04:00
Ce Sun
f71509fdd0 drm/amdgpu: Fix the kernel panic caused by RAS records exceed threshold
kernel panic caused by RAS records exceeding the threshold when load driver
specifying RMA(bad_page_threshold=128)

1.Fix the warnings caused by disabling the interrupt source
before it was enabled
2.Fix kernel panic when xcp sysfs is not initialized,null pointer
appears during fini
3.Fix the memory leak caused by the device's early exit due to rma

The first reason:
[ 2744.246650] ------------[ cut here ]------------
[ 2744.246651] WARNING: CPU: 0 PID: 289 at /tmp/amd.BkfTLqYV/amd/amdgpu/amdgpu_irq.c:635 amdgpu_irq_put.cold+0x42/0x6e [amdgpu]
[ 2744.247108] Modules linked in: amdgpu(OE+) amddrm_ttm_helper(OE) amdttm(OE) amdxcp(OE) amddrm_buddy(OE) amddrm_exec(OE) amd_sched(OE) amdkcl(OE) xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay binfmt_misc intel_rapl_msr intel_rapl_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel nls_iso8859_1 kvm rapl isst_if_mbox_pci isst_if_mmio pmt_telemetry pmt_crashlog isst_if_common pmt_class mei_me mei acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath
[ 2744.247167]  linear mlx5_ib ib_uverbs ib_core ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper crct10dif_pclmul syscopyarea crc32_pclmul ghash_clmulni_intel mlx5_core sysfillrect sysimgblt aesni_intel mlxfw fb_sys_fops psample cec crypto_simd cryptd rc_core i2c_i801 nvme xhci_pci tls intel_pmt drm pci_hyperv_intf nvme_core i2c_smbus i2c_ismt xhci_pci_renesas wmi pinctrl_emmitsburg
[ 2744.247194] CPU: 0 PID: 289 Comm: kworker/0:1 Tainted: G           OE     5.15.0-70-generic #77-Ubuntu
[ 2744.247197] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C23.AG.2 11/21/2024
[ 2744.247198] Workqueue: events work_for_cpu_fn
[ 2744.247206] RIP: 0010:amdgpu_irq_put.cold+0x42/0x6e [amdgpu]
[ 2744.247634] Code: 79 7f ff 44 89 ee 48 c7 c7 4d 5a 42 c2 89 55 d4 e8 90 09 bc bf 8b 55 d4 4c 89 e6 4c 89 ff e8 3c 76 7f ff 8b 55 d4 84 c0 75 07 <0f> 0b e9 95 79 7f ff 49 03 5c 24 08 f0 ff 0b 75 13 4c 89 e6 4c 89
[ 2744.247636] RSP: 0018:ffa0000019e27cb0 EFLAGS: 00010246
[ 2744.247639] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ff11000150fa87c0
[ 2744.247641] RDX: 0000000000000000 RSI: ffffffffc2222430 RDI: ff1100019f200000
[ 2744.247642] RBP: ffa0000019e27ce0 R08: 0000000000000003 R09: ffffffffffe41a08
[ 2744.247643] R10: 0000000000ffff0a R11: 0000000000000001 R12: ff1100019f22ce60
[ 2744.247644] R13: 0000000000000000 R14: 00000000ffffffea R15: ff1100019f200000
[ 2744.247645] FS:  0000000000000000(0000) GS:ff11007e7e400000(0000) knlGS:0000000000000000
[ 2744.247647] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2744.247649] CR2: 00007f3d2002819c CR3: 0000000006810003 CR4: 0000000000771ef0
[ 2744.247650] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2744.247651] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 2744.247652] PKRU: 55555554
[ 2744.247653] Call Trace:
[ 2744.247654]  <TASK>
[ 2744.247656]  sdma_v4_4_2_hw_fini+0x7a/0xc0 [amdgpu]
[ 2744.247997]  ? vcn_v4_0_3_hw_fini+0x5f/0xa0 [amdgpu]
[ 2744.248336]  amdgpu_ip_block_hw_fini+0x31/0x61 [amdgpu]
[ 2744.248776]  amdgpu_device_fini_hw+0x3bb/0x47b [amdgpu]
[ 2744.249197]  ? blocking_notifier_chain_unregister+0x56/0xb0
[ 2744.249202]  amdgpu_driver_unload_kms+0x51/0x60 [amdgpu]
[ 2744.249482]  amdgpu_driver_load_kms.cold+0x18/0x2e [amdgpu]
[ 2744.249913]  amdgpu_pci_probe+0x23e/0x590 [amdgpu]
[ 2744.250187]  local_pci_probe+0x48/0x90
[ 2744.250191]  work_for_cpu_fn+0x17/0x30
[ 2744.250196]  process_one_work+0x228/0x3d0
[ 2744.250198]  worker_thread+0x223/0x420
[ 2744.250200]  ? process_one_work+0x3d0/0x3d0
[ 2744.250201]  kthread+0x127/0x150
[ 2744.250204]  ? set_kthread_struct+0x50/0x50
[ 2744.250207]  ret_from_fork+0x1f/0x30
[ 2744.250212]  </TASK>
[ 2744.250213] ---[ end trace 488c997a88508bc3 ]---

The second reason:
[ 5139.303446] Memory manager not clean during takedown.
[ 5139.303509] WARNING: CPU: 145 PID: 117699 at drivers/gpu/drm/drm_mm.c:998 drm_mm_takedown+0x27/0x30 [drm]
[ 5139.303542] Modules linked in: amdgpu(OE+) amddrm_ttm_helper(OE) amdttm(OE) amdxcp(OE) amddrm_buddy(OE) amddrm_exec(OE) amd_sched(OE) amdkcl(OE) xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay intel_rapl_msr intel_rapl_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel binfmt_misc kvm nls_iso8859_1 rapl isst_if_mbox_pci pmt_telemetry pmt_crashlog isst_if_mmio pmt_class isst_if_common mei_me mei acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath
[ 5139.303572]  linear mlx5_ib ib_uverbs ib_core crct10dif_pclmul ast crc32_pclmul i2c_algo_bit ghash_clmulni_intel aesni_intel crypto_simd drm_vram_helper cryptd drm_ttm_helper mlx5_core ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core mlxfw psample intel_pmt nvme xhci_pci drm tls i2c_i801 pci_hyperv_intf nvme_core i2c_smbus i2c_ismt xhci_pci_renesas wmi pinctrl_emmitsburg [last unloaded: amdkcl]
[ 5139.303588] CPU: 145 PID: 117699 Comm: modprobe Tainted: G     U     OE     5.15.0-70-generic #77-Ubuntu
[ 5139.303590] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C23.AG.2 11/21/2024
[ 5139.303591] RIP: 0010:drm_mm_takedown+0x27/0x30 [drm]
[ 5139.303605] Code: cc 66 90 0f 1f 44 00 00 48 8b 47 38 48 83 c7 38 48 39 f8 75 05 c3 cc cc cc cc 55 48 c7 c7 18 d0 10 c0 48 89 e5 e8 5a bc c3 c1 <0f> 0b 5d c3 cc cc cc cc 90 0f 1f 44 00 00 55 b9 15 00 00 00 48 89
[ 5139.303607] RSP: 0018:ffa00000325c3940 EFLAGS: 00010286
[ 5139.303608] RAX: 0000000000000000 RBX: ff1100012f5cecb0 RCX: 0000000000000027
[ 5139.303609] RDX: ff11007e7fa60588 RSI: 0000000000000001 RDI: ff11007e7fa60580
[ 5139.303610] RBP: ffa00000325c3940 R08: 0000000000000003 R09: fffffffff00c2b78
[ 5139.303610] R10: 000000000000002b R11: 0000000000000001 R12: ff1100012f5cec00
[ 5139.303611] R13: ff1100012138f068 R14: 0000000000000000 R15: ff1100012f5cec90
[ 5139.303611] FS:  00007f42ffca0000(0000) GS:ff11007e7fa40000(0000) knlGS:0000000000000000
[ 5139.303612] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5139.303613] CR2: 00007f23d945ab68 CR3: 00000001212ce005 CR4: 0000000000771ee0
[ 5139.303614] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5139.303615] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 5139.303615] PKRU: 55555554
[ 5139.303616] Call Trace:
[ 5139.303617]  <TASK>
[ 5139.303619]  amdttm_range_man_fini_nocheck+0xfe/0x1c0 [amdttm]
[ 5139.303625]  amdgpu_ttm_fini+0x2ed/0x390 [amdgpu]
[ 5139.303800]  amdgpu_bo_fini+0x27/0xc0 [amdgpu]
[ 5139.303959]  gmc_v9_0_sw_fini+0x63/0x90 [amdgpu]
[ 5139.304144]  amdgpu_device_fini_sw+0x125/0x6a0 [amdgpu]
[ 5139.304302]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[ 5139.304455]  devm_drm_dev_init_release+0x4a/0x80 [drm]
[ 5139.304472]  devm_action_release+0x12/0x20
[ 5139.304476]  release_nodes+0x3d/0xb0
[ 5139.304478]  devres_release_all+0x9b/0xd0
[ 5139.304480]  really_probe+0x11d/0x420
[ 5139.304483]  __driver_probe_device+0x119/0x190
[ 5139.304485]  driver_probe_device+0x23/0xc0
[ 5139.304487]  __driver_attach+0xf7/0x1f0
[ 5139.304489]  ? __device_attach_driver+0x140/0x140
[ 5139.304491]  bus_for_each_dev+0x7c/0xd0
[ 5139.304493]  driver_attach+0x1e/0x30
[ 5139.304494]  bus_add_driver+0x148/0x220
[ 5139.304496]  driver_register+0x95/0x100
[ 5139.304498]  __pci_register_driver+0x68/0x70
[ 5139.304500]  amdgpu_init+0xbc/0x1000 [amdgpu]
[ 5139.304655]  ? 0xffffffffc0b8f000
[ 5139.304657]  do_one_initcall+0x46/0x1e0
[ 5139.304659]  ? kmem_cache_alloc_trace+0x19e/0x2e0
[ 5139.304663]  do_init_module+0x52/0x260
[ 5139.304665]  load_module+0xb2b/0xbc0
[ 5139.304667]  __do_sys_finit_module+0xbf/0x120
[ 5139.304669]  __x64_sys_finit_module+0x18/0x20
[ 5139.304670]  do_syscall_64+0x59/0xc0
[ 5139.304673]  ? exit_to_user_mode_prepare+0x37/0xb0
[ 5139.304676]  ? syscall_exit_to_user_mode+0x27/0x50
[ 5139.304678]  ? __x64_sys_mmap+0x33/0x50
[ 5139.304680]  ? do_syscall_64+0x69/0xc0
[ 5139.304681]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 5139.304684] RIP: 0033:0x7f42ffdbf88d
[ 5139.304686] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 b5 0f 00 f7 d8 64 89 01 48
[ 5139.304687] RSP: 002b:00007ffcb7427158 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[ 5139.304688] RAX: ffffffffffffffda RBX: 000055ce8b8f3150 RCX: 00007f42ffdbf88d
[ 5139.304689] RDX: 0000000000000000 RSI: 000055ce8b8f9a70 RDI: 000000000000000a
[ 5139.304690] RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000011
[ 5139.304690] R10: 000000000000000a R11: 0000000000000246 R12: 000055ce8b8f9a70
[ 5139.304691] R13: 000055ce8b8f2ec0 R14: 000055ce8b8f2ab0 R15: 000055ce8b8f9aa0
[ 5139.304692]  </TASK>
[ 5139.304693] ---[ end trace 8536b052f7883003 ]---

Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:32:11 -04:00
Tao Zhou
b7674ae75b drm/amdgu: get RAS retire flip bits for new type of HBM
Get RAS retire flip bits for HBM with different types in various NPS modes.
Also set flip row bit and MCA R13 bit in PA in different NPS modes.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:32:08 -04:00
Tao Zhou
9b5b71895b drm/amdgpu: implement get_retire_flip_bits for UMC v12
The RAS bad page retire flip bits can be set per vram type,
vram vendor and NPS mode.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:32:05 -04:00
Tao Zhou
699bff37a5 drm/amdgpu: add get_retire_flip_bits for UMC
Add the general interface to get flip bits for RAS bad page retirement.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:32:01 -04:00
Tao Zhou
4ce5b99128 drm/amdgpu: adjust high bits for RAS retired page
Per UMC address conversion algorithm, the high row bits of UMC MCA
address are changed when they're converted into normalized address
on specific ASICs.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:31:46 -04:00
Tao Zhou
1df57411a6 drm/amd: add definition for new memory type
Support new version of HBM.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:31:40 -04:00
ganglxie
f5db59067c Refine RAS bad page records counting and parsing in eeprom V3
there is only MCA records in V3, no need to care about PA records.
recalculate the value of ras_num_bad_pages when parsing failed and
go on with the left records instead of quit.

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:31:33 -04:00
Tim Huang
0a5c060b59 drm/amdgpu: fix incorrect MALL size for GFX1151
On GFX1151, the reported MALL cache size reflects only
half of its actual size; this adjustment corrects the discrepancy.

Signed-off-by: Tim Huang <tim.huang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:23:23 -04:00
Arvind Yadav
010503a3cb drm/amdgpu: Fix amdgpu_userq_wait_ioctl() warn missing error code 'r'
To resolve the warning regarding the missing error code 'r' in
amdgpu_userq_wait_ioctl(), assign the value 'r = -EINVAL'.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202505080458.rnV8YfiY-lkp@intel.com/
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:21:56 -04:00
Arvind Yadav
f10eb185ad drm/amdgpu: Fix NULL dereference in amdgpu_userq_restore_worker
Switch cancel_delayed_work() to cancel_delayed_work_sync() to ensure
the delayed work has finished executing before proceeding with
resource cleanup. This prevents a potential use-after-free or
NULL dereference if the resume_work is still running during finalization.

BUG: kernel NULL pointer dereference, address: 0000000000000140
[  +0.000050] #PF: supervisor read access in kernel mode
[  +0.000019] #PF: error_code(0x0000) - not-present page
[  +0.000021] PGD 0 P4D 0
[  +0.000015] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[  +0.000021] CPU: 17 UID: 0 PID: 196299 Comm: kworker/17:0 Tainted: G     U             6.14.0-org-staging #1
[  +0.000032] Tainted: [U]=USER
[  +0.000015] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F39 03/22/2024
[  +0.000029] Workqueue: events amdgpu_userq_restore_worker [amdgpu]
[  +0.000426] RIP: 0010:drm_exec_lock_obj+0x32/0x210 [drm_exec]
[  +0.000025] Code: e5 41 57 41 56 41 55 49 89 f5 41 54 49 89 fc 48 83 ec 08 4c 8b 77 30 4d 85 f6 0f 85 c0 00 00 00 4c 8d 7f 08 48 39 77 38 74 54 <49> 8b bd f8 00 00 00 4c 89 fe 41 f6 04 24 01 75 3c e8 08 50 bc e0
[  +0.000046] RSP: 0018:ffffab1b04da3ce8 EFLAGS: 00010297
[  +0.000020] RAX: 0000000000000001 RBX: ffff930cc60e4bc0 RCX: 0000000000000000
[  +0.000025] RDX: 0000000000000004 RSI: 0000000000000048 RDI: ffffab1b04da3d88
[  +0.000028] RBP: ffffab1b04da3d10 R08: ffff930cc60e4000 R09: 0000000000000000
[  +0.000022] R10: ffffab1b04da3d18 R11: 0000000000000001 R12: ffffab1b04da3d88
[  +0.000023] R13: 0000000000000048 R14: 0000000000000000 R15: ffffab1b04da3d90
[  +0.000023] FS:  0000000000000000(0000) GS:ffff9313dea80000(0000) knlGS:0000000000000000
[  +0.000024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000021] CR2: 0000000000000140 CR3: 000000018351a000 CR4: 0000000000350ef0
[  +0.000025] Call Trace:
[  +0.000018]  <TASK>
[  +0.000015]  ? show_regs+0x69/0x80
[  +0.000022]  ? __die+0x25/0x70
[  +0.000019]  ? page_fault_oops+0x15d/0x510
[  +0.000024]  ? do_user_addr_fault+0x312/0x690
[  +0.000024]  ? sched_clock_cpu+0x10/0x1a0
[  +0.000028]  ? exc_page_fault+0x78/0x1b0
[  +0.000025]  ? asm_exc_page_fault+0x27/0x30
[  +0.000024]  ? drm_exec_lock_obj+0x32/0x210 [drm_exec]
[  +0.000024]  drm_exec_prepare_obj+0x21/0x60 [drm_exec]
[  +0.000021]  amdgpu_vm_lock_pd+0x22/0x30 [amdgpu]
[  +0.000266]  amdgpu_userq_validate_bos+0x6c/0x320 [amdgpu]
[  +0.000333]  amdgpu_userq_restore_worker+0x4a/0x120 [amdgpu]
[  +0.000316]  process_one_work+0x189/0x3c0
[  +0.000021]  worker_thread+0x2a4/0x3b0
[  +0.000022]  kthread+0x109/0x220
[  +0.000018]  ? __pfx_worker_thread+0x10/0x10
[  +0.000779]  ? _raw_spin_unlock_irq+0x1f/0x40
[  +0.000560]  ? __pfx_kthread+0x10/0x10
[  +0.000543]  ret_from_fork+0x3c/0x60
[  +0.000507]  ? __pfx_kthread+0x10/0x10
[  +0.000515]  ret_from_fork_asm+0x1a/0x30
[  +0.000515]  </TASK>

v2: Replace cancel_delayed_work() to cancel_delayed_work_sync()
    in amdgpu_userq_destroy() and amdgpu_userq_evict().

Cc: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:21:39 -04:00
Philip Yang
7dbbfb3c17 drm/amdgpu: csa unmap use uninterruptible lock
After process exit to unmap csa and free GPU vm, if signal is accepted
and then waiting to take vm lock is interrupted and return, it causes
memory leaking and below warning backtrace.

Change to use uninterruptible wait lock fix the issue.

WARNING: CPU: 69 PID: 167800 at amd/amdgpu/amdgpu_kms.c:1525
 amdgpu_driver_postclose_kms+0x294/0x2a0 [amdgpu]
 Call Trace:
  <TASK>
  drm_file_free.part.0+0x1da/0x230 [drm]
  drm_close_helper.isra.0+0x65/0x70 [drm]
  drm_release+0x6a/0x120 [drm]
  amdgpu_drm_release+0x51/0x60 [amdgpu]
  __fput+0x9f/0x280
  ____fput+0xe/0x20
  task_work_run+0x67/0xa0
  do_exit+0x217/0x3c0
  do_group_exit+0x3b/0xb0
  get_signal+0x14a/0x8d0
  arch_do_signal_or_restart+0xde/0x100
  exit_to_user_mode_loop+0xc1/0x1a0
  exit_to_user_mode_prepare+0xf4/0x100
  syscall_exit_to_user_mode+0x17/0x40
  do_syscall_64+0x69/0xc0

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:21:31 -04:00
Dave Airlie
1faeeb315f amd-drm-next-6.16-2025-05-09:
amdgpu:
 - IPS fixes
 - DSC cleanup
 - DC Scaling updates
 - DC FP fixes
 - Fused I2C-over-AUX updates
 - SubVP fixes
 - Freesync fix
 - DMUB AUX fixes
 - VCN fix
 - Hibernation fixes
 - HDP fixes
 - DCN 2.1 fixes
 - DPIA fixes
 - DMUB updates
 - Use drm_file_err in amdgpu
 - Enforce isolation updates
 - Use new dma_fence helpers
 - USERQ fixes
 - Documentation updates
 - Misc code cleanups
 - SR-IOV updates
 - RAS updates
 - PSP 12 cleanups
 
 amdkfd:
 - Update error messages for SDMA
 - Userptr updates
 
 drm:
 - Add drm_file_err function
 
 dma-buf:
 - Add a helper to sort and deduplicate dma_fence arrays
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCaB6KhwAKCRC93/aFa7yZ
 2FYIAQD43sIYoZCo6l7lZt0m54d6Mt1jangVhMK/jH31eOgKYQEAoRJKSMjT6Ktl
 FLBzG8FXekwQELNu6EgyG8Ywwim9AQ0=
 =dfkK
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.16-2025-05-09' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.16-2025-05-09:

amdgpu:
- IPS fixes
- DSC cleanup
- DC Scaling updates
- DC FP fixes
- Fused I2C-over-AUX updates
- SubVP fixes
- Freesync fix
- DMUB AUX fixes
- VCN fix
- Hibernation fixes
- HDP fixes
- DCN 2.1 fixes
- DPIA fixes
- DMUB updates
- Use drm_file_err in amdgpu
- Enforce isolation updates
- Use new dma_fence helpers
- USERQ fixes
- Documentation updates
- Misc code cleanups
- SR-IOV updates
- RAS updates
- PSP 12 cleanups

amdkfd:
- Update error messages for SDMA
- Userptr updates

drm:
- Add drm_file_err function

dma-buf:
- Add a helper to sort and deduplicate dma_fence arrays

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250509230951.3871914-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-05-12 07:14:34 +10:00
Alex Deucher
5a11a27677 drm/amdgpu/hdp7: use memcfg register to post the write for HDP flush
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.

Fixes: 689275140c ("drm/amdgpu/hdp7.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit dbc064adfc)
Cc: stable@vger.kernel.org
2025-05-08 11:48:12 -04:00
Alex Deucher
ca28e80abe drm/amdgpu/hdp6: use memcfg register to post the write for HDP flush
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.

Fixes: abe1cbaec6 ("drm/amdgpu/hdp6.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 84141ff615)
Cc: stable@vger.kernel.org
2025-05-08 11:47:54 -04:00
Alex Deucher
dbc988c689 drm/amdgpu/hdp5.2: use memcfg register to post the write for HDP flush
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.

Fixes: f756dbac1c ("drm/amdgpu/hdp5.2: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 4a89b7698e)
Cc: stable@vger.kernel.org
2025-05-08 11:47:23 -04:00
Alex Deucher
0e33e0f339 drm/amdgpu/hdp5: use memcfg register to post the write for HDP flush
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.

Fixes: cf424020e0 ("drm/amdgpu/hdp5.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a5cb344033)
Cc: stable@vger.kernel.org
2025-05-08 11:46:57 -04:00
Lijo Lazar
afc6053d4c Reapply: drm/amdgpu: Use generic hdp flush function
Except HDP v5.2 all use a common logic for HDP flush. Use a generic
function. HDP v5.2 forces NO_KIQ logic, revisit it later.

Reapply after fixing up an HDP regression.

v2: merge the fix (Alex)

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> (v1)
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-08 11:21:37 -04:00
Alex Deucher
dbc064adfc drm/amdgpu/hdp7: use memcfg register to post the write for HDP flush
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.

Fixes: 689275140c ("drm/amdgpu/hdp7.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-08 11:21:12 -04:00
Alex Deucher
84141ff615 drm/amdgpu/hdp6: use memcfg register to post the write for HDP flush
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.

Fixes: abe1cbaec6 ("drm/amdgpu/hdp6.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-08 11:20:48 -04:00
Huang Rui
793fa8ce4e drm/amdgpu: cleanup sriov function for psp v12
PSP v12 won't have SRIOV function.

Signed-off-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-08 11:20:43 -04:00
Alex Deucher
4a89b7698e drm/amdgpu/hdp5.2: use memcfg register to post the write for HDP flush
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.

Fixes: f756dbac1c ("drm/amdgpu/hdp5.2: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-08 11:20:19 -04:00
Alex Deucher
a5cb344033 drm/amdgpu/hdp5: use memcfg register to post the write for HDP flush
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.

Fixes: cf424020e0 ("drm/amdgpu/hdp5.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-08 11:18:30 -04:00
Huang Rui
518e22b42c drm/amdgpu: remove re-route ih in psp v12
APU doesn't have second IH ring, so re-routing action here is a no-op.
It will take a lot of time to wait timeout from PSP during the
initialization. So remove the function in psp v12.

Signed-off-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-08 11:18:24 -04:00
Alex Deucher
f690e39747 drm/amdgpu/hdp4: use memcfg register to post the write for HDP flush
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.

Fixes: c9b8dcabb5 ("drm/amdgpu/hdp4.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 5c937b4a60)
Cc: stable@vger.kernel.org
2025-05-07 18:24:56 -04:00
Alex Deucher
4aaffc8575 drm/amdgpu: fix pm notifier handling
Set the s3/s0ix and s4 flags in the pm notifier so that we can skip
the resource evictions properly in pm prepare based on whether
we are suspending or hibernating.  Drop the eviction as processes
are not frozen at this time, we we can end up getting stuck trying
to evict VRAM while applications continue to submit work which
causes the buffers to get pulled back into VRAM.

v2: Move suspend flags out of pm notifier (Mario)

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4178
Fixes: 2965e6355d ("drm/amd: Add Suspend/Hibernate notification callback support")
Cc: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 06f2dcc241)
Cc: stable@vger.kernel.org
2025-05-07 18:24:30 -04:00
Alex Deucher
d0ce1aaa85 Revert "drm/amd: Stop evicting resources on APUs in suspend"
This reverts commit 3a9626c816.

This breaks S4 because we end up setting the s3/s0ix flags
even when we are entering s4 since prepare is used by both
flows.  The causes both the S3/s0ix and s4 flags to be set
which breaks several checks in the driver which assume they
are mutually exclusive.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3634
Cc: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit ce8f7d9589)
Cc: stable@vger.kernel.org
2025-05-07 18:24:04 -04:00
Ruijing Dong
b7e84fb708 drm/amdgpu/vcn: using separate VCN1_AON_SOC offset
VCN1_AON_SOC_ADDRESS_3_0 offset varies on different
VCN generations, the issue in vcn4.0.5 is caused by
a different VCN1_AON_SOC_ADDRESS_3_0 offset.

This patch does the following:

    1. use the same offset for other VCN generations.
    2. use the vcn4.0.5 special offset
    3. update vcn_4_0 and vcn_5_0

Acked-by: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 5c89ceda99)
Cc: stable@vger.kernel.org
2025-05-07 18:23:40 -04:00
Mario Limonciello
b54695dae9 drm/amd: Add per-ring reset for vcn v5.0.0 use
If there is a problem requiring a reset of the VCN engine, it is better to
reset the VCN engine rather than the entire GPU.

Add a reset callback for the ring which will stop and start VCN if an
issue happens.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250506204948.12048-4-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:48:24 -04:00
Mario Limonciello
b8b6e6f165 drm/amd: Add per-ring reset for vcn v4.0.0 use
If there is a problem requiring a reset of the VCN engine, it is better to
reset the VCN engine rather than the entire GPU.

Add a reset callback for the ring which will stop and start VCN if an
issue happens.

Link: https://lore.kernel.org/r/20250506204948.12048-3-mario.limonciello@amd.com
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:48:19 -04:00
Mario Limonciello
d1a46cdd00 drm/amd: Add per-ring reset for vcn v4.0.5 use
There is a problem occurring on VCN 4.0.5 where in some situations a job
is timing out.  This triggers a job timeout which then causes a GPU
reset for recovery.  That has exposed a number of issues with GPU reset
that have since been fixed. But also a GPU reset isn't actually needed
for this circumstance. Just restarting the ring is enough.

Add a reset callback for the ring which will stop and start VCN if the
issue happens.

Link: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12528
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3909
Link: https://lore.kernel.org/r/20250506204948.12048-2-mario.limonciello@amd.com
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:48:11 -04:00
Alex Deucher
5c937b4a60 drm/amdgpu/hdp4: use memcfg register to post the write for HDP flush
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.

Fixes: c9b8dcabb5 ("drm/amdgpu/hdp4.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:47:33 -04:00
Alex Deucher
e8614fc769 Revert "drm/amdgpu: Use generic hdp flush function"
This reverts commit 18a878fd8a.

Revert this temporarily to make it easier to fix a regression
in the HDP handling.

Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:47:02 -04:00
Sunil Khatri
c2a3bac7c8 drm/amdgpu: fix the indentation
fix the indentation
drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c:6992 gfx_v11_ip_dump

compiler: gcc-11 (Debian 11.3.0-12) 11.3.0

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202505071619.7sHTLpNg-lkp@intel.com/
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:45:27 -04:00
Huang Rui
8465f0a372 drm/amdgpu: remove mdelay in psp v12
Since secure firmware is more stable than bring up phase, I believe we
don't need such mdelays any more before wait PSP response on PSP v12.

Signed-off-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Trigger Huang <Trigger.Huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:45:16 -04:00
Shane Xiao
2d274bf709 amd/amdkfd: Trigger segfault for early userptr unmmapping
If applications unmap the memory before destroying the userptr, it needs
trigger a segfault to notify user space to correct the free sequence in
VM debug mode.

v2: Send gpu access fault to user space
v3: Report gpu address to user space, remove unnecessary params
v4: update pr_err into one line, remove userptr log info

Signed-off-by: Shane Xiao <shane.xiao@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:45:09 -04:00
Shane Xiao
8e320f67d4 drm/amdgpu: Add debug bit for userptr usage
In VM debug mode, it is desirable to notify the application
to correct the freeing sequence by unmapping the memory before
destroying the userptr in the old userptr path. Add a bitmask
to decide whether to send gpu vm fault to the applition.

Signed-off-by: Shane Xiao <shane.xiao@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:45:04 -04:00
Prike Liang
def41146b9 drm/amdgpu: unreserve the gem BO before returning from attach error
It requires unlocking the reserved gem BO before returning from
attaching the eviction fence error.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:44:59 -04:00
Prike Liang
926c79ad6e drm/amdgpu: promote the implicit sync to the dependent read fences
The driver doesn't want to implicitly sync on the DMA_RESV_USAGE_BOOKKEEP
usage fences, and the BOOKEEP fences should be synced explicitly. So, as
the VM implicit syncing only need to return and sync the dependent read
fences.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:44:51 -04:00
Alex Deucher
6edc89645c drm/amdgpu/psp: mark securedisplay TA as optional
This is an optional TA which is only available on
certain embedded systems.  Mark it as optional to avoid
user confusion.  This mirrors what we already do for
other optional TAs.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4181
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:44:45 -04:00
Alex Deucher
06f2dcc241 drm/amdgpu: fix pm notifier handling
Set the s3/s0ix and s4 flags in the pm notifier so that we can skip
the resource evictions properly in pm prepare based on whether
we are suspending or hibernating.  Drop the eviction as processes
are not frozen at this time, we we can end up getting stuck trying
to evict VRAM while applications continue to submit work which
causes the buffers to get pulled back into VRAM.

v2: Move suspend flags out of pm notifier (Mario)

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4178
Fixes: 2965e6355d ("drm/amd: Add Suspend/Hibernate notification callback support")
Cc: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:43:18 -04:00
Ellen Pan
086809c82c drm/amdgpu: Implement unrecoverable error message handling for VFs
This notification may arrive in VF mailbox while polling for response from
another event.

This patches covers the following scenarios:

- If VF is already in RMA state, then do not attempt to contact the host.
  Host will ignore the VF after sending the notification.

- If the notification is detected during polling, then set the RMA status,
  and return error to caller.

- If the notification arrives by interrupt, then set the RMA status and
  queue a reset.  This reset will fail and VF will stop runtime services.

Reviewed-by: Shravan Kumar Gande <Shravankumar.Gande@amd.com>
Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Signed-off-by: Ellen Pan <yunru.pan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:43:13 -04:00
Ellen Pan
6be34e1d1f drm/amdgpu: Add unrecoverable error message definitions for VFs
Host may stop runtime services after reaching a bad page threshold.

This notification will indicate to the VF that it no longer has
access to the GPU.

Reviewed-by: Shravan Kumar Gande <Shravankumar.Gande@amd.com>
Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Signed-off-by: Ellen Pan <yunru.pan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:43:07 -04:00
Alex Deucher
ce8f7d9589 Revert "drm/amd: Stop evicting resources on APUs in suspend"
This reverts commit 3a9626c816.

This breaks S4 because we end up setting the s3/s0ix flags
even when we are entering s4 since prepare is used by both
flows.  The causes both the S3/s0ix and s4 flags to be set
which breaks several checks in the driver which assume they
are mutually exclusive.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3634
Cc: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:42:30 -04:00
Ruijing Dong
5c89ceda99 drm/amdgpu/vcn: using separate VCN1_AON_SOC offset
VCN1_AON_SOC_ADDRESS_3_0 offset varies on different
VCN generations, the issue in vcn4.0.5 is caused by
a different VCN1_AON_SOC_ADDRESS_3_0 offset.

This patch does the following:

    1. use the same offset for other VCN generations.
    2. use the vcn4.0.5 special offset
    3. update vcn_4_0 and vcn_5_0

Acked-by: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:42:19 -04:00
Prike Liang
af7160c25c drm/amdgpu: fix the eviction fence dereference
The dma_resv_add_fence() already refers to the added fence.
So when attaching the evciton fence to the gem bo, it needn't
refer to it anymore.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:41:54 -04:00
Ellen Pan
5da3d8820d drm/amdgpu: Implement Runtime Bad Page query for VFs
Host will send a notification when new bad pages are available.

Uopn guest request, the first 256 bad page addresses
will be placed into the PF2VF region.
Guest should pause the PF2VF worker thread while
the copy is in progress.

Reviewed-by: Shravan Kumar Gande <Shravankumar.Gande@amd.com>
Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Signed-off-by: Ellen Pan <yunru.pan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:41:49 -04:00
Ellen Pan
6615f1ad34 drm/amdgpu: Add Runtime Bad Page message definitions for VFs
Currently VFs rely on poison consumption interrupt from HW
to kick off the bad page retirement process. Part of this process
includes a VF reset.

This patch adds the following:

1) Host Bad Pages notification message.
2) Guest request bad pages message.

When combined, VFs are able to reserve the pages early, and potentially
avoid future poison consumption that will disrupt user services
from consequent FLR.

Reviewed-by: Shravan Kumar Gande <Shravankumar.Gande@amd.com>
Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Signed-off-by: Ellen Pan <yunru.pan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:41:43 -04:00
Rodrigo Siqueira
c8305c6327 drm/amdgpu: Add documentation to some parts of the AMDGPU ring and wb
Add some random documentation associated with the ring buffer
manipulations and writeback.

Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-07 17:41:35 -04:00
Dave Airlie
5e0c679981 Linux 6.15-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCgA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmgX1CgeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGxiIH/A7LHlVatGEQgRFi
 0JALDgcuGTMtMU1qD43rv8Z1GXqTpCAlaBt9D1C9cUH/86MGyBTVRWgVy0wkaU2U
 8QSfFWQIbrdaIzelHtzmAv5IDtb+KrcX1iYGLcMb6ZYaWkv8/CMzMX1nkgxEr1QT
 37Xo3/F17yJumAdNQxdRhVLGy2d3X5rScecpufwh97sMwoddllMCDs2LIoeSAYpG
 376/wzni09G2fADa8MEKqcaMue4qcf0FOo/gOkT8YwFGSZLKa6uumlBLg04QoCt0
 foK2vfcci1q4H4ZbCu3uQESYGLQHY0f2ICDCwC3m25VF9a81TmlbC3MLum3vhmKe
 RtLDcXg=
 =xyaI
 -----END PGP SIGNATURE-----

BackMerge tag 'v6.15-rc5' into drm-next

Linux 6.15-rc5, requested by tzimmerman for fixes required in drm-next.

Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-05-06 16:39:25 +10:00
Arvind Yadav
3e50b1d625 drm/amdgpu: only keep most recent fence for each context
Keep only the latest fences to reduce the number of values
given back to userspace

v2: - Export this code from dma-fence-unwrap.c(by Christian).
v3: - To split this in a dma_buf patch and amd userq patch(by Sunil).
    - No need to add a new function just re-use existing(by Christian).
v4: Export dma_fence_dedub_array function and used it(by Christian).

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-05 13:29:58 -04:00
Srinivasan Shanmugam
68071eb0ae drm/amdgpu: Add Support for enforcing isolation without Cleaner Shader
Adjusted the enforce isolation setting handling to include the ability
to disable the cleaner shader without affecting isolation between tasks.

v2: Updated enforce isolation documentation and parameters. (Alex)

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-05 13:29:53 -04:00
Sunil Khatri
71353c1a4f drm/amdgpu: change DRM_DBG_DRIVER to drm_dbg_driver
update the functions in amdgpu_userqueues.c from
DRM_DBG_DRIVER to drm_dbg_driver so multi gpu instance
can be logged in.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-05 13:29:38 -04:00
Sunil Khatri
c46a37628a drm/amdgpu: change DRM_ERROR to drm_file_err in amdgpu_userq.c
change the DRM_ERROR and drm_err to drm_file_err
to add process name and pid to the logging.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-05 13:29:33 -04:00
Sunil Khatri
8c97cdb1a6 drm/amdgpu: use drm_file_err in fence timeouts
use drm_file_err instead of DRM_ERROR which adds
process and pid information in the userqueue error
logging.

Sample log:

[   19.802315] amdgpu 0000:0a:00.0: [drm] *ERROR* comm: ibus-x11 pid: 2055 client: Unset ... Couldn't unmap all the queues
[   19.802319] amdgpu 0000:0a:00.0: [drm] *ERROR* comm: ibus-x11 pid: 2055 client: Unset ... Failed to evict userqueue
[   19.838432] amdgpu 0000:0a:00.0: [drm] *ERROR* comm: systemd-logind pid: 1042 client: Unset ... Couldn't unmap all the queues
[   19.838436] amdgpu 0000:0a:00.0: [drm] *ERROR* comm: systemd-logind pid: 1042 client: Unset ... Failed to evict userqueue

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-05 13:29:25 -04:00
Sunil Khatri
30ff75809d drm/amdgpu: add drm_file reference in userq_mgr
drm_file will be used in usermode queues code to
enable better process information in logging and hence
add drm_file part of the userq_mgr struct.

update the drm_file pointer in userq_mgr for each
amdgpu_driver_open_kms.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-05 13:29:18 -04:00
Sonny Jiang
6718b10a5b drm/amdgpu: Add DPG pause for VCN v5.0.1
For vcn5.0.1 only, enable DPG PAUSE to avoid DPG resets.

Signed-off-by: Sonny Jiang <sonny.jiang@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 3e5f86c14c)
2025-05-01 11:02:00 -04:00
Lijo Lazar
79af0604eb drm/amdgpu: Fix offset for HDP remap in nbio v7.11
APUs in passthrough mode use HDP flush. 0x7F000 offset used for
remapping HDP flush is mapped to VPE space which could get power gated.
Use another unused offset in BIF space.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit d8116a32cd)
Cc: stable@vger.kernel.org
2025-05-01 11:01:46 -04:00
Felix Kuehling
9397204ffa drm/amdgpu: Fail DMABUF map of XGMI-accessible memory
If peer memory is XGMI-accessible, we should never access it through PCIe
P2P DMA mappings. PCIe P2P is slower, has different coherence behaviour,
limited or no support for atomics, or may not work at all. Fail with a
warning if DMABUF mappings of such memory are attempted.

Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit dbe4c63689)
2025-05-01 11:01:46 -04:00
Dan Carpenter
97c39b4da6 drm/amdgpu/userq: remove unnecessary NULL check
The "ticket" pointer points to in the middle of the &exec struct so it
can't be NULL.  Remove the check.

Reviewed-by: Christian König <christian.koenig@amd.com>
Acked-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-30 18:18:52 -04:00
Dan Carpenter
d6c6d5ec66 drm/amdgpu/userq: Call unreserve on error in amdgpu_userq_fence_read_wptr()
This error path should call amdgpu_bo_unreserve() before returning.

Fixes: d8675102ba ("drm/amdgpu: add vm root BO lock before accessing the vm")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-30 18:17:42 -04:00
Alex Deucher
aded8b3c36 drm/amdgpu: properly handle GC vs MM in amdgpu_vmid_mgr_init()
When kernel queues are disabled, all GC vmids are available
for the scheduler.  MM vmids are still managed by the driver
so make all 16 available.

Also fix gmc 10 vs 11 mix up in
commit 1f61fc28b9 ("drm/amdgpu/mes: make more vmids available when disable_kq=1")

v2: Properly handle pre-GC 10 hardware

Fixes: 1f61fc28b9 ("drm/amdgpu/mes: make more vmids available when disable_kq=1")
Cc: Arvind Yadav <Arvind.Yadav@amd.com>
Reviewed-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-30 18:16:53 -04:00
Alex Deucher
2e828a25f8 drm/amdgpu/mes: use correct MES pipe for resets
Use the KIQ pipe for kernel queues and the SCHED pipe for
user queues.

Fixes: 2408b0272b ("drm/amdgpu/mes: consolidate on a single mes reset callback")
Cc: Michael Chen <Michael.Chen@amd.com>
Cc: Shaoyun Liu <Shaoyun.Liu@amd.com>
Reviewed-by: Michael Chen <michael.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-30 18:16:14 -04:00
Alex Deucher
2408b0272b drm/amdgpu/mes: consolidate on a single mes reset callback
Use the legacy one as it covers both kernel queues and
user queues.

Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-30 18:16:07 -04:00
Alex Deucher
6535348a3e drm/amdgpu/mes: remove more unused functions
These were leftover from mes bring up and are unused.

Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-30 18:15:57 -04:00