linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2025-09-01 11:24:14 +00:00

Author	SHA1	Message	Date
Tomasz Lis	7eba6a80fe	drm/xe/vf: Make multi-GT migration less error prone There is a remote chance that after migration, some GTs will not send the MIGRATED interrupt, or due to current VF KMD state the interrupt will not lead to marking the GT for recovery. Requiring IRQs from all GTs before starting migration introduces the possibility that the process will get stalled due to one GuC. One could argue it is also waste of time to wait for all IRQs, but we should get them all IRQs as soon as VGPU starts, so that's not really an impactful argument. Still, not waiting for all GTs makes it easier to handle situations: * where one GuC IRQ is missing * where state before probe is unclean - getting MIGRATED IRQ as soon as interrupts are enabled * where multiple migrations happen close to each other To help with these cases, this patch alters the post-migration recovery so that recovery task is started as soon as one GuC IRQ is handled, and other GTs are included in recovery later as the subsequent IRQs are serviced. The post-migration recovery can now be called for any selection of GTs, and it will perform recovery on all GTs for which IRQs have arrived, even multiple times if necessary. v2: Typos and style fixes v3: Transferring gt_flags by value rather than reference to last function where it is used Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Michal Winiarski <michal.winiarski@intel.com> Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Acked-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Reviewed-by: Michal Winiarski <michal.winiarski@intel.com> Link: https://lore.kernel.org/r/20250630152155.195648-1-tomasz.lis@intel.com Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>	2025-07-03 21:05:21 +02:00
Michal Wajdeczko	eb9b34734c	drm/xe/vf: Move tile-related VF functions to separate file Some of our VF functions, even if they take a GT pointer, work only on primary GT and really are tile-related and would be better to keep them separate from the rest of true GT-oriented functions. Move them to a file and update to take a tile pointer instead. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Tomasz Lis <tomasz.lis@intel.com> Reviewed-by: Tomasz Lis <tomasz.lis@intel.com> Link: https://lore.kernel.org/r/20250602103325.549-3-michal.wajdeczko@intel.com	2025-06-03 12:35:57 +02:00
Tomasz Lis	20a07782da	drm/xe/vf: Fail migration recovery if fixups needed but platform not supported The post-migration recovery needs to be fully implemented for a specific platform in order to make continuation of workloads possible. New platforms introduce changes which affect the recovery procedure, and without a clear verification of support this leads to errors with no straight forward error message explaining the cause. This patch fixes that issue - it introduces a message to be logged when the current driver is known to not support the current platform. Wedging the driver immediately also decreases the amount of additional errors which would come afterwards if the driver continued operation. v2: Show the message during probe as well as during recovery; do not perform any recovery steps if the recovery is bound to fail v3: Use SRIOV-specific logging, fix typos v4: XE_DEBUG_SRIOV to XE_DEBUG check switch, to make testing more straightforward Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Michał Winiarski <michal.winiarski@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Acked-by: Michał Winiarski <michal.winiarski@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250519230035.3143966-1-tomasz.lis@intel.com	2025-05-22 12:04:09 +02:00
Tomasz Lis	cef88d1265	drm/xe/vf: Fixup CTB send buffer messages after migration During post-migration recovery of a VF, it is necessary to update GGTT references included in messages which are going to be sent to GuC. GuC will start consuming messages after VF KMD will inform it about fixups being done; before that, the VF KMD is expected to update any H2G messages which are already in send buffer but were not consumed by GuC. Only a small subset of messages allowed for VFs have GGTT references in them. This patch adds the functionality to parse the CTB send ring buffer and shift addresses contained within. While fixing the CTB content, ct->lock is not taken. This means the only barrier taken remains GGTT address lock - which is ok, because only requests with GGTT addresses matter, but it also means tail changes can happen during the CTB fixups execution (which may be ignored as any new messages will not have anything to fix). The GGTT address locking will be introduced in a future series. v2: removed storing shift as that's now done in VMA nodes patch; macros to inlines; warns to asserts; log messages fixes (Michal) v3: removed inline keywords, enums for offsets in CTB messages, less error messages, if return unused then made functs void (Michal) v4: update the cached head before starting fixups v5: removed/updated comments, wrapped lines, converted assert into error, enums for offsets to separate patch, reused xe_map_rd v6: define xe_map__array() macros, support CTB wrap which divides a message, updated comments, moved one function to an earlier patch v7: renamed few functions, wider use on previously introduced helper, separate cases in parsing messges, documented a static funct v8: Introduced more helpers, fixed coding style mistakes v9: Move xe_map() functs to macros, add asserts, add debug print v10: Errors in place of some asserts, style fixes v11: Fixed invalid conditionals, added debug-only local pointer v12: Removed redundant __maybe_unused Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250512114018.361843-5-tomasz.lis@intel.com	2025-05-12 15:53:38 +02:00
Tomasz Lis	3e693945b1	drm/xe/vf: Shifting GGTT area post migration We have only one GGTT for all IOV functions, with each VF having assigned a range of addresses for its use. After migration, a VF can receive a different range of addresses than it had initially. This implements shifting GGTT addresses within drm_mm nodes, so that VMAs stay valid after migration. This will make the driver use new addresses when accessing GGTT from the moment the shifting ends. By taking the ggtt->lock for the period of VMA fixups, this change also adds constraint on that mutex. Any locks used during the recovery cannot ever wait for hardware response - because after migration, the hardware will not do anything until fixups are finished. v2: Moved some functs to xe_ggtt.c; moved shift computation to just after querying; improved documentation; switched some warns to asserts; skipping fixups when GGTT shift eq 0; iterating through tiles (Michal) v3: Updated kerneldocs, removed unused funct, properly allocate balloning nodes if non existent v4: Re-used ballooning functions from VF init, used bool in place of standard error codes v5: Renamed one function v6: Subject tag change, several kerneldocs updated, some functions renamed, some moved, added several asserts, shuffled declarations of variables, revealed more detail in high level functions v7: Fixed typos, added `_locked` suffix to some functs, improved readability of asserts, removed unneeded conditional v8: Moved one function, removed implementation detail from kerneldoc, added asserts v9: Code shuffling without much change, and one param rename v10: Minor error path change, added printing the shift via debugfs Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250512114018.361843-3-tomasz.lis@intel.com	2025-05-12 15:53:35 +02:00
Tomasz Lis	abd2202047	drm/xe/vf: Defer fixups if migrated twice fast If another VF migration happened during post-migration recovery, then the current worker should be finished to allow the next one start swiftly and cleanly. Check for defer in two places: before fixups, and before sending RESFIX_DONE. Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241104213449.1455694-6-tomasz.lis@intel.com Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>	2024-11-06 15:25:30 +01:00
Tomasz Lis	4be3fca2ce	drm/xe/vf: Start post-migration fixups with provisioning query During post-migration recovery, only MMIO communication to GuC is allowed. The VF KMD needs to use that channel to ask for the new provisioning, which includes a new GGTT range assigned to the VF. v2: query config only instead of handshake; no need to get pm ref as it's now kept through whole recovery (Michal) v3: switched names of 'err' and 'ret' (Michal) Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241104213449.1455694-5-tomasz.lis@intel.com Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>	2024-11-06 15:25:30 +01:00
Tomasz Lis	1255954d9f	drm/xe/vf: Send RESFIX_DONE message at end of VF restore After restore, GuC will not answer to any messages from VF KMD until fixups are applied. When that is done, VF KMD sends RESFIX_DONE message to GuC, at which point GuC resumes normal operation. This patch implements sending the RESFIX_DONE message at end of post-migration recovery. v2: keep pm ref during whole recovery, style fixes (Michal) v3: assert removal to separate patch, debug message per GuC instead of one, comments changes (Michal) v4: improve one debug message (Michal) Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241104213449.1455694-4-tomasz.lis@intel.com Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>	2024-11-06 15:25:30 +01:00
Tomasz Lis	360a1f3e96	drm/xe/vf: Document SRIOV VF restore flow This adds a documentation chapter, containing high level flow of VF restore procedure. v2: Better describe initial conditions, include GuC states on sequence diagram (Michal) v3: moved DOC to .c (Michal) Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241104213449.1455694-3-tomasz.lis@intel.com Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>	2024-11-06 15:24:24 +01:00
Tomasz Lis	6e6d7b41f9	drm/xe/vf: React to MIGRATED interrupt To properly support VF Save/Restore procedure, fixups need to be applied after PF driver finishes its part of VF Restore. The fixups are required to adjust the ongoing execution for a hardware switch that happened, because some GFX resources are not fully virtualized, and assigned to a VF as range from a global pool. The VF on which a VM is restored will often have different ranges provisioned than the VF on which save process happened. Those resource fixups are applied by the VF driver within a restored VM. A VF driver gets informed that it was migrated by receiving an interrupt from each GuC. The interrupt assigned for that purpose is "GUC SW interrupt 0". Seeing that fields set from within the irq handler should be the trigger for fixups. The VF can safely do post-migration fixups on resources associated to each GuC only after that GuC issued the MIGRATED interrupt. This change introduces a worker to be used for post-migration fixups, and a mechanism to schedule said worker when all GuCs sent the irq. v2: renamed and moved functions, updated logged messages, removed unused includes, used anon struct (Michal) v3: ordering, kerneldoc, asserts, debug messages, on_all_tiles -> on_all_gts (Michal) v4: fixed missing header include v5: Explained what fixups are, explained which IRQ is used, style fixes (Michal) Bspec: 50868 Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241104213449.1455694-2-tomasz.lis@intel.com Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>	2024-11-06 14:53:35 +01:00

10 Commits