linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-08-31 22:23:05 +00:00

Author	SHA1	Message	Date
Yishai Hadas	f886473071	vfio/mlx5: Add support for tracker object change event Add support for tracker object change event by referring to its MLX5_EVENT_TYPE_OBJECT_CHANGE event when occurs. This lets the driver recognize whether the firmware moved the tracker object to an error state. In that case, the driver will skip/block any usage of that object including an early exit in case the object was previously marked with an error. This functionality also covers the case when no CQE is delivered as of the error state. The driver was adapted to the device specification to handle the above. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/r/20240205124828.232701-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2024-02-22 12:17:32 -07:00
Kunwu Chan	19032628bd	vfio/pci: WARN_ON driver_override kasprintf failure kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. This is a blocking notifier callback, so errno isn't a proper return value. Use WARN_ON to small allocation failures. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Link: https://lore.kernel.org/r/20240115063434.20278-1-chentao@kylinos.cn Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2024-02-22 12:14:37 -07:00
Paolo Bonzini	09e33b0455	vfio: replace CONFIG_HAVE_KVM with IS_ENABLED(CONFIG_KVM) It is more accurate to Check if KVM is enabled, instead of having the architecture say so. Architectures always "have" KVM, so for example checking CONFIG_HAVE_KVM in vfio code is pointless, but if KVM is disabled in a specific build, there is no need for support code. Alternatively, the #ifdefs could simply be deleted. However, this would add dead code. For example, when KVM is disabled, there is no need to include code in VFIO that uses symbol_get, as that symbol_get would always fail. Cc: Alex Williamson <alex.williamson@redhat.com> Cc: x86@kernel.org Cc: kbingham@kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-02-08 08:45:35 -05:00
Linus Torvalds	244aefb1c6	VFIO updates for v6.8-rc1 - Add debugfs support, initially used for reporting device migration state. (Longfang Liu) - Fixes and support for migration dirty tracking across multiple IOVA regions in the pds-vfio-pci driver. (Brett Creeley) - Improved IOMMU allocation accounting visibility. (Pasha Tatashin) - Virtio infrastructure and a new virtio-vfio-pci variant driver, which provides emulation of a legacy virtio interfaces on modern virtio hardware for virtio-net VF devices where the PF driver exposes support for legacy admin queues, ie. an emulated IO BAR on an SR-IOV VF to provide driver ABI compatibility to legacy devices. (Yishai Hadas & Feng Liu) - Migration fixes for the hisi-acc-vfio-pci variant driver. (Shameer Kolothum) - Kconfig dependency fix for new virtio-vfio-pci variant driver. (Arnd Bergmann) -----BEGIN PGP SIGNATURE----- iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmWhkhEbHGFsZXgud2ls bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsiCLgQAJv6mzD79dVWKAZH27Lj PK0ZSyu3fwgPxTmhRXysKKMs79WI2GlVx6nyW8pVe3w+OGWpdTcbZK2H/T/FryZQ QsbKteueG83ni1cIdJFzmIM1jO79jhtsPxpclRS/VmECRhYA6+c7smynHyZNrVAL wWkJIkS2uUEx3eUefzH4U2CRen3TILwHAXi27fJ8pHbr6Yor+XvUOgM3eQDjUj+t eABL/pJr0qFDQstom6k7GLAsenRHKMLUG88ziSciSJxOg5YiT4py7zeLXuoEhVD1 kI9KE+Vle5EdZe8MzLLhmzLZoFVfhjyNfj821QjtfP3Gkj6TqnUWBKJAptMuQpdf HklOLNmabrZbat+i6QqswrnQ5Z1doPz1uNBsl2lH+2/KIaT8bHZI+QgjK7pg2H2L O679My0od4rVLpjnSLDdRoXlcLd6mmvq3663gPogziHBNdNl3oQBI3iIa7ixljkA lxJbOZIDBAjzPk+t5NLYwkTsab1AY4zGlfr0M3Sk3q7tyj/MlBcX/fuqyhXjUfqR Zhqaw2OaWD8R0EqfSK+wRXr1+z7EWJO/y1iq8RYlD5Mozo+6YMVThjLDUO+8mrtV 6/PL0woGALw0Tq1u0tw3rLjzCd9qwD9BD2fFUQwUWEe3j3wG2HCLLqyomxcmaKS8 WgvUXtufWyvonCcIeLKXI9Kt =IuK2 -----END PGP SIGNATURE----- Merge tag 'vfio-v6.8-rc1' of https://github.com/awilliam/linux-vfio Pull VFIO updates from Alex Williamson: - Add debugfs support, initially used for reporting device migration state (Longfang Liu) - Fixes and support for migration dirty tracking across multiple IOVA regions in the pds-vfio-pci driver (Brett Creeley) - Improved IOMMU allocation accounting visibility (Pasha Tatashin) - Virtio infrastructure and a new virtio-vfio-pci variant driver, which provides emulation of a legacy virtio interfaces on modern virtio hardware for virtio-net VF devices where the PF driver exposes support for legacy admin queues, ie. an emulated IO BAR on an SR-IOV VF to provide driver ABI compatibility to legacy devices (Yishai Hadas & Feng Liu) - Migration fixes for the hisi-acc-vfio-pci variant driver (Shameer Kolothum) - Kconfig dependency fix for new virtio-vfio-pci variant driver (Arnd Bergmann) * tag 'vfio-v6.8-rc1' of https://github.com/awilliam/linux-vfio: (22 commits) vfio/virtio: fix virtio-pci dependency hisi_acc_vfio_pci: Update migration data pointer correctly on saving/resume vfio/virtio: Declare virtiovf_pci_aer_reset_done() static vfio/virtio: Introduce a vfio driver over virtio devices vfio/pci: Expose vfio_pci_core_iowrite/read##size() vfio/pci: Expose vfio_pci_core_setup_barmap() virtio-pci: Introduce APIs to execute legacy IO admin commands virtio-pci: Initialize the supported admin commands virtio-pci: Introduce admin commands virtio-pci: Introduce admin command sending function virtio-pci: Introduce admin virtqueue virtio: Define feature bit for administration virtqueue vfio/type1: account iommu allocations vfio/pds: Add multi-region support vfio/pds: Move seq/ack bitmaps into region struct vfio/pds: Pass region info to relevant functions vfio/pds: Move and rename region specific info vfio/pds: Only use a single SGL for both seq and ack vfio/pds: Fix calculations in pds_vfio_dirty_sync MAINTAINERS: Add vfio debugfs interface doc link ...	2024-01-18 15:57:25 -08:00
Arnd Bergmann	78f70c02bd	vfio/virtio: fix virtio-pci dependency The new vfio-virtio driver already has a dependency on VIRTIO_PCI_ADMIN_LEGACY, but that is a bool symbol and allows vfio-virtio to be built-in even if virtio-pci itself is a loadable module. This leads to a link failure: aarch64-linux-ld: drivers/vfio/pci/virtio/main.o: in function `virtiovf_pci_probe': main.c:(.text+0xec): undefined reference to `virtio_pci_admin_has_legacy_io' aarch64-linux-ld: drivers/vfio/pci/virtio/main.o: in function `virtiovf_pci_init_device': main.c:(.text+0x260): undefined reference to `virtio_pci_admin_legacy_io_notify_info' aarch64-linux-ld: drivers/vfio/pci/virtio/main.o: in function `virtiovf_pci_bar0_rw': main.c:(.text+0x6ec): undefined reference to `virtio_pci_admin_legacy_common_io_read' aarch64-linux-ld: main.c:(.text+0x6f4): undefined reference to `virtio_pci_admin_legacy_device_io_read' aarch64-linux-ld: main.c:(.text+0x7f0): undefined reference to `virtio_pci_admin_legacy_common_io_write' aarch64-linux-ld: main.c:(.text+0x7f8): undefined reference to `virtio_pci_admin_legacy_device_io_write' Add another explicit dependency on the tristate symbol. Fixes: `eb61eca0e8` ("vfio/virtio: Introduce a vfio driver over virtio devices") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240109075731.2726731-1-arnd@kernel.org Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2024-01-10 15:10:41 -07:00
Linus Torvalds	c604110e66	vfs-6.8.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUxRQAKCRCRxhvAZXjc ov/QAQDzvge3oQ9MEymmOiyzzcF+HhAXBr+9oEsYJjFc1p0TsgEA61gXjZo7F1jY KBqd6znOZCR+Waj0kIVJRAo/ISRBqQc= =0bRl -----END PGP SIGNATURE----- Merge tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Add Jan Kara as VFS reviewer - Show correct device and inode numbers in proc/<pid>/maps for vma files on stacked filesystems. This is now easily doable thanks to the backing file work from the last cycles. This comes with selftests Cleanups: - Remove a redundant might_sleep() from wait_on_inode() - Initialize pointer with NULL, not 0 - Clarify comment on access_override_creds() - Rework and simplify eventfd_signal() and eventfd_signal_mask() helpers - Process aio completions in batches to avoid needless wakeups - Completely decouple struct mnt_idmap from namespaces. We now only keep the actual idmapping around and don't stash references to namespaces - Reformat maintainer entries to indicate that a given subsystem belongs to fs/ - Simplify fput() for files that were never opened - Get rid of various pointless file helpers - Rename various file helpers - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from last cycle - Make relatime_need_update() return bool - Use GFP_KERNEL instead of GFP_USER when allocating superblocks - Replace deprecated ida_simple_() calls with their current ida_() counterparts Fixes: - Fix comments on user namespace id mapping helpers. They aren't kernel doc comments so they shouldn't be using /** - s/Retuns/Returns/g in various places - Add missing parameter documentation on can_move_mount_beneath() - Rename i_mapping->private_data to i_mapping->i_private_data - Fix a false-positive lockdep warning in pipe_write() for watch queues - Improve __fget_files_rcu() code generation to improve performance - Only notify writer that pipe resizing has finished after setting pipe->max_usage otherwise writers are never notified that the pipe has been resized and hang - Fix some kernel docs in hfsplus - s/passs/pass/g in various places - Fix kernel docs in ntfs - Fix kcalloc() arguments order reported by gcc 14 - Fix uninitialized value in reiserfs" * tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits) reiserfs: fix uninit-value in comp_keys watch_queue: fix kcalloc() arguments order ntfs: dir.c: fix kernel-doc function parameter warnings fs: fix doc comment typo fs tree wide selftests/overlayfs: verify device and inode numbers in /proc/pid/maps fs/proc: show correct device and inode numbers in /proc/pid/maps eventfd: Remove usage of the deprecated ida_simple_xx() API fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation fs/hfsplus: wrapper.c: fix kernel-doc warnings fs: add Jan Kara as reviewer fs/inode: Make relatime_need_update return bool pipe: wakeup wr_wait after setting max_usage file: remove __receive_fd() file: stop exposing receive_fd_user() fs: replace f_rcuhead with f_task_work file: remove pointless wrapper file: s/close_fd_get_file()/file_close_fd()/g Improve __fget_files_rcu() code generation (and thus __fget_light()) file: massage cleanup of files that failed to open fs/pipe: Fix lockdep false-positive in watchqueue pipe_write() ...	2024-01-08 10:26:08 -08:00
Shameer Kolothum	be12ad45e1	hisi_acc_vfio_pci: Update migration data pointer correctly on saving/resume When the optional PRE_COPY support was added to speed up the device compatibility check, it failed to update the saving/resuming data pointers based on the fd offset. This results in migration data corruption and when the device gets started on the destination the following error is reported in some cases, [ 478.907684] arm-smmu-v3 arm-smmu-v3.2.auto: event 0x10 received: [ 478.913691] arm-smmu-v3 arm-smmu-v3.2.auto: 0x0000310200000010 [ 478.919603] arm-smmu-v3 arm-smmu-v3.2.auto: 0x000002088000007f [ 478.925515] arm-smmu-v3 arm-smmu-v3.2.auto: 0x0000000000000000 [ 478.931425] arm-smmu-v3 arm-smmu-v3.2.auto: 0x0000000000000000 [ 478.947552] hisi_zip 0000:31:00.0: qm_axi_rresp [error status=0x1] found [ 478.955930] hisi_zip 0000:31:00.0: qm_db_timeout [error status=0x400] found [ 478.955944] hisi_zip 0000:31:00.0: qm sq doorbell timeout in function 2 Fixes: `d9a871e4a1` ("hisi_acc_vfio_pci: Introduce support for PRE_COPY state transitions") Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20231120091406.780-1-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2024-01-05 09:06:45 -07:00
Yishai Hadas	daca194876	vfio/virtio: Declare virtiovf_pci_aer_reset_done() static Declare virtiovf_pci_aer_reset_done() as a static function to prevent the below build warning. "warning: no previous prototype for 'virtiovf_pci_aer_reset_done' [-Wmissing-prototypes]" Fixes: `eb61eca0e8` ("vfio/virtio: Introduce a vfio driver over virtio devices") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/lkml/20231220143122.63337669@canb.auug.org.au/ Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20231220082456.241973-1-yishaih@nvidia.com Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202312202115.oDmvN1VE-lkp@intel.com/ Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-12-20 08:08:39 -07:00
Alex Williamson	0214392d5d	Merge branch 'v6.8/vfio/virtio' into v6.8/vfio/next	2023-12-19 12:16:24 -07:00
Yishai Hadas	eb61eca0e8	vfio/virtio: Introduce a vfio driver over virtio devices Introduce a vfio driver over virtio devices to support the legacy interface functionality for VFs. Background, from the virtio spec [1]. -------------------------------------------------------------------- In some systems, there is a need to support a virtio legacy driver with a device that does not directly support the legacy interface. In such scenarios, a group owner device can provide the legacy interface functionality for the group member devices. The driver of the owner device can then access the legacy interface of a member device on behalf of the legacy member device driver. For example, with the SR-IOV group type, group members (VFs) can not present the legacy interface in an I/O BAR in BAR0 as expected by the legacy pci driver. If the legacy driver is running inside a virtual machine, the hypervisor executing the virtual machine can present a virtual device with an I/O BAR in BAR0. The hypervisor intercepts the legacy driver accesses to this I/O BAR and forwards them to the group owner device (PF) using group administration commands. -------------------------------------------------------------------- Specifically, this driver adds support for a virtio-net VF to be exposed as a transitional device to a guest driver and allows the legacy IO BAR functionality on top. This allows a VM which uses a legacy virtio-net driver in the guest to work transparently over a VF which its driver in the host is that new driver. The driver can be extended easily to support some other types of virtio devices (e.g virtio-blk), by adding in a few places the specific type properties as was done for virtio-net. For now, only the virtio-net use case was tested and as such we introduce the support only for such a device. Practically, Upon probing a VF for a virtio-net device, in case its PF supports legacy access over the virtio admin commands and the VF doesn't have BAR 0, we set some specific 'vfio_device_ops' to be able to simulate in SW a transitional device with I/O BAR in BAR 0. The existence of the simulated I/O bar is reported later on by overwriting the VFIO_DEVICE_GET_REGION_INFO command and the device exposes itself as a transitional device by overwriting some properties upon reading its config space. Once we report the existence of I/O BAR as BAR 0 a legacy driver in the guest may use it via read/write calls according to the virtio specification. Any read/write towards the control parts of the BAR will be captured by the new driver and will be translated into admin commands towards the device. In addition, any data path read/write access (i.e. virtio driver notifications) will be captured by the driver and forwarded to the physical BAR which its properties were supplied by the admin command VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO upon the probing/init flow. With that code in place a legacy driver in the guest has the look and feel as if having a transitional device with legacy support for both its control and data path flows. [1] `03c2d32e50` Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20231219093247.170936-10-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-12-19 11:51:34 -07:00
Yishai Hadas	8486ae162b	vfio/pci: Expose vfio_pci_core_iowrite/read##size() Expose vfio_pci_core_iowrite/read##size() to let it be used by drivers. This functionality is needed to enable direct access to some physical BAR of the device with the proper locks/checks in place. The next patches from this series will use this functionality on a data path flow when a direct access to the BAR is needed. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20231219093247.170936-9-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-12-19 11:51:33 -07:00
Yishai Hadas	8bccc5b806	vfio/pci: Expose vfio_pci_core_setup_barmap() Expose vfio_pci_core_setup_barmap() to be used by drivers. This will let drivers to mmap a BAR and re-use it from both vfio and the driver when it's applicable. This API will be used in the next patches by the vfio/virtio coming driver. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20231219093247.170936-8-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-12-19 11:51:33 -07:00
Alex Williamson	946cff255d	Merge branches 'v6.8/vfio/debugfs', 'v6.8/vfio/pds' and 'v6.8/vfio/type1-account' into v6.8/vfio/next	2023-12-04 14:42:54 -07:00
Pasha Tatashin	160912fc3d	vfio/type1: account iommu allocations iommu allocations should be accounted in order to allow admins to monitor and limit the amount of iommu memory. Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20231130200900.2320829-1-pasha.tatashin@soleen.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-12-04 14:41:36 -07:00
Brett Creeley	2e7c6feb4e	vfio/pds: Add multi-region support Only supporting a single region/range is limiting, wasteful, and in some cases broken (i.e. when there are large gaps in the iova memory ranges). Fix this by adding support for multiple regions based on what the device tells the driver it can support. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231117001207.2793-7-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-12-04 14:33:21 -07:00
Brett Creeley	0c320f223e	vfio/pds: Move seq/ack bitmaps into region struct Since the host seq/ack bitmaps are part of a region move them into struct pds_vfio_region. Also, make use of the bmp_bytes value for validation purposes. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231117001207.2793-6-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-12-04 14:33:21 -07:00
Brett Creeley	87bdf9807e	vfio/pds: Pass region info to relevant functions A later patch in the series implements multi-region support. That will require specific regions to be passed to relevant functions. Prepare for that change by passing the region structure to relevant functions. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231117001207.2793-5-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-12-04 14:33:21 -07:00
Brett Creeley	3f5898133a	vfio/pds: Move and rename region specific info An upcoming change in this series will add support for multiple regions. To prepare for that, move region specific information into struct pds_vfio_region and rename the members for readability. This will reduce the size of the patch that actually implements multiple region support. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231117001207.2793-4-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-12-04 14:33:21 -07:00
Brett Creeley	3b8f7a24d1	vfio/pds: Only use a single SGL for both seq and ack Since the seq/ack operations never happen in parallel there is no need for multiple scatter gather lists per region. The current implementation is wasting memory. Fix this by only using a single scatter-gather list for both the seq and ack operations. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231117001207.2793-3-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-12-04 14:33:20 -07:00
Brett Creeley	4004497cec	vfio/pds: Fix calculations in pds_vfio_dirty_sync The incorrect check is being done for comparing the iova/length being requested to sync. This can cause the dirty sync operation to fail. Fix this by making sure the iova offset added to the requested sync length doesn't exceed the region_size. Also, the region_start is assumed to always be at 0. This can cause dirty tracking to fail because the device/driver bitmap offset always starts at 0, however, the region_start/iova may not. Fix this by determining the iova offset from region_start to determine the bitmap offset. Fixes: `f232836a91` ("vfio/pds: Add support for dirty page tracking") Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231117001207.2793-2-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-12-04 14:33:20 -07:00
Longfang Liu	2202844e44	vfio/migration: Add debugfs to live migration driver There are multiple devices, software and operational steps involved in the process of live migration. An error occurred on any node may cause the live migration operation to fail. This complex process makes it very difficult to locate and analyze the cause when the function fails. In order to quickly locate the cause of the problem when the live migration fails, I added a set of debugfs to the vfio live migration driver. +-------------------------------------------+ \| \| \| \| \| QEMU \| \| \| \| \| +---+----------------------------+----------+ \| ^ \| ^ \| \| \| \| \| \| \| \| v \| v \| +---------+--+ +---------+--+ \|src vfio_dev\| \|dst vfio_dev\| +--+---------+ +--+---------+ \| ^ \| ^ \| \| \| \| v \| \| \| +-----------+----+ +-----------+----+ \|src dev debugfs \| \|dst dev debugfs \| +----------------+ +----------------+ The entire debugfs directory will be based on the definition of the CONFIG_DEBUG_FS macro. If this macro is not enabled, the interfaces in vfio.h will be empty definitions, and the creation and initialization of the debugfs directory will not be executed. vfio \| +---<dev_name1> \| +---migration \| +--state \| +---<dev_name2> +---migration +--state debugfs will create a public root directory "vfio" file. then create a dev_name() file for each live migration device. First, create a unified state acquisition file of "migration" in this device directory. Then, create a public live migration state lookup file "state". Signed-off-by: Longfang Liu <liulongfang@huawei.com> Reviewed-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/r/20231106072225.28577-2-liulongfang@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-12-04 14:29:08 -07:00
Christian Brauner	3652117f85	eventfd: simplify eventfd_signal() Ever since the eventfd type was introduced back in 2007 in commit `e1ad7468c7` ("signal/timer/event: eventfd core") the eventfd_signal() function only ever passed 1 as a value for @n. There's no point in keeping that additional argument. Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org Acked-by: Xu Yilun <yilun.xu@intel.com> Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl Acked-by: Eric Farman <farman@linux.ibm.com> # s390 Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-11-28 14:08:38 +01:00
Brett Creeley	ae2667cd8a	vfio/pds: Fix possible sleep while in atomic context The driver could possibly sleep while in atomic context resulting in the following call trace while CONFIG_DEBUG_ATOMIC_SLEEP=y is set: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:283 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 2817, name: bash preempt_count: 1, expected: 0 RCU nest depth: 0, expected: 0 Call Trace: <TASK> dump_stack_lvl+0x36/0x50 __might_resched+0x123/0x170 mutex_lock+0x1e/0x50 pds_vfio_put_lm_file+0x1e/0xa0 [pds_vfio_pci] pds_vfio_put_save_file+0x19/0x30 [pds_vfio_pci] pds_vfio_state_mutex_unlock+0x2e/0x80 [pds_vfio_pci] pci_reset_function+0x4b/0x70 reset_store+0x5b/0xa0 kernfs_fop_write_iter+0x137/0x1d0 vfs_write+0x2de/0x410 ksys_write+0x5d/0xd0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 This can happen if pds_vfio_put_restore_file() and/or pds_vfio_put_save_file() grab the mutex_lock(&lm_file->lock) while the spin_lock(&pds_vfio->reset_lock) is held, which can happen during while calling pds_vfio_state_mutex_unlock(). Fix this by changing the reset_lock to reset_mutex so there are no such conerns. Also, make sure to destroy the reset_mutex in the driver specific VFIO device release function. This also fixes a spinlock bad magic BUG that was caused by not calling spinlock_init() on the reset_lock. Since, the lock is being changed to a mutex, make sure to call mutex_init() on it. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/kvm/1f9bc27b-3de9-4891-9687-ba2820c1b390@moroto.mountain/ Fixes: `bb500dbe2a` ("vfio/pds: Add VFIO live migration support") Signed-off-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231122192532.25791-3-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-11-27 09:29:03 -07:00
Brett Creeley	91aeb563bd	vfio/pds: Fix mutex lock->magic != lock warning The following BUG was found when running on a kernel with CONFIG_DEBUG_MUTEXES=y set: DEBUG_LOCKS_WARN_ON(lock->magic != lock) RIP: 0010:mutex_trylock+0x10d/0x120 Call Trace: <TASK> ? __warn+0x85/0x140 ? mutex_trylock+0x10d/0x120 ? report_bug+0xfc/0x1e0 ? handle_bug+0x3f/0x70 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? mutex_trylock+0x10d/0x120 ? mutex_trylock+0x10d/0x120 pds_vfio_reset+0x3a/0x60 [pds_vfio_pci] pci_reset_function+0x4b/0x70 reset_store+0x5b/0xa0 kernfs_fop_write_iter+0x137/0x1d0 vfs_write+0x2de/0x410 ksys_write+0x5d/0xd0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 As shown, lock->magic != lock. This is because mutex_init(&pds_vfio->state_mutex) is called in the VFIO open path. So, if a reset is initiated before the VFIO device is opened the mutex will have never been initialized. Fix this by calling mutex_init(&pds_vfio->state_mutex) in the VFIO init path. Also, don't destroy the mutex on close because the device may be re-opened, which would cause mutex to be uninitialized. Fix this by implementing a driver specific vfio_device_ops.release callback that destroys the mutex before calling vfio_pci_core_release_dev(). Fixes: `bb500dbe2a` ("vfio/pds: Add VFIO live migration support") Signed-off-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231122192532.25791-2-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-11-27 09:29:03 -07:00
Linus Torvalds	d99b91a99b	Char/Misc and other driver changes for 6.7-rc1 Here is the big set of char/misc and other small driver subsystem changes for 6.7-rc1. Included in here are: - IIO subsystem driver updates and additions (largest part of this pull request) - FPGA subsystem driver updates - Counter subsystem driver updates - ICC subsystem driver updates - extcon subsystem driver updates - mei driver updates and additions - nvmem subsystem driver updates and additions - comedi subsystem dependency fixes - parport driver fixups - cdx subsystem driver and core updates - splice support for /dev/zero and /dev/full - other smaller driver cleanups All of these have been in linux-next for a while with no reported issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZUTSzg8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ylH3QCfbZuG8MiglEZUd4slRLUNqcRQ5tQAn1yKpDFo l3KLkxo1UTLMXbJBWe+b =gafK -----END PGP SIGNATURE----- Merge tag 'char-misc-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc updates from Greg KH: "Here is the big set of char/misc and other small driver subsystem changes for 6.7-rc1. Included in here are: - IIO subsystem driver updates and additions (largest part of this pull request) - FPGA subsystem driver updates - Counter subsystem driver updates - ICC subsystem driver updates - extcon subsystem driver updates - mei driver updates and additions - nvmem subsystem driver updates and additions - comedi subsystem dependency fixes - parport driver fixups - cdx subsystem driver and core updates - splice support for /dev/zero and /dev/full - other smaller driver cleanups All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (326 commits) cdx: add sysfs for subsystem, class and revision cdx: add sysfs for bus reset cdx: add support for bus enable and disable cdx: Register cdx bus as a device on cdx subsystem cdx: Create symbol namespaces for cdx subsystem cdx: Introduce lock to protect controller ops cdx: Remove cdx controller list from cdx bus system dts: ti: k3-am625-beagleplay: Add beaglecc1352 greybus: Add BeaglePlay Linux Driver dt-bindings: net: Add ti,cc1352p7 dt-bindings: eeprom: at24: allow NVMEM cells based on old syntax dt-bindings: nvmem: SID: allow NVMEM cells based on old syntax Revert "nvmem: add new config option" MAINTAINERS: coresight: Add missing Coresight files misc: pci_endpoint_test: Add deviceID for J721S2 PCIe EP device support firmware: xilinx: Move EXPORT_SYMBOL_GPL next to zynqmp_pm_feature definition uacce: make uacce_class constant ocxl: make ocxl_class constant cxl: make cxl_class constant misc: phantom: make phantom_class constant ...	2023-11-03 14:51:08 -10:00
Linus Torvalds	463f46e114	iommufd for 6.7 This branch has three new iommufd capabilities: - Dirty tracking for DMA. AMD/ARM/Intel CPUs can now record if a DMA writes to a page in the IOPTEs within the IO page table. This can be used to generate a record of what memory is being dirtied by DMA activities during a VM migration process. A VMM like qemu will combine the IOMMU dirty bits with the CPU's dirty log to determine what memory to transfer. VFIO already has a DMA dirty tracking framework that requires PCI devices to implement tracking HW internally. The iommufd version provides an alternative that the VMM can select, if available. The two are designed to have very similar APIs. - Userspace controlled attributes for hardware page tables (HWPT/iommu_domain). There are currently a few generic attributes for HWPTs (support dirty tracking, and parent of a nest). This is an entry point for the userspace iommu driver to control the HW in detail. - Nested translation support for HWPTs. This is a 2D translation scheme similar to the CPU where a DMA goes through a first stage to determine an intermediate address which is then translated trough a second stage to a physical address. Like for CPU translation the first stage table would exist in VM controlled memory and the second stage is in the kernel and matches the VM's guest to physical map. As every IOMMU has a unique set of parameter to describe the S1 IO page table and its associated parameters the userspace IOMMU driver has to marshal the information into the correct format. This is 1/3 of the feature, it allows creating the nested translation and binding it to VFIO devices, however the API to support IOTLB and ATC invalidation of the stage 1 io page table, and forwarding of IO faults are still in progress. The series includes AMD and Intel support for dirty tracking. Intel support for nested translation. Along the way are a number of internal items: - New iommu core items: ops->domain_alloc_user(), ops->set_dirty_tracking, ops->read_and_clear_dirty(), IOMMU_DOMAIN_NESTED, and iommu_copy_struct_from_user - UAF fix in iopt_area_split() - Spelling fixes and some test suite improvement -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZUDu2wAKCRCFwuHvBreF YcdeAQDaBmjyGLrRIlzPyohF6FrombyWo2512n51Hs8IHR4IvQEA3oRNgQ2tsJRr 1UPuOqnOD5T/oVX6AkUPRBwanCUQwwM= =nyJ3 -----END PGP SIGNATURE----- Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd updates from Jason Gunthorpe: "This brings three new iommufd capabilities: - Dirty tracking for DMA. AMD/ARM/Intel CPUs can now record if a DMA writes to a page in the IOPTEs within the IO page table. This can be used to generate a record of what memory is being dirtied by DMA activities during a VM migration process. A VMM like qemu will combine the IOMMU dirty bits with the CPU's dirty log to determine what memory to transfer. VFIO already has a DMA dirty tracking framework that requires PCI devices to implement tracking HW internally. The iommufd version provides an alternative that the VMM can select, if available. The two are designed to have very similar APIs. - Userspace controlled attributes for hardware page tables (HWPT/iommu_domain). There are currently a few generic attributes for HWPTs (support dirty tracking, and parent of a nest). This is an entry point for the userspace iommu driver to control the HW in detail. - Nested translation support for HWPTs. This is a 2D translation scheme similar to the CPU where a DMA goes through a first stage to determine an intermediate address which is then translated trough a second stage to a physical address. Like for CPU translation the first stage table would exist in VM controlled memory and the second stage is in the kernel and matches the VM's guest to physical map. As every IOMMU has a unique set of parameter to describe the S1 IO page table and its associated parameters the userspace IOMMU driver has to marshal the information into the correct format. This is 1/3 of the feature, it allows creating the nested translation and binding it to VFIO devices, however the API to support IOTLB and ATC invalidation of the stage 1 io page table, and forwarding of IO faults are still in progress. The series includes AMD and Intel support for dirty tracking. Intel support for nested translation. Along the way are a number of internal items: - New iommu core items: ops->domain_alloc_user(), ops->set_dirty_tracking, ops->read_and_clear_dirty(), IOMMU_DOMAIN_NESTED, and iommu_copy_struct_from_user - UAF fix in iopt_area_split() - Spelling fixes and some test suite improvement" * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (52 commits) iommufd: Organize the mock domain alloc functions closer to Joerg's tree iommufd/selftest: Fix page-size check in iommufd_test_dirty() iommufd: Add iopt_area_alloc() iommufd: Fix missing update of domains_itree after splitting iopt_area iommu/vt-d: Disallow read-only mappings to nest parent domain iommu/vt-d: Add nested domain allocation iommu/vt-d: Set the nested domain to a device iommu/vt-d: Make domain attach helpers to be extern iommu/vt-d: Add helper to setup pasid nested translation iommu/vt-d: Add helper for nested domain allocation iommu/vt-d: Extend dmar_domain to support nested domain iommufd: Add data structure for Intel VT-d stage-1 domain allocation iommu/vt-d: Enhance capability check for nested parent domain allocation iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with nested HWPTs iommufd/selftest: Add nested domain allocation for mock domain iommu: Add iommu_copy_struct_from_user helper iommufd: Add a nested HW pagetable object iommu: Pass in parent domain with user_data to domain_alloc_user op iommufd: Share iommufd_hwpt_alloc with IOMMUFD_OBJ_HWPT_NESTED iommufd: Derive iommufd_hwpt_paging from iommufd_hw_pagetable ...	2023-11-01 16:44:56 -10:00
Abhijit Gangurde	e3ed12f37e	cdx: Create symbol namespaces for cdx subsystem Create CDX_BUS and CDX_BUS_CONTROLLER symbol namespace for cdx bus subsystem. CDX controller modules are required to import symbols from CDX_BUS_CONTROLLER namespace and other than controller modules to import from CDX_BUS namespace. Signed-off-by: Abhijit Gangurde <abhijit.gangurde@amd.com> Link: https://lore.kernel.org/r/20231017160505.10640-4-abhijit.gangurde@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-10-27 13:23:24 +02:00
Joao Martins	13578d4ebe	iommufd/iova_bitmap: Move symbols to IOMMUFD namespace Have the IOVA bitmap exported symbols adhere to the IOMMUFD symbol export convention i.e. using the IOMMUFD namespace. In doing so, import the namespace in the current users. This means VFIO and the vfio-pci drivers that use iova_bitmap_set(). Link: https://lore.kernel.org/r/20231024135109.73787-4-joao.m.martins@oracle.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:42 -03:00
Joao Martins	8c9c727b61	vfio: Move iova_bitmap into iommufd Both VFIO and IOMMUFD will need iova bitmap for storing dirties and walking the user bitmaps, so move to the common dependency into IOMMUFD. In doing so, create the symbol IOMMUFD_DRIVER which designates the builtin code that will be used by drivers when selected. Today this means MLX5_VFIO_PCI and PDS_VFIO_PCI. IOMMU drivers will do the same (in future patches) when supporting dirty tracking and select IOMMUFD_DRIVER accordingly. Given that the symbol maybe be disabled, add header definitions in iova_bitmap.h for when IOMMUFD_DRIVER=n Link: https://lore.kernel.org/r/20231024135109.73787-3-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:42 -03:00
Joao Martins	53f0b02021	vfio/iova_bitmap: Export more API symbols In preparation to move iova_bitmap into iommufd, export the rest of API symbols that will be used in what could be used by modules, namely: iova_bitmap_alloc iova_bitmap_free iova_bitmap_for_each Link: https://lore.kernel.org/r/20231024135109.73787-2-joao.m.martins@oracle.com Suggested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:42 -03:00
Alex Williamson	bd885fcf28	vfio: Fix smatch errors in vfio_combine_iova_ranges() smatch reports: vfio_combine_iova_ranges() error: uninitialized symbol 'last'. vfio_combine_iova_ranges() error: potentially dereferencing uninitialized 'comb_end'. vfio_combine_iova_ranges() error: potentially dereferencing uninitialized 'comb_start'. These errors are only reachable via invalid input, in the case of @last when we receive an empty rb-tree or for @comb_{start,end} if the rb-tree is empty or otherwise fails to produce a second node that reduces the gap. Add tests with warnings for these cases. Reported-by: Cong Liu <liucong2@kylinos.cn> Link: https://lore.kernel.org/all/20230920095532.88135-1-liucong2@kylinos.cn Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Link: https://lore.kernel.org/r/20231002224325.3150842-1-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-10-09 15:56:09 -06:00
Nathan Chancellor	f9af5ad0f5	vfio/cdx: Add parentheses between bitwise AND expression and logical NOT When building with clang, there is a warning (or error with CONFIG_WERROR=y) due to a bitwise AND and logical NOT in vfio_cdx_bm_ctrl(): drivers/vfio/cdx/main.c:77:6: error: logical not is only applied to the left hand side of this bitwise operator [-Werror,-Wlogical-not-parentheses] 77 \| if (!vdev->flags & BME_SUPPORT) \| ^ ~ drivers/vfio/cdx/main.c:77:6: note: add parentheses after the '!' to evaluate the bitwise operator first 77 \| if (!vdev->flags & BME_SUPPORT) \| ^ \| ( ) drivers/vfio/cdx/main.c:77:6: note: add parentheses around left hand side expression to silence this warning 77 \| if (!vdev->flags & BME_SUPPORT) \| ^ \| ( ) 1 error generated. Add the parentheses as suggested in the first note, which is clearly what was intended here. Closes: https://github.com/ClangBuiltLinux/linux/issues/1939 Fixes: `8a97ab9b8b` ("vfio-cdx: add bus mastering device feature support") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Acked-by: Nikhil Agarwal <nikhil.agarwal@amd.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20231002-vfio-cdx-logical-not-parentheses-v1-1-a8846c7adfb6@kernel.org Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-10-03 11:40:49 -06:00
Yishai Hadas	fcb2f2ed4a	vfio/mlx5: Activate the chunk mode functionality Now that all pieces are in place, activate the chunk mode functionality based on device capabilities. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230911093856.81910-10-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-09-28 13:07:29 -06:00
Yishai Hadas	a899cacab5	vfio/mlx5: Add support for READING in chunk mode Add support for READING in chunk mode. In case the last SAVE command recognized that there was still some image to be read, however, there was no available chunk to use for, this task was delayed for the reader till one chunk will be consumed and becomes available. In the above case, a work will be executed to read in the background the next image from the device. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230911093856.81910-9-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-09-28 13:07:29 -06:00
Yishai Hadas	67135f2945	vfio/mlx5: Add support for SAVING in chunk mode Add support for SAVING in chunk mode, it includes running a work that will fill the next chunk from the device. In case the number of available chunks will reach the MAX_NUM_CHUNKS, the next chunk SAVING will be delayed till the reader will consume one chunk. The next patch from the series will add the reader part of the chunk mode. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230911093856.81910-8-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-09-28 13:07:29 -06:00
Yishai Hadas	5798e4dd58	vfio/mlx5: Pre-allocate chunks for the STOP_COPY phase This patch is another preparation step towards working in chunk mode. It pre-allocates chunks for the STOP_COPY phase to let the driver use them immediately and prevent an extra allocation upon that phase. Before that patch we had a single large buffer that was dedicated for the STOP_COPY phase as there was a single SAVE in the source for the last image. Once we'll move to chunk mode the idea is to have some small buffers that will be used upon the STOP_COPY phase. The driver will read-ahead from the firmware the full state in small/optimized chunks while letting QEMU/user space read in parallel the available data. Each buffer holds its chunk number to let it be recognized down the road in the coming patches. The chunk buffer size is picked-up based on the minimum size that firmware requires, the total full size and some max value in the driver code which was set to 8MB to achieve some optimized downtime in the general case. As the chunk mode is applicable even if we move directly to STOP_COPY the buffers preparation and some other related stuff is done unconditionally with regards to STOP/PRE-COPY. Note: In that phase in the series we still didn't activate the chunk mode and the first buffer will be used in all the places. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230911093856.81910-7-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-09-28 13:07:29 -06:00
Yishai Hadas	9114100d10	vfio/mlx5: Rename some stuff to match chunk mode Upon chunk mode there may be multiple images that will be read from the device upon STOP_COPY. This patch is some preparation for that mode by replacing the relevant stuff to a better matching name. As part of that, be stricter to recognize PRE_COPY error only when it didn't occur on a STOP_COPY chunk. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230911093856.81910-6-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-09-28 13:07:29 -06:00
Yishai Hadas	543640af84	vfio/mlx5: Enable querying state size which is > 4GB Once the device supports 'chunk mode' the driver can support state size which is larger than 4GB. In that case the device has the capability to split a single image to multiple chunks as long as the software provides a buffer in the minimum size reported by the device. The driver should query for the minimum buffer size required using QUERY_VHCA_MIGRATION_STATE command with the 'chunk' bit set in its input, in that case, the output will include both the minimum buffer size (i.e. required_umem_size) and also the remaining total size to be reported/used where that it will be applicable. At that point in the series the 'chunk' bit is off, the last patch will activate the feature once all pieces will be ready. Note: Before this change we were limited to 4GB state size as of 4 bytes max value based on the device specification for the query/save/load commands. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230911093856.81910-5-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-09-28 13:07:29 -06:00
Yishai Hadas	34a64c8eac	vfio/mlx5: Refactor the SAVE callback to activate a work only upon an error Upon a successful SAVE callback there is no need to activate a work, all the required stuff can be done directly. As so, refactor the above flow to activate a work only upon an error. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230911093856.81910-4-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-09-28 13:07:29 -06:00
Yishai Hadas	82470eba9d	vfio/mlx5: Wake up the reader post of disabling the SAVING migration file Post of disabling the SAVING migration file, which includes setting the file state to be MLX5_MIGF_STATE_ERROR, call to wake_up_interruptible() on its poll_wait member. This lets any potential reader which is waiting already for data as part of mlx5vf_save_read() to wake up, recognize the error state and return with an error. Post of that we don't need to rely on any other condition to wake up the reader as of the returning of the SAVE command that was previously executed, etc. In addition, this change will simplify error flows (e.g health recovery) once we'll move to chunk mode and multiple SAVE commands may run in the STOP_COPY phase as we won't need to rely any more on a SAVE command to wake-up a potential waiting reader. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230911093856.81910-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-09-28 13:07:29 -06:00
Nipun Gupta	8a97ab9b8b	vfio-cdx: add bus mastering device feature support Support Bus master enable and disable on VFIO-CDX devices using VFIO_DEVICE_FEATURE_BUS_MASTER flag over VFIO_DEVICE_FEATURE IOCTL. Co-developed-by: Shubham Rohila <shubham.rohila@amd.com> Signed-off-by: Shubham Rohila <shubham.rohila@amd.com> Signed-off-by: Nipun Gupta <nipun.gupta@amd.com> Link: https://lore.kernel.org/r/20230915045423.31630-3-nipun.gupta@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-09-28 12:12:07 -06:00
Jinjie Ruan	c777b11d34	vfio/mdev: Fix a null-ptr-deref bug for mdev_unregister_parent() Inject fault while probing mdpy.ko, if kstrdup() of create_dir() fails in kobject_add_internal() in kobject_init_and_add() in mdev_type_add() in parent_create_sysfs_files(), it will return 0 and probe successfully. And when rmmod mdpy.ko, the mdpy_dev_exit() will call mdev_unregister_parent(), the mdev_type_remove() may traverse uninitialized parent->types[i] in parent_remove_sysfs_files(), and it will cause below null-ptr-deref. If mdev_type_add() fails, return the error code and kset_unregister() to fix the issue. general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] CPU: 2 PID: 10215 Comm: rmmod Tainted: G W N 6.6.0-rc2+ #20 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:__kobject_del+0x62/0x1c0 Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 51 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 6b 28 48 8d 7d 10 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 24 01 00 00 48 8b 75 10 48 89 df 48 8d 6b 3c e8 RSP: 0018:ffff88810695fd30 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: ffffffffa0270268 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000004 RDI: 0000000000000010 RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed10233a4ef1 R10: ffff888119d2778b R11: 0000000063666572 R12: 0000000000000000 R13: fffffbfff404e2d4 R14: dffffc0000000000 R15: ffffffffa0271660 FS: 00007fbc81981540(0000) GS:ffff888119d00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc14a142dc0 CR3: 0000000110a62003 CR4: 0000000000770ee0 DR0: ffffffff8fb0bce8 DR1: ffffffff8fb0bce9 DR2: ffffffff8fb0bcea DR3: ffffffff8fb0bceb DR6: 00000000fffe0ff0 DR7: 0000000000000600 PKRU: 55555554 Call Trace: <TASK> ? die_addr+0x3d/0xa0 ? exc_general_protection+0x144/0x220 ? asm_exc_general_protection+0x22/0x30 ? __kobject_del+0x62/0x1c0 kobject_del+0x32/0x50 parent_remove_sysfs_files+0xd6/0x170 [mdev] mdev_unregister_parent+0xfb/0x190 [mdev] ? mdev_register_parent+0x270/0x270 [mdev] ? find_module_all+0x9d/0xe0 mdpy_dev_exit+0x17/0x63 [mdpy] __do_sys_delete_module.constprop.0+0x2fa/0x4b0 ? module_flags+0x300/0x300 ? __fput+0x4e7/0xa00 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7fbc813221b7 Code: 73 01 c3 48 8b 0d d1 8c 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a1 8c 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe780e0648 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 00007ffe780e06a8 RCX: 00007fbc813221b7 RDX: 000000000000000a RSI: 0000000000000800 RDI: 000055e214df9b58 RBP: 000055e214df9af0 R08: 00007ffe780df5c1 R09: 0000000000000000 R10: 00007fbc8139ecc0 R11: 0000000000000206 R12: 00007ffe780e0870 R13: 00007ffe780e0ed0 R14: 000055e214df9260 R15: 000055e214df9af0 </TASK> Modules linked in: mdpy(-) mdev vfio_iommu_type1 vfio [last unloaded: mdpy] Dumping ftrace buffer: (ftrace buffer empty) ---[ end trace 0000000000000000 ]--- RIP: 0010:__kobject_del+0x62/0x1c0 Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 51 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 6b 28 48 8d 7d 10 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 24 01 00 00 48 8b 75 10 48 89 df 48 8d 6b 3c e8 RSP: 0018:ffff88810695fd30 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: ffffffffa0270268 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000004 RDI: 0000000000000010 RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed10233a4ef1 R10: ffff888119d2778b R11: 0000000063666572 R12: 0000000000000000 R13: fffffbfff404e2d4 R14: dffffc0000000000 R15: ffffffffa0271660 FS: 00007fbc81981540(0000) GS:ffff888119d00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc14a142dc0 CR3: 0000000110a62003 CR4: 0000000000770ee0 DR0: ffffffff8fb0bce8 DR1: ffffffff8fb0bce9 DR2: ffffffff8fb0bcea DR3: ffffffff8fb0bceb DR6: 00000000fffe0ff0 DR7: 0000000000000600 PKRU: 55555554 Kernel panic - not syncing: Fatal exception Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled Rebooting in 1 seconds.. Fixes: `da44c340c4` ("vfio/mdev: simplify mdev_type handling") Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230918115551.1423193-1-ruanjinjie@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-09-22 12:48:04 -06:00
Shixiong Ou	27004f89b0	vfio/pds: Use proper PF device access helper The pci_physfn() helper exists to support cases where the physfn field may not be compiled into the pci_dev structure. We've declared this driver dependent on PCI_IOV to avoid this problem, but regardless we should follow the precedent not to access this field directly. Signed-off-by: Shixiong Ou <oushixiong@kylinos.cn> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230914021332.1929155-1-oushixiong@kylinos.cn Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-09-18 13:36:53 -06:00
Shixiong Ou	5a59f2ff30	vfio/pds: Add missing PCI_IOV depends If PCI_ATS isn't set, then pdev->physfn is not defined. it causes a compilation issue: ../drivers/vfio/pci/pds/vfio_dev.c:165:30: error: ‘struct pci_dev’ has no member named ‘physfn’; did you mean ‘is_physfn’? 165 \| __func__, pci_dev_id(pdev->physfn), pci_id, vf_id, \| ^~~~~~ So adding PCI_IOV depends to select PCI_ATS. Signed-off-by: Shixiong Ou <oushixiong@kylinos.cn> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230906014942.1658769-1-oushixiong@kylinos.cn Fixes: `63f77a7161` ("vfio/pds: register with the pds_core PF") Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-09-18 13:36:52 -06:00
Linus Torvalds	4debf77169	iommufd for 6.6 This includes a shared branch with VFIO: - Enhance VFIO_DEVICE_GET_PCI_HOT_RESET_INFO so it can work with iommufd FDs, not just group FDs. This removes the last place in the uAPI that required the group fd. - Give VFIO a new device node /dev/vfio/devices/vfioX (the so called cdev node) which is very similar to the FD from VFIO_GROUP_GET_DEVICE_FD. The cdev is associated with the struct device that the VFIO driver is bound to and shows up in sysfs in the normal way. - Add a cdev IOCTL VFIO_DEVICE_BIND_IOMMUFD which allows a newly opened /dev/vfio/devices/vfioX to be associated with an IOMMUFD, this replaces the VFIO_GROUP_SET_CONTAINER flow. - Add cdev IOCTLs VFIO_DEVICE_[AT\|DE]TACH_IOMMUFD_PT to allow the IOMMU translation the vfio_device is associated with to be changed. This is a significant new feature for VFIO as previously each vfio_device was fixed to a single translation. The translation is under the control of iommufd, so it can be any of the different translation modes that iommufd is learning to create. At this point VFIO has compilation options to remove the legacy interfaces and in modern mode it behaves like a normal driver subsystem. The /dev/vfio/iommu and /dev/vfio/groupX nodes are not present and each vfio_device only has a /dev/vfio/devices/vfioX cdev node that represents the device. On top of this is built some of the new iommufd functionality: - IOMMU_HWPT_ALLOC allows userspace to directly create the low level IO Page table objects and affiliate them with IOAS objects that hold the translation mapping. This is the basic functionality for the normal IOMMU_DOMAIN_PAGING domains. - VFIO_DEVICE_ATTACH_IOMMUFD_PT can be used to replace the current translation. This is wired up to through all the layers down to the driver so the driver has the ability to implement a hitless replacement. This is necessary to fully support guest behaviors when emulating HW (eg guest atomic change of translation) - IOMMU_GET_HW_INFO returns information about the IOMMU driver HW that owns a VFIO device. This includes support for the Intel iommu, and patches have been posted for all the other server IOMMU. Along the way are a number of internal items: - New iommufd kapis iommufd_ctx_has_group(), iommufd_device_to_ictx(), iommufd_device_to_id(), iommufd_access_detach(), iommufd_ctx_from_fd(), iommufd_device_replace() - iommufd now internally tracks iommu_groups as it needs some per-group data - Reorganize how the internal hwpt allocation flows to have more robust locking - Improve the access interfaces to support detach and replace of an IOAS from an access - New selftests and a rework of how the selftests creates a mock iommu driver to be more like a real iommu driver -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZO/QDQAKCRCFwuHvBreF YZ2iAP4hNEF6MJLRI2A28V3I/80f3x9Ed3Cirp/Q8ZdVEE+HYQD8DFaafJ0y3iPQ 5mxD4ZrZ9KfUns/gUqCT5oPHjrcvSAM= =EQCw -----END PGP SIGNATURE----- Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd updates from Jason Gunthorpe: "On top of the vfio updates is built some new iommufd functionality: - IOMMU_HWPT_ALLOC allows userspace to directly create the low level IO Page table objects and affiliate them with IOAS objects that hold the translation mapping. This is the basic functionality for the normal IOMMU_DOMAIN_PAGING domains. - VFIO_DEVICE_ATTACH_IOMMUFD_PT can be used to replace the current translation. This is wired up to through all the layers down to the driver so the driver has the ability to implement a hitless replacement. This is necessary to fully support guest behaviors when emulating HW (eg guest atomic change of translation) - IOMMU_GET_HW_INFO returns information about the IOMMU driver HW that owns a VFIO device. This includes support for the Intel iommu, and patches have been posted for all the other server IOMMU. Along the way are a number of internal items: - New iommufd kernel APIs: iommufd_ctx_has_group(), iommufd_device_to_ictx(), iommufd_device_to_id(), iommufd_access_detach(), iommufd_ctx_from_fd(), iommufd_device_replace() - iommufd now internally tracks iommu_groups as it needs some per-group data - Reorganize how the internal hwpt allocation flows to have more robust locking - Improve the access interfaces to support detach and replace of an IOAS from an access - New selftests and a rework of how the selftests creates a mock iommu driver to be more like a real iommu driver" Link: https://lore.kernel.org/lkml/ZO%2FTe6LU1ENf58ZW@nvidia.com/ * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (34 commits) iommufd/selftest: Don't leak the platform device memory when unloading the module iommu/vt-d: Implement hw_info for iommu capability query iommufd/selftest: Add coverage for IOMMU_GET_HW_INFO ioctl iommufd: Add IOMMU_GET_HW_INFO iommu: Add new iommu op to get iommu hardware information iommu: Move dev_iommu_ops() to private header iommufd: Remove iommufd_ref_to_users() iommufd/selftest: Make the mock iommu driver into a real driver vfio: Support IO page table replacement iommufd/selftest: Add IOMMU_TEST_OP_ACCESS_REPLACE_IOAS coverage iommufd: Add iommufd_access_replace() API iommufd: Use iommufd_access_change_ioas in iommufd_access_destroy_object iommufd: Add iommufd_access_change_ioas(_id) helpers iommufd: Allow passing in iopt_access_list_id to iopt_remove_access() vfio: Do not allow !ops->dma_unmap in vfio_pin/unpin_pages() iommufd/selftest: Add a selftest for IOMMU_HWPT_ALLOC iommufd/selftest: Return the real idev id from selftest mock_domain iommufd: Add IOMMU_HWPT_ALLOC iommufd/selftest: Test iommufd_device_replace() iommufd: Make destroy_rwsem use a lock class per object type ...	2023-08-30 20:41:37 -07:00
Linus Torvalds	ec0e2dc810	VFIO updates for v6.6-rc1 - VFIO direct character device (cdev) interface support. This extracts the vfio device fd from the container and group model, and is intended to be the native uAPI for use with IOMMUFD. (Yi Liu) - Enhancements to the PCI hot reset interface in support of cdev usage. (Yi Liu) - Fix a potential race between registering and unregistering vfio files in the kvm-vfio interface and extend use of a lock to avoid extra drop and acquires. (Dmitry Torokhov) - A new vfio-pci variant driver for the AMD/Pensando Distributed Services Card (PDS) Ethernet device, supporting live migration. (Brett Creeley) - Cleanups to remove redundant owner setup in cdx and fsl bus drivers, and simplify driver init/exit in fsl code. (Li Zetao) - Fix uninitialized hole in data structure and pad capability structures for alignment. (Stefan Hajnoczi) -----BEGIN PGP SIGNATURE----- iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmTvnDUbHGFsZXgud2ls bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsimEEP/AzG+VRcu5LfYbLGLe0z zB8ts6G7S78wXlmfN/LYi3v92XWvMMcm+vYF8oNAMfr1YL5sibWN6UtQfY1KCr7h nWKdQdqjajJ5yDDZnOFdhqHJGNfmZw6+fey8Z0j8zRI2oymK4DncWWX3g/7L1SNr 9tIexGJef+mOdAmC94yOut3YviAaZ+f95T/xrdXHzzoNr50DD0+PD6AJdKJfKggP vhiC/DAYH3Fofaa6tRasgWuKCYWdjZLR/kxgNpeEmW6kZnbq/dnzZ+kgn4HH1f9G 8p7UKVARR6FfG5aLheWu6Y9PDaKnfnqu8y/hobuE/ivXcmqqK+a6xSxrjgbVs8WJ 94SYnTBRoTlDJaKWa7GxqdgzJnV+s5ZyAgPhjzdi6mLTPWGzkuLhFWGtYL+LZAQ6 pNeZSM6CFBk+bva/xT0nNPCXxPh+/j/Y0G18FREj8aPFc03HrJQqz0RLydvTnoDz nX/by5KdzMSVSVLPr4uDMtAsgxsGqWiFcp7QMw1HhhlLWxqmYbA+mLZaqyMZUUOx 6b/P8WXT9P2I+qPVKWQ5CWyqpsEqm6P+72yg6LOM9kINvgwDhOa7cagMXIuMWYMH Rf97FL+K8p1eIy6AnvRHgFBMM5185uG+0YcJyVqtucDr/k8T/Om6ujAI6JbWtNe6 cLgaVAqKOYqCR4HC9bfVGSbd =eKSR -----END PGP SIGNATURE----- Merge tag 'vfio-v6.6-rc1' of https://github.com/awilliam/linux-vfio Pull VFIO updates from Alex Williamson: - VFIO direct character device (cdev) interface support. This extracts the vfio device fd from the container and group model, and is intended to be the native uAPI for use with IOMMUFD (Yi Liu) - Enhancements to the PCI hot reset interface in support of cdev usage (Yi Liu) - Fix a potential race between registering and unregistering vfio files in the kvm-vfio interface and extend use of a lock to avoid extra drop and acquires (Dmitry Torokhov) - A new vfio-pci variant driver for the AMD/Pensando Distributed Services Card (PDS) Ethernet device, supporting live migration (Brett Creeley) - Cleanups to remove redundant owner setup in cdx and fsl bus drivers, and simplify driver init/exit in fsl code (Li Zetao) - Fix uninitialized hole in data structure and pad capability structures for alignment (Stefan Hajnoczi) * tag 'vfio-v6.6-rc1' of https://github.com/awilliam/linux-vfio: (53 commits) vfio/pds: Send type for SUSPEND_STATUS command vfio/pds: fix return value in pds_vfio_get_lm_file() pds_core: Fix function header descriptions vfio: align capability structures vfio/type1: fix cap_migration information leak vfio/fsl-mc: Use module_fsl_mc_driver macro to simplify the code vfio/cdx: Remove redundant initialization owner in vfio_cdx_driver vfio/pds: Add Kconfig and documentation vfio/pds: Add support for firmware recovery vfio/pds: Add support for dirty page tracking vfio/pds: Add VFIO live migration support vfio/pds: register with the pds_core PF pds_core: Require callers of register/unregister to pass PF drvdata vfio/pds: Initial support for pds VFIO driver vfio: Commonize combine_ranges for use in other VFIO drivers kvm/vfio: avoid bouncing the mutex when adding and deleting groups kvm/vfio: ensure kvg instance stays around in kvm_vfio_group_add() docs: vfio: Add vfio device cdev description vfio: Compile vfio_group infrastructure optionally vfio: Move the IOMMU_CAP_CACHE_COHERENCY check in __vfio_register_dev() ...	2023-08-30 20:36:01 -07:00
Brett Creeley	642265e22e	vfio/pds: Send type for SUSPEND_STATUS command Commit `bb500dbe2a` ("vfio/pds: Add VFIO live migration support") added live migration support for the pds-vfio-pci driver. When sending the SUSPEND command to the device, the driver sets the type of suspend (i.e. P2P or FULL). However, the driver isn't sending the type of suspend for the SUSPEND_STATUS command, which will result in failures. Fix this by also sending the suspend type in the SUSPEND_STATUS command. Fixes: `bb500dbe2a` ("vfio/pds: Add VFIO live migration support") Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230821184215.34564-1-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-08-22 13:11:57 -06:00
Yang Yingliang	2d12d18f14	vfio/pds: fix return value in pds_vfio_get_lm_file() anon_inode_getfile() never returns NULL pointer, it will return ERR_PTR() when it fails, so replace the check with IS_ERR(). Fixes: `bb500dbe2a` ("vfio/pds: Add VFIO live migration support") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Link: https://lore.kernel.org/r/20230819023716.3469037-1-yangyingliang@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-08-21 08:50:33 -06:00
Stefan Hajnoczi	a881b49694	vfio: align capability structures The VFIO_DEVICE_GET_INFO, VFIO_DEVICE_GET_REGION_INFO, and VFIO_IOMMU_GET_INFO ioctls fill in an info struct followed by capability structs: +------+---------+---------+-----+ \| info \| caps[0] \| caps[1] \| ... \| +------+---------+---------+-----+ Both the info and capability struct sizes are not always multiples of sizeof(u64), leaving u64 fields in later capability structs misaligned. Userspace applications currently need to handle misalignment manually in order to support CPU architectures and programming languages with strict alignment requirements. Make life easier for userspace by ensuring alignment in the kernel. This is done by padding info struct definitions and by copying out zeroes after capability structs that are not aligned. The new layout is as follows: +------+---------+---+---------+-----+ \| info \| caps[0] \| 0 \| caps[1] \| ... \| +------+---------+---+---------+-----+ In this example caps[0] has a size that is not multiples of sizeof(u64), so zero padding is added to align the subsequent structure. Adding zero padding between structs does not break the uapi. The memory layout is specified by the info.cap_offset and caps[i].next fields filled in by the kernel. Applications use these field values to locate structs and are therefore unaffected by the addition of zero padding. Note that code that copies out info structs with padding is updated to always zero the struct and copy out as many bytes as userspace requested. This makes the code shorter and avoids potential information leaks by ensuring padding is initialized. Originally-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230809203144.2880050-1-stefanha@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-08-17 12:17:44 -06:00
Stefan Hajnoczi	cd24e2a60a	vfio/type1: fix cap_migration information leak Fix an information leak where an uninitialized hole in struct vfio_iommu_type1_info_cap_migration on the stack is exposed to userspace. The definition of struct vfio_iommu_type1_info_cap_migration contains a hole as shown in this pahole(1) output: struct vfio_iommu_type1_info_cap_migration { struct vfio_info_cap_header header; /* 0 8 / __u32 flags; / 8 4 / / XXX 4 bytes hole, try to pack / __u64 pgsize_bitmap; / 16 8 / __u64 max_dirty_bitmap_size; / 24 8 / / size: 32, cachelines: 1, members: 4 / / sum members: 28, holes: 1, sum holes: 4 / / last cacheline: 32 bytes / }; The cap_mig variable is filled in without initializing the hole: static int vfio_iommu_migration_build_caps(struct vfio_iommu iommu, struct vfio_info_cap caps) { struct vfio_iommu_type1_info_cap_migration cap_mig; cap_mig.header.id = VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION; cap_mig.header.version = 1; cap_mig.flags = 0; / support minimum pgsize / cap_mig.pgsize_bitmap = (size_t)1 << __ffs(iommu->pgsize_bitmap); cap_mig.max_dirty_bitmap_size = DIRTY_BITMAP_SIZE_MAX; return vfio_info_add_capability(caps, &cap_mig.header, sizeof(cap_mig)); } The structure is then copied to a temporary location on the heap. At this point it's already too late and ioctl(VFIO_IOMMU_GET_INFO) copies it to userspace later: int vfio_info_add_capability(struct vfio_info_cap caps, struct vfio_info_cap_header cap, size_t size) { struct vfio_info_cap_header header; header = vfio_info_cap_add(caps, size, cap->id, cap->version); if (IS_ERR(header)) return PTR_ERR(header); memcpy(header + 1, cap + 1, size - sizeof(*header)); return 0; } This issue was found by code inspection. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Fixes: `ad721705d0` ("vfio iommu: Add migration capability to report supported features") Link: https://lore.kernel.org/r/20230801155352.1391945-1-stefanha@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-08-16 12:25:35 -06:00
Li Zetao	6c092088fa	vfio/fsl-mc: Use module_fsl_mc_driver macro to simplify the code Use the module_fsl_mc_driver macro to simplify the code and remove redundant initialization owner in vfio_fsl_mc_driver. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230809131536.4021639-1-lizetao1@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-08-16 11:14:15 -06:00
Li Zetao	d7955ce40e	vfio/cdx: Remove redundant initialization owner in vfio_cdx_driver The cdx_driver_register() will set "THIS_MODULE" to driver.owner when register a cdx_driver driver, so it is redundant initialization to set driver.owner in the statement. Remove it for clean code. Signed-off-by: Li Zetao <lizetao1@huawei.com> Acked-by: Nikhil Agarwal <nikhil.agarwal@amd.com> Link: https://lore.kernel.org/r/20230808020937.2975196-1-lizetao1@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-08-16 11:13:51 -06:00
Brett Creeley	fc9da66103	vfio/pds: Add Kconfig and documentation Add Kconfig entries and pds-vfio-pci.rst. Also, add an entry in the MAINTAINERS file for this new driver. It's not clear where documentation for vendor specific VFIO drivers should live, so just re-use the current amd ethernet location. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230807205755.29579-9-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-08-16 10:54:54 -06:00
Brett Creeley	7dabb1bcd1	vfio/pds: Add support for firmware recovery It's possible that the device firmware crashes and is able to recover due to some configuration and/or other issue. If a live migration is in progress while the firmware crashes, the live migration will fail. However, the VF PCI device should still be functional post crash recovery and subsequent migrations should go through as expected. When the pds_core device notices that firmware crashes it sends an event to all its client drivers. When the pds_vfio driver receives this event while migration is in progress it will request a deferred reset on the next migration state transition. This state transition will report failure as well as any subsequent state transition requests from the VMM/VFIO. Based on uapi/vfio.h the only way out of VFIO_DEVICE_STATE_ERROR is by issuing VFIO_DEVICE_RESET. Once this reset is done, the migration state will be reset to VFIO_DEVICE_STATE_RUNNING and migration can be performed. If the event is received while no migration is in progress (i.e. the VM is in normal operating mode), then no actions are taken and the migration state remains VFIO_DEVICE_STATE_RUNNING. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230807205755.29579-8-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-08-16 10:54:54 -06:00
Brett Creeley	f232836a91	vfio/pds: Add support for dirty page tracking In order to support dirty page tracking, the driver has to implement the VFIO subsystem's vfio_log_ops. This includes log_start, log_stop, and log_read_and_clear. All of the tracker resources are allocated and dirty tracking on the device is started during log_start. The resources are cleaned up and dirty tracking on the device is stopped during log_stop. The dirty pages are determined and reported during log_read_and_clear. In order to support these callbacks admin queue commands are used. All of the adminq queue command structures and implementations are included as part of this patch. PDS_LM_CMD_DIRTY_STATUS is added to query the current status of dirty tracking on the device. This includes if it's enabled (i.e. number of regions being tracked from the device's perspective) and the maximum number of regions supported from the device's perspective. PDS_LM_CMD_DIRTY_ENABLE is added to enable dirty tracking on the specified number of regions and their iova ranges. PDS_LM_CMD_DIRTY_DISABLE is added to disable dirty tracking for all regions on the device. PDS_LM_CMD_READ_SEQ and PDS_LM_CMD_DIRTY_WRITE_ACK are added to support reading and acknowledging the currently dirtied pages. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20230807205755.29579-7-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-08-16 10:54:46 -06:00
Brett Creeley	bb500dbe2a	vfio/pds: Add VFIO live migration support Add live migration support via the VFIO subsystem. The migration implementation aligns with the definition from uapi/vfio.h and uses the pds_core PF's adminq for device configuration. The ability to suspend, resume, and transfer VF device state data is included along with the required admin queue command structures and implementations. PDS_LM_CMD_SUSPEND and PDS_LM_CMD_SUSPEND_STATUS are added to support the VF device suspend operation. PDS_LM_CMD_RESUME is added to support the VF device resume operation. PDS_LM_CMD_STATE_SIZE is added to determine the exact size of the VF device state data. PDS_LM_CMD_SAVE is added to get the VF device state data. PDS_LM_CMD_RESTORE is added to restore the VF device with the previously saved data from PDS_LM_CMD_SAVE. PDS_LM_CMD_HOST_VF_STATUS is added to notify the DSC/firmware when a migration is in/not-in progress from the host's perspective. The DSC/firmware can use this to clear/setup any necessary state related to a migration. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230807205755.29579-6-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-08-16 10:53:26 -06:00
Brett Creeley	63f77a7161	vfio/pds: register with the pds_core PF The pds_core driver will supply adminq services, so find the PF and register with the DSC services. Use the following commands to enable a VF: echo 1 > /sys/bus/pci/drivers/pds_core/$PF_BDF/sriov_numvfs Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230807205755.29579-5-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-08-16 10:53:20 -06:00
Brett Creeley	38fe3975b4	vfio/pds: Initial support for pds VFIO driver This is the initial framework for the new pds-vfio-pci device driver. This does the very basics of registering the PDS PCI device and configuring it as a VFIO PCI device. With this change, the VF device can be bound to the pds-vfio-pci driver on the host and presented to the VM as an ethernet VF. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230807205755.29579-3-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-08-16 10:53:00 -06:00
Brett Creeley	9a4087fab3	vfio: Commonize combine_ranges for use in other VFIO drivers Currently only Mellanox uses the combine_ranges function. The new pds_vfio driver also needs this function. So, move it to a common location for other vendor drivers to use. Also, fix RCT ordering while moving/renaming the function. Cc: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20230807205755.29579-2-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-08-16 10:52:23 -06:00
Maher Sanalla	f14c1a14e6	net/mlx5: Allocate completion EQs dynamically This commit enables the dynamic allocation of EQs at runtime, allowing for more flexibility in managing completion EQs and reducing the memory overhead of driver load. Whenever a CQ is created for a given vector index, the driver will lookup to see if there is an already mapped completion EQ for that vector, if so, utilize it. Otherwise, allocate a new EQ on demand and then utilize it for the CQ completion events. Add a protection lock to the EQ table to protect from concurrent EQ creation attempts. While at it, replace mlx5_vector2irqn()/mlx5_vector2eqn() with mlx5_comp_eqn_get() and mlx5_comp_irqn_get() which will allocate an EQ on demand if no EQ is found for the given vector. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-08-07 10:53:52 -07:00
Maher Sanalla	674dd4e2e0	net/mlx5: Rename mlx5_comp_vectors_count() to mlx5_comp_vectors_max() To accurately represent its purpose, rename the function that retrieves the value of maximum vectors from mlx5_comp_vectors_count() to mlx5_comp_vectors_max(). Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-08-07 10:53:51 -07:00
Nicolin Chen	c157fd8861	vfio: Support IO page table replacement Now both the physical path and the emulated path can support an IO page table replacement. Call iommufd_device_replace/iommufd_access_replace(), when vdev->iommufd_attached is true. Also update the VFIO_DEVICE_ATTACH_IOMMUFD_PT kdoc in the uAPI header. Link: https://lore.kernel.org/r/b5f01956ff161f76aa52c95b0fa1ad6eaca95c4a.1690523699.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-28 13:31:24 -03:00
Nicolin Chen	89e07fd468	vfio: Do not allow !ops->dma_unmap in vfio_pin/unpin_pages() A driver that doesn't implement ops->dma_unmap shouldn't be allowed to do vfio_pin/unpin_pages(), though it can use vfio_dma_rw() to access an iova range. Deny !ops->dma_unmap cases in vfio_pin/unpin_pages(). Link: https://lore.kernel.org/r/85d622729d8f2334b35d42f1c568df1ededb9171.1690523699.git.nicolinc@nvidia.com Suggested-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-28 13:31:24 -03:00
Yi Liu	c1cce6d079	vfio: Compile vfio_group infrastructure optionally vfio_group is not needed for vfio device cdev, so with vfio device cdev introduced, the vfio_group infrastructures can be compiled out if only cdev is needed. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-26-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:20:50 -06:00
Yi Liu	5398be2564	vfio: Move the IOMMU_CAP_CACHE_COHERENCY check in __vfio_register_dev() The IOMMU_CAP_CACHE_COHERENCY check only applies to the physical devices that are IOMMU-backed. But it is now in the group code. If want to compile vfio_group infrastructure out, this check needs to be moved out of the group code. Another reason for this change is to fail the device registration for the physical devices that do not have IOMMU if the group code is not compiled as the cdev interface does not support such devices. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-25-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:20:41 -06:00
Yi Liu	b290a05fd8	vfio: Add VFIO_DEVICE_[AT\|DE]TACH_IOMMUFD_PT This adds ioctl for userspace to attach device cdev fd to and detach from IOAS/hw_pagetable managed by iommufd. VFIO_DEVICE_ATTACH_IOMMUFD_PT: attach vfio device to IOAS or hw_pagetable managed by iommufd. Attach can be undo by VFIO_DEVICE_DETACH_IOMMUFD_PT or device fd close. VFIO_DEVICE_DETACH_IOMMUFD_PT: detach vfio device from the current attached IOAS or hw_pagetable managed by iommufd. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-24-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:20:37 -06:00
Yi Liu	5fcc26969a	vfio: Add VFIO_DEVICE_BIND_IOMMUFD This adds ioctl for userspace to bind device cdev fd to iommufd. VFIO_DEVICE_BIND_IOMMUFD: bind device to an iommufd, hence gain DMA control provided by the iommufd. open_device op is called after bind_iommufd op. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230718135551.6592-23-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:20:32 -06:00
Yi Liu	ca9e45b414	vfio: Avoid repeated user pointer cast in vfio_device_fops_unl_ioctl() This adds a local variable to store the user pointer cast result from arg. It avoids the repeated casts in the code when more ioctls are added. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-22-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:20:15 -06:00
Yi Liu	5c6de3ea73	vfio: Test kvm pointer in _vfio_device_get_kvm_safe() This saves some lines when adding the kvm get logic for the vfio_device cdev path. This also renames _vfio_device_get_kvm_safe() to be vfio_device_get_kvm_safe(). Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-20-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:19:45 -06:00
Yi Liu	8b6f173a4c	vfio: Add cdev for vfio_device This adds cdev support for vfio_device. It allows the user to directly open a vfio device w/o using the legacy container/group interface, as a prerequisite for supporting new iommu features like nested translation and etc. The device fd opened in this manner doesn't have the capability to access the device as the fops open() doesn't open the device until the successful VFIO_DEVICE_BIND_IOMMUFD ioctl which will be added in a later patch. With this patch, devices registered to vfio core would have both the legacy group and the new device interfaces created. - group interface : /dev/vfio/$groupID - device interface: /dev/vfio/devices/vfioX - normal device ("X" is a unique number across vfio devices) For a given device, the user can identify the matching vfioX by searching the vfio-dev folder under the sysfs path of the device. Take PCI device (0000:6a:01.0) as an example, /sys/bus/pci/devices/0000\:6a\:01.0/vfio-dev/vfioX implies the matching vfioX under /dev/vfio/devices/, and vfio-dev/vfioX/dev contains the major:minor number of the matching /dev/vfio/devices/vfioX. The user can get device fd by opening the /dev/vfio/devices/vfioX. The vfio_device cdev logic in this patch: ) __vfio_register_dev() path ends up doing cdev_device_add() for each vfio_device if VFIO_DEVICE_CDEV configured. ) vfio_unregister_group_dev() path does cdev_device_del(); cdev interface does not support noiommu devices, so VFIO only creates the legacy group interface for the physical devices that do not have IOMMU. noiommu users should use the legacy group interface. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-19-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:19:40 -06:00
Yi Liu	38c24544e1	vfio: Move device_del() before waiting for the last vfio_device registration refcount device_del() destroys the vfio-dev/vfioX under the sysfs for vfio_device. There is no reason to keep it while the device is going to be unregistered. This movement is also a preparation for adding vfio_device cdev. Kernel should remove the cdev node of the vfio_device to avoid new registration refcount increment while the device is going to be unregistered. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-18-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:19:26 -06:00
Yi Liu	291872a533	vfio: Move vfio_device_group_unregister() to be the first operation in unregister This avoids endless vfio_device refcount increment by userspace, which would keep blocking the vfio_unregister_group_dev(). Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-17-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:19:22 -06:00
Yi Liu	8cfa718602	vfio-iommufd: Add detach_ioas support for emulated VFIO devices This prepares for adding DETACH ioctl for emulated VFIO devices. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-16-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:19:18 -06:00
Yi Liu	9048c7341c	vfio-iommufd: Add detach_ioas support for physical VFIO devices This prepares for adding DETACH ioctl for physical VFIO devices. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-14-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:19:12 -06:00
Yi Liu	31014aef9e	vfio: Record devid in vfio_device_file .bind_iommufd() will generate an ID to represent this bond, which is needed by userspace for further usage. Store devid in vfio_device_file to avoid passing the pointer in multiple places. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-13-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:19:08 -06:00
Yi Liu	6f240ee677	vfio-iommufd: Split bind/attach into two steps This aligns the bind/attach logic with the coming vfio device cdev support. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-12-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:19:04 -06:00
Yi Liu	6086efe734	vfio-iommufd: Move noiommu compat validation out of vfio_iommufd_bind() This moves the noiommu compat validation logic into vfio_df_group_open(). This is more consistent with what will be done in vfio device cdev path. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-11-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:19:01 -06:00
Yi Liu	839e692fa4	vfio: Make vfio_df_open() single open for device cdev path VFIO group has historically allowed multi-open of the device FD. This was made secure because the "open" was executed via an ioctl to the group FD which is itself only single open. However, no known use of multiple device FDs today. It is kind of a strange thing to do because new device FDs can naturally be created via dup(). When we implement the new device uAPI (only used in cdev path) there is no natural way to allow the device itself from being multi-opened in a secure manner. Without the group FD we cannot prove the security context of the opener. Thus, when moving to the new uAPI we block the ability of opening a device multiple times. Given old group path still allows it we store a vfio_group pointer in struct vfio_device_file to differentiate. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-10-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:18:57 -06:00
Yi Liu	270bf4c019	vfio: Add cdev_device_open_cnt to vfio_group This is for counting the devices that are opened via the cdev path. This count is increased and decreased by the cdev path. The group path checks it to achieve exclusion with the cdev path. With this, only one path (group path or cdev path) will claim DMA ownership. This avoids scenarios in which devices within the same group may be opened via different paths. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-9-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:18:54 -06:00
Yi Liu	82d93f580f	vfio: Block device access via device fd until device is opened Allow the vfio_device file to be in a state where the device FD is opened but the device cannot be used by userspace (i.e. its .open_device() hasn't been called). This inbetween state is not used when the device FD is spawned from the group FD, however when we create the device FD directly by opening a cdev it will be opened in the blocked state. The reason for the inbetween state is that userspace only gets a FD but doesn't gain access permission until binding the FD to an iommufd. So in the blocked state, only the bind operation is allowed. Completing bind will allow user to further access the device. This is implemented by adding a flag in struct vfio_device_file to mark the blocked state and using a simple smp_load_acquire() to obtain the flag value and serialize all the device setup with the thread accessing this device. Following this lockless scheme, it can safely handle the device FD unbound->bound but it cannot handle bound->unbound. To allow this we'd need to add a lock on all the vfio ioctls which seems costly. So once device FD is bound, it remains bound until the FD is closed. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-8-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:18:50 -06:00
Yi Liu	05f37e1c03	vfio: Pass struct vfio_device_file * to vfio_device_open/close() This avoids passing too much parameters in multiple functions. Per the input parameter change, rename the function to be vfio_df_open/close(). Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-7-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:18:47 -06:00
Yi Liu	34aeeecdb3	vfio: Accept vfio device file in the KVM facing kAPI This makes the vfio file kAPIs to accept vfio device files, also a preparation for vfio device cdev support. For the kvm set with vfio device file, kvm pointer is stored in struct vfio_device_file, and use kvm_ref_lock to protect kvm set and kvm pointer usage within VFIO. This kvm pointer will be set to vfio_device after device file is bound to iommufd in the cdev path. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-4-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:18:28 -06:00
Yi Liu	b1a59be8a2	vfio: Refine vfio file kAPIs for KVM This prepares for making the below kAPIs to accept both group file and device file instead of only vfio group file. bool vfio_file_enforced_coherent(struct file file); void vfio_file_set_kvm(struct file file, struct kvm *kvm); Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-3-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:18:22 -06:00
Yi Liu	b1a3b5c61d	vfio: Allocate per device file structure This is preparation for adding vfio device cdev support. vfio device cdev requires: 1) A per device file memory to store the kvm pointer set by KVM. It will be propagated to vfio_device:kvm after the device cdev file is bound to an iommufd. 2) A mechanism to block device access through device cdev fd before it is bound to an iommufd. To address the above requirements, this adds a per device file structure named vfio_device_file. For now, it's only a wrapper of struct vfio_device pointer. Other fields will be added to this per file structure in future commits. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-2-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:18:17 -06:00
Yi Liu	71791b9246	vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET This is the way user to invoke hot-reset for the devices opened by cdev interface. User should check the flag VFIO_PCI_HOT_RESET_FLAG_DEV_ID_OWNED in the output of VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl before doing hot-reset for cdev devices. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718105542.4138-11-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:18:13 -06:00
Yi Liu	b56b7aabcf	vfio/pci: Copy hot-reset device info to userspace in the devices loop This copies the vfio_pci_dependent_device to userspace during looping each affected device for reporting vfio_pci_hot_reset_info. This avoids counting the affected devices and allocating a potential large buffer to store the vfio_pci_dependent_device of all the affected devices before copying them to userspace. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230718105542.4138-10-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:18:09 -06:00
Yi Liu	9062ff405b	vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev This allows VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl use the iommufd_ctx of the cdev device to check the ownership of the other affected devices. When VFIO_DEVICE_GET_PCI_HOT_RESET_INFO is called on an IOMMUFD managed device, the new flag VFIO_PCI_HOT_RESET_FLAG_DEV_ID is reported to indicate the values returned are IOMMUFD devids rather than group IDs as used when accessing vfio devices through the conventional vfio group interface. Additionally the flag VFIO_PCI_HOT_RESET_FLAG_DEV_ID_OWNED will be reported in this mode if all of the devices affected by the hot-reset are owned by either virtue of being directly bound to the same iommufd context as the calling device, or implicitly owned via a shared IOMMU group. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Suggested-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718105542.4138-9-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:18:05 -06:00
Yi Liu	a80e1de932	vfio: Add helper to search vfio_device in a dev_set There are drivers that need to search vfio_device within a given dev_set. e.g. vfio-pci. So add a helper. vfio_pci_is_device_in_set() now returns -EBUSY in commit `a882c16a2b` ("vfio/pci: Change vfio_pci_try_bus_reset() to use the dev_set") where it was trying to preserve the return of vfio_pci_try_zap_and_vma_lock_cb(). However, it makes more sense to return -ENODEV. Suggested-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718105542.4138-8-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:18:02 -06:00
Yi Liu	6e6c513fe1	vfio/pci: Move the existing hot reset logic to be a helper This prepares to add another method for hot reset. The major hot reset logic are moved to vfio_pci_ioctl_pci_hot_reset_groups(). No functional change is intended. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718105542.4138-3-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:17:42 -06:00
Yi Liu	c60f932043	vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset() This suits more on what the code does. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718105542.4138-2-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:17:09 -06:00
Linus Torvalds	b25f62ccb4	VFIO updates for v6.5-rc1 - Adjust log levels for common messages. (Oleksandr Natalenko, Alex Williamson) - Support for dynamic MSI-X allocation. (Reinette Chatre) - Enable and report PCIe AtomicOp Completer capabilities. (Alex Williamson) - Cleanup Kconfigs for vfio bus drivers. (Alex Williamson) - Add support for CDX bus based devices. (Nipun Gupta) - Fix race with concurrent mdev initialization. (Eric Farman) -----BEGIN PGP SIGNATURE----- iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmSd7QsbHGFsZXgud2ls bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsiOrcQALKcFPN5OtKQRueLaFC4 Mjg7CkHPUGMGYa7WIXGyBtoU2DvjzQ+6uJUJplY4bKNGLK2xQqUsilyRavDfgW5Q fRiBEGPd6FM8QYp2X0rbYWmQtTSWCbDhj/107Pu43gSOkH3MzXiiqJehNFO7pO5Y Az218iiq1nO6/g6Msswowk03C1LoH41maLjsDP4CKfdl9BTaLw0tNGmXBAkF5MZR Q6D54nu6g20OFXDicYKaKCrB4ydy+pkp7BfFgT7IqtPQtTiAHrhgMJYrU+0IlGwx ukBAGbKTiK/JySs0EY6Wz3K9hnQTzHWzWlqXO/FmlILBqqEMp18AnM3RZa36GqvT PrM/wiKoJgp9BFCPApVyPUyok/lrDmirKxBcWgogHC9uh0q7oI3i79k0vXkKPJnp 5IAGhvTpJC7KBIIct8eIGgVcLA6tVPXpd30iqKR61I6X9etfo+f2nwNSjx1dC3eG B/DnvuTWRjecNrIPUpunHnb/LZFoV/Qq3tam3MV572aMpF0lzowuOxevI5Z4TNf+ l/u923KBAVcotRaMm1huFP2Wkd0K8UCD9zwCpXrctNoXQXtRB9DskOinkOIGT5aS SaepCt003BWPJ5UlOyQMGtLG1+GPYxjrjvNWryoWStomVrc+WQ6XyAxk4iVZPJBs kcWTa7r9RU7uiOc2VK/UOiry =TOvh -----END PGP SIGNATURE----- Merge tag 'vfio-v6.5-rc1' of https://github.com/awilliam/linux-vfio Pull VFIO updates from Alex Williamson: - Adjust log levels for common messages (Oleksandr Natalenko, Alex Williamson) - Support for dynamic MSI-X allocation (Reinette Chatre) - Enable and report PCIe AtomicOp Completer capabilities (Alex Williamson) - Cleanup Kconfigs for vfio bus drivers (Alex Williamson) - Add support for CDX bus based devices (Nipun Gupta) - Fix race with concurrent mdev initialization (Eric Farman) * tag 'vfio-v6.5-rc1' of https://github.com/awilliam/linux-vfio: vfio/mdev: Move the compat_class initialization to module init vfio/cdx: add support for CDX bus vfio/fsl: Create Kconfig sub-menu vfio/platform: Cleanup Kconfig vfio/pci: Cleanup Kconfig vfio/pci-core: Add capability for AtomicOp completer support vfio/pci: Also demote hiding standard cap messages vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X vfio/pci: Support dynamic MSI-X vfio/pci: Probe and store ability to support dynamic MSI-X vfio/pci: Use bitfield for struct vfio_pci_core_device flags vfio/pci: Update stale comment vfio/pci: Remove interrupt context counter vfio/pci: Use xarray for interrupt context storage vfio/pci: Move to single error path vfio/pci: Prepare for dynamic interrupt context storage vfio/pci: Remove negative check on unsigned vector vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable vfio/pci: demote hiding ecap messages to debug level	2023-06-30 15:22:09 -07:00
Linus Torvalds	e4c8d01865	ARM: SoC drivers for 6.5 Nothing surprising in the SoC specific drivers, with the usual updates: * Added or improved SoC driver support for Tegra234, Exynos4121, RK3588, as well as multiple Mediatek and Qualcomm chips * SCMI firmware gains support for multiple SMC/HVC transport and version 3.2 of the protocol * Cleanups amd minor changes for the reset controller, memory controller, firmware and sram drivers * Minor changes to amd/xilinx, samsung, tegra, nxp, ti, qualcomm, amlogic and renesas SoC specific drivers -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmSdmbIACgkQYKtH/8kJ UicewQ/6Aq8j5pBFYBimZoyQ0bi9z+prGrHoDDYLew2vKjtOXJl5z7ZnM3J1oyPt Zvis3IaGkHJCuuqotPdsquZrzHq8slzXzwkHPfHORJBC4gV0V/vMS8w32tO5FfTq ULrMyWnbsU7Udeywc2xuEpAoC9+bXX9brnCpa3H41peIGZKM+0g7EE6FASt3YaOk O+ZMSGqF8QbCqSQrUH3GudFlFMy/VxIvwuUsbLt8aNkRACunQZXVgUdArvLV49nX SElFN7hOVRoVDv0rgYMxlwElymrta/kMyjLba8GU1GIhzyDGozVqIJQAnsQ3f6CC yyzaJm27zzJH0mx9jx4W+JLBdjqDL4ctE2WyllRVIpTGYMHiMQtutHNwtNupIuD5 j9j/fIVQWZqOdWXnA6V/CHYN1MZBRTH3KQcnLlYPC01dWKThPDnrHGfwOkfsrwtN zuERJJ+gd5b8KW4dmy1ueDOSB8162LxbS7iHxpOBGySmqVOYj3XUqACZhKRfXfIQ BVj9punCE/gO2fMb9IZByjeOzgtV+PBRmPxoglyaGkT4fVfL06kEbpKFYbXXq9b/ aAS/U84gGr8ebWsOXszwDnBzTZRzjMVv/T9KDTTJuWbBEPNyCR7fUG0cZ50rSKnJ 2cTPe3a0sS6LaBt71qfExCIfxG+cJ2c3N1U5/jb2C49Aob45obs= =zvLr -----END PGP SIGNATURE----- Merge tag 'soc-drivers-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull ARM SoC driver updates from Arnd Bergmann: "Nothing surprising in the SoC specific drivers, with the usual updates: - Added or improved SoC driver support for Tegra234, Exynos4121, RK3588, as well as multiple Mediatek and Qualcomm chips - SCMI firmware gains support for multiple SMC/HVC transport and version 3.2 of the protocol - Cleanups amd minor changes for the reset controller, memory controller, firmware and sram drivers - Minor changes to amd/xilinx, samsung, tegra, nxp, ti, qualcomm, amlogic and renesas SoC specific drivers" * tag 'soc-drivers-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (118 commits) dt-bindings: interrupt-controller: Convert Amlogic Meson GPIO interrupt controller binding MAINTAINERS: add PHY-related files to Amlogic SoC file list drivers: meson: secure-pwrc: always enable DMA domain tee: optee: Use kmemdup() to replace kmalloc + memcpy soc: qcom: geni-se: Do not bother about enable/disable of interrupts in secondary sequencer dt-bindings: sram: qcom,imem: document qdu1000 soc: qcom: icc-bwmon: Fix MSM8998 count unit dt-bindings: soc: qcom,rpmh-rsc: Require power-domains soc: qcom: socinfo: Add Soc ID for IPQ5300 dt-bindings: arm: qcom,ids: add SoC ID for IPQ5300 soc: qcom: Fix a IS_ERR() vs NULL bug in probe soc: qcom: socinfo: Add support for new fields in revision 19 soc: qcom: socinfo: Add support for new fields in revision 18 dt-bindings: firmware: scm: Add compatible for SDX75 soc: qcom: mdt_loader: Fix split image detection dt-bindings: memory-controllers: drop unneeded quotes soc: rockchip: dtpm: use C99 array init syntax firmware: tegra: bpmp: Add support for DRAM MRQ GSCs soc/tegra: pmc: Use devm_clk_notifier_register() soc/tegra: pmc: Simplify debugfs initialization ...	2023-06-29 15:22:19 -07:00
Eric Farman	ff598081e5	vfio/mdev: Move the compat_class initialization to module init The pointer to mdev_bus_compat_class is statically defined at the top of mdev_core, and was originally (commit `7b96953bc6` ("vfio: Mediated device Core driver") serialized by the parent_list_lock. The blamed commit removed this mutex, leaving the pointer initialization unserialized. As a result, the creation of multiple MDEVs in parallel (such as during boot) can encounter errors during the creation of the sysfs entries, such as: [ 8.337509] sysfs: cannot create duplicate filename '/class/mdev_bus' [ 8.337514] vfio_ccw 0.0.01d8: MDEV: Registered [ 8.337516] CPU: 13 PID: 946 Comm: driverctl Not tainted 6.4.0-rc7 #20 [ 8.337522] Hardware name: IBM 3906 M05 780 (LPAR) [ 8.337525] Call Trace: [ 8.337528] [<0000000162b0145a>] dump_stack_lvl+0x62/0x80 [ 8.337540] [<00000001622aeb30>] sysfs_warn_dup+0x78/0x88 [ 8.337549] [<00000001622aeca6>] sysfs_create_dir_ns+0xe6/0xf8 [ 8.337552] [<0000000162b04504>] kobject_add_internal+0xf4/0x340 [ 8.337557] [<0000000162b04d48>] kobject_add+0x78/0xd0 [ 8.337561] [<0000000162b04e0a>] kobject_create_and_add+0x6a/0xb8 [ 8.337565] [<00000001627a110e>] class_compat_register+0x5e/0x90 [ 8.337572] [<000003ff7fd815da>] mdev_register_parent+0x102/0x130 [mdev] [ 8.337581] [<000003ff7fdc7f2c>] vfio_ccw_sch_probe+0xe4/0x178 [vfio_ccw] [ 8.337588] [<0000000162a7833c>] css_probe+0x44/0x80 [ 8.337599] [<000000016279f4da>] really_probe+0xd2/0x460 [ 8.337603] [<000000016279fa08>] driver_probe_device+0x40/0xf0 [ 8.337606] [<000000016279fb78>] __device_attach_driver+0xc0/0x140 [ 8.337610] [<000000016279cbe0>] bus_for_each_drv+0x90/0xd8 [ 8.337618] [<00000001627a00b0>] __device_attach+0x110/0x190 [ 8.337621] [<000000016279c7c8>] bus_rescan_devices_helper+0x60/0xb0 [ 8.337626] [<000000016279cd48>] drivers_probe_store+0x48/0x80 [ 8.337632] [<00000001622ac9b0>] kernfs_fop_write_iter+0x138/0x1f0 [ 8.337635] [<00000001621e5e14>] vfs_write+0x1ac/0x2f8 [ 8.337645] [<00000001621e61d8>] ksys_write+0x70/0x100 [ 8.337650] [<0000000162b2bdc4>] __do_syscall+0x1d4/0x200 [ 8.337656] [<0000000162b3c828>] system_call+0x70/0x98 [ 8.337664] kobject: kobject_add_internal failed for mdev_bus with -EEXIST, don't try to register things with the same name in the same directory. [ 8.337668] kobject: kobject_create_and_add: kobject_add error: -17 [ 8.337674] vfio_ccw: probe of 0.0.01d9 failed with error -12 [ 8.342941] vfio_ccw_mdev aeb9ca91-10c6-42bc-a168-320023570aea: Adding to iommu group 2 Move the initialization of the mdev_bus_compat_class pointer to the init path, to match the cleanup in module exit. This way the code in mdev_register_parent() can simply link the new parent to it, rather than determining whether initialization is required first. Fixes: `89345d5177` ("vfio/mdev: embedd struct mdev_parent in the parent data structure") Reported-by: Alexander Egorenkov <egorenar@linux.ibm.com> Signed-off-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230626133642.2939168-1-farman@linux.ibm.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-06-27 12:05:26 -06:00
Ryan Roberts	c33c794828	mm: ptep_get() conversion Convert all instances of direct pte_t* dereferencing to instead use ptep_get() helper. This means that by default, the accesses change from a C dereference to a READ_ONCE(). This is technically the correct thing to do since where pgtables are modified by HW (for access/dirty) they are volatile and therefore we should always ensure READ_ONCE() semantics. But more importantly, by always using the helper, it can be overridden by the architecture to fully encapsulate the contents of the pte. Arch code is deliberately not converted, as the arch code knows best. It is intended that arch code (arm64) will override the default with its own implementation that can (e.g.) hide certain bits from the core code, or determine young/dirty status by mixing in state from another source. Conversion was done using Coccinelle: ---- // $ make coccicheck \ // COCCI=ptepget.cocci \ // SPFLAGS="--include-headers" \ // MODE=patch virtual patch @ depends on patch @ pte_t v; @@ - v + ptep_get(v) ---- Then reviewed and hand-edited to avoid multiple unnecessary calls to ptep_get(), instead opting to store the result of a single call in a variable, where it is correct to do so. This aims to negate any cost of READ_ONCE() and will benefit arch-overrides that may be more complex. Included is a fix for an issue in an earlier version of this patch that was pointed out by kernel test robot. The issue arose because config MMU=n elides definition of the ptep helper functions, including ptep_get(). HUGETLB_PAGE=n configs still define a simple huge_ptep_clear_flush() for linking purposes, which dereferences the ptep. So when both configs are disabled, this caused a build error because ptep_get() is not defined. Fix by continuing to do a direct dereference when MMU=n. This is safe because for this config the arch code cannot be trying to virtualize the ptes because none of the ptep helpers are defined. Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-19 16:19:25 -07:00
Nipun Gupta	234489ac56	vfio/cdx: add support for CDX bus vfio-cdx driver enables IOCTLs for user space to query MMIO regions for CDX devices and mmap them. This change also adds support for reset of CDX devices. With VFIO enabled on CDX devices, user-space applications can also exercise DMA securely via IOMMU on these devices. This change adds the VFIO CDX driver and enables the following ioctls for CDX devices: - VFIO_DEVICE_GET_INFO: - VFIO_DEVICE_GET_REGION_INFO - VFIO_DEVICE_RESET Signed-off-by: Nipun Gupta <nipun.gupta@amd.com> Reviewed-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com> Tested-by: Nikhil Agarwal <nikhil.agarwal@amd.com> Link: https://lore.kernel.org/r/20230531124557.11009-1-nipun.gupta@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-06-16 12:27:04 -06:00
Alex Williamson	1e44c58cc4	vfio/fsl: Create Kconfig sub-menu For consistency with pci and platform, push the vfio-fsl-mc option into a sub-menu. Reviewed-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20230614193948.477036-4-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-06-16 12:23:58 -06:00
Alex Williamson	8bee6f00fc	vfio/platform: Cleanup Kconfig Like vfio-pci, there's also a base module here where vfio-amba depends on vfio-platform, when really it only needs vfio-platform-base. Create a sub-menu for platform drivers and a nested menu for reset drivers. Cleanup Makefile to make use of new CONFIG_VFIO_PLATFORM_BASE for building the shared modules and traversing reset modules. Reviewed-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20230614193948.477036-3-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-06-16 12:23:50 -06:00
Alex Williamson	8cc75183b7	vfio/pci: Cleanup Kconfig It should be possible to select vfio-pci variant drivers without building vfio-pci itself, which implies each variant driver should select vfio-pci-core. Fix the top level vfio Makefile to traverse pci based on vfio-pci-core rather than vfio-pci. Mark MMAP and INTX options depending on vfio-pci-core to cleanup resulting config if core is not enabled. Push all PCI related vfio options to a sub-menu and make descriptions consistent. Reviewed-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20230614193948.477036-2-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-06-16 12:23:42 -06:00
Alex Williamson	a5bfe22db2	vfio/pci-core: Add capability for AtomicOp completer support Test and enable PCIe AtomicOp completer support of various widths and report via device-info capability to userspace. Reviewed-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: Robin Voetter <robin@streamhpc.com> Tested-by: Robin Voetter <robin@streamhpc.com> Link: https://lore.kernel.org/r/20230519214748.402003-1-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-06-16 12:22:18 -06:00
Lorenzo Stoakes	0b295316b3	mm/gup: remove unused vmas parameter from pin_user_pages_remote() No invocation of pin_user_pages_remote() uses the vmas parameter, so remove it. This forms part of a larger patch set eliminating the use of the vmas parameters altogether. Link: https://lkml.kernel.org/r/28f000beb81e45bf538a2aaa77c90f5482b67a32.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-09 16:25:25 -07:00
Uwe Kleine-König	59272ad8d9	bus: fsl-mc: Make remove function return void The value returned by an fsl-mc driver's remove function is mostly ignored. (Only an error message is printed if the value is non-zero and then device removal continues unconditionally.) So change the prototype of the remove function to return no value. This way driver authors are not tempted to assume that passing an error to the upper layer is a good idea. All drivers are adapted accordingly. There is no intended change of behaviour, all callbacks were prepared to return 0 before. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com> Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> # sanity checks Reviewed-by: Laurentiu Tudor <laurentiu.tudor@nxp.com> Tested-by: Laurentiu Tudor <laurentiu.tudor@nxp.com> Signed-off-by: Li Yang <leoyang.li@nxp.com>	2023-05-30 18:58:43 -05:00
Alex Williamson	d9824f70e5	vfio/pci: Also demote hiding standard cap messages Apply the same logic as commit `912b625b4d` ("vfio/pci: demote hiding ecap messages to debug level") for the less common case of hiding standard capabilities. Reviewed-by: Oleksandr Natalenko <oleksandr@natalenko.name> Reviewed-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/r/20230523225250.1215911-1-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-26 13:58:27 -06:00
Reinette Chatre	6c8017c6a5	vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X Dynamic MSI-X is supported. Clear VFIO_IRQ_INFO_NORESIZE to provide guidance to user space. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/fd1ef2bf6ae972da8e2805bc95d5155af5a8fb0a.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Reinette Chatre	e4163438e0	vfio/pci: Support dynamic MSI-X pci_msix_alloc_irq_at() enables an individual MSI-X interrupt to be allocated after MSI-X enabling. Use dynamic MSI-X (if supported by the device) to allocate an interrupt after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at the time a valid eventfd is assigned. This is different behavior from a range provided during MSI-X enabling where interrupts are allocated for the entire range whether a valid eventfd is provided for each interrupt or not. The PCI-MSIX API requires that some number of irqs are allocated for an initial set of vectors when enabling MSI-X on the device. When dynamic MSIX allocation is not supported, the vector table, and thus the allocated irq set can only be resized by disabling and re-enabling MSI-X with a different range. In that case the irq allocation is essentially a cache for configuring vectors within the previously allocated vector range. When dynamic MSI-X allocation is supported, the API still requires some initial set of irqs to be allocated, but also supports allocating and freeing specific irq vectors both within and beyond the initially allocated range. For consistency between modes, as well as to reduce latency and improve reliability of allocations, and also simplicity, this implementation only releases irqs via pci_free_irq_vectors() when either the interrupt mode changes or the device is released. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/lkml/20230403211841.0e206b67.alex.williamson@redhat.com/ Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/956c47057ae9fd45591feaa82e9ae20929889249.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Reinette Chatre	dd27a70700	vfio/pci: Probe and store ability to support dynamic MSI-X Not all MSI-X devices support dynamic MSI-X allocation. Whether a device supports dynamic MSI-X should be queried using pci_msix_can_alloc_dyn(). Instead of scattering code with pci_msix_can_alloc_dyn(), probe this ability once and store it as a property of the virtual device. Suggested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/f1ae022c060ecb7e527f4f53c8ccafe80768da47.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Reinette Chatre	9387cf59dc	vfio/pci: Update stale comment In preparation for surrounding code change it is helpful to ensure that existing comments are accurate. Remove inaccurate comment about direct access and update the rest of the comment to reflect the purpose of writing the cached MSI message to the device. Suggested-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/lkml/20230330164050.0069e2a5.alex.williamson@redhat.com/ Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/5b605ce7dcdab5a5dfef19cec4d73ae2fdad3ae1.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Reinette Chatre	63972f63a6	vfio/pci: Remove interrupt context counter struct vfio_pci_core_device::num_ctx counts how many interrupt contexts have been allocated. When all interrupt contexts are allocated simultaneously num_ctx provides the upper bound of all vectors that can be used as indices into the interrupt context array. With the upcoming support for dynamic MSI-X the number of interrupt contexts does not necessarily span the range of allocated interrupts. Consequently, num_ctx is no longer a trusted upper bound for valid indices. Stop using num_ctx to determine if a provided vector is valid. Use the existence of allocated interrupt. This changes behavior on the error path when user space provides an invalid vector range. Behavior changes from early exit without any modifications to possible modifications to valid vectors within the invalid range. This is acceptable considering that an invalid range is not a valid scenario, see link to discussion. The checks that ensure that user space provides a range of vectors that is valid for the device are untouched. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/lkml/20230316155646.07ae266f.alex.williamson@redhat.com/ Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/e27d350f02a65b8cbacd409b4321f5ce35b3186d.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Reinette Chatre	b156e48fff	vfio/pci: Use xarray for interrupt context storage Interrupt context is statically allocated at the time interrupts are allocated. Following allocation, the context is managed by directly accessing the elements of the array using the vector as index. The storage is released when interrupts are disabled. It is possible to dynamically allocate a single MSI-X interrupt after MSI-X is enabled. A dynamic storage for interrupt context is needed to support this. Replace the interrupt context array with an xarray (similar to what the core uses as store for MSI descriptors) that can support the dynamic expansion while maintaining the custom that uses the vector as index. With a dynamic storage it is no longer required to pre-allocate interrupt contexts at the time the interrupts are allocated. MSI and MSI-X interrupt contexts are only used when interrupts are enabled. Their allocation can thus be delayed until interrupt enabling. Only enabled interrupts will have associated interrupt contexts. Whether an interrupt has been allocated (a Linux irq number exists for it) becomes the criteria for whether an interrupt can be enabled. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/lkml/20230404122444.59e36a99.alex.williamson@redhat.com/ Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/40e235f38d427aff79ae35eda0ced42502aa0937.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Reinette Chatre	8850336588	vfio/pci: Move to single error path Enabling and disabling of an interrupt involves several steps that can fail. Cleanup after failure is done when the error is encountered, resulting in some repetitive code. Support for dynamic contexts will introduce more steps during interrupt enabling and disabling. Transition to centralized exit path in preparation for dynamic contexts to eliminate duplicate error handling code. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/72dddae8aa710ce522a74130120733af61cffe4d.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Reinette Chatre	d977e0f766	vfio/pci: Prepare for dynamic interrupt context storage Interrupt context storage is statically allocated at the time interrupts are allocated. Following allocation, the interrupt context is managed by directly accessing the elements of the array using the vector as index. It is possible to allocate additional MSI-X vectors after MSI-X has been enabled. Dynamic storage of interrupt context is needed to support adding new MSI-X vectors after initial allocation. Replace direct access of array elements with pointers to the array elements. Doing so reduces impact of moving to a new data structure. Move interactions with the array to helpers to mostly contain changes needed to transition to a dynamic data structure. No functional change intended. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/eab289693c8325ede9aba99380f8b8d5143980a4.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Reinette Chatre	6578ed85c7	vfio/pci: Remove negative check on unsigned vector User space provides the vector as an unsigned int that is checked early for validity (vfio_set_irqs_validate_and_prepare()). A later negative check of the provided vector is not necessary. Remove the negative check and ensure the type used for the vector is consistent as an unsigned int. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/28521e1b0b091849952b0ecb8c118729fc8cdc4f.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Reinette Chatre	a65f35cfd5	vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable vfio_msi_disable() releases all previously allocated state associated with each interrupt before disabling MSI/MSI-X. vfio_msi_disable() iterates twice over the interrupt state: first directly with a for loop to do virqfd cleanup, followed by another for loop within vfio_msi_set_block() that removes the interrupt handler and its associated state using vfio_msi_set_vector_signal(). Simplify interrupt cleanup by iterating over allocated interrupts once. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/837acb8cbe86a258a50da05e56a1f17c1a19abbe.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:49:03 -06:00
Oleksandr Natalenko	912b625b4d	vfio/pci: demote hiding ecap messages to debug level Seeing a burst of messages like this: vfio-pci 0000:98:00.0: vfio_ecap_init: hiding ecap 0x19@0x1d0 vfio-pci 0000:98:00.0: vfio_ecap_init: hiding ecap 0x25@0x200 vfio-pci 0000:98:00.0: vfio_ecap_init: hiding ecap 0x26@0x210 vfio-pci 0000:98:00.0: vfio_ecap_init: hiding ecap 0x27@0x250 vfio-pci 0000:98:00.1: vfio_ecap_init: hiding ecap 0x25@0x200 vfio-pci 0000:b1:00.0: vfio_ecap_init: hiding ecap 0x19@0x1d0 vfio-pci 0000:b1:00.0: vfio_ecap_init: hiding ecap 0x25@0x200 vfio-pci 0000:b1:00.0: vfio_ecap_init: hiding ecap 0x26@0x210 vfio-pci 0000:b1:00.0: vfio_ecap_init: hiding ecap 0x27@0x250 vfio-pci 0000:b1:00.1: vfio_ecap_init: hiding ecap 0x25@0x200 is of little to no value for an ordinary user. Hence, use pci_dbg() instead of pci_info(). Signed-off-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Cédric Le Goater <clg@redhat.com> Tested-by: YangHang Liu <yanghliu@redhat.com> Link: https://lore.kernel.org/r/20230504131654.24922-1-oleksandr@natalenko.name Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 15:47:48 -06:00
Yan Zhao	4752354af7	vfio/type1: check pfn valid before converting to struct page Check physical PFN is valid before converting the PFN to a struct page pointer to be returned to caller of vfio_pin_pages(). vfio_pin_pages() pins user pages with contiguous IOVA. If the IOVA of a user page to be pinned belongs to vma of vm_flags VM_PFNMAP, pin_user_pages_remote() will return -EFAULT without returning struct page address for this PFN. This is because usually this kind of PFN (e.g. MMIO PFN) has no valid struct page address associated. Upon this error, vaddr_get_pfns() will obtain the physical PFN directly. While previously vfio_pin_pages() returns to caller PFN arrays directly, after commit `34a255e676` ("vfio: Replace phys_pfn with pages for vfio_pin_pages()"), PFNs will be converted to "struct page " unconditionally and therefore the returned "struct page " array may contain invalid struct page addresses. Given current in-tree users of vfio_pin_pages() only expect "struct page * returned, check PFN validity and return -EINVAL to let the caller be aware of IOVAs to be pinned containing PFN not able to be returned in "struct page *" array. So that, the caller will not consume the returned pointer (e.g. test PageReserved()) and avoid error like "supervisor read access in kernel mode". Fixes: `34a255e676` ("vfio: Replace phys_pfn with pages for vfio_pin_pages()") Cc: Sean Christopherson <seanjc@google.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230519065843.10653-1-yan.y.zhao@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-05-23 14:16:29 -06:00
Linus Torvalds	7df047b3f0	VFIO updates for v6.4-rc1 - Expose and allow R/W access to the PCIe DVSEC capability through vfio-pci, as we already do with the legacy vendor capability. (K V P Satyanarayana) - Fix kernel-doc issues with structure definitions. (Simon Horman) - Clarify ordering of operations relative to the kvm-vfio device for driver dependencies against the kvm pointer. (Yi Liu) -----BEGIN PGP SIGNATURE----- iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmRRPjAbHGFsZXgud2ls bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsiePoQAJkUBO4bN3BvTG3iOkh1 vGrD3llhoUeD3AetT+5Y7pX0ml6wx9SLjcuFIUyYDMomLebpFXQAlI3gjE5eGbnf PVfn6zJDrcA5w2tFwkJHUnN+RgcaPW5tiviP6XkRFpS95HCQRsFdp9GhI2U3X7q7 1G1pTV4Fb6BPhU0zAOueW1HKovTFDi3qbzaG5JjosdZc8V32/USQ6ekkXSGMbuzN HsqLi2ujrtwAYluNKGxsuiDKXBFT8z8Mji7S9Z8RqoR0IF9q8JMUC0UMImn/XhKN 4XgCsibPt4o+RSStAf3vUPoP1l3I5J+3dE4ynLiHN8mNmbJ2gCLzQx18Dfww3V9n Ao8wdwZvvskfc+gkEJAh47OFf9GvCvvHtg3uZS9KPDidS3uY8K4ooIr4Pv1F+QMh rcnjZRFCakPB2tijmtg+1/0dY5/YEAn0mYWQ8ph5Ips36q30WBrbhBLJ47wE6PZ0 bsChcCU+4wK9e20lzMewa9QYnrRkvjGbmIo3GnjyPt4aG0VM08orIePy7ZVen3S3 +ieXBbuW61BLN/fZBFrDB4IxmekWq2TkkxSQCe8rn+jgs7qUUlFzn34Iqna6QTQV pG+O3pyflULZGZHkij4lOn1i2SqqbLxZofYose9hHq5r6ALD5EFNg3QVLuiAc7Tb 3ctmfePHowrzKMyHMlXb8DE7 =Ylh5 -----END PGP SIGNATURE----- Merge tag 'vfio-v6.4-rc1' of https://github.com/awilliam/linux-vfio Pull VFIO updates from Alex Williamson: - Expose and allow R/W access to the PCIe DVSEC capability through vfio-pci, as we already do with the legacy vendor capability (K V P Satyanarayana) - Fix kernel-doc issues with structure definitions (Simon Horman) - Clarify ordering of operations relative to the kvm-vfio device for driver dependencies against the kvm pointer (Yi Liu) * tag 'vfio-v6.4-rc1' of https://github.com/awilliam/linux-vfio: docs: kvm: vfio: Suggest KVM_DEV_VFIO_GROUP_ADD vs VFIO_GROUP_GET_DEVICE_FD ordering vfio: correct kdoc for ops structures vfio/pci: Add DVSEC PCI Extended Config Capability to user visible list.	2023-05-02 11:56:43 -07:00
Linus Torvalds	70cc1b5307	powerpc updates for 6.4 - Add support for building the kernel using PC-relative addressing on Power10. - Allow HV KVM guests on Power10 to use prefixed instructions. - Unify support for the P2020 CPU (85xx) into a single machine description. - Always build the 64-bit kernel with 128-bit long double. - Drop support for several obsolete 2000's era development boards as identified by Paul Gortmaker. - A series fixing VFIO on Power since some generic changes. - Various other small features and fixes. Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Benjamin Gray, Bo Liu, Christophe Leroy, Dan Carpenter, David Binderman, Ira Weiny, Joel Stanley, Kajol Jain, Kautuk Consul, Liang He, Luis Chamberlain, Masahiro Yamada, Michael Neuling, Nathan Chancellor, Nathan Lynch, Nicholas Miehlbradt, Nicholas Piggin, Nick Desaulniers, Nysal Jan K.A, Pali Rohár, Paul Gortmaker, Paul Mackerras, Petr Vaněk, Randy Dunlap, Rob Herring, Sachin Sant, Sean Christopherson, Segher Boessenkool, Timothy Pearson. -----BEGIN PGP SIGNATURE----- iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmRLdD8THG1wZUBlbGxl cm1hbi5pZC5hdQAKCRBR6+o8yOGlgPrED/93Hwi1/8udXYV6OAFBnLLLDX9RZWNx sj6W7PC/yY7MGUTIGcayrAURt5xFfQZMBdq9nhhH46Wyd8pSUe1IlpXIEgqeH64Y 5uaSe6u4OZ5dDmiYz8bM+Y4Ixkfq1xMO0Rj27FIRJyU4Pp6gyMQ6/8W+iPU2vYIb jl4frVO8PpmCbW8euOyT/b9YB2h79+5nLgZT4RvmYJblKQwgzRZWaiCAU0wYf+xT xYsMQcqJlPstizTnXIr+GC08VrMPQR51kJCurnNUMXoAQY6toEXebveWRNZ3sH39 K0BRQ036NNGPS4GlHVbjgLIdoWr4pUEROZ48Jy9WeiDh6OwO2vb2zrm7ZLtKlGXI LCQ08T9diPbAAJyVJaKjpsNXhTffuLPRhIOr1o2vqGvY+Fqbw96bPQAyFbxpJtrw rGTTc5e93fEdZV+eaGR8kfdNM75WM6lgiteaIbUnpirYTh/0j6YEO1Ijc4d9e5ux aoHN2eXQDjMm9foZf1D6vaCTRyN8LwLa9hcydKCn6+qULUQzoZ2r4cRt6eSD7+fx Wni1Zg7vD6xx294nVmhg5dWuD4qVIAZj3BZPFHHR2SPqy+UbEJLk/NKgznmrhhgQ HBcZAEFqIBZkA4e0w6LJKh/1j1rxpZUokgjMotOJq4VRivq82W1P7SioFDY3/6w5 UvGtIgGq4Qsa2Q== =89WH -----END PGP SIGNATURE----- Merge tag 'powerpc-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Add support for building the kernel using PC-relative addressing on Power10. - Allow HV KVM guests on Power10 to use prefixed instructions. - Unify support for the P2020 CPU (85xx) into a single machine description. - Always build the 64-bit kernel with 128-bit long double. - Drop support for several obsolete 2000's era development boards as identified by Paul Gortmaker. - A series fixing VFIO on Power since some generic changes. - Various other small features and fixes. Thanks to Alexey Kardashevskiy, Andrew Donnellan, Benjamin Gray, Bo Liu, Christophe Leroy, Dan Carpenter, David Binderman, Ira Weiny, Joel Stanley, Kajol Jain, Kautuk Consul, Liang He, Luis Chamberlain, Masahiro Yamada, Michael Neuling, Nathan Chancellor, Nathan Lynch, Nicholas Miehlbradt, Nicholas Piggin, Nick Desaulniers, Nysal Jan K.A, Pali Rohár, Paul Gortmaker, Paul Mackerras, Petr Vaněk, Randy Dunlap, Rob Herring, Sachin Sant, Sean Christopherson, Segher Boessenkool, and Timothy Pearson. * tag 'powerpc-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (156 commits) powerpc/64s: Disable pcrel code model on Clang powerpc: Fix merge conflict between pcrel and copy_thread changes powerpc/configs/powernv: Add IGB=y powerpc/configs/64s: Drop JFS Filesystem powerpc/configs/64s: Use EXT4 to mount EXT2 filesystems powerpc/configs: Make pseries_defconfig an alias for ppc64le_guest powerpc/configs: Make pseries_le an alias for ppc64le_guest powerpc/configs: Incorporate generic kvm_guest.config into guest configs powerpc/configs: Add IBMVETH=y and IBMVNIC=y to guest configs powerpc/configs/64s: Enable Device Mapper options powerpc/configs/64s: Enable PSTORE powerpc/configs/64s: Enable VLAN support powerpc/configs/64s: Enable BLK_DEV_NVME powerpc/configs/64s: Drop REISERFS powerpc/configs/64s: Use SHA512 for module signatures powerpc/configs/64s: Enable IO_STRICT_DEVMEM powerpc/configs/64s: Enable SCHEDSTATS powerpc/configs/64s: Enable DEBUG_VM & other options powerpc/configs/64s: Enable EMULATED_STATS powerpc/configs/64s: Enable KUNIT and most tests ...	2023-04-28 16:24:32 -07:00
Linus Torvalds	22b8cc3e78	Add support for new Linear Address Masking CPU feature. This is similar to ARM's Top Byte Ignore and allows userspace to store metadata in some bits of pointers without masking it out before use. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmRK/WIACgkQaDWVMHDJ krAL+RAAw33EhsWyYVkeAtYmYBKkGvlgeSDULtfJKe5bynJBTHkGKfM6RE9MSJIt 5fHWaConGh8HNpy0Us1sDvd/aWcWRm5h7ZcCVD+R4qrgh/vc7ULzM+elXe5jzr4W cyuTckF2eW6SVrYg6fH5q+6Uy/moDtrdkLRvwRBf+AYeepB8gvSSH5XixKDNiVBE pjNy1xXVZQokqD4tjsFelmLttyacR5OabiE/aeVNoFYf9yTwfnN8N3T6kwuOoS4l Lp6NA+/0ux+oBlR+Is+JJG8Mxrjvz96yJGZYdR2YP5k3bMQtHAAjuq2w+GgqZm5i j3/E6KQepEGaCfC+bHl68xy/kKx8ik+jMCEcBalCC25J3uxbLz41g6K3aI890wJn +5ZtfcmoDUk9pnUyLxR8t+UjOSBFAcRSUE+FTjUH1qEGsMPK++9a4iLXz5vYVK1+ +YCt1u5LNJbkDxE8xVX3F5jkXh0G01SJsuUVAOqHSNfqSNmohFK8/omqhVRrRqoK A7cYLtnOGiUXLnvjrwSxPNOzRrG+GAwqaw8gwOTaYogETWbTY8qsSCEVl204uYwd m8io9rk2ZXUdDuha56xpBbPE0JHL9hJ2eKCuPkfvRgJT9YFyTh+e0UdX20k+nDjc ang1S350o/Y0sus6rij1qS8AuxJIjHucG0GdgpZk3KUbcxoRLhI= =qitk -----END PGP SIGNATURE----- Merge tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 LAM (Linear Address Masking) support from Dave Hansen: "Add support for the new Linear Address Masking CPU feature. This is similar to ARM's Top Byte Ignore and allows userspace to store metadata in some bits of pointers without masking it out before use" * tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA selftests/x86/lam: Add test cases for LAM vs thread creation selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking selftests/x86/lam: Add inherit test cases for linear-address masking selftests/x86/lam: Add io_uring test cases for linear-address masking selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking x86/mm/iommu/sva: Make LAM and SVA mutually exclusive iommu/sva: Replace pasid_valid() helper with mm_valid_pasid() mm: Expose untagging mask in /proc/$PID/status x86/mm: Provide arch_prctl() interface for LAM x86/mm: Reduce untagged_addr() overhead for systems without LAM x86/uaccess: Provide untagged_addr() and remove tags before address check mm: Introduce untagged_addr_remote() x86/mm: Handle LAM on context switch x86: CPUID and CR3/CR4 flags for Linear Address Masking x86: Allow atomic MM_CONTEXT flags setting x86/mm: Rework address range check in get_user() and put_user()	2023-04-28 09:43:49 -07:00
Linus Torvalds	556eb8b791	Driver core changes for 6.4-rc1 Here is the large set of driver core changes for 6.4-rc1. Once again, a busy development cycle, with lots of changes happening in the driver core in the quest to be able to move "struct bus" and "struct class" into read-only memory, a task now complete with these changes. This will make the future rust interactions with the driver core more "provably correct" as well as providing more obvious lifetime rules for all busses and classes in the kernel. The changes required for this did touch many individual classes and busses as many callbacks were changed to take const * parameters instead. All of these changes have been submitted to the various subsystem maintainers, giving them plenty of time to review, and most of them actually did so. Other than those changes, included in here are a small set of other things: - kobject logging improvements - cacheinfo improvements and updates - obligatory fw_devlink updates and fixes - documentation updates - device property cleanups and const * changes - firwmare loader dependency fixes. All of these have been in linux-next for a while with no reported problems. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZEp7Sw8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ykitQCfamUHpxGcKOAGuLXMotXNakTEsxgAoIquENm5 LEGadNS38k5fs+73UaxV =7K4B -----END PGP SIGNATURE----- Merge tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the large set of driver core changes for 6.4-rc1. Once again, a busy development cycle, with lots of changes happening in the driver core in the quest to be able to move "struct bus" and "struct class" into read-only memory, a task now complete with these changes. This will make the future rust interactions with the driver core more "provably correct" as well as providing more obvious lifetime rules for all busses and classes in the kernel. The changes required for this did touch many individual classes and busses as many callbacks were changed to take const * parameters instead. All of these changes have been submitted to the various subsystem maintainers, giving them plenty of time to review, and most of them actually did so. Other than those changes, included in here are a small set of other things: - kobject logging improvements - cacheinfo improvements and updates - obligatory fw_devlink updates and fixes - documentation updates - device property cleanups and const * changes - firwmare loader dependency fixes. All of these have been in linux-next for a while with no reported problems" * tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits) device property: make device_property functions take const device * driver core: update comments in device_rename() driver core: Don't require dynamic_debug for initcall_debug probe timing firmware_loader: rework crypto dependencies firmware_loader: Strip off \n from customized path zram: fix up permission for the hot_add sysfs file cacheinfo: Add use_arch[\|_cache]_info field/function arch_topology: Remove early cacheinfo error message if -ENOENT cacheinfo: Check cache properties are present in DT cacheinfo: Check sib_leaf in cache_leaves_are_shared() cacheinfo: Allow early level detection when DT/ACPI info is missing/broken cacheinfo: Add arm64 early level initializer implementation cacheinfo: Add arch specific early level initializer tty: make tty_class a static const structure driver core: class: remove struct class_interface * from callbacks driver core: class: mark the struct class in struct class_interface constant driver core: class: make class_register() take a const * driver core: class: mark class_release() as taking a const * driver core: remove incorrect comment for device_create* MIPS: vpe-cmp: remove module owner pointer from struct class usage. ...	2023-04-27 11:53:57 -07:00
K V P, Satyanarayana	6467d0740a	vfio/pci: Add DVSEC PCI Extended Config Capability to user visible list. The Designated Vendor-Specific Extended Capability (DVSEC Capability) is an optional Extended Capability that is permitted to be implemented by any PCI Express Function. This allows PCI Express component vendors to use the Extended Capability mechanism to expose vendor-specific registers that can be present in components by a variety of vendors. A DVSEC Capability structure can tell vendor-specific software which features a particular component supports. An example usage of DVSEC is Intel Platform Monitoring Technology (PMT) for enumerating and accessing hardware monitoring capabilities on a device. PMT encompasses three device monitoring features, Telemetry (device metrics), Watcher (sampling/tracing), and Crashlog. The DVSEC is used to discover these features and provide a BAR offset to their registers with the Intel vendor code. The current VFIO driver does not pass DVSEC capabilities to Virtual Machine (VM) which makes PMT not to work inside the virtual machine. This series adds DVSEC capability to user visible list to allow its use with VFIO. VFIO supports passing of Vendor Specific Extended Capability (VSEC) and raw write access to device. DVSEC also passed to VM in the same way as of VSEC. Signed-off-by: K V P Satyanarayana <satyanarayana.k.v.p@intel.com> Link: https://lore.kernel.org/r/20230317082222.3355912-1-satyanarayana.k.v.p@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-04-14 14:03:07 -06:00
Jason Gunthorpe	692d42d411	Merge branch 'iommufd/for-rc' into for-next The following selftest patch requires both the bug fixes and the improvements of the selftest framework. * iommufd/for-rc: iommufd: Do not corrupt the pfn list when doing batch carry iommufd: Fix unpinning of pages when an access is present iommufd: Check for uptr overflow Linux 6.3-rc5 Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-04-04 11:04:30 -03:00
Greg Kroah-Hartman	cd8fe5b6db	Merge 6.3-rc5 into driver-core-next We need the fixes in here for testing, as well as the driver core changes for documentation updates to build on. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-04-03 09:33:30 +02:00
Yi Liu	7d12578c5d	vfio: Check the presence for iommufd callbacks in __vfio_register_dev() After making the no-DMA drivers (samples/vfio-mdev) providing iommufd callbacks, __vfio_register_dev() should check the presence of the iommufd callbacks if CONFIG_IOMMUFD is enabled. Link: https://lore.kernel.org/r/20230327093351.44505-7-yi.l.liu@intel.com Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-03-31 13:43:32 -03:00
Yi Liu	0a782d15e1	vfio/mdev: Uses the vfio emulated iommufd ops set in the mdev sample drivers This harmonizes the no-DMA devices (the vfio-mdev sample drivers) with the emulated devices (gvt-g, vfio-ap etc.). It makes it easier to add BIND_IOMMUFD user interface which requires to return an iommufd ID to represent the device/iommufd bond. Link: https://lore.kernel.org/r/20230327093351.44505-6-yi.l.liu@intel.com Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-03-31 13:43:32 -03:00
Yi Liu	632fda7f91	vfio-iommufd: Make vfio_iommufd_emulated_bind() return iommufd_access ID vfio device cdev needs to return iommufd_access ID to userspace if bind_iommufd succeeds. Link: https://lore.kernel.org/r/20230327093351.44505-5-yi.l.liu@intel.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-03-31 13:43:32 -03:00
Yi Liu	4508a533fc	vfio-iommufd: No need to record iommufd_ctx in vfio_device iommufd_ctx is stored in vfio_device for emulated devices per bind_iommufd. However, as iommufd_access is created in bind, no more need to stored it since iommufd_access implicitly stores it. Link: https://lore.kernel.org/r/20230327093351.44505-4-yi.l.liu@intel.com Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-03-31 13:43:31 -03:00
Nicolin Chen	54b47585db	iommufd: Create access in vfio_iommufd_emulated_bind() There are needs to created iommufd_access prior to have an IOAS and set IOAS later. Like the vfio device cdev needs to have an iommufd object to represent the bond (iommufd_access) and IOAS replacement. Moves the iommufd_access_create() call into vfio_iommufd_emulated_bind(), making it symmetric with the __vfio_iommufd_access_destroy() call in the vfio_iommufd_emulated_unbind(). This means an access is created/destroyed by the bind()/unbind(), and the vfio_iommufd_emulated_attach_ioas() only updates the access->ioas pointer. Since vfio_iommufd_emulated_bind() does not provide ioas_id, drop it from the argument list of iommufd_access_create(). Instead, add a new access API iommufd_access_attach() to set the access->ioas pointer. Also, set vdev->iommufd_attached accordingly, similar to the physical pathway. Link: https://lore.kernel.org/r/20230327093351.44505-3-yi.l.liu@intel.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-03-31 13:43:31 -03:00
Greg Kroah-Hartman	1aaba11da9	driver core: class: remove module * from class_create() The module pointer in class_create() never actually did anything, and it shouldn't have been requred to be set as a parameter even if it did something. So just remove it and fix up all callers of the function in the kernel tree at the same time. Cc: "Rafael J. Wysocki" <rafael@kernel.org> Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20230313181843.1207845-4-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-17 15:16:33 +01:00
Kirill A. Shutemov	428e106ae1	mm: Introduce untagged_addr_remote() untagged_addr() removes tags/metadata from the address and brings it to the canonical form. The helper is implemented on arm64 and sparc. Both of them do untagging based on global rules. However, Linear Address Masking (LAM) on x86 introduces per-process settings for untagging. As a result, untagged_addr() is now only suitable for untagging addresses for the current proccess. The new helper untagged_addr_remote() has to be used when the address targets remote process. It requires the mmap lock for target mm to be taken. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/all/20230312112612.31869-6-kirill.shutemov%40linux.intel.com	2023-03-16 13:08:39 -07:00
Alexey Kardashevskiy	a940904443	powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains Up until now PPC64 managed to avoid using iommu_ops. The VFIO driver uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in the Type1 VFIO driver. Recent development added 2 uses of iommu_ops to the generic VFIO which broke POWER: - a coherency capability check; - blocking IOMMU domain - iommu_group_dma_owner_claimed()/... This adds a simple iommu_ops which reports support for cache coherency and provides a basic support for blocking domains. No other domain types are implemented so the default domain is NULL. Since now iommu_ops controls the group ownership, this takes it out of VFIO. This adds an IOMMU device into a pci_controller (=PHB) and registers it in the IOMMU subsystem, iommu_ops is registered at this point. This setup is done in postcore_initcall_sync. This replaces iommu_group_add_device() with iommu_probe_device() as the former misses necessary steps in connecting PCI devices to IOMMU devices. This adds a comment about why explicit iommu_probe_device() is still needed. The previous discussion is here: https://lore.kernel.org/r/20220707135552.3688927-1-aik@ozlabs.ru/ https://lore.kernel.org/r/20220701061751.1955857-1-aik@ozlabs.ru/ Fixes: `e8ae0e140c` ("vfio: Require that devices support DMA cache coherence") Fixes: `70693f4708` ("vfio: Set DMA ownership for VFIO devices") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> [mpe: Fix CONFIG_IOMMU_API=n build] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/2000135730.16998523.1678123860135.JavaMail.zimbra@raptorengineeringinc.com	2023-03-15 00:51:46 +11:00
Alexey Kardashevskiy	9d67c94335	powerpc/iommu: Add "borrowing" iommu_table_group_ops PPC64 IOMMU API defines iommu_table_group_ops which handles DMA windows for PEs: control the ownership, create/set/unset a table the hardware for dynamic DMA windows (DDW). VFIO uses the API to implement support on POWER. So far only PowerNV IODA2 (POWER8 and newer machines) implemented this and other cases (POWER7 or nested KVM) did not and instead reused existing iommu_table structs. This means 1) no DDW 2) ownership transfer is done directly in the VFIO SPAPR TCE driver. Soon POWER is going to get its own iommu_ops and ownership control is going to move there. This implements spapr_tce_table_group_ops which borrows iommu_table tables. The upside is that VFIO needs to know less about POWER. The new ops returns the existing table from create_table() and only checks if the same window is already set. This is only going to work if the default DMA window starts table_group.tce32_start and as big as pe->table_group.tce32_size (not the case for IODA2+ PowerNV). This changes iommu_table_group_ops::take_ownership() to return an error if borrowing a table failed. This should not cause any visible change in behavior for PowerNV. pSeries was not that well tested/supported anyway. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> [mpe: Fix CONFIG_IOMMU_API=n build (skiroot_defconfig), & formatting] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/525438831.16998517.1678123820075.JavaMail.zimbra@raptorengineeringinc.com	2023-03-14 23:36:27 +11:00
Yishai Hadas	4928f67bc9	vfio/mlx5: Fix the report of dirty_bytes upon pre-copy Fix the report of dirty_bytes upon pre-copy to include both the existing data on the migration file and the device extra bytes. This gives a better close estimation to what can be passed any more as part of pre-copy. Fixes: `0dce165b1a` ("vfio/mlx5: Introduce vfio precopy ioctl implementation") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20230308155723.108218-1-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-03-13 12:50:59 -06:00
Linus Torvalds	cac85e4616	VFIO updates for v6.3-rc1 - Remove redundant resource check in vfio-platform. (Angus Chen) - Use GFP_KERNEL_ACCOUNT for persistent userspace allocations, allowing removal of arbitrary kernel limits in favor of cgroup control. (Yishai Hadas) - mdev tidy-ups, including removing the module-only build restriction for sample drivers, Kconfig changes to select mdev support, documentation movement to keep sample driver usage instructions with sample drivers rather than with API docs, remove references to out-of-tree drivers in docs. (Christoph Hellwig) - Fix collateral breakages from mdev Kconfig changes. (Arnd Bergmann) - Make mlx5 migration support match device support, improve source and target flows to improve pre-copy support and reduce downtime. (Yishai Hadas) - Convert additional mdev sysfs case to use sysfs_emit(). (Bo Liu) - Resolve copy-paste error in mdev mbochs sample driver Kconfig. (Ye Xingchen) - Avoid propagating missing reset error in vfio-platform if reset requirement is relaxed by module option. (Tomasz Duszynski) - Range size fixes in mlx5 variant driver for missed last byte and stricter range calculation. (Yishai Hadas) - Fixes to suspended vaddr support and locked_vm accounting, excluding mdev configurations from the former due to potential to indefinitely block kernel threads, fix underflow and restore locked_vm on new mm. (Steve Sistare) - Update outdated vfio documentation due to new IOMMUFD interfaces in recent kernels. (Yi Liu) - Resolve deadlock between group_lock and kvm_lock, finally. (Matthew Rosato) - Fix NULL pointer in group initialization error path with IOMMUFD. (Yan Zhao) -----BEGIN PGP SIGNATURE----- iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmP5GC0bHGFsZXgud2ls bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsiGoMP/Ajgc05dq2HGt0ZdTj3d /2fgFa/8GXv9t/Md4neHkvKppeHsyL6R9s/OlGb2zQMrZ9wTurW5s4pW4fLIcpNV v1vyQSLYMCtj/FT3kG38fZdJwF9NGnC+B+bY4ak+V2rWaKs2vT6fUG6YpzxuBU3T jRD41frtszXIp3i8bIPfaoKt/SydUrx12UJAKSks4eDM4aOlxKhpc3VB1vwaSmHB MgZMRPVQOGUubKJWb3u07tYOd8NHpBpD3HVUb8IlB2//tSqSPgq3GaKr/B25YzH+ 192vgGrm19aKYQ4U0KPLSH4QGG01bia4LqArbVAhBMwzgKK1dE24dk2YBVj+yePx 5XXHWv85gLpkev5aLAxsN75/qCtwhYYYB9vBohp8jhXjQU1GXdj9DAht5+c5I3sk SZcczmtuZ10X2XXT7fA5iRsG7o3Uxg1VikxYLT0Zhu/0DLc+wQrvum+mmu3sKscx qcJyTQXhNTDFzBRRTw6KdyCShbG9gFITysf9Xw/n2y3bxzlfy3Ttf617auYFv6fQ ed3kGiT+S16U/dr2b99qQZyn1eIbzOSkz/oWOXwvCWoBdPTEks9f7pDn9Kk6O641 8tf7qj3vpkOccg71EbVCF6JV5JrhtXDOJVzWIkfQWkoi7qI4ONZ/EdEGTnWY77RY urbhuR4UO1iG0nX+yQIFXhDR =QqPa -----END PGP SIGNATURE----- Merge tag 'vfio-v6.3-rc1' of https://github.com/awilliam/linux-vfio Pull VFIO updates from Alex Williamson: - Remove redundant resource check in vfio-platform (Angus Chen) - Use GFP_KERNEL_ACCOUNT for persistent userspace allocations, allowing removal of arbitrary kernel limits in favor of cgroup control (Yishai Hadas) - mdev tidy-ups, including removing the module-only build restriction for sample drivers, Kconfig changes to select mdev support, documentation movement to keep sample driver usage instructions with sample drivers rather than with API docs, remove references to out-of-tree drivers in docs (Christoph Hellwig) - Fix collateral breakages from mdev Kconfig changes (Arnd Bergmann) - Make mlx5 migration support match device support, improve source and target flows to improve pre-copy support and reduce downtime (Yishai Hadas) - Convert additional mdev sysfs case to use sysfs_emit() (Bo Liu) - Resolve copy-paste error in mdev mbochs sample driver Kconfig (Ye Xingchen) - Avoid propagating missing reset error in vfio-platform if reset requirement is relaxed by module option (Tomasz Duszynski) - Range size fixes in mlx5 variant driver for missed last byte and stricter range calculation (Yishai Hadas) - Fixes to suspended vaddr support and locked_vm accounting, excluding mdev configurations from the former due to potential to indefinitely block kernel threads, fix underflow and restore locked_vm on new mm (Steve Sistare) - Update outdated vfio documentation due to new IOMMUFD interfaces in recent kernels (Yi Liu) - Resolve deadlock between group_lock and kvm_lock, finally (Matthew Rosato) - Fix NULL pointer in group initialization error path with IOMMUFD (Yan Zhao) * tag 'vfio-v6.3-rc1' of https://github.com/awilliam/linux-vfio: (32 commits) vfio: Fix NULL pointer dereference caused by uninitialized group->iommufd docs: vfio: Update vfio.rst per latest interfaces vfio: Update the kdoc for vfio_device_ops vfio/mlx5: Fix range size calculation upon tracker creation vfio: no need to pass kvm pointer during device open vfio: fix deadlock between group lock and kvm lock vfio: revert "iommu driver notify callback" vfio/type1: revert "implement notify callback" vfio/type1: revert "block on invalid vaddr" vfio/type1: restore locked_vm vfio/type1: track locked_vm per dma vfio/type1: prevent underflow of locked_vm via exec() vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR vfio: platform: ignore missing reset if disabled at module init vfio/mlx5: Improve the target side flow to reduce downtime vfio/mlx5: Improve the source side flow upon pre_copy vfio/mlx5: Check whether VF is migratable samples: fix the prompt about SAMPLE_VFIO_MDEV_MBOCHS vfio/mdev: Use sysfs_emit() to instead of sprintf() vfio-mdev: add back CONFIG_VFIO dependency ...	2023-02-25 11:52:57 -08:00
Linus Torvalds	143c7bc649	iommufd for 6.3 Some polishing and small fixes for iommufd: - Remove IOMMU_CAP_INTR_REMAP, instead rely on the interrupt subsystem - Use GFP_KERNEL_ACCOUNT inside the iommu_domains - Support VFIO_NOIOMMU mode with iommufd - Various typos - A list corruption bug if HWPTs are used for attach -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCY/TgzQAKCRCFwuHvBreF Ya3AAP4/WxTJIbDvtTyH3Fae3NxTdO8j8gsUvU1vrRYG83zdnAEAxd1yii7GEO8D crkeq9D4FUiPAkFnJ64Exw2FHb060Qg= =RABK -----END PGP SIGNATURE----- Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd updates from Jason Gunthorpe: "Some polishing and small fixes for iommufd: - Remove IOMMU_CAP_INTR_REMAP, instead rely on the interrupt subsystem - Use GFP_KERNEL_ACCOUNT inside the iommu_domains - Support VFIO_NOIOMMU mode with iommufd - Various typos - A list corruption bug if HWPTs are used for attach" * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: iommufd: Do not add the same hwpt to the ioas->hwpt_list twice iommufd: Make sure to zero vfio_iommu_type1_info before copying to user vfio: Support VFIO_NOIOMMU with iommufd iommufd: Add three missing structures in ucmd_buffer selftests: iommu: Fix test_cmd_destroy_access() call in user_copy iommu: Remove IOMMU_CAP_INTR_REMAP irq/s390: Add arch_is_isolated_msi() for s390 iommu/x86: Replace IOMMU_CAP_INTR_REMAP with IRQ_DOMAIN_FLAG_ISOLATED_MSI genirq/msi: Rename IRQ_DOMAIN_MSI_REMAP to IRQ_DOMAIN_ISOLATED_MSI genirq/irqdomain: Remove unused irq_domain_check_msi_remap() code iommufd: Convert to msi_device_has_isolated_msi() vfio/type1: Convert to iommu_group_has_isolated_msi() iommu: Add iommu_group_has_isolated_msi() genirq/msi: Add msi_device_has_isolated_msi()	2023-02-24 14:34:12 -08:00
Linus Torvalds	a13de74e47	IOMMU Updates for Linux v6.3: Including: - Consolidate iommu_map/unmap functions. There have been blocking and atomic variants so far, but that was problematic as this approach does not scale with required new variants which just differ in the GFP flags used. So Jason consolidated this back into single functions that take a GFP parameter. This has the potential to cause conflicts with other trees, as they introduce new call-sites for the changed functions. I offered them to pull in the branch containing these changes and resolve it, but I am not sure everyone did that. The conflicts this caused with upstream up to v6.2-rc8 are resolved in the final merge commit. - Retire the detach_dev() call-back in iommu_ops - Arm SMMU updates from Will: - Device-tree binding updates: * Cater for three power domains on SM6375 * Document existing compatible strings for Qualcomm SoCs * Tighten up clocks description for platform-specific compatible strings - Enable Qualcomm workarounds for some additional platforms that need them - Intel VT-d updates from Lu Baolu: - Add Intel IOMMU performance monitoring support - Set No Execute Enable bit in PASID table entry - Two performance optimizations - Fix PASID directory pointer coherency - Fix missed rollbacks in error path - Cleanups - Apple t8110 DART support - Exynos IOMMU: - Implement better fault handling - Error handling fixes - Renesas IPMMU: - Add device tree bindings for r8a779g0 - AMD IOMMU: - Various fixes for handling on SNP-enabled systems and handling of faults with unknown request-ids - Cleanups and other small fixes - Various other smaller fixes and cleanups -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmP0hDwACgkQK/BELZcB GuM43RAA0YieShO+X0h6TFGfbK0zVoPd91giZehWBv9rHK7pP4iY8UEtBLBWGx/t CId4t98mmKmC212zz8QxrwAEzyTIRY+2t1yrpG2aVkoTYk8inMb07TU37wganh3O T0QccXN+9b2BS4k8yro5f3uX0d/C1JQVcMowwr53VMb/e73huqP1VTbz06/CIWMH DUhVRCzmNhSvoUOT5n7g6+ZDH+pot8WPZbtHV7FowEsmPCRc7Fj8kXyI9FEwKwrZ hIV5Y+6Lej8nQScgbO8MfblJym3VrBoSoM4GY2w0L0rjQw6m+Xtea5rT0W39YVWy YpiscLTL8TIMPP9zK1dXVygTaABK4J2iWmheHPkpKXIhK0iuH3Dke0Do5p6DNITj 7J2YlaNEB480D5hvNBKsbbGHavgGPT8m529Sz0R7mSC7omRzqiG5Vsb46IXL+2bc 92ojjYNfXb6OCtagIr2LMBLZRL2JCODqF1dUmyZfA8GKOHLP5kZXoMM+sZbQ2aUL 1LOxRZVx+tlb9V4VaH1ZSs/6eM+HLDzjtHeu3PoWYf6mW4AEt4S/yl9SKAkGdBqt jCUErmYB1nU/eefqG1jhWRpQeJabcT3Oe30NZru1pfMoREThhjbAACw1JxWtoe1X ipGpV6lAP7tQUGuRk3/9O1lNqElJuNwC5lVTjS4FJ38vYQhQbao= =ZaZV -----END PGP SIGNATURE----- Merge tag 'iommu-updates-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu updates from Joerg Roedel: - Consolidate iommu_map/unmap functions. There have been blocking and atomic variants so far, but that was problematic as this approach does not scale with required new variants which just differ in the GFP flags used. So Jason consolidated this back into single functions that take a GFP parameter. - Retire the detach_dev() call-back in iommu_ops - Arm SMMU updates from Will: - Device-tree binding updates: - Cater for three power domains on SM6375 - Document existing compatible strings for Qualcomm SoCs - Tighten up clocks description for platform-specific compatible strings - Enable Qualcomm workarounds for some additional platforms that need them - Intel VT-d updates from Lu Baolu: - Add Intel IOMMU performance monitoring support - Set No Execute Enable bit in PASID table entry - Two performance optimizations - Fix PASID directory pointer coherency - Fix missed rollbacks in error path - Cleanups - Apple t8110 DART support - Exynos IOMMU: - Implement better fault handling - Error handling fixes - Renesas IPMMU: - Add device tree bindings for r8a779g0 - AMD IOMMU: - Various fixes for handling on SNP-enabled systems and handling of faults with unknown request-ids - Cleanups and other small fixes - Various other smaller fixes and cleanups * tag 'iommu-updates-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (71 commits) iommu/amd: Skip attach device domain is same as new domain iommu: Attach device group to old domain in error path iommu/vt-d: Allow to use flush-queue when first level is default iommu/vt-d: Fix PASID directory pointer coherency iommu/vt-d: Avoid superfluous IOTLB tracking in lazy mode iommu/vt-d: Fix error handling in sva enable/disable paths iommu/amd: Improve page fault error reporting iommu/amd: Do not identity map v2 capable device when snp is enabled iommu: Fix error unwind in iommu_group_alloc() iommu/of: mark an unused function as __maybe_unused iommu: dart: DART_T8110_ERROR range should be 0 to 5 iommu/vt-d: Enable IOMMU perfmon support iommu/vt-d: Add IOMMU perfmon overflow handler support iommu/vt-d: Support cpumask for IOMMU perfmon iommu/vt-d: Add IOMMU perfmon support iommu/vt-d: Support Enhanced Command Interface iommu/vt-d: Retrieve IOMMU perfmon capability information iommu/vt-d: Support size of the register set in DRHD iommu/vt-d: Set No Execute Enable bit in PASID table entry iommu/vt-d: Remove sva from intel_svm_dev ...	2023-02-24 13:40:13 -08:00
Linus Torvalds	3822a7c409	- Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K DmxHkn0LAitGgJRS/W9w81yrgig9tAQ= =MlGs -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...	2023-02-23 17:09:35 -08:00
Yan Zhao	d649c34cb9	vfio: Fix NULL pointer dereference caused by uninitialized group->iommufd group->iommufd is not initialized for the iommufd_ctx_put() [20018.331541] BUG: kernel NULL pointer dereference, address: 0000000000000000 [20018.377508] RIP: 0010:iommufd_ctx_put+0x5/0x10 [iommufd] ... [20018.476483] Call Trace: [20018.479214] <TASK> [20018.481555] vfio_group_fops_unl_ioctl+0x506/0x690 [vfio] [20018.487586] __x64_sys_ioctl+0x6a/0xb0 [20018.491773] ? trace_hardirqs_on+0xc5/0xe0 [20018.496347] do_syscall_64+0x67/0x90 [20018.500340] entry_SYSCALL_64_after_hwframe+0x4b/0xb5 Fixes: `9eefba8002` ("vfio: Move vfio group specific code into group.c") Cc: stable@vger.kernel.org Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230222074938.13681-1-yan.y.zhao@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-02-22 10:56:37 -07:00
Jason Gunthorpe	939204e4df	Linux 6.2 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmPyoZYeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGcE0H/1imH5XOfowBdPQU p06pCJGKQyEsGnn+kXd7UXes9N/uZFQgOzY9sFspS1ZpXfm60zDcWCeJT2l3qatK dtmAGxTEBeZJ8JuevtBiedWy9pJPpvMsfeZd85XzGDRxNUnGT5HgU0/98NpIjysb 9HTPrpJO9HlmoAKkFDu+Z/kLJp+obns1yQOCH5glOREsPY+4SX76bjPjrbSic0oj oDSSBpM2gfdwHWnOKkXhgNuu8zr+hS3LaU1HMj6Kgy3Huz2NjGlgXrRpzutTHEmT cmt3Dl5hdIeUtMCt8LbQcngjTg/rX11rFdWaOp/MOuD6U7cqTCWeEDyVsPicFehH wdsIfgw= =+SoL -----END PGP SIGNATURE----- Merge tag 'v6.2' into iommufd.git for-next Resolve conflicts from the signature change in iommu_map: - drivers/infiniband/hw/usnic/usnic_uiom.c Switch iommu_map_atomic() to iommu_map(.., GFP_ATOMIC) - drivers/vfio/vfio_iommu_type1.c Following indenting change for GFP_KERNEL Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-02-21 11:11:03 -04:00
Joerg Roedel	bedd29d793	Merge branches 'apple/dart', 'arm/exynos', 'arm/renesas', 'arm/smmu', 'x86/vt-d', 'x86/amd' and 'core' into next	2023-02-18 15:43:04 +01:00
Suren Baghdasaryan	1c71222e5f	mm: replace vma->vm_flags direct modifications with modifier calls Replace direct modifications to vma->vm_flags with calls to modifier functions to be able to track flag changes and to keep vma locking correctness. [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo] Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-09 16:51:39 -08:00
Yishai Hadas	ce06a7000f	vfio/mlx5: Fix range size calculation upon tracker creation Fix range size calculation to include the last byte of each range. In addition, log round up the length of the total ranges to be stricter. Fixes: `c1d050b0d1` ("vfio/mlx5: Create and destroy page tracker object") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20230208152234.32370-1-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-02-09 11:43:06 -07:00
Matthew Rosato	b0d2d5697e	vfio: no need to pass kvm pointer during device open Nothing uses this value during vfio_device_open anymore so it's safe to remove it. Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230203215027.151988-3-mjrosato@linux.ibm.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-02-09 11:41:25 -07:00
Matthew Rosato	2b48f52f2b	vfio: fix deadlock between group lock and kvm lock After `51cdc8bc12`, we have another deadlock scenario between the kvm->lock and the vfio group_lock with two different codepaths acquiring the locks in different order. Specifically in vfio_open_device, vfio holds the vfio group_lock when issuing device->ops->open_device but some drivers (like vfio-ap) need to acquire kvm->lock during their open_device routine; Meanwhile, kvm_vfio_release will acquire the kvm->lock first before calling vfio_file_set_kvm which will acquire the vfio group_lock. To resolve this, let's remove the need for the vfio group_lock from the kvm_vfio_release codepath. This is done by introducing a new spinlock to protect modifications to the vfio group kvm pointer, and acquiring a kvm ref from within vfio while holding this spinlock, with the reference held until the last close for the device in question. Fixes: `51cdc8bc12` ("kvm/vfio: Fix potential deadlock on vfio group_lock") Reported-by: Anthony Krowiak <akrowiak@linux.ibm.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230203215027.151988-2-mjrosato@linux.ibm.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-02-09 11:41:25 -07:00
Steve Sistare	e592296cd6	vfio: revert "iommu driver notify callback" Revert this dead code: commit `ec5e32940c` ("vfio: iommu driver notify callback") Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1675184289-267876-8-git-send-email-steven.sistare@oracle.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-02-09 11:39:14 -07:00
Steve Sistare	a5ac1f8165	vfio/type1: revert "implement notify callback" This is dead code. Revert it. commit `487ace1340` ("vfio/type1: implement notify callback") Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1675184289-267876-7-git-send-email-steven.sistare@oracle.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-02-09 11:39:14 -07:00
Steve Sistare	da4f1c2e1c	vfio/type1: revert "block on invalid vaddr" Revert this dead code: commit `898b9eaeb3` ("vfio/type1: block on invalid vaddr") Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1675184289-267876-6-git-send-email-steven.sistare@oracle.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-02-09 11:39:14 -07:00
Steve Sistare	90fdd158a6	vfio/type1: restore locked_vm When a vfio container is preserved across exec or fork-exec, the new task's mm has a locked_vm count of 0. After a dma vaddr is updated using VFIO_DMA_MAP_FLAG_VADDR, locked_vm remains 0, and the pinned memory does not count against the task's RLIMIT_MEMLOCK. To restore the correct locked_vm count, when VFIO_DMA_MAP_FLAG_VADDR is used and the dma's mm has changed, add the dma's locked_vm count to the new mm->locked_vm, subject to the rlimit, and subtract it from the old mm->locked_vm. Fixes: `c3cbab24db` ("vfio/type1: implement interfaces to update vaddr") Cc: stable@vger.kernel.org Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1675184289-267876-5-git-send-email-steven.sistare@oracle.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-02-09 11:39:14 -07:00
Steve Sistare	18e292705b	vfio/type1: track locked_vm per dma Track locked_vm per dma struct, and create a new subroutine, both for use in a subsequent patch. No functional change. Fixes: `c3cbab24db` ("vfio/type1: implement interfaces to update vaddr") Cc: stable@vger.kernel.org Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1675184289-267876-4-git-send-email-steven.sistare@oracle.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-02-09 11:39:14 -07:00
Steve Sistare	046eca5018	vfio/type1: prevent underflow of locked_vm via exec() When a vfio container is preserved across exec, the task does not change, but it gets a new mm with locked_vm=0, and loses the count from existing dma mappings. If the user later unmaps a dma mapping, locked_vm underflows to a large unsigned value, and a subsequent dma map request fails with ENOMEM in __account_locked_vm. To avoid underflow, grab and save the mm at the time a dma is mapped. Use that mm when adjusting locked_vm, rather than re-acquiring the saved task's mm, which may have changed. If the saved mm is dead, do nothing. locked_vm is incremented for existing mappings in a subsequent patch. Fixes: `73fa0d10d0` ("vfio: Type1 IOMMU implementation") Cc: stable@vger.kernel.org Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1675184289-267876-3-git-send-email-steven.sistare@oracle.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-02-09 11:39:14 -07:00
Steve Sistare	ef3a3f6a29	vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR Disable the VFIO_UPDATE_VADDR capability if mediated devices are present. Their kernel threads could be blocked indefinitely by a misbehaving userland while trying to pin/unpin pages while vaddrs are being updated. Do not allow groups to be added to the container while vaddr's are invalid, so we never need to block user threads from pinning, and can delete the vaddr-waiting code in a subsequent patch. Fixes: `c3cbab24db` ("vfio/type1: implement interfaces to update vaddr") Cc: stable@vger.kernel.org Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1675184289-267876-2-git-send-email-steven.sistare@oracle.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-02-09 11:39:14 -07:00
Jason Gunthorpe	bed9e516f1	Merge branch 'vfio-no-iommu' into iommufd.git for-next Shared branch with VFIO for the no-iommu support. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-02-03 15:55:49 -04:00
Jason Gunthorpe	c9a397cee9	vfio: Support VFIO_NOIOMMU with iommufd Add a small amount of emulation to vfio_compat to accept the SET_IOMMU to VFIO_NOIOMMU_IOMMU and have vfio just ignore iommufd if it is working on a no-iommu enabled device. Move the enable_unsafe_noiommu_mode module out of container.c into vfio_main.c so that it is always available even if VFIO_CONTAINER=n. This passes Alex's mini-test: https://github.com/awilliam/tests/blob/master/vfio-noiommu-pci-device-open.c Link: https://lore.kernel.org/r/0-v3-480cd64a16f7+1ad0-iommufd_noiommu_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-02-03 15:45:23 -04:00
Tomasz Duszynski	168a9c91fe	vfio: platform: ignore missing reset if disabled at module init If reset requirement was relaxed via module parameter, errors caused by missing reset should not be propagated down to the vfio core. Otherwise initialization will fail. Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com> Fixes: `5f6c7e0831` ("vfio/platform: Use the new device life cycle helpers") Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20230131083349.2027189-1-tduszynski@marvell.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-02-01 12:25:41 -07:00
Yishai Hadas	f4f0c25e5d	vfio/mlx5: Improve the target side flow to reduce downtime Improve the target side flow to reduce downtime as of below. - Support reading an optional record which includes the expected stop_copy size. - Once the source sends this record data, which expects to be sent as part of the pre_copy flow, prepare the data buffers that may be large enough to hold the final stop_copy data. The above reduces the migration downtime as the relevant stuff that is needed to load the image data is prepared ahead as part of pre_copy. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20230124144955.139901-4-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-30 12:16:15 -07:00
Yishai Hadas	b04e2e86e9	vfio/mlx5: Improve the source side flow upon pre_copy Improve the source side flow upon pre_copy as of below. - Prepare the stop_copy buffers as part of moving to pre_copy. - Send to the target a record that includes the expected stop_copy size to let it optimize its stop_copy flow as well. As for sending the target this new record type (i.e. MLX5_MIGF_HEADER_TAG_STOP_COPY_SIZE) we split the current 64 header flags bits into 32 flags bits and another 32 tag bits, each record may have a tag and a flag whether it's optional or mandatory. Optional records will be ignored in the target. The above reduces the downtime upon stop_copy as the relevant data stuff is prepared ahead as part of pre_copy. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20230124144955.139901-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-30 12:16:15 -07:00
Shay Drory	caf094b5a1	vfio/mlx5: Check whether VF is migratable Add a check whether VF is migratable. Only if VF is migratable, mark the VFIO device as migration capable. Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20230124144955.139901-2-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-30 12:16:15 -07:00
Bo Liu	038ef0a476	vfio/mdev: Use sysfs_emit() to instead of sprintf() Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: Bo Liu <liubo03@inspur.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230129084117.2384-1-liubo03@inspur.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-30 12:16:13 -07:00
Jason Gunthorpe	fd9f2a9122	Merge branch 'iommu-memory-accounting' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/joro/iommu intoiommufd/for-next Jason Gunthorpe says: ==================== iommufd follows the same design as KVM and uses memory cgroups to limit the amount of kernel memory a iommufd file descriptor can pin down. The various internal data structures already use GFP_KERNEL_ACCOUNT to charge its own memory. However, one of the biggest consumers of kernel memory is the IOPTEs stored under the iommu_domain and these allocations are not tracked. This series is the first step in fixing it. The iommu driver contract already includes a 'gfp' argument to the map_pages op, allowing iommufd to specify GFP_KERNEL_ACCOUNT and then having the driver allocate the IOPTE tables with that flag will capture a significant amount of the allocations. Update the iommu_map() API to pass in the GFP argument, and fix all call sites. Replace iommu_map_atomic(). Audit the "enterprise" iommu drivers to make sure they do the right thing. Intel and S390 ignore the GFP argument and always use GFP_ATOMIC. This is problematic for iommufd anyhow, so fix it. AMD and ARM SMMUv2/3 are already correct. A follow up series will be needed to capture the allocations made when the iommu_domain itself is allocated, which will complete the job. ==================== * 'iommu-memory-accounting' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/s390: Use GFP_KERNEL in sleepable contexts iommu/s390: Push the gfp parameter to the kmem_cache_alloc()'s iommu/intel: Use GFP_KERNEL in sleepable contexts iommu/intel: Support the gfp argument to the map_pages op iommu/intel: Add a gfp parameter to alloc_pgtable_page() iommufd: Use GFP_KERNEL_ACCOUNT for iommu_map() iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous() iommu: Add a gfp parameter to iommu_map_sg() iommu: Remove iommu_map_atomic() iommu: Add a gfp parameter to iommu_map() Link: https://lore.kernel.org/linux-iommu/0-v3-76b587fe28df+6e3-iommu_map_gfp_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-01-30 13:54:35 -04:00
Jason Gunthorpe	1369459b2e	iommu: Add a gfp parameter to iommu_map() The internal mechanisms support this, but instead of exposting the gfp to the caller it wrappers it into iommu_map() and iommu_map_atomic() Fix this instead of adding more variants for GFP_KERNEL_ACCOUNT. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> Link: https://lore.kernel.org/r/1-v3-76b587fe28df+6e3-iommu_map_gfp_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>	2023-01-25 11:52:00 +01:00
Yishai Hadas	7658aeda33	vfio/platform: Use GFP_KERNEL_ACCOUNT for userspace persistent allocations Use GFP_KERNEL_ACCOUNT for userspace persistent allocations. The GFP_KERNEL_ACCOUNT option lets the memory allocator know that this is untrusted allocation triggered from userspace and should be a subject of kmem accounting, and as such it is controlled by the cgroup mechanism. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-7-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-23 11:26:30 -07:00
Yishai Hadas	4a6c971a06	vfio/fsl-mc: Use GFP_KERNEL_ACCOUNT for userspace persistent allocations Use GFP_KERNEL_ACCOUNT for userspace persistent allocations. The GFP_KERNEL_ACCOUNT option lets the memory allocator know that this is untrusted allocation triggered from userspace and should be a subject of kmem accounting, and as such it is controlled by the cgroup mechanism. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-6-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-23 11:26:30 -07:00
Yishai Hadas	cb8285b89f	vfio/hisi: Use GFP_KERNEL_ACCOUNT for userspace persistent allocations Use GFP_KERNEL_ACCOUNT for userspace persistent allocations. The GFP_KERNEL_ACCOUNT option lets the memory allocator know that this is untrusted allocation triggered from userspace and should be a subject of kmem accounting, and as such it is controlled by the cgroup mechanism. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-5-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-23 11:26:30 -07:00
Jason Gunthorpe	0886196ca8	vfio: Use GFP_KERNEL_ACCOUNT for userspace persistent allocations Use GFP_KERNEL_ACCOUNT for userspace persistent allocations. The GFP_KERNEL_ACCOUNT option lets the memory allocator know that this is untrusted allocation triggered from userspace and should be a subject of kmem accounting, and as such it is controlled by the cgroup mechanism. The way to find the relevant allocations was for example to look at the close_device function and trace back all the kfrees to their allocations. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-4-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-23 11:26:29 -07:00
Yishai Hadas	83ff6095ec	vfio/mlx5: Allow loading of larger images than 512 MB Allow loading of larger images than 512 MB by dropping the arbitrary hard-coded value that we have today and move to use the max device loading value which is for now 4GB. As part of that we move to use the GFP_KERNEL_ACCOUNT option upon allocating the persistent data of mlx5 and rely on the cgroup to provide the memory limit for the given user. The GFP_KERNEL_ACCOUNT option lets the memory allocator know that this is untrusted allocation triggered from userspace and should be a subject of kmem accounting, and as such it is controlled by the cgroup mechanism. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-23 11:26:29 -07:00
Yishai Hadas	c9c4c070e0	vfio/mlx5: Fix UBSAN note Prevent calling roundup_pow_of_two() with value of 0 as it causes the below UBSAN note. Move this code and its few extra related lines to be called only when it's really applicable. UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13 shift exponent 64 is too large for 64-bit type 'long unsigned int' CPU: 15 PID: 1639 Comm: live_migration Not tainted 6.1.0-rc4 #1116 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x45/0x59 ubsan_epilogue+0x5/0x36 __ubsan_handle_shift_out_of_bounds.cold+0x61/0xef ? lock_is_held_type+0x98/0x110 ? rcu_read_lock_sched_held+0x3f/0x70 mlx5vf_create_rc_qp.cold+0xe4/0xf2 [mlx5_vfio_pci] mlx5vf_start_page_tracker+0x769/0xcd0 [mlx5_vfio_pci] vfio_device_fops_unl_ioctl+0x63f/0x700 [vfio] __x64_sys_ioctl+0x433/0x9a0 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd </TASK> Fixes: `79c3cf2799` ("vfio/mlx5: Init QP based resources for dirty tracking") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-2-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-23 11:26:29 -07:00
Christoph Hellwig	8bf8c5ee1f	vfio-mdev: turn VFIO_MDEV into a selectable symbol VFIO_MDEV is just a library with helpers for the drivers. Stop making it a user choice and just select it by the drivers that use the helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com> Link: https://lore.kernel.org/r/20230110091009.474427-3-hch@lst.de Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-23 11:26:29 -07:00
Angus Chen	7141790b5d	vfio: platform: No need to check res again In function vfio_platform_regions_init(),we did check res implied by using while loop, so no need to check whether res be null or not again. No functional change intended. Signed-off-by: Angus Chen <angus.chen@jaguarmicro.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20230107034721.2127-1-angus.chen@jaguarmicro.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-23 11:26:28 -07:00
Jason Gunthorpe	6b1a7a0042	vfio/type1: Convert to iommu_group_has_isolated_msi() Trivially use the new API. Link: https://lore.kernel.org/r/3-v3-3313bb5dd3a3+10f11-secure_msi_jgg@nvidia.com Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-01-11 16:27:23 -04:00
Niklas Schnelle	895c0747f7	vfio/type1: Respect IOMMU reserved regions in vfio_test_domain_fgsp() Since commit `cbf7827bc5` ("iommu/s390: Fix potential s390_domain aperture shrinking") the s390 IOMMU driver uses reserved regions for the system provided DMA ranges of PCI devices. Previously it reduced the size of the IOMMU aperture and checked it on each mapping operation. On current machines the system denies use of DMA addresses below 2^32 for all PCI devices. Usually mapping IOVAs in a reserved regions is harmless until a DMA actually tries to utilize the mapping. However on s390 there is a virtual PCI device called ISM which is implemented in firmware and used for cross LPAR communication. Unlike real PCI devices this device does not use the hardware IOMMU but inspects IOMMU translation tables directly on IOTLB flush (s390 RPCIT instruction). If it detects IOVA mappings outside the allowed ranges it goes into an error state. This error state then causes the device to be unavailable to the KVM guest. Analysing this we found that vfio_test_domain_fgsp() maps 2 pages at DMA address 0 irrespective of the IOMMUs reserved regions. Even if usually harmless this seems wrong in the general case so instead go through the freshly updated IOVA list and try to find a range that isn't reserved, and fits 2 pages, is PAGE_SIZE * 2 aligned. If found use that for testing for fine grained super pages. Fixes: `af029169b8` ("vfio/type1: Check reserved region conflict and update iova list") Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230110164427.4051938-2-schnelle@linux.ibm.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-01-10 10:44:37 -07:00
Linus Torvalds	71a7507afb	Driver Core changes for 6.2-rc1 Here is the set of driver core and kernfs changes for 6.2-rc1. The "big" change in here is the addition of a new macro, container_of_const() that will preserve the "const-ness" of a pointer passed into it. The "problem" of the current container_of() macro is that if you pass in a "const ", out of it can comes a non-const pointer unless you specifically ask for it. For many usages, we want to preserve the "const" attribute by using the same call. For a specific example, this series changes the kobj_to_dev() macro to use it, allowing it to be used no matter what the const value is. This prevents every subsystem from having to declare 2 different individual macros (i.e. kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce the const value at build time, which having 2 macros would not do either. The driver for all of this have been discussions with the Rust kernel developers as to how to properly mark driver core, and kobject, objects as being "non-mutable". The changes to the kobject and driver core in this pull request are the result of that, as there are lots of paths where kobjects and device pointers are not modified at all, so marking them as "const" allows the compiler to enforce this. So, a nice side affect of the Rust development effort has been already to clean up the driver core code to be more obvious about object rules. All of this has been bike-shedded in quite a lot of detail on lkml with different names and implementations resulting in the tiny version we have in here, much better than my original proposal. Lots of subsystem maintainers have acked the changes as well. Other than this change, included in here are smaller stuff like: - kernfs fixes and updates to handle lock contention better - vmlinux.lds.h fixes and updates - sysfs and debugfs documentation updates - device property updates All of these have been in the linux-next tree for quite a while with no problems, OTHER than some merge issues with other trees that should be obvious when you hit them (block tree deletes a driver that this tree modifies, iommufd tree modifies code that this tree also touches). If there are merge problems with these trees, please let me know. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY5wz3A8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+yks0ACeKYUlVgCsER8eYW+x18szFa2QTXgAn2h/VhZe 1Fp53boFaQkGBjl8mGF8 =v+FB -----END PGP SIGNATURE----- Merge tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the set of driver core and kernfs changes for 6.2-rc1. The "big" change in here is the addition of a new macro, container_of_const() that will preserve the "const-ness" of a pointer passed into it. The "problem" of the current container_of() macro is that if you pass in a "const ", out of it can comes a non-const pointer unless you specifically ask for it. For many usages, we want to preserve the "const" attribute by using the same call. For a specific example, this series changes the kobj_to_dev() macro to use it, allowing it to be used no matter what the const value is. This prevents every subsystem from having to declare 2 different individual macros (i.e. kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce the const value at build time, which having 2 macros would not do either. The driver for all of this have been discussions with the Rust kernel developers as to how to properly mark driver core, and kobject, objects as being "non-mutable". The changes to the kobject and driver core in this pull request are the result of that, as there are lots of paths where kobjects and device pointers are not modified at all, so marking them as "const" allows the compiler to enforce this. So, a nice side affect of the Rust development effort has been already to clean up the driver core code to be more obvious about object rules. All of this has been bike-shedded in quite a lot of detail on lkml with different names and implementations resulting in the tiny version we have in here, much better than my original proposal. Lots of subsystem maintainers have acked the changes as well. Other than this change, included in here are smaller stuff like: - kernfs fixes and updates to handle lock contention better - vmlinux.lds.h fixes and updates - sysfs and debugfs documentation updates - device property updates All of these have been in the linux-next tree for quite a while with no problems" * tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (58 commits) device property: Fix documentation for fwnode_get_next_parent() firmware_loader: fix up to_fw_sysfs() to preserve const usb.h: take advantage of container_of_const() device.h: move kobj_to_dev() to use container_of_const() container_of: add container_of_const() that preserves const-ness of the pointer driver core: fix up missed drivers/s390/char/hmcdrv_dev.c class.devnode() conversion. driver core: fix up missed scsi/cxlflash class.devnode() conversion. driver core: fix up some missing class.devnode() conversions. driver core: make struct class.devnode() take a const * driver core: make struct class.dev_uevent() take a const * cacheinfo: Remove of_node_put() for fw_token device property: Add a blank line in Kconfig of tests device property: Rename goto label to be more precise device property: Move PROPERTY_ENTRY_BOOL() a bit down device property: Get rid of __PROPERTY_ENTRY_ARRAY_ELSIZE() kernfs: fix all kernel-doc warnings and multiple typos driver core: pass a const * into of_device_uevent() kobject: kset_uevent_ops: make name() callback take a const * kobject: kset_uevent_ops: make filter() callback take a const * kobject: make kobject_namespace take a const * ...	2022-12-16 03:54:54 -08:00
Linus Torvalds	785d21ba2f	VFIO updates for v6.2-rc1 - Replace deprecated git://github.com link in MAINTAINERS. (Palmer Dabbelt) - Simplify vfio/mlx5 with module_pci_driver() helper. (Shang XiaoJing) - Drop unnecessary buffer from ACPI call. (Rafael Mendonca) - Correct latent missing include issue in iova-bitmap and fix support for unaligned bitmaps. Follow-up with better fix through refactor. (Joao Martins) - Rework ccw mdev driver to split private data from parent structure, better aligning with the mdev lifecycle and allowing us to remove a temporary workaround. (Eric Farman) - Add an interface to get an estimated migration data size for a device, allowing userspace to make informed decisions, ex. more accurately predicting VM downtime. (Yishai Hadas) - Fix minor typo in vfio/mlx5 array declaration. (Yishai Hadas) - Simplify module and Kconfig through consolidating SPAPR/EEH code and config options and folding virqfd module into main vfio module. (Jason Gunthorpe) - Fix error path from device_register() across all vfio mdev and sample drivers. (Alex Williamson) - Define migration pre-copy interface and implement for vfio/mlx5 devices, allowing portions of the device state to be saved while the device continues operation, towards reducing the stop-copy state size. (Jason Gunthorpe, Yishai Hadas, Shay Drory) - Implement pre-copy for hisi_acc devices. (Shameer Kolothum) - Fixes to mdpy mdev driver remove path and error path on probe. (Shang XiaoJing) - vfio/mlx5 fixes for incorrect return after copy_to_user() fault and incorrect buffer freeing. (Dan Carpenter) -----BEGIN PGP SIGNATURE----- iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmObfPgbHGFsZXgud2ls bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsiDogP/i9GuBKposvZpnfxXWwo oNpKBZSOVMW8wgavNEuryMb+9WoouIghce8XU49MmONoP26kIh5TA14Zpi3XWkLK K+NlpwicESvLeZVHU7f3R8meVqmPtlxIi59jE+CfEHB8BW2HIAsEdwdhkxMwus9C nuiiK/2YYyQWOXYc4LAIkspMzjtGPy6Im5P6AED+dI+TFCEqJAM5qgOLJZFlk4a/ WwZY2xjVKOl6xf5VZXGw+v7fDgz2Ju+j4Bm3X5lx1HgiDrEH83MjXY5h67neAIVb bXrfNLN++MiuO5niGTFMbUjGVUIFxsfmJzBnL9QrLsuj0JrGEKsu/1JEO78g0Km0 ZCChoJ6UyUOgxt6evEymUAZAAkbcKaaht2gdbAXW71tv9p1TripAbBKwVeah1bQp SiHPqy9InKJlhaf+GbXL9eux1WVMfQ6FZccU16bNt7VaV2I8js85z/2gqVD0a5Mw +gnwp5XMUFWNKlJrnc7uVCD0bDExwQhr75OP4rWjMNvvLi9hPXJ2cI2Sg+9OLzQw vm/I+Df+FfXCuGAgX4Lxq76pqWlYGJH0Qxc14Ds6YoXqygBPz9yvTtuBv8mTHJzE KdAl/6DmZZxZ/JFD9lPF80KRiAsJ6iNf6tPTWES7hfDBfIdgQ/DZbXridLWJPNoi xLfaW19yrLTXWKSmR7G2Lsz4 =q9xs -----END PGP SIGNATURE----- Merge tag 'vfio-v6.2-rc1' of https://github.com/awilliam/linux-vfio Pull VFIO updates from Alex Williamson: - Replace deprecated git://github.com link in MAINTAINERS (Palmer Dabbelt) - Simplify vfio/mlx5 with module_pci_driver() helper (Shang XiaoJing) - Drop unnecessary buffer from ACPI call (Rafael Mendonca) - Correct latent missing include issue in iova-bitmap and fix support for unaligned bitmaps. Follow-up with better fix through refactor (Joao Martins) - Rework ccw mdev driver to split private data from parent structure, better aligning with the mdev lifecycle and allowing us to remove a temporary workaround (Eric Farman) - Add an interface to get an estimated migration data size for a device, allowing userspace to make informed decisions, ex. more accurately predicting VM downtime (Yishai Hadas) - Fix minor typo in vfio/mlx5 array declaration (Yishai Hadas) - Simplify module and Kconfig through consolidating SPAPR/EEH code and config options and folding virqfd module into main vfio module (Jason Gunthorpe) - Fix error path from device_register() across all vfio mdev and sample drivers (Alex Williamson) - Define migration pre-copy interface and implement for vfio/mlx5 devices, allowing portions of the device state to be saved while the device continues operation, towards reducing the stop-copy state size (Jason Gunthorpe, Yishai Hadas, Shay Drory) - Implement pre-copy for hisi_acc devices (Shameer Kolothum) - Fixes to mdpy mdev driver remove path and error path on probe (Shang XiaoJing) - vfio/mlx5 fixes for incorrect return after copy_to_user() fault and incorrect buffer freeing (Dan Carpenter) * tag 'vfio-v6.2-rc1' of https://github.com/awilliam/linux-vfio: (42 commits) vfio/mlx5: error pointer dereference in error handling vfio/mlx5: fix error code in mlx5vf_precopy_ioctl() samples: vfio-mdev: Fix missing pci_disable_device() in mdpy_fb_probe() hisi_acc_vfio_pci: Enable PRE_COPY flag hisi_acc_vfio_pci: Move the dev compatibility tests for early check hisi_acc_vfio_pci: Introduce support for PRE_COPY state transitions hisi_acc_vfio_pci: Add support for precopy IOCTL vfio/mlx5: Enable MIGRATION_PRE_COPY flag vfio/mlx5: Fallback to STOP_COPY upon specific PRE_COPY error vfio/mlx5: Introduce multiple loads vfio/mlx5: Consider temporary end of stream as part of PRE_COPY vfio/mlx5: Introduce vfio precopy ioctl implementation vfio/mlx5: Introduce SW headers for migration states vfio/mlx5: Introduce device transitions of PRE_COPY vfio/mlx5: Refactor to use queue based data chunks vfio/mlx5: Refactor migration file state vfio/mlx5: Refactor MKEY usage vfio/mlx5: Refactor PD usage vfio/mlx5: Enforce a single SAVE command at a time vfio: Extend the device migration protocol with PRE_COPY ...	2022-12-15 13:12:15 -08:00
Linus Torvalds	08cdc21579	iommufd for 6.2 iommufd is the user API to control the IOMMU subsystem as it relates to managing IO page tables that point at user space memory. It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO container) which is the VFIO specific interface for a similar idea. We see a broad need for extended features, some being highly IOMMU device specific: - Binding iommu_domain's to PASID/SSID - Userspace IO page tables, for ARM, x86 and S390 - Kernel bypassed invalidation of user page tables - Re-use of the KVM page table in the IOMMU - Dirty page tracking in the IOMMU - Runtime Increase/Decrease of IOPTE size - PRI support with faults resolved in userspace Many of these HW features exist to support VM use cases - for instance the combination of PASID, PRI and Userspace IO Page Tables allows an implementation of DMA Shared Virtual Addressing (vSVA) within a guest. Dirty tracking enables VM live migration with SRIOV devices and PASID support allow creating "scalable IOV" devices, among other things. As these features are fundamental to a VM platform they need to be uniformly exposed to all the driver families that do DMA into VMs, which is currently VFIO and VDPA. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCY5ct7wAKCRCFwuHvBreF YZZ5AQDciXfcgXLt0UBEmWupNb0f/asT6tk717pdsKm8kAZMNAEAsIyLiKT5HqGl s7fAu+CQ1pr9+9NKGevD+frw8Solsw4= =jJkd -----END PGP SIGNATURE----- Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd implementation from Jason Gunthorpe: "iommufd is the user API to control the IOMMU subsystem as it relates to managing IO page tables that point at user space memory. It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO container) which is the VFIO specific interface for a similar idea. We see a broad need for extended features, some being highly IOMMU device specific: - Binding iommu_domain's to PASID/SSID - Userspace IO page tables, for ARM, x86 and S390 - Kernel bypassed invalidation of user page tables - Re-use of the KVM page table in the IOMMU - Dirty page tracking in the IOMMU - Runtime Increase/Decrease of IOPTE size - PRI support with faults resolved in userspace Many of these HW features exist to support VM use cases - for instance the combination of PASID, PRI and Userspace IO Page Tables allows an implementation of DMA Shared Virtual Addressing (vSVA) within a guest. Dirty tracking enables VM live migration with SRIOV devices and PASID support allow creating "scalable IOV" devices, among other things. As these features are fundamental to a VM platform they need to be uniformly exposed to all the driver families that do DMA into VMs, which is currently VFIO and VDPA" For more background, see the extended explanations in Jason's pull request: https://lore.kernel.org/lkml/Y5dzTU8dlmXTbzoJ@nvidia.com/ * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (62 commits) iommufd: Change the order of MSI setup iommufd: Improve a few unclear bits of code iommufd: Fix comment typos vfio: Move vfio group specific code into group.c vfio: Refactor dma APIs for emulated devices vfio: Wrap vfio group module init/clean code into helpers vfio: Refactor vfio_device open and close vfio: Make vfio_device_open() truly device specific vfio: Swap order of vfio_device_container_register() and open_device() vfio: Set device->group in helper function vfio: Create wrappers for group register/unregister vfio: Move the sanity check of the group to vfio_create_group() vfio: Simplify vfio_create_group() iommufd: Allow iommufd to supply /dev/vfio/vfio vfio: Make vfio_container optionally compiled vfio: Move container related MODULE_ALIAS statements into container.c vfio-iommufd: Support iommufd for emulated VFIO devices vfio-iommufd: Support iommufd for physical VFIO devices vfio-iommufd: Allow iommufd to be used in place of a container fd vfio: Use IOMMU_CAP_ENFORCE_CACHE_COHERENCY for vfio_file_enforced_coherent() ...	2022-12-14 09:15:43 -08:00
Dan Carpenter	70be6f3228	vfio/mlx5: error pointer dereference in error handling This code frees the wrong "buf" variable and results in an error pointer dereference. Fixes: `34e2f27143` ("vfio/mlx5: Introduce multiple loads") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/Y5IKia5SaiVxYmG5@kili Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-12 14:10:12 -07:00
Dan Carpenter	fe3dd71db2	vfio/mlx5: fix error code in mlx5vf_precopy_ioctl() The copy_to_user() function returns the number of bytes remaining to be copied but we want to return a negative error code here. Fixes: `0dce165b1a` ("vfio/mlx5: Introduce vfio precopy ioctl implementation") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/Y5IKVknlf5Z5NPtU@kili Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-12 14:10:12 -07:00
Linus Torvalds	9d33edb20f	Updates for the interrupt core and driver subsystem: - Core: The bulk is the rework of the MSI subsystem to support per device MSI interrupt domains. This solves conceptual problems of the current PCI/MSI design which are in the way of providing support for PCI/MSI[-X] and the upcoming PCI/IMS mechanism on the same device. IMS (Interrupt Message Store] is a new specification which allows device manufactures to provide implementation defined storage for MSI messages contrary to the uniform and specification defined storage mechanisms for PCI/MSI and PCI/MSI-X. IMS not only allows to overcome the size limitations of the MSI-X table, but also gives the device manufacturer the freedom to store the message in arbitrary places, even in host memory which is shared with the device. There have been several attempts to glue this into the current MSI code, but after lengthy discussions it turned out that there is a fundamental design problem in the current PCI/MSI-X implementation. This needs some historical background. When PCI/MSI[-X] support was added around 2003, interrupt management was completely different from what we have today in the actively developed architectures. Interrupt management was completely architecture specific and while there were attempts to create common infrastructure the commonalities were rudimentary and just providing shared data structures and interfaces so that drivers could be written in an architecture agnostic way. The initial PCI/MSI[-X] support obviously plugged into this model which resulted in some basic shared infrastructure in the PCI core code for setting up MSI descriptors, which are a pure software construct for holding data relevant for a particular MSI interrupt, but the actual association to Linux interrupts was completely architecture specific. This model is still supported today to keep museum architectures and notorious stranglers alive. In 2013 Intel tried to add support for hot-pluggable IO/APICs to the kernel, which was creating yet another architecture specific mechanism and resulted in an unholy mess on top of the existing horrors of x86 interrupt handling. The x86 interrupt management code was already an incomprehensible maze of indirections between the CPU vector management, interrupt remapping and the actual IO/APIC and PCI/MSI[-X] implementation. At roughly the same time ARM struggled with the ever growing SoC specific extensions which were glued on top of the architected GIC interrupt controller. This resulted in a fundamental redesign of interrupt management and provided the today prevailing concept of hierarchical interrupt domains. This allowed to disentangle the interactions between x86 vector domain and interrupt remapping and also allowed ARM to handle the zoo of SoC specific interrupt components in a sane way. The concept of hierarchical interrupt domains aims to encapsulate the functionality of particular IP blocks which are involved in interrupt delivery so that they become extensible and pluggable. The X86 encapsulation looks like this: \|--- device 1 [Vector]---[Remapping]---[PCI/MSI]--\|... \|--- device N where the remapping domain is an optional component and in case that it is not available the PCI/MSI[-X] domains have the vector domain as their parent. This reduced the required interaction between the domains pretty much to the initialization phase where it is obviously required to establish the proper parent relation ship in the components of the hierarchy. While in most cases the model is strictly representing the chain of IP blocks and abstracting them so they can be plugged together to form a hierarchy, the design stopped short on PCI/MSI[-X]. Looking at the hardware it's clear that the actual PCI/MSI[-X] interrupt controller is not a global entity, but strict a per PCI device entity. Here we took a short cut on the hierarchical model and went for the easy solution of providing "global" PCI/MSI domains which was possible because the PCI/MSI[-X] handling is uniform across the devices. This also allowed to keep the existing PCI/MSI[-X] infrastructure mostly unchanged which in turn made it simple to keep the existing architecture specific management alive. A similar problem was created in the ARM world with support for IP block specific message storage. Instead of going all the way to stack a IP block specific domain on top of the generic MSI domain this ended in a construct which provides a "global" platform MSI domain which allows overriding the irq_write_msi_msg() callback per allocation. In course of the lengthy discussions we identified other abuse of the MSI infrastructure in wireless drivers, NTB etc. where support for implementation specific message storage was just mindlessly glued into the existing infrastructure. Some of this just works by chance on particular platforms but will fail in hard to diagnose ways when the driver is used on platforms where the underlying MSI interrupt management code does not expect the creative abuse. Another shortcoming of today's PCI/MSI-X support is the inability to allocate or free individual vectors after the initial enablement of MSI-X. This results in an works by chance implementation of VFIO (PCI pass-through) where interrupts on the host side are not set up upfront to avoid resource exhaustion. They are expanded at run-time when the guest actually tries to use them. The way how this is implemented is that the host disables MSI-X and then re-enables it with a larger number of vectors again. That works by chance because most device drivers set up all interrupts before the device actually will utilize them. But that's not universally true because some drivers allocate a large enough number of vectors but do not utilize them until it's actually required, e.g. for acceleration support. But at that point other interrupts of the device might be in active use and the MSI-X disable/enable dance can just result in losing interrupts and therefore hard to diagnose subtle problems. Last but not least the "global" PCI/MSI-X domain approach prevents to utilize PCI/MSI[-X] and PCI/IMS on the same device due to the fact that IMS is not longer providing a uniform storage and configuration model. The solution to this is to implement the missing step and switch from global PCI/MSI domains to per device PCI/MSI domains. The resulting hierarchy then looks like this: \|--- [PCI/MSI] device 1 [Vector]---[Remapping]---\|... \|--- [PCI/MSI] device N which in turn allows to provide support for multiple domains per device: \|--- [PCI/MSI] device 1 \|--- [PCI/IMS] device 1 [Vector]---[Remapping]---\|... \|--- [PCI/MSI] device N \|--- [PCI/IMS] device N This work converts the MSI and PCI/MSI core and the x86 interrupt domains to the new model, provides new interfaces for post-enable allocation/free of MSI-X interrupts and the base framework for PCI/IMS. PCI/IMS has been verified with the work in progress IDXD driver. There is work in progress to convert ARM over which will replace the platform MSI train-wreck. The cleanup of VFIO, NTB and other creative "solutions" are in the works as well. - Drivers: - Updates for the LoongArch interrupt chip drivers - Support for MTK CIRQv2 - The usual small fixes and updates all over the place -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmOUsygTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoYXiD/40tXKzCzf0qFIqUlZLia1N3RRrwrNC DVTixuLtR9MrjwE+jWLQILa85SHInV8syXHSd35SzhsGDxkURFGi+HBgVWmysODf br9VSh3Gi+kt7iXtIwAg8WNWviGNmS3kPksxCko54F0YnJhMY5r5bhQVUBQkwFG2 wES1C9Uzd4pdV2bl24Z+WKL85cSmZ+pHunyKw1n401lBABXnTF9c4f13zC14jd+y wDxNrmOxeL3mEH4Pg6VyrDuTOURSf3TjJjeEq3EYqvUo0FyLt9I/cKX0AELcZQX7 fkRjrQQAvXNj39RJfeSkojDfllEPUHp7XSluhdBu5aIovSamdYGCDnuEoZ+l4MJ+ CojIErp3Dwj/uSaf5c7C3OaDAqH2CpOFWIcrUebShJE60hVKLEpUwd6W8juplaoT gxyXRb1Y+BeJvO8VhMN4i7f3232+sj8wuj+HTRTTbqMhkElnin94tAx8rgwR1sgR BiOGMJi4K2Y8s9Rqqp0Dvs01CW4guIYvSR4YY+WDbbi1xgiev89OYs6zZTJCJe4Y NUwwpqYSyP1brmtdDdBOZLqegjQm+TwUb6oOaasFem4vT1swgawgLcDnPOx45bk5 /FWt3EmnZxMz99x9jdDn1+BCqAZsKyEbEY1avvhPVMTwoVIuSX2ceTBMLseGq+jM 03JfvdxnueM3gw== =9erA -----END PGP SIGNATURE----- Merge tag 'irq-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Updates for the interrupt core and driver subsystem: The bulk is the rework of the MSI subsystem to support per device MSI interrupt domains. This solves conceptual problems of the current PCI/MSI design which are in the way of providing support for PCI/MSI[-X] and the upcoming PCI/IMS mechanism on the same device. IMS (Interrupt Message Store] is a new specification which allows device manufactures to provide implementation defined storage for MSI messages (as opposed to PCI/MSI and PCI/MSI-X that has a specified message store which is uniform accross all devices). The PCI/MSI[-X] uniformity allowed us to get away with "global" PCI/MSI domains. IMS not only allows to overcome the size limitations of the MSI-X table, but also gives the device manufacturer the freedom to store the message in arbitrary places, even in host memory which is shared with the device. There have been several attempts to glue this into the current MSI code, but after lengthy discussions it turned out that there is a fundamental design problem in the current PCI/MSI-X implementation. This needs some historical background. When PCI/MSI[-X] support was added around 2003, interrupt management was completely different from what we have today in the actively developed architectures. Interrupt management was completely architecture specific and while there were attempts to create common infrastructure the commonalities were rudimentary and just providing shared data structures and interfaces so that drivers could be written in an architecture agnostic way. The initial PCI/MSI[-X] support obviously plugged into this model which resulted in some basic shared infrastructure in the PCI core code for setting up MSI descriptors, which are a pure software construct for holding data relevant for a particular MSI interrupt, but the actual association to Linux interrupts was completely architecture specific. This model is still supported today to keep museum architectures and notorious stragglers alive. In 2013 Intel tried to add support for hot-pluggable IO/APICs to the kernel, which was creating yet another architecture specific mechanism and resulted in an unholy mess on top of the existing horrors of x86 interrupt handling. The x86 interrupt management code was already an incomprehensible maze of indirections between the CPU vector management, interrupt remapping and the actual IO/APIC and PCI/MSI[-X] implementation. At roughly the same time ARM struggled with the ever growing SoC specific extensions which were glued on top of the architected GIC interrupt controller. This resulted in a fundamental redesign of interrupt management and provided the today prevailing concept of hierarchical interrupt domains. This allowed to disentangle the interactions between x86 vector domain and interrupt remapping and also allowed ARM to handle the zoo of SoC specific interrupt components in a sane way. The concept of hierarchical interrupt domains aims to encapsulate the functionality of particular IP blocks which are involved in interrupt delivery so that they become extensible and pluggable. The X86 encapsulation looks like this: \|--- device 1 [Vector]---[Remapping]---[PCI/MSI]--\|... \|--- device N where the remapping domain is an optional component and in case that it is not available the PCI/MSI[-X] domains have the vector domain as their parent. This reduced the required interaction between the domains pretty much to the initialization phase where it is obviously required to establish the proper parent relation ship in the components of the hierarchy. While in most cases the model is strictly representing the chain of IP blocks and abstracting them so they can be plugged together to form a hierarchy, the design stopped short on PCI/MSI[-X]. Looking at the hardware it's clear that the actual PCI/MSI[-X] interrupt controller is not a global entity, but strict a per PCI device entity. Here we took a short cut on the hierarchical model and went for the easy solution of providing "global" PCI/MSI domains which was possible because the PCI/MSI[-X] handling is uniform across the devices. This also allowed to keep the existing PCI/MSI[-X] infrastructure mostly unchanged which in turn made it simple to keep the existing architecture specific management alive. A similar problem was created in the ARM world with support for IP block specific message storage. Instead of going all the way to stack a IP block specific domain on top of the generic MSI domain this ended in a construct which provides a "global" platform MSI domain which allows overriding the irq_write_msi_msg() callback per allocation. In course of the lengthy discussions we identified other abuse of the MSI infrastructure in wireless drivers, NTB etc. where support for implementation specific message storage was just mindlessly glued into the existing infrastructure. Some of this just works by chance on particular platforms but will fail in hard to diagnose ways when the driver is used on platforms where the underlying MSI interrupt management code does not expect the creative abuse. Another shortcoming of today's PCI/MSI-X support is the inability to allocate or free individual vectors after the initial enablement of MSI-X. This results in an works by chance implementation of VFIO (PCI pass-through) where interrupts on the host side are not set up upfront to avoid resource exhaustion. They are expanded at run-time when the guest actually tries to use them. The way how this is implemented is that the host disables MSI-X and then re-enables it with a larger number of vectors again. That works by chance because most device drivers set up all interrupts before the device actually will utilize them. But that's not universally true because some drivers allocate a large enough number of vectors but do not utilize them until it's actually required, e.g. for acceleration support. But at that point other interrupts of the device might be in active use and the MSI-X disable/enable dance can just result in losing interrupts and therefore hard to diagnose subtle problems. Last but not least the "global" PCI/MSI-X domain approach prevents to utilize PCI/MSI[-X] and PCI/IMS on the same device due to the fact that IMS is not longer providing a uniform storage and configuration model. The solution to this is to implement the missing step and switch from global PCI/MSI domains to per device PCI/MSI domains. The resulting hierarchy then looks like this: \|--- [PCI/MSI] device 1 [Vector]---[Remapping]---\|... \|--- [PCI/MSI] device N which in turn allows to provide support for multiple domains per device: \|--- [PCI/MSI] device 1 \|--- [PCI/IMS] device 1 [Vector]---[Remapping]---\|... \|--- [PCI/MSI] device N \|--- [PCI/IMS] device N This work converts the MSI and PCI/MSI core and the x86 interrupt domains to the new model, provides new interfaces for post-enable allocation/free of MSI-X interrupts and the base framework for PCI/IMS. PCI/IMS has been verified with the work in progress IDXD driver. There is work in progress to convert ARM over which will replace the platform MSI train-wreck. The cleanup of VFIO, NTB and other creative "solutions" are in the works as well. Drivers: - Updates for the LoongArch interrupt chip drivers - Support for MTK CIRQv2 - The usual small fixes and updates all over the place" * tag 'irq-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (134 commits) irqchip/ti-sci-inta: Fix kernel doc irqchip/gic-v2m: Mark a few functions __init irqchip/gic-v2m: Include arm-gic-common.h irqchip/irq-mvebu-icu: Fix works by chance pointer assignment iommu/amd: Enable PCI/IMS iommu/vt-d: Enable PCI/IMS x86/apic/msi: Enable PCI/IMS PCI/MSI: Provide pci_ims_alloc/free_irq() PCI/MSI: Provide IMS (Interrupt Message Store) support genirq/msi: Provide constants for PCI/IMS support x86/apic/msi: Enable MSI_FLAG_PCI_MSIX_ALLOC_DYN PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X PCI/MSI: Provide prepare_desc() MSI domain op PCI/MSI: Split MSI-X descriptor setup genirq/msi: Provide MSI_FLAG_MSIX_ALLOC_DYN genirq/msi: Provide msi_domain_alloc_irq_at() genirq/msi: Provide msi_domain_ops:: Prepare_desc() genirq/msi: Provide msi_desc:: Msi_data genirq/msi: Provide struct msi_map x86/apic/msi: Remove arch_create_remap_msi_irq_domain() ...	2022-12-12 11:21:29 -08:00
Shameer Kolothum	f2240b4441	hisi_acc_vfio_pci: Enable PRE_COPY flag Now that we have everything to support the PRE_COPY state, enable it. Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20221123113236.896-5-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:37:11 -07:00
Shameer Kolothum	190125adca	hisi_acc_vfio_pci: Move the dev compatibility tests for early check Instead of waiting till data transfer is complete to perform dev compatibility, do it as soon as we have enough data to perform the check. This will be useful when we enable the support for PRE_COPY. Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20221123113236.896-4-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:37:11 -07:00
Shameer Kolothum	d9a871e4a1	hisi_acc_vfio_pci: Introduce support for PRE_COPY state transitions The saving_migf is open in PRE_COPY state if it is supported and reads initial device match data. hisi_acc_vf_stop_copy() is refactored to make use of common code. Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20221123113236.896-3-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:37:11 -07:00
Shameer Kolothum	64ffbbb1e9	hisi_acc_vfio_pci: Add support for precopy IOCTL PRECOPY IOCTL in the case of HiSiIicon ACC driver can be used to perform the device compatibility check earlier during migration. Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20221123113236.896-2-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:37:11 -07:00
Shay Drory	ccc2a52e46	vfio/mlx5: Enable MIGRATION_PRE_COPY flag Now that everything has been set up for MIGRATION_PRE_COPY, enable it. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-15-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Shay Drory	d6e18a4bec	vfio/mlx5: Fallback to STOP_COPY upon specific PRE_COPY error Before a SAVE command is issued, a QUERY command is issued in order to know the device data size. In case PRE_COPY is used, the above commands are issued while the device is running. Thus, it is possible that between the QUERY and the SAVE commands the state of the device will be changed significantly and thus the SAVE will fail. Currently, if a SAVE command is failing, the driver will fail the migration. In the above case, don't fail the migration, but don't allow for new SAVEs to be executed while the device is in a RUNNING state. Once the device will be moved to STOP_COPY, SAVE can be executed again and the full device state will be read. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-14-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	34e2f27143	vfio/mlx5: Introduce multiple loads In order to support PRE_COPY, mlx5 driver transfers multiple states (images) of the device. e.g.: the source VF can save and transfer multiple states, and the target VF will load them by that order. This patch implements the changes for the target VF to decompose the header for each state and to write and load multiple states. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-13-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	81156c2727	vfio/mlx5: Consider temporary end of stream as part of PRE_COPY During PRE_COPY the migration data FD may have a temporary "end of stream" that is reached when the initial_bytes were read and no other dirty data exists yet. For instance, this may indicate that the device is idle and not currently dirtying any internal state. When read() is done on this temporary end of stream the kernel driver should return ENOMSG from read(). Userspace can wait for more data or consider moving to STOP_COPY. To not block the user upon read() and let it get ENOMSG we add a new state named MLX5_MIGF_STATE_PRE_COPY on the migration file. In addition, we add the MLX5_MIGF_STATE_SAVE_LAST state to block the read() once we call the last SAVE upon moving to STOP_COPY. Any further error will be marked with MLX5_MIGF_STATE_ERROR and the user won't be blocked. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-12-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	0dce165b1a	vfio/mlx5: Introduce vfio precopy ioctl implementation vfio precopy ioctl returns an estimation of data available for transferring from the device. Whenever a user is using VFIO_MIG_GET_PRECOPY_INFO, track the current state of the device, and if needed, append the dirty data to the transfer FD data. This is done by saving a middle state. As mlx5 runs the SAVE command asynchronously, make sure to query for incremental data only once there is no active save command. Running both in parallel, might end-up with a failure in the incremental query command on un-tracked vhca. Also, a middle state will be saved only after the previous state has finished its SAVE command and has been fully transferred, this prevents endless use resources. Co-developed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-11-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	0c9a38fee8	vfio/mlx5: Introduce SW headers for migration states As mentioned in the previous patches, mlx5 is transferring multiple states when the PRE_COPY protocol is used. This states mechanism requires the target VM to know the states' size in order to execute multiple loads. Therefore, add SW header, with the needed information, for each saved state the source VM is transferring to the target VM. This patch implements the source VM handling of the headers, following patch will implement the target VM handling of the headers. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-10-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	3319d287f4	vfio/mlx5: Introduce device transitions of PRE_COPY In order to support PRE_COPY, mlx5 driver is transferring multiple states (images) of the device. e.g.: the source VF can save and transfer multiple states, and the target VF will load them by that order. The device is saving three kinds of states: 1) Initial state - when the device moves to PRE_COPY state. 2) Middle state - during PRE_COPY phase via VFIO_MIG_GET_PRECOPY_INFO. There can be multiple states of this type. 3) Final state - when the device moves to STOP_COPY state. After moving to PRE_COPY state, user is holding the saving migf FD and can use it. For example: user can start transferring data via read() callback. Also, user can switch from PRE_COPY to STOP_COPY whenever he sees it fits. This will invoke saving of final state. This means that mlx5 VFIO device can be switched to STOP_COPY without transferring any data in PRE_COPY state. Therefore, when the device moves to STOP_COPY, mlx5 will store the final state on a dedicated queue entry on the list. Co-developed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-9-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	c668878381	vfio/mlx5: Refactor to use queue based data chunks Refactor to use queue based data chunks on the migration file. The SAVE command adds a chunk to the tail of the queue while the read() API finds the required chunk and returns its data. In case the queue is empty but the state of the migration file is MLX5_MIGF_STATE_COMPLETE, read() may not be blocked but will return 0 to indicate end of file. This is a step towards maintaining multiple images and their meta data (i.e. headers) on the migration file as part of next patches from the series. Note: At that point, we still use a single chunk on the migration file but becomes ready to support multiple. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-8-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	8b599d1434	vfio/mlx5: Refactor migration file state Refactor migration file state to be an emum which is mutual exclusive. As of that dropped the 'disabled' state as 'error' is the same from functional point of view. Next patches from the series will extend this enum for other relevant states. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-7-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	91454f8b9b	vfio/mlx5: Refactor MKEY usage This patch refactors MKEY usage such as its life cycle will be as of the migration file instead of allocating/destroying it upon each SAVE/LOAD command. This is a preparation step towards the PRE_COPY series where multiple images will be SAVED/LOADED. We achieve it by having a new struct named mlx5_vhca_data_buffer which holds the mkey and its related stuff as of sg_append_table, allocated_length, etc. The above fields were taken out from the migration file main struct, into mlx5_vhca_data_buffer dedicated struct with the proper helpers in place. For now we have a single mlx5_vhca_data_buffer per migration file. However, in coming patches we'll have multiple of them to support multiple images. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-6-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	9945a67ea4	vfio/mlx5: Refactor PD usage This patch refactors PD usage such as its life cycle will be as of the migration file instead of allocating/destroying it upon each SAVE/LOAD command. This is a preparation step towards the PRE_COPY series where multiple images will be SAVED/LOADED and a single PD can be simply reused. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-5-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Yishai Hadas	0e7caa65d7	vfio/mlx5: Enforce a single SAVE command at a time Enforce a single SAVE command at a time. As the SAVE command is an asynchronous one, we must enforce running only a single command at a time. This will preserve ordering between multiple calls and protect from races on the migration file data structure. This is a must for the next patches from the series where as part of PRE_COPY we may have multiple images to be saved and multiple SAVE commands may be issued from different flows. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-4-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:44 -07:00
Jason Gunthorpe	4db52602a6	vfio: Extend the device migration protocol with PRE_COPY The optional PRE_COPY states open the saving data transfer FD before reaching STOP_COPY and allows the device to dirty track internal state changes with the general idea to reduce the volume of data transferred in the STOP_COPY stage. While in PRE_COPY the device remains RUNNING, but the saving FD is open. Only if the device also supports RUNNING_P2P can it support PRE_COPY_P2P, which halts P2P transfers while continuing the saving FD. PRE_COPY, with P2P support, requires the driver to implement 7 new arcs and exists as an optional FSM branch between RUNNING and STOP_COPY: RUNNING -> PRE_COPY -> PRE_COPY_P2P -> STOP_COPY A new ioctl VFIO_MIG_GET_PRECOPY_INFO is provided to allow userspace to query the progress of the precopy operation in the driver with the idea it will judge to move to STOP_COPY at least once the initial data set is transferred, and possibly after the dirty size has shrunk appropriately. This ioctl is valid only in PRE_COPY states and kernel driver should return -EINVAL from any other migration state. Compared to the v1 clarification, STOP_COPY -> PRE_COPY is blocked and to be defined in future. We also split the pending_bytes report into the initial and sustaining values, e.g.: initial_bytes and dirty_bytes. initial_bytes: Amount of initial precopy data. dirty_bytes: Device state changes relative to data previously retrieved. These fields are not required to have any bearing to STOP_COPY phase. It is recommended to leave PRE_COPY for STOP_COPY only after the initial_bytes field reaches zero. Leaving PRE_COPY earlier might make things slower. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-06 12:36:43 -07:00
Jason Gunthorpe	e2d5570939	vfio: Fold vfio_virqfd.ko into vfio.ko This is only 1.8k, putting it in its own module is not really necessary. The kconfig infrastructure is still there to completely remove it for systems that are trying for small footprint. Put it in the main vfio.ko module now that kbuild can support multiple .c files. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Link: https://lore.kernel.org/r/5-v5-fc5346cacfd4+4c482-vfio_modules_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-05 12:04:32 -07:00
Jason Gunthorpe	20601c45a0	vfio: Remove CONFIG_VFIO_SPAPR_EEH We don't need a kconfig symbol for this, just directly test CONFIG_EEH in the few places that need it. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/4-v5-fc5346cacfd4+4c482-vfio_modules_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-05 12:04:32 -07:00
Jason Gunthorpe	e276e25819	vfio: Move vfio_spapr_iommu_eeh_ioctl into vfio_iommu_spapr_tce.c As with the previous patch EEH is always enabled if SPAPR_TCE_IOMMU, so move this last bit of code into the main module. Now that this function only processes VFIO_EEH_PE_OP remove a level of indenting as well, it is only called by a case statement that already checked VFIO_EEH_PE_OP. This eliminates an unnecessary module and SPAPR code in a global header. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/3-v5-fc5346cacfd4+4c482-vfio_modules_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-05 12:04:32 -07:00
Jason Gunthorpe	e5c38a203e	vfio/spapr: Move VFIO_CHECK_EXTENSION into tce_iommu_ioctl() The PPC64 kconfig is a bit of a rats nest, but it turns out that if CONFIG_SPAPR_TCE_IOMMU is on then EEH must be too: config SPAPR_TCE_IOMMU bool "sPAPR TCE IOMMU Support" depends on PPC_POWERNV \|\| PPC_PSERIES select IOMMU_API help Enables bits of IOMMU API required by VFIO. The iommu_ops is not implemented as it is not necessary for VFIO. config PPC_POWERNV select FORCE_PCI config PPC_PSERIES select FORCE_PCI config EEH bool depends on (PPC_POWERNV \|\| PPC_PSERIES) && PCI default y So, just open code the call to eeh_enabled() into tce_iommu_ioctl(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/2-v5-fc5346cacfd4+4c482-vfio_modules_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-05 12:04:32 -07:00
Jason Gunthorpe	8f8bcc8c72	vfio/pci: Move all the SPAPR PCI specific logic to vfio_pci_core.ko The vfio_spapr_pci_eeh_open/release() functions are one line wrappers around an arch function. Just call them directly. This eliminates some weird exported symbols that don't need to exist. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Link: https://lore.kernel.org/r/1-v5-fc5346cacfd4+4c482-vfio_modules_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-12-05 12:04:32 -07:00
Yi Liu	9eefba8002	vfio: Move vfio group specific code into group.c This prepares for compiling out vfio group after vfio device cdev is added. No vfio_group decode code should be in vfio_main.c, and neither device->group reference should be in vfio_main.c. No functional change is intended. Link: https://lore.kernel.org/r/20221201145535.589687-11-yi.l.liu@intel.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Yu He <yu.he@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-12-05 08:56:01 -04:00
Yi Liu	8da7a0e79f	vfio: Refactor dma APIs for emulated devices To use group helpers instead of opening group related code in the API. This prepares moving group specific code out of vfio_main.c. Link: https://lore.kernel.org/r/20221201145535.589687-10-yi.l.liu@intel.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Yu He <yu.he@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-12-05 08:56:01 -04:00
Yi Liu	1334e47ee7	vfio: Wrap vfio group module init/clean code into helpers This wraps the init/clean code of vfio group global variable to be helpers, and prepares for further moving vfio group specific code into separate file. As container is used by group, so vfio_container_init/cleanup() is moved into vfio_group_init/cleanup(). Link: https://lore.kernel.org/r/20221201145535.589687-9-yi.l.liu@intel.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Yu He <yu.he@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-12-05 08:56:01 -04:00
Yi Liu	5c8d3d93f6	vfio: Refactor vfio_device open and close This refactor makes the vfio_device_open() to accept device, iommufd_ctx pointer and kvm pointer. These parameters are generic items in today's group path and future device cdev path. Caller of vfio_device_open() should take care the necessary protections. e.g. the current group path need to hold the group_lock to ensure the iommufd_ctx and kvm pointer are valid. This refactor also wraps the group spefcific codes in the device open and close paths to be paired helpers like: - vfio_device_group_open/close(): call vfio_device_open/close() - vfio_device_group_use/unuse_iommu(): this pair is container specific. iommufd vs. container is selected in vfio_device_first_open(). Such helpers are supposed to be moved to group.c. While iommufd related codes will be kept in the generic helpers since future device cdev path also need to handle iommufd. Link: https://lore.kernel.org/r/20221201145535.589687-8-yi.l.liu@intel.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Yu He <yu.he@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-12-05 08:56:01 -04:00

... 2 3 4 5 6 ...

1290 Commits