The QM_QUE_ISO_CFG macro definition is no longer used
and needs to be deleted from the current driver.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Link: https://lore.kernel.org/r/20220926093332.28824-5-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Remove unused function parameters for vf_qm_fun_reset() and
ensure the device is enabled before the reset operation
is performed.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Link: https://lore.kernel.org/r/20220926093332.28824-4-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The queue address of the accelerator device should be combined into
a dma address in a way of combining the low and high bits.
The previous combination is wrong and needs to be modified.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Link: https://lore.kernel.org/r/20220926093332.28824-3-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
During the process of compatibility and matching of live migration
device information, if the isolation status of the two devices is
inconsistent, the live migration needs to be exited.
The current driver does not return the error code correctly and
needs to be fixed.
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Link: https://lore.kernel.org/r/20220926093332.28824-2-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The iommu_group comes from the struct device that a driver has been bound
to and then created a struct vfio_device against. To keep the iommu layer
sane we want to have a simple rule that only an attached driver should be
using the iommu API. Particularly only an attached driver should hold
ownership.
In VFIO's case since it uses the group APIs and it shares between
different drivers it is a bit more complicated, but the principle still
holds.
Solve this by waiting for all users of the vfio_group to stop before
allowing vfio_unregister_group_dev() to complete. This is done with a new
completion to know when the users go away and an additional refcount to
keep track of how many device drivers are sharing the vfio group. The last
driver to be unregistered will clean up the group.
This solves crashes in the S390 iommu driver that come because VFIO ends
up racing releasing ownership (which attaches the default iommu_domain to
the device) with the removal of that same device from the iommu
driver. This is a side case that iommu drivers should not have to cope
with.
iommu driver failed to attach the default/blocking domain
WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib sunrpc ib_uverbs ism smc uvdevice ib_core s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G W 6.0.0-rc3 #5
Hardware name: IBM 3931 A01 782 (LPAR)
Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
Krnl Code: 000000095bb10d18: c020003d56fc larl %r2,000000095c2bbb10
000000095bb10d1e: c0e50019d901 brasl %r14,000000095be4bf20
#000000095bb10d24: af000000 mc 0,0
>000000095bb10d28: b904002a lgr %r2,%r10
000000095bb10d2c: ebaff0a00004 lmg %r10,%r15,160(%r15)
000000095bb10d32: c0f4001aa867 brcl 15,000000095be65e00
000000095bb10d38: c004002168e0 brcl 0,000000095bf3def8
000000095bb10d3e: eb6ff0480024 stmg %r6,%r15,72(%r15)
Call Trace:
[<000000095bb10d28>] iommu_detach_group+0x70/0x80
([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
[<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
[<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
[<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
pci 0004:00:00.0: Removing from iommu group 4
[<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
[<000000095be5d3b4>] __do_syscall+0x1d4/0x200
[<000000095be6c072>] system_call+0x82/0xb0
Last Breaking-Event-Address:
[<000000095be4bf80>] __warn_printk+0x60/0x68
It indicates that domain->ops->attach_dev() failed because the driver has
already passed the point of destructing the device.
Fixes: 9ac8545199 ("iommu: Fix use-after-free in iommu_release_device")
Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v2-a3c5f4429e2a+55-iommu_group_lifetime_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
All the functions that dereference struct vfio_container are moved into
container.c.
Simple code motion, no functional change.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/8-v3-297af71838d2+b9-vfio_container_split_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is a container item.
A following patch will move the vfio_container functions to their own .c
file.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v3-297af71838d2+b9-vfio_container_split_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
To vfio_container_ioctl_check_extension().
A following patch will turn this into a non-static function, make it clear
it is related to the container.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v3-297af71838d2+b9-vfio_container_split_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This miscdev, noiommu driver and a couple of globals are all container
items. Move this init into its own functions.
A following patch will move the vfio_container functions to their own .c
file.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v3-297af71838d2+b9-vfio_container_split_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This can all be accomplished using typical IS_ENABLED techniques, drop it
all.
Also rename the variable to vfio_noiommu so this can be made global in
following patches.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v3-297af71838d2+b9-vfio_container_split_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This splits up the ioctl of vfio_group_ioctl_set_container() so it
determines the type of file then invokes a type specific attachment
function. Future patches will add iommufd to this function as an
alternative type.
A following patch will move the vfio_container functions to their own .c
file.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v3-297af71838d2+b9-vfio_container_split_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
To vfio_group_detach_container(). This function is really a container
function.
Fold the WARN_ON() into it as a precondition assertion.
A following patch will move the vfio_container functions to their own .c
file.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v3-297af71838d2+b9-vfio_container_split_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
and replace kref. With it a 'vfio-dev/vfioX' node is created under the
sysfs path of the parent, indicating the device is bound to a vfio
driver, e.g.:
/sys/devices/pci0000\:6f/0000\:6f\:01.0/vfio-dev/vfio0
It is also a preparatory step toward adding cdev for supporting future
device-oriented uAPI.
Add Documentation/ABI/testing/sysfs-devices-vfio-dev.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20220921104401.38898-16-kevin.tian@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With the addition of vfio_put_device() now the names become confusing.
vfio_put_device() is clear from object life cycle p.o.v given kref.
vfio_device_put()/vfio_device_try_get() are helpers for tracking
users on a registered device.
Now rename them:
- vfio_device_put() -> vfio_device_put_registration()
- vfio_device_try_get() -> vfio_device_try_get_registration()
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20220921104401.38898-15-kevin.tian@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
ccw is the only exception which cannot use vfio_alloc_device() because
its private device structure is designed to serve both mdev and parent.
Life cycle of the parent is managed by css_driver so vfio_ccw_private
must be allocated/freed in css_driver probe/remove path instead of
conforming to vfio core life cycle for mdev.
Given that use a wait/completion scheme so the mdev remove path waits
after vfio_put_device() until receiving a completion notification from
@release. The completion indicates that all active references on
vfio_device have been released.
After that point although free of vfio_ccw_private is delayed to
css_driver it's at least guaranteed to have no parallel reference on
released vfio device part from other code paths.
memset() in @probe is removed. vfio_device is either already cleared
when probed for the first time or cleared in @release from last probe.
The right fix is to introduce separate structures for mdev and parent,
but this won't happen in short term per prior discussions.
Remove vfio_init/uninit_group_dev() as no user now.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Link: https://lore.kernel.org/r/20220921104401.38898-14-kevin.tian@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Implement amba's own vfio_device_ops.
Remove vfio_platform_probe/remove_common() given no user now.
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20220921104401.38898-13-kevin.tian@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Move vfio_device_ops from platform core to platform drivers so device
specific init/cleanup can be added.
Introduce two new helpers vfio_platform_init/release_common() for the
use in driver @init/@release.
vfio_platform_probe/remove_common() will be deprecated.
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20220921104401.38898-12-kevin.tian@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Also add a comment to mark that vfio core releases device_set if @init
fails.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20220921104401.38898-11-kevin.tian@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Tidy up @probe so all migration specific initialization logic is moved
to migration specific @init callback.
Remove vfio_pci_core_{un}init_device() given no user now.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20220921104401.38898-5-kevin.tian@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
mlx5 has its own @init/@release for handling migration cap.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20220921104401.38898-4-kevin.tian@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Also introduce two pci core helpers as @init/@release for pci drivers:
- vfio_pci_core_init_dev()
- vfio_pci_core_release_dev()
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20220921104401.38898-3-kevin.tian@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The idea is to let vfio core manage the vfio_device life cycle instead
of duplicating the logic cross drivers. This is also a preparatory
step for adding struct device into vfio_device.
New pair of helpers together with a kref in vfio_device:
- vfio_alloc_device()
- vfio_put_device()
Drivers can register @init/@release callbacks to manage any private
state wrapping the vfio_device.
However vfio-ccw doesn't fit this model due to a life cycle mess
that its private structure mixes both parent and mdev info hence must
be allocated/freed outside of the life cycle of vfio device.
Per prior discussions this won't be fixed in short term by IBM folks.
Instead of waiting for those modifications introduce another helper
vfio_init_device() so ccw can call it to initialize a pre-allocated
vfio_device.
Further implication of the ccw trick is that vfio_device cannot be
freed uniformly in vfio core. Instead, require *EVERY* driver to
implement @release and free vfio_device inside. Then ccw can choose
to delay the free at its own discretion.
Another trick down the road is that kvzalloc() is used to accommodate
the need of gvt which uses vzalloc() while all others use kzalloc().
So drivers should call a helper vfio_free_device() to free the
vfio_device instead of assuming that kfree() or vfree() is appliable.
Later once the ccw mess is fixed we can remove those tricks and
fully handle structure alloc/free in vfio core.
Existing vfio_{un}init_group_dev() will be deprecated after all
existing usages are converted to the new model.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Co-developed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20220921104401.38898-2-kevin.tian@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Now that everything is ready set the driver DMA logging callbacks if
supported by the device.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220908183448.195262-11-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Handle async error events and health/recovery flow to safely stop the
tracker upon error scenarios.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220908183448.195262-10-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Report dirty pages from tracker.
It includes:
Querying for dirty pages in a given IOVA range, this is done by
modifying the tracker into the reporting state and supplying the
required range.
Using the CQ event completion mechanism to be notified once data is
ready on the CQ/QP to be processed.
Once data is available turn on the corresponding bits in the bit map.
This functionality will be used as part of the 'log_read_and_clear'
driver callback in the next patches.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220908183448.195262-9-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add support for creating and destroying page tracker object.
This object is used to control/report the device dirty pages.
As part of creating the tracker need to consider the device capabilities
for max ranges and adapt/combine ranges accordingly.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220908183448.195262-8-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Init QP based resources for dirty tracking to be used upon start
logging.
It includes:
Creating the host and firmware RC QPs, move each of them to its expected
state based on the device specification, etc.
Creating the relevant resources which are needed by both QPs as of UAR,
PD, etc.
Creating the host receive side resources as of MKEY, CQ, receive WQEs,
etc.
The above resources are cleaned-up upon stop logging.
The tracker object that will be introduced by next patches will use
those resources.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220908183448.195262-7-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Introduce the DMA logging feature support in the vfio core layer.
It includes the processing of the device start/stop/report DMA logging
UAPIs and calling the relevant driver 'op' to do the work.
Specifically,
Upon start, the core translates the given input ranges into an interval
tree, checks for unexpected overlapping, non aligned ranges and then
pass the translated input to the driver for start tracking the given
ranges.
Upon report, the core translates the given input user space bitmap and
page size into an IOVA kernel bitmap iterator. Then it iterates it and
call the driver to set the corresponding bits for the dirtied pages in a
specific IOVA range.
Upon stop, the driver is called to stop the previous started tracking.
The next patches from the series will introduce the mlx5 driver
implementation for the logging ops.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220908183448.195262-6-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The new facility adds a bunch of wrappers that abstract how an IOVA range
is represented in a bitmap that is granulated by a given page_size. So it
translates all the lifting of dealing with user pointers into its
corresponding kernel addresses backing said user memory into doing finally
the (non-atomic) bitmap ops to change various bits.
The formula for the bitmap is:
data[(iova / page_size) / 64] & (1ULL << (iova % 64))
Where 64 is the number of bits in a unsigned long (depending on arch)
It introduces an IOVA iterator that uses a windowing scheme to minimize the
pinning overhead, as opposed to pinning it on demand 4K at a time. Assuming
a 4K kernel page and 4K requested page size, we can use a single kernel
page to hold 512 page pointers, mapping 2M of bitmap, representing 64G of
IOVA space.
An example usage of these helpers for a given @base_iova, @page_size,
@length and __user @data:
bitmap = iova_bitmap_alloc(base_iova, page_size, length, data);
if (IS_ERR(bitmap))
return -ENOMEM;
ret = iova_bitmap_for_each(bitmap, arg, dirty_reporter_fn);
iova_bitmap_free(bitmap);
Each iteration of the @dirty_reporter_fn is called with a unique @iova
and @length argument, indicating the current range available through the
iova_bitmap. The @dirty_reporter_fn uses iova_bitmap_set() to mark dirty
areas (@iova_length) within that provided range, as following:
iova_bitmap_set(bitmap, iova, iova_length);
The facility is intended to be used for user bitmaps representing dirtied
IOVAs by IOMMU (via IOMMUFD) and PCI Devices (via vfio-pci).
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220908183448.195262-5-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
L and S are swapped in the message.
s/VFIO_FLS_MC/VFIO_FSL_MC/
Also use 'ret' instead of 'WARN_ON(ret)' to avoid a duplicated message.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Diana Craciun <diana.craciun@oss.nxp.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/r/a7c1394346725b7435792628c8d4c06a0a745e0b.1662134821.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The iommu-dma layer is now mostly encapsulated by iommu_dma_ops, with
only a couple more public interfaces left pertaining to MSI integration.
Since these depend on the main IOMMU API header anyway, move their
declarations there, taking the opportunity to update the half-baked
comments to proper kerneldoc along the way.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/9cd99738f52094e6bed44bfee03fa4f288d20695.1660668998.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Commit 91be0bd6c6cf("vfio/pci: Have all VFIO PCI drivers store the
vfio_pci_core_device in drvdata") introduced a helper function to
retrieve the drvdata but used "hssi" instead of "hisi" for the
function prefix. Correct that and also while at it, moved the
function a bit down so that it's close to other hisi_ prefixed
functions.
No functional changes.
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20220831085943.993-1-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This counts the number of devices attached to a vfio_group, ie the number
of items in the group->device_list.
It is only read in vfio_pin_pages(), as some kind of protection against
limitations in type1.
However, with all the code cleanups in this area, now that
vfio_pin_pages() accepts a vfio_device directly it is redundant. All
drivers are already calling vfio_register_emulated_iommu_dev() which
directly creates a group specifically for the device and thus it is
guaranteed that there is a singleton group.
Leave a note in the comment about this requirement and remove the logic.
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/0-v2-d4374a7bf0c9+c4-vfio_dev_counter_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch implements VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
device feature. In the VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY, if there is
any access for the VFIO device on the host side, then the device will
be moved out of the low power state without the user's guest driver
involvement. Once the device access has been finished, then the host
can move the device again into low power state. With the low power
entry happened through VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP,
the device will not be moved back into the low power state and
a notification will be sent to the user by triggering wakeup eventfd.
vfio_pci_core_pm_entry() will be called for both the variants of low
power feature entry so add an extra argument for wakeup eventfd context
and store locally in 'struct vfio_pci_core_device'.
For the entry happened without wakeup eventfd, all the exit related
handling will be done by the LOW_POWER_EXIT device feature only.
When the LOW_POWER_EXIT will be called, then the vfio core layer
vfio_device_pm_runtime_get() will increment the usage count and will
resume the device. In the driver runtime_resume callback, the
'pm_wake_eventfd_ctx' will be NULL. Then vfio_pci_core_pm_exit()
will call vfio_pci_runtime_pm_exit() and all the exit related handling
will be done.
For the entry happened with wakeup eventfd, in the driver resume
callback, eventfd will be triggered and all the exit related handling will
be done. When vfio_pci_runtime_pm_exit() will be called by
vfio_pci_core_pm_exit(), then it will return early.
But if the runtime suspend has not happened on the host side, then
all the exit related handling will be done in vfio_pci_core_pm_exit()
only.
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220829114850.4341-6-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Currently, if the runtime power management is enabled for vfio-pci
based devices in the guest OS, then the guest OS will do the register
write for PCI_PM_CTRL register. This write request will be handled in
vfio_pm_config_write() where it will do the actual register write of
PCI_PM_CTRL register. With this, the maximum D3hot state can be
achieved for low power. If we can use the runtime PM framework, then
we can achieve the D3cold state (on the supported systems) which will
help in saving maximum power.
1. D3cold state can't be achieved by writing PCI standard
PM config registers. This patch implements the following
newly added low power related device features:
- VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
- VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
The VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY feature will allow the
device to make use of low power platform states on the host
while the VFIO_DEVICE_FEATURE_LOW_POWER_EXIT will prevent
further use of those power states.
2. The vfio-pci driver uses runtime PM framework for low power entry and
exit. On the platforms where D3cold state is supported, the runtime
PM framework will put the device into D3cold otherwise, D3hot or some
other power state will be used.
There are various cases where the device will not go into the runtime
suspended state. For example,
- The runtime power management is disabled on the host side for
the device.
- The user keeps the device busy after calling LOW_POWER_ENTRY.
- There are dependent devices that are still in runtime active state.
For these cases, the device will be in the same power state that has
been configured by the user through PCI_PM_CTRL register.
3. The hypervisors can implement virtual ACPI methods. For example,
in guest linux OS if PCI device ACPI node has _PR3 and _PR0 power
resources with _ON/_OFF method, then guest linux OS invokes
the _OFF method during D3cold transition and then _ON during D0
transition. The hypervisor can tap these virtual ACPI calls and then
call the low power device feature IOCTL.
4. The 'pm_runtime_engaged' flag tracks the entry and exit to
runtime PM. This flag is protected with 'memory_lock' semaphore.
5. All the config and other region access are wrapped under
pm_runtime_resume_and_get() and pm_runtime_put(). So, if any
device access happens while the device is in the runtime suspended
state, then the device will be resumed first before access. Once the
access has been finished, then the device will again go into the
runtime suspended state.
6. The memory region access through mmap will not be allowed in the low
power state. Since __vfio_pci_memory_enabled() is a common function,
so check for 'pm_runtime_engaged' has been added explicitly in
vfio_pci_mmap_fault() to block only mmap'ed access.
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220829114850.4341-5-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch adds INTx handling during runtime suspend/resume.
All the suspend/resume related code for the user to put the device
into the low power state will be added in subsequent patches.
The INTx lines may be shared among devices. Whenever any INTx
interrupt comes for the VFIO devices, then vfio_intx_handler() will be
called for each device sharing the interrupt. Inside vfio_intx_handler(),
it calls pci_check_and_mask_intx() and checks if the interrupt has
been generated for the current device. Now, if the device is already
in the D3cold state, then the config space can not be read. Attempt
to read config space in D3cold state can cause system unresponsiveness
in a few systems. To prevent this, mask INTx in runtime suspend callback,
and unmask the same in runtime resume callback. If INTx has been already
masked, then no handling is needed in runtime suspend/resume callbacks.
'pm_intx_masked' tracks this, and vfio_pci_intx_mask() has been updated
to return true if the INTx vfio_pci_irq_ctx.masked value is changed
inside this function.
For the runtime suspend which is triggered for the no user of VFIO
device, the 'irq_type' will be VFIO_PCI_NUM_IRQS and these
callbacks won't do anything.
The MSI/MSI-X are not shared so similar handling should not be
needed for MSI/MSI-X. vfio_msihandler() triggers eventfd_signal()
without doing any device-specific config access. When the user performs
any config access or IOCTL after receiving the eventfd notification,
then the device will be moved to the D0 state first before
servicing any request.
Another option was to check this flag 'pm_intx_masked' inside
vfio_intx_handler() instead of masking the interrupts. This flag
is being set inside the runtime_suspend callback but the device
can be in non-D3cold state (for example, if the user has disabled D3cold
explicitly by sysfs, the D3cold is not supported in the platform, etc.).
Also, in D3cold supported case, the device will be in D0 till the
PCI core moves the device into D3cold. In this case, there is
a possibility that the device can generate an interrupt. Adding check
in the IRQ handler will not clear the IRQ status and the interrupt
line will still be asserted. This can cause interrupt flooding.
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220829114850.4341-4-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio-pci based drivers will have runtime power management
support where the user can put the device into the low power state
and then PCI devices can go into the D3cold state. If the device is
in the low power state and the user issues any IOCTL, then the
device should be moved out of the low power state first. Once
the IOCTL is serviced, then it can go into the low power state again.
The runtime PM framework manages this with help of usage count.
One option was to add the runtime PM related API's inside vfio-pci
driver but some IOCTL (like VFIO_DEVICE_FEATURE) can follow a
different path and more IOCTL can be added in the future. Also, the
runtime PM will be added for vfio-pci based drivers variant currently,
but the other VFIO based drivers can use the same in the
future. So, this patch adds the runtime calls runtime-related API in
the top-level IOCTL function itself.
For the VFIO drivers which do not have runtime power management
support currently, the runtime PM API's won't be invoked. Only for
vfio-pci based drivers currently, the runtime PM API's will be invoked
to increment and decrement the usage count. In the vfio-pci drivers also,
the variant drivers can opt-out by incrementing the usage count during
device-open. The pm_runtime_resume_and_get() checks the device
current status and will return early if the device is already in the
ACTIVE state.
Taking this usage count incremented while servicing IOCTL will make
sure that the user won't put the device into the low power state when any
other IOCTL is being serviced in parallel. Let's consider the
following scenario:
1. Some other IOCTL is called.
2. The user has opened another device instance and called the IOCTL for
low power entry.
3. The low power entry IOCTL moves the device into the low power state.
4. The other IOCTL finishes.
If we don't keep the usage count incremented then the device
access will happen between step 3 and 4 while the device has already
gone into the low power state.
The pm_runtime_resume_and_get() will be the first call so its error
should not be propagated to user space directly. For example, if
pm_runtime_resume_and_get() can return -EINVAL for the cases where the
user has passed the correct argument. So the
pm_runtime_resume_and_get() errors have been masked behind -EIO.
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220829114850.4341-3-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is the last sizable implementation in vfio_group_fops_unl_ioctl(),
move it to a function so vfio_group_fops_unl_ioctl() is emptied out.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/8-v2-0f9e632d54fb+d6-vfio_ioctl_split_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Make it clear that this is the body of the ioctl. Fold the locking into
the function so it is self contained like the other ioctls.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/7-v2-0f9e632d54fb+d6-vfio_ioctl_split_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
No reason to split it up like this, just have one function to process the
ioctl. Move the lock into the function as well to avoid having a lockdep
annotation.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v2-0f9e632d54fb+d6-vfio_ioctl_split_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
No reason to split it up like this, just have one function to process the
ioctl.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v2-0f9e632d54fb+d6-vfio_ioctl_split_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This makes the code clearer and replaces a few places trying to access a
flex array with an actual flex array.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v2-0f9e632d54fb+d6-vfio_ioctl_split_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
500 lines is a bit long for a single function, move the bodies of each
ioctl into separate functions and leave behind a switch statement to
dispatch them. This patch just adds the function declarations and does not
fix the indenting. The next patch will restore the indenting.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v2-0f9e632d54fb+d6-vfio_ioctl_split_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This only returns 0 or -ERRNO, it should return int like all the other
ioctl dispatch functions.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v2-0f9e632d54fb+d6-vfio_ioctl_split_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Only three of these are actually used, simplify to three inline functions,
and open code the if statement in vfio_pci_config.c.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/r/3-v2-1bd95d72f298+e0e-vfio_pci_priv_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
As this is part of the vfio_pci_core component it should be called
vfio_pci_core_register_dev_region() like everything else exported from
this module.
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/r/2-v2-1bd95d72f298+e0e-vfio_pci_priv_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The header in include/linux should have only the exported interface for
other vfio_pci modules to use. Internal definitions for vfio_pci.ko
should be in a "priv" header along side the .c files.
Move the internal declarations out of vfio_pci_core.h. They either move to
vfio_pci_priv.h or to the C file that is the only user.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/r/1-v2-1bd95d72f298+e0e-vfio_pci_priv_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There's currently a reference count leak on the zero page. We increment
the reference via pin_user_pages_remote(), but the page is later handled
as an invalid/reserved page, therefore it's not accounted against the
user and not unpinned by our put_pfn().
Introducing special zero page handling in put_pfn() would resolve the
leak, but without accounting of the zero page, a single user could
still create enough mappings to generate a reference count overflow.
The zero page is always resident, so for our purposes there's no reason
to keep it pinned. Therefore, add a loop to walk pages returned from
pin_user_pages_remote() and unpin any zero pages.
Cc: stable@vger.kernel.org
Reported-by: Luboslav Pivarc <lpivarc@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/166182871735.3518559.8884121293045337358.stgit@omen
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We have a cross dependency between KVM and VFIO when using
s390 vfio_pci_zdev extensions for PCI passthrough
To be able to keep both subsystem modular we add a registering
hook inside the S390 core code.
This fixes a build problem when VFIO is built-in and KVM is built
as a module.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Pierre Morel <pmorel@linux.ibm.com>
Fixes: 09340b2fca ("KVM: s390: pci: add routines to start/stop interpretive execution")
Cc: <stable@vger.kernel.org>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/r/20220819122945.9309-1-pmorel@linux.ibm.com
Message-Id: <20220819122945.9309-1-pmorel@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
If a source file has the same name as a module then kbuild only supports
a single source file in the module.
Rename vfio.c to vfio_main.c so that we can have more that one .c file
in vfio.ko.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220731125503.142683-5-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
- Cleanup use of extern in function prototypes (Alex Williamson)
- Simplify bus_type usage and convert to device IOMMU interfaces
(Robin Murphy)
- Check missed return value and fix comment typos (Bo Liu)
- Split migration ops from device ops and fix races in mlx5 migration
support (Yishai Hadas)
- Fix missed return value check in noiommu support (Liam Ni)
- Hardening to clear buffer pointer to avoid use-after-free (Schspa Shi)
- Remove requirement that only the same mm can unmap a previously
mapped range (Li Zhe)
- Adjust semaphore release vs device open counter (Yi Liu)
- Remove unused arg from SPAPR support code (Deming Wang)
- Rework vfio-ccw driver to better fit new mdev framework (Eric Farman,
Michael Kawano)
- Replace DMA unmap notifier with callbacks (Jason Gunthorpe)
- Clarify SPAPR support comment relative to iommu_ops (Alexey Kardashevskiy)
- Revise page pinning API towards compatibility with future iommufd support
(Nicolin Chen)
- Resolve issues in vfio-ccw, including use of DMA unmap callback
(Eric Farman)
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmLqvYMbHGFsZXgud2ls
bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsiHM0P/1n/bszel20PRC7x+NLI
P7b/0aonW4Qtei2HORwowmaznb4NgRE5GCm5RU+a9+AwQKnK44j3lqy0skcfgZXr
f4viFlxOyd0H4blOhUZ+FuPNkUMAyz6HerzvJ9jQFG426pL5vr7UKWBuJPYB5RCT
4jEy3EUTSH8/Zt8ApLysFTyR64xN3Sk7vSUcj9rEhu5T3FWq8t9+jb3tE/HW/Xaw
pMwdC+ctYzYaBD/oA7Ns2IebNS9AUIUjKMXC25oCmc83WGgGOqgLB2mAthQ2NKB5
5capKBYuYl7PWERvpGpsPILEWvR6m+Rxh8r4Pqjcoyfq4k7vp+A/AFKiD7AEYBdy
BtfLWO59w6vuRQ5XXOa6Hu4ef6BcMvH4StrHxlHkKcgI4PJA0QscIXiJPQSt7Crr
m+kCNgPPgrfZDu7lmZTiWbXOYSkJR3Mxkhf2iNHudW9SsJT9pUAVEiGVVA/kC1Y/
fNBziRQeVF6JUW8M4pveXEWEbA8iE1HQeJA6aVRonxAkJk1KBaQgm/GKJlPXCHIR
R6lI90NXZHz/3ndIX1znKOm0qli+8auX/FH8iWUffZxGmtINOGGMYebD6YxFdCCJ
sWalL8vlQNCams2MZdovu/5BowXWtwOMm6KNG9RXSyWIWZEcNVbAzhTr+rrDdHZd
AJiUNCGO9UlO9FZM+ntfQTSr
=4BE8
-----END PGP SIGNATURE-----
Merge tag 'vfio-v6.0-rc1' of https://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Cleanup use of extern in function prototypes (Alex Williamson)
- Simplify bus_type usage and convert to device IOMMU interfaces (Robin
Murphy)
- Check missed return value and fix comment typos (Bo Liu)
- Split migration ops from device ops and fix races in mlx5 migration
support (Yishai Hadas)
- Fix missed return value check in noiommu support (Liam Ni)
- Hardening to clear buffer pointer to avoid use-after-free (Schspa
Shi)
- Remove requirement that only the same mm can unmap a previously
mapped range (Li Zhe)
- Adjust semaphore release vs device open counter (Yi Liu)
- Remove unused arg from SPAPR support code (Deming Wang)
- Rework vfio-ccw driver to better fit new mdev framework (Eric Farman,
Michael Kawano)
- Replace DMA unmap notifier with callbacks (Jason Gunthorpe)
- Clarify SPAPR support comment relative to iommu_ops (Alexey
Kardashevskiy)
- Revise page pinning API towards compatibility with future iommufd
support (Nicolin Chen)
- Resolve issues in vfio-ccw, including use of DMA unmap callback (Eric
Farman)
* tag 'vfio-v6.0-rc1' of https://github.com/awilliam/linux-vfio: (40 commits)
vfio/pci: fix the wrong word
vfio/ccw: Check return code from subchannel quiesce
vfio/ccw: Remove FSM Close from remove handlers
vfio/ccw: Add length to DMA_UNMAP checks
vfio: Replace phys_pfn with pages for vfio_pin_pages()
vfio/ccw: Add kmap_local_page() for memcpy
vfio: Rename user_iova of vfio_dma_rw()
vfio/ccw: Change pa_pfn list to pa_iova list
vfio/ap: Change saved_pfn to saved_iova
vfio: Pass in starting IOVA to vfio_pin/unpin_pages API
vfio/ccw: Only pass in contiguous pages
vfio/ap: Pass in physical address of ind to ap_aqic()
drm/i915/gvt: Replace roundup with DIV_ROUND_UP
vfio: Make vfio_unpin_pages() return void
vfio/spapr_tce: Fix the comment
vfio: Replace the iommu notifier with a device list
vfio: Replace the DMA unmapping notifier with a callback
vfio/ccw: Move FSM open/close to MDEV open/close
vfio/ccw: Refactor vfio_ccw_mdev_reset
vfio/ccw: Create a CLOSE FSM event
...
* Unwinder implementations for both nVHE modes (classic and
protected), complete with an overflow stack
* Rework of the sysreg access from userspace, with a complete
rewrite of the vgic-v3 view to allign with the rest of the
infrastructure
* Disagregation of the vcpu flags in separate sets to better track
their use model.
* A fix for the GICv2-on-v3 selftest
* A small set of cosmetic fixes
RISC-V:
* Track ISA extensions used by Guest using bitmap
* Added system instruction emulation framework
* Added CSR emulation framework
* Added gfp_custom flag in struct kvm_mmu_memory_cache
* Added G-stage ioremap() and iounmap() functions
* Added support for Svpbmt inside Guest
s390:
* add an interface to provide a hypervisor dump for secure guests
* improve selftests to use TAP interface
* enable interpretive execution of zPCI instructions (for PCI passthrough)
* First part of deferred teardown
* CPU Topology
* PV attestation
* Minor fixes
x86:
* Permit guests to ignore single-bit ECC errors
* Intel IPI virtualization
* Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS
* PEBS virtualization
* Simplify PMU emulation by just using PERF_TYPE_RAW events
* More accurate event reinjection on SVM (avoid retrying instructions)
* Allow getting/setting the state of the speaker port data bit
* Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent
* "Notify" VM exit (detect microarchitectural hangs) for Intel
* Use try_cmpxchg64 instead of cmpxchg64
* Ignore benign host accesses to PMU MSRs when PMU is disabled
* Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior
* Allow NX huge page mitigation to be disabled on a per-vm basis
* Port eager page splitting to shadow MMU as well
* Enable CMCI capability by default and handle injected UCNA errors
* Expose pid of vcpu threads in debugfs
* x2AVIC support for AMD
* cleanup PIO emulation
* Fixes for LLDT/LTR emulation
* Don't require refcounted "struct page" to create huge SPTEs
* Miscellaneous cleanups:
** MCE MSR emulation
** Use separate namespaces for guest PTEs and shadow PTEs bitmasks
** PIO emulation
** Reorganize rmap API, mostly around rmap destruction
** Do not workaround very old KVM bugs for L0 that runs with nesting enabled
** new selftests API for CPUID
Generic:
* Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache
* new selftests API using struct kvm_vcpu instead of a (vm, id) tuple
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmLnyo4UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMtQQf/XjVWiRcWLPR9dqzRM/vvRXpiG+UL
jU93R7m6ma99aqTtrxV/AE+kHgamBlma3Cwo+AcWk9uCVNbIhFjv2YKg6HptKU0e
oJT3zRYp+XIjEo7Kfw+TwroZbTlG6gN83l1oBLFMqiFmHsMLnXSI2mm8MXyi3dNB
vR2uIcTAl58KIprqNNsYJ2dNn74ogOMiXYx9XzoA9/5Xb6c0h4rreHJa5t+0s9RO
Gz7Io3PxumgsbJngjyL1Ve5oxhlIAcZA8DU0PQmjxo3eS+k6BcmavGFd45gNL5zg
iLpCh4k86spmzh8CWkAAwWPQE4dZknK6jTctJc0OFVad3Z7+X7n0E8TFrA==
=PM8o
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"Quite a large pull request due to a selftest API overhaul and some
patches that had come in too late for 5.19.
ARM:
- Unwinder implementations for both nVHE modes (classic and
protected), complete with an overflow stack
- Rework of the sysreg access from userspace, with a complete rewrite
of the vgic-v3 view to allign with the rest of the infrastructure
- Disagregation of the vcpu flags in separate sets to better track
their use model.
- A fix for the GICv2-on-v3 selftest
- A small set of cosmetic fixes
RISC-V:
- Track ISA extensions used by Guest using bitmap
- Added system instruction emulation framework
- Added CSR emulation framework
- Added gfp_custom flag in struct kvm_mmu_memory_cache
- Added G-stage ioremap() and iounmap() functions
- Added support for Svpbmt inside Guest
s390:
- add an interface to provide a hypervisor dump for secure guests
- improve selftests to use TAP interface
- enable interpretive execution of zPCI instructions (for PCI
passthrough)
- First part of deferred teardown
- CPU Topology
- PV attestation
- Minor fixes
x86:
- Permit guests to ignore single-bit ECC errors
- Intel IPI virtualization
- Allow getting/setting pending triple fault with
KVM_GET/SET_VCPU_EVENTS
- PEBS virtualization
- Simplify PMU emulation by just using PERF_TYPE_RAW events
- More accurate event reinjection on SVM (avoid retrying
instructions)
- Allow getting/setting the state of the speaker port data bit
- Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls
are inconsistent
- "Notify" VM exit (detect microarchitectural hangs) for Intel
- Use try_cmpxchg64 instead of cmpxchg64
- Ignore benign host accesses to PMU MSRs when PMU is disabled
- Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior
- Allow NX huge page mitigation to be disabled on a per-vm basis
- Port eager page splitting to shadow MMU as well
- Enable CMCI capability by default and handle injected UCNA errors
- Expose pid of vcpu threads in debugfs
- x2AVIC support for AMD
- cleanup PIO emulation
- Fixes for LLDT/LTR emulation
- Don't require refcounted "struct page" to create huge SPTEs
- Miscellaneous cleanups:
- MCE MSR emulation
- Use separate namespaces for guest PTEs and shadow PTEs bitmasks
- PIO emulation
- Reorganize rmap API, mostly around rmap destruction
- Do not workaround very old KVM bugs for L0 that runs with nesting enabled
- new selftests API for CPUID
Generic:
- Fix races in gfn->pfn cache refresh; do not pin pages tracked by
the cache
- new selftests API using struct kvm_vcpu instead of a (vm, id)
tuple"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (606 commits)
selftests: kvm: set rax before vmcall
selftests: KVM: Add exponent check for boolean stats
selftests: KVM: Provide descriptive assertions in kvm_binary_stats_test
selftests: KVM: Check stat name before other fields
KVM: x86/mmu: remove unused variable
RISC-V: KVM: Add support for Svpbmt inside Guest/VM
RISC-V: KVM: Use PAGE_KERNEL_IO in kvm_riscv_gstage_ioremap()
RISC-V: KVM: Add G-stage ioremap() and iounmap() functions
KVM: Add gfp_custom flag in struct kvm_mmu_memory_cache
RISC-V: KVM: Add extensible CSR emulation framework
RISC-V: KVM: Add extensible system instruction emulation framework
RISC-V: KVM: Factor-out instruction emulation into separate sources
RISC-V: KVM: move preempt_disable() call in kvm_arch_vcpu_ioctl_run
RISC-V: KVM: Make kvm_riscv_guest_timer_init a void function
RISC-V: KVM: Fix variable spelling mistake
RISC-V: KVM: Improve ISA extension by using a bitmap
KVM, x86/mmu: Fix the comment around kvm_tdp_mmu_zap_leafs()
KVM: SVM: Dump Virtual Machine Save Area (VMSA) to klog
KVM: x86/mmu: Treat NX as a valid SPTE bit for NPT
KVM: x86: Do not block APIC write for non ICR registers
...
New driver:
- logicvc
vfio:
- use aperture API
core:
- of: Add data-lane helpers and convert drivers
- connector: Remove deprecated ida_simple_get()
media:
- Add various RGB666 and RGB888 format constants
panel:
- Add HannStar HSD101PWW
- Add ETML0700Y5DHA
dma-buf:
- add sync-file API
- set dma mask for udmabuf devices
fbcon:
- Improve scrolling performance
- Sanitize input
fbdev:
- device unregistering fixes
- vesa: Support COMPILE_TEST
- Disable firmware-device registration when first native driver loads
aperture:
- fix segfault during hot-unplug
- export for use with other subsystems
client:
- use driver validated modes
dp:
- aux: make probing more reliable
- mst: Read extended DPCD capabilities during system resume
- Support waiting for HDP signal
- Port-validation fixes
edid:
- CEA data-block iterators
- struct drm_edid introduction
- implement HF-EEODB extension
gem:
- don't use fb format non-existing planes
probe-helper:
- use 640x480 as displayport fallback
scheduler:
- don't kill jobs in interrupt context
bridge:
- Add support for i.MX8qxp and i.MX8qm
- lots of fixes/cleanups
- Add TI-DLPC3433
- fy07024di26a30d: Optional GPIO reset
- ldb: Add reg and reg-name properties to bindings, Kconfig fixes
- lt9611: Fix display sensing;
- tc358767: DSI/DPI refactoring and DSI-to-eDP support, DSI lane handling
- tc358775: Fix clock settings
- ti-sn65dsi83: Allow GPIO to sleep
- adv7511: I2C fixes
- anx7625: Fix error handling; DPI fixes; Implement HDP timeout via callback
- fsl-ldb: Drop DE flip
- ti-sn65dsi86: Convert to atomic modesetting
amdgpu:
- use atomic fence helpers in DM
- fix VRAM address calculations
- export CRTC bpc via debugfs
- Initial devcoredump support
- Enable high priority gfx queue on asics which support it
- Adjust GART size on newer APUs for S/G display
- Soft reset for GFX 11 / SDMA 6
- Add gfxoff status query for vangogh
- Fix timestamps for cursor only commits
- Adjust GART size on newer APUs for S/G display
- fix buddy memory corruption
amdkfd:
- MMU notifier fixes
- P2P DMA support using dma-buf
- Add available memory IOCTL
- HMM profiler support
- Simplify GPUVM validation
- Unified memory for CWSR save/restore area
i915:
- General driver clean-up
- DG2 enabling (still under force probe)
- DG2 small BAR memory support
- HuC loading support
- DG2 workarounds
- DG2/ATS-M device IDs added
- Ponte Vecchio prep work and new blitter engines
- add Meteorlake support
- Fix sparse warnings
- DMC MMIO range checks
- Audio related fixes
- Runtime PM fixes
- PSR fixes
- Media freq factor and per-gt enhancements
- DSI fixes for ICL+
- Disable DMC flip queue handlers
- ADL_P voltage swing updates
- Use more the VBT for panel information
- Fix on Type-C ports with TBT mode
- Improve fastset and allow seamless M/N changes
- Accept more fixed modes with VRR/DMRRS panels
- Disable connector polling for a headless SKU
- ADL-S display PLL w/a
- Enable THP on Icelake and beyond
- Fix i915_gem_object_ggtt_pin_ww regression on old platforms
- Expose per tile media freq factor in sysfs
- Fix dma_resv fence handling in multi-batch execbuf
- Improve on suspend / resume time with VT-d enabled
- export CRTC bpc settings via debugfs
msm:
- gpu: a619 support
- gpu: Fix for unclocked GMU register access
- gpu: Devcore dump enhancements
- client utilization via fdinfo support
- fix fence rollover issue
- gem: Lockdep false-positive warning fix
- gem: Switch to pfn mappings
- WB support on sc7180
- dp: dropped custom bulk clock implementation
- fix link retraining on resolution change
- hdmi: dropped obsolete GPIO support
tegra:
- context isolation for host1x engines
- tegra234 soc support
mediatek:
- add vdosys0/1 for mt8195
- add MT8195 dp_intf driver
exynos:
- Fix resume function issue of exynos decon driver by calling
clk_disable_unprepare() properly if clk_prepare_enable() failed.
nouveau:
- set of misc fixes/cleanups
- display cleanups
gma500:
- Cleanup connector I2C handling
hyperv:
- Unify VRAM allocation of Gen1 and Gen2
meson:
- Support YUV422 output; Refcount fixes
mgag200:
- Support damage clipping
- Support gamma handling
- Protect concurrent HW access
- Fixes to connector
- Store model-specific limits in device-info structure
- fix PCI register init
panfrost:
- Valhall support
r128:
- Fix bit-shift overflow
rockchip:
- Locking fixes in error path
ssd130x:
- Fix built-in linkage
udl:
- Always advertize VGA connector
ast:
- Support multiple outputs
- fix black screen on resume
sun4i:
- HDMI PHY cleanups
vc4:
- Add support for BCM2711
vkms:
- Allocate output buffer with vmalloc()
mcde:
- Fix ref-count leak
mxsfb/lcdif:
- Support i.MX8MP LCD controller
stm/ltdc:
- Support dynamic Z order
- Support mirroring
ingenic:
- Fix display at maximum resolution
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmLp/7YACgkQDHTzWXnE
hr7NjhAAnefa+72EG42OAqajbwTQMENOtFfqyL3k6ueK2ciYbsj/wklw/xc4Ok3o
DM5kG54t+nA9L1M7UyE7eaO36/XcuvS8Ea0uKKkamWt+3Ux4g1Vo1J37nP5sK5jI
GT/wceKA5sk3nuYly2lBby6mVTGuhAX+3edTAFeOwmd0WvQzzpy4vV+nCAgfshUs
ql4gfQPdQdP+wiovUzCIEu6exCSCAI/Oc944fd3AJi5bZbOPFXRS4rMMOLSrdoXV
9P44EZExPbYrDuVUCx/UaZtN8D9myyyBfZe62CtdgNyTYUHXnHCBYue+7D/s5O+y
GaLWcP128MsqZNmJNhmcWFIlgqowO24YkKUH68JH0UtBLSWich8rfdEsrxIidYED
0ma1jodRapjyZOjrHEJ3N5deKpoflMmqvCMpvIk1Ev6pT8KX9a6u34kLgsOVCV41
2bDEYD+DbRW2FexGR79yB2huXHGSnguco6069ca1oy9RF4q8cX6Pb1w2u42oS7zX
lIgLIashilVR2AYg/qi6IPHavmOQ9ItSXPC+4YasYiMGp/mwePqpmL63b/wkhg0D
nXn6/F8Bm6wle2FFbkLGwo1fF1Hn7RzTHSlqRWDKSEaMLhCus6M09VsobFCB19i0
lO4FNVTL8ZtryR94bgVmgi616w9hOhDhM9A+C0kJ9KBkDnDYUJU=
=HQ9U
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2022-08-03' of git://anongit.freedesktop.org/drm/drm
Pull drm updates from Dave Airlie:
"Highlights:
- New driver for logicvc - which is a display IP core.
- EDID parser rework to add new extensions
- fbcon scrolling improvements
- i915 has some more DG2 work but not enabled by default, but should
have enough features for userspace to work now.
Otherwise it's lots of work all over the place. Detailed summary:
New driver:
- logicvc
vfio:
- use aperture API
core:
- of: Add data-lane helpers and convert drivers
- connector: Remove deprecated ida_simple_get()
media:
- Add various RGB666 and RGB888 format constants
panel:
- Add HannStar HSD101PWW
- Add ETML0700Y5DHA
dma-buf:
- add sync-file API
- set dma mask for udmabuf devices
fbcon:
- Improve scrolling performance
- Sanitize input
fbdev:
- device unregistering fixes
- vesa: Support COMPILE_TEST
- Disable firmware-device registration when first native driver loads
aperture:
- fix segfault during hot-unplug
- export for use with other subsystems
client:
- use driver validated modes
dp:
- aux: make probing more reliable
- mst: Read extended DPCD capabilities during system resume
- Support waiting for HDP signal
- Port-validation fixes
edid:
- CEA data-block iterators
- struct drm_edid introduction
- implement HF-EEODB extension
gem:
- don't use fb format non-existing planes
probe-helper:
- use 640x480 as displayport fallback
scheduler:
- don't kill jobs in interrupt context
bridge:
- Add support for i.MX8qxp and i.MX8qm
- lots of fixes/cleanups
- Add TI-DLPC3433
- fy07024di26a30d: Optional GPIO reset
- ldb: Add reg and reg-name properties to bindings, Kconfig fixes
- lt9611: Fix display sensing;
- tc358767: DSI/DPI refactoring and DSI-to-eDP support, DSI lane handling
- tc358775: Fix clock settings
- ti-sn65dsi83: Allow GPIO to sleep
- adv7511: I2C fixes
- anx7625: Fix error handling; DPI fixes; Implement HDP timeout via callback
- fsl-ldb: Drop DE flip
- ti-sn65dsi86: Convert to atomic modesetting
amdgpu:
- use atomic fence helpers in DM
- fix VRAM address calculations
- export CRTC bpc via debugfs
- Initial devcoredump support
- Enable high priority gfx queue on asics which support it
- Adjust GART size on newer APUs for S/G display
- Soft reset for GFX 11 / SDMA 6
- Add gfxoff status query for vangogh
- Fix timestamps for cursor only commits
- Adjust GART size on newer APUs for S/G display
- fix buddy memory corruption
amdkfd:
- MMU notifier fixes
- P2P DMA support using dma-buf
- Add available memory IOCTL
- HMM profiler support
- Simplify GPUVM validation
- Unified memory for CWSR save/restore area
i915:
- General driver clean-up
- DG2 enabling (still under force probe)
- DG2 small BAR memory support
- HuC loading support
- DG2 workarounds
- DG2/ATS-M device IDs added
- Ponte Vecchio prep work and new blitter engines
- add Meteorlake support
- Fix sparse warnings
- DMC MMIO range checks
- Audio related fixes
- Runtime PM fixes
- PSR fixes
- Media freq factor and per-gt enhancements
- DSI fixes for ICL+
- Disable DMC flip queue handlers
- ADL_P voltage swing updates
- Use more the VBT for panel information
- Fix on Type-C ports with TBT mode
- Improve fastset and allow seamless M/N changes
- Accept more fixed modes with VRR/DMRRS panels
- Disable connector polling for a headless SKU
- ADL-S display PLL w/a
- Enable THP on Icelake and beyond
- Fix i915_gem_object_ggtt_pin_ww regression on old platforms
- Expose per tile media freq factor in sysfs
- Fix dma_resv fence handling in multi-batch execbuf
- Improve on suspend / resume time with VT-d enabled
- export CRTC bpc settings via debugfs
msm:
- gpu: a619 support
- gpu: Fix for unclocked GMU register access
- gpu: Devcore dump enhancements
- client utilization via fdinfo support
- fix fence rollover issue
- gem: Lockdep false-positive warning fix
- gem: Switch to pfn mappings
- WB support on sc7180
- dp: dropped custom bulk clock implementation
- fix link retraining on resolution change
- hdmi: dropped obsolete GPIO support
tegra:
- context isolation for host1x engines
- tegra234 soc support
mediatek:
- add vdosys0/1 for mt8195
- add MT8195 dp_intf driver
exynos:
- Fix resume function issue of exynos decon driver by calling
clk_disable_unprepare() properly if clk_prepare_enable() failed.
nouveau:
- set of misc fixes/cleanups
- display cleanups
gma500:
- Cleanup connector I2C handling
hyperv:
- Unify VRAM allocation of Gen1 and Gen2
meson:
- Support YUV422 output; Refcount fixes
mgag200:
- Support damage clipping
- Support gamma handling
- Protect concurrent HW access
- Fixes to connector
- Store model-specific limits in device-info structure
- fix PCI register init
panfrost:
- Valhall support
r128:
- Fix bit-shift overflow
rockchip:
- Locking fixes in error path
ssd130x:
- Fix built-in linkage
udl:
- Always advertize VGA connector
ast:
- Support multiple outputs
- fix black screen on resume
sun4i:
- HDMI PHY cleanups
vc4:
- Add support for BCM2711
vkms:
- Allocate output buffer with vmalloc()
mcde:
- Fix ref-count leak
mxsfb/lcdif:
- Support i.MX8MP LCD controller
stm/ltdc:
- Support dynamic Z order
- Support mirroring
ingenic:
- Fix display at maximum resolution"
* tag 'drm-next-2022-08-03' of git://anongit.freedesktop.org/drm/drm: (1480 commits)
drm/amd/display: Fix a compilation failure on PowerPC caused by FPU code
drm/amdgpu: enable support for psp 13.0.4 block
drm/amdgpu: add files for PSP 13.0.4
drm/amdgpu: add header files for MP 13.0.4
drm/amdgpu: correct RLC_RLCS_BOOTLOAD_STATUS offset and index
drm/amdgpu: send msg to IMU for the front-door loading
drm/amdkfd: use time_is_before_jiffies(a + b) to replace "jiffies - a > b"
drm/amdgpu: fix hive reference leak when reflecting psp topology info
drm/amd/pm: enable GFX ULV feature support for SMU13.0.0
drm/amd/pm: update driver if header for SMU 13.0.0
drm/amdgpu: move mes self test after drm sched re-started
drm/amdgpu: drop non-necessary call trace dump
drm/amdgpu: enable VCN cg and JPEG cg/pg
drm/amdgpu: vcn_4_0_2 video codec query
drm/amdgpu: add VCN_4_0_2 firmware support
drm/amdgpu: add VCN function in NBIO v7.7
drm/amdgpu: fix a vcn4 boot poll bug in emulation mode
drm/amd/amdgpu: add memory training support for PSP_V13
drm/amdkfd: remove an unnecessary amdgpu_bo_ref
drm/amd/pm: Add get_gfx_off_status interface for yellow carp
...
magical no_llseek thing and makes checks consistent. In particular,
ad-hoc "can we do splice via internal pipe" checks got saner (and
somewhat more permissive, which is what Jason had been after, AFAICT)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYug2xgAKCRBZ7Krx/gZQ
6wxWAQDqeg+xMq2FGPXmgjCa+Cp3PXH96Lp6f3hHzakIDx+t8gEAxvuiXAD22Mct
6S1SKuGj0iDIuM4L7hUiWTiY/bDXSAc=
=3EC/
-----END PGP SIGNATURE-----
Merge tag 'pull-work.lseek' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs lseek updates from Al Viro:
"Jason's lseek series.
Saner handling of 'lseek should fail with ESPIPE' - this gets rid of
the magical no_llseek thing and makes checks consistent.
In particular, the ad-hoc "can we do splice via internal pipe" checks
got saner (and somewhat more permissive, which is what Jason had been
after, AFAICT)"
* tag 'pull-work.lseek' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: remove no_llseek
fs: check FMODE_LSEEK to control internal pipe splicing
vfio: do not set FMODE_LSEEK flag
dma-buf: remove useless FMODE_LSEEK flag
fs: do not compare against ->llseek
fs: clear or set FMODE_LSEEK based on llseek function
KVM/s390, KVM/x86 and common infrastructure changes for 5.20
x86:
* Permit guests to ignore single-bit ECC errors
* Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache
* Intel IPI virtualization
* Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS
* PEBS virtualization
* Simplify PMU emulation by just using PERF_TYPE_RAW events
* More accurate event reinjection on SVM (avoid retrying instructions)
* Allow getting/setting the state of the speaker port data bit
* Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent
* "Notify" VM exit (detect microarchitectural hangs) for Intel
* Cleanups for MCE MSR emulation
s390:
* add an interface to provide a hypervisor dump for secure guests
* improve selftests to use TAP interface
* enable interpretive execution of zPCI instructions (for PCI passthrough)
* First part of deferred teardown
* CPU Topology
* PV attestation
* Minor fixes
Generic:
* new selftests API using struct kvm_vcpu instead of a (vm, id) tuple
x86:
* Use try_cmpxchg64 instead of cmpxchg64
* Bugfixes
* Ignore benign host accesses to PMU MSRs when PMU is disabled
* Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior
* x86/MMU: Allow NX huge pages to be disabled on a per-vm basis
* Port eager page splitting to shadow MMU as well
* Enable CMCI capability by default and handle injected UCNA errors
* Expose pid of vcpu threads in debugfs
* x2AVIC support for AMD
* cleanup PIO emulation
* Fixes for LLDT/LTR emulation
* Don't require refcounted "struct page" to create huge SPTEs
x86 cleanups:
* Use separate namespaces for guest PTEs and shadow PTEs bitmasks
* PIO emulation
* Reorganize rmap API, mostly around rmap destruction
* Do not workaround very old KVM bugs for L0 that runs with nesting enabled
* new selftests API for CPUID
Most of the callers of vfio_pin_pages() want "struct page *" and the
low-level mm code to pin pages returns a list of "struct page *" too.
So there's no gain in converting "struct page *" to PFN in between.
Replace the output parameter "phys_pfn" list with a "pages" list, to
simplify callers. This also allows us to replace the vfio_iommu_type1
implementation with a more efficient one.
And drop the pfn_valid check in the gvt code, as there is no need to
do such a check at a page-backed struct page pointer.
For now, also update vfio_iommu_type1 to fit this new parameter too.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Eric Farman <farman@linux.ibm.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Eric Farman <farman@linux.ibm.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20220723020256.30081-11-nicolinc@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Following the updated vfio_pin/unpin_pages(), use the simpler "iova".
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Eric Farman <farman@linux.ibm.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20220723020256.30081-9-nicolinc@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio_pin/unpin_pages() so far accepted arrays of PFNs of user IOVA.
Among all three callers, there was only one caller possibly passing in
a non-contiguous PFN list, which is now ensured to have contiguous PFN
inputs too.
Pass in the starting address with "iova" alone to simplify things, so
callers no longer need to maintain a PFN list or to pin/unpin one page
at a time. This also allows VFIO to use more efficient implementations
of pin/unpin_pages.
For now, also update vfio_iommu_type1 to fit this new parameter too,
while keeping its input intact (being user_iova) since we don't want
to spend too much effort swapping its parameters and local variables
at that level.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com>
Acked-by: Eric Farman <farman@linux.ibm.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Eric Farman <farman@linux.ibm.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20220723020256.30081-6-nicolinc@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There's only one caller that checks its return value with a WARN_ON_ONCE,
while all other callers don't check the return value at all. Above that,
an undo function should not fail. So, simplify the API to return void by
embedding similar WARN_ONs.
Also for users to pinpoint which condition fails, separate WARN_ON lines,
yet remove the "driver->ops->unpin_pages" check, since it's unreasonable
for callers to unpin on something totally random that wasn't even pinned.
And remove NULL pointer checks for they would trigger oops vs. warnings.
Note that npage is already validated in the vfio core, thus drop the same
check in the type1 code.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20220723020256.30081-2-nicolinc@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Grepping for "iommu_ops" finds this spot and gives wrong impression
that iommu_ops is used in here, fix the comment.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Link: https://lore.kernel.org/r/20220714080912.3713509-1-aik@ozlabs.ru
[aw: convert to multi-line comment]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Instead of bouncing the function call to the driver op through a blocking
notifier just have the iommu layer call it directly.
Register each device that is being attached to the iommu with the lower
driver which then threads them on a linked list and calls the appropriate
driver op at the right time.
Currently the only use is if dma_unmap() is defined.
Also, fully lock all the debugging tests on the pinning path that a
dma_unmap is registered.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v4-681e038e30fd+78-vfio_unmap_notif_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Instead of having drivers register the notifier with explicit code just
have them provide a dma_unmap callback op in their driver ops and rely on
the core code to wire it up.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v4-681e038e30fd+78-vfio_unmap_notif_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This file does not support llseek, so don't set the flag advertising it.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When doing load/store interpretation, the maximum store block length is
determined by the underlying firmware, not the host kernel API. Reflect
that in the associated Query PCI Function Group clp capability and let
userspace decide which is appropriate to present to the guest.
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20220606203325.110625-20-mjrosato@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
The function handle is a system-wide unique identifier for a zPCI
device. With zPCI instruction interpretation, the host will no
longer be executing the zPCI instructions on behalf of the guest.
As a result, the guest needs to use the real function handle in
order for firmware to associate the instruction with the proper
PCI function. Let's provide that handle to the guest.
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20220606203325.110625-19-mjrosato@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
During vfio-pci open_device, pass the KVM associated with the vfio group
(if one exists). This is needed in order to pass a special indicator
(GISA) to firmware to allow zPCI interpretation facilities to be used
for only the specific KVM associated with the vfio-pci device. During
vfio-pci close_device, unregister the notifier.
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Link: https://lore.kernel.org/r/20220606203325.110625-18-mjrosato@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
The current contents of vfio-pci-zdev are today only useful in a KVM
environment; let's tie everything currently under vfio-pci-zdev to
this Kconfig statement and require KVM in this case, reducing complexity
(e.g. symbol lookups).
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Link: https://lore.kernel.org/r/20220606203325.110625-11-mjrosato@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
The parameter of container has been unused for tce_iommu_unuse_page.
So, we should delete it.
Signed-off-by: Deming Wang <wangdeming@inspur.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Link: https://lore.kernel.org/r/20220702064613.5293-1-wangdeming@inspur.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The test isn't going to work if a group doesn't exist. Normally this isn't
a problem since VFIO isn't going to create a device if there is no group,
but the special CONFIG_VFIO_NOIOMMU behavior allows bypassing this
prevention. The new cap test effectively forces a group and breaks this
config option.
Move the cap test to vfio_group_find_or_alloc() which is the earliest time
we know we have a group available and thus are not running in noiommu mode.
Fixes: e8ae0e140c ("vfio: Require that devices support DMA cache coherence")
Reported-by: Xiang Chen <chenxiang66@hisilicon.com>
Tested-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v1-e8934b490f36+f4-vfio_cap_fix_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We do not protect the vfio_device::open_count with group_rwsem elsewhere (see
vfio_device_fops_release as a comparison, where we already drop group_rwsem
before open_count--). So move the group_rwsem unlock prior to open_count--.
This change now also drops group_rswem before setting device->kvm = NULL,
but that's also OK (again, just like vfio_device_fops_release). The setting
of device->kvm before open_device is technically done while holding the
group_rwsem, this is done to protect the group kvm value we are copying from,
and we should not be relying on that to protect the contents of device->kvm;
instead we assume this value will not change until after the device is closed
and while under the dev_set->lock.
Cc: Matthew Rosato <mjrosato@linux.ibm.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20220627074119.523274-1-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In function vfio_dma_do_unmap(), we currently prevent process to unmap
vfio dma region whose mm_struct is different from the vfio_dma->task.
In our virtual machine scenario which is using kvm and qemu, this
judgement stops us from liveupgrading our qemu, which uses fork() &&
exec() to load the new binary but the new process cannot do the
VFIO_IOMMU_UNMAP_DMA action during vm exit because of this judgement.
This judgement is added in commit 8f0d5bb95f ("vfio iommu type1: Add
task structure to vfio_dma") for the security reason. But it seems that
no other task who has no family relationship with old and new process
can get the same vfio_dma struct here for the reason of resource
isolation. So this patch delete it.
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca>
Link: https://lore.kernel.org/r/20220627035109.73745-1-lizhe.67@bytedance.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
On buffer resize failure, vfio_info_cap_add() will free the buffer,
report zero for the size, and return -ENOMEM. As additional
hardening, also clear the buffer pointer to prevent any chance of a
double free.
Signed-off-by: Schspa Shi <schspa@gmail.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/r/20220629022948.55608-1-schspa@gmail.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
As iommu_group_set_name() can fail, we should check the return value.
Signed-off-by: Liam Ni <zhiguangni01@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20220625114239.9301-1-zhiguangni01@gmail.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio core checks whether the driver sets some migration op (e.g.
set_state/get_state) and accordingly calls its op.
However, currently mlx5 driver sets the above ops without regards to its
migration caps.
This might lead to unexpected usage/Oops if user space may call to the
above ops even if the driver doesn't support migration. As for example,
the migration state_mutex is not initialized in that case.
The cleanest way to manage that seems to split the migration ops from
the main device ops, this will let the driver setting them separately
from the main ops when it's applicable.
As part of that, validate ops construction on registration and include a
check for VFIO_MIGRATION_STOP_COPY since the uAPI claims it must be set
in migration_flags.
HISI driver was changed as well to match this scheme.
This scheme may enable down the road to come with some extra group of
ops (e.g. DMA log) that can be set without regards to the other options
based on driver caps.
Fixes: 6fadb02126 ("vfio/mlx5: Implement vfio_pci driver for mlx5 devices")
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220628155910.171454-3-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Protect mlx5vf_disable_fds() upon close device to be called under the
state mutex as done in all other places.
This will prevent a race with any other flow which calls
mlx5vf_disable_fds() as of health/recovery upon
MLX5_PF_NOTIFY_DISABLE_VF event.
Encapsulate this functionality in a separate function named
mlx5vf_cmd_close_migratable() to consider migration caps and for further
usage upon close device.
Fixes: 6fadb02126 ("vfio/mlx5: Implement vfio_pci driver for mlx5 devices")
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220628155910.171454-2-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
As vfio_register_iommu_driver() can fail, we should check the return value.
Signed-off-by: Bo Liu <liubo03@inspur.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/r/20220622045651.5416-1-liubo03@inspur.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Use the new interface to check the capabilities for our device
specifically.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4ea5eb64246f1ee188d1a61c3e93b37756932eb7.1656092606.git.robin.murphy@arm.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Since IOMMU groups are mandatory for drivers to support, it stands to
reason that any device which has been successfully added to a group
must be on a bus supported by that IOMMU driver, and therefore a domain
viable for any device in the group must be viable for all devices in
the group. This already has to be the case for the IOMMU API's internal
default domain, for instance. Thus even if the group contains devices on
different buses, that can only mean that the IOMMU driver actually
supports such an odd topology, and so without loss of generality we can
expect the bus type of any device in a group to be suitable for IOMMU
API calls.
Furthermore, scrutiny reveals a lack of protection for the bus being
removed while vfio_iommu_type1_attach_group() is using it; the reference
that VFIO holds on the iommu_group ensures that data remains valid, but
does not prevent the group's membership changing underfoot.
We can address both concerns by recycling vfio_bus_type() into some
superficially similar logic to indirect the IOMMU API calls themselves.
Each call is thus protected from races by the IOMMU group's own locking,
and we no longer need to hold group-derived pointers beyond that scope.
It also gives us an easy path for the IOMMU API's migration of bus-based
interfaces to device-based, of which we can already take the first step
with device_iommu_capable(). As with domains, any capability must in
practice be consistent for devices in a given group - and after all it's
still the same capability which was expected to be consistent across an
entire bus! - so there's no need for any complicated validation.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/194a12d3434d7b38f84fa96503c7664451c8c395.1656092606.git.robin.murphy@arm.com
[aw: add comment to vfio_iommu_device_capable()]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The use of 'extern' in function prototypes has been disrecommended in
the kernel coding style for several years now, remove them from all vfio
related files so contributors no longer need to decide between style and
consistency.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/165471414407.203056.474032786990662279.stgit@omen
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Console drivers can create conflicts with PCI resources resulting in
userspace getting mmap failures to memory BARs. This is especially
evident when trying to re-use the system primary console for userspace
drivers. Use the aperture helpers to remove these conflicts.
v3:
* call aperture_remove_conflicting_pci_devices()
Reported-by: Laszlo Ersek <lersek@redhat.com>
Suggested-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Tested-by: Laszlo Ersek <lersek@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220622140134.12763-4-tzimmermann@suse.de
- Improvements to mlx5 vfio-pci variant driver, including support
for parallel migration per PF (Yishai Hadas)
- Remove redundant iommu_present() check (Robin Murphy)
- Ongoing refactoring to consolidate the VFIO driver facing API
to use vfio_device (Jason Gunthorpe)
- Use drvdata to store vfio_device among all vfio-pci and variant
drivers (Jason Gunthorpe)
- Remove redundant code now that IOMMU core manages group DMA
ownership (Jason Gunthorpe)
- Remove vfio_group from external API handling struct file ownership
(Jason Gunthorpe)
- Correct typo in uapi comments (Thomas Huth)
- Fix coccicheck detected deadlock (Wan Jiabing)
- Use rwsem to remove races and simplify code around container and
kvm association to groups (Jason Gunthorpe)
- Harden access to devices in low power states and use runtime PM to
enable d3cold support for unused devices (Abhishek Sahu)
- Fix dma_owner handling of fake IOMMU groups (Jason Gunthorpe)
- Set driver_managed_dma on vfio-pci variant drivers (Jason Gunthorpe)
- Pass KVM pointer directly rather than via notifier (Matthew Rosato)
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmKPvyMbHGFsZXgud2ls
bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsihegP/3XamiYsS0GuA7awAq/X
h9Jahb6kJ+sh0RXL1Gqzc9nxH5X9H/hBcL88VOV3GLwyOhNVNpVjQXGguL3aLaCE
zUrs0+AFEJb990y9H+VgwIDom5BIpgdZ2naG42bz9wUeVGg4daJnkMwOgXwIBzfx
IOddktN6UwuE+DyA57yqL93f+0cTrhYZx9R14sDoLR5lE4uGnbQwIknawEKVtoeR
rEPaCFptxPxCUbqoOSR0Y3bu6rUYSH4iiMZpMviqm2ak3aNn76gru3q4QAnI4gTd
l/w+2OJNFC0U7H5Cz7cdIn2StdJvfSkX0e753+qsFccFsViRCGdnW0Lht/xrYrFC
i8AJxkrq2/bs00LXs7kzcruaD8pJ2UPe2x2+nupHSEsj99K4NraeHRB2CC1uwj0d
gYliOSW5T3//wOpztK48s475VppgXeKWkXGoNY3JJlGjAPyd0vFrH8hRLhVZJ9uI
/eLh6hQnOJuCDz1rQrVNRk6cZi9R1Wpl5dvCBRLqjK519nm569aTlVBra+iNyUCQ
lU5/kN0ym8+X8CweE5ILPGiX2iEXBYMqv+Dm5yOimRUHRJZHYv900FX0GVEnCUCq
23sMDaeHS1hyDCQk//bd2Ig7xjh7mbh7CrKcdJ7pL5Gc/A1zkCXd54hvxViiGwQq
U5KIPTyJy+erpcpxjUApaoP2
=etEI
-----END PGP SIGNATURE-----
Merge tag 'vfio-v5.19-rc1' of https://github.com/awilliam/linux-vfio
Pull vfio updates from Alex Williamson:
- Improvements to mlx5 vfio-pci variant driver, including support for
parallel migration per PF (Yishai Hadas)
- Remove redundant iommu_present() check (Robin Murphy)
- Ongoing refactoring to consolidate the VFIO driver facing API to use
vfio_device (Jason Gunthorpe)
- Use drvdata to store vfio_device among all vfio-pci and variant
drivers (Jason Gunthorpe)
- Remove redundant code now that IOMMU core manages group DMA ownership
(Jason Gunthorpe)
- Remove vfio_group from external API handling struct file ownership
(Jason Gunthorpe)
- Correct typo in uapi comments (Thomas Huth)
- Fix coccicheck detected deadlock (Wan Jiabing)
- Use rwsem to remove races and simplify code around container and kvm
association to groups (Jason Gunthorpe)
- Harden access to devices in low power states and use runtime PM to
enable d3cold support for unused devices (Abhishek Sahu)
- Fix dma_owner handling of fake IOMMU groups (Jason Gunthorpe)
- Set driver_managed_dma on vfio-pci variant drivers (Jason Gunthorpe)
- Pass KVM pointer directly rather than via notifier (Matthew Rosato)
* tag 'vfio-v5.19-rc1' of https://github.com/awilliam/linux-vfio: (38 commits)
vfio: remove VFIO_GROUP_NOTIFY_SET_KVM
vfio/pci: Add driver_managed_dma to the new vfio_pci drivers
vfio: Do not manipulate iommu dma_owner for fake iommu groups
vfio/pci: Move the unused device into low power state with runtime PM
vfio/pci: Virtualize PME related registers bits and initialize to zero
vfio/pci: Change the PF power state to D0 before enabling VFs
vfio/pci: Invalidate mmaps and block the access in D3hot power state
vfio: Change struct vfio_group::container_users to a non-atomic int
vfio: Simplify the life cycle of the group FD
vfio: Fully lock struct vfio_group::container
vfio: Split up vfio_group_get_device_fd()
vfio: Change struct vfio_group::opened from an atomic to bool
vfio: Add missing locking for struct vfio_group::kvm
kvm/vfio: Fix potential deadlock problem in vfio
include/uapi/linux/vfio.h: Fix trivial typo - _IORW should be _IOWR instead
vfio/pci: Use the struct file as the handle not the vfio_group
kvm/vfio: Remove vfio_group from kvm
vfio: Change vfio_group_set_kvm() to vfio_file_set_kvm()
vfio: Change vfio_external_check_extension() to vfio_file_enforced_coherent()
vfio: Remove vfio_external_group_match_file()
...
Including:
- Intel VT-d driver updates
- Domain force snooping improvement.
- Cleanups, no intentional functional changes.
- ARM SMMU driver updates
- Add new Qualcomm device-tree compatible strings
- Add new Nvidia device-tree compatible string for Tegra234
- Fix UAF in SMMUv3 shared virtual addressing code
- Force identity-mapped domains for users of ye olde SMMU
legacy binding
- Minor cleanups
- Patches to fix a BUG_ON in the vfio_iommu_group_notifier
- Groundwork for upcoming iommufd framework
- Introduction of DMA ownership so that an entire IOMMU group
is either controlled by the kernel or by user-space
- MT8195 and MT8186 support in the Mediatek IOMMU driver
- Patches to make forcing of cache-coherent DMA more coherent
between IOMMU drivers
- Fixes for thunderbolt device DMA protection
- Various smaller fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmKWCbUACgkQK/BELZcB
GuPHmRAAuoH9iK/jrC3SgrqpBfH2iRN7ovIX8dFvgbQWX27lhXF4gvj2/nYdIvPK
75j/LmdibuzV3Iez4kjbGKNG1AikwK3dKIH21a84f3ctnoamQyL6nMfCVBFaVD/D
kvPpTHyjbGPNf6KZyWQdkJ5DXD1aoG1DKkBnslH5pTNPqGuNqbcnRTg0YxiJFLBv
5w2B6jL06XRzunh+Sp1Dbj+po8ROjLRCEU+tdrndO8W/Dyp6+ZNNuxL9/3BM9zMj
py0M4piFtGnhmJSdym1eeHm7r1YRjkZw+MN+e8NcrcSihmDutEWo7nRRxA5uVaa+
3O2DNERqCvQUYxfNRUOKwzV8v51GYQHEPhvOe/MLgaEQDmDmlF2dHNGm93eCMdrv
m1cT011oU7pa4qHomwLyTJxSsR7FzJ37igq/WeY++MBhl+frqfzEQPVxF+W7GLb8
QvT/+woCPzLVpJbE7s0FUD4nbPd8c1dAz4+HO1DajxILIOTq1bnPIorSjgXODRjq
yzsiP1rAg0L0PsL7pXn3cPMzNCE//xtOsRsAGmaVv6wBoMLyWVFCU/wjPEdjrSWA
nXpAuCL84uxCEl/KLYMsg9UhjT6ko7CuKdsybIG9zNIiUau43uSqgTen0xCpYt0i
m//O/X3tPyxmoLKRW+XVehGOrBZW+qrQny6hk/Zex+6UJQqVMTA=
=W0hj
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
- Intel VT-d driver updates:
- Domain force snooping improvement.
- Cleanups, no intentional functional changes.
- ARM SMMU driver updates:
- Add new Qualcomm device-tree compatible strings
- Add new Nvidia device-tree compatible string for Tegra234
- Fix UAF in SMMUv3 shared virtual addressing code
- Force identity-mapped domains for users of ye olde SMMU legacy
binding
- Minor cleanups
- Fix a BUG_ON in the vfio_iommu_group_notifier:
- Groundwork for upcoming iommufd framework
- Introduction of DMA ownership so that an entire IOMMU group is
either controlled by the kernel or by user-space
- MT8195 and MT8186 support in the Mediatek IOMMU driver
- Make forcing of cache-coherent DMA more coherent between IOMMU
drivers
- Fixes for thunderbolt device DMA protection
- Various smaller fixes and cleanups
* tag 'iommu-updates-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (88 commits)
iommu/amd: Increase timeout waiting for GA log enablement
iommu/s390: Tolerate repeat attach_dev calls
iommu/vt-d: Remove hard coding PGSNP bit in PASID entries
iommu/vt-d: Remove domain_update_iommu_snooping()
iommu/vt-d: Check domain force_snooping against attached devices
iommu/vt-d: Block force-snoop domain attaching if no SC support
iommu/vt-d: Size Page Request Queue to avoid overflow condition
iommu/vt-d: Fold dmar_insert_one_dev_info() into its caller
iommu/vt-d: Change return type of dmar_insert_one_dev_info()
iommu/vt-d: Remove unneeded validity check on dev
iommu/dma: Explicitly sort PCI DMA windows
iommu/dma: Fix iova map result check bug
iommu/mediatek: Fix NULL pointer dereference when printing dev_name
iommu: iommu_group_claim_dma_owner() must always assign a domain
iommu/arm-smmu: Force identity domains for legacy binding
iommu/arm-smmu: Support Tegra234 SMMU
dt-bindings: arm-smmu: Add compatible for Tegra234 SOC
dt-bindings: arm-smmu: Document nvidia,memory-controller property
iommu/arm-smmu-qcom: Add SC8280XP support
dt-bindings: arm-smmu: Add compatible for Qualcomm SC8280XP
...
Rather than relying on a notifier for associating the KVM with
the group, let's assume that the association has already been
made prior to device_open. The first time a device is opened
associate the group KVM with the device.
This fixes a user-triggerable oops in GVT.
Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Zhi Wang <zhi.a.wang@intel.com>
Link: https://lore.kernel.org/r/20220519183311.582380-2-mjrosato@linux.ibm.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When the iommu series adding driver_managed_dma was rebased it missed that
new VFIO drivers were added and did not update them too.
Without this vfio will claim the groups are not viable.
Add driver_managed_dma to mlx5 and hisi.
Fixes: 70693f4708 ("vfio: Set DMA ownership for VFIO devices")
Reported-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/0-v1-f9dfa642fab0+2b3-vfio_managed_dma_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Since asserting dma ownership now causes the group to have its DMA blocked
the iommu layer requires a working iommu. This means the dma_owner APIs
cannot be used on the fake groups that VFIO creates. Test for this and
avoid calling them.
Otherwise asserting dma ownership will fail for VFIO mdev devices as a
BLOCKING iommu_domain cannot be allocated due to the NULL iommu ops.
Fixes: 0286300e60 ("iommu: iommu_group_claim_dma_owner() must always assign a domain")
Reported-by: Eric Farman <farman@linux.ibm.com>
Tested-by: Eric Farman <farman@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/0-v1-9cfc47edbcd4+13546-vfio_dma_owner_fix_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Currently, there is very limited power management support
available in the upstream vfio_pci_core based drivers. If there
are no users of the device, then the PCI device will be moved into
D3hot state by writing directly into PCI PM registers. This D3hot
state help in saving power but we can achieve zero power consumption
if we go into the D3cold state. The D3cold state cannot be possible
with native PCI PM. It requires interaction with platform firmware
which is system-specific. To go into low power states (including D3cold),
the runtime PM framework can be used which internally interacts with PCI
and platform firmware and puts the device into the lowest possible
D-States.
This patch registers vfio_pci_core based drivers with the
runtime PM framework.
1. The PCI core framework takes care of most of the runtime PM
related things. For enabling the runtime PM, the PCI driver needs to
decrement the usage count and needs to provide 'struct dev_pm_ops'
at least. The runtime suspend/resume callbacks are optional and needed
only if we need to do any extra handling. Now there are multiple
vfio_pci_core based drivers. Instead of assigning the
'struct dev_pm_ops' in individual parent driver, the vfio_pci_core
itself assigns the 'struct dev_pm_ops'. There are other drivers where
the 'struct dev_pm_ops' is being assigned inside core layer
(For example, wlcore_probe() and some sound based driver, etc.).
2. This patch provides the stub implementation of 'struct dev_pm_ops'.
The subsequent patch will provide the runtime suspend/resume
callbacks. All the config state saving, and PCI power management
related things will be done by PCI core framework itself inside its
runtime suspend/resume callbacks (pci_pm_runtime_suspend() and
pci_pm_runtime_resume()).
3. Inside pci_reset_bus(), all the devices in dev_set needs to be
runtime resumed. vfio_pci_dev_set_pm_runtime_get() will take
care of the runtime resume and its error handling.
4. Inside vfio_pci_core_disable(), the device usage count always needs
to be decremented which was incremented in vfio_pci_core_enable().
5. Since the runtime PM framework will provide the same functionality,
so directly writing into PCI PM config register can be replaced with
the use of runtime PM routines. Also, the use of runtime PM can help
us in more power saving.
In the systems which do not support D3cold,
With the existing implementation:
// PCI device
# cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
D3hot
// upstream bridge
# cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
D0
With runtime PM:
// PCI device
# cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
D3hot
// upstream bridge
# cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
D3hot
So, with runtime PM, the upstream bridge or root port will also go
into lower power state which is not possible with existing
implementation.
In the systems which support D3cold,
// PCI device
# cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
D3hot
// upstream bridge
# cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
D0
With runtime PM:
// PCI device
# cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
D3cold
// upstream bridge
# cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
D3cold
So, with runtime PM, both the PCI device and upstream bridge will
go into D3cold state.
6. If 'disable_idle_d3' module parameter is set, then also the runtime
PM will be enabled, but in this case, the usage count should not be
decremented.
7. vfio_pci_dev_set_try_reset() return value is unused now, so this
function return type can be changed to void.
8. Use the runtime PM API's in vfio_pci_core_sriov_configure().
The device can be in low power state either with runtime
power management (when there is no user) or PCI_PM_CTRL register
write by the user. In both the cases, the PF should be moved to
D0 state. For preventing any runtime usage mismatch, pci_num_vf()
has been called explicitly during disable.
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220518111612.16985-5-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If any PME event will be generated by PCI, then it will be mostly
handled in the host by the root port PME code. For example, in the case
of PCIe, the PME event will be sent to the root port and then the PME
interrupt will be generated. This will be handled in
drivers/pci/pcie/pme.c at the host side. Inside this, the
pci_check_pme_status() will be called where PME_Status and PME_En bits
will be cleared. So, the guest OS which is using vfio-pci device will
not come to know about this PME event.
To handle these PME events inside guests, we need some framework so
that if any PME events will happen, then it needs to be forwarded to
virtual machine monitor. We can virtualize PME related registers bits
and initialize these bits to zero so vfio-pci device user will assume
that it is not capable of asserting the PME# signal from any power state.
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220518111612.16985-4-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
According to [PCIe v5 9.6.2] for PF Device Power Management States
"The PF's power management state (D-state) has global impact on its
associated VFs. If a VF does not implement the Power Management
Capability, then it behaves as if it is in an equivalent
power state of its associated PF.
If a VF implements the Power Management Capability, the Device behavior
is undefined if the PF is placed in a lower power state than the VF.
Software should avoid this situation by placing all VFs in lower power
state before lowering their associated PF's power state."
From the vfio driver side, user can enable SR-IOV when the PF is in D3hot
state. If VF does not implement the Power Management Capability, then
the VF will be actually in D3hot state and then the VF BAR access will
fail. If VF implements the Power Management Capability, then VF will
assume that its current power state is D0 when the PF is D3hot and
in this case, the behavior is undefined.
To support PF power management, we need to create power management
dependency between PF and its VF's. The runtime power management support
may help with this where power management dependencies are supported
through device links. But till we have such support in place, we can
disallow the PF to go into low power state, if PF has VF enabled.
There can be a case, where user first enables the VF's and then
disables the VF's. If there is no user of PF, then the PF can put into
D3hot state again. But with this patch, the PF will still be in D0
state after disabling VF's since detecting this case inside
vfio_pci_core_sriov_configure() requires access to
struct vfio_device::open_count along with its locks. But the subsequent
patches related to runtime PM will handle this case since runtime PM
maintains its own usage count.
Also, vfio_pci_core_sriov_configure() can be called at any time
(with and without vfio pci device user), so the power state change
and SR-IOV enablement need to be protected with the required locks.
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220518111612.16985-3-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
According to [PCIe v5 5.3.1.4.1] for D3hot state
"Configuration and Message requests are the only TLPs accepted by a
Function in the D3Hot state. All other received Requests must be
handled as Unsupported Requests, and all received Completions may
optionally be handled as Unexpected Completions."
Currently, if the vfio PCI device has been put into D3hot state and if
user makes non-config related read/write request in D3hot state, these
requests will be forwarded to the host and this access may cause
issues on a few systems.
This patch leverages the memory-disable support added in commit
'abafbc551fdd ("vfio-pci: Invalidate mmaps and block MMIO access on
disabled memory")' to generate page fault on mmap access and
return error for the direct read/write. If the device is D3hot state,
then the error will be returned for MMIO access. The IO access generally
does not make the system unresponsive so the IO access can still happen
in D3hot state. The default value should be returned in this case
without bringing down the complete system.
Also, the power related structure fields need to be protected so
we can use the same 'memory_lock' to protect these fields also.
This protection is mainly needed when user changes the PCI
power state by writing into PCI_PM_CTRL register.
vfio_lock_and_set_power_state() wrapper function will take the
required locks and then it will invoke the vfio_pci_set_power_state().
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220518111612.16985-2-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Now that everything is fully locked there is no need for container_users
to remain as an atomic, change it to an unsigned int.
Use 'if (group->container)' as the test to determine if the container is
present or not instead of using container_users.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/6-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Once userspace opens a group FD it is prevented from opening another
instance of that same group FD until all the prior group FDs and users of
the container are done.
The first is done trivially by checking the group->opened during group FD
open.
However, things get a little weird if userspace creates a device FD and
then closes the group FD. The group FD still cannot be re-opened, but this
time it is because the group->container is still set and container_users
is elevated by the device FD.
Due to this mismatched lifecycle we have the
vfio_group_try_dissolve_container() which tries to auto-free a container
after the group FD is closed but the device FD remains open.
Instead have the device FD hold onto a reference to the single group
FD. This directly prevents vfio_group_fops_release() from being called
when any device FD exists and makes the lifecycle model more
understandable.
vfio_group_try_dissolve_container() is removed as the only place a
container is auto-deleted is during vfio_group_fops_release(). At this
point the container_users is either 1 or 0 since all device FDs must be
closed.
Change group->opened to group->opened_file which points to the single
struct file * that is open for the group. If the group->open_file is
NULL then group->container == NULL.
If all device FDs have closed then the group's notifier list must be
empty.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/5-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is necessary to avoid various user triggerable races, for instance
racing SET_CONTAINER/UNSET_CONTAINER:
ioctl(VFIO_GROUP_SET_CONTAINER)
ioctl(VFIO_GROUP_UNSET_CONTAINER)
vfio_group_unset_container
int users = atomic_cmpxchg(&group->container_users, 1, 0);
// users == 1 container_users == 0
__vfio_group_unset_container(group);
container = group->container;
vfio_group_set_container()
if (!atomic_read(&group->container_users))
down_write(&container->group_lock);
group->container = container;
up_write(&container->group_lock);
down_write(&container->group_lock);
group->container = NULL;
up_write(&container->group_lock);
vfio_container_put(container);
/* woops we lost/leaked the new container */
This can then go on to NULL pointer deref since container == 0 and
container_users == 1.
Wrap all touches of container, except those on a performance path with a
known open device, with the group_rwsem.
The only user of vfio_group_add_container_user() holds the user count for
a simple operation, change it to just hold the group_lock over the
operation and delete vfio_group_add_container_user(). Containers now only
gain a user when a device FD is opened.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/4-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The split follows the pairing with the destroy functions:
- vfio_group_get_device_fd() destroyed by close()
- vfio_device_open() destroyed by vfio_device_fops_release()
- vfio_device_assign_container() destroyed by
vfio_group_try_dissolve_container()
The next patch will put a lock around vfio_device_assign_container().
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/3-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is not a performance path, just use the group_rwsem to protect the
value.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/2-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Without locking userspace can trigger a UAF by racing
KVM_DEV_VFIO_GROUP_DEL with VFIO_GROUP_GET_DEVICE_FD:
CPU1 CPU2
ioctl(KVM_DEV_VFIO_GROUP_DEL)
ioctl(VFIO_GROUP_GET_DEVICE_FD)
vfio_group_get_device_fd
open_device()
intel_vgpu_open_device()
vfio_register_notifier()
vfio_register_group_notifier()
blocking_notifier_call_chain(&group->notifier,
VFIO_GROUP_NOTIFY_SET_KVM, group->kvm);
set_kvm()
group->kvm = NULL
close()
kfree(kvm)
intel_vgpu_group_notifier()
vdev->kvm = data
[..]
kvm_get_kvm(vgpu->kvm);
// UAF!
Add a simple rwsem in the group to protect the kvm while the notifier is
using it.
Note this doesn't fix the race internal to i915 where userspace can
trigger two VFIO_GROUP_NOTIFY_SET_KVM's before we reach a consumer of
vgpu->kvm and trigger this same UAF, it just makes the notifier
self-consistent.
Fixes: ccd46dbae7 ("vfio: support notifier chain in vfio_group")
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/1-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIO PCI does a security check as part of hot reset to prove that the user
has permission to manipulate all the devices that will be impacted by the
reset.
Use a new API vfio_file_has_dev() to perform this security check against
the struct file directly and remove the vfio_group from VFIO PCI.
Since VFIO PCI was the last user of vfio_group_get_external_user() and
vfio_group_put_external_user() remove it as well.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/8-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Just change the argument from struct vfio_group to struct file *.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Instead of a general extension check change the function into a limited
test if the iommu_domain has enforced coherency, which is the only thing
kvm needs to query.
Make the new op self contained by properly refcounting the container
before touching it.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_group_fops_open() ensures there is only ever one struct file open for
any struct vfio_group at any time:
/* Do we need multiple instances of the group open? Seems not. */
opened = atomic_cmpxchg(&group->opened, 0, 1);
if (opened) {
vfio_group_put(group);
return -EBUSY;
Therefor the struct file * can be used directly to search the list of VFIO
groups that KVM keeps instead of using the
vfio_external_group_match_file() callback to try to figure out if the
passed in FD matches the list or not.
Delete vfio_external_group_match_file().
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The only caller wants to get a pointer to the struct iommu_group
associated with the VFIO group file. Instead of returning the group ID
then searching sysfs for that string to get the struct iommu_group just
directly return the iommu_group pointer already held by the vfio_group
struct.
It already has a safe lifetime due to the struct file kref, the vfio_group
and thus the iommu_group cannot be destroyed while the group file is open.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Now that the iommu core takes care of isolation there is no race between
driver attach and container unset. Once iommu_group_release_dma_owner()
returns the device can immediately be re-used.
Remove this mechanism.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/0-v1-a1e8791d795b+6b-vfio_container_q_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The last user of this function is in PCI callbacks that want to convert
their struct pci_dev to a vfio_device. Instead of searching use the
vfio_device available trivially through the drvdata.
When a callback in the device_driver is called, the caller must hold the
device_lock() on dev. The purpose of the device_lock is to prevent
remove() from being called (see __device_release_driver), and allow the
driver to safely interact with its drvdata without races.
The PCI core correctly follows this and holds the device_lock() when
calling error_detected (see report_error_detected) and
sriov_configure (see sriov_numvfs_store).
Further, since the drvdata holds a positive refcount on the vfio_device
any access of the drvdata, under the device_lock(), from a driver callback
needs no further protection or refcounting.
Thus the remark in the vfio_device_get_from_dev() comment does not apply
here, VFIO PCI drivers all call vfio_unregister_group_dev() from their
remove callbacks under the device_lock() and cannot race with the
remaining callers.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v4-c841817a0349+8f-vfio_get_from_dev_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Having a consistent pointer in the drvdata will allow the next patch to
make use of the drvdata from some of the core code helpers.
Use a WARN_ON inside vfio_pci_core_register_device() to detect drivers
that miss this.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v4-c841817a0349+8f-vfio_get_from_dev_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When the open_device() op is called the container_users is incremented and
held incremented until close_device(). Thus, so long as drivers call
functions within their open_device()/close_device() region they do not
need to worry about the container_users.
These functions can all only be called between open_device() and
close_device():
vfio_pin_pages()
vfio_unpin_pages()
vfio_dma_rw()
vfio_register_notifier()
vfio_unregister_notifier()
Eliminate the calls to vfio_group_add_container_user() and add
vfio_assert_device_open() to detect driver mis-use. This causes the
close_device() op to check device->open_count so always leave it elevated
while calling the op.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v4-8045e76bf00b+13d-vfio_mdev_no_group_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Now that callers have been updated to use the vfio_device APIs the driver
facing group interface is no longer used, delete it:
- vfio_group_get_external_user_from_dev()
- vfio_group_pin_pages()
- vfio_group_unpin_pages()
- vfio_group_iommu_domain()
--
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v4-8045e76bf00b+13d-vfio_mdev_no_group_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Every caller has a readily available vfio_device pointer, use that instead
of passing in a generic struct device. Change vfio_dma_rw() to take in the
struct vfio_device and move the container users that would have been held
by vfio_group_get_external_user_from_dev() to vfio_dma_rw() directly, like
vfio_pin/unpin_pages().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v4-8045e76bf00b+13d-vfio_mdev_no_group_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Every caller has a readily available vfio_device pointer, use that instead
of passing in a generic struct device. The struct vfio_device already
contains the group we need so this avoids complexity, extra refcountings,
and a confusing lifecycle model.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Jason J. Herne <jjherne@linux.ibm.com>
Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v4-8045e76bf00b+13d-vfio_mdev_no_group_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
All callers have a struct vfio_device trivially available, pass it in
directly and avoid calling the expensive vfio_group_get_from_dev().
Acked-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Jason J. Herne <jjherne@linux.ibm.com>
Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v4-8045e76bf00b+13d-vfio_mdev_no_group_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
IOMMU groups have been mandatory for some time now, so a device without
one is necessarily a device without any usable IOMMU, therefore the
iommu_present() check is redundant (or at best unhelpful).
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/537103bbd7246574f37f2c88704d7824a3a889f2.1649160714.git.robin.murphy@arm.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
From Yishai:
This series improves mlx5 live migration driver in few aspects as of
below.
Refactor to enable running migration commands in parallel over the PF
command interface.
To achieve that we exposed from mlx5_core an API to let the VF be
notified before that the PF command interface goes down/up. (e.g. PF
reload upon health recovery).
Once having the above functionality in place mlx5 vfio doesn't need any
more to obtain the global PF lock upon using the command interface but
can rely on the above mechanism to be in sync with the PF.
This can enable parallel VFs migration over the PF command interface
from kernel driver point of view.
In addition,
Moved to use the PF async command mode for the SAVE state command.
This enables returning earlier to user space upon issuing successfully
the command and improve latency by let things run in parallel.
Alex, as this series touches mlx5_core we may need to send this in a
pull request format to VFIO to avoid conflicts before acceptance.
Link: https://lore.kernel.org/all/20220510090206.90374-1-yishaih@nvidia.com
Signed-of-by: Leon Romanovsky <leonro@nvidia.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQT1m3YD37UfMCUQBNwp8NhrnBAZsQUCYntY3AAKCRAp8NhrnBAZ
scRWAP0QzEqg/Xqk/geUAQ3dliFrA2DZJm8v9B3x5tA5nEAazAD9HqC17MvDzY8T
6KBP7G37JNg2NCkxnKnt2gCIT+O4lgA=
=zwWT
-----END PGP SIGNATURE-----
Merge tag 'mlx5-lm-parallel' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into v5.19/vfio/next
Improve mlx5 live migration driver
From Yishai:
This series improves mlx5 live migration driver in few aspects as of
below.
Refactor to enable running migration commands in parallel over the PF
command interface.
To achieve that we exposed from mlx5_core an API to let the VF be
notified before that the PF command interface goes down/up. (e.g. PF
reload upon health recovery).
Once having the above functionality in place mlx5 vfio doesn't need any
more to obtain the global PF lock upon using the command interface but
can rely on the above mechanism to be in sync with the PF.
This can enable parallel VFs migration over the PF command interface
from kernel driver point of view.
In addition,
Moved to use the PF async command mode for the SAVE state command.
This enables returning earlier to user space upon issuing successfully
the command and improve latency by let things run in parallel.
Alex, as this series touches mlx5_core we may need to send this in a
pull request format to VFIO to avoid conflicts before acceptance.
Link: https://lore.kernel.org/all/20220510090206.90374-1-yishaih@nvidia.com
Signed-of-by: Leon Romanovsky <leonro@nvidia.com>
Use the PF asynchronous command mode for the SAVE state command.
This enables returning earlier to user space upon issuing successfully
the command and improve latency by let things run in parallel.
Link: https://lore.kernel.org/r/20220510090206.90374-5-yishaih@nvidia.com
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Refactor to enable different VFs to run their commands over the PF
command interface in parallel and to not block one each other.
This is done by not using the global PF lock that was used before but
relying on the VF attach/detach mechanism to sync.
Link: https://lore.kernel.org/r/20220510090206.90374-4-yishaih@nvidia.com
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Manage the VF attach/detach callback from the PF.
This lets the driver to enable parallel VFs migration as will be
introduced in the next patch.
As part of this, reorganize the VF is migratable code to be in a
separate function and rename it to be set_migratable() to match its
functionality.
Link: https://lore.kernel.org/r/20220510090206.90374-3-yishaih@nvidia.com
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
drm/i915 feature pull #2 for v5.19:
Features and functionality:
- Add first set of DG2 PCI IDs for "motherboard down" designs (Matt Roper)
- Add initial RPL-P PCI IDs as ADL-P subplatform (Matt Atwood)
Refactoring and cleanups:
- Power well refactoring and cleanup (Imre)
- GVT-g refactor and mdev API cleanup (Christoph, Jason, Zhi)
- DPLL refactoring and cleanup (Ville)
- VBT panel specific data parsing cleanup (Ville)
- Use drm_mode_init() for on-stack modes (Ville)
Fixes:
- Fix PSR state pipe A/B confusion by clearing more state on disable (José)
- Fix FIFO underruns caused by not taking DRAM channel into account (Vinod)
- Fix FBC flicker on display 11+ by enabling a workaround (José)
- Fix VBT seamless DRRS min refresh rate check (Ville)
- Fix panel type assumption on bogus VBT data (Ville)
- Fix panel data parsing for VBT that misses panel data pointers block (Ville)
- Fix spurious AUX timeout/hotplug handling on LTTPR links (Imre)
Merges:
- Backmerge drm-next (Jani)
- GVT changes (Jani)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/87bkwbkkdo.fsf@intel.com
IOMMU_CACHE means that normal DMAs do not require any additional coherency
mechanism and is the basic uAPI that VFIO exposes to userspace. For
instance VFIO applications like DPDK will not work if additional coherency
operations are required.
Therefore check IOMMU_CAP_CACHE_COHERENCY like vdpa & usnic do before
allowing an IOMMU backed VFIO device to be created.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v3-2cf356649677+a32-intel_no_snoop_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
IOMMU_CACHE means "normal DMA to this iommu_domain's IOVA should be cache
coherent" and is used by the DMA API. The definition allows for special
non-coherent DMA to exist - ie processing of the no-snoop flag in PCIe
TLPs - so long as this behavior is opt-in by the device driver.
The flag is mainly used by the DMA API to synchronize the IOMMU setting
with the expected cache behavior of the DMA master. eg based on
dev_is_dma_coherent() in some case.
For Intel IOMMU IOMMU_CACHE was redefined to mean 'force all DMA to be
cache coherent' which has the practical effect of causing the IOMMU to
ignore the no-snoop bit in a PCIe TLP.
x86 platforms are always IOMMU_CACHE, so Intel should ignore this flag.
Instead use the new domain op enforce_cache_coherency() which causes every
IOPTE created in the domain to have the no-snoop blocking behavior.
Reconfigure VFIO to always use IOMMU_CACHE and call
enforce_cache_coherency() to operate the special Intel behavior.
Remove the IOMMU_CACHE test from Intel IOMMU.
Ultimately VFIO plumbs the result of enforce_cache_coherency() back into
the x86 platform code through kvm_arch_register_noncoherent_dma() which
controls if the WBINVD instruction is available in the guest. No other
archs implement kvm_arch_register_noncoherent_dma() nor are there any
other known consumers of VFIO_DMA_CC_IOMMU that might be affected by the
user visible result change on non-x86 archs.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v3-2cf356649677+a32-intel_no_snoop_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The iommu core and driver core have been enhanced to avoid unsafe driver
binding to a live group after iommu_group_set_dma_owner(PRIVATE_USER)
has been called. There's no need to register iommu group notifier. This
removes the iommu group notifer which contains BUG_ON() and WARN().
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20220418005000.897664-11-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
commit 60720a0fc6 ("vfio: Add device tracking during unbind") added the
unbound list to plug a problem with KVM where KVM_DEV_VFIO_GROUP_DEL
relied on vfio_group_get_external_user() succeeding to return the
vfio_group from a group file descriptor. The unbound list allowed
vfio_group_get_external_user() to continue to succeed in edge cases.
However commit 5d6dee80a1 ("vfio: New external user group/file match")
deleted the call to vfio_group_get_external_user() during
KVM_DEV_VFIO_GROUP_DEL. Instead vfio_external_group_match_file() is used
to directly match the file descriptor to the group pointer.
This in turn avoids the call down to vfio_dev_viable() during
KVM_DEV_VFIO_GROUP_DEL and also avoids the trouble the first commit was
trying to fix.
There are no other users of vfio_dev_viable() that care about the time
after vfio_unregister_group_dev() returns, so simply delete the
unbound_list entirely.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20220418005000.897664-10-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
As DMA ownership is claimed for the iommu group when a VFIO group is
added to a VFIO container, the VFIO group viability is guaranteed as long
as group->container_users > 0. Remove those unnecessary group viability
checks which are only hit when group->container_users is not zero.
The only remaining reference is in GROUP_GET_STATUS, which could be called
at any time when group fd is valid. Here we just replace the
vfio_group_viable() by directly calling IOMMU core to get viability status.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20220418005000.897664-9-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Claim group dma ownership when an IOMMU group is set to a container,
and release the dma ownership once the iommu group is unset from the
container.
This change disallows some unsafe bridge drivers to bind to non-ACS
bridges while devices under them are assigned to user space. This is an
intentional enhancement and possibly breaks some existing
configurations. The recommendation to such an affected user would be
that the previously allowed host bridge driver was unsafe for this use
case and to continue to enable assignment of devices within that group,
the driver should be unbound from the bridge device or replaced with the
pci-stub driver.
For any bridge driver, we consider it unsafe if it satisfies any of the
following conditions:
1) The bridge driver uses DMA. Calling pci_set_master() or calling any
kernel DMA API (dma_map_*() and etc.) is an indicate that the
driver is doing DMA.
2) If the bridge driver uses MMIO, it should be tolerant to hostile
userspace also touching the same MMIO registers via P2P DMA
attacks.
If the bridge driver turns out to be a safe one, it could be used as
before by setting the driver's .driver_managed_dma field, just like what
we have done in the pcieport driver.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20220418005000.897664-8-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The device creator is supposed to use the dev.groups value to add sysfs
files before device_add is called, not call sysfs_create_files() after
device_add() returns. This creates a race with uevent delivery where the
extra attribute will not be visible.
This was being done because the groups had been co-opted by the mdev
driver, now that prior patches have moved the driver's groups to the
struct device_driver the dev.group is properly free for use here.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20220411141403.86980-34-hch@lst.de
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Zhi Wang <zhi.a.wang@intel.com>
The last useful member in this struct is the supported_type_groups, move
it to the mdev_driver and delete mdev_parent_ops.
Replace it with mdev_driver as an argument to mdev_register_device()
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20220411141403.86980-33-hch@lst.de
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Zhi Wang <zhi.a.wang@intel.com>
This is only used by one sample to print a fixed string that is pointless.
In general, having a device driver attach sysfs attributes to the parent
is horrific. This should never happen, and always leads to some kind of
liftime bug as it become very difficult for the sysfs attribute to go back
to any data owned by the device driver.
Remove the general mechanism to create this abuse.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20220411141403.86980-32-hch@lst.de
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Zhi Wang <zhi.a.wang@intel.com>
Now that all mdev drivers directly create their own mdev_device driver and
directly register with the vfio core's vfio_device_ops this is all dead
code.
Delete vfio_mdev.c and the mdev_parent_ops members that are connected to
it.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20220411141403.86980-31-hch@lst.de
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Zhi Wang <zhi.a.wang@intel.com>
get_pf_vdev() tries to check if a PF is a VFIO PF by looking at the driver:
if (pci_dev_driver(physfn) != pci_dev_driver(vdev->pdev)) {
However now that we have multiple VF and PF drivers this is no longer
reliable.
This means that security tests realted to vf_token can be skipped by
mixing and matching different VFIO PCI drivers.
Instead of trying to use the driver core to find the PF devices maintain a
linked list of all PF vfio_pci_core_device's that we have called
pci_enable_sriov() on.
When registering a VF just search the list to see if the PF is present and
record the match permanently in the struct. PCI core locking prevents a PF
from passing pci_disable_sriov() while VF drivers are attached so the VFIO
owned PF becomes a static property of the VF.
In common cases where vfio does not own the PF the global list remains
empty and the VF's pointer is statically NULL.
This also fixes a lockdep splat from recursive locking of the
vfio_group::device_lock between vfio_device_get_from_name() and
vfio_device_get_from_dev(). If the VF and PF share the same group this
would deadlock.
Fixes: ff53edf6d6 ("vfio/pci: Split the pci_driver code out of vfio_pci_core.c")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v3-876570980634+f2e8-vfio_vf_token_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Register private handler for pci_error_handlers.reset_done and update
state accordingly.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Longfang Liu <liulongfang@huawei.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20220308184902.2242-10-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VMs assigned with HiSilicon ACC VF devices can now perform live migration
if the VF devices are bind to the hisi_acc_vfio_pci driver.
Just like ACC PF/VF drivers this VFIO driver also make use of the HiSilicon
QM interface. QM stands for Queue Management which is a generic IP used by
ACC devices. It provides a generic PCIe interface for the CPU and the ACC
devices to share a group of queues.
QM integrated into an accelerator provides queue management service.
Queues can be assigned to PF and VFs, and queues can be controlled by
unified mailboxes and doorbells.
The QM driver (drivers/crypto/hisilicon/qm.c) provides generic
interfaces to ACC drivers to manage the QM.
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20220308184902.2242-9-shameerali.kolothum.thodi@huawei.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
HiSilicon ACC VF device BAR2 region consists of both functional register
space and migration control register space. Unnecessarily exposing the
migration BAR region to the Guest has the potential to prevent/corrupt
the Guest migration.
Hence, introduce a separate struct vfio_device_ops for migration support
which will override the ioctl/read/write/mmap methods to hide the
migration region and limit the Guest access only to the functional
register space.
This will be used in subsequent patches when we add migration support
to the driver.
Please note that it is OK to export the entire VF BAR if migration is
not supported or required as this cannot affect the PF configurations.
Reviewed-by: Longfang Liu <liulongfang@huawei.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20220308184902.2242-6-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add a vendor-specific vfio_pci driver for HiSilicon ACC devices.
This will be extended in subsequent patches to add support for VFIO
live migration feature.
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20220308184902.2242-5-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Register its own handler for pci_error_handlers.reset_done and update
state accordingly.
Link: https://lore.kernel.org/all/20220224142024.147653-16-yishaih@nvidia.com
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Expose vfio_pci_core_aer_err_detected() to be used by drivers as part of
their pci_error_handlers structure.
Next patch for mlx5 driver will use it.
Link: https://lore.kernel.org/all/20220224142024.147653-15-yishaih@nvidia.com
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
This patch adds support for vfio_pci driver for mlx5 devices.
It uses vfio_pci_core to register to the VFIO subsystem and then
implements the mlx5 specific logic in the migration area.
The migration implementation follows the definition from uapi/vfio.h and
uses the mlx5 VF->PF command channel to achieve it.
This patch implements the suspend/resume flows.
Link: https://lore.kernel.org/all/20220224142024.147653-14-yishaih@nvidia.com
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Expose migration commands over the device, it includes: suspend, resume,
get vhca id, query/save/load state.
As part of this adds the APIs and data structure that are needed to manage
the migration data.
Link: https://lore.kernel.org/all/20220224142024.147653-13-yishaih@nvidia.com
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
The RUNNING_P2P state is designed to support multiple devices in the same
VM that are doing P2P transactions between themselves. When in RUNNING_P2P
the device must be able to accept incoming P2P transactions but should not
generate outgoing P2P transactions.
As an optional extension to the mandatory states it is defined as
in between STOP and RUNNING:
STOP -> RUNNING_P2P -> RUNNING -> RUNNING_P2P -> STOP
For drivers that are unable to support RUNNING_P2P the core code
silently merges RUNNING_P2P and RUNNING together. Unless driver support
is present, the new state cannot be used in SET_STATE.
Drivers that support this will be required to implement 4 FSM arcs
beyond the basic FSM. 2 of the basic FSM arcs become combination
transitions.
Compared to the v1 clarification, NDMA is redefined into FSM states and is
described in terms of the desired P2P quiescent behavior, noting that
halting all DMA is an acceptable implementation.
Link: https://lore.kernel.org/all/20220224142024.147653-11-yishaih@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Replace the existing region based migration protocol with an ioctl based
protocol. The two protocols have the same general semantic behaviors, but
the way the data is transported is changed.
This is the STOP_COPY portion of the new protocol, it defines the 5 states
for basic stop and copy migration and the protocol to move the migration
data in/out of the kernel.
Compared to the clarification of the v1 protocol Alex proposed:
https://lore.kernel.org/r/163909282574.728533.7460416142511440919.stgit@omen
This has a few deliberate functional differences:
- ERROR arcs allow the device function to remain unchanged.
- The protocol is not required to return to the original state on
transition failure. Instead userspace can execute an unwind back to
the original state, reset, or do something else without needing kernel
support. This simplifies the kernel design and should userspace choose
a policy like always reset, avoids doing useless work in the kernel
on error handling paths.
- PRE_COPY is made optional, userspace must discover it before using it.
This reflects the fact that the majority of drivers we are aware of
right now will not implement PRE_COPY.
- segmentation is not part of the data stream protocol, the receiver
does not have to reproduce the framing boundaries.
The hybrid FSM for the device_state is described as a Mealy machine by
documenting each of the arcs the driver is required to implement. Defining
the remaining set of old/new device_state transitions as 'combination
transitions' which are naturally defined as taking multiple FSM arcs along
the shortest path within the FSM's digraph allows a complete matrix of
transitions.
A new VFIO_DEVICE_FEATURE of VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is
defined to replace writing to the device_state field in the region. This
allows returning a brand new FD whenever the requested transition opens
a data transfer session.
The VFIO core code implements the new feature and provides a helper
function to the driver. Using the helper the driver only has to
implement 6 of the FSM arcs and the other combination transitions are
elaborated consistently from those arcs.
A new VFIO_DEVICE_FEATURE of VFIO_DEVICE_FEATURE_MIGRATION is defined to
report the capability for migration and indicate which set of states and
arcs are supported by the device. The FSM provides a lot of flexibility to
make backwards compatible extensions but the VFIO_DEVICE_FEATURE also
allows for future breaking extensions for scenarios that cannot support
even the basic STOP_COPY requirements.
The VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE with the GET option (i.e.
VFIO_DEVICE_FEATURE_GET) can be used to read the current migration state
of the VFIO device.
Data transfer sessions are now carried over a file descriptor, instead of
the region. The FD functions for the lifetime of the data transfer
session. read() and write() transfer the data with normal Linux stream FD
semantics. This design allows future expansion to support poll(),
io_uring, and other performance optimizations.
The complicated mmap mode for data transfer is discarded as current qemu
doesn't take meaningful advantage of it, and the new qemu implementation
avoids substantially all the performance penalty of using a read() on the
region.
Link: https://lore.kernel.org/all/20220224142024.147653-10-yishaih@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Invoke a new device op 'device_feature' to handle just the data array
portion of the command. This lifts the ioctl validation to the core code
and makes it simpler for either the core code, or layered drivers, to
implement their own feature values.
Provide vfio_check_feature() to consolidate checking the flags/etc against
what the driver supports.
Link: https://lore.kernel.org/all/20220224142024.147653-9-yishaih@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
If 'vfio_pci_core_device::needs_pm_restore' is set (PCI device does
not have No_Soft_Reset bit set in its PMCSR config register), then the
current PCI state will be saved locally in
'vfio_pci_core_device::pm_save' during D0->D3hot transition and same
will be restored back during D3hot->D0 transition. For reset-related
functionalities, vfio driver uses PCI reset API's. These
API's internally change the PCI power state back to D0 first if
the device power state is non-D0. This state change to D0 will happen
without the involvement of vfio driver.
Let's consider the following example:
1. The device is in D3hot.
2. User invokes VFIO_DEVICE_RESET ioctl.
3. pci_try_reset_function() will be called which internally
invokes pci_dev_save_and_disable().
4. pci_set_power_state(dev, PCI_D0) will be called first.
5. pci_save_state() will happen then.
Now, for the devices which has NoSoftRst-, the pci_set_power_state()
can trigger soft reset and the original PCI config state will be lost
at step (4) and this state cannot be restored again. This original PCI
state can include any setting which is performed by SBIOS or host
linux kernel (for example LTR, ASPM L1 substates, etc.). When this
soft reset will be triggered, then all these settings will be reset,
and the device state saved at step (5) will also have this setting
cleared so it cannot be restored. Since the vfio driver only exposes
limited PCI capabilities to its user, so the vfio driver user also
won't have the option to save and restore these capabilities state
either and these original settings will be permanently lost.
For pci_reset_bus() also, we can have the above situation.
The other functions/devices can be in D3hot and the reset will change
the power state of all devices to D0 without the involvement of vfio
driver.
So, before calling any reset-related API's, we need to make sure that
the device state is D0. This is mainly to preserve the state around
soft reset.
For vfio_pci_core_disable(), we use __pci_reset_function_locked()
which internally can use pci_pm_reset() for the function reset.
pci_pm_reset() requires the device power state to be in D0, otherwise
it returns error.
This patch changes the device power state to D0 by invoking
vfio_pci_set_power_state() explicitly before calling any reset related
API's.
Fixes: 51ef3a004b ("vfio/pci: Restore device state on PM transition")
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220217122107.22434-3-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If 'vfio_pci_core_device::needs_pm_restore' is set (PCI device does
not have No_Soft_Reset bit set in its PMCSR config register), then
the current PCI state will be saved locally in
'vfio_pci_core_device::pm_save' during D0->D3hot transition and same
will be restored back during D3hot->D0 transition.
For saving the PCI state locally, pci_store_saved_state() is being
used and the pci_load_and_free_saved_state() will free the allocated
memory.
But for reset related IOCTLs, vfio driver calls PCI reset-related
API's which will internally change the PCI power state back to D0. So,
when the guest resumes, then it will get the current state as D0 and it
will skip the call to vfio_pci_set_power_state() for changing the
power state to D0 explicitly. In this case, the memory pointed by
'pm_save' will never be freed. In a malicious sequence, the state changing
to D3hot followed by VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be
run in a loop and it can cause an OOM situation.
This patch frees the earlier allocated memory first before overwriting
'pm_save' to prevent the mentioned memory leak.
Fixes: 51ef3a004b ("vfio/pci: Restore device state on PM transition")
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220217122107.22434-2-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Variables allocated by kvzalloc should not be freed by kfree.
Because they may be allocated by vmalloc.
So we replace kfree with kvfree here.
Fixes: d6a4c18566 ("vfio iommu: Implementation of ioctl for dirty pages tracking")
Signed-off-by: Jiacheng Shi <billsjc@sjtu.edu.cn>
Link: https://lore.kernel.org/r/20211212091600.2560-1-billsjc@sjtu.edu.cn
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Sparse warns:
sparse warnings: (new ones prefixed by >>)
>> drivers/vfio/pci/vfio_pci_igd.c:146:21: sparse: sparse: incorrect type in assignment (different base types) @@ expected unsigned short [addressable] [usertype] val @@ got restricted __le16 [usertype] @@
drivers/vfio/pci/vfio_pci_igd.c:146:21: sparse: expected unsigned short [addressable] [usertype] val
drivers/vfio/pci/vfio_pci_igd.c:146:21: sparse: got restricted __le16 [usertype]
>> drivers/vfio/pci/vfio_pci_igd.c:161:21: sparse: sparse: incorrect type in assignment (different base types) @@ expected unsigned int [addressable] [usertype] val @@ got restricted __le32 [usertype] @@
drivers/vfio/pci/vfio_pci_igd.c:161:21: sparse: expected unsigned int [addressable] [usertype] val
drivers/vfio/pci/vfio_pci_igd.c:161:21: sparse: got restricted __le32 [usertype]
drivers/vfio/pci/vfio_pci_igd.c:176:21: sparse: sparse: incorrect type in assignment (different base types) @@ expected unsigned short [addressable] [usertype] val @@ got restricted __le16 [usertype] @@
drivers/vfio/pci/vfio_pci_igd.c:176:21: sparse: expected unsigned short [addressable] [usertype] val
drivers/vfio/pci/vfio_pci_igd.c:176:21: sparse: got restricted __le16 [usertype]
These are due to trying to use an unsigned to store the result of
a cpu_to_leXX() conversion. These are small variables, so pointer
tricks are wasteful and casting just generates different sparse
warnings. Store to and copy results from a separate little endian
variable.
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/r/202111290026.O3vehj03-lkp@intel.com/
Link: https://lore.kernel.org/r/163840226123.138003.7668320168896210328.stgit@omen
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Storing a pointer to the MSI descriptor just to track the Linux interrupt
number is daft. Just store the interrupt number and be done with it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211210221815.207838579@linutronix.de
This is to fix incorrect pointer arithmetic which caused wrong
OpRegion version returned, then VM driver got error to get wanted
VBT block. We need to be safe to return correct data, so force
pointer type for byte access.
Fixes: 49ba1a2976 ("vfio/pci: Add OpRegion 2.0+ Extended VBT support.")
Cc: Colin Xu <colin.xu@gmail.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dmitry Torokhov <dtor@chromium.org>
Cc: "Xu, Terrence" <terrence.xu@intel.com>
Cc: "Gao, Fred" <fred.gao@intel.com>
Acked-by: Colin Xu <colin.xu@gmail.com>
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Link: https://lore.kernel.org/r/20211125051328.3359902-1-zhenyuw@linux.intel.com
[aw: line wrap]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio.c abuses (misuses) "/**", which indicates the beginning of
kernel-doc notation in the kernel tree. This causes a bunch of
kernel-doc complaints about this source file, so quieten all of
them by changing all "/**" to "/*".
vfio.c:236: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* IOMMU driver registration
vfio.c:236: warning: missing initial short description on line:
* IOMMU driver registration
vfio.c:295: warning: expecting prototype for Container objects(). Prototype was for vfio_container_get() instead
vfio.c:317: warning: expecting prototype for Group objects(). Prototype was for __vfio_group_get_from_iommu() instead
vfio.c:496: warning: Function parameter or member 'device' not described in 'vfio_device_put'
vfio.c:496: warning: expecting prototype for Device objects(). Prototype was for vfio_device_put() instead
vfio.c:599: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* Async device support
vfio.c:599: warning: missing initial short description on line:
* Async device support
vfio.c:693: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* VFIO driver API
vfio.c:693: warning: missing initial short description on line:
* VFIO driver API
vfio.c:835: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* Get a reference to the vfio_device for a device. Even if the
vfio.c:835: warning: missing initial short description on line:
* Get a reference to the vfio_device for a device. Even if the
vfio.c:969: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* VFIO base fd, /dev/vfio/vfio
vfio.c:969: warning: missing initial short description on line:
* VFIO base fd, /dev/vfio/vfio
vfio.c:1187: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* VFIO Group fd, /dev/vfio/$GROUP
vfio.c:1187: warning: missing initial short description on line:
* VFIO Group fd, /dev/vfio/$GROUP
vfio.c:1540: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* VFIO Device fd
vfio.c:1540: warning: missing initial short description on line:
* VFIO Device fd
vfio.c:1615: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* External user API, exported by symbols to be linked dynamically.
vfio.c:1615: warning: missing initial short description on line:
* External user API, exported by symbols to be linked dynamically.
vfio.c:1663: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* External user API, exported by symbols to be linked dynamically.
vfio.c:1663: warning: missing initial short description on line:
* External user API, exported by symbols to be linked dynamically.
vfio.c:1742: warning: Function parameter or member 'caps' not described in 'vfio_info_cap_add'
vfio.c:1742: warning: Function parameter or member 'size' not described in 'vfio_info_cap_add'
vfio.c:1742: warning: Function parameter or member 'id' not described in 'vfio_info_cap_add'
vfio.c:1742: warning: Function parameter or member 'version' not described in 'vfio_info_cap_add'
vfio.c:1742: warning: expecting prototype for Sub(). Prototype was for vfio_info_cap_add() instead
vfio.c:2276: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* Module/class support
vfio.c:2276: warning: missing initial short description on line:
* Module/class support
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: kvm@vger.kernel.org
Link: https://lore.kernel.org/r/38a9cb92-a473-40bf-b8f9-85cc5cfc2da4@infradead.org
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Modernize how vfio is creating the group char dev and sysfs presence.
These days drivers with state should use cdev_device_add() and
cdev_device_del() to manage the cdev and sysfs lifetime.
This API requires the driver to put the struct device and struct cdev
inside its state struct (vfio_group), and then use the usual
device_initialize()/cdev_device_add()/cdev_device_del() sequence.
Split the code to make this possible:
- vfio_group_alloc()/vfio_group_release() are pair'd functions to
alloc/free the vfio_group. release is done under the struct device
kref.
- vfio_create_group()/vfio_group_put() are pairs that manage the
sysfs/cdev lifetime. Once the uses count is zero the vfio group's
userspace presence is destroyed.
- The IDR is replaced with an IDA. container_of(inode->i_cdev)
is used to get back to the vfio_group during fops open. The IDA
assigns unique minor numbers.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v3-2fdfe4ca2cc6+18c-vfio_group_cdev_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The next patch adds a struct device to the struct vfio_group, and it is
confusing/bad practice to have two krefs in the same struct. This kref is
controlling the period when the vfio_group is registered in sysfs, and
visible in the internal lookup. Switch it to a refcount_t instead.
The refcount_dec_and_mutex_lock() is still required because we need
atomicity of the list searches and sysfs presence.
Reviewed-by: Liu Yi L <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v3-2fdfe4ca2cc6+18c-vfio_group_cdev_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If vfio_create_group() searches the group list and returns an already
existing group it does not put back the iommu_group reference that the
caller passed in.
Change the semantic of vfio_create_group() to not move the reference in
from the caller, but instead obtain a new reference inside and leave the
caller's reference alone. The two callers must now call iommu_group_put().
This is an unlikely race as the only caller that could hit it has already
searched the group list before attempting to create the group.
Fixes: cba3345cc4 ("vfio: VFIO core")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v3-2fdfe4ca2cc6+18c-vfio_group_cdev_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Split vfio_group_get_from_iommu() into __vfio_group_get_from_iommu() so
that vfio_create_group() can call it to consolidate this duplicated code.
Reviewed-by: Liu Yi L <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v3-2fdfe4ca2cc6+18c-vfio_group_cdev_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
iommu_group_register_notifier()/iommu_group_unregister_notifier() are
built using a blocking_notifier_chain which integrates a rwsem. The
notifier function cannot be running outside its registration.
When considering how the notifier function interacts with create/destroy
of the group there are two fringe cases, the notifier starts before
list_add(&vfio.group_list) and the notifier runs after the kref
becomes 0.
Prior to vfio_create_group() unlocking and returning we have
container_users == 0
device_list == empty
And this cannot change until the mutex is unlocked.
After the kref goes to zero we must also have
container_users == 0
device_list == empty
Both are required because they are balanced operations and a 0 kref means
some caller became unbalanced. Add the missing assertion that
container_users must be zero as well.
These two facts are important because when checking each operation we see:
- IOMMU_GROUP_NOTIFY_ADD_DEVICE
Empty device_list avoids the WARN_ON in vfio_group_nb_add_dev()
0 container_users ends the call
- IOMMU_GROUP_NOTIFY_BOUND_DRIVER
0 container_users ends the call
Finally, we have IOMMU_GROUP_NOTIFY_UNBOUND_DRIVER, which only deletes
items from the unbound list. During creation this list is empty, during
kref == 0 nothing can read this list, and it will be freed soon.
Since the vfio_group_release() doesn't hold the appropriate lock to
manipulate the unbound_list and could race with the notifier, move the
cleanup to directly before the kfree.
This allows deleting all of the deferred group put code.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Liu Yi L <yi.l.liu@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v3-2fdfe4ca2cc6+18c-vfio_group_cdev_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Due to historical reason, some legacy shipped system doesn't follow
OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
VBT is not contiguous after OpRegion in physical address, but any
location pointed by RVDA via absolute address. Also although current
OpRegion 2.1+ systems appears that the extended VBT follows OpRegion,
RVDA is the relative address to OpRegion head, the extended VBT location
may change to non-contiguous to OpRegion. In both cases, it's impossible
to map a contiguous range to hold both OpRegion and the extended VBT and
expose via one vfio region.
The only difference between OpRegion 2.0 and 2.1 is where extended
VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
while for 2.1, RVDA is the relative address of extended VBT to OpRegion
baes, and there is no other difference between OpRegion 2.0 and 2.1.
To support the non-contiguous region case as described, the updated read
op will patch OpRegion version and RVDA on-the-fly accordingly. So that
from vfio igd OpRegion view, only 2.1+ with contiguous extended VBT
after OpRegion is exposed, regardless the underneath host OpRegion is
2.0 or 2.1+. The mechanism makes it possible to support legacy OpRegion
2.0 extended VBT systems with on the market, and support OpRegion 2.1+
where the extended VBT isn't contiguous after OpRegion.
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Hang Yuan <hang.yuan@linux.intel.com>
Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
Cc: Fred Gao <fred.gao@intel.com>
Signed-off-by: Colin Xu <colin.xu@intel.com>
Link: https://lore.kernel.org/r/20211012124855.52463-1-colin.xu@gmail.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
IS_IOMMU_CAP_DOMAIN_IN_CONTAINER just obsfucated the checks being
performed, so open code it in the callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-16-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The external_domain concept rather misleading and not actually needed.
Replace it with a list of emulated groups in struct vfio_iommu.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-15-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Ensure pgsize_bitmap is always valid by initializing it to PAGE_MASK
in vfio_iommu_type1_open and remove the now pointless update for
the external domain case in vfio_iommu_type1_attach_group, which was
just setting pgsize_bitmap to PAGE_MASK when only external domains
were attached.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210924155705.4258-14-hch@lst.de
[aw: s/ULONG_MAX/PAGE_MASK/ per discussion in link]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Unlike the the type1 IOMMU backend, the SPAPR one does not contain any
support for the magic non-IOMMU backed iommu_group used by mediated
devices, so reject them in ->attach_group.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-13-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Pass the group flags to ->attach_group and remove the messy check for
the bus type.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-12-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The iommu_device field in struct mdev_device has never been used
since it was added more than 2 years ago.
This is a manual revert of commit 7bd50f0cd2
("vfio/type1: Add domain at(de)taching group helpers").
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-11-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Create a new private drivers/vfio/vfio.h header for the interface between
the VFIO core and the iommu drivers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-10-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The read, write and mmap methods are never implemented, so remove them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-9-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reuse the logic in vfio_noiommu_group_alloc to allocate a fake
single-device iommu group for mediated devices by factoring out a common
function, and replacing the noiommu boolean field in struct vfio_group
with an enum to distinguish the three different kinds of groups.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-8-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Just pass a noiommu argument to vfio_create_group and set up the
->noiommu flag directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-7-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Split the actual noiommu group creation from vfio_iommu_group_get into a
new helper, and open code the rest of vfio_iommu_group_get in its only
caller. This creates an entirely separate and clear code path for the
noiommu group creation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-6-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Factor out a helper to find or allocate the vfio_group to reduce the
spagetthi code in vfio_register_group_dev a little.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-5-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_noiommu_attach_group has two callers:
1) __vfio_container_attach_groups is called by vfio_ioctl_set_iommu,
which just called vfio_iommu_driver_allowed
2) vfio_group_set_container requires already checks ->noiommu on the
vfio_group, which is propagated from the iommudata in
vfio_create_group
so this check is entirely superflous and can be removed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-4-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Factor out a little helper to make the checks for the noiommu driver less
ugly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-3-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We don't need to hold a reference to the group in the driver as well as
obtain a reference to the same group as the first thing
vfio_register_group_dev() does.
Since the drivers never use the group move this all into the core code.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-2-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Currently when a fsl-mc device is reset, the entire DPRC container
is reset which is very inefficient because the devices within a
container will be reset multiple times.
Add support for individually resetting a device.
Signed-off-by: Diana Craciun <diana.craciun@oss.nxp.com>
Reviewed-by: Laurentiu Tudor <laurentiu.tudor@nxp.com>
Link: https://lore.kernel.org/r/20210922110530.24736-2-diana.craciun@oss.nxp.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The function prototype is missing an identifier name. Add one.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20210902212631.54260-1-colin.king@canonical.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
- Fix dma-valid return WAITED implementation (Anthony Yznaga)
- SPDX license cleanups (Cai Huoqing)
- Split vfio-pci-core from vfio-pci and enhance PCI driver matching
to support future vendor provided vfio-pci variants (Yishai Hadas,
Max Gurtovoy, Jason Gunthorpe)
- Replace duplicated reflck with core support for managing first
open, last close, and device sets (Jason Gunthorpe, Max Gurtovoy,
Yishai Hadas)
- Fix non-modular mdev support and don't nag about request callback
support (Christoph Hellwig)
- Add semaphore to protect instruction intercept handler and replace
open-coded locks in vfio-ap driver (Tony Krowiak)
- Convert vfio-ap to vfio_register_group_dev() API (Jason Gunthorpe)
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmEvwWkbHGFsZXgud2ls
bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsi+1UP/3CRizghroINVYR+cJ99
Tjz7lB/wlzxmRfX+SL4NAVe1SSB2VeCgU4B0PF6kywELLS8OhCO3HXYXVsz244fW
Gk5UIns86+TFTrfCOMpwYBV0P86zuaa1ZnvCnkhMK1i2pTZ+oX8hUH1Yj5clHuU+
YgC7JfEuTIAX73q2bC/llLvNE9ke1QCoDX3+HAH87ttqutnRWcnnq56PTEqwe+EW
eMA+glB1UG6JAqXxoJET4155arNOny1/ZMprfBr3YXZTiXDF/lSzuMyUtbp526Sf
hsvlnqtE6TCdfKbog0Lxckl+8E9NCq8jzFBKiZhbccrQv3vVaoP6dOsPWcT35Kp1
IjzMLiHIbl4wXOL+Xap/biz3LCM5BMdT/OhW5LUC007zggK71ndRvb9F8ptW83Bv
0Uh9DNv7YIQ0su3JHZEsJ3qPFXQXceP199UiADOGSeV8U1Qig3YKsHUDMuALfFvN
t+NleeJ4qCWao+W4VCfyDfKurVnMj/cThXiDEWEeq5gMOO+6YKBIFWJVKFxUYDbf
MgGdg0nQTUECuXKXxLD4c1HAWH9xi207OnLvhW1Icywp20MsYqOWt0vhg+PRdMBT
DK6STxP18aQxCaOuQN9Vf81LjhXNTeg+xt3mMyViOZPcKfX6/wAC9qLt4MucJDdw
FBfOz2UL2F56dhAYT+1vHoUM
=nzK7
-----END PGP SIGNATURE-----
Merge tag 'vfio-v5.15-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Fix dma-valid return WAITED implementation (Anthony Yznaga)
- SPDX license cleanups (Cai Huoqing)
- Split vfio-pci-core from vfio-pci and enhance PCI driver matching to
support future vendor provided vfio-pci variants (Yishai Hadas, Max
Gurtovoy, Jason Gunthorpe)
- Replace duplicated reflck with core support for managing first open,
last close, and device sets (Jason Gunthorpe, Max Gurtovoy, Yishai
Hadas)
- Fix non-modular mdev support and don't nag about request callback
support (Christoph Hellwig)
- Add semaphore to protect instruction intercept handler and replace
open-coded locks in vfio-ap driver (Tony Krowiak)
- Convert vfio-ap to vfio_register_group_dev() API (Jason Gunthorpe)
* tag 'vfio-v5.15-rc1' of git://github.com/awilliam/linux-vfio: (37 commits)
vfio/pci: Introduce vfio_pci_core.ko
vfio: Use kconfig if XX/endif blocks instead of repeating 'depends on'
vfio: Use select for eventfd
PCI / VFIO: Add 'override_only' support for VFIO PCI sub system
PCI: Add 'override_only' field to struct pci_device_id
vfio/pci: Move module parameters to vfio_pci.c
vfio/pci: Move igd initialization to vfio_pci.c
vfio/pci: Split the pci_driver code out of vfio_pci_core.c
vfio/pci: Include vfio header in vfio_pci_core.h
vfio/pci: Rename ops functions to fit core namings
vfio/pci: Rename vfio_pci_device to vfio_pci_core_device
vfio/pci: Rename vfio_pci_private.h to vfio_pci_core.h
vfio/pci: Rename vfio_pci.c to vfio_pci_core.c
vfio/ap_ops: Convert to use vfio_register_group_dev()
s390/vfio-ap: replace open coded locks for VFIO_GROUP_NOTIFY_SET_KVM notification
s390/vfio-ap: r/w lock for PQAP interception handler function pointer
vfio/type1: Fix vfio_find_dma_valid return
vfio-pci/zdev: Remove repeated verbose license text
vfio: platform: reset: Convert to SPDX identifier
vfio: Remove struct vfio_device_ops open/release
...
Now that vfio_pci has been split into two source modules, one focusing on
the "struct pci_driver" (vfio_pci.c) and a toolbox library of code
(vfio_pci_core.c), complete the split and move them into two different
kernel modules.
As before vfio_pci.ko continues to present the same interface under sysfs
and this change will have no functional impact.
Splitting into another module and adding exports allows creating new HW
specific VFIO PCI drivers that can implement device specific
functionality, such as VFIO migration interfaces or specialized device
requirements.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20210826103912.128972-14-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This results in less kconfig wordage and a simpler understanding of the
required "depends on" to create the menu structure.
The next patch increases the nesting level a lot so this is a nice
preparatory simplification.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20210826103912.128972-13-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If VFIO_VIRQFD is required then turn on eventfd automatically.
The majority of kconfig users of the EVENTFD use select not depends on.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20210826103912.128972-12-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Expose an 'override_only' helper macro (i.e.
PCI_DRIVER_OVERRIDE_DEVICE_VFIO) for VFIO PCI sub system and add the
required code to prefix its matching entries with "vfio_" in
modules.alias file.
It allows VFIO device drivers to include match entries in the
modules.alias file produced by kbuild that are not used for normal
driver autoprobing and module autoloading. Drivers using these match
entries can be connected to the PCI device manually, by userspace, using
the existing driver_override sysfs.
For example the resulting modules.alias may have:
alias pci:v000015B3d00001021sv*sd*bc*sc*i* mlx5_core
alias vfio_pci:v000015B3d00001021sv*sd*bc*sc*i* mlx5_vfio_pci
alias vfio_pci:v*d*sv*sd*bc*sc*i* vfio_pci
In this example mlx5_core and mlx5_vfio_pci match to the same PCI
device. The kernel will autoload and autobind to mlx5_core but the
kernel and udev mechanisms will ignore mlx5_vfio_pci.
When userspace wants to change a device to the VFIO subsystem it can
implement a generic algorithm:
1) Identify the sysfs path to the device:
/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0
2) Get the modalias string from the kernel:
$ cat /sys/bus/pci/devices/0000:01:00.0/modalias
pci:v000015B3d00001021sv000015B3sd00000001bc02sc00i00
3) Prefix it with vfio_:
vfio_pci:v000015B3d00001021sv000015B3sd00000001bc02sc00i00
4) Search modules.alias for the above string and select the entry that
has the fewest *'s:
alias vfio_pci:v000015B3d00001021sv*sd*bc*sc*i* mlx5_vfio_pci
5) modprobe the matched module name:
$ modprobe mlx5_vfio_pci
6) cat the matched module name to driver_override:
echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:01:00.0/driver_override
7) unbind device from original module
echo 0000:01:00.0 > /sys/bus/pci/devices/0000:01:00.0/driver/unbind
8) probe PCI drivers (or explicitly bind to mlx5_vfio_pci)
echo 0000:01:00.0 > /sys/bus/pci/drivers_probe
The algorithm is independent of bus type. In future the other buses with
VFIO device drivers, like platform and ACPI, can use this algorithm as
well.
This patch is the infrastructure to provide the information in the
modules.alias to userspace. Convert the only VFIO pci_driver which results
in one new line in the modules.alias:
alias vfio_pci:v*d*sv*sd*bc*sc*i* vfio_pci
Later series introduce additional HW specific VFIO PCI drivers, such as
mlx5_vfio_pci.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> # for pci.h
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20210826103912.128972-11-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is a preparation before splitting vfio_pci.ko to 2 modules.
As module parameters are a kind of uAPI they need to stay on vfio_pci.ko
to avoid a user visible impact.
For now continue to keep the implementation of these options in
vfio_pci_core.c. Arguably they are vfio_pci functionality, but further
splitting of vfio_pci_core.c will be better done in another series
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20210826103912.128972-9-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
igd is related to the vfio_pci pci_driver implementation, move it out of
vfio_pci_core.c.
This is preparation for splitting vfio_pci.ko into 2 drivers.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20210826103912.128972-8-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Split the vfio_pci driver into two logical parts, the 'struct
pci_driver' (vfio_pci.c) which implements "Generic VFIO support for any
PCI device" and a library of code (vfio_pci_core.c) that helps
implementing a struct vfio_device on top of a PCI device.
vfio_pci.ko continues to present the same interface under sysfs and this
change should have no functional impact.
Following patches will turn vfio_pci and vfio_pci_core into a separate
module.
This is a preparation for allowing another module to provide the
pci_driver and allow that module to customize how VFIO is setup, inject
its own operations, and easily extend vendor specific functionality.
At this point the vfio_pci_core still contains a lot of vfio_pci
functionality mixed into it. Following patches will move more of the
large scale items out, but another cleanup series will be needed to get
everything.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20210826103912.128972-7-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio_device structure is embedded into the vfio_pci_core_device
structure, so there is no reason for not including the header file in
the vfio_pci_core header as well.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20210826103912.128972-6-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is another preparation patch for separating the vfio_pci driver to
a subsystem driver and a generic pci driver. This patch doesn't change
any logic.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20210826103912.128972-5-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is a preparation patch for separating the vfio_pci driver to a
subsystem driver and a generic pci driver. This patch doesn't change any
logic.
The new vfio_pci_core_device structure will be the main structure of the
core driver and later on vfio_pci_device structure will be the main
structure of the generic vfio_pci driver.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20210826103912.128972-4-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is a preparation patch for separating the vfio_pci driver to a
subsystem driver and a generic pci driver. This patch doesn't change any
logic.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20210826103912.128972-3-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is a preparation patch for separating the vfio_pci driver to a
subsystem driver and a generic pci driver. This patch doesn't change any
logic.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20210826103912.128972-2-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_find_dma_valid is defined to return WAITED on success if it was
necessary to wait. However, the loop forgets the WAITED value returned
by vfio_wait() and returns 0 in a later iteration. Fix it.
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Link: https://lore.kernel.org/r/1629736550-2388-1-git-send-email-anthony.yznaga@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The SPDX and verbose license text are redundant, however in this case
the verbose license indicates a GPL v2 only while SPDX specifies v2+.
Remove the verbose license and correct SPDX to the more restricted
version.
Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/20210824003749.1039-1-caihuoqing@baidu.com
[aw: commit log]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Like vfio_pci_dev_set_try_reset() this code wants to reset all of the
devices in the "reset group" which is the same membership as the device
set.
Instead of trying to reconstruct the device set from the PCI list go
directly from the device set's device list to execute the reset.
The same basic structure as vfio_pci_dev_set_try_reset() is used. The
'vfio_devices' struct is replaced with the device set linked list and we
simply sweep it multiple times under the lock.
This eliminates a memory allocation and get/put traffic and another
improperly locked test of pci_dev_driver().
Reviewed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/r/10-v4-9ea22c5e6afb+1adf-vfio_reflck_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_pci_try_bus_reset() is triggering a reset of the entire_dev set if
any device within it has accumulated a needs_reset. This reset can only be
done once all of the drivers operating the PCI devices to be reset are in
a known safe state.
Make this clearer by directly operating on the dev_set instead of the
vfio_pci_device. Rename the function to vfio_pci_dev_set_try_reset().
Use the device list inside the dev_set to check that all drivers are in a
safe state instead of working backwards from the pci_device.
The dev_set->lock directly prevents devices from joining/leaving the set,
or changing their state, which further implies the pci_device cannot
change drivers or that the vfio_device be freed, eliminating the need for
get/put's.
If a pci_device to be reset is not in the dev_set then the reset cannot be
used as we can't know what the state of that driver is. Directly measure
this by checking that every pci_device is in the dev_set - which
effectively proves that VFIO drivers are attached to everything.
Remove the odd interaction around vfio_pci_set_power_state() - have the
only caller avoid its redundant vfio_pci_set_power_state() instead of
avoiding it inside vfio_pci_dev_set_try_reset().
This restructuring corrects a call to pci_dev_driver() without holding the
device_lock() and removes a hard wiring to &vfio_pci_driver.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/r/9-v4-9ea22c5e6afb+1adf-vfio_reflck_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
PCI wants to have the usual open/close_device() logic with the slight
twist that the open/close_device() must be done under a singelton lock
shared by all of the vfio_devices that are in the PCI "reset group".
The reset group, and thus the device set, is determined by what devices
pci_reset_bus() touches, which is either the entire bus or only the slot.
Rely on the core code to do everything reflck was doing and delete reflck
entirely.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/r/8-v4-9ea22c5e6afb+1adf-vfio_reflck_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Platform simply wants to run some code when the device is first
opened/last closed. Use the core framework and locking for this. Aside
from removing a bit of code this narrows the locking scope from a global
lock.
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/7-v4-9ea22c5e6afb+1adf-vfio_reflck_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
FSL uses the internal reflck to implement the open_device() functionality,
conversion to the core code is straightforward.
The decision on which set to be part of is trivially based on the
is_fsl_mc_bus_dprc() and we use a 'struct device *' pointer as the set_id.
The dev_set lock is protecting the interrupts setup. The FSL MC devices
are using MSIs and only the DPRC device is allocating the MSIs from the
MSI domain. The other devices just take interrupts from a pool. The lock
is protecting the access to this pool.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Tested-by: Diana Craciun OSS <diana.craciun@oss.nxp.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v4-9ea22c5e6afb+1adf-vfio_reflck_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Currently the driver ops have an open/release pair that is called once
each time a device FD is opened or closed. Add an additional set of
open/close_device() ops which are called when the device FD is opened for
the first time and closed for the last time.
An analysis shows that all of the drivers require this semantic. Some are
open coding it as part of their reflck implementation, and some are just
buggy and miss it completely.
To retain the current semantics PCI and FSL depend on, introduce the idea
of a "device set" which is a grouping of vfio_device's that share the same
lock around opening.
The device set is established by providing a 'set_id' pointer. All
vfio_device's that provide the same pointer will be joined to the same
singleton memory and lock across the whole set. This effectively replaces
the oddly named reflck.
After conversion the set_id will be sourced from:
- A struct device from a fsl_mc_device (fsl)
- A struct pci_slot (pci)
- A struct pci_bus (pci)
- The struct vfio_device (everything)
The design ensures that the above pointers are live as long as the
vfio_device is registered, so they form reliable unique keys to group
vfio_devices into sets.
This implementation uses xarray instead of searching through the driver
core structures, which simplifies the somewhat tricky locking in this
area.
Following patches convert all the drivers.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v4-9ea22c5e6afb+1adf-vfio_reflck_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This pairs with vfio_init_group_dev() and allows undoing any state that is
stored in the vfio_device unrelated to registration. Add appropriately
placed calls to all the drivers.
The following patch will use this to add pre-registration state for the
device set.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v4-9ea22c5e6afb+1adf-vfio_reflck_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Tag for other trees/branches to pull from in order to have a stable
place to build off of if they want to add new busses for 5.15.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYPkvsg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yl8pwCff+mL/zOaU8M1KwIHrEvNH6LgF58AoKgBAeCb
58Lenmw44ofP5itzU5x3
=J5mO
-----END PGP SIGNATURE-----
Merge tag 'bus_remove_return_void-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core into drm-next
Bus: Make remove callback return void tag
Tag for other trees/branches to pull from in order to have a stable
place to build off of if they want to add new busses for 5.15.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>
[airlied: fixed up merge conflict in drm]
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patchwork.freedesktop.org/patch/msgid/YPkwQwf0dUKnGA7L@kroah.com
Only a single driver actually sets the ->request method, so don't print
a scary warning if it isn't.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20210726143524.155779-3-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Without this setups with buіlt-in mdev and mdev-drivers fail to
register like this:
[1.903149] Driver 'intel_vgpu_mdev' was unable to register with bus_type 'mdev' because the bus was not initialized.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20210726143524.155779-2-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The only implementation of this in IGD returns a -ERRNO which is
implicitly cast through a size_t and then casted again and returned as a
ssize_t in vfio_pci_rw().
Fix the vfio_pci_regops->rw() return type to be ssize_t so all is
consistent.
Fixes: 28541d41c9 ("vfio/pci: Add infrastructure for additional device specific regions")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Link: https://lore.kernel.org/r/0-v3-5db12d1bf576+c910-vfio_rw_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIO_NOIOMMU is supposed to be an element in the VFIO menu, not start
a new menu. Correct this copy-paste mistake.
Fixes: 03a76b60f8 ("vfio: Include No-IOMMU mode")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/r/0-v1-3f0b685c3679+478-vfio_menuconfig_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
UAPI Changes:
- Remove sysfs stats for dma-buf attachments, as it causes a performance regression.
Previous merge is not in a rc kernel yet, so no userspace regression possible.
Cross-subsystem Changes:
- Sanitize user input in kyro's viewport ioctl.
- Use refcount_t in fb_info->count
- Assorted fixes to dma-buf.
- Extend x86 efifb handling to all archs.
- Fix neofb divide by 0.
- Document corpro,gm7123 bridge dt bindings.
Core Changes:
- Slightly rework drm master handling.
- Cleanup vgaarb handling.
- Assorted fixes.
Driver Changes:
- Add support for ws2401 panel.
- Assorted fixes to stm, ast, bochs.
- Demidlayer ingenic irq.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEuXvWqAysSYEJGuVH/lWMcqZwE8MFAmD5TGAACgkQ/lWMcqZw
E8PNgxAApjTYQSfjIBbOZnNraxW6w7/bPea35E9A47EdBQsNGnYftNsFjbrn/mCJ
D+0eRLjCMlg4FF1SHdh9cPJ35py+ygbDeupogboLITfU99eGBth3fM2Xdg9LPcBh
dbni/JLG9R7gIvSlqdJuweN21trfVrV/9FQEilG5DvQcl27Wx5g8VMRZke1EqGKX
7Id09Uq50ky18vhDjQRCveYhRqJAxV+XozBatzHyxpDVzjLQvRhlAAYdvrSMHZ5R
jreGzOfR8awc6Om+w7wx3Jn1oEGmXVZB/VqxEqGtMOr3lpARPucxrqfHsqpam3rv
yIoEKPrkG+k6fsU7Tbg59jNqe/PbCUW3AlpyuBxf55EbnVGgjLDbq4sRRMkehPfA
fhC31ujOXQQnAgaxyeQAaAJFKNFJzA8Cq5ZPfG+zztzuomHCiUVQBRowP65hJMzR
+ZlEDnhUD3STLz39zuO1reZR1ZoPIvKbsokHAA+ZrIwUd6U3D3ia8V51pq+lL5aS
TGDkyMN9jyZ+SO8Z7+2FnJAv9FAOPU/WCLU/fWW46jAvuezwMIwVcjfSqDU2XbZD
e7KgHpHhx3BGxI8TThHKlY7mf6IL2Bm7X1Cv1pdZs/eEn3Udh2ax942uTQZu/YOO
0AT1XchpvYCBNRw05bVI3OlJ+w3I8uV+h+11jHOKeY6cbwdHeKE=
=BUya
-----END PGP SIGNATURE-----
Merge tag 'drm-misc-next-2021-07-22' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
drm-misc-next for v5.15-rc1:
UAPI Changes:
- Remove sysfs stats for dma-buf attachments, as it causes a performance regression.
Previous merge is not in a rc kernel yet, so no userspace regression possible.
Cross-subsystem Changes:
- Sanitize user input in kyro's viewport ioctl.
- Use refcount_t in fb_info->count
- Assorted fixes to dma-buf.
- Extend x86 efifb handling to all archs.
- Fix neofb divide by 0.
- Document corpro,gm7123 bridge dt bindings.
Core Changes:
- Slightly rework drm master handling.
- Cleanup vgaarb handling.
- Assorted fixes.
Driver Changes:
- Add support for ws2401 panel.
- Assorted fixes to stm, ast, bochs.
- Demidlayer ingenic irq.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/2d0d2fe8-01fc-e216-c3fd-38db9e69944e@linux.intel.com
The driver core ignores the return value of this callback because there
is only little it can do when a device disappears.
This is the final bit of a long lasting cleanup quest where several
buses were converted to also return void from their remove callback.
Additionally some resource leaks were fixed that were caused by drivers
returning an error code in the expectation that the driver won't go
away.
With struct bus_type::remove returning void it's prevented that newly
implemented buses return an ignored error code and so don't anticipate
wrong expectations for driver authors.
Reviewed-by: Tom Rix <trix@redhat.com> (For fpga)
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com> (For drivers/s390 and drivers/vfio)
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> (For ARM, Amba and related parts)
Acked-by: Mark Brown <broonie@kernel.org>
Acked-by: Chen-Yu Tsai <wens@csie.org> (for sunxi-rsb)
Acked-by: Pali Rohár <pali@kernel.org>
Acked-by: Mauro Carvalho Chehab <mchehab@kernel.org> (for media)
Acked-by: Hans de Goede <hdegoede@redhat.com> (For drivers/platform)
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Acked-By: Vinod Koul <vkoul@kernel.org>
Acked-by: Juergen Gross <jgross@suse.com> (For xen)
Acked-by: Lee Jones <lee.jones@linaro.org> (For mfd)
Acked-by: Johannes Thumshirn <jth@kernel.org> (For mcb)
Acked-by: Johan Hovold <johan@kernel.org>
Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> (For slimbus)
Acked-by: Kirti Wankhede <kwankhede@nvidia.com> (For vfio)
Acked-by: Maximilian Luz <luzmaximilian@gmail.com>
Acked-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> (For ulpi and typec)
Acked-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com> (For ipack)
Acked-by: Geoff Levand <geoff@infradead.org> (For ps3)
Acked-by: Yehezkel Bernat <YehezkelShB@gmail.com> (For thunderbolt)
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> (For intel_th)
Acked-by: Dominik Brodowski <linux@dominikbrodowski.net> (For pcmcia)
Acked-by: Rafael J. Wysocki <rafael@kernel.org> (For ACPI)
Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org> (rpmsg and apr)
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> (For intel-ish-hid)
Acked-by: Dan Williams <dan.j.williams@intel.com> (For CXL, DAX, and NVDIMM)
Acked-by: William Breathitt Gray <vilhelm.gray@gmail.com> (For isa)
Acked-by: Stefan Richter <stefanr@s5r6.in-berlin.de> (For firewire)
Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> (For hid)
Acked-by: Thorsten Scherer <t.scherer@eckelmann.de> (For siox)
Acked-by: Sven Van Asbroeck <TheSven73@gmail.com> (For anybuss)
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> (For MMC)
Acked-by: Wolfram Sang <wsa@kernel.org> # for I2C
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Acked-by: Finn Thain <fthain@linux-m68k.org>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Link: https://lore.kernel.org/r/20210713193522.1770306-6-u.kleine-koenig@pengutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The VGA arbitration is entirely based on pci_dev structures, so just pass
that back to the set_vga_decode callback.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20210716061634.2446357-8-hch@lst.de
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
All callers pass NULL as the irq_set_state argument, so remove it and
the ->irq_set_state member in struct vga_device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20210716061634.2446357-7-hch@lst.de
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
vma_lookup() finds the vma of a specific address with a cleaner interface
and is more readable.
Link: https://lkml.kernel.org/r/20210521174745.2219620-12-Liam.Howlett@Oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the new pci_dev_trylock() helper to simplify our locking.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20210623022824.308041-3-mcgrof@kernel.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio_group structure is already defined in vfio module so in order
to improve code readability and for simplicity, rename the vfio_group
structure in vfio_iommu_type1 module to vfio_iommu_group.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Link: https://lore.kernel.org/r/20210608112841.51897-1-mgurtovoy@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This allows a mdev driver to opt out of using vfio_mdev.c, instead the
driver will provide a 'struct mdev_driver' and register directly with the
driver core.
Much of mdev_parent_ops becomes unused in this mode:
- create()/remove() are done via the mdev_driver probe()/remove()
- mdev_attr_groups becomes mdev_driver driver.dev_groups
- Wrapper function callbacks are replaced with the same ones from
struct vfio_device_ops
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Link: https://lore.kernel.org/r/20210617142218.1877096-8-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
For some reason the vfio_mdev shim mdev_driver has its own module and
kconfig. As the next patch requires access to it from mdev.ko merge the
two modules together and remove VFIO_MDEV_DEVICE.
A later patch deletes this driver entirely.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Link: https://lore.kernel.org/r/20210617142218.1877096-7-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio core driver is now taking refcount on the provider drivers,
remove redundant parent_module attribute from vfio_platform_device
structure.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Acked-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20210518192133.59195-3-mgurtovoy@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Remove code duplication and move module refcounting to the subsystem
module.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20210518192133.59195-2-mgurtovoy@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The ->parent_module is the one that use in try_module_get. It should
also be the one the we use in module_put during vfio_platform_open().
Fixes: 32a2d71c4e ("vfio: platform: introduce vfio-platform-base module")
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Message-Id: <20210518192133.59195-1-mgurtovoy@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Make use of the struct_size() helper instead of an open-coded version,
in order to avoid any potential type mistakes or integer overflows
that, in the worst scenario, could lead to heap overflows.
This code was detected with the help of Coccinelle and, audited and
fixed manually.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Message-Id: <20210513230155.GA217517@embeddedor>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
zap_vma_ptes() is only available when CONFIG_MMU is set/enabled.
Without CONFIG_MMU, vfio_pci.o has build errors, so make
VFIO_PCI depend on MMU.
riscv64-linux-ld: drivers/vfio/pci/vfio_pci.o: in function `vfio_pci_mmap_open':
vfio_pci.c:(.text+0x1ec): undefined reference to `zap_vma_ptes'
riscv64-linux-ld: drivers/vfio/pci/vfio_pci.o: in function `.L0 ':
vfio_pci.c:(.text+0x165c): undefined reference to `zap_vma_ptes'
Fixes: 11c4cd07ba ("vfio-pci: Fault mmaps to enable vma tracking")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Eric Auger <eric.auger@redhat.com>
Message-Id: <20210515190856.2130-1-rdunlap@infradead.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The error code returned from vfio_ext_cap_len() is stored in 'len', not
in 'ret'.
Fixes: 89e1f7d4c6 ("vfio: Add PCI device driver")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Message-Id: <20210515020458.6771-1-thunder.leizhen@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Including:
- Big cleanup of almost unsused parts of the IOMMU API by
Christoph Hellwig. This mostly affects the Freescale PAMU
driver.
- New IOMMU driver for Unisoc SOCs
- ARM SMMU Updates from Will:
- SMMUv3: Drop vestigial PREFETCH_ADDR support
- SMMUv3: Elide TLB sync logic for empty gather
- SMMUv3: Fix "Service Failure Mode" handling
- SMMUv2: New Qualcomm compatible string
- Removal of the AMD IOMMU performance counter writeable check
on AMD. It caused long boot delays on some machines and is
only needed to work around an errata on some older (possibly
pre-production) chips. If someone is still hit by this
hardware issue anyway the performance counters will just
return 0.
- Support for targeted invalidations in the AMD IOMMU driver.
Before that the driver only invalidated a single 4k page or the
whole IO/TLB for an address space. This has been extended now
and is mostly useful for emulated AMD IOMMUs.
- Several fixes for the Shared Virtual Memory support in the
Intel VT-d driver
- Mediatek drivers can now be built as modules
- Re-introduction of the forcedac boot option which got lost
when converting the Intel VT-d driver to the common dma-iommu
implementation.
- Extension of the IOMMU device registration interface and
support iommu_ops to be const again when drivers are built as
modules.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmCMEIoACgkQK/BELZcB
GuOu9xAAvg6aR0uHlxvRq6cgNnHN9Ltp5+t3qFYtRRrauY0iOPMO62k0QQli5shX
CGeczD0e59KAZqI0zNJnQn8hMY5dg7XVkFCC5BrSzuCDCtwJZ0N5Tq3pfUlaV1rw
BJf41t79Fd+jp7kn53tu+vRAfYZ3+sLOx/6U3c15pqKRZSkyFWbQllOtD3J5LnLu
1PyPlfiNpMwCajiS7aQbN+fuJ/lKIFeA2MDPOsCBzhbfxiJUqJxZOKAZO3rOjFfK
feTibqQ+3Zz6MPXt9st1cvPpy8jCosv81OY6Knqvxf/oB5q+fEdi2uNrKISonb/t
Fw331oOIwg2A+HOpwC9MN1AumOIqiHSWWENAMk9SlP+TMIWKQ8kZreyI6IEB23dV
+QvP3DVA+CfLwtNY/Zh0IqKh28D+IHlKbpWNU1m+9AUe468mV/MTjfwxr9Yfffhm
LZ6C0DgFdmtqv8jPuDGUOgo3RNeN8bLnUSEHG9gHibA+RKujl5BWDjKkwILqMQTt
Ysdsu8TiNtFIULomizqCpgqEbQfW8TLFvASXCM1VMQ/PDURxvchZPxFDJonYXy+K
z2HGaG3eUE07YrAdRKH69aMVIbmS+sjEhvmi4xZ1Lh7wWcIE2AZVvO8qNb+Ckcp3
4tLPPDksm/iQngnFf6gdgH3qv4rgbzE4+74GXqeANiQCjY9dSJI=
=qF2C
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
- Big cleanup of almost unsused parts of the IOMMU API by Christoph
Hellwig. This mostly affects the Freescale PAMU driver.
- New IOMMU driver for Unisoc SOCs
- ARM SMMU Updates from Will:
- Drop vestigial PREFETCH_ADDR support (SMMUv3)
- Elide TLB sync logic for empty gather (SMMUv3)
- Fix "Service Failure Mode" handling (SMMUv3)
- New Qualcomm compatible string (SMMUv2)
- Removal of the AMD IOMMU performance counter writeable check on AMD.
It caused long boot delays on some machines and is only needed to
work around an errata on some older (possibly pre-production) chips.
If someone is still hit by this hardware issue anyway the performance
counters will just return 0.
- Support for targeted invalidations in the AMD IOMMU driver. Before
that the driver only invalidated a single 4k page or the whole IO/TLB
for an address space. This has been extended now and is mostly useful
for emulated AMD IOMMUs.
- Several fixes for the Shared Virtual Memory support in the Intel VT-d
driver
- Mediatek drivers can now be built as modules
- Re-introduction of the forcedac boot option which got lost when
converting the Intel VT-d driver to the common dma-iommu
implementation.
- Extension of the IOMMU device registration interface and support
iommu_ops to be const again when drivers are built as modules.
* tag 'iommu-updates-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (84 commits)
iommu: Streamline registration interface
iommu: Statically set module owner
iommu/mediatek-v1: Add error handle for mtk_iommu_probe
iommu/mediatek-v1: Avoid build fail when build as module
iommu/mediatek: Always enable the clk on resume
iommu/fsl-pamu: Fix uninitialized variable warning
iommu/vt-d: Force to flush iotlb before creating superpage
iommu/amd: Put newline after closing bracket in warning
iommu/vt-d: Fix an error handling path in 'intel_prepare_irq_remapping()'
iommu/vt-d: Fix build error of pasid_enable_wpe() with !X86
iommu/amd: Remove performance counter pre-initialization test
Revert "iommu/amd: Fix performance counter initialization"
iommu/amd: Remove duplicate check of devid
iommu/exynos: Remove unneeded local variable initialization
iommu/amd: Page-specific invalidations for more than one page
iommu/arm-smmu-v3: Remove the unused fields for PREFETCH_CONFIG command
iommu/vt-d: Avoid unnecessary cache flush in pasid entry teardown
iommu/vt-d: Invalidate PASID cache when root/context entry changed
iommu/vt-d: Remove WO permissions on second-level paging entries
iommu/vt-d: Report the right page fault address
...
pinned_page_dirty_scope is optimized out by commit 010321565a
("vfio/iommu_type1: Mantain a counter for non_pinned_groups"),
but appears again due to some issues during merging branches.
We can safely remove it here.
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Message-Id: <20210412024415.30676-1-zhukeqian1@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When mmaping an extra device region verify that the region index
derived from the mmap offset is valid.
Fixes: a15b1883fe ("vfio_pci: Allow mapping extra regions")
Cc: stable@vger.kernel.org
Signed-off-by: Christian A. Ehrhardt <lk@c--e.de>
Message-Id: <20210412214124.GA241759@lisa.in-ulm.de>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The driver core standard is to pass in the properly typed object, the
properly typed attribute and the buffer data. It stems from the root
kobject method:
ssize_t (*show)(struct kobject *kobj, struct kobj_attribute *attr,..)
Each subclass of kobject should provide their own function with the same
signature but more specific types, eg struct device uses:
ssize_t (*show)(struct device *dev, struct device_attribute *attr,..)
In this case the existing signature is:
ssize_t (*show)(struct kobject *kobj, struct device *dev,..)
Where kobj is a 'struct mdev_type *' and dev is 'mdev_type->parent->dev'.
Change the mdev_type related sysfs attribute functions to:
ssize_t (*show)(struct mdev_type *mtype, struct mdev_type_attribute *attr,..)
In order to restore type safety and match the driver core standard
There are no current users of 'attr', but if it is ever needed it would be
hard to add in retroactively, so do it now.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <18-v2-d36939638fc6+d54-vfio2_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The kobj here is a type-erased version of mdev_type, which is already
stored in the struct mdev_device being passed in. It was only ever used to
compute the type_group_id, which is now extracted directly from the mdev.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <17-v2-d36939638fc6+d54-vfio2_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This returns the index in the supported_type_groups array that is
associated with the mdev_type attached to the struct mdev_device or its
containing struct kobject.
Each mdev_device can be spawned from exactly one mdev_type, which in turn
originates from exactly one supported_type_group.
Drivers are using weird string calculations to try and get back to this
index, providing a direct access to the index removes a bunch of wonky
driver code.
mdev_type->group can be deleted as the group is obtained using the
type_group_id.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <11-v2-d36939638fc6+d54-vfio2_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
mdev_device->type->parent is the same thing.
The struct mdev_device was relying on the kref on the mdev_parent to also
indirectly hold a kref on the mdev_type pointer. Now that the type holds a
kref on the parent we can directly kref the mdev_type and remove this
implicit relationship.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <10-v2-d36939638fc6+d54-vfio2_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This can fail, and seems to be a popular target for syzkaller error
injection. Check the error return and unwind with put_device().
Fixes: 7b96953bc6 ("vfio: Mediated device Core driver")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <9-v2-d36939638fc6+d54-vfio2_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Once the memory for the struct mdev_device is allocated it should
immediately be device_initialize()'d and filled in so that put_device()
can always be used to undo the allocation.
Place the mdev_get/put_parent() so that they are clearly protecting the
mdev->parent pointer. Move the final put to the release function so that
the lifetime rules are trivial to understand. Update the goto labels to
follow the normal convention.
Remove mdev_device_free() as the release function via device_put() is now
usable in all cases.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <8-v2-d36939638fc6+d54-vfio2_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
struct mdev_type holds a pointer to the kref'd object struct mdev_parent,
but doesn't hold the kref. The lifetime of the parent becomes implicit
because parent_remove_sysfs_files() is supposed to remove all the access
before the parent can be freed, but this is very hard to reason about.
Make it obviously correct by adding the missing get.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <7-v2-d36939638fc6+d54-vfio2_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The next patch will use these in mdev_sysfs.c
While here remove the now dead code checks for NULL, a mdev_type can never
have a NULL parent.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <6-v2-d36939638fc6+d54-vfio2_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The kobj pointer in mdev_device is actually pointing at a struct
mdev_type. Use the proper type so things are understandable.
There are a number of places that are confused and passing both the mdev
and the mtype as function arguments, fix these to derive the mtype
directly from the mdev to remove the redundancy.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <5-v2-d36939638fc6+d54-vfio2_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is only done once, we don't need to generate code to initialize a
structure stored in the ELF .data segment. Fill in the three required
.driver members directly instead of copying data into them during
mdev_register_driver().
Further the to_mdev_driver() function doesn't belong in a public header,
just inline it into the two places that need it. Finally, we can now
clearly see that 'drv' derived from dev->driver cannot be NULL, firstly
because the driver core forbids it, and secondly because NULL won't pass
through the container_of(). Remove the dead code.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <4-v2-d36939638fc6+d54-vfio2_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The mdev API should accept and pass a 'struct mdev_device *' in all
places, not pass a 'struct device *' and cast it internally with
to_mdev_device(). Particularly in its struct mdev_driver functions, the
whole point of a bus's struct device_driver wrapper is to provide type
safety compared to the default struct device_driver.
Further, the driver core standard is for bus drivers to expose their
device structure in their public headers that can be used with
container_of() inlines and '&foo->dev' to go between the class levels, and
'&foo->dev' to be used with dev_err/etc driver core helper functions. Move
'struct mdev_device' to mdev.h
Once done this allows moving some one instruction exported functions to
static inlines, which in turns allows removing one of the two grotesque
symbol_get()'s related to mdev in the core code.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Message-Id: <3-v2-d36939638fc6+d54-vfio2_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There is a small race where the parent is NULL even though the kobj has
already been made visible in sysfs.
For instance the attribute_group is made visible in sysfs_create_files()
and the mdev_type_attr_show() does:
ret = attr->show(kobj, type->parent->dev, buf);
Which will crash on NULL parent. Move the parent setup to before the type
pointer leaves the stack frame.
Fixes: 7b96953bc6 ("vfio: Mediated device Core driver")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <2-v2-d36939638fc6+d54-vfio2_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Use an explicit enable_nesting method instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Li Yang <leoyang.li@nxp.com>
Link: https://lore.kernel.org/r/20210401155256.298656-17-hch@lst.de
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The geometry information can be trivially queried from the iommu_domain
struture.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Li Yang <leoyang.li@nxp.com>
Link: https://lore.kernel.org/r/20210401155256.298656-16-hch@lst.de
Signed-off-by: Joerg Roedel <jroedel@suse.de>
There are no longer any users, so it can go away. Everything is using
container_of now.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <14-v3-225de1400dfc+4e074-vfio1_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This tidies a few confused places that think they can have a refcount on
the vfio_device but the device_data could be NULL, that isn't possible by
design.
Most of the change falls out when struct vfio_devices is updated to just
store the struct vfio_pci_device itself. This wasn't possible before
because there was no easy way to get from the 'struct vfio_pci_device' to
the 'struct vfio_device' to put back the refcount.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <13-v3-225de1400dfc+4e074-vfio1_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is the standard kernel pattern, the ops associated with a struct get
the struct pointer in for typesafety. The expected design is to use
container_of to cleanly go from the subsystem level type to the driver
level type without having any type erasure in a void *.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <12-v3-225de1400dfc+4e074-vfio1_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The macro wrongly uses 'dev' as both the macro argument and the member
name, which means it fails compilation if any caller uses a word other
than 'dev' as the single argument. Fix this defect by making it into
proper static inline, which is more clear and typesafe anyhow.
Fixes: 99e3123e3d ("vfio-mdev: Make mdev_device private and abstract interfaces")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <11-v3-225de1400dfc+4e074-vfio1_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
mdev gets little benefit because it doesn't actually do anything, however
it is the last user, so move the vfio_init/register/unregister_group_dev()
code here for now.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Liu Yi L <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <10-v3-225de1400dfc+4e074-vfio1_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
pci already allocates a struct vfio_pci_device with exactly the same
lifetime as vfio_device, switch to the new API and embed vfio_device in
vfio_pci_device.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Liu Yi L <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <9-v3-225de1400dfc+4e074-vfio1_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_add_group_dev() must be called only after all of the private data in
vdev is fully setup and ready, otherwise there could be races with user
space instantiating a device file descriptor and starting to call ops.
For instance vfio_pci_reflck_attach() sets vdev->reflck and
vfio_pci_open(), called by fops open, unconditionally derefs it, which
will crash if things get out of order.
Fixes: cc20d79990 ("vfio/pci: Introduce VF token")
Fixes: e309df5b0c ("vfio/pci: Parallelize device open and release")
Fixes: 6eb7018705 ("vfio-pci: Move idle devices to D3hot power state")
Fixes: ecaa1f6a01 ("vfio-pci: Add VGA arbiter client")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <8-v3-225de1400dfc+4e074-vfio1_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_pci_probe() is quite complicated, with optional VF and VGA sub
components. Move these into clear init/uninit functions and have a linear
flow in probe/remove.
This fixes a few little buglets:
- vfio_pci_remove() is in the wrong order, vga_client_register() removes
a notifier and is after kfree(vdev), but the notifier refers to vdev,
so it can use after free in a race.
- vga_client_register() can fail but was ignored
Organize things so destruction order is the reverse of creation order.
Fixes: ecaa1f6a01 ("vfio-pci: Add VGA arbiter client")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <7-v3-225de1400dfc+4e074-vfio1_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
fsl-mc already allocates a struct vfio_fsl_mc_device with exactly the same
lifetime as vfio_device, switch to the new API and embed vfio_device in
vfio_fsl_mc_device. While here remove the devm usage for the vdev, this
code is clean and doesn't need devm.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <6-v3-225de1400dfc+4e074-vfio1_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_add_group_dev() must be called only after all of the private data in
vdev is fully setup and ready, otherwise there could be races with user
space instantiating a device file descriptor and starting to call ops.
For instance vfio_fsl_mc_reflck_attach() sets vdev->reflck and
vfio_fsl_mc_open(), called by fops open, unconditionally derefs it, which
will crash if things get out of order.
This driver started life with the right sequence, but two commits added
stuff after vfio_add_group_dev().
Fixes: 2e0d29561f ("vfio/fsl-mc: Add irq infrastructure for fsl-mc devices")
Fixes: f2ba7e8c94 ("vfio/fsl-mc: Added lock support in preparation for interrupt handling")
Co-developed-by: Diana Craciun OSS <diana.craciun@oss.nxp.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <5-v3-225de1400dfc+4e074-vfio1_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
platform already allocates a struct vfio_platform_device with exactly
the same lifetime as vfio_device, switch to the new API and embed
vfio_device in vfio_platform_device.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Acked-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <4-v3-225de1400dfc+4e074-vfio1_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This makes the struct vfio_device part of the public interface so it
can be used with container_of and so forth, as is typical for a Linux
subystem.
This is the first step to bring some type-safety to the vfio interface by
allowing the replacement of 'void *' and 'struct device *' inputs with a
simple and clear 'struct vfio_device *'
For now the self-allocating vfio_add_group_dev() interface is kept so each
user can be updated as a separate patch.
The expected usage pattern is
driver core probe() function:
my_device = kzalloc(sizeof(*mydevice));
vfio_init_group_dev(&my_device->vdev, dev, ops, mydevice);
/* other driver specific prep */
vfio_register_group_dev(&my_device->vdev);
dev_set_drvdata(dev, my_device);
driver core remove() function:
my_device = dev_get_drvdata(dev);
vfio_unregister_group_dev(&my_device->vdev);
/* other driver specific tear down */
kfree(my_device);
Allowing the driver to be able to use the drvdata and vfio_device to go
to/from its own data.
The pattern also makes it clear that vfio_register_group_dev() must be
last in the sequence, as once it is called the core code can immediately
start calling ops. The init/register gap is provided to allow for the
driver to do setup before ops can be called and thus avoid races.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Liu Yi L <yi.l.liu@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <3-v3-225de1400dfc+4e074-vfio1_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio_device is using a 'sleep until all refs go to zero' pattern for
its lifetime, but it is indirectly coded by repeatedly scanning the group
list waiting for the device to be removed on its own.
Switch this around to be a direct representation, use a refcount to count
the number of places that are blocking destruction and sleep directly on a
completion until that counter goes to zero. kfree the device after other
accesses have been excluded in vfio_del_group_dev(). This is a fairly
common Linux idiom.
Due to this we can now remove kref_put_mutex(), which is very rarely used
in the kernel. Here it is being used to prevent a zero ref device from
being seen in the group list. Instead allow the zero ref device to
continue to exist in the device_list and use refcount_inc_not_zero() to
exclude it once refs go to zero.
This patch is organized so the next patch will be able to alter the API to
allow drivers to provide the kfree.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <2-v3-225de1400dfc+4e074-vfio1_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio_device->group value has a get obtained during
vfio_add_group_dev() which gets moved from the stack to vfio_device->group
in vfio_group_create_device().
The reference remains until we reach the end of vfio_del_group_dev() when
it is put back.
Thus anything that already has a kref on the vfio_device is guaranteed a
valid group pointer. Remove all the extra reference traffic.
It is tricky to see, but the get at the start of vfio_del_group_dev() is
actually pairing with the put hidden inside vfio_device_put() a few lines
below.
A later patch merges vfio_group_create_device() into vfio_add_group_dev()
which makes the ownership and error flow on the create side easier to
follow.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <1-v3-225de1400dfc+4e074-vfio1_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This driver never had any open userspace (which for VFIO would include
VM kernel drivers) that use it, and thus should never have been added
by our normal userspace ABI rules.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Message-Id: <20210326061311.1497642-2-hch@lst.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The check i > npage at the end of vfio_iommu_type1_unpin_pages is unused
unless npage < 0, but if npage < 0, this function will return npage, which
should return -EINVAL instead. So let's just check the parameter npage at
the start of the function. By the way, replace unpin_exit with break.
Signed-off-by: Shenming Lu <lushenming@huawei.com>
Message-Id: <20210406135009.1707-1-lushenming@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There is a spelling mistake in a comment, fix it.
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Acked-by: Eric Auger <eric.auger@redhat.com>
Message-Id: <20210326083528.1329-5-thunder.leizhen@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There are several spelling mistakes, as follows:
thru ==> through
presense ==> presence
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Message-Id: <20210326083528.1329-4-thunder.leizhen@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There is a spelling mistake in a comment, fix it.
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Message-Id: <20210326083528.1329-3-thunder.leizhen@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There are several spelling mistakes, as follows:
userpsace ==> userspace
Accouting ==> Accounting
exlude ==> exclude
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Message-Id: <20210326083528.1329-2-thunder.leizhen@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Before opregion version 2.0 VBT data is stored in opregion mailbox #4,
but when VBT data exceeds 6KB size and cannot be within mailbox #4
then from opregion v2.0+, Extended VBT region, next to opregion is
used to hold the VBT data, so the total size will be opregion size plus
extended VBT region size.
Since opregion v2.0 with physical host VBT address would not be
practically available for end user and guest can not directly access
host physical address, so it is not supported.
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Signed-off-by: Swee Yee Fonn <swee.yee.fonn@intel.com>
Signed-off-by: Fred Gao <fred.gao@intel.com>
Message-Id: <20210325170953.24549-1-fred.gao@intel.com>
Reviewed-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This blank line is unnecessary, so remove it.
Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
Message-Id: <1615808073-178604-1-git-send-email-wangzhou1@hisilicon.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Compiling the nvlink stuff relies on the SPAPR_TCE_IOMMU otherwise there
are compile errors:
drivers/vfio/pci/vfio_pci_nvlink2.c:101:10: error: implicit declaration of function 'mm_iommu_put' [-Werror,-Wimplicit-function-declaration]
ret = mm_iommu_put(data->mm, data->mem);
As PPC only defines these functions when the config is set.
Previously this wasn't a problem by chance as SPAPR_TCE_IOMMU was the only
IOMMU that could have satisfied IOMMU_API on POWERNV.
Fixes: 179209fa12 ("vfio: IOMMU_API should be selected")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <0-v1-83dba9768fc3+419-vfio_nvlink2_kconfig_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When vfio_pin_pages_remote() returns with a partial batch consisting of
a single VM_PFNMAP pfn, a subsequent call will unfortunately try
restoring it from batch->pages, resulting in vfio mapping the wrong page
and unbalancing the page refcount.
Prevent the function from returning with this kind of partial batch to
avoid the issue. There's no explicit check for a VM_PFNMAP pfn because
it's awkward to do so, so infer it from characteristics of the batch
instead. This may result in occasional false positives but keeps the
code simpler.
Fixes: 4d83de6da2 ("vfio/type1: Batch page pinning")
Link: https://lkml.kernel.org/r/20210323133254.33ed9161@omen.home.shazbot.org/
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Message-Id: <20210325010552.185481-1-daniel.m.jordan@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vaddr_get_pfns() now returns the positive number of pfns successfully
gotten instead of zero. vfio_pin_page_external() might return 1 to
vfio_iommu_type1_pin_pages(), which will treat it as an error, if
vaddr_get_pfns() is successful but vfio_pin_page_external() doesn't
reach vfio_lock_acct().
Fix it up in vfio_pin_page_external(). Found by inspection.
Fixes: be16c1fd99 ("vfio/type1: Change success value of vaddr_get_pfn()")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Message-Id: <20210308172452.38864-1-daniel.m.jordan@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIO_IOMMU_TYPE1 does not compile with !MMU:
../drivers/vfio/vfio_iommu_type1.c: In function 'follow_fault_pfn':
../drivers/vfio/vfio_iommu_type1.c:536:22: error: implicit declaration of function 'pte_write'; did you mean 'vfs_write'? [-Werror=implicit-function-declaration]
So require it.
Suggested-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <0-v1-02cb5500df6e+78-vfio_no_mmu_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
CONFIG_VFIO_AMBA has a light use of AMBA, adding some inline fallbacks
when AMBA is disabled will allow it to be compiled under COMPILE_TEST and
make VFIO easier to maintain.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <3-v1-df057e0f92c3+91-vfio_arm_compile_test_jgg@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
x86 can build platform bus code too, so vfio-platform and all the platform
reset implementations compile successfully on x86.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <2-v1-df057e0f92c3+91-vfio_arm_compile_test_jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
As IOMMU_API is a kconfig without a description (eg does not show in the
menu) the correct operator is select not 'depends on'. Using 'depends on'
for this kind of symbol means VFIO is not selectable unless some other
random kconfig has already enabled IOMMU_API for it.
Fixes: cba3345cc4 ("vfio: VFIO core")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Message-Id: <1-v1-df057e0f92c3+91-vfio_arm_compile_test_jgg@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Some ILP32 architectures support mapping a 32-bit vaddr within a 64-bit
iova space. The unmap-all code uses 32-bit SIZE_MAX as an upper bound on
the extent of the mappings within iova space, so mappings above 4G cannot
be found and unmapped. Use U64_MAX instead, and use u64 for size variables.
This also fixes a static analysis bug found by the kernel test robot running
smatch for ILP32.
Fixes: 0f53afa12b ("vfio/type1: unmap cleanup")
Fixes: c196509953 ("vfio/type1: implement unmap all")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Message-Id: <1614281102-230747-1-git-send-email-steven.sistare@oracle.com>
Link: https://lore.kernel.org/linux-mm/20210222141043.GW2222@kadam
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Pinning one 4K page at a time is inefficient, so do it in batches of 512
instead. This is just an optimization with no functional change
intended, and in particular the driver still calls iommu_map() with the
largest physically contiguous range possible.
Add two fields in vfio_batch to remember where to start between calls to
vfio_pin_pages_remote(), and use vfio_batch_unpin() to handle remaining
pages in the batch in case of error.
qemu pins pages for guests around 8% faster on my test system, a
two-node Broadwell server with 128G memory per node. The qemu process
was bound to one node with its allocations constrained there as well.
base test
guest ---------------- ----------------
mem (GB) speedup avg sec (std) avg sec (std)
1 7.4% 0.61 (0.00) 0.56 (0.00)
2 8.3% 0.93 (0.00) 0.85 (0.00)
4 8.4% 1.46 (0.00) 1.34 (0.00)
8 8.6% 2.54 (0.01) 2.32 (0.00)
16 8.3% 4.66 (0.00) 4.27 (0.01)
32 8.3% 8.94 (0.01) 8.20 (0.01)
64 8.2% 17.47 (0.01) 16.04 (0.03)
120 8.5% 32.45 (0.13) 29.69 (0.01)
perf diff confirms less time spent in pup. Here are the top ten
functions:
Baseline Delta Abs Symbol
78.63% +6.64% clear_page_erms
1.50% -1.50% __gup_longterm_locked
1.27% -0.78% __get_user_pages
+0.76% kvm_zap_rmapp.constprop.0
0.54% -0.53% vmacache_find
0.55% -0.51% get_pfnblock_flags_mask
0.48% -0.48% __get_user_pages_remote
+0.39% slot_rmap_walk_next
+0.32% vfio_pin_map_dma
+0.26% kvm_handle_hva_range
...
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Get ready to pin more pages at once with struct vfio_batch, which
represents a batch of pinned pages.
The struct has a fallback page pointer to avoid two unlikely scenarios:
pointlessly allocating a page if disable_hugepages is enabled or failing
the whole pinning operation if the kernel can't allocate memory.
vaddr_get_pfn() becomes vaddr_get_pfns() to prepare for handling
multiple pages, though for now only one page is stored in the pages
array.
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vaddr_get_pfn() simply returns 0 on success. Have it report the number
of pfns successfully gotten instead, whether from page pinning or
follow_fault_pfn(), which will be used later when batching pinning.
Change the last check in vfio_pin_pages_remote() for consistency with
the other two.
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
follow_pfn() doesn't make sure that we're using the correct page
protections, get the pte with follow_pte() so that we can test
protections and get the pfn from the pte.
Fixes: 5cbf3264bc ("vfio/type1: Fix VA->PA translation for PFNMAP VMAs in vaddr_get_pfn()")
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In case we're running on s390 system always expose the capabilities for
configuration of zPCI devices. In case we're running on different
platform, continue as usual.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
linux/kthread.h is included more than once, remove the one that isn't
necessary.
Fixes: 898b9eaeb3 ("vfio/type1: block on invalid vaddr")
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In case allocation fails, we must behave correctly and exit with error.
Fixes: e6b817d4b8 ("vfio-pci/zdev: Add zPCI capabilities to VFIO_DEVICE_GET_INFO")
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
All amba drivers return 0 in their remove callback. Together with the
driver core ignoring the return value anyhow, it doesn't make sense to
return a value here.
Change the remove prototype to return void, which makes it explicit that
returning an error value doesn't work as expected. This simplifies changing
the core remove callback to return void, too.
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Acked-by: Krzysztof Kozlowski <krzk@kernel.org> # for drivers/memory
Acked-by: Mark Brown <broonie@kernel.org>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Suzuki K Poulose <suzuki.poulose@arm.com> # for hwtracing/coresight
Acked-By: Vinod Koul <vkoul@kernel.org> # for dmaengine
Acked-by: Guenter Roeck <linux@roeck-us.net> # for watchdog
Acked-by: Wolfram Sang <wsa@kernel.org> # for I2C
Acked-by: Takashi Iwai <tiwai@suse.de> # for sound
Acked-by: Vladimir Zapolskiy <vz@mleia.com> # for memory/pl172
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20210126165835.687514-5-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
vfio_platform_remove_common() cannot return non-NULL in
vfio_amba_remove() as the latter is only called if vfio_amba_probe()
returned success.
Diagnosed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20210126165835.687514-4-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Zdev static functions do not use vdev argument. Remove it.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The pci user accessors return negative errno's on error.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
[aw: drop Fixes tag, pcibios_err_to_errno() behaves correctly for -errno]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With this counter, we never need to traverse all groups to update
pinned_scope of vfio_iommu.
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_sanity_check_pfn_list() is used to check whether pfn_list and
notifier are empty when remove the external domain, so it makes a
wrong assumption that only external domain will use the pinning
interface.
Now we apply the pfn_list check when a vfio_dma is removed and apply
the notifier check when all domains are removed.
Fixes: a54eb55045 ("vfio iommu type1: Add support for mediated devices")
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If a group with non-pinned-page dirty scope is detached with dirty
logging enabled, we should fully populate the dirty bitmaps at the
time it's removed since we don't know the extent of its previous DMA,
nor will the group be present to trigger the full bitmap when the user
retrieves the dirty bitmap.
Fixes: d6a4c18566 ("vfio iommu: Implementation of ioctl for dirty pages tracking")
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Block translation of host virtual address while an iova range has an
invalid vaddr.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Implement a notify callback that remembers if the container's file
descriptor has been closed.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Define a vfio_iommu_driver_ops notify callback, for sending events to
the driver. Drivers are not required to provide the callback, and
may ignore any events. The handling of events is driver specific.
Define the CONTAINER_CLOSE event, called when the container's file
descriptor is closed. This event signifies that no further state changes
will occur via container ioctl's.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Implement VFIO_DMA_UNMAP_FLAG_VADDR, VFIO_DMA_MAP_FLAG_VADDR, and
VFIO_UPDATE_VADDR. This is a partial implementation. Blocking is
added in a subsequent patch.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Modify the iteration in vfio_dma_do_unmap so it does not depend on deletion
of each dma entry. Add a variant of vfio_find_dma that returns the entry
with the lowest iova in the search range to initialize the iteration. No
externally visible change, but this behavior is needed in the subsequent
update-vaddr patch.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Minor changes in vfio_dma_do_unmap to improve readability, which also
simplify the subsequent unmap-all patch. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* PSCI relay at EL2 when "protected KVM" is enabled
* New exception injection code
* Simplification of AArch32 system register handling
* Fix PMU accesses when no PMU is enabled
* Expose CSV3 on non-Meltdown hosts
* Cache hierarchy discovery fixes
* PV steal-time cleanups
* Allow function pointers at EL2
* Various host EL2 entry cleanups
* Simplification of the EL2 vector allocation
s390:
* memcg accouting for s390 specific parts of kvm and gmap
* selftest for diag318
* new kvm_stat for when async_pf falls back to sync
x86:
* Tracepoints for the new pagetable code from 5.10
* Catch VFIO and KVM irqfd events before userspace
* Reporting dirty pages to userspace with a ring buffer
* SEV-ES host support
* Nested VMX support for wait-for-SIPI activity state
* New feature flag (AVX512 FP16)
* New system ioctl to report Hyper-V-compatible paravirtualization features
Generic:
* Selftest improvements
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl/bdL4UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNgQQgAnTH6rhXa++Zd5F0EM2NwXwz3iEGb
lOq1DZSGjs6Eekjn8AnrWbmVQr+CBCuGU9MrxpSSzNDK/awryo3NwepOWAZw9eqk
BBCVwGBbJQx5YrdgkGC0pDq2sNzcpW/VVB3vFsmOxd9eHblnuKSIxEsCCXTtyqIt
XrLpQ1UhvI4yu102fDNhuFw2EfpzXm+K0Lc0x6idSkdM/p7SyeOxiv8hD4aMr6+G
bGUQuMl4edKZFOWFigzr8NovQAvDHZGrwfihu2cLRYKLhV97QuWVmafv/yYfXcz2
drr+wQCDNzDOXyANnssmviazrhOX0QmTAhbIXGGX/kTxYKcfPi83ZLoI3A==
=ISud
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"Much x86 work was pushed out to 5.12, but ARM more than made up for it.
ARM:
- PSCI relay at EL2 when "protected KVM" is enabled
- New exception injection code
- Simplification of AArch32 system register handling
- Fix PMU accesses when no PMU is enabled
- Expose CSV3 on non-Meltdown hosts
- Cache hierarchy discovery fixes
- PV steal-time cleanups
- Allow function pointers at EL2
- Various host EL2 entry cleanups
- Simplification of the EL2 vector allocation
s390:
- memcg accouting for s390 specific parts of kvm and gmap
- selftest for diag318
- new kvm_stat for when async_pf falls back to sync
x86:
- Tracepoints for the new pagetable code from 5.10
- Catch VFIO and KVM irqfd events before userspace
- Reporting dirty pages to userspace with a ring buffer
- SEV-ES host support
- Nested VMX support for wait-for-SIPI activity state
- New feature flag (AVX512 FP16)
- New system ioctl to report Hyper-V-compatible paravirtualization features
Generic:
- Selftest improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits)
KVM: SVM: fix 32-bit compilation
KVM: SVM: Add AP_JUMP_TABLE support in prep for AP booting
KVM: SVM: Provide support to launch and run an SEV-ES guest
KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests
KVM: SVM: Provide support for SEV-ES vCPU loading
KVM: SVM: Provide support for SEV-ES vCPU creation/loading
KVM: SVM: Update ASID allocation to support SEV-ES guests
KVM: SVM: Set the encryption mask for the SVM host save area
KVM: SVM: Add NMI support for an SEV-ES guest
KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest
KVM: SVM: Do not report support for SMM for an SEV-ES guest
KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES
KVM: SVM: Add support for CR8 write traps for an SEV-ES guest
KVM: SVM: Add support for CR4 write traps for an SEV-ES guest
KVM: SVM: Add support for CR0 write traps for an SEV-ES guest
KVM: SVM: Add support for EFER write traps for an SEV-ES guest
KVM: SVM: Support string IO operations for an SEV-ES guest
KVM: SVM: Support MMIO for an SEV-ES guest
KVM: SVM: Create trace events for VMGEXIT MSR protocol processing
KVM: SVM: Create trace events for VMGEXIT processing
...
- Fix uninitialized list walk in error path (Eric Auger)
- Use io_remap_pfn_range() (Jason Gunthorpe)
- Allow fallback support for NVLink on POWER8 (Alexey Kardashevskiy)
- Enable mdev request interrupt with CCW support (Eric Farman)
- Enable interface to iommu_domain from vfio_group (Lu Baolu)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJf2mBlAAoJECObm247sIsi4Y0P/2+beHwcZfSn5HSSSeqFR+uj
HfiZYN05bFcVirHPIPfY56kbe8+XNqwHAUPqch+c2iuGhLHuUIXkHwFq8VqvBDY4
xVqu9ZB23e881dF7NPzMAcK2UMDDwtlmeaf+7oRADVmTn58hErU9qt813i7a+x1S
5p8UpCTf+X+sh434cV882/TIV4hWgaFgn1/Gy3l8GDaJUW1lQb3MHZ5XY+o9KOxt
4R7VLIYZCDwEcByUsz8p4RLD036YpSS2Ir+CHXqeArtKrwRjjM62cSwDCOuM6ewZ
GJ7O3YzPp9FQ75F3oorL3D+ojPY6AU1QjZKg0+gAQS3kucewwkv+vI+RDhSc8Xx7
PFTW2bLk4cu9LHQxUT64uH5Qoa0NtfPBUGvgsR4kXPCRClk71ZcGZgaQD3CWJBhE
CTPI1OLHeJgp7MGXAArGRzFOtf0nux5oxOmcmT5fg4icG+x7BgcvDd7dVhyGAjn7
Gp87OOtJ3itDhWIlO1aTJVHEt42b1eezLkkyIKHfPlDLJmyHEfCOTjDLftr7Rmma
2IlyDJs83MCahQFcjqbPJWqh2Ttda8+ItutiklwBJRHe6EC+4WML6JaHgXe71CDi
9Y9HwKLxYFr1pwUpQP6bxzEDlTCPasWyhBmOAUnRtZdU/daX+KGp4WqSFBYkPIgy
ERmTCiJGJ3p41V08qOki
=W/GT
-----END PGP SIGNATURE-----
Merge tag 'vfio-v5.11-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Fix uninitialized list walk in error path (Eric Auger)
- Use io_remap_pfn_range() (Jason Gunthorpe)
- Allow fallback support for NVLink on POWER8 (Alexey Kardashevskiy)
- Enable mdev request interrupt with CCW support (Eric Farman)
- Enable interface to iommu_domain from vfio_group (Lu Baolu)
* tag 'vfio-v5.11-rc1' of git://github.com/awilliam/linux-vfio:
vfio/type1: Add vfio_group_iommu_domain()
vfio-ccw: Wire in the request callback
vfio-mdev: Wire in a request handler for mdev parent
vfio/pci/nvlink2: Do not attempt NPU2 setup on POWER8NVL NPU
vfio-pci: Use io_remap_pfn_range() for PCI IO memory
vfio/pci: Move dummy_resources_list init in vfio_pci_probe()
Add the API for getting the domain from a vfio group. This could be used
by the physical device drivers which rely on the vfio/mdev framework for
mediated device user level access. The typical use case like below:
unsigned int pasid;
struct vfio_group *vfio_group;
struct iommu_domain *iommu_domain;
struct device *dev = mdev_dev(mdev);
struct device *iommu_device = mdev_get_iommu_device(dev);
if (!iommu_device ||
!iommu_dev_feature_enabled(iommu_device, IOMMU_DEV_FEAT_AUX))
return -EINVAL;
vfio_group = vfio_group_get_external_user_from_dev(dev);
if (IS_ERR_OR_NULL(vfio_group))
return -EFAULT;
iommu_domain = vfio_group_iommu_domain(vfio_group);
if (IS_ERR_OR_NULL(iommu_domain)) {
vfio_group_put_external_user(vfio_group);
return -EFAULT;
}
pasid = iommu_aux_get_pasid(iommu_domain, iommu_device);
if (pasid < 0) {
vfio_group_put_external_user(vfio_group);
return -EFAULT;
}
/* Program device context with pasid value. */
...
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Switch to use new platform_get_mem_or_io() instead of home grown analogue.
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: kvm@vger.kernel.org
Acked-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20201209203642.27648-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
While performing some destructive tests with vfio-ccw, where the
paths to a device are forcible removed and thus the device itself
is unreachable, it is rather easy to end up in an endless loop in
vfio_del_group_dev() due to the lack of a request callback for the
associated device.
In this example, one MDEV (77c) is used by a guest, while another
(77b) is not. The symptom is that the iommu is detached from the
mdev for 77b, but not 77c, until that guest is shutdown:
[ 238.794867] vfio_ccw 0.0.077b: MDEV: Unregistering
[ 238.794996] vfio_mdev 11f2d2bc-4083-431d-a023-eff72715c4f0: Removing from iommu group 2
[ 238.795001] vfio_mdev 11f2d2bc-4083-431d-a023-eff72715c4f0: MDEV: detaching iommu
[ 238.795036] vfio_ccw 0.0.077c: MDEV: Unregistering
...silence...
Let's wire in the request call back to the mdev device, so that a
device being physically removed from the host can be (gracefully?)
handled by the parent device at the time the device is removed.
Add a message when registering the device if a driver doesn't
provide this callback, so a clue is given that this same loop
may be encountered in a similar situation, and a message when
this occurs instead of the awkward silence noted above.
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We execute certain NPU2 setup code (such as mapping an LPID to a device
in NPU2) unconditionally if an Nvlink bridge is detected. However this
cannot succeed on POWER8NVL machines as the init helpers return an error
other than ENODEV which means the device is there is and setup failed so
vfio_pci_enable() fails and pass through is not possible.
This changes the two NPU2 related init helpers to return -ENODEV if
there is no "memory-region" device tree property as this is
the distinction between NPU and NPU2.
Tested on
- POWER9 pvr=004e1201, Ubuntu 19.04 host, Ubuntu 18.04 vm,
NVIDIA GV100 10de:1db1 driver 418.39
- POWER8 pvr=004c0100, RHEL 7.6 host, Ubuntu 16.10 vm,
NVIDIA P100 10de:15f9 driver 396.47
Fixes: 7f92891778 ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver")
Cc: stable@vger.kernel.org # 5.0
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
commit f8f6ae5d07 ("mm: always have io_remap_pfn_range() set
pgprot_decrypted()") allows drivers using mmap to put PCI memory mapped
BAR space into userspace to work correctly on AMD SME systems that default
to all memory encrypted.
Since vfio_pci_mmap_fault() is working with PCI memory mapped BAR space it
should be calling io_remap_pfn_range() otherwise it will not work on SME
systems.
Fixes: 11c4cd07ba ("vfio-pci: Fault mmaps to enable vma tracking")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Peter Xu <peterx@redhat.com>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In case an error occurs in vfio_pci_enable() before the call to
vfio_pci_probe_mmaps(), vfio_pci_disable() will try to iterate
on an uninitialized list and cause a kernel panic.
Lets move to the initialization to vfio_pci_probe() to fix the
issue.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Fixes: 05f0c03fba ("vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive")
CC: Stable <stable@vger.kernel.org> # v4.7+
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Don't allow the events to accumulate in the eventfd counter, drain them
as they are handled.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20201027135523.646811-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Bypass the IGD initialization when -ENODEV returns,
that should be the case if opregion is not available for IGD
or within discrete graphics device's option ROM,
or host/lpc bridge is not found.
Then use of -ENODEV here means no special device resources found
which needs special care for VFIO, but we still allow other normal
device resource access.
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Xiong Zhang <xiong.y.zhang@intel.com>
Cc: Hang Yuan <hang.yuan@linux.intel.com>
Cc: Stuart Summers <stuart.summers@intel.com>
Signed-off-by: Fred Gao <fred.gao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
pm_runtime_get_sync() will increment pm usage counter even it
failed. Forgetting to call pm_runtime_put will result in
reference leak in vfio_platform_open, so we should fix it.
Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Acked-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The ioeventfd is called under spinlock with interrupts disabled,
therefore if the memory lock is contended defer code that might
sleep to a thread context.
Fixes: bc93b9ae01 ("vfio-pci: Avoid recursive read-lock usage")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=209253#c1
Reported-by: Ian Pilcher <arequipeno@gmail.com>
Tested-by: Ian Pilcher <arequipeno@gmail.com>
Tested-by: Justin Gatzen <justin.gatzen@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Fixed compiler warning:
drivers/vfio/fsl-mc/vfio_fsl_mc_intr.c:16:5: warning: no previous
prototype for function 'vfio_fsl_mc_irqs_allocate' [-Wmissing-prototypes]
^
drivers/vfio/fsl-mc/vfio_fsl_mc_intr.c:16:1: note: declare 'static'
if the function is not intended to be used outside of this translation unit
int vfio_fsl_mc_irqs_allocate(struct vfio_fsl_mc_device *vdev)
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Diana Craciun <diana.craciun@oss.nxp.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
My static analsysis tool complains that the "index" can be negative.
There are some checks in do_mmap() which try to prevent underflows but
I don't know if they are sufficient for this situation. Either way,
making "index" unsigned is harmless so let's do it just to be safe.
Fixes: 6724728968 ("vfio/fsl-mc: Allow userspace to MMAP fsl-mc device MMIO regions")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Diana Craciun <diana.craciun@oss.nxp.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The copy_to_user() function returns the number of bytes remaining to be
copied, but this code should return -EFAULT.
Fixes: df747bcd5b ("vfio/fsl-mc: Implement VFIO_DEVICE_GET_REGION_INFO ioctl call")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Diana Craciun <diana.craciun@oss.nxp.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When attaching a new group to the container, let's use the new helper
vfio_iommu_find_iommu_group() to check if it's already attached. There
is no functional change.
Also take this chance to add a missing blank line.
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
- New fsl-mc vfio bus driver supporting userspace drivers of objects
within NXP's DPAA2 architecture (Diana Craciun)
- Support for exposing zPCI information on s390 (Matthew Rosato)
- Fixes for "detached" VFs on s390 (Matthew Rosato)
- Fixes for pin-pages and dma-rw accesses (Yan Zhao)
- Cleanups and optimize vconfig regen (Zenghui Yu)
- Fix duplicate irq-bypass token registration (Alex Williamson)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJfkcCjAAoJECObm247sIsi2XIP/j7NL4glPrWU37mesz9dd5nx
SmZhcmxnOqZSQkOCnu+hNFZ9e+tdQjuX+jATOZaYz5l55bLAFmBlBj1Dv8HWaCVI
mTbJ6xXUwdOvNSxbFH6BIUkJg8otR0iEkefVyJLNlF84FsaDknH4yZxx0vdeczjF
wTkkk3+4VmH+4klvPIa9v0eL7yeKeFmgls9nQViVE5kDWUF4us/z/oHlVm9wR+mL
2r3DEjHyz4L2hwVEkhZk7ytR6szdhuhF2l7NoMmaSEXRXjBzJoO6I3P9Y2W4i+su
MFgTfiQ+OpIfVuiR8GzGev+/SrjWGX0Hvb2sYriKOELjhyedkE2kmxacbqMZ/UE+
SRAhFf64C1rzJ4g1IW//Gg+9ObIPqlkqU52VDbOZdCED0AquwSyVmdwIUAK6qF+I
HLOyZXhMI8EZ+w063cS+aKLJIvQTBbfIdMmPZkopVZhwWB3N3BjdvBKA+rPpPoTx
0DpeUo891+zyeEE4aunUmCB8HFnBPgUa+XZqg2juq9MxjScsqgTzA0WEZg7jV4oj
tORQrqoAKJgSk9oVL3EvAnr+IJix3ScRTqYymESORkz/lRCk2hFX48qdeW+qiSP8
W1DHOnivFb1+JzhuZyaRKFWy1mK0EQQWTsE2b2ymPMKJbFhi+pVxaksmeG5x+4Q9
SAp+Qma8Aj3UtBKcj/S+
=LDPo
-----END PGP SIGNATURE-----
Merge tag 'vfio-v5.10-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- New fsl-mc vfio bus driver supporting userspace drivers of objects
within NXP's DPAA2 architecture (Diana Craciun)
- Support for exposing zPCI information on s390 (Matthew Rosato)
- Fixes for "detached" VFs on s390 (Matthew Rosato)
- Fixes for pin-pages and dma-rw accesses (Yan Zhao)
- Cleanups and optimize vconfig regen (Zenghui Yu)
- Fix duplicate irq-bypass token registration (Alex Williamson)
* tag 'vfio-v5.10-rc1' of git://github.com/awilliam/linux-vfio: (30 commits)
vfio iommu type1: Fix memory leak in vfio_iommu_type1_pin_pages
vfio/pci: Clear token on bypass registration failure
vfio/fsl-mc: fix the return of the uninitialized variable ret
vfio/fsl-mc: Fix the dead code in vfio_fsl_mc_set_irq_trigger
vfio/fsl-mc: Fixed vfio-fsl-mc driver compilation on 32 bit
MAINTAINERS: Add entry for s390 vfio-pci
vfio-pci/zdev: Add zPCI capabilities to VFIO_DEVICE_GET_INFO
vfio/fsl-mc: Add support for device reset
vfio/fsl-mc: Add read/write support for fsl-mc devices
vfio/fsl-mc: trigger an interrupt via eventfd
vfio/fsl-mc: Add irq infrastructure for fsl-mc devices
vfio/fsl-mc: Added lock support in preparation for interrupt handling
vfio/fsl-mc: Allow userspace to MMAP fsl-mc device MMIO regions
vfio/fsl-mc: Implement VFIO_DEVICE_GET_REGION_INFO ioctl call
vfio/fsl-mc: Implement VFIO_DEVICE_GET_INFO ioctl
vfio/fsl-mc: Scan DPRC objects on vfio-fsl-mc driver bind
vfio: Introduce capability definitions for VFIO_DEVICE_GET_INFO
s390/pci: track whether util_str is valid in the zpci_dev
s390/pci: stash version in the zpci_dev
vfio/fsl-mc: Add VFIO framework skeleton for fsl-mc devices
...
pfn is not added to pfn_list when vfio_add_to_pfn_list fails.
vfio_unpin_page_external will exit directly without calling
vfio_iova_put_vfio_pfn. This will lead to a memory leak.
Fixes: a54eb55045 ("vfio iommu type1: Add support for mediated devices")
Signed-off-by: Xiaoyang Xu <xuxiaoyang2@huawei.com>
[aw: simplified logic, add Fixes]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The eventfd context is used as our irqbypass token, therefore if an
eventfd is re-used, our token is the same. The irqbypass code will
return an -EBUSY in this case, but we'll still attempt to unregister
the producer, where if that duplicate token still exists, results in
removing the wrong object. Clear the token of failed producers so
that they harmlessly fall out when unregistered.
Fixes: 6d7425f109 ("vfio: Register/unregister irq_bypass_producer")
Reported-by: guomin chen <guomin_chen@sina.com>
Tested-by: guomin chen <guomin_chen@sina.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio_fsl_mc_reflck_attach function may return, on success path,
an uninitialized variable. Fix the problem by initializing the return
variable to 0.
Addresses-Coverity: ("Uninitialized scalar variable")
Fixes: f2ba7e8c94 ("vfio/fsl-mc: Added lock support in preparation for interrupt handling")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Diana Craciun <diana.craciun@oss.nxp.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The preceding patches have ensured that core dumping properly takes the
mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all
its users.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Static analysis discovered that some code in vfio_fsl_mc_set_irq_trigger
is dead code. Fixed the code by changing the conditions order.
Fixes: cc0ee20bd9 ("vfio/fsl-mc: trigger an interrupt via eventfd")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Diana Craciun <diana.craciun@oss.nxp.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The FSL_MC_BUS on which the VFIO-FSL-MC driver is dependent on
can be compiled on other architectures as well (not only ARM64)
including 32 bit architectures.
Include linux/io-64-nonatomic-hi-lo.h to make writeq/readq used
in the driver available on 32bit platforms.
Signed-off-by: Diana Craciun <diana.craciun@oss.nxp.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Define a new configuration entry VFIO_PCI_ZDEV for VFIO/PCI.
When this s390-only feature is configured we add capabilities to the
VFIO_DEVICE_GET_INFO ioctl that describe features of the associated
zPCI device and its underlying hardware.
This patch is based on work previously done by Pierre Morel.
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Currently only resetting the DPRC container is supported which
will reset all the objects inside it. Resetting individual
objects is possible from the userspace by issueing commands
towards MC firmware.
Signed-off-by: Diana Craciun <diana.craciun@oss.nxp.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The software uses a memory-mapped I/O command interface (MC portals) to
communicate with the MC hardware. This command interface is used to
discover, enumerate, configure and remove DPAA2 objects. The DPAA2
objects use MSIs, so the command interface needs to be emulated
such that the correct MSI is configured in the hardware (the guest
has the virtual MSIs).
This patch is adding read/write support for fsl-mc devices. The mc
commands are emulated by the userspace. The host is just passing
the correct command to the hardware.
Also the current patch limits userspace to write complete
64byte command once and read 64byte response by one ioctl.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@nxp.com>
Signed-off-by: Diana Craciun <diana.craciun@oss.nxp.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch allows to set an eventfd for fsl-mc device interrupts
and also to trigger the interrupt eventfd from userspace for testing.
All fsl-mc device interrupts are MSIs. The MSIs are allocated from
the MSI domain only once per DPRC and used by all the DPAA2 objects.
The interrupts are managed by the DPRC in a pool of interrupts. Each
device requests interrupts from this pool. The pool is allocated
when the first virtual device is setting the interrupts.
The pool of interrupts is protected by a lock.
The DPRC has an interrupt of its own which indicates if the DPRC
contents have changed. However, currently, the contents of a DPRC
assigned to the guest cannot be changed at runtime, so this interrupt
is not configured.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@nxp.com>
Signed-off-by: Diana Craciun <diana.craciun@oss.nxp.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch adds the skeleton for interrupt support
for fsl-mc devices. The interrupts are not yet functional,
the functionality will be added by subsequent patches.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@nxp.com>
Signed-off-by: Diana Craciun <diana.craciun@oss.nxp.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Only the DPRC object allocates interrupts from the MSI
interrupt domain. The interrupts are managed by the DPRC in
a pool of interrupts. The access to this pool of interrupts
has to be protected with a lock.
This patch extends the current lock implementation to have a
lock per DPRC.
Signed-off-by: Diana Craciun <diana.craciun@oss.nxp.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Allow userspace to mmap device regions for direct access of
fsl-mc devices.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@nxp.com>
Signed-off-by: Diana Craciun <diana.craciun@oss.nxp.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Expose to userspace information about the memory regions.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@nxp.com>
Signed-off-by: Diana Craciun <diana.craciun@oss.nxp.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Allow userspace to get fsl-mc device info (number of regions
and irqs).
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@nxp.com>
Signed-off-by: Diana Craciun <diana.craciun@oss.nxp.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The DPRC (Data Path Resource Container) device is a bus device and has
child devices attached to it. When the vfio-fsl-mc driver is probed
the DPRC is scanned and the child devices discovered and initialized.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@nxp.com>
Signed-off-by: Diana Craciun <diana.craciun@oss.nxp.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
DPAA2 (Data Path Acceleration Architecture) consists in
mechanisms for processing Ethernet packets, queue management,
accelerators, etc.
The Management Complex (mc) is a hardware entity that manages the DPAA2
hardware resources. It provides an object-based abstraction for software
drivers to use the DPAA2 hardware. The MC mediates operations such as
create, discover, destroy of DPAA2 objects.
The MC provides memory-mapped I/O command interfaces (MC portals) which
DPAA2 software drivers use to operate on DPAA2 objects.
A DPRC is a container object that holds other types of DPAA2 objects.
Each object in the DPRC is a Linux device and bound to a driver.
The MC-bus driver is a platform driver (different from PCI or platform
bus). The DPRC driver does runtime management of a bus instance. It
performs the initial scan of the DPRC and handles changes in the DPRC
configuration (adding/removing objects).
All objects inside a container share the same hardware isolation
context, meaning that only an entire DPRC can be assigned to
a virtual machine.
When a container is assigned to a virtual machine, all the objects
within that container are assigned to that virtual machine.
The DPRC container assigned to the virtual machine is not allowed
to change contents (add/remove objects) by the guest. The restriction
is set by the host and enforced by the mc hardware.
The DPAA2 objects can be directly assigned to the guest. However
the MC portals (the memory mapped command interface to the MC) need
to be emulated because there are commands that configure the
interrupts and the isolation IDs which are virtual in the guest.
Example:
echo vfio-fsl-mc > /sys/bus/fsl-mc/devices/dprc.2/driver_override
echo dprc.2 > /sys/bus/fsl-mc/drivers/vfio-fsl-mc/bind
The dprc.2 is bound to the VFIO driver and all the objects within
dprc.2 are going to be bound to the VFIO driver.
This patch adds the infrastructure for VFIO support for fsl-mc
devices. Subsequent patches will add support for binding and secure
assigning these devices using VFIO.
More details about the DPAA2 objects can be found here:
Documentation/networking/device_drivers/freescale/dpaa2/overview.rst
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@nxp.com>
Signed-off-by: Diana Craciun <diana.craciun@oss.nxp.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The count of dirtied pages is not only determined by count of copied
pages, but also by the start offset.
e.g. if offset = PAGE_SIZE - 1, and *copied=2, the dirty pages count
is 2, instead of 1 or 0.
Fixes: d6a4c18566 ("vfio iommu: Implementation of ioctl for dirty pages tracking")
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When error occurs, need to put vfio group after a successful get.
Fixes: 95fc87b441 ("vfio: Selective dirty page tracking if IOMMU backed device pins pages")
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
While it is true that devices with is_virtfn=1 will have a Memory Space
Enable bit that is hard-wired to 0, this is not the only case where we
see this behavior -- For example some bare-metal hypervisors lack
Memory Space Enable bit emulation for devices not setting is_virtfn
(s390). Fix this by instead checking for the newly-added
no_command_memory bit which directly denotes the need for
PCI_COMMAND_MEMORY emulation in vfio.
Fixes: abafbc551f ("vfio-pci: Invalidate mmaps and block MMIO access on disabled memory")
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Commit 492855939b ("vfio/type1: Limit DMA mappings per container")
added the ability to limit the number of memory backed DMA mappings.
However on s390x, when lazy mapping is in use, we use a very large
number of concurrent mappings. Let's provide the current allowable
number of DMA mappings to userspace via the IOMMU info chain so that
userspace can take appropriate mitigation.
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Page pinning is used both to translate and pin device mappings for DMA
purpose, as well as to indicate to the IOMMU backend to limit the dirty
page scope to those pages that have been pinned, in the case of an IOMMU
backed device.
To support this, the vfio_pin_pages() interface limits itself to only
singleton groups such that the IOMMU backend can consider dirty page
scope only at the group level. Implement the same requirement for the
vfio_group_pin_pages() interface.
Fixes: 95fc87b441 ("vfio: Selective dirty page tracking if IOMMU backed device pins pages")
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Now we regenerate vconfig for all the BARs via vfio_bar_fixup(), every
time any offset of any of them are read. Though BARs aren't re-read
regularly, the regeneration can be avoided if no BARs had been written
since they were last read, in which case vdev->bardirty is false.
Let's return immediately in vfio_bar_fixup() if bardirty is false.
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
It was added by commit 137e553135 ("vfio/pci: Add sriov_configure
support") but duplicates a forward declaration earlier in the file.
Remove it.
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
To keep naming consistent we should stick with *iotlb*. This patch
renames a few remaining functions.
Signed-off-by: Tom Murphy <murphyt7@tcd.ie>
Link: https://lore.kernel.org/r/20200817210051.13546-1-murphyt7@tcd.ie
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The vfio_iommu_replay() function does not currently unwind on error,
yet it does pin pages, perform IOMMU mapping, and modify the vfio_dma
structure to indicate IOMMU mapping. The IOMMU mappings are torn down
when the domain is destroyed, but the other actions go on to cause
trouble later. For example, the iommu->domain_list can be empty if we
only have a non-IOMMU backed mdev attached. We don't currently check
if the list is empty before getting the first entry in the list, which
leads to a bogus domain pointer. If a vfio_dma entry is erroneously
marked as iommu_mapped, we'll attempt to use that bogus pointer to
retrieve the existing physical page addresses.
This is the scenario that uncovered this issue, attempting to hot-add
a vfio-pci device to a container with an existing mdev device and DMA
mappings, one of which could not be pinned, causing a failure adding
the new group to the existing container and setting the conditions
for a subsequent attempt to explode.
To resolve this, we can first check if the domain_list is empty so
that we can reject replay of a bogus domain, should we ever encounter
this inconsistent state again in the future. The real fix though is
to add the necessary unwind support, which means cleaning up the
current pinning if an IOMMU mapping fails, then walking back through
the r-b tree of DMA entries, reading from the IOMMU which ranges are
mapped, and unmapping and unpinning those ranges. To be able to do
this, we also defer marking the DMA entry as IOMMU mapped until all
entries are processed, in order to allow the unwind to know the
disposition of each entry.
Fixes: a54eb55045 ("vfio iommu type1: Add support for mediated devices")
Reported-by: Zhiyi Guo <zhguo@redhat.com>
Tested-by: Zhiyi Guo <zhguo@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
A down_read on memory_lock is held when performing read/write accesses
to MMIO BAR space, including across the copy_to/from_user() callouts
which may fault. If the user buffer for these copies resides in an
mmap of device MMIO space, the mmap fault handler will acquire a
recursive read-lock on memory_lock. Avoid this by reducing the lock
granularity. Sequential accesses requiring multiple ioread/iowrite
cycles are expected to be rare, therefore typical accesses should not
see additional overhead.
VGA MMIO accesses are expected to be non-fatal regardless of the PCI
memory enable bit to allow legacy probing, this behavior remains with
a comment added. ioeventfds are now included in memory access testing,
with writes dropped while memory space is disabled.
Fixes: abafbc551f ("vfio-pci: Invalidate mmaps and block MMIO access on disabled memory")
Reported-by: Zhiyi Guo <zhguo@redhat.com>
Tested-by: Zhiyi Guo <zhguo@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
After the cleanup of page fault accounting, gup does not need to pass
task_struct around any more. Remove that parameter in the whole gup
stack.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch refactors the vfio_iommu_type1_ioctl() to use switch instead of
if-else, and each command got a helper function.
Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The current generation of Intel® QuickAssist Technology devices
are not designed to run in an untrusted environment because of the
following issues reported in the document "Intel® QuickAssist Technology
(Intel® QAT) Software for Linux" (document number 336211-014):
QATE-39220 - GEN - Intel® QAT API submissions with bad addresses that
trigger DMA to invalid or unmapped addresses can cause a
platform hang
QATE-7495 - GEN - An incorrectly formatted request to Intel® QAT can
hang the entire Intel® QAT Endpoint
The document is downloadable from https://01.org/intel-quickassist-technology
at the following link:
https://01.org/sites/default/files/downloads/336211-014-qatforlinux-releasenotes-hwv1.7_0.pdf
This patch adds the following QAT devices to the denylist: DH895XCC,
C3XXX and C62X.
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Reviewed-by: Fiona Trahe <fiona.trahe@intel.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add denylist of devices that by default are not probed by vfio-pci.
Devices in this list may be susceptible to untrusted application, even
if the IOMMU is enabled. To be accessed via vfio-pci, the user has to
explicitly disable the denylist.
The denylist can be disabled via the module parameter disable_denylist.
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Fiona Trahe <fiona.trahe@intel.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
No need to release and immediately re-acquire igate while clearing
out the eventfd ctxs.
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This code was using get_user_pages*(), in a "Case 2" scenario
(DMA/RDMA), using the categorization from [1]. That means that it's
time to convert the get_user_pages*() + put_page() calls to
pin_user_pages*() + unpin_user_pages() calls.
There is some helpful background in [2]: basically, this is a small
part of fixing a long-standing disconnect between pinning pages, and
file systems' use of those pages.
[1] Documentation/core-api/pin_user_pages.rst
[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Commit c5e6688752 ("vfio/type1: Add conditional rescheduling")
missed a "cond_resched()" in vfio_iommu_map if iommu map failed.
This is a very tiny optimization and the case can hardly happen.
Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Intel document 333717-008, "Intel® Ethernet Controller X550
Specification Update", version 2.7, dated June 2020, includes errata
#22, added in version 2.1, May 2016, indicating X550 NICs suffer from
the same implementation deficiency as the 700-series NICs:
"The Interrupt Status bit in the Status register of the PCIe
configuration space is not implemented and is not set as described
in the PCIe specification."
Without the interrupt status bit, vfio-pci cannot determine when
these devices signal INTx. They are therefore added to the nointx
quirk.
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
SR-IOV VFs do not implement the memory enable bit of the command
register, therefore this bit is not set in config space after
pci_enable_device(). This leads to an unintended difference
between PF and VF in hand-off state to the user. We can correct
this by setting the initial value of the memory enable bit in our
virtualized config space. There's really no need however to
ever fault a user on a VF though as this would only indicate an
error in the user's management of the enable bit, versus a PF
where the same access could trigger hardware faults.
Fixes: abafbc551f ("vfio-pci: Invalidate mmaps and block MMIO access on disabled memory")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The next use of the device will generate an underflow from the
stale reference.
Cc: Qian Cai <cai@lca.pw>
Fixes: 1518ac272e ("vfio/pci: fix memory leaks of eventfd ctx")
Reported-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Tested-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Switch the function documentation to kerneldoc comments, and add
WARN_ON_ONCE asserts that the calling thread is a kernel thread and does
not have ->mm set (or has ->mm set in the case of unuse_mm).
Also give the functions a kthread_ prefix to better document the use case.
[hch@lst.de: fix a comment typo, cover the newly merged use_mm/unuse_mm caller in vfio]
Link: http://lkml.kernel.org/r/20200416053158.586887-3-hch@lst.de
[sfr@canb.auug.org.au: powerpc/vas: fix up for {un}use_mm() rename]
Link: http://lkml.kernel.org/r/20200422163935.5aa93ba5@canb.auug.org.au
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [usb]
Acked-by: Haren Myneni <haren@linux.ibm.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Link: http://lkml.kernel.org/r/20200404094101.672954-6-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cover the newly merged use_mm/unuse_mm caller in vfio
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Link: http://lkml.kernel.org/r/20200416053158.586887-2-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert the last few remaining mmap_sem rwsem calls to use the new mmap
locking API. These were missed by coccinelle for some reason (I think
coccinelle does not support some of the preprocessor constructs in these
files ?)
[akpm@linux-foundation.org: convert linux-next leftovers]
[akpm@linux-foundation.org: more linux-next leftovers]
[akpm@linux-foundation.org: more linux-next leftovers]
Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-6-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Support for userspace to send requests directly to the on-chip GZIP
accelerator on Power9.
- Rework of our lockless page table walking (__find_linux_pte()) to make it
safe against parallel page table manipulations without relying on an IPI for
serialisation.
- A series of fixes & enhancements to make our machine check handling more
robust.
- Lots of plumbing to add support for "prefixed" (64-bit) instructions on
Power10.
- Support for using huge pages for the linear mapping on 8xx (32-bit).
- Remove obsolete Xilinx PPC405/PPC440 support, and an associated sound driver.
- Removal of some obsolete 40x platforms and associated cruft.
- Initial support for booting on Power10.
- Lots of other small features, cleanups & fixes.
Thanks to:
Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan, Andrey Abramov,
Aneesh Kumar K.V, Balamuruhan S, Bharata B Rao, Bulent Abali, Cédric Le
Goater, Chen Zhou, Christian Zigotzky, Christophe JAILLET, Christophe Leroy,
Dmitry Torokhov, Emmanuel Nicolet, Erhard F., Gautham R. Shenoy, Geoff Levand,
George Spelvin, Greg Kurz, Gustavo A. R. Silva, Gustavo Walbon, Haren Myneni,
Hari Bathini, Joel Stanley, Jordan Niethe, Kajol Jain, Kees Cook, Leonardo
Bras, Madhavan Srinivasan., Mahesh Salgaonkar, Markus Elfring, Michael
Neuling, Michal Simek, Nathan Chancellor, Nathan Lynch, Naveen N. Rao,
Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pingfan Liu, Qian Cai, Ram
Pai, Raphael Moreira Zinsly, Ravi Bangoria, Sam Bobroff, Sandipan Das, Segher
Boessenkool, Stephen Rothwell, Sukadev Bhattiprolu, Tyrel Datwyler, Wolfram
Sang, Xiongfeng Wang.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl7aYZ8THG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgPiKD/9zNCuZLFMAFrIdbm0HlYA2RGYZFT75
GUHsqYyei1pxA7PgM3KwJiXELVODsBv0eQbgNh1tbecKrxPRegN/cywd1KLjPZ7I
v5/qweQP8MvR0RhzjbhvUcO0jq/f8u2LbJr5mUfVzjU6tAvrvcWo3oZqDElsekCS
kgyOH3r1vZ2PLTMiGFhb0gWi2iqc+6BHU1AFCGPCMjB1Vu5d5+54VvZ/6lllGsOF
yg9CBXmmVvQ+Bn6tH4zdEB78FYxnAIwBqlbmL79i5ca+HQJ0Sw6HuPRy9XYq35p6
2EiXS4Wrgp7i7+1TN3HO362u5Onb8TSyQU7NS6yCFPoJ6JQxcJMBIw6mHhnXOPuZ
CrjgcdwUMjx8uDoKmX1Epbfuex2w+AysW+4yBHPFiSgl3klKC3D0wi95mR485w2F
rN8uzJtrDeFKcYZJG7IoB/cgFCCPKGf9HaXr8q0S/jBKMffx91ul3cfzlfdIXOCw
FDNw/+ZX7UD6ddFEG12ZTO+vdL8yf1uCRT/DIZwUiDMIA0+M6F4nc7j3lfyZfoO1
65f9UlhoLxScq7VH2fKH4UtZatO9cPID2z1CmiY4UbUIPtFDepSuYClgLF+Duf4b
rkfxhKU0+Ja1zNH5XNc+L+Bc5/W4lFiJXz02dYIjtHoUpWkc1aToOETVwzggYFNM
G3PXIBOI0jRgRw==
=o0WU
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Support for userspace to send requests directly to the on-chip GZIP
accelerator on Power9.
- Rework of our lockless page table walking (__find_linux_pte()) to
make it safe against parallel page table manipulations without
relying on an IPI for serialisation.
- A series of fixes & enhancements to make our machine check handling
more robust.
- Lots of plumbing to add support for "prefixed" (64-bit) instructions
on Power10.
- Support for using huge pages for the linear mapping on 8xx (32-bit).
- Remove obsolete Xilinx PPC405/PPC440 support, and an associated sound
driver.
- Removal of some obsolete 40x platforms and associated cruft.
- Initial support for booting on Power10.
- Lots of other small features, cleanups & fixes.
Thanks to: Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan,
Andrey Abramov, Aneesh Kumar K.V, Balamuruhan S, Bharata B Rao, Bulent
Abali, Cédric Le Goater, Chen Zhou, Christian Zigotzky, Christophe
JAILLET, Christophe Leroy, Dmitry Torokhov, Emmanuel Nicolet, Erhard F.,
Gautham R. Shenoy, Geoff Levand, George Spelvin, Greg Kurz, Gustavo A.
R. Silva, Gustavo Walbon, Haren Myneni, Hari Bathini, Joel Stanley,
Jordan Niethe, Kajol Jain, Kees Cook, Leonardo Bras, Madhavan
Srinivasan., Mahesh Salgaonkar, Markus Elfring, Michael Neuling, Michal
Simek, Nathan Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin,
Oliver O'Halloran, Paul Mackerras, Pingfan Liu, Qian Cai, Ram Pai,
Raphael Moreira Zinsly, Ravi Bangoria, Sam Bobroff, Sandipan Das, Segher
Boessenkool, Stephen Rothwell, Sukadev Bhattiprolu, Tyrel Datwyler,
Wolfram Sang, Xiongfeng Wang.
* tag 'powerpc-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (299 commits)
powerpc/pseries: Make vio and ibmebus initcalls pseries specific
cxl: Remove dead Kconfig options
powerpc: Add POWER10 architected mode
powerpc/dt_cpu_ftrs: Add MMA feature
powerpc/dt_cpu_ftrs: Enable Prefixed Instructions
powerpc/dt_cpu_ftrs: Advertise support for ISA v3.1 if selected
powerpc: Add support for ISA v3.1
powerpc: Add new HWCAP bits
powerpc/64s: Don't set FSCR bits in INIT_THREAD
powerpc/64s: Save FSCR to init_task.thread.fscr after feature init
powerpc/64s: Don't let DT CPU features set FSCR_DSCR
powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()
powerpc/32s: Fix another build failure with CONFIG_PPC_KUAP_DEBUG
powerpc/module_64: Use special stub for _mcount() with -mprofile-kernel
powerpc/module_64: Simplify check for -mprofile-kernel ftrace relocations
powerpc/module_64: Consolidate ftrace code
powerpc/32: Disable KASAN with pages bigger than 16k
powerpc/uaccess: Don't set KUEP by default on book3s/32
powerpc/uaccess: Don't set KUAP by default on book3s/32
powerpc/8xx: Reduce time spent in allow_user_access() and friends
...
Fixes sparse warnings by adding '__user' in typecast for
copy_[from,to]_user()
Fixes: d6a4c18566 ("vfio iommu: Implementation of ioctl for dirty pages tracking")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Fixes compilation error with ARCH=i386.
Error fixed by this commit:
ld: drivers/vfio/vfio_iommu_type1.o: in function `vfio_dma_populate_bitmap':
>> vfio_iommu_type1.c:(.text+0x666): undefined reference to `__udivdi3'
Fixes: d6a4c18566 ("vfio iommu: Implementation of ioctl for dirty pages tracking")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
kobject_init_and_add() takes reference even when it fails.
If this function returns an error, kobject_put() must be called to
properly clean up the memory associated with the object. Thus,
replace kfree() by kobject_put() to fix this issue. Previous
commit "b8eb718348b8" fixed a similar problem.
Fixes: 7b96953bc6 ("vfio: Mediated device Core driver")
Signed-off-by: Qiushi Wu <wu000273@umn.edu>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Added a check such that only singleton IOMMU groups can pin pages.
>From the point when vendor driver pins any pages, consider IOMMU group
dirty page scope to be limited to pinned pages.
To optimize to avoid walking list often, added flag
pinned_page_dirty_scope to indicate if all of the vfio_groups for each
vfio_domain in the domain_list dirty page scope is limited to pinned
pages. This flag is updated on first pinned pages request for that IOMMU
group and on attaching/detaching group.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Added migration capability in IOMMU info chain.
User application should check IOMMU info chain for migration capability
to use dirty page tracking feature provided by kernel module.
User application must check page sizes supported and maximum dirty
bitmap size returned by this capability structure for ioctls used to get
dirty bitmap.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
DMA mapped pages, including those pinned by mdev vendor drivers, might
get unpinned and unmapped while migration is active and device is still
running. For example, in pre-copy phase while guest driver could access
those pages, host device or vendor driver can dirty these mapped pages.
Such pages should be marked dirty so as to maintain memory consistency
for a user making use of dirty page tracking.
To get bitmap during unmap, user should allocate memory for bitmap, set
it all zeros, set size of allocated memory, set page size to be
considered for bitmap and set flag VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
- Start dirty pages tracking while migration is active
- Stop dirty pages tracking.
- Get dirty pages bitmap. Its user space application's responsibility to
copy content of dirty pages from source to destination during migration.
To prevent DoS attack, memory for bitmap is allocated per vfio_dma
structure. Bitmap size is calculated considering smallest supported page
size. Bitmap is allocated for all vfio_dmas when dirty logging is enabled
Bitmap is populated for already pinned pages when bitmap is allocated for
a vfio_dma with the smallest supported page size. Update bitmap from
pinning functions when tracking is enabled. When user application queries
bitmap, check if requested page size is same as page size used to
populated bitmap. If it is equal, copy bitmap, but if not equal, return
error.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Fixed error reported by build bot by changing pgsize type from uint64_t
to size_t.
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Calculate and cache pgsize_bitmap when iommu->domain_list is updated
and iommu->external_domain is set for mdev device.
Add iommu->lock protection when cached pgsize_bitmap is accessed.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The PCI Code and ID Assignment Specification changed capability ID 0
from reserved to a NULL capability in the v1.1 revision. The NULL
capability is defined to include only the 16-bit capability header,
ie. only the ID and next pointer. Unfortunately vfio-pci creates a
map of config space, where ID 0 is used to reserve the standard type
0 header. Finding an actual capability with this ID therefore results
in a bogus range marked in that map and conflicts with subsequent
capabilities. As this seems to be a dummy capability anyway and we
already support dropping capabilities, let's hide this one rather than
delving into the potentially subtle dependencies within our map.
Seen on an NVIDIA Tesla T4.
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Accessing the disabled memory space of a PCI device would typically
result in a master abort response on conventional PCI, or an
unsupported request on PCI express. The user would generally see
these as a -1 response for the read return data and the write would be
silently discarded, possibly with an uncorrected, non-fatal AER error
triggered on the host. Some systems however take it upon themselves
to bring down the entire system when they see something that might
indicate a loss of data, such as this discarded write to a disabled
memory space.
To avoid this, we want to try to block the user from accessing memory
spaces while they're disabled. We start with a semaphore around the
memory enable bit, where writers modify the memory enable state and
must be serialized, while readers make use of the memory region and
can access in parallel. Writers include both direct manipulation via
the command register, as well as any reset path where the internal
mechanics of the reset may both explicitly and implicitly disable
memory access, and manipulation of the MSI-X configuration, where the
MSI-X vector table resides in MMIO space of the device. Readers
include the read and write file ops to access the vfio device fd
offsets as well as memory mapped access. In the latter case, we make
use of our new vma list support to zap, or invalidate, those memory
mappings in order to force them to be faulted back in on access.
Our semaphore usage will stall user access to MMIO spaces across
internal operations like reset, but the user might experience new
behavior when trying to access the MMIO space while disabled via the
PCI command register. Access via read or write while disabled will
return -EIO and access via memory maps will result in a SIGBUS. This
is expected to be compatible with known use cases and potentially
provides better error handling capabilities than present in the
hardware, while avoiding the more readily accessible and severe
platform error responses that might otherwise occur.
Fixes: CVE-2020-12888
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Rather than calling remap_pfn_range() when a region is mmap'd, setup
a vm_ops handler to support dynamic faulting of the range on access.
This allows us to manage a list of vmas actively mapping the area that
we can later use to invalidate those mappings. The open callback
invalidates the vma range so that all tracking is inserted in the
fault handler and removed in the close handler.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
the range being faulted into the vma. Add support to manually provide
that, in the same way as done on KVM with hva_to_pfn_remapped().
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Use follow_pfn() to get the PFN of a PFNMAP VMA instead of assuming that
vma->vm_pgoff holds the base PFN of the VMA. This fixes a bug where
attempting to do VFIO_IOMMU_MAP_DMA on an arbitrary PFNMAP'd region of
memory calculates garbage for the PFN.
Hilariously, this only got detected because the first "PFN" calculated
by vaddr_get_pfn() is PFN 0 (vma->vm_pgoff==0), and iommu_iova_to_phys()
uses PA==0 as an error, which triggers a WARN in vfio_unmap_unpin()
because the translation "failed". PFN 0 is now unconditionally reserved
on x86 in order to mitigate L1TF, which causes is_invalid_reserved_pfn()
to return true and in turns results in vaddr_get_pfn() returning success
for PFN 0. Eventually the bogus calculation runs into PFNs that aren't
reserved and leads to failure in vfio_pin_map_dma(). The subsequent
call to vfio_remove_dma() attempts to unmap PFN 0 and WARNs.
WARNING: CPU: 8 PID: 5130 at drivers/vfio/vfio_iommu_type1.c:750 vfio_unmap_unpin+0x2e1/0x310 [vfio_iommu_type1]
Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio ...
CPU: 8 PID: 5130 Comm: sgx Tainted: G W 5.6.0-rc5-705d787c7fee-vfio+ #3
Hardware name: Intel Corporation Mehlow UP Server Platform/Moss Beach Server, BIOS CNLSE2R1.D00.X119.B49.1803010910 03/01/2018
RIP: 0010:vfio_unmap_unpin+0x2e1/0x310 [vfio_iommu_type1]
Code: <0f> 0b 49 81 c5 00 10 00 00 e9 c5 fe ff ff bb 00 10 00 00 e9 3d fe
RSP: 0018:ffffbeb5039ebda8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff9a55cbf8d480 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a52b771c200
RBP: 0000000000000000 R08: 0000000000000040 R09: 00000000fffffff2
R10: 0000000000000001 R11: ffff9a51fa896000 R12: 0000000184010000
R13: 0000000184000000 R14: 0000000000010000 R15: ffff9a55cb66ea08
FS: 00007f15d3830b40(0000) GS:ffff9a55d5600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561cf39429e0 CR3: 000000084f75f005 CR4: 00000000003626e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
vfio_remove_dma+0x17/0x70 [vfio_iommu_type1]
vfio_iommu_type1_ioctl+0x9e3/0xa7b [vfio_iommu_type1]
ksys_ioctl+0x92/0xb0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x4c/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15d04c75d7
Code: <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48
Fixes: 73fa0d10d0 ("vfio: Type1 IOMMU implementation")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
add parentheses to avoid possible vaddr overflow.
Fixes: a54eb55045 ("vfio iommu type1: Add support for mediated devices")
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
instead of calling __copy_to/from_user(), use copy_to_from_user() to
ensure vaddr range is a valid user address range before accessing them.
Fixes: 8d46c0cca5 ("vfio: introduce vfio_dma_rw to read/write a range of IOVAs")
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reported-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When we try to get an MSI cookie for a VFIO device, that can fail if
CONFIG_IOMMU_DMA is not set. In this case iommu_get_msi_cookie() returns
-ENODEV, and that should not be fatal.
Ignore that case and proceed with the initialisation.
This fixes VFIO with a platform device on the Calxeda Midway (no MSIs).
Fixes: f6810c15cf ("iommu/arm-smmu: Clean up early-probing workarounds")
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Older versions of skiboot only provide a single value in the device
tree property "ibm,mmio-atsd", even when multiple Address Translation
Shoot Down (ATSD) registers are present. This prevents NVLink2 devices
(other than the first) from being used with vfio-pci because vfio-pci
expects to be able to assign a dedicated ATSD register to each NVLink2
device.
However, ATSD registers can be shared among devices. This change
allows vfio-pci to fall back to sharing the register at index 0 if
necessary.
Fixes: 7f92891778 ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver")
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The cleanup is getting a tad long.
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
It currently results in messages like:
"vfio-pci 0000:03:00.0: vfio_pci: ..."
Which is quite a bit redundant.
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With the VF Token interface we can now expect that a vfio userspace
driver must be in collaboration with the PF driver, an unwitting
userspace driver will not be able to get past the GET_DEVICE_FD step
in accessing the device. We can now move on to actually allowing
SR-IOV to be enabled by vfio-pci on the PF. Support for this is not
enabled by default in this commit, but it does provide a module option
for this to be enabled (enable_sriov=1). Enabling VFs is rather
straightforward, except we don't want to risk that a VF might get
autoprobed and bound to other drivers, so a bus notifier is used to
"capture" VFs to vfio-pci using the driver_override support. We
assume any later action to bind the device to other drivers is
condoned by the system admin and allow it with a log warning.
vfio-pci will disable SR-IOV on a PF before releasing the device,
allowing a VF driver to be assured other drivers cannot take over the
PF and that any other userspace driver must know the shared VF token.
This support also does not provide a mechanism for the PF userspace
driver itself to manipulate SR-IOV through the vfio API. With this
patch SR-IOV can only be enabled via the host sysfs interface and the
PF driver user cannot create or remove VFs.
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The VFIO_DEVICE_FEATURE ioctl is meant to be a general purpose, device
agnostic ioctl for setting, retrieving, and probing device features.
This implementation provides a 16-bit field for specifying a feature
index, where the data porition of the ioctl is determined by the
semantics for the given feature. Additional flag bits indicate the
direction and nature of the operation; SET indicates user data is
provided into the device feature, GET indicates the device feature is
written out into user data. The PROBE flag augments determining
whether the given feature is supported, and if provided, whether the
given operation on the feature is supported.
The first user of this ioctl is for setting the vfio-pci VF token,
where the user provides a shared secret key (UUID) on a SR-IOV PF
device, which users must provide when opening associated VF devices.
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If we enable SR-IOV on a vfio-pci owned PF, the resulting VFs are not
fully isolated from the PF. The PF can always cause a denial of service
to the VF, even if by simply resetting itself. The degree to which a PF
can access the data passed through a VF or interfere with its operation
is dependent on a given SR-IOV implementation. Therefore we want to
avoid a scenario where an existing vfio-pci based userspace driver might
assume the PF driver is trusted, for example assigning a PF to one VM
and VF to another with some expectation of isolation. IOMMU grouping
could be a solution to this, but imposes an unnecessarily strong
relationship between PF and VF drivers if they need to operate with the
same IOMMU context. Instead we introduce a "VF token", which is
essentially just a shared secret between PF and VF drivers, implemented
as a UUID.
The VF token can be set by a vfio-pci based PF driver and must be known
by the vfio-pci based VF driver in order to gain access to the device.
This allows the degree to which this VF token is considered secret to be
determined by the applications and environment. For example a VM might
generate a random UUID known only internally to the hypervisor while a
userspace networking appliance might use a shared, or even well know,
UUID among the application drivers.
To incorporate this VF token, the VFIO_GROUP_GET_DEVICE_FD interface is
extended to accept key=value pairs in addition to the device name. This
allows us to most easily deny user access to the device without risk
that existing userspace drivers assume region offsets, IRQs, and other
device features, leading to more elaborate error paths. The format of
these options are expected to take the form:
"$DEVICE_NAME $OPTION1=$VALUE1 $OPTION2=$VALUE2"
Where the device name is always provided first for compatibility and
additional options are specified in a space separated list. The
relation between and requirements for the additional options will be
vfio bus driver dependent, however unknown or unused option within this
schema should return error. This allow for future use of unknown
options as well as a positive indication to the user that an option is
used.
An example VF token option would take this form:
"0000:03:00.0 vf_token=2ab74924-c335-45f4-9b16-8569e5b08258"
When accessing a VF where the PF is making use of vfio-pci, the user
MUST provide the current vf_token. When accessing a PF, the user MUST
provide the current vf_token IF there are active VF users or MAY provide
a vf_token in order to set the current VF token when no VF users are
active. The former requirement assures VF users that an unassociated
driver cannot usurp the PF device. These semantics also imply that a
VF token MUST be set by a PF driver before VF drivers can access their
device, the default token is random and mechanisms to read the token are
not provided in order to protect the VF token of previous users. Use of
the vf_token option outside of these cases will return an error, as
discussed above.
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This currently serves the same purpose as the default implementation
but will be expanded for additional functionality.
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Allow bus drivers to provide their own callback to match a device to
the user provided string.
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_group_pin_pages() and vfio_group_unpin_pages() are introduced to
avoid inefficient search/check/ref/deref opertions associated with VFIO
group as those in each calling into vfio_pin_pages() and
vfio_unpin_pages().
VFIO group is taken as arg directly. The callers combine
search/check/ref/deref operations associated with VFIO group by calling
vfio_group_get_external_user()/vfio_group_get_external_user_from_dev()
beforehand, and vfio_group_put_external_user() afterwards.
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_dma_rw will read/write a range of user space memory pointed to by
IOVA into/from a kernel buffer without enforcing pinning the user space
memory.
TODO: mark the IOVAs to user space memory dirty if they are written in
vfio_dma_rw().
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
external user calls vfio_group_get_external_user_from_dev() with a device
pointer to get the VFIO group associated with this device.
The VFIO group is checked to be vialbe and have IOMMU set. Then
container user counter is increased and VFIO group reference is hold
to prevent the VFIO group from disposal before external user exits.
when the external user finishes using of the VFIO group, it calls
vfio_group_put_external_user() to dereference the VFIO group and the
container user counter.
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Since commit 7723f4c5ec ("driver core: platform: Add an error
message to platform_get_irq*()"), platform_get_irq() calls dev_err()
on an error. As we enumerate all interrupts until platform_get_irq()
fails, we now systematically get a message such as:
"vfio-platform fff51000.ethernet: IRQ index 3 not found" which is
a false positive.
Let's use platform_get_irq_optional() instead.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Cc: stable@vger.kernel.org # v5.3+
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Tested-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In order to provide a clearer, more symmetric API for pinning and
unpinning DMA pages. This way, pin_user_pages*() calls match up with
unpin_user_pages*() calls, and the API is a lot closer to being
self-explanatory.
Link: http://lkml.kernel.org/r/20200107224558.2362728-23-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. Change vfio from get_user_pages_remote(), to
pin_user_pages_remote().
2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().
Note that this effectively changes the code's behavior in
vfio_iommu_type1.c: put_pfn(): it now ultimately calls
set_page_dirty_lock(), instead of set_page_dirty(). This is probably
more accurate.
As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]
[1] https://lore.kernel.org/r/20190723153640.GB720@lst.de
Link: http://lkml.kernel.org/r/20200107224558.2362728-20-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update VFIO to take advantage of the recently loosened restriction on
FOLL_LONGTERM with get_user_pages_remote(). Also, now it is possible to
fix a bug: the VFIO caller is logically a FOLL_LONGTERM user, but it
wasn't setting FOLL_LONGTERM.
Also, remove an unnessary pair of calls that were releasing and
reacquiring the mmap_sem. There is no need to avoid holding mmap_sem
just in order to call page_to_pfn().
Also, now that the the DAX check ("if a VMA is DAX, don't allow long
term pinning") is in the internals of get_user_pages_remote() and
__gup_longterm_locked(), there's no need for it at the VFIO call site. So
remove it.
Link: http://lkml.kernel.org/r/20200107224558.2362728-8-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The create attribute is not exported, so make it
static to avoid the following sparse warning:
drivers/vfio/mdev/mdev_sysfs.c:77:1: warning: symbol 'mdev_type_attr_create' was not declared. Should it be static?
Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Mmgrab was introduced in commit f1f1007644 ("mm: add new mmgrab()
helper") and most of the kernel was updated to use it. Update a
remaining file.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
<smpl>
@@ expression e; @@
- atomic_inc(&e->mm_count);
+ mmgrab(e);
</smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Mmgrab was introduced in commit f1f1007644 ("mm: add new mmgrab()
helper") and most of the kernel was updated to use it. Update a
remaining file.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
<smpl>
@@ expression e; @@
- atomic_inc(&e->mm_count);
+ mmgrab(e);
</smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The nvlink2 subdriver for IBM Witherspoon machines preregisters
GPU memory in the IOMMI API so KVM TCE code can map this memory
for DMA as well. This is done by mm_iommu_newdev() called from
vfio_pci_nvgpu_regops::mmap.
In an unlikely event of failure the data->mem remains NULL and
since mm_iommu_put() (which unregisters the region and unpins memory
if that was regular memory) does not expect mem=NULL, it should not be
called.
This adds a check to only call mm_iommu_put() for a valid data->mem.
Fixes: 7f92891778 ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
ioremap has provided non-cached semantics by default since the Linux 2.6
days, so remove the additional ioremap_nocache interface.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAl3leXUUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vyY3g/9FAVVdPEaadNtAhQ/zIxcjozDovKq
0q7yOA3aTBTUoNEinm88an6p0dcC4gNKtGukXmzVH2Hhxm9kLRdtpZGYY00tpLUB
9rI7XsgwwHa+hLwsHbIs507sKGFGy5FLr0ChTTGLDEMppnEvjA2hZooYmcB/OgrC
LlFcwbNKGOk/Si9u2bF2nLO0JDoVHnwzpF99saew/nqc7Lfj9e9IPZFom+VjPBUh
AOvRp2H7uBN+WQlpLeFeMDDoeXh34lX0kYqIV/cVkXVnknDGYKV2CBTg2aeX7jd0
QiPHZh6zlW8zNQgaCZRiBAbatVEOnRMRJ++yiqB8hBYp1LMXm6kJ01YSQpXkugoY
Vp9dtzzTARWV/XkKwD4brw9ZEmIDnO+Ed2x2VbUkPJVcXAvzSQWAx82IU0Iuqmcb
9qr6U2Zf/Xk5aFlGPYVH8QOG+QqzIbZNRQ7NlhDlITyW4P6QPu0mw374yYP2wDGL
sP5YSS3YGa0sQcEgDtVnd4z+WTZI4AwXLPaeaLkDhdfHp2FsERUY4TrPs33J99xw
og4EyokVFzjYzlnBPU6WWn7LL+jj5ccXkL3MA4DR4FJOnNGHh7NXfQUH56rrgsq7
F9/8shL5DuTbQkde1uSyUG9Iq/RigVLlV5DQavFm3dSXvZi0E16t5alC5URNTzk7
at8Bogn53QhlmYc=
=uUXw
-----END PGP SIGNATURE-----
Merge tag 'pci-v5.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
"Enumeration:
- Warn if a host bridge has no NUMA info (Yunsheng Lin)
- Add PCI_STD_NUM_BARS for the number of standard BARs (Denis
Efremov)
Resource management:
- Fix boot-time Embedded Controller GPE storm caused by incorrect
resource assignment after ACPI Bus Check Notification (Mika
Westerberg)
- Protect pci_reassign_bridge_resources() against concurrent
addition/removal (Benjamin Herrenschmidt)
- Fix bridge dma_ranges resource list cleanup (Rob Herring)
- Add "pci=hpmmiosize" and "pci=hpmmioprefsize" parameters to control
the MMIO and prefetchable MMIO window sizes of hotplug bridges
independently (Nicholas Johnson)
- Fix MMIO/MMIO_PREF window assignment that assigned more space than
desired (Nicholas Johnson)
- Only enforce bus numbers from bridge EA if the bridge has EA
devices downstream (Subbaraya Sundeep)
- Consolidate DT "dma-ranges" parsing and convert all host drivers to
use shared parsing (Rob Herring)
Error reporting:
- Restore AER capability after resume (Mayurkumar Patel)
- Add PoisonTLPBlocked AER counter (Rajat Jain)
- Use for_each_set_bit() to simplify AER code (Andy Shevchenko)
- Fix AER kernel-doc (Andy Shevchenko)
- Add "pcie_ports=dpc-native" parameter to allow native use of DPC
even if platform didn't grant control over AER (Olof Johansson)
Hotplug:
- Avoid returning prematurely from sysfs requests to enable or
disable a PCIe hotplug slot (Lukas Wunner)
- Don't disable interrupts twice when suspending hotplug ports (Mika
Westerberg)
- Fix deadlocks when PCIe ports are hot-removed while suspended (Mika
Westerberg)
Power management:
- Remove unnecessary ASPM locking (Bjorn Helgaas)
- Add support for disabling L1 PM Substates (Heiner Kallweit)
- Allow re-enabling Clock PM after it has been disabled (Heiner
Kallweit)
- Add sysfs attributes for controlling ASPM link states (Heiner
Kallweit)
- Remove CONFIG_PCIEASPM_DEBUG, including "link_state" and "clk_ctl"
sysfs files (Heiner Kallweit)
- Avoid AMD FCH XHCI USB PME# from D0 defect that prevents wakeup on
USB 2.0 or 1.1 connect events (Kai-Heng Feng)
- Move power state check out of pci_msi_supported() (Bjorn Helgaas)
- Fix incorrect MSI-X masking on resume and revert related nvme quirk
for Kingston NVME SSD running FW E8FK11.T (Jian-Hong Pan)
- Always return devices to D0 when thawing to fix hibernation with
drivers like mlx4 that used legacy power management (previously we
only did it for drivers with new power management ops) (Dexuan Cui)
- Clear PCIe PME Status even for legacy power management (Bjorn
Helgaas)
- Fix PCI PM documentation errors (Bjorn Helgaas)
- Use dev_printk() for more power management messages (Bjorn Helgaas)
- Apply D2 delay as milliseconds, not microseconds (Bjorn Helgaas)
- Convert xen-platform from legacy to generic power management (Bjorn
Helgaas)
- Removed unused .resume_early() and .suspend_late() legacy power
management hooks (Bjorn Helgaas)
- Rearrange power management code for clarity (Rafael J. Wysocki)
- Decode power states more clearly ("4" or "D4" really refers to
"D3cold") (Bjorn Helgaas)
- Notice when reading PM Control register returns an error (~0)
instead of interpreting it as being in D3hot (Bjorn Helgaas)
- Add missing link delays required by the PCIe spec (Mika Westerberg)
Virtualization:
- Move pci_prg_resp_pasid_required() to CONFIG_PCI_PRI (Bjorn
Helgaas)
- Allow VFs to use PRI (the PF PRI is shared by the VFs, but the code
previously didn't recognize that) (Kuppuswamy Sathyanarayanan)
- Allow VFs to use PASID (the PF PASID capability is shared by the
VFs, but the code previously didn't recognize that) (Kuppuswamy
Sathyanarayanan)
- Disconnect PF and VF ATS enablement, since ATS in PFs and
associated VFs can be enabled independently (Kuppuswamy
Sathyanarayanan)
- Cache PRI and PASID capability offsets (Kuppuswamy Sathyanarayanan)
- Cache the PRI PRG Response PASID Required bit (Bjorn Helgaas)
- Consolidate ATS declarations in linux/pci-ats.h (Krzysztof
Wilczynski)
- Remove unused PRI and PASID stubs (Bjorn Helgaas)
- Removed unnecessary EXPORT_SYMBOL_GPL() from ATS, PRI, and PASID
interfaces that are only used by built-in IOMMU drivers (Bjorn
Helgaas)
- Hide PRI and PASID state restoration functions used only inside the
PCI core (Bjorn Helgaas)
- Add a DMA alias quirk for the Intel VCA NTB (Slawomir Pawlowski)
- Serialize sysfs sriov_numvfs reads vs writes (Pierre Crégut)
- Update Cavium ACS quirk for ThunderX2 and ThunderX3 (George
Cherian)
- Fix the UPDCR register address in the Intel ACS quirk (Steffen
Liebergeld)
- Unify ACS quirk implementations (Bjorn Helgaas)
Amlogic Meson host bridge driver:
- Fix meson PERST# GPIO polarity problem (Remi Pommarel)
- Add DT bindings for Amlogic Meson G12A (Neil Armstrong)
- Fix meson clock names to match DT bindings (Neil Armstrong)
- Add meson support for Amlogic G12A SoC with separate shared PHY
(Neil Armstrong)
- Add meson extended PCIe PHY functions for Amlogic G12A USB3+PCIe
combo PHY (Neil Armstrong)
- Add arm64 DT for Amlogic G12A PCIe controller node (Neil Armstrong)
- Add commented-out description of VIM3 USB3/PCIe mux in arm64 DT
(Neil Armstrong)
Broadcom iProc host bridge driver:
- Invalidate iProc PAXB address mapping before programming it
(Abhishek Shah)
- Fix iproc-msi and mvebu __iomem annotations (Ben Dooks)
Cadence host bridge driver:
- Refactor Cadence PCIe host controller to use as a library for both
host and endpoint (Tom Joseph)
Freescale Layerscape host bridge driver:
- Add layerscape LS1028a support (Xiaowei Bao)
Intel VMD host bridge driver:
- Add VMD bus 224-255 restriction decode (Jon Derrick)
- Add VMD 8086:9A0B device ID (Jon Derrick)
- Remove Keith from VMD maintainer list (Keith Busch)
Marvell ARMADA 3700 / Aardvark host bridge driver:
- Use LTSSM state to build link training flag since Aardvark doesn't
implement the Link Training bit (Remi Pommarel)
- Delay before training Aardvark link in case PERST# was asserted
before the driver probe (Remi Pommarel)
- Fix Aardvark issues with Root Control reads and writes (Remi
Pommarel)
- Don't rely on jiffies in Aardvark config access path since
interrupts may be disabled (Remi Pommarel)
- Fix Aardvark big-endian support (Grzegorz Jaszczyk)
Marvell ARMADA 370 / XP host bridge driver:
- Make mvebu_pci_bridge_emul_ops static (Ben Dooks)
Microsoft Hyper-V host bridge driver:
- Add hibernation support for Hyper-V virtual PCI devices (Dexuan
Cui)
- Track Hyper-V pci_protocol_version per-hbus, not globally (Dexuan
Cui)
- Avoid kmemleak false positive on hv hbus buffer (Dexuan Cui)
Mobiveil host bridge driver:
- Change mobiveil csr_read()/write() function names that conflict
with riscv arch functions (Kefeng Wang)
NVIDIA Tegra host bridge driver:
- Fix Tegra CLKREQ dependency programming (Vidya Sagar)
Renesas R-Car host bridge driver:
- Remove unnecessary header include from rcar (Andrew Murray)
- Tighten register index checking for rcar inbound range programming
(Marek Vasut)
- Fix rcar inbound range alignment calculation to improve packing of
multiple entries (Marek Vasut)
- Update rcar MACCTLR setting to match documentation (Yoshihiro
Shimoda)
- Clear bit 0 of MACCTLR before PCIETCTLR.CFINIT per manual
(Yoshihiro Shimoda)
- Add Marek Vasut and Yoshihiro Shimoda as R-Car maintainers (Simon
Horman)
Rockchip host bridge driver:
- Make rockchip 0V9 and 1V8 power regulators non-optional (Robin
Murphy)
Socionext UniPhier host bridge driver:
- Set uniphier to host (RC) mode always (Kunihiko Hayashi)
Endpoint drivers:
- Fix endpoint driver sign extension problem when shifting page
number to phys_addr_t (Alan Mikhak)
Misc:
- Add NumaChip SPDX header (Krzysztof Wilczynski)
- Replace EXTRA_CFLAGS with ccflags-y (Krzysztof Wilczynski)
- Remove unused includes (Krzysztof Wilczynski)
- Removed unused sysfs attribute groups (Ben Dooks)
- Remove PTM and ASPM dependencies on PCIEPORTBUS (Bjorn Helgaas)
- Add PCIe Link Control 2 register field definitions to replace magic
numbers in AMDGPU and Radeon CIK/SI (Bjorn Helgaas)
- Fix incorrect Link Control 2 Transmit Margin usage in AMDGPU and
Radeon CIK/SI PCIe Gen3 link training (Bjorn Helgaas)
- Use pcie_capability_read_word() instead of pci_read_config_word()
in AMDGPU and Radeon CIK/SI (Frederick Lawler)
- Remove unused pci_irq_get_node() Greg Kroah-Hartman)
- Make asm/msi.h mandatory and simplify PCI_MSI_IRQ_DOMAIN Kconfig
(Palmer Dabbelt, Michal Simek)
- Read all 64 bits of Switchtec part_event_bitmap (Logan Gunthorpe)
- Fix erroneous intel-iommu dependency on CONFIG_AMD_IOMMU (Bjorn
Helgaas)
- Fix bridge emulation big-endian support (Grzegorz Jaszczyk)
- Fix dwc find_next_bit() usage (Niklas Cassel)
- Fix pcitest.c fd leak (Hewenliang)
- Fix typos and comments (Bjorn Helgaas)
- Fix Kconfig whitespace errors (Krzysztof Kozlowski)"
* tag 'pci-v5.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (160 commits)
PCI: Remove PCI_MSI_IRQ_DOMAIN architecture whitelist
asm-generic: Make msi.h a mandatory include/asm header
Revert "nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T"
PCI/MSI: Fix incorrect MSI-X masking on resume
PCI/MSI: Move power state check out of pci_msi_supported()
PCI/MSI: Remove unused pci_irq_get_node()
PCI: hv: Avoid a kmemleak false positive caused by the hbus buffer
PCI: hv: Change pci_protocol_version to per-hbus
PCI: hv: Add hibernation support
PCI: hv: Reorganize the code in preparation of hibernation
MAINTAINERS: Remove Keith from VMD maintainer
PCI/ASPM: Remove PCIEASPM_DEBUG Kconfig option and related code
PCI/ASPM: Add sysfs attributes for controlling ASPM link states
PCI: Fix indentation
drm/radeon: Prefer pcie_capability_read_word()
drm/radeon: Replace numbers with PCI_EXP_LNKCTL2 definitions
drm/radeon: Correct Transmit Margin masks
drm/amdgpu: Prefer pcie_capability_read_word()
PCI: uniphier: Set mode register to host mode
drm/amdgpu: Replace numbers with PCI_EXP_LNKCTL2 definitions
...
Since irq_bypass_register_producer() is called after request_irq(), we
should do tear-down in reverse order: irq_bypass_unregister_producer()
then free_irq().
Specifically free_irq() may release resources required by the
irqbypass del_producer() callback. Notably an example provided by
Marc Zyngier on arm64 with GICv4 that he indicates has the potential
to wedge the hardware:
free_irq(irq)
__free_irq(irq)
irq_domain_deactivate_irq(irq)
its_irq_domain_deactivate()
[unmap the VLPI from the ITS]
kvm_arch_irq_bypass_del_producer(cons, prod)
kvm_vgic_v4_unset_forwarding(kvm, irq, ...)
its_unmap_vlpi(irq)
[Unmap the VLPI from the ITS (again), remap the original LPI]
Signed-off-by: Jiang Yi <giangyi@amazon.com>
Cc: stable@vger.kernel.org # v4.4+
Fixes: 6d7425f109 ("vfio: Register/unregister irq_bypass_producer")
Link: https://lore.kernel.org/kvm/20191127164910.15888-1-giangyi@amazon.com
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
[aw: commit log]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
As part of the cleanup of some remaining y2038 issues, I came to
fs/compat_ioctl.c, which still has a couple of commands that need support
for time64_t.
In completely unrelated work, I spent time on cleaning up parts of this
file in the past, moving things out into drivers instead.
After Al Viro reviewed an earlier version of this series and did a lot
more of that cleanup, I decided to try to completely eliminate the rest
of it and move it all into drivers.
This series incorporates some of Al's work and many patches of my own,
but in the end stops short of actually removing the last part, which is
the scsi ioctl handlers. I have patches for those as well, but they need
more testing or possibly a rewrite.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJdsHCdAAoJEJpsee/mABjZtYkP/1JGl3jFv3Iq/5BCdPkaePP1
RtMJRNfURgK3GeuHUui330PvVjI/pLWXU/VXMK2MPTASpJLzYz3uCaZrpVWEMpDZ
+ImzGmgJkITlW1uWU3zOcQhOxTyb1hCZ0Ci+2xn9QAmyOL7prXoXCXDWv3h6iyiF
lwG+nW+HNtyx41YG+9bRfKNoG0ZJ+nkJ70BV6u0acQHXWn7Xuupa9YUmBL87hxAL
6dlJfLTJg6q8QSv/Q6LxslfWk2Ti8OOJZOwtFM5R8Bgl0iUcvshiRCKfv/3t9jXD
dJNvF1uq8z+gracWK49Qsfq5dnZ2ZxHFUo9u0NjbCrxNvWH/sdvhbaUBuJI75seH
VIznCkdxFhrqitJJ8KmxANxG08u+9zSKjSlxG2SmlA4qFx/AoStoHwQXcogJscNb
YIXYKmWBvwPzYu09QFAXdHFPmZvp/3HhMWU6o92lvDhsDwzkSGt3XKhCJea4DCaT
m+oCcoACqSWhMwdbJOEFofSub4bY43s5iaYuKes+c8O261/Dwg6v/pgIVez9mxXm
TBnvCsotq5m8wbwzv99eFqGeJH8zpDHrXxEtRR5KQqMqjLq/OQVaEzmpHZTEuK7n
e/V/PAKo2/V63g4k6GApQXDxnjwT+m0aWToWoeEzPYXS6KmtWC91r4bWtslu3rdl
bN65armTm7bFFR32Avnu
=lgCl
-----END PGP SIGNATURE-----
Merge tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground
Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
"As part of the cleanup of some remaining y2038 issues, I came to
fs/compat_ioctl.c, which still has a couple of commands that need
support for time64_t.
In completely unrelated work, I spent time on cleaning up parts of
this file in the past, moving things out into drivers instead.
After Al Viro reviewed an earlier version of this series and did a lot
more of that cleanup, I decided to try to completely eliminate the
rest of it and move it all into drivers.
This series incorporates some of Al's work and many patches of my own,
but in the end stops short of actually removing the last part, which
is the scsi ioctl handlers. I have patches for those as well, but they
need more testing or possibly a rewrite"
* tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
scsi: sd: enable compat ioctls for sed-opal
pktcdvd: add compat_ioctl handler
compat_ioctl: move SG_GET_REQUEST_TABLE handling
compat_ioctl: ppp: move simple commands into ppp_generic.c
compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
compat_ioctl: unify copy-in of ppp filters
tty: handle compat PPP ioctls
compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
compat_ioctl: handle SIOCOUTQNSD
af_unix: add compat_ioctl support
compat_ioctl: reimplement SG_IO handling
compat_ioctl: move WDIOC handling into wdt drivers
fs: compat_ioctl: move FITRIM emulation into file systems
gfs2: add compat_ioctl support
compat_ioctl: remove unused convert_in_user macro
compat_ioctl: remove last RAID handling code
compat_ioctl: remove /dev/raw ioctl translation
compat_ioctl: remove PCI ioctl translation
compat_ioctl: remove joystick ioctl translation
...
Each of these drivers has a copy of the same trivial helper function to
convert the pointer argument and then call the native ioctl handler.
We now have a generic implementation of that, so use it.
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Currently, no hugepage split code can transfer the reserved bit
from head to tail during the split, so checking the head can't make
a difference in a racing condition with hugepage spliting.
The buddy wouldn't allow a driver to allocate an hugepage if any
subpage is reserved in the e820 map at boot, if any driver sets the
reserved bit of head page before mapping the hugepage in userland,
it needs to set the reserved bit in all subpages to be safe.
Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
After enabling CONFIG_IOMMU_DMA on X86 a new warning appears when
compiling vfio:
drivers/vfio/vfio_iommu_type1.c: In function ‘vfio_iommu_type1_attach_group’:
drivers/vfio/vfio_iommu_type1.c:1827:7: warning: ‘resv_msi_base’ may be used uninitialized in this function [-Wmaybe-uninitialized]
ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The warning is a false positive, because the call to iommu_get_msi_cookie()
only happens when vfio_iommu_has_sw_msi() returned true. And that only
happens when it also set resv_msi_base.
But initialize the variable anyway to get rid of the warning.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Code that iterates over all standard PCI BARs typically uses
PCI_STD_RESOURCE_END. However, that requires the unusual test
"i <= PCI_STD_RESOURCE_END" rather than something the typical
"i < PCI_STD_NUM_BARS".
Add a definition for PCI_STD_NUM_BARS and change loops to use the more
idiomatic C style to help avoid fencepost errors.
Link: https://lore.kernel.org/r/20190927234026.23342-1-efremov@linux.com
Link: https://lore.kernel.org/r/20190927234308.23935-1-efremov@linux.com
Link: https://lore.kernel.org/r/20190916204158.6889-3-efremov@linux.com
Signed-off-by: Denis Efremov <efremov@linux.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Sebastian Ott <sebott@linux.ibm.com> # arch/s390/
Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> # video/fbdev/
Acked-by: Gustavo Pimentel <gustavo.pimentel@synopsys.com> # pci/controller/dwc/
Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com> # scsi/pm8001/
Acked-by: Martin K. Petersen <martin.petersen@oracle.com> # scsi/pm8001/
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # memstick/
This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.
vaddr_get_pfn() uses provided user pointers for vma lookups, which can
only by done with untagged pointers.
Untag user pointers in this function.
Link: http://lkml.kernel.org/r/87422b4d72116a975896f2b19b00f38acbd28f33.1563904656.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jens Wiklander <jens.wiklander@linaro.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace PAGE_SHIFT + compound_order(page) with the new page_shift()
function. Minor improvements in readability.
[akpm@linux-foundation.org: fix build in tce_page_is_contained()]
Link: http://lkml.kernel.org/r/201907241853.yNQTrJWd%25lkp@intel.com
Link: http://lkml.kernel.org/r/20190721104612.19120-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Fix spapr iommu error case case (Alexey Kardashevskiy)
- Consolidate region type definitions (Cornelia Huck)
- Restore saved original PCI state on release (hexin)
- Simplify mtty sample driver interrupt path (Parav Pandit)
- Support for reporting valid IOVA regions to user (Shameer Kolothum)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJdhS76AAoJECObm247sIsipU0QAI4eCfwy4R4nV6eeTUymniVO
jxYNwE2ZARQp3YW6DzupuesCIDDuhPL81vTzhFuIP6lHixeWM/D4eS8pKw4CH7UH
SAOhiNJ7j1NryXK6SnZMaE8zZH/Ry4j8QvfmlJpDi7EBSQI/IRykRZOA2qEULFJm
eI1ZwBR03njbqKPwKy9a6RVifjFT+yAnQrX/ADdQY1T1C9qPkqJZnYwTviwOM6G/
pXHkpLgc3eYmz6ZAk7TpE4kxx7Zji3MenKHxSE0ZD9xqixzzntZ3gG9dDXy98IRW
/oWQSM0gusF4louG/iXb22QzlzcRPZXJhQ+02fjWfuR+Ak3W5amI5eIPzPyjvbZw
As54My1HedLMWnKvQKjTS5HYyHqgCACYjdAEocx79LvU40txpcEUOqoZeRhSiEyN
zcc2xaYSs/aw7AZd3MwNKrqeg25LspD9T1FBoqou/C4EzW5Q63gClVJqbfpyrmVX
SPYuGtFzkWqI4eLkgJ8ACAm20CMBtH0nVPhxpE7fOMxwzuUoWLxB+t6x5eEDKUhd
kHhtdUJrfYSsmjgnWc0IoB9OVLaj/EsfQrv08V+tNUjt8JoGb9A6A5jWebf506hI
JhlVS0aNN80GATDgu3nPiBMaS/3ErTDrTNh/zmYQnSDFOKTDw1nj+w3rczkwVY7C
1VFtP4vrLSjwJV49zAIR
=siL2
-----END PGP SIGNATURE-----
Merge tag 'vfio-v5.4-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Fix spapr iommu error case case (Alexey Kardashevskiy)
- Consolidate region type definitions (Cornelia Huck)
- Restore saved original PCI state on release (hexin)
- Simplify mtty sample driver interrupt path (Parav Pandit)
- Support for reporting valid IOVA regions to user (Shameer Kolothum)
* tag 'vfio-v5.4-rc1' of git://github.com/awilliam/linux-vfio:
vfio_pci: Restore original state on release
vfio/type1: remove duplicate retrieval of reserved regions
vfio/type1: Add IOVA range capability support
vfio/type1: check dma map request is within a valid iova range
vfio/spapr_tce: Fix incorrect tce_iommu_group memory free
vfio-mdev/mtty: Simplify interrupt generation
vfio: re-arrange vfio region definitions
vfio/type1: Update iova list on detach
vfio/type1: Check reserved region conflict and update iova list
vfio/type1: Introduce iova list and add iommu aperture validity check
- Initial support for running on a system with an Ultravisor, which is software
that runs below the hypervisor and protects guests against some attacks by
the hypervisor.
- Support for building the kernel to run as a "Secure Virtual Machine", ie. as
a guest capable of running on a system with an Ultravisor.
- Some changes to our DMA code on bare metal, to allow devices with medium
sized DMA masks (> 32 && < 59 bits) to use more than 2GB of DMA space.
- Support for firmware assisted crash dumps on bare metal (powernv).
- Two series fixing bugs in and refactoring our PCI EEH code.
- A large series refactoring our exception entry code to use gas macros, both
to make it more readable and also enable some future optimisations.
As well as many cleanups and other minor features & fixups.
Thanks to:
Adam Zerella, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh
Kumar K.V, Anju T Sudhakar, Anshuman Khandual, Balbir Singh, Benjamin
Herrenschmidt, Cédric Le Goater, Christophe JAILLET, Christophe Leroy,
Christopher M. Riedl, Christoph Hellwig, Claudio Carvalho, Daniel Axtens,
David Gibson, David Hildenbrand, Desnes A. Nunes do Rosario, Ganesh Goudar,
Gautham R. Shenoy, Greg Kurz, Guerney Hunt, Gustavo Romero, Halil Pasic, Hari
Bathini, Joakim Tjernlund, Jonathan Neuschafer, Jordan Niethe, Leonardo Bras,
Lianbo Jiang, Madhavan Srinivasan, Mahesh Salgaonkar, Mahesh Salgaonkar,
Masahiro Yamada, Maxiwell S. Garcia, Michael Anderson, Nathan Chancellor,
Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Qian Cai, Ram
Pai, Ravi Bangoria, Reza Arbab, Ryan Grimm, Sam Bobroff, Santosh Sivaraj,
Segher Boessenkool, Sukadev Bhattiprolu, Thiago Bauermann, Thiago Jung
Bauermann, Thomas Gleixner, Tom Lendacky, Vasant Hegde.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl2EtEcTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgPfsD/9uXyBXn3anI/H08+mk74k5gCsmMQpn
D442CD/ByogZcccp23yBTlhawtCE03hcHnCLygn0Xgd8a4YvHts/RGHUe3fPHqlG
bEyZ7jsLVz5ebNZQP7r4eGs2pSzCajwJy2N9HJ/C1ojf15rrfRxoVJtnyhE2wXpm
DL+6o2K+nUCB3gTQ1Inr3DnWzoGOOUfNTOea2u+J+yfHwGRqOBYpevwqiwy5eelK
aRjUJCqMTvrzra49MeFwjo0Nt3/Y8UNcwA+JlGdeR8bRuWhFrYmyBRiZEKPaujNO
5EAfghBBlB0KQCqvF/tRM/c0OftHqK59AMobP9T7u9oOaBXeF/FpZX/iXjzNDPsN
j9Oo2tKLTu/YVEXqBFuREGP+znANr1Wo4CFyOG8SbvYz0HFjR6XbtRJsS+0e8GWl
kqX5/ZhYz3lBnKSNe9jgWOrh/J0KCSFigBTEWJT3xsn4YE8x8kK2l9KPqAIldWEP
sKb2UjGS7v0NKq+NvShH88Q9AeQUEIjTcg/9aDDQDe6FaRQ7KiF8bUxSdwSPi+Fn
j0lnF6i+1ATWZKuCr85veVi7C5qoe/+MqalnmP7MxULyzgXLLxUgN0SzEYO6QofK
LQK/VaH2XVr5+M5YAb7K4/NX5gbM3s1bKrCiUy4EyHNvgG7gricYdbz6HgAjKpR7
oP0rHfgmVYvF1g==
=WlW+
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"This is a bit late, partly due to me travelling, and partly due to a
power outage knocking out some of my test systems *while* I was
travelling.
- Initial support for running on a system with an Ultravisor, which
is software that runs below the hypervisor and protects guests
against some attacks by the hypervisor.
- Support for building the kernel to run as a "Secure Virtual
Machine", ie. as a guest capable of running on a system with an
Ultravisor.
- Some changes to our DMA code on bare metal, to allow devices with
medium sized DMA masks (> 32 && < 59 bits) to use more than 2GB of
DMA space.
- Support for firmware assisted crash dumps on bare metal (powernv).
- Two series fixing bugs in and refactoring our PCI EEH code.
- A large series refactoring our exception entry code to use gas
macros, both to make it more readable and also enable some future
optimisations.
As well as many cleanups and other minor features & fixups.
Thanks to: Adam Zerella, Alexey Kardashevskiy, Alistair Popple, Andrew
Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anshuman Khandual,
Balbir Singh, Benjamin Herrenschmidt, Cédric Le Goater, Christophe
JAILLET, Christophe Leroy, Christopher M. Riedl, Christoph Hellwig,
Claudio Carvalho, Daniel Axtens, David Gibson, David Hildenbrand,
Desnes A. Nunes do Rosario, Ganesh Goudar, Gautham R. Shenoy, Greg
Kurz, Guerney Hunt, Gustavo Romero, Halil Pasic, Hari Bathini, Joakim
Tjernlund, Jonathan Neuschafer, Jordan Niethe, Leonardo Bras, Lianbo
Jiang, Madhavan Srinivasan, Mahesh Salgaonkar, Mahesh Salgaonkar,
Masahiro Yamada, Maxiwell S. Garcia, Michael Anderson, Nathan
Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver
O'Halloran, Qian Cai, Ram Pai, Ravi Bangoria, Reza Arbab, Ryan Grimm,
Sam Bobroff, Santosh Sivaraj, Segher Boessenkool, Sukadev Bhattiprolu,
Thiago Bauermann, Thiago Jung Bauermann, Thomas Gleixner, Tom
Lendacky, Vasant Hegde"
* tag 'powerpc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (264 commits)
powerpc/mm/mce: Keep irqs disabled during lockless page table walk
powerpc: Use ftrace_graph_ret_addr() when unwinding
powerpc/ftrace: Enable HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
ftrace: Look up the address of return_to_handler() using helpers
powerpc: dump kernel log before carrying out fadump or kdump
docs: powerpc: Add missing documentation reference
powerpc/xmon: Fix output of XIVE IPI
powerpc/xmon: Improve output of XIVE interrupts
powerpc/mm/radix: remove useless kernel messages
powerpc/fadump: support holes in kernel boot memory area
powerpc/fadump: remove RMA_START and RMA_END macros
powerpc/fadump: update documentation about option to release opalcore
powerpc/fadump: consider f/w load area
powerpc/opalcore: provide an option to invalidate /sys/firmware/opal/core file
powerpc/opalcore: export /sys/firmware/opal/core for analysing opal crashes
powerpc/fadump: update documentation about CONFIG_PRESERVE_FA_DUMP
powerpc/fadump: add support to preserve crash data on FADUMP disabled kernel
powerpc/fadump: improve how crashed kernel's memory is reserved
powerpc/fadump: consider reserved ranges while releasing memory
powerpc/fadump: make crash memory ranges array allocation generic
...
Invalidating a TCE cache entry for each updated TCE is quite expensive.
This makes use of the new iommu_table_ops::xchg_no_kill()/tce_kill()
callbacks to bring down the time spent in mapping a huge guest DMA window.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190829085252.72370-4-aik@ozlabs.ru
vfio_pci_enable() saves the device's initial configuration information
with the intent that it is restored in vfio_pci_disable(). However,
the commit referenced in Fixes: below replaced the call to
__pci_reset_function_locked(), which is not wrapped in a state save
and restore, with pci_try_reset_function(), which overwrites the
restored device state with the current state before applying it to the
device. Reinstate use of __pci_reset_function_locked() to return to
the desired behavior.
Fixes: 890ed578df ("vfio-pci: Use pci "try" reset interface")
Signed-off-by: hexin <hexin15@baidu.com>
Signed-off-by: Liu Qi <liuqi16@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
As we now already have the reserved regions list, just pass that into
vfio_iommu_has_sw_msi() fn.
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This allows the user-space to retrieve the supported IOVA
range(s), excluding any non-relaxable reserved regions. The
implementation is based on capability chains, added to
VFIO_IOMMU_GET_INFO ioctl.
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This checks and rejects any dma map request outside valid iova
range.
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The @tcegrp variable is used in 1) a loop over attached groups
2) it stores a pointer to a newly allocated tce_iommu_group if 1) found
nothing. However the error handler does not distinguish how we got there
and incorrectly releases memory for a found+incompatible group.
This fixes it by adding another error handling case.
Fixes: 0bd971676e ("powerpc/powernv/npu: Add compound IOMMU groups")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Get a copy of iova list on _group_detach and try to update the list.
On success replace the current one with the copy. Leave the list as
it is if update fails.
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This retrieves the reserved regions associated with dev group and
checks for conflicts with any existing dma mappings. Also update
the iova list excluding the reserved regions.
Reserved regions with type IOMMU_RESV_DIRECT_RELAXABLE are
excluded from above checks as they are considered as directly
mapped regions which are known to be relaxable.
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This introduces an iova list that is valid for dma mappings. Make
sure the new iommu aperture window doesn't conflict with the current
one or with any existing dma mappings during attach.
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
To permit batching of TLB flushes across multiple calls to the IOMMU
driver's ->unmap() implementation, introduce a new structure for
tracking the address range to be flushed and the granularity at which
the flushing is required.
This is hooked into the IOMMU API and its caller are updated to make use
of the new structure. Subsequent patches will plumb this into the IOMMU
drivers as well, but for now the gathering information is ignored.
Signed-off-by: Will Deacon <will@kernel.org>
Commit add02cfdc9 ("iommu: Introduce Interface for IOMMU TLB Flushing")
added three new TLB flushing operations to the IOMMU API so that the
underlying driver operations can be batched when unmapping large regions
of IO virtual address space.
However, the ->iotlb_range_add() callback has not been implemented by
any IOMMU drivers (amd_iommu.c implements it as an empty function, which
incurs the overhead of an indirect branch). Instead, drivers either flush
the entire IOTLB in the ->iotlb_sync() callback or perform the necessary
invalidation during ->unmap().
Attempting to implement ->iotlb_range_add() for arm-smmu-v3.c revealed
two major issues:
1. The page size used to map the region in the page-table is not known,
and so it is not generally possible to issue TLB flushes in the most
efficient manner.
2. The only mutable state passed to the callback is a pointer to the
iommu_domain, which can be accessed concurrently and therefore
requires expensive synchronisation to keep track of the outstanding
flushes.
Remove the callback entirely in preparation for extending ->unmap() and
->iotlb_sync() to update a token on the caller's stack.
Signed-off-by: Will Deacon <will@kernel.org>
- Static symbol cleanup in mdev samples (Kefeng Wang)
- Use vma help in nvlink code (Peng Hao)
- Remove unused code in mbochs sample (YueHaibing)
- Send uevents around mdev registration (Alex Williamson)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJdLNocAAoJECObm247sIsiB6cQAK6DjFawXoXNq3RGZ/mivikf
tQO4iHVrSAsChv2egPEBAimeVdIzdRx/iueBt3bZPmIdMI19efUUmCNL0S7OZnwb
/6b2I1bDgLRoqjxvhGe+QyCRb4Rdrx3avYOe9kasGK5/O/uzBUu+Eu1Lm5xd0KL3
DSiI7tTZXAoYLlB1AGoJLQVc7f4lsWMwx4xDss8tsD0JOjGqDPvgEmTrJjRxzVI9
RCi+sw1zK5gtfiEHcKk1MxTZ30BopGTOE/zZ3D+h+WwVHsBimiUL/E3s7jHn0rsi
j+LE/zgk2isfIs0EyttlqJIYqTtoZYbMz+pkz8SCsgPJbxXI/BgrLjknYdoIbn2E
/Zn1+cXuRGegryM6KvhMecUZyyrv7KdxynJ5c2/DFSewDp+FQvyeArcV39qfw8DU
KurYjIRkkwxrkIevYwV9y8rb57AY36d0aTR589i/ivq146DC4pmQc90DSeH2rDdn
GRt+M/ZVhcfrYXIzUVwiwOJ25lk330jmf/s294dOSctdw1aerlnUSF+uBKk8BJWn
xzkhtm3fKaGYepwYekg3dCT7LbDKqHBDkcnSwT33aM6nMd970hnGzYFYuz8DGg/Q
yjuqF7T8UwiP7UqmRLyY+YmMKvoJhXd/tW8CJzwLFs4tomRSMNFWTwlYQUBDXMLT
Gvwcus6b0p/WS0j5Btpo
=2tVO
-----END PGP SIGNATURE-----
Merge tag 'vfio-v5.3-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Static symbol cleanup in mdev samples (Kefeng Wang)
- Use vma help in nvlink code (Peng Hao)
- Remove unused code in mbochs sample (YueHaibing)
- Send uevents around mdev registration (Alex Williamson)
* tag 'vfio-v5.3-rc1' of git://github.com/awilliam/linux-vfio:
mdev: Send uevents around parent device registration
sample/mdev/mbochs: remove set but not used variable 'mdev_state'
vfio: vfio_pci_nvlink2: use a vma helper function
vfio-mdev/samples: make some symbols static
Merge more updates from Andrew Morton:
"VM:
- z3fold fixes and enhancements by Henry Burns and Vitaly Wool
- more accurate reclaimed slab caches calculations by Yafang Shao
- fix MAP_UNINITIALIZED UAPI symbol to not depend on config, by
Christoph Hellwig
- !CONFIG_MMU fixes by Christoph Hellwig
- new novmcoredd parameter to omit device dumps from vmcore, by
Kairui Song
- new test_meminit module for testing heap and pagealloc
initialization, by Alexander Potapenko
- ioremap improvements for huge mappings, by Anshuman Khandual
- generalize kprobe page fault handling, by Anshuman Khandual
- device-dax hotplug fixes and improvements, by Pavel Tatashin
- enable synchronous DAX fault on powerpc, by Aneesh Kumar K.V
- add pte_devmap() support for arm64, by Robin Murphy
- unify locked_vm accounting with a helper, by Daniel Jordan
- several misc fixes
core/lib:
- new typeof_member() macro including some users, by Alexey Dobriyan
- make BIT() and GENMASK() available in asm, by Masahiro Yamada
- changed LIST_POISON2 on x86_64 to 0xdead000000000122 for better
code generation, by Alexey Dobriyan
- rbtree code size optimizations, by Michel Lespinasse
- convert struct pid count to refcount_t, by Joel Fernandes
get_maintainer.pl:
- add --no-moderated switch to skip moderated ML's, by Joe Perches
misc:
- ptrace PTRACE_GET_SYSCALL_INFO interface
- coda updates
- gdb scripts, various"
[ Using merge message suggestion from Vlastimil Babka, with some editing - Linus ]
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (100 commits)
fs/select.c: use struct_size() in kmalloc()
mm: add account_locked_vm utility function
arm64: mm: implement pte_devmap support
mm: introduce ARCH_HAS_PTE_DEVMAP
mm: clean up is_device_*_page() definitions
mm/mmap: move common defines to mman-common.h
mm: move MAP_SYNC to asm-generic/mman-common.h
device-dax: "Hotremove" persistent memory that is used like normal RAM
mm/hotplug: make remove_memory() interface usable
device-dax: fix memory and resource leak if hotplug fails
include/linux/lz4.h: fix spelling and copy-paste errors in documentation
ipc/mqueue.c: only perform resource calculation if user valid
include/asm-generic/bug.h: fix "cut here" for WARN_ON for __WARN_TAINT architectures
scripts/gdb: add helpers to find and list devices
scripts/gdb: add lx-genpd-summary command
drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl
kernel/pid.c: convert struct pid count to refcount_t
drivers/rapidio/devices/rio_mport_cdev.c: NUL terminate some strings
select: shift restore_saved_sigmask_unless() into poll_select_copy_remaining()
select: change do_poll() to return -ERESTARTNOHAND rather than -EINTR
...
locked_vm accounting is done roughly the same way in five places, so
unify them in a helper.
Include the helper's caller in the debug print to distinguish between
callsites.
Error codes stay the same, so user-visible behavior does too. The one
exception is that the -EPERM case in tce_account_locked_vm is removed
because Alexey has never seen it triggered.
[daniel.m.jordan@oracle.com: v3]
Link: http://lkml.kernel.org/r/20190529205019.20927-1-daniel.m.jordan@oracle.com
[sfr@canb.auug.org.au: fix mm/util.c]
Link: http://lkml.kernel.org/r/20190524175045.26897-1-daniel.m.jordan@oracle.com
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Alan Tull <atull@kernel.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Moritz Fischer <mdf@kernel.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Steve Sistare <steven.sistare@oracle.com>
Cc: Wu Hao <hao.wu@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are lots of documents under Documentation/*.txt and a few other
orphan documents elsehwere that belong to the driver-API book.
Move them to their right place.
Reviewed-by: Cornelia Huck <cohuck@redhat.com> # vfio-related parts
Acked-by: Logan Gunthorpe <logang@deltatee.com> # switchtec
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>