mirror of
				https://github.com/qemu/qemu.git
				synced 2025-10-20 20:54:05 +00:00 
			
		
		
		
	docs: add pvrdma device documentation.
Signed-off-by: Marcel Apfelbaum <marcel@redhat.com> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
This commit is contained in:
		
							parent
							
								
									06329ccecf
								
							
						
					
					
						commit
						edab56321a
					
				
							
								
								
									
										255
									
								
								docs/pvrdma.txt
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										255
									
								
								docs/pvrdma.txt
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,255 @@ | |||||||
|  | Paravirtualized RDMA Device (PVRDMA) | ||||||
|  | ==================================== | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 1. Description | ||||||
|  | =============== | ||||||
|  | PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device. | ||||||
|  | It works with its Linux Kernel driver AS IS, no need for any special guest | ||||||
|  | modifications. | ||||||
|  | 
 | ||||||
|  | While it complies with the VMware device, it can also communicate with bare | ||||||
|  | metal RDMA-enabled machines and does not require an RDMA HCA in the host, it | ||||||
|  | can work with Soft-RoCE (rxe). | ||||||
|  | 
 | ||||||
|  | It does not require the whole guest RAM to be pinned allowing memory | ||||||
|  | over-commit and, even if not implemented yet, migration support will be | ||||||
|  | possible with some HW assistance. | ||||||
|  | 
 | ||||||
|  | A project presentation accompany this document: | ||||||
|  | - http://events.linuxfoundation.org/sites/events/files/slides/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 2. Setup | ||||||
|  | ======== | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 2.1 Guest setup | ||||||
|  | =============== | ||||||
|  | Fedora 27+ kernels work out of the box, older distributions | ||||||
|  | require updating the kernel to 4.14 to include the pvrdma driver. | ||||||
|  | 
 | ||||||
|  | However the libpvrdma library needed by User Level Software is still | ||||||
|  | not available as part of the distributions, so the rdma-core library | ||||||
|  | needs to be compiled and optionally installed. | ||||||
|  | 
 | ||||||
|  | Please follow the instructions at: | ||||||
|  |   https://github.com/linux-rdma/rdma-core.git | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 2.2 Host Setup | ||||||
|  | ============== | ||||||
|  | The pvrdma backend is an ibdevice interface that can be exposed | ||||||
|  | either by a Soft-RoCE(rxe) device on machines with no RDMA device, | ||||||
|  | or an HCA SRIOV function(VF/PF). | ||||||
|  | Note that ibdevice interfaces can't be shared between pvrdma devices, | ||||||
|  | each one requiring a separate instance (rxe or SRIOV VF). | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 2.2.1 Soft-RoCE backend(rxe) | ||||||
|  | =========================== | ||||||
|  | A stable version of rxe is required, Fedora 27+ or a Linux | ||||||
|  | Kernel 4.14+ is preferred. | ||||||
|  | 
 | ||||||
|  | The rdma_rxe module is part of the Linux Kernel but not loaded by default. | ||||||
|  | Install the User Level library (librxe) following the instructions from: | ||||||
|  | https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home | ||||||
|  | 
 | ||||||
|  | Associate an ETH interface with rxe by running: | ||||||
|  |    rxe_cfg add eth0 | ||||||
|  | An rxe0 ibdevice interface will be created and can be used as pvrdma backend. | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 2.2.2 RDMA device Virtual Function backend | ||||||
|  | ========================================== | ||||||
|  | Nothing special is required, the pvrdma device can work not only with | ||||||
|  | Ethernet Links, but also Infinibands Links. | ||||||
|  | All is needed is an ibdevice with an active port, for Mellanox cards | ||||||
|  | will be something like mlx5_6 which can be the backend. | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 2.2.3 QEMU setup | ||||||
|  | ================ | ||||||
|  | Configure QEMU with --enable-rdma flag, installing | ||||||
|  | the required RDMA libraries. | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 3. Usage | ||||||
|  | ======== | ||||||
|  | Currently the device is working only with memory backed RAM | ||||||
|  | and it must be mark as "shared": | ||||||
|  |    -m 1G \ | ||||||
|  |    -object memory-backend-ram,id=mb1,size=1G,share \ | ||||||
|  |    -numa node,memdev=mb1 \ | ||||||
|  | 
 | ||||||
|  | The pvrdma device is composed of two functions: | ||||||
|  |  - Function 0 is a vmxnet Ethernet Device which is redundant in Guest | ||||||
|  |    but is required to pass the ibdevice GID using its MAC. | ||||||
|  |    Examples: | ||||||
|  |      For an rxe backend using eth0 interface it will use its mac: | ||||||
|  |        -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC> | ||||||
|  |      For an SRIOV VF, we take the Ethernet Interface exposed by it: | ||||||
|  |        -device vmxnet3,multifunction=on,mac=<RoCE eth MAC> | ||||||
|  |  - Function 1 is the actual device: | ||||||
|  |        -device pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port> | ||||||
|  |    where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4) | ||||||
|  |  Note: Pay special attention that the GID at backend-gid-idx matches vmxnet's MAC. | ||||||
|  |  The rules of conversion are part of the RoCE spec, but since manual conversion | ||||||
|  |  is not required, spotting problems is not hard: | ||||||
|  |     Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a | ||||||
|  |              MAC: 7c:fe:90:cb:74:3a | ||||||
|  |     Note the difference between the first byte of the MAC and the GID. | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 4. Implementation details | ||||||
|  | ========================= | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 4.1 Overview | ||||||
|  | ============ | ||||||
|  | The device acts like a proxy between the Guest Driver and the host | ||||||
|  | ibdevice interface. | ||||||
|  | On configuration path: | ||||||
|  |  - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request | ||||||
|  |    a resource from the backend interface, maintaining a 1-1 mapping | ||||||
|  |    between the guest and host. | ||||||
|  | On data path: | ||||||
|  |  - Every post_send/receive received from the guest will be converted into | ||||||
|  |    a post_send/receive for the backend. The buffers data will not be touched | ||||||
|  |    or copied resulting in near bare-metal performance for large enough buffers. | ||||||
|  |  - Completions from the backend interface will result in completions for | ||||||
|  |    the pvrdma device. | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 4.2 PCI BARs | ||||||
|  | ============ | ||||||
|  | PCI Bars: | ||||||
|  | 	BAR 0 - MSI-X | ||||||
|  |         MSI-X vectors: | ||||||
|  | 		(0) Command - used when execution of a command is completed. | ||||||
|  | 		(1) Async - not in use. | ||||||
|  | 		(2) Completion - used when a completion event is placed in | ||||||
|  | 		  device's CQ ring. | ||||||
|  | 	BAR 1 - Registers | ||||||
|  |         -------------------------------------------------------- | ||||||
|  |         | VERSION |  DSR | CTL | REQ | ERR |  ICR | IMR  | MAC | | ||||||
|  |         -------------------------------------------------------- | ||||||
|  | 		DSR - Address of driver/device shared memory used | ||||||
|  |               for the command channel, used for passing: | ||||||
|  | 			    - General info such as driver version | ||||||
|  | 			    - Address of 'command' and 'response' | ||||||
|  | 			    - Address of async ring | ||||||
|  | 			    - Address of device's CQ ring | ||||||
|  | 			    - Device capabilities | ||||||
|  | 		CTL - Device control operations (activate, reset etc) | ||||||
|  | 		IMG - Set interrupt mask | ||||||
|  | 		REQ - Command execution register | ||||||
|  | 		ERR - Operation status | ||||||
|  | 
 | ||||||
|  | 	BAR 2 - UAR | ||||||
|  |         --------------------------------------------------------- | ||||||
|  |         | QP_NUM  | SEND/RECV Flag ||  CQ_NUM |   ARM/POLL Flag | | ||||||
|  |         --------------------------------------------------------- | ||||||
|  | 		- Offset 0 used for QP operations (send and recv) | ||||||
|  | 		- Offset 4 used for CQ operations (arm and poll) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 4.3 Major flows | ||||||
|  | =============== | ||||||
|  | 
 | ||||||
|  | 4.3.1 Create CQ | ||||||
|  | =============== | ||||||
|  |     - Guest driver | ||||||
|  |         - Allocates pages for CQ ring | ||||||
|  |         - Creates page directory (pdir) to hold CQ ring's pages | ||||||
|  |         - Initializes CQ ring | ||||||
|  |         - Initializes 'Create CQ' command object (cqe, pdir etc) | ||||||
|  |         - Copies the command to 'command' address | ||||||
|  |         - Writes 0 into REQ register | ||||||
|  |     - Device | ||||||
|  |         - Reads the request object from the 'command' address | ||||||
|  |         - Allocates CQ object and initialize CQ ring based on pdir | ||||||
|  |         - Creates the backend CQ | ||||||
|  |         - Writes operation status to ERR register | ||||||
|  |         - Posts command-interrupt to guest | ||||||
|  |     - Guest driver | ||||||
|  |         - Reads the HW response code from ERR register | ||||||
|  | 
 | ||||||
|  | 4.3.2 Create QP | ||||||
|  | =============== | ||||||
|  |     - Guest driver | ||||||
|  |         - Allocates pages for send and receive rings | ||||||
|  |         - Creates page directory(pdir) to hold the ring's pages | ||||||
|  |         - Initializes 'Create QP' command object (max_send_wr, | ||||||
|  |           send_cq_handle, recv_cq_handle, pdir etc) | ||||||
|  |         - Copies the object to 'command' address | ||||||
|  |         - Write 0 into REQ register | ||||||
|  |     - Device | ||||||
|  |         - Reads the request object from 'command' address | ||||||
|  |         - Allocates the QP object and initialize | ||||||
|  |             - Send and recv rings based on pdir | ||||||
|  |             - Send and recv ring state | ||||||
|  |         - Creates the backend QP | ||||||
|  |         - Writes the operation status to ERR register | ||||||
|  |         - Posts command-interrupt to guest | ||||||
|  |     - Guest driver | ||||||
|  |         - Reads the HW response code from ERR register | ||||||
|  | 
 | ||||||
|  | 4.3.3 Post receive | ||||||
|  | ================== | ||||||
|  |     - Guest driver | ||||||
|  |         - Initializes a wqe and place it on recv ring | ||||||
|  |         - Write to qpn|qp_recv_bit (31) to QP offset in UAR | ||||||
|  |     - Device | ||||||
|  |         - Extracts qpn from UAR | ||||||
|  |         - Walks through the ring and does the following for each wqe | ||||||
|  |             - Prepares the backend CQE context to be used when | ||||||
|  |               receiving completion from backend (wr_id, op_code, emu_cq_num) | ||||||
|  |             - For each sge prepares backend sge | ||||||
|  |             - Calls backend's post_recv | ||||||
|  | 
 | ||||||
|  | 4.3.4 Process backend events | ||||||
|  | ============================ | ||||||
|  |     - Done by a dedicated thread used to process backend events; | ||||||
|  |       at initialization is attached to the device and creates | ||||||
|  |       the communication channel. | ||||||
|  |     - Thread main loop: | ||||||
|  |         - Polls for completions | ||||||
|  |         - Extracts QEMU _cq_num, wr_id and op_code from context | ||||||
|  |         - Writes CQE to CQ ring | ||||||
|  |         - Writes CQ number to device CQ | ||||||
|  |         - Sends completion-interrupt to guest | ||||||
|  |         - Deallocates context | ||||||
|  |         - Acks the event to backend | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 5. Limitations | ||||||
|  | ============== | ||||||
|  | - The device obviously is limited by the Guest Linux Driver features implementation | ||||||
|  |   of the VMware device API. | ||||||
|  | - Memory registration mechanism requires mremap for every page in the buffer in order | ||||||
|  |   to map it to a contiguous virtual address range. Since this is not the data path | ||||||
|  |   it should not matter much. If the default max mr size is increased, be aware that | ||||||
|  |   memory registration can take up to 0.5 seconds for 1GB of memory. | ||||||
|  | - The device requires target page size to be the same as the host page size, | ||||||
|  |   otherwise it will fail to init. | ||||||
|  | - QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached, | ||||||
|  |   so it can't work with huge pages. The limitation will be addressed in the future, | ||||||
|  |   however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge | ||||||
|  |   pages available, QEMU will use them. QEMU will fail to init if the requirements | ||||||
|  |   are not met. | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 6. Performance | ||||||
|  | ============== | ||||||
|  | By design the pvrdma device exits on each post-send/receive, so for small buffers | ||||||
|  | the performance is affected; however for medium buffers it will became close to | ||||||
|  | bare metal and from 1MB buffers and  up it reaches bare metal performance. | ||||||
|  | (tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device) | ||||||
|  | 
 | ||||||
|  | All the above assumes no memory registration is done on data path. | ||||||
		Loading…
	
		Reference in New Issue
	
	Block a user
	 Marcel Apfelbaum
						Marcel Apfelbaum