Signed-off-by: Han Han <hhan@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20200807101736.3544506-1-hhan@redhat.com> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
		
			
				
	
	
		
			346 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			346 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
Paravirtualized RDMA Device (PVRDMA)
 | 
						|
====================================
 | 
						|
 | 
						|
 | 
						|
1. Description
 | 
						|
===============
 | 
						|
PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
 | 
						|
It works with its Linux Kernel driver AS IS, no need for any special guest
 | 
						|
modifications.
 | 
						|
 | 
						|
While it complies with the VMware device, it can also communicate with bare
 | 
						|
metal RDMA-enabled machines as peers.
 | 
						|
 | 
						|
It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe).
 | 
						|
 | 
						|
It does not require the whole guest RAM to be pinned allowing memory
 | 
						|
over-commit and, even if not implemented yet, migration support will be
 | 
						|
possible with some HW assistance.
 | 
						|
 | 
						|
A project presentation accompany this document:
 | 
						|
- https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4730/original/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf
 | 
						|
 | 
						|
 | 
						|
 | 
						|
2. Setup
 | 
						|
========
 | 
						|
 | 
						|
 | 
						|
2.1 Guest setup
 | 
						|
===============
 | 
						|
Fedora 27+ kernels work out of the box, older distributions
 | 
						|
require updating the kernel to 4.14 to include the pvrdma driver.
 | 
						|
 | 
						|
However the libpvrdma library needed by User Level Software is still
 | 
						|
not available as part of the distributions, so the rdma-core library
 | 
						|
needs to be compiled and optionally installed.
 | 
						|
 | 
						|
Please follow the instructions at:
 | 
						|
  https://github.com/linux-rdma/rdma-core.git
 | 
						|
 | 
						|
 | 
						|
2.2 Host Setup
 | 
						|
==============
 | 
						|
The pvrdma backend is an ibdevice interface that can be exposed
 | 
						|
either by a Soft-RoCE(rxe) device on machines with no RDMA device,
 | 
						|
or an HCA SRIOV function(VF/PF).
 | 
						|
Note that ibdevice interfaces can't be shared between pvrdma devices,
 | 
						|
each one requiring a separate instance (rxe or SRIOV VF).
 | 
						|
 | 
						|
 | 
						|
2.2.1 Soft-RoCE backend(rxe)
 | 
						|
===========================
 | 
						|
A stable version of rxe is required, Fedora 27+ or a Linux
 | 
						|
Kernel 4.14+ is preferred.
 | 
						|
 | 
						|
The rdma_rxe module is part of the Linux Kernel but not loaded by default.
 | 
						|
Install the User Level library (librxe) following the instructions from:
 | 
						|
https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home
 | 
						|
 | 
						|
Associate an ETH interface with rxe by running:
 | 
						|
   rxe_cfg add eth0
 | 
						|
An rxe0 ibdevice interface will be created and can be used as pvrdma backend.
 | 
						|
 | 
						|
 | 
						|
2.2.2 RDMA device Virtual Function backend
 | 
						|
==========================================
 | 
						|
Nothing special is required, the pvrdma device can work not only with
 | 
						|
Ethernet Links, but also Infinibands Links.
 | 
						|
All is needed is an ibdevice with an active port, for Mellanox cards
 | 
						|
will be something like mlx5_6 which can be the backend.
 | 
						|
 | 
						|
 | 
						|
2.2.3 QEMU setup
 | 
						|
================
 | 
						|
Configure QEMU with --enable-rdma flag, installing
 | 
						|
the required RDMA libraries.
 | 
						|
 | 
						|
 | 
						|
 | 
						|
3. Usage
 | 
						|
========
 | 
						|
 | 
						|
 | 
						|
3.1 VM Memory settings
 | 
						|
======================
 | 
						|
Currently the device is working only with memory backed RAM
 | 
						|
and it must be mark as "shared":
 | 
						|
   -m 1G \
 | 
						|
   -object memory-backend-ram,id=mb1,size=1G,share \
 | 
						|
   -numa node,memdev=mb1 \
 | 
						|
 | 
						|
 | 
						|
3.2 MAD Multiplexer
 | 
						|
===================
 | 
						|
MAD Multiplexer is a service that exposes MAD-like interface for VMs in
 | 
						|
order to overcome the limitation where only single entity can register with
 | 
						|
MAD layer to send and receive RDMA-CM MAD packets.
 | 
						|
 | 
						|
To build rdmacm-mux run
 | 
						|
# make rdmacm-mux
 | 
						|
 | 
						|
Before running the rdmacm-mux make sure that both ib_cm and rdma_cm kernel
 | 
						|
modules aren't loaded, otherwise the rdmacm-mux service will fail to start.
 | 
						|
 | 
						|
The application accepts 3 command line arguments and exposes a UNIX socket
 | 
						|
to pass control and data to it.
 | 
						|
-d rdma-device-name  Name of RDMA device to register with
 | 
						|
-s unix-socket-path  Path to unix socket to listen (default /var/run/rdmacm-mux)
 | 
						|
-p rdma-device-port  Port number of RDMA device to register with (default 1)
 | 
						|
The final UNIX socket file name is a concatenation of the 3 arguments so
 | 
						|
for example for device mlx5_0 on port 2 this /var/run/rdmacm-mux-mlx5_0-2
 | 
						|
will be created.
 | 
						|
 | 
						|
pvrdma requires this service.
 | 
						|
 | 
						|
Please refer to contrib/rdmacm-mux for more details.
 | 
						|
 | 
						|
 | 
						|
3.3 Service exposed by libvirt daemon
 | 
						|
=====================================
 | 
						|
The control over the RDMA device's GID table is done by updating the
 | 
						|
device's Ethernet function addresses.
 | 
						|
Usually the first GID entry is determined by the MAC address, the second by
 | 
						|
the first IPv6 address and the third by the IPv4 address. Other entries can
 | 
						|
be added by adding more IP addresses. The opposite is the same, i.e.
 | 
						|
whenever an address is removed, the corresponding GID entry is removed.
 | 
						|
The process is done by the network and RDMA stacks. Whenever an address is
 | 
						|
added the ib_core driver is notified and calls the device driver add_gid
 | 
						|
function which in turn update the device.
 | 
						|
To support this in pvrdma device the device hooks into the create_bind and
 | 
						|
destroy_bind HW commands triggered by pvrdma driver in guest.
 | 
						|
 | 
						|
Whenever changed is made to the pvrdma port's GID table a special QMP
 | 
						|
messages is sent to be processed by libvirt to update the address of the
 | 
						|
backend Ethernet device.
 | 
						|
 | 
						|
pvrdma requires that libvirt service will be up.
 | 
						|
 | 
						|
 | 
						|
3.4 PCI devices settings
 | 
						|
========================
 | 
						|
RoCE device exposes two functions - an Ethernet and RDMA.
 | 
						|
To support it, pvrdma device is composed of two PCI functions, an Ethernet
 | 
						|
device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 1. The
 | 
						|
Ethernet function can be used for other Ethernet purposes such as IP.
 | 
						|
 | 
						|
 | 
						|
3.5 Device parameters
 | 
						|
=====================
 | 
						|
- netdev: Specifies the Ethernet device function name on the host for
 | 
						|
  example enp175s0f0. For Soft-RoCE device (rxe) this would be the Ethernet
 | 
						|
  device used to create it.
 | 
						|
- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
 | 
						|
- mad-chardev: The name of the MAD multiplexer char device.
 | 
						|
- ibport: In case of multi-port device (such as Mellanox's HCA) this
 | 
						|
  specify the port to use. If not set 1 will be used.
 | 
						|
- dev-caps-max-mr-size: The maximum size of MR.
 | 
						|
- dev-caps-max-qp:      Maximum number of QPs.
 | 
						|
- dev-caps-max-cq:      Maximum number of CQs.
 | 
						|
- dev-caps-max-mr:      Maximum number of MRs.
 | 
						|
- dev-caps-max-pd:      Maximum number of PDs.
 | 
						|
- dev-caps-max-ah:      Maximum number of AHs.
 | 
						|
 | 
						|
Notes:
 | 
						|
- The first 3 parameters are mandatory settings, the rest have their
 | 
						|
  defaults.
 | 
						|
- The last 8 parameters (the ones that prefixed by dev-caps) defines the top
 | 
						|
  limits but the final values is adjusted by the backend device limitations.
 | 
						|
- netdev can be extracted from ibdev's sysfs
 | 
						|
  (/sys/class/infiniband/<ibdev>/device/net/)
 | 
						|
 | 
						|
 | 
						|
3.6 Example
 | 
						|
===========
 | 
						|
Define bridge device with vmxnet3 network backend:
 | 
						|
<interface type='bridge'>
 | 
						|
  <mac address='56:b4:44:e9:62:dc'/>
 | 
						|
  <source bridge='bridge1'/>
 | 
						|
  <model type='vmxnet3'/>
 | 
						|
  <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/>
 | 
						|
</interface>
 | 
						|
 | 
						|
Define pvrdma device:
 | 
						|
<qemu:commandline>
 | 
						|
  <qemu:arg value='-object'/>
 | 
						|
  <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
 | 
						|
  <qemu:arg value='-numa'/>
 | 
						|
  <qemu:arg value='node,memdev=mb1'/>
 | 
						|
  <qemu:arg value='-chardev'/>
 | 
						|
  <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
 | 
						|
  <qemu:arg value='-device'/>
 | 
						|
  <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
 | 
						|
</qemu:commandline>
 | 
						|
 | 
						|
 | 
						|
 | 
						|
4. Implementation details
 | 
						|
=========================
 | 
						|
 | 
						|
 | 
						|
4.1 Overview
 | 
						|
============
 | 
						|
The device acts like a proxy between the Guest Driver and the host
 | 
						|
ibdevice interface.
 | 
						|
On configuration path:
 | 
						|
 - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request
 | 
						|
   a resource from the backend interface, maintaining a 1-1 mapping
 | 
						|
   between the guest and host.
 | 
						|
On data path:
 | 
						|
 - Every post_send/receive received from the guest will be converted into
 | 
						|
   a post_send/receive for the backend. The buffers data will not be touched
 | 
						|
   or copied resulting in near bare-metal performance for large enough buffers.
 | 
						|
 - Completions from the backend interface will result in completions for
 | 
						|
   the pvrdma device.
 | 
						|
 | 
						|
 | 
						|
4.2 PCI BARs
 | 
						|
============
 | 
						|
PCI Bars:
 | 
						|
	BAR 0 - MSI-X
 | 
						|
        MSI-X vectors:
 | 
						|
		(0) Command - used when execution of a command is completed.
 | 
						|
		(1) Async - not in use.
 | 
						|
		(2) Completion - used when a completion event is placed in
 | 
						|
		  device's CQ ring.
 | 
						|
	BAR 1 - Registers
 | 
						|
        --------------------------------------------------------
 | 
						|
        | VERSION |  DSR | CTL | REQ | ERR |  ICR | IMR  | MAC |
 | 
						|
        --------------------------------------------------------
 | 
						|
		DSR - Address of driver/device shared memory used
 | 
						|
              for the command channel, used for passing:
 | 
						|
			    - General info such as driver version
 | 
						|
			    - Address of 'command' and 'response'
 | 
						|
			    - Address of async ring
 | 
						|
			    - Address of device's CQ ring
 | 
						|
			    - Device capabilities
 | 
						|
		CTL - Device control operations (activate, reset etc)
 | 
						|
		IMG - Set interrupt mask
 | 
						|
		REQ - Command execution register
 | 
						|
		ERR - Operation status
 | 
						|
 | 
						|
	BAR 2 - UAR
 | 
						|
        ---------------------------------------------------------
 | 
						|
        | QP_NUM  | SEND/RECV Flag ||  CQ_NUM |   ARM/POLL Flag |
 | 
						|
        ---------------------------------------------------------
 | 
						|
		- Offset 0 used for QP operations (send and recv)
 | 
						|
		- Offset 4 used for CQ operations (arm and poll)
 | 
						|
 | 
						|
 | 
						|
4.3 Major flows
 | 
						|
===============
 | 
						|
 | 
						|
4.3.1 Create CQ
 | 
						|
===============
 | 
						|
    - Guest driver
 | 
						|
        - Allocates pages for CQ ring
 | 
						|
        - Creates page directory (pdir) to hold CQ ring's pages
 | 
						|
        - Initializes CQ ring
 | 
						|
        - Initializes 'Create CQ' command object (cqe, pdir etc)
 | 
						|
        - Copies the command to 'command' address
 | 
						|
        - Writes 0 into REQ register
 | 
						|
    - Device
 | 
						|
        - Reads the request object from the 'command' address
 | 
						|
        - Allocates CQ object and initialize CQ ring based on pdir
 | 
						|
        - Creates the backend CQ
 | 
						|
        - Writes operation status to ERR register
 | 
						|
        - Posts command-interrupt to guest
 | 
						|
    - Guest driver
 | 
						|
        - Reads the HW response code from ERR register
 | 
						|
 | 
						|
4.3.2 Create QP
 | 
						|
===============
 | 
						|
    - Guest driver
 | 
						|
        - Allocates pages for send and receive rings
 | 
						|
        - Creates page directory(pdir) to hold the ring's pages
 | 
						|
        - Initializes 'Create QP' command object (max_send_wr,
 | 
						|
          send_cq_handle, recv_cq_handle, pdir etc)
 | 
						|
        - Copies the object to 'command' address
 | 
						|
        - Write 0 into REQ register
 | 
						|
    - Device
 | 
						|
        - Reads the request object from 'command' address
 | 
						|
        - Allocates the QP object and initialize
 | 
						|
            - Send and recv rings based on pdir
 | 
						|
            - Send and recv ring state
 | 
						|
        - Creates the backend QP
 | 
						|
        - Writes the operation status to ERR register
 | 
						|
        - Posts command-interrupt to guest
 | 
						|
    - Guest driver
 | 
						|
        - Reads the HW response code from ERR register
 | 
						|
 | 
						|
4.3.3 Post receive
 | 
						|
==================
 | 
						|
    - Guest driver
 | 
						|
        - Initializes a wqe and place it on recv ring
 | 
						|
        - Write to qpn|qp_recv_bit (31) to QP offset in UAR
 | 
						|
    - Device
 | 
						|
        - Extracts qpn from UAR
 | 
						|
        - Walks through the ring and does the following for each wqe
 | 
						|
            - Prepares the backend CQE context to be used when
 | 
						|
              receiving completion from backend (wr_id, op_code, emu_cq_num)
 | 
						|
            - For each sge prepares backend sge
 | 
						|
            - Calls backend's post_recv
 | 
						|
 | 
						|
4.3.4 Process backend events
 | 
						|
============================
 | 
						|
    - Done by a dedicated thread used to process backend events;
 | 
						|
      at initialization is attached to the device and creates
 | 
						|
      the communication channel.
 | 
						|
    - Thread main loop:
 | 
						|
        - Polls for completions
 | 
						|
        - Extracts QEMU _cq_num, wr_id and op_code from context
 | 
						|
        - Writes CQE to CQ ring
 | 
						|
        - Writes CQ number to device CQ
 | 
						|
        - Sends completion-interrupt to guest
 | 
						|
        - Deallocates context
 | 
						|
        - Acks the event to backend
 | 
						|
 | 
						|
 | 
						|
 | 
						|
5. Limitations
 | 
						|
==============
 | 
						|
- The device obviously is limited by the Guest Linux Driver features implementation
 | 
						|
  of the VMware device API.
 | 
						|
- Memory registration mechanism requires mremap for every page in the buffer in order
 | 
						|
  to map it to a contiguous virtual address range. Since this is not the data path
 | 
						|
  it should not matter much. If the default max mr size is increased, be aware that
 | 
						|
  memory registration can take up to 0.5 seconds for 1GB of memory.
 | 
						|
- The device requires target page size to be the same as the host page size,
 | 
						|
  otherwise it will fail to init.
 | 
						|
- QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached,
 | 
						|
  so it can't work with huge pages. The limitation will be addressed in the future,
 | 
						|
  however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge
 | 
						|
  pages available, QEMU will use them. QEMU will fail to init if the requirements
 | 
						|
  are not met.
 | 
						|
 | 
						|
 | 
						|
 | 
						|
6. Performance
 | 
						|
==============
 | 
						|
By design the pvrdma device exits on each post-send/receive, so for small buffers
 | 
						|
the performance is affected; however for medium buffers it will became close to
 | 
						|
bare metal and from 1MB buffers and  up it reaches bare metal performance.
 | 
						|
(tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device)
 | 
						|
 | 
						|
All the above assumes no memory registration is done on data path.
 |