diff --git a/docs/interop/vfio-user.rst b/docs/interop/vfio-user.rst index 0b06f026b0..303fc13d6e 100644 --- a/docs/interop/vfio-user.rst +++ b/docs/interop/vfio-user.rst @@ -76,7 +76,7 @@ structures used in the ``ioctl()`` implementation. Messages derived from the ``ioctl()`` will have a name derived from the ``ioctl()`` command name. E.g., the ``VFIO_DEVICE_GET_INFO`` ``ioctl()`` command becomes a ``VFIO_USER_DEVICE_GET_INFO`` message. The purpose of this reuse is to share as -much code as feasible with the ``ioctl()`` implementation``. +much code as feasible with the ``ioctl()`` implementation. Connection Initiation ^^^^^^^^^^^^^^^^^^^^^ @@ -201,12 +201,32 @@ A server can serve: 1) one or more clients, and/or 2) one or more virtual devices, belonging to one or more clients. -The current protocol specification requires a dedicated socket per -client/server connection. It is a server-side implementation detail whether a -single server handles multiple virtual devices from the same or multiple -clients. The location of the socket is implementation-specific. Multiplexing -clients, devices, and servers over the same socket is not supported in this -version of the protocol. +The current protocol specification requires dedicated sockets per +client/server connection. Commands in the client-to-server direction are +handled on the main communication socket which the client connects to, and +replies to these commands are passed on the same socket. Commands sent in the +other direction from the server to the client as well as their corresponding +replies can optionally be passed across a separate socket, which is set up +during negotiation (AF_UNIX servers just pass the file descriptor). + +Using separate sockets for each command channel avoids introducing an +artificial point of synchronization between the channels. This simplifies +implementations since it obviates the need to demultiplex incoming messages +into commands and replies and interleave command handling and reply processing. +Note that it is still illegal for implementations to stall command or reply +processing indefinitely while waiting for replies on the other channel, as this +may lead to deadlocks. However, since incoming commands and requests arrive on +different sockets, it's possible to meet this requirement e.g. by running two +independent request processing threads that can internally operate +synchronously. It is expected that this is simpler to implement than fully +asynchronous message handling code. Implementations may still choose a fully +asynchronous, event-based design for other reasons, and the protocol fully +supports it. + +It is a server-side implementation detail whether a single server handles +multiple virtual devices from the same or multiple clients. The location of the +socket is implementation-specific. Multiplexing clients, devices, and servers +over the same socket is not supported in this version of the protocol. Authentication -------------- @@ -364,6 +384,9 @@ Name Command Request Direction ``VFIO_USER_DMA_WRITE`` 12 server -> client ``VFIO_USER_DEVICE_RESET`` 13 client -> server ``VFIO_USER_REGION_WRITE_MULTI`` 15 client -> server +``VFIO_USER_DEVICE_FEATURE`` 16 client -> server +``VFIO_USER_MIG_DATA_READ`` 17 client -> server +``VFIO_USER_MIG_DATA_WRITE`` 18 client -> server ====================================== ========= ================= Header @@ -486,45 +509,48 @@ format: Capabilities: -+--------------------+---------+------------------------------------------------+ -| Name | Type | Description | -+====================+=========+================================================+ -| max_msg_fds | number | Maximum number of file descriptors that can be | -| | | received by the sender in one message. | -| | | Optional. If not specified then the receiver | -| | | must assume a value of ``1``. | -+--------------------+---------+------------------------------------------------+ -| max_data_xfer_size | number | Maximum ``count`` for data transfer messages; | -| | | see `Read and Write Operations`_. Optional, | -| | | with a default value of 1048576 bytes. | -+--------------------+---------+------------------------------------------------+ -| pgsizes | number | Page sizes supported in DMA map operations | -| | | or'ed together. Optional, with a default value | -| | | of supporting only 4k pages. | -+--------------------+---------+------------------------------------------------+ -| max_dma_maps | number | Maximum number DMA map windows that can be | -| | | valid simultaneously. Optional, with a | -| | | value of 65535 (64k-1). | -+--------------------+---------+------------------------------------------------+ -| migration | object | Migration capability parameters. If missing | -| | | then migration is not supported by the sender. | -+--------------------+---------+------------------------------------------------+ -| write_multiple | boolean | ``VFIO_USER_REGION_WRITE_MULTI`` messages | -| | | are supported if the value is ``true``. | -+--------------------+---------+------------------------------------------------+ ++--------------------+---------+-----------------------------------------------+ +| Name | Type | Description | ++====================+=========+===============================================+ +| max_msg_fds | number | Maximum number of file descriptors that can | +| | | be received by the sender in one message. | +| | | Optional. If not specified then the receiver | +| | | must assume a value of ``1``. | ++--------------------+---------+-----------------------------------------------+ +| max_data_xfer_size | number | Maximum ``count`` for data transfer messages; | +| | | see `Read and Write Operations`_. Optional, | +| | | with a default value of 1048576 bytes. | ++--------------------+---------+-----------------------------------------------+ +| max_dma_maps | number | Maximum number DMA map windows that can be | +| | | valid simultaneously. Optional, with a | +| | | value of 65535 (64k-1). | ++--------------------+---------+-----------------------------------------------+ +| pgsizes | number | Page sizes supported in DMA map operations | +| | | or'ed together. Optional, with a default | +| | | value of supporting only 4k pages. | ++--------------------+---------+-----------------------------------------------+ +| twin_socket | object | Parameters for twin-socket mode, which | +| | | handles server-to-client commands and their | +| | | replies on a separate socket. Optional. | ++--------------------+---------+-----------------------------------------------+ +| write_multiple | boolean | ``VFIO_USER_REGION_WRITE_MULTI`` messages | +| | | are supported if the value is ``true``. | ++--------------------+---------+-----------------------------------------------+ -The migration capability contains the following name/value pairs: +The ``twin_socket`` capability object holds these name/value pairs: -+-----------------+--------+--------------------------------------------------+ -| Name | Type | Description | -+=================+========+==================================================+ -| pgsize | number | Page size of dirty pages bitmap. The smallest | -| | | between the client and the server is used. | -+-----------------+--------+--------------------------------------------------+ -| max_bitmap_size | number | Maximum bitmap size in ``VFIO_USER_DIRTY_PAGES`` | -| | | and ``VFIO_DMA_UNMAP`` messages. Optional, | -| | | with a default value of 256MB. | -+-----------------+--------+--------------------------------------------------+ ++-----------+---------+--------------------------------------------------------+ +| Name | Type | Description | ++===========+=========+========================================================+ +| supported | boolean | Indicates whether the sender supports twin-socket | +| | | mode. Optional, defaults to false. | ++-----------+---------+--------------------------------------------------------+ +| fd_index | number | Specifies an index in the file descriptor array | +| | | included with the message. The designated file | +| | | descriptor is a socket which is to be used for the | +| | | server-to-client command channel. Optional, only valid | +| | | in the reply message. | ++-----------+---------+--------------------------------------------------------+ Reply ^^^^^ @@ -532,6 +558,38 @@ Reply The same message format is used in the server's reply with the semantics described above. +If and only if the client has indicated support for twin-socket mode by setting +``twin_socket.supported`` to true in its capabilities, the server may optionally +set up a separate command channel for server-to-client commands and their +replies. The server enables twin-socket mode as follows: + +* Create a fresh socket pair. +* Keep the server end of the socket pair and pass the client end in the file + descriptor array included with the reply message. +* Set ``twin_socket.supported`` to true in the reply. +* Indicate the index in the file descriptor array by the + ``twin_socket.fd_index`` capability field in the reply, so the client can + identify the correct file descriptor to use. + +A client requesting twin-socket mode must examine the ``twin_socket`` capability +in the reply: + +* If ``twin_socket.supported`` is false, the field is missing, or the entire + ``twin_socket`` object is absent, the server does not support twin-socket mode + or decided not to enable it. The client can choose whether it wants to proceed + without twin-socket mode, or close the connection if not. +* If ``twin_socket.supported`` is true and ``twin_socket.fd_index`` is present + and refers to a valid file descriptor, twin-socket mode negotiation has + succeeded. The client monitors the provided file descriptor for commands from + the server. +* Otherwise, the reply from the server is inconsistent. The client must abort + and close the connection since it is potentially unable to receive commands + from the server. + +The twin-socket feature is optional, so some servers may not support it. +However, for server implementations that do send server-to-client commands it is +strongly recommended to implement twin-socket support. + ``VFIO_USER_DMA_MAP`` --------------------- @@ -1335,7 +1393,9 @@ Reply ----------------------- If the client has not shared mappable memory, the server can use this message to -read from guest memory. +read from guest memory. This message and its reply are passed over the separate +server-to-client socket if twin-socket mode has been negotiated during +connection setup. Request ^^^^^^^ @@ -1373,7 +1433,9 @@ Reply ----------------------- If the client has not shared mappable memory, the server can use this message to -write to guest memory. +write to guest memory. This message and its reply are passed over the separate +server-to-client socket if twin-socket mode has been negotiated during +connection setup. Request ^^^^^^^ @@ -1466,6 +1528,402 @@ Reply * *wr_cnt* is the number of device writes completed. +``VFIO_USER_DEVICE_FEATURE`` +---------------------------- + +This command is analogous to ``VFIO_DEVICE_FEATURE``. It is used to get, set, or +probe feature data of the device. + +Request +^^^^^^^ + +The request payload for this message is a structure of the following format. + ++-------+--------+--------------------------------+ +| Name | Offset | Size | ++=======+========+================================+ +| argsz | 0 | 4 | ++-------+--------+--------------------------------+ +| flags | 4 | 4 | ++-------+--------+--------------------------------+ +| | +---------+---------------------------+ | +| | | Bit | Definition | | +| | +=========+===========================+ | +| | | 0 to 15 | Feature index | | +| | +---------+---------------------------+ | +| | | 16 | VFIO_DEVICE_FEATURE_GET | | +| | +---------+---------------------------+ | +| | | 17 | VFIO_DEVICE_FEATURE_SET | | +| | +---------+---------------------------+ | +| | | 18 | VFIO_DEVICE_FEATURE_PROBE | | +| | +---------+---------------------------+ | ++-------+--------+--------------------------------+ +| data | 8 | variable | ++-------+--------+--------------------------------+ + +* *argsz* is the maximum size of the reply payload. + +* *flags* defines the action to be performed by the server and upon which + feature: + + * The feature index consists of the least significant 16 bits of the flags + field, and can be accessed using the ``VFIO_DEVICE_FEATURE_MASK`` bit mask. + + * ``VFIO_DEVICE_FEATURE_GET`` instructs the server to get the data for the + given feature. + + * ``VFIO_DEVICE_FEATURE_SET`` instructs the server to set the feature data to + that given in the ``data`` field of the payload. + + * ``VFIO_DEVICE_FEATURE_PROBE`` instructs the server to probe for feature + support. If ``VFIO_DEVICE_FEATURE_GET`` and/or ``VFIO_DEVICE_FEATURE_SET`` + are also set, the probe will only return success if all of the indicated + methods are supported. + + ``VFIO_DEVICE_FEATURE_GET`` and ``VFIO_DEVICE_FEATURE_SET`` are mutually + exclusive, except for use with ``VFIO_DEVICE_FEATURE_PROBE``. + +* *data* is specific to the particular feature. It is not used for probing. + +This part of the request is analogous to VFIO's ``struct vfio_device_feature``. + +Reply +^^^^^ + +The reply payload must be the same as the request payload for setting or +probing a feature. For getting a feature's data, the data is added in the data +section and its length is added to ``argsz``. + +Device Features +^^^^^^^^^^^^^^^ + +The only device features supported by vfio-user are those related to migration, +although this may change in the future. They are a subset of those supported in +the VFIO implementation of the Linux kernel. + ++----------------------------------------+---------------+ +| Name | Feature Index | ++========================================+===============+ +| VFIO_DEVICE_FEATURE_MIGRATION | 1 | ++----------------------------------------+---------------+ +| VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE | 2 | ++----------------------------------------+---------------+ +| VFIO_DEVICE_FEATURE_DMA_LOGGING_START | 6 | ++----------------------------------------+---------------+ +| VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP | 7 | ++----------------------------------------+---------------+ +| VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT | 8 | ++----------------------------------------+---------------+ + +``VFIO_DEVICE_FEATURE_MIGRATION`` +""""""""""""""""""""""""""""""""" + +This feature indicates that the device can support the migration API through +``VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE``. If ``GET`` succeeds, the ``RUNNING`` +and ``ERROR`` states are always supported. Support for additional states is +indicated via the flags field; at least ``VFIO_MIGRATION_STOP_COPY`` must be +set. + +There is no data field of the request message. + +The data field of the reply message is structured as follows: + ++-------+--------+---------------------------+ +| Name | Offset | Size | ++=======+========+===========================+ +| flags | 0 | 8 | ++-------+--------+---------------------------+ +| | +-----+--------------------------+ | +| | | Bit | Definition | | +| | +=====+==========================+ | +| | | 0 | VFIO_MIGRATION_STOP_COPY | | +| | +-----+--------------------------+ | +| | | 1 | VFIO_MIGRATION_P2P | | +| | +-----+--------------------------+ | +| | | 2 | VFIO_MIGRATION_PRE_COPY | | +| | +-----+--------------------------+ | ++-------+--------+---------------------------+ + +These flags are interpreted in the same way as VFIO. + +``VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE`` +"""""""""""""""""""""""""""""""""""""""" + +Upon ``VFIO_DEVICE_FEATURE_SET``, execute a migration state change on the VFIO +device. The new state is supplied in ``device_state``. The state transition must +fully complete before the reply is sent. + +The data field of the reply message, as well as the ``SET`` request message, is +structured as follows: + ++--------------+--------+------+ +| Name | Offset | Size | ++==============+========+======+ +| device_state | 0 | 4 | ++--------------+--------+------+ +| data_fd | 4 | 4 | ++--------------+--------+------+ + +* *device_state* is the current state of the device (for ``GET``) or the + state to transition to (for ``SET``). It is defined by the + ``vfio_device_mig_state`` enum as detailed below. These states are the states + of the device migration Finite State Machine. + ++--------------------------------+-------+---------------------------------------------------------------------+ +| Name | State | Description | ++================================+=======+=====================================================================+ +| VFIO_DEVICE_STATE_ERROR | 0 | The device has failed and must be reset. | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_STOP | 1 | The device does not change the internal or external state. | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_RUNNING | 2 | The device is running normally. | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_STOP_COPY | 3 | The device internal state can be read out. | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_RESUMING | 4 | The device is stopped and is loading a new internal state. | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_RUNNING_P2P | 5 | (not used in vfio-user) | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_PRE_COPY | 6 | The device is running normally but tracking internal state changes. | ++--------------------------------+-------+---------------------------------------------------------------------+ +| VFIO_DEVICE_STATE_PRE_COPY_P2P | 7 | (not used in vfio-user) | ++--------------------------------+-------+---------------------------------------------------------------------+ + +* *data_fd* is unused in vfio-user, as the ``VFIO_USER_MIG_DATA_READ`` and + ``VFIO_USER_MIG_DATA_WRITE`` messages are used instead for migration data + transport. + +Direct State Transitions +"""""""""""""""""""""""" + +The device migration FSM is a Mealy machine, so actions are taken upon the arcs +between FSM states. The following transitions need to be supported by the +server, a subset of those defined in ```` +(``enum vfio_device_mig_state``). + +* ``RUNNING -> STOP``, ``STOP_COPY -> STOP``: Stop the operation of the device. + The ``STOP_COPY`` arc terminates the data transfer session. + +* ``RESUMING -> STOP``: Terminate the data transfer session. Complete processing + of the migration data. Stop the operation of the device. If the delivered data + is found to be incomplete, inconsistent, or otherwise invalid, fail the + ``SET`` command and optionally transition to the ``ERROR`` state. + +* ``PRE_COPY -> RUNNING``: Terminate the data transfer session. The device is + now fully operational. + +* ``STOP -> RUNNING``: Start the operation of the device. + +* ``RUNNING -> PRE_COPY``, ``STOP -> STOP_COPY``: Begin the process of saving + the device state. The device operation is unchanged, but data transfer begins. + ``PRE_COPY`` and ``STOP_COPY`` are referred to as the "saving group" of + states. + +* ``PRE_COPY -> STOP_COPY``: Continue to transfer migration data, but stop + device operation. + +* ``STOP -> RESUMING``: Start the process of restoring the device state. The + internal device state may be changed to prepare the device to receive the + migration data. + +The ``STOP_COPY -> PRE_COPY`` transition is explicitly not allowed and should +return an error if requested. + +``ERROR`` cannot be specified as a device state, but any transition request can +be failed and then move the state into ``ERROR`` if the server was unable to +execute the requested arc AND was unable to restore the device into any valid +state. To recover from ``ERROR``, ``VFIO_USER_DEVICE_RESET`` must be used to +return back to ``RUNNING``. + +If ``PRE_COPY`` is not supported, arcs touching it are removed. + +Complex State Transitions +""""""""""""""""""""""""" + +The remaining possible transitions are to be implemented as combinations of the +above FSM arcs. As there are multiple paths, the path should be selected based +on the following rules: + +* Select the shortest path. + +* The path cannot have saving group states as interior arcs, only start/end + states. + +``VFIO_DEVICE_FEATURE_DMA_LOGGING_START`` / ``VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP`` +"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +Upon ``VFIO_DEVICE_FEATURE_SET``, start/stop DMA logging. These features can +also be probed to determine whether the device supports DMA logging. + +When DMA logging is started, a range of IOVAs to monitor is provided and the +device can optimize its logging to cover only the IOVA range given. Only DMA +writes are logged. + +The data field of the ``SET`` request is structured as follows: + ++------------+--------+----------+ +| Name | Offset | Size | ++============+========+==========+ +| page_size | 0 | 8 | ++------------+--------+----------+ +| num_ranges | 8 | 4 | ++------------+--------+----------+ +| reserved | 12 | 4 | ++------------+--------+----------+ +| ranges | 16 | variable | ++------------+--------+----------+ + +* *page_size* hints what tracking granularity the device should try to achieve. + If the device cannot do the hinted page size then it's the driver's choice + which page size to pick based on its support. On output the device will return + the page size it selected. + +* *num_ranges* is the number of IOVA ranges to monitor. A value of zero + indicates that all writes should be logged. + +* *ranges* is an array of ``vfio_user_device_feature_dma_logging_range`` + entries: + ++--------+--------+------+ +| Name | Offset | Size | ++========+========+======+ +| iova | 0 | 8 | ++--------+--------+------+ +| length | 8 | 8 | ++--------+--------+------+ + + * *iova* is the base IO virtual address + * *length* is the length of the range to log + +Upon success, the response data field will be the same as the request, unless +the page size was changed, in which case this will be reflected in the response. + +``VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT`` +"""""""""""""""""""""""""""""""""""""""""" + +Upon ``VFIO_DEVICE_FEATURE_GET``, returns the dirty bitmap for a specific IOVA +range. This operation is only valid if logging of dirty pages has been +previously started by setting ``VFIO_DEVICE_FEATURE_DMA_LOGGING_START``. + +The data field of the request is structured as follows: + ++-----------+--------+------+ +| Name | Offset | Size | ++===========+========+======+ +| iova | 0 | 8 | ++-----------+--------+------+ +| length | 8 | 8 | ++-----------+--------+------+ +| page_size | 16 | 8 | ++-----------+--------+------+ + +* *iova* is the base IO virtual address + +* *length* is the length of the range + +* *page_size* is the unit of granularity of the bitmap, and must be a power of + two. It doesn't have to match the value given to + ``VFIO_DEVICE_FEATURE_DMA_LOGGING_START`` because the driver will format its + internal logging to match the reporting page size possibly by replicating bits + if the internal page size is lower than requested + +The data field of the response is identical, except with the bitmap added on +the end at offset 24. + +The bitmap is an array of u64s that holds the output bitmap, with 1 bit +reporting a *page_size* unit of IOVA. The bits outside of the requested range +must be zero. + +The mapping of IOVA to bits is given by: + +``bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64))`` + +``VFIO_USER_MIG_DATA_READ`` +--------------------------- + +This command is used to read data from the source migration server while it is +in a saving group state (``PRE_COPY`` or ``STOP_COPY``). + +This command, and ``VFIO_USER_MIG_DATA_WRITE``, are used in place of the +``data_fd`` file descriptor in ```` +(``struct vfio_device_feature_mig_state``) to enable all data transport to use +the single already-established UNIX socket. Hence, the migration data is +treated like a stream, so the client must continue reading until no more +migration data remains. + +Request +^^^^^^^ + +The request payload for this message is a structure of the following format. + ++-------+--------+------+ +| Name | Offset | Size | ++=======+========+======+ +| argsz | 0 | 4 | ++-------+--------+------+ +| size | 4 | 4 | ++-------+--------+------+ + +* *argsz* is the maximum size of the reply payload. + +* *size* is the size of the migration data to read. + +Reply +^^^^^ + +The reply payload for this message is a structure of the following format. + ++-------+--------+----------+ +| Name | Offset | Size | ++=======+========+==========+ +| argsz | 0 | 4 | ++-------+--------+----------+ +| size | 4 | 4 | ++-------+--------+----------+ +| data | 8 | variable | ++-------+--------+----------+ + +* *argsz* is the size of the above structure, including the size of the data. + +* *size* indicates the size of returned migration data. If this is less than the + requested size, there is no more migration data to read. + +* *data* contains the migration data. + +``VFIO_USER_MIG_DATA_WRITE`` +---------------------------- + +This command is used to write data to the destination migration server while it +is in the ``RESUMING`` state. + +As above, this replaces the ``data_fd`` file descriptor for transport of +migration data, and as such, the migration data is treated like a stream. + +Request +^^^^^^^ + +The request payload for this message is a structure of the following format. + ++-------+--------+----------+ +| Name | Offset | Size | ++=======+========+==========+ +| argsz | 0 | 4 | ++-------+--------+----------+ +| size | 4 | 4 | ++-------+--------+----------+ +| data | 8 | variable | ++-------+--------+----------+ + +* *argsz* is the maximum size of the reply payload. + +* *size* is the size of the migration data to be written. + +* *data* contains the migration data. + +Reply +^^^^^ + +There is no reply payload for this message. Appendices ==========