| ================================== | 
 | vfio-ccw: the basic infrastructure | 
 | ================================== | 
 |  | 
 | Introduction | 
 | ------------ | 
 |  | 
 | Here we describe the vfio support for I/O subchannel devices for | 
 | Linux/s390. Motivation for vfio-ccw is to passthrough subchannels to a | 
 | virtual machine, while vfio is the means. | 
 |  | 
 | Different than other hardware architectures, s390 has defined a unified | 
 | I/O access method, which is so called Channel I/O. It has its own access | 
 | patterns: | 
 |  | 
 | - Channel programs run asynchronously on a separate (co)processor. | 
 | - The channel subsystem will access any memory designated by the caller | 
 |   in the channel program directly, i.e. there is no iommu involved. | 
 |  | 
 | Thus when we introduce vfio support for these devices, we realize it | 
 | with a mediated device (mdev) implementation. The vfio mdev will be | 
 | added to an iommu group, so as to make itself able to be managed by the | 
 | vfio framework. And we add read/write callbacks for special vfio I/O | 
 | regions to pass the channel programs from the mdev to its parent device | 
 | (the real I/O subchannel device) to do further address translation and | 
 | to perform I/O instructions. | 
 |  | 
 | This document does not intend to explain the s390 I/O architecture in | 
 | every detail. More information/reference could be found here: | 
 |  | 
 | - A good start to know Channel I/O in general: | 
 |   https://en.wikipedia.org/wiki/Channel_I/O | 
 | - s390 architecture: | 
 |   s390 Principles of Operation manual (IBM Form. No. SA22-7832) | 
 | - The existing QEMU code which implements a simple emulated channel | 
 |   subsystem could also be a good reference. It makes it easier to follow | 
 |   the flow. | 
 |   qemu/hw/s390x/css.c | 
 |  | 
 | For vfio mediated device framework: | 
 | - Documentation/driver-api/vfio-mediated-device.rst | 
 |  | 
 | Motivation of vfio-ccw | 
 | ---------------------- | 
 |  | 
 | Typically, a guest virtualized via QEMU/KVM on s390 only sees | 
 | paravirtualized virtio devices via the "Virtio Over Channel I/O | 
 | (virtio-ccw)" transport. This makes virtio devices discoverable via | 
 | standard operating system algorithms for handling channel devices. | 
 |  | 
 | However this is not enough. On s390 for the majority of devices, which | 
 | use the standard Channel I/O based mechanism, we also need to provide | 
 | the functionality of passing through them to a QEMU virtual machine. | 
 | This includes devices that don't have a virtio counterpart (e.g. tape | 
 | drives) or that have specific characteristics which guests want to | 
 | exploit. | 
 |  | 
 | For passing a device to a guest, we want to use the same interface as | 
 | everybody else, namely vfio. We implement this vfio support for channel | 
 | devices via the vfio mediated device framework and the subchannel device | 
 | driver "vfio_ccw". | 
 |  | 
 | Access patterns of CCW devices | 
 | ------------------------------ | 
 |  | 
 | s390 architecture has implemented a so called channel subsystem, that | 
 | provides a unified view of the devices physically attached to the | 
 | systems. Though the s390 hardware platform knows about a huge variety of | 
 | different peripheral attachments like disk devices (aka. DASDs), tapes, | 
 | communication controllers, etc. They can all be accessed by a well | 
 | defined access method and they are presenting I/O completion a unified | 
 | way: I/O interruptions. | 
 |  | 
 | All I/O requires the use of channel command words (CCWs). A CCW is an | 
 | instruction to a specialized I/O channel processor. A channel program is | 
 | a sequence of CCWs which are executed by the I/O channel subsystem.  To | 
 | issue a channel program to the channel subsystem, it is required to | 
 | build an operation request block (ORB), which can be used to point out | 
 | the format of the CCW and other control information to the system. The | 
 | operating system signals the I/O channel subsystem to begin executing | 
 | the channel program with a SSCH (start sub-channel) instruction. The | 
 | central processor is then free to proceed with non-I/O instructions | 
 | until interrupted. The I/O completion result is received by the | 
 | interrupt handler in the form of interrupt response block (IRB). | 
 |  | 
 | Back to vfio-ccw, in short: | 
 |  | 
 | - ORBs and channel programs are built in guest kernel (with guest | 
 |   physical addresses). | 
 | - ORBs and channel programs are passed to the host kernel. | 
 | - Host kernel translates the guest physical addresses to real addresses | 
 |   and starts the I/O with issuing a privileged Channel I/O instruction | 
 |   (e.g SSCH). | 
 | - channel programs run asynchronously on a separate processor. | 
 | - I/O completion will be signaled to the host with I/O interruptions. | 
 |   And it will be copied as IRB to user space to pass it back to the | 
 |   guest. | 
 |  | 
 | Physical vfio ccw device and its child mdev | 
 | ------------------------------------------- | 
 |  | 
 | As mentioned above, we realize vfio-ccw with a mdev implementation. | 
 |  | 
 | Channel I/O does not have IOMMU hardware support, so the physical | 
 | vfio-ccw device does not have an IOMMU level translation or isolation. | 
 |  | 
 | Subchannel I/O instructions are all privileged instructions. When | 
 | handling the I/O instruction interception, vfio-ccw has the software | 
 | policing and translation how the channel program is programmed before | 
 | it gets sent to hardware. | 
 |  | 
 | Within this implementation, we have two drivers for two types of | 
 | devices: | 
 |  | 
 | - The vfio_ccw driver for the physical subchannel device. | 
 |   This is an I/O subchannel driver for the real subchannel device.  It | 
 |   realizes a group of callbacks and registers to the mdev framework as a | 
 |   parent (physical) device. As a consequence, mdev provides vfio_ccw a | 
 |   generic interface (sysfs) to create mdev devices. A vfio mdev could be | 
 |   created by vfio_ccw then and added to the mediated bus. It is the vfio | 
 |   device that added to an IOMMU group and a vfio group. | 
 |   vfio_ccw also provides an I/O region to accept channel program | 
 |   request from user space and store I/O interrupt result for user | 
 |   space to retrieve. To notify user space an I/O completion, it offers | 
 |   an interface to setup an eventfd fd for asynchronous signaling. | 
 |  | 
 | - The vfio_mdev driver for the mediated vfio ccw device. | 
 |   This is provided by the mdev framework. It is a vfio device driver for | 
 |   the mdev that created by vfio_ccw. | 
 |   It realizes a group of vfio device driver callbacks, adds itself to a | 
 |   vfio group, and registers itself to the mdev framework as a mdev | 
 |   driver. | 
 |   It uses a vfio iommu backend that uses the existing map and unmap | 
 |   ioctls, but rather than programming them into an IOMMU for a device, | 
 |   it simply stores the translations for use by later requests. This | 
 |   means that a device programmed in a VM with guest physical addresses | 
 |   can have the vfio kernel convert that address to process virtual | 
 |   address, pin the page and program the hardware with the host physical | 
 |   address in one step. | 
 |   For a mdev, the vfio iommu backend will not pin the pages during the | 
 |   VFIO_IOMMU_MAP_DMA ioctl. Mdev framework will only maintain a database | 
 |   of the iova<->vaddr mappings in this operation. And they export a | 
 |   vfio_pin_pages and a vfio_unpin_pages interfaces from the vfio iommu | 
 |   backend for the physical devices to pin and unpin pages by demand. | 
 |  | 
 | Below is a high Level block diagram:: | 
 |  | 
 |  +-------------+ | 
 |  |             | | 
 |  | +---------+ | mdev_register_driver() +--------------+ | 
 |  | |  Mdev   | +<-----------------------+              | | 
 |  | |  bus    | |                        | vfio_mdev.ko | | 
 |  | | driver  | +----------------------->+              |<-> VFIO user | 
 |  | +---------+ |    probe()/remove()    +--------------+    APIs | 
 |  |             | | 
 |  |  MDEV CORE  | | 
 |  |   MODULE    | | 
 |  |   mdev.ko   | | 
 |  | +---------+ | mdev_register_device() +--------------+ | 
 |  | |Physical | +<-----------------------+              | | 
 |  | | device  | |                        |  vfio_ccw.ko |<-> subchannel | 
 |  | |interface| +----------------------->+              |     device | 
 |  | +---------+ |       callback         +--------------+ | 
 |  +-------------+ | 
 |  | 
 | The process of how these work together. | 
 |  | 
 | 1. vfio_ccw.ko drives the physical I/O subchannel, and registers the | 
 |    physical device (with callbacks) to mdev framework. | 
 |    When vfio_ccw probing the subchannel device, it registers device | 
 |    pointer and callbacks to the mdev framework. Mdev related file nodes | 
 |    under the device node in sysfs would be created for the subchannel | 
 |    device, namely 'mdev_create', 'mdev_destroy' and | 
 |    'mdev_supported_types'. | 
 | 2. Create a mediated vfio ccw device. | 
 |    Use the 'mdev_create' sysfs file, we need to manually create one (and | 
 |    only one for our case) mediated device. | 
 | 3. vfio_mdev.ko drives the mediated ccw device. | 
 |    vfio_mdev is also the vfio device drvier. It will probe the mdev and | 
 |    add it to an iommu_group and a vfio_group. Then we could pass through | 
 |    the mdev to a guest. | 
 |  | 
 |  | 
 | VFIO-CCW Regions | 
 | ---------------- | 
 |  | 
 | The vfio-ccw driver exposes MMIO regions to accept requests from and return | 
 | results to userspace. | 
 |  | 
 | vfio-ccw I/O region | 
 | ------------------- | 
 |  | 
 | An I/O region is used to accept channel program request from user | 
 | space and store I/O interrupt result for user space to retrieve. The | 
 | definition of the region is:: | 
 |  | 
 |   struct ccw_io_region { | 
 |   #define ORB_AREA_SIZE 12 | 
 | 	  __u8    orb_area[ORB_AREA_SIZE]; | 
 |   #define SCSW_AREA_SIZE 12 | 
 | 	  __u8    scsw_area[SCSW_AREA_SIZE]; | 
 |   #define IRB_AREA_SIZE 96 | 
 | 	  __u8    irb_area[IRB_AREA_SIZE]; | 
 | 	  __u32   ret_code; | 
 |   } __packed; | 
 |  | 
 | This region is always available. | 
 |  | 
 | While starting an I/O request, orb_area should be filled with the | 
 | guest ORB, and scsw_area should be filled with the SCSW of the Virtual | 
 | Subchannel. | 
 |  | 
 | irb_area stores the I/O result. | 
 |  | 
 | ret_code stores a return code for each access of the region. The following | 
 | values may occur: | 
 |  | 
 | ``0`` | 
 |   The operation was successful. | 
 |  | 
 | ``-EOPNOTSUPP`` | 
 |   The orb specified transport mode or an unidentified IDAW format, or the | 
 |   scsw specified a function other than the start function. | 
 |  | 
 | ``-EIO`` | 
 |   A request was issued while the device was not in a state ready to accept | 
 |   requests, or an internal error occurred. | 
 |  | 
 | ``-EBUSY`` | 
 |   The subchannel was status pending or busy, or a request is already active. | 
 |  | 
 | ``-EAGAIN`` | 
 |   A request was being processed, and the caller should retry. | 
 |  | 
 | ``-EACCES`` | 
 |   The channel path(s) used for the I/O were found to be not operational. | 
 |  | 
 | ``-ENODEV`` | 
 |   The device was found to be not operational. | 
 |  | 
 | ``-EINVAL`` | 
 |   The orb specified a chain longer than 255 ccws, or an internal error | 
 |   occurred. | 
 |  | 
 |  | 
 | vfio-ccw cmd region | 
 | ------------------- | 
 |  | 
 | The vfio-ccw cmd region is used to accept asynchronous instructions | 
 | from userspace:: | 
 |  | 
 |   #define VFIO_CCW_ASYNC_CMD_HSCH (1 << 0) | 
 |   #define VFIO_CCW_ASYNC_CMD_CSCH (1 << 1) | 
 |   struct ccw_cmd_region { | 
 |          __u32 command; | 
 |          __u32 ret_code; | 
 |   } __packed; | 
 |  | 
 | This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD. | 
 |  | 
 | Currently, CLEAR SUBCHANNEL and HALT SUBCHANNEL use this region. | 
 |  | 
 | command specifies the command to be issued; ret_code stores a return code | 
 | for each access of the region. The following values may occur: | 
 |  | 
 | ``0`` | 
 |   The operation was successful. | 
 |  | 
 | ``-ENODEV`` | 
 |   The device was found to be not operational. | 
 |  | 
 | ``-EINVAL`` | 
 |   A command other than halt or clear was specified. | 
 |  | 
 | ``-EIO`` | 
 |   A request was issued while the device was not in a state ready to accept | 
 |   requests. | 
 |  | 
 | ``-EAGAIN`` | 
 |   A request was being processed, and the caller should retry. | 
 |  | 
 | ``-EBUSY`` | 
 |   The subchannel was status pending or busy while processing a halt request. | 
 |  | 
 | vfio-ccw schib region | 
 | --------------------- | 
 |  | 
 | The vfio-ccw schib region is used to return Subchannel-Information | 
 | Block (SCHIB) data to userspace:: | 
 |  | 
 |   struct ccw_schib_region { | 
 |   #define SCHIB_AREA_SIZE 52 | 
 |          __u8 schib_area[SCHIB_AREA_SIZE]; | 
 |   } __packed; | 
 |  | 
 | This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_SCHIB. | 
 |  | 
 | Reading this region triggers a STORE SUBCHANNEL to be issued to the | 
 | associated hardware. | 
 |  | 
 | vfio-ccw crw region | 
 | --------------------- | 
 |  | 
 | The vfio-ccw crw region is used to return Channel Report Word (CRW) | 
 | data to userspace:: | 
 |  | 
 |   struct ccw_crw_region { | 
 |          __u32 crw; | 
 |          __u32 pad; | 
 |   } __packed; | 
 |  | 
 | This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_CRW. | 
 |  | 
 | Reading this region returns a CRW if one that is relevant for this | 
 | subchannel (e.g. one reporting changes in channel path state) is | 
 | pending, or all zeroes if not. If multiple CRWs are pending (including | 
 | possibly chained CRWs), reading this region again will return the next | 
 | one, until no more CRWs are pending and zeroes are returned. This is | 
 | similar to how STORE CHANNEL REPORT WORD works. | 
 |  | 
 | vfio-ccw operation details | 
 | -------------------------- | 
 |  | 
 | vfio-ccw follows what vfio-pci did on the s390 platform and uses | 
 | vfio-iommu-type1 as the vfio iommu backend. | 
 |  | 
 | * CCW translation APIs | 
 |   A group of APIs (start with `cp_`) to do CCW translation. The CCWs | 
 |   passed in by a user space program are organized with their guest | 
 |   physical memory addresses. These APIs will copy the CCWs into kernel | 
 |   space, and assemble a runnable kernel channel program by updating the | 
 |   guest physical addresses with their corresponding host physical addresses. | 
 |   Note that we have to use IDALs even for direct-access CCWs, as the | 
 |   referenced memory can be located anywhere, including above 2G. | 
 |  | 
 | * vfio_ccw device driver | 
 |   This driver utilizes the CCW translation APIs and introduces | 
 |   vfio_ccw, which is the driver for the I/O subchannel devices you want | 
 |   to pass through. | 
 |   vfio_ccw implements the following vfio ioctls:: | 
 |  | 
 |     VFIO_DEVICE_GET_INFO | 
 |     VFIO_DEVICE_GET_IRQ_INFO | 
 |     VFIO_DEVICE_GET_REGION_INFO | 
 |     VFIO_DEVICE_RESET | 
 |     VFIO_DEVICE_SET_IRQS | 
 |  | 
 |   This provides an I/O region, so that the user space program can pass a | 
 |   channel program to the kernel, to do further CCW translation before | 
 |   issuing them to a real device. | 
 |   This also provides the SET_IRQ ioctl to setup an event notifier to | 
 |   notify the user space program the I/O completion in an asynchronous | 
 |   way. | 
 |  | 
 | The use of vfio-ccw is not limited to QEMU, while QEMU is definitely a | 
 | good example to get understand how these patches work. Here is a little | 
 | bit more detail how an I/O request triggered by the QEMU guest will be | 
 | handled (without error handling). | 
 |  | 
 | Explanation: | 
 |  | 
 | - Q1-Q7: QEMU side process. | 
 | - K1-K5: Kernel side process. | 
 |  | 
 | Q1. | 
 |     Get I/O region info during initialization. | 
 |  | 
 | Q2. | 
 |     Setup event notifier and handler to handle I/O completion. | 
 |  | 
 | ... ... | 
 |  | 
 | Q3. | 
 |     Intercept a ssch instruction. | 
 | Q4. | 
 |     Write the guest channel program and ORB to the I/O region. | 
 |  | 
 |     K1. | 
 | 	Copy from guest to kernel. | 
 |     K2. | 
 | 	Translate the guest channel program to a host kernel space | 
 | 	channel program, which becomes runnable for a real device. | 
 |     K3. | 
 | 	With the necessary information contained in the orb passed in | 
 | 	by QEMU, issue the ccwchain to the device. | 
 |     K4. | 
 | 	Return the ssch CC code. | 
 | Q5. | 
 |     Return the CC code to the guest. | 
 |  | 
 | ... ... | 
 |  | 
 |     K5. | 
 | 	Interrupt handler gets the I/O result and write the result to | 
 | 	the I/O region. | 
 |     K6. | 
 | 	Signal QEMU to retrieve the result. | 
 |  | 
 | Q6. | 
 |     Get the signal and event handler reads out the result from the I/O | 
 |     region. | 
 | Q7. | 
 |     Update the irb for the guest. | 
 |  | 
 | Limitations | 
 | ----------- | 
 |  | 
 | The current vfio-ccw implementation focuses on supporting basic commands | 
 | needed to implement block device functionality (read/write) of DASD/ECKD | 
 | device only. Some commands may need special handling in the future, for | 
 | example, anything related to path grouping. | 
 |  | 
 | DASD is a kind of storage device. While ECKD is a data recording format. | 
 | More information for DASD and ECKD could be found here: | 
 | https://en.wikipedia.org/wiki/Direct-access_storage_device | 
 | https://en.wikipedia.org/wiki/Count_key_data | 
 |  | 
 | Together with the corresponding work in QEMU, we can bring the passed | 
 | through DASD/ECKD device online in a guest now and use it as a block | 
 | device. | 
 |  | 
 | The current code allows the guest to start channel programs via | 
 | START SUBCHANNEL, and to issue HALT SUBCHANNEL, CLEAR SUBCHANNEL, | 
 | and STORE SUBCHANNEL. | 
 |  | 
 | Currently all channel programs are prefetched, regardless of the | 
 | p-bit setting in the ORB.  As a result, self modifying channel | 
 | programs are not supported.  For this reason, IPL has to be handled as | 
 | a special case by a userspace/guest program; this has been implemented | 
 | in QEMU's s390-ccw bios as of QEMU 4.1. | 
 |  | 
 | vfio-ccw supports classic (command mode) channel I/O only. Transport | 
 | mode (HPF) is not supported. | 
 |  | 
 | QDIO subchannels are currently not supported. Classic devices other than | 
 | DASD/ECKD might work, but have not been tested. | 
 |  | 
 | Reference | 
 | --------- | 
 | 1. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832) | 
 | 2. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204) | 
 | 3. https://en.wikipedia.org/wiki/Channel_I/O | 
 | 4. Documentation/s390/cds.rst | 
 | 5. Documentation/driver-api/vfio.rst | 
 | 6. Documentation/driver-api/vfio-mediated-device.rst |