[Rust-VMM] Requirements for out-of-process device emulation

Stefan Hajnoczi stefanha at redhat.com
Fri Oct 9 16:18:15 UTC 2020


I just posted the following on my blog to outline the requirements that
have been discussed over the past few months around out-of-process
device emulation (vhost-user, vfio-user, etc). I hope it's helpful for
covering various angles of out-of-process device emulation.

It's long, so no worries if you don't want to join the discussion.

Stefan
---
Requirements for out-of-process device emulation
================================================
Over the past months I have participated in discussions about
out-of-process device emulation. This post describes the requirements
that have become apparent. I hope this will be a useful guide to
understanding the big picture about out-of-process device emulation.

What is out-of-process device emulation?
----------------------------------------
Device emulation is traditionally implemented in the program that
executes guest code. This approach is natural because accesses to device
registers are trapped as part of the CPU run loop that sits at the core
of an emulator or virtual machine monitor (VMM).

In some use cases it is advantageous to perform device emulation in
separate processes. For example, software-defined network switches can
minimize data copies by emulating network cards directly in the switch
process. Out-of-process device emulation also enables privilege
separation and tighter sandboxing for security.

Why are these requirements important?
-------------------------------------
When emulated devices are implemented in the VMM they use common VMM
APIs. Adding new devices is relatively easy because the APIs are already
there and the developer can focus on the device specifics.
Out-of-process device emulation potentially leaves developers without
APIs since the device emulation program is a separate program that
literally starts from main(). Developers want to focus on implementing
their specific device, not on solving general problems related to
out-of-process device emulation infrastructure.

It is not only a lot of work to implement an out-of-process device
completely from scratch, but there is also a risk of developing the
wrong solution because some subtleties of device emulation are not
obvious at first glance.

I hope sharing these requirements will help in the creation of common
infrastructure so it's easy to implement high-quality out-of-process
devices.

Not all use cases have the full set of requirements. Therefore it's best
if requirements are addressed in separate, reusable libraries so that
device implementors can pick the ones that are relevant to them.

Device emulation
----------------
Device resources
````````````````
Devices provide resources that drivers interact with such as hardware
registers, memory, or interrupts. The fundamental requirement of
out-of-process device emulation is exposing device resources.

The following types of device resources are needed:

Synchronous MMIO/PIO accesses
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The most basic device emulation operation is the hardware register
access. This is a memory-mapped I/O (MMIO) or programmed I/O (PIO)
access to the device. A read loads a value from a device register. A
write stores a value to a device register. These operations are
synchronous because the vCPU is paused until completion.
Asynchronous doorbells

Devices often have doorbell registers, allowing the driver to inform the
device that new requests are ready for processing. The vCPU does not
need to wait since the access is a posted write.

The kvm.ko ioeventfd mechanism can be used to implement asynchronous
doorbells.

Shared device memory
~~~~~~~~~~~~~~~~~~~~
Devices may have memory-like regions that the CPU can access (such as
PCI Memory BARs). The device emulation process therefore needs to share
a region of its memory space with the VMM so the guest can access it.
This mechanism also allows device emulation to busy wait (poll) instead
of using synchronous MMIO/PIO accesses or asynchronous doorbells for
notifications.

Direct Memory Access (DMA)
~~~~~~~~~~~~~~~~~~~~~~~~~~
Devices often require read and write access to a memory address space
belonging to the CPU. This allows network cards to transmit packet
payloads that are located in guest RAM, for example.

Early out-of-process device emulation interfaces simply shared guest
RAM. The allowed DMA to any guest physical memory address. More advanced
IOMMU and address space identifier mechanisms are now becoming
ubiquitous. Therefore, new out-of-process device emulation interfaces
should incorporate IOMMU functionality.

The key requirement for IOMMU mechanisms is allowing the VMM to grant
access to a region of memory so the device emulation process can read
from and/or write to it.

Interrupts
~~~~~~~~~~
Devices notify the CPU using interrupts. An interrupt is simply a
message sent by the device emulation process to the VMM. Interrupt
configuration is flexible on modern devices, meaning the driver may be
able to select the number of interrupts and a mapping (using one
interrupt with multiple event sources). This can be implemented using
the Linux eventfd mechanism or via in-band device emulation protocol
messages, for example.

Extensibility for new bus types
```````````````````````````````
It should be possible to support multiple bus types. vhost-user only
supports vhost devices. VFIO is more extensible but currently focussed
on PCI devices. It is likely that QEMU SysBus devices will be desirable
for implementing ad-hoc out-of-process devices (especially for
System-on-Chip target platforms).

Bus-level APIs, not protocol bindings
`````````````````````````````````````
Developers should not need to learn the out-of-process device emulation
protocol (vfio-user, etc). APIs should focus on bus-level concepts such
as defining VIRTIO or PCI devices rather than protocol bindings for
dealing with protocol messages, file descriptor passing, and shared
memory.

In other words, developers should be thinking in terms of the problem
domain, not worrying about how out-of-process device emulation is
implemented. The protocol should be hidden behind bus-level APIs.

Multi-threading support from the beginning
``````````````````````````````````````````
Threading issues arise often in device emulation because asynchronous
requests or multi-queue devices can be implemented using threads.
Therefore it is necessary to clearly document what threading models are
supported and how device lifecycle operations like reset interact with
in-flight requests.

Live migration, live upgrade, and crash recovery
------------------------------------------------
There are several related issues around device state and restarting the
device emulation program without disrupting the guest.

Live migration
``````````````
Live migration transfers the state of a device from one device emulation
process to another (typically running on another host). This requires
the following functionality:

Quiescing the device
~~~~~~~~~~~~~~~~~~~~
Some devices can be live migrated at any point in time without any
preparation, while others must be put into a quiescent state to avoid
issues. An example is a storage controller that has a write request in
flight. It is not safe to live migration until the write request has
completed or been canceled. Failure to wait might result in data
corruption if the write takes effect after the destination has resumed
execution.

Therefore it is necessary to quiesce a device. After this point there is
no further device activity and no guest-visible changes will be made by
the device.

Saving/loading device state
~~~~~~~~~~~~~~~~~~~~~~~~~~~
It must be possible to save and load device state. Device state includes
the contents of hardware registers as well as device-internal state
necessary for resuming operation.

It is typically necessary to determine whether the device emulation
processes on the migration source and destination are compatible before
attempting migration. This avoids migration failure when the destination
tries to load the device state and discovers it doesn't support it. It
may be desirable to support loading device state that was generated by a
different implementation of the same device type (for example, two
virtio-net implementations).

Dirty memory logging
~~~~~~~~~~~~~~~~~~~~
Pre-copy live migration starts with an iterative phase where dirty
memory pages are copied from the migration source to the destination
host. Devices need to participate in dirty memory logging so that all
written pages are transferred to the destination and no pages are
"missed".

Crash recovery
``````````````
If the device emulation process crashes it should be possible to restart
it and resume device emulation without disrupting the guest (aside from
a possible pause during reconnection).

Doing this requires maintaining device state (contents of hardware
registers, etc) outside the device emulation process. This way the state
remains even if the process crashes and it can be resume when a new
process starts.

Live upgrade
````````````
It must be possible to upgrade the device emulation process and the VMM
without disrupting the guest. Upgrading the device emulation process is
similar to crash recovery in that the process terminates and a new one
resumes with the previous state.

Device versioning
`````````````````
The guest-visible aspects of the device must be versioned. In the
simplest case the device emulation program would have a
--compat-version=N command-line option that controls which version of
the device the guest sees. When guest-visible changes are made to the
program the version number must be increased.

By giving control of the guest-visible device behavior it is possible to
save/load and live migrate reliably. Otherwise loading device state in a
newer device emulation program could affect the running guest. Guest
drivers typically are not prepared for the device to change underneath
them and doing so could result in guest crashes or data corruption.

Security
--------
The trust model
```````````````
The VMM must not trust the device emulation program. This is key to
implementing privilege separation and the principle of least privilege.
If a compromised device emulation program is able to gain control of the
VMM then out-of-process device emulation has failed to provide isolation
between devices.

The device emulation program must not trust the VMM to the extent that
this is possible. For example, it must validate inputs so that the VMM
cannot gain control of the device emulation process through memory
corruptions or other bugs. This makes it so that even if the VMM has
been compromised, access to device resources and associated system calls
still requires further compromising the device emulation process.

Unprivileged operation
``````````````````````
The device emulation program should run unprivileged to the extent that
this is possible. If special permissions are required to access hardware
resources then these resources can sometimes be provided via file
descriptor passing by a more privileged parent process.

Sandboxing
``````````
Operating system sandboxing mechanisms can be applied to device
emulation processes more effectively than monolithic VMMs. Seccomp can
limit the Linux system calls that may be invoked. SELinux can restrict
access to system resources.

Sandboxing is a common task that most device emulation programs need.
Therefore it is a good candidate for a library or launcher tool that is
shared by device emulation programs.

Management
----------
Command-line interface
``````````````````````
A common command-line interface should be defined where possible. For
example, vhost-user's standard --socket-path=PATH argument makes it easy
to launch any vhost-user device backend. Protocol-specific options (e.g.
socket path) and device type-specific options (e.g. virtio-net) can be
standardized.

Some options are necessarily specific to the device emulation program
and therefore cannot be standardized.

The advantage of standard options is that management tools like libvirt
can launch the device emulation programs without further user
configuration.

RPC interface
`````````````
It may be necessary to issue commands at runtime. Examples include
adjusting throttling limits, enabling/disabling logging, etc. These
operations can be performed over an RPC interface.

Various RPC interfaces are used throughout open source virtualization
software. Adopting a widely-used RPC protocol and standardizing commands
is beneficial because it makes it easy to communicate with the software
and management tools can support them relatively easily.

Conclusion
----------
This was largely a brain dump but I hope it is useful food for thought
as out-of-process device emulation interfaces are designed and
developed. There is a lot more to it than simply implementing a protocol
for device register accesses and guest RAM DMA. Developing open source
libraries in Rust and C that can be used as needed will ensure that
out-of-process devices are high-quality and easy for users to deploy.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.opendev.org/pipermail/rust-vmm/attachments/20201009/b83d0171/attachment.sig>


More information about the Rust-vmm mailing list