The Input-Output Memory Management Unit (IOMMU), sometimes referred to as a System MMU (SMMU), is a system-level Memory Management Unit (MMU) that connects direct-memory-access-capable Input/Output (I/O) devices to system memory.
For each I/O device connected to the system through an IOMMU, software can
configure at the IOMMU a device context, which associates with the device a
specific virtual address space and other per-device parameters. By giving
each device its own separate device context at an IOMMU, each device can be
individually configured for a different software master, usually a guest OS or
the main (host) OS. On every memory access made from a device, hardware
indicates to the IOMMU the originating device by some form of unique device
identifier, which the IOMMU uses to locate the appropriate device context
within data structures supplied by software. For PCIe, for example, the
originating device may be identified by the unique 16-bit triplet of PCI bus
number (8-bit), device number (5-bit), and function number (3-bit)
(collectively known as routing identifier or RID) and optionally up to 8-bit
segment number when the IOMMU supports multiple Hierarchies. This specification
refers to such unique device identifier as device_id
and supports up to
24-bit wide identifiers.
Note
|
A Hierarchy is a PCI Express I/O interconnect topology, wherein the Configuration Space addresses, referred to as the tuple of Bus/Device/Function Numbers, are unique. In some contexts, a Hierarchy is also called a Segment, and in Flit Mode, the Segment number is sometimes included in the ID of a Function. |
Some devices may support shared virtual addressing which is the ability to
share process address spaces with devices. Sharing process address spaces with
devices allows to rely on core kernel memory management for DMA, removing some
complexity from application and device drivers. After binding to a device,
applications can instruct it to perform DMA on application statically or
dynamically allocated buffers. To support such addressing, software can
configure one or more process contexts into the device context. On every memory
access made from a device, the hardware indicates to the IOMMU a unique process
identifier, which the IOMMU uses in conjunction with the unique device
identifier to locate the appropriate process context configured by software in
the device context. For PCIe, for example, the process context may be identified
by the unique 20-bit process address space identifier (PASID). This
specification refers to such unique process identifiers as process_id
and
supports up to 20-bit wide identifiers.
Using the same S/VS-stage and G-stage page table formats in IOMMU for address translation and protections as the CPU’s MMU removes some complexity from the core kernel memory management for DMA. Use of an identical format also allows the same G and S/VS-stage tables to be used by both MMU and the IOMMU.
DMA address translation in the IOMMU has certain performance implications for
DMA accesses as DMA access time may be lengthened due to the time required to
resolve the supervisor physical address using software provided data structures.
Similar overheads in the CPU MMU are mitigated typically through the use of a
translation look-aside buffer (TLB) to cache these address translations such
that they may be re-used to reduce the translation overhead on subsequent
accesses. The IOMMU may employ similar address translation caches, referred as
IOMMU Address Translation Cache (IOATC). The IOMMU provides mechanisms for
software to synchronize the IOATC with the memory resident data structures used
for address translation when they are modified. Software may configure the
device context with a software defined context identifier called guest
soft-context identifier (GSCID
) to indicate that a collection of devices are
assigned to the same VM and thus access a common virtual address space.
Software may configure the process context with a software defined context
identifier called process soft-context identifier (PSCID
) to identify a
collection of process identifier that share a common virtual address space.
The IOMMU may use the GSCID
and PSCID
to tag entries in the IOATC to avoid
duplication and simplify invalidation operations.
Some devices may participate in the translation process and provide a device side ATC (DevATC) for its own memory accesses. By providing a DevATC, the device shares the translation caching responsibility and thereby reduce probability of "thrashing" in the IOATC. The DevATC may be sized by the device to suit its unique performance requirements and may also be used by the device to optimize latency by prefetching translations. Such mechanisms require close cooperation of the device and the IOMMU using a protocol. For PCIe, for example, the Address Translation Services (ATS) protocol may be used by the device to request translations to cache in the DevATC and to synchronize it with updates made by software address translation data structures. The device participating in the address translation process also enables the use of I/O page faults to avoid the core kernel memory manager from having to make all physical memory that may be accessed by the device resident at all times. For PCIe, for example, the device may implement the Page Request Interface (PRI) to dynamically request the memory manager to make a page resident if it discovers the page for which it requested a translation was not available. An IOMMU may support specialized software interfaces and protocols with the device to enable services such as PCIe ATS and PCIe PRI.
In systems built with an Incoming Message-Signaled Interrupt Controller (IMSIC), the IOMMU may be programmed by the hypervisor to direct message-signaled interrupts (MSI) from devices controlled by the guest OS to a guest interrupt file in an IMSIC. Because MSIs from devices are simply memory writes, they would naturally be subject to the same address translation that an IOMMU applies to other memory writes. However, the Advanced Interrupt Architecture requires that IOMMUs treat MSIs directed to virtual machines specially, in part to simplify software, and in part to allow optional support for memory-resident interrupt files. The device context is configured by software with parameters to identify memory writes as MSI and to be translated using a MSI address translation table configured by software in the device context.
Term | Definition |
---|---|
AIA |
Advanced Interrupt Architecture. |
ATS / PCIe ATS |
Address Translation Services: A PCIe protocol to support DevATC. |
CXL |
Compute Express Link bus standard. |
DC / Device Context |
A hardware representation of state that identifies a device and the VM to which the device is assigned. |
DDT |
Device-directory-table: A radix-tree structure traversed using the unique device identifier to locate the Device Context structure. |
DDI |
Device-directory-index: A sub-field of the unique device identifier used as a index into a leaf or non-leaf DDT structure. |
Device ID |
An identification number that is up to 24-bits to identify the source of a DMA or interrupt request. For PCIe devices this is the routing identifier (RID). |
DevATC |
An address translation cache at the device. |
DMA |
Direct Memory Access. |
GPA |
Guest Physical Address: An address in the virtualized physical memory space of a virtual machine. |
GSCID |
Guest soft-context identifier: An identification number used by software to uniquely identify a collection of devices assigned to a virtual machine. An IOMMU may tag IOATC entries with the GSCID. Device contexts programmed with same GSCID must also be programmed with identical G-stage page tables. |
Guest |
Software in a virtual machine. |
HPM |
Hardware Performance Monitor. |
Hypervisor |
Software entity that controls virtualization. |
ID |
Identifier. |
IMSIC |
Incoming Message-signaled Interrupt Controller. |
IOATC |
IOMMU Address Translation Cache: cache in IOMMU that caches data structures used for address translations. |
IOVA |
I/O Virtual Address: Virtual address for DMA by devices. |
MSI |
Message Signaled Interrupts. |
OS |
Operating System. |
PASID |
Process Address Space Identifier: It identifies the address space of a process. The PASID value is provided in the PASID TLP prefix of the request. |
PBMT |
Page-Based Memory Types. |
PPN |
Physical Page Number. |
PRI |
Page Request Interface - a PCIe protocol that enables devices to request OS memory manager services to make pages resident. |
PC |
Process Context. |
PCIe |
Peripheral Component Interconnect Express bus standard. |
PDI |
Process-directory-index: a sub field of the unique process identifier used to index into a leaf or non-leaf PDT structure. |
PDT |
Process-directory-table: A radix tree data structure traversed using the unique Process identifier to locate the process context structure. |
PMA |
Physical Memory Attributes. |
PMP |
Physical Memory Protection. |
PPN |
Physical Page Number. |
PRI |
Page Request Interface - a PCIe protocol that enables devices to request OS memory manager services to make pages resident. |
Process ID |
An identification number that is up to 20-bits to identify a process context. For PCIe devices this is the PASID. |
PSCID |
Process soft-context identifier: An identification number used by software to identify a unique address space. The IOMMU may tag IOATC entries with PSCID. |
PT |
Page Table. |
PTE |
Page Table Entry. A leaf or non-leaf entry in a page table. |
Reserved |
A register or data structure field reserved for future use. Reserved fields in data structures must be set to 0 by software. Software must ignore reserved fields in registers and preserve the value held in these fields when writing values to other fields in the same register. |
RID / PCIe RID |
PCIe routing identifier. |
RO |
Read-only - Register bits are read-only and cannot be altered
by software. Where explicitly defined, these bits are used
to reflect changing hardware state, and as a result bit
values can be observed to change at run time. |
RW |
Read-Write - Register bits are read-write and are permitted
to be either Set or Cleared by software to the desired
state. |
RW1C |
Write-1-to-clear status - Register bits indicate status when
read. A Set bit indicates a status event which is Cleared by
writing a 1b. Writing a 0b to RW1C bits has no effect. |
RW1S |
Read-Write-1-to-set - register bits indicate status when
read. The bit may be Set by writing 1b. Writing a 0b to RW1S
bits has no effect. |
SOC |
System on a chip, also referred as system-on-a-chip and system-on-chip. |
SPA |
Supervisor Physical Address: Physical address used to to access memory and memory-mapped resources. |
TLP |
Transaction Layer Packet. |
VA |
Virtual Address. |
VM |
Virtual Machine: An efficient, isolated duplicate of a real computer system. In this specification it refers to the collection of resources and state that is accessible when a RISC-V hart supporting the hypervisor extension executes with the virtualization mode set to 1. |
VMM |
Virtual Machine Monitor. Also referred to as hypervisor. |
VS |
Virtual Supervisor: Supervisor privilege in virtualization mode. |
WARL |
Write Any values, Reads Legal values: Attribute of a register field that is only defined for a subset of bit encodings, but allow any value to be written while guaranteeing to return a legal value whenever read. |
WPRI |
Writes Preserve values, Reads Ignore values: Attribute of a register field that is reserved for future use. |
A non-virtualized OS may use the IOMMU for the following significant system-level functionalities:
-
Protect the operating system from bad memory accesses from errant devices
-
Support 32-bit devices in 64-bit environment (avoidance of bounce buffers)
-
Support mapping of contiguous virtual addresses to an underlying fragmented physical addresses (avoidance of scatter/gather lists)
-
Dynamic redirection of interrupts
-
Support shared virtual addressing
In the absence of an IOMMU, a device driver must program devices with Physical Addresses, which implies that DMA from a device could be used to access any memory, such as privileged memory, and cause malicious or unintended corruptions. This may be caused by hardware bugs, device driver bugs, or by malicious software/hardware.
The IOMMU offers a mechanism for the OS to defend against such unintended corruptions by limiting the memory that can be accessed by devices using DMA. Indeed, as depicted in Device isolation in non-virtualized OS diagram the Operating System configures the IOMMU to use the S-stage page table to translate IOVA to SPA and thereby limit the addresses that may be accessed.
The OS may also use the MSI address translation capability to dynamically redirect interrupts from one RISC-V hart to another without needing to reprogram the devices themselves.
Legacy 32-bit devices cannot access the memory above 4 GiB. The integration of the IOMMU, through its address remapping capability, offers a simple mechanism for the DMA to directly access any address in the system (with appropriate access permission). Without an IOMMU, the OS must resort to copying data through buffers (also known as bounce buffers) allocated in memory below 4 GiB. In this scenario the IOMMU improves the system performance.
The IOMMU can be useful as it permits to allocate large regions of memory without the need to be contiguous in physical memory. Indeed, a contiguous virtual address range can be mapped to a fragmented physical addresses.
The IOMMU can be used to support shared virtual addressing which is the ability to share process address space with devices. Sharing process address spaces with devices allows to rely on core kernel memory management for DMA, removing some complexity from application and device drivers.
+----------------+ +--------------+ | non−privileged | | privileged | | memory | | memory | | | | | | ^ | | | +-------|--------+ +--------------+ | +-------|-------------+ | | IOMMU | | +------------+ | | | device | | | | S−stage PT | | | +------------+ | | ^ | +-------|-------------+ | +--------+ | Device | +--------+
IOMMU makes it possible for a guest operating system, running in a virtual machine, to be given direct control of an I/O device with only minimal hypervisor intervention.
A guest OS with direct control of a device will program the device with guest physical addresses, because that is all the OS knows. When the device then performs memory accesses using those addresses, an IOMMU is responsible for translating those guest physical addresses into supervisor physical addresses, referencing address-translation data structures supplied by the hypervisor.
DMA translation to enable direct device assignment diagram illustrates the concept. The device D1 is directly assigned to VM-1 and device D2 is directly assigned to VM-2. The VMM configures the G-stage page table to be used by each device and restricts the memory that can be accessed by D1 to VM-1 associated memory and from D2 to VM-2 associated memory.
+----------------+ +----------------+ | VM−1 | | VM−2 | | memory | | memory | | ^ | | ^ | +------|---------+ +-------|--------+ | | +------|-------------------|--------+ | | IOMMU | | | +------------+ +------------+ | | | device D1 | | device D2 | | | | G−stage PT | | G−stage PT | | | +------------+ +------------+ | | ^ ^ | +------|-------------------|--------+ | | +-----------+ +-----------+ | Device D1 | | Device D2 | +-----------+ +-----------+
To handle MSIs from a device controlled by a guest OS, the hypervisor configures an IOMMU to redirect those MSIs to a guest interrupt file in an IMSIC (see MSI address translation to direct guest programmed MSI to IMSIC guest interrupt files) or to a memory-resident interrupt file. The IOMMU is responsible to use the MSI address-translation data structures supplied by the hypervisor to perform the MSI redirection. Because every interrupt file, real or virtual, occupies a naturally aligned 4-KiB page of address space, the required address translation is from a virtual (guest) page address to a physical page address, the same as supported by regular RISC-V page-based address translation.
+-----------------------+ |IMSIC | | +-------------------+ | | | M−level int. file | | | +-------------------+ | | | | +-------------------+ | | | S−level int. file | | | +-------------------+ | | | | +-------------------+ | +----------+ | | Guest int. file 1 | | | IOMMU | +---------------+ | +-------------------+ | | | | | | | +-------+ MSI | +------+ | MSI | IO Bridge | | +-------------------+ | |Device +-----------|MSI PT|----------------------------------->| Guest int. file 2 | | +-------+ Write | +------+ | Write | | | +-------------------+ | (GPA) | | (SPA) +---------------+ | ,,, | +----------+ | +-------------------+ | | | Guest int. file N | | | +-------------------+ | +-----------------------+
The presence of an IOMMU allows each device to be individually configured for a different software master, usually a guest OS or the main (host) OS.
On implementations of the IOMMU that support two stages of translation (VS-stage and G-stage), the G-stage translation (or second stage of translation) is intended to virtualize device DMA to the guest OS physical address space. Devices can be assigned to guest OS which can directly program the device to do DMA with its Guest Physical Addresses (GPA). The hypervisor or host OS will set up and configure the IOMMU to perform GPA to PA translation using G-stage page tables. The use of the G-stage page tables limits the physical memory accessible by a device controlled by the guest OS to the memory allocated to its virtual machine.
The hypervisor may then provide a virtual IOMMU facility, through hardware emulation or by enlightening the guest OS to use a software interface with the Hypervisor (also known as para-virtualization). The guest OS may then use the facilities provided by the virtual IOMMU to avail the same benefits as those discussed for a non-virtualized OS. The guest OS employs a page table, really a VS-stage page table, to perform similar configurations for the device in a non-virtualized OS.
With two-stage address translations enabled, the IOVA may be first translated to a GPA using the VS-stage page tables managed by the guest OS and the GPA translated to a SPA using the G-stage page tables managed by the hypervisor.
Address translation in IOMMU for Guest OS diagram illustrates the concept. The IOMMU is configured to perform two-stage address translation (VS-stage and G-stage) for device D1 and to perform G-stage only translation for device D2. The host OS or hypervisor may also retain a device, such as D3, for its own use and configure the IOMMU to perform a single-stage (S-stage) translation.
+---------------------------------------------------+ | Main memory | | | | | | ^ ^ ^ | +------|------------------|-----------------|-------+ | | | +------|------------------|-----------------|-------+ | | IOMMU | | | | +------------+ +------------+ | | | | device D1 | | device D2 | | | | | G−stage PT | | G−stage PT | | | | +------------+ +------------+ | | | ^ ^ | | | | | | | | +------------+ | +------------+ | | | device D1 | | | device D3 | | | | VS−stage PT| | | S−stage PT | | | +------------+ | +------------+ | | ^ | ^ | +------|------------------|-----------------|-------+ | | | +-----------+ +-----------+ +-----------+ | Device D1 | | Device D2 | | Device D3 | +-----------+ +-----------+ +-----------+
The hypervisor may use the MSI address translation capability to dynamically redirect interrupts from guest controlled devices to the guest assigned interrupt register file of an IMSIC in the RISC-V hart.
Example of IOMMUs integration in SoC. shows an example of a typical system on a chip (SOC) with RISC-V hart(s). The SOC incorporates memory controllers and several IO devices. This SOC also incorporates two instances of the IOMMU. The device may be directly connected to the IO Bridge and the system interconnect or may be connected through a Root Port when a IO protocol transaction to system interconnect transaction translation is required. In case of PCIe, for example, the Root Port is a PCIe port that maps a portion of a hierarchy through an associated virtual PCI-PCI bridge and maps the PCIe IO protocol transactions to the system interconnect transactions.
The first instance, IOMMU 0 (associated with the IO Bridge 0), interfaces a Root Port to the system fabric/interconnect. One or more endpoint devices are interface to the SoC through this Root Port. In case of PCIe, the Root Port incorporates an ATS interface to the IOMMU that is used to support the PCIe ATS protocol by the IOMMU. The example shows an endpoint device with a device side ATC (DevATC) that holds translations obtained by the device from IOMMU 0 using the PCIe ATS protocol.
When such IO protocol to system fabric protocol translation using a Root Port is not required, the devices may interface directly with the system fabric. The second instance, IOMMU 1 (associated with the IO Bridge 1), illustrates interfacing devices (IO Devices A and B) to the system fabric without the use of a Root Port.
The IO Bridge is placed between the device(s) and the system interconnect to process device originated DMA transactions. IO Devices may perform DMA transactions using IO Virtual Addresses (VA, GVA or GPA). The IO Bridge invokes the associated IOMMU to translate the IOVA to a Supervisor Physical Addresses (SPA).
The IOMMU is not invoked for outbound transactions.
The IOMMU is invoked by the IO Bridge for address translation and protection for inbound transactions. The data associated with the inbound transactions is not processed by the IOMMU. The IOMMU behaves like a look-aside IP to the IO Bridge and has several interfaces (see IOMMU interfaces.):
-
Host interface: it is a slave interface to the IOMMU for the harts to access its memory-mapped registers and perform global configuration and/or maintenance operations.
-
Device Translation Request interface: it is a slave interface, which receives the translation requests from the IO Bridge. On this interface the IO Bridge provides information about the request such as:
-
The hardware identities associated with transaction - the
device_id
and if applicable theprocess_id
and its validity. The IOMMU uses the hardware identities to retrieve the context information to perform the requested address translations. -
The IOVA and the type of the transaction (Translated or Untranslated).
-
Whether the request is for a read, write, execute, or an atomic operation.
-
Execute requested must be explicitly associated with the request (e.g., using a PCIe PASID). When not explicitly requested, the default must be 0.
-
-
The privilege mode associated with the request. When a privilege mode is not explicitly associated with the request (e.g., using a PCIe PASID), the default privilege mode must be User.
-
The number of bytes accessed by the request.
-
The IO Bridge may also provide some additional opaque information (e.g. tags) that are not interpreted by the IOMMU but returned along with the response from the IOMMU to the IO Bridge. As the IOMMU is allowed to complete translation requests out of order, such information may be used by the IO Bridge to correlate completions to previous requests.
-
-
Data Structure interface: it is used by the IOMMU for implicit access to memory. It is a master interface to the IO Bridge and is used to fetch the required data structure from main memory. This interface is used to access:
-
The device and process directories to get the context information and translation rules.
-
The G-stage and/or S/VS-stage page table entries to translate the IOVA.
-
The in-memory queues (command-queue, fault-queue, and page-request-queue) used to interface with software.
-
-
Device Translation Completion interface: it is a master interface which provides the completion response from the IOMMU for previously requested address translations. The completion interface may provide information such as:
-
The status of the request, indicating if the request completed successfully or a fault occurred.
-
If the request was completed successfully; the Supervisor Physical Address (SPA).
-
Opaque information (e.g. tags), if applicable, associated with the request.
-
The page-based memory types (PBMT), if Svpbmt is supported, obtained from the IOMMU address translation page tables. When two-stage address translation is performed the IOMMU provides the page-based memory type as resolved between the G-stage and VS-stage page table entries.
-
-
ATS interface: The ATS interface, if the optional PCIe ATS capability is supported by the IOMMU, is used to communicate with ATS capable endpoints through the PCIe Root Port. This interface is used to:
-
To receive ATS translation requests from the endpoints and to return the completions to the endpoints. The Root Port may provide an indication if the endpoint originating the request is a CXL type 1 or type 2 device.
-
To send ATS "Invalidation Request" messages to the endpoints and to receive the "Invalidation Completion" messages from the endpoints.
-
To receive "Page Request" and "Stop Marker" messages from the endpoints and to send "Page Request Group Response" messages to the endpoints.
-
Similar to the RISC-V harts, physical memory attributes (PMA) and physical memory protection (PMP) checks must be completed on any inbound IO transactions even when the IOMMU is in bypass (bare state). The placement and integration of the PMA and PMP checkers is a platform choice.
PMA and PMP checkers reside outside the IOMMU. The example above is showing them in the IO Bridge.
Implicit accesses by the IOMMU itself through the Data Structure interface are checked by the PMA checker. PMAs are tightly tied to a given physical platform’s organization, many details are inherently platform-specific.
The memory accesses performed by the IOMMU using the Data Structure interface need not be ordered in general with the device initiated memory accesses.
Note
|
IOMMU may generate implicit memory accesses on the Data Structure interface to access data structures needed to perform the address translations. Such accesses must not be blocked by the original device initiated memory access. IO bridge may perform ordering of memory accesses on the Data Structure interface to satisfy the necessary hazard checks and other rules as defined by the IO bridge and the system interconnect. |
The IOMMU provides the resolved PBMT (PMA, IO, NC) along with the translated address on the device translation completion interface to the IO Bridge. The PMA in IO Bridge may use the provided PBMT to override the PMA(s) for the associated memory pages.
The PMP may use the hardware ID of the bus master to determine physical memory access privileges. As the IOMMU itself is a bus master for its implicit accesses, the IOMMU hardware ID may be used by the PMP to select the appropriate access control rules.
Note
|
The IOMMU does not validate the authenticity of the hardware IDs provided by the IO bridge. The IO bridge and/or the root ports must include suitable mechanisms to authenticate the hardware IDs. In some SOC this may be trivially achieved as a property of the devices being integrated into the SOC and their IDs being immutable. For PCIe, for example, the PCIe defined Access Control Services (ACS) Source Validation capabilities may be used to authenticate the hardware IDs. Other implementation specific methods in the IO bridge may be provided to perform such authentication. |
The version 1.0 of the RISC-V IOMMU specification supports the following features:
-
Memory-based device context to locate parameters and address translations structures. The device context is located using the hardware provided unique
device_id
. The supporteddevice_id
width may be up to 24-bit. -
Memory-based process context to locate parameters and address translation structures using hardware provide unique
process_id
. The supportedprocess_id
may be up to 20-bit. -
16-bit GSCIDs and 20-bit PSCIDs.
-
Single stage and two stage address translation.
-
VS/S-stage and G-stage virtual-memory system as specified by the RISC-V privileged specification to allow software flexibility to use a common page table for the CPU MMU as well as the IOMMU or to use a separate page table for the IOMMU.
-
Up to 57-bit virtual-address width, 56-bit system-physical-address, and 59-bit guest-physical-address width.
-
Hardware updating of PTE Accessed and Dirty bits.
-
Identifying MSI writes and MSI address translation, using MSI page tables, to redirect MSIs to interrupt files in an IMSIC using MSI PTEs in write-through mode and redirecting MSI to memory-resident-interrupt-files using MSI PTEs in MRIF mode.
-
Svnapot and Svpbmt extension.
-
PCIe ATS and PRI services. Support for translating an IOVA to a GPA instead of a SPA in response to a translation request.
-
A hardware performance monitor (HPM).
-
MSI and wire-based-interrupts to request service from software.
-
A register interface for software to request an address translation to support debug.
Supported features may be discovered using the capabilities
register [CAP].