# A Least-Privilege Memory Protection Model for Modern Hardware

Reto Achermann, Nora Hossle, Lukas Humbel, Daniel Schwyn, David Cock, Timothy Roscoe Systems Group, Department of Computer Science, ETH Zurich

## Abstract

We present a new least-privilege-based model of addressing on which to base memory management functionality in an OS for modern computers like phones or server-based accelerators. Existing software assumptions do not account for heterogeneous cores with different views of the address space, leading to the related problems of numerous security bugs in memory management code (for example programming IOMMUs), and an inability of mainstream OSes to securely manage the complete set of hardware resources on, say, a phone System-on-Chip.

Our new work is based on a recent formal model of address translation hardware which views the machine as a configurable network of address spaces. We refine this to capture existing address translation hardware from modern SoCs and accelerators at a sufficiently fine granularity to model minimal rights both to access memory and configure translation hardware. We then build an executable specification in Haskell, which expresses the model and metadata structures in terms of partitioned capabilities. Finally, we show a fully functional implementation of the model in C created by extending the capability system of the Barrelfish research OS.

Our evaluation shows that our unoptimized implementation has comparable (and in some cases) better performance than the Linux virtual memory system, despite both capturing all the functionality of modern hardware addressing and enabling least-privilege, decentralized authority to access physical memory and devices.

## 1. Introduction

Both modern, fully-verified operating systems and traditional production-quality kernels rely on a model of memory addressing and protection so simple it is rarely remarked on: RAM and devices reside at unique addresses in a single, shared physical address space, and all cores have homogeneous memory management units which translate from a virtual address space into these physical addresses. These MMUs are all configured by a single monolithic kernel.

Unfortunately, this model bears little relation to modern hardware. Modern platforms like phone SoCs violate the assumption of a single physical address space. A modern computer is, in reality, a network of address spaces with adhoc address translation functions between them, many configurable by sufficiently-privileged system software. Access to memory is performed by a variety of heterogeneous cores and I/O devices from different points in this network. Simply configuring a given platform correctly to maintain the assumption of a single physical address space on which the verification is based on is, by itself, a complex and error-prone process.

The result is that traditional kernels suffer from numerous (and continuing) security bugs arising from incorrect assumptions about memory addressing in the system, while correctness proofs for verified kernels are cast into doubt by the existence of "cross-SoC" attacks.

Moreover, centralized authority over all memory access does not accommodate features like the secure co-processors and management engines standard in modern PCs as well as phone platforms. Authority to grant access to memory and devices needs to be decentralized, and this decentralization represented in the specification of the OS itself.

This paper develops an alternative model of hardware memory addressing, protection, and authorization which captures the richness, complexity, and diversity of modern hardware platforms. Our model can serve as a basis for formal verification of system software but also as an informal basis for designing correct memory management functionality.

The model is based on two guiding principles: *completeness*, meaning that we capture the full semantics of real addressing hardware without simplifying assumptions, and *least-privilege*, meaning that we represent individual authority to both access memory and modify translations at as fine a granularity as allowed by the hardware.

In the next section we elaborate on the mismatch between modern hardware and OS designs, and existing efforts to address it. In Section 3 we review the recent related work on which this paper builds, and lay out our methodology.

Our first contribution, in Section 4, is development of the model itself. We start from an abstract model of memory addressing and progressively refine it until it captures the salient features of modern memory hardware, including the rights to modify translations in multi-level page tables and custom protection units. We build an executable spec in Haskell which serves as a basis for an implementation in a real OS.

In Section 5 we describe how Linux might be extended with a subset of our model (foregoing least-privilege and centralizing authority in the kernel), and present a full implementation of the model by extending the capability system in the Barrelfish research OS. Our implementation runs on real hardware and can manage protection rights on a variety of hardware platforms. We also discuss the minimal overhead it incurs for

arXiv:1908.08707v1 [cs.OS] 23 Aug 2019

metadata and bookkeeping.

In Section 6 we evaluate the performance of this memory system and show that, despite the richer and more faithful view of hardware it embodies, it provides comparable performance to the highly optimized, but less functional, Linux virtual memory system on identical hardware.

This paper also has a set of *non-goals*. Firstly, we do not present any formally-verified OS software; our goal is rather to show a model which can be used as a replacement for the over-simplified addressing models currently used in the proofs for verified systems like seL4 and CertiKOS.

Second, we develop no new memory subsystem for Unixlike OSes. As we note below, reasoning about the correctness and security of a modern computer requires going beyond a Linux kernel to capture co-processors and intelligent devices. We sketch in section 5.1 how a simplified version of our model might be retrofitted to Linux.

Finally, this is not a bug-finding paper. We do not aim to find problems in existing OS code (though we cite numerous examples from other work). Instead, we lay the foundations for a more faithful view of the hardware on which to base better system software.

### 2. Motivation and Related Work

We first review the implicit model of memory addressing used by existing OSes, and then explain with concrete examples why it no longer reflects hardware reality. We discuss the implications of this for both new, formally verified OSes and traditional kernels like Linux, and the limitations of existing approaches to the problem in both kinds of OS.

#### 2.1. The traditional view of memory

Address translation is a fundamental technique in computing, enabling relocation, demand paging, machine virtualization (either via processes or full virtual machines), shared memory, inter-process protection, and much other functionality.

Typically, the key abstraction employed is a *virtual address space*, accesses to which are translated into addresses within a unique, machine-wide *physical address space* by hardware mechanisms (TLBs, multi-level page tables, etc.).

Physical memory addresses can thus be used as unambiguous, system-wide identifiers for memory and devices, and so are also used to keep track of access rights: Linux maintains such a data-structure for each page or frame of physical memory, while some microkernels like seL4 [14, 23] and Barrelfish [11, 15] use a capability system [26] to represent physical memory regions with access rights.

An OS must configure translation hardware and maintain these data structures to ensure correct and secure operation. For example, user programs should only be able to load and store to physical resources (memory, or memory-mapped I/O devices) the OS has granted them access rights to.

#### 2.2. Hardware doesn't conform to this view

Unfortunately, modern hardware platforms violate the assumptions in the traditional model above. They are composed of multiple, heterogeneous cores and devices each of which can issue accesses to byte-addressable memory resources such as DRAM, non-volatile memory or device registers. Worse, there is no single "reference" physical address space [16]. Instead, a network of address spaces or buses is connected by address translation units which "routes" accesses through the network.

This breaks most of the assumptions of the classical model: different cores and devices translate their virtual addresses into different physical address spaces, physical addresses can no longer be used as global identifiers without further scoping, address aliasing is not only possible but likely, and finally, software with access to translation units can reconfigure the *physical* address space underneath the systems' MMUs.

For example, the Xeon Phi co-processor [21] implements a "system memory page table" which further translates physical (post-MMU) addresses from the accelerator cores into the host's PCI address space using a single, shared register array where each register controls the translation of a fixed 16GB page in the Xeon Phi's "physical" address space.

Such additional layers of translation are commonplace in phone Systems-on-Chip like the NXP iMX8 [33], Texas Instruments OMAP [39], and NVIDIA Parker [31] processors. Such SoCs contain a variety of different processors with different physical address spaces, which overlap and intersect [16]. This is a deliberate, rational design choice – for example, it is important that a secure co-processor holding encryption keys has private memory that cannot be accessed from application cores, even in kernel mode.

I/O memory management units (IOMMUs, or System MMUs) translate addresses generated by accelerators and DMA-capable devices into a "canonical" system-wide physical address space. This allows user-space programs to share a virtual address space with a context on the device, but impose a further complexity burden on the underlying OS which must now ensure that IOMMUs are always correctly programmed. This code is fraught with complexity and consequent bugs and vulnerabilities, as it is also intended to provide protection from malicious memory accesses [29, 30, 28, 27]. The problem is likely going to get worse with the proliferation of IOMMU designs built into GPUs, co-processors, and intelligent NICs.

OpenCL's Shared Virtual Memory extends the global memory region into the host memory region using three different types [22]. Similarly, nVidia's CUDA [32] or HSA [19] provide a unified view of memory. The same concerns apply here: the complexity of maintaining a shared virtual address space is pushed to system software, but remains.

Even memory controllers can violate the traditional model. Hillenbrand *et al.* [18] reconfigure memory controller configurations from system software to provide DRAM aliases for mitigating the performance effects of channel and bank interleaving. Proposals for "in-memory" or "near-data" processing [34] raise further questions for OS abstractions [9] and require a way to unambiguously refer to memory regardless of which module accesses it.

#### 2.3. Implications for current OS designs

Correctness arguments about OS code therefore rely on assumptions about the hardware that no longer hold. Proofs for the seL4 microkernel [23] assume a single, fixed, physical address space without other translation hardware, and provide no guarantees of safety in the presence of other cores or incorrectly programmed DMA devices. CertiKOS [17] proves functional correctness based on a model of memory accesses to abstract regions of private, shared or atomic memory, but again provides no proof in the presence of other translation units and heterogeneous cores. Even work on verifying memory consistency in the presence of translation only considers the simple case of virtual-to-physical mappings [36].

Proofs aside, the difficulty of getting complex memory addressing right in an OS is shown by the steady stream of related bugs and vulnerabilities in Linux [20], for example ignoring holes in huge pages (CVE-2017-16994), miscalculation of the number of affected pages (CVE-2014-3601), access rights for data pages (CVE-2014-9888), interactions of virtually mapped stack with DMA scatter lists (CVE-2017-8061), handling of shadow page tables (CVE-2016-3960). Moreover, miscalculations, misinterpretations or underflows of addresses and offsets, (Linux commits 9d8c3af3160, 7655739143, 29a90b708 and 5016bdb79), mixing up memory addresses with MSI-X interrupt ranges (Linux: 17f5b569e09cf) and IOMMU address space allocations (Linux: a15a519ed6e) cause unexpected behavior, crashes or memory corruption.

Faced with the complexity of hardware, a number of ad-hoc point solutions have appeared for specific cases, primarily GPUs, such as VAST [25] which uses compiler support to dynamically copy memory to and from the GPU and Mosaic [8], which provides support for multiple sizes of page translation in a shared virtual address space between CPU and GPU. In DVMT [3], applications request physical frames from the OS that have specified properties. The system allows applications to customize how the virtual-to-physical mapping is set up by registering a TLB miss handler for the special DVMT range. The CBuf[35] system globally manages virtual and physical memory focusing on efficient sharing and moving data between protection domains. CBuf unifies shared-memory, memory allocation and system-wide physical memory allocation.

All these approaches aim to simplify user code, at the cost of OS complexity. In contrast, our work is a response to this complexity: the central OS abstraction of a single, shared, global physical address space, combined with straightforward translations to it from virtual address spaces, is inadequate for a secure and reliable OS running on modern hardware. We need a richer model of addressing, and this paper is based on one which views address spaces as nodes in a network of translation units.

## 3. Methodology

Our new model builds on the existing *decoding net* model of Achermann *et al.* [1, 2], which has been shown to provide a precise formal model of many of the sorts of systems we consider in this work: Multi-socket NUMA systems, ARM SoCs, plug-in accelerators, etc.

Achermann *et al.* model the addressing structure of a system as a directed graph, where nodes represent (virtual or physical) address spaces or devices (including RAM), and edges the translation of *AS-local* addresses into other ASs or devices. The graph is a set of nodes, defined as an abstract datatype so:

> name = Name nodeid address  $node = Node accept :: \{address\}$  $translate :: address \rightarrow \{name\}$

Their model distinguishes *local* names (*address*), relative to some address space, and *global* names (*name*), which qualify a local name with its enclosing address space. Each node may **accept** a set of (local) addresses, and/or **translate** them to one or more global names (addresses in other address spaces).

This existing model is a long way from being a basis for an operational system. In Section 4.1 we add two important features: dynamic configuration of the **translate** function which captures how real translation units can be programmed, and *rights* corresponding to the ability for software processes to configure such units. We model the complex network of interacting address spaces, identify and label the necessary divisions of authority as finely as possible, following the principle of least privilege.

We adopt a methodology strongly influenced by the successful combination of *refinement* and *executable specification* used in the seL4 project.

Specifically, we begin by identifying all relevant *objects* (page tables, address spaces, ...), the *subjects* that manipulate them (processes, the kernel, devices, ...), and which *authority* each subject exercises over an object (e.g. in mapping a frame to a virtual address). These are expressed in an *access-control matrix* (following Lampson [24]) which forms our *abstract specification*, analogous to the high-level *security* policy (integrity) shown to be refined (correctly implemented) all the way down to compiled binaries for seL4 [38].

Again, as in seL4 [12], we next develop an executable specification in Haskell (see Section 4.2), expressing subjects, objects, and authority as first-class objects, permitting rapid prototyping without giving up strong formal semantics. Correspondence between abstract and executable models is thus far by inspection and careful construction.

Finally, we show (again with precedent [40]) that the executable model (and hence the abstract model) permits multiple high-performance implementations: In the Barrelfish OS (expressing *rows* of the access matrix with capabilities, see Section 5.2), and in Linux (collapsing distinct authorities held by the kernel, and taking *columns* as access-control lists, see Section 5.1). Barrelfish and seL4 have closely related capability-based resource management and authorization systems and our implementation transfers naturally to seL4; Barrelfish is *currently* a better platform for our work, due to its support for multiprocessing and heterogeneous hardware.

By adopting a proven methodology, we can be confident that the resulting artifact is compatible with an seL4-style verification, and could thus serve as a more accurate replacement for the hardware model underlying the seL4 or CertiKOS proofs. Simultaneously, by careful selection of an abstract model (the access-control matrix) and through the use of refinement, our model is not specific to a particular implementation.

## 4. Model

We derive our abstract, formal model from the existing decoding net model in two steps. First, we extend the model to include dynamic behavior (updating translations), and express the required authority using an access-control matrix. Second, we build a (still relatively abstract) executable specification in Haskell, allowing us to reason concretely about implementation trade-offs.

### 4.1. Authority and Dynamic Behaviour

Decoding nets are static: they represent the *current* state of the system. To describe the dynamic behavior of a system, we add an abstraction above the decoding net, consisting of a set of (dynamic) *address spaces*. The state of the system is then expressed as a function from address space, to the mapping node representing its current configuration:

$$configuration = address \ space \rightarrow node$$

We can then express the *configuration space* of an address space, as a set of possible configurations:

$$config \ space = address \ space \to \{node\}$$

The configuration space of a page table in a system with 4kiB translation granularity would, for example, only include nodes that map all addresses in any naturally-aligned 4kiB region contiguously. We will use the configuration space to express allowable system states according to a security property.

At this level of abstraction, state transitions are simply changes in the current configuration of the address spaces:

Authority Consider Figure 1, representing the general case of an update to an intermediate address space (for example the intermediate physical address, IPA, in a two-stage translation system). We identify two distinct rights (authorities): The



Figure 1: Mappings between address spaces showing grant and map rights of mapped segments.



Figure 2: Address spaces in a system with two PCI devices

**map** right, or the right to change the meaning of an IPA by changing its mapping; and the **grant** right, or the right to grant access (by mapping) to some range of physical addresses.

These two rights do not necessarily go together.

Consider Figure 2, showing the address-space structure of a system with two PCI devices: a DMA engine and an Intel Xeon Phi co-processor. Imagine that we wish to establish a shared mapping to allow a process on a Xeon Phi core to receive DMA transfers (e.g. network packets) into a buffer allocated to it in the on-board GDDR.

The process 'owns' the buffer, and has the ability to call recv(), triggering a DMA transfer. We interpret this as the process having the right to **grant** access (temporarily) to the DMA core. The user-level process, however, clearly should not have the ability to modify the IOMMU mappings of the DMA core at will (or its own, for that matter). That is, it does not have the **map** right on the relevant address space.

What is needed is some agent (hereafter a *subject*, in standard authority-control terminology) with both the **grant** right on the buffer *object*, and the **map** right on the address space *object*. In a traditional monolithic kernel, both these rights are held (implicitly) by the kernel, which exercises them on behalf of the subjects. It is up to the kernel to maintain accurate bookkeeping to determine whether any such request is safe, typically using an ACL (access-control list) i.e. authority tied to the *subject*.

In a microkernel such as seL4 or Barrelfish, these rights are represented by capabilities, handed explicitly to one *subject*, in order to authorize the operation. In this case, authority is tied to the *object*. These are equivalent from the perspective of access control, differing only on implementation detail: In

| subject/object | DMA IOMMU | buffer |
|----------------|-----------|--------|
|                |           |        |
| IOMMU driver   | map       |        |

Xeon Phi process grant



both cases, the same two basic authorities are present.

| Right R1 (Grant)                                        |
|---------------------------------------------------------|
| The right to insert this object into some address space |
| Right R2 (Map)                                          |
| The right to insert some object into this address space |

Note that the 'virtual' and 'physical' address spaces of Figure 1 can be viewed as special cases of an intermediate address space: A top-level 'virtual' address space is simply one to which *nobody* has a **grant** right, and a 'physical' address e.g. DRAM is one to which there exists no **map** right.

The standard representation of authority in systems is an access control matrix [24], such as that of Table 1. This can be read in rows: The IOMMU driver has the **map** *capability* to the IOMMU address space, and the process the **grant** capability to the buffer. Alternatively, reading down the columns gives the ACLs: the IOMMU records **map** *permission* for the driver, and for the buffer is recorded a **grant** *permission* for the process.

This access control matrix on maps and grants is our abstract model. A system is correct (secure) *statically*, if its current configuration is consistent with the access control matrix. It is secure *dynamically* if any possible transition, beginning in a secure state, must leave the system in a secure state.

#### 4.2. Executable Specification

Thus far we have expanded upon the existing decoding net model, giving us a dynamic access-control matrix formulation of the system's correctness property. Next, we implement a (still abstract) *reference monitor* [4] in Haskell, to aid in rapid prototyping of both model and implementation, and as an intermediate step in the process of *refinement* from abstract specification to operational, high-performance implementation. In this, we again take our example from the seL4 approach, which used just such an *executable specification* [13] to prototype the kernel prior to implementation in C.

Given our target environments of Linux and Barrelfish, operations and data structures for the reference monitor are named in a manner suggestive of an OS kernel, although other implementations would be possible. The most important detail added at this stage is to make **translation structures** explicitly visible. The reason for this is to allow us to express the fact that the translation state of the system depends, in a deterministic manner, on the contents of RAM and device registers (e.g. segment registers). This in turn allows us to express the invariant (necessary for integrity of the reference monitor) that no such objects are ever made accessible (i.e. mapped) outside the monitor itself:



Figure 3: Object Type Hierarchy and possible rights (green).

```
mappingTrace :: (Operation KernelState)
mappingTrace = do
...
-- retype a RAM object to a Frame
res <- retype RAM Disp Frame Disp
-- retype another RAM object to a translation structure
res <- retype RAM2 Disp TStructure Disp
-- map the frame into the translation structure
mapping1 <- Model.map TStructure Frame Disp
...
Figure 4: Mapping a RAM object</pre>
```

| Invariant I1 (Never Accessible)              |  |
|----------------------------------------------|--|
| Subjects can never access unmappable objects |  |

Note that (in contrast to the seL4 executable specification), the details of the translation structures are kept opaque at this point—we merely record that they exist at certain locations by dividing the mappable address spaces into *objects* (with terminology borrowed from Barrelfish):

Objects form a hierarchy (Figure 3) which defines how objects can be *derived* from each other. For example, translation structures (TStructure) are created by retyping RAM objects. The previous invariant now reduces to stating that no object of type TStructure is ever mapped. RAM is the base type for untyped memory, and a Frame is RAM that has been retyped to be mappable.

In addition, the set of translation structures defines (again in an implementation-specific manner), the set of address spaces:

AddressSpaceOf :: TStructure -> AddressSpace

Authority is likewise stored as explicit rights:

The monitor (kernel) state is a set of subjects (the term dispatcher being borrowed from Barrelfish), a *mapping database* (MDB) recording the derivation relation between objects, and a set of active address spaces:

```
data KernelState
    = KernelState (Set Dispatcher) MDB (Set AddrSpace)
```

The monitor is exercised (as for the seL4 specification) by direct calls to its API, such as in Figure 4. These are implemented within a state monad: Operation. Thus changes to the system's state are a sequence of API calls e.g. retype or map:

data Operation a = Operation (State -> (a, State))
instance Monad (Operation) where ...

Traces are thus sequences of such operations, corresponding to an observed sequence of KernelStates. Each of these states defines a static configuration of the decoding net. Operations include:

- retype converts an existing object into an object of a permissible subtype.
- map installs a mapping in a translation structure.
- copy copies the rights from one subject to another.

Contained within the set of all possible traces T, there is a set of correct traces  $CT \in T$  that correspond to sequences of consistent KernelStates. All other traces indicate that execution had to be aborted at some point since an operation was applied that would otherwise have led to transitioning to an inconsistent or disallowed system state.

## 5. Implementation

We first describe how a subset of our model might be implemented in the Linux monolithic kernel, and then present a full implementation based on the open-source Barrelfish OS [11]. We refer to this new implementation as *Barrelfish/MAS*, where MAS refers to Multiple Address Spaces.

#### 5.1. Implementation in a Monolithic Kernel

We describe how one could implement the least privilege model and add support for multiple address spaces in a monolithic kernel at the example of Linux.

The Linux kernel acts as the reference monitor and therefore assumes authority over all address spaces in the system: it can change address space mappings and grant access to memory at will. This happens mostly as a reaction to user-space requests such as mmap, but may also originate from policy decisions inside the kernel e.g. demand paging or page caches where the kernel decides to unmap memory from a process.

A possible way to achieve separation is through intercepting updates to the translation tables, which can be done using the para-virtualization subsystem. Whenever a translation table is changed, this gets converted into an API call to the reference monitor. This gives some form of separation, but without proper virtualization cannot be strictly enforced.

User-space processes may share memory by creating shared memory objects, which are implemented as files in a ramfs. Linux manages access to that shared memory object-and filebased objects in general-using standard UNIX permissions, representing an access control list. Consequently, every process with a matching user or group id can access the shared memory object. ACL means for each object (resource) in the system there is a list of subjects plus rights. Files can be opened, which gives the process a file descriptor, which can be mmaped, hence a read right on a file can be seen as a grant right to the memory described by that file. After opening a file, the file descriptor can also be passed around, hence also the file descriptor represents a grant right.

Most memory used by applications is not file backed, and hence referred to as anonymous memory. User-space processes have the access right to mapped anonymous memory. The process cannot explicitly hand over the grant right to anonymous memory to another process other than forking itself, where the child process inherits rights on resources from its parent.

A process can request memory to be mapped and unmapped from its address space. It may supply hints on what type of memory it would like, but in the end the Linux kernel decides where to map and what memory to grant.

Apart from tracking rights, our design also requires the understanding of multiple address spaces and have rights refer to qualified names instead of addresses. In order for Linux to do this, we have to make sure that the physical frames are correctly identified in the presence of multiple address spaces (e.g. the kernel sees a RAM region at a different address than say a DMA engine). Each frame is identified by a physical frame number (PFN). We can use this PFN as the canonical name for the frame itself and the sparse memory model in Linux [5] to implement multiple address spaces holding physical resources as memory sections. For each frame of memory, Linux maintains a data structure tracking its use. We can augment the data structure to include type information to implement different memory object types (Linux already distinguishes between user and kernel objects). The relationship between PFNs and local physical addresses would need to be changed from a fixed offset to one that depends on the current configuration of translation hardware plus the current executing core.

In conclusion, the Linux kernel acts as a central authority holding the grant and map right to all address spaces and resources. A separation is possible when using the paravirtualization subsystem to intercept updates to translation tables. The kernel data structures could be modified to support the notion of multiple address spaces. Because it is very hard to support the full granularity of our model in a monolithic kernel, we chose to implement and evaluate it in a capability system, which is described in the next section.

#### 5.2. Implementation in Barrelfish/MAS

We chose the open-source Barrelfish OS [11] as the basis for our implementation because it uses an seL4-style capability system for authorization and resource management, but in contrast to seL4 has has support for heterogeneous platforms and has drivers for IOMMUs and the Intel Xeon Phi co-processor, thus providing a real-world example of complex addressing.

We describe the relevant parts of our implementation in *Barrelfish/MAS*: the capability system that supports multiple

address spaces (§ 5.2.1), implementation of runtime support by generating code for known translation and maintaining a graph of configurable nodes § 5.2.2, and finally adapting user-space device drivers (§ 5.2.3).

**5.2.1. Capability System** Barrelfish manages physical resources using a capability system for naming, access control, and accounting of objects in a single physical address space. We describe the *Barrelfish/MAS* capability system as a whole here, since a clear description of the original Barrelfish capability system has not been published. As in seL4 [14], capabilties are *typed* to indicate what can be done with the memory they refer to; rules dictate valid *retype* operations (e.g retyping RAM to a Frame).

*Barrelfish/MAS* builds on Barrelfish by adding multiple address spaces and having capabilities which refer to memory objects hold the object's canonical base name, the size of the object they are referring to, as well as its type and rights.

*Barrelfish/MAS* is a *partitiioned capability system*: Capabilities are stored in memory-resident objects as well, but these are *unmappable* ensuring that no user-space process can forge capabilities by writing to memory locations. A process holding a capability obtains a certain set of rights on the object referred to by the capability. These rights can be exercised by invoking the reference monitor API which is implemented as a system call interface.

Capabilities encode the canonical names of the objects they refer to, implemented as a struct with two fields: the address space identifier (ASID) and the address within the address space. An optimized variant packs both values into a 64bit integer providing support for a 16-bit ASID and a 48-bit address, which is sufficient for current platforms.

ASIDs nevertheless are a limited resource, and their allocation must be managed accordingly to avoid ASID exhaustion. We use a dedicated capability to manage ASIDs, where a new range of ASIDs can be allocated by retyping a larger range of ASIDs.

There may be multiple capabilities pointing to the same object, but there is always at least one capability for every given byte in memory.

**The mapping database:** *Barrelfish/MAS* manages a *mapping database*, a data structure that allows efficient lookup of all related capabilities given the name of object they refer to. The mapping database is a balanced tree structure of all capabilities present in the database.

The mapping database stores the capabilities in a cannonical ordering, allowing efficient lookup and range query operators such as "overlap" and "contains". The canonical ordering of the capabilities is defined on their canonical name (address space and address), size and type. Capabilities to objects with a smaller name appear first. If the base names of two capabilities are equal, then the larger object comes first. Finally, all other attributes being equal, the type of the capability defines the order: types higher up in the hierarchy come first. This ordering is important, because based on the canonical order of the capabilities one can define the *descendant* relation. We say a capability B is a descendant of capability A if A is smaller than A and B is fully contained in the range convered by A:

descendant  $c_1 c_2 \leftrightarrow c_1 \cap c_2 = c_2 \wedge c1.type \leq c_2.type$ 

The mapping database can therefore be traversed to find the descendants of a capability (successors) and ancestors (predecessors) efficiently.

It is important that the ordering relation is in line with the retype operation. If B can be retyped from A, B must be smaller than A. Our definition fulfils this, a retype can increase the name, decrease the size or change the type to a subtype.

With help of the mapping database, we can efficiently find all the ancestors and descendants of a particular object.

**Page tables and address spaces:** *Barrelfish/MAS* has a distinct capability type for each hardware-defined translation table e.g. one for each of the four levels of the x86\_64 architecture. Each of these capability types are translation structures in the sense of the executable spec.

User processes can construct their own page tables through capability invocations. This is safe, because the invocations only allow operations resulting in correct-by-construction page tables, and processes can only map resources for which they hold a capability with the grant right to it.

Since a page table defines an address space, we can *derive* an *address space* capability from a page table. This address space represents the input address space of the translation table. For each translation table, the spanning address space can only be derived once.

When we delete a page table, we use this stored ASID to query the mapping database for address space capabilities and start a recursive deletion. This ensures that upon deletion of the page table, the address space is deleted including all *segments* within it. This is equivalent to *revoking* all descendants of the address space capability and then deleting it.

**Tracking mappings:** When access to an object is revoked, all positions where this object has been mapped must be found and removed. We manage this bookkeeping using the capability system. For each mapable object there exists a corresponding *mapping* capability. The mapping capability is a descendant of (retyped from) the mapped objects and hence we can find all locations where an object is mapped by walking the mapping database in ascending order. Each mapping capability indicates the page table objects and slot range where the object has been mapped.

The same technique is used to track mappings of multi-level page tables. For each valid entry in a page table there exists a mapping capability. When the last mapping capability is deleted, the page table entry is invalidated.

**5.2.2. Runtime Support** In Figure 2 we draw a diagram of the different address spaces present in a heterogeneous multi-

processor system. To acquire the access right to a particular memory object, a sequence of translations need to be setup. Which address spaces need to be configured depends on the system topology, which may only be discovered at runtime.

**SoC-Platforms** The topology of SoC platforms is typically fixed and known at compile time. We can therefore enumerate all address spaces of the SoC and pre-compute all fixed translations and store a graph of the topology consisting of configurable and leaf address spaces in the kernel. We can *generate* core-specific translation functions that convert local addresses to global names and vice versa. The name can then be resolved by walking the translation structures of the configurable address spaces until it reaches an accepting address space or there is no translation. We evaluate this scenario in § 6.4.

**Device Discovery** In general, the information about the hardware topology and its address spaces may be incomplete and must be discovered during runtime. For instance, the presence of an IOMMU is known after parsing the ACPI tables and the Xeon Phi co-processor of our example (Figure 2) is discovered by PCI, and lastly the size of the GDDR available on the co-processor is known by the driver. The state of the model is therefore populated by multiple sources of information.

In Barrelfish, there exists the system knowledge base (SKB) [37] which stores information about the system. The SKB in a nutshell is a database storing facts about the system which can be queried using Prolog. We implement the model inside the SKB. During device discovery, processes insert information about the discovered address spaces and how they are connected with each other.

**Model Queries** Device drivers must configure translation units to enable devices to access memory. Booting a core on the Xeon Phi co-processor is a particular example: application modules to be run on the co-processor may reside in host RAM. To make this accessible from the co-processor the IOMMU and the SMPT must be configured accordingly. This information can be obtained by querying the SKB, which returns a list of address spaces that must be configured. The query is based on a shortest path algorithm between the address space of the Xeon Phi core and the address space where host RAM resides in.

Running the queries in the SKB is costly (§ 6.3). We provide a library that caches the graph representation of configurable address spaces and run shortest path on it.

The result of the query is a list of address spaces that need to be configured to make the memory object accessible from the source address space. This blueprint is then converted by the user-space process into a sequence of capability operations to allocate memory, setup translation structures and perform the relevant mappings. The model queries only provide a 'hint' on what needs to be configured while the capability system enforces the authorization required to perform the required mappings. We evaluate the latency of this scenario in § 6.2. Address Resolution While the SKB stores the address space topology of the system it does not store the actual translations of configurable address spaces. An address can be fully resolved by performing the previous query and instead of changing the configuration of the address spaces, we can use the translation structure to calculate where the address space translates the address.

**5.2.3. Device Driver Adaptation** We adapt the user-space device drivers in Barrelfish to use the runtime support described above when configuring their devices and allocating in-memory data structures. In *Barrelfish/MAS*, device drivers run in user-space. They are started by a device manager which passes a set of capabilities including a capability to the device registers and the IOMMU IPC endpoint. The driver can then use capability operations to map the device registers into its address space or program the IOMMU translation through the IOMMU IPC endpoint. Devices with additional memory, such as the Xeon Phi with GDDR receive a capability to the leaf address space, which the driver can then use to retype new RAM capabilities from it.

Memory access from the device might be translated by the IOMMU. To setup a shared buffer between the driver and the device the driver needs to: Allocate memory, Map the memory into the driver's own address space, query the graph to determine necessary configuration steps, follow the result to map the memory into the device's address space For an evaluation of these steps see § 6.2.

To set up the IOMMU, we implemented two alternatives: *i*) an RPC to the IOMMU reference monitor that manages the translations, or *ii*) direct capability invocations on the translation table used by the IOMMU for this device. This is safe, because the capability system enforces that only memory for which the driver has a capability for can be mapped.

## 6. Evaluation

We evaluate our implementation by showing memory management performance comparable to Linux ( $\S$  6.1) and applicability to a real-world scenario using co-processors ( $\S$  6.2). We also show the scaling behavior of the model queries ( $\S$  6.3) and demonstrate how the model can also be used in pathological topologies using simulators ( $\S$  6.4). Finally, we analyze the space-time overheads of our implementation ( $\S$  6.5).

All performance evaluations use a dual-socket Intel Xeon E5 v2 2600 ("Ivy Bridge") with 256GB of main memory. There are 10 cores per socket with HyperThreading, Turbo-Boost, and speed stepping disabled, and the system runs in "performance" mode. The system also has two Intel Xeon Phi co-processors ("Knights Corner"). All Linux experiments use Ubuntu 18.04LTS, with kernel version 4.15 and the latest patches for mitigating Meltdown and Spectre attacks.

#### 6.1. Memory operations

We compare the performance of *Barrelfish/MAS*'s memory subsystem against Linux with Spectre/Meltdown mitigation



Figure 5: Appel-Li benchmark on *Barrelfish/MAS* and Linux with and without Spectre/Meltdown mitigation (NS).

both enabled and disabled, using two microbenchmarks. *Bar-relfish/MAS* has no mitigation measures.

**6.1.1. The Appel and Li benchmark** [6] tests operations relevant to garbage collection and other non-paging tasks by measuring time to protect, and trap-and-unprotect pages of memory.

We run the benchmark with working sets of less than 2MB (512 pages). We measure Linux with four configurations: *i*) default TLB flush heuristic, and *iii*) always full TLB flush, all with Spectre/Meltdown mitigation both enabled and disabled. We benchmark *Barrelfish/MAS* in two ways: *i*) direct invocation of the mapping capability and *ii*) protecting the page through user-level data structures tracking the mapping. Note that *Barrelfish/MAS* does not support selective TLB flushing.

The results are shown in Figure 5. We observe that *Barrelfish/MAS* is consistently faster than Linux in all cases. The Spectre/Meltdown mitigation incurs a 45-53% slowdown. For both multi-page (*protN-trap-unprot*) and single page (*prot1-trap-unprot*) protect-trap-unprotect, *Barrelfish/MAS* is up to 4x faster than Linux. We observe a slight increase in execution time when full TLB flushes are enabled. The *Barrelfish/MAS* "Direct" results use the kernel primitives directly. This enables us to isolate the cost of user-space accounting, which accounts for 10-17% of the execution time.

**6.1.2.** The map/protect/unmap benchmark measures the performance of the primitive operations map, protect and unmap with respect to an increasing buffer size.

The benchmark works as follows: i) allocate a region of virtual memory and fault on it to map memory, ii) write-protect the entire virtual region, and *iii*) unmap the virtual memory region again. We time each operation separately. We measured different ways to map memory on Linux using mmap, shmat and shmfd and compare Barrelfish/MAS against the best performance we obtained on Linux for each operation and page-size. For mapping and unmapping 4kB pages, this was passing a file descriptor obtained through shm\_open to mmap. For map/unmap with larger page sizes, shared memory segments (shmat, shmdt) performed best. Changing page protection was always fastest using mprotect. Again, we benchmark Linux with and without Spectre/Meltdown mitigation enabled. If possible, we do not measure the time for memory allocation as this is dominated by memset. On Barrelfish/MAS we use the high-level interfaces to include

user-space book-keeping in the measurements.

Figure 6 shows execution time of the three operations per page for an increasing buffer size and three page sizes. Enabling Spectre/Meltdown mitigation results in a slow down of up to 2x for small page numbers. In all cases, the cost per page decreases as the number of pages increases, amortizing the system call cost.

*Map: Barrelfish/MAS* is able to match and outperform Linux in all but one case, with a significant difference when using large and huge pages.

*Protect:* These are in line with the Appel and Li benchmarks above; *Barrelfish/MAS* outperforms Linux in all configurations.

*Unmap:* We observe very similar performance characteristics here. For small buffer sizes Linux is slightly faster, for larger buffers *Barrelfish/MAS* slightly outperforms Linux.

From these two microbenchmarks, we conclude that *Barrelfish/MAS* memory operations are competitive: capabilities and fast traps allow an efficient virtual memory interface despite splitting up larger mappings into multiple capability operations and syscalls. It is possible to build a fast and competitive memory system which still fully implements our fine-grained, least-privilege model.

#### 6.2. Complex hardware

We now profile the support for address space networks in *Barrelfish/MAS*, including memory mappings and model queries. We put the cost in the context of related operations a device driver has to perform.

We profile the boot process of the Xeon Phi Co-Processor on our server platform. All accesses from the co-processor to host RAM are translated multiple times, most notably:

$$CoreMMU \rightarrow SMPT \rightarrow IOMMU \rightarrow SystemBus$$

Each step must be configured correctly. We adapted the existing drivers for the co-processor, the system memory page table (SMPT), and the IOMMU to use our new capabilities and model queries. The MMU is managed by the kernel running on the co-processor cores. Resources are managed using the capability system which allows safe programming of translation tables.

First, we allocate 6MB from host RAM, and map this into the device drivers address space (equivalent to performing an anonymous mmap in Linux). Then we copy the boot image into this allocated buffer. We then query the model representation to determine which translation units must be reprogrammed. We map the buffer into the device's IOMMU address space, and then map the resulting obtained segment into the SMPT address space. We compare the IOMMU mapping in two cases. In the first, we ask the IOMMU driver to perform the mapping; this ad-hoc approach corresponds to the current state of the art. In the second, enabled by our model, we perform the mapping directly with capability invocations.



Figure 6: Comparison of memory operations on *Barrelfish/MAS* and Linux with and without Spectre/Meltdown mitigation (NS). Execution time per page in  $\mu$ s. Buffer sizes in powers of two from 4kB to 64GB.



Figure 7: Profiling Configuration Time for a Xeon Phi Co-Processor comparing Local Syscalls and RPCs to Perform IOMMU Mappings with Linux mmap'ing a buffer of the same size for perspective.

As Figure 7 shows, the cost is dominated by memory allocation, which takes about  $625\mu$ s and involves an RPC to a memory server. Writing the buffer content using memcpy takes  $224\mu$ s. Determining the units to be configured to make the buffer available to the device takes  $71\mu$ s, using the C graph implementation. Setting up the IOMMU mapping is  $2\mu$ s (or  $32\mu$ s, when using RPC). Mapping the segment into the SMPT using a kernel driver takes  $5\mu$ s.

For comparison we also show the cost in Linux to mmap an anonymous 6MB region in a userspace process – equivalent in our implementation to allocating and mapping the buffer in the driver. We perform this operation slightly faster than Linux, but pay an additional 71 $\mu$  to dynamically determine the nodes that have to be configured. A less flexible approach might pre-compute or memoize this step, avoiding the latency at map time. Compared with the cost of allocation and writing the memory, the cost of setting up the IOMMU and SMPT mappings (together 7 $\mu$ s) are negligible. Note that *in any system* an untrusted agent will have to perform some sort of invocation (such as a system call) to install these mappings. Despite our fine-grained rights and dynamic implementation, performance is comparable to Linux.

### 6.3. Scaling

We now turn to the scaling properties of the model representation with respect to the system complexity. In real systems, we see ever-increasing numbers of cores and DMA-capable devices, but the diameter of the decoding net representation grows much more slowly, and rarely exceeds 10. This is true



Figure 8: Cost of determing mappable nodes on an X86 system with growing number of PCI devices.

not only for x86 systems, but also for all the ARM SoCs we have encountered to date.

We write a synthetic benchmark that simulates a system with an increasing number of PCIe devices, each of which has its own address space and translation unit, much like the Intel Xeon Phi described in the previous section. This grows the model state in two ways: the total number of address spaces, as well as the number of these that are configurable. Both grow linearly with the number of PCIe devices. We measure the time it takes to determine the configurable address spaces between a PCIe device and the system bus, a typical setup operation from a device driver that has to setup IOMMU and devicelocal translation structures. We evaluate two implementations: *i*) a Prolog implementation of the model using the EclipseCLP interpreter in Barrelfish, and our C implementation based on a graph represented as an adjacency matrix. Both use Dijkstra's algorithm on the graph representation.

Figure 8 shows that, due to internal memory allocations, the performance of EclipseCLP implementation scales linearly in the number of devices. The native C implementation, in contrast, shows almost constant performance. Small linear factors stem from walking of the adjacency-matrix and initialization of the parent array.

We conclude that the cost of determining configurable nodes remains almost independent of the system complexity, as long as the graph is of low diameter and is maintained in an efficient data structure, suggesting that the routing calculation is feasible for modern hardware.

#### 6.4. Correctness on simulated platforms

In this qualitative evaluation we show that the model implementation is functional and performant even when run on simulated platforms with unusual address space topologies not supported by other systems. While these topologies are extreme, their envelope includes other real systems (such as those with secure co-processors) which are not handled by current systems.

We wrote a series of system descriptions for the ARM Fast Models simulator [7]. We use this description to *i*) configure the simulator and *ii*) extract the topology of the memory subsystem. We then use this information to populate the address space model which is used at compile time to generate operating system code and at runtime to query information about the memory system as in the previous evaluation. We mention four configurations, where each consists of two ARM Cortex-A57 clusters, each having their own memory map connecting to DRAM and other devices. The memory map is configured as follows:

- 1. Uniform Uniform memory map between all clusters.
- 2. *Swapped* Memory map contains two areas whose addresses are swapped (exchanged) between the two clusters.
- 3. Private Each cluster has its own private memory region.
- 4. *Private Swapped* A combination of the Private and Swapped configurations

We know of no other current OS designs which can manage memory globally in all these cases. Popcorn Linux [10] and Barrelfish have limited support for case 3; while regular Linux and seL4 only support case 1.

*Barrelfish/MAS* is able to boot and manage memory on all platforms without modifications, regardless of the topology, by virtue of the capabilities used to refer to memory containing the canonical name of the object. Whenever an object is accessed, this canonical name is converted into a local address using a generated function.

### 6.5. Space and time overheads

Finally, we analyze the time and space complexity of managing the physical resources of the system using capabilities in Barrelfish.

We are interested in the space overhead to store the capabilities, managing the lookup of capabilities in the mapping database, and creating new mappings.

In the implementation, capabilities occupy 64 bytes each. There is typically less than one capability for each frame of memory, as each capability can represent up to  $2^{64} - 1$  bytes of memory. The number of capabilities should grow sub-linearly with the size of available RAM. For each mapping, *Barrelfish/MAS* creates a capability for bookkeeping. The number of these mapping capabilities also grows sub-linearly in the total number of mapped frames. Large frames result in one mapping capability per page-table that is spanned for the mapping. Since the 64-byte capability representation also

includes all the pointers necessarily to index the mapping database, the latter incurs no additional overhead. The index itself is a balanced tree; lookups are logarithmic in the total number of capabilities.

Overall, keeping track of memory resources with capabilities incurs a space overhead which grows at worst linearly in the available physical memory. Furthermore, the mapping database can be implemented and queried efficiently. Assuming the *Barrelfish/MAS* worst case of one capability per 4kB frame, this accounts for a 1.5% total memory overhead. In comparison, Linux manages a struct page per physical frame of up to 80 bytes in size, an overhead of almost 2%.

## 7. Conclusion

In this paper we have built on existing work in modelling the complex interacting address spaces in modern hardware by adopting the proven methodology of the seL4 project to produce a rigorous, no-stone-left-unturned model of memory management. Our model applies well-known concepts in access control, giving an abstract model amenable to implementation in capability-based systems (e.g. Barrelfish), as well as ACL-based systems such as Linux.

We have shown that it is possible to implement the model efficiently in an operating system delivering excellent memory management performance while at the same time offering a clean and safe way to deal with the complexity of the allocation and enforcement problem.

We've shown that the model can be used to configure real, complex (even pathological) systems, scales well, and introduces little overhead. Our model is a sound foundation for both fully verified systems and more reliable memory management in existing systems.

### References

- Reto Achermann, Lukas Humbel, David Cock, and Timothy Roscoe. Formalizing Memory Accesses and Interrupts. In *Proceedings of the* 2nd Workshop on Models for Formal Analysis of Real Systems, MARS 2017, pages 66–116, 2017.
- [2] Reto Achermann, Lukas Humbel, David Cock, and Timothy Roscoe. Physical Addressing on Real Hardware in Isabelle/HOL. In *Interactive Theorem Proving*, ITP'18, pages 1–19, Oxford, United Kingdom, 2018. Springer International Publishing.
- [3] Hanna Alam, Tianhao Zhang, Mattan Erez, and Yoav Etsion. Do-It-Yourself Virtual Memory Translation. In *Proceedings of the 44th Annual International Symposium on Computer Architecture*, ISCA '17, pages 457–468, New York, NY, USA, 2017. ACM.
- [4] James P. Anderson. Computer Security Technology Planning Study. Technical Report ESD-TR-73-51, Vol. I, AD-758 206, Electronic Systems Division, Deputy for Command and Management Systems HQ Electronic Systems Division (AFSC), L. G. Hanscom Field, Bedford, Massachusetts 01730, USA, October 1972.
- [5] Andy Whitcroft. Sparsemem Memory Model. https://lwn.net/ Articles/134804/, Aug 2019.
- [6] Andrew W. Appel and Kai Li. Virtual Memory Primitives for User Programs. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS IV, pages 96–107, New York, NY, USA, 1991. ACM.
- [7] ARM Ltd. Development Tools and Software: Fast Models. https://www.arm.com/products/development-tools/ simulation/fast-models, August 2019.

- [8] Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Ghose, Jayneel Gandhi, Christopher J. Rossbach, and Onur Mutlu. Mosaic: A GPU Memory Manager with Application-transparent Support for Multiple Page Sizes. In *Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture*, MICRO-50 '17, pages 136–150, New York, NY, USA, 2017. ACM.
- [9] Antonio Barbalace, Anthony Iliopoulos, Holm Rauchfuss, and Goetz Brasche. It's Time to Think About an Operating System for Near Data Processing Architectures. In *Proceedings of the 16th Workshop on Hot Topics in Operating Systems*, HotOS '17, pages 56–61, New York, NY, USA, 2017. ACM.
- [10] Antonio Barbalace, Marina Sadini, Saif Ansary, Christopher Jelesnianski, Akshay Ravichandran, Cagil Kendir, Alastair Murray, and Binoy Ravindran. Popcorn: Bridging the Programmability Gap in heterogeneous-ISA Platforms. In *Proceedings of the Tenth European Conference on Computer Systems*, EuroSys '15, pages 29:1–29:16, New York, NY, USA, 2015. ACM.
- [11] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. The Multikernel: A New OS Architecture for Scalable Multicore Systems. In *Proceedings of the ACM SIGOPS* 22Nd Symposium on Operating Systems Principles, SOSP '09, pages 29–44, New York, NY, USA, 2009. ACM.
- [12] David Cock, Gerwin Klein, and Thomas Sewell. Secure Microkernels, State Monads and Scalable Refinement. In *Proceedings of the 21st International Conference on Theorem Proving in Higher Order Logics*, TPHOLs '08, pages 167–182, Berlin, Heidelberg, 2008. Springer-Verlag.
- [13] Philip Derrin, Kevin Elphinstone, Gerwin Klein, David Cock, and Manuel M. T. Chakravarty. Running the Manual: An Approach to High-assurance Microkernel Development. In *Proceedings of the 2006* ACM SIGPLAN Workshop on Haskell, Haskell '06, pages 60–71, New York, NY, USA, 2006. ACM.
- [14] Dhammika Elkaduwe, Gerwin Klein, and Kevin Elphinstone. Verified Protection Model of the seL4 Microkernel. In Proceedings of the 2Nd International Conference on Verified Software: Theories, Tools, Experiments, VSTTE '08, pages 99–114, Berlin, Heidelberg, 2008. Springer-Verlag.
- [15] Simon Gerber. Authorization, Protection, and Allocation of Memory in a Large System. PhD thesis, ETH Zurich, 2018.
- [16] Simon Gerber, Gerd Zellweger, Reto Achermann, Kornilios Kourtis, Timothy Roscoe, and Dejan Milojicic. Not Your Parents' Physical Address Space. In Proceedings of the 15th USENIX Conference on Hot Topics in Operating Systems, HOTOS'15, pages 16–16, Berkeley, CA, USA, 2015. USENIX Association.
- [17] Ronghui Gu, Zhong Shao, Hao Chen, Xiongnan Wu, Jieung Kim, Vilhelm Sjöberg, and David Costanzo. CertiKOS: An Extensible Architecture for Building Certified Concurrent OS Kernels. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, pages 653–669, Berkeley, CA, USA, 2016. USENIX Association.
- [18] Marius Hillenbrand, Mathias Gottschlag, Jens Kehne, and Frank Bellosa. Multiple Physical Mappings: Dynamic DRAM Channel Sharing and Partitioning. In *Proceedings of the 8th Asia-Pacific Workshop on Systems*, APSys '17, pages 21:1–21:9, Mumbai, India, 2017.
- [19] HSA Foundation. HSA Runtime Programmer's Reference Manual, version: 1.1.4 edition, Oct 2016.
- [20] Jian Huang, Moinuddin K. Qureshi, and Karsten Schwan. An Evolutionary Study of Linux Memory Management for Fun and Profit. In Proceedings of the 2016 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC '16, pages 465–478, Berkeley, CA, USA, 2016. USENIX Association.
- [21] Intel Corporation. Intel Xeon Phi Coprocessor System Software Developers Guide, 2014.
- [22] Khronos OpenCL Working Group. The OpenCL Specification, version: 2.0, document revision: 29 edition, July 2015.
- [23] Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolanski, Michael Norrish, Thomas Sewell, Harvey Tuch, and Simon Winwood. seL4: Formal Verification of an OS Kernel. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, SOSP '09, pages 207–220, New York, NY, USA, 2009. ACM.
- [24] Butler W Lampson. Protection. ACM SIGOPS Operating Systems Review, 8(1):18–24, 1974.
- [25] Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke. VAST: The Illusion of a Large Memory Space for GPUs. In *Proceedings of the 23rd International Conference on Parallel Architectures and Compilation*, PACT '14, pages 443–454, New York, NY, USA, 2014. ACM.

- [26] Henry M. Levy. Capability-Based Computer Systems. Butterworth-Heinemann, Newton, MA, USA, 1984.
- [27] A Theodore Markettos, Colin Rothwell, Brett F Gutstein, Allison Pearce, Peter G Neumann, Simon W Moore, and Robert NM Watson. Thunderclap: Exploring Vulnerabilities in Operating System IOMMU Protection via DMA from Untrustworthy Peripherals. In NDSS, 2019.
- [28] Alex Markuze, Adam Morrison, and Dan Tsafrir. True IOMMU Protection from DMA Attacks: When Copy is Faster Than Zero Copy. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '16, pages 249–262, New York, NY, USA, 2016. ACM.
- [29] Benot Morgan, Eric Alata, Vincent Nicomette, and Mohamed Kaaniche. Bypassing IOMMU Protection against I/O Attacks. In 2016 Seventh Latin-American Symposium on Dependable Computing (LADC), pages 145–150, Oct 2016.
- [30] Benot Morgan, Eric Alata, Vincent Nicomette, and Mohamed Kaaniche. IOMMU Protection Against I/O Attacks: A Vulnerability and a Proof of Concept. *Journal of the Brazilian Computer Society*, 24(1):2, Jan 2018.
- [31] NVIDIA. NVIDIA Parker Series SoC Technical Reference Manual, v.1.0p edition, June 2017.
- [32] NVIDIA Corporation . Unified Memory in CUDA 6, Nov 2013. https: //devblogs.nvidia.com/unified-memory-in-cuda-6/.
- [33] NXP. i.MX 8DualXPlus/8QuadXPlus Applications Processor Reference Manual, January 2019. REV 1, www.nxp.com/docs/en/ user-guide/IMX8QXPMEKHUG.pdf.
- [34] David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. A Case for Intelligent RAM. *IEEE Micro*, 17(2):34–44, March 1997.
- [35] Yuxin Ren, Gabriel Parmer, Teo Georgiev, and Gedare Bloom. CBufs: Efficient, System-wide Memory Management and Sharing. In Proceedings of the 2016 ACM SIGPLAN International Symposium on Memory Management, ISMM 2016, pages 68–77, New York, NY, USA, 2016. ACM.
- [36] Bogdan F. Romanescu, Alvin R. Lebeck, and Daniel J. Sorin. Specifying and Dynamically Verifying Address Translation-aware Memory Consistency. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pages 323–334, New York, NY, USA, 2010. ACM.
- [37] Adrian Schüpbach, Andrew Baumann, Timothy Roscoe, and Simon Peter. A Declarative Language Approach to Device Configuration. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, pages 119–132, New York, NY, USA, 2011. ACM.
- [38] Thomas Sewell, Simon Winwood, Peter Gammie, Toby Murray, June Andronick, and Gerwin Klein. seL4 Enforces Integrity. In Markovan Eekelen, Herman Geuvers, Julien Schmaltz, and Freek Wiedijk, editors, *Interactive Theorem Proving*, pages 325–340, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg.
- [39] Texas Instruments. OMAP44xx Multimedia Device Technical Reference Manual, April 2014. Version AB, www.ti.com/lit/ug/ swpu235ab/swpu235ab.pdf.
- [40] Simon Winwood, Gerwin Klein, Thomas Sewell, June Andronick, David Cock, and Michael Norrish. Mind the Gap. In Proceedings of the 22Nd International Conference on Theorem Proving in Higher Order Logics, TPHOLs '09, pages 500–515, Berlin, Heidelberg, 2009. Springer-Verlag.