# arXiv:1910.05398v2 [cs.OS] 8 Nov 2019

# Mitosis: Transparently Self-Replicating Page-Tables for Large-Memory Machines

Reto Achermann<sup>1,2</sup> Ashish Panwar<sup>1,3</sup> Abhishek Bhattacharjee<sup>4</sup> Timothy Roscoe<sup>2</sup>

<sup>1</sup>VMware Research <sup>2</sup>ETH Zurich

<sup>3</sup>IISc Bangalore

1 2 3

Sockets 0

<sup>4</sup>Yale University

Jayneel Gandhi<sup>1</sup>

Single-Socket

reto.achermann@inf.ethz.ch, ashishpanwar@iisc.ac.in, abhishek@cs.yale.edu, troscoe@inf.ethz.ch, gandhij@vmware.com

# Abstract

Multi-socket machines with 1-100 TBs of physical memory are becoming prevalent. Applications running on multi-socket machines suffer non-uniform bandwidth and latency when accessing physical memory. Decades of research have focused on data allocation and placement policies in NUMA settings, but there have been no studies on the question of how to place page-tables amongst sockets. We make the case for explicit page-table allocation policies and show that pagetable placement is becoming crucial to overall performance.

We propose Mitosis to mitigate NUMA effects on page-table walks by transparently replicating and migrating page-tables across sockets without application changes. This reduces the frequency of accesses to remote NUMA nodes when performing page-table walks. Mitosis uses two components: (i) a mechanism to enable efficient page-table replication and migration; and (ii) policies for processes to efficiently manage and control page-table replication and migration.

We implement Mitosis in Linux and evaluate its benefits on real hardware. Mitosis improves performance for large-scale multi-socket workloads by up to 1.34x by replicating pagetables across sockets. Moreover, it improves performance by up to 3.24x in cases when the OS migrates a process across sockets by enabling cross-socket page-table migration.

# 1. Introduction

In this paper, we investigate the performance issues in large NUMA systems caused by the sub-optimal placement not of program data, but of page-tables, and show how to mitigate them by replicating and migrating page-tables across sockets.

The importance of good data placement across sockets for performance on NUMA machines is well-known [29, 32, 41, 46]. However, the increase in main memory size is outpacing the growth of TLB capacity. Thus, TLB coverage (i.e. the size of memory that TLBs map) is stagnating and is causing more TLB misses [21, 50, 58, 59]. Unfortunately, the performance penalty due to a TLB miss is significant (up to 4 memory accesses on x86-64). Moreover, this penalty will grow to 5 memory accesses with Intel's new 5-level page- tables [43].

Our <u>first</u> contribution in this paper (§ 3) is to show by experimental measurements on a real system that *page-table* placement in large-memory NUMA machines poses performance challenges: a page-table walk may require multiple remote DRAM accesses on a TLB miss and such misses are increasingly frequent. We show this effect due to page-table placement on a large-memory machine in two scenarios. The first is a *multi-socket scenario* (§ 3.1), where large-scale multithreaded workloads execute across all sockets. In this case,



PTEs as observed from each socket on a TLB miss and *Bottom Graph:* Normalized runtime, for two workloads showing multi-socket (left) and workload migration (right) scenarios with their respective improvement using *Mitosis*.

the page-table is distributed across sockets by the OS as it sees fit. Such page placement results in multiple remote page-table accesses, degrading performance. We show the percentage of remote/local page-table entries (PTEs) on a TLB miss as observed from each socket in the top left table of Figure 1 for one workload (Canneal) from the multi-socket scenario. We observe that some sockets experience longer TLB misses since up to 86% of leaf PTEs are located remotely. Large-memory workloads like key-value stores and databases that stress TLB capacity are particularly susceptible to this behavior.

Our second analysis configuration focuses on a workload *migration scenario* (§ 3.2), where the OS decides to migrate a workload from one socket to another. Such behavior arises for many reasons: the need to load balance, consolidate, improve cache behavior, or save power/energy [3, 31, 61]. A key question with migration is what happens to the data that the workload accesses. Existing NUMA policies in commodity OSes migrates data pages to the target socket where the workload has been migrated. Unfortunately, page-table migration is not supported [56], making future TLB misses expensive. Such misplacement of page-tables leads to performance degradation for the workload since 100% of TLB misses require remote memory access as shown in top right table of Figure 1 for one workload (GUPS) from workload migration scenario. Workload migration is common in environments where virtual machines or containers are consolidated on large systems [3]. Ours is the first study to show this problem of sub-optimal page-table placement on NUMA machine using these two commonly occurring scenarios.

Our <u>second</u> contribution (§ 4) is a technique, *Mitosis*, which replicates and migrates page-tables to reduce this effect. *Mitosis* works entirely within the OS and requires no change to application binaries. The design consists of a mechanism

to enable efficient page-table replication and migration (§ 5), and associated policies for processes to effectively manage page-table replication and migration (§ 6). *Mitosis* builds on widely-used OS mechanisms like page-faults and system calls and is hence applicable to most commodity OSes.

Our <u>third</u> contribution (§ 5, 6) is an implementation of *Mi*tosis for an x86-64 Linux kernel. Instead of substantially re-writing the memory subsystem, we extend the Linux PV-Ops [9] interface to page-tables and provide policy extensions to Linux's standard user-level NUMA library, allowing users to control migration and replication of page-tables, and selectively enable it on a per-process basis. When a process is scheduled to run on a core, we load the core's page-table pointer with the physical address of the local page-table replica for the socket. When the OS modifies the page-table, the updates are propagated to all replicas efficiently and that page-table reads return consistent values based on all replicas.

An important feature of *Mitosis* is that it requires no changes to applications or hardware, and is easy to use on a perapplication basis. For this reason, *Mitosis* is readily deployable and complementary to emerging hardware techniques to reduce address translation overheads like segmentation [21, 49], PTE coalescing [58, 59] and user-managed virtual memory [16]. We will release our implementation of *Mitosis* to enable future research on page-table placement and plan to upstream our changes to Linux.

Our <u>final</u> contribution (§ 8) is a performance evaluation of *Mitosis* on real hardware. We show the effects of pagetable replication and migration on a large-memory machine in the same two scenarios used before to analyze page-table placement. In the first, *multi-socket scenario*, we had observed that page-table placement results in multiple remote memory accesses, degrading performance for many workloads. The graph on the bottom left of Figure 1 shows the performance of a commonly used "first-touch" allocation policy which allocates data pages local to the socket that touches the data first. This policy is not ideal as it cannot allocate page-tables locally for all sockets. *Mitosis* replicates page-tables across sockets to improve performance by up to 1.34x in this scenario. These gains come at a mere cost of 0.6% memory overhead compared to the exorbitant memory cost of data replication.

In the second, *workload migration scenario*, we had observed that page-table migration is not supported, which makes TLB misses expensive for workloads after their migration across sockets. The graph on the bottom right in Figure 1 quantifies the worst-case performance impact of misplacing page-tables on memory that is remote with respect to the application socket (see remote (interfere) bar). The local bar shows the ideal execution time with locally allocated page-tables. *Mitosis* improves this situation by enabling cross-socket page-table migration, and boosts performance by up to 3.24x.

# 2. Background

# 2.1. Virtual Memory

Translation Lookaside Buffers (TLBs) enable fast address translation and are key to the performance of a virtual memory based system. Unfortunately, TLBs only cover a tiny fraction of physical memory available on modern systems while workloads consume all memory for storing their large datasets. Hence, memory-intensive workloads incur frequent costly TLB misses requiring page-table lookup by hardware.

Research has shown that TLB miss processing is prohibitively expensive [21, 24, 25, 26, 38, 53] as walking pagetables (e.g., 4-level radix tree on x86-64) requires multiple memory accesses. Even worse, virtualized systems need twolevels of page-table lookups which can result in much higher TLB miss processing overheads (24 memory accesses instead of four on x86-64). Consequently, address translation overheads of 10-40% are not unusual [21, 24, 25, 39, 40, 50], and will worsen with emerging 5-level page-tables [43].

In response, many research proposals improve address translation by reducing the frequency of TLB misses and/or accelerating page-table walks. Use of large pages to increase TLB-coverage [34, 35, 36, 55, 57, 63, 64, 66] and additional MMU structures to cache multiple levels of the page-tables [19, 24, 26] are some of the techniques widely adopted in commercial systems. In addition, researchers have also proposed TLB-speculation [20, 60], prefetching translations [47, 53, 62], eliminating or devirtualizing virtual memory [42], or exposing virtual memory system to applications to make the case for application-specific address translation [16].

We observe that prior works studied address translation on single-socket systems. However, page-tables are often placed across remote and local memories in large-memory systems. Given the sensitivity of large page placement on such systems [41], we were intrigued by the question of how pagetable placement affects overall performance. In this paper, we present compelling evidence to show that optimizing pagetable placement is as crucial as optimizing data placement.

# 2.2. NUMA Architectures

Multi-socket architectures, where CPUs are connected via a cache-coherent interconnect, offer scalable memory bandwidth even at high capacity and are frequently used in modern data centers and cloud deployments. Looking forward, this trend will only increase; large-memory (1-100 TBs) machines are integrating even more devices with different performance characteristics like Intel's Optane memory [6]. Furthermore, emerging architectures using chiplets and multi-chip modules [17, 33, 44, 45, 48, 54, 65, 67] will drive the multi-socket and NUMA paradigm: accessing memory attached to the local socket will have higher bandwidth and lower latency than accessing memory attached to a remote socket. Note that accessing remote memory can incur 2-4x higher latency than accessing local memory [1]. Given the non-uniformity of access latency and bandwidth, optimizing data placement in NUMA systems has been an active area of research.

# 2.3. Data Placement in NUMA machines

Modern OSes provide generic support for optimizing data placement on NUMA systems through various allocation and migration polices. For example, Linux provides first-touch vs. interleaved allocation to control the initial placement of data, and additionally employs AutoNUMA to migrate pages across sockets in order to place data closer to the threads accessing it. To further optimize data placement, Carrefour [32] proposed data-page replication along with migration. In addition, data replication has also been proposed at data structure level [29] and via NUMA-aware memory allocators [46] to further reduce the frequency of remote memory accesses. In contrast, our work focuses on page-table pages, not data pages.

Some prior research has proposed replicated data structures for address spaces. RadixVM [30] manages the process' address space using replicated radix trees to improve the scalability of virtual memory operations in the research-grade xv6 OS [15]. However, *RadixVM* does not replicate page-tables. Similarly, Corey [28] divides the address space into shared and private per-core regions where these explicitly shared regions share the page-table. In contrast, we use replication to manage NUMA effects of page-table walks in an industry-grade OS.

Techniques for data vs. page-table pages: One may expect prior migration and replication techniques to extend readily to page-tables. In reality, subtle distinctions between data and page-table pages merit some discussion. First, data pages are replicated by simple bytewise copying of data, without any special reasoning of the contents of the pages. Page-table pages, however, require more care and cannot rely simply on bytewise copying - to semantically replicate virtual-to-physical mappings, upper page-table levels must hold pointers (physical addresses) to their replicated, lower level page-tables which differ from replica to replica except at the leaf level. Moreover, data replication has high memory overheads and maintaining consistency across replicated pages (especially for write-intensive pages) can outweigh the benefits of replication. While data replication has its values, we show that page-table replication is equally important - it incurs negligible memory overhead, can be implemented efficiently and delivers substantial performance improvement.

# 3. Page-Table Placement Analysis

In this section, we first present an analysis of page-table distributions when running memory-intensive workloads on a large-memory machine (*multi-socket scenario* § 3.1) and then quantify the impact of NUMA effects on page-table walks (*workload migration scenario* § 3.2). Our experimental platform is a 4-socket Intel Xeon E7-4850v3 with 512 GB physical memory (more detailed machine configuration in § 8).



Figure 2: An illustration of current page-table and data placement for a multi-socket workload using 4-socket system.

### 3.1. Multi-Socket Scenario

We focus on page-table distributions where workloads use almost all resources in a multi-socket system. Consider the example in Figure 2. If a core in socket 0 has a TLB miss for data "D", which is local to the socket, it has to perform up to 4 remote accesses to resolve the TLB miss to ultimately discover that data was actually local to its socket. Even though MMU caches [19] help reduce some of the accesses, at least leaf-level PTEs have to be accessed. Since big-data workloads have large page-tables that are absent from the caches, system memory accesses are often unavoidable [25].

**Methodology.** We are interested in the distribution of pages for each level in the page-table; i.e., which sockets page-tables are allocated on. We write a kernel module that walks the page-table of a process and dumps the PTEs including the value of the page-table root register (CR3) to a file. The kernel module is then invoked every 30 seconds while a multisocket workload (e.g., Memcached) ran, producing a stream of page-table snapshots over time. We use 30 second time intervals as page-table allocation occurs relatively infrequently and smaller time interval does not change results significantly. We use first-touch or interleaved data allocation policy while enabling/disabling AutoNUMA [2] data page migration with different page sizes for multi-socket workloads in Table 1.

| Workload  | Description                                                                                       | MS    | WM   |
|-----------|---------------------------------------------------------------------------------------------------|-------|------|
| Memcached | a commercial distributed in-memory object caching system [8]                                      | 350GB | -    |
| Graph500  | a benchmark for generation, compression and search of large graphs [5]                            | 420GB | -    |
| HashJoin  | a benchmark for hash-table probing used in database applications and other large applications     | 480GB | 17GB |
| Canneal   | a benchmark for simulated cache-aware annealing<br>to optimize routing cost of a chip design [10] | 382GB | 32GB |
| XSBench   | a key computational kernel of the Monte Carlo neu-<br>tronics application [14]                    | 440GB | 85GB |
| BTree     | a benchmarks for index lookups used in database<br>and other large applications                   | 145GB | 35GB |
| LibLinear | a linear classifier for data with millions of instances<br>and features [7]                       | -     | 67GB |
| PageRank  | a benchmark for page rank used to rank pages in search engines [23]                               | -     | 69GB |
| GUPS      | a HPC Challenge benchmark to measure the rate of integer random updates of memory [11]            | -     | 64GB |
| Redis     | a commercial in-memory key-value store [12]                                                       | -     | 75GB |

Table 1: Workloads used for analysis in multi-socket (MS) and workload migration (WM) scenarios.

| Level | T |     |      | Soc | (et | D    |       |     |      | Sock | et 1 |      |       |     |   |    | Soc | ket | 2   |       |     |     |   | Soci | ket | 3   |       |
|-------|---|-----|------|-----|-----|------|-------|-----|------|------|------|------|-------|-----|---|----|-----|-----|-----|-------|-----|-----|---|------|-----|-----|-------|
| L4    |   | 0   | [ 0  | 0   | 0   | 0]   | ( 0%) | 1   | [ 8  | 3    | 0    | 1]   | (75%) | 0   | [ | 0  | 0   | 0   | 0]  | ( 0%) | 0   | [   | 0 | 0    | 0   | 0]  | ( 0%) |
| L3    |   | 1   | [ 56 | 66  | 40  | 37]  | (72%) | 3   | [ 33 | 43   | 26   | 26]  | (66%) | 0   | [ | 0  | 0   | 0   | 0]  | ( 0%) | 0   | [   | 0 | 0    | 0   | 0]  | ( 0%) |
| L2    | T | 89  | [11k | 11k | 11k | 11k] | (75%) | 109 | [13k | 13k  | 13k  | 13k] | (75%) | 66  | [ | 8k | 8k  | 8k  | 8k] | (75%) | 63  | [7  | k | 7k   | 7k  | 8k] | (75%) |
| L1    |   | 40k | [ 6M | 4M  | 4M  | 4M]  | (67%) | 40k | [ 4M | 6M   | 4M   | 4M]  | (67%) | 40k | [ | 4M | 4M  | 6M  | 4M] | (67%) | 40k | [ 4 | M | 4M   | 4M  | 6M] | (67%) |

Figure 3: Analysis of page-table pointers from a page-table dump for a multi-socket workload: Memcached.



Figure 4: Percentage of remote leaf PTEs as observed from each socket for our multi-socket workloads.

**Analysis.** We analyze the distribution of page-tables for each snapshot in time. For each page-table level, we summarize the number of per-socket physical pages and the number of valid PTEs pointing to page-table pages (or data frames) residing on a local or remote socket. From these snapshots, we collect a distribution of leaf PTEs and which sockets they are located on. We focus on leaf PTEs as there are orders of magnitude more of them than non-leaf PTEs and because they generally determine address translation performance (upper-level PTEs can be cached in MMU caches [25]). These distributions indicate how many local and remote sockets a page-table walk may visit before resolving a TLB miss.

**Results.** Due to space limitations, we show a single, processed snapshot of the page-table for Memcached in Figure 3. This snapshot was collected using 4KB pages, local allocation, and AutoNUMA disabled. We studied 2MB pages as well and present observations from them later. The processed dump shows the distribution of all four levels of the page-table (L4 being the root, and L1 the leaf). The dump is organized in four columns representing the four-sockets in this system. In each cell, the first number is the total physical pages at that level-socket combination (e.g. socket 1 has the only L4 page-table page). Next is the distribution of pointers in square brackets of the valid PTEs at this level/socket (e.g. L4 on socket 1 has 8 pointers to L3 on socket 0, 3 pointers locally, and 1 pointer to socket 3). The percentage numbers in rounded brackets are the fraction of valid PTEs pointing to remote physical pages.

Figure 4 shows the percentage of remote leaf PTEs observed by a thread running on each socket. Each workload's cluster has per-socket values representing the percentage of remote leaf PTEs in the page-table. We made these observations from the page-table dumps and distribution of leaf PTEs:

 Page-tables pages are allocated on the socket initializing the first data structures that the page-table pages point to. This is similar to data frame allocation but has important unintended performance consequences. Consider that each page-table page has 512 entries. This means that the choice of where to allocate a page-table page is entirely dependent upon which of the 512 entries in the page-table page gets allocated first, and which socket the allocating thread runs on. If subsequently, other entries in the page-table page are used for threads on another socket, remote memory references for page-table walks become common.

- 2. With first touch policy, the number of page-tables tends to be skewed towards a single socket (e.g. socket 1 for Graph500 on Figure 4). This is especially the case when a single thread allocates and initializes all memory.
- 3. The interleaved policy evenly distributes page-table pages across all sockets.
- 4. While we observed data pages being migrated with AutoN-UMA, page-table pages were never migrated. The fraction of data pages migrated over time depends on the workload and its access locality.
- 5. On all levels, a significant fraction of page-table entries points to remote sockets. In the case of interleave policy, this is (N-1)/N for an *N*-socket system.
- 6. Due to the skew in page-table allocation, some sockets experience longer TLB misses since up to 99% of leaf PTEs are located remotely.

**Summary** On multi-socket systems, page-table page allocation is skewed towards sockets that initialize the data structures. While data pages are migrated by default OS policies, page-table pages remain on the socket they are allocated. Consequently, remote page-table walks are inevitable and multisocket workloads suffer from longer TLB misses as their associated page-table walks require remote memory accesses.

# 3.2. Workload Migration Scenario

We now focus on the impact of NUMA on page-table walks in scenarios where a process on a single socket is migrated to another. Such situations arise frequently in commercial cloud deployments due to the need for load balancing and improving process-data affinity [52, 27]. Particularly, the prevalence of virtual machines and containers that rely on hypervisors and NUMA-aware schedulers to consolidate workloads in data centers are making inter-socket process migrations increasingly common. For e.g., VMware ESXi may migrate processes at a frequency of 2 seconds [3]. Today, data can be migrated across sockets but page-tables cannot, compromising performance. Configurations. We run each workload in isolation while tightly controlling and changing i) the allocation policies for data pages and page-table pages, *ii*) whether or not the sockets are idle and *iii*) whether transparent, 2MB large pages (THP) are enabled. We disable NUMA migration. To study pagetable allocations in a controlled manner, we modified Linux kernel to force page-table allocations on a fixed socket. We use the configurations shown in Table 2 and visualized in Figure 5. We use the STREAM benchmark [13] running on the socket



Figure 5: Different configurations for workload migration scenario. We show only 6 out of 7 configurations here. The 7th configuration (RPI-RDI) can be easily created from (ii) by running another process on Socket 0.

| Config.    | Workload | Page-Table   | Data           | Interference            |
|------------|----------|--------------|----------------|-------------------------|
| (T)LP-LD   | А        | A: Local PT  | A: Local Data  | -                       |
| (T)LP-RD   | А        | A: Local PT  | B: Remote Data | -                       |
| (T)RP-LD   | А        | B: Remote PT | A: Local Data  | -                       |
| (T)RP-RD   | А        | B: Remote PT | B: Remote Data | -                       |
| (T)RPI-LD  | А        | B: Remote PT | A: Local Data  | B: Interfere on PT      |
| (T)LP-RDI  | А        | A: Local PT  | B: Remote Data | B: Interfere on Data    |
| (T)RPI-RDI | А        | B: Remote PT | B: Remote Data | B: Interfere on PT&Data |

Table 2: Configurations for workload migration scenario, where A and B denote different sockets. T denotes if THP in Linux is used for 2MB pages. Interference is another process that runs on a specified socket and hogs its local memory bandwidth. Figure 5 shows the 2-socket case.

indicated by interference to create a worst-case scenario of co-locating a memory-bandwidth heavy workload. Memory allocation and processor affinity are controlled by numact1. **Measurements.** We use perf to obtain performance counter values such as execution cycles and TLB load and store miss walk cycles (i.e., the cycles that the page walker is active for). **Results.** We then run our workloads for all seven configurations. Figure 6 shows the normalized run times with a 4KB page size. The base case is the LP-LD configuration where both page-tables and data pages are local and the system is idle. For each configuration, hashed part of the bar denotes the fraction of time spent on page-table walks. We observe the following from this experiment:

- 1. All workloads spend a significant fraction of execution cycles (up to 90%) performing page-table walks. Parts of these walks may be overlapped with other work; nevertheless, they present a performance impediment.
- 2. LP-LD runs most efficiently for 4KB page size.
- 3. The local page-table, remote data case (LP-RD and LP-RDI) suffers 3x slowdown versus the baseline. This is not surprising and has motivated prior research on data migration techniques in large-memory NUMA machines.

- 4. More surprisingly, the remote page-table, local data case (RP-LD and RPI-LD) suffers 3.3x slowdown. This slow-down can even be more severe than remote data accesses.
- 5. When both page-tables and data pages are placed remotely (RP-RD and RPI-RDI), the slowdown is 3.6x and is the worst placement possible for all workloads.
- 6. With 2MB page size (figure omitted for space), TLB reach improves and the number of memory accesses for a pagetable walk decreases to 3 rather than 4. These two factors reduce the fraction of execution cycles devoted to pagetable walks. Even so, overall performance is still vulnerable to remote page-table placement.

**Summary.** The NUMA node on which page-table pages are placed significantly impacts performance. Remote page-tables can have similar, and in some cases even worse, slowdown than remote data pages accesses. Moreover, the slowdown is visible even with large pages.

# 4. Design Concept

*Mitosis*' key concept is a mechanism and its policies to replicate and migrate page-tables and reduce the frequency of remote memory accesses in page-table walks. *Mitosis* requires two components: *i*) a mechanism to support low-overhead page-table replication and migration and *ii*) policies for processes to efficiently manage and control page-table replication and migration. Figure 7 illustrates these concepts. Our discussion focuses on the multi-socket and workload migration scenarios used before in § 3.

### 4.1. Multi-socket Scenario

We showed in § 3.1 that multi-socket workloads will, assuming a uniform distribution of page-table pages, have  $\frac{N-1}{N}$  PTEs pointing to remote pages for an *N*-socket system. Page-tables



Figure 6: Normalized runtime of our workloads in workload migration scenario with 4KB page size. The lower hashed part of each bar is time spent in walking the page-tables. All configurations are shown in Table 2.



Figure 7: Mitosis: Page-table migration and replication on large-memory machines

may be distributed among the sockets in a skewed fashion. Figure 7 (a)(i) shows a scenario where threads of the same workload running on different sockets have to make remote memory accesses during page-table walks.

From Figure 7 (a)(i) we can see that if a thread in socket 0 has a TLB miss for data "D" (which is local to the socket), it has to perform up to 4 remote accesses to resolve the TLB miss to only find out that the data was local to its socket.

With *Mitosis*, we replicate the page-tables on each socket where the process is running (shown in Figure 7 (a)(ii)). This results in up to 4 local accesses to the page-table, precluding the need for remote memory accesses in page-table walks.

### 4.2. Workload Migration Scenario

Single-socket workloads suffer performance loss when processes are migrated across sockets while page-tables are not (shown in Figure 7 (b)(ii)) The process is migrated from socket 0 to socket 1, the NUMA memory manager transparently migrate data pages, but page-table pages remain on socket 1. In contrast, *Mitosis* migrates the page-tables along with the data (Figure 7 (b)(iii)). This eliminates remote memory accesses for page-table walks, improving performance.

# 5. Mechanism

Replication and migration are inherently similar. We first describe the building blocks which are required to support page-table replication and later show how we can leverage the replication infrastructure to achieve page-table migration.

*Mitosis* enables per-process replication; the virtual memory subsystem needs to maintain multiple copies of page-tables for a single process. Efficient replication of page-tables can be divided into three sub-tasks: *i*) strict memory allocation to hold the replicated page-tables, *ii*) managing and keeping the replicas consistent, and *iii*) using replicas when the process is scheduled. We now describe each sub-task in detail by providing a generalized design and our Linux implementation. We also discuss how *Mitosis* handles accessed and dirty bits.

# 5.1. Allocating Memory for Storing Replicas

General design: All page-table allocations are performed by

the OS on a page-fault–an explicit mapping request can be viewed as an eager call to the page-fault handler for the given memory area. *Mitosis* extends the same mechanism to allocate memory across sockets for different replicas.

Such allocation is strict, i.e. it has to occur on a particular list of sockets at allocation time. It is, therefore, possible that it may fail due to the unavailability of memory on those sockets. There are multiple ways to sidestep this problem by reserving pages on each socket for page-table allocations using per-socket *page-cache*. These pages can be explicitly reserved through a system call or automatically when a process allocates a virtual memory region. Alternatively, the OS can reclaim physical memory through demand paging mechanisms or evicting a data page onto another socket.

**Linux implementation:** We rely on the existing page allocation functionality in Linux to implement *Mitosis*. When allocating page-table pages, we explicitly supply the list of target sockets for page-table replication. Since strict allocation can fail, we implemented per-socket *page-caches* to reserve pages for page-table allocations. The size of this page-cache is explicitly controlled using a sysctl interface.

### 5.2. Management of Updates to Replicas

**General design:** For security, OSes usually do not allow user processes to directly manage their own page-tables. Instead, OSes export an interface through which page-table modifications are handled, e.g. map/unmap/protect of pages. *Mitosis* extends the same interfaces for updates to page-tables to keep all replicas consistent. One way to implement this is to *eagerly* update all replicas at the same time via this standard interface when an update to the page-table is performed on any replica.

On an eager update, the OS finds the physical location to update in the local replica by walking the local replica of the page-table. It is required to walk other replicas of the page-table to locate the physical location to update all the replicas at the same time. Therefore, an N-socket system in x86\_64 will need 4N memory accesses with replication on a page-table update: 4 memory accesses to walk the pagetable on each of the N sockets. To reduce this overhead, we designed a circular linked-list of all replicas. The metadata



Figure 8: Circular linked list to locate all replicas efficiently (implemented in Linux with struct page).

about each physical page is utilized to store the pointers to the next physical page holding the replica of the page-table. Figure 8 shows an illustration with 4-way replication. This allows updates to proceed without walking the page-tables to perform the update. With this optimization, the update of all N replicas takes 2N memory references (N for updating the Nreplicas and N for reading the pointers to the next replica).

Linux implementation: We implemented eager updates to the replica page-tables in Linux. This required intercepting any writes to the page-tables and propagate updates accordingly. But instead of revamping the full-memory subsystem in Linux, we used a different interface, PV-Ops [9], which is required to support para-virtualization environments such as Xen [18]. The Linux kernel shipped with distributions like Ubuntu has para-virtualization support enabled by default.

Conceptually, this is done by indirect calls to the native or Xen handler functions. Effectively, the indirect calls are patched with direct calls once the subsystem is initialized. The PV-Ops subsystem interface consists of functions to allocate and free page-tables of any level, reading and writing the translation base register (CR3 on x86\_64), and writing pagetable entries. The PV-Ops interface can be seen in Listing 1.

```
void write_cr3(unsigned long x);
void paravirt_alloc_pte(struct mm_struct *mm, unsigned long pfn);
void paravirt_release_pte(unsigned long pfn);
void set_pte(pte_t *ptep, pte_t pte);
```

### Listing 1: Excerpt of the PV-Ops interface

We implemented *Mitosis* as a new backend for PV-Ops alongside with the native and Xen backends. When the kernel is compiled with *Mitosis*, the default PV-Ops is switched to the *Mitosis* backend. We implemented the *Mitosis* backend with great care to ensure identical behavior to the native backend when *Mitosis* is turned off. Besides, note that replication is generally not enabled by default, and thus the behavior is the same as the native interface.

The PV-Ops subsystem provides an efficient way for *Mitosis* to track any writes to the page-tables in the system. Propagating those updates efficiently requires a fast way to find the replica page-tables based solely on the information provided through the PV-Ops interface (Listing 1) i.e. using a kernel virtual address (KVA) or a physical frame number (PFN).

We augment the page metadata to keep track of replicas with our circular linked list. The Linux kernel keeps track of each 4KB physical frame in the system using struct page. Moreover, each frame has a unique KVA and PFN. Linux provides functions to convert between struct page and it's corresponding KVA/PFN, which is typically done by adding, subtracting or shifting the respective values and are hence efficient operations. We can, therefore, obtain the struct page directly from the information passed through the PV-Ops interface and update all replicas efficiently.

### 5.3. Efficiently Utilizing Page-Table Replicas

**General design:** When the OS schedules a process or task, it performs a context switch, restores processor registers and resumes execution of the new process or task. The context switch involves programming the page-table base register of the MMU with the base address of the process' page-table and flushing the TLB. With *Mitosis*, we extend the context switch functionality, to select and set the base address of the socket's local page-table replica efficiently. This enables a task or process to use the local page-table replica if present.

**Linux implementation:** For each process, we maintain an array of root page-table pointers which allows directly selecting the local replica by indexing this array using the socket id. Initializing this array with pointers to the very same root page-table is equivalent to the native behavior.

### 5.4. Handling of Bits Written by Hardware

**General design:** A page-table is mostly managed by software (the OS) most of the time and read by the hardware (on a TLB miss). On x86, however, hardware–namely the page-walker– reports whenever a page has been accessed or written to by setting the accessed and dirty bits in the PTEs. In other words, page-table is modified without direct OS involvement. Thus, accessed and dirty bits do not use the standard software interface to update the PTE and cannot be replicated easily without hardware support. Note, that these two bits are typically set by the hardware and reset by the OS. They are used by the OS for system-level operations like swapping or writing back memory-mapped files if they are modified in memory. With *Mitosis* when replicated, we logically OR accessed and dirty bits of all the replicas when read by the OS.

**Linux implementation:** We need to read accessed/dirty bits from all replicas as well as reset them in all replicas. Unfortunately, the PV-Ops interface doesn't provide functions to read a page-table entry, worse we have found code in the Linux kernel which even writes to the page-table entry without going through the PV-Ops interface. We augmented with the corresponding get functions to PV-Ops which consult all copies of page-table entry and make sure the flags are returned correctly. The new function reads all the replicas and ORs the bits in all replicas to get the correct information.

# 5.5. Page-Table Migration

We use replication to perform migration in the following way: we use *Mitosis* to replicate the page-table on the socket to which the process has been migrated. The first replica can be eagerly freed after migration, or alternatively kept up-todate in the case the process gets migrated back and lazily deallocated in case physical memory is becoming scarce.

# 6. Policy

The policies we implement with *Mitosis* control when pagetables are replicated and determine the processes and sockets for which replicas are created. As with NUMA policies, pagetable replication policies can be applied system-wide or upon user request. We discuss both in this section.

# 6.1. System-wide Policies

**General design:** System-wide policies can range from simple on/off knobs for all processes to policies that actively monitor performance counter events provided by the hardware to dynamically enable or disable *Mitosis*.

Event-based triggers can be developed for page-table migration and replication within the OS. For instance, the OS can obtain TLB miss rates or cycles spent walking page-tables through performance counters that are available on modern processors and then apply policy decisions automatically. A high TLB miss rate suggests that a process can benefit from page-table replication or migration. By taking the ratio between the time spent to serve TLB misses and the number of TLB misses can indicate a replication candidate. Processes with a low TLB miss rate may not benefit from replication.

Even if the OS makes a decision to migrate or replicate the page-tables, there it may be costly to copy the entire pagetable as big memory workloads easily achieve page-tables of multiple GB in size. By using additional threads or even DMA engines on modern processors, the creation of a replica can happen in the background and the application regains full performance when the replica or migration has completed.

The target applications of *Mitosis* are long-running, bigmemory workloads with high TLB pressure, and therefore we disable page-table replication for short-running processes since the performance and memory cost of the replicated pagetables for short-running processes cannot be amortized (§ 8.3). **Linux implementation:** We support a straightforward, system-wide policy with four states: *i*) completely disable *Mitosis*, *ii*) enable per-process basis, *iii*) fix the allocation of page-tables on a particular socket, and *iv*) enabled for all processes in the system. This system-wide policy can be set through the sysctl interface of Linux. We leave it as future work to implement an automatic, counter-based approach.

### **6.2.** User-controlled Policies

General design: System-wide policies usually imply a onesize-fits-all approach for all processes, but user-controlled policies allow programmers to use their understanding of their workloads and to select policies explicitly. These user-defined replication and migration policies can be combined with data and process placement primitives. Such policies can be selected when starting the program by defining the CPU set and replication set, or at runtime using corresponding system calls to set affinities and replication policies. All of these policies can be set per-process so that users have fine-grained control on replication and migration.

Linux implementation: We implement user-defined policies as an additional API call to libnuma and corresponding parameters of numact1. Similar to setting the allocation policy, we can supply node-mask or a list of sockets to replicate the page-tables (Listing 2). Applications can thus select the replication policy at runtime, or we can use numact1 to select the policy without changing the program.

numactl [--pgtablerepl= | -r <sockets>]
void numa\_set\_pgtable\_replication\_mask(struct bitmask \*);

# Listing 2: Additions to libnuma and numactl

Both, libnuma and numactl use two additional system calls to set and get the page-table replication bitmask. Whenever a new mask is set, *Mitosis* will walk the existing page-table and create replicas according to the new bitmask. The bitmask effectively specifies the replication factor: *N* bits set corresponds to copies on *N* sockets and by passing an empty bitmask, the default behavior is restored.

# 7. Discussion

# 7.1. Why Linux Implementation?

As a proof-of-concept, we implement *Mitosis* in the widelyused Linux OS. Choosing Linux as our testbed allows us to prototype our ideas on a complex and complete OS where the subtle interactions of many systems features and *Mitosis* stresstests its evaluation. Specifically, we use mainline Linux kernel v4.17 and implement *Mitosis* for the x86\_64 architecture. We plan to release this implementation for everyone to use and plan to upstream the changes to the Linux kernel.

### 7.2. Applicability to Library OS

We have chosen to implement the prototype of *Mitosis* in Linux. However, the concept of *Mitosis* is applicable to other operating systems. Microkernels, for instance, push most of their memory management functionality into user-space libraries or processes while the kernel enforces security and isolation. In Barrelfish [22], for example, processes manage their own address space by explicit capability invocations to update page-tables with new mappings.

In such a system, one could implement *Mitosis* purely in user-space by linking to a *Mitosis*-enabled libraryOS, and the kernel itself would not need to be modified at all. The library can keep track of the address space, including page-tables, replicas etc. Those data-structures can easily be enhanced to include an array of page-table capabilities instead of a single such table. This would allow policies to be defined at application level by using an appropriate policy library. Updates to page-tables might need to be converted to explicit update messages to other sockets, which avoid the need for global locks and propagates updates lazily. On a page-fault, updates can be processed and applied accordingly in the pagefault handling routine. We leave such an implementation to future work, but believe it to be straightforward.

### 7.3. Huge/Large Pages Support?

Larger page sizes help reduce address translation overheads by increasing the amount of memory that each TLB entry map by orders of magnitude. Even with 2MB and 1GB page size support in x86-64 on an Intel Haswell processor, the TLB reach is still less than 1%, assuming 1TB of main memory for any page size. Moreover, many commodity processors provide limited numbers of large page TLB entries especially 1GB TLB entries, which limits their benefit [21, 39, 49] and additionallly huge-pages are not always the best choice [41].

Since, address translation overheads are non-negligible with larger page sizes, they are susceptible to NUMA effects on page-table walks. Thus, our implementation of *Mitosis* supports larger page sizes and evaluate them. We extend transparent huge pages (THP) or 2MB page size in Linux which requires coalescing smaller pages to a large page and splitting larger pages in to smaller ones. *Mitosis* is implemented to replicate the page-tables even in presence of such mechanisms.

### 7.4. Applicability to Virtualized Systems?

Virtualized systems widely use hardware-based nested paging to virtualize memory [37]. This requires two-levels of page-table translation:

- 1. gVA to gPA: guest virtual address to guest physical address via a per-process guest OS page-table (gPT)
- 2. gPA to hPA: guest physical address to host physical address via a per-VM nested page-table (nPT)

In the best case, the virtualized address translation hits in the TLB to directly translate from gVA to hPA with no overheads. In the worst case, a TLB miss needs to perform a 2D page walk that multiplies overheads vis-a-vis native, because accesses to the guest page-table also require translation by the nested page-table. For x86-64, a nested page-table walk requires up to 24 memory accesses. This 2D page-table walk comes with additional hardware complexity.

Understanding page-table placement in virtualized systems is a major undertaking and requires a separate study. We believe we can extend *Mitosis*' design to replicate both guest page-tables and nested page-tables independently if the underlying NUMA architecture is exposed to the guest OS to improve performance of applications. To extend the design, we can rely on setting accessed and dirty bits at both gPT and nPT by the nested page-table walk hardware available since Haswell [4]. Thus, we can extend our OS extension for or-ing the access and dirty bits across replicas to get the correct information at both levels independently. However, the main issue is that most cloud systems prefer not to expose the underlying architecture to the guest OS making a case for novel approaches to replicate and migrate both levels of page-tables in a virtualized environment.

### 7.5. Consistency across page-table replicas?

Coherence between hardware TLBs is maintained by the OS with the help of TLB flush IPIs and updates to the page-table are already thread-safe as they are performed within a critical section. In Linux, a lock is taken whenever the page-table of a process is modified and thus ensuring mutual exclusion. The updates to the page-table structure are made visible after releasing the lock. When an entry is modified, its effect is made visible to other cores through a global TLB flush as the old entry might still be cached.

With *Mitosis*, we currently keep the same consistency guarantees by updating all page-table replicas eagerly while being in the critical section. Thus, only one thread can modify the page-table at a time. Hardware may read the page-table while updates are being carried out. The critical section ensures correctness while serving the page-fault while again, the global TLB flush ensures consistency after modification of an entry in case a core has cached the old one.

# 8. Evaluation

We evaluate *Mitosis* using a set of big-memory workloads and micro-benchmarks. We show: (1) how multi-threaded programs benefit from *Mitosis* (§ 8.1), (2) how *Mitosis* eliminates NUMA effects of page-walks when page-tables are placed on remote sockets due to task migration (§ 8.2) and (3), the memory and runtime overheads of *Mitosis* (§ 8.3).

Hardware Configuration We used a four-socket Intel Xeon E7-4850v3 with 14 cores and 128GB memory per-socket (512 GB total memory) with 2-way hyper-threading running at 2.20GHz. The L3 cache is 35MB in size and the processor has a per-core two-level TLB with 64+1024 entries. Accessing memory on the local NUMA socket has about 280 cycles latency and throughput of 28GB/s. For a remote NUMA socket, this is 580 cycles and 11GB/s respectively.

# 8.1. Multi-socket Scenario

In this part of the evaluation, we focus on multi-threaded workloads running in parallel on all sockets in the system. For a machine with *N* NUMA sockets, in expectation  $\frac{N-1}{N}$  of page-table accesses will be remote while the remote sockets are busy themselves. We evaluate six workloads (see § 3.1), for all commonly used configurations that influence data and page-table placement (see Table 3). Performance is presented as an average of three runs, excluding the initialization phase.

The results are shown in Figure 9a for 4KB pages and Figure 9b with 2MB large pages respectively. All bars are normal-



Figure 9: Normalized performance with *Mitosis* for multi-socket workloads with 4KB and 2MB page size. The lower hashed part of each bar is execution time spent in walking the page-tables.

| Config.  | Data pages             | Page-table pages                     |  |  |  |  |
|----------|------------------------|--------------------------------------|--|--|--|--|
| (T)F     | First touch allocation | First-touch allocation (bar: purple) |  |  |  |  |
| (T)F+M   | Thist-touch anocation  | Mitosis replication (bar: green)     |  |  |  |  |
| (T)F-A   | First-touch allocation | First-touch allocation (bar: purple) |  |  |  |  |
| (T)F-A+M | + Auto page migration  | Mitosis replication (bar: green)     |  |  |  |  |
| I(T)     | Interleaved allocation | Interleaved allocation (bar: purple) |  |  |  |  |
| (T)I+M   | interleaved anocation  | Mitosis replication (bar:green)      |  |  |  |  |

Table 3: Configurations for multi-socket scenario where workload runs on all sockets. T denotes Linux with THP. M denotes the corresponding data allocation policy with *Mitosis*.

ized to 4KB first-touch allocation policy (bar: F). Bars with the same allocation policy are grouped in boxes for comparison. The number on top of *Mitosis* bars (green) shows improvement from corresponding non-*Mitosis* bars (purple) within a box. Note that data allocation policy impacts performance and is shown across boxes for each workload. The results for 2MB pages are normalized to 4KB (bar: F) to show performance impact with increase in page size.

We observe that with 4KB pages, up to 40% of the total runtime is spent in servicing TLB misses. *Mitosis* reduces the overall runtime for all applications with the best-case improvement of 1.34x for Canneal. Most of the improvements can be noted in the reduction of page-walk cycles due to replication of page-tables.

Large pages can significantly reduce translation overheads for many workloads. However, NUMA effects of page-table walks are still noticeable, even if all workload memory is backed by large pages. Hence, *Mitosis* provides significant speedup e.g., 1.14x, 1.13x, 1.06x and 1.07x for Canneal, Memcached, XSBench and BTree, respectively. Note that the use of large pages can lead to decreased performance on NUMA systems and still not used for many systems [41]. Using various data page placement policies improves performance for our workloads as expected. In combination with all policies, *Mitosis* consistently improves performance.

We have provided evidence that highly parallel workloads experience NUMA effects of remote-memory accesses due to page-table walks. Yet, running a workload concurrently means we cannot inspect a thread in isolation: a TLB miss on one core may populate the cache with the PTE needed to serve the TLB miss on another core of the same socket. Moreover, accessing a remote last-level cache may be faster than accessing DRAM. Nevertheless, we have shown that *Mitosis* is still able to improve multi-threaded workloads by up to 1.34x and that too for both page sizes. Again, *Mitosis* does not cause any slowdown.

### 8.2. Workload Migration Scenario

As we observed in § 3.2, NUMA schedulers can move processes from one socket to another under various constraints. In this part of the evaluation, we show that *Mitosis* eliminates NUMA effects of page-walks originating due to data and threads migrating to a different socket while page-tables remain fixed on the socket where workload was first initialized.

We execute the same workloads used for workload migration scenario in § 3.2. As an additional configuration, we enabled *Mitosis* when the page-table is allocated on a remote socket. Recall, we disabled Linux' AutoNUMA migration, and pre-allocated and initialized the working set (17-85GB).

The results are shown in Figure 10a and Figure 10b with 4KB and 2MB page sizes respectively. Table 2 in § 3.2 showed the configurations used for evaluation: LP-LD (Local PT - Local Data) and RPI-LD (Remote PT with interference - Local Data). RPI-LD+M shows the improvement with page-table





migration enabled by *Mitosis* when RPI-LD case arises in the system. The boxes denote the bars to compare to see the improvement due to page-table migration. The number on top of the bar denotes the improvement due to *Mitosis* (green bar) as compared to non-mitosis bar (purple bar) within the same box. All bars are normalized to 4KB LP-LD configuration. The results for 2MB pages are normalized to 4KB (bar: LP-LD) to show performance impact with increase in page size.

With 4KB pages (Figure 10a), remote page-tables cause 1.4x to 3.2x slowdown (bar: RPI-LD) relative to the baseline (LP-LD). *Mitosis* can mitigate this overhead and has the same performance as the baseline by migrating the page-tables with process migration.

With 2MB large pages (Figure 10b), we see that the page walk overheads are comparatively lower, nevertheless we observe a slowdown of up to 2.3x for TRPI-LD over TLP-LD configuration. Again, Mitosis can mitigate this overhead and has the same performance as the TLP-LD configuration. Note, that for certain workloads the page-tables are cached well in the CPU caches and thus there is no difference in runtime. For example, in the case of GUPS, we observe roughly one TLB miss per data access-two cache-line requests in total per data array access. By breaking this down, we obtain that each leaf page-table cache-line covers about 16MB of memory which corresponds to 256k cache-lines of the data array. Therefore, the page-table cache-lines are accessed 256k more often than the data array cache-lines, and there are less than 500k pagetable cache lines which can easily be cached in L3 cache of the socket. In summary, page-table entries are likely to be present in the sockets processor cache.

**Memory Fragmentation:** Physical memory fragmentation limits the availability of large pages as the system ages, leading to higher page-walk overheads [51, 56]. Figure 11 shows the performance of Mitosis under heavy fragmentation while using THP in Linux with 2MB page size. We observe that all workloads, including those that did not show performance improvement with *Mitosis* while using 2MB pages in Figure 10b, show dramatic improvement with *Mitosis* in this case. This is due to workloads falling back to 4KB pages under fragmentation – which we have already shown to be susceptible to NUMA effects of page-table walks. Note that we present this experiment under heavy fragmentation to demonstrate that even if large pages are enabled, page-walk overheads can



Figure 11: Performance of *Mitosis* in workload migration scenario with 2MB pages under heavy memory fragmentation.

approach that of 4KB pages. In practice, the actual state of memory fragmentation may depend on several factors and these overheads will be proportional to the failure rate of large page allocations.

**Summary:** With this evaluation, we have shown that *Mitosis* completely avoids resulting overheads due to page-tables being misplaced on remote NUMA sockets. In none of the cases, *Mitosis* resulted in a slowdown of the workload.

### 8.3. Space and Runtime Overheads

Enabling *Mitosis* implies maintaining replicas which consume memory and use CPU cycles to be kept consistent. We evaluate these overheads by estimating the additional memory requirement, and then perform micro-benchmarks on the virtual memory operations and wrap up by running applications end-to-end to set those overheads into perspective.

**8.3.1. Memory Overheads** We estimate the overhead of the additional memory used to store the page-table replicas when *Mitosis* is enabled. We define the two-dimensional function

*mem\_overhead*(*Foot print*, *Replicas*) = *Overhead*% that calculates memory overhead relative to the single pagetable baseline and evaluate it using different values for the application's memory footprint and the number of replicas. For this estimation, we assume 4-level x86 paging with a compact address space e.g. the application uses addresses 0..*FootPrint*. Each level has at least one page-table allocated and a page-table is 4KB in size.

Table 4 shows the memory overheads of *Mitosis* for small to large applications using up to 16 replicas. We use the single page-table case as the baseline. The page-table accounts for about 0.19% of the total footprint, except for the 1MB case where it accounts for 1.5%. With an increasing memory footprint used by the application, *Mitosis* requires less than 2.9% of additional memory for 16-replicas, whereas our four-socket machine used just 0.6% additional memory.

The page-tables use a small fraction of the total memory footprint of the application. For small programs, the fraction is higher because there is a hard minimum of at least 16KB of page-tables–a 4KB page for each level. This is reflected by the large 23.1% increase in memory consumption for small programs. However, putting this into perspective we advocate not to use *Mitosis* in this case as the 1MB memory footprint falls within the TLB coverage.

In summary, we showed that even with a 16-socket NUMA machine, *Mitosis* adds just 2.9% memory overhead and this overhead drops to 0.6% for our four-socket machine.

|           |         | Number of Replicas |       |       |       |       |  |  |
|-----------|---------|--------------------|-------|-------|-------|-------|--|--|
| Footprint | PT Size | 1                  | 2     | 4     | 8     | 16    |  |  |
| 1 MB      | 0.02 MB | 1.0                | 1.015 | 1.046 | 1.108 | 1.231 |  |  |
| 1 GB      | 2.01 MB | 1.0                | 1.002 | 1.006 | 1.014 | 1.029 |  |  |
| 1 TB      | 2.00 GB | 1.0                | 1.002 | 1.006 | 1.014 | 1.029 |  |  |
| 16 TB     | 32.0 GB | 1.0                | 1.002 | 1.006 | 1.014 | 1.029 |  |  |

Table 4: Memory footprint overhead for Mitosis

| Operation | 4KB region | 8MB region | 4GB region |
|-----------|------------|------------|------------|
| mmap      | 1.021x     | 1.008x     | 1.006x     |
| mprotect  | 1.121x     | 3.238x     | 3.279x     |
| munmap    | 1.043x     | 1.354x     | 1.393x     |

Table 5: Runtime overhead of *Mitosis* for virtual memory operation system calls using 4-way Replication.

**8.3.2. VMA Operation Overheads** In this part of the evaluation, we are interested in understanding the overheads of self-replicating page-tables for common virtual memory operations such as *mmap*, *mprotect* and *munmap*.

We conducted a micro-benchmark that repeatedly calls the VMA operations and measured the time to complete the corresponding system calls. For each operation, we enforce that the page-table modifications are carried out e.g. by passing the MAP\_POPULATE flat to mmap. We varied the number of affected pages from a single page to a large region of memory of multiple GB in size. We ran the micro-benchmark with *Mitosis* enabled and disabled on an otherwise idle system. We use 4KB pages and 4-way replication.

The results of this micro-benchmark are shown in Table 5. The table shows CPU cycles required to perform the operation on a memory region of size 4KB, 8MB, or 4GB with *Mitosis* being on or off. Further, we calculate the overheads of *Mitosis* by dividing the 4-way replicated case (*Mitosis* on) with the base case, *Mitosis* off. For mmap, we observe an overhead of less than 2%. For unmap, the overhead grows to 35% while *Mitosis* adds more than 3x overheads for mprotect.

With 4-way replication, there are four sets of page-tables that need to be updated resulting in four times the work. We attribute the rather low overhead for mmap to the allocation and zeroing of new data pages during the system call. Likewise, when performing the unmap the freed pages are handed back to the allocator, but not zeroed resulting in less work per page and thus higher overhead of replication. *Mitosis* experiences a large overhead for mprotect which is still smaller than the replication factor. The mprotect operation does a readmodify-write cycle on the affected page-table entries. This process is efficient with no replicas as it results in sequential access within a page-table. However, with the PV-OPS interface, for each written entry all replicas are updated accordingly which kills locality. This can be avoided by either changing the PV-Ops interface or implementing lazy updates.

**8.3.3. No End-to-End Slowdown** We now set the VMA operations micro-benchmark of the previous section into the perspective of real-world applications. We show that our modifications to the Linux kernel to support *Mitosis* has negligible end-to-end overhead for applications.

| Workload | Mitosis Off   | Mitosis On    | Overhead |
|----------|---------------|---------------|----------|
| GUPS     | 270.93 (0.43) | 272.18 (0.00) | 0.46%    |
| Redis    | 633.94 (0.34) | 636.31 (0.86) | 0.37%    |

Table 6: Runtimes with LP-LD setting, including initialization with and without *Mitosis*. Standard Deviation in Brackets.

We compare the execution time of the single-threaded benchmarks. We run those benchmarks with and without *Mitosis* and measure overall execution time, including allocation and initialization phase. We use the LP-LD configuration, i.e. everything is locally allocated. THP is deactivated.

The results are shown in Table 6. We observe that in both cases, GUPS and Redis, the overheads of *Mitosis* are less than half a percent, which is small compared to the improvements we have demonstrated earlier.

# 9. Conclusion

We presented *Mitosis*: a technique that transparently replicates page-tables on large-memory machines, and provides the first platform to systematically evaluate page-table allocation policies inside the OS. With strong empirical evidence, we made the case for taking the allocation and placement of page-tables to a first-class consideration, in turn, optimizing performance on NUMA systems. We also demonstrated the benefits of replicating page-tables in large-memory machines for various use-cases, while observing negligible memory and runtime overheads. We plan to open-source the tools used in this work to inspire further research on optimizing page-table placement. Moreover, we plan to work with the Linux community to get *Mitosis* integrated into the mainline kernel.

# References

- "Amd epyc infinity fabric latency ddr4 2400 v 2666: A snapshot," https://www.servethehome.com/amd-epyc-infinity-fabriclatency-ddr4-2400-v-2666-a-snapshot/.
- [2] "AutoNUMA: the other approach to NUMA scheduling," https:// lwn.net/articles/488709/.
- [3] "Extreme Performance Series: vSphere Compute & Memory Schedulers," https://static.rainfocus.com/vmware/ vmworldus17/sess/1489512432328001AfWH/finalpresentationPDF/ SER2343BU\_FORMATTED\_FINAL\_1507912874739001gpDS.pdf.
- [4] "FOUR NEW VIRTUALIZATION TECHNOLOGIES ON THE LATEST INTEL® XEON," https://software.intel.com/en-us/blogs/ 2014/09/08/four-new-virtualization-technologies-on-the-latest-intelxeon-are-you-ready-to.
- [5] "Graph500 | large scale benchmarks," https://graph500.org.
- [6] "Intel's Enterprise Extravaganza 2019: Launching Cascade Lake, Optane DCPMM, Agilex FPGAs, 100G Ethernet, and Xeon D-1600," https://www.anandtech.com/show/14155/intels-enterpriseextravaganza-2019-roundup.
- [7] "Liblinear a library for large linear classification," https:// www.csie.ntu.edu.tw/~cjlin/liblinear/.
- [8] "memcached: a distributed memory object caching system," https://memcached.org.
- [9] "Paravirt\_ops," https://www.kernel.org/doc/Documentation/virtual/ paravirt\_ops.txt.
- [10] "Parsec benchmark suite," https://parsec.cs.princeton.edu/ overview.htm.
- [11] "RandomAccess: GUPS (Giga Updates Per Second)," https:// icl.utk.edu/projectsfiles/hpcc/RandomAccess/.
- [12] "Redis," https://redis.io.
- [13] "STREAM: Sustainable Memory Bandwidth in High Performance Computers," https://www.cs.virginia.edu/stream/.
- [14] "XSBench: The Monte Carlo Macroscopic Cross Section Lookup Benchmark," https://github.com/ANL-CESAR/XSBench.
- [15] "Xv6, a simple Unix-like teaching operating system," https:// pdos.csail.mit.edu/6.828/2012/xv6.html.
- [16] H. Alam, T. Zhang, M. Erez, and Y. Etsion, "Do-it-yourself virtual memory translation," in *Proceedings of the 44th Annual International Symposium on Computer Architecture*, ser. ISCA '17, 2017, pp. 457– 468.

- [17] AMD, "The Next Generation AMD Enterprise Server Product Architecture," https://www.hotchips.org/wp-content/uploads/ hc\_archives/hc29/HC29.22-Tuesday-Pub/HC29.22.90-Server-Pub/ HC29.22.921-EPYC-Lepak-AMD-v2.pdf.
- [18] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, "Xen and the Art of Virtualization," in *Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles*, ser. SOSP '03. Bolton Landing, NY, USA: ACM, 2003, pp. 164–177. [Online]. Available: http://doi.acm.org/10.1145/945445.945462
- [19] T. W. Barr, A. L. Cox, and S. Rixner, "Translation Caching: Skip, Don't Walk (the Page Table)," in *Proceedings of the 37th Annual International Symposium on Computer Architecture*, ser. ISCA '10, Saint-Malo, France, 2010, pp. 48–59. [Online]. Available: http://doi.acm.org/10.1145/1815961.1815970
- [20] T. W. Barr, A. L. Cox, and S. Rixner, "SpecTLB: A mechanism for speculative address translation," in 2011 38th Annual International Symposium on Computer Architecture (ISCA), June 2011, pp. 307–317.
- [21] A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, "Efficient Virtual Memory for Big Memory Servers," in *Proceedings of the* 40th Annual International Symposium on Computer Architecture, ser. ISCA '13, Tel-Aviv, Israel, 2013, pp. 237–248. [Online]. Available: http://doi.acm.org/10.1145/2485922.2485943
- [22] A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania, "The Multikernel: A New OS Architecture for Scalable Multicore Systems," in *Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles*, ser. SOSP '09, Big Sky, Montana, USA, 2009, pp. 29–44. [Online]. Available: http://doi.acm.org/10.1145/1629575.1629579
- [23] S. Beamer, K. Asanovic, and D. A. Patterson, "The GAP benchmark suite," *CoRR*, vol. abs/1508.03619, 2015. [Online]. Available: http://arxiv.org/abs/1508.03619
- [24] A. Bhattacharjee, "Large-reach Memory Management Unit Caches," in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-46, Davis, California, 2013, pp. 383–394. [Online]. Available: http://doi.acm.org/10.1145/ 2540708.2540741
- [25] A. Bhattacharjee, "Translation-Triggered Prefetching," in Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '17, Xi'an, China, 2017, pp. 63–76. [Online]. Available: http://doi.acm.org/10.1145/3037697.3037705
- [26] A. Bhattacharjee, D. Lustig, and M. Martonosi, "Shared Last-level TLBs for Chip Multiprocessors," in *Proceedings of the 2011 IEEE* 17th International Symposium on High Performance Computer Architecture, ser. HPCA '11, 2011, pp. 62–63. [Online]. Available: http://dl.acm.org/citation.cfm?id=2014698.2014896
- [27] J. Bouron, S. Chevalley, B. Lepers, W. Zwaenepoel, R. Gouicem, J. Lawall, G. Muller, and J. Sopena, "The Battle of the Schedulers: FreeBSD ULE vs. Linux CFS," in *Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference*, ser. USENIX ATC '18. Berkeley, CA, USA: USENIX Association, 2018, pp. 85–96. [Online]. Available: http://ll.acm.org/citation.cfm?id=3277355.3277364
- [28] S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang, "Corey: An Operating System for Many Cores," in *Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation*, ser. OSDI'08. San Diego, California: USENIX Association, 2008, pp. 43–57. [Online]. Available: http://dl.acm.org/citation.cfm?id=1855741.1855745
- [29] I. Calciu, S. Sen, M. Balakrishnan, and M. K. Aguilera, "Black-box Concurrent Data Structures for NUMA Architectures," in *Proceedings* of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '17, Xi'an, China, 2017, pp. 207–221. [Online]. Available: http://doi.acm.org/10.1145/3037697.3037721
- [30] A. T. Clements, M. F. Kaashoek, and N. Zeldovich, "RadixVM: Scalable Address Spaces for Multithreaded Applications," in *Proceedings* of the 8th ACM European Conference on Computer Systems, ser. EuroSys '13, Prague, Czech Republic, 2013, pp. 211–224.
- [31] S. Das, S. Nishimura, D. Agrawal, and A. El Abbadi, "Albatross: Lightweight elasticity in shared storage databases for the cloud using live data migration," in *Proceedings of the 2011 VLDB Endowment*, ser. VLDB '11, 2011.

- [32] M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth, "Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems," in *Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems*, ser. ASPLOS '13, Houston, Texas, USA, 2013, pp. 381–394. [Online]. Available: http://doi.acm.org/10.1145/2451116.2451157
- [33] Y. Demir, Y. Pan, S. Song, N. Hardavellas, J. Kim, and G. Memik, "Galaxy: A High-performance Energy-efficient Multi-chip Architecture Using Photonic Interconnects," in *Proceedings of the* 28th ACM International Conference on Supercomputing, ser. ICS '14, Munich, Germany, 2014, pp. 303–312. [Online]. Available: http://doi.acm.org/10.1145/2597652.2597664
- [34] Y. Du, M. Zhou, B. R. Childers, D. Mossé, and R. Melhem, "Supporting Superpages in Non-Contiguous Physical Memory," in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), Feb 2015, pp. 223–234.
- [35] Z. Fang, L. Zhang, J. B. Carter, W. C. Hsieh, and S. A. McKee, "Reevaluating Online Superpage Promotion with Hardware Support," in *Proceedings of the 7th International Symposium on High-Performance Computer Architecture*, ser. HPCA '01, 2001, pp. 63–. [Online]. Available: http://dl.acm.org/citation.cfm?id=580550.876428
- [36] N. Ganapathy and C. Schimmel, "General Purpose Operating System Support for Multiple Page Sizes," in *Proceedings of the Annual Conference on USENIX Annual Technical Conference*, ser. ATEC '98, New Orleans, Louisiana, 1998, pp. 8–8. [Online]. Available: http://dl.acm.org/citation.cfm?id=1268256.1268264
- [37] J. Gandhi, M. D. Hill, and M. M. Swift, "Agile Paging for Efficient Memory Virtualization," *IEEE Micro*, vol. 37, no. 3, pp. 80–86, 2017.
- [38] J. Gandhi, V. Karakostas, F. Ayar, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky, M. M. Swift, and O. S. Ünsal, "Range Translations for Fast Virtual Memory," *IEEE Micro*, vol. 36, no. 3, pp. 118–126, May 2016.
- [39] J. Gandhi, A. Basu, M. D. Hill, and M. M. Swift, "Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks," in *Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture*, ser. MICRO-47, Cambridge, United Kingdom, 2014, pp. 178–189. [Online]. Available: http://dx.doi.org/10.1109/MICRO.2014.37
- [40] J. Gandhi, M. D. Hill, and M. M. Swift, "Agile Paging: Exceeding the Best of Nested and Shadow Paging," in *Proceedings of the 43rd International Symposium on Computer Architecture*, ser. ISCA '16, Seoul, Republic of Korea, 2016, pp. 707–718. [Online]. Available: https://doi.org/10.1109/ISCA.2016.67
- [41] F. Gaud, B. Lepers, J. Decouchant, J. Funston, A. Fedorova, and V. Quéma, "Large Pages May Be Harmful on NUMA Systems," in *Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference*, ser. USENIX ATC'14, Philadelphia, PA, 2014, pp. 231–242. [Online]. Available: http://dl.acm.org/citation.cfm?id= 2643634.2643659
- [42] S. Haria, M. D. Hill, and M. M. Swift, "Devirtualizing memory in heterogeneous systems," in *Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems*, ser. ASPLOS '18. New York, NY, USA: ACM, 2018, pp. 637–650. [Online]. Available: http://doi.acm.org/10.1145/3173162.3173194
- [43] Intel Corp., "5-Level Paging and 5-Level EPT," https: //software.intel.com/sites/default/files/managed/2b/80/5level\_paging\_white\_paper.pdf.
- [44] Intel Corp., "New Intel Core Processor Combines High-Performance CPU with Custom Discrete Graphics from AMD to Enable Sleeker, Thinner Devices," https://newsroom.intel.com/editorials/new-intelcore-processor-combine-high-performance-cpu-discrete-graphicssleek-thin-devices/.
- [45] S. S. Iyer, "Heterogeneous Integration for Performance and Scaling," *IEEE Transactions on Components, Packaging and Manufacturing Technology*, vol. 6, no. 7, pp. 973–982, July 2016.
- [46] S. Kaestle, R. Achermann, T. Roscoe, and T. Harris, "Shoal: Smart Allocation and Replication of Memory for Parallel Programs," in *Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference*, ser. USENIX ATC '15, Santa Clara, CA, 2015, pp. 263–276. [Online]. Available: http://dl.acm.org/citation.cfm?id= 2813767.2813787
- [47] G. B. Kandiraju and A. Sivasubramaniam, "Going the Distance for TLB Prefetching: An Application-driven Study," in *Proceedings* of the 29th Annual International Symposium on Computer Architecture, ser. ISCA '02, 2002, pp. 195–206. [Online]. Available: http://dl.acm.org/citation.cfm?id=545215.545237

- [48] A. Kannan, N. E. Jerger, and G. H. Loh, "Enabling interposer-based disintegration of multi-core processors," in 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec 2015, pp. 546–558.
- [49] V. Karakostas, O. S. Unsal, M. Nemirovsky, A. Cristal, and M. Swift, "Performance analysis of the memory management unit under scaleout workloads," in 2014 IEEE International Symposium on Workload Characterization (IISWC), Oct 2014, pp. 1–12.
- [50] V. Karakostas, J. Gandhi, F. Ayar, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky, M. M. Swift, and O. Ünsal, "Redundant Memory Mappings for Fast Access to Large Memories," in *Proceedings of the 42Nd Annual International Symposium on Computer Architecture*, ser. ISCA '15, Portland, Oregon, 2015, pp. 66– 78. [Online]. Available: http://doi.acm.org/10.1145/2749469.2749471
- [51] Y. Kwon, H. Yu, S. Peter, C. J. Rossbach, and E. Witchel, "Coordinated and efficient huge page management with ingens," in *Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation*, ser. OSDI'16. Berkeley, CA, USA: USENIX Association, 2016, pp. 705–721. [Online]. Available: http://dl.acm.org/citation.cfm?id=3026877.3026931
- [52] J.-P. Lozi, B. Lepers, J. Funston, F. Gaud, V. Quéma, and A. Fedorova, "The linux scheduler: A decade of wasted cores," in *Proceedings of the Eleventh European Conference on Computer Systems*, ser. EuroSys '16. New York, NY, USA: ACM, 2016, pp. 1:1–1:16. [Online]. Available: http://doi.acm.org/10.1145/2901318.2901326
- [53] D. Lustig, A. Bhattacharjee, and M. Martonosi, "TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs," ACM Trans. Archit. Code Optim., vol. 10, no. 1, pp. 2:1–2:38, Apr. 2013. [Online]. Available: http://doi.acm.org/10.1145/2445572.2445574
- [54] Marvell Corporation, "MoChi Architecture," http://www.marvell.com/ architecture/mochi/.
- [55] J. Navarro, S. Iyer, P. Druschel, and A. Cox, "Practical, Transparent Operating System Support for Superpages," *SIGOPS Oper. Syst. Rev.*, vol. 36, no. SI, pp. 89–104, Dec. 2002. [Online]. Available: http://doi.acm.org/10.1145/844128.844138
- [56] A. Panwar, A. Prasad, and K. Gopinath, "Making huge pages actually useful," in *Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems*, ser. ASPLOS '18. New York, NY, USA: ACM, 2018, pp. 679–692. [Online]. Available: http: //doi.acm.org/10.1145/3173162.3173203
- [57] M. Papadopoulou, X. Tong, A. Seznec, and A. Moshovos, "Predictionbased superpage-friendly TLB designs," in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), Feb 2015, pp. 210–222.
- [58] B. Pham, A. Bhattacharjee, Y. Eckert, and G. H. Loh, "Increasing TLB reach by exploiting clustering in page translations," in 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), Feb 2014, pp. 558–567.
- [59] B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, "CoLT: Coalesced Large-Reach TLBs," in *Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture*, ser. MICRO-45, Vancouver, B.C., CANADA, 2012, pp. 258–269. [Online]. Available: https://doi.org/10.1109/MICRO.2012.32
- [60] B. Pham, J. Veselý, G. H. Loh, and A. Bhattacharjee, "Large Pages and Lightweight Memory Management in Virtualized Environments: Can You Have It Both Ways?" in *Proceedings* of the 48th International Symposium on Microarchitecture, ser. MICRO-48, Waikiki, Hawaii, 2015, pp. 1–12. [Online]. Available: http://doi.acm.org/10.1145/2830772.2830773
- [61] K. Rangan, G.-Y. Wei, and D. Brooks, "Thread motion: Fine-grained power management for multi-core systems," in *Proceedings of the* 2009 International Symposium on Computer Architecture, ser. ISCA '09, 2009.
- [62] A. Saulsbury, F. Dahlgren, and P. Stenström, "Recency-based TLB Preloading," in *Proceedings of the 27th Annual International Symposium on Computer Architecture*, ser. ISCA '00, Vancouver, British Columbia, Canada, 2000, pp. 117–127. [Online]. Available: http://doi.acm.org/10.1145/339647.339666
- [63] A. Seznec, "Concurrent Support of Multiple Page Sizes on a Skewed Associative TLB," *IEEE Trans. Comput.*, vol. 53, no. 7, pp. 924–927, Jul. 2004. [Online]. Available: https://doi.org/10.1109/TC.2004.21
- [64] M. Swanson, L. Stoller, and J. Carter, "Increasing TLB Reach Using Superpages Backed by Shadow Memory," in *Proceedings of the* 25th Annual International Symposium on Computer Architecture, ser. ISCA '98, Barcelona, Spain, 1998, pp. 204–213. [Online]. Available: https://doi.org/10.1145/279358.279388

- [65] Taiwan Semiconductor Manufacturing Company, "CoWoS Services," http://www.tsmc.com/english/dedicatedFoundry/services/cowos.htm.
- [66] M. Talluri and M. D. Hill, "Surpassing the TLB Performance of Superpages with Less Operating System Support," in *Proceedings* of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS VI, San Jose, California, USA, 1994, pp. 171–182. [Online]. Available: http://doi.acm.org/10.1145/195473.195531
- [67] J. Yin, Z. Lin, O. Kayiran, M. Poremba, M. S. B. Altaf, N. E. Jerger, and G. H. Loh, "Modular Routing Design for Chiplet-Based Systems," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), June 2018, pp. 726–738.