RDMA@EPFL


Remote Direct Memory Access (RDMA) is a feature of modern networking hardware which enables a computer to access the memory of a remote computer without involving the CPU at the remote side. RDMA provides low-latency, high-throughput communication by bypassing the OS Kernel and by implementing several layers of the network stack in hardware.

The goal of our work has been in recent years to explore the opportunities and challenges raised by RDMA, on both theoretical and practical levels. From a theoretical standpoint, RDMA enables us to think of physically separate machines as both sharing memory and passing messages — a hybrid of the conventional message-passing and shared-memory models. As we have shown in our work, this hybrid model leads to some interesting results, which are impossible in either message-passing or shared-memory, taken alone. Specifically, we have shown that the hybrid model enables agreement algorithms with higher fault-tolerance and/or lower communication complexity than are achievable in the separate models.

Our theoretical results are highly relevant to practical distributed systems. As we demonstrate through our work on state-machine replication using RDMA, system designs which leverage the unique semantics of RDMA deliver considerably better performance—both on and off the critical path—than existing, conventional designs.


Publications & Preprints


We have published a series of papers on the topic:

Marcos K. Aguilera, Naama Ben-David, Rachid Guerraoui, Virendra J. Marathe, Athanasios Xygkis and Igor Zablotchi Microsecond Consensus for Microsecond Applications. OSDI 2020

Marcos K. Aguilera, Naama Ben-David, Rachid Guerraoui, Virendra J. Marathe, and Igor Zablotchi The Impact of RDMA on Agreement. PODC 2019

Marcos K Aguilera, Naama Ben-David, Irina Calciu, Rachid Guerraoui, Erez Petrank, and Sam Toueg Passing Messages while Sharing Memory. PODC 2018


Ongoing work

Project funded by Huawei: huawei.jpeg
Distributed Index for Remote Shared Memory

Project summary: With the advent of faster datacenter interconnects, such as InfiniBand, microsecond-scale computing is emerging as a must [1]. When these interconnects—which are capable of microsecond latency and throughput that goes beyond 100Gbps—are paired with memory access technologies like RDMA, a microsecond app might be expected to process a request in 10 microseconds. Areas where software systems care about microsecond performance include finance (e.g., trading systems), embedded computing (e.g., control systems), and microservices (e.g., key-value stores). Some of these areas are critical and it is desirable to provide high performance, increased robustness under high workload, as well as fault tolerance in case of malfunctioning computing nodes. To achieve the aforementioned characteristics, microsecond applications usually rely on distributed coordination among the computing nodes.
In this project, the abstract notion of distributed coordination is expressed via a distributed index, i.e., a data structure that is capable of handling concurrent modification and loosely resembles a key-value store. The primary purpose of an index is to store and access the location of data, based on their assigned identifier. The fact that the index is distributed means that the “identifier-to-data” mappings are stored in multiple physical nodes inside the datacenter. The index (i.e., identifier) of the data may be structured or unstructured, therefore existing SQL-based technologies are not appropriate. At the same time, the size of this index is expected to be unable to fit in a single physical node which further justifies the necessity for this service to be distributed.
The specific hardware that this project will rely upon creates new challenges and research opportunities. Even though distributed indexes have already been built in the past [11, 12], our project aims to utilize the latest advances on datacenter technology made by Huawei, namely its upcoming high-speed interconnect targeting resource disaggregation (similar to CXL, Gen-Z, OpenCAPI, and CCIX). The interconnect is integrated with the CPU and enables the CPU to perform memory operations both locally and remotely. In other words, it offers assembly-level instructions that seamlessly perform memory operations (reads/writes) to the RAM modules local to the node’s CPU, as well as to remote RAM modules that reside in CPUs of different physical nodes. Although such a technology already exists in competing products (i.e., Intel + Mellanox RDMA), Huawei is developing a fully integrated solution and is therefore capable of gaining performance improvements from this effort.
In a nutshell, the goal of this project is to develop a service rather than a bare-bones data structure. The service will be able to store the data across multiple machines, deal with all the problems stemming from the distribution of data (such as concurrent access), reply to client requests and recover from crashes.