Review of "FaRM: Fast Remote Memory"

22 Sep 2015

Review of "FaRM: Fast Remote Memory"

Traditionally, distributed computing platforms use TCP/IP to build distributed main memory systems. As the price of DRAM dropping, people can put all data inside memory for fast access without too much cost, but for distributed systems, the network communication remains a bottleneck. The platform FaRM which is built on RDMA improves both latency and throughput by an order of magnitude on the same network hardware.

Through a set of simple APIs, FaRM uses RDMA both to directly access data in the shared address space and for fast messaging. RDMA provides reliable user-level reads and writes of remote memory and has the advantage of low latency and high throughput because it bypasses the kernel and avoids overheads of complex protocol stacks, it can transmit data without the cooperation of CPU. Thus, built on top of RDMA with some optimization of the NIC driver, FaRM can be much more efficient than traditional TCP/IP based distributed memory system.

FaRM uses a circular buffer to implement a unidirectional channel. This buffer is stored on the receiver, and one sender/receiver pair uses one buffer. Unused buffers are all zeros. At the beginning of messaging, the receiver polls the word at the "Head" position to detect new messages, value in the "Header" position indicates the length of the message, then receiver uses this length information to locate the trailer position and polls it for non-zero value, once a non-zero value is detected, the entire message has been received. It then passes this message to the application layer, once consumed, the buffer is reset to the start.

FaRM provides a set of event-based API, which gives higher scalability than thread concurrency model. It provides strictly serializable ACID transactions mechanisms to help application maintain consistency based on optimistic concurrency control and two-phase commit. When performance is more of a concern, it also provides lock free read-only operations that are serializable with transactions. Replicated logging is used to high availability and strict serializability.

Object locations in the memory cluster is maintained by a consistent hashing which maps a 32-bit region identifier to a physical machine. To access the object, FaRM either do a RDMA request using the offset and object size or do a local memory access if the object is stored locally which is much faster than RDMA.

Will the technology be influential in 10 years? Maybe. It proposes a more efficient way of building distributed memory systems, which achieves an order of magnitude better throughput and latency than a similar system built on top of TCP/IP on the same physical network. But it would be better if it also provides transparent support for applications without modifications on them.