Review of "Flat Datacenter Storage"

20 Sep 2015

Review of "Flat Datacenter Storage"

DFS is a super fast, fault-tolerant blob storage system on CLOS network topology for data centers.

Because bandwidth is limited (or network is over subscribed), remote data access is slow. Even if you have a big bandwidth on the edge (within a rack), on the root (or core switches), the bandwidth shared by a server resides on the edge is small, so if a computer wants to talk to a "remote" host, it'll be very slow. Because of this, previous distributed file systems like GFS, they enforces applications built on it to think about locality which adds a layer complexity.

But when affordable CLOS topology ((small commodity switches + equal cost multi-path routing (load balancing traffic by using multiple best path to transmit data, can increase bandwidth a lot))) came out, suddenly the assumption that remote communication is slow is changed. With full bandwidth bisection topology, there won't be local/remote disk distinction and you can have simpler programming models.

In FDS, data is logically stored in blobs. Which has a 128-bit GUID and it can be application generated or system generated. Reads and writes to blobs are done in units called tracts. Tracts are sized at 8MB. Tract servers manage raw disks (no filesystem), they serve read and write requests from clients. Tract server calls are asynchronous, so you can allow deep pipelining, like tcp sliding window.

It uses a deterministic data placement mechanism called Tract Locator table. The index of the of the tract number i of blob with GUID g is calculated using Tract Locator = (Hash(g) + i) mod len(TLT). Blob metadata is also distributed, located at TLT[(Hash(g) - 1) % len(TLT)] which is specifically reserved for it. One of the benefit of using hash instead of keeping states is has provides satisfying distribution without incurring the complexity brought by states bookkeeping.

TLT is built by doing m permutations on a list of tractservers to fully distribute read and write requests to different disks. TLT is served by a light weight meta server (still has single point of failure?) to clients. TLT only needs to be updated when cluster changes, which is not a very frequent event in the assumption, thus allowing aggressive caching on TLT.

Replications information is also stored in tract locator table, which is used when client doing operations: Create, Delete, Extend, only writes to primary, and primary propagates to replicas; Write writes to all replicas; Read from random replica. Recovery is super fast because there's no locality issue, the system can just copy a bunch of still live replicas to other tractservers to take over the lost replicas. Since all of these is parallel, this can be done very quickly, 92GB lost recovery only takes 6.2s.

Will the work be influential in 10 years? I think so. As the ECMP and full bandwidth bisection network technology becoming more and more mature, the performance gain presented in this paper is certainly something worth storage people to think about.