Review of "The Google File System"

20 Sep 2015

Review of "The Google File System"

Faced with the needs of Google infrastructure, Google developed its own distributed file system Google File System (GFS), which is different from previous distributed file systems like NFS (Network File System) and AFS (Andrew File System) because it's designed for 1) inexpensive commodity components that fail often, 2) storing millions of (modest) large files files each 100MB or larger, 3) large streaming reads and small random reads, 4) large, sequential writes that append data to files, 5) concurrent appending, 6) high sustained throughput than low latency.

A typical working set of the GSF consists of a lot of clients, a single master and multiple chunk servers (hundreds of). The master server manages all the metadata, including namespaces, access control information, mapping from files to chunks and location of chunks. Metadata are served in memory but all information except for location to chunks are persisted whereas location of chunks are polled. They didn't persist chunk locations because it's simpler to do poll in the face of chunk servers come and go so often. A very interesting fact here is GFS has only one master which somehow constrained the scalability and brings challenge of high availability, but it makes the implementation much easier and they make some sacrifices by making the chunk size relatively large – 64BM.

The consistence model defined by GFS is like this. If it's consistent, then all clients will always see the same data, regardless of which replica they read from. If a region is defined after a file mutation, then it is consistent and the client will see the mutation writes in its entirety. For failure write or append, the state is inconsistent; for concurrent successful writes, the result will be consistent but undefined, but for serial successful writes the result will be defined; for either serial or concurrent append success, the result will be mostly defined, but may be inconsistent at some point.

A very interesting fact about the behavior of the client is it writes to each replicas individually. My question here is why shouldn't it write to the primary replica first and let the primary replica propagate. What's the reason behind this?

Will the work be influential in 10 years? Maybe yes, since it provides a detailed design for another flavor of distributed file systems in addition to AFS and NFS, providing strong support for large, mostly appended files.