Review of "Bigtable: A Distributed Storage System for Structured Data"

05 Oct 2015

Review of "Bigtable: A Distributed Storage System for Structured Data"

Bigtable is a flexible, high-performance distributed structured data storage solution for both bulk processing and real-time data serving requirements widely used by Goolge web indexing, Google Earth, and Google Finance.

Conceptually, A Bigtable is a sparse, distributed, persistent multi-dimensional sorted map. It's indexed by a row key, column key and a timestamp; each value in the map is a uninterpreted array of bytes. Physically, it consists of a library that's linked into every client, a master server and many tablet servers. It's built on top of Google File System and operates in a shared pool of machines that run a wide variety of other distributed applications. Google SSTable is used to store Bigtable data, which provides high performance lookup and can be mapped into memory to omit extra disk lookup. Bigtable also relies heavily on a highly-available and persistent distributed lock service called Chubby which uses Paxos algorithm to maintain consistency. Chubby clients uses stateful session to communicate with Chubby service.

Tablets are the instances that store the structured data. Bigtable uses a three level location hierarchy which is capable of storing 2^32 tablet locations. In order to offload the single master, most client doesn't need to communicate with the master, they cache the locations of tablet servers and do eager pre-fetch to get lower latency. One trade-off here is the complexity of the client library – these library should be more complicated than a database library of Oracle or PostgresSQL. But this design gracefully side stepped the need of a centralized master server as a lot of distributed solutions for databases like MongoDB and PostgreSQL do, thus greatly improves the level of scalability.

Updates of a a tablet is first committed to a commit log that store redo records. Most recent committed updates are stored in memtable, older ones are stored in a sequence of SSTables. When the memtable grows into a certain size, it will be compacted into SSTable. By using this technique, Bigtable can firstly shrink the memory usage of the tablet server and secondly reduce the amount of data that has to be read from the commit log during recovery if the server dies. A major compaction is scheduled regularly to produce SSTable that contains no deletion information or deleted data.

Bigtable uses two-level caching and bloom filters to improve read performance. Scan Cache caches key-value pairs returned by SSTable interface, and Block Cache caches results returned from GFS. Bloom filters can reduce the group of servers that a read operation need to contact thus reduce the number of disk accesses.

Will this paper be influential in 10 years? I think so. Bigtable provides a high-availability, high performance solution for simple key-value stores requirements. It does fall short when you have more complex query needs, but for the majority of applications, those are not the mostly used functions. You can use a number of data processing frameworks like Spark or Storm to derive more meaningful reports anyway.