Review of "Cassandra - A Decentralized Structured Storage System"

05 Oct 2015

Review of "Cassandra - A Decentralized Structured Storage System"

Cassandra is a high-availability distributed storage system running across many commodity servers. Like Bigtable, it does not provide a full relational data model, instead it provides clients with a a simple data model that supports dynamic control over data layout and format. It's designed to handle high write throughput while not sacrificing read efficiency. It integrates distributed features of Dynamo and data model of Bigtable.

In Cassandra, all nodes participate in a cluster, nodes share nothing among them and can be added or removed as needed and can scale linearly. Cassandra is fully replicated and has no single point of failure. In Cassandra, primary key is used to place the records using a consistent hashing. Nodes are organized into a ring, and each node will be responsible for a hash range. Data are replicated to a number of other nodes controlled by replica factor.

Clients only need to write on a single node. Data is written to an append only commit log first, then it's put on a memtable like the one used in Bigtable. Then the node can acknowledge to the client the write finishes. When memtable grows to a certain size, it's flushed to disk with a sequential writes. Good thing with sequential is it's super fast, both in hard disk and SSDs, since sequential writes are block data. Once the memtable is flushed into disk as SSTable, it's immutable, so if a later updates comes in, it will not update the previous record, instead, Cassandra will pick the newest updates as the final value. Then how does Cassandra address the problem of duplicated inconsistent value? It borrows the notion of compaction from Bigtable. It basically performs a merge sort on the SSTables and compacts the redundant values with the final value.

Reads are performed in a coordinated manner. Same as Bigtable, Cassandra clients are aware of every nodes. When a read goes to a node, that node becomes the coordinator of this read, it doesn't have to be the owner of the node, but it can gets data from the node that has the data. Reads and writes can be set with fine grained consistent levels, such as QUARUM( > 51% or replicas ack), LOCAL_QUARUM ( > 51% ack in local DC), LOCAL_ONE, TWO, ALL. These provides flexibility to accommodate different consistency needs.

Will this paper be influential in 10 years? Yes, I think so. Cassandra cleverly integrates Bigtable and Dynamo, preserving all the good features of Bigtable without needing a master server and a distributed locking service, making high scalability and high availability a breeze.