Review of "SparkSQL: Relational Data Processing in Spark"

Big data applications require a mix of processing techniques, data sources and storage formats. On one hand, people needs the flexibility provided by procedural programming interface so that they can run logic like ETL and machine learning, on the other hand, once getting the structured data, people often wants some more productive experiences like those provided by Pig, Hive, and Dremel. This is how SparkSQL come onto the stage.

SparkSQL provides a lazily evaluated DataFrame API, which allows relational operations on both external data sources and Spark's built-in distributed collections. It also provides a highly extensible optimization engine called Catalyst. A DataFrame is equivalent to a table in a relational database, and can also be manipulated in similar ways as RDDs, but at the same time keep track of their schema and support various optimized relational operations.

SparkSQL uses a nested data model based on Hive for tables and DataFrames which supports all major SQL data types. Relational operations can be performed on DataFrames using a a domain specific language such as join, where, groupBy, etc. Catalyst is developed using Scala functional programming constructs. It contains a general library for representing evaluation trees and applying rules to manipulate them.

Will this paper be influential in 10 years? Yes, I think so. It provides another way to manipulate data, which is actually a very demanding feature for data processing frameworks like Spark and MapReduce. But thanks to the great integration of SparkSQL and Spark, developers can enjoy the rich features provided by relational models and at the same not loosing flexibility of procedural processing.

Review of "Cassandra - A Decentralized Structured Storage System"

Cassandra is a high-availability distributed storage system running across many commodity servers. Like Bigtable, it does not provide a full relational data model, instead it provides clients with a a simple data model that supports dynamic control over data layout and format. It's designed to handle high write throughput while not sacrificing read efficiency. It integrates distributed features of Dynamo and data model of Bigtable.

In Cassandra, all nodes participate in a cluster, nodes share nothing among them and can be added or removed as needed and can scale linearly. Cassandra is fully replicated and has no single point of failure. In Cassandra, primary key is used to place the records using a consistent hashing. Nodes are organized into a ring, and each node will be responsible for a hash range. Data are replicated to a number of other nodes controlled by replica factor.

Clients only need to write on a single node. Data is written to an append only commit log first, then it's put on a memtable like the one used in Bigtable. Then the node can acknowledge to the client the write finishes. When memtable grows to a certain size, it's flushed to disk with a sequential writes. Good thing with sequential is it's super fast, both in hard disk and SSDs, since sequential writes are block data. Once the memtable is flushed into disk as SSTable, it's immutable, so if a later updates comes in, it will not update the previous record, instead, Cassandra will pick the newest updates as the final value. Then how does Cassandra address the problem of duplicated inconsistent value? It borrows the notion of compaction from Bigtable. It basically performs a merge sort on the SSTables and compacts the redundant values with the final value.

Reads are performed in a coordinated manner. Same as Bigtable, Cassandra clients are aware of every nodes. When a read goes to a node, that node becomes the coordinator of this read, it doesn't have to be the owner of the node, but it can gets data from the node that has the data. Reads and writes can be set with fine grained consistent levels, such as QUARUM( > 51% or replicas ack), LOCAL_QUARUM ( > 51% ack in local DC), LOCAL_ONE, TWO, ALL. These provides flexibility to accommodate different consistency needs.

Will this paper be influential in 10 years? Yes, I think so. Cassandra cleverly integrates Bigtable and Dynamo, preserving all the good features of Bigtable without needing a master server and a distributed locking service, making high scalability and high availability a breeze.

Review of "Bigtable: A Distributed Storage System for Structured Data"

Bigtable is a flexible, high-performance distributed structured data storage solution for both bulk processing and real-time data serving requirements widely used by Goolge web indexing, Google Earth, and Google Finance.

Conceptually, A Bigtable is a sparse, distributed, persistent multi-dimensional sorted map. It's indexed by a row key, column key and a timestamp; each value in the map is a uninterpreted array of bytes. Physically, it consists of a library that's linked into every client, a master server and many tablet servers. It's built on top of Google File System and operates in a shared pool of machines that run a wide variety of other distributed applications. Google SSTable is used to store Bigtable data, which provides high performance lookup and can be mapped into memory to omit extra disk lookup. Bigtable also relies heavily on a highly-available and persistent distributed lock service called Chubby which uses Paxos algorithm to maintain consistency. Chubby clients uses stateful session to communicate with Chubby service.

Tablets are the instances that store the structured data. Bigtable uses a three level location hierarchy which is capable of storing 2^32 tablet locations. In order to offload the single master, most client doesn't need to communicate with the master, they cache the locations of tablet servers and do eager pre-fetch to get lower latency. One trade-off here is the complexity of the client library – these library should be more complicated than a database library of Oracle or PostgresSQL. But this design gracefully side stepped the need of a centralized master server as a lot of distributed solutions for databases like MongoDB and PostgreSQL do, thus greatly improves the level of scalability.

Updates of a a tablet is first committed to a commit log that store redo records. Most recent committed updates are stored in memtable, older ones are stored in a sequence of SSTables. When the memtable grows into a certain size, it will be compacted into SSTable. By using this technique, Bigtable can firstly shrink the memory usage of the tablet server and secondly reduce the amount of data that has to be read from the commit log during recovery if the server dies. A major compaction is scheduled regularly to produce SSTable that contains no deletion information or deleted data.

Bigtable uses two-level caching and bloom filters to improve read performance. Scan Cache caches key-value pairs returned by SSTable interface, and Block Cache caches results returned from GFS. Bloom filters can reduce the group of servers that a read operation need to contact thus reduce the number of disk accesses.

Will this paper be influential in 10 years? I think so. Bigtable provides a high-availability, high performance solution for simple key-value stores requirements. It does fall short when you have more complex query needs, but for the majority of applications, those are not the mostly used functions. You can use a number of data processing frameworks like Spark or Storm to derive more meaningful reports anyway.

Review of "Sparrow: Distributed, Low Latency Scheduling"

As large scale data analytics shifting towards shorter task durations and larger degrees of parallelism to provide low latency. Faster job scheduling becomes more and more important. As an example, a cluster containing ten thousand 16-core machines and running 100ms tasks may require 1 million scheduling decisions per second. Under this background comes the low latency decentralized scheduler Sparrow.

Different from traditional scheduler, Sparrow takes a radically different approach. First it assumes that a long running executor process is already running on each worker machine for each framework, so that Sparrow only need to send a job description to launch the job. It uses batch sampling together with late binding to achieve low latency stateless scheduling. More concretely, if it has m tasks to schedule, the scheduler doesn't maintain any information of the cluster, instead, it sends out RPC requests to dm worker machines, the worker machines receiving this requests will put a reservation for this task in their task queues, and the RPC requests are hold until the task reservation goes to the front of the queue. Then the scheduler can reply to this RPC response with either a task description if it still has task to schedule, or a NOP if all task has been scheduled. Sparrow also uses proactive cancellation to eliminate the need of a worker response if the job has been scheduled.

Scheduling policies and constraints are also handled by sampling. For per-job constraints, it selects the from the dm workers that satisfy the constraints, for example some requires GPU on the worker. It also handles jobs with per-task constraints, such as data locality constraints, because different tasks might have different locality preferences. It selects two machines to prob for each task from the set of machines the task is constrained to run on. One question here is, without aggregated information, how much data locality can Sparrow utilize using this batch sampling technique? For resource allocation, Sparrow uses strict priorities and weighted fair sharing just like other schedulers including the Hadoop Map Reduce scheduler.

Will this paper be influential in 10 years? I think so, smaller tasks with high requirements of low latency is becoming more and more prevalent. The solution proposed by this paper is very creative in a way that no centralized aggregated information center is needed but still achieves good resource utilization and provides low latency scheduling.

Review of "Omega: flexible, scalable schedulers for large compute clusters"

Similar to Mesos, Omega is also designed to bring better resource management of data centers. While unlike Mesos's two-level resource offering model, which hides the overall cluster resources, Omega proposed a shared-state model.

According to this paper, most (> 80%) jobs are batch jobs, but the majority of resources (55 - 80%) are allocated to service jobs. Service jobs usually run for a much longer time than batch jobs. Omega is designed to accommodate these types of jobs. One thing to notice is fairness is not one of the main concerns of Omega, it's more driven by business requirements.

Omega takes a shared-state scheduling method. All schedulers have been granted access to the whole cluster. They are allowed to compete in a free-for-all manner and use optimistic concurrency control to mediate clashes when they update the cluster state. For example, when individual schedulers put new tasks onto some nodes, it tries to push this update to the central state, but it might fail because of conflict. (Other schedulers might have scheduled tasks on the same nodes already.) In that case, the scheduler can just retry. (According to this paper, conflicts are pretty rare event in Google's production environment.) By using optimistic concurrency, schedulers has a very high level of parallelism. But it also made the trade-off that if the optimistic concurrency assumptions are incorrect, the scheduler has to redo the work. There's no resource allocator in Omega; all of the resource allocation decisions take place in the schedulers. A resilient master copy of the resource allocations in the cluster is maintained. Each scheduler is given a private, local, frequently-updated copy of cell state that it uses for making scheduling decisions.

Will this paper be influential in 10 years? Maybe, it proposes a shared-state scheduling method which allows for efficient scheduling of tasks.

Review of "Dominant Resource Fairness: Fair Allocation of Multiple Resource Types"

Dominant Resource Fairness (DRF) is a generalization of max-min fairness to multiple heterogeneous resources.

What is min-max fair sharing? For example, in a 3 user system, if a user wants no more than 33%, say 20%, give it to him, and evenly distribute the other 80% to the other two users.

While previous work on weighted max-min fairness has focused on single resources, the advent of cloud computing and multi-core systems has increased the need for allocation policies for environments with multiple resources and heterogeneous user demands. Existing fair schedulers for clusters such as Quincy and Hadoop Fair Scheduler, ignore the heterogeneity of user demands and allocate resources at the granularity of slots, where a slot is a fixed fraction of a node. As a consequence, the allocation often result in under-utilization and over-utilization, both of which is not what user expects.

DRF takes a different approach. It propose that the allocation of a user should be determined by the user's dominant share, which is maximum share the user has been allocated of any resources. The strength of DRF is that it satisfies the following properties. First, it provides incentives for users to share resources by guaranteeing that no user is better off in a system where everything is statically partitioned. Every user can get at most 1/N of he dominant resources, N is the total number of users. Second, it's strategy-proof in which a user cannot get a better allocation by lying about the resource demands. Third, it's Pareto-efficient as it allocates all available resources subject to satisfying the other properties, and without preempting existing allocations. Finally, it's envy-free as no user will prefer the allocation of another user.

In practice, DRF is implemented in Mesos resource manager, and brings fairer allocation of resources and higher utilization than existing solutions that allocates identical resource slices to all tasks. Will this paper be influential in 10 years? Yes, because it provides a solid proof that DRF brings better resource utilization in the more and more relevant cloud computing environment.


Referencences

From Wikipedia: Pareto efficiency or Pareto optimality, is a state of allocation of resources in which it is impossible to make any one individual better off without making at least one individual worse off.

Review of "Apache Hadoop YARN: Yet Another Resource Negotiator"

YARN is the next generation of Hadoop compute platform. It departs from the original monolithic architecture by separating resource management functions from the programming model, and delegates many scheduling-related functions to per-job components.

To address some of the multi-tenency issues, Yahoo! developed Hadoop on Demand (HOD), which used Torque and Maui to allocate Hadoop clusters on a shared pool of hardware, and works together with JobTracker and TaskTracker. But it suffers from middling resource utilization because it has too little information to make intelligent decisions about resource allocation. Also JobTracker did a poor job on fault tolerance/recovery, which also makes the development of YARN necessary. Several notable targets of YARN also include secure and auditable operation, multiple diverse framework support similar to Mesos, Flexible Resource Model ,and Backward compatibility to avoid "second version syndrome".

Main components of YARN are one Resource Manager (RM), a set of Node Managers (NM), one on each node, and a set of framework specific Application Managers (AM) which negotiate resources with the RM. RM runs as a daemon on a dedicated machine and acts as the central authority arbitrating resources among various competing applications in the cluster. This enables fairness, capacity, and locality across tenants. To obtain containers, which is a logical bundle of resources like <2GB RAM, 1 CPU>, AM issues resource requests to the RM, this request also include the number of required containers, specification of locality preferences and priority of requests within the application. RM will attempt to satisfy the resource requests and generate a lease for the resource, which will be pulled by a subsequent AM heartbeat. The AM will then present the token to NM to obtain the container. All containers (including AMs) in YARN is described by a container launch context (CLC). Which includes a map of env variables dependencies stored in remotely accessible storage, security tokens, payloads for NM services and the command necessary to create the process. This serves as a descriptor for NM to start a container. Similar to Mesos, YARN didn't do much on fault tolerance. It mainly delegates it to application specific AMs.

Will this paper be influential in 10 years? Maybe, considering the large market share of Hadoop, and YARN does solve a lot of issues identified in Hadoop 1.0.

Review of "Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center"

Mesos, often referred as the operating system for data centers, is a platform for sharing commodity clusters between multiple diverse cluster computing frameworks such as Hadoop and MPI which can achieve near optimal data locality when sharing the cluster among diverse frameworks, and can scale out to 50,000 (emulated) nodes and be resilient.

Mesos acts as another abstraction level of computation power. Instead of having the framework schedulers send tasks to the physical machines, Mesos takes over the task and distributes it to machines. By doing this, Mesos can do optimization on task distribution and flexibly scale different computations according to different needs.

A Mesos framework is a distributed system that has a scheduler, like an app of the cluster. The resource allocation mechanism works like this. First, Mesos has the total knowledge of all the hardware resources of the cluster, through a registration process. When a new framework needs to run, Mesos takes a look at the resource pool, selects a resource and offer to the framework, the framework can reject this offer or accept it. (The key reason why Mesos doesn't allow frameworks to specify resource requirements is to keep Mesos simple but still capable of supporting arbitrarily complex resource constraints, but this might bring efficiency problems, because a framework might need to wait for a lot of time to get the required resources. Mesos solves this through a filters optimization, which is like a hint to Mesos what resource will be accepted.) Resource allocation is delegated to a pluggable allocation module, so that it can be tailored for different needs and evolve separately. Similar to allocation, resource isolation is also pluggable, currently it isolates resource through OS container technologies like Linux Containers and Solaris Projects.

Fault Tolerance of Mesos is achieved by hot standby master replications with ZooKeeper. And node failures is propagated to the specific framework for them to react. Schedulers can also register an alternative schedulers, so that it can be notified after the main scheduler failed.

Mesos is written in about 10,000 lines of C++, built on top of libprocess which provides actor-based concurrency model using efficient asynchronous I/O mechanism (epoll, kqueue, etc). And ZooKeeper is used to perform leader election. One thing to note is that Mesos also enables the creation of a Spark which is known for efficiency thanks to Mesos.

Will this paper be influential in 10 years? I think so. As more and more distributed systems coming out, there's a need to have an OS to manage all of the different resources. Mesos, with its simplicity and efficiency and the rich ecosystem – Spark, Marathon, Cronos, etc. will certainly benefit the industry in the long run.