Summary of "Spark: Cluster Computing with Working Sets"

07 Sep 2015

Summary of "Spark: Cluster Computing with Working Sets"

MapReduce and its variants are very successful in big data analysis. They achieve locality-aware scheduling, fault tolerance and load balancing by enforcing the user to provide acyclic data flow graphs. While this model is useful for a large class of applications, the enforcement makes it inefficient for reusing a working set of data across multiple parallel operations. This include use cases like iterative jobs and interactive analytics. Spark solves this by resilient distributed dataset (RDD), which represents a read-only collection of objects partitioned across a set of machines that can be explicitly cached into memory and can be rebuilt if a partition is lost. Spark can also be used interactively which is the first of its kind because the implementation is on Scala.

Spark provides two main abstractions for parallel programming: resilient distributed datasets and parallel operations on these datasets. Spark lets programmers construct RDDs in four ways:

  1. From a file in a shared file system such as Hadoop Distributed File System (HDFS).
  2. By "parallelizing" a Scala collection in the driver program.
  3. By transforming an existing RDD through operations such as flatMap, map, filter, etc.
  4. By changing the persistence of an existing RDD. i.e. cache and save actions on datasets.

Spark supports several parallel operations, reduce, collect and foreach. It doesn't support parallel reduce like MapReduce, only the driver can collect reduce results. Programmers invoke operations like map, filter, and reduce by passing closures to Spark. Closures can access the variables in the scope that closure is defined, Spark achieves this by copying variables to the worker. Two restricted types of variables are also supported, Broadcast variables are used when a large read-only piece of data (e.g., a lookup table) is used by multiple operations because it's better to distribute it only one time instead of for every closure. Accumulators are similar to MapReduce counters, providing more imperative syntax for parallel sums.

Spark is built on top of Mesos, a "cluster operating system" that lets multiple parallel applications share a cluster in a fine-grained manner and provides an API for applications to launch tasks on a cluster. Which simplifies Spark implementation, also makes it possible for Spark to run along with other frameworks. RDDs is stored as a chain of objects capturing the lineage of the RDDs. Each dataset has a point pointing to its parent and has the information on how itself is derived from the parent. RDDs has three common operations:

  1. getpartitions, which returns a list of partition IDs.
  2. getIterator(partition), which iterates over a partition.
  3. getPreferedLocations(partition), for locality-aware scheduling.

When a parallel operation is invoked on a dataset, Spark creates task to process each partition of the dataset and sends these task to worker nodes. Delay scheduling is used to achieve locality optimization. Once a task is launched on a worker, each task calls getIterator to start reading its partition.

Closures are shipped to workers through Java serialization. Because in Scala a closure is also a Java object. Shared variables are implemented by tricks on serialization.

Scala interpreter operates by compiling a class for each line of user's input. To make this interpreter work for Spark, interpreter is changed to output the classes it defines to a shared file system so that workers can load it with a custom class loader. The generated singleton object for each line is also changed to reference the singleton objects for previous lines directly, rather than going though a getInstance() method.

In comparison, Spark performs comparatively with other frameworks such as Hadoop for the first query or iteration because of Scala execution speed, but subsequent iterations will perform 10x better than other frameworks because of its cache mechanism.

Spark has won a wide adoption from the industry. With other higher level more niche components such as Spark streaming, MLlib, GraphX, Spark SQL, it will even attract more people from different areas. I think in the foreseable future, Spark is still going to dominate the big data processing area.