Review of "SparkSQL: Relational Data Processing in Spark"

07 Oct 2015

Review of "SparkSQL: Relational Data Processing in Spark"

Big data applications require a mix of processing techniques, data sources and storage formats. On one hand, people needs the flexibility provided by procedural programming interface so that they can run logic like ETL and machine learning, on the other hand, once getting the structured data, people often wants some more productive experiences like those provided by Pig, Hive, and Dremel. This is how SparkSQL come onto the stage.

SparkSQL provides a lazily evaluated DataFrame API, which allows relational operations on both external data sources and Spark's built-in distributed collections. It also provides a highly extensible optimization engine called Catalyst. A DataFrame is equivalent to a table in a relational database, and can also be manipulated in similar ways as RDDs, but at the same time keep track of their schema and support various optimized relational operations.

SparkSQL uses a nested data model based on Hive for tables and DataFrames which supports all major SQL data types. Relational operations can be performed on DataFrames using a a domain specific language such as join, where, groupBy, etc. Catalyst is developed using Scala functional programming constructs. It contains a general library for representing evaluation trees and applying rules to manipulate them.

Will this paper be influential in 10 years? Yes, I think so. It provides another way to manipulate data, which is actually a very demanding feature for data processing frameworks like Spark and MapReduce. But thanks to the great integration of SparkSQL and Spark, developers can enjoy the rich features provided by relational models and at the same not loosing flexibility of procedural processing.