Review of "Dremel: Interactive Analysis of Web-Scale Datasets"

07 Oct 2015

Review of "Dremel: Interactive Analysis of Web-Scale Datasets"

Dremel, as the root of later developed Apache Hive, Cloudera Impala, and Apache Drill, was designed for interactive analysis of web-scale datasets. It's designed to cope with the need of efficient analysis of large scale data.

Dremel uses columar storage representation for nested data. Unlike traditional record/row oriented data storages, it's column oriented, the same column of different records are placed together, in this way, it can easily support strongly-typed nested records when using together with a tree structure. But this brings challenge of efficient reassembly of record from this column layout. Dremel solves this by using a Finite State Machine.

Dremel's query language is based on SQL and is designed to be efficiently implementable on columnar nested storage. Each SQL statement takes as input one or multiple nested tables and their schemas and produces a nested table and its output schema. This query also supports nested subqueries, inter and itra-record aggregation, top-k, joins, etc. Query is executed using a multi-level serving tree. A root server receives incoming queries, reads metadata from tables, and routes the queries to the next level in the serving tree. The leaf servers communicate with the storage layer to access he data.

Will this paper be influential in 10 years? I think so. It innovatively combines SQL query language with big data and brings more efficient analytics over large amount of data. Under the influence of this paper, we now have Apache Drill, Google BigQuery, SparkSQL, Apache Hive, and a lot of interactive query frameworks.