Review of "BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data"

18 Oct 2015

Review of "BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data"

Modern data analytics applications usually involve computing aggregates over a large number of records. Traditionally, such queries have been executed using sequential scans over a large fraction of a database. But increasingly, some applications require near real-time response rates. Like updating ADs based on trends in social networks. In such cases, the applications can usually tolerate certain amount of error, but the requires a fast response. BlinkDB is a massively parallel approximate query engine for running interactive SQL queries on large volumes of data which provides fast response rate with bounded errors. It allows the user to balance response rate with response time flexibly.

It consists of two main modules, 1) Sample Creation and 2) Sample Selection. Sample creation module creates stratified samples on the most frequently used QCSs(query column sets, the columns used by WHERE, GROUP BY, and HAVING) to ensure efficient execution for queries on rare values. Sample creation is formulated into an optimization problem. BlinkDB choose a collection of stratified samples with total storage costs below some user configurable storage threshold which are neither over- nor under -specialized for the query workload.

BlinkDB extends Hive framework by adding two major components, 1) an offline sampling module that creates and maintains samples over time, and 2) a run-time sample selection module that creates an Error-Latency Profile(ELP) for queries. Samples are decided by QCSs that appear in queries, once the QCSs is decided, BlinkDB do distributed reservoir sampling or binomial sampling techniques to create a range of uniform and stratified samples samples across a number of dimensions. At run-time, ELP decides the sample to run the query after considering the error and response time requirements.

Will this paper to influential in 10 years? Yes, it provides a flexible mechanism to balance response time and error rate which is needed in the presence a more and more big data systems.