Review of "Impala: A Modern, Open-Source SQL Engine for Hadoop"

07 Oct 2015

Review of "Impala: A Modern, Open-Source SQL Engine for Hadoop"

Impala is an interactive, SQL query engine built on top of Hadoop. It's 5-65x faster than Apache Hive, responses in seconds instead of minutes. It runs natively on Hadoop/HBase storage and metadata so there's no need to duplicate/synchronize data between multiple systems.

Impala supports most of the SQL-92 SELECT statement syntax, plus additional SQL-2003 analytic functions and most of the standard scalar data types.

Ann Impala deployment is comprised of three services, the deamon service impalad is dually responsible for accepting queries from client processes and orchestrating their execution across the cluster and for executing individual query fragments on behalf of other Impala deamons. datanode process allows Impala to take advantage of data locality and statestored is Impala's pub-sub service which disseminates cluster-wide metadata to all Impala processes. Finally, there's also a catalogd service serving as Impala's catalog repository and metadata access gateway.

Applications communicate with Impala through ODBC or JDBC interface. Impalad takes the request and uses Query Planner to find an optimal query plan and then execute the query plan among all nodes. Nodes do local processing to avoid network bottlenecks.

Will this paper be influential in 10 years? Yes, I think so. It provides a much more efficient way to do data analytics on top of large scale of data and since it's built on top of Hadoop, it can cover a large amount of the big data community who is already using Hadoop MapReduce.