Review of "A Sampling Algebra for Aggregate Estimation"

18 Oct 2015

Review of "A Sampling Algebra for Aggregate Estimation"

First, sampling table is generally useful when debugging a large query or speeding up queries by trading-off some accuracy. Two of the major issues are, 1) a way to write such kind of sampling SQL queries, b) a way to estimate how accurate the result is. This paper defines the notion of Second Order Analytical equivalence (SOA equivalence), a key equivalence relationship between query plans that is strong enough to allow quantile analysis but weak enough to ensure commutativity of sampling and relational operators. It also defines GUS operator that emulates a wide range of sampling classes which commutes with most relational operators under SOA-equivalence. This paper develops an algebra over GUS and relational operators that allow deriving of SOA-equivalent plans to give moment calculations quantiles estimation.

The high-level goal of this paper is to introduce a tool that computes the confidence bounds of estimates based on sampling. It takes in a query plan which is used to execute queries, transform it into an analytically equivalent (Second Order Equivalence, SOA) query plan that has a particular structure: all relational operators except the final aggregate form a subtree as the input to a single GUS sampling operator. The GUS operator feeds the aggregate operator that produces the final result. Confidence intervals are the end goal, and preserving expected value and variance is enough to guarantee the same confidence interval using both CLT and Chebychev methods. Thus the paper defines two query plans are equivalent as their result has the same expected value and variance. Although allowing significant progress, this equivalence doesn't say anything about intermediate results. Thus this paper extends this notion into the classic relational algebra.

This paper also provides theory foundation for GUS Quasi-Operators and laid out interaction between GUS and Rel Ops. Will this paper be influential in 10 years? I think so, it laid a pretty solid foundation for efficiently sampling in relational databases and compute its confidence level which is seeing a growth given systems are collecting more and more data which is not practical for doing accurate queries.