Review of "The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing"

15 Sep 2015

Review of "The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing"

None of the existing distributed data processing frameworks can meet certain use cases completely. Batch systems such as MapReduce (and its Hadoop variants, including Pig and Hive), FlumeJava, and Spark suffer from the latency problems because they need to collect all input data into a batch before processing it. For many streaming systems, it's unclear how they'll be fault tolerant at scale, whereas frameworks like Storm although providing scalability and fault tolerance, fall short on expressiveness or correctness vectors. Others like Spark Streaming, Trident are limited at windowing. Spark Streaming and MillWheel are both sufficiently scalable, fault-tolerant and low-latency to act as reasonable substrates, but lack high-level programming models that make calculating event-time sessions straightforward. The major shortcoming of all these systems is that they focus on input data (unbounded or bounded) as something that will at some point be complete. To overcome this, one must provide simple but powerful tools for balancing the amount of correctness, latency and cost appropriate for specific use case at hand. The paper provides 1) a windowing model which supports unaligned event-time windows, 2) a triggering model that binds the output times of results to runtime characteristics of the pipeline, 3) an incremental processing model that integrates retractions and updates into the windowing and triggering models, 4) a scalable implementation on top of MillWheel and FlumeJava.

The windowing API and the underlying model presented in this paper does seem pretty powerful and easy to use. Although didn't talk about fault-tolerance and performance at all, it laid down several fundamental insights on designing the next generation big data processing frameworks. Such as discarding the notion of completeness, being flexible for future use cases, encouraging clarity of implementations, supporting robust analysis of data in their domain, etc. All of these guidelines or principles are really good. They are the experiences we accumulated over all these years of designing all these data processing frameworks. It'll definitely have an influence in our later development.