Review of "Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing"

13 Sep 2015

Review of "Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing"

Before spark steaming, previous streaming systems employ the same record-at-a-time processing model. In this model, streaming computations are divided into a set a long-lived stateful operators, and each processes records as they arrive by updating internal state. The stateful nature of this model makes it hard to handle fault tolerance/recovery and even harder to handle stragglers. Spark streaming took a radical different design in real time processing. It runs each streaming computation as a series of deterministic batch computations on small time intervals. These small time intervals can be as small as half a second, providing lower than one second end-to-end latency. It also brings the advantage of having clear consistency semantics, a simple API, and unification with batch processing.

D-Stream sacrifices a small portion of efficiency for easy handling of fault tolerance and recovery. The discretized design choice also makes it able to leverage all of the features of RDD and Spark core in a transparent way. But the latency can be mitigated by tuning the "sampling window" of the input data, which I think would suffice for a lot of use cases.

Although the latency of D-Stream might be greater than system like Storm, I feel the benefit of bringing batch processing, streaming processing and interactive query together into a unified platform is much greater as more and more "big data" systems not only provides a summary report after some time, but also provides richer interaction with end users. Thus, I think the technology (micro-batching / stream discretization) this paper proposed will be influential in 10 years.