Review of "High-throughput, low- latency, and exactly-once stream processing with Apache Flink"

14 Sep 2015

Review of "High-throughput, low- latency, and exactly-once stream processing with Apache Flink"

In streaming system designs, micro-batching or discretized streaming has three inherent problems. 1) They change the programming model from pure streaming to micro-batching, which means, users can not compute on window data in periods other than multiples of the checkpoint interval and it cannot support count-based model required by many applications. 2) It's inherently tough for them to deal with back-pressure. 3) It has the hard latency limit introduced by mini batch window. Thus the author proposes Transactional updates represented by Google Cloud Dataflow and Distributed Snapshots represented by Apache Flink as the next generation streaming systems.

Google Cloud Data Flow which employs transactional updates paradigm, logs record deliveries together with updates to the state. So that if failure happens, state and record deliveries are repeated from the log. Apache Flink on the the other hand employs distributed snapshots paradigm. It draws a snapshot of all the states the streaming computation is in, including in-flight records and operator state. So when failure happens, it can restore from the latest snapshot from a durable storage. The implementation uses Chandy-Lamport algorithm, which guarantees exactly-once stream processing.

Although claiming high throughput, Flink actually needs to balance throughput with recovery time – the higher the throughput, the larger the "barrier" interval needs to be, thus it needs to store more information about current state.

Whether Flink will be influential in 10 years? I don't know yet, need to actually use it in some project to see how it actually works.