Review of "Naiad: A Timely Dataflow System"
A lot of distributed dataflow system are around the big data ecosystem. Some of them provide high throughput, some provide low latency stream processing, or offer iterative computation. But a lot of applications require these features all at once. In which case they usually need to write code on different platforms. This usually brings high development and maintenance cost. Naiad is a distributed system for executing data parallel cyclic dataflow programs. It offers high throughput batch processing, low latency stream processing and the ability to perform iterative and incremental computations all in one framework.
It achieves this by utilizing a new computational model, timely dataflow. This model enriches dataflow computation with timestamps that represent logical points in the computation and provide the basis for an efficient, lightweight coordination mechanism. Timely dataflow supports the following three features:
- structured loops allowing feedback in the dataflow.
- stateful dataflow vertices capable of consuming and producing records without global coordination.
- notifications for vertices once they have received all records for a given round of input or loop iteration.
We can see that it makes several trade-offs in order to achieve its goals. First, it forces application developers to interact with its complicated graph and timestamps system. Second, in order to achieve higher performance, all the vertices are stateful, thus bringing more complexity in fault-tolerance handling and makes stragglers handling even more tricky. Under this framework, in order to achieve fault tolerance, application developers has to implement checkpoint logic by themselves, which increases the burden of developers, and what's worse, this burden has nothing to do with the application's business logic! To mitigate straggler effect, Naiad has to deal with network hardware or protocol optimization, .NET garbage collection and vertex state contention. Which is doubtfully effective, for example, if a machine in a giant cluster is a bit sluggish, how will all these mechanisms prevent it? It's certainly not a big issue in a smaller cluster where all machines are under close control, but for a lot of users, this might be a problem. Lastly, the author didn't mention anything about data locality optimization. What they do is they uses very fast switches to connect a relatively smaller number of workers. Under the design of the central message dispatching, it's probably not realistic to have data locality optimization.
It is really desirable to have high throughput, low latency, and iterative computation all integrated in one framework. We see that built on Spark, there comes Spark Streaming, which provides low latency stream processing capability to Spark, making Spark able to achieve all three goals mentioned by the paper, although according to the paper, latency performance is not comparable with Naiad. But Spark certainly provides much simpler programming interface. Besides, Naiad is tightly bound to .NET platform, which has a bad record when competing with JVM platform in enterprise computation market. Thus, I suspect the technology introduced in the paper will still be influential in 10 years. Anyway, just my thought.