Review of "GraphX: Graph Processing in a Distributed Dataflow Framework"
Distributed dataflow frameworks are generally slower than specialized graph processing systems which provide tailored programming abstractions and accelerated the execution of iterative graph algorithms. But graph analysis is usually just a part of a larger analytics process, the problem of using several systems together to solve one problem is not very charming. And graph processing systems often abandon fault tolerance in favor of snapshot recovery. In contrast, general purpose dataflow frameworks are good at analyzing unstructured and tabular data, but fall short on iterative graph algorithms which requires multiple stages of complex joins. And they often miss the opportunity of leveraging the common patterns of structure in iterative graph algorithms. This paper argues that many of the specialized graph processing systems can be recovered in a modern general purpose distributed dataflow system. GraphX is introduced as an embedded graph processing framework built on top of Spark. To overcome the performance issues, GraphX recasts graph optimizations as distributed join optimizations and materialized view maintenance. And since it's built on Spark, it's automatically fault tolerant.
GraphX leverages normalized representation of property graph as a pair of vertex and edge property collections to embed graphs in a distributed dataflow framework. Graph parallel computation is the process of computing aggregate properties of the neighborhood of each vertex. It can be expressed in a distributed dataflow framework as a sequence of join stages and group-by stages punctuated by map operations. In join, vertex and edge properties are joined to form the triplets view consisting of each edge and its corresponding source and dest vertex properties. And in group-by stage, the triplets are grouped by source or destination vertex to construct the neighborhood of each vertex and compute aggregates.
Will this paper be influential in 10 years? Yes, I think so. We have already have so many processing frameworks, it's good to integrate graph analytics and tabular data analytics into one framework. And this paper also introduces a general way of building graph processing engines on other dataflow frameworks too.