Summary of "MapReduce: Simplified Data Processing on Large Clusters"

03 Sep 2015

Summary of "MapReduce: Simplified Data Processing on Large Clusters"

First, what is MapReduce? According to the author, "MapReduce is a programming model and an associated implementation for processing and generating large data sets." It puts some restrictions on the programming style and by doing so providing massive parallelization, fault tolerance, locality optimization and load balancing of the underlying tasks the program is aiming to do.

MapReduce enforces a programming model in the following format. All of the primitives are Strings for C++ implementation. (for Java, you can use IntWritable etc.)

map (k1, v1) -> list(k2, v2)

reduce (k2, list(v2)) -> list(v2)

The underlying framework will automatically partition the data into M pieces (usually 14MB to 64MB per piece, controllable by user program), for each piece, the framework passes it to a worker among a large worker pool (usually much smaller than the partition number M, to achieve more efficient scheduling), the selected worker will run the map function with the partitioned data (usually the data will be on the same machine that performs the map operation on this data to minimize network traffic). Map function will generate the list(k2, v2) map, which will periodically stored into disk,with the location notified to reducers through master node (the node controlling all the scheduling and coordination of tasks). When a reducer is notified, it use RPC to read the map result, sorts and groups the values according to the intermediate keys generating (k2, list(v2)) pairs and then passes each of them to user's reduce functions from which the generated output is appended to a final output file for this reduce partition. There are R reduce tasks specified by the user and each will generate 1 output file.

MapReduce provides fault tolerance through a healthcheck pings, if no response is received from a worker in certain amount of time, it sets the task as idle so that the task can be scheduled to other workers. Even for completed tasks, if the worker failed, the task has to be re-executed because the result is stored on local disk. For master node failure, there's no fault tolerance currently.

Another optimization of MapReduce is it uses backup tasks to reduce "straggler" effect. If there are some "straggler" machines among the cluster, then tend to cause the long tail execution on the overall tasks. MapReduce solves this problem by having other worker process the same task, anyone finishes first will make the task as completed.

Some features also provided by MapReduce includes the ability of skipping a bad record for the next run; having a small internal HTTP server displaying all the metrics of the cluster so that people know what's going on; and a distributed counter facility for sanity check.

The paper also provides some basic performance information on the framework. In a 1800 2GHz Intel Xeon processor, 4GB memory, 160GB IDE disks cluster, greping on 1TB of data takes approximately 150 seconds, sorting on a 1TB data takes approximately 891 seconds.

The paper also mentions other related works such as MPI and Bulk Synchronous Programming, the key difference is that MapReduce exploits a restricted programming model, and by doing that it automatically provides parallelization and other benefits. Also, MapReduce relies on an in-house cluster management of Google which is similar to other systems such as Condor.