Review of "Sparrow: Distributed, Low Latency Scheduling"

30 Sep 2015

Review of "Sparrow: Distributed, Low Latency Scheduling"

As large scale data analytics shifting towards shorter task durations and larger degrees of parallelism to provide low latency. Faster job scheduling becomes more and more important. As an example, a cluster containing ten thousand 16-core machines and running 100ms tasks may require 1 million scheduling decisions per second. Under this background comes the low latency decentralized scheduler Sparrow.

Different from traditional scheduler, Sparrow takes a radically different approach. First it assumes that a long running executor process is already running on each worker machine for each framework, so that Sparrow only need to send a job description to launch the job. It uses batch sampling together with late binding to achieve low latency stateless scheduling. More concretely, if it has m tasks to schedule, the scheduler doesn't maintain any information of the cluster, instead, it sends out RPC requests to dm worker machines, the worker machines receiving this requests will put a reservation for this task in their task queues, and the RPC requests are hold until the task reservation goes to the front of the queue. Then the scheduler can reply to this RPC response with either a task description if it still has task to schedule, or a NOP if all task has been scheduled. Sparrow also uses proactive cancellation to eliminate the need of a worker response if the job has been scheduled.

Scheduling policies and constraints are also handled by sampling. For per-job constraints, it selects the from the dm workers that satisfy the constraints, for example some requires GPU on the worker. It also handles jobs with per-task constraints, such as data locality constraints, because different tasks might have different locality preferences. It selects two machines to prob for each task from the set of machines the task is constrained to run on. One question here is, without aggregated information, how much data locality can Sparrow utilize using this batch sampling technique? For resource allocation, Sparrow uses strict priorities and weighted fair sharing just like other schedulers including the Hadoop Map Reduce scheduler.

Will this paper be influential in 10 years? I think so, smaller tasks with high requirements of low latency is becoming more and more prevalent. The solution proposed by this paper is very creative in a way that no centralized aggregated information center is needed but still achieves good resource utilization and provides low latency scheduling.