Review of "Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center"

26 Sep 2015

Review of "Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center"

Mesos, often referred as the operating system for data centers, is a platform for sharing commodity clusters between multiple diverse cluster computing frameworks such as Hadoop and MPI which can achieve near optimal data locality when sharing the cluster among diverse frameworks, and can scale out to 50,000 (emulated) nodes and be resilient.

Mesos acts as another abstraction level of computation power. Instead of having the framework schedulers send tasks to the physical machines, Mesos takes over the task and distributes it to machines. By doing this, Mesos can do optimization on task distribution and flexibly scale different computations according to different needs.

A Mesos framework is a distributed system that has a scheduler, like an app of the cluster. The resource allocation mechanism works like this. First, Mesos has the total knowledge of all the hardware resources of the cluster, through a registration process. When a new framework needs to run, Mesos takes a look at the resource pool, selects a resource and offer to the framework, the framework can reject this offer or accept it. (The key reason why Mesos doesn't allow frameworks to specify resource requirements is to keep Mesos simple but still capable of supporting arbitrarily complex resource constraints, but this might bring efficiency problems, because a framework might need to wait for a lot of time to get the required resources. Mesos solves this through a filters optimization, which is like a hint to Mesos what resource will be accepted.) Resource allocation is delegated to a pluggable allocation module, so that it can be tailored for different needs and evolve separately. Similar to allocation, resource isolation is also pluggable, currently it isolates resource through OS container technologies like Linux Containers and Solaris Projects.

Fault Tolerance of Mesos is achieved by hot standby master replications with ZooKeeper. And node failures is propagated to the specific framework for them to react. Schedulers can also register an alternative schedulers, so that it can be notified after the main scheduler failed.

Mesos is written in about 10,000 lines of C++, built on top of libprocess which provides actor-based concurrency model using efficient asynchronous I/O mechanism (epoll, kqueue, etc). And ZooKeeper is used to perform leader election. One thing to note is that Mesos also enables the creation of a Spark which is known for efficiency thanks to Mesos.

Will this paper be influential in 10 years? I think so. As more and more distributed systems coming out, there's a need to have an OS to manage all of the different resources. Mesos, with its simplicity and efficiency and the rich ecosystem – Spark, Marathon, Cronos, etc. will certainly benefit the industry in the long run.