Review of "Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing"

13 Sep 2015

Before spark steaming, previous streaming systems employ the same record-at-a-time processing model. In this model, streaming computations are divided into a set a long-lived stateful operators, and each processes records as they arrive by updating internal state. The stateful nature of this model makes it hard to handle fault tolerance/recovery and even harder to handle stragglers. Spark streaming took a radical different design in real time processing. It runs each streaming computation as a series of deterministic batch computations on small time intervals. These small time intervals can be as small as half a second, providing lower than one second end-to-end latency. It also brings the advantage of having clear consistency semantics, a simple API, and unification with batch processing.

D-Stream sacrifices a small portion of efficiency for easy handling of fault tolerance and recovery. The discretized design choice also makes it able to leverage all of the features of RDD and Spark core in a transparent way. But the latency can be mitigated by tuning the "sampling window" of the input data, which I think would suffice for a lot of use cases.

Although the latency of D-Stream might be greater than system like Storm, I feel the benefit of bringing batch processing, streaming processing and interactive query together into a unified platform is much greater as more and more "big data" systems not only provides a summary report after some time, but also provides richer interaction with end users. Thus, I think the technology (micro-batching / stream discretization) this paper proposed will be influential in 10 years.

Review of "Naiad: A Timely Dataflow System"

08 Sep 2015

Review of "Naiad: A Timely Dataflow System"

A lot of distributed dataflow system are around the big data ecosystem. Some of them provide high throughput, some provide low latency stream processing, or offer iterative computation. But a lot of applications require these features all at once. In which case they usually need to write code on different platforms. This usually brings high development and maintenance cost. Naiad is a distributed system for executing data parallel cyclic dataflow programs. It offers high throughput batch processing, low latency stream processing and the ability to perform iterative and incremental computations all in one framework.

It achieves this by utilizing a new computational model, timely dataflow. This model enriches dataflow computation with timestamps that represent logical points in the computation and provide the basis for an efficient, lightweight coordination mechanism. Timely dataflow supports the following three features:

structured loops allowing feedback in the dataflow.
stateful dataflow vertices capable of consuming and producing records without global coordination.
notifications for vertices once they have received all records for a given round of input or loop iteration.

We can see that it makes several trade-offs in order to achieve its goals. First, it forces application developers to interact with its complicated graph and timestamps system. Second, in order to achieve higher performance, all the vertices are stateful, thus bringing more complexity in fault-tolerance handling and makes stragglers handling even more tricky. Under this framework, in order to achieve fault tolerance, application developers has to implement checkpoint logic by themselves, which increases the burden of developers, and what's worse, this burden has nothing to do with the application's business logic! To mitigate straggler effect, Naiad has to deal with network hardware or protocol optimization, .NET garbage collection and vertex state contention. Which is doubtfully effective, for example, if a machine in a giant cluster is a bit sluggish, how will all these mechanisms prevent it? It's certainly not a big issue in a smaller cluster where all machines are under close control, but for a lot of users, this might be a problem. Lastly, the author didn't mention anything about data locality optimization. What they do is they uses very fast switches to connect a relatively smaller number of workers. Under the design of the central message dispatching, it's probably not realistic to have data locality optimization.

It is really desirable to have high throughput, low latency, and iterative computation all integrated in one framework. We see that built on Spark, there comes Spark Streaming, which provides low latency stream processing capability to Spark, making Spark able to achieve all three goals mentioned by the paper, although according to the paper, latency performance is not comparable with Naiad. But Spark certainly provides much simpler programming interface. Besides, Naiad is tightly bound to .NET platform, which has a bad record when competing with JVM platform in enterprise computation market. Thus, I suspect the technology introduced in the paper will still be influential in 10 years. Anyway, just my thought.

Summary of "Spark: Cluster Computing with Working Sets"

07 Sep 2015

Summary of "Spark: Cluster Computing with Working Sets"

MapReduce and its variants are very successful in big data analysis. They achieve locality-aware scheduling, fault tolerance and load balancing by enforcing the user to provide acyclic data flow graphs. While this model is useful for a large class of applications, the enforcement makes it inefficient for reusing a working set of data across multiple parallel operations. This include use cases like iterative jobs and interactive analytics. Spark solves this by resilient distributed dataset (RDD), which represents a read-only collection of objects partitioned across a set of machines that can be explicitly cached into memory and can be rebuilt if a partition is lost. Spark can also be used interactively which is the first of its kind because the implementation is on Scala.

Spark provides two main abstractions for parallel programming: resilient distributed datasets and parallel operations on these datasets. Spark lets programmers construct RDDs in four ways:

From a file in a shared file system such as Hadoop Distributed File System (HDFS).
By "parallelizing" a Scala collection in the driver program.
By transforming an existing RDD through operations such as flatMap, map, filter, etc.
By changing the persistence of an existing RDD. i.e. cache and save actions on datasets.

Spark supports several parallel operations, reduce, collect and foreach. It doesn't support parallel reduce like MapReduce, only the driver can collect reduce results. Programmers invoke operations like map, filter, and reduce by passing closures to Spark. Closures can access the variables in the scope that closure is defined, Spark achieves this by copying variables to the worker. Two restricted types of variables are also supported, Broadcast variables are used when a large read-only piece of data (e.g., a lookup table) is used by multiple operations because it's better to distribute it only one time instead of for every closure. Accumulators are similar to MapReduce counters, providing more imperative syntax for parallel sums.

Spark is built on top of Mesos, a "cluster operating system" that lets multiple parallel applications share a cluster in a fine-grained manner and provides an API for applications to launch tasks on a cluster. Which simplifies Spark implementation, also makes it possible for Spark to run along with other frameworks. RDDs is stored as a chain of objects capturing the lineage of the RDDs. Each dataset has a point pointing to its parent and has the information on how itself is derived from the parent. RDDs has three common operations:

getpartitions, which returns a list of partition IDs.
getIterator(partition), which iterates over a partition.
getPreferedLocations(partition), for locality-aware scheduling.

When a parallel operation is invoked on a dataset, Spark creates task to process each partition of the dataset and sends these task to worker nodes. Delay scheduling is used to achieve locality optimization. Once a task is launched on a worker, each task calls getIterator to start reading its partition.

Closures are shipped to workers through Java serialization. Because in Scala a closure is also a Java object. Shared variables are implemented by tricks on serialization.

Scala interpreter operates by compiling a class for each line of user's input. To make this interpreter work for Spark, interpreter is changed to output the classes it defines to a shared file system so that workers can load it with a custom class loader. The generated singleton object for each line is also changed to reference the singleton objects for previous lines directly, rather than going though a getInstance() method.

In comparison, Spark performs comparatively with other frameworks such as Hadoop for the first query or iteration because of Scala execution speed, but subsequent iterations will perform 10x better than other frameworks because of its cache mechanism.

Spark has won a wide adoption from the industry. With other higher level more niche components such as Spark streaming, MLlib, GraphX, Spark SQL, it will even attract more people from different areas. I think in the foreseable future, Spark is still going to dominate the big data processing area.

Summary of "Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks"

03 Sep 2015

Summary of "Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks"

Dryad is a "general-purpose, high performance distributed execution engine. It's built under the inspiration of shader languages developed for GPU, Google's MapReduce and parallel databases. It focuses more on simplicity of the programming model and reliability, efficiency and scalability of the applications while side-stepped problems like high-latency and unreliable wide-area networks, control of resources by separate federated or competing entities and ACL, etc. It provides task scheduling, concurrency optimization in a computer level, fault tolerance and data distribution. One of the unique feature provided by Dryad is the flexibility of fine control of an application's data flow graph. This gives programmer the opportunity to optimize trade offs between parallelism and data distribution overhead thus gives "excellent performance" according to the paper.

A Dryad job consists of DAG where each vertex is a program and each edge is a data channel, data channel can be shared memory, TCP pipes, or temp files. In contrast to MapReduce, Dryad doesn't do serialization, for the vertex program's perspective, what they see is a heap object passed from the previous vertex, which will certainly save a lot of data parsing headaches.

A Dryad job is coordinated by a process called job manager, can be either within the compute cluster or remote workstation that has access to the compute cluster. the job manager constructs the DAG and schedules work across available resources. To discover available resources, each computer in the cluster has a proxy daemon running, and they are registered into a central name server, they job manager queries the name server to get available computers.

Dryad used C++ operator overloading and method calls to create a DAG operation DSL. It supports vertex creation, edge creation and graph merging operations. One interesting property provided by Dryad is it can turn a graph G into a vertex V(G), essentially similar to the composite design pattern, it improves the re-usability a lot. Primary API for writing Dryad vertex are C++ based, but it's straightforward to implement API wrappers for other languages such as C#. The runtime receives a closure from the job manager describing the vertex to be run and URIs for input and output of the vertex. It supports event-based programming style on vertex for you to write concurrent program. Which can potentially gives you more efficiency in a vertex execution.

In Dryad, a scheduler inside job manager tracks states of each vertex. If every vertex finishes successfully, the whole job is finished. If any vertex failed, the job is re-run, but only to a threshold number of times, after that if the job is still failing, the entire job will be failed. Dryad also provides visualizer and web interface for monitoring of cluster states. Dryad achieves fault tolerance through proxy communicating with job manager, but if proxy failed, a timeout will be triggered in job manager indicating a vertex has failed. Dryad also provides a backup task mechanism when noticing a vertex has been slower than their peers, similar to the one used to MapReduce.

Dryad's DAG based data parallelization makes it more expressive for solving different large scale problems. The dynamic refinement it provides also makes it efficient in a lot of cases. The performance is absolutely superior to a commercial database system for hand-coded read-only query. One caveat is you can only run 1 job in a cluster at a time, because the job manager assumes exclusive control over all computers within the cluster.

Summary of "MapReduce: Simplified Data Processing on Large Clusters"

03 Sep 2015

Summary of "MapReduce: Simplified Data Processing on Large Clusters"

First, what is MapReduce? According to the author, "MapReduce is a programming model and an associated implementation for processing and generating large data sets." It puts some restrictions on the programming style and by doing so providing massive parallelization, fault tolerance, locality optimization and load balancing of the underlying tasks the program is aiming to do.

MapReduce enforces a programming model in the following format. All of the primitives are Strings for C++ implementation. (for Java, you can use IntWritable etc.)

map (k1, v1) -> list(k2, v2)

reduce (k2, list(v2)) -> list(v2)

The underlying framework will automatically partition the data into M pieces (usually 14MB to 64MB per piece, controllable by user program), for each piece, the framework passes it to a worker among a large worker pool (usually much smaller than the partition number M, to achieve more efficient scheduling), the selected worker will run the map function with the partitioned data (usually the data will be on the same machine that performs the map operation on this data to minimize network traffic). Map function will generate the list(k2, v2) map, which will periodically stored into disk,with the location notified to reducers through master node (the node controlling all the scheduling and coordination of tasks). When a reducer is notified, it use RPC to read the map result, sorts and groups the values according to the intermediate keys generating (k2, list(v2)) pairs and then passes each of them to user's reduce functions from which the generated output is appended to a final output file for this reduce partition. There are R reduce tasks specified by the user and each will generate 1 output file.

MapReduce provides fault tolerance through a healthcheck pings, if no response is received from a worker in certain amount of time, it sets the task as idle so that the task can be scheduled to other workers. Even for completed tasks, if the worker failed, the task has to be re-executed because the result is stored on local disk. For master node failure, there's no fault tolerance currently.

Another optimization of MapReduce is it uses backup tasks to reduce "straggler" effect. If there are some "straggler" machines among the cluster, then tend to cause the long tail execution on the overall tasks. MapReduce solves this problem by having other worker process the same task, anyone finishes first will make the task as completed.

Some features also provided by MapReduce includes the ability of skipping a bad record for the next run; having a small internal HTTP server displaying all the metrics of the cluster so that people know what's going on; and a distributed counter facility for sanity check.

The paper also provides some basic performance information on the framework. In a 1800 2GHz Intel Xeon processor, 4GB memory, 160GB IDE disks cluster, greping on 1TB of data takes approximately 150 seconds, sorting on a 1TB data takes approximately 891 seconds.

The paper also mentions other related works such as MPI and Bulk Synchronous Programming, the key difference is that MapReduce exploits a restricted programming model, and by doing that it automatically provides parallelization and other benefits. Also, MapReduce relies on an in-house cluster management of Google which is similar to other systems such as Condor.

Huula loves HTTPS

03 Jul 2015

Huula is going to be only accessible through HTTPS.

Dear Huula fans! Huula is going to go through a major update which will simplify the editing experience significantly. After this upgrade, all your previous tutorials will continue working, the only thing you need to take care is to make sure your api script should now come from https://huu.la instead of www.huu.la.

Improve Animation Performance with Velocity.js

18 Jun 2014

The following is a repost of a blog by Julian Shapiro. Julian recently released Velocity.js, a more performant jQuery replacement for .animate(). He recently wrote about how JavaScript animations can be so fast over on David Walsh's blog, a topic we've covered here as well. In this article, Julian introduces Velocity.js itself.

Velocity.js is a jQuery plugin that re-implements jQuery's $.animate() function to produce significantly higher performance (making Velocity also faster than CSS transitions in many cases) while including several new features to improve animation workflow.

In 7Kb (zipped), Velocity includes all of $.animate()'s features while also packing in transform animation, looping, class animation, and scrolling. In short, Velocity is designed to be the best of jQuery, jQuery UI, and CSS transitions.

Interactive CSS3 Button Generators

01 Jun 2014

With the powerful features of CSS3 there is often not reason to use graphic files for adding cool buttons to your web pages. A CSS3 button script is more lightweight and easier to maintain compared to editing graphics in e.g. Photoshop. All you need is to learn some simple CSS3 properties or even simpler; find a free ready to use CSS3 button code. Most popular browsers today support the CSS3 properties used for creating gradients, shadows and mouse hover animations. This means you can safely use cool and interactive CSS3 buttons in most of your projects.

In this article, I have collected most of the useful CSS3 button scripts available and at the bottom of the post; I have added a few CSS3 button generator services. These are actually quite cool if you find it difficult to find a match in the massive list of CSS3 button scripts I have collected for this post. Check out the options and take your time to leave a comment or share the post with your friends on social media!

Disclosure: Please note that some of the links below are affiliate links and I will earn a commission if you purchase through those links (at no extra cost to you). I recommend that you do your own independent research before purchasing any product or service. This article is not a guideline, a recommendation or endorsement of specific products.

Huula is going to be only accessible through HTTPS.

Choose a Start Point