Electrical Engineering
      and Computer Sciences

Electrical Engineering and Computer Sciences

COLLEGE OF ENGINEERING

UC Berkeley

Spark: Cluster Computing with Working Sets

Matei Zaharia, N. M. Mosharaf Chowdhury, Michael Franklin, Scott Shenker and Ion Stoica

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2010-53
May 7, 2010

http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-53.pdf

MapReduce and its variants have been highly successful in implementing large-scale data intensive applications on clusters of unreliable machines. However, most of these systems are built around an acyclic data flow programming model that is not suitable for other popular applications. In this paper, we focus on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis environments. We propose a new framework called Spark that supports these applications while maintaining the scalability and fault-tolerance properties of MapReduce. To achieve these goals, Spark introduces a data abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.


BibTeX citation:

@techreport{Zaharia:EECS-2010-53,
    Author = {Zaharia, Matei and Chowdhury, N. M. Mosharaf and Franklin, Michael and Shenker, Scott and Stoica, Ion},
    Title = {Spark: Cluster Computing with Working Sets},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2010},
    Month = {May},
    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-53.html},
    Number = {UCB/EECS-2010-53},
    Abstract = {MapReduce and its variants have been highly successful in implementing large-scale data intensive applications on clusters of unreliable machines. However, most of these systems are built around an acyclic data flow programming model that is not suitable for other popular applications. In this paper, we focus on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis environments. We propose a new framework called Spark that supports these applications while maintaining the scalability and fault-tolerance properties of MapReduce. To achieve these goals, Spark introduces a data abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.}
}

EndNote citation:

%0 Report
%A Zaharia, Matei
%A Chowdhury, N. M. Mosharaf
%A Franklin, Michael
%A Shenker, Scott
%A Stoica, Ion
%T Spark: Cluster Computing with Working Sets
%I EECS Department, University of California, Berkeley
%D 2010
%8 May 7
%@ UCB/EECS-2010-53
%U http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-53.html
%F Zaharia:EECS-2010-53