Electrical Engineering
      and Computer Sciences

Electrical Engineering and Computer Sciences

COLLEGE OF ENGINEERING

UC Berkeley

Shark: SQL and Rich Analytics at Scale

Reynold Shi Xin, Joshua Rosen, Matei Zaharia, Michael Franklin, Scott Shenker and Ion Stoica

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2012-214
November 26, 2012

http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf

Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g., iterative machine learning) at scale, and efficiently recovers from failures mid-query. This allows Shark to run SQL queries up to 100× faster than Apache Hive, and machine learning programs up to 100× faster than Hadoop. Unlike previous systems, Shark shows that it is possible to achieve these speedups while retaining a MapReduce-like execution engine, and the fine-grained fault tolerance properties that such engines provide. It extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL. The result is a system that matches the speedups reported for MPP analytic databases over MapReduce, while offering fault tolerance properties and complex analytics capabilities that they lack.


BibTeX citation:

@techreport{Xin:EECS-2012-214,
    Author = {Xin, Reynold Shi and Rosen, Joshua and Zaharia, Matei and Franklin, Michael and Shenker, Scott and Stoica, Ion},
    Title = {Shark: SQL and Rich Analytics at Scale},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2012},
    Month = {Nov},
    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.html},
    Number = {UCB/EECS-2012-214},
    Abstract = {Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g., iterative machine learning) at scale, and efficiently recovers from failures mid-query. This allows Shark to run SQL queries up to 100× faster than Apache Hive, and machine learning programs up to 100× faster than Hadoop. Unlike previous systems, Shark shows that it is possible to achieve these speedups while retaining a MapReduce-like execution engine, and the fine-grained fault tolerance properties that such engines provide. It extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL. The result is a system that matches the speedups reported for MPP analytic databases over MapReduce, while offering fault tolerance properties and complex analytics capabilities that they lack.}
}

EndNote citation:

%0 Report
%A Xin, Reynold Shi
%A Rosen, Joshua
%A Zaharia, Matei
%A Franklin, Michael
%A Shenker, Scott
%A Stoica, Ion
%T Shark: SQL and Rich Analytics at Scale
%I EECS Department, University of California, Berkeley
%D 2012
%8 November 26
%@ UCB/EECS-2012-214
%U http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.html
%F Xin:EECS-2012-214