Electrical Engineering
      and Computer Sciences

Electrical Engineering and Computer Sciences

COLLEGE OF ENGINEERING

UC Berkeley

Improving MapReduce Performance in Heterogeneous Environments

Matei Zaharia, Andrew Konwinski, Anthony D. Joseph, Randy H. Katz and Ion Stoica

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2008-99
August 19, 2008

http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-99.pdf

MapReduce is emerging as an important programming model for large-scale data-parallel applications such as web indexing, data mining, and scientific simulation. Hadoop is an open-source implementation of MapReduce enjoying wide adoption and is often used for short jobs where low response time is critical. Hadoop's performance is closely tied to its task scheduler, which implicitly assumes that cluster nodes are homogeneous and tasks make progress linearly, and uses these assumptions to decide when to speculatively re-execute tasks that appear to be stragglers. In practice, the homogeneity assumptions do not always hold. An especially compelling setting where this occurs is a virtualized data center, such as Amazon's Elastic Compute Cloud (EC2). We show that Hadoop's scheduler can cause severe performance degradation in heterogeneous environments. We design a new scheduling algorithm, Longest Approximate Time to End (LATE), that is both simple and highly robust to heterogeneity. LATE can improve Hadoop response times by a factor of 2 in 200-node clusters on EC2.


BibTeX citation:

@techreport{Zaharia:EECS-2008-99,
    Author = {Zaharia, Matei and Konwinski, Andrew and Joseph, Anthony D. and Katz, Randy H. and Stoica, Ion},
    Title = {Improving MapReduce Performance in Heterogeneous Environments},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2008},
    Month = {Aug},
    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-99.html},
    Number = {UCB/EECS-2008-99},
    Abstract = {MapReduce is emerging as an important programming model for large-scale data-parallel applications such as web indexing, data mining, and scientific simulation. Hadoop is an open-source implementation of MapReduce enjoying wide adoption and is often used for short jobs where low response time is critical. Hadoop's performance is closely tied to its task scheduler, which implicitly assumes that cluster nodes are homogeneous and tasks make progress linearly, and uses these assumptions to decide when to speculatively re-execute tasks that appear to be stragglers. In practice, the homogeneity assumptions do not always hold. An especially compelling setting where this occurs is a virtualized data center, such as Amazon's Elastic Compute Cloud (EC2). We show that Hadoop's scheduler can cause severe performance degradation in heterogeneous environments. We design a new scheduling algorithm, Longest Approximate Time to End (LATE), that is both simple and highly robust to heterogeneity. LATE can improve Hadoop response times by a factor of 2 in 200-node clusters on EC2.}
}

EndNote citation:

%0 Report
%A Zaharia, Matei
%A Konwinski, Andrew
%A Joseph, Anthony D.
%A Katz, Randy H.
%A Stoica, Ion
%T Improving MapReduce Performance in Heterogeneous Environments
%I EECS Department, University of California, Berkeley
%D 2008
%8 August 19
%@ UCB/EECS-2008-99
%U http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-99.html
%F Zaharia:EECS-2008-99