Optimizing Parallel Job Performance in Data-Intensive Clusters

Ganesh Ananthanarayanan

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2014-9
January 26, 2014

http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-9.pdf

Extensive data analysis has become the enabler for diagnostics and decision making in many modern systems. These analyses have both competitive as well as social benefits. To cope with the deluge in data that is growing faster than Moore's law, computation frameworks have resorted to massive parallelization of analytics jobs into many ne-grained tasks. These frameworks promised to provide efficient and fault-tolerant execution of these tasks. However, meeting this promise in clusters spanning hundreds of thousands of machines is challenging and a key departure from earlier work on parallel computing.

A simple but key aspect of parallel jobs is the all-or-nothing property: unless all tasks of a job are provided equal improvement, there is no speedup in the completion of the job. The all-or-nothing property is critical for the promise of efficient and fault-tolerant parallel computations on large clusters. Meeting this promise in clusters of these scales is challenging and a key departure from prior work on distributed systems. This talk will look at the execution of a job from first principles and propose techniques spanning the software stack of data analytics systems such that its tasks achieve homogeneous performance while overcoming the various heterogeneities.

To that end, we will propose techniques for (i) caching and cache replacement for parallel jobs, which outperforms even Belady's MIN (that uses an oracle), (ii) data locality, and (iii) straggler mitigation. Our analyses and evaluation are performed using workloads from Facebook and Bing production datacenters Along the way, we will also describe how we broke the myth of disk-locality's importance in datacenter computing.

Advisor: Ion Stoica


BibTeX citation:

@phdthesis{Ananthanarayanan:EECS-2014-9,
    Author = {Ananthanarayanan, Ganesh},
    Title = {Optimizing Parallel Job Performance in Data-Intensive Clusters},
    School = {EECS Department, University of California, Berkeley},
    Year = {2014},
    Month = {Jan},
    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-9.html},
    Number = {UCB/EECS-2014-9},
    Abstract = {Extensive data analysis has become the enabler for diagnostics and decision making in many modern systems. These analyses have both competitive as well as social benefits. To cope with the deluge in data that is growing faster than Moore's law, computation frameworks have resorted to massive parallelization of analytics jobs into many ne-grained tasks. These frameworks promised to provide efficient and fault-tolerant execution of these tasks. However, meeting this promise in clusters spanning hundreds of thousands of machines is challenging and a key departure from earlier work on parallel computing.

A simple but key aspect of parallel jobs is the all-or-nothing property: unless all tasks of a job are provided equal improvement, there is no speedup in the completion of the job. The all-or-nothing property is critical for the promise of efficient and fault-tolerant parallel computations on large clusters. Meeting this promise in clusters of these scales is challenging and a key departure from prior work on distributed systems. This talk will look
at the execution of a job from first principles and propose techniques spanning the software stack of data analytics systems such that its tasks achieve homogeneous performance while overcoming the various heterogeneities.

To that end, we will propose techniques for (i) caching and cache replacement for parallel jobs, which outperforms even Belady's MIN (that uses an oracle), (ii) data locality, and
(iii) straggler mitigation. Our analyses and evaluation are performed using workloads from Facebook and Bing production datacenters Along the way, we will also describe how we
broke the myth of disk-locality's importance in datacenter computing.}
}

EndNote citation:

%0 Thesis
%A Ananthanarayanan, Ganesh
%T Optimizing Parallel Job Performance in Data-Intensive Clusters
%I EECS Department, University of California, Berkeley
%D 2014
%8 January 26
%@ UCB/EECS-2014-9
%U http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-9.html
%F Ananthanarayanan:EECS-2014-9