Electrical Engineering
      and Computer Sciences

Electrical Engineering and Computer Sciences

COLLEGE OF ENGINEERING

UC Berkeley

Design Insights for MapReduce from Diverse Production Workloads

Yanpei Chen, Sara Alspaugh and Randy H. Katz

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2012-17
January 25, 2012

http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-17.pdf

In this paper, we analyze seven MapReduce workload traces from production clusters at Facebook and at Cloudera customers in e-commerce, telecommunications, media, and retail. Cumulatively, these traces comprise over a year’s worth of data logged from over 5000 machines, and contain over two million jobs that perform 1.6 exabytes of I/O. Key observations include input data forms up to 77% of all bytes, 90% of jobs access KB to GB sized files that make up less than 16% of stored bytes, up to 60% of jobs re-access data that has been touched within the past 6 hours, peak-to-median job submission rates are 9:1 or greater, an average of 68% of all compute time is spent in map, task-seconds-per-byte is a key metric for balancing compute and data bandwidth, task durations range from seconds to hours, and five out of seven workloads contain map-only jobs. We have also deployed a public workload repository with workload replay tools so that the researchers can systematically assess design priorities and compare performance across diverse MapReduce workloads.


BibTeX citation:

@techreport{Chen:EECS-2012-17,
    Author = {Chen, Yanpei and Alspaugh, Sara and Katz, Randy H.},
    Title = {Design Insights for MapReduce from Diverse Production Workloads},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2012},
    Month = {Jan},
    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-17.html},
    Number = {UCB/EECS-2012-17},
    Abstract = {In this paper, we analyze seven MapReduce workload traces from production clusters at Facebook and at Cloudera customers in e-commerce, telecommunications, media, and retail. Cumulatively, these traces comprise over a year’s worth of data logged from over 5000 machines, and contain over two million jobs that perform 1.6 exabytes of I/O. Key observations include input data forms up to 77% of all bytes, 90% of jobs access KB to GB sized files that make up less than 16% of stored bytes, up to 60% of jobs re-access data that has been touched within the past 6 hours, peak-to-median job submission rates are 9:1 or greater, an average of 68% of all compute time is spent in map, task-seconds-per-byte is a key metric for balancing compute and data bandwidth, task durations range from seconds to hours, and five out of seven workloads contain map-only jobs. We have also deployed a public workload repository with workload replay tools so that the researchers can systematically assess design priorities and compare performance across diverse MapReduce workloads.}
}

EndNote citation:

%0 Report
%A Chen, Yanpei
%A Alspaugh, Sara
%A Katz, Randy H.
%T Design Insights for MapReduce from Diverse Production Workloads
%I EECS Department, University of California, Berkeley
%D 2012
%8 January 25
%@ UCB/EECS-2012-17
%U http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-17.html
%F Chen:EECS-2012-17