DCR: Replay-Debugging for the Datacenter
Gautam Altekar and Ion Stoica
EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2010-33
March 21, 2010
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-33.pdf
We’ve built a tool for debugging non-deterministic failures in production datacenter applications. Our system, called DCR, is the first to efficiently record and replay large scale, distributed, and data-intensive systems such as HDFS/GFS, HBase/Bigtable, and Hadoop/MapReduce. The enabling idea behind DCR is that debugging doesn’t require a precise replica of the original datacenter run. Instead, it suffices to produce some run that exhibits the original control-plane behavior. This report details the design and implementation of DCR and provides preliminary results.
BibTeX citation:
@techreport{Altekar:EECS-2010-33,
Author = {Altekar, Gautam and Stoica, Ion},
Title = {DCR: Replay-Debugging for the Datacenter},
Institution = {EECS Department, University of California, Berkeley},
Year = {2010},
Month = {Mar},
URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-33.html},
Number = {UCB/EECS-2010-33},
Abstract = {We’ve built a tool for debugging non-deterministic failures
in production datacenter applications. Our system,
called DCR, is the first to efficiently record and replay
large scale, distributed, and data-intensive systems such
as HDFS/GFS, HBase/Bigtable, and Hadoop/MapReduce.
The enabling idea behind DCR is that debugging
doesn’t require a precise replica of the original datacenter
run. Instead, it suffices to produce some run that exhibits
the original control-plane behavior. This report details
the design and implementation of DCR and provides preliminary
results.}
}
EndNote citation:
%0 Report %A Altekar, Gautam %A Stoica, Ion %T DCR: Replay-Debugging for the Datacenter %I EECS Department, University of California, Berkeley %D 2010 %8 March 21 %@ UCB/EECS-2010-33 %U http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-33.html %F Altekar:EECS-2010-33
