Electrical Engineering
      and Computer Sciences

Electrical Engineering and Computer Sciences

COLLEGE OF ENGINEERING

UC Berkeley

Replay Debugging for Distributed Applications

Dennis Michael Geels

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2006-163
December 8, 2006

http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-163.pdf

Researchers in networks and computer systems have developed exciting new distributed applications in recent years; however, adoption of real-world prototypes has been slow. The development of stable, usable services has been hindered by the tremendous effort required to debug distributed applications that are deployed across the Internet. We believe that more powerful debugging tools are needed to address this problem. This dissertation presents the progress we have made on this front, in the form of two new tools, Liblog and Friday. The first, Liblog, is a replay debugging library for libc- and POSIX-based distributed applications. It logs the execution of deployed application processes and replays them deterministically, faithfully reproducing race conditions and non-deterministic failures, enabling careful offline analysis. To our knowledge, Liblog is the first replay tool to address the requirements of large distributed systems: lightweight support for long-running programs, consistent replay of arbitrary subsets of application nodes, and operation in a mixed environment of logging and non-logging processes. In addition, it runs on generic Linux/x86 computers without special hardware or kernel patches and supports unmodified application executables. The second tool, Friday, combines the deterministic replay provided by Liblog with the power of symbolic, low-level debugging and a simple language for expressing higher-level distributed conditions and actions. Friday allows the programmer to understand the collective state and dynamics of a distributed collection of coordinated application components, as part of the debugging process. This dissertation presents the design of Liblog and Friday, an evaluation of the performance overhead that they impose at runtime, and a set of case studies that illustrate the new functionality enabled for real distributed applications.

Advisor: Ion Stoica


BibTeX citation:

@phdthesis{Geels:EECS-2006-163,
    Author = {Geels, Dennis Michael},
    Title = {Replay Debugging for Distributed Applications},
    School = {EECS Department, University of California, Berkeley},
    Year = {2006},
    Month = {Dec},
    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-163.html},
    Number = {UCB/EECS-2006-163},
    Abstract = {Researchers in networks and computer systems have developed exciting new distributed applications in recent years; however, adoption of real-world prototypes has been slow.  The development of stable, usable services has been hindered by the tremendous effort required to debug distributed applications that are deployed across the Internet. We believe that more powerful debugging tools are needed to address this problem.  This dissertation presents the progress we have made on this front, in the form of two new tools, Liblog and Friday.

The first, Liblog, is a <i>replay debugging</i> library for libc- and POSIX-based distributed applications.  It logs the execution of deployed application processes and replays them deterministically, faithfully reproducing race conditions and non-deterministic failures, enabling careful offline analysis.

To our knowledge, Liblog is the first replay tool to address the requirements of large distributed systems: lightweight support for long-running programs, consistent replay of arbitrary subsets of application nodes, and operation in a mixed environment of logging and non-logging processes.  In addition, it runs on generic Linux/x86 computers without special hardware or kernel patches and supports unmodified application executables.

The second tool, Friday, combines the deterministic replay provided by Liblog with the power of symbolic, low-level debugging and a simple language for expressing higher-level distributed conditions and actions. Friday allows the programmer to understand the collective state and dynamics of a distributed collection of coordinated application components, as part of the debugging process.

This dissertation presents the design of Liblog and Friday, an evaluation of the performance overhead that they impose at runtime, and a set of case studies that illustrate the new functionality enabled for real distributed applications.}
}

EndNote citation:

%0 Thesis
%A Geels, Dennis Michael
%T Replay Debugging for Distributed Applications
%I EECS Department, University of California, Berkeley
%D 2006
%8 December 8
%@ UCB/EECS-2006-163
%U http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-163.html
%F Geels:EECS-2006-163