Why Do Internet Services Fail, and What Can Be Done About It?

David Oppenheimer

EECS Department
University of California, Berkeley
Technical Report No. UCB/CSD-02-1185
May 2002

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2002/CSD-02-1185.pdf

We describe the architecture, operational practices, and failure characteristics of three very large-scale Internet services. Our research on architecture and operational practices took the form of interviews with architects and operations staff at those (and several other) services. Our research on component and service failure took the form of examining the operations problem tracking databases from two of the services and a log of service failure post-mortem reports from the third.

Architecturally, we find convergence on a common structure: division of nodes into service front-ends and back-ends, multiple levels of redundancy and load-balancing, and use of custom-written software for both production services and administrative tools. Operationally, we find a thin line between service developers and operators, and a need to coordinate problem detection and repair across administrative domains. With respect to failures, we find that operator errors are their primary cause, operator error is the most difficult type of failure to mask, service front-ends are responsible for more problems than service back-ends but fewer minutes of unavailability, and that online testing and more thoroughly exposing and detecting component failures could reduce system failure rates for at least one service.


BibTeX citation:

@techreport{Oppenheimer:CSD-02-1185,
    Author = {Oppenheimer, David},
    Title = {Why Do Internet Services Fail, and What Can Be Done About It?},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2002},
    Month = {May},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2002/5260.html},
    Number = {UCB/CSD-02-1185},
    Abstract = {We describe the architecture, operational practices, and failure characteristics of three very large-scale Internet services. Our research on architecture and operational practices took the form of interviews with architects and operations staff at those (and several other) services. Our research on component and service failure took the form of examining the operations problem tracking databases from two of the services and a log of service failure post-mortem reports from the third. <p>Architecturally, we find convergence on a common structure: division of nodes into service front-ends and back-ends, multiple levels of redundancy and load-balancing, and use of custom-written software for both production services and administrative tools. Operationally, we find a thin line between service developers and operators, and a need to coordinate problem detection and repair across administrative domains. With respect to failures, we find that operator errors are their primary cause, operator error is the most difficult type of failure to mask, service front-ends are responsible for more problems than service back-ends but fewer minutes of unavailability, and that online testing and more thoroughly exposing and detecting component failures could reduce system failure rates for at least one service.}
}

EndNote citation:

%0 Report
%A Oppenheimer, David
%T Why Do Internet Services Fail, and What Can Be Done About It?
%I EECS Department, University of California, Berkeley
%D 2002
%@ UCB/CSD-02-1185
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2002/5260.html
%F Oppenheimer:CSD-02-1185