Electrical Engineering
      and Computer Sciences

Electrical Engineering and Computer Sciences

COLLEGE OF ENGINEERING

UC Berkeley

   

2008 Research Summary

Applying Datapath Tracing to Network Service Failure Detection

View Current Project Information

George Porter, Saul Edwards and Randy H. Katz

In large campus and enterprise networks, applications are becoming increasingly complex, relying more and more on the functionality provided by component building blocks distributed across the network. While leading to more powerful applications, this shift poses a problem for system administrators. Factors such as network link failure and packet loss, middlebox interference, and the co-location of different applications on the same network lead to unexpected and difficult-to-diagnose problems. The transient nature of some of these faults means that by the time diagnostic action is taken, the cause of the problem might no longer be present. Detecting and diagnosing these faults can be a time-intensive task, often requiring different ad-hoc approaches.

To address this problem, we propose a diagnostic tool that identifies correctness and performance faults in distributed applications by analyzing collected end-to-end traces of application datapaths. A datapath trace is a record of the set of components and hosts that a distributed network service transited, as well as application, systems, and other log data associated with components along that path. Without seeing the end-to-end datapath in context and without knowing which components a users' operation transited, it is difficult to detect poor application performance or operation failure. Our approach consists of instrumenting the components of the distributed system with the X-Trace tracing framework (http://www.x-trace.net), which can capture and store the end-to-end datapath across different components, different network layers, and different protocols. We then analyze these traces off-line to detect the occurrence of faults. We attempt to localize the faults to a subset of the machines affected and identify their underlying root causes. Given this information, the network operator can begin to proactively fix the problem, rather than waiting for end users to notify them of the fault.