Predicting performance problems in data centers

Moises Goldszmidt
Microsoft Research

Abstract

LUMOS is an approach for characterizing and predicting performance problems in data centers based on statistical machine learning techniques. LUMOS builds signatures of performance problems and then uses these to build models for prediction. It also continuously updates these models as it receives new data and feedback regarding its own performance. Our evaluation plan simulates the deployment of a tool based on this approach, making a decision every 15 minutes on whether to alert the operator of an upcoming performance problem (within the next hour) with a sequential update of models as new data comes along. In our results so far on 219 days taken from two years of operational data from a real data center, LUMOS successfully predicts between 41 and 50 of the 64 performance crises during that period. The number of false alarms increases with the detection rate but remains bounded to less than one false alarm per day.

Joint work with Peter Bodik and Armando Fox of UC Berkeley

Moises Goldszmidt is a principal researcher with Microsoft Research. His research interests include probabilistic reasoning, graphical models, statistical machine learning, and systems. Prior to Microsoft he held similar positions with Hewlett-Packard Labs, SRI International, and Rockwell Science Center, and was a principal scientist with Peakstone Corporation (start-up). Dr. Goldszmidt has a PhD degree in computer science from the University of California in Los Angeles (1992). For lists of papers and patents go to http://research.microsoft.com/users/moises/.