Mining Full Text Information for Large-Scale Software Systems
Wei Xu, Armando Fox and David A. Patterson
In the process of the development and operation of large software systems, there are lots of textual data generated, which include console logs, source code (and comments in source code), bug reports, documentation, and version-control change logs. They provide important insights into the system, as they are added by system experts (usually developers). Unfortunately, these types of information are largely ignored by existing analysis and monitoring tools, mainly because of their highly unstructured nature. Even worse, human operators ignore them as there are too many and they are hardly a fun to read. The level of details and abstractions included in each type of information also makes bug reports from the operator hard to be reproduced or analyzed by developers.
This project aims at building a tool, mainly for the operators, to analyze unstructured textual information, and to pinpoint the information that requires attention. We are doing this with information retrieval / text mining techniques. Currently, we focus on combining console logs (printf logs), with source code information. From source code analysis, we infer the structure of console log automatically and thus can extract machine-friendly data from textual logs. We also analyze source code execution path information revealed by the console log and use statistical methods to analyze abnormal paths, which might lead to system failures. An overview of the system and its applications to operators and developers are illustrated in Figure 1 below.
Figure 1: System overview and applications