A General Framework for Workload Evaluation
Kristal Sauer, Charles Reiss, Alexandra Constantin, Archana Sulochana Ganapathi, Armando Fox, David A. Patterson and Michael Jordan
The aim of this project is to employ statistical machine learning to create a framework for workload characterization that can be used to simulate an application. Simulators are of interest for answering questions such as: how would the application behave if I changed the hardware (e.g., by doubling the memory)? Or, what would happen if the application was loaded at a much larger scale/with workload following a different distribution? A key feature of this framework is that we are able to abstract away from the original data, which is important since it enables us to protect any user-sensitive information in the traces, thus making it easier for industry to share data with researchers in academia.
Once we have an abstracted form of the data, we will build an HMM over it; and we obtain a set of states (or, in workload terms, equivalence classes), a qualitative description for each equivalence class (e.g., high CPU utilization), and a transition matrix characterizing the distribution over states. This information does not carry any information about the original data; it is not even necessary to know from what type of system it was derived. Sampling from the HMM gives us a sequence of states which correspond to a sequence of request types. The distribution would be realistic since it would be derived from the data; this overcomes a major weakness of much systems research, which is that synthetic workload follows an artificial distribution.
We would feed the state sequence into a ghost application that consumes resources according to the qualitative equivalence class descriptions. The key phase of the project would be validating that this ghost application is indeed equivalent to the original application with respect to resource consumption; this would be a convincing way to show that we have indeed produced a realistic simulator. Then, we can use our simulator to make predictions about the application's behavior under different conditions than those observed. A subproblem is making sure that the set of equivalence classes obtained during the modeling phase best explains the data. A key difficulty with using HMMs to identify a state set from a sequence of observations is that the number of states must be specified a priori. However, in our case, we would like to create a framework in which this need not be the case. Therefore, we will explore the use of Hierarchical Dirichlet Processes (HDP) for identifying a state set of appropriate cardinality.