Electrical Engineering
      and Computer Sciences

Electrical Engineering and Computer Sciences

COLLEGE OF ENGINEERING

UC Berkeley

PERCU: A Holistic Method for Evaluating High Performance Computing Systems

William TC Kramer

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2008-143
November 5, 2008

http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-143.pdf

PERCU is a comprehensive evaluation methodology for large-scale systems that expands Performance analysis to include Effective work dispatching, Reliability, Consistency, and Usability. The PERCU approach and its components can be used for initial system assessment as well as for on-going quality assurance of High Performance Computing (HPC) and other systems. PERCU leverages work that has to be done in traditional benchmarking and acquisition approaches by compositing existing data to gain additional insights. A key contribution is the Sustained System Performance (SSP) concept which uses time-to-solution for assessing the productive work potential of systems for an arbitrary set of applications. The SSP provides a fair way to compare systems deployed at different times and provides a method to assess sustained price performance in a comprehensive manner. This work also discusses the Effective System Performance (ESP) test, developed to encourage and assess improved job launching and resource management – both important aspects for a productive HPC system. Reliability is the third characteristic of a productive system. This work explores the major causes of failure for very large systems and suggests improved methods for a priori assessment of the reliability of HPC systems. Consistent execution of programs is a metric often overlooked in assessments, but is a key service quality feature. This work shows how lack of consistency impacts quality of service and defines approaches for assessing and improving consistency. Usability is discussed for completeness and as future work. PERCU can be used, in all or part, and with a limitless scale of detail and effort. At its simplest, it is a framework for holistic evaluation. In its detail, it introduces a set of methods for measurement of key parameters that impact quality of service on HPC systems. The use and impact of each PERCU element is documented for multiple systems, mostly using systems evaluated at the National Energy Research Scientific Computing (NERSC) Facility.

Advisor: David E. Culler and James Demmel


BibTeX citation:

@phdthesis{Kramer:EECS-2008-143,
    Author = {Kramer, William TC},
    Title = {PERCU: A Holistic Method for Evaluating High Performance Computing Systems},
    School = {EECS Department, University of California, Berkeley},
    Year = {2008},
    Month = {Nov},
    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-143.html},
    Number = {UCB/EECS-2008-143},
    Abstract = {PERCU is a comprehensive evaluation methodology for large-scale systems that expands Performance analysis to include Effective work dispatching, Reliability, Consistency, and Usability. The PERCU approach and its components can be used for initial system assessment as well as for on-going quality assurance of High Performance Computing (HPC) and other systems. PERCU leverages work that has to be done in traditional benchmarking and acquisition approaches by compositing existing data to gain additional insights. 

A key contribution is the Sustained System Performance (SSP) concept which uses time-to-solution for assessing the productive work potential of systems for an arbitrary set of applications. The SSP provides a fair way to compare systems deployed at different times and provides a method to assess sustained price performance in a comprehensive manner. This work also discusses the Effective System Performance (ESP) test, developed to encourage and assess improved job launching and resource management – both important aspects for a productive HPC system. Reliability is the third characteristic of a productive system. This work explores the major causes of failure for very large systems and suggests improved methods for a priori assessment of the reliability of HPC systems. Consistent execution of programs is a metric often overlooked in assessments, but is a key service quality feature. This work shows how lack of consistency impacts quality of service and defines approaches for assessing and improving consistency. Usability is discussed for completeness and as future work.

PERCU can be used, in all or part, and with a limitless scale of detail and effort. At its simplest, it is a framework for holistic evaluation. In its detail, it introduces a set of methods for measurement of key parameters that impact quality of service on HPC systems. The use and impact of each PERCU element is documented for multiple systems, mostly using systems evaluated at the National Energy Research Scientific Computing (NERSC) Facility.}
}

EndNote citation:

%0 Thesis
%A Kramer, William TC
%T PERCU: A Holistic Method for Evaluating High Performance Computing Systems
%I EECS Department, University of California, Berkeley
%D 2008
%8 November 5
%@ UCB/EECS-2008-143
%U http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-143.html
%F Kramer:EECS-2008-143