BAFL: Bottleneck Analysis of Fine-Grain Parallelism

Brian Fields, Han-yin Chen, Mark Hill1, and Mary Vernon2
(Professor Rastislav Bodik)
IBM Faculty Partnership Award, Intel, Miscrosoft, (NSF) EIA-0103670, (NSF) EIA-9971256, (NSF) CDA-9623632, (NSF) CCR-0105721, (NSF) CAREER CCR-0093275, NSF Graduate Research Fellowship, and Sun

It used to be that understanding microprocessors was easy. Testing sufficed to verify their correctness, and linear formulas accurately explained their performance. Today, processors baffle their own creators. The reason is Moore's Law: to obtain speedup, processor designers turn the growing supply of transistors into growing parallelism, at a growing number of levels (e.g., out-of-order execution, pipelined memory hierarchy, multi-threading). While the effects of parallelism on verification have already been recognized (e.g., via model checking), the problem of performance complexity has been attacked only with ad hoc methods.

The overall goal of the BAFL project is to develop a robust foundation for guiding micro-architectural innovations as transistor counts surpass one billion. Specifically, we are developing methods for finding and eliminating bottlenecks--program instructions and processor resources responsible for lost performance and wasted power consumption. This task is complex because the more parallel the machine, the harder it is to tell which execution events (e.g., cache misses, ALU operations, message transactions) constrained the execution, and which had their execution times tolerated.

Our framework is built on dependence-graph analysis of a program's execution, implementable entirely in hardware. The framework allows a qualitatively new way of thinking about performance. For example, by representing micro-execution events and dependences as a suitable dependence graph, its critical path automatically determines which processor stage (e.g., fetch, execute, or commit) is a bottleneck, and also for which dynamic instructions.

Our solutions attack performance understanding problems for which no (or no efficient) methods existed. These problems span the entire processor life cycle, for example:

Processor policies: a processor capable of discerning which instructions (would) hurt performance can schedule instructions and allocate resources to avoid stalls, thus increasing its raw performance;

Power consumption: a processor capable of analyzing which of its resources are not bottlenecks at a given moment can reconfigure itself, scaling down power-hungry units;

Feedback-directed optimizations: performance monitoring hardware aware of parallelism is able to determine the actual contribution of cache misses and other "bad" events to the execution time, enabling accurate performance-tuning tools and machine-aware compiler optimizations; and

Balanced processor design: the ability to measure the contribution of any resources will help design and size the resources so that none is excessively slower than the other, reducing the human design cost by avoiding the design-space search used today.

1Professor, University of Wisconsin-Madison
2Professor, University of Wisconsin-Madison

More information ( or

Send mail to the author : (

Edit this abstract