It used to be that understanding microprocessors was easy. Testing sufficed to verify their correctness, and linear formulas accurately explained their performance. Today, processors baffle their own creators. The reason is Moore's Law: to obtain speedup, processor designers turn the growing supply of transistors into growing parallelism, at a growing number of levels (e.g., out-of-order execution, pipelined memory hierarchy, multi-threading). While the effects of parallelism on verification have already been recognized (e.g., via model checking), the problem of performance complexity has been attacked only with ad hoc methods.
The overall goal of the BAFL project is to develop a robust foundation for guiding micro-architectural innovations as transistor counts surpass one billion. Specifically, we are developing methods for finding and eliminating bottlenecks--program instructions and processor resources responsible for lost performance and wasted power consumption. This task is complex because the more parallel the machine, the harder it is to tell which execution events (e.g., cache misses, ALU operations, message transactions) constrained the execution, and which had their execution times tolerated.
Our framework is built on dependence-graph analysis of a program's execution, implementable entirely in hardware. The framework allows a qualitatively new way of thinking about performance. For example, by representing micro-execution events and dependences as a suitable dependence graph, its critical path automatically determines which processor stage (e.g., fetch, execute, or commit) is a bottleneck, and also for which dynamic instructions.
Our solutions attack performance understanding problems for which no (or no efficient) methods existed. These problems span the entire processor life cycle, for example:
Processor policies: a processor capable of discerning which instructions (would) hurt performance can schedule instructions and allocate resources to avoid stalls, thus increasing its raw performance;
Power consumption: a processor capable of analyzing which of its resources are not bottlenecks at a given moment can reconfigure itself, scaling down power-hungry units;
Feedback-directed optimizations: performance monitoring hardware aware of parallelism is able to determine the actual contribution of cache misses and other "bad" events to the execution time, enabling accurate performance-tuning tools and machine-aware compiler optimizations; and
Balanced processor design: the ability to measure the contribution of any resources will help design and size the resources so that none is excessively slower than the other, reducing the human design cost by avoiding the design-space search used today.
1Professor, University of Wisconsin-Madison
2Professor, University of Wisconsin-Madison
Microprocessors, both general purpose and application-specific, are inherently limited by a small amount of instruction-level parallelism (due to branches, precise interrupts, etc.) and increase in performance primarily through scaling of the core clock frequency. Although partially mitigated by improved process technologies, each increase in performance comes with a corresponding increase in power consumption.
This research will investigate hardware-based general purpose computing without any use of the von Neumann software architecture. The Berkeley href="http://bwrc.eecs.berkeley.edu">BWRC, will be used to develop the methodologies for computing purely with hardware and investigate how to maximally benefit from parallelism in the absence of a sequential instruction stream. Emphasis will be placed on translating the well-understood semantics of sequential programming (such as function libraries, processes/threads, and address spaces) into their corresponding constructs in hardware, as well as evaluating flexible and scalable computing fabrics (such as ALU meshes and neural networks) which map well into the incremental configuration features of FPGAs.
The speed gap between the memory and processor core grows at an exponential rate due to the difference between their time constants for their exponential speedup. Therefore, memory is becoming increasingly more important in a system. The slower speedup rate for memory is fundamentally limited due to the communication required over the entire memory space. So, architectural advancement in memory is essential for filling this gap and for continuing the computer speedup. The first step in achieving this is to construct a formal model that can describe the existing memory architectures and to build an environment based on this model that can allow the architect to easily explore the interesting portion of the tradeoff spaces. Once the formal memory model and environment are developed, the memory design space can then be formally and hopefully automatically explored by leveraging the semantics behind the model. Determining the formal memory model and constructing a memory design exploration framework is the focus of my research.
In order to formalize the memory architecture, we must address the following issues. (1) How to model the memory structure uniformly across the levels of hierarchy. This includes the organization detail of the memory. (2) How to model the interactions across the memory hierarchy. This includes how read and write is done across the hierarchy and how pre-fetch behaves as a function of access history. (3) How to model the consistency issues across the hierarchy and multiprocessor memories. Currently within our memory model, these issues are models with tables and accessors that communicate by function call semantics. The consistency methods are modeled by arbitrators that orders concurrent incoming function calls. A framework is built on top of the model by providing a library of simple parameterized memory elements. This allows the designer to easily build a complex memory system by stacking together and configuring the simple elements. Simulation is available through the Ptolemy II kernel. A much faster simulation engine is in development by connecting to the Mescal simulator in the PE domain.
A formal memory exploration framework requires the following: a formal model of the specification, a formal description of the design space, a formal description of the cost functions, and a formal list of design space constraints. The formal memory model provides a partial language to describe these. Once these are fully determined, the exploration can be formulized as an optimization problem, where the goal is to find the best design according the cost function that in the valid design space and satisfies the specification. Since the design process is resource constrained, for example, time, heuristics may be required to maximize the chance to find the optimal design. Some examples of these heuristics are genetics algorithm and gradient descend. We will study when and which heuristics are required and how effective they are.
The final goal in the research is to use the above system to explore interesting memory architecture design space. It is hoped with the formality and automated method, a better memory design can be found in faster time.
The slow mechanical nature of I/O devices, specifically disks, compared to the speed of electronic processing has long been recognized. In order to keep the processor supplied with data, systems rely on aggressive I/O optimization techniques that can be tuned to specific workloads. But as the improvement in processor performance continues to far exceed the improvement in disk access time, the I/O bottleneck is increasingly an issue. We now resort more and more to expensive measures for increasing I/O performance such as configuring systems with large amounts of memory as the I/O cache or using many more disks than storage requirements warrant. As systems continue to grow in complexity over and beyond our ability to cost-effectively manage them, what is really needed is a storage system that delivers good performance without requiring a lot of resources and time to configure and tune, even as the workloads evolve.
In this research, we explore mechanizable techniques for improving I/O performance by dynamically optimizing disk block layout in response to the actual usage pattern. Our techniques are based on the observation that physically, it is much more efficient to read a large contiguous chunk of data than many small chunks scattered throughout the disk. Users, however, typically have only limited knowledge and control of how their data are laid out on the disk, and most would rather not be thus burdened. The file system or application can guess how blocks are likely to be used based on static logical information such as name space relationships, but such information may not accurately reflect the actual dynamic usage pattern. On the other hand, technology trends are such that disk space and processing power are increasingly available for performing sophisticated optimizations without user involvement.
Therefore, we propose automatic locality-improving storage (ALIS), a storage system that automatically replicates and reorganizes selected disk blocks to increase the spatial locality of reference and allow it to effectively fetch data in larger chunks. Using extensive trace-driven simulations, we demonstrate that ALIS can dramatically improve performance despite efforts already made by the database and file system to optimize the layout of data. Our results show that ALIS is able to far outperform prior reorganization techniques in both an environment where the storage system consists of only disks and low-end disk adaptors, and one where there is a large outboard controller.
We are investigating automatic compilation for a hybrid processing architecture based on the SCORE compute model [1]. SCORE targets scalable reconfigurable hardware, enabling applications to automatically exploit the additional hardware of next-generation devices, without recompilation. SCORE implements a paged virtualization model, where a spatial computation is partitioned into compute pages (analogous to virtual memory pages) to be automatically loaded into available physical pages by a run-time scheduler. We have developed a parallel programming model for SCORE that supports arbitrarily large spatial computations, and we are presently developing a compiler to map this programming model to a paged hardware execution model.
We have developed an intermediate language, called TDF, for describing SCORE computations as dataflow graphs of communicating finite state machines (FSMs). The graphical form provides an explicit specification of task-level parallelism and stream-based communication. The programming model is equivalent, in power, to dynamic dataflow models such as BDF [2]. We have developed a SCORE compiler, consisting presently of a TDF front-end, limited compiler optimizations, and a back-end that emits executable C++ simulation code. We are presently extending the compiler to include more sophisticated optimizations (to exploit instruction-level parallelism) and automatic page generation.
The goal of automatic page generation is to partition and/or cluster TDF FSM operators into fixed-size hardware pages under area and I/O constraints, and in a manner that is robust to hardware virtualization. Under virtualization, a pair of communicating pages may not necessarily execute simultaneously, requiring stream communication to be buffered in memory. Partitioning for efficient execution under virtualization involves a tradeoff between several desirable properties: (1) minimized FSM latency, (2) minimized stream IO bandwidth, and (3) avoidance of inter-page feedback loops. Our approach involves first hoisting control-independent datapath operations out of FSMs, where possible, and partitioning/clustering the resulting pipelines using known techniques for synchronous circuits. We are presently exploring techniques for partitioning the remaining control-intensive FSMs and their associated datapaths.
Fixed frequency FPGAs, where any valid, mapped design runs at the intrinsic clock rate of the FPGA, offer two key advantages over conventional FPGAs: vastly increased computational throughput and significantly simpler interfacing to other logic blocks such as microprocessors. Yet previous attempts have not been successful because of either a lack of a general switched interconnect or an inability to efficiently map datapaths.
We are investigating structures which allow for efficient, pipelined FPGAs. Our primary focus is on a different form of Manhattan interconnect, a "corner turning" interconnect. This interconnect structure can be efficiently pipelined for high throughput operation and has fast polynomial-time routing algorithms.
We have defined a fixed-frequency architecture based on the Xilinx Virtex FPGA and completed a toolflow which accepts placed designs produced by the Xilinx toolflow. We have developed both a router and automatic retimer and are currently constructing the layout for a routing channel in order to estimate performance and area costs for this architecture.