We are investigating automatic compilation for a hybrid processing architecture based on the SCORE compute model . SCORE targets scalable reconfigurable hardware, enabling applications to automatically exploit the additional hardware of next-generation devices, without recompilation. SCORE implements a paged virtualization model, where a spatial computation is partitioned into compute pages (analogous to virtual memory pages) to be automatically loaded into available physical pages by a run-time scheduler. We have developed a parallel programming model for SCORE that supports arbitrarily large spatial computations, and we are presently developing a compiler to map this programming model to a paged hardware execution model.
We have developed an intermediate language, called TDF, for describing SCORE computations as dataflow graphs of communicating finite state machines (FSMs). The graphical form provides an explicit specification of task-level parallelism and stream-based communication. The programming model is equivalent, in power, to dynamic dataflow models such as BDF . We have developed a SCORE compiler, consisting presently of a TDF front-end, limited compiler optimizations, and a back-end that emits executable C++ simulation code. We are presently extending the compiler to include more sophisticated optimizations (to exploit instruction-level parallelism) and automatic page generation.
The goal of automatic page generation is to partition and/or cluster TDF FSM operators into fixed-size hardware pages under area and I/O constraints, and in a manner that is robust to hardware virtualization. Under virtualization, a pair of communicating pages may not necessarily execute simultaneously, requiring stream communication to be buffered in memory. Partitioning for efficient execution under virtualization involves a tradeoff between several desirable properties: (1) minimized FSM latency, (2) minimized stream IO bandwidth, and (3) avoidance of inter-page feedback loops. Our approach involves first hoisting control-independent datapath operations out of FSMs, where possible, and partitioning/clustering the resulting pipelines using known techniques for synchronous circuits. We are presently exploring techniques for partitioning the remaining control-intensive FSMs and their associated datapaths.
Fixed frequency FPGAs, where any valid, mapped design runs at the intrinsic clock rate of the FPGA, offer two key advantages over conventional FPGAs: vastly increased computational throughput and significantly simpler interfacing to other logic blocks such as microprocessors. Yet previous attempts have not been successful because of either a lack of a general switched interconnect or an inability to efficiently map datapaths.
We are investigating structures which allow for efficient, pipelined FPGAs. Our primary focus is on a different form of Manhattan interconnect, a "corner turning" interconnect. This interconnect structure can be efficiently pipelined for high throughput operation and has fast polynomial-time routing algorithms.
We have defined a fixed-frequency architecture based on the Xilinx Virtex FPGA and completed a toolflow which accepts placed designs produced by the Xilinx toolflow. We have developed both a router and automatic retimer and are currently constructing the layout for a routing channel in order to estimate performance and area costs for this architecture.