The embedded system design process is characterized by three challenges. One challenge is the task of modeling heterogeneous concurrent applications. Another is the task of finding a programming model for heterogeneous multiprocessor architectural platforms. Compounding each of these challenges is the task of implementing heterogeneous applications on heterogeneous architectures. We believe that the key problem underlying each of these challenges is the modeling of concurrency, and the key to modeling concurrency is to capture concurrent communication formally in models of computation .
The aim of this research is to develop a disciplined approach to the design and implementation of communication structures in embedded applications. Our approach combines the network-on-chip paradigm with the models of computation paradigm. We use models of computation to capture the communication requirements of an application as well as to abstract the capabilities of a communication architecture. Then, application requirements and architectural capabilities are matched using a discipline based on network-on-chip principles.
Our design methodology is based on the Y-chart . The Y-chart is an iterative strategy for design space exploration that enables the co-evolution of hardware and software. We extend the multiple views methodology  to provide designers with the right abstractions for implementing heterogeneous, concurrent applications on heterogeneous multiprocessor programmable platforms.
Network processors exploit task and packet-level parallelism to achieve large throughputs. This has resulted in a huge diversity of architectures for similar applications . From an ease of programmability perspective, these architectures can be classified along three main axes, namely in the type and organization of processing elements, different scheduling schemes between processing elements and other resources, like I/O interface buffers, and communication between different resources, like memory operations as well as bus configurations. Each of the above aspects significantly influence the ease of programmability.
A primary goal of this work is to implement one or more benchmarks on different network processors and understand the complex inter-play between different architectural components and programmability. In the process, we hope to define a broader abstraction for programming such heterogeneous application-specific multi-processors. Such an abstraction can be seen as a hardware abstraction layer across multiple network processors or as a starting point for exploring new heterogeneous ASIPs. It is important to note that this work differs from a typical programming model where the emphasis is on capturing application requirements as compared to ease of mapping or ease of porting across heterogeneous mutliprocessor ASIPs. In this work, we have currently implemented a 16-port IPv4 benchmark  on an Intel IXP1200 and Motorola C-PortC-5 and are currently investigating the different architectural influences for both of the implementations.
Building blocks for application-specific programmable system-on-chip architectures can be classified into (multi-) processors, memory hierarchies, interconnection networks and peripherals. While in the design process a lot of emphasis is put on the first three, peripherals, including communication interfaces as the largest subset, are often underestimated and neglected.
Our analysis  for multimedia and network domains clearly shows that (1) peripherals and communication interfaces are a significant share of the system (btween 30 and 40 percent of die area), (2) there is a large number of heterogeneous peripheral functions even on a single chip, and (3) a single function can be even more complex than a processor core.
The main goal of this research is to achieve a reduction in the diversity of peripherals, thereby making the design process more regular and potentially simpler. We propose to achieve the goal by means of a programmable peripheral processor that can be targeted at multiple peripheral functions. The advantages of such a system are not only the unified design process, but also an increase in robustness of the overall system due to a regular software interface for device drivers.
We have carefully examined the speed gap between ASIC and custom processors. Typical ASIC designs are a factor of x3 to x8 slower than comparable custom designs. The contributing factors to this gap are microarchitecture, timing overhead, logic style, logic design, cell design and wire sizing, layout, and process variation and accessibility. With careful attention to reduce these contributions, some ASICs can close the speed gap to only a factor of x2 to x3 .
Power is also a significant concern to designers. The heat dissipated by a chip greatly impacts the cost of packaging and cooling methods (ranging from passive dissipation to refrigeration). The power consumption limits battery life and adds to the cost of running an instrument. Particularly for low cost embedded applications, power consumption directly limits the maximum performance.
Custom logic styles such as dynamic domino logic are perceived as being higher power consumption than static logic. However, as dynamic circuits achieve higher maximum speeds, they can be downsized more and can reduce the supply voltage to meet slower speed targets. Custom logic styles typically have less capacitance in the PMOS pull-up network of gates, due to alternative logic styles and skewed drive strengths. As the gate loads are smaller, the drive strengths required are smaller, doubly impacting the dynamic power consumption of switching capacitances within the circuit.
Given equivalent speed targets for ASIC and custom methodologies, custom approaches can exploit their speed advantage to substantially reduce power consumption. We quantify the contribution of each factor to the power gap between ASIC and custom designs. We then illustrate approaches to reducing ASIC power consumption.
Some approaches to transistor sizing have used convex posynomial formulations to guarantee a solution that is globally optimal [1,2]. The convex posynomials are fit to experimental data characterizing a gate. More recent circuit optimization methods have generally avoided posynomial models, due to perceived inaccuracies in the models. We wish to find an optimization formulation that includes sizing with threshold voltage and gate voltage assignment.
Previous heuristic approaches [3-5] to transistor threshold voltage assignment to minimize power have not considered simultaneous sizing of transistors and threshold changes. We extend previous work by including supply voltage and performing simultaneous optimization of all gate parameters.
We chose posynomials to formulate the problem of simultaneous transistor sizing with supply voltage and threshold voltage assignment to combinational circuits. Using convex posynomials guarantees a global minimum, and avoids heuristic approaches where circuit variables are optimized independently. The convex posynomials are used for static timing analysis and power analysis. Power analysis includes all components of circuit power: switching capacitance, short-circuit power through a gate when it switches, and leakage current for the dominant leakage state (primarily subthreshold leakage). Posynomials are fit to a continuous range of supply voltage, threshold voltage, drive strength ratio, and drive strength of a gate for a range of input slew and output load capacitance conditions. The fidelity of the posynomial models for delay and power analysis is verified versus SPICE circuit simulations.
Given the globally optimal solution from the continuous sizing, continuous supply voltage, and continuous threshold voltage formulation, we discretize supply voltage and threshold voltage. Upper and lower bounds on the minimum power for a circuit can be easily determined from the posynomial formulation. A limited branch and bound during the iterative discretization ensures a discretized solution near the global optimum of the continuous problem.
Low power design methods are becoming more important as power consumption increases. The effects of higher power consumption include: (1) increased costs, because the extra heat generated must be combated by expensive packaging technology; (2) reduced reliability because of physical effects like electromigration; and (3) shortened overall battery life in portable devices. In our methodology, we propose to take advantage of advances in manufacturing technologies. Advanced manufacturing technology allows the production of circuits with dual supply and threshold voltage levels. Multiple voltage levels can be exploited to trade off power consumption and circuit speed. Because most combinational circuits contain many sub-critical paths, the delay of some circuit components can safely be increased, which results in reduced power consumption. Our research aims to develop a set of efficient algorithms to assign threshold and supply voltages to gates and perform gate sizing in a way that preserves the overall performance of the circuit while maximally reducing power. Our current approach is based on a heuristic that uses linear programming (LP) and integer-linear programming (ILP) to perform the assignment.
Nanometer designs are creating a variety of new challenges for the semiconductor industry in the areas of manufacturing, design tools, chip architectures, speed, reliability, power management, variability, and verification. Coupled with these technology challenges, there is an increasing demand for products that are low power, high performance, reliable, and portable. This has resulted not only in increased functionality on a single chip but in a more complex design cycle as many physical design and manufacturing issues must be considered in making design tradeoffs at earlier stages of the design process. Of these tradeoffs, a significant portion of design decisions are concerned with minimizing power consumption and power dissipation.
Recently, a number of circuit design and manufacturing techniques to manage and minimize power have been presented: (1) use of multiple supply voltages to minimize dynamic power; (2) use of multiple threshold voltages to minimize static power; (3) use of post-synthesis gate resizing to recover power on non-critical paths; (4) a combination of (1), (2), and (3); and (5) selective phase-shifting to provide richer design tradeoff space for synthesis and mapping.
The goal of this research project is to orchestrate all of these approaches simultaneously to maximally leverage them in automated design methodologies for application-specific standard products (ASSPs) and application-specific integrated circuits (ASICs). Specifically, the desired results of the research are to arrive at new algorithms and design flows that enable designers to create low power, high performance circuits in an automated fashion. Research is focused on both front-end and back-end VLSI CAD flows.
It is difficult to verify that the components of programmable systems will behave correctly with respect to either high level specifications or the intent of the designers that have implemented them. To focus on a concrete example, microprocessors traditionally use the notion of an instruction set (IS), specified at a high level, to form a boundary between the hardware designers and the compiler authors. While there are many variations on this theme, it has proved a long-lived approach to the complexities of programming and designing microprocessors. However, it is still difficult to verify that a particular implementation of a microprocessor implements a given IS, and the more hardware details the IS hides, the more difficult is it to use the IS as the sole input for an optimizing compiler. Thus, compiler authors must often be familiar with all of the low-level hardware details of a processor, and then carefully construct code using the hidden semantics of the IS with respect to a given processor or processor class to give good performance.
VLIW style instruction sets are an attempt to reduce this semantic gap by exposing more of the behavior of the processor to the compiler. The research question posed here is: "What is required to completely close the semantic gap between the hardware level description of a programmable component and the exported semantics of the same component?" The tentative answer is that as a first step it requires a more formal description of the device, its programability, and its concurrency than is offered by traditional HDLs. Furthermore, given this new description, it is necessary to reformulate much of the abstraction/mapping tool chain from the hardware description to the high-level programmer's model of the device. One way to explore this issue is to attempt to build such a system. The current approach combines a function hardware description language (sharing properties with other concurrent constraint languages such as Oz and Mercury) with the power of complex propositional reasoning (using high performance SAT solvers), and additional theories (a la CVC) including Presburger arithmetic, and any other ones that look cool or necessary. Currently, work is underway to develop a high performance SAT+Presburger solver while pushing the design of simply controlled (similar to a horizontal VLIW) heterogeneous processors to the level of automatic compiler generation. It is expected that the solver will be useful for the implementation of an efficient retargetable compiler, due to the fact that Presburger arithmetic can encode the semantics of the current constraint language that is used to model both the hardware and intermediate forms (at the compiler IR level) of the software.
Current designs of network processors reveal manifold micro-architectures . Even processors targeted on the same application domain show a variety of differences in micro-architecture topology. It is unclear whether any of the current solutions actually represents a natural and efficient implementation of the required functionality for a given application such as IPv4 forwarding.
In order to enable the systematic evaluation of different existing innovative designs in a reasonable time-frame, we investigate the feasibility of a recently developed analytical performance estimation approach  to catch inherent characteristics of the design space and to automatically guide the designer toward optimal solutions in the network processing domain.
We focus in particular on the following issues: (1) extension of the analytical approach to the model and automatically exploring heterogeneous communication architectures; (2) comparison of results from the analytical model with simulation-based results ; (3) design space pruning by incorporating technology constraints as well as individual constraints defined by the designer; and (4) incorporation of domain-specific design decisions into an automated exploration framework, such as the partitioning of application tasks between micro-processor threads.
The traditional approach to worst-case static timing analysis uses deterministic delays for gates and wires. These fixed delay values are extracted from best- and worst-case process parameter sets obtained through repeated device simulations. However, with the advent of deep-submicron technology, the intra-chip variations of process components and interconnect geometries is expected to increase by as much as 65%. This, coupled with the presence of variable correlations between delay elements causes further deviation from the fixed delay model. As a consequence, typical worst-case timing analysis leads to overly conservative solutions as it assumes worst-case delay characteristics and fails to recognize the inherent statistical variation. The end product suffers due to lost performance and expensive over-design.
In order to reliably design modern ICs, the timing model must be freed from deterministic approximations and the underlying randomness must be exposed through a coherent probabilistic abstraction. The objective of this research is to propose a framework and methodology for statistical timing analysis. In particular, there are two driving questions: (1) choosing an appropriate probabilistic model to describe delays while trading off between accuracy and computational complexity and (2) implementing an efficient algorithm to compute delay distributions of combinational circuits based on the chosen model. The incorporation of such a probabilistic framework for timing analysis reduces the gap between timing constraints predicted by tools and the actual silicon performance, obviates over-design, and relieves the designer of unnecessary iterations to attain timing closure.
The past five years have witnessed over 30 attempts at programmable solutions for packet processing. With these architectures, network processor designers have employed a large variety of hardware techniques to accelerate packet processing, including parallel processing, special-purpose hardware, memory architectures, on-chip communication mechanisms, and use of peripherals . However, despite this architectural innovation, relatively little effort has been made to make these architectures easily programmable. The current practice of program network processors is to use assembly language or a subset of C. This low-level programming language places a large burden on the programmer to understand fine details of the architecture just to implement a packet processing application, let alone optimize it. We believe the programmer should be presented with an abstraction of the underlying hardware, or programming model, which exposes just enough detail to write efficient code for that platform. This work focuses on a programming model for network processors. Our approach starts with an abstraction based on Click, a domain-specific language for packet processing. We've demonstrated the promise of our programming model by implementing an IPv4 router on the Intel IXP1200, a common network processor.
Detecting overlap of switching windows between coupling nets is a major static technique to accurately locate crosstalk noise. However, due to the mutual dependency between switching windows, the computation requires iterations to converge. In this research, we discuss the issues and provide answers to the important questions involved in convergence and numerical properties, including the effect of coupling models, multiple convergence points, convergence rate, computational complexity, non-monotonicity, continuity, and the effectiveness of bounds. Numerical fixed-point computation results are used to explain these properties. Our contribution here builds a theoretical foundation for static crosstalk noise analysis .
Coupling effects have become a serious concern in UDSM chip designs. Static timing analysis must take these effects into account to correctly ensure the signal integrity. But the conventional approaches give pessimistic results for noise calculation due to an over-conservative assumption on the switching windows.
For crosstalk noise calculation, computing the switching windows of a net helps us identify noise sources accurately. Traditional approaches use a single continuous switching window for a net. Under this model, signal switching is assumed to happen at any time within the window. Although conservative and sound, this model can result in too much pessimism since in reality the exact timing of signal switching is determined by a path delay up to the net, i.e., the underlying circuit structure does not always allow signal switching at an arbitrary time within the continuous switching window. To address this inherent inaccuracy of the continuous switching window, we propose a refinement of the traditional approaches, where signal switching is characterized by a set of discontinuous switching windows instead of a single continuous window. Each continuous switching window is divided into multiple windows, called time slots, and the signal switching activity of each slot is analyzed separately to calculate the maximum noise with more accuracy. By controlling the size of a time slot we can trade off accuracy and runtime, which makes this approach  highly scalable. We have confirmed with experiments on industrial circuits that up to 90% of the noise violations detected by the traditional approach can be unreal.
We propose a method for describing programmable platforms in a constraint-based pure functional language. This description coupled with a well-defined model of computation allows us to constructively generate and restrict multiple views of the programmable platform. Our multiple views enable designers to abstract and refine the design while our system maintains consistency between the views.
Programmable solutions on network processors for packet processing have relied mostly on assembly or subsets of C as an interface. For practical implementations, programmers must effectively utilize hardware parallelism, arbitrate between shared resources, divide code among threads, lay out shared data, and interface with special function units and peripherals. This is significantly removed from the natural language with which application writers describe their design. To bridge the gap between an application language and the language for a target architecture, a programming model is necessary that hides the unimportant details of the target architecture and exposes those which allow the programmer to claim the full power of the device.
The current programming model we have developed for network processors requires the programmer to specify thread boundaries, lay out shared data, and manually eliminate redundant accesses and computation across elements. Since this presents a potentially enormous design space, this work will attempt to find optimal strategies and heuristics to automate traversing it. Furthermore, there are potential optimizations that may be found in a high-level language that are not easily found once assembler or C is generated. This work will explore the finding of such optimizations automatically.
The speed gap between the memory and processor core grows at an exponential rate due to the difference between their time constants for their exponential speedup. Therefore, memory is becoming increasingly more important in a system. The slower speedup rate for memory is fundamentally limited due to the communication required over the entire memory space. So, architectural advancement in memory is essential for filling this gap and for continuing the computer speedup. The first step in achieving this is to construct a formal model that can describe the existing memory architectures and to build an environment based on this model that can allow the architect to easily explore the interesting portion of the tradeoff spaces. Once the formal memory model and environment are developed, the memory design space can then be formally and hopefully automatically explored by leveraging the semantics behind the model. Determining the formal memory model and constructing a memory design exploration framework is the focus of my research.
In order to formalize the memory architecture, we must address the following issues. (1) How to model the memory structure uniformly across the levels of hierarchy. This includes the organization detail of the memory. (2) How to model the interactions across the memory hierarchy. This includes how read and write is done across the hierarchy and how pre-fetch behaves as a function of access history. (3) How to model the consistency issues across the hierarchy and multiprocessor memories. Currently within our memory model, these issues are models with tables and accessors that communicate by function call semantics. The consistency methods are modeled by arbitrators that orders concurrent incoming function calls. A framework is built on top of the model by providing a library of simple parameterized memory elements. This allows the designer to easily build a complex memory system by stacking together and configuring the simple elements. Simulation is available through the Ptolemy II kernel. A much faster simulation engine is in development by connecting to the Mescal simulator in the PE domain.
A formal memory exploration framework requires the following: a formal model of the specification, a formal description of the design space, a formal description of the cost functions, and a formal list of design space constraints. The formal memory model provides a partial language to describe these. Once these are fully determined, the exploration can be formulized as an optimization problem, where the goal is to find the best design according the cost function that in the valid design space and satisfies the specification. Since the design process is resource constrained, for example, time, heuristics may be required to maximize the chance to find the optimal design. Some examples of these heuristics are genetics algorithm and gradient descend. We will study when and which heuristics are required and how effective they are.
The final goal in the research is to use the above system to explore interesting memory architecture design space. It is hoped with the formality and automated method, a better memory design can be found in faster time.
Static timing analysis is the most used approach to critical path detection in combinational circuits. Since traditional criteria used in standard static timing analysis depend on circuit delay assignment, the analysis has to be repeated after each timing optimization step. A new approach to static timing analysis has been recently proposed in . The cornerstone of the method consists of grouping paths into three categories: strongly perceptible, imperceptible, and weakly perceptible. This categorization allows definite distinction between paths that impact the circuit delay, and therefore need to be sped up, and those that do not, thus reducing the timing optimization effort. Furthermore, this new approach has the advantage of being independent on the circuit delay assignment and, therefore, static timing analysis need not be applied over and over after each timing optimization step. The main drawback of the method is that the criteria to check which category a path belongs to, which has been provided by the author, are partially ambiguous and not easy to compute. We identified new criteria for this path categorization based on the flattening of a circuit into an equivalent normal form (ENF). Our criteria, while being straightforward to check, clear any previous ambiguity, thus allowing the capture of the power of this new path categorization.