(Professors Robert W. Brodersen and Borivoje Nikolic)

(MARCO) CMU 2001-CT-888 and (MARCO) GSRC 98-DT-660

This work relates the potential energy savings to the energy profile of a circuit. These savings are obtained by using gate sizing and supply voltage optimization to minimize energy consumption subject to a delay constraint [1,2]. The sensitivity of energy to delay is derived from a linear delay model extended to multiple supplies. The optimizations are applied to a range of examples that span typical circuit topologies including inverter chains, SRAM decoders, and adders. At a delay of 20% larger than the minimum, energy savings of 40% to 70% are possible, indicating that achieving peak performance is expensive in terms of energy. The analysis is extended to register files, minimizing energy across pipeline stages, and optimal parallelism.

- [1]
- V. Stojanovic, D. Markovic, B. Nikolic, M. Horowitz, and R. Brodersen, "Energy-Delay Tradeoffs in Combinational Logic Using Gate Sizing and Supply Voltage Optimization,"
*Proc. European Solid-State Circuits Conf.,*Florence, Italy, September 2002. - [2]
- R. Brodersen, M. Horowitz, D. Markovic, B. Nikolic, and V. Stojanovic, "Methods for True Power Minimization,"
*Proc. ICCAD,*San Jose, CA, November 2002.

(Professors Paul R. Gray and Borivoje Nikolic)

MICRO

This project will investigate the possibilities of using digital signal processing techniques to enhance pipelined A/D converter performance. Specifically, we're currently applying the Wiener filtering concept to the correction of analog errors. With a slow-but-accurate helper A/D and a back end FIR digital filter, we have proven in simulation that capacitor mismatch, finite opamp gain, and various offset errors can be eliminated through the digital filtering. The analog signal paths involved are open-loop. Correction is performed solely in digital, without feedback to the pipeline A/D to tweak analog parameters. The system is further made adaptive to track slow environmental changes (power supply voltage drift, ambient temperature change, etc.) by means of an LMS algorithm. Adaptation rate can be adjusted depending on the speed of the slow helper. With this approach, we're potentially looking at a very high conversion speed (> 200 MS/s) and high accuracy (>= 10 bits) where the stringent requirement on analog circuit components can be relaxed with the aid of digital techniques. Down the road, we will also investigate applications of Voterra filtering to correct nonlinearities and bandwidth limitations in analog circuits where the complexity of digital filters will increase geometrically. The driving force behind this, however, is the inexorable power of scaling coming from digital CMOS technology. If strategically leveraged upon, it will revolutionize the performance and design of traditional analog circuits in the near future.

(Professors Borivoje Nikolic and Kannan Ramchandran)

In this project, we propose and analyze the list Viterbi algorithm (LVA) with an arithmetic coding-based continuous error detection (CED) scheme for transmission on inter-symbol interference (ISI) channels. We have shown that the system localizes error occurrences to the end of the transmission and that the proportion of bits where errors are likely to occur asymptotically approaches zero as the inverse of the number of transmitted bits. We analytically derived an upper bound on the bit error rate (BER) of the LVA-CED system as a function of the redundancy added by the CED code, the number of bits transmitted, and the BER when the standard maximum likelihood detector is used on the same channel. We analytically derived the number of paths required by the LVA for a target error rate. To show the benefits of these theoretical results in a practical setting, we have applied the system to high-order partial-response magnetic recording channels. Simulations show that this system results in a 2 dB improvement over maximum transition run (MTR) encoded EEPR4 partial response decoding with maximum likelihood sequence detection (EEPRML) at a BER of 2 x 10^{-6} in additive white Gaussian noise, and confirm the theoretical predictions.

Figure 1: Proposed LVA-CED system for magnetic recording

- [1]
- D. Petrovic, B. Nikolic, and K. Ramchandran, “List Viterbi Decoding with Continuous Error Detection for Magnetic Recording Channels,”
*Proc. IEEE Global Conf. Communications, Globecom,*San Antonio, TX, November 2001.

(Professor Borivoje Nikolic)

A maskless lithography system can replace expensive masks with a reusable, electronic mask. The maskless system requires extremely high throughput, around 10 Tbps, to match the wafer write speed of conventional lithography. We are focusing research on advanced data decompression techniques and a high-speed analog interface to the mask-writing mechanism.

Previous research has shown that Lempel-Ziv and Burrows-Wheeler can compress layout data well. We analyzed these algorithms for their suitability to hardware decompression. Lempel-Ziv is a pattern matching encoder followed by a Huffman encoder. Burrows-Wheeler is block sort followed by a locally-adaptive compression algorithm. Both of these algorithms are limited by memory available in compression and decompression—history buffer size for LZ and block size for BW. We found that we could stretch the effectiveness of limited memory in both algorithms by precompressing with a simple runlength-encode (RLE). RLE fares well because layout data typically has homogenous blocks. When sufficient memory is available, RLE has little effect on the compression ratio. But for memory-limited compression, RLE significantly improves the compression ratio. Burrows-Wheeler requires more memory to achieve the same compression ratio as Lempel-Ziv. BW also requires a more complex decompression architecture. Thus, we determined that RLE followed by LZ is best suited for hardware decompression of layout data.

(Professor Borivoje Nikolic)

This research focuses on architectures and algorithms' iterative decoders for error correction codes. Iterative decoding algorithms for turbo codes and low-density parity check (LDPC) codes have recently been discovered to achieve performance close to theoretical capacity bounds. These algorithms are based on message passing between modules using soft-input-soft-output decoding.

Currently, some of the decoders that have been implemented in silicon are often based on algorithms that are serial in nature. Algorithms needed to decode convolutional codes or partial response channels, such as the BCJR algorithm and soft-output Viterbi algorithm are some examples. In many applications, such as magnetic recording, a source and channel decoder are necessary for decoding. While a parallel LDPC decoder can now be used for outer source decoding, channel decoding becomes the bottleneck in the system.

We investigate parallel descriptions of algorithms previously described serially. With the increasing number of transistors available on a single chip, it is now possible to investigate direct implementation of these parallel algorithms. As an example, this research investigates joint MAP and LDPC decoding algorithms and their implementations on a single chip.

(Professor Borivoje Nikolic)

This research addresses the algorithms and implementations of iterative decoders for error control in communication applications. The iterative codes are based on various concatenated schemes of convolutional codes [1], and low-density parity check (LDPC) codes [2]. The decoding algorithms are instances of message passing or belief propagation [3] algorithms, which rely on the iterative cooperation between soft-decoding modules known as soft-input-soft-output (SISO) decoders.

Implementation constraints imposed on iterative decoders applying the message-passing algorithms are investigated. Serial implementations similar to traditional microprocessor datapaths are contrasted against architectures with multiple processing elements that exploit the inherent parallelism in the decoding algorithm. Turbo codes and low-density parity check codes, in particular, are evaluated in terms of their suitability for VLSI implementation in addition to the performance as measured by bit-error rate as a function of SNR.

In this research, the computational hardware and memory requirements of magnetic storage applications [4] provide a platform for evaluation of the iterative decoders. Past accomplishments include modification of known algorithms to accentuate the physical design considerations. A VLSI implementation of a soft-output Viterbi decoder suitable for high throughput Turbo applications has been demonstrated [5]. The ongoing efforts continue to study and demonstrate the traits of particular low-density parity check codes that lend themselves to efficient mapping on hardware architectures.

- [1]
- C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon Limit Error-Correcting Coding and Decoding: Turbo Codes,”
*Proc. IEEE Int. Conf. Communications,*Geneva, Switzerland, May 1993. - [2]
- R. G. Gallager, “Low Density Parity Check Codes,”
*IRE Trans. Information Theory,*Vol. 8, January 1962. - [3]
- S. M. Aji and R. J. McEliece, "The Generalized Distributive Law,"
*IEEE Trans. Information Theory,*Vol. 46, No. 2, March 2000. - [4]
- E. Yeo, P. Pakzad, B. Nikolic, and V. Anantharam, "VLSI Architectures for Iterative Decoders in Magnetic Recording Channels,"
*IEEE Trans. Magnetics,*Vol. 37, No. 2, March 2001. - [5]
- E. Yeo, S. Augsburger, W. R. Davis, and B. Nikolic, "500 Mb/s Soft Output Viterbi Decoder,"
*Proc. IEEE European Solid-State Circuit Conf.,*Firenze, Italy, September 2002.

(Professor Borivoje Nikolic)

In variable-throughput digital systems, power dissipation can be reduced by adjusting the operating frequency, supply voltage, or MOSFET threshold voltage, so that the system throughput never exceeds the requirements. Supply voltage scaling (VS) has been one of the most effective power-reduction techniques [1,2]. Threshold voltage scaling (TS) has also been proposed to effectively curtail the leakage power of the system [3]. Minimizing the power dissipation for a given throughput requires a careful balance of active and static power contributions, which can be achieved by simultaneous control of both supply and threshold. This research investigates several power reduction scenarios through different technology generations, logic depths, and switching activities, and demonstrates the effectiveness of each power reduction technique on both an inverter chain-based calculation model and through simulation of a 20-bit adder circuit. A typical variable-throughput system, an inverse discrete cosine transformer for an MPEG decoder, is also designed for hardware demonstration of the effectiveness of supply and threshold voltage control.

- [1]
- T. D. Burd et al., "A Dynamic Voltage Scaled Microprocessor System,"
*IEEE J. Solid-State Circuits,*Vol. 35, No. 11, November 2000. - [2]
- A. Chandrakasan et al., "Data Driven Signal Processing: An Approach for Energy Efficient Computing,"
*Proc. Int. Symp. Low Power Electronics and Design,*Monterey, CA, August 1996. - [3]
- K. Nose et al., "VTH-Hopping Scheme for 82% Power Saving in Low-Voltage Processors,"
*Proc. Custom Integrated Circuit Conf.,*San Diego, CA, May 2001.

(Professor Borivoje Nikolic)

Simple, accurate short channel MOSFET current and delay models are useful in low-power digital design for rapidly evaluating the effect of changing transistor width, supply, and threshold voltage. As device channel lengths have scaled, effects such as mobility degradation and velocity saturation have made the Shockley square-law model insufficient for accurate characterization. Two accurate short channel current models have been presented in [1] and [2]. In this research it is shown the second model can be simplified such that only three extracted parameters are necessary to model the velocity saturation and mobility degradation behavior, covering both triode and saturation operating regions, and these parameters can be easily extracted from transistor I-V curves. This simplified model is demonstrated both on a commercial 0.13 µm process technology and a simulated 20 nm FinFET technology. Furthermore, it is demonstrated that the form of both models can be used in delay expressions that accurately capture inverter delays across a range of supply voltages and fan-outs.

- [1]
- T. Sakurai and A. Newton, “Alpha-Power Law MOSFET Model and Its Applications to CMOS Inverter Delay and other Formulas,”
*IEEE J. Solid-State Circuits,*Vol. 25, No. 2, April 1990. - [2]
- C. Sodini, P. Ko, and J. Moll, “The Effect of High Fields on MOS Device and Circuit Performance,”
*IEEE Trans. Electron Devices,*Vol. 31, No. 10, October 1984.

(Professor Borivoje Nikolic)

MARCO

We are investigating the performance and power-area-delay tradeoffs for CMOS arithmetic circuits in deep submicron technologies. This exploration is done on an example of high-performance 64-bit adders. A number of high-speed adder designs have been reported that increase speed and reduce power by: (1) architectural or logic transformations of carry look-ahead equations, and (2) advanced circuit styles in combination with advanced timing methodologies. The goal of this project is to determine minimum achievable delays for given adder topologies for varying output loads, to minimize the delay under energy and area constraints and to minimize energy and/or area under delay constraints. The main design knobs are gate sizes, supply voltage, and transistor threshold voltage. Furthermore, an optimum adder topology will be found for the given set of constraints. This goal is accomplished through a common method that allows performance comparison between different adder architectures (topologies) in early phases of the design. The methodology can be extended to optimization of various digital building blocks in the energy-delay space.

(Professor Borivoje Nikolic)

MARCO

As the continuous miniaturization of solid-state devices increases the chip operating frequency and circuit density, it presents the circuit designer with a slew of new problems concerning the optimal design and robustness of high-speed circuits. Some of these problems are more pressing device variation, matching concerns, and the fact that an increasing fraction of the clock cycle is required for non-computational tasks such as clock skew/jitter compensation, latch/flip-flop setup, and hold times. One goal of this project is to examine the effect of increasing device variation on the components of high-speed link transceivers used for inter-chip communication, such as the interleaved sampling front ends, clock recovery, and decision circuits. In addition, we are analytically investigating the overall optimization of master-slave flip-flops based on the concept of "sampling function" to obtain minimal setup and hold times and as a complement to time-consuming optimization via simulation.