## <u>THE SEARCH FOR A PRACTICAL ITERATIVE DETECTOR FOR MAGNETIC RECORDING</u> Rob Lynch<sup>1</sup>, Erozan Kurtas<sup>2</sup>, Alex Kuznetsov<sup>2</sup>, Engling Yeo<sup>3</sup> and Borivoje Nikolić<sup>3</sup> <sup>1</sup>Seagate Technology, Fremont, CA; <sup>2</sup>Seagate Research, Pittsburgh, PA; <sup>3</sup>University of California, Berkeley, CA

As the difficulty in increasing areal recording densities rises, more attention is given to improvements available from advanced signal processing. A promising technique to achieve SNR improvement is the use of iterative decoding. Initial investigations of turbo codes for applications in recording channels have created a great deal of interest among researchers in both academia and industry over the past few years. A large number of publications have appeared in the areas of code design and ultimate code performance, but somewhat less attention has been paid to decoding architectures or to implementation and system issues. Although iterative decoders promise large gains over conventional PRML systems, they have not been used in commercial applications so far.

Our thesis is that this situation is due, at least in part, to the difficulty in finding the optimum trade-off between performance and complexity/cost. Disk drive read/write channels have traditionally been very cost sensitive. The area of the silicon chip directly dictates its price, while power dissipation has to be low enough to allow for inexpensive packaging and system cooling. Historically, CMOS scaling has been used to allow more advanced signal processing at increased decoding speed, while the power and area stay roughly the same in each new technology generation. Thus, the challenge is to find the algorithm that achieves the best SNR performance at reasonably high speed while staying within these constraints.

We screen the large number of choices initially available based on some quick, coarse analysis. For example, both BCJR and SOVA algorithms have been considered for the inner (channel) detector. It is generally recognized that any performance benefit that the BCJR possesses is more than offset by the large complexity penalty, so we consider here only the SOVA option. Iterative detectors based on convolutional codes are also recognized to be too complex to justify the extra performance they offer, so we limit this discussion to those based on simple parity checks:

- Random Low-Density-Parity-Check (LDPC) code with rate 8/9 and column weight 3
- Turbo Product Code (TPC) based on single parity checks, with sixteen 16×16 matrices, a precoder and a random interleaver

In order to place the results in context, we also include performance and complexity estimates for a simple 16-state Viterbi detector and for an advanced PRML channel with noise-prediction in the 32-state Viterbi trellis and parity-based post-processing, as described in [1]. Noise-predictive decoders with parity post-processing can be also used for complexity comparison: at the point of their introduction they occupied 1-2mm<sup>2</sup> of silicon area while dissipating less than 500mW.

The iterative decoding algorithm is similar for the TPC and LDPC codes. The SOVA algorithm is used on the channel trellis, and the Message Passing Algorithm (MPA) is used for soft decoding of the TPC and LDPC codes. The decoding is established by iterating between the inner channel decoder and the outer decoder. The LDPC decoder performs four internal (bit-to-check plus check-to-bit) iterations, while the TPC decoder performs only one internal (rows followed by columns) iteration prior to looping back to supply the channel decoder with extrinsic information.

The performance analysis results are summarized in Fig. 1. Using BER =  $10^{-5}$  as a reference, the iterative detectors (LDPC and TPC) give an SNR benefit of 4.7dB over the simple Viterbi and 3.2dB over the sophisticated NPML-PP detector. Clearly it is not possible to operate a real system close to the very steep cliff observed for the iterative detectors, so we cannot avail ourselves of the full 3.2dB. The TPC shows evidence of a change to a shallower slope , which further simulations have confirmed continues below  $10^{-7}$ . On the other hand, we have not found such a slope change for the random LDPC code, even for BER less than  $10^{-8}$ . From the point of view of performance, then, we look first at the LDPC system.

Unfortunately, the large SNR gain of the LDPC detector comes with a high price in chip area or power. The irregular structure of random LDPC codes dictates only two possible architectures: serial and fully parallel. Parallel implementation of the LDPC decoder directly maps the complete message passing graph onto the silicon [3]. In the present example this consists of 4608 bit nodes, 512 check nodes and the interconnect network for passing the

<sup>&</sup>lt;sup>1</sup> Rob Lynch, Seagate Technology

<sup>47010</sup> Kato Rd, Fremont, CA 94538

<sup>510-624-3533,</sup> Fax 510-624-3587

rob.lynch@seagate.com

messages. The area of this implementation is estimated to be almost  $100 \text{mm}^2$  in current  $0.13 \mu \text{m}$  CMOS technology. This is impractically large, although this amount of parallelism would yield low clock frequencies and relatively low power; about 250mW for 5 iterations of decoding. On the opposite end of the spectrum, a serial decoder would use only one processing element for all message-passing calculations, running at the symbol rate. The bottleneck in the serial architecture is the memory, which has to store soft information for all the messages (13824 5-bit wide words). Furthermore, it has to provide 3 operands to the bit nodes and 27 operands to the check nodes in each cycle, which would limit the operation to around 100Mb/s. Also, the difficulty in realizing a multi-port memory means the total latency will be as much as 10 sector times.

Structured LDPC codes allow partitioning the large message passing graph into several smaller graphs. TPC codes are an extreme example of this. In the example chosen here, the sector memory is divided into sixteen 16x16 blocks, and the MPA can be implemented for each of these blocks using a single processing element. This allows the use of single-port memories in the interleavers, with a small size of about 0.2mm<sup>2</sup>. Figure 2 summarizes the area-speed tradeoff for the TPC decoder with 1, 2 and 5 iterations and compares it to a commonly used 32-state NPML detector [4]. Figure 3 presents the power-delay tradeoff. The curves in Figures 2 and 3 show that the implementation of high-speed iterative decoders in 0.13µm technology is not feasible. Decoders running at lower speeds (below 500Mb/s) would still be several times larger than NPML decoders. However, implementation in 90nm CMOS technology would lower the area and power requirements by a factor of two. This, in addition to the 40% improvement in speed, would make high-throughput iterative decoders feasible.

## REFERENCES

- R. D. Cideciyan, J. D. Coker, E. Eleftheriou, R. L. Galbraith, "Noise Predictive Maximum Likelihood Detection Combined with Parity-Based Post-Processing", *IEEE Trans. Magnetics*, Vol.37, No. 2, pp. 714-720, March 2001.
- H.Sawaguchi, S.Mita, and J. Wolf, "A Concatenated Coding Technique for Partial Response Channels," *IEEE Trans. Magnetics*, Vol. 37, No. 2, pp.695-703, March 2001.
- A. Blanksby and C. J. Howland, "A 220mW 1-Gbit/s 1024-bit rate-1/2 low density parity check code decoder," in *Proc IEEE CICC*, 2001, pp. 293–296.
- N. Nazari, "A 500 Mb/s disk drive read channel . . .," ISSCC 2000, pp. 78-79.



Fig. 1. BER vs. input SNR for several detectors. The channel model is a Lorentzian with user density 2.5 and AWGN. Sector size is 4096 user bits for all detectors. The simulated read signal is equalized (in an LMS sense) to the target [5 4 - 3 - 4 - 2] [2].



Fig. 2. Area-speed trade-off for TPC architecture in  $0.13 \mu m$  technology.



Fig. 3. Power-speed trade-off for TPC architecture in  $0.13 \mu m$  technology.