(Professors James W. Demmel and Katherine A. Yelick)

(DOE) DE-FG03-94ER25219, (DOE) DE-FC03-98ER25351, and (LLNL/DOE) W-7405-ENG-48

The overall performance in a variety of scientific computing and information retrieval applications is dominated by a few computational kernels. One important class of such operations are sparse matrix kernels--computations with matrices having relatively few non-zero entries. Performance tuning of sparse kernels is a particularly tedious and time-consuming task because performance is a complicated function of the kernel, machine, and non-zero structure of the matrix: for every machine, a user must carefully choose a data structure and implementation (code and transformations) that minimize the matrix storage while achieving the best possible performance.

The goal of this research project is to generate implementations of particular sparse kernels tuned to a given matrix and machine. Our work builds on Sparsity [1], an early successful prototype for sparse matrix-vector multiply (y = A*x, where A is a sparse matrix and x, y, are dense vectors), in the following ways:

- Architecture-specific performance bounds: by careful analysis, we have shown that the code generated by the Sparsity system is often within 20% of the fastest possible. This places a limit on what additional low-level tuning (e.g., instruction scheduling) will do, and helps identify new opportunities for performance improvements [2]. We have applied a similar analysis to sparse triangular solve [3].
- New optimization techniques: we are currently exploring a large space of techniques to exploit a variety of matrix structures (symmetry, diagonals, bands, variable blocks, etc.) and to increase locality (reordering to create dense structure, cache blocking, multiple vectors, etc.). We will develop automatic techniques for deciding when and in what combinations to apply these methods.
- Extensions to higher-level kernels: we are
exploring new sparse kernels, such as multiplying by A and A
^{T}simultaneously, A*A^{T}*x (for interior point methods and the SVD), R*A*R^{T}(A and R are sparse; used in multigrid methods), and A^{k}*x, among others. - Implementation of an automatically tuned sparse matrix library: we will make this work available as an implementation of the new Sparse BLAS standard [4], augmented by a single routine to facilitate automatic tuning.

This research is being conducted by members of the Berkeley Benchmarking and OPtimization (BeBOP) project [5].

- [1]
- E.-J. Im, "Optimizing the Performance of Sparse Matrix-Vector Multiplication," PhD thesis, UC Berkeley, May 2000.
- [2]
- R. Vuduc, J. Demmel, K. Yelick, et al., "Performance
Optimizations and Bounds for Sparse-Matrix Vector Multiply,"
*Supercomputing,*November 2002. - [3]
- R. Vuduc, S. Kamil, et al., "Automatic Performance Tuning
and Analysis of Sparse Triangular Solve,"
*Int. Conf. Supercomputing, Workshop on Performance Optimization of High-level Languages and Libraries,*June 2002. - [4]
- BLAS Technical Forum: http://www.netlib.org/ blas/blast-forum.
- [5]
- The BeBOP Homepage: http://www.cs.berkeley. edu/~richie/bebop.

Edit this abstract