Automatic Performance Tuning of Sparse Matrix Kernels
Richard Vuduc, Shoaib Kamil1, Rajesh Nishtala2, Benjamin Lee3, and Attila Gyulassy4
(Professors James W. Demmel and Katherine A. Yelick)
(DOE) DE-FG03-94ER25219, (DOE) DE-FC03-98ER25351, and (LLNL/DOE) W-7405-ENG-48
The overall performance in a variety of scientific computing and
information retrieval applications is dominated by a few
computational kernels. One important class of such operations
are sparse matrix kernels--computations with matrices having
relatively few non-zero entries. Performance tuning of sparse kernels
is a particularly tedious and time-consuming task because performance
is a complicated function of the kernel, machine, and non-zero
structure of the matrix: for every machine, a user must carefully
choose a data structure and implementation (code and transformations)
that minimize the matrix storage while achieving the best possible
The goal of this research project is to generate implementations of
particular sparse kernels tuned to a given matrix and machine. Our
work builds on Sparsity , an early successful
prototype for sparse matrix-vector multiply (y = A*x, where A is a
sparse matrix and x, y, are dense vectors), in the following ways:
- Architecture-specific performance bounds: by careful
analysis, we have shown that the code generated by the Sparsity
system is often within 20% of the fastest possible. This places
a limit on what additional low-level tuning (e.g., instruction
scheduling) will do, and helps identify new opportunities for
We have applied a similar analysis to sparse triangular solve
- New optimization techniques: we are currently
exploring a large space of techniques to exploit a variety of matrix
structures (symmetry, diagonals, bands, variable blocks, etc.) and
to increase locality (reordering to create dense structure, cache
blocking, multiple vectors, etc.). We will develop automatic
techniques for deciding when and in what combinations to apply
- Extensions to higher-level kernels: we are
exploring new sparse kernels, such as multiplying by A and AT
simultaneously, A*AT*x (for interior point methods and the
SVD), R*A*RT (A and R are sparse; used in multigrid methods),
and Ak*x, among others.
- Implementation of an automatically tuned sparse matrix library:
we will make this work available as an implementation of the new
Sparse BLAS standard , augmented by a single routine
to facilitate automatic tuning.
This research is being conducted by members of the Berkeley
Benchmarking and OPtimization (BeBOP) project
- E.-J. Im, "Optimizing the Performance of Sparse Matrix-Vector
Multiplication," PhD thesis, UC Berkeley, May 2000.
- R. Vuduc, J. Demmel, K. Yelick, et al., "Performance
Optimizations and Bounds for Sparse-Matrix Vector Multiply,"
Supercomputing, November 2002.
- R. Vuduc, S. Kamil, et al., "Automatic Performance Tuning
and Analysis of Sparse Triangular Solve," Int. Conf.
Supercomputing, Workshop on Performance Optimization of High-level
Languages and Libraries, June 2002.
- BLAS Technical Forum: http://www.netlib.org/
- The BeBOP Homepage: http://www.cs.berkeley.
More information (http://www.cs.berkeley.edu/~richie/bebop) or
Send mail to the author : (email@example.com)
Edit this abstract