Performance Optimizations and Bounds for Sparse Symmetric Matrix-Multiple Vector Multiply

Benjamin C. Lee, Richard W. Vuduc, James W. Demmel, Katherine A. Yelick, Michael de Lorimier and Lijue Zhong

EECS Department
University of California, Berkeley
Technical Report No. UCB/CSD-03-1297
2003

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2003/CSD-03-1297.pdf

We present performance optimizations and detailed performance analyses of sparse matrix-vector multiply SpMV and its generalization to multiple vectors, SpMM, when the matrix is symmetric. In addition to the traditional benefit of reduced storage from symmetry, we show considerable performance benefits are possible on modern architectures where the cost of memory access dominates the cost of other operations. We analyze the effects of (1) symmetric storage, (2) register-level blocked matrix storage, and (3) register-level vector blocking for multiple vectors, when applied individually and applied in conjunction. Compared to the best register blocked implementations that ignore symmetry, we show 2.1x speedups for SpMV and 2.6x speedups for SpMM. Compared to the most naive implementation in which none of the three optimizations are applied, our best implementations are up to 9.9x faster for a dense matrix in sparse format and up to 7.3x faster for a true sparse matrix. We evaluate our implementations with respect to upper bounds on their absolute performance in Mflop/s. Our performance model, which extends our prior bounds for SpMV in the non-symmetric case, bounds performance by considering only the cost of memory operations, and using lower bounds on cache misses. On four different platforms and a suite of twelve symmetric matrices spanning a variety of applications, we find our implementations are within 60% of the upper bounds on average. This fraction is smaller than what we have previously observed for non-symmetric SpMV, suggesting two possible avenues for future work: (1) additional refinements to the performance bounds that explicitly model low-level code generation (e.g., register pressure or instruction scheduling/ selection/bandwidth issues), and (2) a new opportunity to apply automated low-level tuning techniques in the spirit of ATLAS/PHiPAC systems to symmetric sparse kernels.


BibTeX citation:

@techreport{Lee:CSD-03-1297,
    Author = {Lee, Benjamin C. and Vuduc, Richard W. and Demmel, James W. and Yelick, Katherine A. and de Lorimier, Michael and Zhong, Lijue},
    Title = {Performance Optimizations and Bounds for Sparse Symmetric Matrix-Multiple Vector Multiply},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2003},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2003/6361.html},
    Number = {UCB/CSD-03-1297},
    Abstract = {We present performance optimizations and detailed performance analyses of sparse matrix-vector multiply SpMV and its generalization to multiple vectors, SpMM, when the matrix is symmetric. In addition to the traditional benefit of reduced storage from symmetry, we show considerable performance benefits are possible on modern architectures where the cost of memory access dominates the cost of other operations. We analyze the effects of (1) symmetric storage, (2) register-level blocked matrix storage, and (3) register-level vector blocking for multiple vectors, when applied individually and applied in conjunction. Compared to the best register blocked implementations that ignore symmetry, we show 2.1x speedups for SpMV and 2.6x speedups for SpMM. Compared to the most naive implementation in which none of the three optimizations are applied, our best implementations are up to 9.9x faster for a dense matrix in sparse format and up to 7.3x faster for a true sparse matrix. We evaluate our implementations with respect to upper bounds on their absolute performance in Mflop/s. Our performance model, which extends our prior bounds for SpMV in the non-symmetric case, bounds performance by considering only the cost of memory operations, and using lower bounds on cache misses. On four different platforms and a suite of twelve symmetric matrices spanning a variety of applications, we find our implementations are within 60% of the upper bounds on average. This fraction is smaller than what we have previously observed for non-symmetric SpMV, suggesting two possible avenues for future work: (1) additional refinements to the performance bounds that explicitly model low-level code generation (e.g., register pressure or instruction scheduling/ selection/bandwidth issues), and (2) a new opportunity to apply automated low-level tuning techniques in the spirit of ATLAS/PHiPAC systems to symmetric sparse kernels.}
}

EndNote citation:

%0 Report
%A Lee, Benjamin C.
%A Vuduc, Richard W.
%A Demmel, James W.
%A Yelick, Katherine A.
%A de Lorimier, Michael
%A Zhong, Lijue
%T Performance Optimizations and Bounds for Sparse Symmetric Matrix-Multiple Vector Multiply
%I EECS Department, University of California, Berkeley
%D 2003
%@ UCB/CSD-03-1297
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2003/6361.html
%F Lee:CSD-03-1297