# Communication-Avoiding QR Decomposition for GPUs

### Michael Anderson, Grey Ballard, James Demmel and Kurt Keutzer

###
EECS Department

University of California, Berkeley

Technical Report No. UCB/EECS-2010-131

October 8, 2010

### http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-131.pdf

We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs, by up to 13x for tall-skinny matrices. We also discuss stationary video background subtraction as a motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value decomposition of a tall-skinny matrix. Using CAQR as a first step to getting the singular value decomposition, we are able to get the answer 3x faster than if we use a traditional bandwidth-bound GPU QR factorization tuned specifically for that matrix size, and 34x faster than if we use Intel's Math Kernel Library (MKL) QR factorization.

BibTeX citation:

@techreport{Anderson:EECS-2010-131, Author = {Anderson, Michael and Ballard, Grey and Demmel, James and Keutzer, Kurt}, Title = {Communication-Avoiding QR Decomposition for GPUs}, Institution = {EECS Department, University of California, Berkeley}, Year = {2010}, Month = {Oct}, URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-131.html}, Number = {UCB/EECS-2010-131}, Abstract = {We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs, by up to 13x for tall-skinny matrices. We also discuss stationary video background subtraction as a motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value decomposition of a tall-skinny matrix. Using CAQR as a first step to getting the singular value decomposition, we are able to get the answer 3x faster than if we use a traditional bandwidth-bound GPU QR factorization tuned specifically for that matrix size, and 34x faster than if we use Intel's Math Kernel Library (MKL) QR factorization.} }

EndNote citation:

%0 Report %A Anderson, Michael %A Ballard, Grey %A Demmel, James %A Keutzer, Kurt %T Communication-Avoiding QR Decomposition for GPUs %I EECS Department, University of California, Berkeley %D 2010 %8 October 8 %@ UCB/EECS-2010-131 %U http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-131.html %F Anderson:EECS-2010-131