Communication-avoiding parallel and sequential QR factorizations
THIS REPORT HAS BEEN WITHDRAWN
James Demmel, Laura Grigori, Mark Frederick Hoemmen and Julien Langou
EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2008-74
May 29, 2008
We present parallel and sequential dense QR factorization algorithms that are optimized to avoid communication. Some of these are novel, and some extend earlier work. Communication includes both messages between processors (in the parallel case), and data movement between slow and fast memory (in either the sequential or parallel cases).
Our first algorithm, Tall Skinny QR (TSQR), factors $m \times n$ matrices in a one-dimensional (1-D) block cyclic row layout, storing the $Q$ factor (if desired) implictly as a tree of blocks of Householder reflectors. TSQR is optimized for matrices with many more rows than columns (hence the name). In the parallel case, TSQR requires no more than the minimum number of messages $\Theta(\log P)$ between $P$ processors. In the sequential case, TSQR transfers $2mn + o(mn)$ words between slow and fast memory, which is the theoretical lower bound, and performs $\Theta(mn/W)$ block reads and writes (as a function of the fast memory size $W$), which is within a constant factor of the theoretical lower bound. In contrast, the conventional parallel algorithm as implemented in ScaLAPACK requires $\Theta(n \log P)$ messages, a factor of $n$ times more, and the analogous sequential algorithm transfers $\Theta(mn^2)$ words between slow and fast memory, also a factor of $n$ times more. TSQR only uses orthogonal transforms, so it is just as stable as standard Householder QR. Both parallel and sequential performance results show that TSQR outperforms competing methods.
Our second algorithm, CAQR (Communication-Avoiding QR), factors general rectangular matrices distributed in a two-dimensional block cyclic layout. It invokes TSQR for each block column factorization, which both remove a latency bottleneck in ScaLAPACK's current parallel approach, and both bandwidth and latency bottlenecks in ScaLAPACK's out-of-core QR factorization. CAQR achieves modeled speedups of $2.1\times$ on an IBM POWER5 cluster, $3.0\times$ on a future petascale machine, and $3.8\times$ on the Grid.
Author Comments: Replaced by UC Berkeley EECS tech report EECS-2008-89; see http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-89.html
