My thesis project involves developing new algorithms in sparse and dense linear algebra,
with the goal of avoiding communication. Communication includes moving data
between processors ("parallel"), and moving data between levels of the memory hierarchy
("sequential"). For several decades, communication performance has gotten exponentially
worse relative to arithmetic performance. Current hardware trends, such as multicore
and off-chip compute accelerators, only exacerbate existing limitations in memory and
bus bandwidth. In fact, we cannot expect new developments in hardware to solve the problem
Flops are cheap, bandwidth is money, latency is physics.
Floating-point (and integer) computational performance scales with the number
of transistors on a chip. Bandwidth scales with cost of hardware (number of
wires, pipes, banks, disks, ...) and energy consumption. Latency is the
hardest to improve, because it's often limited by physical laws (even the speed
of light is computationally significant).
These hard facts call for new algorithms that work with the physical and
economic laws of hardware. Rearranging algorithms to avoid communication can
often improve performance, even if it requires redundant computation. I've
applied this observation to iterative methods for sparse linear systems and
eigenvalue problems, by rearranging the algorithms so that several steps of the
method can be performed at the same time, for the same latency cost as a single
step of the original algorithm. Performance tuning involves a subtle balance
between latency, bandwidth, redundant computation, and numerical accuracy.
I've also worked out new preconditioning techniques that may make our algorithms
practically useful for a wide class of problems. On the dense side, I've helped
develop and implement a new QR factorization that communicates asymptotically
less than existing algorithms in LAPACK and ScaLAPACK, and have helped prove
lower bounds on the amount of communication a QR factorization must perform.
Publications and talks
- ``Towards extreme-scale simulations with next-generation
Trilinos: a low Mach fluid application case study.'' Paul Lin,
Matthew Bettencourt, Stefan Domino, Travis Fisher, Mark Hoemmen, et
al. Accepted to the Workshop on Large-Scale Parallel Processing,
- ``Exploiting data representation for fault tolerance.'' James
Elliott, Mark Hoemmen, and Frank Mueller, January 2014. Preprint available online.
- ``Evaluating the impact of SDC on the GMRES iterative solver.''
James Elliott, Mark Hoemmen, and Frank Mueller. Accepted to the
28th IEEE International Parallel and Distributed Processing
Symposium (IPDPS), May 2014. Preprint available online.
- ``Improving the Performance of CA-GMRES on Multicores with
Multiple GPUs.'' Ichitaro Yamazaki, Hartwig Anzt, Stanimire Tomov,
Mark Hoemmen, and Jack Dongarra. Accepted to the 28th IEEE
International Parallel and Distributed Processing Symposium (IPDPS),
- ``Fault-tolerant iterative methods via selective reliability.''
Patrick G. Bridges, Kurt B. Ferreira, Michael A. Heroux, and Mark
Hoemmen. Sandia National Laboratories Technical Report, June 2012.
Preprint available online.
- ``Amesos2 and Belos: Direct and iterative solvers for large
sparse linear systems.'' Eric Bavier, Mark Hoemmen, Sivasankaran
Rajamanickam, and Heidi Thornquist. Scientific Programming, 2012.
- ``Communication-optimal parallel and sequential QR and LU
factorizations.'' James Demmel, Laura Grigori, Mark Hoemmen, and
Julien Langou. SIAM Journal on Scientific Computing, Volume 34,
Issue 1, 2012. See also UC Berkeley technical report
(UCB/EECS-2008-89) and LAPACK Working Note \#204, May 2008.
- ``Asynchronous, performance-portable Krylov methods on
accelerators.'' Mark Hoemmen and Kurtis Nusbaum. Sandia National
Laboratories Technical Report, January 2012.
- ``Cooperative Application/OS DRAM Fault Recovery.'' Patrick
G. Bridges, Michael A. Heroux, Mark Hoemmen, Kurt Ferreira, Philip
Soltero, and Ronald B. Brightwell. Workshop on Resiliency in
High-Performance Computing (Resilience 2011) in conjunction with the
17th International European Conference on Parallel and Distributed
Computing (Euro-Par 2011), Bordeaux, France, 29 August - 02
- ``A communication-avoiding, hybrid-parallel, rank-revealing
orthogonalization method.'' Mark Hoemmen. IEEE International
Parallel and Distributed Processing Symposium, May 2011.
- ``Communication-avoiding Krylov subspace methods.'' Mark
Hoemmen. PhD thesis, University of California Berkeley EECS
Department, March 2010. Also available as UC Berkeley Technical
- ``Minimizing Communication in Sparse Matrix Solvers.'' James
Demmel, Mark Hoemmen, Marghoob Mohiyuddin, and Katherine Yelick.
- ``Non-negative diagonals and high performance on low-profile
matrices from Householder QR.'' James Demmel, Mark Hoemmen, Yozo
Hida, and E. Jason Riedy. SIAM Journal on Scientific Computing,
Volume 31, Issue 4, pp. 2832-2841, July 2009. (See also UC Berkeley
technical report (UCB/EECS-2008-76) and LAPACK Working Note \#203,
- ``Avoiding Communication in Sparse Matrix Computations.'' James
Demmel, Mark Hoemmen, Marghoob Mohiyuddin, and Katherine
Yelick. IEEE International Parallel and Distributed Processing
Symposium, April 2008.
- ``Avoiding communication in computing Krylov subspaces.'' James
Demmel, Mark Hoemmen, Marghoob Mohiyuddin, and Katherine Yelick. UC
Berkeley technical report (UCB/EECS-2007-123), October 2007.
- ``Benchmarking sparse matrix-vector multiply in five
minutes.'' With Hormozd Gahvari. SPEC 2007 (Austin, TX).
I'm interested in numerical linear algebra, performance tuning,
programming languages (especially parallel and embedded languages), and
software engineering issues for scientific computing -- especially when it
involves using ideas from higher-level programming languages. I have
background in the numerical solution of ordinary and partial differential
equations, as well as numerical optimization.
Mathematics knows no races or geographic boundaries; for mathematics,
the cultural world is one country.
-- David Hilbert
Last updated 09 July 2009.