My thesis project involves rearranging Krylov subspace methods, such as GMRES and conjugate gradient iteration, in order to exploit locality and avoid communication in the parallel and out-of-core regimes. Trends in hardware over the past few years show that communication latency improves much more slowly than communication bandwidth, which in turn improves much more slowly than floating-point arithmetic performance. In general,
Flops are cheap, bandwidth is money, latency is physics.This means that it's much easier and cheaper to put a lot of transistors on a chip to do arithmetic quickly, than it is to improve bandwidth (add a lot of wires, pipes, banks, or disks in a RAID array, depending on the system) or latency (which has strict and practically reachable physical limits such as the speed of light). This means that rearranging algorithms to avoid communication can often improve performance, even if it requires redundant computation. We've applied this observation to Krylov subspace methods by rearranging them so that several steps of the method can be performed at the same time, for the same latency cost as a single step of the original algorithm. Performance tuning involves a subtle balance between latency, bandwidth, redundant computation, and numerical accuracy. We've also worked out new preconditioning techniques that may make our algorithms practically useful for a wide class of problems. Other related projects include parallel and out-of-core QR factorization algorithms.
I'm interested in numerical linear algebra, performance tuning, programming languages (especially parallel and embedded languages), and software engineering issues for scientific computing -- especially when it involves using ideas from higher-level programming languages. I have background in the numerical solution of ordinary and partial differential equations, as well as numerical optimization.
Mathematics knows no races or geographic boundaries; for mathematics, the cultural world is one country.
-- David Hilbert