2.5D matrix multiplication is an adaptation of the classical 3D algorithm. 2.5D algorithms perform adaptive replication, controlling for communication as well as memory usage (the main flaw of the 3D algorithm).

We've also shown how to extend this approach to 2.5D LU, TRSM, and Cholesky. These new 2.5D algorithms achieve an asymptotically lower inter-processor communication complexity than any previously existing algorithms. In fact, in all cases, we prove their communication optimality. Our 2011 Euro-Par paper received a Distinguished Paper award for this work.

2.5D algorithms also map nicely to modern 3D Torus architectures. Our implementations have shown excellent strong scaling on the Intrepid BlueGene/P machine at Argonne National Lab. Careful mapping techniques allow the code to exploit topology-aware collectives. We analyze the implementations and mapping techniques, as well as create a performance model in our 2011 Supercomputing paper.

The parallelization technique used in 2.5D algorithms has also been extended outside the domain of numerical linear algebra. For example, we recently demonstrated how the all-pairs shortest-paths problem can be solved with the same computational complexity as 2.5D LU using a recursive 2.5D algorithm. The theoretical analysis is again matches by scalability results on a large supercomputer, in this case the Cray XE6. For details, see our 2013 IPDPS paper.

This type of communication analysis has also been applied to Molecular Dynamics (MD). 1.5D parallelization of MD codes allows for a reduction of communication bandwidth in latency in exchange for memory usage. Our analysis and performance of this problem will be presented as IPDPS 2013.

http://ctf.cs.berkeley.edu

State of the art electronic structure calculation codes such as NWChem, spend most of their compute time performing tensor contractions. The most popular distributed implementations currently use a global-memory PGAS layer (e.g. Global Arrays) for communication. However, using this approach it is hard to exploit tensor symmetry or maintain a well-mapped communication pattern.

We've designed a tensor parallelization framework that uses cyclic operations (hence, Cycl-ops), to compute tensor contractions. By using a cyclic decomposition, we can seamlessly take advantage of tensor symmetry and decompose into a structured virtual topology. The framework performs automatic topology-aware mapping of tensors of any dimension, so long as they fit into memory. A technical report preceeding our 2013 IPDPS paper on this framework can be found here (here) the IPDPS paper is (here).

We also wrote a "Parallel Sorting" article for David Padua's encyclopedia, as well as contributed to the formulation of the parallel sorting pattern. In preparation for writing these articles, I surveyed the large body of parallel computing literature. I assembled a collection of 46 significant papers on the topic.

In summer of 2010, I interned at DE Shaw Research (DESRES). At DESRES, we implemented and heavily optimized a multi-GPU version of the Smooth Particle Ewald method. This work was part of an effort to port the Desmond parallel molecular dynamics code, to GPU clusters. I also worked on development of the pairlist and near-term GPU kernels.