Currently Active Projects:
Collaborators: James Demmel (UCB), BeBoP group (UCB/LBNL)
Parallel dense linear algorithms typically assign parallel work in matrix blocks.
However, the communication cost can be reduced by using redundant copies of the
matrices. A classical 3D matrix multiplication algorithm has been long known
to achieve the minimal communication cost possible.
2.5D matrix multiplication is an adaptation of the classical 3D algorithm.
2.5D algorithms perform adaptive replication, controlling for communication
as well as memory usage (the main flaw of the 3D algorithm).
We've also shown how to extend this approach to 2.5D LU, TRSM, and Cholesky.
These new 2.5D algorithms achieve an asymptotically lower inter-processor
communication complexity than any previously existing algorithms.
In fact, in all cases, we prove their communication optimality.
2011 Euro-Par paper
received a Distinguished Paper award for this work.
2.5D algorithms also map nicely to modern 3D Torus architectures. Our
implementations have shown excellent strong scaling on the Intrepid BlueGene/P
machine at Argonne National Lab. Careful mapping techniques allow the code
to exploit topology-aware collectives. We analyze the implementations
and mapping techniques, as well as create a performance model in our
2011 Supercomputing paper
The parallelization technique used in 2.5D algorithms has also been extended
outside the domain of numerical linear algebra. For example, we recently
demonstrated how the all-pairs shortest-paths problem can be solved with the
same computational complexity as 2.5D LU using a recursive 2.5D algorithm.
The theoretical analysis is again matches by scalability results on a
large supercomputer, in this case the Cray XE6. For details, see our
2013 IPDPS paper
This type of communication analysis has also been applied to Molecular Dynamics (MD).
1.5D parallelization of MD codes allows for a reduction of communication
bandwidth in latency in exchange for memory usage. Our analysis and performance
of this problem will be presented as IPDPS 2013.
Cyclops Tensor Framework:
Collaborators: Jeff Hammond (ANL), Devin Matthews (UT), James Demmel (UCB)
State of the art electronic structure calculation codes such as NWChem,
spend most of their compute time performing tensor contractions.
The most popular distributed implementations currently use a global-memory
PGAS layer (e.g. Global Arrays) for communication. However, using
this approach it is hard to exploit tensor symmetry or maintain
a well-mapped communication pattern.
We've designed a tensor parallelization framework that uses cyclic operations
(hence, Cycl-ops), to compute tensor contractions. By using a cyclic
decomposition, we can seamlessly take advantage of tensor symmetry and
decompose into a structured virtual topology. The framework performs
automatic topology-aware mapping of tensors of any dimension, so long as
they fit into memory. A technical report preceeding our 2013 IPDPS paper
on this framework can be found here
the IPDPS paper is
Collaborators: Laxmikant V. Kale (UIUC)
Parallel sorting has a long and rich history of algorithm development
and evolution. Yet, some ideas have persisted for decades.
Bitonic Sort was discovered in 1964 by Batcher and used in circuit
sorting networks. Today, it is again being used on GPU architectures.
In fact, choosing the best parallel sorting algorithm depends
on the target architecture.
We considered the problem of parallel sorting on modern supercomputers.
We found that another iterative algorithm (Histogram Sort)
is the best choice when scalability
is needed past 5000 nodes. We modified the Histogram Sort to release dependencies
and aggressively scheduled it. As a result, we achieved full communication
and computation overlap, alleviating the heavy all-to-all data exchange.
For more information, read our 2010 IPDPS
We also wrote a "Parallel Sorting" article for David Padua's encyclopedia,
as well as contributed to the formulation of the parallel sorting pattern.
In preparation for writing these articles, I surveyed the large body of
parallel computing literature. I assembled a
collection of 46 significant papers on the topic
Collaborators: Laxmikant V. Kale (UIUC), Abhinav Bhatele (UIUC),
Michael Bergdorf (DESRES), Charles Rendleman (DESRES)
At University of Illinois, we developed a molecular dynamic simulation
meant to be a light-weight framework for benchmarking NAMD.
We used the patch/compute parallelization scheme, where both the problem
domain and the computational domain get decomposed. We implemented
several optimization layers, including multicasts, pairlists, and
a specialized load balancer.
In summer of 2010, I interned at DE Shaw Research (DESRES). At DESRES,
we implemented and heavily optimized a multi-GPU version of the Smooth
Particle Ewald method. This work was part of an effort to port
parallel molecular dynamics code, to GPU clusters. I also
worked on development of the pairlist and near-term GPU kernels.