Abstract: We consider the problem of communication avoidance in computing interactions between a set of particles in scenarios with and without a cutoff radius for interaction. Our strategy, which we show to be optimal in communication, divides the work in the iteration space rather than simply dividing the particles over processors, so more than one processor may be responsible for computing updates to a single particle. Similar to a force decomposition in molecular dynamics, this approach requires up to √p times more memory than a particle decomposition, but reduces communication costs by factors up to √p and is often faster in practice than a particle decomposition. We examine a generalized force decomposition algorithm that tolerates the memory limited case, i.e. when memory can only hold c copies of the particles for c = 1,2,...,√p. When c = 1, the algorithm degenerates into a particle decomposition; similarly when c = √p, the algorithm uses a force decomposition. We present a proof that the algorithm is communication-optimal and reduces critical path latency and bandwidth costs by factors of c2 and c, respectively. Performance results from experiments on up to 24K cores of Cray XE-6 and IBM BlueGene/P machines indicate that the algorithm reduces communication in practice. In some cases, it even outperforms the original force decomposition approach because the right choice of c strikes a balance between the costs of collective and point-to-point communication. Finally, we extend the analysis to include a cutoff radius for direct evaluation of force interactions. We show that with a cutoff, communication optimality still holds. We sketch a generalized algorithm for multi-dimensional space and assess its performance for 1D and 2D simulations on the same systems.
PDF coming May 2013.
Abstract: High-level, productivity-oriented languages such as Python are becoming increasingly popular in HPC applications as “glue” and prototyping code. The PGAS model offers its own productivity advantages , and combining PGAS and Python is a promising approach for rapid development of parallel applications. We discuss the potential benefits and challenges of a PGAS extension to Python, and we present initial performance results from a prototype implementation called PyGAS.
Available via: PDF
|Fall 2011||CS262A: Advanced Topics in Computer Systems||Prof. E. Brewer||Report (PDF)|
|CS294: Vectorizing Compilers||Dr. R. Allen|
|Spring 2012||CS267: Applications of Parallel Computers||Prof. J. Demmel||Paper (PDF)|
|CS270: Combinatorial Algorithms and Data Structures||Prof. S. Rao|
|Fall 2012||CS284: Computer-Aided Geometric Design||Prof. C. Séquin|
|CS294: Program Synthesis||Prof. R. Bodik||Report (PDF)|
|Spring 2013||CS283: Graduate Computer Graphics||Prof. R. Ramamoorthi|
PyGAS is a Python extension module that provides a shared-memory abstract for distributed, SPMD programs. The module represents the core functionality upon which higher-level libraries can be implemented, i.e. distributed array packages, graph abstractions, etc. More information can be found in our short paper. PyGAS is currently under active development and APIs may change rapidly.
In a forthcoming paper, we cast subdivision surface evaluation as sparse matrix-vector multiplication (SpMV). We use Pixar's OpenSubdiv as an implementation vehicle and implement our kernels using high-performance linear algebra libraries.
Complex memory layouts like Z-Morton curves aid performance but hinder productivity. To bridge the gap, the Logical Indexing Library reads layout descriptors and generates expressions that map from a simple, 2D space into the complex layout.