Abstracts for Katherine A. Yelick

The EECS Research Summary for 2003


Automatic Performance Tuning of Sparse Matrix Kernels

Richard Vuduc, Shoaib Kamil1, Rajesh Nishtala2, Benjamin Lee3, and Attila Gyulassy4
(Professors James W. Demmel and Katherine A. Yelick)
(DOE) DE-FG03-94ER25219, (DOE) DE-FC03-98ER25351, and (LLNL/DOE) W-7405-ENG-48

The overall performance in a variety of scientific computing and information retrieval applications is dominated by a few computational kernels. One important class of such operations are sparse matrix kernels--computations with matrices having relatively few non-zero entries. Performance tuning of sparse kernels is a particularly tedious and time-consuming task because performance is a complicated function of the kernel, machine, and non-zero structure of the matrix: for every machine, a user must carefully choose a data structure and implementation (code and transformations) that minimize the matrix storage while achieving the best possible performance.

The goal of this research project is to generate implementations of particular sparse kernels tuned to a given matrix and machine. Our work builds on Sparsity [1], an early successful prototype for sparse matrix-vector multiply (y = A*x, where A is a sparse matrix and x, y, are dense vectors), in the following ways:

This research is being conducted by members of the Berkeley Benchmarking and OPtimization (BeBOP) project [5].

[1]
E.-J. Im, "Optimizing the Performance of Sparse Matrix-Vector Multiplication," PhD thesis, UC Berkeley, May 2000.
[2]
R. Vuduc, J. Demmel, K. Yelick, et al., "Performance Optimizations and Bounds for Sparse-Matrix Vector Multiply," Supercomputing, November 2002.
[3]
R. Vuduc, S. Kamil, et al., "Automatic Performance Tuning and Analysis of Sparse Triangular Solve," Int. Conf. Supercomputing, Workshop on Performance Optimization of High-level Languages and Libraries, June 2002.
[4]
BLAS Technical Forum: http://www.netlib.org/ blas/blast-forum.
[5]
The BeBOP Homepage: http://www.cs.berkeley. edu/~richie/bebop.
1Undergraduate (EECS)
2Undergraduate (EECS)
3Undergraduate (EECS)
4Undergraduate (EECS)

More information (http://www.cs.berkeley.edu/~richie/bebop) or

Send mail to the author : (richie@cs.berkeley.edu)

Titanium: A High-Performance Parallel Java Dialect

Dan Bonachea, Kaushik Datta, Ed Givelberg1, Sabrina Merchant, Geoff Pike2, and Jimmy Su
(Professors Susan L. Graham, Paul N. Hilfinger, and Katherine A. Yelick)
(ASCI-3 LLNL) W-7405-ENG-48 and (NSF) PACI ACI-9619020

Titanium is an explicitly parallel dialect of Java developed at UC Berkeley [1] to support high-performance scientific computing on large-scale multiprocessors, including massively parallel supercomputers and distributed-memory clusters with one or more processors per node. Other language goals include safety, portability, and support for building complex data structures.

The main additions [2] to Java are:

Titanium provides a global memory space abstraction (similar to other languages such as Split-C and UPC) whereby all data has a user-controllable processor affinity, but parallel processes may directly reference each other's memory to read and write values or arrange for bulk data transfers. A specific portability result is that Titanium programs can run unmodified on uniprocessors, shared memory machines, and distributed memory machines. Performance tuning may be necessary to arrange an application's data structures for distributed memory, but the functional portability allows for development on shared memory machines and uniprocessors.

Titanium is a superset of Java and inherits all the expressiveness, usability, and safety properties of that language. Titanium augments Java's safety features by providing checked synchronization that prevents certain classes of synchronization bugs. To support complex data structures, Titanium uses the object-oriented class mechanism of Java along with the global address space to allow for large shared structures. Titanium's multidimensional array facility adds support for high-performance hierarchical and adaptive grid-based computations.

Our compiler research focuses on the design of program analysis techniques and optimizing transformations for Titanium programs, and on developing a compiler and run-time system that exploit these techniques. Because Titanium is an explicitly parallel language, new analyses are needed even for standard code motion transformations. The compiler analyzes both synchronization constructs and shared variable accesses. Transformations include cache optimizations, overlapping communication, identifying references to objects on the local processor, and replacing runtime memory management overhead with static checking. Our current implementation translates Titanium programs entirely into C, where they are compiled to native binaries by a C compiler and then linked to the Titanium runtime libraries (there is no JVM).

The current implementation runs on a wide range of platforms, including uniprocessors, shared memory multiprocessors, distributed-memory clusters of uniprocessors or SMPs (CLUMPS), and a number of specific supercomputer architectures (Cray T3E, IBM SP, Origin 2000). The distributed memory back-ends can use a wide variety of high-performance network interconnects, including Active Messages, MPI, IBM LAPI, shmem, and UDP.

Titanium is especially well adapted for writing grid-based scientific parallel applications, and several such major applications have been written and continue to be further developed ([3,4], and many others).

[1]
Titanium Project home page: http://titanium.cs.berkeley.edu.
[2]
P. Hilfinger et al., Titanium Language Reference Manual, UC Berkeley Computer Science Division, Report No. UCB/CSD 01/1163, November 2001.
[3]
G. Balls and P. Colella, A Finite Difference Domain Decomposition Method Using Local Corrections for the Solution of Poisson's Equation, Lawrence Berkeley National Laboratory Report No. LBNL-45035, 2001.
[4]
G. Pike, L. Semenzato, P. Colella, and P. Hilfinger, "Parallel 3D Adaptive Mesh Refinement in Titanium," Proc. SIAM Conf. Parallel Processing for Scientific Computing, San Antonio, TX, March 1999.
1Postdoctoral Researcher
2Postdoctoral Researcher

More information (http://titanium.cs.berkeley.edu/) or

Send mail to the author : (bonachea@eecs.berkeley.edu)

GASNet--A High-Performance Portable Communication System for Global Address Space Languages

Dan Bonachea, Christian Bell1, and Mike Welcome2
(Professor Katherine A. Yelick)
Department of Energy Office of Science, National Science Foundation, and NSA

GASNet [1] is a network-independent and language-independent high-performance communication interface intended for use in implementing the runtime system for global address space languages, such as Unified Parallel C (UPC) [2] or Titanium [3]. GASNet stands for "Global-Address Space Networking."

Goals of this research include:

  • Language-independence: compatibility with several global-address space languages and compilers
    -UPC, Titanium, Co-array Fortran, and possibly others
    -Hide language- or compiler-specific details, such as shared-pointer representation
  • Hardware independence: a variety of parallel architectures and operating systems
    -SMP: Linux/UNIX SMPs, Origin 2000, etc.
    -Clusters of uniprocessors or SMPs: IBM SP, Compaq AlphaServer, Linux/UNIX clusters, etc.
  • Explicit support for multi-threading/multi-process for supporting clusters of SMPs
  • Support many high-performance networks: MPI, Myrinet/GM, Quadrics/elan, IBM/LAPI, Infiniband
  • Ease of implementation on new hardware
    -Allow quick prototype implementations
    -Production implementations can leverage performance features of hardware
  • Provide both portability and high performance

    Figure 1 shows the organization of the GASNet communication layer and its relation to the parallel language system.

    The GASNet interface is divided into two distinct layers:

    GASNet Core API

  • Provides the most basic required network primitives
  • Implemented directly on each platform
    -Minimal set of network functions needed to support a working implementation
    -General enough to implement everything else
  • Based heavily on active messages paradigm
    -Provides powerful extensibility mechanism

    GASNet Extended API

  • Wider interface that includes more complicated operations
    -Put, get, barrier, collectives, etc.
    -Flexible and expressive non-blocking operations and synchronization mechanisms
    -Ideal compilation target for aggressive communication optimizations
  • We provide a reference implementation of the extended API in terms of the core API
  • Implementors can choose to directly implement any subset for performance leverage hardware support for higher-level operations

    GASNet has already been implemented on Myrinet/GM, Quadrics/elan, IBM/LAPI, and MPI (a fully portable implementation), and implementations are on the way for Infiniband, Gigabit Ethernet and other high-performance networks. This research effort includes carefully evaluating the performance characteristics for the target high-performance networks and interfaces [4], and developing strategies for best utilizing the network capabilities to implement low-latency, low-overhead, one-sided small put/gets and zero-copy high-bandwidth bulk transfers [5].

    Figures 2 and 3 demonstrate the small-put/get "round-trip" latency and large-put/get bandwidth achieved by the Myrinet and Quadrics implementations of GASNet in various configurations and on several machines. The leftmost bar for each machine shows the performance of the fully portable GASNet-MPI implementation on that system (MPI-based core API, reference implementation of extended API). The middle bar for the Quadrics machines shows the performance of using the reference extended API with a natively-implemented GASNet-Quadrics core API. The rightmost bar for each machine shows the performance of a fully-native implementation of the GASNet core and extended API on the underlying network. We find this performance to be very competitive with the best performance achievable on these networks, therefore justifying the claim that GASNet is indeed a very lightweight, high-performance interface.

    Figures 4 and 5 show the small-put/get "round-trip" latency for the fully-native GASNet-Quadrics and GASNet-Myrinet implementations over varying message sizes.

    This work is done in conjunction with the Berkeley UPC compiler group, a joint effort between Lawrence Berkeley National Laboratory and UC Berkeley. GASNet is already being used as a communication system for the Berkeley UPC compiler and the UC Berkeley Titanium compiler, and several other global address space language research efforts are considering adopting GASNet.


    Figure 1: High-level system organization for a GAS language using GASNet

    Figure 2: Small put/get "round-trip" latency performance of GASNet-Myrinet and GASNet-Quadrics in several configurations and machines

    Figure 3: Large put/get bandwidth performance of GASNet-Myrinet and GASNet-Quadrics in several configurations and machines

    Figure 4: Put/get "round-trip" latency performance of GASNet-Quadrics over varying message size

    Figure 5: Put/get "round-trip" latency performance of GASNet-Myrinet over varying message size

    [1]
    D. Bonachea, GASNet Specification, v1.1, UC Berkeley Computer Science Division, Report No. UCB/CSD 02/1207, October 2002.
    [2]
    T. El-Ghazawi, W. Carlson, and J. Draper, UPC Language Specification, v1.0, George Washington University, February 2001.
    [3]
    P. Hilfinger, D. Bonachea, D. Gay, S. Graham, B. Libeit, G. Pike, and K. Yelick, Titanium Language Reference Manual, UC Berkeley Computer Science Division, Report No. UCB/CSD 01/1163, November 2001.
    [4]
    C. Bell, D. Bonachea, Y. Cote, J. Duell, P. Hargrove, P. Husbands, C. Iancu, M. Welcome, and K. Yelick, "An Evaluation of Current High-Performance Networks," Int. Parallel and Distributed Processing Symp., 2003 (submitted).
    [5]
    D. Bonachea and C. Bell, "Efficient DMA Registration Strategies for Pinning-based High Performance Networks," Workshop in Communication Architecture for Clusters, 2003 (submitted).
    1Visiting Researcher, LBNL
    2Visiting Researcher, LBNL

    More information (http://upc.nersc.gov) or

    Send mail to the author : (bonachea@eecs.berkeley.edu)