The overall performance in a variety of scientific computing and information retrieval applications is dominated by a few computational kernels. One important class of such operations are sparse matrix kernels--computations with matrices having relatively few non-zero entries. Performance tuning of sparse kernels is a particularly tedious and time-consuming task because performance is a complicated function of the kernel, machine, and non-zero structure of the matrix: for every machine, a user must carefully choose a data structure and implementation (code and transformations) that minimize the matrix storage while achieving the best possible performance.
The goal of this research project is to generate implementations of particular sparse kernels tuned to a given matrix and machine. Our work builds on Sparsity [1], an early successful prototype for sparse matrix-vector multiply (y = A*x, where A is a sparse matrix and x, y, are dense vectors), in the following ways:
This research is being conducted by members of the Berkeley Benchmarking and OPtimization (BeBOP) project [5].
Titanium is an explicitly parallel dialect of Java developed at UC Berkeley [1] to support high-performance scientific computing on large-scale multiprocessors, including massively parallel supercomputers and distributed-memory clusters with one or more processors per node. Other language goals include safety, portability, and support for building complex data structures.
The main additions [2] to Java are:
Titanium provides a global memory space abstraction (similar to other languages such as Split-C and UPC) whereby all data has a user-controllable processor affinity, but parallel processes may directly reference each other's memory to read and write values or arrange for bulk data transfers. A specific portability result is that Titanium programs can run unmodified on uniprocessors, shared memory machines, and distributed memory machines. Performance tuning may be necessary to arrange an application's data structures for distributed memory, but the functional portability allows for development on shared memory machines and uniprocessors.
Titanium is a superset of Java and inherits all the expressiveness, usability, and safety properties of that language. Titanium augments Java's safety features by providing checked synchronization that prevents certain classes of synchronization bugs. To support complex data structures, Titanium uses the object-oriented class mechanism of Java along with the global address space to allow for large shared structures. Titanium's multidimensional array facility adds support for high-performance hierarchical and adaptive grid-based computations.
Our compiler research focuses on the design of program analysis techniques and optimizing transformations for Titanium programs, and on developing a compiler and run-time system that exploit these techniques. Because Titanium is an explicitly parallel language, new analyses are needed even for standard code motion transformations. The compiler analyzes both synchronization constructs and shared variable accesses. Transformations include cache optimizations, overlapping communication, identifying references to objects on the local processor, and replacing runtime memory management overhead with static checking. Our current implementation translates Titanium programs entirely into C, where they are compiled to native binaries by a C compiler and then linked to the Titanium runtime libraries (there is no JVM).
The current implementation runs on a wide range of platforms, including uniprocessors, shared memory multiprocessors, distributed-memory clusters of uniprocessors or SMPs (CLUMPS), and a number of specific supercomputer architectures (Cray T3E, IBM SP, Origin 2000). The distributed memory back-ends can use a wide variety of high-performance network interconnects, including Active Messages, MPI, IBM LAPI, shmem, and UDP.
Titanium is especially well adapted for writing grid-based scientific parallel applications, and several such major applications have been written and continue to be further developed ([3,4], and many others).
GASNet [1] is a network-independent and language-independent high-performance communication interface intended for use in implementing the runtime system for global address space languages, such as Unified Parallel C (UPC) [2] or Titanium [3]. GASNet stands for "Global-Address Space Networking."
Goals of this research include:
Figure 1 shows the organization of the GASNet communication layer and its relation to the parallel language system.
The GASNet interface is divided into two distinct layers:
GASNet Core API
GASNet Extended API
GASNet has already been implemented on Myrinet/GM, Quadrics/elan, IBM/LAPI, and MPI (a fully portable implementation), and implementations are on the way for Infiniband, Gigabit Ethernet and other high-performance networks. This research effort includes carefully evaluating the performance characteristics for the target high-performance networks and interfaces [4], and developing strategies for best utilizing the network capabilities to implement low-latency, low-overhead, one-sided small put/gets and zero-copy high-bandwidth bulk transfers [5].
Figures 2 and 3 demonstrate the small-put/get "round-trip" latency and large-put/get bandwidth achieved by the Myrinet and Quadrics implementations of GASNet in various configurations and on several machines. The leftmost bar for each machine shows the performance of the fully portable GASNet-MPI implementation on that system (MPI-based core API, reference implementation of extended API). The middle bar for the Quadrics machines shows the performance of using the reference extended API with a natively-implemented GASNet-Quadrics core API. The rightmost bar for each machine shows the performance of a fully-native implementation of the GASNet core and extended API on the underlying network. We find this performance to be very competitive with the best performance achievable on these networks, therefore justifying the claim that GASNet is indeed a very lightweight, high-performance interface.
Figures 4 and 5 show the small-put/get "round-trip" latency for the fully-native GASNet-Quadrics and GASNet-Myrinet implementations over varying message sizes.
This work is done in conjunction with the Berkeley UPC compiler group, a joint effort between Lawrence Berkeley National Laboratory and UC Berkeley. GASNet is already being used as a communication system for the Berkeley UPC compiler and the UC Berkeley Titanium compiler, and several other global address space language research efforts are considering adopting GASNet.

Figure 1: High-level system organization for a GAS language using GASNet

Figure 2: Small put/get "round-trip" latency performance of GASNet-Myrinet and GASNet-Quadrics in several configurations and machines

Figure 3: Large put/get bandwidth performance of GASNet-Myrinet and GASNet-Quadrics in several configurations and machines

Figure 4: Put/get "round-trip" latency performance of GASNet-Quadrics over varying message size

Figure 5: Put/get "round-trip" latency performance of GASNet-Myrinet over varying message size