SC1110090807 0605040302
PPoPP1110090807 060503
ICS1110090807 0605040302
IPDPS1110090807 0605040302
ISCA1110090807 0605040302
ASPLOS11100908 060402
MICRO1110090807 0605040302
HPCA1110090807 0605040302

140 Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
75 PLFS: A Checkpoint Filesystem for Parallel Applications
48 Dynamic Task Scheduling for Linear Algebra Algorithms on Distributed-Memory Multicore Systems
44 Millisecond-Scale Molecular Dynamics Simulations on Anton
44 I/O Performance Challenges at Leadership Scale
43 The Cat is Out of the Bag: Cortical Simulations with 10^9 Neurons, 10^13 Synapses
43 Auto-Tuning 3-D FFT Library for CUDA GPUs
42 Scalable Work Stealing
40 Comparative Study of One-Sided Factorizations with Multiple Software Packages on Multi-Core Hardware
36 Leveraging 3D PCRAM Technologies to Reduce Checkpoint Overhead for Future Exascale Systems
34 42 TFlops Hierarchical N-body Simulations on GPUs with Applications in both Astrophysics and Turbulence
33 Minimizing Communication in Sparse Matrix Solvers
30 VGrADS: Enabling e-Science Workflows on Grids and Clouds with Fault Tolerance
25 HyperX: Topology, Routing, and Packaging of Efficient Large-Scale Networks
24 A Massively Parallel Adaptive Fast-Multipole Method on Heterogeneous Architectures
22 Scalable Implicit Finite Element Solver for Massively Parallel Processing with Demonstration to 160K cores
20 Towards a Framework for Abstracting Accelerators in Parallel Applications: Experience with Cell
19 Diagnosing Performance Bottlenecks in Emerging Petascale Applications
19 Future Scaling of Processor-Memory Interfaces
19 GridBot: Execution of Bags of Tasks in Multiple Grids
19 Scalable Massively Parallel I/O to Task-Local Files
18 Increasing Memory Miss Tolerance for SIMD Cores
18 Liquid Water: Obtaining the Right Answer for the Right Reasons
15 PFunc: Modern Task Parallelism for Modern High Performance Computing
15 SmartStore: A New Metadata Organization Paradigm with Semantic-Awareness for Next-Generation File Systems
14 Memory-Efficient Optimization of Gyrokinetic Particle-to-Grid Interpolation for Multicore Processors
14 Scalable Computation of Streamlines on Very Large Datasets
13 Terascale Data Organization for Discovering Multivariate Climatic Trends
12 Sparse Matrix Factorization on Massively Parallel Computers
12 A Configurable Algorithm for Parallel Image-Compositing Applications
12 Autotuning Multigrid with PetaBricks
11 Instruction-Level Simulation of a Cluster at Scale
11 Age Based Scheduling for Asymmetric Multiprocessors
10 Automating the Generation of Composed Linear Algebra Kernels
10 Router Designs for Elastic Buffer On-Chip Networks
10 Allocator Implementations for Network-on-Chip Routers
10 Improving GridFTP Performance Using The Phoebus Session Layer
10 Multi-core Acceleration of Chemical Kinetics for Simulation and Prediction
9 Adaptive and Scalable Metadata Management to Support A Trillion Files
9 On the Design of Scalable, Self-Configuring Virtual Networks
8 Optimal Real Number Codes for Fault Tolerant Matrix Operations
8 FACT: Fast Communication Trace Collection for Parallel Applications through Program Slicing
8 Space-Efficient Time-Series Call-Path Profiling of Parallel Applications
8 Early Performance Evaluation of "Nehalem" Cluster using Scientific and Engineering Applications
7 A Case for Integrated Processor-Cache Partitioning in Chip Multiprocessors
7 Enabling Software Management for Multicore Caches with a Lightweight Hardware Support
7 Predicting the Execution Time of Grid Workflow Applications through Local Learning
7 Indexing Genomic Sequences on the IBM Blue Gene
7 Evaluating Similarity-Based Trace Reduction Techniques for Scalable Performance Analysis
7 Triangular Matrix Inversion on Graphics Processing Units
7 Scalable Temporal Order Analysis for Large Scale Debugging
6 Machine Learning-Based Prefetch Optimization for Data Center Applications
6 A Design Methodology for Domain-Optimized Power-Efficient Supercomputing
6 A 32x32x32, Spatially Distributed 3D FFT in Four Microseconds on Anton
6 Performance Evaluation of NEC SX-9 using Real Science and Engineering Applications
6 Enabling High-Fidelity Neutron Transport Simulations on Petascale Architectures
5 Compact Multi-Dimensional Kernel Extraction for Register Tiling
5 A Microdriver Architecture for Error Correcting Codes inside the Linux Kernel
5 Beyond Homogeneous Decomposition: Scaling Long-Range Forces on Massively Parallel Architectures
4 Evaluating the Impact of Inaccurate Information in Utility-Based Scheduling
4 FALCON: A System for Reliable Checkpoint Recovery in Shared Grid Environments
4 Flexible Cache Error Protection using an ECC FIFO
3 Dynamic Storage Cache Allocation in Multi-Server Architectures
3 Supporting Fault-Tolerance for Time-Critical Events in Distributed Environments
2 SCAMPI: A Scalable Cam-based Algorithm for Multiple Pattern Inspection
2 Efficient Band Approximation of Gram Matrices for Large Scale Kernel Methods on GPUs
2 A Scalable Method for Ab Initio Computation of Free Energies in Nanoscale Systems