SC1110090807 0605040302
PPoPP1110090807 060503
ICS1110090807 0605040302
IPDPS1110090807 0605040302
ISCA1110090807 0605040302
ASPLOS11100908 060402
MICRO1110090807 0605040302
HPCA1110090807 0605040302

51 The 48-Core SCC Processor: The Programmer's View
36 Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
30 Managing Variability in the I/O Performance of Petascale Storage Systems
26 OpenMPC: Extended OpenMP Programming and Tuning for GPUs
20 Characterizing the Influence of System Noise on Large-Scale Applications by Simulation
19 Data Sharing Options for Scientific Workflows on Amazon EC2
15 A Scalable and Distributed Dynamic Formal Verifier for MPI Programs
15 Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures
15 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
14 DASH: a Recipe for a Flash-based Data Intensive Supercomputer
13 An 80-Fold Speedup, 15.0 Tflops, Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code
13 Scalable Earthquake Simulation on Petascale Supercomputers
12 Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework
12 Scaling Hierarchical N-Body Simulations on GPU Clusters
12 Extreme-Scale AMR
12 Understanding the Impact of Emerging Non-Volatile Memories on High-Performance, IO-Intensive Computing
12 Scalable Graph Exploration on Multicore Processors
12 Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics
12 Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory
11 Functional Partitioning to Optimize End-to-End Performance on Many-Core Architectures
11 Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance
10 Accelerating I/O Forwarding in IBM Blue Gene/P Systems
9 Elastic Cloud Caches for Accelerating Service-Oriented Computations
9 IOrchestrator: Improving the Performance of Multi-node I/O Systems via Inter-Server Coordination
9 Power-Aware Consolidation of Scientific Workflows in Virtualized Environments
8 190 TFlops Astrophysical N-body Simulation on a Cluster of GPUs
8 Fast PGAS Implementation of Distributed Graph Algorithms
8 Experiences with a Lightweight Supercomputer Kernel: Lessons Learned from Blue Gene's CNK
7 PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications
6 CPM in CMPs: Coordinated Power Management in Chip-Multiprocessors
6 Scalable Identification of Load Imbalance in Parallel Executions using Call Path Profiles
5 FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking
5 JAWS: Job-Aware Workload Scheduling for the Exploration of Turbulence Simulations
5 Circuit-Switched Memory Access in Photonic Interconnection Networks for High-Performance Embedded Computing
5 vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowledgement Offload
5 Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems
5 Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support
5 Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses
5 A Multi-Scale Heart Simulation on Massively Parallel Computers
4 Diagnosis, Tuning and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method
3 Optimal Utilization of Heterogeneous Resources for Biomolecular Simulations
3 Multiscale Simulation of Cardiovascular flows on the IBM Bluegene/P: Full Heart-Circulation System at Red-Blood Cell Resolution
3 Parallel Fast Gauss Transform
3 A Parallel Implementation of Electron-Phonon Scattering in Nanoelectronic Devices up to 95K Cores
2 Hierarchical Diagonal Blocking and Precision Reduction Applied to Combinatorial Multigrid
2 An Adaptive Framework for Simulation and Online Remote Visualization of Critical Climate Applications in Resource-Constrained Environments
2 Toward First Principles Electronic Structure Simulations of Excited States and Strong Correlations in Nano- and Materials Science
2 Direct Numerical Simulation of Particulate Flows on 294912 Processor Cores
2 Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers
2 The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches
1 Strider: Runtime Support for Optimizing Strided Data Accesses on Multi-Cores with Explicitly Managed Memories
1 On-Chip Network Evaluation Framework
1 A Block-Oriented Language and Runtime System for Tensor Algebra with Very Large Arrays
1 A Flexible Reservation Algorithm for Advance Network Provisioning
0 Exploiting 162-Nanosecond End-to-End Communication Latency on Anton
0 Automatic Run-time Parallelization and Transformation of I/O
0 Exploring a Novel Gathering Method for Finite Element Codes on the Cell/B.E. Architecture