|
|
| 51 | The 48-Core SCC Processor: The Programmer's View |
| 36 | Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System |
| 30 | Managing Variability in the I/O Performance of Petascale Storage Systems |
| 26 | OpenMPC: Extended OpenMP Programming and Tuning for GPUs |
| 20 | Characterizing the Influence of System Noise on Large-Scale Applications by Simulation |
| 19 | Data Sharing Options for Scientific Workflows on Amazon EC2 |
| 15 | A Scalable and Distributed Dynamic Formal Verifier for MPI Programs |
| 15 | Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures |
| 15 | 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs |
| 14 | DASH: a Recipe for a Flash-based Data Intensive Supercomputer |
| 13 | An 80-Fold Speedup, 15.0 Tflops, Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code |
| 13 | Scalable Earthquake Simulation on Petascale Supercomputers |
| 12 | Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework |
| 12 | Scaling Hierarchical N-Body Simulations on GPU Clusters |
| 12 | Extreme-Scale AMR |
| 12 | Understanding the Impact of Emerging Non-Volatile Memories on High-Performance, IO-Intensive Computing |
| 12 | Scalable Graph Exploration on Multicore Processors |
| 12 | Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics |
| 12 | Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory |
| 11 | Functional Partitioning to Optimize End-to-End Performance on Many-Core Architectures |
| 11 | Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance |
| 10 | Accelerating I/O Forwarding in IBM Blue Gene/P Systems |
| 9 | Elastic Cloud Caches for Accelerating Service-Oriented Computations |
| 9 | IOrchestrator: Improving the Performance of Multi-node I/O Systems via Inter-Server Coordination |
| 9 | Power-Aware Consolidation of Scientific Workflows in Virtualized Environments |
| 8 | 190 TFlops Astrophysical N-body Simulation on a Cluster of GPUs |
| 8 | Fast PGAS Implementation of Distributed Graph Algorithms |
| 8 | Experiences with a Lightweight Supercomputer Kernel: Lessons Learned from Blue Gene's CNK |
| 7 | PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications |
| 6 | CPM in CMPs: Coordinated Power Management in Chip-Multiprocessors |
| 6 | Scalable Identification of Load Imbalance in Parallel Executions using Call Path Profiles |
| 5 | FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking |
| 5 | JAWS: Job-Aware Workload Scheduling for the Exploration of Turbulence Simulations |
| 5 | Circuit-Switched Memory Access in Photonic Interconnection Networks for High-Performance Embedded Computing |
| 5 | vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowledgement Offload |
| 5 | Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems |
| 5 | Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support |
| 5 | Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses |
| 5 | A Multi-Scale Heart Simulation on Massively Parallel Computers |
| 4 | Diagnosis, Tuning and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method |
| 3 | Optimal Utilization of Heterogeneous Resources for Biomolecular Simulations |
| 3 | Multiscale Simulation of Cardiovascular flows on the IBM Bluegene/P: Full Heart-Circulation System at Red-Blood Cell Resolution |
| 3 | Parallel Fast Gauss Transform |
| 3 | A Parallel Implementation of Electron-Phonon Scattering in Nanoelectronic Devices up to 95K Cores |
| 2 | Hierarchical Diagonal Blocking and Precision Reduction Applied to Combinatorial Multigrid |
| 2 | An Adaptive Framework for Simulation and Online Remote Visualization of Critical Climate Applications in Resource-Constrained Environments |
| 2 | Toward First Principles Electronic Structure Simulations of Excited States and Strong Correlations in Nano- and Materials Science |
| 2 | Direct Numerical Simulation of Particulate Flows on 294912 Processor Cores |
| 2 | Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers |
| 2 | The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches |
| 1 | Strider: Runtime Support for Optimizing Strided Data Accesses on Multi-Cores with Explicitly Managed Memories |
| 1 | On-Chip Network Evaluation Framework |
| 1 | A Block-Oriented Language and Runtime System for Tensor Algebra with Very Large Arrays |
| 1 | A Flexible Reservation Algorithm for Advance Network Provisioning |
| 0 | Exploiting 162-Nanosecond End-to-End Communication Latency on Anton |
| 0 | Automatic Run-time Parallelization and Transformation of I/O |
| 0 | Exploring a Novel Gathering Method for Finite Element Codes on the Cell/B.E. Architecture |