|
|
| 140 | Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors |
| 75 | PLFS: A Checkpoint Filesystem for Parallel Applications |
| 48 | Dynamic Task Scheduling for Linear Algebra Algorithms on Distributed-Memory Multicore Systems |
| 44 | Millisecond-Scale Molecular Dynamics Simulations on Anton |
| 44 | I/O Performance Challenges at Leadership Scale |
| 43 | The Cat is Out of the Bag: Cortical Simulations with 10^9 Neurons, 10^13 Synapses |
| 43 | Auto-Tuning 3-D FFT Library for CUDA GPUs |
| 42 | Scalable Work Stealing |
| 40 | Comparative Study of One-Sided Factorizations with Multiple Software Packages on Multi-Core Hardware |
| 36 | Leveraging 3D PCRAM Technologies to Reduce Checkpoint Overhead for Future Exascale Systems |
| 34 | 42 TFlops Hierarchical N-body Simulations on GPUs with Applications in both Astrophysics and Turbulence |
| 33 | Minimizing Communication in Sparse Matrix Solvers |
| 30 | VGrADS: Enabling e-Science Workflows on Grids and Clouds with Fault Tolerance |
| 25 | HyperX: Topology, Routing, and Packaging of Efficient Large-Scale Networks |
| 24 | A Massively Parallel Adaptive Fast-Multipole Method on Heterogeneous Architectures |
| 22 | Scalable Implicit Finite Element Solver for Massively Parallel Processing with Demonstration to 160K cores |
| 20 | Towards a Framework for Abstracting Accelerators in Parallel Applications: Experience with Cell |
| 19 | Diagnosing Performance Bottlenecks in Emerging Petascale Applications |
| 19 | Future Scaling of Processor-Memory Interfaces |
| 19 | GridBot: Execution of Bags of Tasks in Multiple Grids |
| 19 | Scalable Massively Parallel I/O to Task-Local Files |
| 18 | Increasing Memory Miss Tolerance for SIMD Cores |
| 18 | Liquid Water: Obtaining the Right Answer for the Right Reasons |
| 15 | PFunc: Modern Task Parallelism for Modern High Performance Computing |
| 15 | SmartStore: A New Metadata Organization Paradigm with Semantic-Awareness for Next-Generation File Systems |
| 14 | Memory-Efficient Optimization of Gyrokinetic Particle-to-Grid Interpolation for Multicore Processors |
| 14 | Scalable Computation of Streamlines on Very Large Datasets |
| 13 | Terascale Data Organization for Discovering Multivariate Climatic Trends |
| 12 | Sparse Matrix Factorization on Massively Parallel Computers |
| 12 | A Configurable Algorithm for Parallel Image-Compositing Applications |
| 12 | Autotuning Multigrid with PetaBricks |
| 11 | Instruction-Level Simulation of a Cluster at Scale |
| 11 | Age Based Scheduling for Asymmetric Multiprocessors |
| 10 | Automating the Generation of Composed Linear Algebra Kernels |
| 10 | Router Designs for Elastic Buffer On-Chip Networks |
| 10 | Allocator Implementations for Network-on-Chip Routers |
| 10 | Improving GridFTP Performance Using The Phoebus Session Layer |
| 10 | Multi-core Acceleration of Chemical Kinetics for Simulation and Prediction |
| 9 | Adaptive and Scalable Metadata Management to Support A Trillion Files |
| 9 | On the Design of Scalable, Self-Configuring Virtual Networks |
| 8 | Optimal Real Number Codes for Fault Tolerant Matrix Operations |
| 8 | FACT: Fast Communication Trace Collection for Parallel Applications through Program Slicing |
| 8 | Space-Efficient Time-Series Call-Path Profiling of Parallel Applications |
| 8 | Early Performance Evaluation of "Nehalem" Cluster using Scientific and Engineering Applications |
| 7 | A Case for Integrated Processor-Cache Partitioning in Chip Multiprocessors |
| 7 | Enabling Software Management for Multicore Caches with a Lightweight Hardware Support |
| 7 | Predicting the Execution Time of Grid Workflow Applications through Local Learning |
| 7 | Indexing Genomic Sequences on the IBM Blue Gene |
| 7 | Evaluating Similarity-Based Trace Reduction Techniques for Scalable Performance Analysis |
| 7 | Triangular Matrix Inversion on Graphics Processing Units |
| 7 | Scalable Temporal Order Analysis for Large Scale Debugging |
| 6 | Machine Learning-Based Prefetch Optimization for Data Center Applications |
| 6 | A Design Methodology for Domain-Optimized Power-Efficient Supercomputing |
| 6 | A 32x32x32, Spatially Distributed 3D FFT in Four Microseconds on Anton |
| 6 | Performance Evaluation of NEC SX-9 using Real Science and Engineering Applications |
| 6 | Enabling High-Fidelity Neutron Transport Simulations on Petascale Architectures |
| 5 | Compact Multi-Dimensional Kernel Extraction for Register Tiling |
| 5 | A Microdriver Architecture for Error Correcting Codes inside the Linux Kernel |
| 5 | Beyond Homogeneous Decomposition: Scaling Long-Range Forces on Massively Parallel Architectures |
| 4 | Evaluating the Impact of Inaccurate Information in Utility-Based Scheduling |
| 4 | FALCON: A System for Reliable Checkpoint Recovery in Shared Grid Environments |
| 4 | Flexible Cache Error Protection using an ECC FIFO |
| 3 | Dynamic Storage Cache Allocation in Multi-Server Architectures |
| 3 | Supporting Fault-Tolerance for Time-Critical Events in Distributed Environments |
| 2 | SCAMPI: A Scalable Cam-based Algorithm for Multiple Pattern Inspection |
| 2 | Efficient Band Approximation of Gram Matrices for Large Scale Kernel Methods on GPUs |
| 2 | A Scalable Method for Ab Initio Computation of Free Energies in Nanoscale Systems |