34 Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs
34 EpiFast: a fast algorithm for large scale realistic epidemic simulations on distributed memory systems
30 Refereeing conflicts in hardware transactional memory
30 How GPUs can outperform ASICs for fast LDPC decoding
29 Dynamic topology aware load balancing algorithms for molecular dynamics applications
29 FTL design exploration in reconfigurable high-performance SSD for server applications
28 Adagio: making DVS practical for complex HPC applications
27 Parametric multi-level tiling of imperfectly nested loops
26 Rate-based QoS techniques for cache/memory in CMP platforms
25 Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems
24 Using many-core hardware to correlate radio astronomy signals
23 A translation system for enabling data mining applications on GPUs
20 QuakeTM: parallelizing a complex sequential application using transactional memory
20 Fast and scalable list ranking on the GPU
17 High-performance regular expression scanning on the Cell/B.E. Processor
16 Virtualization polling engine (VPE): using dedicated CPU cores to accelerate I/O virtualization
16 Combining thread level speculation helper threads and runahead execution
15 Computer generation of fast fourier transforms for the cell broadband engine
15 Chunking parallel loops in the presence of synchronization
15 Understanding the interconnection network of SpiNNaker
14 A graph based approach for MPI deadlock detection
14 P-Code: a new RAID-6 code with optimal properties
13 Dynamic parallelization of single-threaded binary programs using speculative slicing
12 Dynamic cache clustering for chip multiprocessors
12 Exploring pattern-aware routing in generalized fat tree networks
11 DBDB: optimizing DMATransfer for the cell be architecture
11 Pattern-based sparse matrix representation for memory-efficient SMVM kernels
11 Practice of parallelizing network applications on multi-core architectures
10 Zero-content augmented caches
10 Divide-and-conquer: a bubble replacement for low level caches
10 Limited early value communication to improve performance of transactional memory
9 MPI-aware compiler optimizations for improving communication-computation overlap
9 Single-particle 3d reconstruction from cryo-electron microscopy images on GPU
8 A comprehensive power-performance model for NoCs with multi-flit channel buffers
7 Less reused filter: improving l2 cache performance via filtering less reused lines
7 Evaluating high performance communication: a power perspective
7 R-ADMAD: high reliability provision for large-scale de-duplication archival storage systems
6 Efficient high performance collective communication for the cell blade
6 Towards 100 gbit/s ethernet: multicore-based parallel communication protocol design
6 Creating artificial global history to improve branch prediction accuracy
6 Maximizing MPI point-to-point communication performance on RDMA-enabled clusters with customized protocols
5 A parallel levenberg-marquardt algorithm
4 Implementation of a wide-angle lens distortion correction algorithm on the cell broadband engine
4 OhHelp: a scalable domain-decomposing dynamic load balancing for particle-in-cell simulations
4 /scratch as a cache: rethinking HPC center scratch storage
2 Synchronization optimizations for efficient execution on multi-cores
1 Fast memory snapshot for concurrent programmingwithout synchronization