84 A compiler framework for optimization of affine loop nests for gpgpus
69 Power-aware dynamic placement of HPC applications
67 Efficient computation of sum-products on GPUs through software-managed cache
59 Biomedical image analysis on a cooperative cluster of GPUs and multicores
58 The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer
56 Fast scan algorithms on graphics processors
51 Phasers: a unified deadlock-free construct for collective and point-to-point synchronization
26 Preserving time in large-scale communication traces
26 Analysis of dynamic power management on multi-core processors
26 A regression-based approach to scalability prediction
24 Analyzing memory access intensity in parallel programs on multicore
19 Implementing Wilson-Dirac operator on the cell broadband engine
19 Orchestrating data transfer for the cell/B.E. processor
19 CUBA: an architecture for efficient CPU/co-processor data communication
18 The shared-thread multiprocessor
18 Evaluating the effect of replacing CNK with linux on the compute-nodes of blue gene/l
17 Soft error vulnerability of iterative linear algebra methods
15 Data mining on the cell broadband engine
13 Timely offloading of result-data in HPC centers
11 Autonomous learning for efficient resource utilization of dynamic VM migration
10 Automatic analysis of speedup of MPI applications
9 Adaptive runtime tuning of parallel sparse matrix-vector multiplication on distributed memory systems
5 Accurate memory signatures and synthetic address traces for HPC applications
5 An approach for adaptive DRAM temperature and power management
4 Optimizing irregular shared-memory applications for clusters
3 CprFS: a user-level file system to support consistent file states for checkpoint and restart
3 Performance portable optimizations for loops containing communication operations
2 Advanced collective communication in aspen
2 Shifted declustering: a placement-ideal layout scheme for multi-way replication storage architecture
2 Automatic SIMD vectorization of chains of recurrences
2 Exploiting idle register classes for fast spill destination
1 Can software reliability outperform hardware reliability on high performance interconnects?: a case study with MPI over infiniband
1 Focused prefetching: performance oriented prefetching based on commit stalls
0 A freespace crossbar for multi-core processors
0 A projection-based optimization framework for abstractions with application to the unstructured mesh domain
0 Three-dimensional delaunay refinement for multi-core processors
0 Rotating register allocation with multiple rotating branches