|
|
| 84 | A compiler framework for optimization of affine loop nests for gpgpus |
| 69 | Power-aware dynamic placement of HPC applications |
| 67 | Efficient computation of sum-products on GPUs through software-managed cache |
| 59 | Biomedical image analysis on a cooperative cluster of GPUs and multicores |
| 58 | The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer |
| 56 | Fast scan algorithms on graphics processors |
| 51 | Phasers: a unified deadlock-free construct for collective and point-to-point synchronization |
| 26 | Preserving time in large-scale communication traces |
| 26 | Analysis of dynamic power management on multi-core processors |
| 26 | A regression-based approach to scalability prediction |
| 24 | Analyzing memory access intensity in parallel programs on multicore |
| 19 | Implementing Wilson-Dirac operator on the cell broadband engine |
| 19 | Orchestrating data transfer for the cell/B.E. processor |
| 19 | CUBA: an architecture for efficient CPU/co-processor data communication |
| 18 | The shared-thread multiprocessor |
| 18 | Evaluating the effect of replacing CNK with linux on the compute-nodes of blue gene/l |
| 17 | Soft error vulnerability of iterative linear algebra methods |
| 15 | Data mining on the cell broadband engine |
| 13 | Timely offloading of result-data in HPC centers |
| 11 | Autonomous learning for efficient resource utilization of dynamic VM migration |
| 10 | Automatic analysis of speedup of MPI applications |
| 9 | Adaptive runtime tuning of parallel sparse matrix-vector multiplication on distributed memory systems |
| 5 | Accurate memory signatures and synthetic address traces for HPC applications |
| 5 | An approach for adaptive DRAM temperature and power management |
| 4 | Optimizing irregular shared-memory applications for clusters |
| 3 | CprFS: a user-level file system to support consistent file states for checkpoint and restart |
| 3 | Performance portable optimizations for loops containing communication operations |
| 2 | Advanced collective communication in aspen |
| 2 | Shifted declustering: a placement-ideal layout scheme for multi-way replication storage architecture |
| 2 | Automatic SIMD vectorization of chains of recurrences |
| 2 | Exploiting idle register classes for fast spill destination |
| 1 | Can software reliability outperform hardware reliability on high performance interconnects?: a case study with MPI over infiniband |
| 1 | Focused prefetching: performance oriented prefetching based on commit stalls |
| 0 | A freespace crossbar for multi-core processors |
| 0 | A projection-based optimization framework for abstractions with application to the unstructured mesh domain |
| 0 | Three-dimensional delaunay refinement for multi-core processors |
| 0 | Rotating register allocation with multiple rotating branches |