|
|
| 18 | Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping |
| 18 | Large-scale FFT on GPU clusters |
| 16 | Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations |
| 14 | Overlapping communication and computation by using a hybrid MPI/SMPSs approach |
| 14 | Decomposable and responsive power models for multicore processors using performance counters |
| 12 | Cache oblivious parallelograms in iterative stencil computations |
| 12 | An empirically tuned 2D and 3D FFT library on CUDA GPU |
| 10 | The auction: optimizing banks usage in Non-Uniform Cache Architectures |
| 10 | InterferenceRemoval: removing interference of disk access for MPI programs through data replication |
| 9 | Evaluation of parallel H.264 decoding strategies for the Cell Broadband Engine |
| 9 | An experimental approach to performance measurement of heterogeneous parallel applications using CUDA |
| 8 | An approach to resource-aware co-scheduling for CMPs |
| 7 | Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application |
| 6 | A query language for understanding component interactions in production systems |
| 6 | Indemics: an interactive data intensive framework for high performance epidemic simulation |
| 5 | Optimal bucket algorithms for large MPI collectives on torus interconnects |
| 5 | High-throughput Bayesian network learning using heterogeneous multicore computers |
| 5 | Timing local streams: improving timeliness in data prefetching |
| 5 | Static reuse distances for locality-based optimizations in MATLAB |
| 5 | FPGA accelerating double/quad-double high precision floating-point applications for ExaScale computing |
| 4 | ParaLearn: a massively parallel, scalable system for learning interaction networks on FPGAs |
| 4 | How to unleash array optimizations on code using recursive data structures |
| 3 | Making nested parallel transactions practical using lightweight hardware support |
| 3 | SAMS multi-layout memory: providing multiple views of data to boost SIMD performance |
| 3 | Clustering performance data efficiently at massive scales |
| 3 | Speeding up Nek5000 with autotuning and specialization |
| 2 | Handling task dependencies under strided and aliased references |
| 1 | Fast and accurate NCBI BLASTP: acceleration with multiphase FPGA-based prefiltering |
| 1 | A compiler-automated array compression scheme for optimizing memory intensive programs |
| 1 | Small-ruleset regular expression matching on GPGPUs: quantitative performance analysis and optimization |
| 0 | Enigma: architectural and operating system support for reducing the impact of address translation |
| 0 | Adaptive multi-level cache allocation in distributed storage architectures |