SC1110090807 0605040302
PPoPP1110090807 060503
ICS1110090807 0605040302
IPDPS1110090807 0605040302
ISCA1110090807 0605040302
ASPLOS11100908 060402
MICRO1110090807 0605040302
HPCA1110090807 0605040302

18 Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping
18 Large-scale FFT on GPU clusters
16 Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations
14 Overlapping communication and computation by using a hybrid MPI/SMPSs approach
14 Decomposable and responsive power models for multicore processors using performance counters
12 Cache oblivious parallelograms in iterative stencil computations
12 An empirically tuned 2D and 3D FFT library on CUDA GPU
10 The auction: optimizing banks usage in Non-Uniform Cache Architectures
10 InterferenceRemoval: removing interference of disk access for MPI programs through data replication
9 Evaluation of parallel H.264 decoding strategies for the Cell Broadband Engine
9 An experimental approach to performance measurement of heterogeneous parallel applications using CUDA
8 An approach to resource-aware co-scheduling for CMPs
7 Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application
6 A query language for understanding component interactions in production systems
6 Indemics: an interactive data intensive framework for high performance epidemic simulation
5 Optimal bucket algorithms for large MPI collectives on torus interconnects
5 High-throughput Bayesian network learning using heterogeneous multicore computers
5 Timing local streams: improving timeliness in data prefetching
5 Static reuse distances for locality-based optimizations in MATLAB
5 FPGA accelerating double/quad-double high precision floating-point applications for ExaScale computing
4 ParaLearn: a massively parallel, scalable system for learning interaction networks on FPGAs
4 How to unleash array optimizations on code using recursive data structures
3 Making nested parallel transactions practical using lightweight hardware support
3 SAMS multi-layout memory: providing multiple views of data to boost SIMD performance
3 Clustering performance data efficiently at massive scales
3 Speeding up Nek5000 with autotuning and specialization
2 Handling task dependencies under strided and aliased references
1 Fast and accurate NCBI BLASTP: acceleration with multiphase FPGA-based prefiltering
1 A compiler-automated array compression scheme for optimizing memory intensive programs
1 Small-ruleset regular expression matching on GPGPUs: quantitative performance analysis and optimization
0 Enigma: architectural and operating system support for reducing the impact of address translation
0 Adaptive multi-level cache allocation in distributed storage architectures