SC1110090807 0605040302
PPoPP1110090807 060503
ICS1110090807 0605040302
IPDPS1110090807 0605040302
ISCA1110090807 0605040302
ASPLOS11100908 060402
MICRO1110090807 0605040302
HPCA1110090807 0605040302

65 Model-driven autotuning of sparse matrix-vector multiply on GPUs
56 An adaptive performance modeling tool for GPU architectures
54 Is transactional programming actually easier?
54 NOrec: streamlining STM by abolishing ownership records
45 Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?
35 Fast tridiagonal solvers on the GPU
22 Structure-driven optimizations for amorphous data-parallel programs
21 Scheduling support for transactional memory contention management
19 Featherweight X10: a core calculus for async-finish parallelism
19 A practical concurrent binary search tree
17 PHANTOM: predicting performance of parallel applications on large-scale parallel machines using a single node
14 Load balancing on speed
13 The LOFAR correlator: implementation and performance analysis
13 Lazy binary-splitting: a run-time adaptive work-stealing scheduler
12 GAMBIT: effective unit testing for concurrency libraries
10 Improving parallelism and locality with asynchronous algorithms
10 Scaling LAPACK panel operations using parallel cache assignment
10 Analyzing lock contention in multithreaded applications
9 CUDAlign: using GPU to accelerate the comparison of megabase genomic sequences
8 Leveraging parallel nesting in transactional memory
8 Scalable communication protocols for dynamic sparse data exchange
6 Debugging programs that use atomic blocks and transactional memory
6 Composable thread coloring
6 Input-driven dynamic execution prediction of streaming applications
5 Compiler aided selective lock assignment for improving the performance of software transactional memory
5 Helper locks for fork-join parallel programming
3 Using data structure knowledge for efficient lock generation and strong atomicity
3 Modeling advanced collective communication algorithms on cell-based systems
2 Thread to strand binding of parallel network applications in massive multi-threaded systems