|
|
| 65 | Model-driven autotuning of sparse matrix-vector multiply on GPUs |
| 56 | An adaptive performance modeling tool for GPU architectures |
| 54 | Is transactional programming actually easier? |
| 54 | NOrec: streamlining STM by abolishing ownership records |
| 45 | Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? |
| 35 | Fast tridiagonal solvers on the GPU |
| 22 | Structure-driven optimizations for amorphous data-parallel programs |
| 21 | Scheduling support for transactional memory contention management |
| 19 | Featherweight X10: a core calculus for async-finish parallelism |
| 19 | A practical concurrent binary search tree |
| 17 | PHANTOM: predicting performance of parallel applications on large-scale parallel machines using a single node |
| 14 | Load balancing on speed |
| 13 | The LOFAR correlator: implementation and performance analysis |
| 13 | Lazy binary-splitting: a run-time adaptive work-stealing scheduler |
| 12 | GAMBIT: effective unit testing for concurrency libraries |
| 10 | Improving parallelism and locality with asynchronous algorithms |
| 10 | Scaling LAPACK panel operations using parallel cache assignment |
| 10 | Analyzing lock contention in multithreaded applications |
| 9 | CUDAlign: using GPU to accelerate the comparison of megabase genomic sequences |
| 8 | Leveraging parallel nesting in transactional memory |
| 8 | Scalable communication protocols for dynamic sparse data exchange |
| 6 | Debugging programs that use atomic blocks and transactional memory |
| 6 | Composable thread coloring |
| 6 | Input-driven dynamic execution prediction of streaming applications |
| 5 | Compiler aided selective lock assignment for improving the performance of software transactional memory |
| 5 | Helper locks for fork-join parallel programming |
| 3 | Using data structure knowledge for efficient lock generation and strong atomicity |
| 3 | Modeling advanced collective communication algorithms on cell-based systems |
| 2 | Thread to strand binding of parallel network applications in massive multi-threaded systems |