SC1110090807 0605040302
PPoPP1110090807 060503
ICS1110090807 0605040302
IPDPS1110090807 0605040302
ISCA1110090807 0605040302
ASPLOS11100908 060402
MICRO1110090807 0605040302
HPCA1110090807 0605040302

227 A NUCA substrate for flexible CMP cache sharing
96 A performance-conserving approach for reducing peak power consumption in server systems
75 System noise, OS clock ticks, and fine-grained parallel applications
74 Optimization of MPI collective communication on BlueGene/L systems
63 Facilitating the search for compositions of program transformations
61 Tasking with out-of-order spawn in TLS chip multiprocessors: microarchitecture and compilation
60 Online performance analysis by statistical sampling of microprocessor performance counters
57 Cache oblivious stencil computations
54 Automatic generation and tuning of MPI collective communication routines
46 Low-overhead call path profiling of unmodified, optimized code
43 Towards automatic translation of OpenMP to MPI
40 Improved automatic testcase synthesis for performance model validation
37 Disk layout optimization for reducing energy consumption
33 An integrated simdization framework using virtual vectors
30 Automatic thread distribution for nested parallelism in OpenMP
30 Think globally, search locally
26 Continuous Replica Placement schemes in distributed systems
25 Multigrain parallel Delaunay Mesh generation: challenges and opportunities for multithreaded architectures
25 affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system
24 Power-aware resource allocation in high-end systems via online simulation
23 TAPE: a transactional application profiling environment
22 High performance support of parallel virtual file system (PVFS2) over Quadrics
21 Lightweight reference affinity analysis
19 Thread-Level Speculation on a CMP can be energy efficient
18 What is worth learning from parallel workloads?: a user and session based analysis
17 Generating new general compiler optimization settings
14 Improving the computational intensity of unstructured mesh applications
13 Transparent caching with strong consistency in dynamic content web sites
12 A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks
11 Fast branch misprediction recovery in out-of-order superscalar processors
10 An asymmetric clustered processor based on value content
10 The implications of working set analysis on supercomputing memory hierarchy design
9 The architecture of the HP Superdome shared-memory multiprocessor
8 Characterization of L3 cache behavior of SPECjAppServer2002 and TPC-C
8 Low-power, low-complexity instruction issue using compiler assistance
8 Scaling physics and material science applications on a massively parallel Blue Gene/L system
7 A heterogeneously segmented cache architecture for a packet forwarding engine
5 Tornado warning: the perils of selective replay in multithreaded processors
4 Design of a next generation sampling service for large scale data analysis applications
4 Another approach to backfilled jobs: applying virtual malleability to expired windows
3 Reducing latencies of pipelined cache accesses through set prediction
3 Parallel sparse LU factorization on second-class message passing platforms