SC1110090807 0605040302
PPoPP1110090807 060503
ICS1110090807 0605040302
IPDPS1110090807 0605040302
ISCA1110090807 0605040302
ASPLOS11100908 060402
MICRO1110090807 0605040302
HPCA1110090807 0605040302

342 Benchmarking GPUs to tune dense linear algebra
210 The cost of doing science on the cloud: the Montage example
184 Entering the petaflop era: the architecture and performance of Roadrunner
177 Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
118 High performance discrete Fourier transforms on graphics processors
71 Bandwidth intensive 3-D FFT kernel for GPUs using CUDA
69 Adapting a message-driven parallel application to GPU-accelerated clusters
63 EpiSimdemics: an efficient algorithm for simulating the spread of infectious disease over large realistic social networks
58 Server-storage virtualization: integration and load balancing in data centers
58 Proactive process-level live migration in HPC environments
51 Toward loosely coupled programming on petascale systems
48 Characterizing application sensitivity to OS interference using kernel-level noise injection
44 A novel migration-based NUCA design for chip multiprocessors
37 Programming the Intel 80-core network-on-a-chip terascale processor
36 Scaling parallel I/O performance through I/O delegate and caching system
33 Communication avoiding Gaussian elimination
33 Nimrod/K: towards massively parallel dynamic grid workflows
31 Parallel I/O prefetching using MPI file caching and I/O signatures
31 Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark
31 Efficient management of data center resources for massively multiplayer online games
27 Hiding I/O latency with pre-execution prefetching for parallel applications
27 Early evaluation of IBM BlueGene/P
26 Dynamically adapting file domain partitioning methods for collective I/O based on underlying parallel file system locking protocols
25 Massively parallel genomic sequence search on the Blue Gene/P architecture
24 Scalable adaptive Mantle Convection Simulation on Petascale Supercomputers
23 Massively parallel volume rendering using 2-3 swap image compositing
22 A dynamic scheduler for balancing HPC applications
21 An adaptive cut-off for task parallelism
21 Lessons learned at 208K: towards debugging millions of cores
21 Scientific application-based performance comparison of SGI Altix 4700, IBM POWER5+, and SGI ICE 8200 supercomputers
19 0.374 Pflop/s Trillion-particle Particle-in-cell Modeling of Laser Plasma Interactions on Roadrunner
19 Feedback-controlled resource sharing for predictable eScience
18 A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories
18 SMARTMAP: operating system support for efficient data sharing among processes on a multi-core processor
17 BitDew: a programmable environment for large-scale data management and distribution
17 Scalable load-balance measurement for SPMD codes
17 PAM: a novel performance/power aware meta-scheduler for multi-core systems
17 High performance multivariate visual data exploration for extremely large data
17 High-frequency Simulations of Global Seismic Wave Propagation using SPECFEM3D_GLOBE on 62K Processors
15 369 Tflop/s Molecular Dynamics Simulations on the Roadrunner General-purpose Heterogeneous Supercomputer
15 Performance prediction of large-scale parallell system and application using macro-level simulation
15 Accelerating configuration interaction calculations for nuclear structure
13 Parallel exact inference on the cell broadband engine processor
12 Capturing performance knowledge for automated analysis
12 Performance optimization of TCP/IP over 10 gigabit ethernet by precise instrumentation
11 Positivity, posynomials and tile size selection
11 Dendro: parallel algorithms for multigrid and AMR methods on 2:1 balanced octrees
10 Using overlays for efficient data transfer over shared wide-area networks
10 New algorithm to enable 400+ TFlop/s sustained performance in simulations of disorder effects in high-Tc superconductors
10 Wide-area performance profiling of 10GigE and InfiniBand technologies
9 A multi-level parallel simulation approach to electron transport in nano-scale transistors
9 Asymmetric interactions in symmetric multi-core systems: analysis, enhancements and evaluation
8 A novel domain oriented approach for scientific grid workflow composition
8 Applying double auctions for scheduling of workflows on the Grid
7 Materialized community ground models for large-scale earthquake simulation
7 High-radix crossbar switches enabled by proximity communication
7 An efficient parallel approach for identifying protein families in large-scale metagenomic data sets
7 Efficient auction-based grid reservations using dynamic programming
6 Using server-to-server communication in parallel file systems to simplify consistency and improve performance
5 Analysis of application heartbeats: learning structural and temporal features in time series data for identification of performance problems
4 The role of MPI in development time: a case study
3 Global trees: a framework for linked data structures on distributed memory parallel systems
2 Extending CC-NUMA systems to support write update optimizations
1 Prefetch throttling and data pinning for improving performance of shared caches
0 Linear Scaling Divide-and-conquer Electronic Structure Calculations for Thousand Atom Nanostructures