I'm a sixth year graduate student working in the AMP Lab at UC Berkeley. I plan to graduate in the Fall of 2012 and will be doing a post-doc with the F1 team at Google starting in January 2013. My research interests broadly include large scale distributed storage systems and cloud computing. More specifically, my thesis focuses on scale independence, an alternative to traditional cost-based optimization that guarantees predictable performance for applications querying ever-growing datasets. I am advised by Armando Fox, Michael Franklin and David Patterson.
PQL: Scale Independent Relational Query Processing
Collaborators: Kristal Curtis, Tim Kraska, Nick Lanham, Stephen Tu, Armando Fox, Michael Franklin, David Patterson
Rapidly growing data volumes have led many developers to abandon traditional relational databases in favor of distributed key/values stores and map/reduce programs. While these alternatives often provide trivial scalability, they lack many of the benefits of high-level declarative languages such as optimization and data-independence. Instead, we propose extending the the relational model with scale independence, a new type of data independence, that ensures consistent performance for all queries in an application, independent of the data size. Our implementation, PIQL, provides a scale independent relational system on top of existing distributed key/values stores by changing the objective function for optimization and automatically selecting and maintaining required indexes and materialized views. The PIQL system also integrates with the Scala compiler to provide language integrated schema specification and a LINQ-like query language.
- Generalized Scale Independence Through Incremental Precomputation (SIGMOD'12)
- PIQL: Success-Tolerant Query Processing in the Cloud (VLDB'11)
- The Case For PIQL: A Performance Insightful Query Language (SOCC'10)
The source code is available in the piql subproject of the SCADS repository on Github
SCADS: Scalable Consistency Adjustable Data Storage
Collaborators: Peter Bodík, Tim Kraska, Nick Lanham, Gene Pang, Beth Trushkowsky, Stephen Tu, Armando Fox, Michael Franklin, David Patterson
SCADS is a research prototype key/value store written in Scala. Built using BDB-JE, its design is focused on modularity and easy deployment for running experiments. The system has served as the storage system for the director (FAST'11), PIQL execution engine, RAD Lab Stack, and the multi-datacenter concurrency control project.
The source code is available in the SCADS repository on Github
RAD Lab Stack
Collaborators: Allen Chen, Kristal Curtis, Amber Feng, Karl He, Rean Griffith Andy Konwinski, Justin Ma, Sunil Pedapudi, Ari Rabkin, Beth Trushkowsky, Matei Zaharia
Before the AMP Lab, I was a member of the Reliable Adaptive Distributed System Lab. The lab's moon-shot vision statement was to enable a single person to design, analyze, deploy and operate the next multi-million user website in only a single weekend. I led the effort to integrate the various projects of the lab, including SCADS, PIQL, Mesos, the director, Spark, and deploylib into a single unified demo stack. At the at the end of project celebration on February 24th 2011 we demonstrated three web applications written by undergrads, including one completed the previous weekend. Using the stack we scaled them to 300+ EC2 instances over the course of an afternoon.
The source code is available in the demo branch of the SCADS repository on Github
Deploylib is a scala DSL for deploying experiments and other software on clusters of machines, including Amazon's EC2. It was used to run the experiments for the PIQL and director papers as well as the RAD Lab Final Demo. It provides developers with the following constructs:
- Concise syntax for executing ssh commands on remote machines through scripts or from the scala console
- Automatic handling of transient failures
- Typed functions for running common commands like ps, ls, jps, etc
- Intelligent distribution of files using NFS or S3
- Parallel extensions for common collection operations (pforeach, pmap, pflatMap)
- A mesos framework for deploying services that run on the JVM
- Automatic dependency management for projects built with SBT