Sketching Big Data with Spark: Closing talk in Strata's Hardcore Data Science track. Discusses randomized and sketch algorithms for large-scale data analytics, including Bloom filter, frequent items, stratified sampling. Strata NYC, Oct 2015.
Introduction to Spark. Guest lecture at Stanford's CS347 (Parallel and Distributed Data Management). May 2015.
Big Data Analytics Systems: What Goes Around Comes Around. Guest lecture at Berkeley's CS186 (Database Systems). Apr 2015.
Spark in 2015 and Beyond. Opening talk for Spark Forum at ApacheCon. Apr 2015.
Spark DataFrames for Large-scale Data Science. DataFrame introduction at Bay Area Spark User Meetup. Feb 2015.
Interfaces, Interfaces, Interfaces. On interface design at Databricks Retreat.
Readings in Databases: I maintain a list of papers essential to the understanding of database systems online.
Tungsten: Bringing Spark Closer to Bare Metal: Re-architecture of Spark execution engine to substantially improving the efficiency of memory and CPU.
DataFrames: Rethinking how we can make Spark 100X easier to use for data scientists and engineers.
Apache Spark: The leading next-generation distributed data processing engine.
Shark: An open source SQL query engine. It uses Spark as the physical execution engine and can run Hive QL queries up to 100x faster without losing the fault-tolerance and scale-out properties of MapReduce. Shark has been subsumed by Spark SQL.
GraphX: Proposing a new way to think about graph computation.
CrowdDB: A pioneering database system that incorporates crowd-sourced query processing. The project presents a vision in which humans are simply resources database systems can use to answer queries.