Pre-prints / In submission


Parallel double greedy submodular maximization.
Xinghao Pan, Stefanie Jegelka, Joseph E. Gonzalez, Joseph K. Bradley, and Michael I. Jordan
Advances in Neural Information Processing Systems (NIPS) 27, 2014.
December 2014.
[ abstract | bib ]
Many machine learning algorithms can be reduced to the maximization of submodular functions. Although well understood in the serial setting, the parallel maximization of submodular functions remains an open area of research with recent results only addressing monotone functions. The optimal algorithm for maximizing the more general class of non-monotone submodular functions was introduced by Buchbinder et al. (2012) and follows a strongly serial double-greedy logic and program analysis. In this work, we propose two methods to parallelize the double-greedy algorithm. The first, coordination-free approach emphasizes speed at the cost of a weaker approximation guarantee. The second, concurrency control approach guarantees a tight 1/2-approximation, at the quantifiable cost of additional coordination and reduced parallelism. As a consequence we explore the tradeoff space between guaranteed performance and objective optimality. We implement and evaluate both algorithms on multi-core hardware and billion edge graphs, demonstrating both the scalability and tradeoffs of each approach.
@article{pan-etal.nips2014doublegreedy1, author = {Pan, Xinghao and Jegelka, Stefanie and Gonzalez, Joseph E and Bradley, Joseph K and Jordan, Michael I}, title = {Parallel Double Greedy Submodular Maximization}, booktitle = {Advances in Neural Information Processing Systems 27}, year = {2014}, month = {December} }

Optimistic concurrency control for distributed unsupervised learning.
Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka, Tamara Broderick, and Michael I. Jordan
Advances in Neural Information Processing Systems (NIPS) 26, 2013.
December 2013.
[ abstract | bib | pdf | arXiv ]
Research on distributed machine learning algorithms has focused primarily on one of two extremes - algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflict-resolution protocol is invoked. We view this "optimistic concurrency control" paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment.
@article{pan-etal.nips2013occ1, author = {Pan, Xinghao and Gonzalez, Joseph E and Jegelka, Stefanie and Broderick, Tamara and Jordan, Michael I}, title = {Optimistic concurrency control for distributed unsupervised learning}, booktitle = {Advances in Neural Information Processing Systems 26}, pages = {1403--1411}, url = {} year = {2013}, month = {December} }

MLI: An API for distributed machine learning.
Evan Sparks, Ameet Talwalkar, Virginia Smith, Xinghao Pan, Joseph Gonzalez, Tim Kraska, Michael I. Jordan, and Michael J. Franklin
IEEE International Conference on Data Mining (ICDM).
December 2013.
[ abstract | bib | pdf ]
MLI is a Application Programming Interface de- signed to address the challenges of building Machine Learning algorithms in a distributed setting based on data-centric com- puting. Its primary goal is to simplify the development of high- performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.
@inproceedings{sparks-etal.icdm2013mli1, author = {Sparks, Evan and Talwalkar, Ameet and Smith, Virginia and Pan, Xinghao and Gonzalez, Joseph and Kraska, Tim and Jordan, Michael I and Franklin, Michael J}, title = {MLI: An API for distributed machine learning}, booktitle = {International Conference on Data Mining (ICDM)}, month = {December}, year = {2013}, organization = {IEEE} }

City-scale traffic estimation from a roving sensor network.
Javed Aslam, Sejoon Lim, Xinghao Pan, and Daniela Rus.
In Proceedings of the 10th ACM Conference on Embedded Network Sensor Systems (SenSys).
November 2012.
[ abstract | bib | pdf ]
Traffic congestion, volumes, origins, destinations, routes, and other road-network performance metrics are typically collected through survey data or via static sensors such as traffic cameras and loop detectors. This information is often out-of-date, difficult to collect and aggregate, difficult to analyze and quantify, or all of the above. In this paper we conduct a case study that demonstrates that it is possible to accurately infer traffic volume through data collected from a roving sensor network of taxi probes that log their locations and speeds at regular intervals. Our model and inference procedures can be used to analyze traffic patterns and conditions from historical data, as well as to infer current patterns and conditions from data collected in real-time. As such, our techniques provide a powerful new sensor network approach for traffic visualization, analysis, and urban planning.
@inproceedings{aslam-etal.sensys2012city1, author = {Aslam, Javed and Lim, Sejoon and Pan, Xinghao and Rus, Daniela}, title = {City-scale traffic estimation from a roving sensor network}, booktitle = {10th ACM Conference on Embedded Network Sensor Systems}, pages = {141--154}, month = {November}, year = {2012}, organization = {ACM} }

A cognitive system for adaptive decision making
Xinghao Pan, Loo Nin Teow, Kheng Hwee Tan, Ji Hua Brian Ang, and Gee Wah Ng.
In Proceedings of the 15th International Conference on Information Fusion (FUSION).
July 2012.
[ abstract | bib | pdf ]
This paper seeks to investigate how the existing knowledge in a cognitive system can be used to help fill-in an incomplete situation picture. This is motivated by the human brain’s innate ability to use new incoming information together with stored knowledge to fill in gaps of the whole picture in the mind. The research focus of this paper is on the inference or retrieval of information that is relevant to the current circumstances, but not directly obtainable from actual observations. The objective is to derive methods to help uncover as complete a situation picture as possible with high degree of confidence. Two computational approaches to enhance situation awareness for adaptive decision making given partial information are proposed. The proposed approaches enhanced an existing system Dynamic Bayesian Reasoning and Advanced Intelligent Network (D’Brain) which is a cognitive based dynamic reasoning system that uses Bayesian network as the underlying knowledge representation. Simulations are carried out to test the feasibility of the proposed approaches. The results show that they are promising approaches to fulfil the objectives in enhancing situation awareness for adaptive decision making.
@inproceedings{pan-etal.fusion2012cogsys1, title = {A cognitive system for adaptive decision making}, author = {Pan, Xinghao and Teow, Loo Nin and Tan, Kheng Hwee and Ang, Ji Hua Brian and Ng, Gee Wah}, booktitle = {15th International Conference on Information Fusion (FUSION)}, pages = {1323--1329}, month = {July}, year = {2012}, organization = {IEEE} }

MLbase: A distributed machine learning wrapper
Ameet Talwalkar, Tim Kraska, Rean Griffith, John Duchi, Joseph Gonzalez, Denny Britz, Xinghao Pan, Virginia Smith, Evan Sparks, Andre Wibisono, Michael J. Franklin, Michael I. Jordan.
In Proceedings of the Big Learning Workshop at NIPS (BigLearn).
December 2012.
[ abstract | bib | pdf ]
Machine learning (ML) and statistical techniques are key to transforming big data into actionable knowledge. In spite of the modern primacy of data, the complexity of existing ML algorithms is often overwhelming—many users do not understand the trade-offs and challenges of parameterizing and choosing between different learning techniques. Furthermore, existing scalable systems that support machine learning are typically not accessible to ML researchers without a strong background in distributed systems and low-level primitives. In this work, we present our vision for MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers. MLbase provides (1) a simple declarative way to specify ML tasks, (2) a novel optimizer to select and dynamically adapt the choice of learning algorithm, (3) a set of high-level operators to enable ML researchers to scalably implement a wide range of ML methods without deep systems knowledge, and (4) a new run-time optimized for the data-access patterns of these high-level operators.
@inproceedings{talwalkar-etal.biglearn2012mlbase1, title = {MLbase: A distributed machine learning wrapper}, author = {Talwalkar, Ameet and Kraska, Tim and Griffith, Rean and Duchi, John and Gonzalez, Joseph and Britz, Denny and Pan, Xinghao and Smith, Virginia and Sparks, Evan and Wibisono, Andre and Franklin, Michael J and Jordan, Michael I}, booktitle = {Big Learning Workshop at NIPS}, month = {December}, year = {2012}, organization = {NIPS} }

ASDF: An automated, online framework for diagnosing performance problems
Keith Bare, Soila Kavulya, Jiaqi Tan, Xinghao Pan, Eugene Marinelli, Michael Kasick, Rajeev Gandhi, Priya Narasimhan.
Architecting Dependable Systems VII, Lecture Notes in Computer Science Volume 6420, Springer Berlin Heidelberg.
[ abstract | bib | pdf | tech report ]
Localizing performance problems (or fingerpointing) is essential for distributed systems such as Hadoop that support long-running, parallelized, data-intensive computations over a large cluster of nodes. Manual fingerpointing does not scale in such environments because of the number of nodes and the number of performance metrics to be analyzed on each node. ASDF is an automated, online fingerpointing framework that transparently extracts and parses different time-varying data sources (e.g., sysstat, Hadoop logs) on each node, and implements multiple techniques (e.g., log analysis, correlation, clustering) to analyze these data sources jointly or in isolation. We demonstrate ASDF’s online fingerpointing for documented performance problems in Hadoop, under different workloads; our results indicate that ASDF incurs an average monitoring overhead of 0.38% of CPU time, and exhibits average online fingerpointing latencies of less than 1 minute with false-positive rates of less than 1%.
@techreport{bare-etal.pdl2008asdf1, title = {ASDF: An automated, online framework for diagnosing performance problems}, author = {Bare, Keith and Kasick, Mike and Kavulya, Soila and Marinelli, Eugene and Pan, Xinghao and Tan, Jiaqi and Gandhi, Rajeev and Narasimhan, Priya}, month = {May}, year = {2008}, number = {CMU-PDL-08-104}, institution = {Carnegie Mellon University, Parallel Data Lab} }

Kahuna: Problem diagnosis for mapreduce-based cloud computing environments
Jiaqi Tan, Xinghao Pan, Eugene Marinelli, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan.
In Proceedings of Network Operations and Management Symposium (NOMS).
April 2010.
[ abstract | bib | pdf ]
We present Kahuna, an approach that aims to diagnose performance problems in MapReduce systems. Central to Kahuna’s approach is our insight on peer similarly, i.e., that nodes behave alike in the absence of performance problems, and that a node that behaves differently is the likely culprit of a performance problem. Kahuna incorporates techniques to statistically compare black-box (OS-level performance metrics) and white-box (Hadoop-log statistics) data across the different nodes of a MapReduce cluster, in order to identify the faulty node(s). We motivate our peer-similarity observations through concrete evidence from the 4000-processor Yahoo! M45 Hadoop cluster. In addition, we demonstrate Kahuna’s effectiveness through experimental evaluation of its algorithms for a number of reported performance problems, on four different workloads (including Nutch and Pig) in a 100-node Hadoop cluster hosted on Amazon’s EC2 datacenter.
@inproceedings{tan-etal.noms2010kahuna1, author = {Tan, Jiaqi and Pan, Xinghao and Marinelli, Eugene and Kavulya, Soila and Gandhi, Rajeev and Narasimhan, Priya}, booktitle = {Network Operations and Management Symposium (NOMS)}, pages = {112--119}, month = {April}, year = {2010}, organization = {IEEE} }

Ganesha: Black-box diagnosis of MapReduce systems
Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan.
ACM SIGMETRICS Performance Evaluation Review (PER), Vol 37, number 3, pages 8-13.
January 2010.
[ abstract | bib | pdf ]
Ganesha aims to diagnose faults transparently (in a blackbox manner) in MapReduce systems, by analyzing OS-level metrics. Ganesha's approach is based on peer-symmetry under fault-free conditions, and can diagnose faults that manifest asymmetrically at nodes within a MapReduce system. We evaluate Ganesha by diagnosing Hadoop problems for the Gridmix Hadoop benchmark on 10-node and 50-node MapReduce clusters on Amazon's EC2. We also candidly highlight faults that escape Ganesha's diagnosis.
@article{pan-etal.per2010ganesha1, title = {Ganesha: blackBox diagnosis of MapReduce systems}, author = {Pan, Xinghao and Tan, Jiaqi and Kavulya, Soila and Gandhi, Rajeev and Narasimhan, Priya}, journal = {ACM SIGMETRICS Performance Evaluation Review}, volume = {37}, number = {3}, pages = {8--13}, month = {January}, year = {2010}, publisher = {ACM} }

Mochi: Visual log-analysis based tools for debugging hadoop
Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan.
In Proceedings of Hot Topics in Cloud Computing (HotCloud).
June 2009.
[ abstract | bib | pdf ]
Mochi, a new visual, log-analysis based debugging tool correlates Hadoop’s behavior in space, time and volume, and extracts a causal, unified control- and data-flow model of Hadoop across the nodes of a cluster. Mochi’s analysis produces visualizations of Hadoop’s behavior using which users can reason about and debug performance issues. We provide examples of Mochi’s value in revealing a Hadoop job’s structure, in optimizing real-world workloads, and in identifying anomalous Hadoop behavior, on the Yahoo! M45 Hadoop cluster.
@inproceedings{tan-etal.hotcloud2009mochi1, title={Mochi: Visual log-analysis based tools for debugging hadoop}, author = {Tan, Jiaqi and Pan, Xinghao and Kavulya, Soila and Gandhi, Rajeev and Narasimhan, Priya}, booktitle = {Hot Topics in Cloud Computing}, month = {June}, year = {2009}, organization = {USENIX Association} }

SALSA: Analyzing logs as state machines
Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan.
In Proceedings of First USENIX Workshop on Analysis of System Logs (WASL).
December 2008.
[ abstract | bib | pdf ]
SALSA examines system logs to derive state-machine views of the sytem's execution, along with control-flow, data-flow models and related statistics. Exploiting SALSA's derived views and statistics, we can effectively construct higher-level useful analyses. We demonstrate SALSA's approach by analyzing system logs generated in a Hadoop cluster, and then illustrate SALSA's value by developing visualization and failure-diagnosis techniques, for three different Hadoop workloads, based on our derived state-machine views and statistics.
@inproceedings{tan-etal.wasl2008salsa1, title = {SALSA: Analyzing logs as state machines}, author = {Tan, Jiaqi and Pan, Xinghao and Kavulya, Soila and Gandhi, Rajeev and Narasimhan, Priya}, booktitle = {First USENIX Workshop on Analysis of System Logs}, month = {December}, year = {2008}, organization = {USENIX Association} }

Theses / Tech Reports

The blind men and the elephant: Piecing together Hadoop for diagnosis
Xinghao Pan.
Masters Thesis, School of Computer Science, Carnegie Mellon University.
May 2009.
[ abstract | bib | pdf ]
Google's MapReduce framework enables distributed, data-intensive, parallel applications by decomposing a massive job into smaller (Map and Reduce) tasks and a massive data-set into smaller partitions, such that each task processes a different partition in parallel. However, performance problems in a distributed MapReduce system can be hard to diagnose and to localize to a specific node or a set of nodes. On the other hand, the structure of large number of nodes performing similar tasks naturally affords us opportunities for observing the system from multiple viewpoints.
We present a "Blind Men and the Elephant" (BliMeE) framework in which we exploit this structure, and demonstrate how problems in a MapReduce system can be diagnose by corroborating the multiple viewpoints. More specifically, we present algorithms within the BliMeE framework based on OS-level performance counters, on white-box metrics extracted from logs, and on application-level heartbeats. We show that our BliMeE algorithms are able to capture a variety of faults including resource hogs and application hangs, and to localize the fault to subsets of slave nodes in the MapReduce system.
In addition, we discuss how the diagnostic algorithms' outcomes can be further synthesized in a repeated application of the BliMeE approach. We present a simple supervised learning technique which allows us to identify a fault if it has been previously observed.
@mastersthesis{pan.masters2009blind1, title = {The Blind Men and the Elephant: Piecing Together Hadoop for Diagnosis}, author = {Pan, Xinghao}, month = {May}, year = {2009}, school = {Carnegie Mellon University} }

Object recognition tools for educational robots
Xinghao Pan.
Senior Research Thesis, School of Computer Science, Carnegie Mellon University.
May 2008.
[ abstract | bib | pdf ]
SIFT (Scale Invariant Feature Transform) features, developed by David G. Lowe, have been found to be highly robust to translations, rotations and scaling, and have been the solution of choice for many when dealing with problems of robotic vision and object recognition. Object recognition using SIFT features involves detection of SIFT features in an image, and matching those features against features in a structured database of object images. However, to the best of our knowledge, there are currently no open-source tools to conveniently perform these tasks. This research aims to develop SIFT-based object recognition tools for use by students in undergraduate robotics courses. The tools will allow students to gain a basic understanding of SIFT, but also abstract away from the actual implementation. Hence, students will be able to focus on solving higher level robotics problems.
@techreport{pan.seniorthesis2008object1, title = {Object recognition tools for educational robots}, author = {Pan, Xinghao}, month = {May}, year = {2008}, institution = {Carnegie Mellon University} }

Talks / Posters

Optimistic concurrency control for distributed unsupervised learning.
Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka, Tamara Broderick, and Michael I. Jordan.
Talk / Poster @ Bay Area Machine Learning Symposium (BayLearn).
August 28, 2013.
[ talk | poster ]

© 2013 Xinghao Pan
Template design by Andreas Viklund / Best hosted at