Publications.

Pre-prints / In submission


Refereed

Scaling up correlation clustering through parallelism and concurrency control.
Xinghao Pan, Dimitris Papiliopoulos, Benjamin Recht, Kannan Ramachandran, and Michael I. Jordan
NIPS Workshop on Discrete and Combinatorial Problems in Machine Learning (DISCML).
December 2014.
[ abstract | bib ]
Given a similarity graph between items, correlation clustering (CC) aims to group similar items together and dissimilar ones apart. One of the most popular CC algorithms is KwikCluster: a simple peeling scheme that offers a 3-approximation ratio. Unfortunately, KwikCluster is inherently sequential and can require a large number of peeling rounds. This can be a significant bottleneck when scaling up to big graphs. Recent proposals to parallelize KwikCluster encounter challenges in scaling up, while sometimes they introduce a loss to the 3 approximation factor. We present C4, a parallel CC algorithm that obtains a 3-approximation ratio, has limited overheads, and gracefully scales up to billion-edge graphs. The main idea behind C4 is running multiple peeling threads concurrently, while ensuring consistency among them without many overheads. We enforce consistency through concurrency control, a popular paradigm in database research. We provide ex- tensive experiments and demonstrate that C4 can scale up to billion-edge graphs where it outputs a clustering in a few seconds.
@article{pan-etal.nips2014corrclus1, title = {Scaling up correlation clustering through parallelism and concurrency control}, author = {Pan, Xinghao and Papailiopoulos, Dimitris and Recht, Benjamin and Ramchandran, Kannan and Jordan, Michael I}, booktitle = {NIPS Workshop on Discrete and Combinatorial Problems in Machine Learning (DISCML)}, year = {2014}, month = {December} }


Parallel double greedy submodular maximization.
Xinghao Pan, Stefanie Jegelka, Joseph E. Gonzalez, Joseph K. Bradley, and Michael I. Jordan
Advances in Neural Information Processing Systems (NIPS) 27.
December 2014.
[ abstract | bib | pdf ]
Many machine learning algorithms can be reduced to the maximization of submodular functions. Although well understood in the serial setting, the parallel maximization of submodular functions remains an open area of research with recent results only addressing monotone functions. The optimal algorithm for maximizing the more general class of non-monotone submodular functions was introduced by Buchbinder et al. (2012) and follows a strongly serial double-greedy logic and program analysis. In this work, we propose two methods to parallelize the double-greedy algorithm. The first, coordination-free approach emphasizes speed at the cost of a weaker approximation guarantee. The second, concurrency control approach guarantees a tight 1/2-approximation, at the quantifiable cost of additional coordination and reduced parallelism. As a consequence we explore the tradeoff space between guaranteed performance and objective optimality. We implement and evaluate both algorithms on multi-core hardware and billion edge graphs, demonstrating both the scalability and tradeoffs of each approach.
@incollection{pan-etal.nips2014doublegreedy1, title = {Parallel double greedy submodular maximization}, author = {Pan, Xinghao and Jegelka, Stefanie and Gonzalez, Joseph E and Bradley, Joseph K and Jordan, Michael I}, booktitle = {Advances in Neural Information Processing Systems 27}, year = {2014}, month = {December} }


Optimistic concurrency control for distributed unsupervised learning.
Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka, Tamara Broderick, and Michael I. Jordan
Advances in Neural Information Processing Systems (NIPS) 26.
December 2013.
[ abstract | bib | pdf | arXiv ]
Research on distributed machine learning algorithms has focused primarily on one of two extremes - algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflict-resolution protocol is invoked. We view this "optimistic concurrency control" paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment.
@incollection{pan-etal.nips2013occ1, title = {Optimistic concurrency control for distributed unsupervised learning}, author = {Pan, Xinghao and Gonzalez, Joseph E and Jegelka, Stefanie and Broderick, Tamara and Jordan, Michael I}, booktitle = {Advances in Neural Information Processing Systems 26}, pages = {1403--1411}, url = {http://papers.nips.cc/paper/5038-optimistic-concurrency-control-for-distributed-unsupervised-learning.pdf}, year = {2013}, month = {December} }


MLI: An API for distributed machine learning.
Evan Sparks, Ameet Talwalkar, Virginia Smith, Xinghao Pan, Joseph Gonzalez, Tim Kraska, Michael I. Jordan, and Michael J. Franklin
IEEE International Conference on Data Mining (ICDM).
December 2013.
[ abstract | bib | pdf ]
MLI is a Application Programming Interface de- signed to address the challenges of building Machine Learning algorithms in a distributed setting based on data-centric com- puting. Its primary goal is to simplify the development of high- performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.
@article{sparks-etal.icdm2013mli1, title = {MLI: An API for distributed machine learning}, author = {Sparks, Evan and Talwalkar, Ameet and Smith, Virginia and Pan, Xinghao and Gonzalez, Joseph and Kraska, Tim and Jordan, Michael I and Franklin, Michael J}, journal = {2013 IEEE 13th International Conference on Data Mining}, volume = {0}, issn = {1550-4786}, month = {December}, year = {2013}, pages = {1187-1192}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA} }


MLbase: A distributed machine learning wrapper
Ameet Talwalkar, Tim Kraska, Rean Griffith, John Duchi, Joseph Gonzalez, Denny Britz, Xinghao Pan, Virginia Smith, Evan Sparks, Andre Wibisono, Michael J. Franklin, Michael I. Jordan.
Big Learning Workshop at NIPS (BigLearn).
December 2012.
[ abstract | bib | pdf ]
Machine learning (ML) and statistical techniques are key to transforming big data into actionable knowledge. In spite of the modern primacy of data, the complexity of existing ML algorithms is often overwhelming—many users do not understand the trade-offs and challenges of parameterizing and choosing between different learning techniques. Furthermore, existing scalable systems that support machine learning are typically not accessible to ML researchers without a strong background in distributed systems and low-level primitives. In this work, we present our vision for MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers. MLbase provides (1) a simple declarative way to specify ML tasks, (2) a novel optimizer to select and dynamically adapt the choice of learning algorithm, (3) a set of high-level operators to enable ML researchers to scalably implement a wide range of ML methods without deep systems knowledge, and (4) a new run-time optimized for the data-access patterns of these high-level operators.
@article{talwalkar-etal.biglearn2012mlbase1, title = {MLbase: A distributed machine learning wrapper}, author = {Talwalkar, Ameet and Kraska, Tim and Griffith, Rean and Duchi, John and Gonzalez, Joseph and Britz, Denny and Pan, Xinghao and Smith, Virginia and Sparks, Evan and Wibisono, Andre and Franklin, Michael J and Jordan, Michael I}, booktitle = {Big Learning Workshop at NIPS}, month = {December}, year = {2012}, organization = {NIPS} }


City-scale traffic estimation from a roving sensor network.
Javed Aslam, Sejoon Lim, Xinghao Pan, and Daniela Rus.
10th ACM Conference on Embedded Network Sensor Systems (SenSys).
November 2012.
[ abstract | bib | pdf ]
Traffic congestion, volumes, origins, destinations, routes, and other road-network performance metrics are typically collected through survey data or via static sensors such as traffic cameras and loop detectors. This information is often out-of-date, difficult to collect and aggregate, difficult to analyze and quantify, or all of the above. In this paper we conduct a case study that demonstrates that it is possible to accurately infer traffic volume through data collected from a roving sensor network of taxi probes that log their locations and speeds at regular intervals. Our model and inference procedures can be used to analyze traffic patterns and conditions from historical data, as well as to infer current patterns and conditions from data collected in real-time. As such, our techniques provide a powerful new sensor network approach for traffic visualization, analysis, and urban planning.
@inproceedings{aslam-etal.sensys2012city1, title = {City-scale traffic estimation from a roving sensor network}, author = {Aslam, Javed and Lim, Sejoon and Pan, Xinghao and Rus, Daniela}, booktitle = {10th ACM Conference on Embedded Network Sensor Systems}, pages = {141--154}, month = {November}, year = {2012}, organization = {ACM} }


A cognitive system for adaptive decision making
Xinghao Pan, Loo Nin Teow, Kheng Hwee Tan, Ji Hua Brian Ang, and Gee Wah Ng.
15th International Conference on Information Fusion (FUSION).
July 2012.
[ abstract | bib | pdf ]
This paper seeks to investigate how the existing knowledge in a cognitive system can be used to help fill-in an incomplete situation picture. This is motivated by the human brain’s innate ability to use new incoming information together with stored knowledge to fill in gaps of the whole picture in the mind. The research focus of this paper is on the inference or retrieval of information that is relevant to the current circumstances, but not directly obtainable from actual observations. The objective is to derive methods to help uncover as complete a situation picture as possible with high degree of confidence. Two computational approaches to enhance situation awareness for adaptive decision making given partial information are proposed. The proposed approaches enhanced an existing system Dynamic Bayesian Reasoning and Advanced Intelligent Network (D’Brain) which is a cognitive based dynamic reasoning system that uses Bayesian network as the underlying knowledge representation. Simulations are carried out to test the feasibility of the proposed approaches. The results show that they are promising approaches to fulfil the objectives in enhancing situation awareness for adaptive decision making.
@inproceedings{pan-etal.fusion2012cogsys1, title = {A cognitive system for adaptive decision making}, author = {Pan, Xinghao and Teow, Loo Nin and Tan, Kheng Hwee and Ang, Ji Hua Brian and Ng, Gee Wah}, booktitle = {15th International Conference on Information Fusion (FUSION)}, pages = {1323--1329}, month = {July}, year = {2012}, organization = {IEEE} }


ASDF: An automated, online framework for diagnosing performance problems
Keith Bare, Soila Kavulya, Jiaqi Tan, Xinghao Pan, Eugene Marinelli, Michael Kasick, Rajeev Gandhi, Priya Narasimhan.
Architecting Dependable Systems VII, Lecture Notes in Computer Science Volume 6420, Springer Berlin Heidelberg.
2010.
[ abstract | bib | pdf | tech report ]
Localizing performance problems (or fingerpointing) is essential for distributed systems such as Hadoop that support long-running, parallelized, data-intensive computations over a large cluster of nodes. Manual fingerpointing does not scale in such environments because of the number of nodes and the number of performance metrics to be analyzed on each node. ASDF is an automated, online fingerpointing framework that transparently extracts and parses different time-varying data sources (e.g., sysstat, Hadoop logs) on each node, and implements multiple techniques (e.g., log analysis, correlation, clustering) to analyze these data sources jointly or in isolation. We demonstrate ASDF’s online fingerpointing for documented performance problems in Hadoop, under different workloads; our results indicate that ASDF incurs an average monitoring overhead of 0.38% of CPU time, and exhibits average online fingerpointing latencies of less than 1 minute with false-positive rates of less than 1%.
@incollection{bare-etal.pdl2008asdf1, title = {ASDF: An automated, online framework for diagnosing performance problems}, author = {Bare, Keith and Kavulya, SoilaP. and Tan, Jiaqi and Pan, Xinghao and Marinelli, Eugene and Kasick, Michael and Gandhi, Rajeev and Narasimhan, Priya}, booktitle = {Architecting Dependable Systems VII}, volume = {6420}, series = {Lecture Notes in Computer Science}, publisher = {Springer Berlin Heidelberg}, pages = {201-226}, year = {2010}, isbn = {978-3-642-17244-1}, language = {English} }


Kahuna: Problem diagnosis for MapReduce-based cloud computing environments
Jiaqi Tan, Xinghao Pan, Eugene Marinelli, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan.
Network Operations and Management Symposium (NOMS).
April 2010.
[ abstract | bib | pdf ]
We present Kahuna, an approach that aims to diagnose performance problems in MapReduce systems. Central to Kahuna’s approach is our insight on peer similarly, i.e., that nodes behave alike in the absence of performance problems, and that a node that behaves differently is the likely culprit of a performance problem. Kahuna incorporates techniques to statistically compare black-box (OS-level performance metrics) and white-box (Hadoop-log statistics) data across the different nodes of a MapReduce cluster, in order to identify the faulty node(s). We motivate our peer-similarity observations through concrete evidence from the 4000-processor Yahoo! M45 Hadoop cluster. In addition, we demonstrate Kahuna’s effectiveness through experimental evaluation of its algorithms for a number of reported performance problems, on four different workloads (including Nutch and Pig) in a 100-node Hadoop cluster hosted on Amazon’s EC2 datacenter.
@inproceedings{tan-etal.noms2010kahuna1, title = {Kahuna: Problem diagnosis for {M}ap{R}educe-based cloud computing environments}, author = {Tan, Jiaqi and Pan, Xinghao and Marinelli, Eugene and Kavulya, Soila and Gandhi, Rajeev and Narasimhan, Priya}, booktitle = {Network Operations and Management Symposium (NOMS)}, pages = {112--119}, month = {April}, year = {2010}, organization = {IEEE} }


Ganesha: Black-box diagnosis of MapReduce systems
Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan.
ACM SIGMETRICS Performance Evaluation Review (PER), Vol 37, number 3, pages 8-13.
January 2010.
[ abstract | bib | pdf ]
Ganesha aims to diagnose faults transparently (in a blackbox manner) in MapReduce systems, by analyzing OS-level metrics. Ganesha's approach is based on peer-symmetry under fault-free conditions, and can diagnose faults that manifest asymmetrically at nodes within a MapReduce system. We evaluate Ganesha by diagnosing Hadoop problems for the Gridmix Hadoop benchmark on 10-node and 50-node MapReduce clusters on Amazon's EC2. We also candidly highlight faults that escape Ganesha's diagnosis.
@article{pan-etal.per2010ganesha1, title = {Ganesha: Black-box diagnosis of {M}ap{R}educe systems}, author = {Pan, Xinghao and Tan, Jiaqi and Kavulya, Soila and Gandhi, Rajeev and Narasimhan, Priya}, journal = {ACM SIGMETRICS Performance Evaluation Review}, volume = {37}, number = {3}, pages = {8--13}, month = {January}, year = {2010}, publisher = {ACM} }


Mochi: Visual log-analysis based tools for debugging hadoop
Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan.
Hot Topics in Cloud Computing (HotCloud).
June 2009.
[ abstract | bib | pdf ]
Mochi, a new visual, log-analysis based debugging tool correlates Hadoop’s behavior in space, time and volume, and extracts a causal, unified control- and data-flow model of Hadoop across the nodes of a cluster. Mochi’s analysis produces visualizations of Hadoop’s behavior using which users can reason about and debug performance issues. We provide examples of Mochi’s value in revealing a Hadoop job’s structure, in optimizing real-world workloads, and in identifying anomalous Hadoop behavior, on the Yahoo! M45 Hadoop cluster.
@inproceedings{tan-etal.hotcloud2009mochi1, title = {Mochi: Visual log-analysis based tools for debugging hadoop}, author = {Tan, Jiaqi and Pan, Xinghao and Kavulya, Soila and Gandhi, Rajeev and Narasimhan, Priya}, booktitle = {Hot Topics in Cloud Computing}, month = {June}, year = {2009}, organization = {USENIX Association} }


SALSA: Analyzing logs as state machines
Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan.
First USENIX Workshop on Analysis of System Logs (WASL).
December 2008.
[ abstract | bib | pdf ]
SALSA examines system logs to derive state-machine views of the sytem's execution, along with control-flow, data-flow models and related statistics. Exploiting SALSA's derived views and statistics, we can effectively construct higher-level useful analyses. We demonstrate SALSA's approach by analyzing system logs generated in a Hadoop cluster, and then illustrate SALSA's value by developing visualization and failure-diagnosis techniques, for three different Hadoop workloads, based on our derived state-machine views and statistics.
@inproceedings{tan-etal.wasl2008salsa1, title = {SALSA: Analyzing logs as state machines}, author = {Tan, Jiaqi and Pan, Xinghao and Kavulya, Soila and Gandhi, Rajeev and Narasimhan, Priya}, booktitle = {First USENIX Workshop on Analysis of System Logs}, month = {December}, year = {2008}, organization = {USENIX Association} }



Theses / Tech Reports

The blind men and the elephant: Piecing together Hadoop for diagnosis
Xinghao Pan.
Masters Thesis, School of Computer Science, Carnegie Mellon University.
May 2009.
[ abstract | bib | pdf ]
Google's MapReduce framework enables distributed, data-intensive, parallel applications by decomposing a massive job into smaller (Map and Reduce) tasks and a massive data-set into smaller partitions, such that each task processes a different partition in parallel. However, performance problems in a distributed MapReduce system can be hard to diagnose and to localize to a specific node or a set of nodes. On the other hand, the structure of large number of nodes performing similar tasks naturally affords us opportunities for observing the system from multiple viewpoints.
We present a "Blind Men and the Elephant" (BliMeE) framework in which we exploit this structure, and demonstrate how problems in a MapReduce system can be diagnose by corroborating the multiple viewpoints. More specifically, we present algorithms within the BliMeE framework based on OS-level performance counters, on white-box metrics extracted from logs, and on application-level heartbeats. We show that our BliMeE algorithms are able to capture a variety of faults including resource hogs and application hangs, and to localize the fault to subsets of slave nodes in the MapReduce system.
In addition, we discuss how the diagnostic algorithms' outcomes can be further synthesized in a repeated application of the BliMeE approach. We present a simple supervised learning technique which allows us to identify a fault if it has been previously observed.
@mastersthesis{pan.masters2009blind1, title = {The Blind Men and the Elephant: Piecing Together Hadoop for Diagnosis}, author = {Pan, Xinghao}, month = {May}, year = {2009}, school = {Carnegie Mellon University} }


Object recognition tools for educational robots
Xinghao Pan.
Senior Research Thesis, School of Computer Science, Carnegie Mellon University.
May 2008.
[ abstract | bib | pdf ]
SIFT (Scale Invariant Feature Transform) features, developed by David G. Lowe, have been found to be highly robust to translations, rotations and scaling, and have been the solution of choice for many when dealing with problems of robotic vision and object recognition. Object recognition using SIFT features involves detection of SIFT features in an image, and matching those features against features in a structured database of object images. However, to the best of our knowledge, there are currently no open-source tools to conveniently perform these tasks. This research aims to develop SIFT-based object recognition tools for use by students in undergraduate robotics courses. The tools will allow students to gain a basic understanding of SIFT, but also abstract away from the actual implementation. Hence, students will be able to focus on solving higher level robotics problems.
@techreport{pan.seniorthesis2008object1, title = {Object recognition tools for educational robots}, author = {Pan, Xinghao}, month = {May}, year = {2008}, institution = {Carnegie Mellon University} }



Talks / Posters

Parallel double greedy submodular maximization.
Xinghao Pan, Stefanie Jegelka, Joseph E. Gonzalez, Joseph K. Bradley, and Michael I. Jordan
Talk / Poster @ Bay Area Machine Learning Symposium (BayLearn).
October 21, 2014.
[ talk | poster ]

Optimistic concurrency control for distributed unsupervised learning.
Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka, Tamara Broderick, and Michael I. Jordan.
Talk / Poster @ Bay Area Machine Learning Symposium (BayLearn).
August 28, 2013.
[ talk | poster ]


© 2013 Xinghao Pan
Template design by Andreas Viklund / Best hosted at www.svenskadomaner.se