# Learning from Subsampled Data: Active and Randomized Strategies

### Fabian Wauthier

###
EECS Department

University of California, Berkeley

Technical Report No. UCB/EECS-2013-94

May 17, 2013

### http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-94.pdf

In modern statistical applications, we are often faced with situations where there is either too little or too much data. Both extremes can be troublesome: Interesting models can only be learnt when sufficient amounts of data are available, yet these models tend to become intractable when data is abundant. An important thread of research addresses these difficulties by subsampling the data prior to learning a model. Subsampling can be active (i.e. active learning) or randomized. While both of these techniques have a long history, a direct application to novel situations is in many cases problematic. This dissertation addresses some of these issues.

We begin with an active learning strategy for spectral clustering when the cost of assessing individual similarities is substantial or prohibitive. We give an active spectral clustering algorithm which iteratively adds similarities based on information gleaned from a partial clustering and which improves over common alternatives.

Next, we consider active learning in Bayesian models. Complex Bayesian models often require an MCMC-based method for inference, which makes a naive application of common active learning strategies intractable. We propose an approximate active learning method which reuses samples from an existing MCMC chain in order to speed up the computations.

Our third contribution looks at the effects of randomized subsampling on Gaussian process models that make predictions about outliers and rare events. Randomized subsampling risks making outliers even rarer, which, in the context of Gaussian process models, can lead to overfitting. We show that Heavy-tailed stochastic processes can be used to improve robustness of regression and classification estimators to such outliers by selectively shrinking them more strongly in sparse regions than in dense regions.

Finally, we turn to a theoretical evaluation of randomized subsampling for the purpose of inferring rankings of objects. We present two simple algorithms that predict a total order over n objects from a randomized subsample of binary comparisons. In expectation, the algorithms match an Omega(n) lower bound on the sample complexity for predicting a permutation with fixed expected Kendall tau distance. Furthermore, we show that given O(nlog(n)) samples, one algorithm recovers the true ranking with uniform quality, while the other predicts the ranking more accurately near the top than the bottom. Due to their simple form, the algorithms can be easily extended to online and distributed settings.

**Advisor:** Michael Jordan

BibTeX citation:

@phdthesis{Wauthier:EECS-2013-94, Author = {Wauthier, Fabian}, Title = {Learning from Subsampled Data: Active and Randomized Strategies}, School = {EECS Department, University of California, Berkeley}, Year = {2013}, Month = {May}, URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-94.html}, Number = {UCB/EECS-2013-94}, Abstract = {In modern statistical applications, we are often faced with situations where there is either too little or too much data. Both extremes can be troublesome: Interesting models can only be learnt when sufficient amounts of data are available, yet these models tend to become intractable when data is abundant. An important thread of research addresses these difficulties by subsampling the data prior to learning a model. Subsampling can be active (i.e. active learning) or randomized. While both of these techniques have a long history, a direct application to novel situations is in many cases problematic. This dissertation addresses some of these issues. We begin with an active learning strategy for spectral clustering when the cost of assessing individual similarities is substantial or prohibitive. We give an active spectral clustering algorithm which iteratively adds similarities based on information gleaned from a partial clustering and which improves over common alternatives. Next, we consider active learning in Bayesian models. Complex Bayesian models often require an MCMC-based method for inference, which makes a naive application of common active learning strategies intractable. We propose an approximate active learning method which reuses samples from an existing MCMC chain in order to speed up the computations. Our third contribution looks at the effects of randomized subsampling on Gaussian process models that make predictions about outliers and rare events. Randomized subsampling risks making outliers even rarer, which, in the context of Gaussian process models, can lead to overfitting. We show that Heavy-tailed stochastic processes can be used to improve robustness of regression and classification estimators to such outliers by selectively shrinking them more strongly in sparse regions than in dense regions. Finally, we turn to a theoretical evaluation of randomized subsampling for the purpose of inferring rankings of objects. We present two simple algorithms that predict a total order over n objects from a randomized subsample of binary comparisons. In expectation, the algorithms match an Omega(n) lower bound on the sample complexity for predicting a permutation with fixed expected Kendall tau distance. Furthermore, we show that given O(nlog(n)) samples, one algorithm recovers the true ranking with uniform quality, while the other predicts the ranking more accurately near the top than the bottom. Due to their simple form, the algorithms can be easily extended to online and distributed settings.} }

EndNote citation:

%0 Thesis %A Wauthier, Fabian %T Learning from Subsampled Data: Active and Randomized Strategies %I EECS Department, University of California, Berkeley %D 2013 %8 May 17 %@ UCB/EECS-2013-94 %U http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-94.html %F Wauthier:EECS-2013-94