Random Projection Trees and Low Dimensional Manifolds

Sanjoy Dasgupta



The curse of dimensionality has traditionally been the bane of nonparametric statistics (histograms, kernel density estimation, nearest neighbor search, and so on), as reflected in running times and convergence rates that are exponentially bad in the dimension.

Recently the field has been rejuvenated by the realization that a lot of data that seems high-dimensional in fact has low "intrinsic" dimension in the sense of lying close to a low-dimensional manifold. Many algorithms have been proposed for learning such manifolds from data.

I'll exhibit a way to benefit from intrinsic low dimensionality without having to go to the trouble of explicitly learning its structure. Specifically, I'll present a simple variant of the ubiquitous k-d tree (a spatial data structure) that is provably adaptive to low dimensional structure. We call this a "random projection tree" (RP tree).

Along the way, I'll discuss different notions of intrinsic dimension -- motivated by manifolds, by local statistics, and by analysis on metric spaces -- and relate them. I'll then prove that RP trees require resources that depend only on these dimensions rather than the dimension of the space in which the data happens to be situated.

This is work with Yoav Freund (UC San Diego).