Electrical Engineering
      and Computer Sciences

Electrical Engineering and Computer Sciences

COLLEGE OF ENGINEERING

UC Berkeley

Parallel Algorithms for Hierarchical Clustering

Clark F. Olson

EECS Department
University of California, Berkeley
Technical Report No. UCB/CSD-94-786
December 1993

http://www.eecs.berkeley.edu/Pubs/TechRpts/1994/CSD-94-786.pdf

Hierarchical clustering is a common method used to determine clusters of similar data points in multi-dimensional spaces. O( n^2) algorithms, where n is the number of points to cluster, have long been known for this problem. This paper discusses parallel algorithms to perform hierarchical clustering using various distance metrics. I describe O( n) time algorithms for clustering using the single link, average link, complete link, centroid, median, and minimum variance metrics on an n node CRCW PRAM and O( n log n) algorithms for these metrics (except average link and complete link) on n\log n node butterfly networks or trees. Thus, optimal efficiency is achieved for a significant number of processors using these distance metrics. A general algorithm is given that can be used to perform clustering with the complete link and average link metrics on a butterfly. While this algorithm achieves optimal efficiency for the general class of metrics, it is not optimal for the specific cases of complete link and average link clustering.


BibTeX citation:

@techreport{Olson:CSD-94-786,
    Author = {Olson, Clark F.},
    Title = {Parallel Algorithms for Hierarchical Clustering},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {1993},
    Month = {Dec},
    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/1993/6324.html},
    Number = {UCB/CSD-94-786},
    Abstract = {Hierarchical clustering is a common method used to determine clusters of similar data points in multi-dimensional spaces. <i>O</i>(<i>n</i>^2) algorithms, where <i>n</i> is the number of points to cluster, have long been known for this problem.  This paper discusses parallel algorithms to perform hierarchical clustering using various distance metrics. I describe <i>O</i>(<i>n</i>) time algorithms for clustering using the single link, average link, complete link, centroid, median, and minimum variance metrics on an <i>n</i> node CRCW PRAM and <i>O</i>(<i>n</i> log <i>n</i>) algorithms for these metrics (except average link and complete link) on <i>n</i>\log </i>n</i> node butterfly networks or trees. Thus, optimal efficiency is achieved for a significant number of processors using these distance metrics. A general algorithm is given that can be used to perform clustering with the complete link and average link metrics on a butterfly. While this algorithm achieves optimal efficiency for the general class of metrics, it is not optimal for the specific cases of complete link and average link clustering.}
}

EndNote citation:

%0 Report
%A Olson, Clark F.
%T Parallel Algorithms for Hierarchical Clustering
%I EECS Department, University of California, Berkeley
%D 1993
%@ UCB/CSD-94-786
%U http://www.eecs.berkeley.edu/Pubs/TechRpts/1993/6324.html
%F Olson:CSD-94-786