Parallel Algorithms for Hierarchical Clustering

Clark F. Olson

EECS Department
University of California, Berkeley
Technical Report No. UCB/CSD-94-786
December 1993

http://www.eecs.berkeley.edu/Pubs/TechRpts/1994/CSD-94-786.pdf

Hierarchical clustering is a common method used to determine clusters of similar data points in multi-dimensional spaces. O( n^2) algorithms, where n is the number of points to cluster, have long been known for this problem. This paper discusses parallel algorithms to perform hierarchical clustering using various distance metrics. I describe O( n) time algorithms for clustering using the single link, average link, complete link, centroid, median, and minimum variance metrics on an n node CRCW PRAM and O( n log n) algorithms for these metrics (except average link and complete link) on n\log n node butterfly networks or trees. Thus, optimal efficiency is achieved for a significant number of processors using these distance metrics. A general algorithm is given that can be used to perform clustering with the complete link and average link metrics on a butterfly. While this algorithm achieves optimal efficiency for the general class of metrics, it is not optimal for the specific cases of complete link and average link clustering.


BibTeX citation:

@techreport{Olson:CSD-94-786,
    Author = {Olson, Clark F.},
    Title = {Parallel Algorithms for Hierarchical Clustering},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {1993},
    Month = {Dec},
    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/1993/6324.html},
    Number = {UCB/CSD-94-786},
    Abstract = {Hierarchical clustering is a common method used to determine clusters of similar data points in multi-dimensional spaces. <i>O</i>(<i>n</i>^2) algorithms, where <i>n</i> is the number of points to cluster, have long been known for this problem.  This paper discusses parallel algorithms to perform hierarchical clustering using various distance metrics. I describe <i>O</i>(<i>n</i>) time algorithms for clustering using the single link, average link, complete link, centroid, median, and minimum variance metrics on an <i>n</i> node CRCW PRAM and <i>O</i>(<i>n</i> log <i>n</i>) algorithms for these metrics (except average link and complete link) on <i>n</i>\log </i>n</i> node butterfly networks or trees. Thus, optimal efficiency is achieved for a significant number of processors using these distance metrics. A general algorithm is given that can be used to perform clustering with the complete link and average link metrics on a butterfly. While this algorithm achieves optimal efficiency for the general class of metrics, it is not optimal for the specific cases of complete link and average link clustering.}
}

EndNote citation:

%0 Report
%A Olson, Clark F.
%T Parallel Algorithms for Hierarchical Clustering
%I EECS Department, University of California, Berkeley
%D 1993
%@ UCB/CSD-94-786
%U http://www.eecs.berkeley.edu/Pubs/TechRpts/1993/6324.html
%F Olson:CSD-94-786