# Parallel Algorithms for Hierarchical Clustering

### Clark F. Olson

###
EECS Department

University of California, Berkeley

Technical Report No. UCB/CSD-94-786

December 1993

### http://www.eecs.berkeley.edu/Pubs/TechRpts/1994/CSD-94-786.pdf

Hierarchical clustering is a common method used to determine clusters of similar data points in multi-dimensional spaces.
*O*(
*n*^2) algorithms, where
*n* is the number of points to cluster, have long been known for this problem. This paper discusses parallel algorithms to perform hierarchical clustering using various distance metrics. I describe
*O*(
*n*) time algorithms for clustering using the single link, average link, complete link, centroid, median, and minimum variance metrics on an
*n* node CRCW PRAM and
*O*(
*n* log
*n*) algorithms for these metrics (except average link and complete link) on
*n*\log n node butterfly networks or trees. Thus, optimal efficiency is achieved for a significant number of processors using these distance metrics. A general algorithm is given that can be used to perform clustering with the complete link and average link metrics on a butterfly. While this algorithm achieves optimal efficiency for the general class of metrics, it is not optimal for the specific cases of complete link and average link clustering.

BibTeX citation:

@techreport{Olson:CSD-94-786, Author = {Olson, Clark F.}, Title = {Parallel Algorithms for Hierarchical Clustering}, Institution = {EECS Department, University of California, Berkeley}, Year = {1993}, Month = {Dec}, URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/1993/6324.html}, Number = {UCB/CSD-94-786}, Abstract = {Hierarchical clustering is a common method used to determine clusters of similar data points in multi-dimensional spaces. <i>O</i>(<i>n</i>^2) algorithms, where <i>n</i> is the number of points to cluster, have long been known for this problem. This paper discusses parallel algorithms to perform hierarchical clustering using various distance metrics. I describe <i>O</i>(<i>n</i>) time algorithms for clustering using the single link, average link, complete link, centroid, median, and minimum variance metrics on an <i>n</i> node CRCW PRAM and <i>O</i>(<i>n</i> log <i>n</i>) algorithms for these metrics (except average link and complete link) on <i>n</i>\log </i>n</i> node butterfly networks or trees. Thus, optimal efficiency is achieved for a significant number of processors using these distance metrics. A general algorithm is given that can be used to perform clustering with the complete link and average link metrics on a butterfly. While this algorithm achieves optimal efficiency for the general class of metrics, it is not optimal for the specific cases of complete link and average link clustering.} }

EndNote citation:

%0 Report %A Olson, Clark F. %T Parallel Algorithms for Hierarchical Clustering %I EECS Department, University of California, Berkeley %D 1993 %@ UCB/CSD-94-786 %U http://www.eecs.berkeley.edu/Pubs/TechRpts/1993/6324.html %F Olson:CSD-94-786