This paper investigates the applicability of distributed clustering technique, called RACHET [1], to organize large sets of distributed text data. Although the authors of RACHET claim that the algorithm generates quality clusters for massive and high dimensional data set, the algorithm was not yet evaluated on a well known academic data set. This paper presents performance analysis of the algorithm and tests its suitability for distributed document clustering. This work uses three widely known hierarchical algorithms to generate local clusters at each of distributed repositories and then the RACHET is applied to merge distributed hierarchies of clusters. We perform our own tests of the algorithm on standard document corpora [2], using popular cluster evaluation measures [3, 4] and discuss important implementation details. KEY WORDS Distributed hierarchical clustering, document clustering.
Debzani Deb, M. Muztaba Fuad, Rafal A. Angryk