We describe an algorithm for clustering using a similarity graph. The algorithm (a) runs in O(n log3 n + m log n) time on graphs with n vertices and m edges, and (b) with high probability, finds all "large enough" clusters in a random graph generated according to the planted partition model. We provide lower bounds that imply that our "large enough" constraint cannot be improved much, even using a computationally unbounded algorithm. We describe some experiments running the algorithm and a few related algorithms on random graphs with partitions generated using a Chinese Restaurant Processes, and some results of applying the algorithm to cluster DBLP titles.
Nader H. Bshouty, Philip M. Long