Common document clustering algorithms utilize models that either divide a corpus into smaller clusters or gather individual documents into clusters. Hierarchical Agglomerative Clustering, a common gathering algorithm runs in O(n2 ) to O(n3 ) time, depending on the linkage of documents. In contrast, Bisecting K-Means Clustering has been shown to run in linear time with respect to the number of documents to cluster, although other factors significantly affect run time. We propose a clustering algorithm bases on an inverted-index matrix of terms and an inverted term tree model.
Casey Bartman, Jamal R. Alsabbagh