Improving Web Clustering by Cluster Selection

15 years 7 months ago

Download www.danielcrabtree.com

Web page clustering is a technology that puts semantically related web pages into groups and is useful for categorizing, organizing, and reﬁning search results. When clustering using only textual information, Sufﬁx Tree Clustering (STC) outperforms other clustering algorithms by making use of phrases and allowing clusters to overlap. One problem of STC and other similar algorithms is how to select a small set of clusters to display to the user from a very large set of generated clusters. The cluster selection method used in STC is ﬂawed in that it does not handle overlapping clusters appropriately. This paper introduces a new cluster scoring function and a new cluster selection algorithm to overcome the problems with overlapping clusters, which are combined with STC to make a new clustering algorithm ESTC. This paper’s experiments show that ESTC signiﬁcantly outperforms STC and that even with less data ESTC performs similarly to a commercial clustering search engine.

Daniel Crabtree, Xiaoying Gao, Peter Andreae

Real-time Traffic