Sciweavers

IJCAI
2007

Semantic Smoothing of Document Models for Agglomerative Clustering

14 years 1 months ago
Semantic Smoothing of Document Models for Agglomerative Clustering
In this paper, we argue that the agglomerative clustering with vector cosine similarity measure performs poorly due to two reasons. First, the nearest neighbors of a document belong to different classes in many cases since any pair of documents shares lots of “general” words. Second, the sparsity of class-specific “core” words leads to grouping documents with the same class labels into different clusters. Both problems can be resolved by suitable smoothing of document model and using KullbackLeibler divergence of two smoothed models as pairwise document distances. Inspired by the recent work in information retrieval, we propose a novel context-sensitive semantic smoothing method that can automatically identifies multiword phrases in a document and then statistically map phrases to individual document terms. We evaluate the new model-based similarity measure on three datasets using complete linkage criterion for agglomerative clustering and find out it significantly improves th...
Xiaohua Zhou, Xiaodan Zhang, Xiaohua Hu
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2007
Where IJCAI
Authors Xiaohua Zhou, Xiaodan Zhang, Xiaohua Hu
Comments (0)