Sciweavers

ESANN
2007

Kernel PCA based clustering for inducing features in text categorization

14 years 1 months ago
Kernel PCA based clustering for inducing features in text categorization
We study dimensionality reduction or feature selection in text document categorization problem. We focus on the first step in building text categorization systems, that is the choice of efficiently representing numerically the natural language text. This numerical representation is going to be used by machine learning algorithms. We propose a representation based on word clusters. We build a kernel matrix from the word distribution over the different categories and apply kernel PCA to extract a low-dimensional representation of words. On this low-dimensional representation we use K-means clustering to group words into clusters and use these clusters subsequently in the document categorization task. We show that kernel PCA based clustering gives better or comparable performance than several advanced clustering methods when applied for the standard Reuters corpus.
Zsolt Minier, Lehel Csató
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2007
Where ESANN
Authors Zsolt Minier, Lehel Csató
Comments (0)