Sciweavers

ICDE
2007
IEEE

Document Representation and Dimension Reduction for Text Clustering

14 years 6 months ago
Document Representation and Dimension Reduction for Text Clustering
Increasingly large text datasets and the high dimensionality associated with natural language create a great challenge in text mining. In this research, a systematic study is conducted, in which three different document representation methods for text are used, together with three Dimension Reduction Techniques (DRT), in the context of the text clustering problem. Several standard benchmark datasets are used. The three Document representation methods considered are based on the vector space model, and they include word, multi-word term, and character N-gram representations. The dimension reduction methods are independent component analysis (ICA), latent semantic indexing (LSI), and a feature selection technique based on Document Frequency (DF). Results are compared in terms of clustering performance, using the k-means clustering algorithm. Experiments show that ICA and LSI are clearly better than DF on all datasets. For word and N-gram representation, ICA generally gives better result...
M. Mahdi Shafiei, Singer Wang, Roger Zhang, Evange
Added 03 Jun 2010
Updated 03 Jun 2010
Type Conference
Year 2007
Where ICDE
Authors M. Mahdi Shafiei, Singer Wang, Roger Zhang, Evangelos E. Milios, Bin Tang, Jane Tougas, Raymond J. Spiteri
Comments (0)