Sciweavers

ICML
2004
IEEE

Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5

14 years 11 months ago
Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5
Text categorization algorithms usually represent documents as bags of words and consequently have to deal with huge numbers of features. Most previous studies found that the majority of these features are relevant for classification, and that the performance of text categorization with support vector machines peaks when no feature selection is performed. We describe a class of text categorization problems that are characterized with many redundant features. Even though most of these features are relevant, the underlying concepts can be concisely captured using only a few features, while keeping all of them has substantially detrimental effect on categorization accuracy. We develop a novel measure that captures feature redundancy, and use it to analyze a large collection of datasets. We show that for problems plagued with numerous redundant features the performance of C4.5 is significantly superior to that of SVM, while aggressive feature selection allows SVM to beat C4.5 by a narrow m...
Evgeniy Gabrilovich, Shaul Markovitch
Added 17 Nov 2009
Updated 17 Nov 2009
Type Conference
Year 2004
Where ICML
Authors Evgeniy Gabrilovich, Shaul Markovitch
Comments (0)