Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5

15 years 5 months ago

Download www.cs.technion.ac.il

Text categorization algorithms usually represent documents as bags of words and consequently have to deal with huge numbers of features. Most previous studies found that the majority of these features are relevant for classification, and that the performance of text categorization with support vector machines peaks when no feature selection is performed. We describe a class of text categorization problems that are characterized with many redundant features. Even though most of these features are relevant, the underlying concepts can be concisely captured using only a few features, while keeping all of them has substantially detrimental effect on categorization accuracy. We develop a novel measure that captures feature redundancy, and use it to analyze a large collection of datasets. We show that for problems plagued with numerous redundant features the performance of C4.5 is significantly superior to that of SVM, while aggressive feature selection allows SVM to beat C4.5 by a narrow m...

Evgeniy Gabrilovich, Shaul Markovitch

Real-time Traffic

Categorization Accuracy | ICML 2004 | Machine Learning | Text Categorization Algorithms | Text Categorization Problems |

claim paper

Post Info
More Details (n/a)

Added	17 Nov 2009
Updated	17 Nov 2009
Type	Conference
Year	2004
Where	ICML
Authors	Evgeniy Gabrilovich, Shaul Markovitch

Comments (0)

Sciweavers

Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5

Categorization Accuracy | ICML 2004 | Machine Learning | Text Categorization Algorithms | Text Categorization Problems |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers