Text categorization by boosting automatically extracted concepts

16 years 3 hour ago

Download www.cs.brown.edu

Term-based representations of documents have found widespread use in information retrieval. However, one of the main shortcomings of such methods is that they largely disregard lexical semantics and, as a consequence, are not sufﬁciently robust with respect to variations in word usage. In this paper we investigate the use of concept-based document representations to supplement word- or phrase-based features. The utilized concepts are automatically extracted from documents via probabilistic latent semantic analysis. We propose to use AdaBoost to optimally combine weak hypotheses based on both types of features. Experimental results on standard benchmarks conﬁrm the validity of our approach, showing that AdaBoost achieves consistent improvements by including additional semantic features in the learned ensemble. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing—Indexing Methods; H.3.3 [Information Storage and Retrieval]: Inf...

Lijuan Cai, Thomas Hofmann

Real-time Traffic