Sciweavers

ECIR
2003
Springer

Taming Wild Phrases

14 years 8 days ago
Taming Wild Phrases
Abstract. In this paper the suitability of different document representations for automatic document classification is compared, investigating a whole range of representations between bag-of-words and bag-of-phrases. We look at some of their statistical properties, and determine for each representation the optimal choice of classification parameters and the effect of Term Selection. are represented by an abstraction called Head/Modifier pairs. Rather than just throwing phrases and keywords together, we start with pure HM pairs and gradually add more keywords to the document representation. We use the classification on keywords as the baseline, which we compare with the contribution of the pure HM pairs to classification accuracy, and the incremental contributions from heads and modifiers. Finally, we measure the accuracy achieved with all words and all HM pairs combined, which turns out to be only marginally above the baseline. We conclude that even the most careful term select...
Cornelis H. A. Koster, Marc Seutter
Added 31 Oct 2010
Updated 31 Oct 2010
Type Conference
Year 2003
Where ECIR
Authors Cornelis H. A. Koster, Marc Seutter
Comments (0)