Taming Wild Phrases

15 years 8 months ago

Download www.cs.ru.nl

Abstract. In this paper the suitability of diﬀerent document representations for automatic document classiﬁcation is compared, investigating a whole range of representations between bag-of-words and bag-of-phrases. We look at some of their statistical properties, and determine for each representation the optimal choice of classiﬁcation parameters and the eﬀect of Term Selection. are represented by an abstraction called Head/Modiﬁer pairs. Rather than just throwing phrases and keywords together, we start with pure HM pairs and gradually add more keywords to the document representation. We use the classiﬁcation on keywords as the baseline, which we compare with the contribution of the pure HM pairs to classiﬁcation accuracy, and the incremental contributions from heads and modiﬁers. Finally, we measure the accuracy achieved with all words and all HM pairs combined, which turns out to be only marginally above the baseline. We conclude that even the most careful term select...

Cornelis H. A. Koster, Marc Seutter

Real-time Traffic