Sciweavers

CLIN
2001

Accurate Stemming of Dutch for Text Classification

14 years 1 months ago
Accurate Stemming of Dutch for Text Classification
This paper investigates the use of stemming for classification of Dutch (email) texts. We introduce a stemmer, which combines dictionary lookup (implemented efficiently as a finite state automaton) with a rule-based backup strategy and show that it outperforms the Dutch Porter stemmer in terms of accuracy, while not being substantially slower. For text classification, the most important property of a stemmer is the number of words it (correctly) reduces to the same stem. Here the dictionary-based system also outperforms Porter. However, evaluation of a Bayesian text classification system with either no stemming or the Porter or dictionary-based stemmer on an email classification and a newspaper topic classification task does not lead to significant differences in accuracy. We conclude with an analysis of why this is the case.
Tanja Gaustad, Gosse Bouma
Added 31 Oct 2010
Updated 31 Oct 2010
Type Conference
Year 2001
Where CLIN
Authors Tanja Gaustad, Gosse Bouma
Comments (0)