Classifying with Co-stems - A New Representation for Information Filtering

14 years 10 months ago

Download www.uni-weimar.de

Besides the content the writing style is an important discriminator in information ﬁltering tasks. Ideally, the solution of a ﬁltering task employs a text representation that models both kinds of characteristics. In this respect word stems are clearly content capturing, whereas word sufﬁxes qualify as writing style indicators. Though the latter feature type is used for part of speech tagging, it has not yet been employed for information ﬁltering in general. We propose a text representation that combines both the output of a stemming algorithm (stems) and the stem-reduced words (co-stems). A co-stem can be a preﬁx, an inﬁx, a sufﬁx, or a concatenation of preﬁxes, inﬁxes, or sufﬁxes. Using accepted standard corpora, we analyze the discriminative power of this representation for a broad range of information ﬁltering tasks to provide new insights into the adequacy and task-speciﬁcity of text representation models. Altogether we observe that co-stem-based representat...

Nedim Lipka, Benno Stein

Real-time Traffic