Scalable Term Selection for Text Categorization

15 years 8 months ago

Download acl.ldc.upenn.edu

In text categorization, term selection is an important step for the sake of both categorization accuracy and computational efﬁciency. Different dimensionalities are expected under different practical resource restrictions of time or space. Traditionally in text categorization, the same scoring or ranking criterion is adopted for all target dimensionalities, which considers both the discriminability and the coverage of a term, such as χ2 or IG. In this paper, the poor accuracy at a low dimensionality is imputed to the small average vector length of the documents. Scalable term selection is proposed to optimize the term set at a given dimensionality according to an expected average vector length. Discriminability and coverage are separately measured; by adjusting the ratio of their weights in a combined criterion, the expected average vector length can be reached, which means a good compromise between the speciﬁcity and the exhaustivity of the term subset. Experiments show that the...

Jingyang Li, Maosong Sun

Real-time Traffic