In this paper, we report on the fusion of simple retrieval strategies with thesaural resources in order to perform large-scale text categorization tasks. Unlike most related systems, which rely on training data in order to infer text-to-concept relationships, our approach can be applied with any controlled vocabulary and does not use any training data. The first classification module uses a traditional vector-space retrieval engine, which has been fine-tuned for the task, while the second classifier is based on regular variations of the concept list. For evaluation purposes, the system uses a sample of MedLine and the Medical Subject Headings (MeSH) terminology as collection of concepts. Preliminary results show that performances of the hybrid system are significantly improved as compared to each single system. For top returned concepts, the system reaches performances comparable to machine learning systems, while genericity and scalability issues are clearly in favor of the learn...
Patrick Ruch, Robert H. Baud, Antoine Geissbü