We here describe the subword approach we used in the 2006 ImageCLEF Medical Image Retrieval task. It is based on the assupmtion that neither fully inflected nor automatically stemmed words constitute the appropriate granularity for lexicalized content description. We therefore introduce subwords as morphologically meaningful word units. Subwords are organized in language specific lexica that were partly manually and partly automatically generated and currently cover six European languages. They are linked together via a multilingual thesaurus. The use of subwords instead of full words significantly reduces the number of lexical entries that are needed to sufficiently cover a specific language and domain. A further benefit of the approach is its independence from the underlying retrieval system, thus making it usable by any search engine. In this year's test runs we combined MorphoSaurus with the open-source search engine Lucene and achieved precision gains of up to 25% over the b...
Philipp Daumke, Jan Paetzold, Kornél G. Mar