Cross-language information retrieval (CLIR) today is dominated by techniques that use token-to-token mappings from bilingual dictionaries. Yet, state-of-the-art statistical translation models (e.g., using Synchronous Context-Free Grammars) are far richer, capturing multi-term phrases, term dependencies, and contextual constraints on translation choice. We present a novel CLIR framework that is able to reach inside the translation “black box” and exploit these sources of evidence. Experiments on the TREC-5/6 EnglishChinese test collection show this approach to be promising. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval
Ferhan Türe, Jimmy J. Lin, Douglas W. Oard