Sciweavers

IR
2007

An empirical study of tokenization strategies for biomedical information retrieval

13 years 11 months ago
An empirical study of tokenization strategies for biomedical information retrieval
Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this work, we conducted a careful, systematic evaluation of a set of tokenization heuristics on all the available TREC biomedical text collections for ad hoc document retrieval, using two representative retrieval methods and a pseudo relevance feedback method. We also studied the effect of stemming and stop word removal on the retrieval performance. As expected, our experiment results show that tokenization can significantly affect the retrieval accuracy; appropriate tokenization can improve the performance by up to 96%, measured by mean average precision (MAP). In particular, it is shown that different query types require different tokenization heuristics, stemming is effective only for ce...
Jing Jiang, ChengXiang Zhai
Added 15 Dec 2010
Updated 15 Dec 2010
Type Journal
Year 2007
Where IR
Authors Jing Jiang, ChengXiang Zhai
Comments (0)