

Investigating the Relationship between Word Segmentation Performance and Retrieval Performance in Chinese IR

14 years 2 months ago
Investigating the Relationship between Word Segmentation Performance and Retrieval Performance in Chinese IR
It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. In this paper we show that, for Chinese, the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this effect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%. It appears that the main reason for the drop in retrieval performance is that correct compounds and collocations are preserved by accurate segmenters, while they are broken up by less accurate (but reasonable) segmenters, to a surprising advantage. This suggests that words themselves might be too broad a notion to conveniently capture th...
Fuchun Peng, Xiangji Huang, Dale Schuurmans, Nick
Added 17 Dec 2010
Updated 17 Dec 2010
Type Journal
Year 2002
Authors Fuchun Peng, Xiangji Huang, Dale Schuurmans, Nick Cercone
Comments (0)