Sciweavers

EMNLP
2010

Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

13 years 9 months ago
Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping
Almost all Chinese language processing tasks involve word segmentation of the language input as their first steps, thus robust and reliable segmentation techniques are always required to make sure those tasks wellperformed. In recent years, machine learning and sequence labeling models such as Conditional Random Fields (CRFs) are often used in segmenting Chinese texts. Compared with traditional lexicon-driven models, machine learned models achieve higher F-measure scores. But machine learned models heavily depend on training materials. Although they can effectively process texts from the same domain as the training texts, they perform relatively poorly when texts from new domains are to be processed. In this paper, we propose to use 2 statistics when training an SVM-HMM based segmentation model to improve its ability to recall OOV words and then use bootstrapping strategies to maintain its ability to recall IV words. Experiments show the approach proposed in this paper enhances the do...
Baobao Chang, Dongxu Han
Added 11 Feb 2011
Updated 11 Feb 2011
Type Journal
Year 2010
Where EMNLP
Authors Baobao Chang, Dongxu Han
Comments (0)