Sciweavers

FLAIRS
2007

Combining Machine Learning with Linguistic Heuristics for Chinese Word Segmentation

13 years 11 months ago
Combining Machine Learning with Linguistic Heuristics for Chinese Word Segmentation
This paper describes a hybrid model that combines machine learning with linguistic heuristics for integrating unknown word identification with Chinese word segmentation. The model consists of two components: a position-of-character (POC) tagging component that annotates each character in a sentence with a POC tag that indicates its position in a word, and a merging component that transforms a POCtagged character sequence into a word-segmented sentence. The tagging component uses a support vector machine based tagger to produce an initial tagging of the text and a transformation-based tagger to improve the initial tagging. In addition to the POC tags assigned to the characters, the merging component incorporates a number of linguistic and statistical heuristics to detect words with regular internal structures, recognize long words, and filter non-words. Experiments show that, without resorting to a separate unknown word identification mechanism, the model achieves an F-score of 95.0% f...
Xiaofei Lu
Added 02 Oct 2010
Updated 02 Oct 2010
Type Conference
Year 2007
Where FLAIRS
Authors Xiaofei Lu
Comments (0)