Combining Machine Learning with Linguistic Heuristics for Chinese Word Segmentation

14 years 11 months ago

Download www.personal.psu.edu

This paper describes a hybrid model that combines machine learning with linguistic heuristics for integrating unknown word identification with Chinese word segmentation. The model consists of two components: a position-of-character (POC) tagging component that annotates each character in a sentence with a POC tag that indicates its position in a word, and a merging component that transforms a POCtagged character sequence into a word-segmented sentence. The tagging component uses a support vector machine based tagger to produce an initial tagging of the text and a transformation-based tagger to improve the initial tagging. In addition to the POC tags assigned to the characters, the merging component incorporates a number of linguistic and statistical heuristics to detect words with regular internal structures, recognize long words, and filter non-words. Experiments show that, without resorting to a separate unknown word identification mechanism, the model achieves an F-score of 95.0% f...

Xiaofei Lu

Real-time Traffic

Artificial Intelligence | FLAIRS 2007 | Poc Tags | Unknown Word Identification | Word Segmentation |

claim paper

Post Info
More Details (n/a)

Added	02 Oct 2010
Updated	02 Oct 2010
Type	Conference
Year	2007
Where	FLAIRS
Authors	Xiaofei Lu

Comments (0)

Sciweavers

Combining Machine Learning with Linguistic Heuristics for Chinese Word Segmentation

Artificial Intelligence | FLAIRS 2007 | Poc Tags | Unknown Word Identification | Word Segmentation |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers