Sciweavers

ACIIDS
2010
IEEE

An Unsupervised Learning and Statistical Approach for Vietnamese Word Recognition and Segmentation

14 years 5 months ago
An Unsupervised Learning and Statistical Approach for Vietnamese Word Recognition and Segmentation
There are two main topics in this paper: (i) Vietnamese words are recognized and sentences are segmented into words by using probabilistic models; (ii) the optimum probabilistic model is constructed by an unsupervised learning processing. For each probabilistic model, new words are recognized and their syllables are linked together. The syllable-linking process improves the accuracy of statistical functions which improves contrarily the new words recognition. Hence, the probabilistic model will converge to the optimum one. Our experimented corpus is generated from about 250.000 online news articles, which consist of about 19.000.000 sentences. The accuracy of the segmented algorithm is over 90%. Our Vietnamese word and phrase dictionary contains more than 150.000 elements.
Hieu Le Trung, Vu Le Anh, Kien Le Trung
Added 10 Jul 2010
Updated 10 Jul 2010
Type Conference
Year 2010
Where ACIIDS
Authors Hieu Le Trung, Vu Le Anh, Kien Le Trung
Comments (0)