A Hierarchical EM Approach to Word Segmentation

14 years 6 months ago

Download ai.uwaterloo.ca

We propose a simple two-level hierarchical probability model for unsupervised word segmentation. By treating words as strings composed of morphemes/phonemes which are themselves composed of character/phone strings, we use EM to ﬁrst identify the important morphemes/phonemes in a corpus, and then use a second level of EM to identify words given a lower level morpheme/phoneme segmentation. To further improve performance of the basic method we employ a mutual information criterion to eliminate long word agglomerations and reduce the size of the inferred lexicon while moving EM out of poor local maxima. Experiments on the Brown corpus show that our method accurately recovers hidden word boundaries using less training data than current MDL based approaches, even though our method is only trained on raw unsupervised data.

Fuchun Peng, Dale Schuurmans

Real-time Traffic

Level Morpheme/phoneme Segmentation | Natural Language Processing | NLPRS 2001 | Two-level Hierarchical Probability | Unsupervised Word Segmentation |

claim paper

Post Info
More Details (n/a)

Added	30 Jul 2010
Updated	30 Jul 2010
Type	Conference
Year	2001
Where	NLPRS
Authors	Fuchun Peng, Dale Schuurmans

Comments (0)

Sciweavers

A Hierarchical EM Approach to Word Segmentation

Level Morpheme/phoneme Segmentation | Natural Language Processing | NLPRS 2001 | Two-level Hierarchical Probability | Unsupervised Word Segmentation |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers