Sciweavers

EMNLP
2010

An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

13 years 9 months ago
An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL
This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local predictability of adjacent character sequences, while searching for a leasteffort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the "CHILDES" corpus for research in language development reveals that the algorithm achieves an accuracy, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced computational time.
Valentin Zhikov, Hiroya Takamura, Manabu Okumura
Added 11 Feb 2011
Updated 11 Feb 2011
Type Journal
Year 2010
Where EMNLP
Authors Valentin Zhikov, Hiroya Takamura, Manabu Okumura
Comments (0)