Unsupervised Segmentation of Chinese Text by Use of Branching Entropy

15 years 9 months ago

Download acl.ldc.upenn.edu

We propose an unsupervised segmentation method based on an assumption about language data: that the increasing point of entropy of successive characters is the location of a word boundary. A large-scale experiment was conducted by using 200 MB of unsegmented training data and 1 MB of test data, and precision of 90% wasattained with recall being around 80%. Moreover, we found that the precision was stable at around 90% independently of the learning data size.

Zhihui Jin, Kumiko Tanaka-Ishii

Real-time Traffic

ACL 2006 | ACL 2007 | Language Data | Unsegmented Training Data | Unsupervised Segmentation Method |

claim paper

» Unsupervized Word Segmentation the Case for Mandarin Chinese

» The Markov Expert for Finding Episodes in Time Series

Post Info
More Details (n/a)

Added	30 Oct 2010
Updated	30 Oct 2010
Type	Conference
Year	2006
Where	ACL
Authors	Zhihui Jin, Kumiko Tanaka-Ishii

Comments (0)

Sciweavers

Unsupervised Segmentation of Chinese Text by Use of Branching Entropy

ACL 2006 | ACL 2007 | Language Data | Unsegmented Training Data | Unsupervised Segmentation Method |

Explore & Download

Productivity Tools

Sciweavers