Toward a unified approach to statistical language modeling for Chinese

14 years 2 months ago

Download research.microsoft.com

This paper presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigram language models to Chinese is challenging because (1) there is no standard definition of words in Chinese, (2) word boundaries are not marked by spaces, and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a highquality training data set from the web, creates a high-quality lexicon, segments the training data using this lexicon, and compresses the language model, all using the maximum likelihood principle, which is consistent with the trigram model training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.

Jianfeng Gao, Joshua Goodman, Mingjing Li, Kai-Fu

Real-time Traffic

Language Model | TALIP 2002 | Training Data | Unified Approach |

claim paper

Post Info
More Details (n/a)

Added	23 Dec 2010
Updated	23 Dec 2010
Type	Journal
Year	2002
Where	TALIP
Authors	Jianfeng Gao, Joshua Goodman, Mingjing Li, Kai-Fu Lee

Comments (0)

Sciweavers

Toward a unified approach to statistical language modeling for Chinese

Language Model | TALIP 2002 | Training Data | Unified Approach |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers