Sciweavers

NLPRS
2001
Springer

A Simple Closed-Class/Open-Class Factorization for Improved Language Modeling

14 years 3 months ago
A Simple Closed-Class/Open-Class Factorization for Improved Language Modeling
We describe a simple improvement to ngram language models where we estimate the distribution over closed-class (function) words separately from the conditional distribution of open-class words given function words. In English, function words account for about 30% of written language, and also form a natural skeleton for most sentences. By factoring a language model into a function word model and a conditional model over open-class words given function words, we largely avoid the problem of sparse training data in the first phase, and localize the need for sophisticated smoothing techniques primarily to the second conditional model. We test our factored approach on the Brown and Wall Street Journal corpora and observe a 3.5% to 25.2% improvement in perplexity over standard methods, depending on the particular smoothing method and test set used. Compared to other proposals for improving n-gram language models, our factorization has the advantage of inherent simplicity and efficiency, a...
Fuchun Peng, Dale Schuurmans
Added 30 Jul 2010
Updated 30 Jul 2010
Type Conference
Year 2001
Where NLPRS
Authors Fuchun Peng, Dale Schuurmans
Comments (0)