Sciweavers

COLT
2004
Springer

Concentration Bounds for Unigrams Language Model

14 years 4 months ago
Concentration Bounds for Unigrams Language Model
Abstract. We show several PAC-style concentration bounds for learning unigrams language model. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on its error shows a PAC bound of approximately O k√ m . We improve its dependency on k to O 4√ k√ m + k m . We also analyze the empirical frequencies estimator, showing that its PAC error bound is approximately O 1 k + √ k m . We derive a combined estimator, which has an error of approximately O m− 2 5 , for any k. A standard measure for the quality of a learning algorithm is its expected per-word log-loss. We show that the leave-one-out method can be used for estimating the log-loss of the unigrams model with a PAC error of approximately O 1√ m , for any distribution. We also bound the log-loss a priori, as a function of various parameters of the distribution.
Evgeny Drukh, Yishay Mansour
Added 01 Jul 2010
Updated 01 Jul 2010
Type Conference
Year 2004
Where COLT
Authors Evgeny Drukh, Yishay Mansour
Comments (0)