Good-Turing adjustments of word frequencies are an important tool in natural language modeling. In particular, for any sample of words, there is a set of words not occuring in that sample. The total probability mass of the words not in the sample is the so-called missing mass. Good showed that the fraction of the sample consisting of words that occur only once in the sample is a nearly unbiased estimate of the missing mass. Here, we give a PACstyle high-probability confidence interval for the actual missing mass. More generally, for , we give a confidence interval for the true probability mass of the set of words occuring times in the sample.
David A. McAllester, Robert E. Schapire