In domains with insufficient matched training data, language models are often constructed by interpolating component models trained from partially matched corpora. Since the ngrams from such corpora may not be of equal relevance to the target domain, we propose an n-gram weighting technique to adjust the component n-gram probabilities based on features derived from readily available segmentation and metadata information for each corpus. Using a log-linear combination of such features, the resulting model achieves up to a
Bo-June Paul Hsu, James R. Glass