Efficacy of a constantly adaptive language modeling technique for web-scale applications

15 years 1 months ago

Download research.microsoft.com

In this paper, we describe CALM, a method for building statistical language models for the Web. CALM addresses several unique challenges dealing with the Web contents. First, CALM does not rely on the whole corpus to be available to build the language model. Instead, we design CALM to progressively adapt itself as Web chunks are made available by the crawler. Second, given the dynamic and dramatic changes in the Web contents, CALM is designed to quickly enrich its lexicon and N-grams as new vocabulary and phrases are discovered. To reduce the amount of heuristics and human interventions typically needed for model adaptation, we derive an information theoretical formula for CALM to facilitate the optimal adaptation in the maximum a posteriori (MAP) sense. Testing against a collection of Web chunks where new vocabulary and phrases are dominant, we show CALM can achieve comparable and satisfactory model measured in perplexity. We also show CALM is robust against over training and the ini...

Kuansan Wang, Xiaolong Li

Real-time Traffic

CALM | ICASSP 2009 | Language Model | Signal Processing | Web Contents |

claim paper

Added	21 May 2010
Updated	21 May 2010
Type	Conference
Year	2009
Where	ICASSP
Authors	Kuansan Wang, Xiaolong Li

Sciweavers

Efficacy of a constantly adaptive language modeling technique for web-scale applications

CALM | ICASSP 2009 | Language Model | Signal Processing | Web Contents |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers