Focusing on novelty: a crawling strategy to build diverse language models

14 years 7 months ago

Download www2.research.att.com

Word prediction performed by language models has an important role in many tasks as e.g. word sense disambiguation, speech recognition, hand-writing recognition, query spelling and query segmentation. Recent research has exploited the textual content of the Web to create language models. In this paper, we propose a new focused crawling strategy to collect Web pages that focuses on novelty in order to create diverse language models. In each crawling cycle, the crawler tries to ﬁll the gaps present in the current language model built from previous cycles, by avoiding visiting pages whose vocabulary is already well represented in the model. It relies on an information theoretic measure to identify these gaps and then learns link patterns to pages in these regions in order to guide its visitation policy. To handle constantly evolving domains, a key feature of our crawler approach is its ability to adjust its focus as the crawl progresses. We evaluate our approach in two diﬀerent scena...

Luciano Barbosa, Srinivas Bangalore

Real-time Traffic

CIKM 2011 | Information Technology | Language Models | Link Patterns | Word Sense Disambiguation |

claim paper

Post Info
More Details (n/a)

Added	13 Dec 2011
Updated	13 Dec 2011
Type	Journal
Year	2011
Where	CIKM
Authors	Luciano Barbosa, Srinivas Bangalore

Comments (0)

Sciweavers

Focusing on novelty: a crawling strategy to build diverse language models

CIKM 2011 | Information Technology | Language Models | Link Patterns | Word Sense Disambiguation |

Explore & Download

Productivity Tools

Sciweavers