Sciweavers

WWW
2010
ACM

A pattern tree-based approach to learning URL normalization rules

14 years 6 months ago
A pattern tree-based approach to learning URL normalization rules
Duplicate URLs have brought serious troubles to the whole pipeline of a search engine, from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs to a canonical form using a set of rewrite rules. Nowadays URL normalization has attracted significant attention as it is lightweight and can be flexibly integrated into both the online (e.g. crawling) and the offline (e.g. index compression) parts of a search engine. To deal with a large scale of websites, automatic approaches are highly desired to learn rewrite rules for various kinds of duplicate URLs. In this paper, we rethink the problem of URL normalization from a global perspective and propose a pattern treebased approach, which is remarkably different from existing approaches. Most current approaches learn rewrite rules by iteratively inducing local duplicate pairs to more general forms, and inevitably suffer from noisy training data and are practically inefficient. Given a training set of URLs p...
Tao Lei, Rui Cai, Jiang-Ming Yang, Yan Ke, Xiaodon
Added 14 May 2010
Updated 14 May 2010
Type Conference
Year 2010
Where WWW
Authors Tao Lei, Rui Cai, Jiang-Ming Yang, Yan Ke, Xiaodong Fan, Lei Zhang
Comments (0)