De-duping URLs via rewrite rules

16 years 7 months ago

Download research.yahoo.com

A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal functions of a search engine, including crawling, indexing, ranking, and presentation, are adversely impacted by the presence of duplicate URLs. Traditionally, the de-duping problem has been addressed by fetching and examining the content of the URL; our approach here is different. Given a set of URLs partitioned into equivalence classes based on the content (URLs in the same equivalence class have similar content), we address the problem of mining this set and learning URL rewrite rules that transform all URLs of an equivalence class to the same canonical form. These rewrite rules can then be applied to eliminate duplicates among URLs that are encountered for the first time during crawling, even without fetching their content. In order to express such transformation rules, we propose a simple framework that...

Anirban Dasgupta, Ravi Kumar, Amit Sasturkar

Real-time Traffic

Data Mining | Duplicate Urls | KDD 2008 | URL Rewrite Patterns | URL Rewrite Rules |

claim paper

Post Info
More Details (n/a)

Added	30 Nov 2009
Updated	30 Nov 2009
Type	Conference
Year	2008
Where	KDD
Authors	Anirban Dasgupta, Ravi Kumar, Amit Sasturkar

Comments (0)

Sciweavers

De-duping URLs via rewrite rules

Data Mining | Duplicate Urls | KDD 2008 | URL Rewrite Patterns | URL Rewrite Rules |

Explore & Download

Productivity Tools

Sciweavers