Learning URL patterns for webpage de-duplication

16 years 1 months ago

Download www.wsdm-conference.org

Presence of duplicate documents in the World Wide Web adversely aﬀects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these rules for de-duplication using just URL strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract transformation rules, which are used to normalize URLs belonging to each cluster. Preserving each mined rule for de-duplication is not eﬃcient due to the large number of such rules. We present a machine learning technique to generalize the set of rules, which reduces the resource footprint to be usable at web-scale. The rule extraction techniques are robust against web-site speciﬁc URL conventions. We compare the precision and scalability of our approach with recent eﬀorts in using URLs for de-duplication. Experimental results demonstrate that our app...

Hema Swetha Koppula, Krishna P. Leela, Amit Agarwa

Real-time Traffic

Core Building Blocks | Data Mining | URLs | World Wide Web | WSDM 2010 |

claim paper

Post Info
More Details (n/a)

Added	18 May 2010
Updated	18 May 2010
Type	Conference
Year	2010
Where	WSDM
Authors	Hema Swetha Koppula, Krishna P. Leela, Amit Agarwal, Krishna Prasad Chitrapura, Sachin Garg, Amit Sasturkar

Comments (0)

Sciweavers

Learning URL patterns for webpage de-duplication

Core Building Blocks | Data Mining | URLs | World Wide Web | WSDM 2010 |

Explore & Download

Productivity Tools

Sciweavers