Sciweavers

WWW
2010
ACM

The paths more taken: matching DOM trees to search logs for accurate webpage clustering

14 years 6 months ago
The paths more taken: matching DOM trees to search logs for accurate webpage clustering
An unsupervised clustering of the webpages on a website is a primary requirement for most wrapper induction and automated data extraction methods. Since page content can vary drastically across pages of one cluster (e.g., all product pages on amazon.com), traditional clustering methods typically use some distance function between the DOM trees representing a pair of webpages. However, without knowing which portions of the DOM tree are “important,” such distance functions might discriminate between similar pages based on trivial features (e.g., differing number of reviews on two product pages), or club together distinct types of pages based on superficial features present in the DOM trees of both (e.g., matching footer/copyright), leading to poor clustering performance. We propose using search logs to automatically find paths in the DOM trees that mark out important portions of pages, e.g., the product title in a product page. Such paths are identified via a global analysis of ...
Deepayan Chakrabarti, Rupesh R. Mehta
Added 14 May 2010
Updated 14 May 2010
Type Conference
Year 2010
Where WWW
Authors Deepayan Chakrabarti, Rupesh R. Mehta
Comments (0)