

Hierarchical topic segmentation of websites

15 years 3 months ago
Hierarchical topic segmentation of websites
In this paper, we consider the problem of identifying and segmenting topically cohesive regions in the URL tree of a large website. Each page of the website is assumed to have a topic label or a distribution on topic labels generated using a standard classier. We develop a set of cost measures characterizing the benet accrued by introducing a segmentation of the site based on the topic labels. We propose a general framework to use these measures for describing the quality of a segmentation; we also provide an ecient algorithm to nd the best segmentation in this framework. Extensive experiments on human-labeled data conrm the soundness of our framework and suggest that a judicious choice of cost measures allows the algorithm to perform surprisingly accurate topical segmentations. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Algorithms, Experimentation, Measurements Keywords Website Hierarchy, Website Segmen...
Ravi Kumar, Kunal Punera, Andrew Tomkins
Added 30 Nov 2009
Updated 30 Nov 2009
Type Conference
Year 2006
Where KDD
Authors Ravi Kumar, Kunal Punera, Andrew Tomkins
Comments (0)