Sciweavers

WWW
2005
ACM

Finding the boundaries of information resources on the web

14 years 6 months ago
Finding the boundaries of information resources on the web
In recent years, many algorithms for the Web have been developed that work with information units distinct from individual web pages. These include segments of web pages or aggregation of web pages into web communities. Using these logical information units has been shown to improve the performance of many web algorithms. In this paper, we focus on a type of logical information units called compound documents. We argue that the ability to identify compound documents can improve information retrieval, automatic metadata generation, and navigation on the Web. We propose a unified framework for identifying the boundaries of compound documents, which combines both structural and content features of constituent web pages. The framework is based on a combination of machine learning and clustering algorithms, with the former algorithm supervising the latter one. Experiments on a collection of educational web sites show that our approach can reliably identify most of the compound documents on...
Pavel Dmitriev, Carl Lagoze, Boris Suchkov
Added 26 Jun 2010
Updated 26 Jun 2010
Type Conference
Year 2005
Where WWW
Authors Pavel Dmitriev, Carl Lagoze, Boris Suchkov
Comments (0)