As we may perceive: inferring logical documents from hypertext

16 years 3 days ago

Download www.cs.cornell.edu

In recent years, many algorithms for the Web have been developed that work with information units distinct from individual web pages. These include segments of web pages or aggregation of web pages into web communities. Such logical information units improve a variety of web algorithms and provide the building blocks for the construction of organized information spaces such as digital libraries. In this paper, we focus on a type of logical information units called “compound documents”. We argue that the ability to identify compound documents can improve information retrieval, automatic metadata generation, and navigation on the Web. We propose a unified framework for identifying the boundaries of compound documents, which combines both structural and content features of constituent web pages. The framework is based on a combination of machine learning and clustering algorithms, with the former algorithm supervising the latter one. We also propose a new method for evaluating qualit...

Pavel Dmitriev, Carl Lagoze, Boris Suchkov

Real-time Traffic

Compound Documents | HT 2005 | Information Units | Web Pages |

claim paper

» HySpirit A Probabilistic Inference Engine for Hypermedia Retrieval in Large Databases

» MultipleSource Internet Tomography

» Finegrained structured configuration management for web projects

Post Info
More Details (n/a)

Added	26 Jun 2010
Updated	26 Jun 2010
Type	Conference
Year	2005
Where	HT
Authors	Pavel Dmitriev, Carl Lagoze, Boris Suchkov

Comments (0)

Sciweavers

As we may perceive: inferring logical documents from hypertext

Compound Documents | HT 2005 | Information Units | Web Pages |

Explore & Download

Productivity Tools

Sciweavers