As we may perceive: finding the boundaries of compound documents on the web

16 years 7 months ago

Download www2008.org

This paper considers the problem of identifying on the Web compound documents (cDocs) ? groups of web pages that in aggregate constitute semantically coherent information entities. Examples of cDocs are a news article consisting of several html pages, or a set of pages describing specifications, price, and reviews of a digital camera. Being able to identify cDocs would be useful in many applications including web and intranet search, user navigation, automated collection generation, and information extraction. In the past, several heuristic approaches have been proposed to identify cDocs [1][5]. However, heuristics fail to capture the variety of types, styles and goals of information on the web, and do not account for the fact that the definition of a cDoc often depends on the context. This paper presents an experimental evaluation of three machine learning-based algorithms for cDoc discovery. These algorithms are responsive to the varying structure of cDocs and adaptive to their appl...

Pavel Dmitriev

Real-time Traffic

CDoc Discovery | CDoc Identification Purposes | Internet Technology | Machine Learning-based Algorithms | WWW 2008 |

claim paper

» Finding the boundaries of information resources on the web

» Focused Crawling A New Approach to TopicSpecific Web Resource Discovery

» Factors impeding Wiki use in the enterprise a case study

Post Info
More Details (n/a)

Added	21 Nov 2009
Updated	21 Nov 2009
Type	Conference
Year	2008
Where	WWW
Authors	Pavel Dmitriev

Comments (0)

Sciweavers

As we may perceive: finding the boundaries of compound documents on the web

CDoc Discovery | CDoc Identification Purposes | Internet Technology | Machine Learning-based Algorithms | WWW 2008 |

Explore & Download

Productivity Tools

Sciweavers