The discoverability of the web

16 years 7 months ago

Download www2007.org

Previous studies have highlighted the high arrival rate of new content on the web. We study the extent to which this new content can be efficiently discovered by a crawler. Our study has two parts. First, we study the inherent difficulty of the discovery problem using a maximum cover formulation, under an assumption of perfect estimates of likely sources of links to new content. Second, we relax this assumption and study a more realistic setting in which algorithms must use historical statistics to estimate which pages are most likely to yield links to new content. We recommend a simple algorithm that performs comparably to all approaches we consider. We measure the overhead of discovering new content, defined as the average number of fetches required to discover one new page. We show first that with perfect foreknowledge of where to explore for links to new content, it is possible to discover 90% of all new content with under 3% overhead, and 100% of new content with 9% overhead. But...

Anirban Dasgupta, Arpita Ghosh, Ravi Kumar, Christ

Real-time Traffic

General Terms Algorithms | Internet Technology | Maximum Cover Formulation | Perfect Foreknowledge | WWW 2007 |

claim paper

» Discovering informative content blocks from Web documents

» SCCM ServiceOriented Community Coordinated Multimedia Architecture

» Information Gathering During Planning for Web Service Composition

Post Info
More Details (n/a)

Added	21 Nov 2009
Updated	21 Nov 2009
Type	Conference
Year	2007
Where	WWW
Authors	Anirban Dasgupta, Arpita Ghosh, Ravi Kumar, Christopher Olston, Sandeep Pandey, Andrew Tomkins

Comments (0)

Sciweavers

The discoverability of the web

General Terms Algorithms | Internet Technology | Maximum Cover Formulation | Perfect Foreknowledge | WWW 2007 |

Explore & Download

Productivity Tools

Sciweavers