Distributed Hypertext Resource Discovery Through Examples

15 years 11 months ago

Download www.cse.iitb.ac.in

We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, metadata, and hyperlink structure in powerful ways, such as “ﬁnd the number of links from an environmental protection page to a page about oil and natural gas over the last year.” A key problem in populating the database in such a system is to discover web resources related to the topics involved in such queries. We argue that that a keywordbased “ﬁnd similar” search based on a giant all-purpose crawler is neither necessary nor adequate for resource discovery. Instead we exploit the properties that pages tend to cite pages with related topics, and given that a page u cites a page about a desired topic, it is very likely that u cites additional desirable pages. We exploit these properties by using a crawler controlled by two hypertext mining programs: (1) a classiﬁer that evaluates the relevance of a region of th...

Soumen Chakrabarti, Martin van den Berg, Byron Dom

Real-time Traffic