DeepBot: a focused crawler for accessing hidden web content

16 years 27 days ago

Download www.tic.udc.es

The crawler engines of today cannot reach most of the information contained in the Web. A great amount of valuable information is "hidden" behind the query forms of online databases, and/or is dynamically generated by technologies such as Javascript. This portion of the web is usually known as the Deep Web or the Hidden Web. We have built DeepBot, a prototype of hidden-web focused crawler able to access such content. DeepBot receives a set of domain definitions as an input, each one describing a specific data-collecting task and automatically identifies and learns to execute queries on the forms relevant to them. In this paper we describe the techniques employed for building DeepBot and report the experimental results obtained when testing it with several real world data collection tasks. Categories and Subject Descriptors H.2.5 [Database Management]: Heterogeneous Databases. H.2.8 [Database Management]: Database Applications - Data mining. H.3.4 [Information Storage and Ret...

Manuel Álvarez, Juan Raposo, Alberto Pan, F

Real-time Traffic

Database Management | DEEC 2007 | Hidden Web | Hidden-web Focused Crawler | Information Management |

claim paper

» Downloading textual hidden web content through keyword queries

» WebKhoj Indian language IR from multiple character encodings

» Service Class Driven Dynamic Data Source Discovery with DynaBot

» Exposing the hidden web for chemical digital libraries

» Purely URLbased topic classification

» Learning search tasks in queries and web pages via graph regularization

» Extracting Relevant Snippets for Web Navigation

» ReAlignerV Webbased genomic alignment tool with high specificity and robustness estimated ...

Post Info
More Details (n/a)

Added	02 Jun 2010
Updated	02 Jun 2010
Type	Conference
Year	2007
Where	DEEC
Authors	Manuel Álvarez, Juan Raposo, Alberto Pan, Fidel Cacheda, Fernando Bellas, Victor Carneiro

Comments (0)

Sciweavers

DeepBot: a focused crawler for accessing hidden web content

Database Management | DEEC 2007 | Hidden Web | Hidden-web Focused Crawler | Information Management |

Explore & Download

Productivity Tools

Sciweavers