Sciweavers

SIGIR
2004
ACM

Query-related data extraction of hidden web documents

14 years 4 months ago
Query-related data extraction of hidden web documents
The larger amount of information on the Web is stored in document databases and is not indexed by general-purpose search engines (i.e., Google and Yahoo). Such information is dynamically generated through querying databases — which are referred to as Hidden Web databases. Documents returned in response to a user query are typically presented using templategenerated Web pages. This paper proposes a novel approach that identifies Web page templates by analysing the textual contents and the adjacent tag structures of a document in order to extract query-related data. Preliminary results demonstrate that our approach effectively detects templates and retrieves data with high recall and precision. Categories and Subject Descriptors: H.3.5 [Information Storage and Retrieval]: Online Information Services – Web-based services. General Terms: Performance, Experimentation.
Yih-Ling Hedley, Muhammad Younas, Anne E. James, M
Added 30 Jun 2010
Updated 30 Jun 2010
Type Conference
Year 2004
Where SIGIR
Authors Yih-Ling Hedley, Muhammad Younas, Anne E. James, Mark Sanderson
Comments (0)