The larger amount of information on the Web is stored in document databases and is not indexed by general-purpose search engines (i.e., Google and Yahoo). Such information is dynamically generated through querying databases — which are referred to as Hidden Web databases. Documents returned in response to a user query are typically presented using templategenerated Web pages. This paper proposes a novel approach that identifies Web page templates by analysing the textual contents and the adjacent tag structures of a document in order to extract query-related data. Preliminary results demonstrate that our approach effectively detects templates and retrieves data with high recall and precision. Categories and Subject Descriptors: H.3.5 [Information Storage and Retrieval]: Online Information Services – Web-based services. General Terms: Performance, Experimentation.
Yih-Ling Hedley, Muhammad Younas, Anne E. James, M