Sciweavers

WWW
2011
ACM

Identifying primary content from web pages and its application to web search ranking

13 years 6 months ago
Identifying primary content from web pages and its application to web search ranking
Web pages are usually highly structured documents. In some documents, content with different functionality is laid out in blocks, some merely supporting the main discourse. In other documents, there may be several blocks of unrelated main content. Indexing a web page as if it were a linear document can cause problems because of the diverse nature of its content. If the retrieval function treats all blocks of the web page equally without attention to structure, it may lead to irrelevant query matches. In this paper, we describe how content quality of different blocks of a web page can be utilized to improve a retrieval function. Our method is based on segmenting a web page into semantically coherent blocks and learning a predictor of segment content quality. We also describe how to use segment content quality estimates as weights in the BM25F formulation. Experimental results show our method improves relevance of retrieved results by as much as 4.5% compared to BM25F that treats the ...
Srinivas Vadrevu, Emre Velipasaoglu
Added 29 May 2011
Updated 29 May 2011
Type Journal
Year 2011
Where WWW
Authors Srinivas Vadrevu, Emre Velipasaoglu
Comments (0)