Title extraction from bodies of HTML documents and its application to web page retrieval

16 years 4 days ago

Download research.microsoft.com

This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML documents. This is an issue which does not seem to have been investigated previously. In this paper, we take a supervised machine learning approach to address the problem. We propose a specification on HTML titles. We utilize format information such as font size, position, and font weight as features in title extraction. Our method significantly outperforms the baseline method of using the lines in largest font size as title (20.9%-32.6% improvement in F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracte...

Yunhua Hu, Guomao Xin, Ruihua Song, Guoping Hu, Sh

Real-time Traffic

Extracted Titles | HTML Document | HTML Titles | SIGIR 2005 |

claim paper

» Extracting context to improve accuracy for HTML content extraction

» DOMbased content extraction of HTML documents

» Identifying primary content from web pages and its application to web search ranking

» Extracting Content Structure for Web Pages Based on Visual Representation

» Thresher automating the unwrapping of semantic content from the World Wide Web

» Recognition of Common Areas in a Web Page Using Visual Information a possible application ...

» Discovering informative content blocks from Web documents

» Scalable Web Mining with Newistic

Post Info
More Details (n/a)

Added	26 Jun 2010
Updated	26 Jun 2010
Type	Conference
Year	2005
Where	SIGIR
Authors	Yunhua Hu, Guomao Xin, Ruihua Song, Guoping Hu, Shuming Shi, Yunbo Cao, Hang Li

Comments (0)

Sciweavers

Title extraction from bodies of HTML documents and its application to web page retrieval

Extracted Titles | HTML Document | HTML Titles | SIGIR 2005 |

Explore & Download

Productivity Tools

Sciweavers