A fast and robust method for web page template detection and removal

15 years 10 months ago

Download www.cs.utah.edu

The widespread use of templates on the Web is considered harmful for two main reasons. Not only do they compromise the relevance judgment of many web IR and web mining methods such as clustering and classification, but they also negatively impact the performance and resource usage of tools that process web pages. In this paper we present a new method that efficiently and accurately removes templates found in collections of web pages. Our method works in two steps. First,the costly process of template detection is performed over a small set of sample pages. Then, the derived template is removed from the remaining pages in the collection. This leads to substantial performance gains when compared to previous approaches that combine template detection and removal. We show, through an experimental evaluation, that our approach is effective for identifying terms occurring in templates--obtaining F-measure values around 0.9, and that it also boosts the accuracy of web page clustering and cla...

Karane Vieira, Altigran Soares da Silva, Nick Pint

Real-time Traffic

CIKM 2006 | Information Management | Template Detection | Web Mining Methods | Web Page |

claim paper

» A densitometric approach to web page segmentation

» Fast Head Tilt Detection for HumanComputer Interaction

Post Info
More Details (n/a)

Added	20 Aug 2010
Updated	20 Aug 2010
Type	Conference
Year	2006
Where	CIKM
Authors	Karane Vieira, Altigran Soares da Silva, Nick Pinto, Edleno Silva de Moura, João M. B. Cavalcanti, Juliana Freire

Comments (0)

Sciweavers

A fast and robust method for web page template detection and removal

CIKM 2006 | Information Management | Template Detection | Web Mining Methods | Web Page |

Explore & Download

Productivity Tools

Sciweavers