Sciweavers

KDD
2003
ACM

Eliminating noisy information in Web pages for data mining

15 years 25 days ago
Eliminating noisy information in Web pages for data mining
A commercial Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements (for business purposes and for easy user access). We call these blocks that are not the main content blocks of the page the noisy blocks. We show that the information contained in these noisy blocks can seriously harm Web data mining. Eliminating these noises is thus of great importance. In this paper, we propose a noise elimination technique based on the following observation: In a given Web site, noisy blocks usually share some common contents and presentation styles, while the main content blocks of the pages are often diverse in their actual contents and/or presentation styles. Based on this observation, we propose a tree structure, called Style Tree, to capture the common presentation styles and the actual contents of the pages in a given Web site. By sampling the pages of the site,...
Lan Yi, Bing Liu, Xiaoli Li
Added 30 Nov 2009
Updated 30 Nov 2009
Type Conference
Year 2003
Where KDD
Authors Lan Yi, Bing Liu, Xiaoli Li
Comments (0)