Sciweavers

WWW
2005
ACM

Web data cleansing for information retrieval using key resource page selection

15 years 5 days ago
Web data cleansing for information retrieval using key resource page selection
With the page explosion of WWW, how to cover more useful information with limited storage and computation resources becomes more and more important in web IR research. Using web page non-content feature analysis, we proposed a clustering-based method to select high quality pages from the whole page set. Although the result page set contains only 44.3% of the whole collection, it is related with more than 98% of links and covers about 90% of key information. Link property and retrieval affects are also observed and experiment results show that key resource selection method is more suitable for the job of data cleansing and the result page set outperforms the whole collection by smaller size and better retrieval performance. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Experimentation Keywords Web data cleansing, Non-content feature, Web IR.
Yiqun Liu, Canhui Wang, Min Zhang, Shaoping Ma
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2005
Where WWW
Authors Yiqun Liu, Canhui Wang, Min Zhang, Shaoping Ma
Comments (0)