Web data cleansing for information retrieval using key resource page selection

16 years 7 months ago

Download www.www2005.org

With the page explosion of WWW, how to cover more useful information with limited storage and computation resources becomes more and more important in web IR research. Using web page non-content feature analysis, we proposed a clustering-based method to select high quality pages from the whole page set. Although the result page set contains only 44.3% of the whole collection, it is related with more than 98% of links and covers about 90% of key information. Link property and retrieval affects are also observed and experiment results show that key resource selection method is more suitable for the job of data cleansing and the result page set outperforms the whole collection by smaller size and better retrieval performance. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Experimentation Keywords Web data cleansing, Non-content feature, Web IR.

Yiqun Liu, Canhui Wang, Min Zhang, Shaoping Ma

Real-time Traffic