Sciweavers

CEAS
2007
Springer

Characterizing Web Spam Using Content and HTTP Session Analysis

14 years 6 months ago
Characterizing Web Spam Using Content and HTTP Session Analysis
Web spam research has been hampered by a lack of statistically significant collections. In this paper, we perform the first large-scale characterization of web spam using content and HTTP session analysis techniques on the Webb Spam Corpus – a collection of about 350,000 web spam pages. Our content analysis results are consistent with the hypothesis that web spam pages are different from normal web pages, showing far more duplication of physical content and URL redirections. An analysis of session information collected during the crawling of the Webb Spam Corpus shows significant concentration of hosting IP addresses in two narrow ranges as well as significant overlaps among session header values. These findings suggest that content and HTTP session analysis may contribute a great deal towards future efforts to automatically distinguish web spam pages from normal web pages.
Steve Webb, James Caverlee, Calton Pu
Added 07 Jun 2010
Updated 07 Jun 2010
Type Conference
Year 2007
Where CEAS
Authors Steve Webb, James Caverlee, Calton Pu
Comments (0)