Characterizing Web Spam Using Content and HTTP Session Analysis

16 years 26 days ago

Download www.ceas.cc

Web spam research has been hampered by a lack of statistically signiﬁcant collections. In this paper, we perform the ﬁrst large-scale characterization of web spam using content and HTTP session analysis techniques on the Webb Spam Corpus – a collection of about 350,000 web spam pages. Our content analysis results are consistent with the hypothesis that web spam pages are diﬀerent from normal web pages, showing far more duplication of physical content and URL redirections. An analysis of session information collected during the crawling of the Webb Spam Corpus shows significant concentration of hosting IP addresses in two narrow ranges as well as signiﬁcant overlaps among session header values. These ﬁndings suggest that content and HTTP session analysis may contribute a great deal towards future eﬀorts to automatically distinguish web spam pages from normal web pages.

Steve Webb, James Caverlee, Calton Pu

Real-time Traffic