Sciweavers

WWW
2006
ACM

Detecting spam web pages through content analysis

15 years 1 months ago
Detecting spam web pages through content analysis
In this paper, we continue our investigations of "web spam": the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%). Categories and Subject Descriptors H.5.4 [Information Interfaces and Presentation]: Hypertext/Hypermedia; K.4.m [Computers and Society]: Miscellaneous; H.4.m [Information Systems]: Miscellaneous General Terms Measurement, Experimentation, Algorithms Keywords Web characterization, web pages, web spam, data mining
Alexandros Ntoulas, Marc Najork, Mark Manasse, Den
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2006
Where WWW
Authors Alexandros Ntoulas, Marc Najork, Mark Manasse, Dennis Fetterly
Comments (0)