Spam, Damn Spam, and Statistics: Using Statistical Analysis to Locate Spam Web Pages

15 years 12 months ago

Download research.microsoft.com

The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call “web spam”, that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time ﬁnding the information they need, and search engines have to cope with an inﬂated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index. We propose that some spam web pages can be identiﬁed through statistical analysis: Certain classes of spam pages, in particular those that are machine-generated, diverge in some of their properties from the properties of web pages at large. We have examined a variety of such properties, including linkage structure, page content, and page evolution, and have found that outliers in the statistical distribution of these properties are hig...

Dennis Fetterly, Mark Manasse, Marc Najork

Real-time Traffic

Internet Technology | Search Engines | Web Pages | Web Spam | WEBDB 2004 |

claim paper

» Detecting image spam using visual features and near duplicate detection

» An economic model of the worldwide web

» Counting triangles in realworld networks using projections

Post Info
More Details (n/a)

Added	02 Jul 2010
Updated	02 Jul 2010
Type	Conference
Year	2004
Where	WEBDB
Authors	Dennis Fetterly, Mark Manasse, Marc Najork

Comments (0)

Sciweavers

Spam, Damn Spam, and Statistics: Using Statistical Analysis to Locate Spam Web Pages

Internet Technology | Search Engines | Web Pages | Web Spam | WEBDB 2004 |

Explore & Download

Productivity Tools

Sciweavers