Sciweavers

AIRWEB
2006
Springer

Tracking Web Spam with Hidden Style Similarity

14 years 3 months ago
Tracking Web Spam with Hidden Style Similarity
Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier paradigm are good examples (e.g. commercial sites, blogs and other sites powered by a web authoring software), as well as less legitimous spamdexing attempts (e.g. link farms, faked directories. . . ). Those pages built using the same generating method (template or script) share a common "look and feel" that is not easily detected by common text classification methods, but is more related to stylometry. In this paper, we present a (hidden) style similarity measure based on extra-textual features in html source code. We also describe a method to clusterize a large collection of documents according to this measure. The clustering algorithm being based on fingerprints, we also give some recalls about fingerprinting. By conveniently sorting the generated clusters, one can efficiently track back instances of a particular automatic content generation method among web pages collected ...
Tanguy Urvoy, Thomas Lavergne, Pascal Filoche
Added 20 Aug 2010
Updated 20 Aug 2010
Type Conference
Year 2006
Where AIRWEB
Authors Tanguy Urvoy, Thomas Lavergne, Pascal Filoche
Comments (0)