Hardening Fingerprinting by Context

16 years 25 days ago

Download www.ceas.cc

Near-duplicate detection is not only an important pre and post processing task in Information Retrieval but also an eﬀective spam-detection technique. Among diﬀerent approaches to near-replica detection methods based on document signatures are particularly attractive due to their scalability to massive document collections and their ability to handle high throughput rates. Their weakness lies in the potential brittleness of signatures to small changes in content, which makes them vulnerable to various types of noise. In the important spam-ﬁltering application, this vulnerability can also be exploited by dedicated attackers aiming to maximally fragment signatures corresponding to the same email campaign. We focus on the I-Match algorithm and present a method of strengthening it by considering the usage context when deciding which portions of a document should aﬀect signature generation. This substantially (almost 100-fold in some cases) increases the diﬃculty of dedicated att...

Aleksander Kolcz, Abdur Chowdhury

Real-time Traffic