Sciweavers

CEAS
2006
Springer

Fast Uncertainty Sampling for Labeling Large E-mail Corpora

14 years 3 months ago
Fast Uncertainty Sampling for Labeling Large E-mail Corpora
One of the biggest challenges in building effective anti-spam solutions is designing systems to defend against the everevolving bag of tricks spammers use to defeat them. Because of this, spam filters that work well today may not work well tomorrow. The adversarial nature of the spam problem makes large, up-to-date, and diverse e-mail corpora critical for the development and evaluation of new anti-spam filtering technologies. Gathering large collections of messages can actually be quite easy, especially in the context of a large, corporate or ISP environment. The challenge is not necessarily in collecting enough mail, however, but in collecting a representative distribution of mail types as seen "in the wild" and in then accurately labeling the hundreds of thousands or millions of accumulated messages as spam or non-spam. In the field of machine learning Uncertainty Sampling is a well-known Active Learning algorithm which uses a collaborative model to minimize the human effo...
Richard Segal, Ted Markowitz, William Arnold
Added 20 Aug 2010
Updated 20 Aug 2010
Type Conference
Year 2006
Where CEAS
Authors Richard Segal, Ted Markowitz, William Arnold
Comments (0)