We present a novel spam detection technique that relies on neither content nor reputation analysis. This work investigates the discriminatory power of email transport-layer characteristics, i.e. the TCP packet stream. From a corpus of messages and corresponding packets, we extract per-email TCP features. While legitimate mail flows are wellbehaved, we observe small congestion windows, frequent retransmissions, loss and large latencies in spam traffic. To learn and exploit these differences, we build "SpamFlow." Using machine learning feature selection, SpamFlow identifies the most selective flow properties, thereby adapting to different networks and users. In addition to greater than 90% classification accuracy, SpamFlow correctly identifies 78% of the false negatives from a popular content filter. By exploiting the need to source large quantities of spam on resource constrained hosts and networks, SpamFlow is not easily subvertible.
Robert Beverly, Karen R. Sollins