The standard method for combating spam, either in email or on the web, is to train a classifier on manually labeled instances. As the spammers change their tactics, the performanc...
Deepak Chinavle, Pranam Kolari, Tim Oates, Tim Fin...
The process of extracting useful knowledge from large datasets has become one of the most pressing problems in today’s society. The problem spans entire sectors, from scientists...
In this paper, we present a novel near-duplicate document detection method that can easily be tuned for a particular domain. Our method represents each document as a real-valued s...
Hannaneh Hajishirzi, Wen-tau Yih, Aleksander Kolcz
e-Transformation technologies, for the past few years, have been evolving towards the goal of information integration and system interoperability. While there is no doubt that int...
Many organizations today have more than very large databases; they have databases that grow without limit at a rate of several million records per day. Mining these continuous dat...