Large-scale bot detection for search engines

16 years 1 months ago

Download www.hwkang.com

In this paper, we propose a semi-supervised learning approach for classifying program (bot) generated web search traﬃc from that of genuine human users. The work is motivated by the challenge that the enormous amount of search data pose to traditional approaches that rely on fully annotated training samples. We propose a semi-supervised framework that addresses the problem in multiple fronts. First, we use the CAPTCHA technique and simple heuristics to extract from the data logs a large set of training samples with initial labels, though directly using these training data is problematic because the data thus sampled are biased. To tackle this problem, we further develop a semi-supervised learning algorithm to take advantage of the unlabeled data to improve the classiﬁcation performance. These two proposed algorithms can be seamlessly combined and very cost eﬃcient to scale the training process. In our experiment, the proposed approach showed signiﬁcant (i.e. 2 : 1) improvement...

Hongwen Kang, Kuansan Wang, David Soukal, Fritz Be

Real-time Traffic

Internet Technology | Semi-supervised Learning | Semi-supervised Learning Approach | Training Samples | WWW 2010 |

claim paper

» A QueryDependent Duplicate Detection Approach for Large Scale Search Engines

» Template detection for large scale search engines

» SDD high performance code clone detection system for large scale source code

» DOCODELite A MetaSearch Engine for Document Similarity Retrieval

» Experiments in Terabyte Searching Genomic Retrieval and Novelty Detection for TREC 2004

» Identifying web spam with user behavior analysis

» Detecting Link Hijacking by Web Spammers

» Characterizing typical and atypical user sessions in clickstreams

Post Info
More Details (n/a)

Added	14 May 2010
Updated	14 May 2010
Type	Conference
Year	2010
Where	WWW
Authors	Hongwen Kang, Kuansan Wang, David Soukal, Fritz Behr, Zijian Zheng

Comments (0)

Sciweavers

Large-scale bot detection for search engines

Internet Technology | Semi-supervised Learning | Semi-supervised Learning Approach | Training Samples | WWW 2010 |

Explore & Download

Productivity Tools

Sciweavers