Fast webpage classification using URL features

16 years 1 days ago

Download www.comp.nus.edu.sg

We demonstrate the usefulness of the uniform resource locator (URL) alone in performing web page classification. This approach is magnitudes faster than typical web page classification, as the pages themselves do not have to be fetched and analyzed. Our approach segments the URL into meaningful chunks and adds component, sequential and orthographic features to model salient patterns. The resulting binary features are used in supervised maximum entropy modeling. We analyze our approach's effectiveness in binary, multi-class and hierarchical classification. Our results show that, in certain scenarios, URL-based methods approach and sometime exceeds the performance of full-text and link-based methods. We also use these features to predict the prestige of a webpage (as modeled by Pagerank), and show that it can be predicted with an average error of less than one point (on a ten-point scale) in a topical set of web pages. Categories and Subject Descriptors H.3.1 [Information Storage a...

Min-Yen Kan, Hoang Oanh Nguyen Thi

Real-time Traffic