Accurate web page classification often depends crucially on information gained from neighboring pages in the local web graph. Prior work has exploited the class labels of nearby pages to improve performance. In contrast, in this work we utilize a weighted combination of the contents of neighbors to generate a better virtual document for classification. In addition, we break pages into fields, finding that a weighted combination of text from the target and fields of neighboring pages is able to reduce classification error by more than a third. We demonstrate performance on a large dataset of pages from the Open Directory Project and validate the approach using pages from a crawl from the Stanford WebBase. Interestingly, we find no value in anchor text and unexpected value in page titles (and especially titles of parent pages) in the virtual document. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval--Information filtering; I....
Xiaoguang Qi, Brian D. Davison