Classifiers without borders: incorporating fielded text from neighboring web pages

15 years 6 months ago

Download www.cse.lehigh.edu

Accurate web page classification often depends crucially on information gained from neighboring pages in the local web graph. Prior work has exploited the class labels of nearby pages to improve performance. In contrast, in this work we utilize a weighted combination of the contents of neighbors to generate a better virtual document for classification. In addition, we break pages into fields, finding that a weighted combination of text from the target and fields of neighboring pages is able to reduce classification error by more than a third. We demonstrate performance on a large dataset of pages from the Open Directory Project and validate the approach using pages from a crawl from the Stanford WebBase. Interestingly, we find no value in anchor text and unexpected value in page titles (and especially titles of parent pages) in the virtual document. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval--Information filtering; I....

Xiaoguang Qi, Brian D. Davison

Real-time Traffic

Information Technology | Page | SIGIR 2008 | Virtual Document | Weighted Combination |

claim paper

Added	15 Dec 2010
Updated	15 Dec 2010
Type	Journal
Year	2008
Where	SIGIR
Authors	Xiaoguang Qi, Brian D. Davison

Sciweavers

Classifiers without borders: incorporating fielded text from neighboring web pages

Information Technology | Page | SIGIR 2008 | Virtual Document | Weighted Combination |

Explore & Download

Productivity Tools

Sciweavers