An N-Gram Based Approach to Automatically Identifying Web Page Genre

16 years 2 months ago

Download torch.cs.dal.ca

The research reported in this paper is the first phase of a larger project on the automatic classification of web pages by their genres, using ngram representations of the web pages. In this study, the textual content of web pages is used to create feature sets consisting of the most frequent n-grams and their associated frequencies. We present three methods, each of which uses a distance measure to determine the dissimilarity between two feature sets. Each method forms a feature set for every web page in the test set, however the formation of feature sets from the training set differs between methods: we experiment using one feature set per web page, per genre, and a combination of genre-based feature sets supplemented by subgenre feature sets. We present results for a balanced corpus of seven genres (blog, eshop, FAQs, front page, listing, home page, and search page). Initial results are encouraging.

Jane E. Mason, Michael A. Shepherd, Jack Duffy

Real-time Traffic

Biometrics | Feature Sets | Genre-based Feature Sets | HICSS 2009 | System Sciences | Web Page |

claim paper

Post Info
More Details (n/a)

Added	19 May 2010
Updated	19 May 2010
Type	Conference
Year	2009
Where	HICSS
Authors	Jane E. Mason, Michael A. Shepherd, Jack Duffy

Comments (0)

Sciweavers

An N-Gram Based Approach to Automatically Identifying Web Page Genre

Biometrics | Feature Sets | Genre-based Feature Sets | HICSS 2009 | System Sciences | Web Page |

Explore & Download

Productivity Tools

Sciweavers