The research reported in this paper is the first phase of a larger project on the automatic classification of web pages by their genres, using ngram representations of the web pages. In this study, the textual content of web pages is used to create feature sets consisting of the most frequent n-grams and their associated frequencies. We present three methods, each of which uses a distance measure to determine the dissimilarity between two feature sets. Each method forms a feature set for every web page in the test set, however the formation of feature sets from the training set differs between methods: we experiment using one feature set per web page, per genre, and a combination of genre-based feature sets supplemented by subgenre feature sets. We present results for a balanced corpus of seven genres (blog, eshop, FAQs, front page, listing, home page, and search page). Initial results are encouraging.
Jane E. Mason, Michael A. Shepherd, Jack Duffy