Sciweavers

ICTAI
2007
IEEE

Webpage Genre Identification Using Variable-Length Character n-Grams

14 years 5 months ago
Webpage Genre Identification Using Variable-Length Character n-Grams
An important factor for discriminating between webpages is their genre (e.g., blogs, personal homepages, e-shops, online newspapers, etc). Webpage genre identification has a great potential in information retrieval since users of search engines can combine genre-based and traditional topic-based queries to improve the quality of the results. So far, various features have been proposed to quantify the style of webpages including word and html-tag frequencies. In this paper, we propose a low-level representation for this problem based on character n-grams. Using an existing approach, we produce feature sets of variable-length character ngrams and combine this representation with information about the most frequent html-tags. Based on two benchmark corpora, we present webpage genre identification experiments and improve the best reported results in both cases.
Ioannis Kanaris, Efstathios Stamatatos
Added 03 Jun 2010
Updated 03 Jun 2010
Type Conference
Year 2007
Where ICTAI
Authors Ioannis Kanaris, Efstathios Stamatatos
Comments (0)