Webpage Genre Identification Using Variable-Length Character n-Grams

16 years 27 days ago

Download www.icsd.aegean.gr

An important factor for discriminating between webpages is their genre (e.g., blogs, personal homepages, e-shops, online newspapers, etc). Webpage genre identification has a great potential in information retrieval since users of search engines can combine genre-based and traditional topic-based queries to improve the quality of the results. So far, various features have been proposed to quantify the style of webpages including word and html-tag frequencies. In this paper, we propose a low-level representation for this problem based on character n-grams. Using an existing approach, we produce feature sets of variable-length character ngrams and combine this representation with information about the most frequent html-tags. Based on two benchmark corpora, we present webpage genre identification experiments and improve the best reported results in both cases.

Ioannis Kanaris, Efstathios Stamatatos

Real-time Traffic

Artificial Intelligence | ICTAI 2007 | Traditional Topic-based Queries | Webpage | Webpage Genre Identification |

claim paper

Post Info
More Details (n/a)

Added	03 Jun 2010
Updated	03 Jun 2010
Type	Conference
Year	2007
Where	ICTAI
Authors	Ioannis Kanaris, Efstathios Stamatatos

Comments (0)

Sciweavers

Webpage Genre Identification Using Variable-Length Character n-Grams

Artificial Intelligence | ICTAI 2007 | Traditional Topic-based Queries | Webpage | Webpage Genre Identification |

Explore & Download

Productivity Tools

Sciweavers