Sciweavers

DOCENG
2007
ACM

Elimination of junk document surrogate candidates through pattern recognition

14 years 4 months ago
Elimination of junk document surrogate candidates through pattern recognition
A surrogate is an object that stands for a document and enables navigation to that document. Hypermedia is often represented with textual surrogates, even though studies have shown that image and text surrogates facilitate the formation of mental models and overall understanding. Surrogates may be formed by breaking a document down into a set of smaller elements, each of which is a surrogate candidate. While processing these surrogate candidates from an HTML document, relevant information may appear together with less useful junk material, such as navigation bars and advertisements. This paper develops a pattern recognition based approach for eliminating junk while building the set of surrogate candidates. The approach defines features on candidate elements, and uses classification algorithms to make selection decisions based on these features. For the purpose of defining features in surrogate candidates, we introduce the Document Surrogate Model (DSM), a streamlined Document Object M...
Eunyee Koh, Daniel Caruso, Andruid Kerne, Ricardo
Added 14 Aug 2010
Updated 14 Aug 2010
Type Conference
Year 2007
Where DOCENG
Authors Eunyee Koh, Daniel Caruso, Andruid Kerne, Ricardo Gutierrez-Osuna
Comments (0)