We developed and tested a heuristic technique for extracting the main article from news site Web pages. We construct the DOM tree of the page and score every node based on the amo...
Acquiring knowledge from the Web to build domain ontologies has become a common practice in the Ontological Engineering field. The vast amount of freely available information allo...
Traditionally, information extraction from web tables has focused on small, more or less homogeneous corpora, often based on assumptions about the use of <table> tags. A mul...
We present Content Extraction via Tag Ratios (CETR) – a method to extract content text from diverse webpages by using the HTML document’s tag ratios. We describe how to comput...
Emergence of social web services like YouTube[1], Flickr[2] etc. is constantly transforming the way we share our lifestyles with family, friends and colleagues. The significance of...
Simo Hosio, Fahim Kawsar, Jukka Riekki, Tatsuo Nak...