Querying data from presentation formats like HTML, for purposes such as information extraction, requires the consideration of tree structures as well as the consideration of spati...
The Internet makes it possible to share and manipulate a vast quantity of information efficiently and effectively, but the rapid and chaotic growth experienced by the Net has gener...
Abstract. Newistic is a web mining platform that collects and analyses documents crawled from the Internet. Although it currently processes news articles, it can be easily adapted ...
Traditionally, information extraction from web tables has focused on small, more or less homogeneous corpora, often based on assumptions about the use of <table> tags. A mul...
This paper presents a case-study of automatic construction of a hypertext from a large full-text document. The document we used as input of the automatic authoring process is a we...