A web-portal providing access to over 250.000 scanned and OCRed cultural heritage documents is analyzed. The collection consists of the complete Dutch Hansard from 1917 to 1995. E...
Indexing and retrieval of speech content in various forms such as broadcast news, customer care data and on-line media has gained a lot of interest for a wide range of application...
Dogan Can, Erica Cooper, Arnab Ghoshal, Martin Jan...
The PDF format is commonly used for the exchange of documents on the Web and there is a growing need to understand and extract or repurpose data held in PDF documents. Many system...
A tremendous amount of semi-structured data is available today on the web but is not necessarily in a form which is suitable for a user's tasks. For example, a website may sh...
We introduce a unified graph representation of the Web, which includes both structural and usage information. We model this graph using a simple union of the Web's hyperlink ...
Barbara Poblete, Carlos Castillo, Aristides Gionis