The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from...
Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wa...
Background: Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, ...
The PDF format is commonly used for the exchange of documents on the Web and there is a growing need to understand and extract or repurpose data held in PDF documents. Many system...
With the flourish of the Web, online review is becoming a more and more useful and important information resource for people. As a result, automatic review mining and summarizing ...
Web pages (and resources, in general) can be characterized according to their geographical locality. For example, a web page with general information about wildflowers could be c...
Luis Gravano, Vasileios Hatzivassiloglou, Richard ...