The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from...
Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wa...
A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal...
In this paper we consider the competition on the Internet between information providers to maximise their exposure to a relevant audience. Spammers and Search engine gamers adopt a...
Three join algorithms are evaluated in an environment with distributed main-memory based mediators and data sources. A streamed ship-out join ships bulks of tuples to a mediator ne...
Today’s Web browsers allow users to open links in new windows or tabs. This action, which we call ‘branching’, is sometimes performed on search results when the user plans t...