XML is fast becoming the standard format to store, exchange and publish over the web, and is getting embedded in applications. Two challenges in handling XML are its size (the XML...
Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini...
This paper expands on a 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basis ove...
Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchron...
Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey...
The World Wide Web is emerging not only as an infrastructure for data, but also for a broader variety of resources that are increasingly being made available as Web services. Rele...
Abhijit A. Patil, Swapna A. Oundhakar, Amit P. She...
The World Wide Web is growing at such a pace that even the biggest centralized search engines are able to index only a small part of the available documents on the Internet. The d...