The success of many innovative Web applications is not based on the content they produce ? but on how they combine and link existing content. Older Web Engineering methods lack fl...
In this paper, we present a novel near-duplicate document detection method that can easily be tuned for a particular domain. Our method represents each document as a real-valued s...
Hannaneh Hajishirzi, Wen-tau Yih, Aleksander Kolcz
In this article, we describe the XML storage system used in the WebContent project. We begin by advocating the use of an XML database in order to store WebContent documents, and w...
The phenomenal growth of the world-wide web has made it the most popular Internet application today. Web caching and content distribution services have been recognized as valuable...
Chengdu Huang, Seejo Sebastine, Tarek F. Abdelzahe...
Recently, along with the rapid growth of the Web, the preservation efforts have also increased. As a consequence, large amounts of past Web data are stored in Web archives. This h...