In this work, a technique addressed to the reliable identification of very similar filled-in forms, with a reject option, is proposed. The method is based on the automatic detecti...
Abstract. The requirements for effective search and management of the WWW are stronger than ever. Currently Web documents are classified based on their content not taking into acco...
Maria Halkidi, Benjamin Nguyen, Iraklis Varlamis, ...
In the case of large-scale distributed environments such as the Internet, users are interested in monitoring changes to a particular web page (XML or HTML). There are many instanc...
This paper addresses the issue of Web document summarization. As textual content of Web documents is often scarce or irrelevant and existing summarization techniques are based on ...
A distributed system is described that reliably mines parallel text from large corpora. The approach can be regarded as cross-language near-duplicate detection, enabled by an init...
Jakob Uszkoreit, Jay Ponte, Ashok C. Popat, Moshe ...