This paper offers a novel look at using a dimensionalityreduction technique called simhash [8] to detect similar document pairs in large-scale collections. We show that this algo...
Most prior work on information extraction has focused on extracting information from text in digital documents. However, often, the most important information being reported in an...
In many information systems, the databases that make up the system are distributed in different modules or branch offices according to the requirements of the business enterprise. ...
We present an evolutionary clustering method which can be applied to multi-relational knowledge bases storing resource annotations expressed in the standard languages for the Sema...
We address the problem of identifying the domain of online databases. More precisely, given a set F of Web forms automatically gathered by a focused crawler and an online database...