Broder et al.’s [3] shingling algorithm and Charikar’s [4] random projection based approach are considered “state-of-theart” algorithms for finding near-duplicate web pag...
Research and development of information access technology for scanned paper documents has been hampered by the lack of public test collections of realistic scope and complexity. A...
David D. Lewis, Gady Agam, Shlomo Argamon, Ophir F...
In this work, we study similarity measures for text-centric XML documents based on an extended vector space model, which considers both document content and structure. Experimenta...
We introduce a novel approach to combining rankings from multiple retrieval systems. We use a logistic regression model or an SVM to learn a ranking from pairwise document prefere...
The primary aim of XML element retrieval is to return to users XML elements, rather than whole documents. This poster describes a small study, in which we elicited users’ expect...