Empirical modeling of the score distributions associated with retrieved documents is an essential task for many retrieval applications. In this work, we propose modeling the releva...
The discipline of narratology has long recognized the need to classify documents as instances of different text types. We have discovered that classification is as applicable to h...
This paper explores the problem of computing pairwise similarity on document collections, focusing on the application of “more like this” queries in the life sciences domain. ...
We propose a novel approach that identifies web page templates and extracts the unstructured data. Extracting only the body of the page and eliminating the template increases the ...
In this paper, we discuss how the focus in document analysis, generally speaking, and in graphics recognition more specifically, has moved from re-engineering problems to indexin...