: We describe our participation in the TREC 2004 Web and Terabyte tracks. For the web track, we employ mixture language models based on document full-text, incoming anchortext, and...
Authentication of digital documents is an important concern as digital documents are replacing the traditional paper-based documents for official and legal purposes. This is espec...
: We present in this paper a transformation model for structured documents. TransM is a new model that deals with specified documents, where the structure conforms to a predefined ...
Nouhad Amaneddine, Jean Paul Bahsoun, Jean-Paul Bo...
Nowadays people have to deal with an increasing amount of information contained in electronic documents available from numerous heterogeneous, widely distributed sources. Keeping ...
Our research works are interested in the identification and the representation of the semantic structures of multimedia documents. The semantic structure of a multimedia document ...
Entity annotation involves attaching a label such as `name' or `organization' to a sequence of tokens in a document. All the current rule-based and machine learningbased...
Many data on the Web are XML documents. An XML document is an unranked labelled tree. A schema for XML documents (for instance a DTD) is the specification of their internal structu...
This paper reports a statistical identification technique that differentiates scripts and languages in degraded and distorted document images. We identify scripts and languages th...
— Results of queries by personal names often contain documents related to several people because of the namesake problem. In order to differentiate documents related to different...
The purpose of text clustering in information retrieval is to discover groups of semantically related documents. Accurate and comprehensible cluster descriptions (labels) let the ...