The goal of the DARPA MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Program is to automatically convert foreign language text images into Englis...
Most current sentence alignment approaches adopt sentence length and cognate as the alignment features; and they are mostly trained and tested in the documents with the same style...
This paper presents a new web mining scheme for parallel data acquisition. Based on the Document Object Model (DOM), a web page is represented as a DOM tree. Then a DOM tree align...
An ad hoc data format is any non-standard, semi-structured data format for which robust data processing tools are not available. In this paper, we present ANNE, a new kind of mark...
Dotplot is a technique for visualizing patterns of string matches in millions of lines of text and code. Patterns may be explored interactively or detected automatically. Applicat...