This paper presents a new web mining scheme for parallel data acquisition. Based on the Document Object Model (DOM), a web page is represented as a DOM tree. Then a DOM tree align...
We present a cut and paste based text summarizer, which uses operations derived from an analhuman written abstracts. The summarizer edits extracted sentences, using reduction to r...
A large annotated corpus is critical to the development of robust optical character recognizers (OCRs). However, creation of annotated corpora is a tedious task. It is laborious, ...
Most current sentence alignment approaches adopt sentence length and cognate as the alignment features; and they are mostly trained and tested in the documents with the same style...
An overwhelming number of legal documents is available in digital form. However, most of the texts are usually only provided in a semi-structured form, i.e. the documents are stru...