The amount of documents directly published by end users is increasing along with the growth of Web 2.0. Such documents often contain spoken-style expressions, which are difficult...
We describe a compression model for semistructured documents, called Structural Contexts Model (SCM), which takes advantage of the context information usually implicit in the stru...
Annotating the regions, text lines and characters of document images is an important, but tedious and expensive task. A ground-truthing tool may largely alleviate the human burden...
It is very significant in the knowledge society to accumulate spoken documents on the web. However, because of the high redundancy of spontaneous speech, the transcribed text in i...
Abstract. An approach is presented to guide the benchmarking of invoice analysis systems, a specific, applied subclass of document analysis systems. The state of the art of benchma...