Motivated by the real-world application of categorizing system log messages into defined situation categories, this paper describes an interactive text categorization method, PICC...
Current approaches to script identification rely on hand-selected features and often require processing a significant part of the document to achieve reliable identification. We p...
When dealing with information overload from the Internet, such as the classification of Web pages and the filtering of email spam, a new technique called cotraining has been shown...
Co-training is a semi-supervised technique that allows classifiers to learn with fewer labelled documents by taking advantage of the more abundant unclassified documents. However, ...
This paper presents a new representation and evaluation procedure of page segmentation algorithms and analyzes six widely-used layout analysis algorithms using the procedure. The ...