We investigate the problem of evaluating the performance of text processing algorithms on inputs that contain errors as a result of optical character recognition. A new hierarchical paradigm is proposed based on approximate string matching, allowing each stage in the processing pipeline to be tested, the error effects analyzed, and possible solutions suggested. Categories and Subject Descriptors I.7.5 [Document and Text Processing]: Document Capture—document analysis General Terms algorithms, measurement, performance Keywords performance evaluation, optical character recognition, sentence boundary detection, tokenization, part-of-speech tagging
Daniel P. Lopresti