We describe a methodology for retrieving document images from large extremely diverse collections. First we perform content extraction, that is the location and measurement of regions containing handwriting, machineprinted text, photographs, blank space, etc, in documents represented as bilevel, greylevel, or color images. Recent experiments have shown that even modest per-pixel content classification accuracies can support usefully high recall and precision rates (of, e.g., 80
Michael A. Moll, Henry S. Baird