In this paper, we report on our experience with the creation of an automated, human-assisted process to extract metadata from documents in a large (>100,000), dynamically growing collection. Such a collection may be expected to be heterogeneous, both statically heterogeneous (containing documents in a variety of formats) and dynamically heterogeneous (likely to acquire new documents in formats unlike any prior acquisitions). Eventually, we hope to be able to totally automate metadata extraction for 80% of the documents and reduce the time needed to generate the metadata for the remaining documents also by 80%. In this paper, we describe our process of first classifying documents into equivalence classes for which we can then use a rule-based approach to extract metadata. Our rule-based approach differs from others in as far as it separates the rule-interpreting engine from a template of rules. The templates vary among classes but the engine is the same. We have evaluated our approa...
Jianfeng Tang, Kurt Maly, Steven J. Zeil, Mohammad