Digital Libraries will hold huge amounts of text and other forms of information. For the collections to be maximally useful, they must be highly organized with useful indexes and intraand inter-document linkages. This brings with it a demand for ever-better methods for automated analysis of text to build the indexes and links. It requires turning implicit information, "encrypted in natural language" into explicit information. We discuss approaches to the automation task built on the techniques of corpus linguistics. This paper focuses on word classification as an example of the utility of corpus methods. Results are presented for the syntactic and semantic classification of words from a biological corpus. The word classes identified can then be used for indexing, query expansion, syntactic analysis and for linking separate library collections by aligning word senses. The paper also discusses derivative objects, diagram analysis and authoring tools. Finally, we outline a new ...
Robert P. Futrelle, Xiaolan Zhang 0002, Yumiko Sek