This paper describes an approach to providing lexical information for natural language processing in unrestricted domains. A system of approximately 1200 morphological rules is used to extend a core lexicon of 39,000 words to provide lexical coverage that exceeds that of a lexicon of 80,000 words or 150,000 word forms. The morphological system is described, and lexical coverage is evaluated for random words chosen from a previously unanalyzed corpus. 1 Motivation Many applications of natural language processing have a need for a large vocabulary lexicon. However, no matter how large a lexicon one starts with, most applications will encounter terms that are not covered. This paper describes an approach to the lexicon problem that emphasizes recognition of morphological structure in unknown words in order to extend a relatively small core lexicon to allow robust natural language processing in unrestricted domains. This technique, which extends functionality originally developed for the ...
William A. Woods