This research explores the interaction of textual and photographic information in document understanding. The problem of performing generalpurpose vision without apriori knowledge is very di cult at best. The use of collateral information in scene understanding has been explored in computer vision systems that use general scene context in the task of object identi cation. The work described here extends this notion by de ning visual semantics, namely, techniques for systematically extracting picture-speci c information from text accompanying a photograph. Speci cally, this paper discusses the multi-stage processing of textual captions with the following objectives: (i) predicting which objects (implicitly or explicitly mentioned in the caption) are present in the picture, and (ii) generating several types of constraints useful in locating/identifying these objects. The implementation and use of a lexicon speci cally designed for the integration of linguistic and visual information is ...
Rohini K. Srihari, Debra T. Burhans