Robust joint visual attention is necessary for achieving a common frame of reference between humans and robots interacting multimodally in order to work together on realworld spatial tasks involving objects. We make a comprehensive examination of one component of this process that is often otherwise implemented in an ad hoc fashion: the ability to correctly determine the object referent from deictic reference including pointing gestures and speech. We develop a modular spatial reasoning framework based around decomposition and resynthesis of speech and gesture into a language of pointing and object labeling that supports multimodal and unimodal access in both real-world and mixedreality workspaces, accounts for the need to discriminate and sequence identical and proximate objects, assists in overcoming inherent precision limitations in deictic gesture, and assists in the extraction of those gestures. We further discuss an implementation of the framework that has been deployed on two h...
Andrew G. Brooks, Cynthia Breazeal