Identifying verbally and non-verbally referred-to objects is an important aspect of human-robot interaction. Most importantly, it is essential to achieve a joint focus of attention and, thus, a natural interaction behavior. In this contribution, we introduce a saliencybased model that reflects how multi-modal referring acts influence the visual search, i.e. the task to find a specific object in a scene. Therefore, we combine positional information obtained from pointing gestures with contextual knowledge about the visual appearance of the referred-to object obtained from language. The available information is then integrated into a biologically-motivated saliency model that forms the basis for visual search. We prove the feasibility of the proposed approach by presenting the results of an experimental evaluation. Categories and Subject Descriptors I.2.10 [Artificial Intelligence]: Vision and Scene Understanding-Perceptual reasoning; I.4.8 [Image Processing and Computer Vision]: Scene ...
Boris Schauerte, Gernot A. Fink