A number of recent studies have demonstrated that groups benefit considerably from access to shared visual information. This is due, in part, to the communicative efficiencies provided by the shared visual context. However, a large gap exists between our current theoretical understanding and our existing models. We address this gap by developing a computational model that integrates linguistic cues with visual cues in a way that effectively models reference during tightlycoupled, task-oriented interactions. The results demonstrate that an integrated model significantly outperforms existing language-only and visual-only models. The findings can be used to inform and augment the development of conversational agents, applications that dynamically track discourse and collaborative interactions, and dialogue managers for natural language interfaces. Author Keywords Shared visual information, multimodal interaction, language use, discourse, communication, and modeling. ACM Classification Ke...