We investigate the impact of listener’s gaze on predicting reference resolution in situated interactions. We extend an existing model that predicts to which entity in the environment listeners will resolve a referring expression (RE). Our model makes use of features that capture which objects were looked at and for how long, reflecting listeners’ visual behavior. We improve a probabilistic model that considers a basic set of features for monitoring listeners’ movements in a virtual environment. Particularly, in complex referential scenes, where more objects next to the target are possible referents, gaze turns out to be beneficial and helps deciphering listeners’ intention. We evaluate performance at several prediction times before the listener performs an action, obtaining a highly significant accuracy gain.