Under natural viewing conditions, human observers use shifts in gaze to allocate processing resources to subsets of the visual input. There are many computational models that try to predict these shifts in eye movement and attention. Although the important role of high level stimulus properties (e.g., semantic information) stands undisputed, most models are based solely on low-level image properties. We here demonstrate that a combined model of high-level object detection and low-level saliency significantly outperforms a lowlevel saliency model in predicting locations humans fixate on. The data is based on eye-movement recordings of humans observing photographs of natural scenes, which contained one of the following high-level stimuli: faces, text, scrambled text or cell phones. We show that observers - even when not instructed to look for anything particular, fixate on a face with a probability of over 80% within their first two fixations, on text and scrambled text with a probabili...
Moran Cerf, E. Paxon Frady, Christof Koch