Input multimodality combining speech and hand gestures has motivated numerous usability studies. Contrastingly, issues relating to the design and ergonomic evaluation of multimodal output messages combining speech with visual modalities have not yet been addressed extensively. The experimental study presented here addresses one of these issues. Its aim is to assess the actual efficiency and usability of oral system messages including some brief spatial information for helping users to locate objects on crowded displays rapidly and without effort. Target presentation mode, scene spatial structure and task difficulty were chosen as independent variables. Two conditions were defined: the visual target presentation mode (VP condition) and the multimodal target presentation mode (MP condition). Each participant carried out two blocks of visual search tasks (120 tasks per block, and one block per condition). Scene target presentation mode, scene structure and task difficulty were found to b...