We describe an algorithm for generating multimodal referring expressions, based on empirical data. The main novelties are (1) a decision to point based on both the efficiency of pointing (Fitt's law) and the inefficiency of a full linguistic description, (2) the explicit tracking of the 'focus of attention', and (3) a threedimensional notion of salience incorporating linguistic, focus and inherent salience.