The use of video and audio features for automated annotation of audio-visual data is becoming widespread. A major limitation of many of the current methods is that the stored indexing features are too low-level - they relate directly to properties of the data. In this work we apply a further stage of processing that associates the feature measurements with real-world objects or events. The outputs, which we call "cues", denote the probability of the object being present in the scene. An additional advantage of this approach is that the cues from different types of features are presented in a homogeneous way.