A central problem in automatic sound recognition is the mapping between low-level audio features and the meaningful content of an auditory scene. We propose a dynamic network model to perform this mapping. In acoustics, much research has been devoted to low-level perceptual abilities such as audio feature extraction and grouping, which have been translated into successful signal processing techniques. However, little work is done on modeling knowledge and context in sound recognition, although this information is necessary to identify a sound event rather than to separate its components from a scene. We first investigate the role of context in human sound identification in a simple experiment. Then we show that the use of knowledge in a dynamic network model can improve automatic sound identification, by reducing the search space of the low-level audio features. Furthermore, context information dissolves ambiguities that arise from multiple interpretations of one sound event.
Maria E. Niessen, Leendert van Maanen, Tjeerd C. A