Classifying laughter and speech using audio-visual feature prediction

15 years 7 months ago

Download www.doc.ic.ac.uk

In this study, a system that discriminates laughter from speech by modelling the relationship between audio and visual features is presented. The underlying assumption is that this relationship is different between speech and laughter. Neural networks are trained which learn the audio-to-visual and visual-to-audio features mapping for both classes. Classiﬁcation of a new frame is performed via prediction. All the networks produce a prediction of the expected audio / visual features and the network with the best prediction, i.e., the model which best describes the audiovisual feature relationship, provides its label to the input frame. When trained on a simple dataset and tested on a hard dataset, the proposed approach outperforms audiovisual feature-level fusion, resulting in a 10.9% and 6.4% absolute increase in the F1 rate for laughter and classiﬁcation rate, respectively. This indicates that classiﬁcation based on prediction can produce a good model even when the available da...

Stavros Petridis, Ali Asghar, Maja Pantic

Real-time Traffic