In this paper we address the problem of estimating who is speaking from automatically extracted low resolution visual cues in group meetings. Traditionally, the task of speech/non-speech detection or speaker diarization tries to find “who speaks and when” from audio features only. In this paper, we investigate more systematically how speaking status can be estimated from low resolution video We exploit the synchrony of a group’s head and hand motion to learn correspondences between speaking status and visual activity. We also carry out experiments to evaluate how context through the observation of group behaviour and task-oriented activities can help to improve estimates of speaking status. We test on 105 minutes of natural meeting data with unconstrained conversations and compare with state of the art audio-only methods.
Hayley Hung, Sileye O. Ba