In this paper, we propose a novel correlation based method for speech-video synchronization (synch) and relationship classification. The method uses the envelope of the speech signal and data extracted from the lips movement. Firstly, a nonlinear-time-varying model is considered to represent the speech signal as a sum of amplitude and frequency modulated (AM-FM) signals. Each AM-FM signal, in this sum, is considered to model a single speech formant frequency. Using Taylor series expansion, the model is formulated in a way which characterizes the relation between the speech amplitude and the instantaneous frequency of each AM-FM signal w.r.t lips movements. Secondly, the envelope of the speech signal is estimated and then correlated with signals generated from lips movement. From the resultant correlation, the relation between the two signals is classified and the delay between them is estimated. The proposed method is applied to real cases and the results show that it is able to (i) ...
Amar A. El-Sallam, Ajmal S. Mian