In this paper we describe a system to reliably localize the position of the speaker’s face and mouth in videophone sequences. A statistical scheme based on a subspace method is presented for detecting human faces under varying poses. We propose a new matching criterion based on the Generalized Likelihood Ratio. The criterion is optimized efficiently with respect to similarity, affine or perspective transform parameters using a coarse-to-fine search strategy combined with a simulated annealing algorithm. Moreover we propose to extract a vector of geometrical features (four points) on the outline of the mouth. The extraction consists in analyzing amplitude projections in the regions of the mouth. All the computations are performed on H263-coded frames, with a QCIF spatial resolution. To this end, we propose algorithms adapted to the poor quality of the images and suited to a further real-time application.
Charles Kervrann, Franck Davoine, P. Pérez,