Visual information has been shown to improve the performance of speech recognition systems in noisy acoustic environments. However, most audio-visual speech recognizers rely on a clean visual signal. In this paper, we explore a novel approach to visual speech modeling, based on articulatory features, which has potential benefits under visually challenging conditions. The idea is to use a set of parallel SVM classifiers to extract different articulatory attributes from the input images, and then combine their decisions to obtain higher-level units, such as visemes or words. We evaluate our approach in a preliminary experiment on a small audio-visual database, using several image noise conditions, and compare it to the standard viseme-based modeling approach. Categories and Subject Descriptors I.4 [Image Processing and Computer Vision] General Terms Algorithms, Design, Experimentation. Keywords Multimodal interfaces, audio-visual speech recognition, speechreading, visual feature extract...
Kate Saenko, Trevor Darrell, James R. Glass