This paper presents the development and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. To support this research, we have collected a new video corpus, called Audio-Visual TIMIT (AV-TIMIT), which consists of 4 total hours of read speech collected from 223 different speakers. This new corpus was used to evaluate our new AVSR system which incorporates a novel audio-visual integration scheme using segment-constrained Hidden Markov Models (HMMs). Preliminary experiments have demonstrated improvements in phonetic recognition performance when incorporating visual information into the speech recognition process. Categories and Subject Descriptors I.2.M [Artificial Intelligence]: Miscellaneous General Terms Algorithms, Design, Experimentation. Keywords Audio-visual speech recognition, audio-visual corpora.
Timothy J. Hazen, Kate Saenko, Chia-Hao La, James