A phone-viseme dynamic Bayesian network for audio-visual automatic speech recognition

15 years 7 months ago

Download ivpl.ece.northwestern.edu

This work extends and improves a recently introduced (Dec. 2007) dynamic Bayesian network (DBN) based audio-visual automatic speech recognition (AVASR) system. That system models the audio and visual components of speech as being composed of the same sub-word units when, in fact, this is not psycholinguistically true. We extend the system to model the audio and visual streams as being composed of separate, yet related, sub-word units. We also introduce a novel stream weighting structure incorporated into the model itself. In doing so, our system makes improvements in word error rate (WER) and overall recognition accuracy in a large vocabulary continuous speech recognition task (LVCSR). The "best" performing proposed system attains a WER of 66.71% whereas the "best" baseline system performs at a WER of 64.30%. The proposed system also improves accuracy to 45.95% from 39.40%.

Louis H. Terry, Aggelos K. Katsaggelos

Real-time Traffic

Automatic Speech Recognition | Computer Vision | ICPR 2008 | Overall Recognition Accuracy | Speech Recognition Task |

claim paper

Added	05 Nov 2009
Updated	05 Nov 2009
Type	Conference
Year	2008
Where	ICPR
Authors	Louis H. Terry, Aggelos K. Katsaggelos

Sciweavers

A phone-viseme dynamic Bayesian network for audio-visual automatic speech recognition

Automatic Speech Recognition | Computer Vision | ICPR 2008 | Overall Recognition Accuracy | Speech Recognition Task |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers