In this paper, we present a systems approach for channel modeling of an Automatic Speech Recognition (ASR) system. This can have implications in improving speech recognition components, such as through discriminative language modeling. We simulate the ASR corruption using a phrase-based machine translation system trained between the reference phoneme and output phoneme sequences of a real ASR. We demonstrate that local optimization on the quality of phoneme-to-phoneme mappings does not directly translate to overall improvement of the entire model. However, we are still able to capitalize on contextual information of the phonemes which a simple acoustic distance model is not able to accomplish. Hence we show that the use of longer context results in a significantly improved model of the ASR channel.
Qun Feng Tan, Kartik Audhkhasi, Panayiotis G. Geor