This paper addresses selecting between candidate pronunciations for out-of-vocabulary words in speech processing tasks. We introduce a simple, unsupervised method that outperforms the conventional supervised method of forced alignment with a reference. The success of this method is independently demonstrated using three metrics from largescale speech tasks: word error rates for large vocabulary continuous speech recognition, decision error tradeoff curves for spoken term detection, and phone error rates compared to a handcrafted pronunciation lexicon. The experiments were conducted using state-of-the-art recognition, indexing, and retrieval systems. The results were compared across many terms, hundreds of hours of speech, and well known data sets.
Christopher M. White, Abhinav Sethy, Bhuvana Ramab