For TV and radio shows containing narrowband speech, Speech-to-text (STT) accuracy on the narrowband audio can be improved by using an acoustic model trained on acoustically matched data. To selectively apply it, one must rst be able to accurately detect which audio segments are narrowband. The present paper explores two different bandwidth classi cation approaches: a traditional Gaussian mixture model (GMM) approach and a spline-based classi er that categorizes audio segments based on their power spectra. We focus on shows found in the DARPA GALE Mandarin training and test sets, where the ratio of wideband to narrowband shows is very large. In this setting, the spline-based classi er reduces the number of misclassi ed wideband segments by up to 95% relative to the GMM-based classi er for the same number of misclassi ed narrowband segments.
Mark C. Fuhs, Qin Jin, Tanja Schultz