Missing data methods attempt to improve robust speech recognition by distinguishing between reliable and unreliable data in the time-frequency domain. Such methods require a binary mask which labels time-frequency regions of a noisy speech signal as reliable if they contain more speech energy than noise energy and unreliable otherwise. Current methods for estimating the mask are based mainly on bottom-up speech separation cues such as harmonicity and produce labeling errors that cause a degradation in recognition performance. We propose a two stage recognition system in order to improve mask estimation and produce better recognition results. First, an n-best lattice consistent with the speech separation mask is generated. The lattice is then re-scored by expanding the mask using a model-based hypothesis test to determine the reliability of individual time-frequency regions. Systematic evaluations show significant improvement in recognition performance compared to that using speech se...
Soundararajan Srinivasan, DeLiang L. Wang