In missing feature based automatic speech recognition (ASR), the role of the spectro-temporal mask in providing an accurate description of the relationship between target speech and environmental noise is critical for minimizing the degradation in ASR word accuracy (WAC) as the signal-to-noise ratio (SNR) decreases. This paper demonstrates the importance of accurate characterization of instantaneous acoustic background for mask estimation in data imputation approaches to missing feature based ASR, especially in the presence of non-stationary background noise. Mask estimation relies on a hypothesis test designed to detect the presence of speech in time-frequency spectral bins under rapidly varying noise conditions. Masked melfrequency filter bank energies are reconstructed using a minimum mean squared error (MMSE) based data imputation procedure. The impact of this mask estimation approach is evaluated in the context of MMSE based data imputation under multiple background conditions ov...
Shirin Badiezadegan, Richard C. Rose