Drawing on recent progress in auditory neuroscience, we present a novel speech feature analysis technique based on localized spectrotemporal cepstral analysis of speech. We proceed by extracting localized 2D patches from the spectrogram and project onto a 2D discrete cosine (2D-DCT) basis. For each time frame, a speech feature vector is then formed by concatenating low-order 2DDCT coefficients from the set of corresponding patches. We argue that our framework has significant advantages over standard onedimensional MFCC features. In particular, we find that our features are more robust to noise, and better capture temporal modulations important for recognizing plosive sounds. We evaluate the performance of the proposed features on a TIMIT classification task in clean, pink, and babble noise conditions, and show that our feature analysis outperforms traditional features based on MFCCs.
Jake V. Bouvrie, Tony Ezzat, Tomaso Poggio