This paper considers the problem of obtaining an accurate spectral representation of speech formant structure when the voicing source exhibits a high fundamental frequency. Our work is inspired by auditory perception and modeling studies implicating the use of temporal changes in speech by humans. Specifically, we develop and assess signal processing schemes to exploit temporal change of pitch as a basis for formant estimation. Our methods are cast in a generalized framework of two-dimensional processing of speech and show quantitative improvements under certain conditions over traditional representations derived from linear prediction and cepstral analysis. We conclude by highlighting potential benefits of our representations in the particular application of speaker recognition with preliminary results indicating a performance gender-gap closure in subsets of the TIMIT corpus.
Tao T. Wang, Thomas F. Quatieri