A new framework for the context and speaker independent recognition of emotions from voice, based on a richer and more natural representation of the speech signal, is proposed. The utterance is viewed as consisting of a series of voiced segments and not as a single object. The voiced segments are first identified and then described using statistical measures of spectral shape, intensity, and pitch contours, calculated at both the segment and the utterance level. Utterance classification is performed by combining the segment classification decisions using a fixed combination scheme. The performance of two learning algorithms, Support Vector Machines and K Nearest Neighbors, is compared. The proposed approach yields an overall classification accuracy of 87% for 5 emotions, outperforming previous results on a similar database.
Mohammad T. Shami, Mohamed S. Kamel