We examine the utility of speech and lexical features for predicting student emotions in computerhuman spoken tutoring dialogues. We first annotate student turns for negative, neutral, positive and mixed emotions. We then extract acoustic-prosodic features from the speech signal, and lexical items from the transcribed or recognized speech. We compare the results of machine learning experiments using these features alone or in combination to predict various categorizations of the annotated student emotions. Our best results yield a 19-36% relative improvement in error reduction over a baseline. Finally, we compare our results with emotion prediction in human-human tutoring dialogues.
Diane J. Litman, Katherine Forbes-Riley