We examine the utility of multiple types of turn-level and contextual linguistic features for automatically predicting student emotions in human-human spoken tutoring dialogues. We first annotate student turns in our corpus for negative, neutral and positive emotions. We then automatically extract features representing acoustic-prosodic and other linguistic information from the speech signal and associated transcriptions. We compare the results of machine learning experiments using different feature sets to predict the annotated emotions. Our best performing feature set contains both acoustic-prosodic and other types of linguistic features, extracted from both the current turn and a context of previous student turns, and yields a prediction accuracy of 84.75%, which is a 44% relative improvement in error reduction over a baseline. Our results suggest that the intelligent tutoring spoken dialogue system we are developing can be enhanced to automatically predict and adapt to student emo...
Katherine Forbes-Riley, Diane J. Litman