The following article shows how a state-of-the-art speaker diarization system can be improved by combining traditional short-term features (MFCCs) with prosodic and other longterm features. First, we present a framework to study the speaker discriminability of 70 different long-term features. Then, we show how the top-ranked long-term features can be combined with short-term features to increase the accuracy of speaker diarization. The results were measured on standardized data sets (NIST RT) and show a consistent improvement of about 30 % relative in diarization error rate compared to the best system presented at the NIST evaluation in 2007. This result was also verified on a wide set of meetings, which we call CombDev, that contains 21 meetings from previous evaluations. Since the prosodic and long-term features were selected using a diarization-independent speakerdiscriminability study, we are confident that the same features are able to improve other systems that perform similar...
Gerald Friedland, Oriol Vinyals, C. Yan Huang, Chr