Professional manual transcription of speech is an expensive and time consuming process. This paper focuses on the problem of combining noisy transcriptions from multiple non-expert transcribers, where the quality of work from each worker varies. Computing transcriber reliability is a difficult task in the absence of gold standard reference transcripts. Three simple metrics for quantifying this reliability without using a gold standard are proposed. We create a database of 1000 Mexican Spanish broadcast news audio clips transcribed by five transcribers each through Amazon Mechanical Turk. Combination of multiple noisy transcripts using these reliability scores improves the word error rate of the combined transcript with respect to the LDC gold standard by 8 % relative, and the sentence error rate by 4.1 % relative, when compared with a combination without any reliability information.
Kartik Audhkhasi, Panayiotis G. Georgiou, Shrikant