

The Czech Broadcast Conversation Corpus

14 years 6 months ago
The Czech Broadcast Conversation Corpus
Abstract. This paper presents the final version of the Czech Broadcast Conversation Corpus released at the Linguistic Data Consortium (LDC). The corpus contains 72 recordings of a radio discussion program, which yield about 33 hours of transcribed conversational speech from 128 speakers. The release not only includes verbatim transcripts and speaker information, but also structural metadata (MDE) annotation that involves labeling of sentence-like unit boundaries, marking of non-content words like filled pauses and discourse markers, and annotation of speech disfluencies. The annotation is based on the LDC’s MDE annotation standard for English, with changes applied to accommodate phenomena that are specific for Czech. In addition to its importance to speech recognition, speaker diarization, and structural metadata extraction research, the corpus is also useful for linguistic analysis of conversational Czech.
Jáchym Kolár, Jan Svec
Added 27 May 2010
Updated 27 May 2010
Type Conference
Year 2009
Where TSD
Authors Jáchym Kolár, Jan Svec
Comments (0)