Sciweavers

LREC
2008

Thai Broadcast News Corpus Construction and Evaluation

14 years 1 months ago
Thai Broadcast News Corpus Construction and Evaluation
Large speech and text corpora are crucial to the development of a state-of-the-art speech recognition system. This paper reports on the construction and evaluation of the first Thai broadcast news speech and text corpora. Specifications and conventions used in the transcription process are described in the paper. The speech corpus contains about 17 hours of speech data while the text corpus was transcribed from around 35 hours of television broadcast news. The characteristics of the corpus were analyzed and shown in the paper. The speech corpus was split according to the evaluation focus condition used in the DARPA Hub-4 evaluation. An 18k-word Thai speech recognition system was setup to test with this speech corpus as a preliminary experiment. Acoustic model adaptations were performed to improve the system performance. The best system yielded a word error rate of about 20% for clean and planned speech, and below 30% for the overall condition.
Markpong Jongtaveesataporn, Chai Wutiwiwatchai, Ko
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2008
Where LREC
Authors Markpong Jongtaveesataporn, Chai Wutiwiwatchai, Koji Iwano, Sadaoki Furui
Comments (0)