Sciweavers

INTERSPEECH
2010

Rapid bootstrapping of five eastern european languages using the rapid language adaptation toolkit

13 years 6 months ago
Rapid bootstrapping of five eastern european languages using the rapid language adaptation toolkit
This paper presents our latest efforts toward LVCSR systems for five Eastern European languages such as Bulgarian, Croatian, Czech, Polish, and Russian using our Rapid Language Adaptation Toolkit (RLAT) [1]. We investigated the possibility of crawling large quantities of text material from the Internet, which is very cheap but also requires text post-processing steps due to the varying text quality. The goal of this study is to determine the best strategy for language model optimization on the given domain in a short time period with minimal human effort. Our results show that we can build an initial ASR system for these five languages in only twenty days using RLAT. On the multilingual GlobalPhone speech corpus [2], we achieved a word error rate (WER) of 16.9% for Bulgarian, 32.8% for Croatian, 23.5% for Czech, 20.4% for Polish, and 36.2% for Russian.
Ngoc Thang Vu, Tim Schlippe, Franziska Kraus, Tanj
Added 18 May 2011
Updated 18 May 2011
Type Journal
Year 2010
Where INTERSPEECH
Authors Ngoc Thang Vu, Tim Schlippe, Franziska Kraus, Tanja Schultz
Comments (0)