Sciweavers

NAACL
1994

On Using Written Language Training Data for Spoken Language Modeling

14 years 2 months ago
On Using Written Language Training Data for Spoken Language Modeling
We attemped to improve recognition accuracy by reducing the inadequacies of the lexicon and language model. Specifically we address the following three problems: (1) the best size for the lexicon, (2) conditioning written text for spoken language recognition, and (3) using additional training outside the text distribution. We found that increasing the lexicon 20,000 words to 40,000 words reduced the percentage of words outside the vocabulary from over 2% to just 0.2%, thereby decreasing the error rate substantially. The error rate on words already in the vocabulary did not increase substantially. We modified the language model training text by applying rules to simulate the differences between the training text and what people actually said. Finally, we found that using another three years' of training text - even without the appropriate preprocessing, substantially improved the language model We also tested these approaches on spontaneous news dictation and found similar improve...
Richard M. Schwartz, Long Nguyen, Francis Kubala,
Added 02 Nov 2010
Updated 02 Nov 2010
Type Conference
Year 1994
Where NAACL
Authors Richard M. Schwartz, Long Nguyen, Francis Kubala, George Chou, George Zavaliagkos, John Makhoul
Comments (0)