Sciweavers

EACL
2003
ACL Anthology

Learning to Identify Fragmented Words in Spoken Discourse

14 years 1 months ago
Learning to Identify Fragmented Words in Spoken Discourse
Disfluent speech adds to the difficulty of processing spoken language utterances. In this paper we concentrate on identifying one disfluency phenomenon: fragmented words. Our data, from the Spoken Dutch Corpus, samples nearly 45,000 sentences of human discourse, ranging from spontaneous chat to media broadcasts. We classify each lexical item in a sentence either as a completely or an incompletely uttered, i.e. fragmented, word. The task is carried out both by the IB1 and RIPPER machine learning algorithms, trained on a variety of features with an extensive optimization strategy. Our best classifier has a 74.9% F-score, which is a significant improvement over the baseline. We discuss why memory-based learning has more success than rule induction in correctly classifying fragmented words.
Piroska Lendvai
Added 31 Oct 2010
Updated 31 Oct 2010
Type Conference
Year 2003
Where EACL
Authors Piroska Lendvai
Comments (0)