Sciweavers

LREC
2010

Information Retrieval of Word Form Variants in Spoken Language Corpora Using Generalized Edit Distance

14 years 1 months ago
Information Retrieval of Word Form Variants in Spoken Language Corpora Using Generalized Edit Distance
An important feature of spoken language corpora is existence of different spelling variants of words in transcription. So there is an important problem for linguist who works with large spoken corpora: how to find all variants of the word without annotating them manually? Our work describes a search engine that enables finding different spelling variants (true positives) from corpus of spoken language, and reduces efficiently the amount of false positives returned during the search. Our search engine uses a generalized variant of the edit distance algorithm that allows defining text-specific string to string transformations in addition to the default edit operations defined in edit distance. We have extended our algorithm with capability to block transformations in specific substrings of search words. User can mark certain regions (blocked regions) of the search word where edit operations are not allowed. Our material comes
Siim Orasmaa, Reina Käärik, Jaak Vilo, T
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2010
Where LREC
Authors Siim Orasmaa, Reina Käärik, Jaak Vilo, Tiit Hennoste
Comments (0)