A study in machine learning from imbalanced data for sentence boundary detection in speech

15 years 6 months ago

Download www.hlt.utdallas.edu

Enriching speech recognition output with sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed an HMM system to detect sentence boundaries that uses both the prosodic and textual information. In this system, the sentence boundaries are detected by building a classi er in which at each interword boundary, a decision is made as to whether or not it ends a sentence. Since there are more nonsentence boundaries than sentence boundaries in the data, the prosody model, which is implemented as a decision tree classi er, must be constructed to e ectively learn from the imbalanced data distribution. To address this problem, we investigate a variety of sampling approaches and a bagging scheme. A pilot study was carried out to select methods to apply to the full NIST sentence boundary evaluation task across two corpora, using both the reference transcription and the recognition output. In the pilot study, w...

Yang Liu, Nitesh V. Chawla, Mary P. Harper, Elizab

Real-time Traffic

Automated Reasoning | CSL 2006 | Original Training | Pilot Study | Sentence Boundaries |

claim paper

» Automatic Labeling Inconsistencies Detection and Correction for Sentence Unit Segmentation...

» Applying Support Vector Machines to Imbalanced Datasets

» Construction of ChunkAligned Bilingual Lecture Corpus for Simultaneous Machine Translation

» Mixed Type Audio Classification with Support Vector Machine

» Concept boundary detection for speeding up SVMs

» What Characterizes a Shadow Boundary under the Sun and Sky

» Information Extraction for Clinical Data Mining A Mammography Case Study

» Morphological Richness Offsets Resource Demand Experiences in Constructing a POS Tagger f...

Post Info
More Details (n/a)

Added	11 Dec 2010
Updated	11 Dec 2010
Type	Journal
Year	2006
Where	CSL
Authors	Yang Liu, Nitesh V. Chawla, Mary P. Harper, Elizabeth Shriberg, Andreas Stolcke

Comments (0)

Sciweavers

A study in machine learning from imbalanced data for sentence boundary detection in speech

Automated Reasoning | CSL 2006 | Original Training | Pilot Study | Sentence Boundaries |

Explore & Download

Productivity Tools

Sciweavers