Sciweavers

LREC
2010

Large Scale Multilingual Broadcast Data Collection to Support Machine Translation and Distillation Technology Development

14 years 1 months ago
Large Scale Multilingual Broadcast Data Collection to Support Machine Translation and Distillation Technology Development
The development of technologies to address machine translation and distillation of multilingual broadcast data depends heavily on the collection of large volumes of material from modern data providers. To address the needs of GALE researchers, the Linguistic Data Consortium (LDC) developed a system for collecting broadcast news and conversation from a variety of Arabic, Chinese and English broadcasters. The system is highly automated, easily extensible and robust and is capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. In addition to this extensive system, LDC manages three remote collection sites to maximize the variety of available broadcast data and has designed a portable broadcast collection platform to facilitate remote collection. This paper will present a detailed a description of the design and implementation of LDC's collection system, the technical challenges and solutions to large scale broadcast data col...
Kevin Walker, Christopher Caruso, Denise DiPersio
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2010
Where LREC
Authors Kevin Walker, Christopher Caruso, Denise DiPersio
Comments (0)