A Two-Level Structure for Compressing Aligned Bitexts

16 years 1 months ago

Download transducens.dlsi.ua.es

A bitext, or bilingual parallel corpus, consists of two texts, each one in a different language, that are mutual translations. Bitexts are very useful in linguistic engineering because they are used as source of knowledge for different purposes. In this paper we propose a strategy to efﬁciently compress and use bitexts, saving, not only space, but also processing time when exploiting them. Our strategy is based on a two-level structure for the vocabularies, and on the use of biwords, a pair of associated words, one from each language, as basic symbols to be encoded with an ETDC [2] compressor. The resulting compressed bitext needs around 20% of the space and allows more efﬁcient implementations of the different types of searches and operations that linguistic engineerings need to perform on them. In this paper we discuss and provide results for compression, decompression, different types of searches, and bilingual snippets extraction.

Joaquín Adiego, Nieves R. Brisaboa, Miguel

Real-time Traffic

Bilingual Parallel Corpus | Information Retrieval | Linguistic Engineerings | Mutual Translations | SPIRE 2009 |

claim paper

Post Info
More Details (n/a)

Added	27 May 2010
Updated	27 May 2010
Type	Conference
Year	2009
Where	SPIRE
Authors	Joaquín Adiego, Nieves R. Brisaboa, Miguel A. Martínez-Prieto, Felipe Sánchez-Martínez

Comments (0)

Sciweavers

A Two-Level Structure for Compressing Aligned Bitexts

Bilingual Parallel Corpus | Information Retrieval | Linguistic Engineerings | Mutual Translations | SPIRE 2009 |

Explore & Download

Productivity Tools

Sciweavers