Mining the Web to Create Minority Language Corpora

15 years 11 months ago

Download www.accenture.com

The Web is a valuable source of language speci c resources but the process of collecting, organizing and utilizing these resources is di cult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents in a minority language. It di ers from pseudo-relevance feedback in that retrieved documents are labeled by an automatic language classi er as relevant or irrelevant, and this feedback is used to generate new queries. We experiment with various query-generation methods and query-lengths to nd inclusion/exclusion terms that are helpful for retrieving documents in the target language and nd that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also describe experiments using a handful of words elicited from a user instead of initial documents and show that the methods perform similarly. Experiments applying the same approach to multiple languages...

Rayid Ghani, Rosie Jones, Dunja Mladenic

Real-time Traffic