Sciweavers

NAACL
2010

Language Identification: The Long and the Short of the Matter

13 years 9 months ago
Language Identification: The Long and the Short of the Matter
Language identification is the task of identifying the language a given document is written in. This paper describes a detailed examination of what models perform best under different conditions, based on experiments across three separate datasets and a range of tokenisation strategies. We demonstrate that the task becomes increasingly difficult as we increase the number of languages, reduce the amount of training data and reduce the length of documents. We also show that it is possible to perform language identification without having to perform explicit character encoding detection.
Timothy Baldwin, Marco Lui
Added 14 Feb 2011
Updated 14 Feb 2011
Type Journal
Year 2010
Where NAACL
Authors Timothy Baldwin, Marco Lui
Comments (0)