In this paper we look at the problem of cleansing noisy text using a statistical machine translation model. Noisy text is produced in informal communications such as Short Message Service (SMS), Twitter and chat. A typical Statistical Machine Translation system is trained on parallel text comprising noisy and clean sentences. In this paper we propose an unsupervised method for the translation of noisy text to clean text. Our method has two steps. For a given noisy sentence, a weighted list of possible clean tokens for each noisy token are obtained. The clean sentence is then obtained by maximizing the product of the weighted lists and the language model scores.
Danish Contractor, Tanveer A. Faruquie, L. Venkata