Tokenization is one of the initial steps done for almost any text processing task. It is not particularly recognized as a challenging task for English monolingual systems but it r...
Inducing a grammar from text has proven to be a notoriously challenging learning task despite decades of research. The primary reason for its difficulty is that in order to induce...
Language modeling is to associate a sequence of words with a priori probability, which is a key part of many natural language applications such as speech recognition and statistic...
We introduce the corpus of United States Congressional bills from 1947 to 1998 for use by language research communities. The U.S. Policy Agenda Legislation Corpus Volume 1 (USPALC...
We propose to model the development of language by a series of formal grammars, accounting for the linguistic capacity of children at the very early stages of mastering language. T...