We introduce the corpus of United States Congressional bills from 1947 to 1998 for use by language research communities. The U.S. Policy Agenda Legislation Corpus Volume 1 (USPALC...
The collection and transcription of speech data is typically an expensive and time-consuming task. Voice over IP and cloud computing are poised to greatly reduce this impediment t...
Ian McGraw, Chia-ying Lee, I. Lee Hetherington, St...
Computing the pairwise semantic similarity between all words on the Web is a computationally challenging task. Parallelization and optimizations are necessary. We propose a highly...
Patrick Pantel, Eric Crestan, Arkady Borkovsky, An...
More and more users are contributing and sharing more and more contents on the Web via the use of content hosting sites and social media services. These user–generated contents ...
A distributed system is described that reliably mines parallel text from large corpora. The approach can be regarded as cross-language near-duplicate detection, enabled by an init...
Jakob Uszkoreit, Jay Ponte, Ashok C. Popat, Moshe ...