A distributed system is described that reliably mines parallel text from large corpora. The approach can be regarded as cross-language near-duplicate detection, enabled by an init...
Jakob Uszkoreit, Jay Ponte, Ashok C. Popat, Moshe ...
We propose to demonstrate LiquidXML, a platform for managing large corpora of XML documents in large-scale P2P networks. All LiquidXML peers may publish XML documents to be shared...
Text categorization involves mapping of documents to a fixed set of labels. A similar but equally important problem is that of assigning labels to large corpora. With a deluge of ...
For many languages there are no large, general-language corpora available. Until the web, all but the richest institutions could do little but shake their heads in dismay as corpu...
This paper presents a system capable of automatically acquiring subcategorization frames (SCFs) for French verbs from the analysis of large corpora. We applied the system to a lar...
— In this paper, we show experimentally that learning filters are able to classify large corpora of spam and legitimate email messages with a high degree of accuracy. The corpor...