A Statistical Method for Extracting Uninterrupted and Interrupted Collocations from Very Large Corpora

15 years 8 months ago

Download acl.ldc.upenn.edu

In order to extractrigidexpressions with a high frequency of use, new algorithm that can efficientlyextract both uninterruptedand interruptedcollocationsfrom very large corpora has been proposed. The statistical method recently proposed for calculating N-gram of m'bitrary N can be applied to the extraction of uninterrupted collocations. But this method posed problems that so large volumes of fractional and unnecessary expressions are extracted that it was impossible to extract interrupted collocations combining the results. To solve this problem, this paper proposed a new algorithm that restrains extraction of unnecessary substrings. This is followed by the proposal of a method that enable to extract interrupted collocations. The new methods are applied to Japanese newspaper articles involving 8.92 million characters. In the case of uninterrupted collocations with string length of 2 or mere characters and frequency of appearance 2 or more times, there were 4.4 millions types of e...

Satoru Ikehara, Satoshi Shirai, Hajime Uchino

Real-time Traffic

COLING 1996 | COLING 2008 | Method Posed Problems | Statistical Method | Uninterrupted Collocations |

claim paper

» Aiding Web Searches by Statistical Classification Tools

» Unsupervised knowledge acquisition for Extracting Named Entities from speech

» Design and Prototype of a LargeScale and Fully SenseTagged Corpus

Post Info
More Details (n/a)

Added	02 Nov 2010
Updated	02 Nov 2010
Type	Conference
Year	1996
Where	COLING
Authors	Satoru Ikehara, Satoshi Shirai, Hajime Uchino

Comments (0)

Sciweavers

A Statistical Method for Extracting Uninterrupted and Interrupted Collocations from Very Large Corpora

COLING 1996 | COLING 2008 | Method Posed Problems | Statistical Method | Uninterrupted Collocations |

Explore & Download

Productivity Tools

Sciweavers