A Figure of Merit for the Evaluation of Web-Corpus Randomness

15 years 8 months ago

Download clic.cimec.unitn.it

In this paper, we present an automated, quantitative, knowledge-poor method to evaluate the randomness of a collection of documents (corpus), with respect to a number of biased partitions. The method is based on the comparison of the word frequency distribution of the target corpus to word frequency distributions from corpora built in deliberately biased ways. We apply the method to the task of building a corpus via queries to Google. Our results indicate that this approach can be used, reliably, to discriminate biased and unbiased document collections and to choose the most appropriate query terms.

Massimiliano Ciaramita, Marco Baroni

Real-time Traffic

EACL 2006 | Natural Language Processing | Target Corpus | Unbiased Document Collections | Word Frequency Distributions |

claim paper

Post Info
More Details (n/a)

Added	30 Oct 2010
Updated	30 Oct 2010
Type	Conference
Year	2006
Where	EACL
Authors	Massimiliano Ciaramita, Marco Baroni

Comments (0)

Sciweavers

A Figure of Merit for the Evaluation of Web-Corpus Randomness

EACL 2006 | Natural Language Processing | Target Corpus | Unbiased Document Collections | Word Frequency Distributions |

Explore & Download

Productivity Tools

Sciweavers