Sciweavers

ECIR
2007
Springer

Entropy-Based Authorship Search in Large Document Collections

14 years 1 months ago
Entropy-Based Authorship Search in Large Document Collections
The purpose of authorship search is to identify documents written by a particular author or in a particular style in large document collections. Standard search engines match documents to queries based on topic, and are not applicable to authorship search. In this paper we propose an approach to authorship search based on information theory. We propose relative entropy of style markers as the ranking methodology, inspired by the language models used in information retrieval. Our experiments on collections of newswire texts show that, with simple style markers and sufficient training data, documents by a particular author can be accurately found from within large collections. Although effectiveness does degrade as collection size is increased, with even 500,000 documents nearly half of the top-ranked documents are correct matches. We have also found that the authorship search approach can be used for authorship attribution, and is much more scalable than state-of-art approaches in term...
Ying Zhao, Justin Zobel
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2007
Where ECIR
Authors Ying Zhao, Justin Zobel
Comments (0)