N-Gram Feature Selection for Authorship Identification

15 years 9 months ago

Download www.icsd.aegean.gr

Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Character n-grams are a very successful approach to represent text for stylistic purposes since they are able to capture nuances in lexical, syntactical, and structural level. So far, character n-grams of fixed length have been used for authorship identification. In this paper, we propose a variable-length n-gram approach inspired by previous work for selecting variable-length word sequences. Using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors, we show that the proposed approach is at least as effective as information gain for selecting the most significant n-grams although the feature sets produced by the two methods have few common members. Moreover, we explore the significance of digits for distinguishing between authors showing that an increase in perfor...

John Houvardas, Efstathios Stamatatos

Real-time Traffic

AIMSA 2006 | Artificial Intelligence | Authorship Identification | Character N-grams | Variable-length N-gram Approach |

claim paper

Post Info
More Details (n/a)

Added	20 Aug 2010
Updated	20 Aug 2010
Type	Conference
Year	2006
Where	AIMSA
Authors	John Houvardas, Efstathios Stamatatos

Comments (0)

Sciweavers

N-Gram Feature Selection for Authorship Identification

AIMSA 2006 | Artificial Intelligence | Authorship Identification | Character N-grams | Variable-length N-gram Approach |

Explore & Download

Productivity Tools

Sciweavers