Raising the baseline for high-precision text classifiers

16 years 7 months ago

Download ir.iit.edu

Many important application areas of text classifiers demand high precision and it is common to compare prospective solutions to the performance of Naive Bayes. This baseline is usually easy to improve upon, but in this work we demonstrate that appropriate document representation can make outperforming this classifier much more challenging. Most importantly, we provide a link between Naive Bayes and the logarithmic opinion pooling of the mixture-of-experts framework, which dictates a particular type of document length normalization. Motivated by document-specific feature selection we propose monotonic constraints on document term weighting, which is shown as an effective method of fine-tuning document representation. The discussion is supported by experiments using three large email corpora corresponding to the problem of spam detection, where high precision is of particular importance. General Terms Algorithms Keywords high precision text classification, Naive Bayes, low false positiv...

Aleksander Kolcz, Wen-tau Yih

Real-time Traffic

Appropriate Document Representation | Data Mining | Fine-tuning Document Representation | KDD 2007 | Naive Bayes |

claim paper

Post Info
More Details (n/a)

Added	30 Nov 2009
Updated	30 Nov 2009
Type	Conference
Year	2007
Where	KDD
Authors	Aleksander Kolcz, Wen-tau Yih

Comments (0)

Sciweavers

Raising the baseline for high-precision text classifiers

Appropriate Document Representation | Data Mining | Fine-tuning Document Representation | KDD 2007 | Naive Bayes |

Explore & Download

Productivity Tools

Sciweavers