This paper introduces the SpamBayes classification engine and outlines the most important features and techniques which contribute to its success. The importance of using the indeterminate ‘unsure’ classification produced by the chi-squared combining technique is explained. It outlines a Robinson/Woodhead/Peters technique of ‘tiling’ unigrams and bigrams to produce better results than relying solely on either or other methods of using both unigrams and bigrams. It discusses methods of training the classifier, and evaluates the success of different methods. The paper focuses on highlighting techniques that might aid other classification systems rather than attempting to demonstrate the effectiveness of the SpamBayes classification engine.
Tony A. Meyer, Brendon Whateley