UserPreferences

TrainOnErrorsAndUnsures


Note: This wiki is now frozen; you can no longer edit it, and no interactive features work.

1. Train on Errors and Unsures

By training only on messages classified as unsure and those messages which are clearly mistakes, you gain a couple advantages over training on all mail. First, presumably you will have "approved" all messages added to your training database, so it is less likely to accumulate mistakes. Second, since you'll be creating your training database more-or-less manually, it will likely be much smaller. Scoring messages against a smaller database is likely to be faster, and if you do encounter mistakes, it should be fairly easy to recover from them, even if you simply delete your current training database and start over. Third, some of the messages you receive may be very highly similar and coming in high quantities (e.g. CVS logs for software developers, or more general: automatically created messages). Scoring on all such messages pollutes the scores of the words they contain.

Should you train on all messages which are classified as unsure? Probably not. Here are a couple reasons why you might want to be careful about blindly training on all unsures:

So, which unsures should you train on? How many? There are no obviously correct answers to these questions. I have had reasonable success by training on nearly all unsures which are hams (all but those where are extremely close to my ham_cutoff) and only training on the lowest scoring unsures which are spam. My reasoning is

SkipMontanaro


This combined training approach seems to work well. Sample stats are [WWW]here:

This method seems to keep my training set relatively balanced for long periods of time, since few messages are trained after the initial training session. It also ensures that my initial training data is a "current" sub-sample of my actual email mix. I currently get 0% false positives, 94.4% correct spam capture rate, and about 3.8% unsures, the vast majority of which are spam and not ham. I only "miss" 1.4% of spam, which is pretty good, I think. You might even consider lowering the spam threshold with this method to improve the capture rate and decrease the number of unsures. [WWW]Here are my statistical results using this method.

RyanMalayter


The pure form (without an initial train on everything phase) of this tactic is represented by the fpfnunsure regime for the incremental test harness. While it does have a nice low number of messages trained, it suffers from significantly higher error and unsure rates than either TrainingIdeas/TrainOnEverything or TrainingIdeas/TrainOnLots. Have some pictures:

http://www.wolfskeep.com/~popiel/spambayes/plots/fpfnunsure.png http://www.wolfskeep.com/~popiel/spambayes/plots/fpfnunsurespan.png http://www.wolfskeep.com/~popiel/spambayes/plots/fpfnunsuretrained.png http://www.wolfskeep.com/~popiel/spambayes/plots/fpfnunsurespantrained.png

See AlexDemoTestSet for details on the test data.

AlexPopiel