TrainOnEverything

1. Train on Everything

Training on every message may be the easiest thing to do, especially if your setup allows you to automatically train on messages which are scored as ham or spam. That leaves only the unsure messages for you to handle manually. You will have to at least skim all your spam though. If a ham is mistakenly scored as spam, not only might you miss it, but it will be trained as spam, thus adding a mistake to your training database, and making it more likely that similar messages received in the future will lead to further scoring mistakes (and further mistakes in your training database, yadda, yadda, yadda). As you can see, this might be a difficult cycle to break.

A further problem with training on all messages is that you run the risk of seeing your ham and spam databases get way out of whack size-wise. SpamBayes works best when your training database has similar numbers of ham and spam. You'll need to keep an eye on this aspect of the system if you train on all mail you receive. There has been only the slightest small amount of work done to date (late 2003) to alert users to out-of-balance conditions, and even less work on ways to keep the database from getting out of balance in the first place.

SkipMontanaro

Train on Everything learns fairly quickly and has a reasonably good accuracy over extended periods, though it does seem to decay over time. Here's some pictures, generated with the incremental test harness using the (soon to be renamed) perfect regime:

Details of the test set are in AlexDemoTestSet

AlexPopiel

Note: This wiki is now frozen; you can no longer edit it, and no interactive features work.

1. Train on Everything