UserPreferences

TrainingIdeas


Note: This wiki is now frozen; you can no longer edit it, and no interactive features work.

1. Training Strategies

Unlike the core classifier's algorithms, different approaches to training the classifier have not been well tested. Different people use different approaches for a number of reasons, including ease of use, effectiveness and curiosity. There seem to be two main training strategies used by most people: train on everything and train on mistakes. The second subsumes train on mistakes and unsures if you pretend that unsures are mistakes (which they aren't).

SkipMontanaro

1.1. Some Training Aphorisms

I post these here with no proof that they are true, just that they work for me and seem to have been used by more of the developers than just me (SkipMontanaro). Oh, and Tim told me to do it.

SkipMontanaro

1.2. Training Tactics

Skip has laid out some good precepts and what seems to work best based on a lot of collective experience, even if Tim did make him do it. Now let's list all the current possibilities for training and then discuss them in more detail. If nothing else, more people will understand what works, what doesn't and why. These are listed in order of the number of new messages that require training from smallest to largest.

For all of the above (except TrainToExhaustion), there is the optional but recommended step of:

There is an ongoing debate about the effects of database size on performance. So let's arbitrarily say the database size philosophies are:

Assuming there is no mechanism for pruning the token databases (more on that later), here are some approaches for when and if to terminate training. For all of these, initial training set size and the training tactic selected determines the quality of the result.

For all of these, there is the additional option to periodically adjust the ham and spam thresholds. Finally, if performance degrades over time, start over.

SethGoodman

This is turning into a great outline of the various training strategies. Is there anyone out there now willing to put code/time into pulling out some numbers? I believe that the timtv.py script does (effectively) the 'train on everything' strategy. There's also a script to test 'incremental' training, which could probably be used as the basis of testing some of these other ideas (the ones that work with the current database setup). It would be great if someone would look into that and write up a recipe for testing that others can follow.

TonyMeyer

1.3. Picking the Initial Training Set

Most of us just collect what we think is a reasonable number of manually classified ham and spam into two folders and train on them. Depending on how you chose these, your mileage will really vary a lot. Aside from the number of ham and spam to chose (equal numbers seem to work better, I'm told), which ones you chose are more critical than you might think. The goal is to get the token databases to each have good coverage of the tokens we expect to see in our incoming mail stream with correct probabilities. In fact, the choice of the message corpus to train on actually defines the training tactic.

After picking a training tactic, which determines which messages can go in the corpus, and picking a corpus size (how many messages), there are still different ways to select a training set out of the message corpus you just constructed. The most obvious approach is to just train on every message in the corpus. Aside from taking a long time and creating giant databases, this is not necessarily optimal for the reasons Skip gives in several sections of this Wiki. Like any problem of this type, a much smaller subset of the messages can give an equally optimal statistical representation of the message corpus with all the advantages of smaller databases.

So how do we identify this message subset to train on? Skip created a terrific, but labor intensive algorithm, which he put out on the mailing list. For anyone who missed it, here is SkipsRecursiveTrainingSetSelectionAlgorithm, shamelessly repeated without his permission. This is an incredibly good algorithm. The key is that adding a particular message to the database affects the classification of many other messages in not always obvious ways. You train on a very small initial training set, use the result to classify a larger message corpus, train a few of the messages that classify the worst and repeat this until you can't see straight. You will have a terrific, yet small, training set. I did this with the Outlook Plug-In rather than Skip's scripts, but the result is the same. Though it's only been a few days, it is performing much better than the random method I previously used plus the database size is much smaller. It also had better classification performance on a corpus of 6,000 previous messages than I imagined possible: no errors and 7 unsures. For anyone using the Outlook plug-in, SkipsRecursiveTrainingSetSelectionForOutlook is an adaptation of Skip's algorithm for Outlook.

Update May 19, 2004: I retrained on January 8, 2004 using SkipsRecursiveTrainingSetSelectionForOutlook with a training set size of 640 ham + 640 spam out of a corpus of about 7,000 messages. I have added a few messages per day if they are either unsure, mistakes or just classify 'too close' to their respective thresholds. Performance has remained pretty consistent with the following overall stats: unsures = 3.9% (284), false positives = 0.0% (none), false negatives = 0.3% (13) based on 5,032 incoming messages. The training set currently has 858 ham and 1023 spam.

SethGoodman

1.4. When and If To Terminate Training

It's unlikely that you will ever completely stop training. Over time, the spammers try different tricks, they promote new come-ons, and they move from ISP to ISP. Your training database probably needs to evolve to adapt.

1.5. Manual Pruning of Databases

Currently, if the database grows out of bounds the recipe is to throw it away and replace by a new one.

Q: I'd like to see a user interface for the manual maintanance of the database. See statistics, see the tokens and their ratigns, sort, remove tokens with a low count (hapaxes).

While most training ideas focus on automatic database adaption, my proposal is a manual maintenance. Let me explain a bit more in detail:

A: You can do this now if you want to. Use the sb_dbexpimp.py script to convert the database to CSV. Open the CSV file in (e.g.) Excel. Sort the columns. Remove or change the counts as you like. Save the file. Use sb_dbexpimp.py to convert the database back to whatever format it was in before.

However, this is not a good idea. SpamBayes knows that tokens are added or removed from the database in groups (i.e. messages). If you remove individual tokens, rather than complete messages, you may very well stuff up the calculations. --TonyMeyer

Q: Thanks for this tip. I tried sb_dbexpimp.py and it turned out well.

Converting the database to CSV, I realized that only the total number of trained ham/spam messages is saved, no hint on tokens coming from the same message.

My idea was to keep the strong hints but prune the weak ones. Manually pruning all tokens with count < 3 reduced my database from 120000 tokens to 600. The number of uncertain mails increased for a while, but with very little learning, the filter works again satisfactory with my limited ressources. -- GuenterMilde

A: There has been quite a lot of discussion about database pruning, including pruning hapaxes, on the lists (spambayes@python.org and spambayes-dev@python.org) - for a good start, see [WWW]this google search. It would be easier to continue this discussion there (and you'd get the benefit of opinions of others who might not read this wiki) if that was ok with you. --TonyMeyer

OK -- GuenterMilde

1.6. Self-Pruning Databases

If you keep adding to your training set, it will probably grow larger than you want over time. In addition, if the characteristics of your message stream change over time, the older messages in the training set no longer are a good representation of the current message stream. This motivates schemes for expiring tokens and/or messages to keep the training set to a reasonable size and representative of the current message stream. This is somewhat controversial and several people are experimenting with it to see if real gains can be had and if it is broadly applicable.

For more info, see AdaptiveTraining.

SethGoodman

1.7. Cluster Algorithm

We all know the technique of "spelling variations" like via-gra to avoid easily catchable keywords. A clustering algorithm could be used on a "grown up" database to find clusters of related tokens. The tokens will be replaced by a cluster representative + a new function to compare message tokens to this clusters.

Ratings are than based on the distance of the probed token to the cluster-token times the ham/spam value of the cluster. Besides keeping the database smaller, this also catches new variations -- GuenterMilde

How would you define "related"? Tokens that appear together? Tokens that substitue for each other? (If the latter, how to do figure that out?). Tokens that have a small edit distance? --TonyMeyer

There are several possibilities to calculate a distance between two strings. In PHP (unfortunately, not yet in Python), I found some ready-made functions

1.8. Incremental Training

Comments seem to be in order to add to the SpamBayes documentation.

Incremental Training options are turned on in the Manager as a default. When an imbalance occurs in training messages, it may be useful to intervene using this feature.

An approach could be to move new ham out of the Inbox and then select "Recover from Spam". Then an improvement in the balance of ham and spam could be achieved.