1. Train on Errors and Unsures
By training only on messages classified as unsure and those messages which are clearly mistakes, you gain a couple advantages over training on all mail. First, presumably you will have "approved" all messages added to your training database, so it is less likely to accumulate mistakes. Second, since you'll be creating your training database more-or-less manually, it will likely be much smaller. Scoring messages against a smaller database is likely to be faster, and if you do encounter mistakes, it should be fairly easy to recover from them, even if you simply delete your current training database and start over. Third, some of the messages you receive may be very highly similar and coming in high quantities (e.g. CVS logs for software developers, or more general: automatically created messages). Scoring on all such messages pollutes the scores of the words they contain.
Should you train on all messages which are classified as unsure? Probably not. Here are a couple reasons why you might want to be careful about blindly training on all unsures:
-
Some messages are truly difficult to classify. Consider a message sent to the spambayes mailing list asking about a spam which is attached. There are clearly going to be strong ham and spam clues in the message. It's probably better to just live with that sort of stuff landing in your unsure mailbox than "polluting" your training database with a questionable message.
-
Spam comes in bunches. When a spammer gets a new email account, it won't take very long for it to be closed after he begins sending mail, so he has to send a lot in a hurry. Consequently, you're likely to get multiple spams at about the same time that contain several correlating clues (content and/or transport clues). Training any one of that bunch of messages as spam will likely push the rest of the bunch into or very near to the spam group. Also, some types of spam seem to be sent in cycles. Right now (late 2003), I'm starting to see lots of Christmas gift spam. That will dry up shortly after the new year and I won't see much of it again until late 2004.
-
After you have a reasonable training database built, most unsures will actually be spam. If you train on all of them, you (once again) run the risk of seeing your training database get out-of-whack.
So, which unsures should you train on? How many? There are no obviously correct answers to these questions. I have had reasonable success by training on nearly all unsures which are hams (all but those where are extremely close to my ham_cutoff) and only training on the lowest scoring unsures which are spam. My reasoning is
-
I get more spam than ham, so it makes sense that I will have to do something to try keeping my ham counts and spam counts close, so I want to be more selective about adding spams to my training database. (As I write this, my training database has 368 spams and 215 hams.)
-
By training on a single low-scoring spam from the unsure population, I am quite likely to push one or more higher scoring spams from the unsure population toward or into the spam population.
This combined training approach seems to work well. Sample stats are
here:
-
Initial training is "train on everything"
-
I save 30 days worth of spam messages (about 1750) in a folder called "Spam-Quarantine". That might sound like a lot, but I've had the same primary email address for over 7 years now. Also, I'm the mail system administrator for my company, and I test spam filters quite a bit, so for a while I was actually trying to get more spam. Big mistake. But then I ran across SpamBayes, and life has returned to normal.
-
I create an empty temporary folder in outlook called "Ham for training"
-
I use Outlook's Advanced find feature to search for all mail messages newer than 30 days not in the "Spam-Quarantine" folder. This is easy with Outlook XP or 2003, because you can deselect individual folders from the search. I do not include Sent Items, deleted items, or the calendar.
-
I adjust the cutoff date for this search. If there are more than 1750 messages, I make it less than 30 days. If there are less than 1750 messages, I move the data the other way. I tweak the date until the number of ham messages returned by the search is about the same as the number of messages in "Spam-Quarantine", in this case 1750.
-
I copy all messages from this search into the "Ham for Training", by selecting them all, and doing a drag-and-drop in Outlook while holding down the Ctl key. Make sure you copy the messages, and do not just move them, because it's hard to get them back in the right folders if you do that.
-
I open spambayes manager, and train using only my "Ham for training" and "spam-Quarantine" folders. Make sure the "rebuild database" option is selected.
-
I delete the "Ham for training" folder and all messages in it (which are only copies anyway).
-
Continuing training is "train on errors & unsures"
-
I set my ham and spam thresholds to 20 and 80, respectively. I have never had a ham score above 60, much less 80, since I've been using spambayes. Only very rarely does a spam score below 20. These limits insure a minimal number of "unsures" that require my attention, while letting me feel confident I won't get a false positive. You might want to be more conservative with the spam score, and choose 90 or above.
-
I train on all mis-classified messages, both ham and spam, of which there are very very few.
-
I train on all unsure messages, both spam and ham, of which I get 1-2 a day out of hundreds of messages.
-
This training method performs well for several months at a time. If I feel performance is getting worse (this comes in the form of lots of unsure messages, not mis-classified messages), I re-do the initial training and start over.
This method seems to keep my training set relatively balanced for long periods of time, since few messages are trained after the initial training session. It also ensures that my initial training data is a "current" sub-sample of my actual email mix. I currently get 0% false positives, 94.4% correct spam capture rate, and about 3.8% unsures, the vast majority of which are spam and not ham. I only "miss" 1.4% of spam, which is pretty good, I think. You might even consider lowering the spam threshold with this method to improve the capture rate and decrease the number of unsures.
Here are my statistical results using this method.
The pure form (without an initial train on everything phase) of this tactic is represented by the fpfnunsure regime for the incremental test harness. While it does have a nice low number of messages trained, it suffers from significantly higher error and unsure rates than either TrainingIdeas/TrainOnEverything or TrainingIdeas/TrainOnLots. Have some pictures:
See AlexDemoTestSet for details on the test data.
