Note: This wiki is now frozen; you can no longer edit it, and no interactive features work.

1. Adaptive Training/Self-Pruning Databases

If there were a mechanism for pruning the token databases, there are many more possibilities. Why bother? Because it has the potential to make the program more robust and perhaps it could also reduce false negatives. The databases will also be smaller, which is generally a good thing. The main advantage, though, is the potential that the databases could track the message stream and continuously adapt to it in a fairly optimal way.

This has both advantages and disadvantages, as do all adaptive, self-learning systems. First the advantages. In cases of pilot error (user trains message into wrong class), this is very hard to correct by subsequent training, and you often wind up starting over from scratch. Some users are more prone to this than others. If old messages were pruned from the databases, mistakes like this would be self-correcting. If the content of your message stream gradually changes over time (friends get crazier, spam mutates), the classifier's performance will gradually deteriorate. Adding new information to the database helps for a while, but as the old information in the database gets further from a good statistical estimate of the incoming message stream, performance will deteriorate. Rather than periodically starting over, a self-pruning, continuous learning database has the potential of tracking the current message stream.

Now for the disadvantages, mostly mathematical. Our training set selection problem belongs to a class of adaptive algorithms known in the field of digital signal processing as adaptive signal estimation, or more specifically, blind adaptive signal estimation. Since we don't know a priori what the future signal (message stream) will look like, we assume that the message stream statistics are stationary (don't change with time), or at least they only change slowly. We then estimate the statistics using a recent sample of the message stream (a message corpus). We filter new messages (tokenize and assign scores) using the estimate, which gives a scalar result and apply a decision rule to classify the message as ham, spam or unknown. The combined process of sampling, filtering and applying a decision rule (receiving a message, tokenizing, scoring and classifying) is what is known as an estimated classifier. Finally, we do further training to make the process adaptive, which we hope will improve the estimated classifier, or at least maintain its performance, over time.

Since the statistical properties of the message stream do slowly change over time (it is non-stationary), the further in the future we use this estimated classifier, the less correct we can expect it to be. Training tactics, such as TrainOnErrors, are attempts to correct the estimated classifier for statistical changes in the message stream so that it continues to classify future messages correctly. Those same training tactics also correct estimation errors caused by imperfect initial training set selection. The initial training set is only a sample of the current message stream, so the estimated classifier is generally imperfect and will not perform as well for subsequent messages. Because of this, we usually do additional training, so the system is already adaptive. How much additional training we need depends on 1) how closely the message corpus that we selected the training set from represents the statistical properties of the actual message stream, 2) how much the statistical properties of the actual message stream change over time and 3) what is our tolerance for errors and unsures.

Another factor to consider is that even if the long-term average statistical properties of the message stream don't change over time (they are stationary), the messages that we receive on any given day are only a small sample of that stream. Since the sample is small, its statistical properties may differ from the long-term average enough to cause the estimated classifier to perform poorly on a different sample from the same message stream. For a stationary message stream and a fixed estimated classifier, the variance of the statistical properties of these daily samples sets an upper bound on the classification performance. If the estimated classifier is not fixed and can adapt, it could perform better as it will partially track the random walk of the statistical properties of the daily sample of the message stream. Note that the more agile we attempt to make the estimated classifier (the faster it adapts), the closer to instability we bring the entire adaptive system.

There are a few "gotchas" to a continuous learning adaptive approach:

FWIW, having worked with adaptive algorithms for a long time, my guess is that neither of these potential pitfalls will prove to be serious for our problem of classifying a slowly-changing future message stream with moderate sample variance into ham, spam and unsure.

One way to avoid all this is to simply start over and select a new training set whenever performance becomes poor. Though SpamBayes makes starting over easy, there are a lot of variables. The best initial training schemes are labor intensive, though they could be automated. Less optimal initial training schemes mean you get spam in your Inbox for a while. Some people don't mind starting over from scratch, some like the idea and some even do it daily. That's a perfectly reasonable solution for many, so this isn't necessarily a problem for everyone. For those of us who would prefer a more automated system, there are considerations that we don't currently have to worry about for a continuous learning system that prunes the databases.

First, here are four tactics for pruning a database of the tokens associated with one or more trained messages and two tactics for pruning individual tokens. We prune either periodically or when we want to train a new message into the database, depending on the tactic chosen.

Assuming that we implement one of these pruning tactics, here are some possible implementations:

note: untraining whole messages, rather than tokens, preserves the underlying assumptions of the Spambayes calculations (whether this really matters in practice is anyone's guess)

When it is time to train a new message into the databases, there are also choices as to how to add it:

The latter choice is easier than it sounds, if we do it in a simple-minded way: add a parameter (in SpamBayes Manager or the .ini file) that says how much longer an atypical message would stay in the database than a typical one. The typical message classifies perfectly (ham = 0, spam = 100). An atypical message does not classify perfectly: it can be close or totally wrong, but not perfect. Depending on how atypical (imperfectly classified) a message is, keep its tokens in the database longer, up to the maximum extra time specified in the parameter. To keep a message around longer by one day, for example, set it's timestamp one day in the future. The length of extra time a message's tokens stay in the database is determined by some formula (linear, quadratic, logarithmic?) and is between zero and the maximum extra time parameter.

As the training set changes, so does the contribution of each message. A message that was atypical when it was first trained may become rather typical later. This suggests that an optimal approach is to rescore all trained messages when we need to prune a message and select the most typical one. Since message-based pruning requires saving the list of tokens for each trained message, it's not as bad as tokenizing and scoring every message, but it's still a lot of overhead. It may turn out that TrainToExhaustion accomplishes the same thing.

Here's what we know so far. TrainOnErrorsAndUnsures, TrainOnAlmostEverything and TrainToExhaustion have all been proven to be better tactics than TrainOnEverything or TrainOnErrors. Keeping spam/ham ratios within limits appears to help for most people, though some have reported good results with spam/ham rations very different from unity. Very large training sets don't seem to perform better than moderate or small-sized ones. The rest is conjecture.

Other ideas and comments are welcome.