UserPreferences

TrainOnUnsures


Note: This wiki is now frozen; you can no longer edit it, and no interactive features work.

1. Train on Unsures

This training tactic only trains on messages the classifier couldn't definitively classify as either ham or spam. This is different from mistake-based training in that none of the messages in the training set were mistakes. Presumably, by adding the tokens from these unsure messages to the database, we will have fewer unsures in the future (a more bi-modal score distribution). If we then use a corpus of all the unsures a given classifier produced to select the next training set, we will create classifiers that produce fewer and fewer unsures, to a point.

Since train on unsures by definition excludes from the training set all messages which are correctly classified as well all messages that were mis-classified, the number of false positives and false negatives may increase or decrease. The training tactic is not selecting for this. In practice, very few people probably use this tactic, as it seems so intuitive to train on mistakes. Because of this, more people who wish to include unsures in their training wind up using train on errors and unsures. If you select a very good initial training set, though, there will be very few errors and the actual messages trained on will be virtually all unsures.

SethGoodman