UserPreferences

TrainingIdeas/TrainOnLots


Note: This wiki is now frozen; you can no longer edit it, and no interactive features work.

1. Train on Errors, Unsures, and non-obvious correct decisions

There are ways in between: RobsSetup describes a training method that can be summarized as: train on almost anything that didn't score 0.00 or 1.00. Since, after a small initial training period, most messages will score either 0.00 or 1.00, this drastically reduces the database size from the train on everything strategy (and solves some of its other problems as well). A possible advantage over the train on mistakes strategy is that it works without relying very much for scoring on words that are represented only once in the database (so called hapaxes).

RobHooft


This training scheme (represented by the nonedge regime for the incremental harness) does better than TrainingIdeas/TrainOnEverything. In particular, it trains far fewer messages for slightly greater accuracy. It also doesn't seem to decay as badly over extended periods. Have some pictures:

http://www.wolfskeep.com/~popiel/spambayes/nonedge/nonedge.png http://www.wolfskeep.com/~popiel/spambayes/nonedge/nonedgespan.png http://www.wolfskeep.com/~popiel/spambayes/nonedge/nonedgetrained.png http://www.wolfskeep.com/~popiel/spambayes/nonedge/nonedgespantrained.png

Details of the test set are in AlexDemoTestSet