RobsSetup

1. Setup

My setup is quite a bit different from most other people's setups, so I did have to go through a number of iterations before I could setup spambayes in a satisfactory way. Since this might help other people, I am describing it here.

I have two accounts: a private account and a work account. Both receive about 50-100 spam mails per day at the time of this writing. Both are IMAP setups where I have a login account on a machine with direct file access to the mail folders. One uses mailbox folders, the other one maildir.

Another distinct property of my setup is that there are at least 4 different ways I can read my E-mail: via mozilla at home, via mozilla at work, via mozilla at a University place I'm visiting once per week, and via webmail.

2. Client-side filtering?

Obviously with such a setup it is not practical to have the spam filter on the client side: first it is quite expensive to download all of that spam, and second it is not practical to install filters in all of my mozillas and currently not possible to do it in my webmail.

So I made a setup that uses procmail for both servers.

3. Spambayes configuration

On both systems, my ~/.spambayesrc looks like this:

[Storage]
persistent_use_database=True
persistent_storage_file=~/.hammiedb
[Headers]
header_score_logarithm=True

The "header_score_logarithm" is for my own enjoyment: For really obvious spam and ham, the score will be augmented by a number indicating "how 0" or "how 1" the score really is.

4. Procmail setup

For the system using "maildir", I am using the following .procmailrc setup:

LOGFILE=/home/h/hooft/procmail.log
:0 fw:hamlock
| /home/h/hooft/bin/sb_filter.py

# Messages that are so obviously spam that we should not train on them
:0
* ^X-SpamBayes-Classification: spam; 1.00
.ztrain.obvious-spam/

# Messages that are spam but we might want to train on them
:0
* ^X-SpamBayes-Classification: spam
.ztrain.spam/

# Unsure messages must be copied to the unsure folder for training
:0 c
* ^X-SpamBayes-Classification: unsure
.ztrain.unsure/

# Ham that doesn't score 0.00 is eligible for training as well
:0 c
* ^X-SpamBayes-Classification: ham; 0.0[2-9]
.ztrain.ham/

:0 c
* ^X-SpamBayes-Classification: ham; 0.1[0-9]
.ztrain.ham/

This is followed by some recipes to split the list-mail into folders. As you can see, I use quite a lot of different folders for use of spambayes. Actually, there are two more that are not mentioned in the procmail recipes, making the total list like this:

Folder	purpose
ztrain/confirmed-ham	messages for which I manually confirmed ham status
ztrain/confirmed-spam	messages for which I manually confirmed spam status
ztrain/obvious-spam	messages that were classified by spambayes to have 1.00 score.
ztrain/spam	spam messages that have scores below 1.00
ztrain/unsure	unsure messages, these are copied into the inbox!
ztrain/ham	ham messages that have scores above 0.01, these are copied into the inbox!

5. Training

Using these folders, I do a modified version of "unsure-based" training.

In first instance, I filled the confirmed-ham and confirmed-spam folders with a couple dozen different messages from different sources, and trained on them (see below). From that moment on, spambayes and procmail started filling the other folders.

Regularly, I will review the folders "spam", "unsure" and "ham", and move their messages into the confirmed-* folders. After such a move, I retrain spambayes. The "obvious spam" folder is a dump only. I will actually not pay more than 0.1 second attention to that folder. Since these messages are such obvious giveaways, it does not pay to train on them. Same is true for ham that comes in with scores of 0.00 and 0.01. All other ham will automatically be copied into ztrain/ham so that I can train on it as well.

After a few days, 90% of my spam was caught as obvious-spam, about 8% as spam, and only very few messages come through as unsure.

To train, I am using the following script:

#!/bin/sh
#
# MAKE SURE TO EXPUNGE/COMPACT the folders before running this
#
sb_mboxtrain.py -d $HOME/.hammiedb -g $HOME/Maildir/.ztrain.confirmed-ham -s $HOME/Maildir/.ztrain.confirmed-spam

Note: This wiki is now frozen; you can no longer edit it, and no interactive features work.

1. Setup

2. Client-side filtering?

3. Spambayes configuration

4. Procmail setup

5. Training