JouniKSeppänensSetup

1. Jouni K Seppänen's Setup

This is my SpamBayes setup that includes some tricks that I didn't see documented elsewhere.

I get my email using IMAP from a server running UW IMAPD using various client programs on various computers. Thus client-side filtering is out, and while I have shell access to the server, I don't want to run the ImapFilter on it, since some other user might stumble upon the port.

(What port? If you mean the configuration one, it's only served if you use the -l switch or don't use -c or -t. You can also set it to only accept connections from certain IP addresses, or have it use authentication. --TonyMeyer)

Features of the setup include:

SbFilter based filtering
Hashcash stamped email bypasses filtering
semi-automatic training: copy messages to special mailboxes and they are used to train SpamBayes

1.1. Filtering

The following .procmailrc file directs the filtering of incoming mail.

MAILDIR=$HOME/Mail
LOGFILE=$HOME/procmail.log

:0 c
backup

:0 ic
| cd $MAILDIR/backup && rm -f dummy `ls -t msg.* | sed -e 1,64d`

:0 fw:hamlock
| $HOME/bin/sb_filter.py

:0:
* ^X-Hashcash: ([0-9]+:)+\/USERNAME(\+[a-zA-Z0-9]*)?@DOMAIN\.TLD
* $? $HOME/bin/hashcash -cqXdp 1d -b20 -f $HOME/.hashcash.db -r '$MATCH'
$DEFAULT

:0:
* ^X-SpamBayes-Classification: spam
Junk

:0:
* ^X-SpamBayes-Classification: unsure
Unsure

The first two recipes comprise the Procmail backup trick as described in man procmailex: make copies of all emails before feeding them to SpamBayes or anything else, and delete all but the newest few backup copies. (The Mail/backup directory must exist.) The third recipe runs SpamBayes as a filter. The fourth one, copied from http://www.hashcash.org/mail/mfa/procmail/, rescues Hashcash stamped emails from being classified as spam; the (\+[...])? part allows for addresses like foo+bar@example.com. The final two recipes moves spam and unsure messages into their respective folders.

1.2. Training

I created two folders named "New Spam" and "New Ham" to which I copy "unsure" and misclassified messages. An hourly cron job runs the following script (trainspam.sh):

#!/bin/sh

basedir=$HOME/spamtrain
RC=$basedir/split.rc
NEWBOX=$basedir/newbox.$$
TARGET=$basedir/target.$$
export TARGET NEWBOX

function do_it () {
  touch "$TARGET"
  if lockfile -1 -r 5 "$SOURCE.lock"; then
    formail -s procmail -p -m $RC <"$SOURCE" && mv -f "$NEWBOX" "$SOURCE"
    rm -f "$SOURCE.lock"
  fi

  $HOME/bin/sb_mboxtrain.py $OPT "$TARGET"

  if lockfile -1 -r 5 "$FINAL.lock"; then
    (echo; cat "$TARGET") >> "$FINAL"
    rm -f "$FINAL.lock" "$TARGET"
  fi
}

SOURCE=$HOME/Mail/'New Spam'
FINAL=$basedir/spam
OPT=-s
do_it

SOURCE=$HOME/Mail/'New Ham'
FINAL=$basedir/ham
OPT=-g
do_it

What happens is basically that the New Spam and New Ham mailboxes are moved into "$TARGET" in turn, the sb_mboxtrain.py script is run on that file, which is then copied at the end of $basedir/spam or $basedir/ham. However, a slight complication is caused by UW IMAPD's extra message, which you aren't supposed to remove. This is handled by the formail -s procmail command using the split.rc file to direct Procmail:

:0:
* ^Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA$
$NEWBOX

:0fw
| $HOME/spamtrain/headerfix.sh

:0:
$TARGET

Along the way, we remove some headers using the following script called headerfix.sh:

#!/bin/sh
formail -q- -I X-Spam-Flag -I X-Spam-Status -I X-Spam-Level \
  -I X-Spam-Checker-Version -I X-Status -I X-Keywords -I X-UID \
  -I Status -I X-Gnus-Newsgroup -I X-Gnus-Mail-Source -I Xref -I Lines \
  -I X-RAVMilter-Version -I X-DCC-HUTCC-Metrics -I X-RBL-Warning -s

This is just a collection of headers that I have sometimes seen in some of my mail, either from MTAs or MUAs and that I don't want to accidentally use for training. (Not all of these are probably necessary to remove.)

(Note that unless you are using the [Tokenizer] basic_header_tokenize option (off by default), then none of these headers will generate any tokens at all. Removing them will have no effect on classification at all. --TonyMeyer)

Note: This wiki is now frozen; you can no longer edit it, and no interactive features work.

1. Jouni K Seppänen's Setup

1.1. Filtering

1.2. Training