1. Jouni K Seppänen's Setup
This is my SpamBayes setup that includes some tricks that I didn't see documented elsewhere.
I get my email using IMAP from a server running UW IMAPD using various client programs on various computers. Thus client-side filtering is out, and while I have shell access to the server, I don't want to run the ImapFilter on it, since some other user might stumble upon the port.
(What port? If you mean the configuration one, it's only served if you use the -l switch or don't use -c or -t. You can also set it to only accept connections from certain IP addresses, or have it use authentication. --TonyMeyer)
Features of the setup include:
-
SbFilter based filtering
-
Hashcash stamped email bypasses filtering
-
semi-automatic training: copy messages to special mailboxes and they are used to train SpamBayes
1.1. Filtering
The following .procmailrc file directs the filtering of incoming mail.
MAILDIR=$HOME/Mail LOGFILE=$HOME/procmail.log :0 c backup :0 ic | cd $MAILDIR/backup && rm -f dummy `ls -t msg.* | sed -e 1,64d` :0 fw:hamlock | $HOME/bin/sb_filter.py :0: * ^X-Hashcash: ([0-9]+:)+\/USERNAME(\+[a-zA-Z0-9]*)?@DOMAIN\.TLD * $? $HOME/bin/hashcash -cqXdp 1d -b20 -f $HOME/.hashcash.db -r '$MATCH' $DEFAULT :0: * ^X-SpamBayes-Classification: spam Junk :0: * ^X-SpamBayes-Classification: unsure Unsure
The first two recipes comprise the Procmail backup trick as described in man procmailex: make copies of all emails before feeding them to SpamBayes or anything else, and delete all but the newest few backup copies. (The Mail/backup directory must exist.) The third recipe runs SpamBayes as a filter. The fourth one, copied from http://www.hashcash.org/mail/mfa/procmail/, rescues Hashcash stamped emails from being classified as spam; the (\+[...])? part allows for addresses like foo+bar@example.com. The final two recipes moves spam and unsure messages into their respective folders.
1.2. Training
I created two folders named "New Spam" and "New Ham" to which I copy "unsure" and misclassified messages. An hourly cron job runs the following script (trainspam.sh):
#!/bin/sh basedir=$HOME/spamtrain RC=$basedir/split.rc NEWBOX=$basedir/newbox.$$ TARGET=$basedir/target.$$ export TARGET NEWBOX function do_it () { touch "$TARGET" if lockfile -1 -r 5 "$SOURCE.lock"; then formail -s procmail -p -m $RC <"$SOURCE" && mv -f "$NEWBOX" "$SOURCE" rm -f "$SOURCE.lock" fi $HOME/bin/sb_mboxtrain.py $OPT "$TARGET" if lockfile -1 -r 5 "$FINAL.lock"; then (echo; cat "$TARGET") >> "$FINAL" rm -f "$FINAL.lock" "$TARGET" fi } SOURCE=$HOME/Mail/'New Spam' FINAL=$basedir/spam OPT=-s do_it SOURCE=$HOME/Mail/'New Ham' FINAL=$basedir/ham OPT=-g do_it
What happens is basically that the New Spam and New Ham mailboxes are moved into "$TARGET" in turn, the sb_mboxtrain.py script is run on that file, which is then copied at the end of $basedir/spam or $basedir/ham. However, a slight complication is caused by UW IMAPD's extra message, which you aren't supposed to remove. This is handled by the formail -s procmail command using the split.rc file to direct Procmail:
:0: * ^Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA$ $NEWBOX :0fw | $HOME/spamtrain/headerfix.sh :0: $TARGET
Along the way, we remove some headers using the following script called headerfix.sh:
#!/bin/sh formail -q- -I X-Spam-Flag -I X-Spam-Status -I X-Spam-Level \ -I X-Spam-Checker-Version -I X-Status -I X-Keywords -I X-UID \ -I Status -I X-Gnus-Newsgroup -I X-Gnus-Mail-Source -I Xref -I Lines \ -I X-RAVMilter-Version -I X-DCC-HUTCC-Metrics -I X-RBL-Warning -s
This is just a collection of headers that I have sometimes seen in some of my mail, either from MTAs or MUAs and that I don't want to accidentally use for training. (Not all of these are probably necessary to remove.)
(Note that unless you are using the [Tokenizer] basic_header_tokenize option (off by default), then none of these headers will generate any tokens at all. Removing them will have no effect on classification at all. --TonyMeyer)