I've appended a small script I use to help decide which spams and hams that turn up "unsure" I should train on first/next. I run a mailbox through sb_filter.py like so:
sb_filter.py ~/Mail/unsure | python ~/tmp/scan.py
The scan.py script spits out the subject, message-id, date and classification headers sorted by score. By default, it only considers messages classified as "unsure". You can force it to consider any/all combinations though, e.g.:
sb_filter.py ~/Mail/unsure | python ~/tmp/scan.py 'ham|spam|unsure'
The idea is that you train on one or a few of your lowest scoring spams and/or highest scoring hams, save your unsure file, then run the above again. Any previously "unsure" spams which now show up at the spam end of things get ignored. Lather, rinse, repeat. When you're tired of the cleansing cycle (or your hair is squeaky clean), rename your unsure folder, e.g.:
mv ~/Mail/unsure ~/Mail/unsure.save
then train on it again, e.g.:
formail -s procmail < ~/Mail/unsure.save
The above commands are what I use in my Unix-y/procmail-ish/sb_filter-laden environment. You will obviously have to adjust them according to the needs of your environment, but the basic idea is the same everywhere. I think this process is even easier in the Outlook plugin. Sort your unsure folder by score, move a small number of the most out-of-whack messages where they belong, then reclassify your unsure folder.
Skip
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | #!/usr/bin/env python
import sys, re, getopt
msgid = date = cls = ""
sub = "<no subject>"
scanfor = "unsure"
opts, args = getopt.getopt(sys.argv[1:], "")
if args:
scanfor = '|'.join(args)
info = []
for line in sys.stdin:
if line.startswith("From "):
msgid = date = cls = ""
sub = "<no subject>"
elif line.lower().startswith("subject: "):
sub = line.strip()
elif line.lower().startswith("message-id: "):
msgid = line.strip()
elif line.lower().startswith("date: "):
date = line.strip()
elif line.lower().startswith("x-spambayes-classification: "):
cls = line.strip()
if re.search(scanfor, cls) is not None:
prob = float(cls.split()[-1])
info.append((prob, (sub, date, msgid, cls)))
date = msgid = cls = ""
sub = "<no subject>"
info.sort()
for (prob, (sub, date, msgid, cls)) in info:
print
print sub
print date
print msgid
print cls |
