SkipsRecursiveTrainingSetSelectionAlgorithm

I really doubt that anyone needs to train on every single spam message which comes through in a 30-day period. Most spam probably comes from a small handful of cretins, and spam from the same cretin seems to arrive in bunches (gotta make full use of a new account before it gets shut off). Consequently, training on a single spammy unsure message is often sufficient to nudge several messages out of the unsure region and into spam territory.

I've appended a small script I use to help decide which spams and hams that turn up "unsure" I should train on first/next. I run a mailbox through sb_filter.py like so:

    sb_filter.py ~/Mail/unsure | python ~/tmp/scan.py

The scan.py script spits out the subject, message-id, date and classification headers sorted by score. By default, it only considers messages classified as "unsure". You can force it to consider any/all combinations though, e.g.:

    sb_filter.py ~/Mail/unsure | python ~/tmp/scan.py 'ham|spam|unsure'

The idea is that you train on one or a few of your lowest scoring spams and/or highest scoring hams, save your unsure file, then run the above again. Any previously "unsure" spams which now show up at the spam end of things get ignored. Lather, rinse, repeat. When you're tired of the cleansing cycle (or your hair is squeaky clean), rename your unsure folder, e.g.:

    mv ~/Mail/unsure ~/Mail/unsure.save

then train on it again, e.g.:

    formail -s procmail < ~/Mail/unsure.save

The above commands are what I use in my Unix-y/procmail-ish/sb_filter-laden environment. You will obviously have to adjust them according to the needs of your environment, but the basic idea is the same everywhere. I think this process is even easier in the Outlook plugin. Sort your unsure folder by score, move a small number of the most out-of-whack messages where they belong, then reclassify your unsure folder.

Skip

   1 
  2 
  3 
  4 
  5 
  6 
  7 
  8 
  9 
 10 
 11 
 12 
 13 
 14 
 15 
 16 
 17 
 18 
 19 
 20 
 21 
 22 
 23 
 24 
 25 
 26 
 27 
 28 
 29 
 30 
 31 
 32 
 33 
 34 
 35 
 36 
 37 
 38 
 39 
    #!/usr/bin/env python

    import sys, re, getopt

    msgid = date = cls = ""
    sub = "<no subject>"

    scanfor = "unsure"

    opts, args = getopt.getopt(sys.argv[1:], "")
    if args:
        scanfor = '|'.join(args)

    info = []
    for line in sys.stdin:
        if line.startswith("From "):
            msgid = date = cls = ""
            sub = "<no subject>"
        elif line.lower().startswith("subject: "):
            sub = line.strip()
        elif line.lower().startswith("message-id: "):
            msgid = line.strip()
        elif line.lower().startswith("date: "):
            date = line.strip()
        elif line.lower().startswith("x-spambayes-classification: "):
            cls = line.strip()
            if re.search(scanfor, cls) is not None:
                prob = float(cls.split()[-1])
                info.append((prob, (sub, date, msgid, cls)))
            date = msgid = cls = ""
            sub = "<no subject>"

    info.sort()
    for (prob, (sub, date, msgid, cls)) in info:
        print
        print sub
        print date
        print msgid
        print cls

SkipMontanaro

SkipsRecursiveTrainingSetSelectionAlgorithm

Note: This wiki is now frozen; you can no longer edit it, and no interactive features work.