RescoreOldSpam

Due to poor training (I was training mis-labeled spams only for a long time), I got to the point where at least one ham fell into the gigantic sea of spam (103k msgs) *and* plenty of spam was still landing in my inbox. I've reset my db, and I've set things up so I'll now be training on correct ham, unsure, and incorrect spam (i.e. any msgs I see).

Anyway, here's the quick hack to let me quickly get the new scores for mountains of messages. I have started running this on all the spam, and I'll be taking a look at the low-scored messages when it finishes running.

No, I didn't really know what I was doing, as we can see where I re-extract the header instead of using some API that will return me the message object or even the raw score.

This runs at about 10 msgs/sec on my dual P2/400. I expect it to finish in 2h, whereas my former shell loop that ran sb_filter.py on each msg was looking like it would take 33h.

Enjoy-

Drew, drewp@bigasterisk.com

rescanner:

   1 
  2 
  3 
  4 
  5 
  6 
  7 
  8 
  9 
 10 
 11 
 12 
 13 
 14 
 15 
 16 
 17 
 18 
 19 
 20 
 21 
 22 
 23 
 24 
 25 
 26 
 27 
 28 
 29 
 30 
 31 
 32 
 33 
 34 
 35 
 36 
 37 
 38 
 39 
 40 
 41 
 42 
 43 
 44 
 45 
#!/usr/local/bin/python

"""
for efficiency, we'll be talking to a running xmlrpc server, which you
could start up like this:

sb_xmlrpcserver.py -d  0.0.0.0:65000

feed this program filenames like this:

find bayes/ -type f  | rescanner

to get output like this (filename, then the value of the
classification header):

bayes/2003q1/11 spam; 1.00
bayes/2003q1/12 ham; 0.15
bayes/2003q1/13 spam; 1.00


"""

import xmlrpclib
import sys

RPCBASE="http://localhost:65000"

serv = xmlrpclib.ServerProxy(RPCBASE)


def score_message(msg):
   """returns the value for the the X-Spambayes-Classification header
   that gets added to this message after we filter it (with your
   current db)"""
   global serv
   filtered_msg = serv.filter(xmlrpclib.Binary(msg)) # returns as string :(
   filtered_msg=filtered_msg.data
   key='X-Spambayes-Classification: '
   i=filtered_msg.index(key)
   return filtered_msg[i+len(key):filtered_msg.index('\n',i)]


for filename in sys.stdin.xreadlines():
   filename=filename.rstrip("\n")
   print filename, score_message(file(filename).read())

Note: This wiki is now frozen; you can no longer edit it, and no interactive features work.