Anyway, here's the quick hack to let me quickly get the new scores for mountains of messages. I have started running this on all the spam, and I'll be taking a look at the low-scored messages when it finishes running.
No, I didn't really know what I was doing, as we can see where I re-extract the header instead of using some API that will return me the message object or even the raw score.
This runs at about 10 msgs/sec on my dual P2/400. I expect it to finish in 2h, whereas my former shell loop that ran sb_filter.py on each msg was looking like it would take 33h.
Enjoy-
Drew, drewp@bigasterisk.com
rescanner:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | #!/usr/local/bin/python
"""
for efficiency, we'll be talking to a running xmlrpc server, which you
could start up like this:
sb_xmlrpcserver.py -d 0.0.0.0:65000
feed this program filenames like this:
find bayes/ -type f | rescanner
to get output like this (filename, then the value of the
classification header):
bayes/2003q1/11 spam; 1.00
bayes/2003q1/12 ham; 0.15
bayes/2003q1/13 spam; 1.00
"""
import xmlrpclib
import sys
RPCBASE="http://localhost:65000"
serv = xmlrpclib.ServerProxy(RPCBASE)
def score_message(msg):
"""returns the value for the the X-Spambayes-Classification header
that gets added to this message after we filter it (with your
current db)"""
global serv
filtered_msg = serv.filter(xmlrpclib.Binary(msg)) # returns as string :(
filtered_msg=filtered_msg.data
key='X-Spambayes-Classification: '
i=filtered_msg.index(key)
return filtered_msg[i+len(key):filtered_msg.index('\n',i)]
for filename in sys.stdin.xreadlines():
filename=filename.rstrip("\n")
print filename, score_message(file(filename).read()) |
