1. MultiWayClassification
Having grown to appreciate automatic spam filtering, first via SpamAssassin and now via SpamBayes, I want the same abilities for email classifiction. This page records what I know about n-way classification, using SpamBayes and other Bayesian tools, and raises some issues.
1.1. Using SpamBayes for MultiWayClassification
-
Look at the nway.py script
-
look at POPFile
Most n-way work assumes n-exclusive-classifications, e.g. for sorting email intoo folders. I am interested in non-excllusive tagging, where multiple tags can apply to same message.
-
Naively, it looks as if maintaining a SpamBayes database for each keyword tag might work.
-
Trying to figure out how to take advantage of multiple tags to further refine classification
-
some tags are exclusive. E.g. spam => not work related
-
thinking about a second pass given a preliminary classification
-
That is what nway.py does. But since SpamBayes is already slow (at least on my laptop, under Cygwin), I am investigating using a single database.
Q: in what ways is SpamBayes strictly oriented towards spam? Probably in certain of the tokenizations... Would it harm MultiWayClassification to use the same mechanism?
1.1.1. More SpamBayes community discussion about MultiWayClassification
As already mentioned,
-
Look at the nway.py script
-
look at POPFile
The Spambayes mailing lists contain several such discussions, such as
-
Anssi Porttikivi, http://mail.python.org/pipermail/spambayes-dev/2003-September/001251.html,
-
I don't need automatically discovering the clusters at the moment.
-
cluster analysis pretty much implies exclusive classifications.
-
You'd do typical clustering using some technique
-
You might then try to decompose the clusters found into overlappng
-
this probably means that you would not want to use an initial
-
i.e. you would probably want to use a metric that includes proximity
-
come to think of it, drawing convex hulls around clusters found
-
Richard Prosser asks about using the spambayes classifier for other than email, in http://mail.python.org/pipermail/spambayes-bugs/2004-March/002016.html
-
It is agreed that this can and has been done, but that the tokenizer is email targetted. Thie discusssion also touches on n-way
-
Kenny Pitt replies:
-
I'm trying to figure out in exactly what ways SpamBayes is limited to 2 categories.
-
e.g. the database currently supports two scores, ham-count and spam-count.
-
Are the statistical tests specifically two-way?
-
I suspect that exclusive n-way classification requires a completely different algorithm.
-
I suspect that I would want to use the SpamBayes tokenizer
-
I suspect that I would want to use the SpamBayes chi-squared code
-
I'd probably like to benefit from any changes to the above that SpamBayes does - but I don't terribly care about the Windows/Outlook interface issues
-
I have EMACS Gnus keys bound to train SpamBayes ham and spam.
-
interactively
-
and they are already so damnably slow that I really need to
-
I envisage the n-way classification user interface as being interactive
-
I press a key, EMACS suggests to me a list of keyword to be tagged
-
I edit the suggestions
-
and train immediately
-
True, this may be premature optimization...
-
Gerritt Holl asked about n-way again in http://mail.python.org/pipermail/spambayes/2003-October/008904.html
-
Bill Yerazunis says
-
I'm not sure what CRM114 is (a release? just nway?)
-
Janne Sinkkonen discusses, with Skip Montaro (the origiinal author of nway.py),
-
Janne also says "It would be nice if several filters, n-way or two-way,
-
to do not just n-way classification, but to also automatically discover the clusters.
One reply admonishes "look up the literature on clustering".
GLEW COMMENT:
-
I'm more interested in non-exclusive classification; or, possibly, a fixed set of classes, but automatic discovery of which classifications are exclsuive and which are nonexclusive.
Non-exclusive clustering is an interesting concept... I'm not familiar with it having been done, but IANAPS (I am not a professional statistician).
-
convex regions
-
clustering algorithm that implicitly assumes convexity. Some do so explicitly, e.g. calculating convex hulls, but some assume convexity of the clusters implicitly, e.g. KMeans.
-
to any member of the cluster as well as proximity to a centroid
-
via other means is a good way of deterining membership in multiple clusters
-
Also keep in mind that SpamBayes is only designed to support two categories. It doesn't matter what the two categories are as long as you give it the training data to define the characteristics of each. However, if you need to send things to 3 or more categories as POPFile does then the SpamBayes code won't support that.
-
And whether it is intrinsic, fundamental, or just a property of the code.
-
Why could it not be extennded to support more?
-
And even if so, why is that a problem for n-way-nonexclusive classification?
-
make them train offline, in the background.
-
to the message and/or folders for the message to be savedinto
-
And, basically, I suspect that training N different databases is
-
"Done-and-tested in CRM114. ... I'm not using it personally, but it was a feature request and the user made happy gushy noises after he tested it."
-
how Janne hacked nway a bit to predict what emails he was going to reply to. http://mail.python.org/pipermail/spambayes/2003-October/008955.html
-
could be configured in an integrated fashion." So I'm not the only person to think about it.
1.1.2. Binary vs. MultiWayClassification
SpamBayes returns a ternary result for a binary scale. I.e. we want spam/ham, or spam/non-spam. SpamBayes delivers spam/unsure/non-spam.
With non-exclusive MultiWayClassification, how should this be represented?
-
We probably still want spam/unsure/non-spam
-
But it might be perfectly correct for a message to be tagged with both computer-hardware AND compilers.
-
Essentially, each keyword is on a binary scale tag/non-tag.
This raises isssues both with user interface and with the statistical model.
It appears to me that SpamBayes' chief advantage over othe Bayesian spam filters is that it scores "spam" and "ham" separately and independently. A message can have a high spam score and a low ham score, vice versa, or it could have both a high spam score and a high ham score. Or both low.
Q: is the SpamBayes scoring specifically binary spam/ham? I.e. does the chi-square test answer a specifically binary question? Or is it comparing two independent scores, ham and spam? Or could the Spambayes database be extended to handle multiple tags?
If the latter, it may be possible to persuade the Spambayes code to answer questions like
-
Binary Exclusive Test: "tell me if this message is more like tag1 or tag2"
-
where spam/ham is just a particular setting.
-
N-way Exclusive Test: "tell me if this message is more like tag1 or tag2 or tag3"
-
1-way Test: "does the *score* for this message wrt tagX exceed some threshold?"
-
Does the "scoring" require comparison to a differet tag/metric.
I.e. I am trying to ask if MultiWayClassification with SpamBayes really needs a database with tagN/non-tagN scores, one such pair for each tag? I.e.
token | spam | ham | tag1 | non-tag1 | tag2 | not-tag2 |
or
token | spam | ham | tag1 | tag2 |
I ask this question in part because I do not see the value of tag1/unsure/non-tag1 in a nonexlusive multiway classification.
-
I'm perfectly happy with spam/unsure/ham
-
for exclusive n-way sorting, folder1/foldder2/folder3/.../folderN/unsure
-
is fine; it corresponds to having an "unsorted" folder that you need to check
But if I am just doing non-exclusive tagging, I think the user interface would just suggest keywords that exceeded a threshold - a 1-way test for each. I am having trouble imagining what the user interface would look like for "unsure", or other --- perhaps you would sort the keywords by score.
1.2. Original
I originally posted this on FrontPage, but then decided to refactor into a separate discussion. Here's the original duscussion:
Is this suppposed to be a WishList? If so, then... I wish there was a way to use Spambayes tp do N-ways sorting, not just spam/ham, but spam/ham/proprietary/personal/public/... I vaguely remember seeing this before, but haven't had much luck finding it now I need it. -- AndyGlew
No, this isn't really a good place to put feature requests. Suggestions about improvements for the classification system are ok, but if you have a feature request then it would be much better to put it on Sourceforge. SpamBayes is not designed to be an n-way classifier. If you want to try it anyway, look at the n-way.py script in the contrib directory of the source release. If you want n-way classified mail, then I recommend POPfile. --TonyMeyer
Fair enough... I expect you'll delete this soon, although it might be appropropriate to refactor.wiki into a separate page on n-way - maybe MultiWayClassification. My old spambayes installation did not have n-way.py, so I'll update and look. POPFile does not seem to be what I want: it does n-exclusive-ways, whereas I am interested in suggested n nonexlusive keywords for classification (like the keywords on gmail). Sourceforge seems to contain just an issue tracking system; I am more interested in discussion about the issues involved in MultiWayClassification. -- AndyGlew
I was wrong. I found contrib/nway.py. It does the "obvious" thing of using multiple databases... -- AndyGlew
1.3. Multiway Email Sorting
The following I originally posted on my blog, which is on a company intranet and not externally visible.
I am getting tired of sorting my email into folders. I have grown dependent on automatic spam filtering via spambayes... I want the same for sorting.
I could very easily go over to search based email handling, e.g. using Google desktop. But
-
Google desktop doesn't handle my saved MH email
-
one day I should write the plugin
-
on gmail I find the "classification tag" feature useful...
-
I've learned the hard way that I aboslutely, absolutely, have to sort my email into
-
proprietary
-
personal as well as
-
public
-
company related, but not proprietary and I also like
-
technical, related to computer architecture
-
other: politics, personal, fun, shopping, etc. i.e. search based is good, but classifications help
-
but I want to automate the generation of such tags.
Spambayes is binary, spam/ham. I've heard about people doing it N-way, but cannot find any examples right now.
I can imagine the N-way being done
-
linearly, with a set order - e.g. first filter out spam, then everything else.
-
as N-independent binary tests. See discussion of combinations below
-
but the priority order matters
POPFile dies naive bayesian N-way. But it is N exclusive ways.
It occurs to me that if I am applying classification tags they are not mutually exclusive. Therefore, running something like a Spambayes binary test N times, suggesting N tags, is not unreasonable. Since spambayes reports high-probability-yes, high-probability-no, unsure, I could imagine an interface that reported all tags, and allowed the user to correct them.
Some tags are mutually exclusive, others not. E.g. "spam" probably isn't associated with any ham tags - there isn't much spam that is also comp-arch-technical. However... I *do* enjoy Nigerian fraud emails, so spam spam-nigerian might be amusing.
Basically, I am wondering about how to handle various combinations of tags. I want to support multiple tags, but I want as few extraneous tag combinations as possible.
e.g. possibly train a final filter on the tags suggested by the initial filter. If the initial filter suggested (spam,technical-comp-arch), then the final filter might map this to (spam) only. But if (spam,family-recreation) usually means that it is family related, then it might elide the spam and map to (family-recreation).
i.e. we might be able to use Bayesian learning to automatically figure out which combinations of tags are meaningful.
The various spam filtering tools like Spambayes and POPFle integrate with mailers in many different ways. E.g. POPFile is a POP server. Spambayes is more of a library, that is integrated into proxy servers, but also email tools that filter various folders, etc.
Maybe what I really want is a generic tool that maps a file containing an email to a classification, returning the strings of the tags. I could then use this as a suggestion in Gnus under EMACS, and edit out the tags unwanted.
Given such suggestions, various other tools could place them in mail headers, or in the subject line, or whatever.
Basic structure
-
Apply N independent Bayesian filters
-
Take that tuple, and apply it to a composite filter on the tags
-
Separate spam, and possible spam
-
Ham handling
-
Suggest this tag tuple to the user. Allow user to edit
-
Retrain if different from proposed initial and composite filters
-
Spam handling
-
Moved into a separate folder
-
User checks later
-
Same sort of retraining, if user wants classified differently
ISSUE: this implies 2N separate databases: one for each tag, on the raw message, and one for each tag, given the classifications. Since Spambayes is already slow, maybe it would be better to have a unified database, especially since all probably use the same features (tokens).
1.3.1. Usage Model
I used to use procmail rules to sort my mail into different folders, and then use gnus to take me to folders by different priorities.
I stopped doing this with Outlook, because Outlook mail sorting rules are so broken. Now that I have abandoned Outlook I could fall back to procmail, but I am still reluctant to maintain a ruleset. Hence the blurb about using N-way spambayes sorting.
When I sorted all incoming mail into folders, the folder I was in suggested how what folder any mail I generated should go to. I.e. my replies got stored in the same folder as the mail I was replying to.
That's still valid. Unfortunately, right now what I do is apply spam/ham classification, but receive all mail into my MH +inbox. So all of my replies stay in my inbox. And sorting out my replies is even more annoying than classifying incoming mail.
I suppose the general rule still applies: classify a reply like the email it is replying to. But I still probably want to generate content based classification for email I send. I mean, how often does a proprietary discussion turn into an invitation for lunch?
For email that I generate I can generate the classification tags immediately, and edit them. I don't want to do this manual editing much, but if the suggestion were close enough.
I also record a lifestream of all email. Duplicate emails in different folders is unfortunate. A unified store, with tags, would be nice, so long as the proprietary/nonproprietary distinction were maintained.
2. Full vs Partial Populations
Thinking about this has leed me to the following conclusion: only one "column", or set of counts, is necessary for a particular tag, if that tag classification has been performed for the entire population, and if all messages not tagged are, well, not in the classification.
In this case, probabilities would be calculated by dividing any of the (feature,tag) counts, and dividing by the total size of the population, which is separately known.
This might be okay if you establish your classification sccheme for email at the beginning of time, and have run all email through it. However, if you are like me your classification scheme evolves; you create new categories and discard old. So, although you may have a history of 100,000 messages, there may be only 10 that have been trained wrt the new classification tag.
To handle such evolving classification schemes, one must have explicit tag and not-tag counts (and assume unknown for features that have neither tag nor not-tag asssociated with them). Perhaps, when the size of the total population and the size of population for which tag/not-tag training has been done approach each other, the not-tag counting can be discarded. More important, when the populations almost completely overlap, more queries can be supported.
Given tag/not-tag feature classification, one can decide tag/not-tag.
Given tag1/not-tag1 and tag2/not-tag2 feature classifications, one can do tag1/tag2 classification, but one must make assumptions:
-
one can assume that the populations are completely non-overlapping;
-
one can assume that the tags are mutually exclusive
-
or one can assume that the tags are independent, the tags not mutually exclusive,
-
possibly with the requirement that at least one tag1/not-tag1/tag2/not-tag2
-
in which case it is really two different tag1/not-tag1 and tag2/not-tag2 classifications.
-
must be asserted, making it a multiway classification - i.e. the multiwayness acting as a tie-breaker
The basic problem with doing such tag1/tag2 multiway tests, if we do not know to what degree the populations overlap, is that we do not know how independent the variables are: we do not know the size of the population that is tagged tag1*tag2.
We can imagine that the database storing the feature=>classification counts might be augmented to track the various population sizes:
-
e.g. the counts that were just subject to spam/not-spam
-
the count that were subject to spam/not-spam and proprietary classification
-
the count that were subject to spam/not-spam and tag1/tag2/... classification
If these counts are all available, if we do a query or test involving tagI and tagJ, then we will have the population sizes necessary. Assuming stationary distributions would allow us to figure out the appropriate feature=>classification weights.