MultiWayClassification

1. MultiWayClassification

Having grown to appreciate automatic spam filtering, first via SpamAssassin and now via SpamBayes, I want the same abilities for email classifiction. This page records what I know about n-way classification, using SpamBayes and other Bayesian tools, and raises some issues.

1.1. Using SpamBayes for MultiWayClassification

Look at the nway.py script
look at POPFile

Most n-way work assumes n-exclusive-classifications, e.g. for sorting email intoo folders. I am interested in non-excllusive tagging, where multiple tags can apply to same message.

Naively, it looks as if maintaining a SpamBayes database for each keyword tag might work.

SpamBayes

Trying to figure out how to take advantage of multiple tags to further refine classification

some tags are exclusive. E.g. spam => not work related
thinking about a second pass given a preliminary classification

Q: in what ways is SpamBayes strictly oriented towards spam? Probably in certain of the tokenizations... Would it harm MultiWayClassification to use the same mechanism?

1.1.1. More SpamBayes community discussion about MultiWayClassification

As already mentioned,

Look at the nway.py script
look at POPFile

The Spambayes mailing lists contain several such discussions, such as

Anssi Porttikivi, http://mail.python.org/pipermail/spambayes-dev/2003-September/001251.html,

One reply admonishes "look up the literature on clustering".

GLEW COMMENT:

I don't need automatically discovering the clusters at the moment.
cluster analysis pretty much implies exclusive classifications.

Non-exclusive clustering is an interesting concept... I'm not familiar with it having been done, but IANAPS (I am not a professional statistician).

You'd do typical clustering using some technique
You might then try to decompose the clusters found into overlappng

this probably means that you would not want to use an initial

clustering algorithm that implicitly assumes convexity. Some do so explicitly, e.g. calculating convex hulls, but some assume convexity of the clusters implicitly, e.g. KMeans.

i.e. you would probably want to use a metric that includes proximity

come to think of it, drawing convex hulls around clusters found

via other means is a good way of deterining membership in multiple clusters

Richard Prosser asks about using the spambayes classifier for other than email, in http://mail.python.org/pipermail/spambayes-bugs/2004-March/002016.html

It is agreed that this can and has been done, but that the tokenizer is email targetted. Thie discusssion also touches on n-way
Kenny Pitt replies:

SpamBayes

I'm trying to figure out in exactly what ways SpamBayes is limited to 2 categories.

e.g. the database currently supports two scores, ham-count and spam-count.

Why could it not be extennded to support more?

Are the statistical tests specifically two-way?

And even if so, why is that a problem for n-way-nonexclusive classification?

I suspect that exclusive n-way classification requires a completely different algorithm.

I suspect that I would want to use the SpamBayes tokenizer
I suspect that I would want to use the SpamBayes chi-squared code
I'd probably like to benefit from any changes to the above that SpamBayes does - but I don't terribly care about the Windows/Outlook interface issues

I have EMACS Gnus keys bound to train SpamBayes ham and spam.

interactively
and they are already so damnably slow that I really need to

make them train offline, in the background.

I envisage the n-way classification user interface as being interactive

I press a key, EMACS suggests to me a list of keyword to be tagged

to the message and/or folders for the message to be savedinto

I edit the suggestions
and train immediately

And, basically, I suspect that training N different databases is

True, this may be premature optimization...

Gerritt Holl asked about n-way again in http://mail.python.org/pipermail/spambayes/2003-October/008904.html

Bill Yerazunis says

I'm not sure what CRM114 is (a release? just nway?)

Janne Sinkkonen discusses, with Skip Montaro (the origiinal author of nway.py),

http://mail.python.org/pipermail/spambayes/2003-October/008955.html

Janne also says "It would be nice if several filters, n-way or two-way,

could be configured in an integrated fashion." So I'm not the only person to think about it.

1.1.2. Binary vs. MultiWayClassification

SpamBayes returns a ternary result for a binary scale. I.e. we want spam/ham, or spam/non-spam. SpamBayes delivers spam/unsure/non-spam.

With non-exclusive MultiWayClassification, how should this be represented?

We probably still want spam/unsure/non-spam
But it might be perfectly correct for a message to be tagged with both computer-hardware AND compilers.

Essentially, each keyword is on a binary scale tag/non-tag.

This raises isssues both with user interface and with the statistical model.

It appears to me that SpamBayes' chief advantage over othe Bayesian spam filters is that it scores "spam" and "ham" separately and independently. A message can have a high spam score and a low ham score, vice versa, or it could have both a high spam score and a high ham score. Or both low.

Q: is the SpamBayes scoring specifically binary spam/ham? I.e. does the chi-square test answer a specifically binary question? Or is it comparing two independent scores, ham and spam? Or could the Spambayes database be extended to handle multiple tags?

If the latter, it may be possible to persuade the Spambayes code to answer questions like

Binary Exclusive Test: "tell me if this message is more like tag1 or tag2"

where spam/ham is just a particular setting.

N-way Exclusive Test: "tell me if this message is more like tag1 or tag2 or tag3"
1-way Test: "does the *score* for this message wrt tagX exceed some threshold?"

Does the "scoring" require comparison to a differet tag/metric.

I.e. I am trying to ask if MultiWayClassification with SpamBayes really needs a database with tagN/non-tagN scores, one such pair for each tag? I.e.

token

spam

ham

tag1

non-tag1

tag2

not-tag2

token

spam

ham

tag1

tag2

I ask this question in part because I do not see the value of tag1/unsure/non-tag1 in a nonexlusive multiway classification.

I'm perfectly happy with spam/unsure/ham
for exclusive n-way sorting, folder1/foldder2/folder3/.../folderN/unsure

is fine; it corresponds to having an "unsorted" folder that you need to check

But if I am just doing non-exclusive tagging, I think the user interface would just suggest keywords that exceeded a threshold - a 1-way test for each. I am having trouble imagining what the user interface would look like for "unsure", or other --- perhaps you would sort the keywords by score.

1.2. Original

I originally posted this on FrontPage, but then decided to refactor into a separate discussion. Here's the original duscussion:

Is this suppposed to be a WishList? If so, then... I wish there was a way to use Spambayes tp do N-ways sorting, not just spam/ham, but spam/ham/proprietary/personal/public/... I vaguely remember seeing this before, but haven't had much luck finding it now I need it. -- AndyGlew

No, this isn't really a good place to put feature requests. Suggestions about improvements for the classification system are ok, but if you have a feature request then it would be much better to put it on Sourceforge. SpamBayes is not designed to be an n-way classifier. If you want to try it anyway, look at the n-way.py script in the contrib directory of the source release. If you want n-way classified mail, then I recommend POPfile. --TonyMeyer

Fair enough... I expect you'll delete this soon, although it might be appropropriate to refactor.wiki into a separate page on n-way - maybe MultiWayClassification. My old spambayes installation did not have n-way.py, so I'll update and look. POPFile does not seem to be what I want: it does n-exclusive-ways, whereas I am interested in suggested n nonexlusive keywords for classification (like the keywords on gmail). Sourceforge seems to contain just an issue tracking system; I am more interested in discussion about the issues involved in MultiWayClassification. -- AndyGlew

I was wrong. I found contrib/nway.py. It does the "obvious" thing of using multiple databases... -- AndyGlew

1.3. Multiway Email Sorting

The following I originally posted on my blog, which is on a company intranet and not externally visible.

I am getting tired of sorting my email into folders. I have grown dependent on automatic spam filtering via spambayes... I want the same for sorting.

I could very easily go over to search based email handling, e.g. using Google desktop. But

Google desktop doesn't handle my saved MH email

one day I should write the plugin

on gmail I find the "classification tag" feature useful...

but I want to automate the generation of such tags.

I've learned the hard way that I aboslutely, absolutely, have to sort my email into

proprietary
personal as well as
public
company related, but not proprietary and I also like
technical, related to computer architecture
other: politics, personal, fun, shopping, etc. i.e. search based is good, but classifications help

Spambayes is binary, spam/ham. I've heard about people doing it N-way, but cannot find any examples right now.

I can imagine the N-way being done

linearly, with a set order - e.g. first filter out spam, then everything else.

but the priority order matters

as N-independent binary tests. See discussion of combinations below

POPFile dies naive bayesian N-way. But it is N exclusive ways.

It occurs to me that if I am applying classification tags they are not mutually exclusive. Therefore, running something like a Spambayes binary test N times, suggesting N tags, is not unreasonable. Since spambayes reports high-probability-yes, high-probability-no, unsure, I could imagine an interface that reported all tags, and allowed the user to correct them.

Some tags are mutually exclusive, others not. E.g. "spam" probably isn't associated with any ham tags - there isn't much spam that is also comp-arch-technical. However... I *do* enjoy Nigerian fraud emails, so spam spam-nigerian might be amusing.

Basically, I am wondering about how to handle various combinations of tags. I want to support multiple tags, but I want as few extraneous tag combinations as possible.

e.g. possibly train a final filter on the tags suggested by the initial filter. If the initial filter suggested (spam,technical-comp-arch), then the final filter might map this to (spam) only. But if (spam,family-recreation) usually means that it is family related, then it might elide the spam and map to (family-recreation).

i.e. we might be able to use Bayesian learning to automatically figure out which combinations of tags are meaningful.

The various spam filtering tools like Spambayes and POPFle integrate with mailers in many different ways. E.g. POPFile is a POP server. Spambayes is more of a library, that is integrated into proxy servers, but also email tools that filter various folders, etc.

Maybe what I really want is a generic tool that maps a file containing an email to a classification, returning the strings of the tags. I could then use this as a suggestion in Gnus under EMACS, and edit out the tags unwanted.

Given such suggestions, various other tools could place them in mail headers, or in the subject line, or whatever.

Basic structure

Apply N independent Bayesian filters
Take that tuple, and apply it to a composite filter on the tags
Separate spam, and possible spam
Ham handling

Suggest this tag tuple to the user. Allow user to edit
Retrain if different from proposed initial and composite filters

Spam handling

Moved into a separate folder
User checks later
Same sort of retraining, if user wants classified differently

ISSUE: this implies 2N separate databases: one for each tag, on the raw message, and one for each tag, given the classifications. Since Spambayes is already slow, maybe it would be better to have a unified database, especially since all probably use the same features (tokens).

1.3.1. Usage Model

I used to use procmail rules to sort my mail into different folders, and then use gnus to take me to folders by different priorities.

I stopped doing this with Outlook, because Outlook mail sorting rules are so broken. Now that I have abandoned Outlook I could fall back to procmail, but I am still reluctant to maintain a ruleset. Hence the blurb about using N-way spambayes sorting.

When I sorted all incoming mail into folders, the folder I was in suggested how what folder any mail I generated should go to. I.e. my replies got stored in the same folder as the mail I was replying to.

That's still valid. Unfortunately, right now what I do is apply spam/ham classification, but receive all mail into my MH +inbox. So all of my replies stay in my inbox. And sorting out my replies is even more annoying than classifying incoming mail.

I suppose the general rule still applies: classify a reply like the email it is replying to. But I still probably want to generate content based classification for email I send. I mean, how often does a proprietary discussion turn into an invitation for lunch?

For email that I generate I can generate the classification tags immediately, and edit them. I don't want to do this manual editing much, but if the suggestion were close enough.

I also record a lifestream of all email. Duplicate emails in different folders is unfortunate. A unified store, with tags, would be nice, so long as the proprietary/nonproprietary distinction were maintained.

2. Full vs Partial Populations

Thinking about this has leed me to the following conclusion: only one "column", or set of counts, is necessary for a particular tag, if that tag classification has been performed for the entire population, and if all messages not tagged are, well, not in the classification.

In this case, probabilities would be calculated by dividing any of the (feature,tag) counts, and dividing by the total size of the population, which is separately known.

This might be okay if you establish your classification sccheme for email at the beginning of time, and have run all email through it. However, if you are like me your classification scheme evolves; you create new categories and discard old. So, although you may have a history of 100,000 messages, there may be only 10 that have been trained wrt the new classification tag.

To handle such evolving classification schemes, one must have explicit tag and not-tag counts (and assume unknown for features that have neither tag nor not-tag asssociated with them). Perhaps, when the size of the total population and the size of population for which tag/not-tag training has been done approach each other, the not-tag counting can be discarded. More important, when the populations almost completely overlap, more queries can be supported.

Given tag/not-tag feature classification, one can decide tag/not-tag.

Given tag1/not-tag1 and tag2/not-tag2 feature classifications, one can do tag1/tag2 classification, but one must make assumptions:

one can assume that the populations are completely non-overlapping;
one can assume that the tags are mutually exclusive
or one can assume that the tags are independent, the tags not mutually exclusive,

possibly with the requirement that at least one tag1/not-tag1/tag2/not-tag2

must be asserted, making it a multiway classification - i.e. the multiwayness acting as a tie-breaker

The basic problem with doing such tag1/tag2 multiway tests, if we do not know to what degree the populations overlap, is that we do not know how independent the variables are: we do not know the size of the population that is tagged tag1*tag2.

We can imagine that the database storing the feature=>classification counts might be augmented to track the various population sizes:

e.g. the counts that were just subject to spam/not-spam
the count that were subject to spam/not-spam and proprietary classification
the count that were subject to spam/not-spam and tag1/tag2/... classification

If these counts are all available, if we do a query or test involving tagI and tagJ, then we will have the population sizes necessary. Assuming stationary distributions would allow us to figure out the appropriate feature=>classification weights.

Note: This wiki is now frozen; you can no longer edit it, and no interactive features work.