UserPreferences

DeAnagraming


Note: This wiki is now frozen; you can no longer edit it, and no interactive features work.

In one of [WWW]John Graham-Cumming's newsletters there was an idea (I think I've seen it before, but can't recall where) that sounded interesting, primarily to avoid deliberate misspellings and so-called Cmabirgde Uinersvtiy spam. The idea is that when the filter encounters a word it can add two tokens: the word itself (e.g. viagra) and the word with the letters sorted into alphabetical order (e.g. aagirv).

I (TonyMeyer) tried this out, both adding an additional token and replacing the original token, either just in header tokenization, just in body tokenization, or both. None gave me any results I'd call a win (this was only one corpus, but it didn't look that promising). Another stupid beats smart, I guess :) (Replacing the token makes the token list harder to read, too, of course).

The patch for adding the token only in the body (you should be able to figure the others from this) is:

*** tokenizer.py        Wed Jan 19 12:04:21 2005
--- tokenizer2.py       Wed Jan 19 11:59:29 2005
***************
*** 1593,1598 ****
--- 1593,1603 ----
                  n = len(w)
                  # Make sure this range matches in tokenize_word().
                  if 3 <= n <= maxword:
+                     if options["Tokenizer", "x-de-anagram"]:
+                         yield w
+                         w = list(w)
+                         w.sort()
+                         w = "".join(w)
                      yield w

                  elif n >= 3:

*** Options.py  Wed Jan 19 12:04:20 2005
--- Options2.py Wed Jan 19 11:59:14 2005
***************
*** 179,184 ****
--- 179,188 ----
       the ability to reduce the nine tokens to one. (This option has no
       effect if 'Search for Habeas Headers' is False)"""),
       BOOLEAN, RESTORE),
+
+     ("x-de-anagram", "Sort all words into alphabetical order", False,
+      """(EXPERIMENTAL)""",
+      BOOLEAN, RESTORE),
    ),

    # These options are all experimental; it seemed better to put them into

A table of results is:

-> <stat> tested 4690 hams & 384 spams against 17688 hams & 1539 spams
(etc)

filename:    defaults             deanagram_header       deanagram_both
                       deanagram_body         deanagram_add
ham:spam:   22378:1923  18166:1923  23454:1923  22378:1923  22378:1923
fp total:            5           7           5           7           6
fp %:             0.02        0.04        0.02        0.03        0.03
fn total:           23          28          23          20          26
fn %:             1.20        1.46        1.20        1.04        1.35
unsure t:          152         176         156         160         159
unsure %:         0.63        0.88        0.61        0.66        0.65
real cost:     $103.40     $133.20     $104.20     $122.00     $117.80
best cost:      $83.60     $103.20      $83.40      $86.80      $96.00
h mean:           0.12        0.15        0.11        0.13        0.12
h sdev:           2.39        2.94        2.31        2.66        2.44
s mean:          96.29       95.62       96.24       96.56       96.13
s sdev:          14.55       15.50       14.62       13.87       14.77
mean diff:       96.17       95.47       96.13       96.43       96.01
k:                5.68        5.18        5.68        5.83        5.58

If you'd like cmp.py results, which tell you much more, let [TonyMeyer me] know and I'll happily provide them.