Tuesday 29 July 2003
Spam
Spam Update The spam has started to leak around the filters here at Tino HQ, so we’ve been revising the spam filters once again. Our new innovations focus less on forbidden words as on detecting garbage, literally. A lot of spam — a whole lot of it — contains gibberish. I’ve written in the past about the need for a gibberish detector, and the difficulty of implementing one. The trick is to identify garbage like this, which has appeared in actual spams here lately:
Nearly all the spam we get has something like this near the end of the body, and some of it even has it in the subject line. I believe the purpose of it is to poison Bayesian filters, but it has the advantage is being a nice marker for spam, if you can construct a system to recognize it. We’ve come up with these rules: First of all, if you use a Q — in your e-mail address, in your subject, or your body — it has to be followed by a U, or at the end of a word, or it has to be used in one of a very few words (like ‘Qantas’) that deviate from the general Qu rule. This rule catches all but the third line above. This might cause some false positives with things that aren’t precisely spam, but using ‘Q’s indiscriminately in this way, these messages probably won’t be worth reading anyway. Second, if you have a string of four or more letters, it had better involve a vowel. This catches the first (‘phkm’) and third (“dljvfr’) lines. There are a few abbreviations, things like SMTP and HTTP and such, that are specifically excepted from this rule. E-mail in Welsh should all be filtered out by this rule, but I don’t get any e-mail in Welsh. Third, you look for two-letter combinations that just don’t occur. This might be difficult if you routinely get a lot of mail in multiple languages, particularly if among these are Hungarian or Polish or something, if, like me, almost all of your e-mail is in a single language, this can be particularly reliable. Even made-up brand names need to look like words and need to be pronounceable, so this is particularly accurate. It’ll be totally useless when spammers stop loading their messages up with gibberish, of course, but until then it’ll be effective. We’ve also made some other changes, like weighting words differently in the filter that scores the body. Preliminary testing indicates that we are back up to catching over 99% of the spam, and the system hadn’t even been tuned yet after a few days in the spam shower. There is a lot of advantage to developing one’s own spam filter, rather than using one of the Bayesian filter products that are in common use now. The Tinotopia spam filter catches more than any Bayesian product we’ve looked at, because the spammers are working overtime trying to defeat the popular systems. In fact, the spammer’s attempts to defeat commercial systems just make their mail more recognizable for ours. Posted by tino at 21:45 29.07.03This entry's TrackBack URL::
http://tinotopia.com/cgi-bin/mt3/tinotopia-tb.pl/174 Links to weblogs that reference 'Spam Update' from Tinotopia. Comments
Have you also considered a pattern like [0-9][a-Z][a-Z][0-9] (i.e. number letter letter number)? Those don’t normally occur either. Posted by: Paul Johnson at July 30, 2003 12:06 PM R2D2 The gibberish filter was honed by running a Perl script against a very large corpus of English text, including the Bible, all of Shakespeare, all of Dickens, a lot of (legitimate) mail and Usenet traffic, etc. and seeing what didn’t occur. So far we’ve caught 100% of the spam with no false negatives. The only things I’ve seen so far that could be problematic are certain e-mail addresses. This morning, a message — spam — came in with a return address of something like starwarsfreak106@yahoo.com. The “rsfr” (from “warsfrrak” tripped the bogo detector. As it turns out, the message was spam anyway, but this is an indication of how a perfectly reasonable (if silly) e-mail address can inadvertently look like gibberish to the computer. Posted by: Tino at July 31, 2003 02:01 PM I have harvested loads of email addresses from the From and CC fields of spam I have received, and created a rule (in OE) that deletes all emails from the server if the From or CC fields contain any one of those addresses, excepting, of course, my own. This has cut down drastically the amount of spam I receive, and it is a simple matter to add any that do get through. I have also (just now) added a filter for qa, qb, qc, etc.. excluding qu, and will see how it goes. Posted by: Iain C. Purvis at March 1, 2004 11:50 AM |