Internet and e-mail policy and practice
including Notes on Internet E-mail


Click the comments link on any story to see comments or add your own.

Subscribe to this blog

RSS feed

Home :: Email

25 Oct 2009

How do you test spam filters? Email

(Thanks to Chris Lewis for permission to adapt this)

Everyone who uses e-mail needs spam filtering, and some filters definitely work better than others. Some people we know were trying to design tests of filter quality, which turns out to be extremely difficult.

What one might call 'filtering quality' assessment, should be the very very last step after "does it have the features I want?", "does it install/is it supported/supportable?", "does it crash?", "does it make lots of stupid mistakes?", "is it likely going to compare favorably with what we already have?".

You have to do the latter before the former. The latter is relatively easy. The former is what people keep asking about, and is the really really hard part to do right.

One approach is cloning a real-time stream of mail and feeding it to both the current production filter and the one under test. If you do that, you're constrained to comparing the results of the two versions. This can be a considerable privacy concern even if you can check every email. On high volume streams it starts becoming quite difficult to compare the differences for validity, especially if you don't have much to start with in production. (Surprisingly, we tend to find that at least in the high volume, automated spam filters tend to be more accurate than humans are.)

Worse, there's the difficulty of cloning the stream accurately enough. At the simplest level, do you lose the source IP address due to passing the mail through a server that does the cloning? At a higher level there's the loss of ability to deal with actual SMTP-level interaction details. Filtering techniques that use real-time characteristics of the mail stream are very difficult to clone.

As a case in point: you can't clone to a greylisting or banner delay system and expect useful results. The filtering itself is based on the sending system's reaction to a temporary failure rejection. But a sending system can't react two different ways at once based on the two (or more) receiving systems on the same email transaction. Other techniques are equally difficult to clone, such as "nolisting" which uses a fake unreachable primary MX--it either is or it isn't, but can't be both.

Another aspect is filter training for filters that adapt to the mail stream. How do you train a real-time cloned Bayes-ish filter if the end-users aren't seeing its results? Imagine testing an end-user-trained Bayes as the cloned-in system. How do you train the thing if your production system is rejecting spam before the user sees it? Even if you can, can you tie in the new system's knobs to what you already have? Many systems can't.

Truly effective filtering systems tend to be a hybrid of many different techniques. Generally at least a few of them won't be amenable to cloning.

I was part of a working group trying to do A:B testing of filtering products. I have a enough experience that I was able to pick a lot of holes in the more naive proposals. The various filtering vendors' technies who were also participating found a lot more as we tried various other ideas.

The only thing that works is live production testing. If your environment is large enough, you can split your MXes between the old and new, giving separate but, one hopes, similar mail streams to the two and compare the results. If it's not possible to split run each candidate for at least a week. The latter was the end-point suggestion we came up with. I think they finally realized that their mail flow was too small for validity, and perhaps it was too scary to try to use random new filters on production.

In our shop, we roll out new tools to small subsets of users. Since our filters permit us to forward filtered mail out of quarantine, and our filtering method provides feedback for false positives, we get to find out whether something's wrong when it can't do "too much" damage.

This is a great idea for environments that can do this.

Environments (or, rather, software) that can't do this are simply ineligible for evaluation due to missing critical features. Or to put it another way, quarantine, forward-out-of-quarantine, and "reject, notbounce, with message for remediation" features are critical business requirements. While other environments can do with somewhat less, I feel the latter at least should be a MUST.

We've decided that certain filtering methodologies are simply unacceptable, such as, rejection notification by bounce rather than SMTP reject.

Secondly, we consider the process around "wrong filtering choices" to be just as much a part of the system as the filtering is. Checks and balance on the filtering (eg: rejection with remediation instructions, quarantine etc) are designed in from the beginning as part of the overall system.

As a concrete example, you can be really aggressive in your filtering if you have (a) a way of finding out when you goof and (b) you have tools to remediate it. Blocking a huge IP range isn't so scary if you know you will find out what spots you shouldn't, you can unblock those spots as needed, and you can "undo" the filtering simply by forwarding in the applicable hunk of quarantine.

We're probably FAR more aggressive in filtering than most simply because we can "undo" portions of that aggressiveness. It's routine. Just another designed-in aspect of the filtering system. This sort of thing is notably absent in most vendor offerings.

That said, other installations can do without these features, but should be able to simulate their results with respect to measuring effectiveness and false positives in one way or another.

However, is this something you do during pre-production launches of new products that you know you'll eventually be deploying across the board? Or do you do this as a part of your normal evaluation processes for any product?

This tends to be a final stage of evaluation of new systems or new features. Generally, they go through a test on the trap first to see if they're in the ballpark--primarily based on overall metrics with some spot checking of individual results. The latter gives you a good idea whether it's worth the risk of putting in production. The former tells you whether it's better than what you had before. Few outside products have ever made it to production testing.

  posted at: 00:24 :: permanent link to this entry :: 0 comments
Stable link is


My other sites

Who is this guy?

Airline ticket info

Taughannock Networks

Other blogs

It turns out you don’t need a license to hunt for spam.
63 days ago

A keen grasp of the obvious
Italian Apple Cake
621 days ago

Related sites

Coalition Against Unsolicited Commercial E-mail

Network Abuse Clearinghouse

My Mastodon feed

© 2005-2024 John R. Levine.
CAN SPAM address harvesting notice: the operator of this website will not give, sell, or otherwise transfer addresses maintained by this website to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages.