Mail Filtering - Background

Background

History

Over the years, I've gone through a number of different anti-spam measures.

Bcc: filter

The first filter I deployed against email spam was a simple set of procmail rules that blocked messages which failed to show me as an explicit To: or Cc: recipient. This was in 1997, when I was getting a couple hundred spams per day. Spammers had much more limited resources in those days, and could not afford the CPU time to send separate messages to each recipient. Instead they would send each message Bcc:ed to thousands of recipients, with the individual recipient addresses not appearing in the message's headers as delivered. Blocking such messages cut my spam to near zero for five years. Eventually the spammers started using networks of millions of stolen machines to send their spam, so they could afford to address the messages individually, and this measure stopped working.

By 2002 my Bcc: filter was losing effectiveness, so I deployed SpamAssassin. This worked ok for a while but the spammers were rapidly finding ways around it, and since it's written in Perl it was too hard to improve.

bogofilter

In early 2003 I switched from SpamAssassin to bogofilter. This was so effective it was almost frightening, handling around 1000 spams per day with basically zero errors, and it ran much faster than SpamAssassin.

sendmail.cf hacking

Unfortunately, in late 2003 the first really huge email worms showed up (SoBig.F, MyDoom, Beagle), and my mail load jumped up to about 100,000/day. bogofilter, although very fast, was not fast enough to handle this level of traffic by itself. My creaky old 450MHz / 64MB server started running out of memory and crashing. I started experimenting with sendmail.cf mods to reject mail with certain recognizable virus signatures. I eventually had it reject all mail with "Content-Type: multipart/mixed", which blocked pretty much all viruses (and almost all other spam) very cheaply. The downside was it also blocked legitimate attachments, but I don't get too many of those and it was either that or go off the net.

This was also when I started looking into sendmail resource limits such as MAX_DAEMON_CHILDREN and QUEUE_LA/REFUSE_LA. Towards the end of this period I even deployed a probabalistic firewall rule that rejected mail connections 90% of the time, except for people on a whitelist. These emergency measures allowed me to limp along on my old machine until I could upgrade my hardware.

milters

In mid-2004, with my new 3.2GHz / 2GB server humming along and rejecting 150,000 crapmails per day, I had plenty of CPU cycles and memory to play with so I switched out of panic mode and started experimenting to find some long term solutions to spam. I installed ClamAV, which is a sendmail mail-filtering plug-in or milter specialized for blocking worms and viruses. It worked quite well and was cheaper to run than bogofilter. This got me interested in milters in general, and I started writing some of my own: blackmilter, graymilter, and spfmilter. In addition I looked into some more of the available sendmail config options, such as greet_pause. And that brings us to the present, with my crapmail load approaching and sometimes exceeding a million per day.

Why So Much?

Why does ACME Labs get so much spam? That's a good question. There are probably two main reasons.

Lots of people use "acme.com" as an example or fake address. It even appears in the HTML specifications. They shouldn't be using my domain name for this; in fact there's actually an official recommendation for which domain names to use as examples; but few people follow it.
Acme.com's web site is fairly popular - we get about 25,000 visitors per day. That means our web pages are cached on a lot of people's disks. Well, one way that spammers and viruses find addresses to send to is by looking in those web cache files on machines they have taken over.

Resources

It almost goes without saying that the most expensive resource wasted by spam is your attention, therefore the top priority for spam filters is to operate automatically and accurately. Aside from that, there are three main resources that spam wastes and that spam filters try to conserve.

CPU Cycles

Your system can only process so much data, and you don't want to waste that capacity on spam. Typically you measure this resource by either the load average (an exponential average of the number of processes waiting to run), or a straight CPU utilization percentage. Filters that run efficiently help conserve this resource.

Memory

Spam uses up memory by causing lots of mail-delivery processes to get created. Each process uses memory for as long as it exists. If you have adequate CPU cycles available, the processes will run quickly and so won't use up much memory averaged over time; except if they have to sit around waiting for the disk, a DNS lookup, a database lock, anything. If that happens, you can end up with hundreds of sendmail processes. You measure this resource by both total resident memory and total virtual memory. If resident memory approaches the size of your RAM, you start paging / swapping which slows things down. If virtual memory approaches the size of your swap file, your system crashes. Filters that run quickly in clock time, as opposed to CPU time, help conserve this resource. Filters that need to wait for some external resource should be considered more expensive. And of course they should not use up a lot of memory themselves.

Bandwidth

A large spam load can really clog your internet connection. If I turned off ACME's filters, our T1 line would be completely full! You measure bandwidth by looking at the bytes/bits per second going through your network and comparing that to the size of your pipe (which you can figure out here). With spam you typically only need to worry about the inbound bandwidth; the outbound load should be trivial by comparison. To save bandwidth, filters want to block spam before the DATA phase of the mail transaction. Filters that run after the DATA phase should be considered more expensive.

SMTP

To understand why some filters are more expensive to use than others, you have to learn a little about SMTP. First, it stands for Simple Mail Transfer Protocol, and it was originally defined in RFC 821 by Jon Postel, one of the originators of the Internet. It's how pretty much all mail on the net gets sent.

An SMTP transaction, like many internet protocols, is a multi-step back-and-forth conversation. It looks something like this:

Client

Server

[opens connection]

220 server.example.net Greetings

HELO client.example.com

250 Welcome client.example.com

MAIL FROM: <joe@client.example.com>

250 joe@client.example.com... Sender ok

RCPT TO: <jane@server.example.net>

250 jane@server.example.net... Recipient ok

DATA

354 Enter mail

[transmits message]

250 Message accepted

QUIT

[closes connection]

At any stage in the conversation, the server can decide that the mail is spam and return a rejection code. The earlier that happens, the better. Rejecting spam early means the sendmail process doesn't sit around as long taking up memory, and also you can avoid using some network bandwidth.

Some filters can run as soon as the client opens the connection, because all they need to know is the client's IP address. Those are the cheapest.

Some filters need to know the mail's "from" address, so they have to wait for that stage of the conversation before they can run. That is still pretty early & cheap.

Then there are all the filters which operate on the full text of the email. They run after the DATA phase, and should be considered the most expensive. They can also be very accurate, though.

One technique I use is to leverage a late/expensive filter into an early/cheap filter. For instance, ClamAV uses a database of patterns to identify virus email very accurately, but it needs the full text of the message so it is expensive to run. However, I can take the list of IP addresses identified by ClamAV as being virus-infected and turn that into a short-term blacklist. The blacklist gets checked as soon as a connection is opened, which is very cheap.

False Positives

The worst thing that a spam filter can do is to block or junk legitimate mail. This is called a false positive - the mail is falsely classified as 'positive' for spam. Real mail is important, and even a single lost real message is a big problem.

The best way to avoid false positives is to use filters which can't block real mail. A good example is graymilter.

The second way to avoid false positives is to try and reject spam during the SMTP transaction. This will generate a bounce message that actually goes to the person who sent the message, so there's a reasonable chance it will be seen and acted on. See below for an explanation of the two different kinds of bounce messages.

Third, messages that get received for processing and then classified as spam should in general be saved in a folder instead of just getting junked. I scan through my spam folder once a day or so, looking for false positives, and do occasionally find one.

Only as a last resort should you receive a message and then junk it silently, without any review. This should be reserved for messages that are absolutely, blatantly spam, and that are so numerous that reviewing them would be impractical.

Bounces

There are two kinds of bounce messages you can generate, one ok and one bad.

If a message gets rejected during the SMTP transaction, then the machine sending the message has the responsibility for generating the bounce message. That means the bounce has a good chance of going to the person who actually sent the mail, which is good.

On the other hand, once the SMTP transaction has ended and your machine has received the message for processing, you should under no circumstances generate a bounce message. These late bounces would go to the sender showing in the message headers, which in spam and viruses is always forged. Bounces going to forged addresses account for just as much wasted bandwidth and confusion as the spam and viruses themselves. Please do not contribute to this problem.

Note that qmail, an alternative mail transport program, generates post-reception bounce messages in circumstances where other mail transports would have refused the reception. This means every qmail site is basically an open spam relay. For this reason alone, qmail should never be used by anyone.

<<< [Introduction] <<<

>>> [Sendmail Config] >>>

Client

	Server
[opens connection]
	220 server.example.net Greetings
HELO client.example.com
	250 Welcome client.example.com
MAIL FROM: <joe@client.example.com>
	250 joe@client.example.com... Sender ok
RCPT TO: <jane@server.example.net>
	250 jane@server.example.net... Recipient ok
DATA
	354 Enter mail
[transmits message]
	250 Message accepted
QUIT
	[closes connection]