Over the years, I've gone through a number of different anti-spam measures.
The first filter I deployed against email spam was a simple set of
procmail rules that blocked messages which failed to show me as an
explicit To: or Cc: recipient.
This was in 1997, when I was getting a couple hundred spams per day.
Spammers had much more limited resources in those days, and could not afford
the CPU time to send separate messages to each recipient.
Instead they would send each message Bcc:ed to thousands of recipients,
with the individual recipient addresses not appearing in the message's
headers as delivered.
Blocking such messages cut my spam to near zero for five years.
Eventually the spammers started using networks of millions of stolen
machines to send their spam, so they could afford to address the
messages individually, and this measure stopped working.
By 2002 my Bcc: filter was losing effectiveness, so I deployed
SpamAssassin.
This worked ok for a while but the spammers were rapidly finding ways
around it, and since it's written in Perl it was too hard to improve.
In early 2003 I switched from SpamAssassin to
bogofilter.
This was so effective it was almost frightening, handling around 1000
spams per day with basically zero errors, and it ran much faster
than SpamAssassin.
Unfortunately, in late 2003 the first really huge email worms showed
up (SoBig.F, MyDoom, Beagle), and my mail load jumped up to about 100,000/day.
bogofilter, although very fast, was not fast enough to handle
this level of traffic by itself.
My creaky old 450MHz / 64MB server started running out of memory and crashing.
I started experimenting with sendmail.cf mods to
reject mail with certain recognizable virus signatures.
I eventually had it reject all mail with "Content-Type: multipart/mixed",
which blocked pretty much all viruses (and almost all other spam) very
cheaply.
The downside was it also blocked legitimate attachments, but
I don't get too many of those and it was either that or go off the net.
This was also when I started looking into sendmail resource limits such as
MAX_DAEMON_CHILDREN
and
QUEUE_LA/REFUSE_LA.
Towards the end of this period I even deployed a probabalistic firewall
rule that rejected mail connections 90% of the time, except for people on
a whitelist.
These emergency measures allowed me to limp along on my old machine
until I could upgrade my hardware.
In mid-2004, with my new 3.2GHz / 2GB server humming along and rejecting
150,000 crapmails per day, I had plenty of CPU cycles and memory to play
with so I switched out of panic mode and started experimenting to find
some long term solutions to spam.
I installed
ClamAV, which is a sendmail
mail-filtering plug-in or
milter
specialized for blocking worms and viruses.
It worked quite well and was cheaper to run than bogofilter.
This got me interested in milters in general, and I started writing
some of my own:
blackmilter,
graymilter, and
spfmilter.
In addition I looked into some more of the available sendmail config
options, such as
greet_pause.
And that brings us to the present, with my crapmail load approaching
and sometimes exceeding a million per day.
Why does ACME Labs get so much spam?
That's a good question.
There are probably two main reasons.
It almost goes without saying that the most expensive resource wasted
by spam is your attention, therefore the top priority for
spam filters is to operate automatically and accurately.
Aside from that, there are three main resources that spam wastes
and that spam filters try to conserve.
Your system can only process so much data, and you don't want to waste
that capacity on spam.
Typically you measure this resource by either the load average (an exponential
average of the number of processes waiting to run), or a straight CPU
utilization percentage.
Filters that run efficiently help conserve this resource.
Spam uses up memory by causing lots of mail-delivery processes to get created.
Each process uses memory for as long as it exists.
If you have adequate CPU cycles available, the processes will run quickly
and so won't use up much memory averaged over time; except if they have
to sit around waiting for the disk, a DNS lookup, a database lock,
anything.
If that happens, you can end up with hundreds of sendmail processes.
You measure this resource by both total resident memory and
total virtual memory.
If resident memory approaches the size of your RAM, you start
paging / swapping which slows things down.
If virtual memory approaches the size of your swap file, your system crashes.
Filters that run quickly in clock time, as opposed to CPU time, help
conserve this resource.
Filters that need to wait for some external resource should be considered
more expensive.
And of course they should not use up a lot of memory themselves.
A large spam load can really clog your internet connection.
If I turned off ACME's filters, our T1 line would be completely full!
You measure bandwidth by looking at the bytes/bits per second going
through your network and comparing that to the size of your pipe
(which you can figure out
here).
With spam you typically only need to worry about the inbound
bandwidth; the outbound load should be trivial by comparison.
To save bandwidth, filters want to block spam before the DATA phase
of the mail transaction.
Filters that run after the DATA phase should be considered more expensive.
To understand why some filters are more expensive to use than others,
you have to learn a little about SMTP.
First, it stands for Simple Mail Transfer Protocol, and it
was originally defined in
RFC 821
by Jon Postel, one of the originators of the Internet.
It's how pretty much all mail on the net gets sent.
An SMTP transaction, like many internet protocols, is a multi-step
back-and-forth conversation.
It looks something like this:
At any stage in the conversation, the server can decide that the mail
is spam and return a rejection code.
The earlier that happens, the better.
Rejecting spam early means the sendmail process doesn't sit around as
long taking up memory, and also you can avoid using some network bandwidth.
Some filters can run as soon as the client opens the connection,
because all they need to know is the client's IP address.
Those are the cheapest.
Some filters need to know the mail's "from" address, so they have to
wait for that stage of the conversation before they can run.
That is still pretty early & cheap.
Then there are all the filters which operate on the full text of
the email.
They run after the DATA phase, and should be considered the most
expensive.
They can also be very accurate, though.
One technique I use is to leverage a late/expensive filter into
an early/cheap filter.
For instance, ClamAV uses a database of patterns to identify virus
email very accurately, but it needs the full text of the message
so it is expensive to run.
However, I can take the list of IP addresses identified by ClamAV as
being virus-infected and turn that into a short-term blacklist.
The blacklist gets checked as soon as a connection is opened,
which is very cheap.
The worst thing that a spam filter can do is to block or junk
legitimate mail.
This is called a false positive - the mail is falsely classified
as 'positive' for spam.
Real mail is important, and even a single lost real message is a big problem.
The best way to avoid false positives is to use filters which can't
block real mail.
A good example is
graymilter.
The second way to avoid false positives is to try and reject spam during
the SMTP transaction.
This will generate a bounce message that actually goes to the
person who sent the message, so there's a reasonable chance it will
be seen and acted on.
See
below
for an explanation of the two different kinds of bounce messages.
Third, messages that get received for processing and then classified
as spam should in general be saved in a folder instead of just getting
junked.
I scan through my spam folder once a day or so, looking for
false positives, and do occasionally find one.
Only as a last resort should you receive a message and then junk
it silently, without any review.
This should be reserved for messages that are absolutely, blatantly
spam, and that are so numerous that reviewing them would be impractical.
There are two kinds of bounce messages you can generate, one ok and
one bad.
If a message gets rejected during the SMTP transaction, then
the machine sending the message has the responsibility for generating
the bounce message.
That means the bounce has a good chance of going to the person who
actually sent the mail, which is good.
On the other hand, once the SMTP transaction has ended and your machine has
received the message for processing, you should
under no circumstances generate a bounce message.
These late bounces would go to the sender showing in the message headers,
which in spam and viruses is always forged.
Bounces going to forged addresses account for just as much wasted
bandwidth and confusion as the spam and viruses themselves.
Please do not contribute to this problem.
Note that
qmail,
an alternative mail transport program,
generates post-reception bounce messages in circumstances where other
mail transports would have refused the reception.
This means every qmail site is basically an open spam relay.
For this reason alone, qmail should never be used by anyone.
Bcc: filter
SpamAssassin
bogofilter
sendmail.cf hacking
milters
Why So Much?
Resources
CPU Cycles
Memory
Bandwidth
SMTP
Client Server [opens connection] 220 server.example.net Greetings HELO client.example.com 250 Welcome client.example.com MAIL FROM: <joe@client.example.com> 250 joe@client.example.com... Sender ok RCPT TO: <jane@server.example.net> 250 jane@server.example.net... Recipient ok DATA 354 Enter mail [transmits message] 250 Message accepted QUIT [closes connection]
False Positives
Bounces
<<< [Introduction] <<< | >>> [Sendmail Config] >>> |