Scalable E-Mail Filtering
Methods and Techniques
UC Berkeley Security SIG

Jon Kuroda

UC Berkeley EECS, CUSG

jkuroda[at]EECS[dot]Berkeley[dot]EDU

http://www.EECS.Berkeley.EDU/~jkuroda/talks/mailfiltering/

Note: press space to go forward. pageup/down keys work too
This is an (x)html document in the S5 presentation system, so there are lots of links one can follow. Diagrams are in separate SVG documents which requires browser support — recent FireFox and Opera work.

Some Alternate Titles



"Mail Filtering Deconstructed"

"Content Filtering on the Cheap"

"Content Filtering that Works (Better)"

"Virus Scanning and Spam Tagging That Sucks Less"

"Virus Scanning for a Non-Ideal World"

"Mail Filtering for the Masses"

"Anti-Virus Is Hard. Lets Go Shopping!"

What's this all about?

Scalable
Cheap and Easy to get more capacity (Pay what you want as you go)
Does More. Costs Less. Doesn't Suck (as much)
Flexible
Useful in situations other than my own
Looking for a modular/toolkit approach
Server Based
Primarily interested in the MTA side
Less so in delivery-time systems such as Sieve, .forward/procmail, or MUA filters
Content-Filtering
Anti-Virus/Spam
Data Scrubbing/Retention
Auto-Spell-Checking / Auto-Translation

Topics (and Non-Topics)

What I will be (or have been) talking about
Background/Historical Information
Personal Caveats
How different filtering systems work
Ways to deploy email filtering - including examples
Crazy Ideas and Odds and Ends
What I will not be talking about as much
My (MTA|OS) is better than your (MTA|OS)
Every single implementation detail
Measures outside of a filtering context
What I hope you (and I) will get out of this
Some understanding of how e-mail filters work and how to use them
Some tools and ideas to take home and try
If I am lucky, a good laugh

Caveats

I'm a *nix/sendmail guy who installed anti-virus software
Examples involving these will have the most detail
I don't do MS Exchange, but I will talk about it (a little)
I'm a realist, not an idealist
I don't work in an ideal IT world
I try not to assume one.
There is nothing new here
I actually didn't think this was that novel
No (in my opinion) out of the ordinary ideas
But, as always, the work is in the documentation
There is no spoon. But I have some lovely sporks.
Note for the online readers, I meant to have some plastic sporks to pass out as random prizes for questions, but 1) I forgot to bring them 2) I had too high of a slide/time ratio.

A (Very) Brief History of E-mail Servers

In the beginning ...
Results
Mmmmm, Job Security

Once upon a time ...

It's 2003 in 399 Cory Hall ...


Filters: An (Over) Simplified Look

Some pseudocode (should read like Perl) describing what a filter does.
Note that there can be outcomes and actions separate from I/O — side-effects.

while (<INPUT>)
    if (/PATTERN/) {
	mangle $input;
  	print OUTPUT;
        do some_side_effect;
    } else {
        print OUTPUT;
        do some_other_side_effect;
    }

Pretty Diagram

Filters: The Engine

The "brains", does the work of making filtering and other decisions
Anti-virus/spam
"Scrubbing" messages for sensitive data
Anything from rejecting a message to passing it on unmodified
Side-Effects
Logging/Notifications
Updating cached information
Sometimes, all we care about are the side-effects
It may also depend on databases that require periodic* updating
Virus/Spam databases
SpamAssassin Bayesian Analysis (sa-learn)
Spam Host lists
* How periodic? How paranoid are you? More on this later.

The Ins and Outs of Filter I/O

Great, we have a filter engine, but how do we get email in and out of the filter?

Filter I/O: Pre/Post-queue Filtering, a Slide Without a Home

Filter before or after queueing email? Pre-queue filtering lets one reject mail during an SMTP connection, but it can cause timeouts. Postfix has some good notes on the pros and cons of pre-queue filtering. We may come back to those notes later on.

Pre-queue filtering:

220-mail.example.com ESMTP JavaMail 6.2 Mon, 31 Jul 2006 20:01:23 -0700
ehlo poland.example.com
250-mail.example.com Hello poland.example.com [192.0.34.166]
Mail From: <exile@poland.example.com>
250 OK
Rcpt To: <president@example.com>
250 Accepted
data
354 Enter message, ending with "." on a line by itself
X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*
.
550 DENIED!! Message contains malware (ClamAV:Eicar-Test-Signature)

Versus post-queue, where filtering occurs but not till after email is accepted

220-post.example.com ESMTP JavaMail 6.2 Mon, 31 Jul 2006 20:02:23 -0700
ehlo siberia.example.com
250-post.example.com Hello siberia.example.com [192.0.34.166]
Mail From: <exile@siberia.example.com>
250 OK
Rcpt To: <president@example.com>
250 Accepted
data
354 Enter message, ending with "." on a line by itself
X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*
.
250 2.0.0 k713JO1n055810 Message accepted for delivery

Filter I/O: Pre/Post-queue Filtering: Consequences

What implications does this Pre/Post-queue Filtering business have?

Obviously, having a heavy-duty, relatively slow running filter as a pre-queue filter can lead to SMTP timeouts, causing remote MTAs (at least legitimate ones) to retry delivery, causing additional load leading to a downward spiral. So, just because some filter can be used in a pre-queue fashion doesn't meant that it should be used like that.

Rather, think in terms of what one would want as pre-queue filters:

Candidates for pre-queue filtering
Fast/Light or Cuts down Mailload "Not so good" Candidates
  • DNSBL
  • SPF checks
  • Greylisting
  • SMTP Compliance
  • Anti-Virus/Spam (CPU intensive)
  • Most other message body manipulation
  • CPU or I/O Intensive side-effects

Filter I/O: MTA Plug-in

Designed for a particular MTA (such as Microsoft Exchange w/VSAPI?)
Essentially yields 'full-featured' Standalone SMTP-aware Filter
Operates in MTA's process space
I don't find them that flexible, but these can work well for specific situations.

Filter I/O: SMTP-aware Filter

It speaks SMTP, but do we call it an 'MTA'?
SMTP usually means post-queue filtering only
Examples of COTS SMTP-aware filters
Hardware Boxes Software Only
* denotes what we use in our group or our department, ** what we have used.

Another Pretty Picture

Filter I/O: API/Protocol

MTA and filter are separate and communicate via a defined protocol or API
Like the MTA Plugin, this augments an MTA instead of replacing it one way or another, but, unlike plugins, filters and MTA are separate. For high performance computing types, think shared memory versus message passing.

Two well known API/Protocols:
Lots of for-pay and open/cheap-source options here
Another Pretty Diagram

Filter I/O: SMTP and Milter compared

SMTP - Everyone speaks it, some better than others Milter - The new cool thing

Filter I/O: Dual MTA and Milter Compared

Dual MTA needs at least one extra* MTA listening on different port/socket * Postfix lets you get around this (choice of pre- or post-queue), sorta ...

Milter setup only needs one MTA running
a Pretty Diagram comparing basic setups and Another comparing multi-filter setups

Filter I/O: Transparent (or is it Opaque?) Proxy (Bonus Slide)

In many ways, similar to an SMTP-aware filter except the filter "intercepts" traffic at IP/Layer 3 or lower, not at the SMTP layer.

Imagine the filter as an invisble box that watches the network traffic on port 25 and silently edits and rewrites packets so that noone is the wiser on either end. For bonus points, you can do this even lower at Layer 2 (ala a bridging firewall)

Okay, this all sounds very slick, but I don't know offhand of anything that does this or of anyone who has done this with a homegrown system. Anyone hear of something like this, even homegrown?

Filter: Side Effects and Other Random Bits

Notification E-mails
Saving Viruses/Spam
Logging

"Virus Deleted" Emails In EECS, we actually send the cleaned e-mails. A default sieve rule on our IMAP server auto-files all such cleaned e-mails to a special folder where users can ignore them or be impressed by how much virus-laden email we're catching for them.

Saving Viruses A user in our department who was doing work with windows viruses asked if we had any he could get his hands on. We save viruses mostly for our amusement and to run stats, but we were able to give him a CD of viruses and get paid some T&M for it.

Deployment: What to do with these tools

We now have some pieces that can be combined in many ways, how can we use them?


Install it Everywhere
- The Simple (But Stupid?) Life

MX Filter
- The "Big Guy at the Entrance" way.

Network Filter Service (The Other Other NFS)
- It seems slick, but is it really useful?

Deployment: Install It Everywhere

Pretty self-explanatory

Pros
Cons
This does not count as scalable, mmm'kay?

Deployment: MX Filter in a Nutshell

Essentially an SMTP relay that filters along the way

Major Steps
Optional
A Pretty Diagram

Deployment: MX Filter - Pro and Cons

Pros
Cons

Deployment: Network Filter Service in a Nutshell


Not a Network File System, nor a Number Field Sieve

Major Steps
Optional
A Pretty Diagram

Deployment: Network Filter Service - Pro and Cons

Pros
Cons
It seems cooler, but maybe not better when supporting disjoint heterogenous mail servers. It may work better in a more uniform managed environment, say an end-to-end mail-service as opposed to "just" protecting someone else's servers.

Deployment: A Detour for Exchange

Microsoft Exchange is, for better or for worse, not going to go away anytime soon. The question is "How best to keep the viruses away from it?"

First, and perhaps only, relevant thing to remember:
You cannot rely upon SMTP filtering as a sole method of anti-virus for Exchange

For example, users can upload files to Exchange via HTTP, from desktops, PDAs, anything. Where have your users' Crackberrys been?

While it is always a good idea to "pre-filter" mail inbound to an Exchange server, you should also make use of Exchange's VSAPI to provide virus scanning of an item whenever a client requests it, not just when the item (message) is accepted and enqueued. Additionally, items are continually rescanned when virus definitions/signatures are updated.

Implementation: Our version

We went with Filtering MXs for deploying our virus filter.

Our guiding principle was "Free Beer Good".

Our tools:

The Obligatory Pretty Diagram

Our Way: Hardware/OS (Free Beer)

The Systems: Solaris on Sparc
Note:
We did not have to spend any extra money to obtain hardware in order to provide virus filtering for our customers. Free Beer Good.

Our Way: Sendmail (Cheap Beer)

Roll our own from source, or use the binaries in Solaris?
Complications:
Building from source only creates a static libmilter by default

Aside from our time, we got this for low/no cost and we learned a bit.

Our Way: Dealing with the SMTP parts

Configure sendmail via /etc/mail/relay-domains to relay email

# domains for which we relay mail
# This file is read in only sendmail starts or is sent SIGHUP.
#
cool.EECS.Berkeley.EDU
hot.EECS.Berkeley.EDU
rad.EECS.Berkeley.EDU
here.CS.Berkeley.EDU
there.EECS.Berkeley.EDU


Configure sendmail via /etc/mail/mailertable for [e]smtp/lmtp handoff

# domainname	esmtp:[next-hop-server]
# note use of []'s to suppress MX lookup
# pipe into '/usr/sbin/makemap hash /etc/mail/mailertable' or
# run '/usr/sbin/makemap hash /etc/mail/mailertable < /etc/mail/mailertable'
#
cool.EECS.Berkeley.EDU	esmtp:[cool.EECS.Berkeley.EDU]
hot.EECS.Berkeley.EDU   esmtp:[cool.EECS.Berkeley.EDU]
rad.EECS.Berkeley.EDU   esmtp:[awesome.EECS.Berkeley.EDU]
here.CS.Berkeley.EDU	esmtp:[here.CS.Berkeley.EDU]
there.EECS.Berkeley.EDU esmtp:[here.CS.Berkeley.EDU]

May need to enable mailertable in sendmail.mc and rebuild sendmail.cf:

FEATURE(mailertable, `hash -o /etc/mail/mailertable')

Our Way: Dealing with the SMTP parts - meta-config file

A perl script creates /etc/mail/{mailertable,relay-domains} from a config file.

vw-config.cf:
# sourcefile for /etc/mail/{mailertable,relay-domains} on vw systems
#
# format:
# server SERVER-NAME Freeform comments (can have spaces)
#        CLIENT-NAME
# ^^^^^^ whitespace optional, used for readability
#
# "vw-config" command used to /etc/mail/{mailertable,relay-domains}
#
server cool.EECS.Berkeley.EDU The Coolest Server in Town
    cool.EECS.Berkeley.EDU
    hot.EECS.Berkeley.EDU
server awesome.EECS.Berkeley.EDU A Pretty Awesome Server
    rad.EECS.Berkeley.EDU
server here.CS.Berkeley.EDU mail/ftp/webserver for the Nowhere Group
    here.CS.Berkeley.EDU
    there.EECS.Berkeley.EDU
...

Note: This was a minimalist approach to things. Other options include using an SQL database, LDAP, or some other setup to store this info, as long as you can get it into the form your MTA (sendmail here) can use.

Our Way: Milters - a detour

Before going into specific milters, lets look at how to configure the MTA to use them. Sorry, this is currently Sendmail specific, but here are some notes for Postfix. You should first read the Milter Installation and Configuration page, but here are the important bits for the impatient:

Our Way: Virus Filter Software (Almost Free Beer)

TrendMicro Viruswall (Sendmail Edition)
Campus had a license for Trendmicro VirusWall (later, only our department licensed it) which is based on the number of users, not number of systems. So, essentially, the beer was already paid for.

Our Way: RCPT Verification Software (Almost Free Beer)

SnertSoft milter-ahead
This milter used to be be Free Beer until around version 1.x. Now it costs a whole 90€ for a site license — still Pretty Cheap Beer.

Our Way: RCPT verification milter and Sendmail

SnertSoft's milter-ahead makes use of Sendmail's own environment:

Our Way: Connecting Sendmail to the Milters

We have a sendmail with milter support and milters, now to connect them.

This in /etc/mail/sendmail.mc:

... [all the usual stuff]
INPUT_MAIL_FILTER(`milter-ahead',`S=unix:/var/run/milter-ahead.sock,F=T,T=C:1m;S:30s;R:6m;E:5m')
INPUT_MAIL_FILTER(`virus',`S=inet:2701@127.0.0.1,F=T,T=S:2m;R:2m;E:5m')
... [rest of your config]

Results in this in /etc/mail/sendmail.cf:

...
Xmilter-ahead, S=unix:/var/run/milter-ahead.sock, F=T, T=C:1m;S:30s;R:6m;E:5m
Xvirus, S=inet:2701@127.0.0.1, F=T, T=S:2m;R:2m;E:5m
...

With milter-ahead before virus-scanning, we don't have to virus-scan email for bogus recipients, but to change the order, use this in sendmail.mc:

define(`confINPUT_MAIL_FILTERS', `virus,milter-ahead')

to get this in sendmail.cf:

O InputMailFilters=virus,milter-ahead

Implementation: The Big Mail Server Farm

You have lots of users, a decent budget, and time to set it up right. sure.

Some of the big parts:


And of course, another diagram

The Big Mail Server Farm

Wait a minute, how does the MX Farm talk to the Milter Farm?

Well, you can buy a load balancer but there is a a somewhat clever, likely horrible, yet totally approved* hack.

INPUT_MAIL_FILTER(`milter',`S=inet:2701@milterfarm.cs.berkeley.edu,F=T,T=S:2m;R:2m;E:5m')

# in DNS
milterfarm	IN	A	10.0.0.10
			A	10.0.0.11
			A	10.0.0.12
			A	10.0.0.13
			...

That's right, DNS roundrobin -- the cheap person's load sharing (not balancing!) system. I asked one of the guys (the guy) behind milter, and he said this was how to do it. I don't remember if this made use of something in sendmail (in which case this won't work with Postfix), or it was something internal to milter. There are some alternatives:

Crazy Ideas 1

Okay, so you can load share among a farm of servers running a milter, but how can one failover from a milter accessed over a local/unix socket to one accessed via a network socket?

Well, it sounds a little crazy, but instead of setting up the local milter to use a local unix socket, set it up to run over the loopback interface on 127.0.0.1 and add 127.0.0.1 to your DNS round robin.

I've spent all this time (or money) on a huge milter farm, but it's not getting used enough! What can I do to justify this to my boss?

Well, if you spent of time or money setting up anti-virus, it makes sense to try and use it for other things. I just hope you chose a anti-virus scanning engine that can be easily used for something other than SMTP, like ClamAV which can be used for a number of other purposes such as a virus-scanning http proxy.

And, of course, there was that slide back during Filter I/O

Odds and Ends

A check_expn rule I wrote a while ago. This also works for use as a check_vrfy rule. It could probably use a little clean up or canonicalizing into Sendmail Standard Form.

Basic logic:
look up foo.domain.com in /etc/mail/access using the F rule and look for an entry like this:

EXPN:webmaster@foo.domain.com		OK
EXPN:postmaster				OK
EXPN:root@bar.domain.com		DENY

If not, use the A rule to look up the IP address or octet based netblock in /etc/mail/access. (Don't think anyone has done CIDR in sendmail.cf yet ...)

EXPN:127.0.0.1		OK # allow localhost to expn
EXPN:128.32.0.0		OK # allow UCB-ETHER to expn

sendmail.mc:
...
LOCAL_RULESETS
Scheck_expn
R$*			$: $>F <$1> <?> <! Expn> <$1>
R<?> <$*>	    	$: <$&{client_addr}> <$1>
R<$*> <$*>		$: $>A <$1> <?> <! Expn> <$2>
R<OK> <$*>		$@ $1
R$*			$#error $@ 5.7.0 $: "502 sorry, we do not allow this operation."

References/Resources

This presentation was done in S5, the xhtml/css presentation system.

Thanks and Acknowledgements

I'd like to thank the following people for their feedback and encouragement: