Previously:
part 1
part 2
part 3
part 4
part 5a
part 5b
part 5c
part 6
Message content scanning is vital for blocking viruses and spam
that aren't blocked by DNSBLs. Unfortunately the interface between
MTAs and scanners varies wildly depending on the MTA and the scanner -
there are no good standards in this area. However I'm going to ignore
that cesspit for now, and instead concentrate on when the scanning is
performed relative to other message processing. As you might expect,
most of the time it is done wrong. Fortunately for me the Postfix
documentation has an excellent catalogue of wrong ways that I can
refer to in this article.
An old approach is for the MTA to deliver the message to the
scanner which then re-injects it into the MTA. The MTA needs to
distinguish messages from outside and messages from the scanner so
that the former are scanned and the latter are delivered normally.
The Postfix documentation
describes doing the delivery and re-injection in the "simple" way via
a pipe and the sendmail command, or in the "advanced" way via SMTP.
The usual way to do this with Exim
is to tell Exim to deliver to itself using BSMTP over a pipe, using
the transport filter feature to invoke the scanner. This setup has a
couple of disadvantages that are worth noting. It (at a minimum)
doubles your load because each message is received and delivered
twice. It also makes the logs confusing to read, since the message has
a different queue ID before and after scanning and therefore different
IDs when it is originally received and finally delivered.
Another arrangement is MailScanner's
bump-in-the-queue setup. The MTA is configured to leave messages in
the queue after receiving them, instead of delivering them immediately
as it usually would. MailScanner picks them up fairly promptly - it
scans the queue every second or two - and after scanning them, drops
them in a second queue then tells the MTA to deliver the messages from
this second queue. MailScanner has the advantage that it can work in
batch mode, so when load is high (several messages arrive between
incoming queue scans) the scanner startup cost is spread more thinly.
This is useful for old-fashioned scanners that can't be daemonized.
Apart from the scanning itself, its only overhead is moving the
messages between queues. MailScanner also preserves queue IDs, keeping
logs simple. A key disadvantage is that MailScanner needs intimate
knowledge of the MTA's queue format, which is usually considered
private to the MTA. Sendmail and Exim do at least document their queue
formats, though MailScanner is still vulnerable to format changes
(e.g. Exim's recent extension to ACL variable syntax). Postfix is much
more shy of its private parts, so there's a long-standing argument
between people who want to use MailScanner and Wietse Venema who
insists that it is completely wrong to fiddle with the queue in this
way.
So far I have completely ignored the most important problem that
both these designs have. It is too late to identify junk
email after you have accepted responsibility for delivering it. You
can't bounce the junk, because it will have a bogus return path so the
bounce will go to the wrong place. You can't discard it because of the
risk of throwing away legitimate email. You can't quarantine it or
file it in a junk mailbox, because people will not check the
quarantine and the ultimate effect will be the same as discarding.
(Perhaps I exaggerate a bit: If the recipient doesn't get an expected
message promptly, or if the sender contacts them out of band because
they didn't get a reply, the recipient can at least look in the
quarantine for it. However you can only expect people to check their
quarantines for unexpected misclassified email if the volume
of junk in the quarantine is relatively small. Which means the
quarantine should be reserved for the most difficult-to-classify
messages.)
You must design the MTA to scan email during the SMTP conversation,
before it accepts responsibility for the message. It can then reject
messages that smell bad. Software that sends junk will just drop a
rejected message, whereas legitimate software will generate a bounce
to inform the sender of the problem. You minimise the problem of spam
backscatter and legitimate senders still get prompt notification of
false positives. However you become much more vulnerable to overload:
If you scan messages after accepting them, you can deal with an
overload situation by letting a backlog build up, to be dealt with
when the load goes down again. You do not have this latitude with
SMTP-time scanning.
The Postfix before-queue content filter setup
uses the Postfix smtpd on the front end to do non-content anti-spam
checks (e.g. DNS blacklists and address verification), and then passes
the messages through the scanner using SMTP (in a similar manner to
Postfix's "advanced" after-queue filters) then in turn to another
instance of the smtpd which inserts the message into the queue. There
is minimal buffering before the scanner, so the whole message must be
scanned in memory as it comes in, which means the scanner's
concurrency is the same as the number of incoming connections. This is
a waste: messages come in over the network slowly; if you buffer them
so that you can pass them to the scanner at full speed, you can handle
the same volume of email with lower scanner concurrency, saving memory
resources or increasing the number of connections you can handle at
once. However you don't want to buffer large messages in memory
because that brings back the problem in another form. You also don't
want to buffer them on disk, since that would add overhead to the
slowest part of the system - unless you use the queue file as
the buffer. This implies that Posfix's before-queue filtering is too
early since the writing to disk happens after the message has gone
through the scanner.
Sendmail's milter API
couples scanners to the MTA in about the same place as Postfix's
before-queue content filter, so it has the same performance problems.
(Actually, in some cases it is worse: If you have a filter that wants
to modify the message body, then with Postfix it can in principle do
so in streaming mode with minimal in-memory buffering, whereas with
Sendmail the milter API forces it to buffer the entire message before
it can start emitting the modified version.) What's more interesting
is their contrasting approach to protocol design. Postfix goes for a
simple open standard on-the-wire protocol as the interface to its
scanners. However it misses its target: It speaks a simplified version
of SMTP to the scanner, with a non-standard protocol extension to pass
information about the client through to Postfix's back end. The
simplification means that Postfix cannot offer SMTP extensions such as
BINARYMIME if the scanner does not do so too, which is a bit
crippling. Sendmail goes for an open API, and expects scanners to link
to a library that provides this API. The connection to the MTA is a
private undocumented protocol internal to Sendmail, and subject to
change between versions. This decouples scanners from
the details of SMTP,
but instead couples them to Sendmail. This is terrible for
interoperability - and in practice it's futile to fight against
interoperability by making the protocol private, because people will
create independent implementations of it anyway:
1
2
3.
So I don't like the Postfix or the Sendmail approaches, both because
of their performance characteristics and because of their bad
interfaces.
Exim is agnostic about its interface to scanners: it has sections
of code that talk directly to each of the popular scanners, e.g.
SpamAssassin, ClamAV, etc. This is rather inefficient in terms of
development resources (though the protocols tend to be simple), and is
succumbing to exactly the Babel that Postfix and Sendmail were trying
to avoid. Exim's approach has the potential to be better from the
performance point of view: It writes the message to disk before
passing it to the scanner at full speed, so in principle the same file
could act as the buffer for the scanner and the queue file for later
delivery. This would mean there are no overheads for buffering
messages that are accepted; if the message is rejected then it will
only hit the disk if the machine is under memory pressure. Sadly the
current implementation formats the message to a second file on disk
before passing it to the scanner(s), instead of formatting it in the
process of piping it to the scanner. The other weakness is that
although there is a limit on the number of concurrent SMTP
connections, you can't set a lower limit on the number of messages
being scanned at once. You must instead rely on the scanners
themselves to implement concurrency limits, and avoid undaemonized
scanners that don't have such limits. This is probably adequate for
many setups, but it means the MTA can't make use of its greater
knowledge to do things like prioritize internal traffic over external
traffic in the event of overload.
So, having criticised everything in sight, what properties do we
want from the MTA's interface to scanners? In general, we would like
the logistics of passing the message to the scanner to add no
significant overhead - i.e. the cost should be the same as receiving
the message and scanning the message considered separately, with
nothing added to plug these processes together. Furthermore we'd like
to save scanners from having to duplicate functionality that already
exists in the MTA. Specifically:
- Buffer the message in its queue file before scanning, so that the
scanner does not take longer than necessary because it is limited by
the client's sending speed.
- Insulate the scanner from the details of SMTP extensions and wire
formats, without compromising the MTA's support for same. This implies
that any reformatting (e.g. downgrade binary attachments to base64)
needed by the scanner should not pessimize onward delivery.
- Put sensible limits on the concurrency demanded of the scanner to
maximise its throughput. Use short-term queueing and scheduling (a few
seconds) to handle spikes in load.
- Cache scanner results.
- Put a security boundary between the MTA and the scanner.
Notice that these have a certain commonality with callout address
verification, which also needs a results cache, concurrency limits,
and a queue / scheduler. This gives me the idea for what I call
"data callouts" for content scanning, based on a loose
analogy between verifying that the message's addresses are OK and
verifying that the message's contents are OK. Also notice that message
reformatting and security boundaries are requirements for local
delivery. So a "data callout" is essentially a special kind
of local delivery that the MTA performs before sending its
reply to the last of the message data; it's a special kind of delivery
because it is only done to check for success or failure - unlike
normal deliveries the message isn't stored in a mailbox. This design
makes good use of existing infrastructure: The MTA can use its global
scheduler to manage the load on the scanner. There is already lots of
variability in local delivery, so the variability in content scanner
protocols fits in nicely.
The data callout is actually a special case of "early
delivery", i.e. delivering a message before telling the client
that it has been accepted. This feature gives you a massive
performance boost, since you can relay a message without touching disk
at all (except to log!). If you are going to attempt this stunt then
you need a coherent way to deal with
problems caused by the
early delivery taking too long. Probably the best plan is to
ensure that a very slow diskless early delivery can be converted to a
normal on-disk delivery, so that a response can be given to the client
before it times out, and so that the effort spent on delivery so far
is not wasted. This is similar to allowing lengthy callouts address
verifications to continue even after the client that triggered them
has gone, so that the callout cache will be populated with a result
that can be returned quickly when the client retries. (I'm not sure if
it's worth doing the same thing with data callouts, or if a slow
scanner more likely indicates some nasty problem that the MTA should
back away from.)
The Postfix and Sendmail filter interfaces have a feature that is
missing from Exim's scanner interface and my data callout idea. The
filters can modify the message, whereas the scanners can only return a
short result (such as a score). Message mangling is not something I
particularly approve of, but it is a popular requirement. Fortunately
my idea can support it, by going back to the old approach of
delivering the message to the scanner which then re-injects it. Early
delivery removes most of the disadvantages from this technique: it
happens before we accept the message, and it doesn't add to disk load.
It adds a new advantage of being able to fall back gracefully from
scan-then-accept to accept-then-scan in the event of overload, if
that's what you want. It still has the disadvantages of log
obfuscation and depending on the scanner to support advanced SMTP
features (though perhaps these can be avoided with a better filter
protocol).
I hope that this convinces you that - as I said in
my last essay -
lots of cool things become possible if you get callouts right. This
essay also serves as a response to iwj10, who complained that my
log-structured queue idea
was a pointless optimisation because early delivery is much more
effective. He wasn't convinced when I said that early delivery was a
separate problem. Even when you have early delivery - so that the
queue only contains very slow or undeliverable messages - the
log-structured queue reduces the effort required to work out which
messages to retry next because the necessary information is stored in
the right order.