Yesterday I wrote about the format of message data in spool files, but
that's only part of what is stored in the spool - and from the
performance point of view it isn't even a very important part, even
though that is I was writing about. Far more important for performance
is the high-level logistics of spool file management: how they are
created, written, and deleted.
As well as the message data, the spool will contain for each each
message all the information needed to route it to its destination(s).
At a minimum this is the SMTP envelope, i.e. the recipient addresses,
the return path (in case the message bounces) and the DSN options.
However sometimes the postmaster wants to make routeing decisions
using more details of the message's origin, such as the client and
server IP addresses, information from the DNS, security details from
TLS and SASL, the time it arrived, and results from anti-spam and
anti-virus scanners. Also, for my cunning wire format scheme you need
to preserve the MIME parse tree and dot-stuffing points.
Almost all of this information never changes after the message has
been received. What does change is the list of recipients that have
successfully received the message, or failed, or which need to be
retried.
Sendmail and Exim have quite similar schemes for their spool files.
See the
Sendmail operations guide page 100 and the Exim
specification page 410. Both store a message in two principal
files, one containing the message body (i.e. the message data
excluding the header) and one containing the envelope and message
header. (Both can make routeing decisions based on arbitrary header
fields.) There are also some short-lived auxiliary files: a
transaction journal which records changes to the recipient list during
a delivery, and a temporary file which is used when writing a new
envelope file which holds the merged data from the old envelope file
and the transaction journal. (Exim also has per-message log files,
which IME are not much use, and can be turned off.)
The split is very slightly different in qmail, though it still has
essentially two files per message: its data file stores the whole
message data, and its envelope file stores a very bare envelope. It
doesn't have a complicated envelope file because qmail is limited to
routeing messages based on just their email addresses.
All of these MTAs are quite profligate in their use of files, and
therefore the amount of filesystem metadata manipulation they cause.
File creation requires modification of the parent directory's data and
inode, the allocation of a new inode and at least one block for
the file's data. This implies that the disks are going to have to do
quite a lot of seeking to write the message data to stable
(crash-resistant) storage. As a result these MTAs don't get the best
possible performance out of their disk systems.
Postfix is better because it only uses one file per message. However
it has multiple queue directories, and moves messages between them at
different stages in their life cycle. This is wasteful: for example the
Postfix queue documentation says that "whenever the queue manager
is restarted [...] the queue manager moves all the active queue
messages back into the incoming queue, and then uses its normal
incoming queue scan to refill the active queue. The process of moving
all the messages back and forth [...] is expensive. At all costs,
avoid frequent restarts of the queue manager."
My simple spool scheme is as follows:
- You use one file per message, which is divided into three sections:
first the message data (in the same format as it came in on the wire),
then the static envelope (information from the client connection,
security negotiation, parser, scanner, etc), and finally the recipient
addresses and transaction journal.
- You create the spool file when you start receiving the message data,
because that is the first time that it becomes absolutely necessary.
It is never moved after it is created.
- Some of the envelope information depends on the message data, so it is
simpler to write all of the envelope after the data. The size of the
envelope has a much lower bound than the data so it's OK to keep it in
memory.
- The recipient list comes last, because that is the part of the
envelope that changes as deliveries are performed. The transaction
journal is not necessarily just a list of addresses that we no longer
need to deal with: in some
situations you want to add recipients to the message after
receiving it.
- The file is almost entirely written only once in append-only order:
the data and static envelope information never changes, and the
transaction journal is append-only. However you need to be able to
find the start of the envelope quickly when re-opening an old message,
so its offset in the file is stored before the message data, and this
offset information must be written after the data has been received.
In fact the presence of the envelope offset can serve as a way to
distinguish between messages that have been correctly received and
those that were only partially written (e.g. because of a crash). When
the client finishes sending you the message data, you append the
envelope, then update the envelope offset to mark the message as
being OK immediately before sending the OK response to the client.
Let's see if we can reduce the load on the disks even more.
One obvious cost is the creation and deletion of spool files. You
could keep a pool of unused files, and re-open one of them instead of
creating a new file. When you have finished with a file, you can
truncate it and return it to the unused pool. A small disadvantage of this
is it removes the connection between a message's spool ID and its file
on disk.
You are still truncating and extending the files a lot, which is a lot
of work for the filesystem's block allocator. You could skip the
truncation step before returning the file to the unused pool. This
means you have to change your spool file format slightly to add a
marker so that you can tell where the file's current contents end and
the junk from previous messages starts. You have to overwrite the
marker and add a new one whenever writing to the file, and you can no
longer use O_APPEND mode. You also have to mark the file as
unused, by zapping the envelope offset, before returning it to the
pool.
The remaining work is fsync()ing changes to the files to
ensure they are safely written to disk. A message needs to be synced
when it is received, when you take responsibility for it. You then
need to sync each entry in the transaction journal, which is at least
once per message and at most once per recipient. Because there are
lots of messages in their individual files, these syncs are all over
the disk, which implies lots of costly seeking.
High performance NNTP servers minimize seeks by having a few large
spool files to which they write messages sequentially. There are a
couple of reasons that this doesn't work for us. Firstly, NNTP servers
generally receive messages streamed from relatively few high-bandwidth
clients, whereas SMTP servers often have many low-bandwidth clients
which send few messages per connection. You want at least one spool
file per client so that one client can write to a file slowly without
holding up another. Secondly, you can't write the messages
contiguously if you are going to keep a message's transaction journal
next to it.
So why not move all the transaction journals into a global transaction
log?
When I first thought of this, I doubted that the benefits would
outweigh the complexity, but it turns out to have wonderful
consequences.
The nice thing about this kind of log-structured storage is that it's
very friendly to disks: the write position moves slowly so you
can get much higher performance than you do with random seeks.
The great difficulty is that you need to clean garbage out of the log so
that it doesn't grow without bound, and this process can easily
cannibalize all the performance you gained from using a log structure
in the first place. Hence my initial skepticism.
Another problem is that the information about a message can end up
scattered throughout the journal as records are appended when its
delivery is tried and retried. You can avoid this by re-writing its
remaining recipient list at the head of the journal whenever it is
retried. (This is a bit like Sendmail and Exim's envelope re-writing,
but lest costly because of the log-structured write activity.) The
next time it is tried you only have to look at its most recent entry
in the journal.
One significant difference between an MTA transaction journal and a
database or filesystem is that the MTA's data has a natural maximum
lifetime, and this can actually be quite short. A nice consequence of
the re-writing I just described is that the maximum age of data in the
journal is the same as your maximum retry interval, which is probably
only a few hours - compared to days for the lifetime of the message
data, or years for blocks in a filesystem or database.
So you are effectively driving your journal garbage collector from
your MTA's queue runners. But wait! Turn it around: you can drive the
queue runners from the journal. It has exactly the data your queue
runners need, already sorted into in a fair retry order. What's more,
your queue runners no longer need to scan the directories in the spool
and open each file in turn to check whether it needs to be retried.
They also never have to go searching in the global journal for a
message's recipient list. In fact, if you keep enough envelope
information in the journal to do all the routing for each message, you
only need to re-open a spool file just before it is transmitted - and
not at all if the destination host is still unavailable. Wonderful!
This division of the spool into a collection of message data files and
a global log-structured queue is also very amenable for tuned storage
setups. You could reserve a mirrored pair of disks for the queue, so
that these disks almost never have to seek. The message data files can
live on separate disks that have more random access patterns, but
which have a lower load because most work goes on in the queue. The
principal workload is only one fsync() per message file in
the lifetime of a message. The queue gets all the fsync()s to
record deliveries, so a load of less critical writes (such as
re-writing envelopes) get fsync()ed for free.
In this setup the queue is somewhat like an NNTP server's overview
database, and it has the same problems. All your eggs are in one
basket, so it had better be very reliable. When INN introduced
advanced storage layouts that depended on the correctness of the
overview database, any problems with the database were catastrophic -
and they were sadly common. That's not to say it can't be done:
Demon's NewsBorg managed it. And in fact, an MTA's log-structured
queue database which only needs to be accessed in log-structured order
is much simpler than a random-access overview database, so it should
also be more reliable.
I must say, I'm pretty pleased with this :-) Although I came at it
from the direction of performance hackery, it turns out to be a
unifying and simplifying idea of great elegance. And there are still
at least a couple more ways of making it faster...