fanf | 2006-09-19

It's traditional for most MTAs to use something close to the host operating system's standard text file format for storing messages that are queued for delivery. For example, on Unix that means using bare LF for newlines instead of the Internet standard CRLF. Exim, Sendmail, and qmail do this. On other systems the translation may be more extensive, e.g. if the native charset is EBCDIC, or if text files are record-oriented rather than stream-oriented. Postfix is a bit of a counter-example, since it runs on Unix but turns the lines of a message into records, as part of its general IPC scheme that I mentioned in part 2.

This makes sense if you are working in an environment where local email is more common than remote email: messages stay in the system's local format so are less likely to be munged. However, nowadays MTAs (especially on Unix) generally act as relays to and from SMTP (and variants like message submission or LMTP). The mismatch between local text formats and the Internet standard format leads to all sorts of transparency bugs. For example, SMTP was originally specified to be 7 bit clean, with no restrictions on bare CR or LF. However these characters will get mangled when a message is transferred via an MTA that translates newlines. There are descriptions of many similar problems in RFC 2049.

The other problem with all of this translation and re-translation is that it's horribly inefficient: it requires lots of copying back-and-forth, meaning you use several times more memory bus bandwidth than is needed to shift bits between the network and the disk. One of the neat things that the Cyrus IMAP server does is store its data in wire format, so that it can be transferred from disk to network with minimal copying. (UW IMAP, Courier, and Dovecot all use Unix native Berkeley or maildir folders.) Why not do the same in an SMTP server?

The main difficulty with this idea is that SMTP does not have a single wire format. What's worse is that you don't find out a message's outgoing wire format until you have connected to the destination host and are about to transmit it. Even worse, there might not even be a single outgoing wire format if the message has multiple recipients at different hosts with different capabilities.

You can make this problem much more tractable if you store the messages in the format that they are received in, and prepare for any recoding that may be necessary on transmission. In most cases, the incoming and outgoing software will have the same capabilities, so you will be able to use efficient APIs like sendfile(). But what do I mean by "prepare"?

You need to parse the MIME structure of the message, and note at least the content-transfer-encoding of each part. This will allow you to downgrade each part in the appropriate way when that is necessary.
If you are really keen, then you can also scan the data in each part to (say) choose between quoted-printable or base64 depending on which would be most efficient.
You need to note dot-stuffing points: in traditional SMTP, where dots have been inserted, and with chunked SMTP, where they need to be inserted for onward traditional transmission.
If you support UTF8SMTP, you have to scan the RFC822 and MIME headers for top-bit-set characters that may need downgrading to RFC2047 or RFC2231 encodings.

This can be done as the message is received, which only requires the message data to go over the memory bus once. (Data goes over the bus twice in a copy - from the old buffer and to the new buffer - so a parse is cheaper than a copy.) When you come to send the message onwards, you can directly sendfile() those parts that do not need to be downgraded and only recode where absolutely necessary.

It's worth noting that transmission between chunked and traditional SMTP can be done efficiently (assuming no MIME recoding is necessary). Dot-stuffing is only rarely necessary (e.g. it never happens in base64 attachments) so there will be reasonably large blocks of the message between dots which can sensibly be passed to sendfile(). When removing dots you will be sending blocks with a one byte gap between the file offset of the end of one block and the start of the next, and when adding dots you can use the leader or trailer feature of sendfile() to add the dot. This is good because it means it is cheap to use the more pipelineable BDAT command when it is available. It's also way more efficient than the usual scan-every-byte-of-the-message implementation of dot-stuffing. And I think it is quite cute :-)

Existing MTAs that implement 8BITMIME downgrading (such as Sendmail) generally do the MIME parse and recode when sending the message, and just use base64 because it would be too expensive to do two passes to work out if quoted-printable would be better. Another situation when two passes are necessary is adding a DomainKeys signature, because it appears at the top of the message. In many cases (if you aren't going to alter the message) you can make DK more efficient by calculating the signature as the message arrives so that it's easy to add when the message is transmitted.

Is it worth trying to make this more efficient? For example, if you have a cache of destination host capabilities, you could recode the message as it comes in to save a trip over the memory bus. But you will still have to support late recoding if the cache lookup misses, so the extra complexity probably isn't worth it.

Previously: part 3- local delivery; part 2 - partitioning for security; part 1 - the sendmail command; message identification.

Yesterday I wrote about the format of message data in spool files, but that's only part of what is stored in the spool - and from the performance point of view it isn't even a very important part, even though that is I was writing about. Far more important for performance is the high-level logistics of spool file management: how they are created, written, and deleted.

As well as the message data, the spool will contain for each each message all the information needed to route it to its destination(s). At a minimum this is the SMTP envelope, i.e. the recipient addresses, the return path (in case the message bounces) and the DSN options. However sometimes the postmaster wants to make routeing decisions using more details of the message's origin, such as the client and server IP addresses, information from the DNS, security details from TLS and SASL, the time it arrived, and results from anti-spam and anti-virus scanners. Also, for my cunning wire format scheme you need to preserve the MIME parse tree and dot-stuffing points.

Almost all of this information never changes after the message has been received. What does change is the list of recipients that have successfully received the message, or failed, or which need to be retried.

Sendmail and Exim have quite similar schemes for their spool files. See the Sendmail operations guide page 100 and the Exim specification page 410. Both store a message in two principal files, one containing the message body (i.e. the message data excluding the header) and one containing the envelope and message header. (Both can make routeing decisions based on arbitrary header fields.) There are also some short-lived auxiliary files: a transaction journal which records changes to the recipient list during a delivery, and a temporary file which is used when writing a new envelope file which holds the merged data from the old envelope file and the transaction journal. (Exim also has per-message log files, which IME are not much use, and can be turned off.)

The split is very slightly different in qmail, though it still has essentially two files per message: its data file stores the whole message data, and its envelope file stores a very bare envelope. It doesn't have a complicated envelope file because qmail is limited to routeing messages based on just their email addresses.

All of these MTAs are quite profligate in their use of files, and therefore the amount of filesystem metadata manipulation they cause. File creation requires modification of the parent directory's data and inode, the allocation of a new inode and at least one block for the file's data. This implies that the disks are going to have to do quite a lot of seeking to write the message data to stable (crash-resistant) storage. As a result these MTAs don't get the best possible performance out of their disk systems.

Postfix is better because it only uses one file per message. However it has multiple queue directories, and moves messages between them at different stages in their life cycle. This is wasteful: for example the Postfix queue documentation says that "whenever the queue manager is restarted [...] the queue manager moves all the active queue messages back into the incoming queue, and then uses its normal incoming queue scan to refill the active queue. The process of moving all the messages back and forth [...] is expensive. At all costs, avoid frequent restarts of the queue manager."

My simple spool scheme is as follows:

You use one file per message, which is divided into three sections: first the message data (in the same format as it came in on the wire), then the static envelope (information from the client connection, security negotiation, parser, scanner, etc), and finally the recipient addresses and transaction journal.
You create the spool file when you start receiving the message data, because that is the first time that it becomes absolutely necessary. It is never moved after it is created.
Some of the envelope information depends on the message data, so it is simpler to write all of the envelope after the data. The size of the envelope has a much lower bound than the data so it's OK to keep it in memory.
The recipient list comes last, because that is the part of the envelope that changes as deliveries are performed. The transaction journal is not necessarily just a list of addresses that we no longer need to deal with: in some situations you want to add recipients to the message after receiving it.
The file is almost entirely written only once in append-only order: the data and static envelope information never changes, and the transaction journal is append-only. However you need to be able to find the start of the envelope quickly when re-opening an old message, so its offset in the file is stored before the message data, and this offset information must be written after the data has been received.

In fact the presence of the envelope offset can serve as a way to distinguish between messages that have been correctly received and those that were only partially written (e.g. because of a crash). When the client finishes sending you the message data, you append the envelope, then update the envelope offset to mark the message as being OK immediately before sending the OK response to the client.

Let's see if we can reduce the load on the disks even more.

One obvious cost is the creation and deletion of spool files. You could keep a pool of unused files, and re-open one of them instead of creating a new file. When you have finished with a file, you can truncate it and return it to the unused pool. A small disadvantage of this is it removes the connection between a message's spool ID and its file on disk.

You are still truncating and extending the files a lot, which is a lot of work for the filesystem's block allocator. You could skip the truncation step before returning the file to the unused pool. This means you have to change your spool file format slightly to add a marker so that you can tell where the file's current contents end and the junk from previous messages starts. You have to overwrite the marker and add a new one whenever writing to the file, and you can no longer use O_APPEND mode. You also have to mark the file as unused, by zapping the envelope offset, before returning it to the pool.

The remaining work is fsync()ing changes to the files to ensure they are safely written to disk. A message needs to be synced when it is received, when you take responsibility for it. You then need to sync each entry in the transaction journal, which is at least once per message and at most once per recipient. Because there are lots of messages in their individual files, these syncs are all over the disk, which implies lots of costly seeking.

High performance NNTP servers minimize seeks by having a few large spool files to which they write messages sequentially. There are a couple of reasons that this doesn't work for us. Firstly, NNTP servers generally receive messages streamed from relatively few high-bandwidth clients, whereas SMTP servers often have many low-bandwidth clients which send few messages per connection. You want at least one spool file per client so that one client can write to a file slowly without holding up another. Secondly, you can't write the messages contiguously if you are going to keep a message's transaction journal next to it.

So why not move all the transaction journals into a global transaction log?

When I first thought of this, I doubted that the benefits would outweigh the complexity, but it turns out to have wonderful consequences.

The nice thing about this kind of log-structured storage is that it's very friendly to disks: the write position moves slowly so you can get much higher performance than you do with random seeks. The great difficulty is that you need to clean garbage out of the log so that it doesn't grow without bound, and this process can easily cannibalize all the performance you gained from using a log structure in the first place. Hence my initial skepticism.

Another problem is that the information about a message can end up scattered throughout the journal as records are appended when its delivery is tried and retried. You can avoid this by re-writing its remaining recipient list at the head of the journal whenever it is retried. (This is a bit like Sendmail and Exim's envelope re-writing, but lest costly because of the log-structured write activity.) The next time it is tried you only have to look at its most recent entry in the journal.

One significant difference between an MTA transaction journal and a database or filesystem is that the MTA's data has a natural maximum lifetime, and this can actually be quite short. A nice consequence of the re-writing I just described is that the maximum age of data in the journal is the same as your maximum retry interval, which is probably only a few hours - compared to days for the lifetime of the message data, or years for blocks in a filesystem or database.

So you are effectively driving your journal garbage collector from your MTA's queue runners. But wait! Turn it around: you can drive the queue runners from the journal. It has exactly the data your queue runners need, already sorted into in a fair retry order. What's more, your queue runners no longer need to scan the directories in the spool and open each file in turn to check whether it needs to be retried. They also never have to go searching in the global journal for a message's recipient list. In fact, if you keep enough envelope information in the journal to do all the routing for each message, you only need to re-open a spool file just before it is transmitted - and not at all if the destination host is still unavailable. Wonderful!

This division of the spool into a collection of message data files and a global log-structured queue is also very amenable for tuned storage setups. You could reserve a mirrored pair of disks for the queue, so that these disks almost never have to seek. The message data files can live on separate disks that have more random access patterns, but which have a lower load because most work goes on in the queue. The principal workload is only one fsync() per message file in the lifetime of a message. The queue gets all the fsync()s to record deliveries, so a load of less critical writes (such as re-writing envelopes) get fsync()ed for free.

In this setup the queue is somewhat like an NNTP server's overview database, and it has the same problems. All your eggs are in one basket, so it had better be very reliable. When INN introduced advanced storage layouts that depended on the correctness of the overview database, any problems with the database were catastrophic - and they were sadly common. That's not to say it can't be done: Demon's NewsBorg managed it. And in fact, an MTA's log-structured queue database which only needs to be accessed in log-structured order is much simpler than a random-access overview database, so it should also be more reliable.

I must say, I'm pretty pleased with this :-) Although I came at it from the direction of performance hackery, it turns out to be a unifying and simplifying idea of great elegance. And there are still at least a couple more ways of making it faster...

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

Tony Finch's blog

2006-09-19

2006-09-19

How not to design an MTA - part 4 - spool file format

How not to design an MTA - part 5 - spool file logistics

Profile

February 2026

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags