fanf: (Default)

So last week I wrote about some simple "desugaring" transformations that turn a fairly conventional block-structured programming language into the call-by-value lambda calculus with some small extensions. The main problem is that I relied on first-class continuations. This has at least two undesirable consequences:

  • First-class continuations are the goto of functional programming. Actually, that understates the amount of unconstrained tangle that they can unleash - they are far more powerful and dangerous than goto. So one can argue that programmers are better off avoiding first-class continuations and using more constrained control flow, much like we are advised to shun goto in favour of break, return, and exceptions.
  • Continuations can't easily be implemented with a conventional linear stack. When you return a normal value up the stack, the top stack frame(s) can be freed, but if you return a continuation up the stack, they cannot: calling the continuation re-activates these frames. The stack becomes a cactus. A common solution to this problem is to allocate activation frames on the heap, but this seriously stresses the garbage collector and harms locality: whereas a stack works up and down over the same small area of memory, the heap blasts forever upwards.

Fortunately there's a nice solution: multi-return function calls. The idea is to separate continuation arguments from normal arguments, and restrict the continuations so that activation frames are still used in a normal stack-like LIFO manner. What's really neat is that it preserves a lot of the the power of first-class continuations.

However, the desugaring transforms become more complicated. The scope of the multi-return continuations is restricted so that you cannot (for example) return from a function by invoking its continuation from an inner block, because the continuation is not available there. Instead you have to add plumbing to pass the early-return continuation into the block along with the block's normal sequential continuation. This plumbing ends up being rather like the standard continuation-passing-style transformation described in the lambda-the-ultimate papers linked to above, but with an extra continuation for each early exit.

This extra-continuation plumbing is also similar to a standard implementation technique for exceptions (which I called "non-trivial" last week, but isn't actually hard to understand). As well as the sequential continuation that is passed around in CPS, you pass around an exception handler continuation which is invoked in order to throw an exception. This is very similar to implementing exceptions in C with setjmp and longjmp, except in C you can keep the current exception handler in a global variable instead of passing it around.

The downside of these techniques is that the plumbing is an overhead. However, it is "pay-as-you-go", in that you only need the plumbing if you use early exits from blocks. (It's less easy to avoid the exception plumbing since exceptions can usually be thrown anywhere.) By contrast, the more complicated infrastructure needed for first-class continuations affects the performance of the whole run-time system.

(The mathematics in the typing of these constructions relates directly to their balance of practicality and power. Constrained continuations like multi-return functions and exceptions are usually used linearly and have an intuitionistic (constructive) typing. Call-with-current-continuation, however, has a classical typing which is not expressible in most practical type systems. This kind of direct relationship between theory and practice is one of the beauties of computer science.)

fanf: (Default)
My employers are currently looking for someone to take over as head of one of the Computing Service's divisions: http://www.cam.ac.uk/cs/jobs/

The University has its own ducts and fibre connecting almost all University sites around the city with gigabit or 10gig ethernet. We host the EastNet POP for UKERNA, which provides connectivity for 8 universities and many 6th form colleges. We're currently replacing the old BT switch with VOIP, which will eventually be handling about 5-10,000 phones. The division also runs the CS's managed workstation service, which includes a large central filestore and public access computer rooms all over the place.
fanf: (Default)

Recently I've read about Lua. It's a nice simple language, with a syntax that's quite close to my ideal for a dynamically-typed imperative language. One of the things it omits for simplicity's sake is first-class continuations. I've been wondering how one might specify a similar syntax if you have a Schemeish infrastructure, so that the usual control structures can be desugared into functions, conditionals, and tail-calls. I also have in mind Modula-3's definition of some control structures in terms of exceptions. In the following I'm not going to talk about the whole syntax; I'll omit things which don't involve blocks.

The basic form of a block expression is:

	name(arguments): do
		statements
	end

The value of this expression is a function which takes some arguments and which when called executes the statements. The scope of its name is just within the block itself; it is used in inner scopes to refer to shadowed variables in outer scopes.

To execute the block immediately, the arguments can be omitted:

	name: do statements; end

is equivalent to

	name(): do statements; end()

that is, it specifies a function that has no arguments and which is called immediately.

The name can also be omitted. Doing so just means there's no way to refer to this block's variables if they are shadowed in an inner scope. For example, an anonymous function (lambda expression) is typically

	(args): do statements; end

You might also want to allow a shorter version of

	(args): do return(values); end

for use in expressions, perhaps

	(args): return (values)

A minimal block like the following is just like a Lua block:

	do statements; end

Just before the end of each block there's an implicit return(nil). You can return from a block early with whatever value or list of values you want.

Instead of do...end you can write a conditional:

	if expression then
		true statements
	else
		false statements
	end

If the else part is omitted then the false statements are implicitly return(nil). You can also add elsif expression then alternate statements parts in the usual way.

The basic form of a loop is

	name: do
		statements
	loop

which is equivalent to

	name: do
		statements
		return(name())
	end

that is, loops are implemented using tail recursion. (Loops do not have to be named, but I've written the desugaring in terms of a named block to avoid having to talk about hypothetical unique names.)

	name: while expression do
		statements
	loop

is equivalent to

	name: if expression then
		statements
		return(name())
	end

and

	name: do
		statements
	while expression loop

is equivalent to

	name: do
		statements
		if expression then
			return(name())
		end
	end

You can use until expression instead of while not(expression).

Given the above, you can see that return inside a loop is similar to break in C. It's often nice to be able to break out of more than one nested loop at a time, or return from a function inside a loop. This is where a block's name comes into play. To return from a named outer block, call name.return. For example:

	def bsearch(arr, val): do
		def lo, hi := 1, #arr
		while lo <= hi do
			def mid := math.floor((lo + hi) / 2);
			if arr[mid] == val then
				bsearch.return(mid)
			elsif arr[mid] < val then
				lo := mid + 1
			else
				hi := mid - 1
			end
		loop
	end

A bare return in the above would specify the value of the if block, not the function.

This kind of return makes it easy to write call-with-current-continuation:

	def callcc(f): do
		def cont(...): do
			callcc.return(...)
		end
		return(f(cont))
	end

However this is unnecessary because return is already an expression whose value is the current continuation. It's just like other function arguments, except that it's implicit (a bit like the self argument in Lua's methods) and can't be assigned to (so that the compiler can do tail-call optimisation). (Actually, what if you *could* assign to return? Eeeeevil!)

There are three forms of statement within a block: definitions, assignments, and function calls. Since a block without arguments is equivalent to a function expression that is immediately called, non-function blocks can be statements. A function definition

	def name(arguments): do ...

is equivalent to

	def name := name(arguments): do ...

For bigger code, it's useful to allow a block's name to be repeated after its end or loop, for example:

	def long_function(arguments): do
		code
		code
		code
	end long_function

You can further desugar definitions, since

	old_name: do
		before statements
		def variables := values
		after statements
	end

is equivalent to

	outer_name: do
		before statements
		inner_name(variables): do
			after statements
		end(values)
	end

except that you need to fix up the after statements so that unqualified returns are qualified with the outer_name, and variables qualified with the old_name are qualified with the outer_name or inner_name as necessary.

Some things are still missing from this description. A for loop is necessary, probably based on some kind of iterator concept, but that's too far away from the topic of this post to be worth pursuing in detail now. Also, we would probably like a standard exception mechanism; this is not trivial to implement in terms of continuations because the code that handles an exception is determined dynamically (like Perl local) whereas continuations are lexical (like Perl my). Finally, the combination of continuations and concurrency is utterly filthy: imagine what happens if one thread calls a continuation created by another...

fanf: (Default)
I just received a message from David Woodhouse CC:ed from a linux-kernel mailing list thread about unifdef. David incorporated my version of the program into the Linux kernel headers_install process earlier this year. A nice surprise from today's message was the discovery that Mike Kinghan of Symbian Software has massively improved and expanded my code to make sunifdef.
fanf: (Default)

Following on from http://fanf.livejournal.com/65203.html I feel the need to write down some more details of my log-structured queue idea.

My approach has a "crash-only software" architecture. According to the abstract of that paper: "Crash-only programs crash safely and recover quickly. There is only one way to stop such software - by crashing it - and only one way to bring it up - by initiating recovery." So, no shutdown code.

Essentials:

The queue consists of a sequence of records of various types.

The most important record type is a message envelope, which contains everything you need to route a message: at a minimum the recipient addresses, though practicality demands rather more than that. The envelope also contains a reference to the spool file which contains the message data etc.

When a message is delivered the MTA must immediately update its recipient list on stable storage. (It has to be immediate in order to avoid duplicate deliveries when the MTA restarts after being stopped at the wrong point.) We don't want to have to rewrite the whole envelope for every delivery: it is big and complicated, whereas we want delivery records to be small and simple since there will often be more than one of them for each message (especially for bigger envelopes!) and they will be most of the fsync()ed write traffic on the queue. So delivery records are just deltas relative to envelope records.

Once the current set of delivery attempts for a message has been completed, we write a replacement envelope for the message, or if there are no recipients left, we need to record that all work on the message is finished. At this point the old envelope and all its updates have become garbage.

Unless the MTA has only one message in the queue, the old envelope, the delivery records, and the replacement envelope will all be separated from each other in the queue by records related to other messages. Therefore we need some way of linking them together so that we can restart quickly, without scanning the whole queue. The point of this article is how to solve this problem.

Details:

The procedure for delivering a message is as follows. A queue runner scans the queue until it encounters an envelope which is not marked as garbage, is past its retry time, and is not already being worked on. It then kicks off delivery of this message. (There is a queue runner for newly received messages as well as older messages so all deliveries start this way.) First write a start record to the queue to record that this message is being worked on and to link back to its envelope record. These start records are used to recover state after a restart. Then the envelope is marked as garbage. This is done early so that these garbage marks are added sequentially - it's common to finish working on messages in a different order. Then deliver the message, recording the results for each set of recipients (successful or otherwise) as a delivery record in the queue, with an fsync() for each set. When there's nothing left to do for the message, either write an updated envelope record (if it is to be retried) or a finish record. Finally, mark the start record as garbage, which implicitly marks its associated delivery records as garbage.

When we restart, the part of the queue that we can't avoid scanning is the "working suffix", which contains delivery records of messages which have not yet had their replacement envelopes written. The original envelopes of the these messages are marked as garbage, but they are still needed until their replacement is written, so they can't yet be discarded. These are the messages that the MTA was working on when it stopped, so we need to recover that working state, or at least the parts that still matter. The start of the working suffix is the earliest start record which has not been marked as garbage.

One thing we can avoid scanning for at startup is the appropriate positions of the queue runners in the queue. Every so often the MTA records the set of queue runner positions for use at a restart. Doing this more frequently means less wasted work at startup, but also means more overhead in normal operation. Similarly, it also records the start of the working suffix. The procedure is to find out the current restart positions, fsync() all the queue files, then write these restart positions to disk. There are multiple queue files because - like log files - the MTA switches to a new file every so often so that when old files become entirely garbage they can be deleted. The queue runners will be marking envelopes as garbage as they go, and since they will not all be scanning the current queue file, they can't benefit from that file's frequent fsync()s, hence the need for the extra fsync() to ensure that the restart positions correctly reflect the on-disk state in case of a crash. This procedure lags slightly behind the activity of the queue runners, which is OK because the restart positions don't need to be accurate: if they are out of date the only harm will be a little more garbage to scan after startup. The important feature is that the older portions of the queue before the restart pointers are consistent.

When the MTA is restarted it can immediately start receiving messages, since this just involves appending new envelope records to the queue. However deliveries have to wait. Having read the queue positions, the MTA starts scanning the working suffix. For each start record that isn't marked as garbage, ensure that the envelope it refers to is marked as garbage (it might not be if there was a crash) then add it to the list of working messages. Remember each delivery record that is associated with a working message. If we encounter a replacement envelope or finish record for a working message (again because of a crash), mark its start record as garbage and remove it from the working list. When we run out of queue, write replacement envelopes for all the working messages, and mark their start records as garbage. The on-disk data has now been made consistent, so we can start the queue runners at their recorded positions.

There's one remaining mechanism. The earliest queue runner (dealing with the messages with the longest retry interval) leaves behind it a queue that is entirely marked as garbage. However envelopes just behind the queue runner are not true garbage until their replacement envelope has been written. Thus there must be a process which tracks the point where true garbage starts, so that when it rolls past a file boundary we know the older file can safely be removed. This can be done entirely in memory by keeping note of when deliveries triggered by the earliest queue runner complete. No persistent state is needed because after the working suffix has been scanned the garbage markers are all correct.

Optimizations:

No fsync()s are needed for writing garbage flags, start or finish records, or replacement envelopes, because if they are lost in a crash they can be recovered, or other fsync() activity prevents such a loss. These writes are book-keeping - they aren't necessary in a conventional one-file-per-message spool - so it's good that they can be coalesced with other more unavoidable writes.

The envelope of a new message does need to be fsync()ed before we return a 250 to the client that sent it to us, but if we delay a little we can piggy-back it on a delivery fsync().

Delivery records contain sets of recipients when a single delivery has multiple recipients. One delivery record is written when we get the destination server's final response to the message data. Thus we can get less than one fsync() per recipient per message.

I haven't said where to record the restart positions. This could be in the queue, which implies a bit of furtling around to find them, or in a fixed location, which implies extra seeks (similar to mtime seeks).

It might be more efficient to have multiple queues, the key difference being that replacement envelopes are written to a queue with queue runners that have an appropriate retry interval. With the single queue setup the long-interval queue runners are mostly skipping garbage, which wastes IO bandwidth and pollutes the page cache. However multiple queues may increase seeking too much. As [livejournal.com profile] chrislightfoot points out, the conventional wisdom isn't a reliable guide in this area. More benchmarks needed.

Edit:

Actually, the garbage bits on the start records are unnecessary: they are only for use by the restart process, but it can (and does!) infer their correct value. When we finish working on a message, instead of frobbing this bit, we just update our current in-memory note of the start-of-working-suffx and end-of-garbage-prefix positions, as appropriate.

A correspondence with Postfix terminology: The working suffix is like Postfix's active queue. The new envelopes being appended to the queue are the incoming queue. The other queue runners are scanning the deferred queue. Postfix has a fairness strategy of adding messages to the active queue alternately from the incoming and deferred queues, and we can apply something similar here by controlling when we ask each queue runner to scan for its next envelope.

fanf: (Default)

Here's an idea I came up with in 1999, when I was working at Demon. It's a design for a scalable log processing infrastructure, which allows you to plug in simple processing scripts which have very easy requirements, but still attain good scalability and reliability.

It's all based on dividing logs into relatively small batches: too large and the log processing latency gets too big; too small and the batch handling overhead gets too big. I decided that one-minute batches were a happy medium. Batches are always aligned to minute boundaries NN:NN:00, so that (assuming your computers' clocks are synchronized) batches generated on different computers start and end at the same time.

Log processing scripts can be simple sequential code. The requirements are that they are idempotent, so that they can be re-run if anything goes wrong; they must depend only on the time stamps in the logs, not the wall clock time; and they must be able to process batches independently.

The log transport protocol must also satisfy these requirements. I decided to use HTTP, because it's already ubiquitous. The PUT method is the one we want, because it's guaranteed to be idempotent (unlike POST), and it pushes data which is more efficient for the log server than pulling (which usually implies polling). HTTP already has lots of useful features (such as security) that we don't have to re-invent. If you use the "chunked" Transfer-Encoding, then you can start sending a batch as it starts which minimizes latency.

This scheme gets its scalability by exploiting parallelism.

A batch of logs may take longer to process than it does to generate if you have lots of log generating machines, or if a naive log processing script is slow - the real-world example from Demon was doing reverse DNS lookups on web logs. In this situation you can start processing the next batch as soon as it becomes available, even if the previous batch is still being processed. This means you can make use of idle CPU (waiting for DNS lookups) or multiple CPUs.

If one machine isn't enough to handle your log load, you can send alternate batches to different machines. For example one can process all the even minutes and one all the odd minutes. You can also have a hierarchy of log processors: perhaps a log collector for each cluster which in turn passes its collected logs to the central processors.

The reliability comes from a couple of simple things.

Logs are transmitted using TCP, which isn't lossy like traditional UDP-based syslog. Logs are written to disk before being sent to the server, so that if there is a network problem they can be transmitted later. Similarly, the log server can delay processing a batch until all the clients have sent their chunks. The log processor can catch up after an outage by processing the backlog of batches in parallel.

This implies that the log server has to have a list of clients so that it knows how many batches to expect each minute. Alternatively, if a machine comes back after an outage and sends a load of old batches to the log processor, idempotence means that it can re-process those old batches safely. I'm not really sure what is the best approach here - it requires some operational experience.

fanf: (silly)
Yes I know this is a day late, but I have come up with a new (email-related) acronym: "Anti-virus and anti-spam tests" = "avast!".
fanf: (Default)

Yesterday I wrote about the format of message data in spool files, but that's only part of what is stored in the spool - and from the performance point of view it isn't even a very important part, even though that is I was writing about. Far more important for performance is the high-level logistics of spool file management: how they are created, written, and deleted.

As well as the message data, the spool will contain for each each message all the information needed to route it to its destination(s). At a minimum this is the SMTP envelope, i.e. the recipient addresses, the return path (in case the message bounces) and the DSN options. However sometimes the postmaster wants to make routeing decisions using more details of the message's origin, such as the client and server IP addresses, information from the DNS, security details from TLS and SASL, the time it arrived, and results from anti-spam and anti-virus scanners. Also, for my cunning wire format scheme you need to preserve the MIME parse tree and dot-stuffing points.

Almost all of this information never changes after the message has been received. What does change is the list of recipients that have successfully received the message, or failed, or which need to be retried.

Sendmail and Exim have quite similar schemes for their spool files. See the Sendmail operations guide page 100 and the Exim specification page 410. Both store a message in two principal files, one containing the message body (i.e. the message data excluding the header) and one containing the envelope and message header. (Both can make routeing decisions based on arbitrary header fields.) There are also some short-lived auxiliary files: a transaction journal which records changes to the recipient list during a delivery, and a temporary file which is used when writing a new envelope file which holds the merged data from the old envelope file and the transaction journal. (Exim also has per-message log files, which IME are not much use, and can be turned off.)

The split is very slightly different in qmail, though it still has essentially two files per message: its data file stores the whole message data, and its envelope file stores a very bare envelope. It doesn't have a complicated envelope file because qmail is limited to routeing messages based on just their email addresses.

All of these MTAs are quite profligate in their use of files, and therefore the amount of filesystem metadata manipulation they cause. File creation requires modification of the parent directory's data and inode, the allocation of a new inode and at least one block for the file's data. This implies that the disks are going to have to do quite a lot of seeking to write the message data to stable (crash-resistant) storage. As a result these MTAs don't get the best possible performance out of their disk systems.

Postfix is better because it only uses one file per message. However it has multiple queue directories, and moves messages between them at different stages in their life cycle. This is wasteful: for example the Postfix queue documentation says that "whenever the queue manager is restarted [...] the queue manager moves all the active queue messages back into the incoming queue, and then uses its normal incoming queue scan to refill the active queue. The process of moving all the messages back and forth [...] is expensive. At all costs, avoid frequent restarts of the queue manager."

My simple spool scheme is as follows:

  • You use one file per message, which is divided into three sections: first the message data (in the same format as it came in on the wire), then the static envelope (information from the client connection, security negotiation, parser, scanner, etc), and finally the recipient addresses and transaction journal.
  • You create the spool file when you start receiving the message data, because that is the first time that it becomes absolutely necessary. It is never moved after it is created.
  • Some of the envelope information depends on the message data, so it is simpler to write all of the envelope after the data. The size of the envelope has a much lower bound than the data so it's OK to keep it in memory.
  • The recipient list comes last, because that is the part of the envelope that changes as deliveries are performed. The transaction journal is not necessarily just a list of addresses that we no longer need to deal with: in some situations you want to add recipients to the message after receiving it.
  • The file is almost entirely written only once in append-only order: the data and static envelope information never changes, and the transaction journal is append-only. However you need to be able to find the start of the envelope quickly when re-opening an old message, so its offset in the file is stored before the message data, and this offset information must be written after the data has been received.

In fact the presence of the envelope offset can serve as a way to distinguish between messages that have been correctly received and those that were only partially written (e.g. because of a crash). When the client finishes sending you the message data, you append the envelope, then update the envelope offset to mark the message as being OK immediately before sending the OK response to the client.

Let's see if we can reduce the load on the disks even more.

One obvious cost is the creation and deletion of spool files. You could keep a pool of unused files, and re-open one of them instead of creating a new file. When you have finished with a file, you can truncate it and return it to the unused pool. A small disadvantage of this is it removes the connection between a message's spool ID and its file on disk.

You are still truncating and extending the files a lot, which is a lot of work for the filesystem's block allocator. You could skip the truncation step before returning the file to the unused pool. This means you have to change your spool file format slightly to add a marker so that you can tell where the file's current contents end and the junk from previous messages starts. You have to overwrite the marker and add a new one whenever writing to the file, and you can no longer use O_APPEND mode. You also have to mark the file as unused, by zapping the envelope offset, before returning it to the pool.

The remaining work is fsync()ing changes to the files to ensure they are safely written to disk. A message needs to be synced when it is received, when you take responsibility for it. You then need to sync each entry in the transaction journal, which is at least once per message and at most once per recipient. Because there are lots of messages in their individual files, these syncs are all over the disk, which implies lots of costly seeking.

High performance NNTP servers minimize seeks by having a few large spool files to which they write messages sequentially. There are a couple of reasons that this doesn't work for us. Firstly, NNTP servers generally receive messages streamed from relatively few high-bandwidth clients, whereas SMTP servers often have many low-bandwidth clients which send few messages per connection. You want at least one spool file per client so that one client can write to a file slowly without holding up another. Secondly, you can't write the messages contiguously if you are going to keep a message's transaction journal next to it.

So why not move all the transaction journals into a global transaction log?

When I first thought of this, I doubted that the benefits would outweigh the complexity, but it turns out to have wonderful consequences.

The nice thing about this kind of log-structured storage is that it's very friendly to disks: the write position moves slowly so you can get much higher performance than you do with random seeks. The great difficulty is that you need to clean garbage out of the log so that it doesn't grow without bound, and this process can easily cannibalize all the performance you gained from using a log structure in the first place. Hence my initial skepticism.

Another problem is that the information about a message can end up scattered throughout the journal as records are appended when its delivery is tried and retried. You can avoid this by re-writing its remaining recipient list at the head of the journal whenever it is retried. (This is a bit like Sendmail and Exim's envelope re-writing, but lest costly because of the log-structured write activity.) The next time it is tried you only have to look at its most recent entry in the journal.

One significant difference between an MTA transaction journal and a database or filesystem is that the MTA's data has a natural maximum lifetime, and this can actually be quite short. A nice consequence of the re-writing I just described is that the maximum age of data in the journal is the same as your maximum retry interval, which is probably only a few hours - compared to days for the lifetime of the message data, or years for blocks in a filesystem or database.

So you are effectively driving your journal garbage collector from your MTA's queue runners. But wait! Turn it around: you can drive the queue runners from the journal. It has exactly the data your queue runners need, already sorted into in a fair retry order. What's more, your queue runners no longer need to scan the directories in the spool and open each file in turn to check whether it needs to be retried. They also never have to go searching in the global journal for a message's recipient list. In fact, if you keep enough envelope information in the journal to do all the routing for each message, you only need to re-open a spool file just before it is transmitted - and not at all if the destination host is still unavailable. Wonderful!

This division of the spool into a collection of message data files and a global log-structured queue is also very amenable for tuned storage setups. You could reserve a mirrored pair of disks for the queue, so that these disks almost never have to seek. The message data files can live on separate disks that have more random access patterns, but which have a lower load because most work goes on in the queue. The principal workload is only one fsync() per message file in the lifetime of a message. The queue gets all the fsync()s to record deliveries, so a load of less critical writes (such as re-writing envelopes) get fsync()ed for free.

In this setup the queue is somewhat like an NNTP server's overview database, and it has the same problems. All your eggs are in one basket, so it had better be very reliable. When INN introduced advanced storage layouts that depended on the correctness of the overview database, any problems with the database were catastrophic - and they were sadly common. That's not to say it can't be done: Demon's NewsBorg managed it. And in fact, an MTA's log-structured queue database which only needs to be accessed in log-structured order is much simpler than a random-access overview database, so it should also be more reliable.

I must say, I'm pretty pleased with this :-) Although I came at it from the direction of performance hackery, it turns out to be a unifying and simplifying idea of great elegance. And there are still at least a couple more ways of making it faster...

fanf: (Default)

It's traditional for most MTAs to use something close to the host operating system's standard text file format for storing messages that are queued for delivery. For example, on Unix that means using bare LF for newlines instead of the Internet standard CRLF. Exim, Sendmail, and qmail do this. On other systems the translation may be more extensive, e.g. if the native charset is EBCDIC, or if text files are record-oriented rather than stream-oriented. Postfix is a bit of a counter-example, since it runs on Unix but turns the lines of a message into records, as part of its general IPC scheme that I mentioned in part 2.

This makes sense if you are working in an environment where local email is more common than remote email: messages stay in the system's local format so are less likely to be munged. However, nowadays MTAs (especially on Unix) generally act as relays to and from SMTP (and variants like message submission or LMTP). The mismatch between local text formats and the Internet standard format leads to all sorts of transparency bugs. For example, SMTP was originally specified to be 7 bit clean, with no restrictions on bare CR or LF. However these characters will get mangled when a message is transferred via an MTA that translates newlines. There are descriptions of many similar problems in RFC 2049.

The other problem with all of this translation and re-translation is that it's horribly inefficient: it requires lots of copying back-and-forth, meaning you use several times more memory bus bandwidth than is needed to shift bits between the network and the disk. One of the neat things that the Cyrus IMAP server does is store its data in wire format, so that it can be transferred from disk to network with minimal copying. (UW IMAP, Courier, and Dovecot all use Unix native Berkeley or maildir folders.) Why not do the same in an SMTP server?

The main difficulty with this idea is that SMTP does not have a single wire format. What's worse is that you don't find out a message's outgoing wire format until you have connected to the destination host and are about to transmit it. Even worse, there might not even be a single outgoing wire format if the message has multiple recipients at different hosts with different capabilities.

You can make this problem much more tractable if you store the messages in the format that they are received in, and prepare for any recoding that may be necessary on transmission. In most cases, the incoming and outgoing software will have the same capabilities, so you will be able to use efficient APIs like sendfile(). But what do I mean by "prepare"?

  • You need to parse the MIME structure of the message, and note at least the content-transfer-encoding of each part. This will allow you to downgrade each part in the appropriate way when that is necessary.
  • If you are really keen, then you can also scan the data in each part to (say) choose between quoted-printable or base64 depending on which would be most efficient.
  • You need to note dot-stuffing points: in traditional SMTP, where dots have been inserted, and with chunked SMTP, where they need to be inserted for onward traditional transmission.
  • If you support UTF8SMTP, you have to scan the RFC822 and MIME headers for top-bit-set characters that may need downgrading to RFC2047 or RFC2231 encodings.

This can be done as the message is received, which only requires the message data to go over the memory bus once. (Data goes over the bus twice in a copy - from the old buffer and to the new buffer - so a parse is cheaper than a copy.) When you come to send the message onwards, you can directly sendfile() those parts that do not need to be downgraded and only recode where absolutely necessary.

It's worth noting that transmission between chunked and traditional SMTP can be done efficiently (assuming no MIME recoding is necessary). Dot-stuffing is only rarely necessary (e.g. it never happens in base64 attachments) so there will be reasonably large blocks of the message between dots which can sensibly be passed to sendfile(). When removing dots you will be sending blocks with a one byte gap between the file offset of the end of one block and the start of the next, and when adding dots you can use the leader or trailer feature of sendfile() to add the dot. This is good because it means it is cheap to use the more pipelineable BDAT command when it is available. It's also way more efficient than the usual scan-every-byte-of-the-message implementation of dot-stuffing. And I think it is quite cute :-)

Existing MTAs that implement 8BITMIME downgrading (such as Sendmail) generally do the MIME parse and recode when sending the message, and just use base64 because it would be too expensive to do two passes to work out if quoted-printable would be better. Another situation when two passes are necessary is adding a DomainKeys signature, because it appears at the top of the message. In many cases (if you aren't going to alter the message) you can make DK more efficient by calculating the signature as the message arrives so that it's easy to add when the message is transmitted.

Is it worth trying to make this more efficient? For example, if you have a cache of destination host capabilities, you could recode the message as it comes in to save a trip over the memory bus. But you will still have to support late recoding if the cache lookup misses, so the extra complexity probably isn't worth it.

Previously: part 3- local delivery; part 2 - partitioning for security; part 1 - the sendmail command; message identification.

fanf: (Default)

When SMTP was originally developed in 1981 it was indeed simple: messages were flat-ASCII only, with no support for interesting languages or media or structured messages. But that was OK, because people had been using email without those features for nearly a decade. By the start of the 1990s this was embarrassingly limited, so MIME was developed to support multimedia and multilingual email in a backwards-compatible way. This didn't require any enhancements to SMTP, but as a consequence it was somewhat inefficient. So a mechanism for extending SMTP was developed, which (amongst other things) allows MIME messages to be transmitted more efficiently. Fifteen years later another significant change is being worked on, to move from ASCII-only email addresses to internationalized UTF-8 addresses, while still preserving backwards compatibility where possible. The end result is that even getting the bits from A to B is amazingly complicated, as I shall now explain.

The basic email message format, dating right back to before RFC 561 in September 1973, divides the message data into a header and the body. The header is mostly "protocol" data (i.e. for consuption by computers) whereas the body is "text" (i.e. for consumption by people), according to the terminology of RFC 2277. This distinction is relevant to MIME because the extensions it introduces can change the interpretation of the "text" parts of messages, but they leave existing the protocol parts unchanged. (Of course MIME introduces new "protocol" headers for its own purposes.)

In RFC 2045 MIME defines three classes of data:

  • 7bit data is what may be transmitted by un-extended SMTP without any encoding. The data consists of lines of up to 1000 characters terminated by a CRLF (i.e. a pair of bytes with values 13 and 10), Bytes may only have values between 1 and 127, and values 13 and 10 may only occur as part of a CRLF.
  • 8bit data is the same as 7bit data, except that bytes may have values above 127. Many textual formats count as 8bit data, for example text in ISO 8859 charsets or in UTF-8.
  • Binary data has no restrictions. As well as multimedia data such as images and sounds, some textual formats are binary, for example UTF-16.

Since only 7bit data can be transmitted un-encoded, MIME also defines two so-called content-transfer-encodings which parallel the 8bit and binary data classes.

  • The quoted-printable encoding is designed for use with 8bit data that has a small proportion of bytes outside the 7bit gamut. Most bytes are un-encoded, and encoded bytes are expanded to a three byte sequence consisting of an = sign followed by two hex digits. Although it is designed to be reasonably unobtrusive when used with textual data, such that it remains mostly readable even if not decoded, it can handle arbitrary binary data. However the typical expansion factor for compressed data is 2.25 which is not very efficient.
  • The base64 encoding is designed for binary data. It encodes three data bytes in four bytes of mostly alphanumeric ASCII, for an expansion factor of 1.33.

I said above that the message header is only "mostly" protocol data. There are some important parts which are human-readable text, including the Subject: and the parts of the From: / To: / CC: headers which give people's names. RFC 2047 defines a sort of mini-MIME for use in these headers. It's "mini" because information about the character set and encoding is bundled in with the encoded text in a compact form, rather than being placed in separate headers in longhand like the message body's MIME metadata. RFC 2047 has "q" and "b" encodings which are similar to quoted-printable and base64.

If that isn't complicated enough, there are bits of header which are "protocol" according to RFC 2277, but nevertheless can't always fit within the limits of ASCII. The principal example of this is attachment filenames. So there is yet another encoding defined in RFC 2231.

So that is a summary of the baroqueness that is MIME encodings. One of the things that ESMTP aims to do is (eventually) make all of this encoding unnecessary. There are three existing extensions for this purpose, and one further extension which is currently in the works. These extensions roughly parallel MIME's encoding features.

However, before listing the extensions I should note that with un-extended SMTP, some encoding is necessary to transmit even a plain 7bit ASCII message. Messages are transmitted as a sequence of lines terminated by a line containing only a dot. Lines in the message which start with a dot have an extra dot added so that the message is not terminated prematurely. This is called dot-stuffing.

  • The 8BITMIME extension, defined in RFC 1652, allows you to transmit 8bit attachments un-encoded. This is almost universally supported now.
  • The CHUNKING extension, defined in RFC 3030, allows you to use a different form of framing for messages. Instead of dot-stuffing, you can transmit the message in byte-counted chunks. As well as being marginally more efficient than the traditional framing, it makes it possible to use...
  • The BINARYMIME extension, also defined in RFC 3030, allows you to transmit binary attachments un-encoded.
  • The UTF8SMTP extension is still a draft. It's part of the effort to internationalize email addresses, which follows from the introduction of internationalized domain names. One of the side-effects of this extension is that it will be possible to use UTF-8 in message headers un-encoded. This should make RFC 2047 and RFC 2231 un-necessary. (One of the pre-requisites for this is a commonly accepted universal character set, which we now have. UTF-8 has only really achieved this status in the last few years.)

These four extensions are not orthogonal, so they combine to produce eight possible forms of SMTP message transmission, one of which is very unlikely to occur in the real world. The more extensions that the server supports, the less you need to encode. The following list enumerates the possible combinations of extensions, and for each combination it states what data must be encoded. "Headers" means that non-ASCII data in the headers must be encoded according to RFC2047 or RFC2231 (otherwise it must be UTF-8); "8bit" means that 8bit data must be encoded with quoted-printable or base64; "binary" means that binary data must be encoded with base64; and "dot-stuffing" means dots at the start of lines must be doubled.

  • No extensions: headers, 8bit, binary, dot-stuffing.
  • 8BITMIME: headers, binary, dot-stuffing.
  • 8BITMIME + UTF8SMTP: binary, dot-stuffing.
  • CHUNKING: headers, 8bit, binary. (unlikely!)
  • CHUNKING + 8BITMIME: headers, binary.
  • CHUNKING + 8BITMIME + UTF8SMTP: binary.
  • CHUNKING + 8BITMIME + BINARYMIME: headers.
  • CHUNKING + 8BITMIME + BINARYMIME + UTF8SMTP: no encoding.

The specifications for these extensions say that when an MTA that supports them is transmitting a message, it must translate un-encoded data into its encoded form if the receiving MTA doesn't support the necessary extensions. In the absence of these extensions, an MTA can pretty much ignore the message data. (The separation of the layers isn't entirely perfect because of things like Received: trace headers and the way delivery failures are reported.) An MTA that does support these extensions has to have a full MIME implementation. The kind of layering violation that makes a mockery of the ISO model :-)

Note that while the specifications require support for downgrading an un-encoded message into encoded form, upgrading is optional. Whether the extra effort is worthwhile or not probably depends on the relative cost of CPU and network.

Obviously this complexity has implications for the design of MTAs, but I will write about that another time.

fanf: (Default)
http://fanf.livejournal.com/44733.html

I just found out an explanation for Outlook's irritating behaviour that I was working around back in January. It turns out that Sendmail has never followed the standard's requirements for the Sender: header (it never adds or fixes it), so the developers of Outlook assumed that you would only see a Sender: header when the message had been transmitted via a mailing list. This makes the "from [list] on behalf of [author]" wording look relatively sensible. However, they should have used the bloody List-ID: which is supposed to be used for exactly this kind of thing.

What is worse is that Exim is out on a limb in this respect. Postfix used to follow the standard but was changed in 2000 to be bug-compatible with Sendmail. This is very irritating, because the standard behaviour is much more useful. Grr.
fanf: (Default)
In my previous entry I made a couple of comments about mailing lists which [livejournal.com profile] james_r asked me about IRL, so here's a follow-up.

There has been a very blurry line between mailing lists and multi-recipient aliases almost forever: the ambiguity is referred to in RFC 821 (dated August 1982). The earliest clear and prominent explanation of the difference is in RFC 1123 (dated October 1989), but that doesn't say anything about what lists do to the message header. Traditionally, during list expansion the Sender: field has been replaced with an address related to the list (often the same one used to replace the return path). Lists may also make other changes (about which more below) but it's this abuse of Sender: that I would like to improve.

As RFC 821 hints, mailing lists have been implemented with varying degrees of integration with the MTA. Ten years ago a common setup was to have Majordomo managing the contents of an aliases file, which was then used directly by Sendmail to do the list expansion. Nowadays it's more common to run something like Mailman which is mostly independent of the MTA. This leads to arguments about what architectural label to put on the list manager (which are probably aggravated by the fact that the labels were invented for X.400): is it part of the Message Transmission System alongside the MTA, or is it an MTS client, i.e. an MUA?

There's another historically ambiguous area, around the word "forwarding". In fact it overlaps with the list/alias blurriness because "forwarding" is sometimes used to mean "aliasing", especially when it is set up with a .forward file or an equivalent such as the Sieve redirect action (as opposed to something like the /etc/aliases mechanism). It's best to call this "aliasing" or "redirecting", not "forwarding" (even though RFC 2822 describes this meaning of "forwarding"). "Forwarding" should be reserved for sending a new message with the forwarded message attached.

There is a third operation which is called "forwarding" by RFC 822. Some MUAs (including Pine) describe it as "bouncing" which is also confusingly ambiguous terminology that should be avoided. RFC 2822 clarified that the old 822 terminology should not be used, so I call this "resending" after the header fields it uses. From the SMTP point of view, resending is very similar to mailing list expansion in that the message gets a new set of recipients, its return path is changed to refer to the resender, and the message header and body mostly remain the same. The difference is that resending is usually more manual, and the Resent- header fields include more information than mailing lists' traditional Sender: manipulation.

One of the improvements that RFC 2822 makes compared to RFC 822 is that it allows a message to be resent multiple times, with an unambiguous record of what happened in the message header. The top of a message includes a sequence of Received: fields which record in blog order (most recent at the top) the MTAs that a message went through. RFC 2822 says that (like Received: lines), Resent- fields should be added to the top of the message, so that they appear at a point in the Received: chain corresponding to the point that the message was resent. (RFC 822 didn't say anything in particular about where Resent- fields should go, so you could only resend a message once unambiguously.)

So that's the background. What I propose is that:
  • Mailing list managers should be considered MUAs, i.e. that the result of list explosion should be considered a new message submission rather than anything like relaying or redirecting;
  • They should use the Resent- convention to indicate that the message was resent via a list, and at what point in the chain of MTAs this occurred;
  • They also should add their List-ID: and other List- fields at the top of the header, next to the Resent- fields.
The advantages of this are that it's much clearer which system is responsible for doing what to a message; the original Sender: field isn't obliterated; and it's much more friendly to cryptographic signature schemes like DKIM which don't like the message header to be arbitrarily modified.

Things it doesn't improve are the controversies over adding list name tags to the message's Subject: and overriding the original sender's choice of Reply-To:. On these matters I'm of the opinion that this mangling is only done because MUAs still don't have decent user-interfaces for the List- fields.
fanf: (Default)
There are at least six forms of identifier in SMTP which identify messages or parts of messages or transmissions of messages.


Message-ID [RFC2822]

The message's primary identifier appears in its Message-ID header field. It's a key part of the way Usenet works, where Message-IDs must be globally unique and are used by servers to decide whether or not they have a copy of a message already. They are much less significant in email, but some servers do per-mailbox de-duplication when delivering a message.


Content-ID [RFC2045,RFC2046]

If a message has several MIME parts then they can cross-reference each other using their Content-IDs. This is most commonly used for embedding images in HTML email. Content-IDs are supposed to be a globally unique identifier for the file, so that you can tell (for example) that various message/external-body parts refer to the same thing.


Resent-Message-ID [RFC2822]

When you re-send a message it gets an extra message ID alongside the extra from/to/date fields. This could be used to distinguish different re-sendings of the same message.

Some MUAs, e.g. Pine, call the re-send operation "bounce", but this causes confusion with delivery failures. Re-sending is more similar to forwarding a message, but without wrapping the original message in a new multi-part message with a covering note. Mailing lists effectively re-send messages but they don't use Resent- headers to explain what they did, which is a shame.


spool ID [RFC2821]

Received: headers include an ID field which is used to record the spool ID (sometimes called the queue ID). This is essentially the filename of the message within the MTA's spool, so the message will get a new spool ID each time it is transmitted. When a message is accepted, the receiver-SMTP usually mentions its spool ID in its final 250 response, so that the sender-SMTP can log it. This makes it easy to correlate the logs for a particular message across multiple MTAs. (MICROS~1 Exchange instead mentions the Message-ID, which is no use at all since the sender already knows it.)


TRANSID [RFC1845]

The transaction ID is part of the SMTP extension for checkpoint/restart, which allows a client to recover gracefully from a lost connection. The sender-SMTP generates a TRANSID for each transmission of a message, so in the normal case they correspond one-to-one with the receiver-SMTP's spool IDs. (There may be more than one spool ID per TRANSID if the sender-SMTP tries to restart a failed transaction for which the receiver-SMTP has discarded the previous spool file.) Not much software implements this extension.


ENVID [RFC3461,RFC3885]

This is something like a cross between the Resent-Message-ID and the TRANSID. It is generated in a similar manner to a TRANSID and also appears in the message envelope, but once a message has a TRANSID it is preserved until the message gets re-sent. The idea is that the TRANSID is included in a delivery status notification so that mailing list managers (etc.) can correlate the DSN with the message that caused it. A Resent-Message-ID would serve almost as well.


A couple of weeks ago, Ian Christian asked on the Exim-users list "Wouldn't a connection-ID be a useful thing to have?" and I had already come to the conclusion that, yes, it would.

At the moment, Exim generates its spool ID quite late, after it has received the message header. This means that rejected RCPT commands can't always be correlated with messages, though this turns out not to be a problem in practice. The other thing you can't do is tell which messages were sent down the same connection, unless you tell Exim to log more connection details, and even then it isn't immediately obvious.

My design would be to create a connection ID for every incoming connection, which can be used for logging everything related to that connection. When a MAIL command is received, a spool ID is created which has the connection ID as a substring, and this is used for logging all commands related to that transaction. When the message is forwarded onwards, a TRANSID is created which has the spool ID as a substring. Thus the postmaster can grep broadly or selectively by using shorter or longer identifiers.

It isn't possible to correlate everything easily, though, because if you send multiple messages down the same outgoing connection, they will have unrelated TRANSIDs, so the MTA will still have to log details of outgoing connections separately.
fanf: (Default)
Last year I missed the annual Exim course in Cambridge owing to being on honeymoon at the time - though I still did the setup for the practical sessions, which was slightly hairy since it was the first time we had done a practical in Cambridge and I was not going to be there to fix any bugs; fortunately there were none.

This year I am in Cambridge, and as well as doing a slightly updated version of the practical, I have also given a talk. The slides are now on the web at http://www-uxsup.csx.cam.ac.uk/~fanf2/hermes/doc/talks/2006-07-exim-course/
fanf: (Default)
We have some shiny new machines which have 16 disks in a 3U chassis, which is a pretty tight fit. In fact we re-jigged them so that instead of being wired with one block of eight in one RAID set and the other block of 8 in another, we've put alternate disks in different RAID sets so that it matters less if a disk messes with a neighbour when it is being removed. (Two degraded RAID sets rather than one completely shafted one.) We're going to use four of them as the 4th generation Cyrus machines.

As usual for us they have Intel mobos and LSI battery-backed RAID controllers - though unlike our older RAID cards, these ones have been re-badged by Intel and have a more graphical user interface, neither of which are advantages. My colleague David has been doing our usual paranoid power-loss sanity checks, and it turns out that when these new machines find unwritten data in the write cache at boot time, they say "ooh look! unwritten data. I will log this fact and then throw it away.". D'oh!

Further investigation has revealed that this seems to be an incompatibility between the motherboard and the RAID card. The exact nature of the incompatibility is unclear, and it looks like we might have to ditch LSI and replace the cards with 3ware ones instead. This will be a particular pain because some of these new machines have already gone into service, e.g. for our local free software mirror site.

This reminded David that some of our Cyrus machines are now getting a bit old, and are now past the advertized lifetime of the RAID batteries, so he has been doing power-off tests. Fortunately the batteries seem to be surviving well. However he discovered that our 3rd generation Cyrus servers (16 disks in 4U) seem to be suffering from the same RAID/mobo incompatibility as the new machines. They worked OK originally, but required a mobo replacement some time back, and we didn't expect this to stop the RAID working. Restoring these machines is NOT an easy job:
fanf2@cyrus-21.csi.private.cam.ac.uk:~
: 0 ; df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             2.0G  970M  951M  51% /
tmpfs                 1.8G  8.0K  1.8G   1% /dev/shm
/dev/md0              3.3T  1.5T  1.8T  46% /spool
/dev/sdb1             2.0G  289M  1.6G  16% /var
Their networking is 100Mb/s, which means it will take at least 1.5 days to get the data back on to the machine after replacing the RAID card. A gigabit upgrade is now being planned...

[livejournal.com profile] brad will sympathize...
fanf: (Default)
Why didn't I think 14 months ago that the withdrawal of insecure access to Hermes would be an ideal time to get rid of the stupid and ugly username@imap.hermes.cam.ac.uk -> username@hermes.cam.ac.uk rewrite rule?
fanf: (Default)
It soon becomes obvious to people who spend much time communicating online that people tend to behave worse than they do in person. This is probably due to the absence of the usual cues you get from being able to see or hear who you are talking to. The problem becomes much worse when you add anonymity into the mix.

We've recently had a dickhead on the exim-users list abusing anonymous remailers to make one of the regular posters appear bad. As a consequence I've blacklisted all the anonymous remailers I could find - only 35 of them, which was fewer than I expected. I did this based on their sender addresses, since they tend to live on dynamic IP addresses or behind legitimate email systems.

I'm feeling very cross because of this bad behaviour.
fanf: (Default)
I've been playing around with Dan Bernstien's cdb recently. We use it quite heavily - most of the configuration tables on our mail servers are cdb files. I'm quite fond of its simplicity. But the code... DJB doesn't really do layers of abstraction. (He even says "Packages that need to read cdb files should incorporate the necessary portions of the cdb library rather than relying on an external cdb library.") He also doesn't believe in libc: the cdb code includes an incorrect declaration for errno, a braindead wrapper around malloc(), inefficient string functions, a dumb reinvented wheel strerror(), etc. etc. When I was reading the algorithm that constructs the hash tables, I added comments as I went. One of them said:
  /* Afterwards c->start[i] points to the *end* of the ith block. */
Nice variable name!

(Admittedly the next loop adjusts the pointers so that the name becomes correct, but still.)
fanf: (Default)
Network Major Projects Manager in the University of Cambridge Computing Service

The University Computing Service (UCS) provides central academic Information Technology services and IT infrastructure for both the academic and administrative activities of the University of Cambridge. These services are crucial to the success and reputation of the University. The UCS is responsible for the development, delivery and support of core technology and systems (communication, storage, computers and corporate systems) to staff and students.

A data networking expert with extensive project and staff management experience is required for leading the following major projects:
  • The development of the University's data network to support the conversion of the University's analogue telephone network to Voice-over-IP;
  • The expansion of the University's data network by means of an injection of up to £1M of SRIF3 funding.
The successful candidate will lead the planning and implementation of these and other major projects, working as a deputy alongside the Network Systems Manager.

With a degree or equivalent qualification, the candidate will have a minimum of five years' recent experience in IP network service delivery and a minimum of five years' experience of team management. The candidate's technical knowledge and experience should be at least to the level required for CCIE certification.

This is a permanent appointment at the Senior Computer Officer grade, subject to nine months satisfactory probation period, on a salary scale of £45,417 to £47,978 (under review) with initial salary depending on experience. The University offers many benefits to its staff including a final-salary pension scheme (USS), a variety of facilities and services, and opportunities for training and career development.

Further particulars and an application form are available from the Director, University of Cambridge Computing Service, New Museums Site, Pembroke Street, Cambridge CB2 3QH (e-mail lamk2@cam.ac.uk) or on the web at http://www.cam.ac.uk/cs/jobs. The closing date for applications is Friday 28 July 2006. It is intended that interviews will be held during the week beginning Monday 14 August 2006.

The University follows an equal opportunities policy.
fanf: (Default)
In the course of cleaning up after the departmental email server failure, I've noticed that we have an enormous number of messages on our queues for one user, like about 14 000 out of about 18 000 on each of the machines.

It turns out that this user has a vanity domain which forwards email for any local part at that domain to his Hermes account. This vanity domain is being spammed to death, with the result that his quota is entirely used up with 30 000 spam messages and he has another 70 000 waiting.

What is worse, he isn't even attached to this University any more: his Hermes account seems to be a left-over from when he was a lecturer here. It looks like he's given up on it - he stopped popping his email from it last month - but of course he hasn't got in touch with us about the problem. Aaargh!

Never accept email to invalid local parts! It is just asking for trouble!
fanf: (Default)
http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=396

I can't help thinking of the IETF when reading this article about CORBA, and how its actual behaviour can deviate from its self-image. Some people refer to this disparagingly as the "IVTF" (V for vendor).
fanf: (Default)
Of course, the downside of emergency hacks is that they can't stick around for ever, so I've still got work to do for them. Now that they've decided where to go from here, the clean-up can start - re-jigging forwarding arrangements, recovering lost email. Looks like it'll affect at least 300 users...
fanf: (Default)
A couple of days ago, one of the departmental email servers trashed a disk and it has been down ever since. They seem to be having some trouble getting it going again. I've spent much of today carefully putting together some emergency arrangements for their email.

Firstly, we have to replace the "special route" which sends email to their broken server with a rule that says to just leave the email on the queue. This is so that we can add other rules which do more interesting things, but without changing the basic behaviour. In our configuration, special routes take precedence over most other rules which is why it has to be removed.
kaboom_defer:
  driver		= redirect
  domains		= kaboom.cam.ac.uk
  data			= :defer: email server is currently unavailable
  allow_defer
This changes ppswitch's behaviour slightly, because it will now temporarily reject messages from outside the University instead of accepting and queueing them. So we also need a special router which is less strict about recipient verification for the broken server:
kaboom_accept:
  driver		= accept
  verify_only
  domains		= kaboom.cam.ac.uk
They are quite lucky in that many of their email addresses have the form crsid@kaboom, and many of their users have Hermes accounts. So a useful thing to do is to spot addresses belonging to Hermes users and send the email there rather than to the broken server. However it isn't as simple as that: if the Hermes user has a filter that redirects their email back to the broken machine, this special arrangement would create a loop. So it only sends messages to Hermes if they didn't come from Hermes...
kaboom_hermes:
  driver		= redirect
  domains		= kaboom.cam.ac.uk
  local_parts		= +hermes_active
  condition		= ${if !match{$sender_host_name}{cyrus-[0-9]+[.]csi[.]private[.]cam[.]ac[.]uk} }
  data			= $local_part@hermes.cam.ac.uk
  forbid_blackhole
  forbid_file
  forbid_include
  forbid_pipe
  check_ancestor
  retry_use_local_part
Finally, their head of department wants email sent to hod@kaboom.cam.ac.uk to still work during the outage. Rather than create another special-case hack, I just created a managed mail domain for them (i.e. our standard alias-file-based virtual domain setup) which can define temporary arrangements for their non-crsid addresses. So hod@kaboom now goes to the professor's Hermes account.

What fun!
fanf: (Default)
How depressing. These essays were supposed to turn up reasonably frequently, but I got hung up on the original plan for essay 3 and it still isn't quite writable. In essay 2 I argued that it's very hard indeed to implement strong partitions within a program running on Unix, even when it runs as multiple processes. Essay 3 was supposed to be a more nuanced argument about security engineering, and therefore more positive about partitioning, but it's hard to be opinionated when you don't have a definite answer.

So let's go back to basics and take a look at another public interface of an MTA. In essay 1 I talked about local message submission, in which I argued that every current MTA does it wrong. This time I'm going to look at the local delivery agent - LDA - and again, I think every current MTA does it wrong.

Background

Local delivery on Unix is complicated. The set of valid local parts of email addresses (the bit before the @) is basically the same as the list of local users, defined in the passwd file. But extra valid local parts can be defined in /etc/aliases. Delivery is usually just appending the message to /var/mail/user, but if the user has a .forward file, that may specify an alternative mailbox, or redirection to another user or address, or sending the message to a program, or any combination of the above. And aliases have the same delivery options. Furthermore, special entries in the aliases file can cause magic return-path rewriting, which is used to make delivery errors from mailing lists go to the list managers instead of the person contributing the message to the list. And finally there may be a mechanism for accepting email for otherwise-unknown local parts.

Trad MTAs

Sendmail and Exim do (almost) all of their email routing and delivery as root. So essentially all of the complicated programming outlined in the previous paragraph must be trusted by every user of the computer, and it must be robust and correct.

The "almost" arises from the fact that some parts of the process may be run under the identity of the destination user, in particular the process of writing the message to a mailbox or delivering it to a program. Interpreting .forward files may also be done as the user, which is particularly important if they contain complicated filtering scripts as for Exim or Sieve. But still, the end result of routing the address is a filename or program, and you have to trust the MTA to get it right.

The qmail approach

There are a couple of differences in the way that qmail does things.

Most interestingly, it adds a couple of security boundaries. The first one occurs between deciding that an address is local and the whole process of local delivery. In this case the qmail-lspawn daemon has the necessary privilege and is responsible for switching to the appropriate user and running the program that delivers the message. The second one occurs at a redirection (including aliases and forwards), in which case the message is re-injected into the queue.

Less interestingly, it has its own way of describing aliases and forwards. The .qmail file has a very slightly simpler format than a .forward file, but essentially the same features. It also has an alternative aliases file format which is reminiscent of a password file, and symultaneously less flexible and more redundant than a normal aliases file. Finally, it falls back to looking at a special "alias" user's .qmail files.

To some extent, this reduces and divides up the complexity of local delivery. But not much. It reduces it by using simpler file formats ("don't parse"), but that doesn't achieve much, since the .qmail processing is done under the identity of the user who created the file. It divides it by separating alias handling from forward handling, but this means the whole thing requires four programs and extra re-injections.

The postfix approach

Postfix is much more compatible with the traditional way of doing things: it understands the standard /etc/aliases and .forward files without qmail's gratuitous wheel-reinvention. And, just as in essay 2 it has less security partitioning than qmail - where qmail's LDA is three programs (not counting re-injection this time), Postfix's LDA is one.

Essentially all of the traditional local delivery logic is implemented inside Postfix's "local" daemon. This implies that it must do a lot of work as root, in particular alias processing: it has to work out which local user the alias redirects to before it can drop privileges. The manual's "security" section is remarkably coy about the details of this, saying only that it is "security sensitive" (duh).

What is worse is that it has to include a lot of support for Postfix's infrastructure: It has lots of options in the configuration file which it must therefore parse, and it also has to support all of Postfix's database types. Some of these can be fiendishly complicated: LDAP, SQL, embedded databases, regex matches, sockets, etc. etc. As a result it is one of the largest parts of Postfix. Are you really happy running all this as root, even for a "secure" MTA?

This mention of other database types hints at the complexity quagmire that email can be. Unix MTAs often have to do the interoperability between the Internet and "enterprise messaging systems" - which often means LDAP is necessary. And they have to do virtual hosting of various kinds: a virtual domain may be just an aliases file without local users, or it may be a pile of local mailboxes with a local user per domain, or it may be more complicated.

Postfix's LDA has the traditional Unix logic hard-coded. It can't do virtual domains. Postfix implements alias-based virtual domains as part of its main routing logic (but trad aliases are not done this way - huh?), and it has a separate parallel local delivery daemon for domains with mailboxes which duplicates much of the local(8) code (double-huh?).

Step back a bit

So how would we like it to work?

We want all the complicated routing logic to be performed by the main MTA process, running as a non-privileged user (with only access to the email queue and anciliary databases). The LDA does not worry about .forward or /etc/aliases. It does not link to database libraries.

However, we cannot trust the result of this routing: if we did, the MTA could alter any file on the system, so it could gain root privilege by appending an appropriate "message" to the password file. The LDA must have some way of verifying that the MTA's instructions are legitimate.

For Sendmail and Exim, the LDA necessarily trusts the MTA, because the MTA already is root. For qmail and Postfix, the LDA doesn't need to verify anything because it does the work itself, either as the destination user or as root.

Another wrong way

In simple cases it's easy for the LDA to verify its instructions: the MTA might say "deliver to this file as this user! and you know you can because the file's name appears in the user's .forward!", and with a bit of filesystem permissions checking the LDA can be satisfied. However it can't do this kind of check against the result of a database query. Even if we allow the LDA to link to the database library, it still can't verify that arbitrary SQL is legitimate.

The fanf way

In most cases, you don't want the LDA to deliver to many places. You might have a standard place for a user's mailboxes, e.g. ~/mail/, which is where the local MUAs and IMAP servers look. Delivery to other places is forbidden. So it should be really easy to tell the LDA where these places are, entirely independently of the MTA configuration: no matter whether the latter uses plain text /etc/aliases or databases or LDAP or whatever. You could just list the allowed paths in a file.

The LDA just gets an instruction from the MTA: deliver this message to this file as this user. The LDA switches to the user, checks that the file is listed as being OK, checks filesystem permissions, and writes the message. It then sends back a success or failure indication to the MTA. Really simple. Works for trad Unix delivery and virtual delivery and weird delivery.

Delivery to programs is similar - and in fact, here there are precedents for an anciliary permitted-destinations list. The Sendmail restricted shell, smrsh (which is unrelated to the James Bond villains), has a list of commands that it is willing to run. This is also a bit like userv.

One problem - easily solved - is that a simple list of destinations is not enough. You need some pattern matching, such as "anything under this directory", or "this file in my home directory, wherever that is". You want a system-wide default so that users don't have to be bothered with this detail, but you still want clever/demanding users to be able to refine it - e.g. "I want my inbox in ~/mailbox so never deliver to the default /var/mail/me". (This is still much simpler than .foward syntax.) You also want a tool to auto-generate the allowed-paths file from a .forward file or aliases file, so the common case remains easy to use.

Tangential issues

It's very common for redirections discovered in the course of local delivery to cause the message to be re-injected. This is bad for performance (which was going to be the topic of essay 4, but may now turn out to be essay 5). With my design, local delivery happens after all redirections have occurred, so the message doesn't have to make another trip through the MTA.

We'd also like to be able to handle redirections from more complicated filtering scripts in this efficient way, but at the same time we would prefer these scripts (written by untrusted users) to be run with the privileges of their authors, not with the privileges of the MTA. A small extension to the MTA->LDA protocol allows the MTA to ask for a script to be run and to get the results back: redirections, rejections, deliveries, etc. It can even do this when verifying recipient addresses at SMTP time, so that scripted rejections don't turn into accept-then-bounce.

Exim does the redirection short-cut, but it lost the filter verification short-cut when its security was improved for version 4. It's nice that my LDA design allows these performance and anti-spam features whilst remaining secure.
fanf: (Default)
A quick note that the "simple authentication and security layer" specification has been updated as RFC 4422.

(Confusingly, Erlang also has a SASL, but in that case it's the "system application support libraries".)

There's also been an enormous upload of new LDAP RFCs, numbered 4510..4533 (24 documents!).
fanf: (Default)
I came up with a new word in the pub this evening. I was trying to remember Peter Van Roy's term "definitive language", which he uses to describe a language which is the ultimate refinement of a popular, well-understood programming paradigm. He argues that Erlang, E, and Oz are working towards the definitive concurrent functional language, and I suppose that C++/Modula-3/Java/C# are working towards a definitive object-oriented language.

But the word I came out with was "eigenlanguage" not "definitive language", and I was told that this has entirely the wrong implication: an eigenlanguage must be a language that is pared down to its essentials. Scheme (not Common Lisp), Forth (not Postscript), BCPL (not C++), etc. The fun thing about eigenlanguages happens when you push them as far as you can. Write code in the lambda calculus, or if that isn't hard-core enough, SK combinators! Single-instruction machine codes! Turing machines! Cellular automata!

I have an interesting eigenlanguage in my head. The "eigen" in this one is related to the data structure (singular) that programs manipulate.

One of the ugly things about Forth is that it has two stacks, one for expression evaluation and one for control flow. It's well-known that you can get significant fun out of making function return addresses explicit (continuation-passing style, and for the serious lunatics, call-with-current-continuation), which in Forth terms means using one stack for both purposes. For more interesting data, Forth descends to low-level array-of-words machine memory.

Lisp, however has nice lists/trees/pairs/do-what-you-like s-expressions. It's well-known that you can use a singly-linked list as a stack. But this stack doesn't have to be just a linear data structure like in Forth: it can be tree-like, or loopy, or whatever you like!

So why not try defining a language with only one data structure? A Lispish directed graph, which you can only manipulate via your single stack pointer, and which you have to code in continuation-passing Forth. If done properly, it can lead to some really elegant pessimizations such as multi-instruction sequences that implement the Forth primitives dup and exch.

However it is too late for me to go into the details now...
fanf: (Default)
I've started paying attention to the effort to specify fully internationalized email. The main thing that is not yet i14ed is email addresses. There is now a spec for i14ed domain names, so you can have éxample.com (with the accent) if you want, but you can't yet use it in an email address - and even if you could, you still couldn't have an i14ed local part, which is a pain if you are Chinese and want your name to appear in your email address.

Unfortunately, fixing this limitation requires upgrading all of the email infrastructure. Fortunately, this means we can fix the failure to move to 8 bit email when we last upgraded the infrastructure for MIME. At the moment, if you send a message with non-ascii characters in the subject or body of a message, it gets encoded in an inefficient and ugly way in order to remain compatible with the mentality of American computing in the 1970s. Email i18n will at last fix that.

The key question is how to manage the transition gracefully and maintain backwards-compatibility with what's currently deployed. The MIME answer was to upgrade user software but leave the transport as it was. MIME messages are mostly comprehensible to users of old MUAs, and in theory the transport doesn't have to care about message content. The latter is no longer true and all kinds of email software now has to understand MIME.

The email address i18n (EAI) answer is to downgrade an i14ed message when it has to be transported to a host running old code. There's some precedent for this in the SMTP extensions for 8 bit MIME and binary MIME, and (as well as 8 bit headers) what EAI adds to these is email address downgrading.

At the moment the draft specification defines two extra parameters that are sent with an i14ed address. The badly-named ATOMIC parameter has a value which can be y or n, to indicate whether the i14ed address can be algorithmically downgraded - something like the translation from an i14ed domain name into punycode. The optional ALT-ADDRESS parameter has a value which is a trad-ascii address which can be used in place of the i14ed address. Yes, these parameters have overlapping functionality: an i14ed address with an alt-address and with atomic=y can be downgraded in two different ways. Ugh.

I haven't seen a rationale for this - perhaps it is the result of merging two alternative proposals - but apart from being ugly it seems to me that it will be seriously problematic. So I sent the following to the IETF's EAI mailing list :-

Why not just require that downgrading is always possible? This would make both options unnecessary, and I like the idea because it reduces the number of protocol options and I can see some awkward interop and usability problems caused by their existence.

I see problems coming from the fact that the correct values for the ALT-ADDRESS and ATOMIC options must be accurately communicated from the recipient to the sender before the message is sent. In the case of a reply, how can senders extract these values from the messages they are replying to?

In other cases, I imagine that it might be possible to get the values from some structured electronic medium, such as an internationalized vcard or an ldap directory with an internationalized schema or perhaps an extended form of mailto: URI. In many cases the utf8 address will be cut-and-pasted from a document or manually typed in from paper, and there's not the slightest chance that you can expect users to understand the importance of the IMA metadata or to transcribe it correctly. Furthermore, there's no way that any automatic system in the sender's MUA or MSA can correct a mistake.

What happens when there is a mistake? Does bouncing an erroneously downgraded message have the same effect on the sender as bouncing a message because it cannot be downgraded? If a sender's address book mixes up the recipient's alt-address with one of the recipient's unrelated non-utf8 addresses, the incorrect end site may or may not handle the downgraded message sensibly.

Note that if all utf8 addresses are downgradeable, then internationalizing email addresses in vcards or LDAP or URIs etc. is relatively simple: the email address is still just a single field (without new IMA metadata); it just has a more relaxed syntax.
fanf: (Default)
http://www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf

On pages 11-15 this rant goes off on a tangent about branch cuts in complex arithmetic. It distinguishes between C99-style complex numbers, which have a separate type for bare imaginary numbers, and Fortran-style complex numbers, which do not. The authors assert that it isn't possible for Fortran-style complex numbers to handle signed zeroes correctly and therefore that programs written to use this style of arithmetic necessarily get branch cuts wrong.

This seems unlikely to me, if the programmer has suitable facilities for constructing or decomposing complex numbers from or into two reals. I guess they must be over-simplifying.
fanf: (Default)
It seems that ISO 3166 has been changed so that it is now consistent with the DNS: Jersey, Guernsey, the Isle of Man, and the UK now have their own country codes and are no longer bundled together in the catch-all "GB". This means that the GBR annotations in my passport are obsolete.

RIP ftp.dra.hmg.gb.

Link: http://www.circleid.com/posts/gg_im_je_iso3166_good_bye_gb/

AfNOG

2006-05-12 19:41
fanf: (Default)
This is the seventh AfNOG. The workshop side of it - a week of intensive teaching - seems to have grown out of the INET workshops that started back in the 1990s. It's also associated with the "network startup resource center" which grew from Randy Bush's efforts to get the Internet going in the third world. The funding is interesting: it comes from a variety of international development quangos, with a substantial chunk from the US NSF - ostensibly so that US scientists have better infrastructure to support their research abroad. A surprisingly large number of the protagonists work at the University of Oregon Computing Center.

The network held up relatively well, though when the ISP realised how much we were using it, the "nominal" 512kb/s link became strictly 512k, and the queueing delays rose into the seonds. I have also had some irritating problems with the wireless support on this laptop - possibly dodgy drivers in the slightly dodgy FreeBSD-5.3.

My Exim bit seemed to go quite well. We got through the material somewhat faster than I expected, based on what Philip Hazel told me, and the students mostly made swift work of the exercises. There were some good questions about how to deal with spam and viruses, and about how to put together larger email systems. I talked about the current "new" Hermes architecture, and my fellow instructor, Joel Jaeggli from the University of Oregon, talked about their email architecture, which is similar to the old Hermes system.

By the end of it I was pretty zonked, and my throat was slightly sore. I clearly need to learn how to project my voice properly...

I'm leaving tomorrow night, but AfNOG will continue with some half-day tutorials followed by a conference.

Kenya

2006-05-12 19:21
fanf: (weather)
The people here (in the hotel) are friendly, quick to smile and greet you. The porters have a slightly queer habit of leading people to their rooms by the hand.

There are some characteristic local dishes: one that turns up in the hotel buffets quite frequently is mostly shredded spinachy greens, lightly cooked - actually all the veg here is nicely crunchy. There's also a stodge that looks like mashed potato, but which is made from cornflour, and a green beany stodge. The meat is traditionally overcooked and badly butchered, and frequently goat.

Last Friday we went to a local eatery which did the local food in the local way, with local timing. We left the hotel at 20:00ish and ate at about 22:00ish... According to one of the African instructors, "the Europeans have the clocks but the Africans have the time." The meal there started with a fairly simple soup, which was described as "bone soup", i.e. it was essentially just stock. Later on I tried the goat. It was OK, compared to the beef and chicken.

Yesterday we went to a tourist trap called "Carnivore". The meal was almost entirely barbequeued meat. (I am consequently even more vegetarian than usual for the next few days...) I ate lamb, pork, beef, chicken, ostrich (which was the best by a long way), crocodile, and camel (nasty). They also served us a cocktail called "dawa" which is apparently Swahili for "medicine". It was lime, vodka, and honey, with ice - so not that different from the whiskey, lemon, and honey that we drink hot.

I still haven't seen very much of the place. However we are going out tomorrow morning at 06:30 for a quick safari, which I am looking forward to. I am currently having a wee dram with the other whisky drinkers before an early-ish bed.
fanf: (weather)
The classrooms are close to being ready now that the PCs are no longer crammed into the staging area where they had FreeBSD installed. The classroom switches are being configured and the last bits of network are being deployed.

There was an interesting discussion at breakfast which I mostly just listened to. Apparently, the Nigerian government is keen to set up one or more local Internet exchange points - but it turns out that it won't be an IXP in the usual sense (organized as a multilateral agreement between ISPs) but a government-run transit provider that buys and sells international bandwidth. Furthermore, the idea came from the government's security people, and there's a background plan to funnel Internet traffic through this organization so that it can be monitored more easily. Pity.

We also talked about PACNOG, which will happen soon. This is the Pacific Islands analogue of AfNOG, and it turns out the ISPs in these sets of countries have similar difficulties - satellite links, poor resources, lack of knowledge.

A lot of the western instructors here are heavily involved in NOG workshops around the world, and the African instructors are key people in their countries' technical Internet communities.

Kooks

2006-05-04 20:19
fanf: (silly)
I note that the prolifically incompetent "mathematician" Eugene Terrell has recently submitted a couple of new Inernet-Drafts on the topic of being able to fit more than 128 bits of numbers into 32 bits of space. Go to http://www.watersprings.org/pub/id/index-t.html and look at draft-terrell-* ...
fanf: (weather)
Our "localhost" spent most of today struggling to get shipments through customs, including the routers for {E,F}2 and various other equipment. Most of the instructors are here now, and we are very international crowd. USA, Canada, Ausralia, New Zealand, Chile, France, Denmark, Nigeria, Ghana, South Africa, and I'm sure I've forgotten some. A few people represent several countries by themselves :-)

http://www.ws.afnog.org/pictures/index.html

Amanda's pictures from today (4th May) include one of Y.T. editing slides and handouts.

The Guinness brewed under licence in Kenya is 6.5%. The coffee is still excellent.
fanf: (weather)
Weather: comfortable, high teens to low twenties celsius. Warmer when the sun comes out, which is rarely. Thunder and rain. Nairobi is only a little way south of the equator but it's nearly 1700m up, hence the somewhat temperate climate.

The hotel is quite civilized, with a nice restaurant and good coffee :-)

This morning we had meetings to decide the curricula, which get adjusted from year to year as the Internet and its software evolves. There are now four tracks: F2 is the French routing track, E2 is the English routing track, E1 is the services track (which I am helping with), and E0 is the Unix basics track. (E0 was introduced last year because many students did not have the background knowledge necessary for E1 and E2.)

E1 is roughly as follows. First 1.5 days of DNS then 0.5 days of web stuff. This year we are ditching Squid in favour of more about authentication, specifically RADIUS. Then 1 day of SMTP (Exim, presented by me) and 1 day of POP&IMAP. Finally half a day to consolidate the security stuff that has been done in the course of the previous subjects - SSL certificates and so forth.

I am going to have to adjust Philip Hazel's teaching materials so that we can do SMTP AUTH backed with RADIUS. This should be nice and easy, since FreeBSD (our Unix of choice) comes with the libradius client library, which Exim supports.
fanf: (Default)
At AfNOG. Network connectivity is pretty good, and not as bad as the RTT might suggest.
traceroute to hermes.cam.ac.uk (131.111.8.59), 64 hops max, 40 byte packets
 1  soekris.ws.afnog.org (196.200.222.2)  2.858 ms  2.514 ms  15.464 ms
 2  196.216.67.253 (196.216.67.253)  4.517 ms  3.173 ms  7.452 ms
 3  217.21.112.4.swiftkenya.com (217.21.112.4)  3.905 ms  3.873 ms  4.796 ms
 4  193.220.225.5 (193.220.225.5)  9.375 ms  4.426 ms  5.115 ms
 5  NO-NIT-TN-7.taide.net (193.219.192.7)  530.815 ms  530.397 ms  530.574 ms
 6  NO-NIT-TN-5.taide.net (193.219.193.145)  533.882 ms  530.595 ms  530.928 ms
 7  nb01b11-pos2-0.nb.telenor.net (217.70.230.45)  531.592 ms  533.452 ms  533.100 ms
 8  nb01b12-ge3-3-0.nb.telenor.net (217.70.228.21)  631.788 ms  678.357 ms  570.329 ms
 9  nb21b12-pos4-2.nb.telenor.net (217.70.227.34)  566.925 ms  615.756 ms  556.670 ms
10  linx-gw1.ja.net (195.66.224.15)  556.974 ms  556.369 ms  555.574 ms
11  po1-1.lond-scr4.ja.net (146.97.35.129)  555.848 ms  555.910 ms  583.818 ms
12  po0-0.lond-scr.ja.net (146.97.33.33)  557.343 ms  564.451 ms  555.941 ms
13  po0-0.cambridge-bar.ja.net (146.97.35.10)  560.323 ms  560.377 ms  561.087 ms
14  route-enet-3.cam.ac.uk (146.97.40.50)  562.861 ms  559.136 ms  559.810 ms
15  route-cent-3.cam.ac.uk (192.153.213.194)  559.727 ms  562.341 ms  559.504 ms
16  hermes-v.csi.cam.ac.uk (131.111.8.59)  559.562 ms  559.973 ms  567.100 ms
fanf: (Default)
A quick note that the email message submission specification has been updated as RFC 4409.
fanf: (Default)
Next week I am going to Nairobi for AfNOG, the African Network Operators Group 2006 event. A significant part of this is a workshop teaching the technical skills needed to run an ISP. I'll be going in place of Philip Hazel to teach people about Exim.

One thing I noticed this morning is that the credit cards that I have been issued this year do not have magnetic stripes: they are chip-and-pin only! This is rather inconvenient for international travel. Fortunately I still have an older credit card with a stripe - though it is one we want to get rid of - and my bank is sending me a replacement card which might arrive before I leave.

Anyone else who is planning to travel on new credit cards should check this. Of course banks like you to warn them beforehand anyway, so that their fraud detection systems are prepared.
fanf: (Default)
I sent this to the chat-users list yesterday...

A few people have been wondering what's been going on since my last message in January, so here's an update.

We got approval from the IT Syndicate at the end of February, so the Chat service will be going ahead as planned. Our Hostmaster staff did some tests to ensure that SRV records in the cam.ac.uk DNS would not be a problem, and these were passed OK. My colleagues in Unix Support arranged the purchase of two computers to host the Chat service (as part of a larger order) and racked them in the machine room.

During that time I have mostly been working on email-related things, hence the lack of news... However I have been keeping an eye on the Jabber mailing lists. The discussions there have helped me to decide which server software to use. I originally selected jabberd-2, which is nice and clean, but is not quite finished and development is slow. What's worse is its reputation for flakiness and poor integration with multi-user chat. I also considered jabberd-1.4, which is still being actively developed, but it is crufty and lacks some important XMPP features such as privacy lists. Then in February, the JSF announced that jabber.org (the first Jabber server) had switched to ejabberd-1.0, the Erlang Jabber server. It is featureful, implements all of XMPP, and gets some serious performance, clustering, and robustness technology from Erlang.

This week I have been getting to grips with Erlang, and next week I'll tackle Unix Support's new Linux installation service so that I can get Wogan and Parky (the Chat service computers) running. After that I'll be away for two weeks in Kenya, teaching people about Exim. After that I hope to get a prototype Jabber service running.

As ever, there are more details on the Chat wiki at https://wiki.csx.cam.ac.uk/chat/ (for those with raven accounts)
fanf: (Default)
I need a name for a virtual chat host, to go with wogan and parky, my physical hosts. So far I have:
  • Max Headroom - just a bit too long
  • Ananova - virtual newsreader - too confusing
  • Wildfire - virtual mobile phone operator
Any better ideas?
fanf: (Default)
Mike O'Dell's alternative IPv6 architecture, known as 8+8, is neat. His idea of NATting at site boundaries as a matter of course is something I have been wondering about. Nice to see that someone has already worked it out in detail.

http://www.watersprings.org/pub/id/draft-odell-8+8-00.txt

Previously: http://fanf.livejournal.com/53662.html

Update: 8+8 later became the "Global/Site/End-system" addressing architecture: http://ietfreport.isoc.org/idref/draft-ietf-ipngwg-gseaddr/
fanf: (Default)
This message contains a good rant about single sign-on:

The fact that "users don't necessarily want to have to manually authenticate each time some service wants authentication" is not the reason we want to promote single sign-on. We don't want the user to manually authenticate every time because doing so trains the user to supply their credentials so frequently that they will not think it is strange when they are asked to provide them to an attacker. The only way to prevent phishing attacks are by training users that they only authenticate in very small number of circumstances that rarely occur.
fanf: (weather)
[livejournal.com profile] dotaturls had a bit of a hiatus recently because LJ decided it was "too big". Given that it has most of the things I thought worth noting in the last four years, this is not unreasonable! Anyway, the last week's links spooged through this morning and all is now well.
fanf: (Default)
On Wednesday I asked about alternatives to window-based protocols. An answer is "rate-based". For example, see this paper from 1995 which discusses congestion control in ATM.

Of course TCP has an implicit measure of its sending rate, because it maintains both a congestion window and an RTT, so its average sending rate is cwnd/rtt - but this is not explicit in the way it is for ATM. And in fact TCP's rate can be very bursty and is often unstable in adverse conditions. However the cwnd does give you a direct measurement of the amount of buffering you need in case it is necessary to retransmit lost packets.

Edit: There's a really nice summary of the burstiness of TCP in this paper which brilliantly turns what is commonly viewed as a disadvantage into a benefit for dynamic traffic splitting.

Someone on #cl pointed me to XCP which is a really clever congestion control protocol, which has been implemented to work under TCP but could in principle work with any unicast transport protocol. One thing that I quite like about it is that it agrees with my intuition that the Internet would work better if routers and hosts co-operated more (one two). In XCP, senders annotate their packets with their current RTT and cwnd parameters (measured by TCP in the usual way) plus their desired cwnd increase. Routers along the way can adjust the increase according to how busy the links are, and even make it negative if there is too much traffic. The receiver then returns the feedback to the sender in its acks. The really brilliant thing is that routers do not need to keep any per-flow state - compare the ATM paper above, which does require per-flow state, and is therefore hopelessly unscalable. Imagine the number of concurrent flows and the flow churn rate of a router on the LINX! Furthermore, XCP separates aggregate efficiency from fair allocation of bandwidth between flows. The fairness controller can implement different policies, which could allow for more precise QOS guarantees, or bandwidth allocation based on price paid, etc.

Really really cool, but really really hard to deploy widely. It's the kind of thing that I suppose would fit in with David D. Clark's Future Internet Network Design research programme, which is supposed to come up with new ideas for the long term. "To do that, you have to free yourself from what the world looks like now."
fanf: (Default)
http://www.psc.edu/networking/projects/tcptune/

This article says TCP is "window-based" in a way that implies that there are alternative models. Are there alternative models?
fanf: (Default)
Forgot to mention these last time:
  • Uniform handling of queries across the service: help-desk, postmaster, webmaster, *-support. Access to support queues by all service staff.
  • Webify managed mail domains interface. Perhaps integrate with MZS?
  • Once MMDs and lists are fully webified, deprecate the Hermes terminal interface and menu system in favour of PWF Linux. Benefits would include a non-crippled Pine installation. We should move Pine configurations from the local filesystem to the IMAP server first.
  • More effective use of RSS/Atom. Could be a side-effect of a blogging/forum service, or a veneer on our existing information feeds as in my quick hack http://canvas.csi.cam.ac.uk/atom/
fanf: (Default)
According to Our Glorious Leader the Computing Service's director, March is "strategy month" and we are supposed to come up with ideas. So, this is my long-term wish list. No restraint has been exerted in its compilation :-)
  • Get the Chat service deployed!
  • Hermes projects:
    • Re-do and improve the anti-spam and anti-virus infrastructure. This has three parts:
      1. Replace MailScanner with exiscan, so that we can scan messages at SMTP time, and reject those we don't like instead of discarding them. If we don't want to rely totally on ClamAV we'd need to replace McAfee with a better commercial AV scanner, because McAfee is not fast enough for non-batch-mode scanning; it's also pretty bad at promptly issuing new signatures. This would allow us to have a default SpamAssassin threshold (without the lost-email consent problems caused by the way MailScanner works) which in turn would help with our AOL problems.
      2. Allow per-user options at SMTP time, e.g. spam threshold, stricter sender verification, greylisting, etc. This requires some data-shifting infrastructure which I have made a small start at, and user-interface extensions to Prayer.
      3. Add a new component for per-user statistical spam filtering. This would live on the message store (with other per-user state), and hook into message delivery for initial filtering, and copying to or from the spam folder for training. Again we need user-interface changes to Prayer to present the results of the filter and allow users to train it - similar functionality to full MUAs like Mac OS X Mail or Thunderbird.
    • Shared folders. Requires a new IMAP proxy (like the Cyrus Murder) and changes to the replication system. Does Cyrus 2.3 have the necessary functionality?
    • Internationalized webmail. Replace (or run alongside) Prayer with some better-known software?
    • Finish the ratelimit project. Will require time to work on a user interface, so that it is sufficiently friendly. My current idea for how it would work requires the double-bounce patch mentioned below.
    • Deploy DKIM (server-side message signatures for anti-spam).
  • Exim projects:
    • Improved hints DB infrastructure. The aim is to remove the performance bottleneck caused by whole-table locking, and make the back-end more pluggable so that we could have distributed hints databases. PPswitch would then act as a cluster with the machines sharing knowledge about other hosts on the net.
    • Concurrent DNS lookups during routing, to improve mailing list performance. I've done some work on this already.
    • Better handling of double bounces. If we can't deliver a bounce to recipient@domain, try postmaster@domain then postmaster@cam instead of freezing the message.
    • Finish the bounce address protection code.
  • Service integration:
    • Kerberos. Proper single-sign-on! Users really hate typing in passwords.
    • Raven-authenticated webmail for more single-sign-on.
    • Mailing lists generated from Lookup groups. Lookup groups generated from data in CamSIS/CHRIS.
    • Web presence: an extension to the chat service which allows other Raven-authenticated sites (such as lookup and webmail) to display a presence icon next to a username. If I am looking at your lookup page and you are on my chat buddy list, then I can see whether you are currently on-line and I can click on the presence icon to chat with you. Depends crucially on transparent single-sign-on. In particular, the default for Raven at the moment is confirmed sign-on, which isn't transparent enough.
    • Short URL service, like go.warwick.ac.uk, which doesn't host content but instead redirects to full URLs. Mainly for use on printed material such as business cards, academic papers, posters, press releases, etc. Namespace divided into ~CRSID for homepages (using data from lookup) and ad-hoc names managed by user-admin. Could include societies.
  • Web stuff, etc.:
    • Blogging service.
    • Web forums.
    • The two could be aspects of the same thing (e.g. Livejournal's blogs and communities).
    • Society information in Lookup, similar to the institution information.
    • More effective use of talks.cam.ac.uk.
    • Shared calendars / meeting manager.
  • Networking:
    • Make it easier for depts and colleges to host conferences. e.g. at the Durham UKUUG I was given a short-term uid which gave me access to their PWF and wifi.
fanf: (Default)
One of the things that made the Bourne Shell notorious (apart from the obvious Bournegol) was its memory management. Its allocator just walked up the address space as needed. If it hit the end of its mapped memory, it would SEGV, drop into its signal handler, ask the kernel for more memory, and return. This caused inordinate problems later when Unix was ported to systems with less amenable signal handling. (See OpenSolaris for a slightly cleaned-up Bourne shell.)

Larry Wall's 1987 IOCCC winner goes something like:
    extern int fd[];
    signal(SIGPIPE, start);
    pipe(fd);
    close(fd[0]);
    for(;;)
        write(fd[1],ptr,n);
The write serves only to trigger SIGPIPE, and the signal handler is repeatedly changed in order to implement the flow of control (which the judges called "amazing").

I thought it would be amusing to combine these ideas.
    extern char *prog[];
    extern int pc;
    for(;;)
        ((void(*)(void))dlsym(NULL,prog[pc]))();
This looks like a pretty boring interpreter if you assume that the prog consists entirely of built-in function names. But what if it doesn't? Then dlsym() will return NULL and the program will SEGV. The signal handler can then push a return address on the interpreter stack, change pc to point to the user-defined function definition, and longjmp() back to the loop.

(Aside: you can have symbolic builtins like +-*/ if you choose function names that have the same ELF hash as the symbols you want, then run sed over the object code to change their names to symbols.)

The problem is that dlsym() isn't very portable - certainly not enough for the IOCCC. So we need a better way of constructing a symbol table, and preferably without having to mention the function name loads of times (e.g. as in Larry's program). The problem is that you can't interleave the array initializers with the function definitions (at least not without non-portable hacks like linker sets). But how about this:
    #define BUILTIN(name, code) \
        void name(void) code { #name, name },
    
    BUILTIN(add,{
        int b = pop();
        int a = pop();
        push(a + b);
    }
    
    BUILTIN(sub,{
        int b = pop();
        int a = pop();
        push(a - b);
    }
    
    BUILTIN(exch,{
        int b = pop();
        int a = pop();
        push(b);
        push(a);
    }
    
    struct {
        const char *name;
        void (*code)(void);
    } builtin[] = {
          )))
        { 0,0 }
    };
fanf: (Default)
http://www.interesting-people.org/archives/interesting-people/200603/msg00179.html

Russ Nelson <nelson@crynwr.com> says "I looked into [tax-deductable code donations] in my role as the executive director of the Public Software Fund. If you own copyright on a work, and you donate that work to a 501(c)(3) [non-profit organization], you can deduct the fair market value of that work [from your US taxes]."

Notable relevant organizations include the FreeBSD Foundation, the Apache Software Foundation, and the Free Software Foundation. They must be projects that will formally take ownership of the copyright from the donor.

Russ continues: "The difficulty here is that you need to determine the fair market value. [...] So what would a contribution to an existing open source project be worth? [...] It seems to me arguable that you can deduct whatever you could sell the first copy for. When Cygnus Software was an independent company, they would sell the first copy of a gcc port for six or seven figures. A comparable price would be the price of full ownership of any comparable piece of software. A less reliable valuation would be the salary of someone paid to do a comparable work-for-hire."
fanf: (silly)
Some amusing or odd features:
  • Blasphemy and racism are forbidden, along with the usual porn. This probably covers some fortune(1) jokes.
  • You can get permission to look at porn for research purposes, but there is no such exemption for reverse engineering software (so no software security research).
  • You are not allowed to "access, download, store, or process" comercial spam, or anything that causes "annoyance, inconvenience, or needless anxiety".
  • Durham ITS takes a cut of research grants which have provision for computing costs.
  • Durham claims all "intellectual property" you create and must be involved in any commercial exploitation.
  • For the purpose of these regulations, "any remote IT facilities are deemed to be [Durham] University IT facilities".
Page generated 2026-02-08 07:50
Powered by Dreamwidth Studios