fanf | Recent Entries

So I've been playing around with Jabber recently, and I wanted a client for testing and admin purposes. I wasn't keen on Psi because installing a pre-built package would have required upgrading the world, and building it and Qt takes for ever. So I thought I'd try CJC, the console jabber client, which is appealingly retro.

CJC is written in Python and has an associated library called PyXMPP which does the protocol end of things. (In fact this is another reason for choosing CJC - in the fullness of time I'll want to write some glue software to hook Jabber into other stuff and PyXMPP seems like a plausible choice of foundation.) PyXMPP needs a few add-on libraries in addition to the standard Python install.

Building Python and persuading it to link to the correct libraries wasn't too much trouble. DNSpython builds in the standard Python way without any difficulty at all. LibXML2 was also not too bad - though because it depends on the Python install to compile its Python bindings, but part of the Python install (viz. PyXMPP) depends on libxml2, I couldn't separate it from Python properly. A bit scruffy, but *shrug*.

The problems are almost entirely to do with M2Crypto, which provides Python bindings for OpenSSL. M2Crypto uses SWIG to do most of the work of hooking the two together. The nice thing about SWIG is that it can be built to support loads of scripting languages without having any of them installed, so (unlike libxml2) I could divorce it from the Python install.

M2Crypto uses SWIG in a rather grotty way, resulting in lots of Warning(121): %name is deprecated. Use %rename instead. I guess that the reason for this deprecation is that the %name directive requires the SWIG interface file to include a copy of the C function's declaration. This caused the M2Crypto build to explode when pointed at OpenSSL-0.9.7h because there have been some constness changes in the OpenSSL header files which cause the M2Crypto's idea of some function types to be wrong. Sigh.

So I put together an evil hack to patch this bug, and got the whole thing built successfully. However when I tried to connect to jabber.org with cjc, I got a lovely stack traceback:

    File ".../pyxmpp/stream.py", line 1206, in _make_tls_connection
      ctx=SSL.Context('tlsv1')
    File ".../M2Crypto/SSL/Context.py", line 41, in __init__
      map()[self.ctx] = self
    File ".../M2Crypto/SSL/Context.py", line 20, in __setitem__
     self.map[key] = value
  TypeError: unhashable type

M2Crypto tries to keep a hash table of SSL context structures. (I don't know why; it looks redundant to me because that bit of the code could just use self.ctx instead.) The SSL context structure is a SWIG wrapper around the corresponding OpenSSL type, and SWIG doesn't define a tp_hash function in its wrappers, so Python bitches as you can see above.

Working this out required far too much effort. At first I thought it might be a problem with version skew, so I tried using older Pythons and older SWIGs and newer cjc/pyxmpp snapshots, and the bug remained. So I looked at the code more closely and consulted friendly Python experts and came to the conclusion that the code couldn't possibly work. (But then how were others managing to use it?) I then thrashed the SSL context hash table to within an inch of its life by using repr(key) instead of key, so that a stringified version of the context pointer was used to index the table, thus bypassing the problem.

After that, all I was left with was a minor certificate verification problem, which was solved by teaching M2Crypto about the many benefits of SSL_CTX_set_default_verify_paths() and getting PyXMPP to invoke the new M2Crypto feature appropriately. And finding the CA certificates required to verify the signing chain of jabber.org's dodgy certificate.

So maybe now I can make some real progress, instead of shaving yaks.

I have found some time to update http://www.cus.cam.ac.uk/~fanf2/photos.html with some help from

furrfu

Thanks for everyone's comments on yesterday's post. Since there's a fair amount of overlap I thought I'd write a consolidated reply here.

Regarding the target audience,

claerwen is right that it is aimed at Computer Officers and other people involved in computer support, plus knowledgable users or those with unusual requirements who need to know. Obviously most users won't have the time or the inclination to familiarize themselves with all our documentation - see the profusion of links under http://www.cam.ac.uk/cs/email - so we rely on the techlinks to do that for them :-)

stephdiary: Yes, they will know about the CUDN.

hsenag: We already have more friendly documentation for our average end-users.

Regarding the various pedant points, I'm being careful (too careful?) in my use of language but the distinctions I am making are often not very important to the reader. So although I make a distinction between message submission servers and smart hosts, for most purposes they can be treated as synonyms for "outgoing SMTP server".

mouse262: Good point about Google Mail (though they seem to think that my browser isn't in the UK). I've avoided the problem by using the example of a visitor from Oxford who uses Herald, instead of advertising a competitor :-)

bjh21: Yes, that sentence is a mess, and I probably shouldn't try to editorialize in that way. I have neutralized it.
(Anonymous): I think your suggestion of a summary page would mostly be addressed by a table of contents when it gets HTMLified. The details of configuration for Hermes users isn't really in scope for this document (because it's covered elsewhere), but you are probably right that front-loading the document with waffle isn't helpful so I have re-ordered it.
The actual limit is currently more than 60 but I used that figure because it's a good point to start worrying about the amount of traffic you are generating. At least Unix boxes will generally get SMTP retries right.

Following my earlier rant, I thought that this Bugtraq message was noteworthy.

I'm preparing a document that gives advice on sending email from within Cambridge University. The aim is to correct some confusion about the difference between our message submission server and our smart host, which has not mattered much in the past but will when we disable unauthenticated access to the former. I'm also documenting the forthcoming rate limiting restrictions.

Any comments or questions are welcomed. It'll be announced properly in due course...

http://www.cus.cam.ac.uk/~fanf2/hermes/doc/misc/sending-from-cudn.txt

On Tuesday this week I spent some quality time with the Berkeley DB documentation. The motivation for this was to write a script that would reliably back up a Berkeley DB transactional database. This is simple, because Berkeley DB comes with the necessary utilities which work independent of whatever client application is built on top of the database: you just checkpoint, archive, and remove cruft. Recovery is supposed to be similarly simple.

However I ended up getting distracted by various funky features, like replication, distributed transactions, running Berkeley DB as an RPC server, etc. One of the things that particularly caught my eye was the page about nested transactions. The documentation describes them as being useful for decomposing large transactions into smaller parts, so that if a child transaction encounters a conflict it can abort without having to unwind the whole parent transaction. This allows complex transactions to proceed even in the face of heavy contention.

However what the documentation doesn't say is that nested transactions make it easier to create simple library APIs without compromising flexibility. A library which provides a function implemented using a Berkeley DB transaction can simply provide a parameter for a parent transaction, then code which uses the library can call the same function in isolation or as part of a larger transaction. In the absence of nested transactions, the library will probably want to provide two versions of the function, one with the rubric for dealing starting, retrying, and committing the transaction, and a raw guts version for use inside larger transactions. This breaks the library's abstraction.

This reminded me of a paper I read in about February this year, called Composable Memory Transactions. The paper was written by two members of the Haskell mafia and two members of the STM mafia (who were all working at MICROS~1 Research in Cambridge at the time), and it combined the two in a mind-blowingly beautiful way.

Software Transactional Memory is mostly a subset of a concurrent programming technique known as lock-free programming. Traditionally, concurrent programs have used locks to ensure that changes to data structures appear atomic - the data moves directly from one consistent state to another, from the point of view of other threads in the program. It is difficult to program correctly with locks, and they do not scale well as the complexity of the program or the number of processors increases. The aim of lock-free programming is to create concurrent data structures which can make effective use of large multi-processor machines, which are becoming increasingly common - most high-end CPUs nowadays are multi-core. It uses the processor's atomic operations (e.g. compare-and-swap) to manipulate the data structure directly. Keir Fraser's PhD thesis "Practical Lock Freedom" contains several examples and is really nicely written.

However lock-free programming at this level is hard: it's possible for a really studly programmer to implement a reasonably simple data structure such as a skip list, but it isn't feasible for mortals to implement lock-free versions of the kind of ad-hoc structures you get in large programs. So we need a higher-level way of programming in this style. What software transactional memory does is to encapsulate the complexities of lock-free programming in a general-purpose way, so that it looks like the transaction processing that is familiar from database programming. You start a transaction, manipulate the data structure, and then commit (or retry if there was a conflict), and the changes you made appear atomic to the rest of the program. Nice and easy.

What the Composable Memory Transactions paper did is turn STM into a general-purpose IPC mechanism. Inter-process communication (including inter-thread communication) generally involves two things: data transmission and rendezvous. STM takes care of data transmission via shared data structures, but it does not necessarily provide a mechanism for a thread to block until some condition is satisfied. For example, in a producer-consumer setup, the consumer needs to wait if the queue is empty until the producer puts something in it.

The way Haskell's STM supports blocking is with the retry operator. If the conditions aren't right for the thread to complete its transaction (e.g. the queue is empty), it calls retry. This doesn't retry immediately - that would be pointless - instead it aborts the transaction and waits until something has changed which may allow the thread to complete the transaction successfully. The STM system has a pretty good idea of which somethings to keep an eye on, because it was keeping a transaction log of the variables that the thread read during the transaction, so it can block the thread until some other transaction writes to one of these variables.

Blocking operations like this are usually the enemies of compositionality. The paper uses the example of two C functions which use select() internally to wait until data is available on a socket; a program that uses these functions can't easily compose them into a higher-level function which waits for either of the inner functions to succeed. The Haskell STM system fixes this problem with an orElse combinator which composes two child transactions. If the first one calls retry, the second is run. If both of them retry then the combined transaction retries. If there is no outer orElse then the thread blocks until any variable read by either of the child transactions is modified.

A little thing that really excited me was one of the examples in the paper that showed how you could use these combinators to turn a blocking operation into a polling operation by wrapping it in a function. (With more familiar IPC systems only the opposite is possible.) The following tries a transaction, and if it succeeds immediately it returns true; if the transaction attempts to block, it returns false.

    tryTXN t = do { t; return True }
               `orElse` return False

When I first read the paper I thought it would be cool if the same neat programming interface could be applied to data on disk as well as in memory, via a transactional database such as Berkeley DB; however I didn't pursue the idea. Having been reminded of it recently, I have thought about it a little more - it helps that I know a bit more about Berkeley DB's more advanced features now than I did then. In particular, this style of composable transactions requires support for nested transactions: each alternative of an orElse is run as a nested transaction, so that if it calls retry it can be rolled back independently of the parent transaction, so that other alternative sees the data structure as it was before the orElse.

What else is required? We need to record which database entries are read during a transaction, and we need to be able to block until any of them are written. Now this could be done as a wrapper around the Berkeley DB calls, but that is only sufficient if every program accessing the database uses these wrappers. This is not so good for compositionality. It's also duplicating work, because Berkeley DB's transaction log already records which database entries were read. So I think the feature should be built-in to Berkeley DB.

I suggest the addition of a DB_TXN->wait method. This unwinds the transaction's log, releases all its locks, and waits until another transaction modifies a database entry that this transaction read. Then it returns DB_RETRY, which indicates that the caller should retry the transaction as if it were recovering from a deadlock. However, if this is a child transaction, DB_TXN->wait does not release the locks and block, but instead it frees the transaction after unwinding it and returns DB_TRYOTHER. This indicates that the caller should return to its parent, which can try an alternative child transaction. When the parent runs out of alternatives, it calls DB_TXN->wait which will either block or tell the parent to return to its parent. In a similar manner to Berkeley DB's existing handling of nested transactions, when DB_TXN->wait returns DB_TRYOTHER, the log of database entries which will unblock this thread is merged from the child transaction into the parent transaction, so that when it blocks, any change that might cause a retried child to succeed will wake up the parent. This is at a somewhat lower level than the Haskell STM, but this is consistent with Berkeley DB's existing APIs, which, for example, require that you explicitly code loops to retry transactions which conflict.

Berkeley DB already has some limited support for IPC rendezvous, but only for producer-consumer relationships implemented via the queue access method. If you call DB->get with DB_CONSUME_WAIT then your thread will block until entries are added to an empty queue. So there is at least a little of the necessary infrastructure to implement the much more general DB_TXN->wait.

All that remains is for someone to write the code...

I've made a few changes to yesterday evening's post about mergesort. I've fixed the algorithm so that the sort is stable. I've added an initial section about where the bottom-up merge sort comes from. And I've adjusted the introductory paragraph to be less rude to Simon :-)

simontatham has a description of his really nice mergesort algorithm for linked lists which sorts in place with O(1) overhead and only uses forward links. Unfortunately he has described it in the Knuth style - pre-optimised assembler translated into prose - which is possibly the worst way of describing algorithms. So when I implemented it, I only got a rough idea from Simon's description, then I went through a process like the following to reconstruct his algorithm from a simple starting point. Happily, the solution I ended up with was the same as Simon's.

The most basic merge sort goes like this:

    split the list in two
    sort each half
    merge the halves
    ...
    profit!

This requires O(log N) space overhead for the recursive calls in step 2. However it is possible to eliminate the recursion and turn mergesort into an iterative algorithm, and thereby avoid the space overhead. (This is a fairly standard optimization.)

The recursive calls can occur in any order, so for the purposes of illustration, imagine that they occur at the same time, and imagine the call sequence that this produces. The algorithm divides into two phases: a recursive subdivision phase where we split the lists and perform recursive calls; and an accumulation phase where we merge pairs of lists and return upwards. Note that the first phase is not dependent on the data at all: every list of the same length is divided the same way, and at the end of the phase we always end up with a collection of singleton lists. The first phase also gives us is some book-keeping information to co-ordinate the merges. But the structure of this information is so simple that we can skip the recursive subdivision and do the book-keeping on the fly while we perform the merges.

This leads to the following bottom-up merge sort:

 1   sort(list) =
 2     for n := 1 step n := n * 2
 3       output := []
 5       for merges := 0 step merges := merges + 1
 6         (a,list) := take(n,list)
 7         (b,list) := take(n,list)
 8         break if empty?(a)
 9         output := append(output,merge(a,b))
10       loop
11       list := output
12       return(list) if merges <= 1
13     loop

In the outer loop (lines 2,13) we treat the list as a sequence of sub-lists of length N, which starts at one and doubles each time round the loop. For each N we make a pass over the list. Inside the inner loop (lines 6,7,9) we take a pair of short lists (which are sorted) and merge them into a double-length sorted list ready for the next pass. The merge() function is standard.

Lines 5,8,12 deal with termination conditions: If there was nothing in the list to move to A, then the inner loop has completed its pass. If we only performed one merge then the list is now completely sorted.

The above version is not very efficient, because there are lots of avoidable O(N) operations. In particular, we are repeatedly appending to the output list as we build it up from the head to the tail. This is easy to optimize away by maintaining a tail pointer which keeps track of where appends occur. The merge operation also wants to build up a list from the head to the tail, so it can be altered to take care of the append too.

 1   sort(list) =
 2     for n = 1 step n = n * 2
 3       output := []
 4       tail := &output
 5       for merges := 0 step merges := merges + 1
 6         (a,list) := take(n,list)
 7         (b,list) := take(n,list)
 8         break if empty?(a)
 9         tail := append_merge(tail,a,b)
10       loop
11       list := output
12       return(list) if merges <= 1
13     loop
14
15   append_merge(tail,a,b) =
16     (a1,a) = take(1,a)
17     (b1,b) = take(1,b)
19     while not null?(a1) or not null?(b1)
20       if null?(b1) or (not null?(a1) and a1 <= b1)
21         tail := append1(tail,a1)
22         (a1,a) = take(1,a)
23       else
24         tail := append1(tail,b1)
25         (b1,b) = take(1,b)
27     loop
28     return(tail)

In the combined append & merge function, we loop until both lists are empty (i.e. both of the variables we used to store the head of a list are undefined - line 19). We add the smaller of the two head variables to the tail of the output list if both are defined, otherwise we just add the rest of the remaining list.

Note that the comparison on line 20 is less-than-or-equal, not strictly-less-than. This is important for ensuring that the sort is stable, i.e. that equal elements in the input list appear in the same order in the output list. Since elements of A came before elements of B in the input list, if the two heads are equal we should add the head of A to the output before the head of B.

The above version still has some slack. In particular it scans the elements in A and B twice, once in the call to take() when creating the lists, and once when merging them. We have to scan A before the merge so that we can scan A and B together during the merge, but this isn't necessary for B. Instead we can merge A with the first N elements of the list directly.

 1   sort(list) =
 2     for n = 1 step n = n * 2
 3       output := []
 4       tail := &output
 5       for merges := 0 step merges := merges + 1
 6         (a,list) := take(n,list)
 8         break if empty?(a)
 9         (tail,list) := append_merge_take(tail,a,list,n)
10       loop
11       list := output
12       return(list) if merges <= 1
13     loop
14
15   append_merge_take(tail,a,b,n) =
16     (a1,a) = take(1,a)
17     (b1,b) = take(1,b)
18     n := n - 1
19     while not null?(a1) or (not null?(b1) and n >= 0)
20       if (null?(b1) or n < 0) or (not null?(a1) and a1 <= b1)
21         tail := append1(tail,a1)
22         (a1,a) = take(1,a)
23       else
24         tail := append1(tail,b1)
25         (b1,b) = take(1,b)
26         n := n - 1
27     loop
28     return(tail,b)

The main difference here is that the tests for the end of the B loop on lines 19,20 are more complicated, because we also have to check for running out of N. Note that we decrement N whenever we take an element from B (lines 17,18 and 25,26)

This is pretty much the algorithm that Simon describes. However he skips the elements of A rather than taking them from the list, which adds complexity to the merge function's end-detection for A similar to that for B. An in-place implementation of take() would have the same effect.

http://cm.bell-labs.com/cm/cs/doc/85/1-05.ps.gz

Earlier this week a technical cockup in one of the insane alternate DNS roots spilled onto the NANOG mailing list, to much amusement. Bill Stewart posted some comments on resistance at Bell Labs to the deployment DNS in the mid 1980s, including "The Hideous Name" by Rob Pike and P.J. Weinberger, linked above.

It's very wrong, and it's interesting to understand why because it provides a historical perspective on the state of networking at that time. The paper seems to have an underlying assumption that balkanization is the natural state of networking. This has two corollaries: that you do not need to (and should not) compromise your user interface to accommodate the naming schemes of other networks; and that a global consensus name space is not possible.

The paper's model of a name in computing is of a path, which is intimately related to the network's topology - a topic which is of no interest to people who aren't network engineers. In this model the idea of a relative name makes sense: how do I get there from here. However they massively underestimeate the value of globally-valid names, which remain so as they traverse the network and do not need to be rewritten as they move from one namespace into another. (They also massively overestimate the reliability of rewriting.) Globally-useful names are so important that they are created by the users of systems that don't support them: for example in UUCP, people would quote their addresses relative to a few well-known hubs, which effectively acted as the root of the namespace as well as the core of the network.

This illustrates the network effect, which is an idea that is of course absent from the paper since the paper was written at least ten years before the strength of network effects became obvious. Instead the paper invokes standardization as the solution:

It is clear that standards are necessary for electronic mail to be delivered reliably across network boundaries. What needs to be standardized is the interpretation of names, especially at network boundaries. Until such a standard exists; is syntactically and semantically clean; distributes the interpretation of names across the systems that understand them; and is adhered to, the network mail situation will not improve.

This is of course correct, but the solution we have ended up with does not follow their principles of naming at all - and it is of course centred on the Internet and so can't be used to gateway between other networks directly, which they assumed would be necessary.

We were sorry to hear earlier this week that Cyrusoft, the publishers of Mulberry, have gone bankrupt. Cyrus Daboo started writing Mulberry at Cambridge (when I was an undergraduate) with close co-operation from the Computing Service and our Mac users. Mulberry was the first really good GUI IMAP client so it's a shame to see it die an unnaturally early death. (More about the early days of Mulberry here.)

For the above reasons we have in the past recommended Mulberry to our users; we can continue to support it for the rest of this academic year, but we need to find something to replace it. To this end I have been playing around with Thunderbird. I have been sadly disappointed. It is buggy, its user interface is clunky, and it encourages users to send their passwords across the net in the clear. The latter in particular is inexcusable for a modern piece of software.

How not to do it

In which Tony goes off on one about how Thunderbird is far too difficult to configure...

When you start Thunderbird for the first time, the first thing you see is a broken "import settings" dialogue which fails to list anywhere to import settings from. Nice start!

Then Thunderbird puts you through a new account setup wizard. On the first pane you can choose from email, rss, or usenet. Having chosen the first you then go to the "identity" pane where you enter your name and email address.

Next is the slightly confusing server information pane. This allows you to choose between POP and IMAP, and type in the host names of the incoming POP/IMAP server and the outgoing SMTP server. The confusion arises from an option that appears if you select POP (which is the default), in which case another option appears which seems to be about whether or not to merge multiple POP feeds into a single INBOX - the description isn't very clear; what is worse is that there is a hrizontal line between the incoming server hostname and this option, and no line between this option and the outgoing server hostname, so the visual cues give you the wrong idea about what this relates to until you flip the POP/IMAP switch.

Note that this pane does not have any security or port number options!

Next is the user names pane in which you type in your user names for the incoming and outgoing servers. The text boxes default to the local part of your email address, which is quite nice. However there are still no security options.

The final pane allows you to give this account a friendly label for the GUI. After this the main window opens and you get asked to type in your password to log into the IMAP server.

At this point I'm pretty suspicious. Surely they can't be so stupid as to send the password in the clear? After all, the IMAP standard requires TLS support. But paranoia means I fire up tcpdump and type in a bogus password. Lo! the password appears in the tcpdump output. I peevishly cancel the login and go looking for the account settings.

The first pane of the preferences dialogue box has a button for configuring how Thunderbird connects to the Internet [sic]. Clicking that reveals that this means proxy configuration, but it doesn't use the correct terminology until you open the subordinate dialogue box. The settings for the actual mail servers are nowhere to be found in the remaining preferences panes. (This is not unprecedented, because Mac OS X Mail also does not include account settings in the preferences dialogue box.) Further digging through the menus reveals that account settings lives in the tools menu.

The primary pane for the account deals with identity information (name, email address, signature, etc.) and, somewhat out of place, the SMTP server. The latter option is in the form of a useless drop-down menu with no direct way to change the detailed settings.

The next pane is for server settings, which is only for the incoming server. This allows you to change the host name, and - mirabile dictu! - the port number. There's a security settings panel which allows you to choose how to use TLS, including the "expose my password" (i.e. "never") and "man-in-the-middle vulnerability" (i.e. "optional") settings. There's also a "secure authentication" tickybox, which like the Internet connection preferences button is clearly a meaningless euphemism for something that the developers think is too technical for users. Googling reveals that it means challenge-response authentication, i.e. CRAM-MD5 - it might also include Kerberos, but since Thunderbird has no documentation I can't easily find out.

So I turn on TLS and continue to look for the SMTP server options. Further down there is a security pane. This reveals itself to be about message encryption, not security in general. Another stupid euphemism. I eventually find the SMTP server options, entirely divorced from the rest of the account settings in a separate sub-tree of the dialogue box. This pane has a just list of one SMTP server host name, which I have to select before clicking the edit button in order to be transferred to a separate dialogue box which - at last - allows me to set the port number and TLS options.

After all that, Thunderbird is finally configured correctly, and indeed it works. However I was put through three panes of the wizard to enter the settings in the first place, which resulted in a dangerously insecure (and broken) configuration, after which I had to dig through four panels of the account settings dialogue box, after going on a wild goose chase through five panels of the preferences dialogue and one in the account settings dialogue.

This is by far the worst MUA configuration user interface I have seen - even Outlook is better. However all of them are bad, because they expose too much complexity to the user and encourage the use of insecure settings.

The right way to do it

In which Tony expounds on secure user interface design...

The following is based on a couple of simple principles: be as secure as possible by default; and never ask the user something that can be found out automatically.

The account setup wizard needs at least the following information:

The user's name

The user's email address

The incoming server name

The outgoing server name

The user's password.

By default the user's login name is the local part of their email address. There must be an option to edit this, but in many cases that will not be necessary.

When this has been entered, the software probes the servers to automatically find out the rest of the information, including port numbers, TLS settings, and authentication protocols. The algorithm goes like this:

(1) Connect to the IMAP port (143) of the incoming server and check for the STARTTLS capability. If it is available, set the incoming server options to IMAP with TLS required.

(2) If STARTTLS is not available, try the IMAPS port (993) instead. If this works, set the incoming server options to IMAPS.

(3) If either (1) or (2) succeed, do the TLS negotiation and cache the server certificate.

(4) If neither IMAP connection works, perform a similar process with the POP3 (110) and POP3S (995) ports.

Note that IMAP is preferred to POP because it works better if you use multiple MUAs, and because if you try out a POP client and forget to set the keep mail on server option, it will delete everything. However if the user's email provider doesn't give them enough storage space they may wish to override this policy; see below.

(5) Look at what authentication protocols are supported. The order of preference should be (a) Kerberos over TLS or in the clear; (b) password auth over TLS; (c) CRAM-MD5 over TLS or in the clear; (d) password auth in the clear.

CRAM-MD5 has dubious security properties, because it is vulnerable to brute-forcing attacks that password-over-TLS is not, and because the server must have a cleartext-equivalent copy of the users' passwords.

(6) Check that login works, but only if secure authentication is possible. This may require prompting for a password.

(7) Perform a similar process with the outgoing SMTP server. This is slightly more complicated because there are more possibilities. First try the submission port 587; if that doesn't work try the SMTPS port 465; if that doesn't work, try port 25. If secure authentication isn't offered, attempt to send a message from the user to the user (but abort before sending the DATA) to see if the server works as an unauthenticated relay.

I've described this process as being sequential, but in practice the software will want to perform all seven connection attempts concurrently to minimise delays that may be caused by firewalls dropping packets.

When this is complete, present to the user the results of the probe for confirmation or alteration. This will be the usual account settings dialogue box. This should include the following information and controls:

incoming server name, editable

incoming server port number, editable

incoming server protocol, switchable between POP and IMAP

incoming server TLS setting (STARTTLS, SSL-on-connect, optional, none)

incoming server user name

If secure authentication worked, there should be a friendly OK indicator, and an option for the user to view the cached server TLS certificate. If the certificate is not signed by a CA, the software should present it to the user for checking without being asked.

If only insecure authentication was offered, there should be a warning note. This should always appear if the user selects either of the dangerous TLS settings (the last two).

There should be a note if a working login has not been confirmed.

outgoing server name, editable

outgoing server port number, editable

outgoing server TLS setting

outgoing server user name

A tickybox to indicate whether authentication is required; if it is not checked the TLS and user name options are greyed out.

The outgoing server has a similar OK/unconfirmed flag and security note as the incoming server. If TLS and AUTH were not offered by the server, the security note should say that roaming may not work, but should not be as scary as for cleartext password authentication.

Finally, there is a button to trigger test connections using the given settings to confirm that they are working OK, a button to re-do autoconfiguration from scratch, and the usual OK/Cancel pair (though the latter should probably be greyed out for initial configuration!)

(There should probably also be options for client TLS certificate authentication as an alternative to password auth.)

As well as the initial setup, the software must allow for configuration changes caused by alterations to the server or by roaming, and should assist the user in these cases.

If secure authentication was not previously possible but the server now advertises it, the software should check that it works OK then present a dialogue box to the user saying "Secure authentication is now possible, and your settings have been upgraded to use it." with an OK button and a button that takes the user to the account settings dialogue box. The software must be very paranoid about the server's TLS certificate in this case, in order to ensure that this isn't actually a man-in-the-middle attack.

If secure authentication was previously possible but is not advertised on this connection, the software should warn the user that they may be under attack and abort the connection. The other possibility is that the user's email provider has one of those misguided setups that only allow secure connections from the public Internet, so the warning should also advise them to contact their support staff. The software should not encourage the user to turn on the "optional TLS" setting in this situation, and should cause extra work for email providers with bad setups.

If the TLS certificate changes, the user should be warned gently - particularly gently if the CA signatures on the old and new cert are both trusted.

A few final words...

I have previously considered that perhaps there should be a standard protocol for email software autoconfiguration, such as specially formatted records in the DNS that contain all the necessary information, so that all users would have to do is type in their email address. However, although SRV records can already solve most of this problem, they are not quite enough because although they tell you the host name and port number, they don't tell you about the higher-level security settings. So such a protocol is not entirely trivial; it also has the problem that it would be redundant wrt the in-band feature negotiation of IMAP and SMTP which might cause objections to the idea on architectural grounds. So the above approach of intelligent autoconfiguration is pretty close to ideal; perhaps you could make it even slicker by guessing the IMAP and SMTP server names based on the user's email address, instead of hoping that the SRV idea gets standardized.

Update: see also the expired Internet-Draft http://www.watersprings.org/pub/id/draft-hall-email-srv-02.txt

Following my previous entry, I emailed <bind-suggest@isc.org> to suggest an improvement which would allow me to do what I want directly, without fiddling with features that aren't quite the right thing. I got a reply from Mark Andrews who says that bind-9.4.0 will do what I want, according to this changelog entry:

  1798. [func] The server syntax has been extended to support a
               range of servers.  [RT #11132]

This will allow me to say the following, which is exactly what I want.

  server 192.168.0.0/16 {
    bogus yes;
  };

There's a rather nasty way of using the DNS to tunnel from the outside past into a private network. If you publish an NS record whose target has a private address, and you then cause a nameserver at the private network's border to query that nameserver, it will send the query to the private network. This isn't particularly damaging because the target port is fixed to 53, but it invites abuse because it's a very cool hack and the tools to do it are becoming easier to use.

What I'd like to be able to do is to tell Bind never to send queries to RFC 3330 addresses. Bind has a "bogus server" feature that almost does the trick, but it can only block one IP address at a time and I want to block millions of them. However it also has a "blackhole" feature, which allows you to block whole CIDR address ranges, with the effect of the "bogus server" feature but also blocking queries from the relevant addresses. This latter is a slightly irritating side-effect if you want to accept queries from private addresses but not send queries to those addresses; however I only have one such IP address because my name servers only accept queries from localhost.

The solution I have come up with is:

    acl bogons {
        ! 127.0.0.1;
        0.0.0.0/8;
        10.0.0.0/8;
        127.0.0.0/8;
        169.254.0.0/16;
        172.16.0.0/12;
        192.0.2.0/24;
        192.168.0.0/16;
        198.18.0.0/15;
        224.0.0.0/3;
    };
    
    options {
        listen-on-v6 { none; };
        listen-on port 53 { 127.0.0.1; };
        query-source address * port 53;
    
        allow-query { 127.0.0.1; };
        allow-notify { none; };
        allow-transfer { none; };
        blackhole { bogons; };

        // some other options here
    };
    
    server 127.0.0.1 {
        bogus yes;
    };

Last week, a discussion started on exim-users about how Exim's excessive number of little languages could be rationalized. (link). I have thought about this problem to some extent, so I wrote the following...

There are two sets of languages that are relevant to Exim: configuration languages (of which I count a generous handful) and extension languages (currently 2: Perl via ${perl and C via local_scan() and ${dlfunc). Configuration languages are important because they are the user interface of the program, and everyone has to live with them. Exim's problem is that it has too many sub-languages: two filter languages (Exim's own, plus Sieve); the ACL language; the driver language for routers etc; the list match language; the string expansion language; and regular expressions. This count is rather inflated: it's a bit of a cheat to count regexes separately, because nowadays they're part of every decent language, and list matching and string expansion function as the expression syntax to the other languages' statement syntax. But there's a lot of overlap and non-orthogonality, so plenty of room for improvement.

Time for a bit of terminology. Configuration languages are a subset of "domain-specific languages". The scope of the term is quite broad, and is fairly well illustrated by the "little languages" of the traditional Unix tools: typesetting languages like troff, tbl, pic; compiler generation tools like lex and yacc; text-processing languages like sed, awk; command languges like make and the shell; configuration languages like crontab, inetd.conf, printcap, termcap; etc. These may or may not be usable as general-purpose languages; the point is that they are targeted at a specific domain (i.e. purpose).

There is an observation that DSLs, especially for complicated pieces of software, either need to be programmable, or they become programmable as they accumulate features. The latter has happened to Exim twice (one, two). This leads to the argument that programmability should be designed in from the start; further more, if you base the DSL on an existing programming language then you don't have to do the language implementation yourself and can concentrate on the domain-specific code. Hence the idea of "embedded domain-specific languages": DSLs that are implemented within the framework of a programming language. We were speculating about replacing Exim's configuration language with a DSL designed for programmability, and I suggested making it an EDSL. Then we got into an agrument about which language should be the host for the embedding. So what makes a good host language? I think the most important thing is extensible flow control operators. The reason for this is that Exim's declarative configuration style hides quite a lot of flow complexity: many decisions are four-way (accept/reject/defer/pass) or more, and there is implicit short-cutting and iteration over addresses. The EDSL configuration should preserve this hiding of complexity, which means that configuration keywords like drop/deny/defer/accept have to be able to affect the control flow without requiring boilerplate from the user. This is even more important in the routers, where instead of dropping back into Exim's core, you usually want to skip to the next router. It's better to make the whole chain of routers a single routine (rather than one per router) because then the postmaster can code complicated routing decisions beyond the usual sequencing, but this in turn makes difficult demands of the host language.

Tcl is of course famously designed to be a host for EDSLs (such as expect); it isn't a particularly nice language in itself with its clumsy variable assignment and expression evaluation commands, but this is less of a problem if your common commands have rich semantics and it's compensated by Tcl's brilliance at non-standard flow control. This is mainly because it's easy to quote blocks of code for evaluation now or later, so unlike many languages it's trivial to define your own if command because order of evaluation is not rigid. In if [test] {then} {else} the [] specifies evaluation now and the {} specifies evaluation later, so the if command's implementation can just look at the value of its first argument and evaluate its second or third accordingly.

Lisp is another big EDSL host - in fact this is part of the culture of Lisp: when writing a Lisp program you first design an EDSL then you code the solution in your new language. Lisp is less nice than Tcl as the basis for a configuration language, though, because of the irritating superfluous parentheses. Still Emacs makes a plausible existence proof.

However both these languages suffer from lack of static checking (in the case of Tcl even at the level of basic syntax) which imposes a burden of testing on the postmaster which in an ideal world would be performed automatically. Which is why (apart from personal aesthetic preference) I suggest Haskell as the host language. Like Lisp, it has a culture of EDSLs. However these tend to focus on sets of "combinators" that are used to tie bits of code together - exactly the kind of do-it-yourself flow control we want to be able to do. In addition to that, the "monad" concept is brilliant for tucking away the implicit state manipulation and the short-cutting flow control that Exim does all the time, without cluttering the syntax seen by the user.

Functional programming has a reputation for taking "hair shirt" purity too far, to the extent of being useless for practical purposes. However, at least one plausible Internet server application has been written in Haskell, and Pugs is showing that it isn't a completely undigestable language for Perl hackers.

But at the moment this is just idle speculation - though I do have a cute name for the idea ("Elegant Configuration using Haskell for Internet Mail") - but it's unlikely to actually happen until I find some extra tuits...

One of the few remnants of the old JANET grey book email system that we still support is abbreviated email domains such as in <fanf2@cam>. Exim has two configuration options which we use to implement this: widen_domains = cam.ac.uk : ac.uk and rewrite_headers.

When someone sends email to an abbreviated address, Exim does not recognize the domain as being one of the ones it is explicitly configured to handle, so looks it up in the DNS. This DNS lookup fails, so Exim tries appending each of the widen_domains in turn and re-doing the DNS lookup. Thus @hermes becomes @hermes.cam.ac.uk, which works, and @cam becomes @cam.cam.ac.uk, which doesn't, then @cam.ac.uk, which does.

Abbreviated domains are not valid on the public Internet, so they must be replaced by their un-abbreviated equivalents before the message leaves ppswitch. Email messages have two sets of metadata: the envelope, which determines who receives the message and who gets any delivery failure reports (i.e. the sender); and the header, which is what the user sees. [1] Since Exim is concerned with getting messages from A to B, it generally only looks at the envelope, and leaves the header alone. However when the envelope contains an abbreviated address there is probably a copy of the address in the header, so Exim must fix up both of them. This is specified by the rewrite_headers option.

[1] The two do not necessarily agree: for example, a message sent via a mailing list has the list's address in its To: header, but the list members' addresses in the envelope. However, when a message is submitted the initial envelope is created from the addresses that the message's author specified in the header, so the two sets of addresses start off the same.

That is all well and good, but there are some subtleties. Widening an abbreviated domain is a side-effect of the address being run through Exim's routers. Routing happens in two situations: when Exim is verifying addresses in order to decide if the message is acceptable; and when Exim is working out where to deliver the message after it has been accepted. The rewrite_headers option only has an effect in the latter case, because in the former case Exim has not received the message header yet. (Even if the header had been received, Exim doesn't record these header fix-ups persistently but instead works them out again whenever necessary.)

This is not a problem for recipient addresses, because these are all run through the routers at delivery time. However the sender address is not, because it is not necessary to work out the sender's destination in order to deliver the message to the recipients. The sender address is only run through the routers in order to verify it before the message is accepted. Therefore any fix-ups that may result from expanding the sender address do not happen!

Fortunately no-one configures their software to use abbreviated sender addresses, so we can safely prevent people from making this configuration error and thereby emitting broken messages, without causing disruption. This required a somewhat subtle change to Exim's source code. The idea is to ignore the widen_domains option when verifying a sender address, so that an abbreviated domain remains unrecognized and the address fails to verify, causing the message to be rejected. However if the postmaster has turned off rewrite_headers (for example, because they are working in a walled garden where it is OK for abbreviated domains to propagate) then the original problem doesn't obtain, so widen_domains should not be ignored.

Even more subtly, abbreviated addresses can appear in aliases files, for example <fanf2@ucs.cam.ac.uk> -> <fanf2@cam> which means that if I send a message using the alias, Exim will generate the abbreviated address as a result of sender verification. In this situation, the abbreviated address does not appear in the message header, so it does not need to be fixed up, so it should be permitted. Thus, we ignore widen_domains if we are verifying a sender address, but not if rewrite_headers = false, and only if this is the original sender address (not the result of alias expansion).

I didn't work out the latter "even more subtle" part of this fix until after I had rolled out my new Exim configuration yesterday and after I had received a bug report from a computer officer whose webserver's email was being bounced. Oops! I'll have to try the roll-out again on Monday...

Phil Chambers complains about non-ASCII junk in SMTP envelopes:
http://www.exim.org/mail-archives/exim-users/Week-of-Mon-20050718/msg00029.html

I note that another common syntax error is to put a space between : and < and talk about building support into Exim to deal with this misbehaviour:
http://www.exim.org/mail-archives/exim-users/Week-of-Mon-20050725/msg00102.html

More followups discussing ways of configuring existing versions of Exim to do this, and the wiseness thereof:
http://www.exim.org/mail-archives/exim-users/Week-of-Mon-20050725/msg00141.html
http://www.exim.org/mail-archives/exim-users/Week-of-Mon-20050725/msg00147.html
http://www.exim.org/mail-archives/exim-users/Week-of-Mon-20050725/msg00169.html

(I also note that Exim is already strict in this manner about HELO domains.)

[update]
Using $smtp_command_argument to implement :< strictness:
http://www.exim.org/mail-archives/exim-users/Week-of-Mon-20050725/msg00184.html
http://www.exim.org/mail-archives/exim-users/Week-of-Mon-20050725/msg00191.html

Shortly before my wedding there was a discussion on the Exim-users mailing list about Exim's handling of its hints databases, which cache information about the retry status of remote hosts and messages in the queue, callout results, ratelimit state, etc. At the moment Exim just uses a standard Unix DB library for this, e.g. GDBM, with whole-file locking to protect against concurrent access from multiple Exim processes.

There are two disadvantages with this: firstly, the performance isn't great because Exim tends to serialize on the locks, and the DB lock/open/fsync/close/unlock cycle takes a while, limiting the throughput of the whole system; secondly, if you have a cluster of mail servers the information isn't shared between them, so each machine has to populate its own database which implies poor information (e.g. an incomplete view of clients' sending rates) and duplicated effort (e.g. repeated callouts, wasted retries).

The first problem is a bit silly, because the databases are just caches and can be safely deleted, so they don't need to be on persistent storage. In fact some admins mount a ram disk on Exim's hints db directory which avoids the fsync cost and thereby ups the maximum throughput. If you go a step further, you can take the view that Exim is to some extent using the DB as an IPC mechanism.

The traditional solution to the second problem is to slap a standard SQL DB on the back-end, but if you do this the SQL DB becomes a single point of failure. This is bad given that a system like ppswitch which is a cluster of identical machines that relay email does not currently have a SPOF. It also compounds the excessive persistence silliness.

It occurs to me that what I want is something like Splash, which is a distributed masterless database, which uses the Spread toolkit to reliably multicast messages around the cluster. Wonderful! The hard work has already been done for me, so all I need to do is overhaul Exim's hints DB layer for the first time in 10 years - oh, and get a load of other stuff off the top of my to-do list first.

If it's done properly it should greatly improve the ratelimit feature on clustered systems, and make it much easier to write a high-quality greylisting implementation. (A BALGE implementation is liable to cause too many operational problems to be acceptable to us.) It should also be good for Exim even in single-host setups, by avoiding the hints lock bottleneck. The overhaul of the hints DB layer will also allow Exim to make use of other more sophisticated databases as well as Splash, e.g. standard SQL databases or Unix-style libraries that support multiple-reader / single-writer locking.

I've spent a fun evening getting to grips with Philip's Music Writer, the music typesetting software written by my esteemed colleague Philip Hazel. This is for the hymn music on the service sheet for our wedding. So far I have done hymn 2 and hymn 3. The result is quite nice, but the source files remind me strongly of the 1970s Bell Labs Unix typesetting utilities, especially because of the amount of fiddling you have to do to work around quirks in the source language.

We've just taken the third step towards eliminating insecure access to our servers.

When I was an undergraduate, Hermes was too overloaded to be able to cope with crypto, so all access (except for the sysadmins) was via cleartext protocols with exposed passwords. This has not been a problem for many years now, but because of inertia we have continued to allow insecure logins and a very large proportion of users are still using them.

Last summer I introduced secure authenticated message submission, in order to provide better support for roaming users. Since that point we have had secure versions of all protocols. At the start of term a couple of months ago, we gave notice that insecure protocols would be gradually disabled. I added pre- and post-login banners on the Telnet and FTP services warning of this, which probably had little effect because users don't read banners. I also changed the webmail login: in the past insecure logins were accompanied by a small warning which was not effective; I changed it so that users were forced to click through to the insecure login form and I made the warning more prominent. This had an immediate effect, reducing insecure logins by 80%.

This morning we've disabled Telnet and FTP completely, so terminal logins and file transfers must be performed using ssh and sftp. I've also changed webmail so that the insecure front page is redirected to the secure version, and insecure logins are forbidden.

The final stage is going to be a very long slog to gradually get all IMAP and POP users switched to secure configurations, by disabling insecure access for groups of users in stages. We expect this to take at least a year...

My esteemed colleague Philip Hazel recently released pcre-6.0, the latest version of his Perl-compatible regular expression library. The major new feature is an alternative matching algorithm which does not back-track. Unfortunately, Philip has documented this using incorrect terminology from Jeffrey Friedl's otherwise excellent book, Mastering Regular Expressions. This is my rant about Jeffrey Friedl's description of the difference between DFA and NFA, i.e. deterministic or non-deterministic finite automata.

In the correct non-Friedl terminology, DFA and NFA refer to the construction of the FA, not the matching algorithm. When matching a subject string against a DFA, each character determines a unique new automaton state, so you can implement the match with just two variables: a pointer into the string and a pointer into the DFA. In an NFA, each character can determine multiple subsequent automaton states, so to implement the match you need to trace the multiple possible routes through the NFA. Traditionally this has been done with a backtracking depth-first search, in which you keep a history of your state at each decision point so that you can go back and try the alternatives if a match fails. The new PCRE algorithm effectively performs a breadth-first search, by processing the input string a character at a time and keeping track of all the possible NFA states in parallel. (See http://www.cl.cam.ac.uk/Teaching/current/RLFA/ and in particular section 2.2 of the lecture notes for more details.)

Friedl only talks about the two NFA algorithms, referring to the first as an NFA implementation (fine) and the second as a DFA implementation (no! wrong! totally wrong! where'd he learn this?). His confusion probably comes from the fact that the alternative NFA algorithm behaves the same as if it were implemented using an equivalent DFA, but it's much slower because of the overhead of tracking multiple possible states instead of just a single possible state. With pure regular languages it is possible to construct a DFA from an NFA in order to get faster matching (see the lecture notes); the disadvantage is that the DFA construction can use a lot of space.

The problem with DFAs wrt Unix (and Perl) regexes is that Unix regexes have traditionally been implemented with a backtracking NFA matcher, and people have assumed this implementation when adding features to the regex language. The result is that so-called "regular expressions" no longer match regular languages according to the strict terminology of formal language theory. However when using the alternative not-really-DFA algorithm, PCRE restricts the regex language (round brackets do not capture and back-references are not permitted) so that it can only match regular languages. In principle, PCRE could perform the DFA construction in order to make the new algorithm fast, but Philip says this is unlikely to happen any time soon because PCRE is now on the back burner.

Not all Unix regex implementations are backtracking NFAs. Lex and GNU grep use DFA matching in order to run fast; the cost of constructing the DFA is outweighed by the benefit of its faster running time over lots of data. Lex benefits from the DFA construction becuse it can keep multiple possible token types
in mind (e.g. 123L vs. 123.4 in C) with no extra state and no back-tracking; like PCRE it doesn't support capturing sub-expressions or back-refs. GNU grep is a bit more interesting, because it does support back-refs. It uses a two-stage matcher which performs an initial DFA match (with no back-refs) which is then checked for correct back-ref matching. The latter can be done much more efficiently than a general match, or completely skipped if there are no back-refs at all. Maybe a mythical future PCRE will work the same way?

I've posted my rate limiting patch to the exim-users mailing list. You can see it at:

http://www.cus.cam.ac.uk/~fanf2/hermes/doc/antiforgery/exim-ratelimit.patch

Sometimes flaming is fun. "This draft is worthless. [...] As well as being based on a fundamentally useless idea, the details of the specification are incompetent. [...]"

http://www.imc.org/ietf-smtp/mail-archive/msg01746.html
http://www.imc.org/ietf-smtp/mail-archive/msg01766.html

Following on from http://www.livejournal.com/users/fanf/36178.html
I have updated the document at http://www.cus.cam.ac.uk/~fanf2/hermes/doc/antiforgery/ratelimit.html with a new subsection that does a little analysis of the behaviour of the mathematical model.

On Saturday night at

j4 I gave

hoiho a quick run-down on what it was all about. (We really know how to party in Cambridge!) He suggested looking at the leaky bucket model which is often used in networking circles, because it fairly closely models the way that packet buffers operate. From the mathematical point of view it's probably similar to adjusting the moving average for variable event frequencies as I have done for the exponentially-weighted average.

The reason I didn't look at it initially was because the amount of state it requires increases with the permitted sending rate and burstiness. This is OK if you are a router and are already keeping that state along with the packets, but I have to re-load and re-save the state for each message. However it turns out that the model I picked has pretty much the same intuitive model, where one of the configuration parameters can be thought of as the bucket size, i.e. the maximum size of a burst of messages.

It's nice when a result shows you that your first choice of maths turned out to be a better fit than you expected, and pleasingly neat to boot. It has been fun resurrecting some more of my secondary school knowledge :-)

I've just written a proposal for implementing rate limits on outgoing email via ppswitch (our central email relay). The aim is to detect and stop floods of viruses or spam from a compromised machine on the University's network.

The document includes a description of the simple mathematical model I'm planning to use to compute a client's sending rate. It seems to satisfy my requirements but if anyone has any better ideas then I'm all ears.

http://www.cus.cam.ac.uk/~fanf2/hermes/doc/antiforgery/ratelimit.html

To obfuscate : assombrir, rendre confus, peu clair. Un obfuscated code est un programme dont la source ne permet pas de comprendre directement ce qu'il fait. C'est une sorte de jeu (de défouloir ?) pratiqué par quelques informaticiens... On peut le voir comme l'équivalent informatique de la poésie, par opposition aux masses énormes de prose produites chaque année par les développeurs ordinaires. Petit extrait signé Tony Finch : l opt b (S (S I (S (BB (CC B) CI (BB K K make)) (S (BB C (BB C (C (CI qk))) (SS (SS S) (S (BB (BB (S I)) S (BB (BB (CC B) CI) (BB K K) make)) (B (CC B (BB (CC C) (BB (C (CI qk)) (pair (atom qk))) pair)) make)) (B (C (BB B C (C (CI qi)))) make))) make))) (B K make))

http://dvanw.blogspot.com/2005/04/041-obfuscated-code.html

I like this guy's analogies :-)

So this morning I had a remarkably sensible non-lucid (or semi-lucid) dream.

I was remembering a mathematical (or perhaps cryptographic) joke involving two entities, labelled A and 1. (Or maybe it was A and 2 or 1 and B; my memory is vague when asleep at 6 in the morning.) Shortly after I first encountered this joke I found out about typesetting software that auto-numbers sections and lists such as: a. first item; a. second item; a. third item; which is fixed up to a. b. c. You can also provide a style hint such as 1. for numerical lists or A. for capitalized lists or i. or I. for roman numerals. (I think I've seen

jwz writing lists in this style but without the fix up.)

I dreamt talking about this to

rmc28 and generalizing it to a list of things that are first in some series (aleph, alpha, red, etc.), to which dream-Rachel responded "Oh, that's Newton's list." (Actually, s/list/set/g. I think the Newton part came from me reading lots of fiction about the English enlightenment recently.)

After waking it occurred to me that the set wouldn't be all that interesting, though you could argue endlessly about the choice of sequences from which to take first items for inclusion. The Guinness Book of Records might be a good source, but it's far too heavy on people. The sequence of British Prime Ministers is worth including, but what about people who ran a mile in less than 4 minutes? Who came after Roger Bannister? Lists of abstractions are much more fun, mainly because of the obscurity.

Maybe filling in subsequent items from the lists would make a good source of quiz questions :-)

http://www.exim.org/mail-archives/exim-users/Week-of-Mon-20050328/msg00087.html

This arose from my thoughts about domain-specific languages and the programmability thereof. I felt the need to examine further the computational abilities of Exim's ACLs.

[more]
http://www.exim.org/mail-archives/exim-users/Week-of-Mon-20050328/msg00108.html

Pedantry about basic units... (I hope these URLs work.)

http://www.prospect-magazine.co.uk/article_details.php?id=6820&AuthKey=ffd48ae887d8c740d86a0d7354814dbc&issue=503

http://www.prospect-magazine.co.uk/article_details.php?id=6746&AuthKey=ffd48ae887d8c740d86a0d7354814dbc&issue=501

One of the weaknesses of the current Hermes architecture is the comparative lack of data security for the message transport queues on ppswitch. These store messages in the process of being delivered, or which have been delayed because we can't reach the destination host, or because of quota problems for a user on the Cyrus store. The ppswitch machines have SCSI RAID controllers with battery-backed caches, so they're resistant to single disk failures, power outages, etc. but are not safe from filesystem bugs or multiple hardware failures. It would be nice to augment them with some kind of replication technology so that they have comparable reliability to the Cyrus machines.

At the start of last year I was thinking about a new feature called "early delivery", which would allow you to get the MTA to deliver a copy of the message before confirming to the sending machine that it has been safely delivered (between the "." at the end of the message data and the server's response). This copy would be kept on another machine, therefore safe in case of some nasty failure, while the original would be delivered as usual. I didn't really work out any good way of preserving all the message's metadata (client IP address, authentication, etc.) or a way of removing copies of messages that have been delivered so are no longer needed.

David Woodhouse suggested another use for early deliveries, which would be an early attempt at normal delivery rather than a separate copy of the message. His scenario was a border MTA delivering to MTAs inside an organization which might have different security policies - similar to ppswitch, in fact. By trying to deliver early you can pass success or failure directly back to the sender, instead of accepting the message then generating a bounce if there's a problem. As such, it's somewhat like souped-up SMTP call-forward address verification, but applied to the whole message.

http://www.exim.org/pipermail/exim-users/Week-of-Mon-20040105/064531.html

I've recently been thinking a bit about using standard email protocols for communication within the parts of an MTA. For example, rather than the sendmail program being the MTA itself, and having to be setuid in order to switch from the user's security context to the MTA's, it should be a standard SMTP client. It can't quite be as simple as that because sendmail has a number of management functions as well as message submission, so it might use a second protocol for those, or extensions to SMTP. You can preserve some of its security properties (such as knowledge of the user that ran sendmail) by doing SMTP over a Unix domain socket (instead of over TCP), and using AUTH EXTERNAL to tell the SMTP server to use SCM_CREDS to find out the user at the other end of the socket.

Another example would be to detach local delivery from the core of the MTA, again so that the MTA doesn't have to be setuid in order to switch from the MTA's security context to the user's. The MTA could send messages to the local delivery agent using LMTP. Cyrus already does this, but it's less normal to do so for traditional Unix message stores. (Postfix is structured like this but AFAICT its internal protocols are undocumented.)

It occurred to me yesterday that you can combine these ideas in an interesting way, using early delivery to implement content-scanning callouts. In this case the early delivery isn't a delivery as such (the message isn't kept by the content scanner), but rather it's just a means of getting the data from the SMTP server to the scanner. In this respect it's different from other common email filtering architectures where the scanner acts as an SMTP proxy that receives the message from the MTA and re-injects it back into the MTA:

sender

smtpd

filter

smtpd

queue

Instead it's more of a T shape, which makes it more like OPES in architecture:

sender	->	smtpd	->	queue
		V
		scanner

Although it's an interesting idea it does involve a fair amount of copying stuff around, which isn't as efficient as it could be - though it does make security boundaries simpler. What would be better is for the filter to read the message straight from the disk, which in turn demands a friendly on-disk queue format. Perhaps (taking a leaf from Cyrus's book) this should be wire format so that efficient APIs (e.g. sendfile) can be used.

Lots of ideas to play with if I ever do some work on Postmaster...

My CSA implementation seems to work OK now (according to my tests). Any more testing, code review, suggestions, bug reports etc. welcome. The following patch is against the code from CVS, but it applies to 4.50 OK.

http://www.cus.cam.ac.uk/~fanf2/hermes/doc/antiforgery/exim-csa.patch
http://www.cus.cam.ac.uk/~fanf2/hermes/doc/antiforgery/exim-csa.docs

For those who like tales of woe, search for "gross" in the patch. That cost me two hours!

Any comments, suggestions, questions about the following?

http://www.cus.cam.ac.uk/~fanf2/hermes/doc/antiforgery/csa.txt

I've updated unifdef with a fix for a fairly stupid bug - it was not ignoring comment markers inside strings (in fact it was completely oblivious of strings). What makes this bug particularly stupid is that the code I started with did recognize strings, though not according to C syntax, and I ripped it out because it was broken. D'oh! The bug report has also prompted me to update the version distributed with FreeBSD, which happened to be missing some additions I made in August 2003. Good thing too, because it's been a while since my last commit.

New code at http://dotat.at/prog/misc/ and http://www.freebsd.org/cgi/cvsweb.cgi/src/usr.bin/unifdef/

So while the rest of the team are having a knees-up at the 62nd IETF in Minneapolis, I'm quietly implementing CSA for Exim. There's about 450 lines of reasonably straightforward new code to do this. You can now say something like the following in an ACL to reject any hosts which are not authorized.

      require verify = csa

In addition to that, the dnsdb lookup type will also handle CSA records. A straight SRV lookup isn't enough because CSA also involves a search for a site policy record. The dnsdb CSA lookup also does some extra checking and pretty-prints the contents of the CSA record for extra friendliness.

There are a few more details to the implementation which I will be writing down tomorrow when I get stuck into the documentation. That will be after the code has been tested (as opposed to just compiled).

Some of my friends think Livejournal should be news though I'm not really convinced. Although using NNTP as the transport protocol has significant efficiency advantages, it doesn't help with the interesting problems of distributing LJ's network effects.

I tend to think that a more modern solution to the efficiency problem would be to create a way for (RSS or Atom etc.) syndication feed updates to be pushed rather than pulled (and polled). Fortunately this is a sufficiently obvious approach that other people are working on it so I can just sit back and let the lazy web do its thing. But what protocol should be used for the notifications?

My answer comes via the Radio Free RFC episodes about XMPP. The Extensible Messaging and Presence Protocol, aka Jabber, is the IETF's open standard for instant messaging. It has an extension known as PubSub which supports generic publication of update notifcations and subscription to a notification feed, and there's a profile of PubSub for handling updates to an Atom feed: http://www.xmpp.org/drafts/draft-saintandre-atompub-notify-02.html

All that remains to be worked out is how to handle the interesting access control features of "friends" lists in a distributed and easy-to-use manner...

I've been arguing online quite a lot today, perhaps most prominently on Dave Farber's Interesting People list. I think the EFF are overstating the free speech implications of authenticated email, and that their arguments against new technologies are wrong.

http://www.interesting-people.org/archives/interesting-people/200502/msg00247.html

This post happened to coincide with a discussion on the IETF email message format discussion list about syntax changes to support anonymous messages, e.g. by omitting the From: field.

http://www.imc.org/ietf-822/mail-archive/maillist.html

I think this is all pretty worthless. If you are serious about anonymity you should use serious anonymizing technology like mixmaster. The illusory secrecy provided by hiding your email address (but not your IP address) will just get you in trouble.

http://www.cus.cam.ac.uk/~fanf2/hermes/doc/talks/2005-02-eximconf/paper.pdf

http://www.cus.cam.ac.uk/~fanf2/hermes/doc/talks/2005-02-ukuug/paper.pdf

Talking with

lusercop and

pjc50 last night about dispute resolution, in particular a dispute over the name of a film. Like trademarks and stage-names these must be registered and reasonably unique. However unlike trademarks, I understand that there's a dispute resolution agency which is reasonable and inexpensive.

This makes me think that state-supported monopolies (trademarks, patents, copyright) should have state-supported dispute resolution, in order to ensure a reasonable balance of power between the deep pockets and the rest of us.

http://www.thestar.com/NASApp/cs/ContentServer?pagename=thestar/Layout/Article_Type1&c=Article&cid=1105614487492&call_pageid=970599119419

http://www.timesonline.co.uk/article/0,,2-1427660,00.html

I'd like to see them try this in Cambridge, if only because the clutter on the streets in the city centre is incredibly ugly. This is assuming, of course, that it has the right effect on taxi drivers and other cretins. But given that our councils have very old-fashioned and rules-bound approaches to road design, I don't have much hope.

http://www.google.com/search?q=apple+%22open+group%22+unix+trademark

Unsurprisingly, given Apple's marketing of OSX and its lack of UNIX certification, the Open Group (keepers of the UNIX trademark) sued Apple for trademark infringement at the beginning of 2002. The most recent news that I can find about this argument was April 2004, at which point the parties were still in dispute with two months left before the court's deadline.

However TOG were making conciliatory noises, and given that Apple is now the largest UNIX vendor it would have been foolish of them to go to court. If Apple won, it would have been bad for TOG and UNIX certification, and worrying for the status of UNIX standardization. However it would have been good for UNIX and Linux marketing because it would allow market analysts to group them together where they belong, ahead of Windows.

So what happened? Anyone know?

The other part of CSV is DNA - Domain Name Accreditation. A slightly unfortunate abbreviation, I think.

DNA consists of two parts. The first is a system for sites to advertise which accreditation services publish information about the site. The idea is that recipients of email can use this information to reduce the number of accreditation queries they need to perform. Of course recipients can't trust information provided by senders, so the sender's list of accreditation services must be tempered by the recipient's assessment of the accreditation services' reliability and augmented by a judicious selection of reputation services.

Accreditation and reputation services are very similar from the technical point of view, but have different contractual arrangements. Accreditation services are paid by the sender to vouch for the sender's good standing, and tend to provide positive ratings: current examples include http://www.bondedsender.com/ and http://www.habeas.com/. Reputation services are paid by the recipient (or not - most of them are run on a not-for-profit basis) to track the behaviours of senders, and tend to provide information about those with negative ratings: current examples include http://www.mail-abuse.com/.

The other part of DNA is a standard way to query accreditation and reputation services. This is very similar to existing RHSBL techniques: the sender's HELO name is concatenated with the name of the service and the resulting domain is looked up in the DNS. Instead of encoding information in an A record like most DNS blacklists, DNA specifies a format for a TXT record.

This is all fairly straight-forward, though it's worth noting that DNA depends crucially on an authenticated HELO name in order to work. It can equally well use CSA or traditional HELO verification, though CSA can provide an independent reason to reject a message which traditional HELO verification cannot. IP-based blacklists also depend on authentication: the establishment of the TCP session under SMTP authenticates the sender's IP address.

Without an authenticated HELO name there are only a few heuristic HELO checks that DNA can perform without additional knowledge of the SMTP session. Yes, CSV - the combination of CSA and DNA - can replace all but one of my HELO checks: CSA implements the check for senders using the recipient's domain in HELO, and appropriately cunning DNA services can implement the checks for all-numeric and over-long HELO names, and even the check for randomly-generated invalid domain names. The remaining check requires knowledge of the first recipient's local part, which CSV doesn't have.

But could DNA be better?

If the DNA query includes both the client's HELO name and IP address, then the reputation service has everything it needs to implement CSA or traditional HELO verification, as well as the reputation lookup. This implies that the SMTP server doesn't have to implement these functions, and can just rely on maybe only a single DNS lookup to do all the work. This reduces the DNS load caused by the SMTP server significantly. It can also improve cacheing, because SMTP servers are generally not as centralized as DNS servers so can't usefully cache the results of DNS lookups.

Behind this and the more complicated un-enhanced DNA services I'm thinking of is the idea of a dynamic DNS blacklist service, which - instead of being based on a local database - uses computation and DNS lookups to produce a result. This could plausibly be implemented in Perl by hacking together a Perl DNS server (http://search.cpan.org/dist/Net-DNS-Server/ or http://www.stanford.edu/~riepel/lbnamed/Stanford-DNSserver/) and a fixed Net::DNS query library. (Sadly most DNS server back-end APIs don't allow for decent amounts of concurrency, so an easily hackable server is required.) You'd probably want to put a cacheing DNS server in front of the Perl monstrosity :-)

As well as implementing complicated logic like CSA, this kind of blacklist service could also be used to combine multiple DNS blacklists into a single virtual blacklist so that one query from the SMTP server effectively performs multiple blacklist lookups at the same time. With the enhanced DNA these could be both IP blacklists and HELO blacklists.

I've spent a lot of time recently discussing email server authorization with the CLEAR working group.

The CLEAR WG is not an officially chartered IETF WG yet, but our general aim is to get a couple of fairly uncontroversial email security protocols pinned down. The first of these is CSV, which comprises two protocols: CSA and DNA. CSA retro-fits HELO authentication on to SMTP, and DNA specifies how to discover and use RHSBL¹-like accreditation services. In the last several weeks we've mostly been working on CSA.

One of the features of SMTP is that the protocol starts off with the client saying HELO followed by its name. This is mostly redundant chatter, because SMTP servers are officially not supposed to reject messages if the HELO name is incorrect, and even if you would like to be strict about it you cannot because one third of SMTP servers have misconfigured HELO names. (It's only mostly redundant chatter because SMTP clients say EHLO instead of HELO to indicate that they speak the current version of the protocol.) A side effect of this is that malware SMTP clients (spammers, viruses, etc.) lie like bastards in their HELO lines. You can detect a lot of this with some simple heuristics, but there's still a lot of slightly more sohpisticated lying which can only be detected with the co-operation of the site that is being joe-jobbed².

CSA allows a site admin to state which of the machines are allowed to send email (and which are not) by putting SRV records in the DNS. SRV records have a few fields which CSA abuses to hold the authorization information, plus a target name. When you look up a SRV record the DNS server puts the addresses (A and AAAA records) of the target in the additional data section of the response. So an SMTP server can use a single DNS lookup to check that its client is authorized (from the SRV record) and that it isn't lying about who it is (by comparing the client's IP address with the additional data).

However there's a gaping hole in this scheme. Although you can fill your DNS zone with "not authorized" CSA records, this doesn't prevent attackers from making up a name that doesn't appear in the DNS and using it to bypass the CSA checks. This also undermines the planned accreditation and reputation systems based on the HELO name, because they'll have to deal with enormous numbers of made up names without any clear indication of whether this is the result of malice or misconfiguration. Made up HELO names may also sully the reputation of their purported parent domain.

So to plug this hole we needed a system for specifying that all authorized SMTP clients in a domain have a CSA record, and that any other name (whether present in the DNS or not) is implicitly unauthorized. This also saves DNS admins from filling their zones with CSA clutter. When an SMTP server is checking a HELO name, if the basic CSA lookup fails it can search for a default CSA record in parent domains, and thereby distinguish between implicitly unauthorized clients and clients in domains that don't know about CSA. The result is a solidly secure basis of authentication on which accreditation and reputation systems can stand.

One nice thing that CSA lets you do is implement a more lenient default-unauthorized policy than by blocking port 25 (assuming, of course, that SMTP servers check CSA). If a clueful user wants to bypass your default-unauthorized CSA policy, they just need to configure their SMTP client to use a HELO name that isn't in your domain. The policy that applies is then that of the user's domain and not yours, as is any reputation which accrues. This feature may be particularly attractive to broadband ISPs and their customers.

The process of deciding the details of this was unduly painful - too much miscommunication and arguing at cross-purposes, and it would have been two or three orders of magintude faster if we had been in the same room. However I think we've settled on a reasonably simple and efficient mechanism which minimises the number of lookups required to find a default record and maximises their cacheability. With any luck a new Internet Draft will be published soon.

It's worth noting that SPF is vulnerable to a variant of this attack, because most domains will only publish SPF records for their mail domains, along side the MX records. However any domain name with an A record is also valid as a mail domain, so an attacker can send mail "from" any existing host name belonging to their victim. SPF is not vulnerable to purely made up names, because many recipients perform a simple verification to ensure that the domain has an A or MX record in the DNS. HELO names don't have any equivalent to this, hence the need for CSA default records.

[1] RHSBLs are DNS blacklists that are keyed on domain names (in the case of DNA these are HELO names) instead of IP addresses.
RHS refers to the right-hand-side of an email address.

[2] You have been joe-jobbed if a spammer has sent out loads of email as if it were from you.

Excellent idea from

stephdiary in the pub last night: something like a proxyconfig.pac but for auto-configuring email software. This would be a boon because it's too complicated for most users.

gerald_duck mentioned that he has written a C++ program which frobs the registry in order to acheive something like this for common MUAs running on NT or its descendents.

Over the weekend I read The Alphabet by David Sacks, which is a history (or just-so story) of how our letters came to have their shapes and sounds. It's very readable and informative, and this Guardian review does a good job of summarizing the best bits.

http://books.guardian.co.uk/reviews/referenceandlanguages/0,6121,1112273,00.html

However it is a rather straight-line history. It fails to explore some of the interesting tributaries, such as the divergent shapes of S and Σ or Runic script or the influence of black letter and gothic scripts. The latter is an irritating omission given their brief appearance in the explanation for the shape of lower-case t.

I'd also like to know about the genesis of modern Hebrew and Arabic scripts which have diverged interestingly from our Greek inheritance. And, closer to home, some more comparison between the different sound values used by different languages would have been nice. There's a fair amount about g/j/y (which is required to explain their history) but some more about c/s with maybe something about Cyrillic to give it more context would have been useful.

But perhaps I'm asking a bit too much for a book that's already over 400 pages long.

More on the same topic as http://www.livejournal.com/users/fanf/29644.html
( Read more... )

Back 5 years ago when I was working with adns, one of the things I played with a bit was a perl wrapper. adns is absolutely fantastic for bulk log processing - being able to do more than 10,000 concurrent queries so that your're using all your CPU and not blocking on the network is a god-send. However C makes this more painful than it ought to be.

I never finished the perl wrapper because other things became more important, and when I next had the time and the inclination to look at it Net::DNS existed, so I thought there would be little point.

I've been paying gradually more and more attention to SpamAssassin recently, and it uses Net::DNS's background query feature to run all its DNS queries concurrently with its pattern matching. As a result of this I've found out that Net::DNS's background query handling is utterly stupid: it uses a separate socket for each query, rather than stuffing them all down the same socket and using the DNS protocol's query ID field to tie responses to queries.

This causes excessive resource usage which greatly restricts the number of concurrent queries it can handle, even on a sensible OS. On Windows it dies if the concurrency goes above about 350, which occasionally happens with SpamAssassin. http://bugzilla.spamassassin.org/show_bug.cgi?id=3924

So now I have the bit between my teeth. Must f1xx0r!

Counting all offered messages (rejected or not), we saw 1 447 252 different HELO names in the last month. If I count the number of dots in each name, the resulting histogram is as follows. The small end (0-2 dots) is inflated by incompetence and forgery. The big end (>10 dots) is 99.99% abuse.

 25765
450511 .
218188 ..
432343 ...
197647 ....
 33647 ..... 5
 28485 ......
 19790 .......
  4582 ........
  2040 .........
  3069 .......... 10
  7005 ...........
  9483 ............
  7722 .............
  4390 ..............
  1840 ............... 15
   568 ................
   150 .................
    23 ..................
     3 ...................
     1 .................... 20

Of the messages we accept, 274 902 different HELO names were used (19% of the total). If I count the number of dots in each name, the resulting histogram looks like this:

 5723
69182 .
84906 ..
75131 ...
26182 ....
 4723 ..... 5
 4436 ......
 2686 .......
  279 ........
  123 .........
  123 .......... 10
  317 ...........
  447 ............
  320 .............
  211 ..............
   87 ............... 15
   21 ................
    4 .................
    1 ..................

A lot of these are clearly bogus, for example 80 characters of random
words concatenated with an IP address, like

Antigone.meter.ernet.ne.jpsouthparkmail.comnetlane.comlouiskoo.comjpopmail.comtw60.186.213.104

or a random collection of concatenated domain names, like

cave.ngs.ouse.hello.nlsammail.compcmail.com.twsouthparkmail.com

(These should obviously be added to my HELO heuristics!) After removing them, there are 272 890 HELO names. If I count the number of dots in each name, the resulting histogram looks like this:

 5723
69182 .
84905 ..
75130 ...
26176 ....
 4688 ..... 5
 4334 ......
 2521 .......
  179 ........
   47 .........
    0 .......... 10
    2 ...........
    3 ............

This still includes various stupidities. 26631 of the 37272 single dot names ending in com|net|org have no name servers so are invalid. Of the unfiltered list, 208323 of the 288884 com|net|org names are invalid.

Edit: Actually, if you use less-strict DNS validity checking those numbers are 22015 (instead of 26631) and 206556 (instead of 208323).

I wanted to compile some statistics on the correctness of SMTP clients HELO domains. This is exclusively for emal coming into our MXs, so doesn't include MUAs which tend to be very broken in this respect.

Exim in our configuration checks that the HELO domain and the reverse DNS and the forward DNS all match. However I'm also interested in whether a forward lookup on the HELO domain matches the client's IP address, and Exim doesn't record this in the logs. A quick bit of hackery with adns, and a few minutes of 10,000 concurrent DNS queries later, I have my results:

Total rejections: 123921
Failed HELO checks: 101417
Forward DNS correct: 2128

Total accepted: 31754
Failed HELO checks: 13349
Forward DNS correct: 3196

So, today this machine has rejected 80% of incoming messages. According to
SpamAssassin about 15% of the messages we accept are spam so you might
want to adjust the numbers on that basis.

Of the rejected messages, 80% have a completely bad HELO domain, and 2%
have a HELO domain that's correct only in the forward direction.

Of the accepted messages, 32% have a completely bad HELO domain, and 10%
have a HELO domain that's correct only in the forward direction.

I really like adns :-)

Talking about the irritations of mobile phones in the office, and reminded of Darren's phone which not only makes irritating noises, but also flashes irritatingly multi-coloured...

Imagine what will happen when they perfect hologram technology. Phones will be able to visually irritate at a distance, and will be just as unimaginitive as ring tones. "Help me Obi-Wan Kenobi."

I cry out in pained anticipation.

I've just rolled out the as-yet-unreleased Exim 4.44 on our central email servers (ppswitch). We're the first people to run it in anger, and I think it's the first time ppswitch has been quite this close to the forefront of technology :-)

A lot of the features in Exim 4.4x have been requested by me in order to improve the service on ppswitch, especially stuff related to submission mode, callout verification, logging, and DNS blacklisting. This helps to make the configuration nice and tidy as well as implementing the correct functionality.

A particularly nice feature in 4.44 is the ability to reject on the basis that a domain's name servers are operated by known spammers:

  deny message  = The name servers for the domain ${sender_address_domain} \
                  are listed at ${dnslist_domain} ($dnslist_value); \
                  See ${dnslist_text}
       dnslists = sbl.spamhaus.org/${lookup dnsdb{>:a=\
                                   ${lookup dnsdb{>:zns=\
                                     $sender_address_domain}}}}

My configuration file is now over 1000 lines, about half of which are blank or comments. I have another 200 lines of prototype configuration (which implement cryptographic bounce authentication) to test and roll out...

The Abbot Ale / Daily Telegraph perfect pub award:

http://www.abbotale.co.uk/vote_online.php

On the Today programme this morning (24 Nov), the Home Office minister Caroline Flint justified the recently announced crime bills on the grounds that terrorism is a new threat. This is offensively ignorant, and well illustrates the way the Labour government is exaggerating people's fears in order to gain political advantage.