fanf | Recent Entries

This has to be the funniest support request I have seen in a long time. How can I reply without making sarcastic comments about there obviously being no problem because the user's stupid email disclaimer tells the recipient to delete the embarrassing message? Extra points, of course, for sending an "urgent" message after 17:00 on a Friday.

[fx: fanf files off the serial numbers]

Help!

I wonder if you can help me please. I have sent an email to someone by mistake. It has confidential information on it that the recipient shouldn't see.

[...]

This e-mail (together with any files transmitted with it) is intended only for the use of the individual(s) or the organisation to whom it is addressed. It may contain information which is strictly confidential or privileged. If you are not the intended recipient, you are notified that any dissemination, distribution or copying of this e-mail is strictly prohibited. If you have received this e-mail in error, please notify the sender by return e-mail (or telephone) and delete the original message.

One thing I skimmed over in my previous article is addressing. A system's addressing architecture is often good as a basis for explaining the rest of its architecture.

The Internet's addressing architecture was originally very simple. There were straight-forward mappings between host names and IP addresses, and between service names and port numbers. The general model was that of academic computing, where a large central host provides a number of different services to its users.

However it isn't completely clean: port numbers aren't just used to identify services, they are also used for multiplexing. Furthermore, multi-homing adds complexity to the host addressing model.

This simplicity didn't survive beyond the mid 1990s, because it is too limiting when you get away from mainframes. Nowadays it is common for multiple host names to match to the same IP address, or for a host name to map to multiple IP addresses. We often run multiple instances of the same service on a host, rather than single instances of different services. A set of related services (such as IMAP/POP/SMTP) are often run on different (but related) hosts.

One thing that the Internet does have now that it didn't then is a well-developed application-level addressing system - the Uniform Resource Indicator. (Probably the most interesting early application-level address is the email address, followed by pre-URL ftp locators.) One consequence of the over-simple foundation that URIs are built on is that they end up being somewhat redundant: e.g. the www in <http://www.cam.ac.uk/> or the second imap in <imap://fanf2@imap.hermes.cam.ac.uk/>.

In my model I divide the problem into addressing, routing, and multiplexing. Addresses are used to establish a session, including selection of the service, and they are only loosely-coupled to the route to the server. Routing gets packets between the programs at either end, so I'm having multiple routing endpoints per host to support concurrent sessions. Multiplexing within the session is no longer muddled with service selection: it just divides the packets into requests or streams etc.

In the previous article I said that if you squint you can view DNS as a vestigial session layer, which does the mapping from application-level addresses to routes. Note that in most cases the DNS lookup doesn't include any mention of the service, which is why it gets encoded in host names as I pointed out above. Some applications make more advanced use of the DNS and avoid the problem, which is why you can have email addresses and Jabber IDs like <fanf2@cam.ac.uk> rather than <fanf2@mx.cam.ac.uk> or <fanf2@chat.cam.ac.uk>.

The full session layer I have in mind is much more dynamic than this, though. It ought to be an elegant replacement for the routing and reliability hacks that we currently use, such as round-robin DNS, load-balancing routers, application-level redirecting proxies, BGP anycast, etc. etc. Think of something like IMAP mailbox referrals or HTTP redirects, but implemented in an application-agnostic manner.

All very pie/sky...

The Internet architecture is beautifully simple and fantastically successful. But it is wrong.

The Internet architecture was born of an argument between circuit switching and packet switching. It correctly argues that packet switching is more fundamental and therefore more powerful: it is easier to implement data streams efficiently on top of a network optimized for packet switching than it is to implement datagrams efficiently on top of a network optimized for circuit switching. A crucial part of the Internet's design is the end-to-end argument, which says that reliable communication can be completely correctly implemented only at the endpoints of a link; any reliability features of the link itself are optimizations, not fundamentals. Hence the idea of intelligent endpoints that actively participate in maintaining a connection, rather than relying on the network to do all the work. However the Internet takes datagram fundamentalism too far, and at the same time it fails to take full advantage of intelligent endpoints.

In particular, Internet hosts - endpoints - are completely ignorant of the details of routing, hence architecture diagrams that depict the network as a cloud. Packets are shoved in and magically pop out wherever their addressing says they should. One consequence of this is that the network must know how to reach every host. The Internet makes this feasible by assuming that hosts with similar addresses have similar routing: although there are hundreds of millions of hosts on the Internet, core routers only have to deal with a couple of hundred thousand routes. However the fact remains that any site which has multiple connections to the Internet must run at least one router which keeps track of practically all the routes on the Internet. This severely limits the complexity of the Internet. (And it isn't fixed by IPv6.)

This is not such a problem for an Internet of static devices or of intermittently connected devices, but if you want to support properly mobile devices which maintain communications seamlessly whilst changing their connectivity, you have a problem. The job of core routers now scales according to the number of devices not the number of organizations, and our technique ("CIDR") for aggregating routes based on topology no longer works. The topology changes too fast and is too fine-grained. So mobility on the Internet uses a new routing layer above the basic Internet infrastructure, to work around the scalability problem.

Even in the absence of mobility, core Internet routers have an extremely difficult job. Not only do they have to forward packets at tens of gigabits per second, but they must also maintain a dynamic routing table which affects every packet forwarding action, and they must communicate with other routers to keep this table up-to-date. Routers in circuit-switched networks are much simpler, and therefore cheaper and easier to manage. RFC 3439 has a good discussion of the complexity and cost trade-offs. It isn't an Internet hagiography.

An important corollary of the end-to-end argument is that security must be implemented end-to-end - after all, security is to a large extent a reliability problem. But as a consequence, whereas the Internet relies too much on the network for routing, it relies too much on the host for security. (This is partly, but not entirely, a consequence of the core protocols being mostly concerned with working at all, let alone working securely, and all the users being trusted in the first two decades.) So IP provides us with no help with managing access to the network or auditing network usage. It has no place for a trusted third party or mediated connectivity.

That does not mean that these are impossible to implement on the Internet - but it does mean they break things. Firewalls and NATs simplify routing and management, but they have to implement work-arounds for higher-level protocols which assume end-to-end connectivity. And "higher-level" can be as low-level as TCP: firewalls often break path MTU discovery by blocking crucial ICMP messages, and NATs often break TCP connections that stay idle too long.

Which (sort of) brings us to the upper levels of the protocol stack, where end-to-end security is implemented. This is where I get to my point about the need for a session layer. The particular features I am concerned with are security and multiplexing. You can compose them either way around, and the Internet uses both orders.

In HTTP, multiplexing relies on raw TCP: you use multiple concurrent TCP connections to get multiple concurrent HTTP requests. Each connection is secured using TLS, and above that, application-level functionality is used to authenticate the user. Similar models are used for DNS(SEC) and SIP.

In SSH, the TCP connection is secured and authenticated first, and this foundation is used as the basis for application-level multiplexing of streams over the connection. Similar models are used for Jabber and BEEP.

The problem with HTTP is that re-securing and re-authenticating each connection is costly, so complexity is added to mitigate these costs. So TLS session caches shorten connection start-up, and HTTP/1.1 allows multiple requests per extension, and techniques like cookies and session keys in URLs avoid the need to re-authenticate for each request.

The problem with SSH and BEEP is that multiplexing streams requires a windowing mechanism so that one busy stream doesn't monopolize the link and starve quieter streams. However TCP already has a windowing mechanism, and in the event of poor connectivity this interferes with the upper layers. TCP-over-TCP is a bad idea but similar arguments apply to other upper layers.

What is missing is a proper session layer, which is used for authentication and to establish a security context, but which is agnostic about multiplexing - datagrams, streams, reliable or not, concurrent or not. Every Internet application protocol has had to re-invent a session layer: mapped to a TCP connection, as in SSH, or mapped to an authentication token, as in HTTP. This goes right back to the early days: in FTP, the session corresponds to the control connection, and multiplexing is handled by the data connections.

As well as managing security and multiplexing, a session layer can manage performance too. At the moment, we rely on TCP's informal congestion control features: the Internet works because practically everyone implements them. However in the mid-1990s, people were seriously worried that this wouldn't be sufficient. The rise of HTTP meant that bulk data transfer was happening in shorter connections which didn't give TCP enough time to measure the available bandwidth, so it would tend to over-shoot. HTTP/1.1 and the dot-com overspend helped, but the problem is still there and is once more rearing its head in the form of multimedia streaming protocols. A session can share its measurement of network properties across all its constituent traffic.

My assumption is that sessions will be relatively heavyweight to set up and relatively long-lived: more like SSH than HTTP. The shortest session one typically sees is downloading a page (plus its in-line images) from a web site, which is long enough to justify the setup costs - after all, it's enough to justify HTTP/1.1 pipelining which isn't as good as the multiplexing I have in mind. But what about really short transactions? I do not believe they occur in isolation, so its reasonable to require them to be performed within a session.

But what about the DNS? In fact I see it as a vestigial bit of session layer. The endpoint identifiers we use in practice are domain names, but to talk to one we must first establish connectivity to it, which requires a DNS lookup. Admittedly this is a bit of a stretch, since the DNS lookup doesn't involve an end-to-end handshake, but it can involve a fair amount of latency and infrastructure. The first one is especially heavyweight.

And establishing connectivity brings me back to routing. Why not use a more distributed on-demand model for routing? More like email than usenet? More like the DNS than hosts.txt? Then your router should be able to scale according to just your levels of traffic and complexity of connectivity, instead of according to the Internet as a whole.

When you set up a session with a remote host, you would establish not only a security context, but also a routing context. You can take on some responsibility for routing to make the network's job easier. Perhaps it would be simpler if addresses were no longer end-to-end, but instead were more like paths. Routers could simply forward packets by examining a pre-established path rather than based on a dynamic routing lookup - and this would be secure because of the session's security context. Separate infrastructure for session set-up would deal with the changing connectivity of the network, instead of routers. Because you participate in routing, you can co-operate actively as well: if you are mobile you can reconfigure your route as you move, without breaking the session.

I quite like this idea, but I really don't know how it could be implemented. See how long people have been working on better infrastructure at similar levels: IPv6, DNSSEC. Maybe it could be built on top of IPv4 instead of replacing it. SCTP, a replacement for TCP, has many of the multiplexing and multihoming features, but it doesn't address routing. And speaking of that, I'm not sure how to manage the millions of sessions flowing through a backbone router without requiring it to know about them all. Sessions would have to be handled in aggregate (according to local topology) but you still have to allocate bandwidth between them fairly...

Anyway, a fun idea to discuss in the pub.

Current Music: Castles in the sky - Ian van Dahl

A couple of weeks ago I was working on the SMTP QUICKSTART specification, which optimizes the start of the SMTP connection.

http://www-uxsup.csx.cam.ac.uk/~fanf2/hermes/doc/qsmtp/draft-fanf-smtp-quickstart.html

I have now written a specification for SMTP transaction replay, which allows you to use the existing CHUNKING and PIPELINING extensions to acheive the maximum speed possible for bulk data transfer, whilst still behaving correctly if the connection is lost. I imagine this will be most useful clients on slow intermittent links, like mobile wireless users or people on the wrong end of a satellite link.

http://www-uxsup.csx.cam.ac.uk/~fanf2/hermes/doc/qsmtp/draft-fanf-smtp-replay.html

Gosh, over 5000 words...

I've just been talking to the features editor of Varsity, one of the University's student newspapers. She sent us an email on Saturday asking if we would talk to her about Hermes, because "it is such an essential part of student life yet most people know nothing about it. It just is." I said yes, since I try to project a friendly face to the Computing Service.

So if you get a copy of this Friday's Varsity, expect to read about how we deal with dead users, how Hermes has only caught fire once, growing pains, and speculation about life before email.

http://www-uxsup.csx.cam.ac.uk/~fanf2/hermes/doc/antiforgery/draft-fanf-smtp-quickstart.txt

This version has a basic subset which should be easy to implement, and which brings the MAIL command forward from the client's 9th packet to its 5th in typical use.

Full QUICKSTART allows the client to pipeline commands before the server greeting and after the TLS handshake in a way that should be safe, albeit not so trivial to implement. If all goes well, the MAIL command appears in
the client's 3rd packet, which I think is pretty good :-) If the client has to recover from a cache miss, it should be no slower than basic QUICKSTART.

http://lists.oarci.net/pipermail/dns-operations/2006-February/000122.html

At the moment there are a lot of DNS-based attacks going on. They generally rely on spoofed queries, where an attacker sends a forged DNS query to an open resolver (the reflector) which sends a large response (amplification) to the victim. A lot of people are saying that wider implementation of BCP38 would significantly reduce the problem, because that requires ISPs to filter spoofed packets at their borders. However the DNS relies on referrals from one name server to another, which can be used for reflecting and amplifying attacks even when UDP forgery is prevented.

Our email volume is following a strange curve. The following numbers come from a week in mid-February (specifically the 11th-17th) for the last four years:

        messages     GB
2003   1 143 641    30.97
2004   2 480 215    62.11
2005   2 199 303   117.86
2006   2 925 334   162.70

Why the drop in message count in 2005? Architectural changes to our email systems meant that fewer messages were going through ppswitch more than once; this year that has increased again because of the unbundling of the mailinglist system. It may also be a result in the changes of behaviour of email viruses (which contribute to the count).

I've been discussing my ideas on the Lemonade mailing list.

Unsurprisingly, the gurus there didn't like my attempt to rehabilitate smtps, mainly for political reasons. The RCPTHDR extension probably only has a future as input to the RFC 2822 or RFC 2476 updates.

People seemed fairly keen on QUICKSTART. I have updated it to shave off a couple of round trips from connections that use STARTTLS.

Further analysis of the BURL draft revealed that its pipelining changes aren't totally correct - see my message on the subject. The upshot is that you can't save a round trip for AUTH.

However, a closer look at the CHUNKING spec revealed that it has a very desirable property: it completely eliminates synchronization points once you have got the connection going and you are sending messages. With basic ESMTP, there is a synchronization point at the DATA command, which has the effect of limiting your sending speed to at most one message per round-trip. CHUNKING allows you to send at the maximum speed TCP can sustain. Nice. The CHUNKING RFC needs a little clarification about its interaction with PIPELINING, but its author has a draft update in the works which will do the trick.

I have more to say about sending email efficiently in future essays on how not to design MTAs.

There was recently (mostly on the 19th and 20th) a thread on the IETF discussion list about whether round-trip times are still a concern. The combination of that, my recent thinking about message submission, and the LEMONADE working group's efforts to streamline it led to the following.

At the moment, a message submission packet trace to a system like Hermes goes like this ( very long packet trace )
That's TWELVE round-trips, which can easily take over a second if you are any distance from the server. Obviously, there's a lot of scope for streamlining.

draft-fanf-smtp-tls-on-connect attempts to resurrect the old smtps port, which is still frequently used in practice, but is frowned on by the IETF. This saves three round trips.

draft-fanf-smtp-quickstart slightly streamlines the ESMTP startup, to save another round trip.

draft-ietf-lemonade-burl, amongst other things, allows a client to pipeline single-exchange AUTH commands, and the message envelope and data. This saves two round trips.

draft-fanf-smtp-rcpthdr allows a client to avoid re-stating email addresses, which doesn't save round trips, but does reduce the chance that the client's submission will overflow TCP's initial window and thereby incur an extra round trip delay.

The result looks like this ( shorter trace )
That's six round trips, so half the elapsed time of message submission according to the current specifications.

I wonder if anyone will like my draft specs...

Edit: Actually you can pipeline the message data and quit command, which reduces the round-trip counts to eleven and five. See also draft-fanf-submission-streamlined.

I just noticed that there's a Swedish Haskell researcher called Björn Lisper...

In this article the kind of security that I am concerned with here is total compromise. The other major security problem is denial of service, which I'll cover separately.

Both problems arise from buggy code, typically buggy string handling code, so anything that reduces the likelihood of code to have string handling bugs, the better. The most effective thing to do is not to use traditional low-level C style: no pointer arithmetic, no fixed-size buffers (especially on the stack!). Instead, use higher-level constructs so that you can write your code as if you were using a scripting language.

However vulnerabilities will remain, so we should consider further.

For example, DJB has a maxim "Don't parse". He argues that parsers should be reserved for user interfaces, and that good program-to-program interfaces should not need parsers. But this is impossible: he actually means that they should only need the simplest possible parsers.

Most program-to-program interfaces involve some kind of protocol, and all protocols need parsing. This is true in the DJB sense for many Internet protocols (including SMTP and HTTP) which are designed to be friendly to humans as well as programs, but it is also true for protocols that are designed only for software, such as ASN.1. Binary protocols are just as vulnerable to catastrophic implementation errors as textual protocols, but less amenable to our huge stable of text-handling tools. So perhaps DJB's dichotomy of "good interfaces" and "user interfaces" should be joined with "bad interfaces".

Not only do protocols need parsers, but other requirements often mean that you can't take the DJB approach of paring the parser to the bone. Full implementations of SMTP need a fair amount of parsing, not just of commands and responses, but also of the message (especially for message submission). Furthermore, these days spam and viruses are a much bigger problem with email than buggy MTAs, so an MTA needs adequate defences against them, and these defences should be deployed as early in the message handling sequence as possible. You want to minimize effort wasted handling the junk and other bad effects such as collateral spam.

This is a lot of code to expose to the big bad net. Can't we partition it in an attempt to keep vulnerabilities contained? Yes, but this approach has limitations. It isn't enough to just separate the MTA into multiple programs or processes: if they are running under the same UID they are within the same security boundary, and even if the other processes can't be compromised through bugs, they can be through ptrace().

For example, except for its privileged parts Postfix runs under a single unprivileged UID, so it does not have any internal partitions. It allows you to run various of its daemons in a chroot, but this does not increase safety much. Code insertion attacks via ptrace() work between any programs running under the same UID, in the chroot or not, so they can be used by a compromised program to escape from its chroot even without root privilege.

Partitioning increases complexity, because you have to invent a protocol for the partitions to communicate with each other. For a modern MTA this protocol can't be DJB-bare-bones-simple, because of the features you must support. For example, the SMTP server must be able to verify addresses, which is less complicated than delivering a message, but cannot be deferred. You need something more than just dropping a file in a queue directory.

Given that you have written a robust protocol engine fit for exposing to the big bad world, it's tempting to re-use it for the MTA's internal communications. This would be a mistake. Bugs in this engine are the ones that will lead to compromise, so if you can compromise the server's front end you can probably use the same bug to hop the next security boundary into the MTA's core.

For example, Postfix has a single record format used for queue files and IPC. Postfix's sendmail command generates a queue file in the context of the calling user and drops it in the queue using a privileged program. A (hypothetical) serious bug in Postfix's record handling code could be exploited by a malicious user who crafts a file that triggers the bug and thereby gains control of the drop directory. It's likely that the same bug could be used to compromise the rest of the MTA from that beach head, via a queue file or via IPC. This attack is much easier because Postfix exposes its internal communications protocol - if it didn't, the user couldn't do anything useful with the crafted file.

So, to summarize, if you are going to partition for security:

use different UIDs for each partition;
don't use the same UID inside and outside a chroot;
use different protocols across different trust boundaries;

That last suggestion requires enormous faff, but in fact it happens as a matter of course for much of the code we are worried about: for example, separate anti-virus and anti-spam daemons such as SpamAssassin will have their own IPC protocols. This extends to the MTA's routing engine too, with more protocols for querying the DNS, or LDAP or a SQL database etc.

So the question is whether these boundaries are adequate, or if it makes sense to further partition the MTA. There are essentially two routes via which malicious people can talk to us - the SMTP server and the SMTP client - so we might want to partition them off from the routing code. The SMTP server is both the most vulnerable and the most complicated, but it still needs to talk to other software on the system. So it's difficult to significantly reduce our exposure, it would cost a lot in complexity, and therefore it's probably not worth it - and in fact only DJB thinks it's worth having more than one UID for his MTA.

One of the problems with SMTP is that there is only one server reply to the message data, regardless of the number of recipients. This means that you can't apply per-recipient message data policies (such as different SpamAssassin score thresholds) without ugly hacks.

CMU Cyrus uses a special protocol for local delivery called LMTP, the local mail transport protocol. It is the same as SMTP except that the initial client command is LHLO instead of EHLO, and the server gives multiple replies to the message data, one per recipient. The reason Cyrus wants this functionality is because of quota checks: it wants to be able to deliver the message to the recipients that are below quota, but defer delivery to other recipients so that retries are all handled by the MTA.

But, as I explained in the first paragraph, this kind of feature would be useful for inter-domain mail transer as well as mail delivery. There's no particular reason that you couldn't use LMTP as it is in this context, and fall back to ESMTP if the LHLO command is rejected. But this is gratuitously ignoring SMTP's extension architecture, and apart from being ugly, it costs an extra round-trip at the start of a connection.

So I have been thinking of knocking up a quick I-D to re-cast LMTP as a proper SMTP extension and advocate its use everywhere. But, wonderful lazyweb! it has already been done: http://www.courier-mta.org/draft-varshavchik-exdata-smtpext.txt (though I would have done some of the details differently).

Of course this is of dubious utility if the spammers don't also use it...

(I say part 1, but don't expect the sequels to arrive quickly.)

The sendmail command is the de facto standard API for submitting email on unix, whether or not it is implemented by Sendmail. All other MTAs have a sendmail command that is compatible with Sendmail for important functions. (Notice how I pedantically use the uc/lc distinction.)

The traditional implementation

Sendmail and its early successors (including Exim) have been setuid root programs that implement all of the MTA functions. They are also decentralized, in that each instance of Sendmail or Exim does (mostly) the whole job of delivering a message without bothering much about what else is going on on the system. The combination of these facts is bad:

(1) A large setuid root program is a serious vulnerability waiting to happen. Sendmail has a long history of problems; Exim is lucky owing to conscientiousness rather than to good architecture.

(2) Particularly subtle problems arise from the effects of what sendmail inherits from its parent process, such as the environment and file descriptors. For example, consider sendmail invoked by a CGI. If the web server is careless and doesn't mark its listening socket close-on-exec, the socket is inherited by the CGI and thence sendmail, which may then take ages to deliver the message. You can't restart the web server while this is going on, because a sendmail process is still listening on port 80, which means you can't restart the web server at all if the CGI is popular.

(3) Independent handling of messages makes load management very difficult. This is not only the load on the local machine, but also the load it imposes on the recipients. Sendmail and Exim lack the IPC necessary to be able to find out if the load is a problem before it is too late.

The qmail approach

Message submission in qmail is done with the qmail-inject program. This performs some header fix-ups, and it can extract the message envelope from the header rather than taking the envelope as separate arguments (like sendmail -t as opposed to plain sendmail). It then calls the setuid qmail-queue to add the message to the queue.

(4) The simple, braindead qmail-queue program does not impose any policy checks on messages it accepts, because that would be too complicated and therefore liable to errors. The fix-ups performed by qmail-inject are within the user's security boundary, not the MTA's, so they are a courtesy rather than a requirement.

The Postfix approach

Postfix is very similar to qmail as far as message submission is concerned, except that rather than fixing up a message, its sendmail command transforms the message into Postfix's internal form before handing it to postdrop which drops it in a queue. The fix-ups are performed later by the cleanup program, whcih also operates on messages received over the network. Which brings us to:

(5) Sendmail, qmail, and Postfix do not have an idea of message submission versus message relay. For example they tend to fix up all messages, wherever they come from - or in qmail's case, not fix them at all.

Step back a bit

So what are the requirements?

(a) A clear security boundary between local users and the MTA. Note that all the MTAs rely on setuid or setgid programs that insert messages directly into the queue. Postfix and qmail ensure they are relatively small and short-lived, but they are still bypassing the most security-conscious part of the MTA, i.e. the smtp server. This opens up an extra avenue for attack - albeit only for local users. But why do they need special privileges?

(b) Policy checks on email from local users along the same lines as those from remote MUAs. Is this message submission or message relay? Does it need to be scanned for viruses? What are the size limits? Does address verification imply this user (e.g. nobody) cannot send email at all?

If you have a sophisticated system for smtp server policy checks, why bypass that for local messages? Exim can sort-of do what I want, but it retro-fits the policy checks onto the wrong architecture.

The fanf approach

The sendmail program is a very simple RFC 2476 message submission client: it talks SMTP to a server and expects the server to do the necessary fix-ups. It doesn't need any special privilege: from the server's point of view it is just another client.

It's not quite as simple as that. You need to authenticate the local user to the server, because users should not be able to fake Sender: fix-ups, and there are situations when you will want to treat users differently, e.g. email from a mailing list manager. So instead of SMTP over TCP, talk it over a unix domain socket, which allows unforgeable transmission of the client user ID.

Problem (1) solved: no setuid or setgid programs.
Problem (2) solved: client process is short-lived and synchronous.
Problem (3) solved: messages all go through the same channel.
Problem (4) solved: messages all go through the same policy engine.
Problem (5) solved: the policy engine is powerful enough to know when submission-mode fix-ups are becessary.

One thing you do lose by this approach is that the sendmail command only works when the SMTP listener is running, which is not a problem with the other designs. But I'm not convinced this is a serious difficulty, and in fact it can be viewed as an advantage - it doesn't let email silently disappear into an unserviced queue.

A question arises with this architecture - which also arises for remote MUAs - which is, where is the best place to generate the message envelope? i.e. the transport-level sender and recipient addresses. RFC 2476 says that the client does this job, however no-one has written a decent sendmail -t replacement, and even "serious" MUAs get this job wrong. Furthermore, the server still has to parse the header and perform various checks and fix-ups, so why shouldn't it generate the envelope too? Hence draft-fanf-smtp-rcpthdr, which also has the best description of how to handle submission-time fix-ups of re-sent messages.

I just sent out an annoucement about a new facility to help with the withdrawal of insecure access to Hermes. Computing support staff in the University can now add users to the notification schedule via a web page instead of having to go through us.

Within 10 minutes, three people had tested the form...

Virus naming is generally a matter of consensus between the AV vendors, but occasionally that breaks down. A good example is "blackworm" aka "blackmal" aka "blueworm" aka "mywife" aka "nyxem". Our email AV system keys off the virus name to decide whether to delete a message or mangle it. (Sadly our current system can't reject messages at SMTP time.) This depends on us getting a reasonably unique name from the virus scanners, so that we treat messages appropriately. Sadly at the moment there's something nasty going around which McAfee is calling "the Generic Malware.a!zip trojan" and ClamAV is calling "Worm.VB-9". Can I have a proper name please so I can delete it and stop irritating people with mangled junk?

In early drafts of this announcement I had written "there is a significant number of unofficial Hermes webmail login forms", which a couple of my colleagues corrected to "there are", so I checked. I had thought that my version was pedantic but correct, but I may have got that impression from an over-simplified version of the rule. It turns out (according to the Economist style guide and other sources) that the rule is "the number is" but "a number are". Subtle.

The best guideline is to rephrase away from controversial matters of style, which is what I did :-)

So we got a message from reception about some unusual forwarded email. It turns out that these mediaevalists had been confused by some misaddressed email and blamed a virus. This caused a phone call to CS reception, which our staff followed up with an email asking for the problem messages to be forwarded to us for analysis. The messages were forwarded ON PAPER with a scribbled note.

http://www.bbc.co.uk/radio4/connect/pip/dk7o2/

I just heard a pretty amazing programme about technology and the postal service. The thing that struck me was the way they have mechanized bulk sorting of letters. There are 73 major sorting centres in Britain which between them have 200-300 letter sorting machines, which do the obvious jobs of working out the letter's orientation and photographing it for OCR. (Really fast - 30,000 items per hour per machine.) What surprised me is that the OCR is not done on site, but instead the photos are transmitted over the post office's data network to a single centralized data centre which contains all the clever computers. Of course they aren't so clever that they can deal with all letters, so - second surprise - unrecognized letters are handled by sending the images to offices full of people who type in post codes all day.

No significant sorting intelligence is on the same site as the sorting machines.

Hmm, perhaps this is the manufacturer of the machines: http://www.abprecision.co.uk/businessunits/hsp/postalservices.htm

And perhaps this is a press release about the data centres: http://www.prnewswire.co.uk/cgi/news/release?id=59114

So the Information Technology Syndicate Technical Committe Meeting happened this afternoon. I prepared a handout which (in the absence of any guidelines about what it should contain) was basically a brain dump. I quoted Blaise Pascal - "I didn't have time to make it shorter" - which (surprisingly) got a chuckle from the attendees. Rather than going through it in detail I talked briefly about the motivation for the project and the reasons for choosing Jabber and the things that I hope will make it popular. There were a few questions, the most penetrating of which were about whether we'll provide any kind of role addresses (the answer being no, because it requires support from the protocol which isn't really there yet) and whether spam will be a problem (maybe in the future if Jabber becomes very popular, but not yet, and at least Jabber is resistent to spoofing). All in all a fairly easy ride.

So maybe I'll finally (after three months of faff) be allowed to make progress... Next job: finish racking the irritating bastards Wogan and Parky.

Hate.

The 1U dual Opteron servers for the Chat service (which will be called wogan and parky) have arrived and we tried to rack them this afternoon. Their chassis is designed so that the lid is mounted in the rack, and the base (containing the guts of the machine) slides in and out. The tops have screw threads near the back to which a couple of L brackets attach which in turn tie the rear of the machine to the rack's rear verticals. These screw threads stick out enough to make the whole thing wider than the rack's verticals. Furthermore, the top has no slack within its nominal 1U which makes it near impossible to mount machines next to eah other.

TOTAL PAIN IN THE ARSE

We shall be having another go at them on Monday...

I just found out about these cool Google Maps hacks which can show your Jabber presence and location. Neat!

http://map.butterfat.net/
http://jobble.uaznia.net/map

http://www.cus.cam.ac.uk/~fanf2/hermes/doc/talks/2006-01-techlinks/

I did a talk this afternoon mostly about withdrawing insecure access to Hermes. Surprisingly few questions...

I did the slides using Eric Meyer's S5 which seems quite fun, though it suffers from the browser compatibility farce that plagues the web. (see also...)

Following http://fanf.livejournal.com/47657.html

We successfully moved the backup server today, and it is now joined with its second EonStor. The first one has 16 x 250GB SATA-1 disks in 3U for 4TB raw capacity. (link) The second one has 24 x 500GB SATA-2 disks in 4U for 12TB raw capacity. (link) The system also has 1U of fibrechannel switch, 2U of PC, and 4U of tape robot, for a total 16TB backed-up disk in 14U of space. (Compare the 26U of space for the old 0.5TB NetApp system!) These disks store the third (warm) copy of everyone's email, conveniently close to the tapes so that we can spool off the fourth (cold, off-site) copy quickly.

The EonStors, like the NetApp disk shelves, aren't mounted on sliding rails: they're just bolted to the front vertical rails of the rack. This makes mounting them a three person job (two to hold either side of the unit while the third has a good screw). Best done with all the disks removed because that makes the units MUCH lighter. Most racking jobs can be done comfortably with two people; our tool-less Intel servers can be racked by one. The crucial thing that makes this possible is the pegs on the sides of the server that drop into the extended rails, which is much easier than having rails on the sides which (after careful alignment) slide into the rails in the rack.

http://www.f-secure.com/weblog/#00000782

An example of why universal accept-then-discard virus handling is bad.

On Tuesday we spent a good chunk of the afternoon emptying the last remaining two racks of the old Hermes system, including the rack that held the two NetApp F740s and 0.5TB of disk. (pictures) The NetApps have now gone where much old CS hardware eventually ends up: under a desk in Unix Support's office.

For the last 18 months the NetApps have been off, reserving space for the future expansion of Hermes. We now have a job for the space, and two shiny new racks to occupy it. This afternoon we attempted to move our backup server (on the right) to its new home, where it will be joined by a fibrechannel switch and another 12TB of disk.

However after removing the tape robot from its old rack and wheeling it across the machine room, we discovered that the vertical rails on the new rack had been set too close together. We rapidly gave up and put the box back where it came from. This machine has to have THE MOST IRRITATING rack mount kit ever, with lots of fiddly screws and small bits of metal that like to fall through holes in the floor. A complete pain in the neck, especially in comparison with the utter joy of toolless rackmount kits.

We will have to do the move next week, after the racks have been adjusted.

http://www.ietf.org/internet-drafts/draft-kuhn-leapsecond-00.txt

<cam user=mgk25> has written up his smoothed leap seconds proposal as an Internet-Draft. This idea, which he originally described in October 2000, is a suggestion for reconciling POSIX time and UTC with a minimum of nasty side-effects.

The essential problem is that the two are incompatible: POSIX says that each day has exactly 86400 seconds, but UTC says a day may have 86399, 86400, or 86401 seconds. This isn't just a POSIX problem: it isn't possible to accurately model UTC with only a count of seconds no matter how you fudge it. Consider the difference between UTC and TAI, i.e. atomic time with and without leap seconds. They are both based on the same seconds, so they will both be represented the same way if you simply count those seconds. The count omits any information about which seconds are leap seconds, so it needs some additional information to be able to model UTC. POSIX leaves no space for this additional information in its basic time APIs.

Markus's fudge is fairly well thought through, and since practicalities force us to have some kind of fudge, his is probably the best. However I can't stop being irritated that this is a fudge on top of a fudge, and if you remove the underlying fudge they both become unnecessary.

The underlying fudge is UTC itself, which is also trying to reconcile two incompatible standards: atomic time and astronomical time. The result is a rather complicated timescale which makes it hard to do common human-scale time-related computations, because it exposes some of the complexity of time synchronization to all the higher levels instead of encapsulating it. Ugh.

I'm a big fan of the effort to abolish leap seconds and instead use pure atomic time for civil time. Sub-second synchronization with the rotation of the earth isn't necessary for civil time-keeping tasks, as you can tell from the width of the time zones and their twice-annual adjustments. If atomic time gets too far from synchronization with astronomical time - maybe it'll be an hour out in a few thousand years - we can easily adjust the time zones, which is a common and simple operation (we do it twice a year). But I also appreciate Markus's pessimism about the likelihood of this abolition succeeding, and his defensive engineering in anticipation of the failure.

As a quick example of the problems caused by leap seconds, here's nice bit of code for converting a Gregorian date into a Modified Julian Day number:

	int mjd(int y, int m, int d) {
		y -= m < 3;
		m += m < 3 ? 10 : -2;
		return y/400 - y/100 + y/4 + y*365
		    + m*367/12 + d - 678912;
	}

You can easily adapt this to produce POSIX time values, by adjusting the epoch and multiplying by 86400. If you want to handle leap seconds, however, you have to add code to read a table of leap seconds, find the date in the table, and adjust the result accordingly. It all becomes at least ten times longer and a hundred times less portable.

http://www.ietf.org/html.charters/widex-charter.html

I don't understand the point of this. What does it give us that we can't already do with AJAX?

http://www.livejournal.com/users/fanf/46715.html

Turns out this is a bug in jabberd2. What my server sends to Google is <presence.xmlns='jabber:client'.from='dot@dotat.at' to='tony.finch@gmail.com' type='subscribe'/>. Note that the XML namespace is set to jabber:client despite the fact that this is a server-to-server connection. D'oh!

http://j2.openaether.org/bugzilla/show_bug.cgi?id=159

In any case, Google have added an interop work-around and I can now communicate between dot@dotat.at and any @gmail.com Jabber user. Sweet.

http://ralphm.net/blog/2006/01/17/gtalk_s2s

And in fact fanf@jabber.org and tony.finch@gmail.com can talk to each other quite happily.

However, when my dotat.at jabberd2 talks to Google, it completes dialback, then it responds to my first message or presence stanza with <stream:error> <unsupported-stanza-type xmlns="urn:ietf:params:xml:ns:xmpp-streams"/> </stream:error>

I just got a phone call as a follow-up to today's IT Syndicate meeting. This was the meeting at which my paper on the Chat service was presented. I have been asked to give a talk to the IT Syndicate Technical Committee in two weeks to "enlighten them about Jabber", whatever that means. I've asked them to give me some specific questions they would like answered or to indicate which parts of my briefing paper that they would like me to expand on - I don't know if they want a speaking-to-managers or a speaking-to-techies talk.

But in any case, Bah! and Faugh! How long does this have to take? This started as a skunk works project in October, and I've now been waiting nearly three months to get permission to put _xmpp-{client,server}._tcp.cam.ac.uk SRV records in the DNS.

Update: Looks like it'll be a speaking-to-techies talk, probably including a protocol overview and stuff like that.

Following on from <http://www.livejournal.com/users/fanf/44881.html>, <cam user=jpk28> kindly lent me a couple of non-Mac PCI graphics cards to see if I could get any further with them.

(2) 1995-vintage S3 Vision968

This was amusing. The card is only a little bit more recent than my 1993-1994 gap year at inmos, where at the time they made palette-DAC chips for graphics cards and had a respectable share of the market. However this card has an IBM DAC, not an ~~inmos~~ ~~SGS-Thomson~~ STMicroelectronics one.

Xorg -configure took rather a long time to run, and this turned out to be because it thought there were 110 S3 cards in the machine and enormous numbers of PCI buses. I edited the generated configuration file to be something more reasonable and tried starting X. The machine wasn't very happy about this: X sort of hung (I can't remember if I managed to kill it or if I had to reboot) and the ethernet card lost its interrupts. Not much success there at all. Since the card has a practially useless 2MB vRAM, I gave up fairly quickly.

(3) 1997-vintage ATI 3D Rage Pro PCI

By the time of this card, separate palette-DAC chips were a thing of the past. It has 4MB vRAM which is just barely tolerable.

The Xorg ATI driver claims that it should recognize this card as a Mach64 series card, but it doesn't, and X instead falls back to the VESA driver. DDC manages to get useful information from the monitor, which is good, but it autoconfigures with too many pixels to be able to maintain a decent refresh rate. 60Hz is nasty.

Multi-head X almost worked, except that when I moved the pointer to the secondary screen it disappeared. Juggling things around (virtual positions of screens, primary/secondary numbering) didn't improve matters - sometimes the VESA screen would have a corrupted display, sometimes X would get confused about where the boundary between the screens was (mouse pointer appearing 1280 pixels from the left of the 1600 pixel screen). I could run X on one screen at a time fine, but not both together.

At the UKUUG ~~winter~~ spring conference I'm going to be presenting a paper on my email rate-limiting work. This gives me some incentive to work a bit more on its deployment :-)

I've been discovering that it's very hard to set a limit which minimizes the inconvenicence (e.g. admin work maintaining the list of ratelimit exemptions; false positives because people don't realize they need to warn us about bulk email beforehand) but at the same time provides decent protection against unwanted floods. The spam incident last term illustrated this well: I had (foolishly) assumed that spam would typically be one recipient per message, but the spammers managed to find a hole that allowed them many recipients per message, so my ratelimiting system didn't spot the flood.

This problem is similar to the problem of setting an appropriate work factor for anti-spam proof-of-work systems: http://www.cl.cam.ac.uk/~rnc1/proofwork.pdf

So I'm now experimenting with per-recipient limits in conjunction with per-message limits, to see how awkward it is to set the limits for that - probably just as bad, but per-recipient limits are closer to what we actually care about.

I've also had an idea about making the countermeasures less irritating for the user. Rejecting the message (even with a 450 try later code) is likely to cause problems for shitty mailmerge software that can't retry. So what we can probably do instead is accept the message and freeze it on ppswitch's queue, using the control = freeze ACL modifier, and after we have been alerted to the problem, we can check the messages and thaw them with exim -Mt. This is a fair amount of admin faff, so I'll probably develop a web interface to move the work to the users.

Current blacklists have a number of (not entirely orthogonal) problems:

(1) obscure editorial policies
(2) centralized, so vulnerable to attack
(3) not easily extended to identities other than IP addresses
(4) difficult to contribute to

A while back I was thinking about anti-spam blacklists and wondering how one could derive one algorithmically based on input from lots of email servers whilst being resistant to gaming. There is (or used to be - I can't find it any more) a DNSBL which consolidated multiple local blacklists, but AFAIK there isn't anything clever about it, so it's utterly untrustworthy. I imagined that each email server (or server of any other kind) could gossip with its peers and thereby automatically find out what they think of each other. Each server could seed its opinions with data from anti-spam or anti-virus scanners or from protocol fingerprinting heuristics. The algorithm would establish each server's reputation, which would include not only the desirability of communicating with it but also the trustworthiness of its gossip. However I'm not enough of a mathematician to solve this problem.

However, and not particularly surprisingly, this kind of thing has already been investigated albeit in another field. I stumbled across an article by Ralph Levien when searching for opinions about ~~Jabber~~XMPP versus ~~BXXP~~BEEP, which led to some interesting articles from the Chandler development list, one of which included a reference to <http://www.levien.com/free/tmetric-HOWTO.html>. I wonder if this could be applied to my idea.

http://www.theregister.co.uk/2006/01/10/lawsuit_started-by_email_is_valid/

This is a significant case, even if it is only directly relevant to maritime arbitration. The precedent is likely to be applied in other cases where the status of email as written communication is in doubt.

Note that the last paragraph of the article says:

Scottish court actions cannot be served by email. In England, email service is possible but only when there is written consent to this from the other party in advance, according to the Civil Procedure Rules. Accordingly, if a British business receives a court action "out of the blue" by email, it could generally argue that service has not been affected.

Even so, this underlines the responsibility of employees to treat email as seriously as they do the dead tree post.

I wonder when legal papers will first be served by IM :-)

I had a surprisingly productive day today, considering I'm still suffering a bit from the tail end of a fever I had over the weekend. The main things I wanted to do were to upgrade the Exim configuration on ppswitch with a few minor changes, disable insecure access to Hermes for those people who have not been insecure since the middle of November, and prepare an announcement about it.

The Exim change became slightly more interesting than it might otherwise have been following a complaint from a user who was being irritated by blank messages. I investigated this a bit and became somewhat confused by Exim's handling of its $message_size variable. It turned out that all that is required to stop junk blank messages is to put deny condition = ${if ={0}{$message_size} } in the right place, but this is by no means obvious since $message_size has four slightly different meanings in various circumstances.

It became even more "interesting" when, during the roll-out of the new config, one of the changes alerted me to a long-standing bug. PPswitch performs two levels of address verification: either a basic check of the plausibility of the mail domain after the @, or a full call-out check which aims to validate the local part before the @ too. The latter is a bit problematic because of the quantity of legitimate but misconfigured email out there, so we have a list of domains for which we do callouts, which includes domains that are well-configured and frequent victims of forged spam, such as aol.com.

The change was supposed to add Cambridge domains to this list, because part of the long-standing bug was that we weren't thoroughly checking email for systems like CUS which ppswitch doesn't know everything about. However, after the upgrade ppswitch started doing call-out verification for all sender addresses! The problem I hadn't noticed was that when we were doing sender verification, we were checking the recipient's domain against the list rather than the sender's domain; since the recipient is always (in this context) a Cambridge address the callout was always happening. This was caused by using the domains condition instead of the sender_domains condition: a common mistake...

My checks in preparation for removing people from the insecure list threw up another lurking bug. Our audit script which checks for misconfigured users was failing to notice any ~/mail users since the start of the new year. This turned out to be because my regex for extracting yesterday's log lines assumed dates in the form Jan 04 wereas they were actually in the form Jan 4. Bah! syslogd really is the pits. I had to write a script to retro-analyse 8 days' data, which required a fair amount of care, and I also had to ensure that this did not cause me to falsely class people as safe to remove from the list.

Anyway, all that meant I didn't manage to get the announcement out before the end of the afternoon. However, rather than spodding, I spent a little time fiddling with my Hermesified authentication module for jabberd-2.0. Hermes uses cdb files for most of its important configuration tables, including the password files, so I ripped the cdb code out of Exim and tidied it up a bit so that jabberd2 could use the same password files. This evening I brought mmap-based cdb reading back from the dead, which may improve its performance slightly (though probably to an immesurable degree). After I've done some more testing on the dotat.at Jabber server it should be ready to contribute to the jabberd2 maintainers.

Yesterday I upgraded my workstation with gigabit ethernet and a better graphics card, which is actually capable of driving my LCD monitor at 1600x1200 over DVI with a decent refresh rate. I had a cunning idea over the hols that I could keep my old graphics card to make a dual-head setup, but of course my PC has only one AGP slot. Bah.

However when wandering around the office I noticed a PCI graphics card lurking on a shelf. Aha! In it went, and I booted up. Xorg -configure spotted the two cards and auto-generated a dual-head configuration. Nice! However when I ran X with the new configuration file, the first screen started up OK but the second did not. It complained of being unable to find the card's video BIOS, and that the card had 0KB of video RAM, and that none of the standard video modes would fit in the available memory.

Much faffing ensued, including trying to run X on only the second head (no luck), trying to force the video RAM size (crash!), swapping between three monitors (no improvement) and eventually trying the PC without the AGP card (no image at all). Bah!

At this point I was feeling rather vexed by the computer, so I left it to look around the office some more to see if I could find another PCI graphics card. No luck, but I did eventually find a blue+white G3 Powermac. These machines have no AGP, so I hauled it out from under the desk to see if I could filch its graphics card. A missing blanking plate taunted me from the location where the video card had been.

I realized that the card I had been trying to make work had originally come from the blue+white Mac, and the reason it didn't work was that it had Open Firmware, not PC firmware. D'oh!

This evening has also been irritating. I finally got the motivation to run a Jabber server on my workstation for my vanity domain dotat.at and for other testing purposes.

This helped me to find a couple of lurking bugs in my custom authentication module. The API has a number of functions, including one for checking the existence of a user and one for checking the validity of a password. Most of them are required to return 1 for an error and 0 for success, except for the user existence checker which has the opposite logic, and which (having noticed this anomaly) I therefore had to fix...

Once I got a client talking to the server OK, I tried to get it to exchange presence subscriptions between my dot@dotat.at and fanf@jabber.org accounts. No dice: the server-to-server dialback authentication timed out. Huh. I eventually remembered about our wonderful new port blocking setup which is intended to improve the security of MICROS~1 crapware. Of course, all firewalls do is break things, so I have asked our network admins to exempt my workstation from this pointless hindrance.

Outlook has an odd idea about email headers, probably because it isn't an Internet email client (though it has had support for Internet Standards retro-fitted). For example, it uses [] to delimit addresses instead of <> and it uses semicolons as a list separators instead of commas, both of which are syntax errors. Furthermore, unlike all other MUAs, it insists on showing the user both the From: and Sender: headers if they are different - most MUAs only show you the Sender: if you ask to see the full headers. This means that email from most of our users appears in Outlook like "from Tony Finch [fanf2@hermes.cam.ac.uk] on behalf of Tony Finch [fanf2@cam.ac.uk]" which is rather irritating.

From Hermes's point of view, its native email domain is (obviously) hermes.cam.ac.uk, and most of our other domains are just aliases files (with a bit of magic, especially for cam.ac.uk itself). There's nothing preventing the determined person from directing their @cam email to someone else's Hermes account - and in fact Professor V. Important may quite legitimately want all @cam email to be handled by a secretary. In addition to that, Hermes users are allowed to send email "from" role addresses or vanity addresses, such as <mail-support@ucs.cam.ac.uk> or <dot@dotat.at>. So in order to be clear about who sent a message, Hermes adds a Sender: header containing the authenticated user's @hermes address if the message's From: header is not @hermes - which is in line with decades of tradition in email software.

Hermes always did this for locally-submitted email, via Pine or Webmail, and since we introduced authenticated message submission it has done it for remote submissions too. We're trying to force everyone to configure securely authenticated IMAP and SMTP, so the slightly-redundant Sender: header is going to come to the notice of Outlook users more and more. I believe it was first complained about by Robin Walker just over a year ago, and it has caused another complaint this week.

I've recently made a few changes to ppswitch which improve the situation. When Exim is routing an email address belonging to a Hermes user (resolving aliases etc.) it eventually reaches that user's @hermes address. PPswitch now uses this to work out on the fly that <bursar@botolph.cam.ac.uk> is actually spqr1, and passes this result back up to the access control logic. This mapping of email addresses to users means we can do two things: we can make it a bit harder to spoof email, because ppswitch knows when a spotty undergrad is sending email "from" Professor V. Important; we can also say that spqr1@botolph and spqr1@cam are sufficiently similar to spqr1@hermes that email "from" those addresses doesn't need a disambiguating Sender: header.

The second part of the fix is to make this work for local submission as well as remote submissions. Nowadays, messages from Pine and Webmail go through the same submission process on ppswitch as messages from remote MUAs, except that the SMTP client is actually Exim on the machine hermes-1 or hermes-2. Behind the menu system these are fairly traditional multi-user Unix systems, so Exim still likes to add a Sender: header if a message's From: header doesn't match the local user's address - but this is now redundant and can be turned off. Webmail also likes to add a Sender: header in some circumstances, but this is less easy to turn off; fortunately it is less of a problem because Webmail has some knowledge about the mapping between @hermes and @cam - it doesn't know about other domains but Webmail users are less likely to reconfigure their From: line than others.

One notable thing that these changes don't handle is "friendly name" addresses, like <tony.finch@ucs.cam.ac.uk> which are particularly popular in the Judge Business School, but handling these is an AI-complete problem that I don't propose to fix :-)

While upgrading the Exim configuration on ppswitch, I noticed that the Computer Lab is suffering from a fairly vicious joe-job. It started just over a week ago and seems to be centred on three addresses; fortunately all of them are invalid. It has upped their junk email counts by just over a factor of ten and it's still increasing. The following are daily rejections for the last fortnight in blog order...

.00     528711 (8 messages per second)
.01     281682
.02     128430
.03     124632 (1 Jan 2006)
.04     169562
.05     166818
.06     190821
.07     204235
.08      67480
.09      29091
.10      31859 (Newton's birthday)
.11      30676
.12      42042
.13      39324
.14      44769

Since Monday I have been working on an admin script to help with the withdrawal of insecure access to Hermes. We have 9500 users to reconfigure, so it is going to be a fearsome job. The script is going to provide ongoing information to IT support staff in the University so that they know who needs to fix their setups.

I also had my annual(ish) appraisal. I did the paperwork on Tuesday, which caused me to procrastinate by turning a couple of the Computing Service's announcement streams into Atom feeds. They are now being syndicated by LJ: see

cam_ucs_netnews and

cam_ucs_ann

A couple of announcements have recently been posted to the standards-JIG mailing list about the Jingle specifications. This is noteworthy because Jingle is the XMPP extension for peer-to-peer session initiation, including Voice Over IP, and it is essentially the protocol used by Google Talk. It will be interesting to see non-Google clients implement the protocol, and how soon gateways between SIP, H.323, and Jingle get implemented.

http://mail.jabber.org/pipermail/standards-jig/2005-December/009315.html
http://mail.jabber.org/pipermail/standards-jig/2005-December/009316.html

JEP = "Jabber enhancement proposal"
JIG = "Jabber interest group"
SIP = the "session initiation protocol", the IETF's VOIP standard
H.323 = the ITU's VOIP standard
VOIP = voice over IP
XMPP = the standard core of Jabber

Progress on the chat service is still slow. I sent the following to cs-chat-users:

At the moment I'm still waiting for final approval for the name "chat.cam.ac.uk" which I should get after the next IT Syndicate meeting on the 17th Jan. I'm not going to do much work on the project until then. (Hermes will keep me busy!) However I aim to make a prototype server available soon after that date.

After that there may be some further delays caused by Jabber's need for SRV records in the DNS. Jabber uses these to route messages between servers, similar to MX records for email. The University's DNS does not currently have any SRV records, so our hostmasters will go through a period of compatibility testing before putting them in the main zone. This won't affect use of the chat service between Cambridge users, but may delay our ability to communicate with other Jabber servers.

Today I have been doing some project support work for the Chat service. (Its domain name is chat.cam.ac.uk, and people might not know what Jabber is, so "the Chat service" is how we will refer to it, unless anyone comes up with a better idea.)

I have a role address for support matters, chat-support@ucs.cam.ac.uk.

I have a mailing list for interested people. That link only works for Cambridge people, but anyone can send a subscribe message to cs-chat-users-request@lists.cam.ac.uk (though I doubt it'll be of much interest to outsiders).

I have a wiki kindly set up for me by

furrfu (again the link is restricted to Cambridge people).

I have a logo!

Following http://www.livejournal.com/users/fanf/42585.html

Bob presented a revised version of my proposal to the SMT today, which included some changes as a result of the previous attempt - essentially, upgrading the perfunctory "oh, and maybe we could do these things" paragraphs to "we will do these". In particular, the web front-end and MSN gateway.

The main result was that I have the go-ahead!

I will have to write a paper for the IT Syndicate (which oversees the Computing Service) to get full approval for a significant new service. I will also have to get it working by the summer, complete with MUC, MSN, web, etc.

The oddest thing was some quibbling about the usability of the MSN gateway, based on no practical experience at all. Sigh. Still, no harm done.

If anyone wants to be added to my ineterst list, please email <fanf2@cam.ac.uk>. If anyone knows of other Universities in the UK which have Jabber services, I would be interested - I know of Portsmouth and Cardiff.

The UKUUG is holding its ~~Winter~~ Spring Conference in Durham in March, and the Call for Participation closes in a couple of weeks. I've been preparing an abstract for a paper and talk about the ratelimit feature I implemented for Exim, and our experiences of deploying it in Cambridge.

Any comments on the following?

Cambridge University has pretty good email security, but even so we have a couple of incidents each year when a security breach results in a flood of spam from our network. In order to protect against this in the future, we needed a system for throttling these floods before they cause damage, such as Cambridge being blacklisted by AOL.
I implemented a general-purpose rate-limiting facility for Exim 4.52. It is extremely flexible and allows you to specify almost any policy you want. It can measure the rate of messages, recipients, SMTP commands, or bytes of data from a particular sender; and senders can be identified by IP address, authenticated username, or almost anything else.
I deployed this facility on the central email systems in Cambridge. It ran in logging-only mode for several weeks while I tuned the policy to mimimize the disruption to legitimate email. This exposed the slightly surprising extent of bulk email usage in the University, and a number of particularly problematic cases. An important task was to communicate the change in policy to less technical users.
I will describe Exim's ratelimit facility and report on our deployment experiences.

So, while I was away seeing my dad and going to EuroBSDcon in Basel, my boss Bob presented my Jabber proposal to the senior management team. It was sent back to me for further work.

There were a few tactical errors, I think. The original reason for writing the proposal was that I had got to the stage of needing real DNS records in order to do some realistic testing. Bob said that I couldn't have them without SMT approval, hence the proposal write-up. (For good reasons there is a bit of a hurdle for the creation of domains immediately under cam.ac.uk.)

What I failed to emphasize to Bob or in my proposal, was that there had been no discussion of requirements or of development/release milestones beyond what I had sketched out for myself. So I was perfectly happy for the SMT to say "do things in this order please, and we won't formally announce until X, Y, and Z are done". Our Glorious Leader demanded a web interface with Raven authentication, and our Deputy Director thought that we'd need an MSN gateway to maximize take-up. There was also some discussion about how we could guage that the service was successful. This was all fine by me.

However there was much more of an argument than was warranted, especially over the Raven requirement. This is irritating because I'd prefer to make progress than fight political battles over unimportant details. Maybe we'll be successful with a revised proposal at the SMT next week.

The Raven requirement does have some slightly awkward implications which are worth noting.

The security model for Raven requires that Raven passwords are only typed into Raven, which means that they cannot be used by native Jabber clients. So my original plan to use the Hermes password file will still go ahead. But we've been told that any web-based Jabber client must use Raven authentication, which means that Jabber users must use different passwords in different contexts, which is a bit ugly. It's also different from the way Hermes passwords are used for email - though OGL says that if our webmail service was being done now it would also use Raven.

It implies some customization of the software, rather than just installation and configuration of the standard version. The Jabber HTTP binding uses clever AJAX techniques to get much less latency and bandwidth than a simple polling solution, and the protocol tunnels most of the Jabber protocol through to the web client. So I may have to make modifications to both the web client part as well as the web server part. Fitting an AJAX application into the Raven model may be tricky too.

Then there's the problem of turning Raven authentication at the web server into SASL authentication at the Jabber server. The easiest way to do this is for the Jabber server to trust the web server, and for them to use SASL EXTERNAL authentication. The EXTERNAL mechanism just states the username, and the server uses some implicit context to authenticate it. It is designed for lifting TLS client authentication up to the SASL layer, but it also works for turning trust defined by the system administrator into a standardized protocol. This kind of trust is not great for security, so I'd prefer something better; we'll see if Jon Warbrick, the author of Raven, has any bright ideas.

Foolish, I know, but I'll get stale if I do nothing but email. This week I are been mostly writing a draft proposal to be given to my senior management team, suggesting that I should implement a Jabber service for Cambridge University. All comments and suggestions welcome!

http://www.cus.cam.ac.uk/~fanf2/hermes/doc/jabber/proposal.txt

Once more, Pat Stewart does my dirty work. (I'm amused that none of my colleagues suggested toning down the sarcastic bit!)

http://www.cam.ac.uk/cgi-bin/wwwnews?grp=ucam.comp-serv.announce&art=1431

Since the summer, insecure access to Hermes has been forbidden to new users. We are planning to extend this rule to the whole University by next summer. In preparation for the next step, this term we have been monitoring usage of Hermes to make a list of easy cases.

Next Monday, 14th November, we will withdraw insecure access to Hermes from those users who have not used insecure configurations - that is, from those users who should not be affected by this change.

We are also deprecating the ~/mail folder name prefix. On Monday we will also disable the backwards-compatibility support for those who do not need it.

Although these changes should not affect too many users, our list of insecure users is still growing so this change will affect some people. This is inevitable whatever time we pick.

After Monday we will have somewhat over 9,000 users to reconfigure, which is a rather daunting prospect. We are proposing to sort them by affiliation before working through them, so that users in a department or college will be dealt with together, rather than in dribs and drabs. We will send a list of affected users to the relevant support staff a reasonable amount of time before they will have to change: although the timetable is ambitious we don't want to turn it into a mad rush.

We hope this process seems fair to computer officers and techlinks. If you have any suggestions for ways in which we can make it less painful, please contact <postmaster@hermes.cam.ac.uk> - though if anyone suggests not doing it at all, we will consider giving them a particularly strict timetable. The details are deliberately vague at the moment because we expect to refine them based on experience.

For background information and links to the documentation of the correct settings for Hermes, see http://www.cam.ac.uk/cs/email/securehermes.html

So I've been playing around with Jabber recently, and I wanted a client for testing and admin purposes. I wasn't keen on Psi because installing a pre-built package would have required upgrading the world, and building it and Qt takes for ever. So I thought I'd try CJC, the console jabber client, which is appealingly retro.

CJC is written in Python and has an associated library called PyXMPP which does the protocol end of things. (In fact this is another reason for choosing CJC - in the fullness of time I'll want to write some glue software to hook Jabber into other stuff and PyXMPP seems like a plausible choice of foundation.) PyXMPP needs a few add-on libraries in addition to the standard Python install.

Building Python and persuading it to link to the correct libraries wasn't too much trouble. DNSpython builds in the standard Python way without any difficulty at all. LibXML2 was also not too bad - though because it depends on the Python install to compile its Python bindings, but part of the Python install (viz. PyXMPP) depends on libxml2, I couldn't separate it from Python properly. A bit scruffy, but *shrug*.

The problems are almost entirely to do with M2Crypto, which provides Python bindings for OpenSSL. M2Crypto uses SWIG to do most of the work of hooking the two together. The nice thing about SWIG is that it can be built to support loads of scripting languages without having any of them installed, so (unlike libxml2) I could divorce it from the Python install.

M2Crypto uses SWIG in a rather grotty way, resulting in lots of Warning(121): %name is deprecated. Use %rename instead. I guess that the reason for this deprecation is that the %name directive requires the SWIG interface file to include a copy of the C function's declaration. This caused the M2Crypto build to explode when pointed at OpenSSL-0.9.7h because there have been some constness changes in the OpenSSL header files which cause the M2Crypto's idea of some function types to be wrong. Sigh.

So I put together an evil hack to patch this bug, and got the whole thing built successfully. However when I tried to connect to jabber.org with cjc, I got a lovely stack traceback:

    File ".../pyxmpp/stream.py", line 1206, in _make_tls_connection
      ctx=SSL.Context('tlsv1')
    File ".../M2Crypto/SSL/Context.py", line 41, in __init__
      map()[self.ctx] = self
    File ".../M2Crypto/SSL/Context.py", line 20, in __setitem__
     self.map[key] = value
  TypeError: unhashable type

M2Crypto tries to keep a hash table of SSL context structures. (I don't know why; it looks redundant to me because that bit of the code could just use self.ctx instead.) The SSL context structure is a SWIG wrapper around the corresponding OpenSSL type, and SWIG doesn't define a tp_hash function in its wrappers, so Python bitches as you can see above.

Working this out required far too much effort. At first I thought it might be a problem with version skew, so I tried using older Pythons and older SWIGs and newer cjc/pyxmpp snapshots, and the bug remained. So I looked at the code more closely and consulted friendly Python experts and came to the conclusion that the code couldn't possibly work. (But then how were others managing to use it?) I then thrashed the SSL context hash table to within an inch of its life by using repr(key) instead of key, so that a stringified version of the context pointer was used to index the table, thus bypassing the problem.

After that, all I was left with was a minor certificate verification problem, which was solved by teaching M2Crypto about the many benefits of SSL_CTX_set_default_verify_paths() and getting PyXMPP to invoke the new M2Crypto feature appropriately. And finding the CA certificates required to verify the signing chain of jabber.org's dodgy certificate.

So maybe now I can make some real progress, instead of shaving yaks.