fanf: (Default)
On Thursday evenings I go to the pub for a few beers with friends. The route I walk from work goes across Jesus Green from Lower Park St to Carlyle Rd. This evening it was darker than usual, because of cloud cover and the lamp on the path being broken. There's usually a few people walking across the park but this evening it was quiet.

As I was approaching the Jesus Lock end at about 20:30, I saw three youths cutting across from the right-hand avenue onto then path. They walked towards me widely spread abreast, so that two of them were on the grass and I had plenty of space to pass between them. As I did so, one of them pushed me over onto the grass. I said "cunt!" or something like that, and he started to have a bit of a go with feet and fists saying "giz ya phone". He wasn't hitting hard, so I told him it was worthless to him. I tried briefly to get up and fight back but this made them hid harder (one or two of the others joined in) and I found that curling up on the ground and protecting my head didn't excite them. Soon they lost their bottle and ran off towards Park St.

They didn't get my phone, or the £60 in my wallet, and only bruised me slightly. Sore cheek and forearm, and back is fine despite taking most of the blows. I was more annoyed by them than scared - it seemed that Dickhead #1 was just showing off, and not really intent on theft or injury.

What a bore. I was going to have to call the police to report the crime, but I wasn't hurt and hadn't lost anything, but it was just enough to utterly disrupt my evening. Chiz.

At this point I should have dialled 999 on the phone that the chavs were too crap to steal properly. However I wasn't hurt and I did not see enough of them to be able to identify them (dark clothes, dark night), so I called Rachel to get the Police non-emergency number. I didn't get through to anyone until about 10 minutes after the mugging, at which point I find out that the Police wanted to take it much more seriously, and if I had called 999 they might have been able to see something useful with their CCTV cameras. Silly me.

I walk home to reassure Rachel that I am fine, and soon a couple of coppers come to take a verbal statement and give me an incident number. They classified it as assaulted by common people common assault (but not attempted robbery, since they were just being dickheads).

After that I made myself some dinner and Rachel and I settled down to watch yesterday's episode of Heroes. But Charles managed to have a suck on the DVR buttons (which we usually try to prevent) and they didn't work. Faugh, I say! Faugh and thrice faugh!

Time to find another beer and see if half an hour in the airing cupboard has fixed the buttons.
fanf: (weather)
Happy new year! Cambridge has 10,000 extra enthusiastic, intelligent, (young, attractive) people again, and internal email volumes are up 50%. Summer's having its last blast, as it usually does in October, making the place look its best for the new intake.

I was interested to hear a news item this morning about SOCA busting a load of 419ers. The odd thing that struck me is that they are calling the crime "mass marketing fraud" instead of "spam". Also, I think that the SOCA web site looks like a Dr Who secret agency web site - for example, compare Geocomtex with SOCA.

One sad/amusing aspect of 419 spam is that it's mostly sent manually, unlike the botnet-driven techno-spam mostly sent by Americans with the help of the Russian mafia. Bruce Schneier and Time Magazine write about the frightening sophistication of the Storm Worm.
fanf: (Default)
I've been thinking about teaching children to program (or helping them to learn), especially since it came up in conversation at a recent party. (This is slightly inspired by my son, but he's only a year old so he won't be getting to grips with computers for some years yet.) The discussion at the party was based on our experience of starting with 1980s micros, which we think was a good way to learn programming, so how could we provide something similar to the children of today? I think this kind of micro must be:

* safe for playing with minimal supervision
* very simple (or apparently so)
* good for fun as well as for programming

More detailed requirements are a consequence of the above:

* The hardware has to be robust, so no moving parts. The drivers have to be robust too, so you can yank the power or plug/unplug peripherals without bad consequences.
* As cheap as possible, so it doesn't matter if it dies from abuse. This implies it should be off-the-shelf, and able to use standard peripherals, e.g. hand-me-down display etc.
* No high-level networking. Low level features are useful for older children to learn network programming, but for most usage it'll have an air gap for safety.
* No complicated user interface. It'll run one full-screen task at a time, such as a game or a programmer's read/eval/print loop.
* Easy to do fun things with graphics, sounds, and robotics.

I think the right solution is to use a "thin client": one of those computers that's designed to provide a keyboard and display to a remote server. You can get plausible ones for about £100. However, instead of booting off the net, it must be able to boot off flash (built-in or USB). At the moment a couple of good options are the TinyTuxBox 4 or the Linutop, both based on the same AMD Geode CPU, though the latter has the advantage of using LinuxBIOS.

It might not be easy or sensible to reflash the BIOS so that these computers can boot straight into BASIC for that authentic 1980s experience, so it's probably best to use USB flash drives as if they were Atari or Nintendo cartridges. Then it should be easy to ship software as disk images - with perhaps collections of games for fun as well as a programming language.

I'm not actually keen on BASIC as a language or environment. Its only surviving descendent is Visual BASIC which is too complicated, volatile, and proprietary for my purposes, and I don't see any advantage in resurrecting an older variant. Logo has plenty of teaching materials, which is handy, and it's a reasonably nice language with a fair amount of power. Or maybe Lua - though perhaps that only seems simple to people who already know how to program. I think it's also worth providing a text editor for programming, not just a command prompt, since we aren't constrained by memory in the way the eight-bits were, and the children will already have some familiarity with word processors.

Under the simple language and simple editor, it is unfortunately difficult to remain simple without expending a great deal of effort. It's a challenge just to get a modern CPU to run without toasting itself to a cinder. On the x86 platform you have 25 years of accumulated expedient hacks to work around in software. Doing general-purpose IO over USB is just a bit more complicated than a latch wired to your data bus. Given all that it seems that the most sensible plan is to start with a Linux or BSD kernel, and put a simplified userland on top.

It's difficult to choose how to support the key features - graphics for example. X11 is not simple, but it is very well supported and it's easy to make it look simple (if you throw away the display manager and window manager etc.) On the other hand it ought to be simple to build on top of the framebuffer device, but then (since I don't want to re-invent many wheels) I'd still end up pulling in layers of complexity (DirectFB and associated libraries) some of which are explicitly experimental and incomplete. But since it's all sitting on a Unix kernel I suppose I've already given up the idea of making its innards easily comprehensible.

Another example of the appearance of simplicity despite enormous amounts of hidden complexity has to be robotics. Phidgets look like fun hardware interfaces, but of course they depend on USB so they only appear to be simple...

Before signing off I should note that the OLPC XO is not what I'm after. It has more user interface and more networking than I want. I don't want a general-purpose computer, I want a programmable toy.
fanf: (Default)
There's been a moderately interesting thread on the main IETF list recently, especially the part starting with a message from Karl Auerbach which reminded me of a post I wrote last year. Especially interesting is the link posted by Lars Eggert to a paper by Bryan Ford (of PEG parsers fame). I need a less long-winded summary of my session layer idea, so this is it.

I'm starting from the observation that lots of application protocols have some idea of a session, but the Internet architecture provides no support for sessions so each application that needs one has re-invented the idea differently. The earliest example is the FTP control connection which manages multiple data connections. HTTP requests are mostly independent of each other, but it has had various feeble attempts at sessions built on top of it. TLS has a session cache to slightly reduce reconnect latency. BEEP and ssh use a single TCP connection per session and multiplex concurrent activities down the same pipe. Etc. There's obviously a need to fulfill.

A session is essentially a user's login to an application, therefore they should have the same lifetime. Login implies that a session's core service is a cryptographic association between the client and server, which has to provide the usual integrity, confidentiality, as well as authenticity features. (As such it should be able to replace IPsec, TLS, SASL, etc.)

On top of this is built a layer for multiplexing streams (and perhaps datagrams). Since it already has a security association, it can avoid many performance problems caused by multiplexing using concurrent TCP connections:

* it can omit the TCP three-way handshake and the TCP+TLS 6-way handshake, to get T/TCP startup performance without its security problems;

* it can avoid slow-start restart delays and related problems by doing congestion control at the session layer instead of per-stream (see also RFC 2140);

* it can avoid the problems that ssh and BEEP have with multiple layers of windowing and head-of-line blocking on packet loss, by making each stream run fully independently on top of the datagram layer;

* if it supports both reliable streams and best-effort datagrams in the same session then it'll support media apps (SIP, Jingle) well even when there is asymmetrical connectivity (NATs or firewalls - assuming they allow a session to be established in the first place);

* the crypto provides the endpoints with a secure identity for the session that's independent of lower-level addressing or routing, so they can re-establish the session if either end moves and can re-locate the other, without interrupting the higher-level data streams - mobility for free.

This is all very similar to Bryan Ford's SST design, so I'm really pleased that the lazyweb has turned my dreaming into something real! However SST's channels are lower-level than my sessions: all apps communicating between the same pair of hosts use one channel. I'm not sure how this affects higher-level authentication (SASL and perhaps X.509) - you would at least need some way to cryptographically bind app-level auth to the SST channel, but can other apps sharing the channel spoof this binding, and does it matter?

Last year's article also had some wild speculation about fixing the problems of NAT and global routing scalability, but that part is not relevant to the current discussion and too vague to be summarized more concisely, so I won't try (even though I still quite like the idea).
fanf: (silly)
Really tiny ARM computer the size of an RJ45 socket:
http://www.digi.com/products/embeddedsolutions/digiconnectmespecs.jsp

Really tiny GPS module (also contains an ARM):
http://www.rfsolutions.co.uk/acatalog/GPS_Module.html

I think all you'd need to add is a power converter from POE (48V) to the 3.3V wanted by the modules.
fanf: (Default)

I've come to the conclusion that most programming languages' idea of a string - an array or list of characters - is wrong, or at least too simplistic.

Most languages designed before Unicode have the idea that characters are bytes, which is hopelessly inadequate. Languages designed before Unicode was mature (e.g. Java) have the idea that characters are 16 bits which is still inadequate. The next reasonable step is to use a 32 bit type for characters, but that wastes at least a third of the memory you use for strings since Unicode needs at most 21 bits.

If your language's character type is too narrow then you have to use an encoding scheme (such as a Unicode Transformation Format or ISO-2022) to fit a larger repertoire into the available space. However, when you have done that your char is no longer a character. This causes a number of problems:

  • strlen() no longer counts characters.
  • random access into strings can lead to nonsense, e.g. if you miss a code shift sequence or you don't land on the start of a UTF-8 sequence or if you accidentally strip off a BOM.

In fact, even in the absence of encodings you have similar problems, e.g.

  • the length of a string is not the same as the number of glyphs, because of combining characters.
  • even in a fixed width font, the number of glyphs is not the same as the width of the string on the display, because some characters are double-width.

Even ASCII is not simple enough to be immune from problems:

  • random access into strings can produce garbage if the string contains escape sequences.
  • some logical characters (e.g. newline) are not represented as a single byte (i.e. CR LF).
  • in some contexts you can use typewriter-style backspace/overstrike to implement something like combining characters.

On the other hand, if your language's string type is not based on bytes then you have the problem that strings need to be transcoded from their external form into something that the libraries prefer, since UTF-8 is winning the encoding wars.

I think that the solution to this problem is to embrace it, instead of trying to hide behind an inadequate abstraction. The essence of the bug is that, while it is meaningful to talk about a unicode codepoint in isolation, a codepoint is not a character and a string is not an array of codepoints: there is a layer of encoding between the conceptual sequence of codepoints and the representation of strings. (This implies that, although D is refreshingly agnostic about the size of its characters it still falls into the array trap and is chauvinistic towards a subset of Unicode transformation formats.)

What this means is that the language should not have a character or string type, but instead a type for binary blobs like Erlang's. The programmer manipulates binary values using pattern matching (it should probably support Erlang-style matching and some flavour of regular expressions) and some reasonably efficient way of construction by concatenation (as well as its syntax for binary comprehensions, Erlang has the concept of IO lists which support scatter/gather IO). The idea is to shift from thinking of sequences of discrete characters to thinking about strings as structured mass data.

Note that (so far) this design has no built-in preference for any particular string encoding, which should help to make it flexible enough to live gracefully in environments that favour all sorts of textual data, including ones not designed yet. However you need to support string constants reasonably gracefully, which either means that your source encoding is the same as the encoding you deal with at run time (so that the blobs in the source become identical blobs in the object) or you need a good way to transcode them. Ideally transcoding should only happen once, preferably at compile time, but it's probably OK to do it during static initialization. (Note that similar considerations apply to compiling regular expressions.)

To a large extent, what I'm doing is lifting the string representation chauvinism problem from the language level to the library level, and libraries are just as good as languages at growing ugly calluses around poorly-designed features - though you usually have to do less reimplementation when switching libraries. I'm also aiming to encourage a style of string manipulation more like perl's than C's. To succeed properly, though, it'll have to go further with encouraging good use of library support for international text - but at that point I'm stepping on to thin ice.

fanf: (Default)

I've recently been reading about Erlang's bit syntax, which is a pretty nice way of serializing and deserializing binary data. The Erlang documentation includes some programming examples as well as the reference description. Although it is quite simple, it gets a lot of bang per buck by using essentially the same syntax to parse binaries using pattern matching as it does to construct them. It can also be implemented quite efficiently.

More recently the Erlang implementers have been working on a syntax for binary comprehensions which makes it even easier to implement really fiddly binary data manipulation code incredibly concisely and efficiently. It's one of the greatest inventions in programming language syntax in the last ten years.

However all is not sweetness and light, and the problem is that Erlang has incomplete (or broken) support for differing endianness.

Endianness affects the behaviour of two basic operations: splitting a large number into a sequence of smaller numbers; and joining a sequence of small numbers to form a larger number. (I'm being deliberately vague talking about "numbers" because they can be any size, smaller than bytes as well as larger.) These split and join operations are the fundamental building blocks of higher-level serialization and deserializaton operations.

The simplest view of endianness is that it is a choice of byte order, where the larger numbers are words comprising a whole number of bytes, which are the smaller numbers. Serialization is splitting and deserialization is joining, because the programmer wants to deal with words but they must be represented as bytes. Erlang allows you to control the byte order of numbers when they are serialized into or deserialized from a binary.

C's bitfield support illustrates how endianness is more complicated than just byte order. C allows you to specify the width in bits of integer fields in a structure. The compiler packs sequences of bitfields into (typically) word-sized storage units (for efficient manipulation in the processor's registers). In this situation, serialization is joining and deserialization is splitting - the opposite of the previous paragraph! - because the programmer wants to deal with a sequence of small numbers but these must be represented as words. However, it gets more complicated because byte order affects how words are represented in memory, so there are two layers of endianness in effect: how the compiler joins bitfields into words and how the processor splits words into bytes. The compiler agrees with the processor on endianness, so that a sequence of 8-bit-wide bitfields has the same representation as a sequence of char fields.

Erlang's bit syntax deals directly with byte strings, without restricting the alignment of bitfield boundaries to allow simple word-by-word manipulation as C does. However you still get both splitting and joining when serializing (and when deserializing), since large numbers must be split into bytes and small numbers must be joined to form bytes. Small numbers that overlap byte boundaries must be both split and joined!

The bug is that Erlang allows you (when serializing) to control the byte order of splitting, but not the endianness it uses to join small bit fields to form bytes - the latter is always big endian. This means that Erlang bit syntax cannot elegantly handle bitfields in a C struct on a little-endian machine. For example, this is the structure of the program counter on a 26-bit little-endian ARM processor:

	struct {
		unsigned mode : 2;
		unsigned pc : 24;
		unsigned imask : 2;
		unsigned flags : 4;
	};

The layout in memory is as follows. I have numbered the bytes on the left-hand side and the bits along the top, according to their power-of-two values. I have also indicated how the program counter bits end up split between the four bytes.

	    7   6   5   4   3   2   1   0
	  ------------------------+-------+
	0   5         pc        0 |  mode |
	  ------------------------+-------+
	1  13            pc             6
	  ---------------------------------
	2  21               pc          14
	  +---------------+-------+--------
	3 | N   Z   C   V | I   F | 23  22
	  +---------------+-------+--------

It would be nice if you could write in Erlang:

	<< Mode:2/little, PC:24/little, IMask:2/little, Flags:4/little >> 

However, what you get is completely crazy:

	    7   6   5   4   3   2   1   0
	  +-------+------------------------
	0 | mode  | 7         pc        2
	  +-------+------------------------
	1   1   0   15        pc       10
	  ---------------------------------
	2   9   8   23        pc       18
	  --------+-------+---------------+
	3  17  16 | I   F | N   Z   C   V |
	  --------+-------+---------------+

That is, the endianness flags have had no effect on the small fields: they have been joined to form bytes in big endian order. The program counter has been split into bytes in little-endian order, and then this three-byte string has been shifted in a big-endian fashion to fit into its unaligned space. As a result its bits are shuffled into an utterly baffling order.

To represent little-endian structures like this one in Erlang, you must deal with layout details manually, defeating the expressiveness of Erlang's bit syntax. In the example, the logically-contiguous PC field gets split up, and its value must be recovered with an auxiliary expression.

	<< PCa:6, Mode:2, PCb:8, PCc:8, Flags:4, IMask:2, PCd:2 >> 

So, how to fix this bug?

One possibility that I considered was to have a little-endian version of the binary append operation, which causes small bitfields to be joined into bytes in little-endian order, and larger fields to be shifted and split in little-endian fashion when they are appended on non-byte boundaries. This is nice and straight-forward, but it can still lead to craziness if you mix big- and little-endian operations. Appending binaries is expressed in Erlang by the comma inside << >> brackets; if, instead of putting the endianness flag on each element, you put it on the binary as a whole, e.g. << mode:2, pc:24, imask:2, flags:4 >>/little, then it can simultaneously affect splitting and joining which helps to ensure the results are consistent not crazy, and it also reduces verbosity.

However I'm not sure this is entirely sufficient. Erlang used to require that binaries were a whole number of bytes, so sub-byte fragments could only appear as part of a larger << >> expression. However the new generalized bit syntax allows binaries to end on arbitrary bit boundaries, which opens the question of whether the sub-byte tail of a binary is represented in big- or little-endian fashion. The same problem can occur at the start of a binary if it has been created by splitting a larger binary on a non-byte boundary, and the new binary is represented by reference to the original instead of as a new copy.

Therefore I think binaries should know their endianness, which would determine: the layout of non-byte-aligned boundaries at the start and/or end of the binary; how the binary is re-aligned (shifted) so that it can be appended to a binary with different alignment; and the byte order of large fields. (Shifting left moves big-endians towards the start and little-endians towards the end.) The append operation can also check that binaries have compatible endianness so that you get a sensible error instead of crazy bit shuffling.

Binaries actually need three kinds of endianness: big, little, and indeterminate. The latter can only be used for whole-byte binaries and would (for example) apply to binaries freshly read from the outside world. It seems reasonable to allow whole-byte binaries to have their endianness coerced. Appending binaries would cause an error if they are not a whole number of bytes and they are not the same endianness; if they are a whole number of bytes and of differing endianness then the result has indeterminate endianness; otherwise the result has the same endianness as the arguments.

So, to conclude: Erlang's bit syntax is fantastic but not quite finished. The run-time system needs to annotate binary objects with their endianness, and endianness flags should be attached to whole bit syntax expressions instead of each element.

fanf: (Default)
(Thanks to Debbie for pointing out this one.)

COSMOS System Manager and Relativity Group Computer Officer in the Department of Applied Maths and Theoretical Physics

The COSMOS Supercomputer is a national level high performance computing facility, dedicated to cosmological research, involving a collaboration of 28 investigators from 10 UK Universities. Professor Hawking is the principal investigator of the COSMOS consortium which has been in existence for 10 years. At present the consortium operates a 152 core SGI Altix 4700, with 456 Gb globally shared memory and over 20Tb storage, and applications are pending for a state-of-the-art 10 Tflop system.

The Relativity & Gravitation Group is a large research group comprising at present six Professors, three Readers, one lecturer, three teaching fellows, six postdoctoral researchers and twenty two PhD students. It has a group secretary, plus a computer officer/administrator shared with COSMOS.

We are seeking a System Administrator to manage the COSMOS supercomputer facility and inter-University consortium, including configuration and troubleshooting of the physical system, software maintenance and development, user administration, support and training. The role holder will also provide administration of COSMOS financial matters, including grant preparation, procurement and budgeting.

The role holder will also provide system management of Relativity & Gravitation Group computer resources, including procurement, system configuration, maintenance and development, user support and training. The role holder also maintains contact with computer support staff at a Departmental and University level where this is required to further the Group's research objectives, and occasionally provides support and training outside the Group context where the Department requires additional support or developmental effort.

You should hold a good first degree, preferably in a science or computer science subject, and at least 2 years' experience with networked, unix-based computers, in particular Linux, at system administration level, while supporting a large scientific user base. At least 1 year's experience of administering a large supercomputer facility is strongly preferred. Familiarity with high performance computing resources, such as the configuration and maintenance of clusters, storage area networks and numerical software, would also be useful. You will need to communicate clearly with individuals for training purposes and with representatives of external organisations.

You will have the ability to balance a budget, the ability to write coherently for applications and scientific reports using TeX/LaTeX, and the ability to write HTML web pages for user documentation and public interest.

This post is available for 2 years with the possibility of further extension, subject to funding. More information can be found on the DAMTP web site. (They don't say what grade the role has, but I'd expect 9ish.)
fanf: (Default)
2 Software Development Specialists

The University Computing Service is forming a small team of software developers to work on a variety of projects. We are recruiting two people to work in this group.

The successful applicants will develop a variety of new applications with the majority of projects being developed in a modern web application framework based on Java, with associated unit tests, build tools and documentation. There will also be enhancement and customisation work on existing applications written in a variety of languages. The successful candidates will be able to adapt rapidly to new environments and technologies. Most of the development will be done on Linux platforms with occasional work on Mac OS X or Windows servers.

Much of the development will use open source software and improvements and modifications to the software will be fed back to the international development teams responsible for these pieces of software.

Candidates must have:

* A good degree in a numerate discipline, or equivalent experience.
* Substantial experience of software development environments.
* Substantial experience in web application development.
* Good knowledge of Java, preferably in a web context.
* Good knowledge of at least one interpreted language.
* Good communication skills, both verbal and written.

These are permanent appointments at Computer Officer Grade 8, subject to a nine months satisfactory probation period, on a salary scale of £30,012-£40,335 with initial salary depending on experience. More information can be found on the CS's web site. The closing date for applications is Friday 5 October 2007. It is intended that interviews will be held during the week beginning Monday 15 October 2007.
fanf: (Default)
DSpace Developer

The University of Cambridge's institutional data and document repository system uses DSpace, a complex Java application running on Tomcat/PostgreSQL. We are looking for an experienced software analyst/developer to maintain and develop this system and its client tools to ensure that it continues to provide for the needs of the University and deliver appropriately enhanced functionality for the future.

Candidates must have considerable experience of object-oriented software development, preferably with Java Web and SQL database applications in a Unix/Linux environment, and must be self-motivated and capable of prioritising their own work. They must have experience in managing complex software development projects, preferably in an open source context, and must have excellent problem-solving and troubleshooting skills. The candidate will also be expected to be capable of presenting their work to a wider national and international audience and must be a good communicator and collaborator.

This is a fixed-term appointment for 5 years at Computer Officer Grade 8, subject to a nine months satisfactory probation period, on a salary scale of £30,012-£40,335 with initial salary depending on experience. More information can be found on the CS's web site.

This is a re-advertisement; previous applicants need not apply.

Computer Officer, Institute of Astronomy

The Institute of Astronomy has an opportunity for a highly motivated, problem-solving, innovative individual to join the IT Support team. The Institute comprises approximately 170 staff and PhD students and the role will be as one of a team of four computer officers with principal responsibility for supporting Linux.

We have a large science cluster for astronomical data processing and modelling. The majority of the cluster is currently Sun servers and desktops with an increasing number of Linux PCs and Sun Opteron based desktops and servers. The computer management team also support a large number of other peripherals including multi-terabyte RAID arrays for large data storage.

The role holder will be educated to degree level (or equivalent) in a physical science subject and is required to be a multi-skilled communicator, so as to provide in depth cover on a wide variety of IT issues, and interface in a helpful and friendly manner with the Institute staff via a Helpdesk.

The appointment is at grade 7, with a salary range of £25,134 - £32,796. No limit of tenure applies. Applications, including a full CV, completed PD18 form, and the contact details of three referees who may be approached prior to interview should be sent to Joy McSharry, Institute of Astronomy, Madingley Road, CB3 0HA (jpm@ast.cam.ac.uk) by Monday 17 September 2007. Informal enquiries may be made to Dr Andrew Batey, 01223 766662, email adb@ast.cam.ac.uk. Interviews are planned for early October.
fanf: (silly)
; <<>> DiG 9.2.4 <<>> soa figure53.com. @ns1.dreamhost.com.
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 26927
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;figure53.com.                  IN      SOA

;; ANSWER SECTION:
figure53.com.           14400   IN      SOA     2007081700. 21487. 1800 1814400 14400 604800 3600

;; Query time: 153 msec
;; SERVER: 66.33.206.206#53(ns1.dreamhost.com.)
;; WHEN: Thu Aug 23 22:42:10 2007
;; MSG SIZE  rcvd: 81
fanf: (Default)

August, dark and dank and wet, brings more rain than any yet!

Our flat roof is having water permeability problems again. This time it has leaked onto the fire suppression system control panel and broken it, meaning that in the event of a fire someone will have to go downstairs to the room full of gas cylinders to welease Woger the suppwession gas. Should be fixed tomorrow...

Also, the delightful old computer lab tower (metal stairs and ugly wired glass cladding) is leaking copiously, and giving anyone who goes to the Titan rooms a light shower.

fanf: (Default)

The University Computing Service is forming a small team of software developers to work on a variety of projects. We are recruiting a database specialist to work in this group. The successful applicant will design database schemas, write stored procedures for the application developers and configure and maintain database systems where needed.

Most but not all development will be done on Linux platforms with databases such as MySQL, Oracle and PostgreSQL. The majority of projects are anticipated to be web applications, typically using Java.

Much of the development will use open source software and improvements and modifications to the software will be fed back to the international development teams responsible for these pieces of software.

Candidates must have:

  • A degree in a numerate discipline, or equivalent experience.
  • Substantial experience of relational database design.
  • Substantial knowledge of SQL and writing stored procedures.
  • Good experience in software development.
  • Familiarity with administering a relational database.
  • Good English skills, both verbal and written.

This is a permanent appointment at Grade 9, subject to a nine months satisfactory probation period, on a salary scale of £33,779-£42,791 with initial salary depending on experience. More information can be found on the CS's web site.

PS. fanf notes that the dev group's offices are on the 5th floor with a nice view of King's College chapel.

fanf: (Default)
To program is to understand: The development of an information system is not just a matter of writing a program that does the job. It is of utmost importance that development of this program has revealed an in-depth understanding of the application domain; otherwise, the information system will probably not fit into the organization. During the development of such systems it is important that descriptions of the application domain are communicated between system specialists and the organization.

This quote is from the book Object-oriented programming in the BETA programming language, and is also quoted in the HOPL III paper The when, why and why not of the BETA programming language which I have recently enjoyed reading.

fanf: (Default)
Erlang is an excellent language for implementing distributed systems. One of the reasons for this is the semantics of its message-passing primitives. (Other reasons include the way it isolates processes and its distributed failure handling.)

Erlang's send operation is non-blocking, and sends a message to a particular process. Its receive operation can block (with an optional timeout) and can selectively pick messages from the process's mailbox out-of-order, which is useful for avoiding state explosion or tedious book-keeping.

Some languages, including occam and CML, have a send which can block until another process is ready to receive. This kind of rendezvous doesn't make sense in a distributed setting when there is significant latency between the two processes: does the synchronization happen at the sender or the receiver? With asynchronous message passing you have to be explicit about it, typically by making the sender wait for an ack so synchronization happens at the receiver.

Instead of messages being sent to processes, many concurrent languages have a notion of a channel. The problem with channels in a distributed setting is dealing with message routing when channel endpoints can be passed from machine to machine. It's unfeasibly tricky if (as with CML and Concurrent Haskell) multiple processes (potentially on multiple machines) can be waiting to receive from a channel.

It's interesting to observe that, as well as scaling up gracefully, Erlang-style message passing also scales down to very simple forms of concurrency, i.e. coroutines, co-operative multi-tasking, or "fibres". Communication between coroutines is usually synchronous (since you pass the flow of control as well as some values) and addressed by process (not by channel).

Although coroutines can be dead simple, there are some variations, chiefly between asymmetric/symmetric (distinguishing resume and yield or not), and kinds of hobbled coroutines that typically can't yield from a nested function call. The paper about coroutines in Lua has a good survey, plus an example of how you can implement symmetric coroutines on top of asymmetric coroutines by adding a scheduler. (The dual implementation is similar.) You also need a scheduler if you are going to add features like message passing. Stackless Python is a good example of an elaborate coroutine implementation.

Erlang-style message passing between coroutines is dead simple. The send routine just appends the message to the end of a queue, and receive just pulls a message off the start of the queue and dispatches it to the destination process. For selective receive, each process can have a mailbox of pending messages which it checks for a match before dispatching on the main queue; when a process is resumed with a message that doesn't match its current receive statement it adds the message to its mailbox and dispatches the next message from the main queue.

Thinking about this reminds me a lot of my larval stage hacking around with the Acorn Archimedes RISC OS WIMP, especially the UserMessage feature.
fanf: (silly)
Would you all kindly stop spamming me and every other vaguely technical person with job advertisements. If you read my web site instead of just glancing at it you will know that I am not looking for a job. If you checked your own pimp database perhaps you would find out that I have told you to go away five times already. Perhaps you could also fix your stupid auto-responder so that it doesn't emit spam blowback.

No love,
Tony.
fanf: (Default)

OpenSSH has a neat feature called ControlMaster which allows multiple ssh clients to share the same connection to a target host. This saves time on connection startup by eliminating all the cryptography and authentication for the second and subsequent clients. You can use the feature by explicitly telling ssh when to be a control master (supply -M and -S <socketpath> arguments) and when to be a control client (just supply a -S <socketpath> argument). However it's much more convenient to tell it to automatically be a master if there isn't already one, or a client if there is, by putting ControlMaster=auto in your .ssh/config file.

However there is a race in the setup of the communications socket in auto mode, as illustrated by the following command line:

ssh -oControlMaster=auto -oControlPath=sock localhost 'sleep 1; echo 1' &
ssh -oControlMaster=auto -oControlPath=sock localhost 'sleep 2; echo 2' &

Both of the commands will try to start up as a control client, find that sock does not exist, and switch into control master mode. One will succeed in creating the control master socket and the other will fail and bomb.

I've written a patch which eliminates this race by trying to create a control master socket first, and falling back to control client mode if master mode fails. See the attachment to the message I posted to the openssh-dev list.

fanf: (Default)
Make sure you have the line mdns off in /etc/host.conf on your incoming SMTP servers.

One of my colleagues in our network engineering team discovered today that ppswitch was spewing multicast packets, much to our surprise. It turns out that recent versions of glibc have quietly added support for multicast DNS to the resolver. Multicast DNS is part of Apple's zeroconf networking system (aka Bonjour, previously known as Rendezvous), and it takes over host names ending in .local. See it in action by typing strace ping foo.local and observe it sending a DNS query to the class D multicast address 224.0.0.251.

Since MXs have to deal with untold quantities of crap (at the moment about 96% of the email we're offered - 6 million messages per day - is junk) and since one of the key crap detection tools is the DNS, ppswitch ends up doing a lot of crap DNS lookups. A significant number of these (10,000 per day) are names ending in .local which thereby trigger mdns lookups. However these names do not come from machines named via zeroconf: they are mostly Small Business Server installations which have followed Microsoft's recommendations for choosing a domain name.

It is a great source of joy and wonder that Apple and Microsoft both use .local in conflicting ways. This is truly the Zen of standards: contemplate it deeply and you may achieve enlightenment. (if you don't go mad)
fanf: (Default)
We're upgrading Hermes from a heavily modified Cyrus-2.1 IMAP message store to a somewhat less modified Cyrus-2.3 setup. We're at the early testing stage, and I'm one of the guinea pigs. Since I've moved over almost everything has worked fine, except that my cron email has no longer been filtered into my "root" mailbox. Cyrus-2.3 has a new bytecode interpreter for Sieve scripts so that it only parses a script when it is uploaded instead of on every message delivery, so I suspected a bug in the new code.

The cron mail filtering clause in my sieve script uses an address :regex test. The bad behaviour indicated that the regex was not matching - the sieve script was not bombing out. (When a sieve script fails (e.g. because a mailbox is too full) messages are just delivered to the user's inbox, but in this case some of my cron mail was being delivered to my mail-support mailbox because later tests were catching it.) Cyrus's changelog didn't mention any regex-specific changes, and sieve regexes are just POSIX EREs, so I didn't think a subtle regex syntax change was causing the match to fail.

The Cyrus Sieve code comes with a test wrapper which is not compiled by default, but a quick cd cyrus-imapd-2.3/sieve && make test built it without a problem. It became much more useful when I toggled the DUMPCODE and VERBOSE flags in bytecode.h and when I removed the UUCP From_ line that was causing it to fail to parse the header of my test message. After adding lots of debug printf()s I found that my sieve script was being handled as I expected except that it was trying to match my regex against the whole contents of the message's From: line, whereas it should have been matching it against just the address. My regex included pedantic ^$ anchors which detected the unexpected presence of a display name.

The bug is fixed by a one-line patch that passes the parsed out address part to the regex comparator instead of the whole header, as is done for other comparators. It only affects the address :regex test - other tests with the :regex comparator are OK, so are other comparators with the address test.
--- sieve/bc_eval.c     13 Feb 2007 15:06:54 -0000      1.1.1.1
+++ sieve/bc_eval.c     30 Jul 2007 16:03:30 -0000
@@ -591,7 +591,7 @@
                                    goto alldone;
                                }

-                               res |= comp(val[y], strlen(val[y]),
+                               res |= comp(addr, strlen(addr),
                                            (const char *)reg, comprock);
                                free(reg);
                            } else {

(This is not my first Cyrus Sieve bug - I also found a case where it was ignoring backslash escapes before * glob characters. The patch is here.)
fanf: (Default)
This is one of the worst examples of willful incompetence I have seen recently, so I thought it deserved a bit of point and laugh.

In the last couple of weeks we received two complaints that email from ABN AMRO (a large Dutch bank) to Cambridge was unreliable. According to the bounce messages that were forwarded to us, their Postfix outbound servers were complaining "conversation with mx.cam.ac.uk timed out while sending MAIL FROM". Strange: we don't do anything particularly time consuming at that point in the SMTP conversation. Perhaps it's an MTU problem?

I try turning on some extra logging, and Exim says "SMTP connection from triton10.abnamro.nl lost while reading message data (header)". This is inconsistent with an MTU problem, since the envelope commands (MAIL FROM, RCPT TO, DATA) are larger than the replies, and Exim has received the commands and sent the replies OK. It's also inconsistent with Postfix's error message, since Postfix obviously sent the MAIL FROM without a problem. It turns out there's a minor bug in Postfix that causes it to use an incorrect error message when there's a timeout waiting for a reply from the server.

OK, so ABN AMRO's Postfix is timing out while waiting for our envelope replies, but our replies are sent reasonably promptly. I resort to running a very selective tcpdump on mx.cam.ac.uk to see if that provides a clue. There is indeed no sign of an MTU problem: what is actually happening is their end is closing the connection only 15 seconds after it has sent the envelope commands. Exim doesn't check for a closed connection until it wants to read more data, which explains its error message.

So it looks like their end has an absurdly small 15 second timeout, which triggers if we take too long to emit the envelope replies - which can happen if recipient address verification takes a while. The standard requires at least a five minute timeout, and we're careful to stay within that limit. They are just asking for trouble if they reduce their timeout to such a short period, and they would have to deliberately break Postfix which ships with correct defaults.

So I tried getting in touch with their postmasters. I first tried postmaster@nl.abnamro.com since one of the problem reports came from an @nl.abnamro.com address. I received two bounces from their Lotus Notes system. I then tried postmaster@abnamro.com, and after a day without a reply I tried postmaster@abnamro.nl. I still haven't received a reply.

I also asked the people who reported the problem to chase it up with their IT staff. Eventually I got the reply that they are aware of the problem but there is no "business justification" to fix their broken systems. I bet ABN AMRO's management would do something if their post room was chucking letters in the bin and ignoring support queries, so why do they tolerate such crapness for email?
fanf: (Default)

Here are some of my favourite pointless suggestions for syntactic changes to programming languages.

Bitwise not and exclusive or

By analogy with negation and subtraction, I think the operators for not and xor should be spelled the same. I.e. instead of a ^ b write a ~ b.

Indirection

Prefix * in C is disastrous for both expression and declaration syntax. Pascal's postfix ^ for indirection is much more sensible. Fortunately the previous reform frees up ^ in C for this purpose. We can then abolish -> since it becomes superfluous.

Instead of

    void (*signal(int sig, void (*func)(int)))(int);
We get
    void signal(int sig, void func^(int))^(int);

Associativity of relational operators

It doesn't make sense for relational operators to be associative. To specify that in a > b > c the first comparison gives a boolean result that is compared with c is utter nonsense. It's much more sensible to say that relational operators chain as in mathematical notation, so that expr relop expr relop expr ... expr relop expr is equivalent to expr relop expr and expr relop expr and ... and expr relop expr except that each expr is evaluated at most once. BCPL has this syntax except for the latter guarantee, so you can't say ’0’<=rdch()<=’9’ which is a shame.

Precedence of logical not

Logical not or ! should have precedence between logical and / && and the relational operators, so that there's much less need for an unless keyword.

Counting with for loops

The Modula (etc.) style for var = init to limit by step do ... is too wordy. The C style for (var = init; var < limit; var += step) ... is too generic and too repetitive. Lua's for var = init, limit, step do ... is way too cryptic. I quite like an idea from one of [livejournal.com profile] simontatham's school friends, which fits in nicely with chained relational operators and mathematical notation. It comes in a few variants, for counting up or down, with exclusive or inclusive limits, with unity or non-unity step:

    for init <= var < limit do...
    for init >= var > limit do...
    for init <= var <= limit do...
    for init >= var >= limit do...
    for init <= var < limit by step do...
    for init >= var > limit by step do...
    for init <= var <= limit by step do...
    for init >= var >= limit by step do...

Assignment and equality

I'm a firm believer in shunning bare = as an operator, and using := for assignment and == for testing equality - at least in languages where assignment is an expression. There's much less opportunity for confusion in languages where assignment is a statement and therefore if var = expr then ... is a syntax error. Since I have a head full of Lua at the moment I will note that it has one place where assignment and equality can clash, in table constructor expressions, where the former is key/value construction and the latter is list-style construction with a boolean value.

fanf: (Default)

I've been playing around with Lua recently. It's a really sweet little language. A particular highlight is its "tables", which are generalized associative arrays that can have keys of any type. (Most languages only allow strings and/or numbers.) It also has pretty nice uniform handling of multiple function arguments, multiple return values, and multiple assignment.

A frequent topic of discussion on the Lua list is the fact that Lua does not have a switch/case statement. Most of the people who propose how one should be added tend to copy other languages which can only switch on one value at a time. I think it would me more in keeping with the rest of Lua to have a multi-valued case statement. I also quite like the idea of making it capable of ML-style pattern matching and variable binding.

The syntax I have in mind is as follows. I'm using ? to mean 0 or 1 of the following, * to mean 0 or more, and + to mean 1 or more.

    case explist1
+(+(match patlist
      ?(if exp) )
    then
        block)
  ?(else
        block)
    end
As in C, multiple matches for a given block are expressed using multiple match clauses, instead of the alternatives being listed in a single clause as in Modula. Each match can be guarded by a conditional expression; when it evaluates to false the match fails and later ones are tried. The statements following then are evaluated if one of the preceding matches succeeds. You can put a "default" or "otherwise" case after an else. (I quite like the re-use of keywords, though perhaps it is too cute.) All variables bound by the match clauses are visible in the following block; the variables in a pattern list must all be different; if the clause binding a variable isn't the one that matched then the variable's value is nil.

Lua also needs more expressive exception handling - I'm not too fond of its "protected call" mechanism. It would be more palatable if it were dressed up in something like:

    try
        block
+(+(catch patlist
      ?(if exp) )
    then
        block )
  ?(finally
        block )
    end

A useful bit of syntactic sugar is to be able to declare functions in a pattern matching style instead of an argument-passing style, so:

    function (...)
        case ...
        match a, b, c
            blah blah
        end
    end
could be more concisely written:
    function
    match a, b, c
        blah blah
    end

I'm still pondering the syntax for pattern lists. A basic principle is that all direct comparisons should be against constants, so that the match can be compiled into a table lookup against a constant table. This means only numbers and strings - other objects in Lua either do not have compile-time expressions (C userdata) or have expressions that do not denote constant values but instead describe how to create new objects at run-time (function closures and tables). I also want to allow indirect comparisons against table elements, so that it's possible to match against structured data. Finally there needs to be a way of binding a variable to the value being matched against.

I do not intend to support exact matches against tables: only the elements that you specify in the match will be checked and if the table has more elements they will be ignored. This is because the extra check to make the match exact may not be entirely trivial to implement, and because it makes code brittle in the presence of unexpected values squirreled away in a table - it's quite common in prototype-based object systems to add methods to an object without co-operation from the code that created it.

So the basic syntax for a pattern to match against simple constants or to bind a variable is:

    String | Number | Name
Matching against the first few integer indexed elements of a table in list style is also simple:
    "{" pattern *( "," pattern ) "}"

It's also useful to both match against a table and bind a variable, so that you can refer to the table in the later statements. (It isn't possible to reconstruct the table because it may have extra elements, and even if it were possible it would be inefficient and the result would have a different identity.) It's also useful to both match a constant and bind a variable, if you have multiple match clauses: case f() match 3 match 4 then -- did f return three or four? I'm still unsure about the syntax for matching and binding. Some possibilities are:

    pattern and pattern
    pattern pattern
    Name "=" pattern
    Name "==" pattern

The idea of "and" patterns is that you are matching both the left-hand side and the right-hand side. It suggests that there should also be "or" patterns, as in case f() match 3 or 4 then ... In MCPL you get the same effect as and by just writing the patterns side-by-side, and (like sh) | is used instead of or.

The idea of using "=" is to make binding in a pattern look like assignment, whereas "==" is supposed to be like an assertion. However these get problematic when we try to fit key/value style table matching into the syntax. Table constructors in Lua look like:

    tableconstructor = "{" ?fieldlist "}"
    fieldlist = field *(fieldsep field) ?fieldsep
    fieldsep = "," | ";"
    field = "[" exp "]" "=" exp
        | Name "=" exp
        | exp
(The first form of field is by analogy with the table[key] form of index expression. The second is like table.name indexing, which is syntactic sugar for table["name"]. The last is for list-style construction, where the value is assigned to the next integer index starting from 1.)

If we follow constructor syntax closely, we get the serious difficulty that a pattern like { foo = bar } isn't clear which name is the table index and which is the variable being bound. This probably indicates that = should be avoided.

Perhaps the solution is to use an arrow to indicate which way values flow, as in:

    tablepattern = "{" ?fieldpatterns "}"
    fieldpatterns = fieldpat *(fieldsep fieldpat) ?fieldsep
    fieldpat = "[" (String | Number) "]" "->" pattern
        | Name "->" pattern
        | pattern
    pattern = String | Number | Name
        | pattern and pattern
        | pattern or pattern
This seems reasonably OK, I think. It also suggests adding a free-standing match operator -> that produces a boolean result (and probably should not bind variables).

It makes me want to change the assignment operator to <- for symmetry :-) But fiddling with Lua's existing features is the topic for another post.

fanf: (Default)
PEGs are an interesting new formalism for grammars. Unlike most older formalisms, which are based on Chomsky's generative grammars, their starting point is recognizing strings that are in a language, not generating them. As such they are a closer match to what we usually want a grammar for. The practical effect of this is that they naturally avoid common ambiguities without external rules, such as C's if/else ambiguity or the various rules about greediness imposed on regexes (e.g. perl's matching rules versus POSIX's longest-leftmost rule, discussed in Friedl's book). Even though PEGs can recognize some non-context-free languages (e.g. anbncn) they can be matched in linear time using a packrat parser (which can be implemented very beautifully in Haskell).

Bryan Ford's 2004 POPL paper establishes the formal foundations of PEGs and defines a concrete syntax for them, fairly similar to ABNF. The key differences are: the choice operator is ordered (prefers to match its left-hand sub-expression); repetition operators are maximally greedy and don't backtrack (so the second a in a*a can never match); and it includes positive and negative lookahead assertions of the form &a and !a (like (?=a) and (?!a) in perl).

It occurs to me that there is a useful analogy hidden in here, that would be made more obvious with a little change in syntax. Instead of a / b write a || b, and instead of &a b write a && b. With || and && I am referring to C's short-cutting "orelse" and "andalso" boolean operators - or rather the more liberal versions that can return more than just a boolean, since a PEG returns the amount of input matched when it succeeds. This immediately suggests some new identities on grammars based on De Morgan's laws, e.g. !a || !b === !(a && b). Note that !!a =/= a because the former never consumes any input, so not all of De Morgan's laws work with PEGs. This also suggests how to choose the operators to overload when writing a PEG parser combinator library for C++ (which has a much wider range of possibilities than Lua).
fanf: (Default)

One of the features I like about Firefox is when I run firefox (from the command line or from my window manager) it will start it if it isn't running, and it'll open a new window if it is. I have a lot of virtual desktops, and it is much more convenient to just run firefox than to find a desktop with an existing firefox window, create a new window, then move it to the target desktop. Sadly emacs is not so accommodating.

Emacs has a feature which allows you to tell a running emacs to start editing a file. However it does not open a new frame to display the file, so you have to go searching through your virtual desktops to find where it ended up.

I run emacs version 21, but version 22's emacsclient has a new -e (eval) option which allows you to tell emacs to run arbitrary lisp. This can solve the niggles I have described: e.g. to open a new emacs frame on the current desktop I could just run emacsclient -e '(make-frame)'. It would be nice to have this feature in version 21, so I have come up with an evil hack to do the job.

My original idea was that I could set up a file with a local variables section, including an "eval" specifier that tells emacs to open a new frame. Then opening the file would create a new frame. However emacs is sensibly paranoid and won't run lisp in arbitrary files without manual confirmation.

Plan B was to add a function to the find file hook, so that when I opened a specific file (that need not exist) the necessary magic would occur. Then running emacsclient --no-wait ~/.emacs-newframe would create a new frame. After some fiddling I made this work. The code in my .emacs is as follows. As well as creating a new frame, it kills the buffer containing the dummy file and switches the new frame to the canonical dummy buffer, to keep things tidy. It uses the server-visit-hook which is a bit more specific for my purpose than the file-find-hook.

(defun fanf-newframe ()
  (cond ((string-equal buffer-file-name "/home/fanf2/.emacs-newframe")
	 (kill-buffer nil)
	 (select-frame (make-frame))
	 (switch-to-buffer "*scratch*"))))

(add-hook 'server-visit-hook 'fanf-newframe)

A little wrapper script can start emacs or run emacsclient as necessary, to get behaviour like firefox. Instead of running bare emacsclient, it can now also create a new frame before asking emacs to visit a file, which is a bit nicer. However doing so may be racy because I can't make emacsclient ~/.emacs-newframe (without --no-wait) work sensibly: the hook runs before the buffer is registered as being visited by emacsclient, so the hook can't tell the server that the buffer is finished with after it is sure the frame exists. However emacs is not internally multi-threaded and the server handles requests in-order, so it's probably OK.

fanf: (Default)
You can classify formalisms for message-passing concurrency into channel-oriented (like the π calculus) or destination-oriented (like the actor model).

The E programming language is based on a kind of remote procedure call, and so is superficially destination-based: the communication described by the RPC is targeted at a particular destination object. However the process performing the RPC does not block waiting for the result, since this can be a very long time in widely distributed systems. Instead it immediately gets a promise for the eventual result; it can then perform RPCs on the promise which will be handled by the evential result as soon as it has been determined. This allows you to pipeline RPCs so that networking delays overlap instead of accumulate.

E also allows you to create promises without making an RPC, and these promises are directly equivalent to a one-shot channel. You can pass the sending and receiving ends around in any way you like, and the receivers can attach actions to the promise to be performed when it has been fulfilled. On this foundation it's reasonably straight-forward to create persistent channels: whenever a channel is used, the sender has to create a replacement promise which is passed with its main message; after handling the message, the receivers listen for the next message on the new promise instead of the old one. If there are multiple senders then they must have copies of both ends of the promise, so that they can spot when it has been used and update their reference to the replacement.

However not all promises are first class. When you perform an E-style RPC, you get an explicit handle on the receiving end of a promise, but not on the sending end - that is just implicit in the return value of the remote procedure. If this procedure returns a promise (the result of another RPC) then the two promises effectively become unified: fulfilling the second promise fulfills the first. There's no way of having more than one sender on this chain of promises because you can't branch it, unless you add first-class promises.

This all has some implications for distributed garbage collection: as well as direct references to remote objects, E can refer to them via promises. With first-class promises the distributed topology can be arbitrarily complicated via both kinds of references. I do not yet know whether restricting promises to RPC only is worth it for simplifying GC and/or too harmful to programming flexibility.
fanf: (Default)

Between returning from San Francisco and getting my job at the University, I did some FreeBSD-related things. One of its features which captured my interest was its kernel object system, kobj. This was designed to make the driver ABI more dynamic so that modules compiled against older source could be loaded into newer kernels and still work. The problem is that if a driver call is changed, you don't want to have to add compatibility tests to every call point in the kernel. FreeBSD uses kobj to solve this problem by making drivers into classes (such that each object of the class represents a device controlled by the driver) so that driver calls become method calls. What caught my interest is that the method calls are much more dynamic than I was used to from languages like C++.

C++ does method lookup using an array of method pointers called a vtbl. The static type of the object determines the layout of the vtbl, i.e. which method pointer is at which index in the table. (The dynamic type of the object may be a derived type of its static type, in which case the vtbl contains more entries than are determined by the static type.) A method call object.method(args) compiles to object.vtbl[method_index](args). This is no good for our problem because any change to a class's methods changes its ABI and means that old code will no longer work.

In kobj, the method lookup table is a cache, so that method lookups are fast in the normal case, but fall back to a more complicated dynamic lookup if there is a cache miss. This means that if a new method is called against an old object then the lookup code can return a pointer to a compatibility shim method instead of one of the methods defined by the object's class. I'm going to show you my version of the method call sequence, but I need to explain a few basics first. In kobj (and my object system cobj) a class is just a collection of method implementations: there is no type system determining the set of methods implemented by a class. The most obvious consequence is that method declarations are divorced from classes. They are effectively just names, like method selectors in Smalltalk. A method has several parts: at compile time it has an inline function that is used to call the method, and a type so that implementations and calls can be checked; at run time it has an extern object that serves to represent the method selector.

	/* In a header file somewhere... */

	/* The type signature of my_method() */
	typedef int my_method_t(cobj_t *o, int a, void *p);

	/* An extern object representing the method selector.
	 * It is not declared as a pointer because we want to
	 * be able to get its address efficiently: addresses
	 * are resolved at link time, saving an indirection
	 * at run time. The object itself is just a place-
	 * holder, but may be useful for introspection.
	 */
	extern struct cobj_meth_t my_method_sel;

	/* Method dispatch is implemented as an inline function.
	 * The actual code is the same for all methods, except
	 * for variations in the types, and would normally be
	 * created with a macro.
	 */
	static inline int my_method(cobj_t *o, int a, void *p) {
		cobj_mi_t *mi = &o->cobj_mi[COBJ_HASH(&my_method_sel)];
		if(mi.m != &my_method_sel)
			cobj_lookup(o, &my_method_sel);
		return ((my_method_t*)mi.i)(o, a, p);
	}

The lookup cache is an array of {method,implementation} pairs, indexed by a hash of the method selector. If the selector doesn't match then we call the dynamic lookup function to find the implementation and fill in the table entry. Then we can just do an indirect call of the implementation. The hash function is defined to be as lightweight as possible:

	/* In the main cobj header file... */

	typedef void cobj_impl_t(void);

	/* {method,implementation} pairs */
	typedef struct cobj_mi_t {
		cobj_meth_t *m;
		cobj_impl_t *i;
	} cobj_mi_t;

	/* standard prefix of all objects */
	#define COBJ_PREFIX cobj_mi_t *cobj_mi

	typedef struct cobj_t {
		COBJ_PREFIX;
	} cobj_t;

	#define COBJ_MI_SIZE 32
	#define COBJ_HASH(m) (((int)(m) / sizeof(cobj_mi_t)) % COBJ_MI_SIZE)

As well as the explicit operations for computing the hash, the compiler needs to multiply it by sizeof(cobj_mi_t) when indexing the cache. The resulting combination will just compile to (m & 0xF8) - i.e. we use the most random bits from the method object's address as the index.

In kobj, the lookup function was fixed. Classes just consisted of a lookup cache shared by all instances, and a simple table of {method,implementation} pairs that the dynamic lookup function could search. Objects had a similarly simple structure: they were just structs whose first member pointed to the object's class's method cache. There was no support for inheritance; however some parts of the kernel had hacked up their own implementations using the C++ layout with the base class's members as the prefix of the object with the derived class's members following. The problem with this is it introduces another opportunity for ABI incompatibility: if the base class changes size then all derived classes need to be recompiled.

I was interested in extending the object system to allow inheritance of various kinds whilst avoiding as much ABI coupling as possible. I had read the Art of the Metaobject Protocol so I was also interested in playing around with object systems where you could replace key parts like object layout, inheritance structure, etc.

In a traditional OO system, each class is also an object. The class-dependent operations are things like dynamic method lookup and object creation. If a class is an object then these operations should be methods on the class object. This implies that a class must be an instance of a class, which in Smalltalk terminology is called a metaclass. I envisioned having different metaclasses to implement simple method lookup, single inheritance, multiple inheritance, mixin inheritance, etc. Thus the dynamic lookup function just juggles things so that it can perform the appropriate method invocation on the original object's class and save the result in the cache.

	/* In the main cobj header file... */

	/* common prefix of class objects */
	#define COBJ_CLASS_PREFIX COBJ_PREFIX; cobj_mi_t cache[COBJ_MI_SIZE]

	typedef struct cobj_class_t {
		COBJ_CLASS_PREFIX;
	} cobj_class_t;

	static inline cobj_t *cobj_class(cobj_t *obj) {
		void *cache = obj->cobj_mi;
		void *class = (char *)cache - offsetof(cobj_class_t, cache);
		return(class);
	}

	/* In the main cobj code file... */

	void cobj_lookup(cobj_t *obj, cobj_meth_t *meth) {
		cobj_mi_t *mi = &obj->cobj_mi[COBJ_HASH(meth)];
		mi->i = cobj_do_lookup(cobj_class(obj), meth, obj);
		mi->m = meth;
	}

This function is designed to make the method dispatch fast path as small as possible, particularly by minimizing the number of arguments. It is also the vehicle for recursion up through the metaclasses. Since a class is an object with an implementation of the cobj_do_lookup() method, and a metaclass is the class of a class, metaclasses are also objects with implementations of the cobj_do_lookup() method. Obviously we need a way to bound the recursion.

Eventually we must reach the class that is its own metaclass. The safe way is to detect this case in cobj_lookup() and call a hardcoded cobj_do_lookup_base() implementation. This tactic allows the ur-class to be a fairly normal class that supports other methods (e.g. for introspection or construction). A much more amusing way is to arrange that the ur-class only has one method, so we can pre-populate its cache. (We can't do this if there's more than one method because we don't know what addresses and therefore indexes they will have - and they might clash). Then the recursion will always terminate with a cache hit. You then need a slightly more elaborate bootstrap strategy. The ur-class is the metaclass of classes with one method, of which there will be a zeroth instance with the pre-populated cache. There will be one other instance which knows how to do method lookup for simple classes with more than one method but no inheritance etc. This will in turn have an instance which is the same as the safe (boring) ur-class.

	struct cobj_class_ur {
		COBJ_CLASS_PREFIX;
		cobj_impl_t *do_lookup;
	} cobj_zero = {
		/* cache pointer */
		&cobj_zero.cobj_mi,
		/* pre-populated cache */
		{ { cobj_do_lookup_o, cobj_do_lookup_ur },
		  /* COBJ_MI_SIZE times... */ 	
		  { cobj_do_lookup_o, cobj_do_lookup_ur } },
		/* the only method we support */
		(cobj_impl_t *)cobj_do_lookup_ur
	}, cobj_one = {
		/* cache pointer */
		&cobj_zero.cobj_mi,
		/* no need to pre-populate this cache */
		{ { NULL, NULL } },
		/* the only method we support */
		(cobj_impl_t *)cobj_do_lookup_simple
	};

	cobj_impl_t *cobj_do_lookup_ur(cobj_t *class, cobj_meth_t *meth, cobj_t *obj) {
		assert(meth == cobj_do_lookup_sel);
		return(class->do_lookup);
	}

Apart from the goal of supporting different kinds of inheritance, this metaclass approach is very like Smalltalk (including the recursion back to a class that is its own metaclass), and Objective C is very like Smalltalk. However they are not solutions to the ABI skew problem (or the API skew problem!) that I started with, because in Smalltalk and Objective C the members of base classes and the size of base objects are exposed to derived classes, so you can't extend a base class without losing compatibility.

This was all quite fun to play with, but unfortunately I got hung up on writing a preprocessor to make declaring classes and methods less painful, so it never got very far before I got a job.

fanf: (Default)

We use the JANET subscription to the MAPS RBL+. It is a frequent source of false positives which causes a great deal of irritation.

  • Their RSS (relay spam stopper, i.e. list of open relays) does not have any re-testing nor any expiration of old entries, so is full of shockingly stale entries.
  • Their DUL (dynamic user list, i.e. list of home computer IP addresses) has many errors. They do not track re-assignments of address space which caused them to list gmail's servers earlier this year.
  • Spamhaus's ZEN list is just as effective but much more trouble-free. Why is JANET paying for the RBL+ and not ZEN?
  • They do not accept corrections from their paying customers. WTF!
Sadly I can't easily ditch the RBL+ since using both it and ZEN adds 20-25% to our block rate compared to either alone, so performance would suffer horribly. Also, I don't want to get into the game of maintaining my own blacklist and/or whitelist - a time sink I do not need.

fanf: (Default)
Dear lazyweb,

Please could I have an implementation of TeX written in Javascript which renders beautifully typeset mathematics to a <canvas>.
fanf: (Default)

There's currently a discussion on the IETF SMTP mailing list about the exact syntax of the domain part of an email address, since the cleaned-up ABNF in RFC 2821 unintentionally forbade domains with a single label. The argument is how to fix this error, whether the discrepancy between DNS syntax (optional trailing dot for FQDNs) and SMTP syntax (no trailing dot ever) matters, and whether email to a bare top-level domain makes any sense.

Arnt Gulbrandsen noticed that a surprisingly large number of TLDs have MX records, so I thought it might be interesting to list all 26 of them...

$ dig axfr . @f.root-servers.net. |
  sed 's/^V^I/ /g;/\..* NS .*/!d;s///' |
  uniq | tr A-Z a-z |
  while read d; do
    dig mx $d. |
    sed 's/^V^I/ /g;/\..* MX /!d;s//. MX /' |
    grep . && whois -h whois.iana.org $d |
    sed '/Country:/!d;s///' | uniq;
  done |
  manual-fixup

ac. A                              ; not listening
     Ascension Islands
ai. MX 10 mail.offshore.ai.        ; seems to work <vince@ai> (sendmail/ubuntu "No UCE/UBE")
     Anguilla
as. MX 10 dca.relay.gdns.net.      ; accepts anything (sendmail)
as. MX 20 jfk.relay.gdns.net.      ; accepts anything (sendmail)
as. MX 30 lhr.relay.gdns.net.      ; accepts anything (sendmail)
     American Samoa
bi. A                              ; not listening
     Burundi
bj. MX 10 bow.intnet.bj.           ; accepts anything (postfix)
bj. MX 20 bow.rain.fr.             ; not listening
     Benin
cf. MX 10 bow.intnet.cf.           ; relaying denied (sendmail)
     Central African Republic
cm. A                              ; not listening
     Cameroon
cx. MX 5 mail.nic.cx.              ; relaying denied (exim/@mail)
     Christmas Island
dj. MX 5 smtp.intnet.dj.           ; relaying denied (trend anti-virus)
dj. MX 10 bow.intnet.dj.           ; not listening
dj. MX 20 bow.rain.fr.             ; not listening
     Djibouti
dk. A                              ; rejects everything (postfix)
     Denmark
dm. MX 10 mail.nic.dm.             ; relaying denied (imail)
     Dominica
gp. MX 5 ns1.nic.gp.               ; relaying denied (sendmail "I enjoy loveletters")
gp. MX 10 ns34259.ovh.net.         ; relaying denied (qmail)
gp. MX 20 manta.outremer.com.      ; not listening
     Guadeloupe
gt. MX 10 mail.gt.                 ; not listening
     Guatemala
hr. MX 10 alpha.carnet.hr.         ; accepts anything (exim/debian)
     Croatia/Hrvatska
im. MX 10 mail.advsys.co.uk.       ; relaying denied (msexchange)
im. A                              ; ignored
     Isle of Man
io. MX 10 mailer2.io.              ; seems to work <paul@io> (sendmail)
io. A                              ; ignored
     British Indian Ocean Territory
kh. MX 10 ns1.dns.net.kh.          ; not listening
     Cambodia
km. MX 100 mail1.comorestelecom.km.; relaying denied (sendmail)
km. MX 110 bow.snpt.km.            ; no such host
     Comoros                       ; SOA contains <root@km>
mh. MX 10 imap.pwke.twtelecom.net. ; accepts anything (?)
mh. MX 20 mx1.mail.twtelecom.net.  ; relay denied (postfix)
mh. MX 30 mx2.mail.twtelecom.net.  ; relay denied (postfix)
     Marshall Islands
mq. MX 10 mx.mediaserv.net.        ; domain not fully-qualified (postfix/debian)
     Martinique
ne. MX 10 bow.intnet.ne.           ; relaying denied ("*****")
ne. MX 20 bow.rain.fr.             ; not listening
     Niger
ni. MX 0 ns2.ni.                   ; relaying denied (sendmail)
ni. MX 10 ns.ni.                   ; not listening
     Nicaragua
pa. MX 5 ns.pa.                    ; relaying denied
     Panama
ph. A                              ; not listening
     Philippines
pn. A                              ; rejects everything (exim)
     Pitcairn
pw. A                              ; relaying denied (sendmail)
     Palau
sh. A                              ; not listening
     St Helena
td. MX 10 mail.intnet.td.          ; relaying denied (sendmail)
td. MX 20 bow.rain.fr.             ; not listening
     Chad
tk. MX 100 ATAFU.TALOHA.tk.        ; seems to work <joost@tk> (postfix)
     Tokelau
tl. MX 5 mail.nic.tl.              ; relaying denied (exim/@mail)
     Timor-Leste
tm. A                              ; not listening
     Turkmenistan
tt. MX 0 66-27-54-138.san.rr.com.  ; seems to work <admin@tt> (sendmail)
tt. MX 10 66-27-54-142.san.rr.com. ; relaying denied (postfix/ubuntu)
     Trinidad and Tobago
ua. MX 5 km.nic.net.ua.            ; seems to work <hostmaster@ua> (sendmail)
ua. MX 20 mx1.he.kolo.net.         ; seems to work (sendmail)
     Ukraine
uz. A                              ; not listening
     Uzbekistan
va. MX 10 lists.vatican.va.        ; accepts anything (postfix)
va. MX 20 paul.vatican.va.         ; not listening
va. MX 50 proxy2.urbe.it.          ; relaying denied (qmail)
va. MX 90 john.vatican.va.         ; not listening
     Vatican
ws. MX 10 mail.worldsite.ws.       ; relaying denied ("*****")
ws. A                              ; ignored
     Samoa
fanf: (Default)

Previously: part 1 part 2 part 3 part 4 part 5a part 5b part 5c part 6

Message content scanning is vital for blocking viruses and spam that aren't blocked by DNSBLs. Unfortunately the interface between MTAs and scanners varies wildly depending on the MTA and the scanner - there are no good standards in this area. However I'm going to ignore that cesspit for now, and instead concentrate on when the scanning is performed relative to other message processing. As you might expect, most of the time it is done wrong. Fortunately for me the Postfix documentation has an excellent catalogue of wrong ways that I can refer to in this article.

An old approach is for the MTA to deliver the message to the scanner which then re-injects it into the MTA. The MTA needs to distinguish messages from outside and messages from the scanner so that the former are scanned and the latter are delivered normally. The Postfix documentation describes doing the delivery and re-injection in the "simple" way via a pipe and the sendmail command, or in the "advanced" way via SMTP. The usual way to do this with Exim is to tell Exim to deliver to itself using BSMTP over a pipe, using the transport filter feature to invoke the scanner. This setup has a couple of disadvantages that are worth noting. It (at a minimum) doubles your load because each message is received and delivered twice. It also makes the logs confusing to read, since the message has a different queue ID before and after scanning and therefore different IDs when it is originally received and finally delivered.

Another arrangement is MailScanner's bump-in-the-queue setup. The MTA is configured to leave messages in the queue after receiving them, instead of delivering them immediately as it usually would. MailScanner picks them up fairly promptly - it scans the queue every second or two - and after scanning them, drops them in a second queue then tells the MTA to deliver the messages from this second queue. MailScanner has the advantage that it can work in batch mode, so when load is high (several messages arrive between incoming queue scans) the scanner startup cost is spread more thinly. This is useful for old-fashioned scanners that can't be daemonized. Apart from the scanning itself, its only overhead is moving the messages between queues. MailScanner also preserves queue IDs, keeping logs simple. A key disadvantage is that MailScanner needs intimate knowledge of the MTA's queue format, which is usually considered private to the MTA. Sendmail and Exim do at least document their queue formats, though MailScanner is still vulnerable to format changes (e.g. Exim's recent extension to ACL variable syntax). Postfix is much more shy of its private parts, so there's a long-standing argument between people who want to use MailScanner and Wietse Venema who insists that it is completely wrong to fiddle with the queue in this way.

So far I have completely ignored the most important problem that both these designs have. It is too late to identify junk email after you have accepted responsibility for delivering it. You can't bounce the junk, because it will have a bogus return path so the bounce will go to the wrong place. You can't discard it because of the risk of throwing away legitimate email. You can't quarantine it or file it in a junk mailbox, because people will not check the quarantine and the ultimate effect will be the same as discarding. (Perhaps I exaggerate a bit: If the recipient doesn't get an expected message promptly, or if the sender contacts them out of band because they didn't get a reply, the recipient can at least look in the quarantine for it. However you can only expect people to check their quarantines for unexpected misclassified email if the volume of junk in the quarantine is relatively small. Which means the quarantine should be reserved for the most difficult-to-classify messages.)

You must design the MTA to scan email during the SMTP conversation, before it accepts responsibility for the message. It can then reject messages that smell bad. Software that sends junk will just drop a rejected message, whereas legitimate software will generate a bounce to inform the sender of the problem. You minimise the problem of spam backscatter and legitimate senders still get prompt notification of false positives. However you become much more vulnerable to overload: If you scan messages after accepting them, you can deal with an overload situation by letting a backlog build up, to be dealt with when the load goes down again. You do not have this latitude with SMTP-time scanning.

The Postfix before-queue content filter setup uses the Postfix smtpd on the front end to do non-content anti-spam checks (e.g. DNS blacklists and address verification), and then passes the messages through the scanner using SMTP (in a similar manner to Postfix's "advanced" after-queue filters) then in turn to another instance of the smtpd which inserts the message into the queue. There is minimal buffering before the scanner, so the whole message must be scanned in memory as it comes in, which means the scanner's concurrency is the same as the number of incoming connections. This is a waste: messages come in over the network slowly; if you buffer them so that you can pass them to the scanner at full speed, you can handle the same volume of email with lower scanner concurrency, saving memory resources or increasing the number of connections you can handle at once. However you don't want to buffer large messages in memory because that brings back the problem in another form. You also don't want to buffer them on disk, since that would add overhead to the slowest part of the system - unless you use the queue file as the buffer. This implies that Posfix's before-queue filtering is too early since the writing to disk happens after the message has gone through the scanner.

Sendmail's milter API couples scanners to the MTA in about the same place as Postfix's before-queue content filter, so it has the same performance problems. (Actually, in some cases it is worse: If you have a filter that wants to modify the message body, then with Postfix it can in principle do so in streaming mode with minimal in-memory buffering, whereas with Sendmail the milter API forces it to buffer the entire message before it can start emitting the modified version.) What's more interesting is their contrasting approach to protocol design. Postfix goes for a simple open standard on-the-wire protocol as the interface to its scanners. However it misses its target: It speaks a simplified version of SMTP to the scanner, with a non-standard protocol extension to pass information about the client through to Postfix's back end. The simplification means that Postfix cannot offer SMTP extensions such as BINARYMIME if the scanner does not do so too, which is a bit crippling. Sendmail goes for an open API, and expects scanners to link to a library that provides this API. The connection to the MTA is a private undocumented protocol internal to Sendmail, and subject to change between versions. This decouples scanners from the details of SMTP, but instead couples them to Sendmail. This is terrible for interoperability - and in practice it's futile to fight against interoperability by making the protocol private, because people will create independent implementations of it anyway: 1 2 3. So I don't like the Postfix or the Sendmail approaches, both because of their performance characteristics and because of their bad interfaces.

Exim is agnostic about its interface to scanners: it has sections of code that talk directly to each of the popular scanners, e.g. SpamAssassin, ClamAV, etc. This is rather inefficient in terms of development resources (though the protocols tend to be simple), and is succumbing to exactly the Babel that Postfix and Sendmail were trying to avoid. Exim's approach has the potential to be better from the performance point of view: It writes the message to disk before passing it to the scanner at full speed, so in principle the same file could act as the buffer for the scanner and the queue file for later delivery. This would mean there are no overheads for buffering messages that are accepted; if the message is rejected then it will only hit the disk if the machine is under memory pressure. Sadly the current implementation formats the message to a second file on disk before passing it to the scanner(s), instead of formatting it in the process of piping it to the scanner. The other weakness is that although there is a limit on the number of concurrent SMTP connections, you can't set a lower limit on the number of messages being scanned at once. You must instead rely on the scanners themselves to implement concurrency limits, and avoid undaemonized scanners that don't have such limits. This is probably adequate for many setups, but it means the MTA can't make use of its greater knowledge to do things like prioritize internal traffic over external traffic in the event of overload.

So, having criticised everything in sight, what properties do we want from the MTA's interface to scanners? In general, we would like the logistics of passing the message to the scanner to add no significant overhead - i.e. the cost should be the same as receiving the message and scanning the message considered separately, with nothing added to plug these processes together. Furthermore we'd like to save scanners from having to duplicate functionality that already exists in the MTA. Specifically:

  • Buffer the message in its queue file before scanning, so that the scanner does not take longer than necessary because it is limited by the client's sending speed.
  • Insulate the scanner from the details of SMTP extensions and wire formats, without compromising the MTA's support for same. This implies that any reformatting (e.g. downgrade binary attachments to base64) needed by the scanner should not pessimize onward delivery.
  • Put sensible limits on the concurrency demanded of the scanner to maximise its throughput. Use short-term queueing and scheduling (a few seconds) to handle spikes in load.
  • Cache scanner results.
  • Put a security boundary between the MTA and the scanner.

Notice that these have a certain commonality with callout address verification, which also needs a results cache, concurrency limits, and a queue / scheduler. This gives me the idea for what I call "data callouts" for content scanning, based on a loose analogy between verifying that the message's addresses are OK and verifying that the message's contents are OK. Also notice that message reformatting and security boundaries are requirements for local delivery. So a "data callout" is essentially a special kind of local delivery that the MTA performs before sending its reply to the last of the message data; it's a special kind of delivery because it is only done to check for success or failure - unlike normal deliveries the message isn't stored in a mailbox. This design makes good use of existing infrastructure: The MTA can use its global scheduler to manage the load on the scanner. There is already lots of variability in local delivery, so the variability in content scanner protocols fits in nicely.

The data callout is actually a special case of "early delivery", i.e. delivering a message before telling the client that it has been accepted. This feature gives you a massive performance boost, since you can relay a message without touching disk at all (except to log!). If you are going to attempt this stunt then you need a coherent way to deal with problems caused by the early delivery taking too long. Probably the best plan is to ensure that a very slow diskless early delivery can be converted to a normal on-disk delivery, so that a response can be given to the client before it times out, and so that the effort spent on delivery so far is not wasted. This is similar to allowing lengthy callouts address verifications to continue even after the client that triggered them has gone, so that the callout cache will be populated with a result that can be returned quickly when the client retries. (I'm not sure if it's worth doing the same thing with data callouts, or if a slow scanner more likely indicates some nasty problem that the MTA should back away from.)

The Postfix and Sendmail filter interfaces have a feature that is missing from Exim's scanner interface and my data callout idea. The filters can modify the message, whereas the scanners can only return a short result (such as a score). Message mangling is not something I particularly approve of, but it is a popular requirement. Fortunately my idea can support it, by going back to the old approach of delivering the message to the scanner which then re-injects it. Early delivery removes most of the disadvantages from this technique: it happens before we accept the message, and it doesn't add to disk load. It adds a new advantage of being able to fall back gracefully from scan-then-accept to accept-then-scan in the event of overload, if that's what you want. It still has the disadvantages of log obfuscation and depending on the scanner to support advanced SMTP features (though perhaps these can be avoided with a better filter protocol).

I hope that this convinces you that - as I said in my last essay - lots of cool things become possible if you get callouts right. This essay also serves as a response to iwj10, who complained that my log-structured queue idea was a pointless optimisation because early delivery is much more effective. He wasn't convinced when I said that early delivery was a separate problem. Even when you have early delivery - so that the queue only contains very slow or undeliverable messages - the log-structured queue reduces the effort required to work out which messages to retry next because the necessary information is stored in the right order.

fanf: (Default)
http://www.cam.ac.uk/cs/jobs/

Development Group Manager

The University Computing Service is forming a small team of software developers to work on a variety of projects, both service-based and internal. We are recruiting an experienced developer to lead this group, which will consist of a mixture of permanent staff and short-term contractors.

Most but not all development will be on Unix based platforms, but applicants should have experience in more than one development environment and platform. The majority of projects are anticipated to be database-driven web applications, and substantial experience with the relevant software development tools in this area is essential.

The successful applicant will have line-management and administrative responsibility for the team, including specifying the details of the contracted work, monitoring progress and quality control.

Much of the development work will use open source software, and the manager will also be responsible for ensuring any developments are fed back to the international development community.

This is a permanent appointment at Grade 9, subject to a nine months satisfactory probation period, on a salary scale of £32,795-£41,544 with initial salary depending on experience. The University offers many benefits to its staff including a final-salary pension scheme (USS), a variety of facilities and services, and opportunities for training and career development. The closing date for applications is Friday 13 April 2007. It is intended that interviews will be held during the week beginning Monday 23 April 2007.
fanf: (Default)
This afternoon I have been clearing up the remnants of the infrastructure that supported the withdrawal of insecure access to Hermes. For a final tombstone, I've written the following to replace the text on the Computing Service's summary page about the campaign...

Between March 2005 and March 2007, all insecure modes of access to Hermes were withdrawn. The following is a brief history of the process.

Following a decision made by the Computing Service, an initial announcement was sent out in early March 2005. The original timetable was to disable insecure access via protocols that do not require user configuration in July 2005, and to disable insecure access via other protocols between July 2005 and July 2006.

In early May 2005, warning banners were added to the insecure telnet, rlogin, FTP, and webmail login screens. At the start of July 2005, telnet, rlogin and FTP access were disabled, leaving the secure alternatives ssh, scp, and sfp as the only way of accessing the Hermes menu system. At the same time the webmail service was changed to automatically redirect insecure http logins to https.

Over the summer of 2005 we developed documentation to help people reconfigure their MUA - http://www.cam.ac.uk/cs/email/muasettings.html - and more detailed guidelines for sending email from the University network - http://www.cam.ac.uk/cs/email/sending.html

Between October and December 2005, the Computing Service developed software to support the withdrawal of insecure access via IMAP, POP, and SMTP. This included a system for emailing users to tell them that they needed to reconfigure their MUA settings, and a web site that provided information to IT support staff about the configuration status of their users. The software was piloted within the Computing Service before being announced to the University's IT support staff: http://www-uxsup.csx.cam.ac.uk/~fanf2/hermes/doc/talks/2006-01-techlinks/

Over 10,000 users had to change their MUA settings. The software spread out the process of notifying users to minimise the resulting load on IT support staff. We used a ratchet system to disable insecure access for users who had corrected their settings, without affecting other users who were not yet ready. The notification process for each user became gradually more insistent and eventually irritating over a pertiod of about two months, in order to encourage users to change their settings before their insecure access was disabled. This minimised the problem of users coming to work in the morning and ambushing their computer officer because their email did not work; instead they could be dealt with later in the day.

This process was mostly complete by Christmas 2006 - five months later than planned, because of the delayed start, but not because it took longer than intended. During that time, the Computing Service help desk dealt with at least 15% more calls than in the preceding year.

There remained about 250 users sending email through smtp.hermes using email addresses that could not be automatically linked to a Hermes account. These users were notified and reconfigured between January and March 2007. The whole process was complete just over two years after it started.

The Computing Service is grateful to IT staff in the departments and colleges who did a lot of work to help with this campaign, and without whom it would not have been possible.
fanf: (Default)
The United States Securities and Exchange Commission is taking action against the stock pump-and-dump fraud spammers.

http://news.google.com/news?q=sec+spam

I wonder whether the effect of pump-and-dump on the companies' stock price and volatility is worse (or not) than the SEC suspending trading. I wonder if pump-and-dump spam could be used to make money by blackmailing the targeted companies. I wonder if the companies are legitimate victims, or are they colluding with the spammers?
fanf: (Default)

(This isn't really part of my "how not to design an MTA" series since I don't have much to say about how other MTAs do it. That's mainly because most of them don't have schedulers to speak of. Postfix has one, but it isn't able to pipeline messages down a connection, and it doesn't have a centralized router. Its current queue manager is described in the attachment linked from a message to postfix-devel from Patrik Rak.)

There are two important parts of an MTA that need a concurrent scheduler: routing (where you want concurrent DNS/LDAP/etc. lookups) and delivery (where you want concurrent connections to remote hosts). (Aside: incoming messages are also concurrent but in this case the demand arises externally so can't be scheduled).

Scheduling is tricky because of the mix of work that we have to handle. Messages can have multiple recipients: the vast majority (around 90%) have only a single recipient, but those with more than one can have very large envelopes if they combine mailing list and alias expansion. Remote systems can respond with varying speed: as well as DNS and SMTP being slow because of network latency or brokenness, SMTP servers sometimes introduce delays deliberately to cause problems for clients they think are bad in some way.

(Note: a big message has lots of data whereas a big envelope belongs to a message with lots of recipients.)

In the following I'll talk about jobs and workers to abstract from the differences between routing and delivery. In routing a job is just a recipient that needs to be routed, and the workers are routing processes; the workers are all equivalent and we have not yet found any differences between the jobs. In delivery, a job is a subset of the recipients for a message which can all be delivered to the same place, and a worker is a connection to a remote SMTP server or to a local delivery agent; there is more complexity because it's more efficient to re-use a connection for jobs routed to its destination than to handle each delivery independently.

First I'll examine the basics that are shared between routing and delivery. It's worth looking at a couple of simple schedulers that don't work.

1. Do not allocate jobs to workers on a first-come-first-served basis. A big envelope may be able to use up your resources, holding back other messages. Once you have dealt with all its fast jobs you will be dealing with just slow jobs, so your throughput will be poor.

2. Do not allocate an equal number of workers to each message. When you are heavily loaded each message will only have one worker each, which means that large envelopes will be processed serially - but we want to make use of concurrency to handle large envelopes effectively.

A possible compromise is to use a FCFS scheduler up to some maximum concurrency per message. However a maximum that makes sense on a heavily-loaded system means that big envelopes won't be able to make use of the spare resources on a lightly-loaded system. So we need a way to scale up the maximum if there are free workers.

	def allowed_new_worker(job):
		if job.message.concurrency < max_message_concurrency
		then return true
		else if job.message.concurrency < scale_max_concurrency * free_workers
		then return true
		else return false
		end

A scale factor of 10 means that a message can use about 91% of an otherwise-idle system, or two messages can use 47% each, etc. If a second big envelope comes along when the first one already has 91% of the system, then message 2 will slurp up the remaining 9% (assuming that is less than the max_message_concurrency) and as jobs in message 1 complete message 2 will take over their workers until the new equilibrium is reached.

	def next_router_job(jobs):
		if allowed_new_worker(jobs.first)
		then return jobs.first
		else return next_router_job(jobs.rest)
		end

The allowed_new_worker() function is still useful in the delivery scheduler, but we want to spread the cost of connection set-up by delivering multiple messages down each connection. If we are going to make good use of pipelining we should allocate new deliveries to connections before the previous deliveries are complete, i.e. there can be a queue of multiple jobs per worker. Furthermore, if a worker is overloaded (its queue is backlogging) we need to open another connection to a suitable destination; we should do so in a way that spreads the load across the possible destinations. We also need to detect when the load has dropped and we can close a connection.

This will depend on a database of destinations, destdb. For broken destinations it records when we should next bother to try to connecting to them. It allows us to look up efficiently any connection(s) that we have currently open to a destination, whether they are active or idle, and how old they are. When we close the last connection to a destination, we keep a record of when it was made to help the decision of where to open new connections. The destdb has a garbage collector to remove stale retry and closed-connection records.

When deciding which worker to use for a delivery job, we first classify its possible destinations (initially just the primary destinations) based on destdb into active, idle, closed, unused (absent from destdb), and retry (in which case we omit destinations that have not passed their retry time).

First sort the active list by connection time and traverse it from most recent to least recent. If we find one that has less than the backlog threshold then the first we find is the worker for this job. We prefer recent connections so that when our load goes down older connections will become idle and eventually be closed.

If this fails then check if this job is allowed to use a new worker. If it is not, we still allocate it to an active connection so that big envelopes can get a free ride off smaller ones. Note that this means their concurrency can be higher than the configured maximum even under load. We choose the first connection in the active list that has fewer jobs than the newest active connection (to spread the load evenly). If they all have the same number of jobs, use the first one. If there are no active connections then this job must wait to be rescheduled when a worker becomes free. (We can fall back to this procedure a couple of other ways below.)

The alternative case is that we have a job which can and should use a new worker. If there are destinations on its idle list the job should use the most recent one (which presumably was active until not long ago). Otherwise we are going to open a new connection. If the system has no unused workers (or idle ones that we can close) to handle the new connection, then add the job to an active connection as above.

If there are destinations on its unused list the job should pick one at random and use that. This will become a new connection which will attract new jobs, thereby spreading our load onto a part of the destination cluster we haven't used yet. Otherwise, we have previously used all the possible destinations. If there are destinations on the closed list, pick the oldest. This means we will spread our load around the destination cluster round-robin style, though in a random order because of the way we pick previously-unused destinations. Finally, if there are destinations on the retry list, try one of them at random. Since this is the last option, we will normally not retry a destination until its record is garbage-collected from destdb (causing it to appear on the unused list), unless we are under load.

At this point we have failed to find a candidate destination. There are two possibilities: all of the working destinations have active connections, or all of them failed the retry time check. In the first case, we open another connection to a destination picked at random from those with the fewest concurrent connections. If they have all reached their per-destination concurrency limit, add the job to an active connection as above.

In the second case, none of the job's primary destinations are currently working. So we repeat the whole process from the classification stage using the secondary destination list. If necessary we can fall back to the tertiary etc. destinations. If we run out of fallbacks then the message must be deferred for later retry.

That wraps up the delivery job scheduling algorithm.

I should note that there are a number of mechanisms that cause connections to be closed, so that the MTA can recover unused resources. Firstly, idle connections can be closed if the worker is required for a new connection. Secondly, we should have an ageing mechanism that closes connections after some period of time; it may be worth using different times for active and idle connections. (We want to close active connections after a time to spread load around even when it is low.) Thirdly, remote MTAs can close connections whenever they wish. If the connection has a queue of pending jobs when it is closed then these must be re-scheduled.

If the MTA is part of a cluster, the destdb should ideally be shared. This allows hosts to pool their knowledge of which hosts are down and when to retry, and it means that the destination load balancing will work across the cluster. (It would not be too difficult to implement this feature for Exim's hints databases.)

There is a little scope for simplification, because it probably isn't worth keeping track of closed connections. (I described them because I wanted to note my thoughts fairly completely.) It's probably OK to remove the special case handling for retries too, and instead rely on the destdb garbage collector.

Note that it probably isn't worth worrying about a large envelope using all your resources, because SMTP only guarantees that envelopes of up to 100 recipients will work. To encounter problems you would probably have to configure your mailing list manager to use large envelopes. (I'm assuming that you are using modern concurrency techniques, i.e. hundreds of connections per process, not one per process.) On the other hand it probably is worth taking care to limit the load imposed from outside (attackers) by smtp-time verification. You can do so within the above framework by assigning all verifications to a pseudo-message, which would probably have specially tuned concurrency settings.

fanf: (Default)
On Wednesday afternoon I gave a talk to some of the University's "techlinks" (an IT support staff group) which was another of my annual overviews of our email systems. The slides are online at http://www-uxsup.csx.cam.ac.uk/~fanf2/hermes/doc/talks/2007-02-techlinks/
fanf: (Default)
Following a discussion about source code revision control systems, something fairly similar to the following conversation occurred...

iwj describes his CVS hack which allows you to move repositories around, and which will chase chains of moves.

fanf is reminded of a script that he wrote to trace routes through a network using `route get` to find the next hop then ssh to connect to the next hop then a Quine to transmit itself to the next hop where it then repeats the process. (You need a network based on BSD routers and you need root on all the machines. It is useful for networks with asymmetrical routing.)

mdw, who also likes Quines, says that he wrote a program that can turn any C program into a Quine.

iwj says that gcc should obviously have a --quine option to output its own source code for easier GPL conformance.

mdw says that Linux should similarly have /proc/source/.

sgt points out that /proc/include would actually be useful for compiling things, prompting brief amazement that Quines can have so many practical purposes.

iwj (slightly missing the point of Quines) suggests that this would cause modprobe to pull in the source code dynamically and eat your RAM.

fanf is reminded of an earlier discussion about LD_PRELOAD hacks (in which we agreed that a good way of discouraging programmers who are too enthusiastic about LD_PRELOAD is to explain to them how they can do a much better job with ptrace) and suggests that, as well as source code, Linux could export object code via .so files under /proc/.

everyone discusses how this would solve the problem of kernel vs. libc version skew, since libc could just dynamically link the kernel-dependent code from procfs. We debate whether Ulrich Drepper would have babies or kittens and whether this would lead to Core Wars being played in every process.
fanf: (Default)

Back in the autumn I wrote three posts (1, 2, 3) about MTA queue logistics, which together comprised part 5 of my MTA design series. Part 4 was about message file format, part 3 was about local delivery, and part 2 was about security partitions. Since it's nearly a year since I wrote part 1 (on local message submission) I should probably write some more...

One of the things I most like about Exim's architecture is the split between the two major sections of its configuration file. The "acls" section contains the "access control lists" which control which messages Exim will accept. ("Access control logic" would be a more accurate name since they are not simple lists.) The "routers" section controls how Exim decides where to deliver messages. This is a fairly clean front-end / back-end split, though it is somewhat obscured by historical baggage, such as the separation between routers and transports, the client/server confusion of authenticators, and the remaining ACL-related stuff in the global configuration section. What is neat, though, is the loose coupling between the front and back ends.

One of the key requirements of an SMTP server (especially an MX) is to verify addresses before it accepts messages, so that it does not take responsibility for undeliverable messages. Postfix normally verifies addresses using its local_recipient_maps setting, which duplicates the routing logic implemented by other parts of the MTA. This implies that when you change the configuration of Postfix's back-end you must make a corresponding change to the front-end. This sucks.

Exim avoids this duplication by using the routers directly to do verification. The ACL just says verify = recipient or verify = sender, and Exim attempts to route the relevant address. If routing succeeds the address is valid, and if it fails the ACL returns the error from the routers directly to the client.

Under the covers, Sendmail works in a similar way to Exim. The rulesets 0-5 correspond to Exim's routers, and the check_* rulesets correspond to Exim's ACLs. Sendmail's rulesets can invoke each other, and the check_* rulesets invoke the numbered rulesets to do address verification.

One of the less nice things about Exim is the ad-hoc configuration language - in fact, Exim has about 7 little languages squeezed inside it, by my count. Sendmail has a much more unified configuration syntax, but it is much, much more obscure. So in fact most people configure it using m4 macros and don't get to appreciate its lurking crufty elegance.

There is an architectural bug in Postfix that prevents it from using this loose coupling design effectively. All email traffic between its front end and back end is via files on disk. This is catastrophic for performance, since using this mechanism for verification would at least triple the disk load required to receive a message: one disk op to receive the message, one for sender verification, and one for recipient verification. That is ignoring the 10%+ of messages that have multiple recipients, and the 25%-33% of messages that have invalid recipients (even after blacklist checks). Even so, you can configure Postfix to work in this way in order to do callout address verification.

At this point I need to take a brief diversion to explore the varied depths of email address verification. In particular, how much effort can or should you put in, which is to say, how close do you get to delivering a message before stopping? (You can't go all the way because you don't have a message to deliver when verifying!) If the address is an alias, do you go on to verify the address it redirects to? What if there is more than one address? If it is a local user, do you check the quota? The answers can depend on both the implementation of your MTA and on local policy decisions.

Callout verification is specific to addresses that the MTA will deliver to over SMTP or LMTP. Traditionally MTAs verify remote addresses by just checking that the domain's DNS is sane, but you can go quite a lot further before you are actually delivering a message. You can ensure that you can connect to the remote SMTP server, and you can start an SMTP transaction by sending the MAIL and RCPT commands of a message envelope. You can abort by resetting the transaction before sending any message data. The result of the verification is the destination's response to the RCPT command, if everything worked.

In practice, callout verification is not very useful for addresses at domains not under your control, which is usually the case for sender addresses. This is partly because a lot of email is sent from broken domains that are not running an MTA - most commonly, email from web servers. A lot of it is junk from compromised servers, but a lot of it is also very desirable email related to some transaction performed by the user on the web site. A more worrying reason is that if sender callout verification is widely deployed, then a joe-job turns into an anonymized distributed denial of service attack: criminals can attempt to send lots of email "from" their victim and thereby cause lots of otherwise legitimate MTAs to bombard the victim with verification requests.

However, recipient callout verification can be very useful. If you are running an MX for domains that are under separate management it can be difficult to get their lists of valid recipients. They will have various different user administration systems, some more ad-hoc than others, and even if they are accessible in a reasonably standard way (e.g. LDAP queries to a Microsoft Active Directory) you still have to establish a second out-of-band trust relationship between your MX and the destination system. In comparison, callout verification works in-band, using normal SMTP behaviour with no special work required at either end. Much simpler!

Unfortunately, callout implementations are usually sub-standard. I've already described Postfix's performance problems. Exim's is crippled because it is too closely coupled to the ACLs: it's implemented as a subroutine call that has at most one SMTP time-out period to perform an operation that can take multiple time-out periods. Postfix doesn't make this mistake because its implementation follows the loosely-coupled design that Exim only hints at. This also means that Postfix's callouts can make use of its global scheduling, concurrency control, and connection cacheing. Exim, being too decentralized, doesn't have these features at all.

Clearly there is room for improvement. In my next article I am going to argue that if you can get callout verification right then it makes a lot of other really cool stunts easy.

fanf: (Default)
Following on from http://fanf.livejournal.com/67571.html

People who follow [livejournal.com profile] dotaturls will have noticed that I have published a couple of revisions of RFC 1845bis as official Internet Drafts. My intention is to get it published as an RFC through the "individual submission" route. This means it goes through the IETF review & consensus process, but it is not the product of an IETF working group. RFC 1845 was published as "experimental" but I'd like 1845bis to be on the Standards Track.

(Documents can also become RFCs through the "independent submission" route, which mostly involves just the RFC editor. This is used by 1st April RFCs, for example, and these documents are not eligible for the Standards Track. Not all RFCs are standards.)
fanf: (Default)

I recently read the official transcript of the international conference held at Washington DC in October 1884 for the purpose of fixing a prime meridian and a universal day. This conference was called by the US government to (in effect) ratify at the political level a decision that had been made at the technical level by the International Geodetical Association in Rome in 1883.

I expected it to be deathly boring, but it turned out to be entertaining at a number of levels.

The result was that the Greenwich meridian would be considered the zero of longitude, and that it would be counted from -180° in the West to +180° in the East. Universal time would be counted from 0h to 24h such that noon at Greenwich would be 12h UT and the date line would be at 180°. All of these details could have been different.

There is a lot of procedural guff in the transcript, and it was starting to look like a meeting of the People's Front of Judea (but less funny) when it turned into something like a skit from 1066 And All That. The three principal protagonists are the USA, Britain, and France, all of whom act like stereotypes of themselves.

  • USA: brash, impatient, sure that they already have the correct answer and that this international diplomacy is just an irritatingly slow way of rubber-stamping a decision that has already been made, but nevertheless keen on openness and democracy in the most practical sense.
  • Britain: Top Nation and knows it, therefore can be smugly quiet with the knowledge that the result will be in their favour, with the occassional insufferably magnanimous and sporting comment that they "had no desire to advocate any one of the places enumerated" ect.
  • France: Utterly affronted by the Americans and pissed off that they aren't Top Nation any more, arguing that if they can't have the meridian then no-one can, and that the result should hold to the philosophical ideals of the metric system, the French Revolutionary Calendar, and decimal clocks.

In the end, the main result was determined to a large extent by the fact that about 70% of the world's shipping used charts based on the Greenwich meridian, and this had also been the decision in Rome. However, they did not agree to all of the previous year's choices; in particular, because the people in Rome were astronomers and navigators, they preferred UT days to be counted noon-to-noon so that a night's observing could all be logged with the same date, whereas the people in Washington preferred to count them midnight-to-midnight following the usual practice amongst normal people. Also, in Rome they preferred to count longitude from 0° to 360°, but in Washington they decided on -180° to +180° because this was closer to the markings on existing charts. (The west/negative to east/positive choice was kept.)

On the way some interesting arguments were made, often including fascinating facts. A significant one was that the prime meridian should be based on a geographical feature, such as the centre of the Atlantic (the Azores) or the Pacific (the Bering strait). It's lucky that both Greenwich and Paris are close enough to being 180° from the Bering strait for practical purposes - in particular that the International Date Line could (later) be mostly drawn along 180° without undue pain. By the 1880s, sailors already had an informal date line through the Pacific; I guess that this is because of the size of the ocean and because America was colonized from Europe - would the date line go down the Atlantic if it had been colonized from the Far East? One of the wackier points was that the Gregorian calendar was a Roman calendar and that Rome was the centre of the Christian world, so it should obviously be the centre of universal time. A more sensible suggestion (but still too radical to be accepted) was that the world should be divided into hourly timezones, very similar to the ones we have now and to the ones that had already been established on the North American railways, and based on GMT.

The history subsequent to the Washington conference is somewhat ironic. One requirement at the time was that the prime meridian passed through the transit instrument of a good observatory which could thereby keep the time on which longitude depended (or that the meridian was some fixed offset from such an observatory, though they decided against that option to save on maritime charts). By the start of WWI, people were developing long-distance precision time transfer using radio telegraphy, so the dependence on a single observatory and master clock was no longer necessary or desirable. The French established the Bureau International de l'Heure (BIH) which defined and maintained time and longitude based on observations from multiple observatories, and thereby took back what they had lost three decades previously!

This should have made no difference, but owing to longstanding inaccuracies in the official longitudes of the various observatories, the zero longitude of the BIH's geodetic model no longer matched the Airy transit circle at Greenwich. Further errors (of about 10m) were introduced when the Royal Observatory moved from Greenwich to Herstmonceaux after WWII. The biggest change, however, happened when the standard geodetic model was fixed to be geocentric, i.e. to have a common centre of revolution with the Earth. Although the new model matched the old fairly well at the equator, it ended up being off by about 100m at Greenwich, and this discrepancy is still preserved by WGS84, the world geodetic system used by GPS. If you go to Greenwich you'll find that not only does your GPS receiver disagree with the markings on the ground, but also your Ordnance Survey maps - because they use a datum established before the construction of the Airy transit circle.

Nowadays the Greenwich Observatory is just a museum and irrelevant to time and longitude, despite what HMG thinks. For example, BBC Radio's six-pip time signal is now derived from GPS and synchronized to UTC (despite the fact that legal UK time is still GMT, i.e. UT1) - there's even a seventh pip to mark a leap second correctly! More absurdly, the government made a fuss about Greenwich time in the run-up to the millennium under the brand name "Greenwich Electronic Time", including promotion of NTP and installation of atomic clocks at the LINX (which is based at Telehouse and is coincidentally almost exactly on the meridian).

fanf: (Default)
I've just rolled out some changes to our POP and IMAP servers which have become possible because all our users are now logging in over TLS. For the last year while we have been working towards getting all users' settings secured there has been a small weakness in the way we enforced secure logins. The server had to receive the username before it could decide whether or not to allow an insecure login, which means that when a supposedly secure user accidentally uses insecure settings (e.g. because they are reconfiguring a new MUA) they are likely to expose their password because the login conversation goes too far before it is aborted.

The changes are just a few pedantic tweaks to the way our servers speak IMAP and POP.

In POP, the client logs in by issuing a USER command, which states the username, followed by a PASS command with the password. The server now rejects insecure login attempts after the USER command, before the password is transmitted. There is a corresponding change to the server's capability list, which omits the USER command until TLS has been established.

IMAP is less helpful to us because its LOGIN command transmits the username and password all at once. (However this saves a round trip so is faster.) However we can still put LOGINDISABLED in the server's capability list, which will help clients that use it. The other IMAP improvement is to include the capability list in the server banner which saves another round trip - a noticeable improvement for GPRS users.

This should be invisible to users, but makes the service safer especially for wireless access.
fanf: (Default)

A "proleptic" time scale is one whose definition has been extended so that it can be used outside its normal range of validity. For example, it's common for computer software to implement only the current leap year rules regardless of the year in question, which means they implement the proleptic Gregorian calendar. This is a straight-forward simplification that's useful if you aren't too worried about handling historical dates anachronistically.

Most current computer time scales follow a form of Universal Time which has a fixed 24 * 60 * 60 = 86400 seconds per day, synchronized to UTC but with some kind of fudge to handle leap seconds. It would be much more satisfactory for the computer's clock to count seconds of atomic time all the time, and use a leap seconds table to implement UTC properly.

However, the current rules for translating between atomic time and universal time only date back to 1972, which is fairly hopeless if you want your time scale to be usable for historical purposes. Of course, atomic time only dates back to 1958 which is not much better. We need to come up with some plausible rules for proleptic TAI and proleptic UTC so that we can handle historical times within the same framework that we use for times nearer the present.

Since there was no TAI before 1958, and TAI is defined to be synchronized with UT at the start of 1958, the obvious way of dealing with earlier times is to assume that TAI = UTC = UT before that point.

Between 1958 and 1972 there was a time scale called "UTC" but it is nothing like the current version. This old UTC (oUTC) had variable-length seconds and occasionally made 0.1s steps. The history of its variations is kept by the US Naval Observatory. The time scale described in that table does mesh reasonably smoothly with TAI-UT=0 at the start of 1958 (even though 1961 is the start of the oUTC time scale) and TAI-UTC=10 at the start of 1972. The problem with using this table directly is that it means we have to introduce complications into the code which are not necessary for most purposes. It also implies a degree of accuracy that can't really be justified for events that old.

(Digression: I'm deliberately not distinguishing between UT0 / UT1 / UT2 before 1958 because practical timescales back then were not well synchronized owing to disagreements about the locations of observatories. An alternative that would appeal to pedantic astronomers would be to align our base time scale to ephemeris time and terrestrial time which is TAI+32.184s. However that needlessly complicates things since we would have to work out how to extend back into history the non-uniform translation to UT, and it requires all the complexity of oUTC.)

It seems to me that a reasonable solution is to use the USNO's table to extend the leap second system back to 1958. This means that we will not need to deal with non-integer offsets between time scales nor with variable-rate seconds. My method is to use the equations in the table to work out the TAI-oUTC values for 00:00 on 1st January and 1st July each year (the points immediately after each possible leap second insertion). Assuming that these values are a good approximation to TAI-UT1, I add leap seconds to create a proleptic UTC (pUTC) which keeps pUTC-oUTC less than 0.9. This is the same as the modern UTC-UT1 (DUT1) rule.

date		mjd	tai - utc	leap
1958-01-01	36204	0.0024020	0
1958-07-01	36385	0.2369780	0
1959-01-01	36569	0.4754420	0
1959-07-01	36750	0.7100180	1
1960-01-01	36934	0.9484820	1
1960-07-01	37116	1.1843540	1
1961-01-01	37300	1.4228180	1
1961-07-01	37481	1.6573940	2
1962-01-01	37665	1.8458580	2
1962-07-01	37846	2.0491572	2
1963-01-01	38030	2.2558260	2
1963-07-01	38211	2.4591252	3
1964-01-01	38395	2.7657940	3
1964-07-01	38577	3.1016660	3
1965-01-01	38761	3.5401300	4
1965-07-01	38942	3.9747060	4
1966-01-01	39126	4.3131700	4
1966-07-01	39307	4.7823220	5
1967-01-01	39491	5.2592500	5
1967-07-01	39672	5.7284020	6
1968-01-01	39856	6.2053300	6
1968-07-01	40038	6.5770740	7
1969-01-01	40222	7.0540020	7
1969-07-01	40403	7.5231540	8
1970-01-01	40587	8.0000820	8
1970-07-01	40768	8.4692340	9
1971-01-01	40952	8.9461620	9
1971-07-01	41133	9.4153140	10
1972-01-01	41317	9.8922420	10

Thus my leap second table looks like this:

   date      time   pUTC-TAI
1959-06-30 23:59:60     1
1961-06-30 23:59:60     2
1963-06-30 23:59:60     3
1964-12-31 23:59:60     4
1966-06-30 23:59:60     5
1967-06-30 23:59:60     6
1968-06-30 23:59:60     7
1969-06-30 23:59:60     8
1970-06-30 23:59:60     9
1971-06-30 23:59:60    10
1972-06-30 23:59:60    11
1972-12-31 23:59:60    12
1973-12-31 23:59:60    13
1974-12-31 23:59:60    14
1975-12-31 23:59:60    15
1976-12-31 23:59:60    16
1977-12-31 23:59:60    17
1978-12-31 23:59:60    18
1979-12-31 23:59:60    19
1981-06-30 23:59:60    20
1982-06-30 23:59:60    21
1983-06-30 23:59:60    22
1985-06-30 23:59:60    23
1987-12-31 23:59:60    24
1989-12-31 23:59:60    25
1990-12-31 23:59:60    26
1992-06-30 23:59:60    27
1993-06-30 23:59:60    28
1994-06-30 23:59:60    29
1995-12-31 23:59:60    30
1997-06-30 23:59:60    31
1998-12-31 23:59:60    32
2005-12-31 23:59:60    33
fanf: (Default)

The essentials

A clock consists of:

  • an oscillator that ticks at a frequency which (ideally) is accurately known and stable over long periods of time;
  • and a counter which keeps track of the ticks.

Traditional clocks have a counter which directly represents civil time (12 hours per half day or 24 hours per day, 60 minutes per hour, etc.) but in computing it's more usual to have a binary counter with a separate mechanism to translate its value into civil time.

The visible part of the counter does not necessarily count ticks directly: its low-order digits may be invisible. For example, wrist watches often have a quartz oscillator that ticks at 32,768 Hz but a fastest hand that moves once per second. On the other hand, in computing the counter may have a higher resolution than the tick granularity. For example, on Unix gettimeofday() returns the value of a counter with a resolution of 1μs but a granularity of typically 100 Hz or 1000 Hz. A counter's interval is the amount it is incremented by for each tick, so 10,000 or 1000 for the gettimeofday() example.

Types of clock

A real-time clock or RTC can be used to keep civil time. As well as an oscillator and a counter, it has an epoch, which is the point in civil time at which the counter's value was zero, and a mapping between other values of the counter and civil time. This mapping is generally complicated and dependent on what kind of civil time you want.

In my terminology, a clock that is not a real-time clock is a timer.

An interval timer counts time like a real-time clock but does not have a mapping to civil time.

A countdown timer has a negative interval and typically triggers some action when its counter reaches zero.

An activity timer does not tick continuously but only when some activity is being performed, for example when a process is running on the CPU. It may even tick faster than its nominal frequency, for example if a process has threads running on multiple CPUs.

Correcting clocks

The terminology I use when correcting a clock varies depending on which of the clock's parameters I am changing.

You can reset the counter to step the clock to the correct time. Unexpected resets can confuse software that makes unwarranted assumptions about a clock.

You can adjust the rate of a clock. This usually means that the counter interval is changed, not the frequency of the hardware oscillator! This might be done to slew the clock to the correct time as well as to correct a systematic error in the clock's nominal frequency.

You can change the epoch of a clock. In practice most clocks have fixed epochs and are corrected by resetting their counters.

Clocks may be corrected automatically. A good way of doing this on a networked machine is with NTP which avoids resetting the clock if it can, but it has been common for crappier systems to use less clever protocols.

A clock that is stabilized is automatically adjusted to conform to a frequency standard. A clock that is synchronized is automatically corrected to conform to civil time. It's possible to have one without the other in either direction.

Mapping to civil time

A monotonic clock cannot be reset, so you can only make gross corrections by changing its epoch. It's common for computer systems to provide monotonic clocks without any mapping to civil time, in which case they are there for use as interval timers.

A clock that keeps atomic time counts SI seconds and has a trivial mapping to TAI. (Monotonic clocks keep atomic time.) The system must have an up-to-date table of leap seconds in order to map atomic-time clocks to civil time.

Clocks that don't keep atomic time keep some kind of universal time. The POSIX and NTP clocks are examples. They have to have some kind of fudge to deal with UTC leap seconds. They may reset the clock or vary its rate on or near a leap second.

Some very limited computer systems do not have any sophistication in the mapping between their clock and civil time, so the clock just represents local time fairly directly. This implies that the clock is reset when Summer Time transitions occur.

Complications

It is common for clocks to be made more complicated so that they include some of the semantics of civil time. They may represent the counter with a non-uniform base similar to the one we use for writing times in natural language, or include mechanisms for handling leap seconds. (A linear counter has a simple integer or fixed-point representation.)

For example, the clock accessed via ntp_gettime() has an additional field that can warn about an impending leap second or that one is currently in progress.

For a hypothetical example, counters like the one returned by gettimeofday() have separate parts that count seconds and fractions of seconds. Markus Kuhn suggests that you could represent leap seconds as double-length seconds where the sub-second part of the counter increments past the point where it would normally carry into the seconds part.

These complications are motivated by keeping the mapping between counter values and civil time simple. Since time is promulgated in the form of UTC rather than TAI, and common external time representations (POSIX, NTP, etc) are based on UT, this makes some sense. However it means that the clock is not suitable for use as the basis of an interval or countdown timer.

Simplifying the kernel

Ideally the kernel could just provide simple time counters and leave all civil time handling to userland. Unfortunately that isn't quite possible because of API compatibility constraints (POSIX time) and file systems have to store time stamps as representations of civil time. However the kernel does not have to deal with times other than the present, or simple interval computations. So it could just represent other clocks as epoch adjustments to a master monotonic clock.

Summary of clock properties

  • counter value
  • tick interval
  • counter resolution
  • epoch (only for RTCs)

Flags

  • stabilized
  • synchronized (only for RTCs)

Types of clock

  • countdown timer
  • activity timer (possibly several variants)
  • interval timer
  • monotonic timer
  • monotonic RTC
  • atomic
  • universal w/ rate change around leap seconds
  • universal w/ resets at leap seconds
  • universal w/ variable-length seconds (not with linear counters)
  • local time
fanf: (Default)

Next Wednesday (29th November) is going to be busy.

In the afternoon there will be a Techlinks meeting to introduce new IT staff to the Computing Service. I'll be standing up briefly to say HELO on behalf of Mail Support, and so I have produced a "crib sheet" which lists most of what we do so that I won't have to. I only have a few minutes so there won't be enough time to say anything substantial, and anyway it can wait until my full-length techlinks next term.

In the evening I'll be helping Bob to give a tour of our computer room. This is principally for students, but any interested member of the University is welcome. Email me for more info.

fanf: (Default)

A lot of our users have been asking us about the current spam problem, so I sent the following to our computing support staff mailing list and the ucam.comp.mail newsgroup. I thought it would be worth posting here too.

The volume of spam we are seeing has more than doubled since the summer, from about 15 messages blocked per second to over 35, and the amount of spam that gets past the blocks has increased accordingly. This is really unprecedented: in the preceding two years, the volume of blocked spam increased gradually from about 10 to about 15 per second. For comparison, we're handling about 7 messages per second, which includes internal email (3 per second) as well as non-blocked spam and legitimate email from outside the University.

That is, at least 90% of the 3.5 million messages offered to us each day from outside the University are spam.

It is a coincidence that this increase kicked off at about the start of term: this is a global problem that has been widely noted in the IT press. Unfortunately the current flavours of spam are difficult for our second-level filters (SpamAssassin) to handle because it doesn't have many recognizable features, such as URLs for criminal web sites, etc. We are updating SpamAssassin when new releases come out, which is roughly monthly at the moment.

fanf: (Default)

I've recently been pondering models of concurrent programming. One of the simplest, most beautiful and most powerful is the π calculus - it's effetively the λ calculus of concurrent programming.

The π calculus is oriented around channels, which can be read from, written to, and passed along channels: that is, they can be passed between processes (though processes are implicit). Reading and writing are synchronous: if a process wants to read or write a channel, it blocks until there is another process wanting to do the opposite, and after the communication both processes can proceed with their sequels. The alternative to this is for writes to be asynchronous: they do not block.

The problem with synchronous writes is that they do not translate well into a distributed setting. An asynchronous write translates directly into a network transmission, but a synchronous write requires a round trip to unblock the writer. As a result it is less efficient and more vulnerable to failure. Even in a simple local setting, asynchronous sends may be preferable: in fact, Pict, a programming language closely based on the π calculus, is restricted to asynchronous writes. On the other hand, with asynchronous writes you may need a queue on the reader's end, and some way to avoid running out of memory because of overflowing queues.

Another design alternative is to focus on processes instead of channels. Messages are sent to processes, and PIDs can be transmitted between processes . In this model the read ends of channels are effectively fixed. This is the actor model as used by Erlang.

The problem with the π calculus's channels, or a distributed-object system like Obliq or E (see §17.4 on p. 126), is that you need distributed garbage collection. If I spawn a process to listen on a channel locally, and send the channel to a remote system, I must keep the process around as long as that remote system may want to write on the channel. Distributed garbage collection is far too difficult. On the other hand, in the actor model a process is the master of its own lifetime, because it lasts until its thread of execution completes. Thus you avoid relying on tricky algorithms, at the cost of needing some way of dealing with failed message sends. But in a distributed system you have to deal with failure anyway, and the same mechanism can deal with the queue overflow problem.

So I conclude that the π calculus is wrong for distributed programming and that Erlang's actor model is right. Those Erlang designers are clever chaps.

fanf: (Default)

How stupid is this? Repeated irrelevant advertising as well as a stupid disclaimer. YOUR TAXES AT WORK!

(This came from a user who was having problems receiving a message from a government researcher, because the 4MB message was too big for the poor GSI email system. I also loved the X.400 failure report, which I shall not reproduce here for privacy reasons.)

PLEASE NOTE: THE ABOVE MESSAGE WAS RECEIVED FROM THE INTERNET.
On entering the GSI, this email was scanned for viruses by the
Government Secure Intranet (GSi) virus scanning service supplied
exclusively by Cable & Wireless in partnership with MessageLabs. In
case of problems, please call your organisational IT Helpdesk. The
MessageLabs Anti Virus Service is the first managed service to achieve
the CSIA Claims Tested Mark (CCTM Certificate Number 2006/04/0007),
the UK Government quality mark initiative for information security
products and services. For more information about this please visit
www.cctmark.gov.uk

**********************************************************************

This email and any files transmitted with it are private and intended
solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error please return it
to the address it came from telling them it is not for you and then
delete it from your system.

This email message has been swept for computer viruses.

**********************************************************************

The original of this email was scanned for viruses by Government
Secure Intranet (GSi) virus scanning service supplied exclusively by
Cable & Wireless in partnership with MessageLabs. On leaving the GSI
this email was certified virus free. The MessageLabs Anti Virus
Service is the first managed service to achieve the CSIA Claims Tested
Mark (CCTM Certificate Number 2006/04/0007), the UK Government quality
mark initiative for information security products and services. For
more information about this please visit www.cctmark.gov.uk
fanf: (Default)

Iterators are used in some programming languages to generalise the for loop. They were introduced by CLU in the 1970s and have been adopted by Python and Lua, amongst others. They are traditionally a special kind of function that has yield statements that return successive values in the iteration. This function is called at the start of the loop, and when it yields, the loop body is run. Each time round the loop the function is resumed until it yields again.

The iterator's activation frame is typically on top of the stack. This seems a bit backwards to me, because it leads to contortions to maintain the iterator's state between yields, e.g. using an auxiliary object or first-class closures. (See the first example here.)

I think it's simpler to desugar a for loop like this:

    for varlist in iterator(arglist) do
        statements
    loop

into

    iterator(
        (varlist): do
            statements
        end,
        arglist)

That is, you turn the loop body into an anonymous function which is passed to the iterator function. Instead of yielding, the iterator just calls the body function. The iterator can then be written as a normal loop without contortions to keep state between yields. (Pretty much like this, in fact.)

What makes this slightly interesting is how to desugar break, continue, and return statements inside the loop. A continue becomes just a return from the anonymous function, but the others have to escape from the iterator function as well as the anonymous function. This could be done with exceptions or continuation passing.

fanf: (Default)
http://fanf.livejournal.com/53255.html

Back in February and March I drafted some SMTP extenstions that were intended to improve its performance and robustness. I've been revisiting this work recently because people from the IETF Lemonade working group have expressed interest in them.

Most of the work has been rewriting my REPLAY draft to make it closer to the checkpoint/restart extenstion specified in RFC 1845. I stripped out the excessive rationale sections, and described some of the protocol elements more carefully, and added a 1845 backwards compatibility mode.

http://www-uxsup.csx.cam.ac.uk/~fanf2/hermes/doc/qsmtp/draft-fanf-lemonade-rfc1845bis.html

However it looks like the Lemonade WG are more insteredred in QUICKSTART than in checkpoint/resume, so I've also made a few editorial updates to QUICKSTART, mostly updating references.

http://www-uxsup.csx.cam.ac.uk/~fanf2/hermes/doc/qsmtp/draft-fanf-smtp-quickstart.html
fanf: (Default)

At the end of http://fanf.livejournal.com/65911.html I mentioned that it might be beneficial to have multiple queues, in order to reduce the density of garbage in the older parts of the queue. There are at least a couple of other reasons why one might want multiple queues.

Even faster

In the absence of any other bottlenecks, the MTA is going to be limited by the rate that it can fsync the main queue. You can raise this limit if you have two parallel queues on different disks, and spread the load of incoming messages between them. I got this idea from the Sendmail X design document which describes a similar (but not quite so neat) queue structure to the one I have been describing.

TURN

SMTP's TURN feature allows a client to ask the server to deliver any email queued on the server for a particular domain. This might be used by business dial-up customers who call in to their ISP every so often to collect email.

There are two variants of TURN in the current specifications, ETRN and ATRN, because the original form was insecure. With ETRN the server delivers the queued messages as if from a normal queue run; the only security concern is that the server must have some throttling to prevent clients from starting unlimited numbers of queue runners. With ATRN the existing connection is used to deliver the messages from the server to the client, which switch roles after the ATRN command. The client must be authenticated so that the server knows the client is permitted to receive the email it is asking for. ATRN is used on the "on-demand mail relay" port 366 instead of the usual SMTP port.

The basic implementation of ETRN (common to sendmail, Exim, and older Postfix) is for the server to fire off sendmail -qR, which scans the entire queue for messages with recipient addresses containing the domain given by the client. This is horribly inefficient if you have lots of messages on the queue, and the more clients you have using ETRN the more your queues get clogged with undeliverable messages.

The solution to this problem is to get messages for your ETRN domains off the queue; with Exim this is typically done by delivering them to a batch-SMTP file per domain, which can then be re-injected for delivery fairly efficiently when the client says ETRN. This kind of setup is a must for ATRN: whereas ETRN uses normal SMTP routing and delivery (which works if the clients have static IP addresses), ATRN does not, so there is generally no way to deliver the messages except by ATRN. ETRN is an optimisation to allow clients to tell the server not to wait for a retry timeout, whereas ATRN is purely on-demand so it is actually wrong to leave the messages on the retry queue.

With my log-structured queues we can use this idea but do it more efficiently. When a message is addressed to an ATRN domain, or cannot be delivered to an ETRN domain, its envelope is written to that domain's queue instead of appended to the main queue. One thing that makes this slightly more interesting is that the envelope may have to be split if it has recipients at multiple domains. This introduces the requirement for some kind of reference counting of spool files. The per-domain queue files are not routinely scanned, and when a client requests a delivery the server can simply and efficiently work through its queue.

Postfix leaves ETRN messages on its "deferred" queue, but optimises ETRN by keeping per-domain indexes of messages. This has the advantage of avoiding the need to split messages per domain, and means that ETRN domains are still retried even if the client doesn't ask. However it means that the deferred queue can get large and normal queue runs can get expensive. We should also periodically retry ETRN domains, but this won't happen unless they have messages on the main queue. To deal with that we should periodically probe these domains, which does not need any disk activity if the domain remains unreachable.

January 2026

S M T W T F S
    123
45678910
1112 13 14151617
18192021222324
25262728293031

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated 2026-02-07 17:53
Powered by Dreamwidth Studios