fanf: (dotat)

Nearly two years ago I wrote about my project to convert the 25-year history of our DNS infrastructure from SCCS to git - "uplift from SCCS to git"

In May I got email from Akrem ANAYA asking my SCCS to git scripts. I was pleased that they might be of use to someone else!

And now I have started work on moving our Managed Zone Service to a new server. The MZS is our vanity domain system, for research groups who want domain names not under

The MZS also uses SCCS for revision control, so I have resurrected my uplift scripts for another outing. This time the job only took a couple of days, instead of a couple of months :-)

The most amusing thing I found was the cron job which stores a daily dump of the MySQL database in SCCS...

fanf: (dotat)

I have found out the likely reason that tactile paving markers on dual-use paths are the wrong way round!

A page on the UX StackExchange discusses tactile paving on dual-use paths and the best explanation is that it can be uncomfortable to walk on longitudinal ridges, whereas transverse ones are OK.

(Apparently the usual terms are "tram" vs "ladder".)

I occasionally get a painful twinge in my ankle when I tread on the edge of a badly laid paving stone, so I can sympathise. But surely tactile paving is supposed to be felt by the feet, so it should not hurt people who tread on it even inadvertently!

fanf: (dotat)

In Britain there is a standard for tactile paving at the start and end of shared-use foot / cycling paths. It uses a short section of ridged paving slabs which can be laid with the ridges either along the direction of the path or across the direction of the path, to indicate which side is reserved for which mode of transport.

Transverse ridges

If you have small hard wheels, like many pushchairs or the front castors on many wheelchairs, transverse ridges are very bumpy and uncomfortable.

If you have large pneumatic wheels, like a bike, the wheel can ride over the ridges so it doesn't feel the bumps.

Transverse ridges are better for bikes

Longitudinal ridges

If you have two wheels and the ground is a bit slippery, longitudinal ridges can have a tramline effect which disrupts you steering and therefore balance, so they are less safe.

If you have four wheels, the tramline effect can't disrupt your balance and can be nice and smooth.

Longitudinal ridges are better for pushchairs

The standard

So obviously the standard is transverse ridges for the footway, and longitudinal ridges for the cycleway.

(I have a followup item with a plausible explanation!)

fanf: (dotat)

One of the new features in C99 was tgmath.h, the type-generic mathematics library. (See section 7.22 of n1256.) This added ad-hoc polymorphism for a subset of the library, but C99 offered no way for programmers to write their own type-generic macros in standard C.

In C11 the language acquired the _Generic() generic selection operator. (See section of n1570.) This is effectively a type-directed switch expression. It can be used to implement APIs like tgmath.h using standard C.

(The weird spelling with an underscore and capital letter is because that part of the namespace is reserved for future language extensions.)

When const is ugly

It can often be tricky to write const-correct code in C, and retrofitting constness to an existing API is much worse.

There are some fearsome examples in the standard library. For instance, strchr() is declared:

    char *strchr(const char *s, int c);

(See section of n1570.)

That is, it takes a const string as an argument, indicating that it doesn't modify the string, but it returns a non-const pointer into the same string. This hidden de-constifying cast allows strchr() to be used with non-const strings as smoothly as with const strings, since in either case the implicit type conversions are allowed. But it is an ugly loop-hole in the type system.

Parametric constness

It would be much better if we could write something like,

    const<A> char *strchr(const<A> char *s, int c);

where const<A> indicates variable constness. Because the same variable appears in the argument and return types, those strings are either both const or both mutable.

When checking the function definition, the compiler would have to treat parametric const<A> as equivalent to a normal const qualifier. When checking a call, the compiler allows the argument and return types to be const or non-const, provided they match where the parametric consts indicate they should.

But we can't do that in standard C.

Or can we?

When I mentioned this idea on Twitter a few days ago, Joe Groff said "_Generic to the rescue", so I had to see if I could make it work.

Example: strchr()

Before wrapping a standard function with a macro, we have to remove any existing wrapper. (Standard library functions can be wrapped by default!)

    #ifdef strchr
    #undef strchr

Then we can create a replacement macro which implements parametric constness using _Generic().

    #define strchr(s,c) _Generic((s),                    \
        const char * : (const char *)(strchr)((s), (c)), \
        char *       :               (strchr)((s), (c)))

The first line says, look at the type of the argument s.

The second line says, if it is a const char *, call the function strchr and use a cast to restore the missing constness.

The third line says, if it is a plain char *, call the function strchr leaving its return type unchanged from char *.

The (strchr)() form of call is to avoid warnings about attempts to invoke a macro recursively.

    void example(void) {
        const char *msg = "hello, world\n";
        char buf[20];
        strcpy(buf, msg);

        strchr(buf, ' ')[0] = '\0';
        strchr(msg, ' ')[0] = '\0';

In this example, the first call to strchr is always OK.

The second call typically fails at runtime with the standard strchr, but with parametric constness you get a compile time error saying that you can't modify a const string.

Without parametric constness the third call gives you a type conversion warning, but it still compiles! With parametric constness you get an error that there is no matching type in the _Generic() macro.


That is actually pretty straightforward, which is nice.

As well as parametric constness for functions, in the past I have also wondered about parametric constness for types, especially structures. It would be nice to be able to use the same code for read-only static data as well as mutable dynamic data, and have the compiler enforce the distinction. But _Generic() isn't powerful enough, and in any case I am not sure how such a feature should work!

fanf: (dotat)

There was an interesting and instructive cockup in the BIND 9.10.4 release. The tl;dr is that they tried to improve performance by changing the layout of a data structure to use less memory. As a side effect, two separate sets of flags ended up packed into the same word. Unfortunately, these different sets of flags were protected by different locks. This overlap meant that concurrent access to one set of flags could interfere with access to the other set of flags. This led to a violation of BIND's self-consistency checks, causing a crash.

The fix uses an obscure feature of C structure bitfield syntax. If you declare a bitfield without a name and with a zero width (in this case, unsigned int :0) any following bitfields must be placed in the next "storage unit". This is described in the C standard section

You can see this in action in the BIND DNS red-black tree declaration.

A :0 solves the problem, right?

So a clever idea might be to use this :0 bitfield separator so that the different sets of bitfields will be allocated in different words, and the different words will be protected by different locks.

What are the assumptions behind this fix?

If you only use a :0 separator, you are assuming that a "word" (in a vague hand-wavey sense) corresponds to both a "storage unit", which the compiler uses for struct layouts, and also to the size of memory access the CPU uses for bitfields, which matters when we are using locks to keep some accesses separate from each other.

What is a storage unit?

The C standard specifies the representation of bitfields in section

Values stored in bit-fields consist of m bits, where m is the size specified for the bit-field. The object representation is the set of m bits the bit-field comprises in the addressable storage unit holding it.

Annex J, which covers portability issues, says:

The following are unspecified: [... yada yada ...]

The alignment of the addressable storage unit allocated to hold a bit-field

What is a storage unit really?

A "storage unit" is defined by the platform's ABI, which for BIND usually means the System V Application Binary Interface. The amd64 version covers bit fields in section 3.1.2. It says,

  • bit-fields must be contained in a storage unit appropriate for its declared type

  • bit-fields may share a storage unit with other struct / union members

In this particular case, the storage unit is 4 bytes for a 32 bit unsigned int. Note that this is less than the architecture's nominal 64 bit width.

How does a storage unit correspond to memory access widths?

Who knows?


It is unspecified.


The broken code declared a bunch of adjacent bit fields and forgot that they were accessed under different locks.

Now, you might hope that you could just add a :0 to split these bitfields into separate storage units, and the split would be enough to also separate the CPU's memory accesses to the different storage units.

But you would be assuming that the code that accesses the bitfields will be compiled to use unsigned int sized memory accesses to read and write the unsigned int sized storage units.

Um, do we have any guarantees about access size?

Yes! There is sig_atomic_t which has been around since C89, and a load of very recent atomic stuff. But none of it is used by this part of BIND's code.

(Also, worryingly, the AMD64 SVABI does not mention "atomic".)

So what is this "Deathstation 9000" thing then?

The DS9K is the canonical epithet for the most awkward possible implementation of C, which follows the standard to the letter, but also violates common assumptions about how C works in practice. It is invoked as a horrible nightmare, but in recent years it has become a disastrous reality as compilers have started exploiting undefined behaviour to support advanced optimizations.

(Sadly the Armed Response Technologies website has died, and Wikipedia's DS9K page has been deleted. About the only place you can find information about it is in old comp.lang.c posts...)

And why is the DS9K relevant?

Well, in this case we don't need to go deep into DS9K territory. There are vaguely reasonable ABIs for which a small :0 fix would not actually work.

For instance, there might be a CPU which can only do 64 bit memory accesses, but which has a 32 bit int storage unit. This type representation would probably mean the C compiler has really bad performance, so it is fairly unlikely. But it is allowed, and there are (or were) CPUs which can't do sub-word memory accesses, and which have very little RAM so they want to pack data tightly.

On a CPU like this, the storage unit doesn't match the memory access, so C's :0 syntax for skipping to the next storage unit will fail to achieve the desired effect of isolating memory accesses that have to be performed under different concurrent access locks.

DS9K defence technology

So the BIND 9.10.4 fix does two things:

The most important change is to move one of the sets of bitfields from the start of the struct to the end of it. This means there are several pointers in between the fields protected by one lock and the fields protected by the other lock, so even a DS9K can't reasonably muddle them up.

Secondly, they used magical :0 syntax and extensive comments to (hopefully) prevent the erroneous compaction from happening again. Even if the bitfields get moved back to the start of the struct (so the pointers no longer provide insulation) the :0 might help to prevent the bug from causing crashes.

(edited to add)

When I wrapped this article up originally, I forgot to return to sig_atomic_t. If, as a third fix, these bitfields were also changed from unsigned int to sig_atomic_t, it would further help to avoid problems on systems where the natural atomic access width is wider than an int, like the lesser cousin of the DS9K I outlined above. However, sig_atomic_t can be signed or unsigned so it adds a new portability wrinkle.


The clever :0 is not enough, and the BIND developers were right not to rely on it.

fanf: (dotat)
  1. If you have not yet seen the Wintergatan Marble Machine video, you should. It is totally amazing: a musical instrument and a Heath Robinson machine and a marble run and a great tune. It has been viewed nearly 19 million times since it was published on the 29th February.

    Wintergatan have a lot of videos on YouTube most of which tell the story of the construction of the Marble Machine. I loved watching them all: an impressive combination of experimentation and skill.

  2. The Starmachine 2000 video is also great. It interleaves a stop-motion animation (featuring Lego) filmed using a slide machine (which apparently provides the percussion riff for the tune) with the band showing some of their multi-instrument virtuosity.

    The second half of the video tells the story of another home-made musical instrument: a music box driven by punched tape, which you can see playing the melody at the start of the video. Starmachine 2000 was published in January 2013, so it's a tiny hint of their later greatness.

    I think the synth played on the shoulder is also home-made.

  3. Sommarfågel is the band's introductory video, also featuring stop-motion animation and the punched-tape music box.

    Its intro ends with a change of time signature from 4/4 to 6/8 time!

    The second half of the video is the making of the first half - the set, the costumes, the animation. The soundtrack sounds like it was played on the music box.

  4. They have an album which you can listen to on YouTube or buy on iTunes.

    By my reckoning, 4 of the tracks are in 3/4 time, 3 in 4/4 time, and 2 in 6/8 time. I seem to like music in unusual time signatures.

  5. One of the tracks is called "Biking is better".

  6. No lyrics.

fanf: (dotat)

This took me hours to debug yesterday so I thought an article would be a good idea.

The vexation

My phone was sending email notifications for events in our shared calendar.

Some of my Macs were trying and failing to send similar email notifications. (Mac OS Mail knows about some of my accounts but it doesn't know the passwords because I don't entirely trust the keychain.)

The confusion

There is no user interface for email notifications in Apple's calendar apps.

  • They do not allow you to create events with email notifications.
  • They do not show the fact that an event (created by another client) has an email notification configured.
  • They do not have a preferences setting to control email notifications.

The escalation

It seems that if you have a shared calendar containing an event with an email notification, each of your Apple devices which have this calendar will separately try to send their own duplicate copy of the notification email.

(I dread to think what would happen in an office environment!)

The explanation

I couldn't get a CalDAV client to talk to Fastmail's CalDAV service in a useful way. I managed to work out what was going on by exporting a .ics file and looking at the contents.

The VEVENT clause saying ACTION:EMAIL was immediately obvious towards the end of the file.

The solution

Sadly I don't know of a way to stop them from doing this other than to avoid using email notification on calendar events.

fanf: (dotat)

I mentioned obliquely on Twitter that I am no longer responsible for email at Cambridge. This surprised a number of people so I realised I should do a proper update for my non-Cambridge friends.

Since Chris Thompson retired at the end of September 2014, I have been the University's hostmaster. I had been part of the hostmaster team for a few years before then, but now I am the person chiefly responsible. I am assisted by my colleagues in the Network Systems team.

Because of my increased DNS responsibilities, I was spending a lot less time on email. This was fine from my point of view - I had been doing it for about 13 years, which frankly is more than enough.

Also, during the second half of 2015, our department reorganization at long last managed to reach down from rarefied levels of management and started to affect the technical staff.

The result is that at the start of 2016 I moved into the Network Systems team. We are mainly responsible for the backbone and data centre network switches and routers, plus a few managed institution networks. (Most departments and colleges manage their own switches.) There are various support functions like DHCP, RADIUS, and now DNS, and web interfaces to some of these functions.

My long-time fellow postmaster David Carter continues to be responsible for Hermes and has moved into a new team that includes the Exchange admins. There will be more Exchange in our future; it remains to be decided what will happen to Hermes.

So my personal mail is no longer a dodgy test domain and reconfiguration canary on Hermes. Instead I am using Fastmail which has been hosting our family domain for a few years.

fanf: (dotat)

The most common question I get asked about my link log (LJ version) is

How do you read all that stuff?!

The answer is,

I don't!

My link log is basically my browser bookmarks, except public. (I don't have any private bookmarks to speak of.) So, for it to be useful, a link description should be reasonably explanatory and have enough keywords that I can find it years later.

The links include things I have read and might want to re-read, things I have not read but I think I should keep for future reference, things I want to read soon, things that look interesting which I might read, and things that are aspirational which I feel I ought to read but probably never will.

(Edited to add) I should also say that I might find an article to be wrong or problematic, but it still might be interesting enough to be worth logging - especially if I might want to refer back to this wrong or problematic opinion for future discussion or criticism. (end of addition)

But because it is public, I also use it to tell people about links that are cool, though maybe more ephemeral. This distorts the primary purpose a lot.

It's my own miscellany or scrapbook, for me and for sharing.


A frustrating problem is that I sometimes see things which, at the time, seem to be too trivial to log, but which later come up in conversation and I can't cite my source because I didn't log it or remember the details.

Similarly I sometimes fail to save links to unusually good blog articles or mailing list messages, and I don't routinely keep my own copies of either.

Frequently, my descriptions lack enough of the right synonym keywords for me to find them easily. (I'm skeptical that I would be able to think ahead well enough to add them if I tried.)

I make no effort to keep track of where I get links from. This is very lazy, and very anti-social. I am sorry about this, but not sorry enough to fix it, which makes me sorry about being too lazy to fix it. Sorry.


What I choose to log changes over time. How I phrase the descriptions also changes. (I frequently change the original title to add keywords or to better summarize the point. My usual description template is "Title or name: spoiler containing keywords.")

Increasingly in recent years I have tried to avoid the political outrage of the day. I prefer to focus on positive or actionable things.

A lot (I think?) of the negative "OMG isn't this terrible" links on my log recently are related to computer security or operations. They fall into the "actionable" category by being cautionary tales: can I learn from their mistakes? Please don't let me repeat them?


Mailing lists. I don't have a public list of the ones I subscribe to. The tech ones are mail / DNS / network / time related - IETF, operations, software I use, projects I contribute to, etc.

RSS/Atom feeds. I also use LJ as my feed reader (retro!) and sadly they don't turn my feed list into a proper blog roll, but you might find something marginally useful at

Hacker News. I use Colin Percival's Hacker News Daily to avoid the worst of it. It is more miss than hit - a large proportion of the 10 "best" daily links are about Silly Valley trivia or sophomoric politics. But when HN has a link with discussion about something technical it can provide several good related links and occasionally some well-informed comments.

Mattias Geniar's cron.weekly tech newsletter is great. Andrew Ducker's links are great.

Twitter. I like its openness, the way it is OK to follow someone who might be interesting. On Facebook that would be creepy, on Linkedin that would be spammy.

fanf: (dotat)

Blogging after the pub is a mixed blessing. A time when I want to write about things! But! A time when I should be sleeping.

Last week I wrote nearly 2000 boring words about keyboards and stayed up till nearly 4am. A stupid amount of effort for something no-one should ever have to care about.

Feh! what about something to expand the mind rather than dull it?


Years and years ago I read practically everything on - the website for the E programming language - which is a great exposition of how capabilities are the way to handle access control.

I was reminded of these awesome ideas this evening when talking to Colin Watson about how he is putting Macaroons into practice.

One of the great E papers is Capability Myths Demolished. It's extremely readable, has brilliant diagrams, and really, it's the whole point of this blog post.

Go and read it:

Ambient authority

One of the key terms it introduces is "ambient authority". This is what you have when you are able to find a thing and do something to it, without being given a capability to do so.

In POSIX terms, file descriptors are capabilities, and file names are ambient authority. A capability-style POSIX would, amongst other things, eliminate the current root directory and the current working directory, and limit applications to openat() with relative paths.

Practical applications of capabilities to existing systems usually require intensive "taming" of the APIs to eliminate ambient authority - see for example Capsicum, and CaPerl and CaJa.

From the programming point of view (rather than the larger systems point of view) eliminating ambient authority is basically eliminating global variables.

From the hipster web devops point of view, the "12 factor application" idea of storing configuration in the environment is basically a technique for reducing ambient authority.

Capability attenuation

Another useful idea in the Capability Myths Demolished paper is capability attenuation: if you don't entirely trust someone, don't give them the true capabilty, give them a capability on a proxy. The proxy can then restrict or revoke access to the true object, according to whatever rules you want.

(userv proxies POSIX file descriptors rather than passing them as-is, to avoid transmitting unwanted capabilities - it always gives you a pipe, not a socket, nor a device, etc.)

And stuff

It is too late and my mind is too dull now to do justice to these ideas, so go and read their paper instead of my witterings.

fanf: (dotat)

Why you can't use xmodmap to change how Synergy handles modifier keys

As previously mentioned I am doing a major overhaul of my workstation setup, the biggest since 1999. This has turned out to be a huge can of worms.

I will try to explain the tentacular complications caused by some software called Synergy...

I use an Apple trackpad and keyboard, and 1Password

I made a less-big change several years ago - big in terms of the physical devices I use, but not the software configuration: I fell in love with Apple, especially the Magic Trackpad. On the downside, getting an Apple trackpad to talk directly to a FreeBSD or Debian box is (sadly) a laughable proposition, and because of that my setup at work is a bit weird.

(Digression 1: I also like Apple keyboards. I previously used a Happy Hacking Keyboard (lite 2), which I liked a lot, but I like the low profile and short travel of the Apple keyboards more. Both my HHKB and Apple keyboards use rubber membranes, so despite appearances I have not in fact given up on proper keyswitches - I just never used them much.)

(Digression 2: More recently I realised that I could not continue without a password manager, so I started using 1Password, which for my purposes basically requires having a Mac. If you aren't using a password manager, get one ASAP. It will massively reduce your password security worries.)

At work I have an old-ish cast-off Mac, which is responsible for input devices, 1Password, web browser shit, and disractions like unread mail notifications, iCalendar, IRC, Twitter. The idea (futile and unsuccessful though it frequently is) was that I could place the Mac to one side and do Real Work on the (other) Unix workstation.

Synergy is magic

The key software that ties my workstations together is Synergy.

Synergy allows you to have one computer which owns your input devices (in my case, my Mac) which can seamlessly control your other computers. Scoot your mouse cursor from one screen to another and the keyboard input focus follows seamlessly. It looks like you have multiple monitors plugged into one computer, with VMs running different operating systems displayed on different monitors. But they aren't VMs, they are different bare metal computers, and Synergy is forwarding the input events from one to the others.

Synergy is great in many ways. It is also a bit too magical.

How I want my keyboard to work

My main criterion is that the Meta key should be easy to press.

For reference look at this picture of an Apple keyboard on Wikimedia Commons.

(Digression 3: Actually, Control is more important than Meta, and Control absolutely has to be the key to the left of A. I also sometimes use the key labelled Control as part of special key chord operations, for which I move my fingers down from the home row. In X11 terminology I need ctrl:nocaps not ctrl:swapcaps. If I need caps it's easier for me to type with my little finger holding Shift than to waste a key on Caps Lock confusion . Making Caps Lock into another Control is so common that it's a simple configuration feature in Mac OS.)

I use Emacs, which is why the Meta key is also important. Meta is a somewhat abstract key modifier which might be produced by various different keys (Alt, Windows, Option, Command, Escape, Ctrl+[, ...) depending on the user's preferences and the vagaries of the software they use.

I press most modifier keys (Ctrl, Shift, Fn) with my left little finger; the exception is Meta, for which I use my thumb. For comfort I do not want to have to curl my thumb under my palm very much. So Meta has to come from the Command key.

For the same reasons, on a PC keyboard Meta has to come from the Alt key. Note that on a Mac keyboard, the Option key is also labelled Alt.

So if you have a keyboard mapping designed for a PC keyboard, you have Meta <- Alt <- non-curly thumb, but when applied to a Mac keyboard you get Meta <- Alt <- Option <- too-curly thumb.

This is an awkward disagreement which I have to work around.

X11 keyboard modifiers

OK, sorry, we aren't quite done with the tedious background material yet.

The X11 keyboard model (at least the simpler pre-XKB model) basically has four layers:

  • keycodes: numbers that represent physical keys, e.g. 64

  • keysyms: symbols that represent key labels, e.g. Alt_L

  • modifiers: a few kesyms can be configured as one of 8 modifiers (shift, ctrl, etc.)

  • keymap: how keysyms plus modifiers translate to characters

I can reset all these tables so my keyboard has a reasonably sane layout with:

    $ setxkbmap -layout us -option ctrl:nocaps

After I do that I get a fairly enormous variety of modifier-ish keysyms:

    $ xmodmap -pke | egrep 'Shift|Control|Alt|Meta|Super|Hyper|switch'
    keycode  37 = Control_L NoSymbol Control_L
    keycode  50 = Shift_L NoSymbol Shift_L
    keycode  62 = Shift_R NoSymbol Shift_R
    keycode  64 = Alt_L Meta_L Alt_L Meta_L
    keycode  66 = Control_L Control_L Control_L Control_L
    keycode  92 = ISO_Level3_Shift NoSymbol ISO_Level3_Shift
    keycode 105 = Control_R NoSymbol Control_R
    keycode 108 = Alt_R Meta_R Alt_R Meta_R
    keycode 133 = Super_L NoSymbol Super_L
    keycode 134 = Super_R NoSymbol Super_R
    keycode 203 = Mode_switch NoSymbol Mode_switch
    keycode 204 = NoSymbol Alt_L NoSymbol Alt_L
    keycode 205 = NoSymbol Meta_L NoSymbol Meta_L
    keycode 206 = NoSymbol Super_L NoSymbol Super_L
    keycode 207 = NoSymbol Hyper_L NoSymbol Hyper_L

These map to modifiers as follows. (The higher modifers have unhelpfully vague names.)

    $ xmodmap -pm
    xmodmap:  up to 4 keys per modifier, (keycodes in parentheses):

    shift       Shift_L (0x32),  Shift_R (0x3e)
    control     Control_L (0x25),  Control_L (0x42),  Control_R (0x69)
    mod1        Alt_L (0x40),  Alt_R (0x6c),  Meta_L (0xcd)
    mod2        Num_Lock (0x4d)
    mod4        Super_L (0x85),  Super_R (0x86),  Super_L (0xce),  Hyper_L (0xcf)
    mod5        ISO_Level3_Shift (0x5c),  Mode_switch (0xcb)

How Mac OS -> Synergy -> X11 works by default

If I don't change things, I get

    Command -> 133 -> Super_L -> Mod4 -> good for controlling window manager

    Option -> 64 -> Alt_L -> Mod1 -> meta in emacs

which is not unexpected, given the differences between PC and Mac layouts, but I want to swap the effects of Command and Option

Note that I get (roughly) the same effect when I plug the Apple keyboard directly into the PC and when I use it via Synergy. I want the swap to work in both cases.

xmodmap to the rescue! not!

Eliding some lengthy and frustrating debugging, the important insight came when I found I could reset the keymap using setxkbmap (as I mentioned above) and test the effect of xmodmap on top of that in a reproducible way. (xmodmap doesn't have a reset-to-default option.)

What I want should be relatively simple:

    Command -> Alt -> Mod1 -> Meta in emacs

    Option -> Super -> Mod4 -> good for controlling window manager

So in xmodmap I should be able to just swap the mappings of keycodes 64 and 133.

The following xmodmap script is a very thorough version of this idea. First it completely strips the higher-order modifier keys, then it rebuilds just the config I want.

    clear Mod1
    clear Mod2
    clear Mod3
    clear Mod4
    clear Mod5

    keycode  64 = NoSymbol
    keycode  92 = NoSymbol
    keycode 108 = NoSymbol
    keycode 133 = NoSymbol
    keycode 134 = NoSymbol
    keycode 203 = NoSymbol
    keycode 204 = NoSymbol
    keycode 205 = NoSymbol
    keycode 206 = NoSymbol
    keycode 207 = NoSymbol

    ! command -> alt
    keycode 133 = Alt_L
    keycode 134 = Alt_R
    add Mod1 = Alt_L Alt_R

    ! option -> super
    keycode 64 = Super_L
    keycode 108 = Super_R
    add Mod4 = Super_L Super_R

WAT?! That does not work

The other ingredient of the debugging was to look carefully at the output of xev and Synergy's debug logs.

When I fed that script into xmodmap, I saw,

    Command -> 64 -> Super_L -> Mod4 -> good for controlling window manager

    Option -> 133 -> Alt_L -> Mod1 -> Meta in emacs

So Command was STILL being Super, and Option was STILL being Alt.

I had not swapped the effect of the keys! But I had swapped their key codes!

An explanation for Synergy's modifier key handling

Synergy's debug logs revealed that, given a keypress on the Mac, Synergy thought it should have a particular effect; it looked at the X11 key maps to work out what key codes it should generate to produce that effect.

So, when I pressed Command, Synergy thought, OK, I need to make a Mod4 on X11, so I have to artificially press a keycode 113 (or 64) to have this effect.

This also explains some other weird effects.

  • Synergy produces one keycode for both left and right Command, and one for both left and right Option. Its logic squashes a keypress to a desired modifier, which it then has to map back to a keysym - and it loses the left/right distinction in the process.

  • If I use my scorched-earth technique but tell xmodmap to map keysyms to modifiers that Synergy isn't expecting, the Command or Option keys will mysteriously have no effect at all. Synergy's log complains that it can't find a mapping for them.

  • Synergy has its own modifier key mapping feature. If you tell it to map a key to Meta, and the X11 target has a default keymap, it will try to create Meta from a crazy combination of Shift+Alt. The reason why is clear if you work backwards from these lines of xmodmap output:

    keycode  64 = Alt_L Meta_L Alt_L Meta_L
    mod1        Alt_L (0x40),  Alt_R (0x6c),  Meta_L (0xcd)

My erroneous mental model

This took a long time to debug because I thought Synergy was mapping keypresses on the Mac to keycodes or keysyms on X11. If that had been the case then xmodmap would have swapped the keys as I expected. I would also have seen different keysyms in xev for left and right. And I would not have seen the mysterious disappearing keypresses nor the crazy multiple keypresses.

It took me a lot of fumbling to find some reproducible behaviour from which I could work out a plausible explanation :-(

The right way to swap modifier keys with Synergy

The right way to swap modifier keys with Synergy is to tell the Synergy server (which runs on the computer to which the keyboard is attached) how they keyboard modifiers should map to modifiers on each client computer. For example, I run synergys on a Mac called white with a synergyc on a PC called grey:

    section: screens
            super = alt
            alt = super
            # no options

You can do more complicated mapping, but a simple swap avoids craziness.

I also swap the keys with xmodmap. This has no effect for Synergy, because it spots the swap and unswaps them, but it means that if I plug a keyboard directly into the PC, it has the same layout as it does when used via Synergy.


I'm feeling a bit more sad about Synergy than I was before. It isn't just because of this crazy behaviour.

The developers have been turning Synergy into a business. Great! Good for them! Their business model is paid support for open source software, which is fine and good.

However, they have been hiding away their documentation, so I can't find anything in the current packages or on the developer website which explains this key mapping behaviour. Code without documentation doesn't make me want to give them any money.

fanf: (dotat)

Just a quick note to say I have fixed the [ profile] dilbert_zoom and [ profile] dilbert_24 feeds (after a very long hiatus) so if you follow them there is likely to be a sudden spooge of cartoons.

fanf: (dotat)

(In particular, i3 running on a remote machine!)

I have been rebuilding my workstation, and taking the opportunity to do some major housecleaning (including moving some old home directory stuff from CVS to git!)

Since 1999 I have used fvwm as my X11 window manager. I have a weird configuration which makes it work a lot like a tiling window manager - I have no title bars, very thin window borders, keypresses to move windows to predefined locations. The main annoyance is that this configuration is not at all resolution-independent and a massive faff to redo for different screen layouts - updating the predefined locations and the corresponding keypresses.

I have heard good things about i3 so I thought I would try it. Out of the box it works a lot like my fvwm configuration, so I decided to make the switch. It was probably about the same faff to configure i3 the way I wanted (40 workspaces) but future updates should be much easier!

XQuartz and quartz-wm

I did some of this configuration at home after work, using XQuartz on my MacBook as the X11 server. XQuartz comes with its own Mac-like window manager called quartz-wm.

You can't just switch window managers by starting another one - X only allows one window manager at a time, and other window managers will refuse to start if one is already running. So you have to configure your X session if you want to use a different window manager.

X session startup

Traditionally, X stashes a lot of configuration stuff in /usr/X11R6/lib/X11. I use xdm which has a subdirectory of ugly session management scripts; there is also an xinit subdirectory for simpler setups. Debian sensibly moves a lot of this gubbins into /etc/X11; XQuartz puts them in /opt/X11/lib/X11.

As well as their locations, the contents of the session scripts vary from one version of X11 to another. So if you want to configure your session, be prepared to read some shell scripts. Debian has sensibly unified them into a shared Xsession script which even has a man page!

XQuartz startup

XQuartz does not use a display manager; it uses startx, so the relevant script is /opt/X11/lib/X11/xinit/xinitrc. This has a nice run-parts style directory, inside which is the script we care about, /opt/X11/lib/X11/xinit/xinitrc.d/ This in turn invokes scripts in a per-user run-parts directory, ~/.xinitrc.d.

Choose your own window manager

So, what you do is,

    $ mkdir .xinitrc.d
    $ cat >.xinitrc.d/
    exec twm
    $ chmod +x .xinitrc.d/
    $ open /Applications/Utilities/

(The .sh and the chmod are necessary.)

This should cause an xterm to appear with twm decoration instead of quartz-wm decoration.

My supid trick

Of course, X is a network protocol, so (like any other X application) you don't have to run the window manager on the same machine as the X server. My was roughly,

    exec ssh -AY workstation i3

And with XQuartz configured to run fullscreen this was good enough to have a serious hack at .i3/config :-)

fanf: (dotat)

Today I have been amused by a conversation on Twitter about car names.

It started with a map of common occupational surnames in Europe which prompted me to say that I thought Smith's Cars doesn't sound quite as romantic as Ferrari. Adam and Rich pointed out that "Farrier's Cars" might be a slightly better translation since it keeps the meaning and still sounds almost as good. And it made me think that perhaps the horse is prancing because it has new shoes.

In response I got a lot of agricultural translations from @bmcnett. Later there followed several more or less apocryphal stories about car names: why the Toyota MR2 sold like shit in France, why the Rolls Royce Silver Shadow was nearly silver-plated crap in Germany, or (one, two, three) why the Mitsubishi Pajero was only bought by wankers in Spain, and finally why the Chevy Nova did, in fact, go in Latin America.

(And, tangentially relevant, Mike Pitt also mentioned that Joe Green's name is also more romantic in his native Italian.)

But I was puzzled for a while when the uniquely named mathew mentioned the Lamborghini Countach, before all the other silliness, when I had only remarked about unromantic translations. I wasn't able to find an etymology of Lamborghini, but their logo is a charging bull, and many of their cars are named after famous fighting bulls. However, the Countach is not; the story goes that when Nuccio Bertoni first saw a prototype he swore in amazement - "countach" or "cuntacc" is apparently the kind of general-purpose profanity a 1970s Piedmontese man might utter appreciatively upon seeing a beautiful woman or a beautiful car.

So, maybe, at a stretch, it would not be completely wrong to translate "Lamborghini Countach" to "bull shit!"

fanf: (dotat)

I have a question about a subtle distinction in meaning between words. I want to know if this distinction is helpful, or if it is just academic obfuscation. But first let me set the scene...

Last week I attempted to explain DNSSEC to a few undergraduates. One of the things I struggled with was justifying its approach to privacy.

(HINT: it doesn't have one!)

Actually, DNSSEC has a bit more reasoning for its lack of privacy than that, so here's a digression:

The bullshit reason why DNSSEC doesn't give you privacy

Typically when you query the DNS you are asking for the address of a server; the next thing you do is connect to that server. The result of the DNS query is necessarily revealed, in cleartext, in the headers of the TCP connection attempt.

So anyone who can see your TCP connection has a good chance of inferring your DNS queries. (Even if your TCP connection went to Cloudflare or some other IP address hosting millions of websites, a spy can still guess based on the size of the web page and other clues.)

Why this reasoning is bullshit

DNSSEC turns the DNS into a general-purpose public key infrastructure, which means that it makes sense to use the DNS for a lot more interesting things than just finding IP addresses.

For example, people might use DNSSEC to publish their PGP or S/MIME public keys. So your DNS query might be a prelude to sending encrypted email rather than just connecting to a server.

In this case the result of the query is not revealed in the TCP connection traffic - you are always talking to your mail server! The DNS query for the PGP key reveals who you are sending mail to - information that would be much more obscure if the DNS exchange were encrypted!

That subtle distinction

What we have here is a distinction between

who you are talking to


what you are talking about

In the modern vernacular of political excuses for the police state, who you talk to is "metadata" and this is "valuable" information which "should" be collected. (For example, America uses metadata to decide who to kill.) Nobody wants to know what you are talking about, unless getting access to your data gives them a precedent in favour of more spying powers.

Hold on a sec! Let's drop the politics and get back to nerdery.

Actually it's more subtle than that

Frequently, "who you are talking to" might be obscure at a lower layer, but revealed at a higer layer.

For example, when you query for a correspondent's public key, that query might be encrypted from the point of view of a spy sniffing your network connection, so the spy can't tell whose key you asked for. But if the spy has cut a deal with the people who run the keyserver, they can find out which key you asked for and therefore who you are talking to.

Why DNS privacy is hard

There's a fundamental tradeoff here.

You can have privacy against an attacker who can observe your network traffic, by always talking to a few trusted servers who proxy your actions to the wider Internet. You get extra privacy bonus points if a diverse set of other people use the same trusted servers, and provide covering fire for your traffic.

Or you can have privacy against a government-regulated ISP who provides (and monitors) your default proxy servers, by talking directly to resources on the Internet and bypassing the monitored proxies. But that reveals a lot of information to network-level traffic analsis.

The question I was building up to

Having written all that preamble, I'm even less confident that this is a sensible question. But anyway,

What I want to get at is the distinction between metadata and content. Are there succinct words that capture the difference?

Do you think the following works? Does this make sense?

The weaker word is


If something is confidential, an adversary is likely to know who you are talking to, but they might not know what you talked about.

The stronger word is


If something is private, an adversary should not even know who you are dealing with.

For example, a salary is often "confidential": an adversary (a rival colleague or a prospective employer) will know who is paying it but not how big it is.

By contrast, an adulterous affair is "private": the adversary (your spouse) shouldn't even know you are spending a lot of time with someone else.

What I am trying to get at with this choice of words is the idea that even if your communication is "confidential" (i.e. encrypted so spies can't see the contents), it probably isn't "private" (i.e. it doesn't hide who you are talking to or suppress scurrilous implications).

SSL gives you confidentiality (it hides your credit card number) whereas TOR gives you privacy (it hides who you are buying drugs from).

fanf: (dotat)

A month ago I wrote about a denial of service attack that caused a TCP flood on one of our externally-facing authoritative DNS servers. In that case the mitigation was to discourage legitimate clients from using TCP; however those servers are sadly still vulnerable to TCP flooding attacks, and because they are authoritative servers they have to be open to the Internet. But we get some mitigation from having off-site anycast servers so it isn't completely hopeless.

This post is about our inward-facing recursive DNS servers.

Two weeks ago, Google and Red Hat announced CVE-2015-7547, a remote code execution vulnerability in glibc's getaddrinfo() function. There are no good mitigations for this vulnerability so we patched everything promptly (and I hope you did too).

There was some speculation about whether it is possible to exploit CVE-2015-7547 through a recursive DNS server. No-one could demonstrate an attack but the general opinion was that it is likely to be possible. Dan Kaminsky described the vulnerability as "a skeleton key of unkown strength", citing the general rule that "attacks only get better".

Yesterday, Jaime Cochran and Marek Vavruša of Cloudflare described a successful cache traversal exploit of CVE-2015-7547. Their attack relies on the behaviour of djbdns when it is flooded with TCP connections - it drops older connections.

BIND's behaviour is different; it retains existing connections and refuses new connections. This makes it trivially easy to DoS BIND's TCP listener, and given the wider discussion about proper support for DNS over TCP including persistent connections djbdns's behaviour appears to be superior.

So it's unfortunate that djbdns has better connection handling which makes it vulnerable to this attack, whereas BIND is protected by being worse!

Fundamentally, we need DNS server software to catch up with the best web servers in the quality of their TCP connection handling.

But my immediate reaction was to realise that my servers would have a problem if the Cloudflare attack (or something like it) became at all common. Our recursive DNS servers were open to TCP connections from the wider Internet, and we were relying on BIND to reject queries from foreign clients. This clearly was not sufficiently robust. So now, where my firewall rules used to have a comment

    # XXX or should this be CUDN only?

there are now some actual filtering rules to block incoming TCP connections from outside the University. (I am still sending DNS REFUSED replies over UDP since that is marginally more friendly than an ICMP rejection and not much more expensive.)

Although this change was inspired by CVE-2015-7547, it is really about general robustness; it probably doesn't have much effect on our ability to evade more sophisticated exploits.

fanf: (dotat)
I have embiggened the table in my previous entry to add the new Raspberry Pi 3 to the comparison!
fanf: (dotat)

This afternoon I was talking to an undergrad about email in Cambridge, and showing some old pictures of Hermes to give some idea of what things were like many years ago. This made me wonder, how does a Raspberry Pi 2 (in a Pibow case) compare to a Sun E450 like the ones we decommissioned in 2004?

ETA (2016-02-29): I have embiggened this table to include details of the Raspberry Pi 3. It's worth noting when comparing the IO ports that both models of Raspberry Pi hang the ethernet and four USB ports off one USB2 480 Mbit/s bus. In the Raspberry Pi 3 the WiFi shares the SDIO bus with the SD card slot, and the Bluetooth hangs off a UART. The E450 has a lot more IO bandwidth.

Raspberry Pi 2 Raspberry Pi 3 Sun Enterprise 450
(in PiBow case)
2015 - 2016 - 1997-2002
£44 $14 000
dimensions 99 x 66 x 30 mm 696 x 448 x 581 mm
volume 196 cm3 181 000 cm3
weight 137 g 94 000 g
cores 4 x ARM Cortex-A7 4 x ARM Cortex-A53 4 x UltraSPARC II
word size 32 bit 64 bit 64 bit
issue width 2-way in-order 4-way in-order
clock 900 MHz 1200 MHz 400 MHz
GPU Videocore IV
250 MHz 300 MHz
L1 i-cache 32 KB 16 KB
L1 d-cache 32 KB 16 KB
L2 cache 512 KB 4 MB
bandwidth 900 MB/s 1778 MB/s
Network 1 x 100 Mb/s ethernet 1 x 100 Mb/s ethernet 1 x 100 Mb/s ethernet
1 x 72 Mb/s 802.11n WiFi
1 x 24 Mb/s Bluetooth 4.1
Disk bus speed 25 MB/s 40 MB/s
Disk interface 1 x MicroSD 5 x UltraSCSI-3
1 x UltraSCSI-2
1 x floppy
1 x CD-ROM

1 x UART
4 x USB
8 x GPIO
2 x I²C
2 x SPI

2 x UART
1 x mini DIN-8
1 x centronics
3 x 64 bit 66 MHz PCI
4 x 64 bit 33 MHz PCI
3 x 32 bit 33 MHz PCI

fanf: (dotat)

tl;dr - you can use my program adns-masterfile to pre-heat a DNS cache as part of a planned failover process.


One of my favourite DNS tools is GNU adns, an easy-to-use asynchronous stub resolver library.

I started using adns in 1999 when it was still very new. At that time I was working on web servers at Demon Internet. We had a daily log processing job which translated IP addresses in web access logs to host names, and this was getting close to taking more than a day to process a day of logs. I fixed this problem very nicely with adns, and my code is included as an example adns client program.

Usually when I want to do some bulk DNS queries, I end up doing a clone-and-hack of adnslogres. Although adns comes with a very capable general-purpose program called adnshost, it doesn't have any backpressure to control the number of outstanding queries. So if you try to use it for bulk queries, you will overwhelm the server with traffic and most of the queries will time out and fail. So, although my normal approach would be to massage the data with an ad-hoc Perl script so I can pipe it into a general-purpose program, with adns I usually end up writing special-purpose C programs which have their own concurrency control.

cache preheating

In our DNS server setup we use keepalived to do automatic failover of the recursive server IP addresses. I wrote about our keepalived configuration last year, explaining how we can control which servers are live, which are standby, and which are not part of the failover pool.

The basic process for config changes, patches, upgrades, etc., is to take a stanby machine out of the pool, apply the changes, test them, then shuffle the pool so the next machine can be patched. Patching all of the servers requires one blip of a few seconds for each live service address.

However, after moving a service address, the new server has a cold cache, so (from the point of view of the user) a planned failover looks like a sudden drop in DNS server performance.

It would be nice if we could pre-heat a server's cache before making it live, to minimize the impact of planned maintenance.

rndc dumpdb

BIND can be told to dump its cache, which it does in almost-standard textual DNS master file format. This is ideal source data for cache preheating: we could get one of the live servers to dump its cache, then use the dump to make lots of DNS queries against a standby server to warm it up.

The named_dump.db file uses a number of slightly awkward master file features: it omits owner names and other repeated fields when it can, and it uses () to split records across multiple lines. So it needs a bit more intelligence to parse than just extracting a field from a web log line!


I hadn't used lex for years and years before last month. I had almost forgotten how handy it is, although it does have some very weird features. (Its treatment of leading whitespace is almost as special as tab characters in make files.)

But, if you have a problem whose solution involves regular expressions and a C library, it's hard to do better than lex. It makes C feel almost like a scripting language!

(According to the Jargon File, Perl was described as a "Swiss-Army chainsaw" because it is so much more powerful than Lex, which had been described as the Swiss-Army knife of Unix. I can see why!)


So last month, I used adns and lex to write a program called adns-masterfile which parses a DNS master file (in BIND named_dump.db flavour) and makes queries for all the records it can. It's a reasonably handy way for pre-heating a cache as part of a planned server swap.

I used it yesterday when I was patching and rebooting everything to deal with the glibc CVE-2015-7547 vulnerability.


Our servers produce a named_dump.db file of about 5.3 million lines (175 MB). From this adns-masterfile makes about 2.6 million queries, which takes about 10 minutes with a concurrency limit of 5000.

fanf: (dotat)

Last weekend one of our authoritative name servers ( suffered a series of DoS attacks which made it rather unhappy. Over the last week I have developed a patch for BIND to make it handle these attacks better.

The attack traffic

On authdns1 we provide off-site secondary name service to a number of other universities and academic institutions; the attack targeted

For years we have had a number of defence mechanisms on our DNS servers. The main one is response rate limiting, which is designed to reduce the damage done by DNS reflection / amplification attacks.

However, our recent attacks were different. Like most reflection / amplification attacks, we were getting a lot of QTYPE=ANY queries, but unlike reflection / amplification attacks these were not spoofed, but rather were coming to us from a lot of recursive DNS servers. (A large part of the volume came from Google Public DNS; I suspect that is just because of their size and popularity.)

My guess is that it was a reflection / amplification attack, but we were not being used as the amplifier; instead, a lot of open resolvers were being used to amplify, and they in turn were making queries upstream to us. (Consumer routers are often open resolvers, but usually forward to their ISP's resolvers or to public resolvers such as Google's, and those query us in turn.)

What made it worse

Because from our point of view the queries were coming from real resolvers, RRL was completely ineffective. But some other configuration settings made the attacks cause more damage than they might otherwise have done.

I have configured our authoritative servers to avoid sending large UDP packets which get fragmented at the IP layer. IP fragments often get dropped and this can cause problems with DNS resolution. So I have set

    max-udp-size 1420;
    minimal-responses yes;

The first setting limits the size of outgoing UDP responses to an MTU which is very likely to work. (The ethernet MTU minus some slop for tunnels.) The second setting reduces the amount of information that the server tries to put in the packet, so that it is less likely to be truncated because of the small UDP size limit, so that clients do not have to retry over TCP.

This works OK for normal queries; for instance a IN MX query gets a svelte 216 byte response from our authoritative servers but a chubby 2047 byte response from our recursive servers which do not have these settings.

But ANY queries blow straight past the UDP size limit: the attack queries for IN ANY got obese 3930 byte responses.

The effect was that the recursive clients retried their queries over TCP, and consumed the server's entire TCP connection quota. (Sadly BIND's TCP handling is not up to the standard of good web servers, so it's quite easy to nadger it in this way.)


We might have coped a lot better if we could have served all the attack traffic over UDP. Fortunately there was some pertinent discussion in the IETF DNSOP working group in March last year which resulted in draft-ietf-dnsop-refuse-any, "providing minimal-sized responses to DNS queries with QTYPE=ANY".

This document was instigated by Cloudflare, who have a DNS server architecture which makes it unusually difficult to produce traditional comprehensive responses to ANY queries. Their approach is instead to send just one synthetic record in response, like  HINFO  ( "Please stop asking for ANY"
                              "See draft-jabley-dnsop-refuse-any" )

In the discussion, Evan Hunt (one of the BIND developers) suggested an alternative approach suitable for traditional name servers. They can reply to an ANY query by picking one arbitrary RRset to put in the answer, instead of all of the RRsets they have to hand.

The draft says you can use either of these approaches. They both allow an authoritative server to make the recursive server go away happy that it got an answer, and without breaking odd applications like qmail that foolishly rely on ANY queries.

I did a few small experiments at the time to demonstrate that it really would work OK in the real world (unlike some of the earlier proposals) and they are both pretty neat solutions (unlike some of the earlier proposals).

Attack mitigation

So draft-ietf-dnsop-refuse-any is an excellent way to reduce the damage caused by the attacks, since it allows us to return small UDP responses which reduce the downstream amplification and avoid pushing the intermediate recursive servers on to TCP. But BIND did not have this feature.

I did a very quick hack on Tuesday to strip down ANY responses, and I deployed it to our authoritative DNS servers on Wednesday morning for swift mitigation. But it was immediately clear that I had put my patch in completely the wrong part of BIND, so it would need substantial re-working before it could be more widely useful.

I managed to get back to the patch on Thursday. The right place to put the logic was in the fearsome query_find() which is the top-level query handling function and nearly 2400 lines long! I finished the first draft of the revised patch that afternoon (using none of the code I wrote on Tuesday), and I spent Friday afternoon debugging and improving it.

The result is this patch which adds a minimal-qtype-any option. I'm currently running it on my toy nameserver, and I plan to deploy it to our production servers next week to replace the rough hack.

I have submitted the patch to the ISC; hopefully something like it will be included in a future version of BIND. And it prompted a couple of questions about draft-ietf-dnsop-refuse-any that I posted to the DNSOP working group mailing list.

fanf: (dotat)

I have been fiddling around with FreeBSD's whois client. Since I have become responsible for Cambridge's Internet registrations, it's helpful to have a whois client which isn't annoying.

Sadly, whois is an unspeakably crappy protocol. In fact it's barely even a protocol, more like a set of vague suggestions. Ugh.

The first problem...

... is to work out which server to send your whois query to. There are a number of techniques, most of which are necessary and none of which are sufficient.

  1. Rely on a knowledgable user to specify the server.

    Happily we can do better than just this, but the feature has to be available for special queries.

  2. Have a built-in curated mapping from query patterns to servers.

    This is the approach used by Debian's whois client. Sadly in the era of vast numbers of new gTLDs, this requires software updates a couple of times a week.

  3. Send the query to which maps TLDs to whois servers using CNAMEs in the DNS.

    This is a brilliant service, particularly good for the wild and wacky two-letter country-class TLDs. Unfortunately it has also failed to keep up with the new gTLDs, even though it only needs a small amount of extra automation to do so.

  4. Try whois.nic.TLD which is the standard required for new gTLDs.

    In practice a combination of (2) and (3) is extremely effective for domain name whois lookups.

  5. Follow referrals from a server with broad but shallow data to one with narrower and deeper data.

    Referrals are necessary for domain queries in "thin" registries, in which the TLD's registry does not contain all the details about registrants (domain owners), but instead refers queries to the registrar (i.e. reseller).

    They are also necessary for IP address lookups, for which ARIN's database contains registrations in North America, plus referrals to the other regional Internet registries for IP address registrations in other parts of the world.

Back in May I added (3) to FreeBSD's whois to fix its support for new gTLDs, and I added a bit more (1).

One motivation for the latter was for looking up domains: (4) doesn't work because Nominet's .uk whois server doesn't provide referrals to JANET's whois server; and (2) is a bit awkward, because although there is an entry for you have to have some idea of when it makes sense to try DNS queries for 2LDs. ( would be easier to use if it had a wildcard for each entry.)

The other motivation for extending the curated server list was to teach it about more NIC handle formats, such as -RIPE and -NICAT handles; and the same mechanism is useful for special-case domains.

Last week I added support for AS numbers, moving them from (0) to (1). After doing that I continued to fiddle around, and soon realised that it is possible to dispense with (3) and (2) and a large chunk of (1), by relying more on (4). The IANA whois server knows about most things you might look up with whois - domain names, IP addresses, AS numbers - and can refer you to the right server.

This allowed me to throw away a lot of query syntax analysis and trial-and-error DNS lookups. Very satisfying.

(I'm not sure if this excellently comprehensive data is a new feature of IANA's whois server, or if I just failed to notice it before...)

The second problem...

... is that the output from whois servers is only vaguely machine-readable.

For example, FreeBSD's whois now knows about 4 different referral formats, two of which occur with varying spacing and casing from different servers. (I've removed support for one amazingly ugly and happily obsolete referral format.)

My code just looks for a match for any referral format without trying to be knowledgable about which servers use which syntax.

The output from whois is basically a set of key: value pairs, but often these will belong to multiple separate objects (such as a domain name or a person or a net block); servers differ about whether blank lines separate objects or are just for pretty-printing a single object. I'm not sure if there's anything that can be done about this without huge amounts of tedious work.

And servers often emit a lot of rubric such as terms and conditions or hints and tips, which might or might not have comment markers. FreeBSD's whois has a small amount of rudimentary rubric-trimming code which works in a lot of the most annoying cases.

The third problem...

... is that the syntax of whois queries is enormously variable. What is worse, some servers require some non-standard complication to get useful output.

If you query Verisign for the server does fuzzy matching and returns a list of dozens of spammy name server names. To get a useful answer you need to ask for domain

ARIN also returns an unhelpfully terse list if a query matches multiple objects, e.g. a net block and its first subnet. To make it return full details for all matches (like RIPE's whois server) you need to prefix the query with a +.

And for .dk the verbosity option is --show-handles.

The best one is DENIC, which requires a different query syntax depending on whether the domain name is a non-ASCII internationalized domain name, or a plain ASCII domain (which might be a punycode-encoded internationalized domain name). Good grief, can't it just give a useful answer without hand-holding?


That's quite a lot of bullshit for a small program to cope with, and it's really only scratching the surface. Debian's whois implementation has attacked this mess with a lot more sustained diligence, but I still prefer FreeBSD's because of its better support for new gTLDs.

fanf: (dotat)

After being ejected from the pub this evening we had a brief discussion about the correspondance between winter clothing and the ISO Open Systems Interconnection 7 layer model. This is entirely normal.

My coat is OK except it is a bit too large, so even if I bung up the neck with a woolly scarf, it isn't snug enough around the hips to keep the chilly wind out. So I had multiple layers on underneath. My Scottish friend was wearing a t-shirt and a leather jacket – and actually did up the jacket; I am not sure whether that was because of the cold or just for politeness.

So the ISO OSI 7 layer model is:

  1. physical
  2. link
  3. network
  4. transport
  5. presentation and session or
  6. maybe session and presentation
  7. application

The Internet model is:

  1. wires or glass or microwaves or avian carriers
  2. ethernet or wifi or mpls or vpn or mille feuille or
    how the fuck do we make IP work over this shit
  3. IP
  4. TCP and stuff
  5. dunno
  6. what
  7. useful things and pictures of cats

And, no, I wasn't actually wearing OSI depth of clothing; it was more like:

  1. skin
  2. t-shirt
  3. rugby shirt
  4. fleece
    You know, OK for outside "transport" most of the year.
  5. ...
  6. ...    Coat. It has layers but I don't care.
  7. ...
fanf: (dotat)

A couple of months ago we had a brief discussion on an ops list at work about security of personal ssh private keys.

I've been using ssh-agent since I was working at Demon in the late 1990s. I prefer to lock my screen rather than logging out, to maintain context from day to day - mainly my emacs buffers and window layout. So to be secure I need to delete the keys from the ssh-agent when the screen is locked.

Since 1999 I have been using Xautolock and Xlock with a little script which automatically runs ssh-add -D to delete my keys from the ssh-agent shortly after the screen is locked. This setup works well and hasn't needed any changes since 2002.

But when I'm at home I'm using a Mac, and I have not had similar automation. Having to type ssh-add -D manually is very tiresome and because I am lazy I don't do it as often as I should.

At the beginning of December I found out about Hammerspoon, which is a Mac OS X application that provides lots of hooks into the operating system that you can script with Lua. The Hammerspoon getting started guide provides a good intro to the kinds of things you can do with it. (Hammerspoon is a fork of an earlier program called Mjolnir, ho ho.)

It wasn't until earlier this week that I had a brainwave when I realised that I might be able to use Hammerspoon to automatically run ssh-add -D at appropriate times for me. The key is the hs.caffeinate.watcher module, with which you can get a Lua function to run when sleep or wake events occur.

This is almost what I want: I configure my Macs to send the displays to sleep on a short timer and lock them soon after, and I have a hot corner to force them to sleep immediately. Very similar to my X11 screen lock setup if I use Hammerspoon to invoke an ssh-add -D script.

But there are other events that should also cause ssh keys to be deleted from the agent: switching users, switching to the login screen, and (for wasteful people) screen-saver activation. So I had a go at hacking Hammerspoon to add the functionality I wanted, and ended up with a pull request that adds several extra event types to hs.caffeinate.watcher.

At the end of this article is an example Hammerspoon script which uses these new event types. It should work with unpatched Hammerspoon as well, though only the sleep events will have any effect.

There are a few caveats.

I wanted to hook an event corresponding to when the screen is locked after the screensaver starts or the screen sleeps, with the delay that the user specified in the System Preferences. The NSDistributedNotificationCenter event "" sounds like it ought to be what I want, but it actually gets triggered as soon as the screensaver starts or the screen sleeps, without any delay. So a more polished hook script will have to re-implement that feature in a similar way to my X11 screen lock script.

I wondered if locking the screen corresponds to locking the keychain under the hood. I experimented with SecKeychainAddCallback but it was not triggered when the screen locks :-( So I think it might make sense to invoke security lock-keychain as well as ssh-add -D.

And I don't yet have a sensible user interface for re-adding the keys when the screen is unlocked. I can get to run ssh -t my-workstation ssh-add but that is really quite ugly :-)

Anyway, here is a script which you can invoke from ~/.hammerspoon/init.lua using dofile(). You probably also want to run hs.logger.defaultLogLevel = 'info'.

    -- hammerspoon script
    -- run key security scripts when the screen is locked and unlocked

    -- NOTE this requires a patched Hammerspoon for the
    -- session, screen saver, and screen lock hooks.

    local pow = hs.caffeinate.watcher

    local log ="ssh-lock")

    local function ok2str(ok)
        if ok then return "ok" else return "fail" end

    local function on_pow(event)
        local name = "?"
        for key,val in pairs(pow) do
            if event == val then name = key end
        log.f("caffeinate event %d => %s", event, name)
        if event == pow.screensDidWake
        or event == pow.sessionDidBecomeActive
        or event == pow.screensaverDidStop
            local ok, st, n = os.execute("${HOME}/bin/unlock_keys")
            log.f("unlock_keys => %s %s %d", ok2str(ok), st, n)
        if event == pow.screensDidSleep
        or event == pow.systemWillSleep
        or event == pow.systemWillPowerOff
        or event == pow.sessionDidResignActive
        or event == pow.screensDidLock
            local ok, st, n = os.execute("${HOME}/bin/lock_keys")
            log.f("lock_keys => %s %s %d", ok2str(ok), st, n)

fanf: (dotat)

On a flight to San Francisco in 2000 or 2001 I had a chat with a British woman sitting next to me. She was a similar age to me, 20-something, also aiming to try her hand at the Silly Valley thing, but, you know, MONTHS behind me. She asked me if I knew of a Days Inn or something like that. There was (is) one in the Tenderloin district literally round the corner from my flat, but that part of the city was such a shithole I was embarrassed to say I lived there. And then (I kid you not) I managed to leave my box of tea behind on the plane, and I lost touch with my seat-mate trying to retrieve it. What a chump.

Another time, I was on a taxi from Cambridge to Heathrow (ridiculous wasteful expense) when I realised my passport had fallen out of my back pocket while I was sitting on my heels during a party. I missed my flight, had to go back to Cambridge to recover my passport, and my employer's travel agent put me on another flight the next day. I felt like a fool; I'm amazed my employer handled that so reasonably (not to mention the other ways I took advantage of them).

My sojourn in San Francisco was not a success. I was amazingly lucky to catch the tail end of the dot-com boom in 2000 but I burned out badly less than a year later. I was stupid in so many ways.

I failed because I overestimated my own capabilities, and I underestimated the importance of my friends. It's enormously difficult to establish a social network in a new place from scratch. I was lucky working for Demon Internet in London (1997-2000) and for Covalent in San Francisco (2000-2001) that in both cases my colleagues were a social bunch. But I was often back to Cambridge for parties with my mates, and there's a big difference between 60 miles and 6000 miles.

And, honestly, I was too arrogant to ask my colleagues for feedback and support. (I'm still crap at that.)


I recovered from the breakdown. Though it took a long time, I moved back closer to my friends, spent my savings writing code for fun, and in the end got a job which has kept body and soul together (and better) for 13 years.

My failure was painful and difficult, but I learned valuable lessons about myself, and it WAS NOT (in the end) a disaster.

This year has brought that time back to me in interesting ways.

A friend of ours went back to work in South America, in a place she knew and loved, in a job that was made for her. But the place had changed - the old friends were no longer there - and the job wasn't as happy as expected. She was back here much sooner than planned. But our mutual friends told her about my crash and burn and recovery, and this helped her recover.

Another friend did the dot-com thing with a much greater success than me: it took him a lot more than a year to burn out. His was a more controlled flight into terrain than mine, but similarly abrupt. However he already knew about my past, and he says he also took strength from my story.

It is enormously touching to know that my friends have seen my failure, seen that it wasn't a catastrophe, and that helped them to get back on their feet.

So, happy new year, and know that if things don't work out as you hoped, if you fucked it up, it isn't the end of the world. Keep talking to the people you love and keep doing your thing.

fanf: (dotat)
I have just pushed a new release of unifdef.

This version has improvents to the expression evaluator and to in-place source file modifications.

I have also made some attempts at improving win32 support, but my ability to test that is severely limited.
fanf: (dotat)

With reference to the C standard, describe the syntax of controlling expressions in C preprocessor conditional directives.

I've started work on a "C preprocessor partial evaluator". My aim is to re-do unifdef with better infrastructure than 1980s line-at-a-time stdio, so that it's easier to implement a more complete C preprocessor. The downside, of course, is that the infrastructure (Lua and Lpeg) is more than ten times bigger than the whole of unifdef.

The main feature I want is macro expansion; the second feature I want is #if expression simplification. The latter leads to the question above: exactly what is allowed in a C preprocessor conditional directive controlling expression? This turns out to be more tricky than I expected.

What actually triggered the question was that I "know" that sizeof doesn't work in preprocessor expressions because "obviously" the preprocessor doesn't know about details of the target architecture, but I couldn't find where in the standard it says so.

My reference is ISO JTC1 SC22 WG14 document n1570 which is very close to the final committee draft of ISO 9899:2011, the C11 standard.

Preprocessor expressions are specified in section 6.10.1 "Conditional inclusion". Paragraph 1 says:

The expression that controls conditional inclusion shall be an integer constant expression except that: identifiers (including those lexically identical to keywords) are interpreted as described below;166) and it may contain [defined] unary operator expressions [...]

166) Because the controlling constant expression is evaluated during translation phase 4, all identifiers either are or are not macro names — there simply are no keywords, enumeration constants, etc.

The crucial part that I missed is the parenthetical "including [identifiers] lexically identical to keywords" - this applies to the sizeof keyword, as footnote 166 obliquely explains.

... A brief digression on "translation phases". These are specified in section, which lists 8 phases. Now, if you have done an undergraduate course in compilers or read the dragon book, you might expect this list to include things like lexing, parsing, symbol tables, something about translation to and optimization of object code, and something about linking separately compiled units. And it does, sort of. But whereas compilers are heavily weighted towards the middle and back end, C standard translation phases focus on lexical and preprocessor matters, to a ridiculous extent. I find this imbalance quite funny (in a rather dry way) - after such a lengthy and detailed build-up, the last two items in the list are almost, "and then a miracle occurs", especially the last sentence in phase 7 which is a bit LOL WTF.

6. Adjacent string literal tokens are concatenated.

7. White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.

8. All external object and function references are resolved. Library components are linked to satisfy external references to functions and objects not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

So, anyway, what footnote 166 is saying is that in the preprocessor there is no such thing as a keyword - the preprocessor has a sketchy lexer that produces "preprocessing-token"s (as specified in section 6.4) which are a simplified subset of the compiler's "token"s which mostly don't turn up until translation phase 7.

Paragraph 1 said identifiers are interpreted as described below, which refers to this sentence in section 6.10.1 paragraph 4:

After all replacements due to macro expansion and the defined unary operator have been performed, all remaining identifiers (including those lexically identical to keywords) are replaced with the pp-number 0, and then each preprocessing token is converted into a token. The resulting tokens compose the controlling constant expression which is evaluated according to the rules of 6.6.

This means that if you try to use a keyword (such as sizeof) in a preprocessor expression, it gets replaced by zero and (usually) turns into a syntax error. And this is why compilers produce less-than-straightforward error messages like error: missing binary operator before token "(" if you try to use sizeof.

Smash-keyword-to-zero has another big implication which is a bit more subtle. Section 6.6 specifies constant expressions, and paragraphs 3 and 6 are particularly relevant to the preprocessor.

3 Constant expressions shall not contain assignment, increment, decrement, function-call, or comma operators, except when they are contained within a subexpression that is not evaluated.

It is normal for real preprocessor expression parsers to implement a simplified subset of the C expression syntax which simply lacks support for these forbidden operators. So, if you put sizeof(int) in a preprocessor expression, that gets turned into 0(0) before it is evaluated, and you get an error about a missing binary operator. If you write something similar where the compiler expects an integer constant expression, you will get errors complaining that integers are not functions or that function calls are not allowed in integer constant expressions.

6 An integer constant expression shall have integer type and shall only have operands that are integer constants, enumeration constants, character constants, sizeof expressions whose results are integer constants, _Alignof expressions, and floating constants that are the immediate operands of casts.

Re-read this sentence from the point of view of the preprocessor, after identifiers and keywords have been smashed to zero. There aren't any enumeration constants, because they are identifiers zero. Similarly there aren't any sizeof or _Alignof expressions. And there can't be any casts because you can't write a type without at least one identifier. (One situation where smash-keyword-to-zero does not cause a syntax error is an expression like (unsigned)-1 and I bet that turns up in real-world preprocessor expressions.) And since there can't be any casts, there can't be any floating constants.

And therefore the preprocessor does not need any floating point support at all.

I am slightly surprised that such a fundamental simplification requires such a long chain of reasoning to obtain it from the standard. Perhaps (like my original question about sizeof) I have overlooked the relevant text.

Finally, my thanks to Mark Wooding and Brian Mastenbrook for pointing me at the crucial words in the standard.

fanf: (dotat)

Since September 1995 we have had a TXT record on saying "The University of Cambridge, England". Yesterday I replaced it with a LOC record. The TXT space is getting crowded with things like Google and Microsoft domain authentication tokens and SPF records.

When I rebuilt our DNS servers with automatic failover earlier this year, I used the TXT record for a very basic server health check. I wanted some other ultra-stable record for this purpose which is why I added the LOC record.

What other LOC records are there under, I wonder?

Anyone used to be able to AXFR the zone but that hasn't been possible for a few years now. But there is a public FOI response containing a list of domains from 2011 which is good enough.

A bit of hackery with adns and 20 seconds later I have:        LOC 56 46  4.000 N 2 57  5.000 W 50.00m   1m 10000m 10m            LOC 52 12 19.000 N 0  7  5.000 E 18.00m 10000m 100m 100m     LOC 51 22 17.596 N 0  9 58.698 W 33.00m  10m   100m 10m LOC 51 31 15.000 N 0  7 19.000 W  0.00m   1m 10000m 10m       LOC 53 14 58.200 N 2 32 15.190 E 47.00m   1m 10000m 10m            LOC 51  4 17.000 N 1 19 19.000 W 70.00m 200m   100m 10m            LOC 51 26 25.800 N 0 56 46.700 W 87.00m   1m 10000m 10m        LOC 51 26 25.800 N 0 56 46.700 W 87.00m   1m 10000m 10m           LOC 51 31 16.000 N 0  7 40.000 W 93.00m   1m 10000m 10m      LOC 51  4 17.000 N 1 19 19.000 W 70.00m 200m   100m 10m    LOC 53 46 28.000 N 0 16 42.000 W  0.00m   1m 10000m 10m

A LOC record records latitude, longitude, altitude, diameter, horizontal precision, and vertical precision.

The LOC record is supposed to indicate the location of the church of St Mary the Great, which has been the nominal centre of Cambridge since a system of milestones was set up in 1725. The precincts of the University are officially the area within three miles of GSM which corresponds to a diameter a bit less than 10km. The 100m vertical precision is enough to accommodate the 80m chimney at Addenbrookes.

There are a couple of anomalies in the other LOC records. indicates a random spot in the highlands, not the centre of Dundee as it should. should be W not E.

A lot of the records use the default size of 1m diameter and precision of 10km horizontal and 10m vertical.

You can copy and paste the lat/long into Google Maps to see where they land :-)

fanf: (dotat)
Ag Silver	Antigua and Barbuda
Al Aluminum	Albania
Am Americium	Armenia
Ar Argon	Argentina
As Arsenic	American Samoa
At Astatine	Austria
Au Gold		Australia
Ba Barium	Bosnia and Herzegovina
Be Beryllium	Belgium
Bh Bohrium	Bahrain
Bi Bismuth	Burundi
Br Bromine	Brazil
Ca Calcium	Canada
Cd Cadmium	Democratic Republic of the Congo
Cf Californium	Central African Republic
Cl Chlorine	Chile
Cm Curium	Cameroon
Cn Copernicium	China
Co Cobalt	Colombia
Cr Chromium	Costa Rica
Cs Cesium	Serbia and Montenegro
Cu Copper	Cuba
Er Erbium	Eritrea
Es Einsteinium	Spain
Eu Europium	Europe
Fm Fermium	Federated States of Micronesia
Fr Francium	France
Ga Gallium	Gabon
Gd Gadolinium	Grenada
Ge Germanium	Georgia
In Indium	India
Ir Iridium	Iran
Kr Krypton	South Korea
La Lanthanum	Laos
Li Lithium	Liechtenstein
Lr Lawrencium	Liberia
Lu Lutetium	Luxembourg
Lv Livermorium	Latvia
Md Mendelevium	Moldova
Mg Magnesium	Madagascar
Mn Manganese	Mongolia
Mo Molybdenum	Macau
Mt Meitnerium	Malta
Na Sodium	Namibia
Ne Neon		Niger
Ni Nickel	Nicaragua
No Nobelium	Norway
Np Neptunium	Nepal
Os Osmium	Oman
Pa Protactinium	Panama
Pm Promethium	Saint Pierre and Miquelon
Pr Praseodymium	Puerto Rico
Pt Platinum	Portugal
Re Rhenium	Reunion
Ru Ruthenium	Russia
Sb Antimony	Solomon Islands
Sc Scandium	Seychelles
Se Selenium	Sweden
Sg Seaborgium	Singapore
Si Silicon	Slovenia
Sm Samarium	San Marino
Sn Tin		Senegal
Sr Strontium	Suriname
Tc Technetium	Turks and Caicos Islands
Th Thorium	Thailand
Tm Thulium	Turkmenistan
Elemental domain names that exist:
Oh, let's do US states as well:
Al Aluminum     Alabama
Ar Argon        Arkansas
Ca Calcium      California
Co Cobalt       Colorado
Fl Flerovium    Florida
Ga Gallium      Georgia
In Indium       Indiana
La Lanthanum    Louisiana
Md Mendelevium  Maryland
Mn Manganese    Minnesota
Mo Molybdenum   Missouri
Mt Meitnerium   Montana
Nd Neodymium    North Dakota
Ne Neon         Nebraska
Pa Protactinium Pennsylvania
Sc Scandium     South Carolina
fanf: (dotat)

This week we will be delegating (the Isaac Newton Institute's domain) to the Faculty of Mathematics, who have been running their own DNS since the very earliest days of Internet connectivity in Cambridge.

Unlike most new delegations, the domain already exists and has a lot of records, so we have to keep them working during the process. And for added fun, is signed with DNSSEC, so we can't play fast and loose.

In the absence of DNSSEC, it is mostly OK to set up the new zone, get all the relevant name servers secondarying it, and then introduce the zone cut. During the rollout, some servers will be serving the domain from the old records in the parent zone, and other servers will serve the domain from the new child zone, which occludes the old records in its parent.

But this won't work with DNSSEC because validators are aware of zone cuts, and they check that delegations across cuts are consistent with the answers they have received. So with DNSSEC, the process you have to follow is fairly tightly constrained to be basically the opposite of the above.

The first step is to set up the new zone on name servers that are completely disjoint from those of the parent zone. This ensures that a resolver cannot prematurely get any answers from the new zone - they have to follow a delegation from the parent to find the name servers for the new zone. In the case of, we are lucky that the Maths name servers satisfy this requirement.

The second step is to introduce the delegation into the parent zone. Ideally this should propagate to all the authoritative servers promptly, using NOTIFY and IXFR.

(I am a bit concerned about DNSSEC software which does validation as a separate process after normal iterative resolution, which is most of it. While the delegation is propagating it is possible to find the delegation when resolving, but get a missing delegation when validating. If the validator is persistent at re-querying for the delegation chain it should be able to recover from this; but quick propagation minimizes the problem.)

After the delegation is present on all the authoritative servers, and old data has timed out of caches, the new child zone can (if necessary) be added to the parent zone's name servers. In our case the central name servers and off-site secondaries also serve the Maths zones, so this step normalizes the setup for

fanf: (dotat)

benchmarking wider-fanout versions of qp tries

QP tries are faster than crit-bit tries because they extract up to four bits of information out of the key per branch, whereas crit-bit tries only get one bit per branch. If branch nodes have more children, then the trie should be shallower on average, which should mean lookups are faster.

So if we extract even more bits of information from the key per branch, it should be even better, right? But, there are downsides to wider fanout.

I originally chose to index by nibbles for two reasons: they are easy to pull out of the key, and the bitmap easily fits in a word with room for a key index.

If we index keys by 5-bit chunks we pay a penalty for more complicated indexing.

If we index keys by 6-bit chunks, trie nodes have to be bigger to fit a bigger bitmap, so we pay a memory overhead penalty.

How do these costs compare to the benefits of wider branches?

I have implemented 5-bit and 6-bit versions of qp tries and benchmarked them. For those interested in the extremely nerdy details, the full version of "never mind the quadbits" is on the qp trie home page.

The tl;dr is, 5-bit qp tries are fairly unequivocally better than 4-bit qp tries. 6-bit qp tries are less clear - they are not faster than 5-bit on my laptop and desktop (both Core 2 Duo), but they are faster on a Xeon server that I tried despite using more memory.

fanf: (dotat)

The inner loop in qp trie lookups is roughly

    while(t->isbranch) {
        b = 1 << key[t->index]; // simplified
        if((t->bitmap & b) == 0) return(NULL);
        t = t->twigs + popcount(t->bitmap & b-1);

The efficiency of this loop depends on how quickly we can get from one indirection down the trie to the next. There is quite a lot of work in the loop, enough to slow it down significantly compared to the crit-bit search loop. Although qp tries are half the depth of crit-bit tries on average, they don't run twice as fast. The prefetch compensates in a big way: without it, qp tries are about 10% faster; with it they are about 30% faster.

I adjusted the code above to emphasize that in one iteration of the loop it accesses two locations: the key, which it is traversing linearly with small skips, so access is fast; and the tree node t, whose location jumps around all over the place, so access is slow. The body of the loop calculates the next location of t, but we know at the start that it is going to be some smallish offset from t->twigs, so the prefetch is very effective at overlapping calculation and memory latency.

It was entirely accidental that prefetching works well for qp tries. I was trying not to waste space, so the thought process was roughly, a leaf has to be two words:

    struct Tleaf { const char *key; void *value; };

Leaves should be embedded in the twig array, to avoid a wasteful indirection, so branches have to be the same size as leaves.

    union Tnode { struct Tleaf leaf; struct Tbranch branch; };

A branch has to have a pointer to its twigs, so there is space in the other word for the metadata: bitmap, index, flags. (The limited space in one word is partly why qp tries test a nibble at a time.) Putting the metadata about the twigs next to the pointer to the twigs is the key thing that makes prefetching work.

One of the inspirations of qp tries was Phil Bagwell's hash array mapped tries. HAMTs use the same popcount trick, but instead of using the PATRICIA method of skipping redundant branch nodes, they hash the key and use the hash as the trie index. The hashes should very rarely collide, so redundant branches should also be rare. Like qp tries, HAMTs put the twig metadata (just a bitmap in their case) next to the twig pointer, so they are friendly to prefetching.

So, if you are designing a tree structure, put the metdadata for choosing which child is next adjacent to the node's pointer in its parent, not inside the node itself. That allows you to overlap the computation of choosing which child is next with the memory latency for fetching the child pointers.

fanf: (dotat)

Crit-bit tries have fixed-size branch nodes and a constant overhead per leaf, which means they can be used as an embedded lookup structure. Embedded lookup structures do not need any extra memory allocation; it is enough to allocate the objects that are to be indexed by the lookup structure.

An embedded lookup structure is a data structure in which the internal pointers used to search for an object (such as branch nodes) are embedded within the objects you are searching through. Each object can be a member of at most one of any particular kind of lookup structure, though an object can simultaneously be a member of several different kinds of lookup structure.

The BSD <sys/queue.h> macros are embedded linked lists. They are used frequently in the kernel, for instance in the network stack to chain mbuf packet buffers together. Each mbuf can be a member of a list and a tailq. There is also a <sys/tree.h> which is used by OpenSSH's privilege separation memory manager. Embedded red-black trees also appear in jemalloc.

embedded crit-bit branch node structure

DJB's crit-bit branch nodes require three words: bit index, left child, and right child; embedded crit-bit branches are the same with an additional parent pointer.

    struct branch {
        uint index;
        void *twig[2];
        void **parent;

The "twig" child pointers are tagged to indicate whether they point to a branch node or a leaf. The parent pointer normally points to the relevant child pointer inside the parent node; it can also point at the trie's root pointer, which means there has to be exactly one root pointer in a fixed place.

(An aside about how I have been counting overhead: DJB does not include the leaf string pointer as part of the overhead of his crit-bit tries, and I have followed his lead by not counting the leaf key and value pointers in my crit-bit and qp tries. So by this logic, although an embedded branch adds four words to an object, it only counts as three words of overhead. Perhaps it would be more honest to count the total size of the data structure.)

using embedded crit-bit tries

For most purposes, embedded crit-bit tries work the same as external crit-bit tries.

When searching for an object, there is a final check that the search key matches the leaf. This check needs to know where to find the search key inside the leaf object - it should not assume the key is at the start.

When inserting a new object, you need to add a branch node to the trie. For external crit-bit tries this new branch is allocated; for embedded crit-bit tries you use the branch embedded in the new leaf object.

deleting objects from embedded crit-bit tries

This is where the fun happens. There are four objects of interest:

  • The doomed leaf object to be deleted;

  • The victim branch object which needs to remain in the trie, although it is embedded in the doomed leaf object;

  • The parent branch object pointing at the leaf, which will be unlinked from the trie;

  • The bystander leaf object in which the parent branch is embedded, which remains in the trie.

The plan is that after unlinking the parent branch from the trie, you rescue the victim branch from the doomed leaf object by moving it into the place vacated by the parent branch. You use the parent pointer in the victim branch to update the twig (or root) pointer to follow the move.

Note that you need to beware of the case where the parent branch happens to be embedded in the doomed leaf object.

exercise for the reader

Are the parent pointers necessary?

Is the movement of branches constrained enough that we will always encounter a leaf's embedded branch in the course of searching for that leaf? If so, we can eliminate the parent pointers and save a word of overhead.


I have not implemented this idea, but following Simon Tatham's encouragement I have written this description in the hope that it inspires someone else.

fanf: (dotat)

tl;dr: I have developed a data structure called a "qp trie", based on the crit-bit trie. Some simple benchmarks say qp tries have about 1/3 less memory overhead and are about 10% 30% faster than crit-bit tries.

"qp trie" is short for "quadbit popcount patricia trie". (Nothing to do with cutie cupid dolls or Japanese mayonnaise!)

Get the code from


Crit-bit tries are an elegant space-optimised variant of PATRICIA tries. Dan Bernstein has a well-known description of crit-bit tries, and Adam Langley has nicely annotated DJB's crit-bit implementation.

What struck me was crit-bit tries require quite a lot of indirections to perform a lookup. I wondered if it would be possible to test multiple bits at a branch point to reduce the depth of the trie, and make the size of the branch adapt to the trie's density to keep memory usage low. My initial attempt (over two years ago) was vaguely promising but too complicated, and I gave up on it.

A few weeks ago I read about Phil Bagwell's hash array mapped trie (HAMT) which he described in two papers, "fast and space efficient trie searches", and "ideal hash trees". The part that struck me was the popcount trick he uses to eliminate unused pointers in branch nodes. (This is also described in "Hacker's Delight" by Hank Warren, in the "applications" subsection of chapter 5-1 "Counting 1 bits", which evidently did not strike me in the same way when I read it!)

You can use popcount() to implement a sparse array of length N containing M < N members using bitmap of length N and a packed vector of M elements. A member i is present in the array if bit i is set, so M == popcount(bitmap). The index of member i in the packed vector is the popcount of the bits preceding i.

    mask = 1 << i;
    if(bitmap & mask)
        member = vector[popcount(bitmap & mask-1)]

qp tries

If we are increasing the fanout of crit-bit tries, how much should we increase it by, that is, how many bits should we test at once? In a HAMT the bitmap is a word, 32 or 64 bits, using 5 or 6 bits from the key at a time. But it's a bit fiddly to extract bit-fields from a string when they span bytes.

So I decided to use a quadbit at a time (i.e. a nibble or half-byte) which implies a 16 bit popcount bitmap. We can use the other 48 bits of a 64 bit word to identify the index of the nibble that this branch is testing. A branch needs a second word to contain the pointer to the packed array of "twigs" (my silly term for sub-tries).

It is convenient for a branch to be two words, because that is the same as the space required for the key+value pair that you want to store at each leaf. So each slot in the array of twigs can contain either another branch or a leaf, and we can use a flag bit in the bottom of a pointer to tell them apart.

Here's the qp trie containing the keys "foo", "bar", "baz". (Note there is only one possible trie for a given set of keys.)

[ 0044 | 1 | twigs ] -> [ 0404 | 5 | twigs ] -> [ value | "bar" ]
                        [    value | "foo" ]    [ value | "baz" ]

The root node is a branch. It is testing nibble 1 (the least significant half of byte 0), and it has twigs for nibbles containing 2 ('b' == 0x62) or 6 ('f' == 0x66). (Note 1 << 2 == 0x0004 and 1 << 6 == 0x0040.)

The first twig is also a branch, testing nibble 5 (the least significant half of byte 2), and it has twigs for nibbles containing 2 ('r' == 0x72) or 10 ('z' == 0x7a). Its twigs are both leaves, for "bar" and "baz". (Pointers to the string keys are stored in the leaves - we don't copy the keys inline.)

The other twig of the root branch is the leaf for "foo".

If we add a key "qux" the trie will grow another twig on the root branch.

[ 0244 | 1 | twigs ] -> [ 0404 | 5 | twigs ] -> [ value | "bar" ]
                        [    value | "foo" ]    [ value | "baz" ]
                        [    value | "qux" ]

This layout is very compact. In the worst case, where each branch has only two twigs, a qp trie has the same overhead as a crit-bit trie, two words (16 bytes) per leaf. In the best case, where each branch is full with 16 twigs, the overhead is one byte per leaf.

When storing 236,000 keys from /usr/share/dict/words the overhead is 1.44 words per leaf, and when storing a vocabulary of 54,000 keys extracted from the BIND9 source, the overhead is 1.12 words per leaf.

For comparison, if you have a parsimonious hash table which stores just a hash code, key, and value pointer in each slot, and which has 90% occupancy, its overhead is 1.33 words per item.

In the best case, a qp trie can be a quarter of the depth of a crit-bit trie. In practice it is about half the depth. For our example data sets, the average depth of a crit-bit trie is 26.5 branches, and a qp trie is 12.5 for dict/words or 11.1 for the BIND9 words.

My benchmarks show qp tries are about 10% faster than crit-bit tries. However I do not have a machine with both a popcount instruction and a compiler that supports it; also, LLVM fails to optimise popcount for a 16 bit word size, and GCC compiles it as a subroutine call. So there's scope for improvement.

crit-bit tries revisited

DJB's published crit-bit trie code only stores a set of keys, without any associated values. It's possible to add support for associated values without increasing the overhead.

In DJB's code, branch nodes have three words: a bit index, and two pointers to child nodes. Each child pointer has a flag in its least significant bit indicating whether it points to another branch, or points to a key string.

[ branch ] -> [ 3      ]
              [ branch ] -> [ 5      ]
              [ "qux"  ]    [ branch ] -> [ 20    ]
                            [ "foo"  ]    [ "bar" ]
                                          [ "baz" ]

It is hard to add associated values to this structure without increasing its overhead. If you simply replace each string pointer with a pointer to a key+value pair, the overhead is 50% greater: three words per entry in addition to the key+value pointers.

When I wanted to benchmark my qp trie implementation against crit-bit tries, I trimmed the qp trie code to make a crit-bit trie implementation. So my crit-bit implementation stores keys with associated values, but still has an overhead of only two words per item.

[ 3 twigs ] -> [ 5   twigs ] -> [ 20  twigs ] -> [ val "bar" ]
               [ val "qux" ]    [ val "foo" ]    [ val "baz" ]

Instead of viewing this as a trimmed-down qp trie, you can look at it as evolving from DJB's crit-bit tries. First, add two words to each node for the value pointers, which I have drawn by making the nodes wider:

[ branch ] ->      [ 3    ]
              [ x  branch ] ->      [ 5    ]
              [ val "qux" ]    [ x  branch ] ->       [ 20  ]
                               [ val "foo" ]    [ val "bar" ]
                                                [ val "baz" ]

The value pointers are empty (marked x) in branch nodes, which provides space to move the bit indexes up a level. One bit index from each child occupies each empty word. Moving the bit indexes takes away a word from every node, except for the root which becomes a word bigger.


This code was pretty fun to write, and I'm reasonably pleased with the results. The debugging was easier than I feared: most of my mistakes were simple (e.g. using the wrong variable, failing to handle a trivial case, muddling up getline()s two length results) and clang -fsanitize=address was a mighty debugging tool.

My only big logic error was in Tnext(); I thought it was easy to find the key lexicographically following some arbitrary string not in the trie, but it is not. (This isn't a binary search tree!) You can easily find the keys with a given prefix, if you know in advance the length of the prefix. But, with my broken code, if you searched for an arbitrary string you could end up in a subtrie which was not the subtrie with the longest matching prefix. So now, if you want to delete a key while iterating, you have to find the next key before deleting the previous one.


I have this nice code, but I have no idea what practical use I might put it to!


I have done some simple tuning of the inner loops and qp tries are now about 30% faster than crit-bit tries.

fanf: (dotat)

RFC 2317 describes classless delegation. The reverse DNS name space allows for delegation at octet boundaries, /8 or /16 or /24, so DNS delegation dots match the dots in IP addresses. This worked OK for the old classful IP address architecture, but classless routing allows networks with longer prefixes, e.g. /26, which makes it tricky to delegate the reverse DNS to the user of the IP address space.

The RFC 2317 solution is a clever use of standard DNS features. If you have been delegated, in the parent reverse DNS zone your ISP sets up CNAME records instead of PTR records for all 64 addresses in the subnet. These CNAME records point into some other zone controlled by you where the actual PTR records can be found.

RFC 2317 suggests subdomain names like 128/ but the / in the label is a bit alarming to people and tools that expect domain names to follow strict letter-digit-hyphen syntax. You can just as well use a subdomain name like or even put the PTR records in a completely different branch of the DNS.

For shorter prefixes it is still normal to delegate reverse DNS at multiple-of-8 boundaries. This can require an awkwardly large number of zones, especially if your prefix is a multiple-of-8 plus one. For instance, our Computer Laboratory delegated half of their /16 to be used by the rest of the University, and we are preparing to use half of to provide address space for our wireless service. Both of these address space allocations would normally require 128 zones to be delegated.

Fortunately there is a way to reduce this to one zone, analogous to the RFC 2317 trick, but using a more modern DNS feature, the DNAME record (RFC 2672 RFC 6672). So RFC 2317 says, for long prefixes you replace a lot of PTR records with CNAME records pointing into another zone. Correspondingly, for short prefixes you replace a lot of delegations (NS and DS records) with DNAME records pointing into another zone.

We are using this technique in so that we only need to have one reverse DNS zone instead of 128 zones. The DNAME records point from (for instance) to Yes, we are using part of the "forward" DNS namespace to hold "reverse" DNS records! The apex of this zone is so we can in principle consolidate any other reverse DNS address space into this same zone.

This works really nicely - DNAME support is sufficiently widespread that it mostly just works, with a few caveats mainly affecting outgoing mail servers.

For we are planning to do use the DNAME trick again. However, because it is private address space we don't want to consolidate it into the public zone. The options are to use (which would allow us to consolidate if we choose) or which would be more similar to usual RFC 2317 style.

Example zone file for short-prefix classless delegation:

    $TTL 1h

    @         SOA (
                    1 30m 15m 1w 1h )


    0-127     NS

    128-255   NS

    $GENERATE 0-127   $ DNAME $.0-127
    $GENERATE 128-255 $ DNAME $.128-255

With classless delegations like that your PTR records have names like

(I should probably write this up as an Internet-Draft but a quick blog post will do for now.)

fanf: (passport)
I visited [ profile] rmc28 this morning. She's having a hard time this week: a secondary infection has given her a high fever. (Something like this is apparently typical at this point in the treatment.) The antibiotics are playing havoc with her guts and her hair is starting to come out. She's very tired and dopey, finding it difficult to think or read or do very much at all. Rough.

Visitors are still welcome, though you'll probably find she's happy to listen but not very talkative.
fanf: (passport)

At some point, 10 or 15 years ago, I got into the habit of saying goodbye to people by saying "stay well!"

I like it because it is cheerful, it closes a conversation without introducing a new topic, and it feels meaningful without being stilted (like "farewell") or rote (like "goodbye").


"Stay well" works nicely in a group of healthy people, but it is problematic with people who are ill.

Years ago, before "stay well!" was completely a habit, a colleague got prostate cancer. The treatment was long and brutal. I had to be careful when saying goodbye, but I didn't break the habit.

It is perhaps even worse with people who are chronically ill, because "stay well" (especially when I say it) has a casually privileged assumption that I am saying it to people who are already cheerfully well.

In the last week this phrase has got a new force for me. I really do mean "stay well" more than ever, but I wish I could express it without implying that you are already well or that it is your duty to be well.

fanf: (passport)

We have not been able to visit Rachel this weekend owing to an outbreak of vomiting virus. Nico had it on Friday evening, I had it late last night and Charles started a few hours after me.

Seems to have a 50h ish incubation time, vomiting for not very long, fever. Nico seems to have constipation now, possibly due to not drinking enough?

It's quite infectious and unpleasant so we are staying in. We have had lovely offers of help from lots of people but in this state I don't feel like we can organise much for a couple of days.

Since we can't visit Rachel we tried Skype briefly yesterday evening, though it was pretty crappy as usual for Skype.

Rachel was putting on a brave face on Friday and asked me to post these pictures:

fanf: (dotat)

Rachel has been in hospital since Monday, and on Thursday they told her she has Leukaemia, fortunately a form that is usually curable.

She started chemotherapy on Thursday evening, and it is very rough, so she is not able to have visitors right now. We'll let you know when that changes, but we expect that the side-effects will get worse for a couple of weeks.

To keep her spirits up, she would greatly appreciate small photos of cute children/animals or beautiful landscapes. Send them by email to or on Twitter to @rmc28, or you can send small poscards to Ward D6 at Addenbrookes.

Flowers are not allowed on the ward, and no food gifts please because nausea is a big problem. If you want to send a gift, something small and pretty and/or interestingly tactile is suitable.

Rachel is benefiting from services that rely on donations, so you might also be about to help by giving blood - for instance you can donate at Addenbrookes. Or you might donate to Macmillan Cancer Support.

And, if you have any niggling health problems, or something weird is happening or getting worse, do get it checked out. Rachel's leukaemia came on over a period of about three weeks and would have been fatal in a couple of months if untreated.

fanf: (dotat)

I feel like

Most of the items below involve chunks of software development that are or will be open source. Too much of this, perhaps, is me saying "that's not right" and going in and fixing things or re-doing it properly...

Exchange 2013 recipient verification and LDAP, autoreplies to bounce messages, Exim upgrade, and logging

We have a couple of colleges deploying Exchange 2013, which means I need to implement a new way to do recipient verification for their domains. Until now, we have relied on internal mail servers rejecting invalid recipients when they get a RCPT TO command from our border mail relay. Our border relay rejects invalid recipients using Exim's verify=recipient/callout feature, so when we get a RCPT TO we try it on the target internal mail server and immediately return any error to the sender.

However Exchange 2013 does not reject invalid recipients until it has received the whole message; it rejects the entire message if any single recipient is invalid. This means if you have several users on a mailing list and one of them leaves, all of your users will stop receiving messages from the list and will eventually be unsubscribed.

The way to work around this problem is (instead of SMTP callout verification) to do LDAP queries against the Exchange server's Active Directory. To support this we need to update our Exim build to include LDAP support so we can add the necessary configuration.

Another Exchange-related annoyance is that we (postmaster) get a LOT of auto-replies to bounce messages, because Exchange does not implement RCF 3834 properly and instead has a non-standard Microsoft proprietary header for suppressing automatic replies. I have a patch to add Microsoft X-Auto-Response-Suppress headers to Exim's bounce messages which I hope will reduce this annoyance.

When preparing an Exim upgrade that included these changes, I found that exim-4.86 included a logging change that makes the +incoming_interface option also log the outgoing interface. I thought I might be able to make this change more consistent with the existing logging options, but sadly I missed the release deadline. In the course of fixing this I found Exim was running out of space for logging options so I have done a massive refactoring to make them easily expandible.

gitweb improvements

For about a year we have been carrying some patches to gitweb which improve its handling of breadcrumbs, subdirectory restrictions, and categories. These patches are in use on our federated gitolite service, git.csx. They have not been incorporated upstream owing to lack of code review. But at last I have some useful feedback so with a bit more work I should be able to get the patches committed so I can run stock gitweb again.

Delegation updates, and superglue

We now subscribe to the ISC SNS service. As well as running, the ISC run a global anycast secondary name service used by a number of ccTLDs and other deserving organizations. Fully deploying the delegation change for our 125 public zones has been delayed an embarrassingly long time.

When faced with the tedious job of updating over 100 delegations, I think to myself, I know, I'll automate it! We have domain registrations with JANET (for and some reverse zones), Nominet (other .uk), Gandi (non-uk), and RIPE (other reverse zones), and they all have radically different APIs: EPP (Nominet), XML-RPC (Gandi), REST (RIPE), and, um, CasperJS (JANET).

But the automation is not just for this job: I want to be able to automate DNSSEC KSK rollovers. My plan is to have some code that takes a zone's apex records and uses the relevant registrar API to ensure the delegation matches. In practice KSK rollovers may use a different DNSKEY RRset than the zone's published one, but the principle is to make the registrar interface uniformly dead simple by encapsulating the APIs and non-APIs.

I have some software called "superglue" which nearly does what I want, but it isn't quite finished and at least needs its user interface and internal interfaces made consistent and coherent before I feel happy suggesting that others might want to use it.

But I probably have enough working to actually make the delegation changes so I seriously need to go ahead and do that and tell the hostmasters of our delegated subdomains that they can use ISC SNS too.

Configuration management of secrets

Another DNSSEC-related difficulty is private key management - and not just DNSSEC, but also ssh (host keys), API credentials (see above), and other secrets.

What I want is something for storing encrypted secrets in git. I'm not entirely happy with existing solutions. Often they try to conceal whether my secrets are in the clear or not, whereas I want it to be blatantly obvious whether I am in the safe or risky state. Often they use home-brew crypto whereas I would be much happier with something widely-reviewed like gpg.

My current working solution is a git repo containing a half-arsed bit of shell and a makefile that manage a gpg-encrypted tarball containing a git repo full of secrets. As a background project I have about 1/3rd of a more refined "git privacy guard" based on the same basic principle, but I have not found time to work on it seriously since March. Which is slightly awkward because when finished it should make some of my other projects significantly easier.

DNS RPZ, and metazones

My newest project is to deploy a "DNS firewall", that is, blocking access to malicious domains. The aim is to provide some extra coverage for nasties that get through our spam filters or AV software. It will use BIND's DNS response policy zones feature, with a locally-maintained blacklist and whitelist, plus subscriptions to commercial blacklists.

The blocks will only apply to people who use our default recursive servers. We will also provide alternative unfiltered servers for those who need them. Both filtered and unfiltered servers will be provided by the same instances of BIND, using "views" to select the policy.

This requires a relatively simple change to our BIND dynamic configuration update machinery, which ensures that all our DNS servers have copies of the right zones. At the moment we're using ssh to push updates, but I plan to eliminate this trust relationship leaving only DNS TSIG (which is used for zone transfers). The new setup will use nsnotifyd's simplified metazone support.

I am amused that nsnotifyd started off as a quick hack for a bit of fun but rapidly turned out to have many uses, and several users other than me!

Other activities

I frequently send little patches to the BIND developers. My most important patch (as in, running in production) which has not yet been committed upstream is automatic size limits for zone journal files.

Mail and DNS user support. Say no more.

IETF activities. I am listed as an author of the DANE SRV draft which is now in the RFC editor queue. (Though I have not had the tuits to work on DANE stuff for a long time now.)

Pending projects

Things that need doing but haven't reached the top of the list yet include:

  • DHCP server refresh: upgrade OS to the same version as the DNS servers and combine the DHCP Ansible setup into the main DNS Ansible setup.
  • Federated DHCP log access: so other University institutions that are using our DHCP service have some insight into what is happening on their networks.
  • Ansible management of the ipreg stuff on Jackdaw.
  • DNS records for mail authentication: SPF/DKIM/DANE. (We have getting on for 400 mail domains with numerous special cases so this is not entirely trivial.)
  • Divorce ipreg database from Jackdaw.
  • Overhaul ipreg user interface or replace it entirely. (Multi-year project.)

fanf: (dotat)

I have released nsdiff-1.70 which is now available from the nsdiff home page. nsdiff creates an nsupdate script from the differences between two versions of a zone. We use it at work for pushing changes from our IP Register database into our DNSSEC signing server.

This release incorporates a couple of suggestions from Jordan Rieger of The first relaxes domain name syntax in places where domain names do not have to be host names, e.g. in the SOA RNAME field, which is the email address of the people responsible for the zone. The second allows you to optionally choose case-insensitive comparison of records.

The other new feature is an nsvi command which makes it nice and easy to edit a dynamic zone. Why didn't I write this years ago? It was inspired by a suggestion from @jpmens and @Habbie on Twitter and fuelled by a few pints of Citra.

fanf: (dotat)

nsnotifyd is my tiny DNS server that only handles DNS NOTIFY messages by running a command. (See my announcement of nsnotifyd and the nsnotifyd home page.)

At Cambridge we have a lot of stealth secondary name servers. We encourage admins who run resolvers to configure them in this way in order to resolve names in our private zones; it also reduces load on our central resolvers which used to be important. This is documented in our sample configuration for stealth nameservers on the CUDN.

The problem with this is that a stealth secondary can be slow to update its copy of a zone. It doesn't receive NOTIFY messages (because it is stealth) so it has to rely on the zone's SOA refresh and retry timing parameters. I have mitigated this somewhat by reducing our refresh timer from 4 hours to 30 minutes, but it might be nice to do better.

A similar problem came up in another scenario recently. I had a brief exchange with someone at JANET about DNS block lists and response policy zones in particular. RPZ block lists are distributed by standard zone transfers. If the RPZ users are stealth secondaries then they are not going to get updates in a very timely manner. (They might not be entirely stealth: RPZ vendors maintain ACLs listing their customers which they might also use for sending notifies.) JANET were concerned that if they provided an RPZ mirror it might exacerbate the staleness problem.

So I thought it might be reasonable to:

  • Analyze a BIND log to extract lists of zone transfer clients, which are presumably mostly stealth secondaries. (A little script called nsnotify-liststealth)
  • Write a tool called nsnotify-fanout to send notify messages to a list of targets.
  • And hook them up to nsnotifyd with a script called nsnotify2stealth.

The result is that you can just configure your authoritative name server to send NOTIFYs to nsnotifyd, and it will automatically NOTIFY all of your stealth secondaries as soon as the zone changes.

This seems to work pretty well, but there is a caveat!

You will now get a massive thundering herd of zone transfers as soon as a zone changes. Previously your stealth secondaries would have tended to spread their load over the SOA refresh period. Not any more!

The ISC has a helpful page on tuning BIND for high zone transfer volume which you should read if you want to use nsnotify2stealth.

fanf: (dotat)
About ten days ago, I was wondering how I could automatically and promptly record changes to a zone in git. The situation I had in mind was a special zone which was modified by nsupdate but hosted on a server whose DR plan is "rebuild from Git using Ansible". So I thought it would be a good idea to record updates in git so they would not be lost.

My initial idea was to use BIND's update-policy external feature, which allows you to hook into its dynamic update handling. However that would have problems with race conditions, since the update handler doesn't know how much time to allow for BIND to record the update.

So I thought it might be better to write a DNS NOTIFY handler, which gets told about updates just after they have been recorded. And I thought that wiring a DNS server daemon would not be much harder than writing an update-policy external daemon.

@jpmens responded with enthusiasm, and a some hours later I had something vaguely working.

What I have written is a special-purpose DNS server called nsnotifyd. It is actually a lot more general than just recording a zone's history in git. You can run any command as soon as a zone is changed - the script for recording changes in git is just one example.

Another example is running nsdiff to propagate changes from a hidden master to a DNSSEC signer. You can do the same job with BIND's inline-signing mode, but maybe you need to insert some evil zone mangling into the process, say...

Basically, anywhere you currently have a cron job which is monitoring updates to DNS zones, you might want to run it under nsnotifyd instead, so your script runs as soon as the zone changes instead of running at fixed intervals.

Since a few people expressed an interest in this program, I have written documentation and packaged it up, so you can download it from the nsnotifyd home page, or from one of several git repository servers.

(Epilogue: I realised halfway through this little project that I had a better way of managing my special zone than updating it directly with nsupdate. Oh well, I can still use nsnotifyd to drive @diffroot!)
fanf: (dotat)

Here are a couple of fun ways to reduce repetitive code in C. To illustrate them, I'll implement FizzBuzz with the constraint that I must mention Fizz and Buzz only once in each implementation.

The generic context is that I want to declare some functions, and each function has an object containing some metadata. In this case the function is a predicate like "divisible by three" and the metadata is "Fizz". In a more realistic situation the function might be a driver entry point and the metadata might be a description of the hardware it supports.

Both implementations of FizzBuzz fit into the following generic FizzBuzz framework, which knows the general form of the game but not the specific rules about when to utter silly words instead of numbers.

    #include <stdbool.h>
    #include <stdio.h>
    #include <err.h>
    // predicate metadata
    typedef struct pred {
        bool (*pred)(int);
        const char *name;
    } pred;
    // some other file declares a table of predicates
    #include PTBL
    static bool putsafe(const char *s) {
        return s != NULL && fputs(s, stdout) != EOF;
    int main(void) {
        for (int i = 0; true; i++) {
            bool done = false;
            // if a predicate is true, print its name
            for (pred *p = ptbl; p < ptbl_end; p++)
                done |= putsafe(p->pred(i) ? p->name : NULL);
            // if no predicate was true, print the number
            if (printf(done ? "\n" : "%d\n", i) < 0)
                err(1, "printf");

To compile this code you need to define the PTBL macro to the name of a file that implements a FizzBuzz predicate table.

Higher-order cpp macros

A higher-order macro is a macro which takes a macro as an argument. This can be useful if you want the macro to do different things each time it is invoked.

For FizzBuzz we use a higher-order macro to tersely define all the predicates in one place. What this macro actually does depends on the macro name p that we pass to it.

    #define predicates(p) \
        p(Fizz, i % 3 == 0) \
        p(Buzz, i % 5 == 0)

We can then define a function-defining macro, and pass it to our higher-order macro to define all the predicate functions.

    #define pred_fun(name, test) \
        static bool name(int i) { return test; }

And we can define a macro to declare a metadata table entry (using the C preprocessor stringification operator), and pass it to our higher-order macro to fill in the whole metadata table.

    #define pred_ent(name, test) { name, #name },
    pred ptbl[] = {

For the purposes of the main program we also need to declare the end of the table.

    #define ptbl_end (ptbl + sizeof(ptbl)/sizeof(*ptbl))

Higher-order macros can get unweildy, especially if each item in the list is large. An alternative is to use a higher-order include file, where instead of passing a macro to another macro, you #define a macro with a particular name, #include a file of macro invocations, then #undef the special macro. This saves you from having to end dozens of lines with backslash continuations.

ELF linker sets

The linker takes collections of definitions, separates them into different sections (e.g. code and data), and concatenates each section into a contiguous block of memory. The effect is that although you can interleave code and data in your C source file, the linker disentangles the little pieces into one code section and one data section.

You can also define your own sections, if you like, by using gcc declaration attributes, so the linker will gather the declarations together in your binary regardless of how spread out they were in the source. The FreeBSD kernel calls these "linker sets" and uses them extensively to construct lists of drivers, initialization actions, etc.

This allows us to declare the metadata for a FizzBuzz predicate alongside its function definition, and use the linker to gather all the metadata into the array expected by the main program. The key part of the macro below is the __attribute__((section("pred"))).

    #define predicate(name, test) \
        static bool name(int i) { return test; } \
        pred pred_##name __attribute__((section("pred"))) \
            = { name, #name }

With that convenience macro we can define our predicates in whatever order or in whatever files we want.

    predicate(Fizz, i % 3 == 0);
    predicate(Buzz, i % 5 == 0);

To access the metadata, the linker helpfully defines some symbols identifying the start and end of the section, which we can pass to our main program.

    extern pred __start_pred[], __stop_pred[];
    #define ptbl    __start_pred
    #define ptbl_end __stop_pred

Source code

git clone

fanf: (dotat)

I haven't got round to setting up proper performance monitoring for our DNS servers yet, so I have been making fairly ad-hoc queries against the BIND statistics channel and pulling out numbers with jq.

Last week I changed our new DNS setup for more frequent DNS updates. As part of this change I reduced the TTL on all our records from one day to one hour. The obvious question was, how would this affect the query rate on our servers?

So I wrote a simple monitoring script. The first version did,

    while sleep 1

But the fetch-and-print-stats part took a significant fraction of a second, so the queries-per-second numbers were rather bogus.

A better way to do this is to run `sleep` in the background, while you fetch-and-print-stats in the foreground. Then you can wait for the sleep to finish and loop back to the start. The loop should take almost exactly a second to run (provided fetch-and-print-stats takes less than a second). This is pretty similar to an alarm()/wait() sequence in C. (Actually no, that's bollocks.)

My dnsqps script also abuses `eval` a lot to get a shonky Bourne shell version of associative arrays for the per-server counters. Yummy.

So now I was able to get queries-per-second numbers from my servers, what was the effect of dropping the TTLs? Well, as far as I can tell from eyeballing, nothing. Zilch. No visible change in query rate. I expected at least some kind of clear increase, but no.

The current version of my dnsqps script is:

    while :
        sleep 1 & # set an alarm
        for s in "$@"
	    total=$(curl --silent http://$s:853/json/v1/server |
                    jq -r '.opcodes.QUERY')
            eval inc='$((' $total - tot$s '))'
            eval tot$s=$total
            printf ' %5d %s' $inc $s
        printf '\n'
        wait # for the alarm

fanf: (dotat)

Last week I rolled out my new DNS servers. It was reasonably successful - a few snags but no showstoppers.

Authoritative DNS rollout playbook

I have already written about scripting the recursive DNS rollout. I also used Ansible for the authoritative DNS rollout. I set up the authdns VMs with different IP addresses and hostnames (which I will continue to use for staging/testing purposes); the rollout process was:

  • Stop the Solaris Zone on the old servers using my zoneadm Ansible module;
  • Log into the staging server and add the live IP addresses;
  • Log into the live server and delete the staging IP addresses;
  • Update the hostname.

There are a couple of tricks with this process.

You need to send a gratuitous ARP to get the switches to update their forwarding tables quickly when you move an IP address. Solaris does this automatically but Linux does not, so I used an explicit arping -U command. On Debian/Ubuntu you need the iputils-arping package to get a version of arping which can send gratuitous ARPs (The arping package is not the one you want. Thanks to Peter Maydell for helping me find the right one!)

If you remove a "primary" IPv4 address from an interface on Linux, it also deletes all the other IPv4 addresses on the same subnet. This is not helpful when you are renumbering a machine. To avoid this problem you need to set sysctl net.ipv4.conf.eth0.promote_secondaries=1.

Pre-rollout configuration checking

The BIND configuration on my new DNS servers is rather different to the old ones, so I needed to be careful that I had not made any mistakes in my rewrite. Apart from re-reading configurations several times, I used a couple of tools to help me check.


I used bzl, the BIND zone list tool by JP Mens to get the list of configured zones from each of my servers. This helped to verify that all the differences were intentional.

The new authdns servers both host the same set of zones, which is the union of the zones hosted on the old authdns servers. The new servers have identical configs; the old ones did not.

The new recdns servers differ from the old ones mainly because I have been a bit paranoid about avoiding queries for martian IP address space, so I have lots of empty reverse zones.


I used my tool nsdiff to verify that the new DNS build scripts produce the same zone files as the old ones. (Except for th HINFO records which the new scripts omit.)

(This is not quite an independent check, because nsdiff is part of the new DNS build scripts.)


On Monday I sent out the DNS server upgrade announcement, with some wording improvements suggested by my colleagues Bob Dowling and Helen Sargan.

It was rather arrogant of me to give the expected outage times without any allowance for failure. In the end I managed to hit 50% of the targets.

The order of rollout had to be recursive servers first, since I did not want to swap the old authoritative servers out from under the old recursive servers. The new recursive servers get their zones from the new hidden master, whereas the old recursive servers get them from the authoritative servers.

The last server to be switched was authdns0, because that was the old master server, and I didn't want to take it down without being fairly sure I would not have to roll back.

ARP again

The difference in running time between my recdns and authdns scripts bothered me, so I investigated and discovered that IPv4 was partially broken. Rob Bricheno helped by getting the router's view of what was going on. One of my new Linux boxes was ARPing for a testdns IP address, even after I had deconfigured it!

I fixed it by rebooting, after which it continued to behave correctly through a few rollout / backout test runs. My guess is that the problem was caused when I was getting gratuitous ARPs working - maybe I erroneously added a static ARP entry.

After that all switchovers took about 5 - 15 seconds. Nice.

Status checks

I wrote a couple of scripts for checking rollout status and progress. wheredns tells me where each of our service addresses is running (old or new); pingdns repeatedly polls a server. I used pingdns to monitor when service was lost and when it returned during the rollout process.

Step 1: recdns1

On Tuesday shortly after 18:00, I switched over recdns1. This is our busier recursive server, running at about 1500 - 2000 queries per second during the day.

This rollout went without a hitch, yay!

Afterwards I needed to reduce the logging because it was rather too noisy. The logging on the old servers was rather too minimal for my tastes, but I turned up the verbosity a bit too far in my new configuration.

Step 2a: recdns0

On Wednesday morning shortly after 08:00, I switched over recdns0. It is a bit less busy, running about 1000 - 1500 qps.

This did not go so well. For some reason Ansible appeared to hang when connecting to the new recdns cluster to push the updated keepalived configuration.

Unfortunately my back-out scripts were not designed to cope with a partial rollout, so I had to restart the old Solaris Zone manually, and recdns0 was unavailable for a minute or two.

Mysteriously, Ansible connected quickly outside the context of my rollout scripts, so I tried the rollout again and it failed in the same way.

As a last try, I ran the rollout steps manually, which worked OK although I don't type as fast as Ansible runs a playbook.

So in all there was about 5 minutes downtime.

I'm not sure what went wrong; perhaps I just needed to be a bit more patient...

Step 2b: authdns1

After doing recdns0 I switched over authdns1. This was a bit less stressy since it isn't directly user-facing. However it was also a bit messy.

The problem this time was me forgetting to uncomment authdns1 from the Ansible inventory (its list of hosts). Actually, I should not have needed to uncomment it manually - I should have scripted it. The silly thing is that I had the testdns servers in the inventory for testing the authdns rollout scripts; the testdns servers were causing me some benign irritation (connection failures) when running ansible in the previous week or so. I should not have ignored this irritation and (like I did with the recdns rollout script) automated it away.

Anyway, after a partial rollout and manual rollback, it took me a few ansible-playbook --check runs to work out why Ansible was saying "host not found". The problem was due to the Jinja expansion in the following remote command, where the "to" variable was set to "" which was not in the inventory.

    ip addr add {{hostvars[to].ipv6}}/64 dev eth0

You can reproduce this with a command like,

    ansible -m debug -a 'msg={{hostvars["funted"]}}' all

After fixing that, by uncommenting the right line in the inventory, the rollout worked OK.

The other post-rollout fix was to ensure all the secondary zones had transferred OK. I had not managed to get all of our masters to add my staging servers to their ACLs, but this was not to hard to sort out using the BIND 9.10 JSON statistics server and the lovely jq command:

    curl |
    jq -r '.views[].zones[] | select(.serial == 4294967295) | .name' |
    xargs -n1 rndc -s refresh

After that, I needed to reduce the logging again, because the authdns servers get a whole different kind of noise in the logs!

Lurking bug: rp_filter

One mistake sneaked out of the woodwork on Wednesday, with fortunately small impact.

My colleague Rob Bricheno reported that client machines on (the same subnet as recdns1) were not able to talk to recdns0, I could see the queries arriving with tcpdump, but they were being dropped somewhere in the kernel.

Malcolm Scott helpfully suggested that this was due to Linux reverse path filtering on the new recdns servers, which are multihomed on both subnets. Peter Benie advised me of the correct setting,

    sysctl net.ipv4.conf.em1.rp_filter=2

Step 3: authdns0

On Thursday evening shortly after 18:00, I did the final switch-over of authdns0, the old master.

This went fine, yay! (Actually, more like 40s than the expected 15s, but I was patient, and it was OK.)

There was a minor problem that I forgot to turn off the old DNS update cron job, so it bitched at us a few times overnight when it failed to send updates to its master server. Poor lonely cron job.

One more thing

Over the weekend my email servers complained that some of their zones had not been refreshed recently. This was because four of our RFC 1918 private reverse DNS zones had not been updated since before the switch-over.

There is a slight difference in the cron job timings on the old and new setups: previously updates happened at 59 minutes past the hour, now they happen at 53 minutes past (same as the DNS port number, for fun and mnemonics). Both setups use Unix time serial numbers, so they were roughly in sync, but due to the cron schedul the old servers had a serial number about 300 higher.

BIND on my mail servers was refusing to refresh the zone because it had copies of the zones from the old servers with a higher serial number than the new servers.

I did a sneaky nsupdate add and delete on the relevant zones to update their serial numbers and everything is happy again.

To conclude

They say a clever person can get themselves out of situations a wise person would not have got into in the first place. I think the main wisdom to take away from this is not to ignore minor niggles, and to write rollout/rollback scripts that can work forwards or backwards after being interrupted at any point. I won against the niggles on the ARP problem, but lost against them on the authdns inventory SNAFU.

But in the end it pretty much worked, with only a few minutes downtime and only one person affected by a bug. So on the whole I feel a bit like Mat Ricardo.

fanf: (dotat)

The last couple of weeks have been a bit slow, being busy with email and DNS support, an unwell child, and surprise 0day. But on Wednesday I managed to clear the decks so that on Thursday I could get down to some serious rollout planning.

My aim is to do a forklift upgrade of our DNS servers - a tier 1 service - with negligible downtime, and with a backout plan in case of fuckups.

Solaris Zones

Our old existing DNS service is based on Solaris Zones. The nice thing about this is that I can quickly and safely halt a zone - which stops the software and unconfigures the network interface - and if the replacement does not work I can restart the zone - which brings up the interfaces and the software.

Even better, the old servers have a couple of test zones which I can bounce up and down without a care. These give me enormous freedom to test my migration scripts without worrying about breaking things and with a high degree of confidence that my tests are very similar to the real thing.

Testability gives you confidence, and confidence gives you productivity.

Before I started setting up our new recursive DNS servers, I ran zoneadm -z testdns* halt on the old servers so that I could use the testdns addresses for developing and testing our keepalived setup. So I had the testdns zones in reserve for developing and testing the rollout/backout scripts.

Rollout plans

The authoritative and recursive parts of the new setup are quite different, so they require different rollout plans.

On the authoritative side we will have a virtual machine for each service address. I have not designed the new authoritative servers for any server-level or network-level high availability, since the DNS protocol should be able to cope well enough. This is similar in principle to our existing Solaris Zones setup. The vague rollout plan is to set up new authdns servers on standby addresses, then renumber them to take over from the old servers. This article is not about the authdns rollout plan.

On the recursive side, there are four physical servers any of which can host any of the recdns or testdns addresses, managed by keepalived. The vague rollout plan is to disable a zone on the old servers then enable its service address on the keepalived cluster.

Ansible - configuration vs orchestration

So far I have been using Ansible in a simple way as a configuration management system, treating it as a fairly declarative language for stating what the configuration of my servers should be, and then being able to run the playbooks to find out and/or fix where reality differs from intention.

But Ansible can also do orchestration: scripting a co-ordinated sequence of actions across disparate sets of servers. Just what I need for my rollout plans!

When to write an Ansible module

The first thing I needed was a good way to drive zoneadm from Ansible. I have found that using Ansible as a glorified shell script driver is pretty unsatisfactory, because its shell and command modules are too general to provide proper support for its idempotence and check-mode features. Rather than messing around with shell commands, it is much more satisfactory (in terms of reward/effort) to write a custom module.

My zoneadm module does the bare minimum: it runs zoneadm list -pi to get the current state of the machine's zones, checks if the target state matches the current state, and if not it runs zoneadm boot or zoneadm halt as required. It can only handle zone states that are "installed" or "running". 60 lines of uncomplicated Python, nice.

Start stupid and expect to fail

After I had a good way to wrangle zoned it was time to do a quick hack to see if a trial rollout would work. I wrote the following playbook which does three things: move the testdns1 zone from running to installed, change the Ansible configuration to enable testdns1 on the keepalived cluster, then push the new keepalived configuration to the cluster.

- hosts:
    - zoneadm: name=testdns1 state=installed
- hosts: localhost
    - command: bin/vrrp_toggle rollout testdns1
- hosts: rec
    - keepalived

This is quick and dirty, hardcoded all the way, except for the vrrp_toggle command which is the main reality check.

The vrrp_toggle script just changes the value of an Ansible variable called vrrp_enable which lists which VRRP instances should be included in the keepalived configuration. The keepalived configuration is generated from a Jinja2 template, and each vrrp_instance (testdns1 etc.) is emitted if the instance name is not commented out of the vrrp_enable list.


Ansible does not re-read variables if you change them in the middle of a playbook like this. Good. That is the right thing to do.

The other way in which this playbook is stupid is there are actually 8 of them: 2 recdns plus 2 testdns, rollout and backout. Writing them individually is begging for typos; repeated code that is similar but systematically different is one of the most common ways to introduce bugs.

Learn from failure

So the right thing to do is tweak the variable then run the playbook. And note the vrrp_toggle command arguments describe almost everything you need to know to generate the playbook! (The only thing missing is the mapping from instance name (like testdns1) to parent host (like helen2).

So I changed the vrrp_toggle script into a rec-rollout / rec-backout script, which tweaks the vrrp_enable variable and generates the appropriate playbook. The playbook consists of just two tasks, whose order depends on whether we are doing rollout or backout, and which have a few straightforward place-holder substitutions.

The nice thing about this kind of templating is that if you screw it up (like I did at first), usually a large proportion of the cases fail, probably including your test cases; whereas with clone-and-hack there will be a nasty surprise in a case you didn't test.

Consistent and quick rollouts

In the playbook I quoted above I am using my keepalived role, so I can be absolutely sure that my rollout/backout plan remains consistent with my configuration management setup. Nice!

However the keepalived role does several configuration tasks, most of which are not necessary in this situation. In fact all I need to do is copy across the templated configuration file and tell keepalived to reload it if the file has changed.

Ansible tags are for just this kind of optimization. I added a line to my keepalived.conf task:

    tags: quick

Only one task needed tagging because the keepalived.conf task has a handler to tell keepalived to reload its configuration when that changes, which is the other important action. So now I can run my rollout/backout playbooks with a --tags quick argument, so only the quick tasks (and if necessary their handlers) are run.


Once I had got all that working, I was able to easily flip testdns0 and testdns1 back and forth between the old and new setups. Each switchover takes about ten seconds, which is not bad - it is less than a typical DNS lookup timeout.

There are a couple more improvements to make before I do the rollout for real. I should improve the molly guard to make better use of ansible-playbook --check. And I should pre-populate the new servers' caches with the Alexa Top 1,000,000 list to reduce post-rollout latency. (If you have a similar UK-centric popular domains list, please tell me so I can feed that to the servers as well!)

fanf: (dotat)

I have released version 1.55 of nsdiff, which creates an nsupdate script from differences between DNS zone files.

There are not many changes to nsdiff itself: the only notable change is support for non-standard port numbers.

The important new thing is nspatch, which is an error-checking wrapper around `nsdiff | nsupdate`. To be friendly when running from cron, nspatch only produces output when it fails. It can also retry an update if it happens to lose a race against concurrent updates e.g. due to DNSSEC signing activity.

You can read the documentation and download the source from the nsdiff home page.

My Mac insists that I should call it nstiff...

fanf: (dotat)

On Friday evening I reached a BIG milestone in my project to replace Cambridge University's DNS servers. I finished porting and rewriting the dynamic name server configuration and zone data update scripts, and I was - at last! - able to get the new servers up to pretty much full functionality, pulling lists of zones and their contents from the IP Register database and the managed zone service, and with DNSSEC signing on the new hidden master.

There is still some final cleanup and robustifying to do, and checks to make sure I haven't missed anything. And I have to work out the exact process I will follow to put the new system into live service with minimum risk and disruption. But the end is tantalizingly within reach!

In the last couple of weeks I have also got several small patches into BIND.

  • Jan 7: documentation for named -L

    This was a follow-up to a patch I submitted in April last year. The named -L option specifies a log file to use at startup for recording the BIND version banners and other startup information. Previously this information would always go to syslog regardless of your logging configuration.

    This feature will be in BIND 9.11.

  • Jan 8: typo in comment

    Trivial :-)

  • Jan 12: teach nsdiff to AXFR from non-standard ports

    Not a BIND patch, but one of my own companion utilities. Our managed zone service runs a name server on a non-standard port, and our new setup will use nsdiff | nsupdate to implement bump-in-the-wire signing for the MZS.

  • Jan 13: document default DNSKEY TTL

    Took me a while to work out where that value came from. Submitted on Jan 4. Included in 9.10 ARM.

  • Jan 13: automatically tune max-journal-size

    Our old DNS build scripts have a couple of mechanisms for tuning BIND's max-journal-size setting. By default a zone's incremental update journal will grow without bound, which is not helpful. Having to set the parameter by hand is annoying, especially since it should be simple to automatically tune the limit based on the size of the zone.

    Rather than re-implementing some annoying plumbing for yet another setting, I thought I would try to automate it away. I have submitted this patch as RT#38324. In response I was told there is also RT#36279 which sounds like a request for this feature, and RT#25274 which sounds like another implementation of my patch. Based on the ticket number it dates from 2011.

    I hope this gets into 9.11, or something like it. I suppose that rather than maintaining this patch I could do something equivalent in my build scripts...

  • Jan 14: doc: ignore and clean up isc-notes-html.xsl

    I found some cruft in a supposedly-clean source tree.

    This one actually got committed under my name, which I think is a first for me and BIND :-) (RT#38330)

  • Jan 14: close new zone file before renaming, for win32 compatibility
  • Jan 14: use a safe temporary new zone file name

    These two arose from a problem report on the bind-users list. The conversation moved to private mail which I find a bit annoying - I tend to think it is more helpful for other users if problems are fixed in public.

    But it turned out that BIND's error logging in this area is basically negligible, even when you turn on debug logging :-( But the Windows Process Explorer is able to monitor filesystem events, and it reported a 'SHARING VIOLATION' and 'NAME NOT FOUND'. This gave me the clue that it was a POSIX vs Windows portability bug.

    So in the end this problem was more interesting than I expected.

  • Jan 16: critical: ratelimiter.c:151: REQUIRE(ev->ev_sender == ((void *)0)) failed

    My build scripts are designed so that Ansible sets up the name servers with a static configuration which contains everything except for the zone {} clauses. The zone configuration is provisioned by the dynamic reconfiguration scripts. Ansible runs are triggered manually; dynamic reconfiguration runs from cron.

    I discovered a number of problems with bootstrapping from a bare server with no zones to a fully-populated server with all the zones and their contents on the new hidden master.

    The process is basically,

    • if there are any missing master files, initialise them as minimal zone files
    • write zone configuration file and run rndc reconfig
    • run nsdiff | nsupdate for every zone to fill them with the correct contents

    When bootstrapping, the master server would load 123 new zones, then shortly after the nsdiff | nsupdate process started, named crashed with the assertion failure quoted above.

    Mark Andrews replied overnight with the linked patch (he lives in Australia) which fixed the problem. Yay!

    The other bootstrapping problem was to do with BIND's zone integrity checks. nsdiff is not very clever about the order in which it emits changes; in particular it does not ensure that hostnames exist before any NS or MX or SRV records are created to point to them. You can turn off most of the integrity checks, but not the NS record checks.

    This causes trouble for us when bootstrapping the zone, which is the only zone we have with in-zone NS records. It also has lots of delegations which can also trip the checks.

    My solution is to create a special bootstrap version of the zone, which contains the apex and delegation records (which are built from configuration stored in git) but not the bulk of the zone contents from the IP Register database. The zone can then be succesfully loaded in two stages, first `nsdiff DB.bootstrap | nsupdate -l` then `nsdiff zones/ | nsupdate -l`.

    Bootstrapping isn't something I expect to do very often, but I want to be sure it is easy to rebuild all the servers from scratch, including the hidden master, in case of major OS upgrades, VM vs hardware changes, disasters, etc.

    No more special snowflake servers!

fanf: (dotat)

I have got keepalived working on my recursive DNS servers, handling failover for and I am quite pleased with the way it works.

It was difficult to get started because keepalived's documentation is TERRIBLE. More effort has been spent explaining how it is put together than explaining how to get it to work. The keepalived.conf man page is a barely-commented example configuration file which does not describe all the options. Some of the options are only mentioned in the examples in /usr/share/doc/keepalived/samples. Bah!

Edit: See the comments to find the real documentation!

The vital clue came from Graeme Fowler who told me about keepalived's vrrp_script feature which is "documented" in keepalived.conf.vrrp.localcheck which I never would have found without Graeme's help.


Keepalived is designed to run on a pair of load-balancing routers in front of a cluster of servers. It has two main parts. Its Linux Virtual Server daemon runs health checks on the back-end servers and configures the kernel's load balancing router as appropriate. The LVS stuff handles failover of the back-end servers. The other part of keepalived is its VRRP daemon which handles failover of the load-balancing routers themselves.

My DNS servers do not need the LVS load-balancing stuff, but they do need some kind of health check for named. I am running keepalived in VRRP-only mode and using its vrrp_script feature for health checks.

There is an SMTP client in keepalived which can notify you of state changes. It is too noisy for me, because I get messages from every server when anything changes. You can also tell keepalived to run scripts on state changes, so I am using that for notifications.

VRRP configuration

All my servers are configured as VRRP BACKUPs, and there is no MASTER. According to the VRRP RFC, the master is supposed to be the machine which owns the IP addresses. In my setup, no particular machine owns the service addresses.

I am using authentication mainly for additional protection against screwups (e.g. VRID collisions). VRRP password authentication doesn't provide any security: any attacker has to be on the local link so they can just sniff the password off the wire.

I am slightly surprised that it works when I set both IPv4 and IPv6 addresses on the same VRRP instance. The VRRP spec says you have to have separate vrouters for IPv4 and IPv6. Perhaps it works because keepalived doesn't implement real VRRP by default: it does not use a virtual MAC address but instead it just moves the virtual IP addresses and sends gratuitous ARPs to update the switches' forwarding tables. Keepalived has a use_vmac option but it seems rather fiddly to get working, so I am sticking with the default.

vrrp_instance testdns0 {
        virtual_router_id 210
        interface em1
        state BACKUP
        priority 50
        notify /etc/keepalived/notify
        authentication {
                auth_type PASS
                auth_pass XXXXXXXX
        virtual_ipaddress {
        track_script {

State change notifications

My notification script sends email when a server enters the MASTER state and takes over the IP addresses. It also sends email if the server dropped into the BACKUP state because named crashed.

    # this is /etc/keepalived/notify
    case $state in
        # do not notify if this server is working
        if /etc/keepalived/named_ok
        then exit 0
        else state=DEAD
    exim -t <<EOF
    Subject: $instance $state on $(hostname)

DNS server health checks and dynamic VRRP priorities

In the vrrp_instance snippet above, you can see that it specifies four vrrp_scripts to track. There is one vrrp_script for each possible priority, so that the four servers can have four different priorities for each vrrp_instance.

Each vrrp_script is specified using the Jinja macro below. (Four different vrrp_scripts for each of four different vrrp_instances is a lot of repetition!) The type argument is "recdns" or "testdns", the num is 0 or 1, and the prio is a number from 1 to 4.

Each script is run every "interval" seconds, and is allowed to run for up to "timeout" seconds. (My checking script should take at most 1 second.)

A positive "weight" setting is added to the vrrp_instance's priority to increse it when the script succeeds. (If the weight is negative it is added to the priority to decrease it when the script fails.)

    {%- macro named_check(type,num,prio) -%}
    vrrp_script named_check_{{type}}{{num}}_{{prio}} {
        script "/etc/keepalived/named_check {{type}} {{num}} {{prio}}"
        interval 1
        timeout 2
        weight {{ prio * 50 }}
    {%- endmacro -%}

When keepalived runs the four tracking scripts for a vrrp_instance on one of my servers, at most one of the scripts will succeed. The priority is therefore adjusted to 250 for the server that should be live, 200 for its main backup, 150 and 100 on the other servers, and 50 on any server which is broken or out of service.

The checking script finds the position of the host on which it is running in a configuration file which lists the servers in priority order. A server can be commented out to remove it from service. The priority order for testdns1 is the opposite of the order for testdns0. So the following contents of /etc/keepalived/priority.testdns specifies that testdns1 is running on recdns-cnh, testdns0 is on recdns-wcdc, recdns-rnb is disabled, and recdns-sby is a backup.


I can update this prioriy configuration file to change which machines are in service, without having to restart or reconfigure keepalived.

The health check script is:


    set -e

    type=$1 num=$2 check=$3

    # Look for the position of our hostname in the priority listing

    name=$(hostname --short)

    # -F = fixed string not regex
    # -x = match whole line
    # -n = print line number

    # A commented-out line will not match, so grep will fail
    # and set -e will make the whole script fail.

    grepout=$(grep -Fxn $name /etc/keepalived/priority.$type)

    # Strip off everything but the line number. Do this separately
    # so that grep's exit status is not lost in the pipeline.

    prio=$(echo $grepout | sed 's/:.*//')

    # for num=0 later is higher priority
    # for num=1 later is lower priority

    if [ $num = 1 ]
        prio=$((5 - $prio))

    # If our priority matches what keepalived is asking about, then our
    # exit status depends on whether named is running, otherwise tell
    # keepalived we are not running at the priority it is checking.

    [ $check = $prio ] && /etc/keepalived/named_ok

The named_ok script just uses dig to verify that the server seems to be working OK. I originally queried for version.bind, but there are very strict rate limits on the server info view so it did not work very well! So now the script checks that this command produces the expected output:

dig @localhost +time=1 +tries=1 +short in txt

fanf: (dotat)

The SCCS-to-git project that I wrote about previously was the prelude to setting up new DNS servers with an entirely overhauled infrastructure.

The current setup which I am replacing uses Solaris Zones (like FreeBSD Jails or Linux Containers) to host the various name server instances on three physical boxes. The new setup will use Ubuntu virtual machines on our shared VM service (should I call it a "private cloud"?) for the authoritative servers. I am making a couple of changes to the authoritative setup: changing to a hidden master, and eliminating differences in which zones are served by each server.

I have obtained dedicated hardware for the recursive servers. Our main concern is that they should be able to boot and work with no dependencies on other services beyond power and networking, because basically all the other services rely on the recursive DNS servers. The machines are Dell R320s, each with one Xeon E5-2420 (6 hyperthreaded cores, 2.2GHz), 32 GB RAM, and a Dell-branded Intel 160GB SSD.

Failover for recursive DNS servers

The most important change to the recursive DNS service will be automatic failover. Whenever I need to loosen my bowels I just contemplate dealing with a failure of one of the current elderly machines, which involves a lengthy and delicate manual playbook described on our wiki...

Often when I mention DNS and failover, the immediate response is "Anycast?". We will not be doing anycast on the new servers, though that may change in the future. My current plan is to do failover with VRRP using keepalived. (Several people have told me they are successfully using keepalived, though its documentation is shockingly bad. I would like to know of any better alternatives.) There are a number of reasons for using VRRP rather than anycast:

  • The recursive DNS server addresses are (aka recdns0) and (aka recdns1). (They have IPv6 addresses too.) They are on different subnets which are actually VLANs on the same physical network. It is not feasible to change these addresses.
  • The 8 and 12 subnets are our general server subnets, used for a large proportion of our services, most of which use the recdns servers. So anycasting recdns[01] requires punching holes in the server network routing.
  • The server network routers do not provide proxy ARP and my colleagues in network systems do not want to change this. But our Cisco routers can't punch a /32 anycast hole in the server subnets without proxy ARP. So if we did do anycast we would also have to do VRRP to support failover for recdns clients on the server subnets.
  • The server network spans four sites, connected via our own city-wide fibre network. The sites are linked at layer 2: the same Ethernet VLANs are present at all four sites. So VRRP failover gives us pretty good resilience in the face of server, rack, or site failures.

VRRP will be a massive improvement over our current setup, and it should provide us a lot of the robustness that other places would normally need anycast for, but with significantly less complexity. And less complexity means less time before I can take the old machines out of service.

After the new setup is in place, it might make sense for us to revisit anycast. For instance, we could put recursive servers at other points of presence where our server network does not reach (e.g. the Addenbrooke's medical research site). But in practice there are not many situations when our server network is unreachable but the rest of the University data network is functioning, so it might not be worth it.

Configuration management

The old machines are special snowflake servers. The new setup is being managed by Ansible.

I first used Ansible in 2013 to set up the DHCP servers that were a crucial part of the network renumbering we did when moving our main office from the city centre to the West Cambridge site. I liked how easy it was to get started with Ansible. The way its --check mode prints a diff of remote config file changes is a killer feature for me. And it uses ssh rather than rolling its own crypto and host authentication like some other config management software.

I spent a lot of December working through the configuration of the new servers, starting with the hidden master and an authoritative server (a staging server which is a clone of the future live servers). It felt like quite a lot of elapsed time without much visible progress, though I was steadily knocking items off the list of things to get working.

The best bit was the last day before the xmas break. The new recdns hardware arrived on Monday 22nd, so I spent Tuesday racking them up and getting them running.

My Ansible setup already included most of the special cases required for the recdns servers, so I just uncommented their hostnames in the inventory file and told Ansible to run the playbook. It pretty much Just Worked, which was extremely pleasing :-) All that steady work paid off big time.

Multi-VLAN network setup

The main part of the recdns config which did not work was the network interface configuration, which was OK because I didn't expect it to work without fiddling.

The recdns servers are plugged into switch ports which present subnet 8 untagged (mainly to support initial bootstrap without requiring special setup of the machine's BIOS), and subnet 12 with VLAN tags (VLAN number 812). Each server has its own IPv4 and IPv6 addresses on subnet 8 and subnet 12.

The service addresses recdns0 (subnet 8) and recdns1 (subnet 12) will be additional (virtual) addresses which can be brought up on any of the four servers. They will usually be configured something like:

  • recdns-wcdc: VRRP master for recdns0
  • recdns-rnb: VRRP backup for recdns0
  • recdns-sby: VRRP backup for recdns1
  • recdns-cnh: VRRP master for recdns1

And in case of multi-site failures, the recdns1 servers will act as additional backups for the recdns0 servers and vice versa.

There were two problems with my initial untested configuration.

The known problem was that I was likely to need policy routing, to ensure that packets with a subnet 12 source address were sent out with VLAN 812 tags. This turned out to be true for IPv4, whereas IPv6 does the Right Thing by default.

The unknown problem was that the VLAN 812 interface came up only half-configured: it was using SLAAC for IPv6 instead of the static address that I specified. This took a while to debug. The clue to the solution came from running ifup with the -v flag to get it to print out what it was doing:

# ip link delete em1.812
# ifup -v em1.812

This showed that interface configuration was failing when it tried to set up the default route on that interface. Because there can be only one default route, and there was already one on the main subnet 8 interface. D'oh!

Having got ifup to run to completion I was able to verify that the subnet 12 routing worked for IPv6 but not for IPv4, pretty much as expected. With advice from my colleagues David McBride and Anton Altaparmakov I added the necessary runes to the configuration.

My final /etc/network/interfaces files on the recdns servers are generated from the following Jinja template:

# This file describes the network interfaces available on the system
# and how to activate them. For more information, see interfaces(5).

# NOTE: There must be only one "gateway" line because there can be
# only one default route. Interface configuration will fail part-way
# through when you bring up a second interface with a gateway
# specification.

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface, on subnet 8
auto em1

iface em1 inet static
      address 131.111.8.{{ ifnum }}
      netmask 23

iface em1 inet6 static
      address 2001:630:212:8::d:{{ ifnum }}
      netmask 64

# VLAN tagged interface on subnet 12
auto em1.812

iface em1.812 inet static
      address 131.111.12.{{ ifnum }}
      netmask 24

      # send packets with subnet 12 source address
      # through routing table 12 to subnet 12 router

      up   ip -4 rule  add from table 12
      down ip -4 rule  del from table 12
      up   ip -4 route add default table 12 via
      down ip -4 route del default table 12 via

iface em1.812 inet6 static
      address 2001:630:212:12::d:{{ ifnum }}
      netmask 64

      # auto-configured routing works OK for IPv6

# eof

April 2019

123 4567


RSS Atom

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated 2019-04-20 03:10
Powered by Dreamwidth Studios