fanf | Recent Entries

Previously, I wrote about my cataract and its assessment at Addenbrooke's cataract clinic.

I had my cataract removed a couple of weeks ago, and so far things are going well, though there is still some follow-up work needed.

( Read more... )

https://dotat.at/@/2023-01-10-qsbr.html

At the end of October, I finally got my multithreaded qp-trie working! It could be built with two different concurrency control mechanisms:

A reader/writer lock

This has poor read-side scalability, because every thread is hammering on the same shared location. But its write performance is reasonably good: concurrent readers don't slow it down too much.
liburcu, userland read-copy-update

RCU has a fast and scalable read side, nice! But on the write side I used rcu_synchronize(), which is blocking and rather slow, so my write performance was terrible.

OK, but I want the best of both worlds! To fix it, I needed to change the qp-trie code to use safe memory reclamation more effectively: instead of blocking inside rcu_synchronize() before cleaning up, use call_rcu() to clean up asynchronously. I expect I'll write about the qp-trie changes another time.

Another issue is that I want the best of both worlds by default, but liburcu is LGPL and we don't want BIND to depend on code whose licence demands more from our users than the MPL.

So I set out to write my own safe memory reclamation support code.

( Read more... )

https://dotat.at/@/2022-12-12-dirty-decompress.html

In a previous entry, I wrote about making DNS name decompression faster by moving work left on this diagram so that we do less of it:

    names < pointers < labels < bytes

Last week I had a bright idea about that leftmost step, moving per-pointer work to per-name, using some dirty tricks. Sadly the experiment was not successful, because it also increased the per-label work. Nevertheless I think it's interesting enough to be worth writing about.

( Read more... )

https://dotat.at/@/2022-12-05-axfr-perf.html

This year I have rewritten BIND's DNS name compression and decompression code. I didn't plan to, it just sort of happened! Anyway, last week my colleague Petr was doing some benchmarking, and he produced some numbers that seemed too good to be true, so I have re-done the measurement myself, and wow.

( Read more... )

https://dotat.at/@/2022-12-04-leap-seconds.html

It has been a couple of years since my previous blog post about leap seconds, though I have been tweeting on the topic fairly frequently: see my page on date, time, and leap seconds for an index of threads. But Twitter now seems a lot less likely to stick around, so I'll aim to collect more of my thinking-out-loud here on my blog.

( Read more... )

https://dotat.at/@/2022-11-17-dns-decompress.html

Earlier this year, I rewrote BIND's DNS name compression algorithm. That work is in the November release, BIND 9.19.7.

Last week I rewrote the reverse algorithm, DNS name decompression, and made it about 2x faster. I enjoyed deleting this comment that dates back to January 1999:

/*
 * Note:  The following code is not optimized for speed, but
 * rather for correctness.  Speed will be addressed in the future.
 */

So here is a collection of folklore and tricks related to DNS name decompression.

( Read more... )

Since my blog post about hg64 histograms in July I have continued occasional hacking on the code, so it's time to post an update.

As before, my proof-of-concept histogram data structure hg64 is available from:

( Read more... )

https://dotat.at/@/2022-09-23-ub.html

This week I am on "escalation engineer" duty, which means triaging new bug reports. One of them was found by ubsan (-fsanitize=undefined) reporting an overflow in a signed integer shift. This was fun to fix because it needed a bit of lateral thinking.

( Read more... )

https://dotat.at/@/2022-07-15-histogram.html

I have reached the point in my main project (qp-trie for BIND) where I want to collect performance statistics. Things like:

how long it takes to compact a trie;
how much memory was recovered by compaction;
size of trie before and after compaction.

In a wider context, those using BIND for large authoritative server systems would like better statistics on things like:

incoming and outgoing zone transfer times;
zone refresh query latency.

For my own purposes, I'm no longer satisfied with summarizing performance with just the mean and standard deviation; and in an operational context, it's useful to know about the existence of outliers (rare outliers can be hidden by simple summary statistics) and the size of an outlier can be a useful clue.

So, we need histograms!

hg64

I have written a proof-of-concept histogram data structure called hg64, which you can get from:

It can load 1 million data items in about 5ms (5ns per item), and uses a few KiB of memory.

Here I will write about how hg64 came to be.

( Read more... )

https://dotat.at/@/2022-07-01-dns-compress.html

In our previous episode, I optimized tolower() using SWAR tricks. That was in fact a side-quest to a side-quest to a side-quest. The first side-quest was trying out a new DNS name compression algorithm; in BIND, compression is performance-critical and a heavy user of tolower(), which is what set me off on the second and third side-quests.

Now let's pop the stack, and talk about DNS names.

( Read more... )

https://dotat.at/@/2022-06-27-tolower-swar.html

Here's a fun bit of Monday optimization.

DNS servers often have to convert DNS names to their canonical lower case form. BIND has to do so a bit more than most because it tries to preserve the case of DNS names rather than canonicalizing them.

I was recently distracted by a cleanup oportunity: there were multiple maptolower lookup tables in BIND, so I decided to make a home for them so that we would only need one copy. Then I thought, surely there are faster ways to implement tolower() than a lookup table.

( Read more... )

https://dotat.at/@/2022-06-22-compact-qp.html

My new job is working on BIND for ISC, and my main project is to replace BIND's core red-black tree data structure with my qp-trie.

previously

In the summer of 2021 I wrote some notes on page-based GC for qp-trie RCU which I then went on to implement in my fork of NSD.

Since the start of May 2022 I have ported the NSD version of my qp-trie to BIND, with several improvements:

multi-version concurrency, instead of just two versions, one for readers and one for the writer;
the rather sketchy locking has been completed;
two flavours of write transaction: minimum space for authoritative DNS; and minimum time for recursive caches;
rollback for failed transactions.

The notes I wrote last summer turned into code very nicely: NSD proved to be a good place to try out the ideas. And more recently, I am pleased with how the code adapted to the more complicated demands of BIND.

But there's one area that has been problematic: compaction.

( Read more... )

https://dotat.at/@/2022-06-04-metric.html

My response to the UK government consultation on using imperial units of measure for retail.

Regarding the foreword to the consultation description, my children were weighed in kilos at birth, so whoever wrote the foreword must be wery old, and talking about babies who are now grown adults.

And for metrological purposes, British weights and measures have been derived from the metric system since the international convention on the metre in 1875 and the production of the new metric prototypes in London in 1898, because the metric standards were made to a higher quality with greater reliability than the old imperial standards. So it's incorrect to say that we started adopting the metric system in 1995; that was actually when we started abolishing imperial (and that was at the initiative of the UK government, not the EU).

oops, I should have mentioned that the tories were in power in 1995

Q. 1 for all,

1 a) Are there any specific areas of consumer transactions that should be a priority for allowing a choice in units of measurement, and why?

None, because there is no need to make it more confusing to compare sizes and prices.

1 b) Are there any specific areas that you think should be excluded from a choice in units of measurement, and why?

There should be a single set of standard units of measure so that it is as easy as possible to compare the sizes and prices of goods, especially (e.g.) price per kilo or price per litre.

1 c) If an item is sold in imperial measures, should there be a requirement for a metric equivalent alongside it?

Metric measures should always be required, and there should always be clear labelling showing the price per some power of 10 (e.g. 100g or 1kg)

Q. 2 for businesses,

n/a (I am replying as an individual)

Q. 3 for consumers,

3 a) If you had a choice, would you want to purchase items: * (i) in imperial units? * (ii) in imperial units alongside a metric equivalent?

I always want to purchase items in metric (preferably a round number when measured in metric units), and I want it to be easy to find the quantity labelled in metric.

3 b) Are you more likely to shop from businesses that sell in imperial units?

I am likely to avoid shops that use imperial units, because the measurements will be confusing and unfamiliar.

3 c) Do you foresee any costs or benefits to you from businesses being permitted to sell: * (i) solely in imperial units? * (ii) in imperial units alongside a less prominent metric equivalent?

I expect it will be detrimental to allow businesses to sell in imperial units only, because it will make it easier for them to confuse customers, misprice goods, and hide their malpractice.

3 d) Do you have experience of buying solely in imperial units?

I am 47 years old and I cannot remember when imperial measures were in wide use for anything except milk and beer.

Milk, for example, is often confusing when 2 litre bottles are sold alongside 4 pint bottles, and I have to carefully examine the labels to find out which is the larger volume and what is the price per litre of the 4 pint bottle.

For beer, how can I compare a pint bottle and a 33cl bottle? It’s relatively easy for 50cl vs 33cl because the ratio is 3/2. In practice I think of a pint as about half a litre, so I can compare in metric. And, when I am served beer in a pub in a pint-to-brim glass, what I usually get is closer to half a litre than a pint.

Q. 4 for trading standards,

n/a (I am replying as an individual)

https://dotat.at/@/2022-04-20-really-divisionless.html

I previously wrote about Daniel Lemire's algorithm for nearly divisionless unbiased bounded random numbers. Recently I found out there is a way to get really divisionless random numbers, as explained by Steve Canon and demonstrated by Kendall Willets.

I have written my own version of really divisionless random numbers so I could compare it with Lemire's algorithm. Here's how it works.

( Read more... )

This week I am at an IETF meeting in Vienna. Yesterday afternoon I took a few pictures while I walked around the city; this blog item is about my non-touristy activity.

isc mfld

I started work for isc.org at the beginning of this month. We dealt with a lot of the new employee logistics before my official start date, like getting a new work laptop and sorting out logins for email and other accounts. And Jeff Osborne (president of ISC) encouraged me to come to come to IETF113 in Vienna to meet him and some of the other ISC staff in person soon after my start.

And so, the MFLD track (many fine lunches and dinners) has mostly been chatting with new colleagues.

qp-trie

One of our aims with BIND is to revamp its more difficult parts, trying to make them simpler, and maybe also improving performance. I come to ISC with my qp-trie data structure, which I hope can replace BIND's rbtdb, aka red-black tree database, BIND's in-memory data structure for looking up DNS records.

Last summer I did some work on NSD to see how my DNS-trie variant would work in the context of a real (but relatively simple) DNS server, and further experiments to try out a new copy-on-write concurrency strategy. During the IETF hackathon on Saturday and Sunday I revisited my NSD fork to remind myself which branch does what. I discussed it with Benno Overeinder and Willem Toorop from NLnet Labs, and wrote up some notes for them and myself. They are planning to revisit NSD's zone lookup data structures, and my qp-trie is one of the candidates.

bind

I had a discussion with my new colleague Petr Špaček about testing BIND's rbtdb refactoring, especially how to exercise things like locking that unit tests would not, and where we need better performance metrics. It's early days yet, but it will be helpful to get benchmark baselines in place soon.

radix trees

Petr introduced me to Jan Včelák; they both worked at cz.nic in the past, and know of my qp-trie work in Knot DNS. We had a chat about fast lookup for IP address prefixes, and Jan pointed me to a paper describing a "Tree Bitmap" that he had implemented.

There has been a lot of research into fast IP address prefix lookup, because it is in the fast path of network routers, and they often use the same popcount packed array trick as my qp-trie. There is a radix tree in BIND which is used for access control list matching, and which could definitely be made a lot faster.

dnsop

On Tuesday morning was the DNSOP working group meeting. I have not been paying much attention to DNSOP in the last couple of years, so my aim was mostly to get an idea of the current state of play. It turns out that the working group has been clearing its backlog, and is now considering new work for adoption. I came away with a couple of drafts that I should look at more closely.

First, the "DNS referral glue requirements" draft, where there are some questions on terminology that I have opinions about.

And second, "using service bindings with DANE" which gets into awkward questions about the semantics of aliases in the DNS, with particular reference to DANE SRV RFC 7673 which has my name at the top. I am not filled with joy at the prospect of arguing about this again, but I feel I ought to...

Following the cataract clinic referral I got in September I spent this afternoon at Addenbrooke's having my eyes examined. It was about as useful and informative as I hoped, though it took a long time. (left the house at 13:00, got back at 17:00)

the cataract

I have a large dense cataract, which means the whole lens must be replaced. This is a more difficult operation than usual: cataract surgery commonly deals with softer/smaller cataracts, which only need to replace the inner part of the lens; a simpler and less invasive procedure.

outcomes

The new lens will be fixed focus.

It is difficult to say how much vision I will have. I am going to get a copy of my opticians records which should give the surgeons a better idea of how my vision might compare to my benchmark, i.e. the way it was before the lens clouded over.

(To be honest I was hoping for better than that! But I'll be happy if I get some of my left-side peripheral vision back.)

There is a small risk that my visual cortex might have problems integrating the vision from both eyes; I got the impression that this was explained to prepare me for the possible range of outcomes.

There was (curiously) less discussion about surgical complications, though they are covered in the patient handouts.

anaesthesia

In many cases cataract surgery is done under local anaesthetic, but because I am younger (so a general is less risky) and because it will be a longer procedure, we decided to book me in for a general.

Afterwards I will need someone (Rachel!) to escort me home.

And there is a course of (I think antibiotic?) eye drops 4x each day for (IIRC) 28 days, and a protective patch for the first few nights.

waiting time

probably about six months

machine that goes ping

There were some curious optometry devices that I haven't seen before.

To get the prescription approximately right for the replacement lens, they measure the shape of the eye. (Longer eyes means more short sighted.) The current technology for this is optical, but it didn't work on my eye (cataract too dense) so they used an older ultrasound thing that reminded me of a CRT light pen. It did, indeed, go ping when making a measurement.

Later on one of the surgeons used a more typical ultrasound device: I closed my eye and probed it through my eyelid and a generous smear of KY jelly.

There were various kinds of eye drops; my vision is still a bit blurry from dilated pupils!

I have made a new release of nsnotifyd a tiny DNS server that just listens for NOTIFY messages and runs a script when one of your zones changes.

This nsnotifyd-2.0 release adds support for TCP. The original nsnotifyd only supported UDP, which works fine with BIND, but Knot DNS sends NOTIFY messages over TCP.

As well as its new TCP support, the nsnotify client program that sends NOTIFY messages can now send bulk notifications for lots of zones, as well as being able to send notifications to lots of recipient servers.

Many thanks to Niels Haarbo and DK Hostmaster for requesting TCP support and for sponsoring my work to implement it.

I like nsnotifyd because I wrote it in 2015, and I haven’t touched it since then (until this month). I usually hear nothing about nsnotifyd, but occasionally someone mentions they are using it. For example the Guardian tech blog said of nsnotifyd, "like all good *nix tools it does one thing well", and JP Mens called it "a gem of a utility".

Happy users, no bug reports, software bliss.

I have an old Mac Mini which I am repurposing as a Linux box. Last month I upgraded its hardware and posted pictures on Twitter. This weekend I upgraded the software.

However there is a compatibility problem that I have not managed to solve: Linux isn't able to drive the display hardware as well as it should be able to.

( Read more... )

I have sent my formal resignation letter: I am leaving the University. From March, I will be working full-time for isc.org on BIND.

I said in my letter that it has been a privilege and a pleasure working for the University. That isn't a polite fiction; I like the people I have worked with, and I have had interesting and rewarding things to do. But this is an exciting opportunity which came at the right time, so I allowed myself to be dragged away.

I have been a contributor to BIND for several years now. When I met Jeff Osborne (the president of ISC) in Amsterdam 3 years ago, he joked that I worked for them but he didn't pay me :-) And I have met many of the ISC folks at various conferences. So in some ways it's known territory.

But some things are going to be very different. The university has more than 1000 times the number of people as ISC, whereas ISC is almost entirely remote so spans over 1000 times the area. From my own point of view, I am looking forward to working in a bigger team than I have been.

This all happened fairly fast: Ondřej Surý (head of BIND development) suggested the new job to me only a few weeks ago, and I still find it hard to believe it's really happening! We have talked over email about what Ondřej would like me to work on, and I'm looking forward to getting stuck in.

As well as being very short-sighted in my right eye, I have a cataract in my left eye.

This post is going to discuss medical stuff that you might prefer to skip.

how it was

When I was very small, my grandmother saw me squinting and told my parents that they should get my eyes looked at; and that's how we found out I had a congenital cataract.

It is (or was) weird enough to make opticians curious: it was like a little crystal in the middle of the lens. The effect on my vision was curious too.

In bright light, when my pupil was small, I could see almost nothing with my left eye, so I had no depth perception. This was problematic for school games involving flying balls, like cricket or tennis.

In low light I could see blurry shapes, but for a long time I thought my left eye was basically useless. But when I was about 20 we went on a trip to a theme park. I went in to a 3D cinema, not expecting to get much out of it, but I didn't want to be separated from my friends. I was surprised that I did, in fact, see the exaggerated 3D effects of a dog putting its nose in my face and things like that. Fun!

(lack of) treatment

When I was still growing, I regularly (once or twice a year) went to Moorfields Eye Hospital in London to get the cataract examined by an expert. It never really changed much, and it wasn't troublesome enough to justify surgery, especially since I was still growing and surgical techniques were improving, so it made sense to leave it.

I became short-sighted around puberty, and since my teenage years my eyes have just been checked by normal opticians, and my right eye messed around a lot more than my left. We continued to leave the cataract alone.

middle age

Now I am in my late 40s, and in the last few years I have started getting presbyopia. Many years ago I chose to get glasses with small frames so that the edges of the lenses were not too thick; now I peer under the lenses when I need to look closely at something.

At about the same time as I noticed I was getting long-sighted, my cataract also changed. Basically, the whole lens clouded over. This has made it obvious that I had a useful amount of peripheral vision in my left eye, because now I am much more frequently surprised by people or things that sneak up on my left.

surgery?

I had a long-delayed eye test earlier this month during which we discussed my cataract. Cataract surgery is a lot better now than it was, and my cataract is a lot more annoying than it was, so I think it's worth getting a specialist opinion on whether surgery will help more than it hurts.

To be honest the idea of it is freaky and scary, but rationally I know a lot of people have cataract surgery each year, and I hear less horrible things about it than I do about laser surgery for myopia.

Today I got a letter from Addenbrooke's to say their triage team had rejected my referral, because the referral form was incomplete or unclear or sent to the wrong place or something. Vexing. So I emailed my optician and my GP with a list of things that I think need to be mentioned in the referral, with reference to some useful documents about the clinical criteria needed to justify it.

Hopefully the second try will actually get a specialist to agree to eyeball my eyeball...

postscript

A few weeks later some combination of my GP and optometrist managed to get the referral un-rejected, so I have an appointment with the Addenbrooke's cataract clinic on 23rd February 2022. A bit of a wait, but I was told to expect it...

Here are some ideas for how a special-purpose allocator might improve a qp-trie implementation:

lower memory usage
faster allocation
neater RCU support
possibly less load on the TLB
tunable fragmentation overhead

The downsides are:

complexity - it's a custom allocator and garbage collector!
it would only support transactional updates

Let's dig in...

COW and RCU

A few years ago I added support for transactional updates to the qp-trie used by Knot DNS. Knot handled DNS updates by making a complete copy of the zone, so that the old copy could continue to serve queries while the new copy was being modified. The new code made it possible to copy only the parts of the zone that were affected by the update, and reduce the overhead of handling small updates.

My COW (copy-on-write) code was designed to work with the RCU (read-copy-update) concurrency framework. RCU was developed for concurrent data structures in the Linux kernel; there is also a userland RCU library. RCU is a combination of;

COW data structures, so updates don't interfere with readers
lightweight concurrency barriers, so readers do not need to take a lock
deferred cleanup, so writers know when all readers are using the new copy of a data structure, when the old copy can be cleaned up

I used one-bit reference counts to mark the boundary between the parts of the tree (mostly near the root) that had been copied, and the shared parts (towards the leaves). So it wasn't a pure COW, because the refcount manipulation required writes to the shared parts of the tree.

memory layout

A common design for malloc() implementations (for example phkmalloc and jemalloc) is to keep allocations of different sizes separate. Each size class has its own free list, and each page can only satisfy allocations from a single size class. This can reduce the amount of searching around for free space inside malloc() and reduce the amount of fragmentation.

But in a qp-trie, nodes are often different sizes, so each step when traversing the tree will usually require a leap to a different page, which can increase pressure on the CPU's address translation lookaside buffer.

Could we, perhaps, make a qp-trie more friendly to the TLB, and maybe also the prefetcher, by being more clever about how it allocates nodes, and how they are arranged next to each other in memory?

A custom allocator seems like a lot of work for a (probably) small performance improvement, so I have not (until recently) pursued the idea.

refcounts vs tracing

Reference counts are often regarded as a poor substitute for "proper" tracing garbage collection. A tracing copying collector can give you:

cheaper allocations: just bump a pointer
amortized free: release whole pages rather than individual nodes
better locality and less fragmentation
no extra write traffic to update reference counts

To get most of these advantages, the garbage collector must be able to move objects around. What you gain in more efficient alloc and free, you pay for by copying.

However, if all updates to our data structure are RCU transactions that necessarily involve making copies, then tracing garbage collection seems like less of a stretch.

rough design

Our qp-trie allocator has a bag of pages, whose size does not need to match the hardware page size, but that kind of size should be about right.

For each page, we keep track of how much free space it has (so that we can decide when it is worth evacuating and freeing it), and a note of the RCU epoch after which it can be freed.

There's a global array of pages, containing the the address of each page. Actually, when the page table is resized, we will need to do an RCU delayed cleanup, so there can also be a secondary array which is waiting to be freed.

starting a transaction

When an update transaction is started, we obtain a fresh page where we will put new nodes and modified copies. We use a cheap bump allocator that just obtains another page when it runs out of space. Unlike many GC languages, we still manually free() nodes, to keep count of the free space in each page.

There can only be one write transaction at a time, so the writer can update the page metadata without interlocks.

finishing a transaction

After the updates have been applied to make a new version of the tree, we can do a bit of extra maintenance work before switching our readers over to the new tree. I'll discuss these in more detail below:

layout optimization: it might be worth doing some extra copying to make the tree nicer for the prefetcher;
garbage collection: identify which pages have too much free space, and evacuate and compact their contents so they can be freed.
cache eviction: if our tree is used for a cache rather than for authoritative data, the GC phase can also discard entries that are past their TTL.

Finally, the switch-over process:

swap the tree's root pointers atomically
wait for an RCU epoch so all readers are using the new tree
free everything on the delayed cleanup list

layout optimization

This is entirely optional: I don't know if it will have any useful effect. The idea is to copy tree nodes into a layout that's friendly to the CPU's prefetcher and maybe also its TLB. My best guess for how to achieve this is, starting from the root of the tree, to copy nodes in breadth-first order, until some heuristic limits are reached.

One of the tradeoffs is between better layout and extra memory usage (for more copies). A minimal option might be to only copy the few uppermost levels of the tree until they fill one page. Layout optimization across multiple pages is more complicated.

garbage collection

Here is a sketch of an algorithm for a full collection; I have not worked out how to do a useful collection that touches less data.

We recursively traverse the whole tree. The argument for the recursive function is a branch twig, i.e. a pointer to an interior node with its metadata (bitmap etc.), and the return value is either the same as the argument, or an altered version pointing to the node's new location.

The function makes a temporary copy of its node on the stack, then iterates over the twigs contained in the node. Leaf twigs are copied as is; it calls itself recursively for each branch twig.

If any of the branch twigs were changed by the recursive calls, or if the old copy of this node was in a sufficiently-empty page, the old copy is freed (which only alters its page's free space counter), the new version of the node is copied to the allocation pointer, and this recursive invocation returns the node's new location. Otherwise it returns a pointer to the old location (and the copy on the stack is discareded).

We can tune our fragmentation overhead by adjusting the threshold for sufficiently-empty pages. Note that garbage collection must also include recent allocations during the update transaction: a transaction containing multiple updates is likely to generate garbage because many qp-trie updates change the size of a node, even if we update in place when we can. So the pages used for new allocations should be treated as sufficiently-empty so that their contents are compacted before they enter heavy read-only use.

cache eviction

So far my qp-trie code has worked well for authoritative data, but I have not tried to make it work for a DNS cache. A cache needs to do a couple of extra things:

evict entries that have passed their time-to-live;
evict older entries to keep within a size limit.

Both of these can be done as part of the garbage collection tree walk.

In BIND, activities like this are performed incrementally by co-operatively scheduled tasks, rather than dedicated threads, which makes them a bit more intricate to code.

small pointers

The page table allows us to use much smaller node pointers.

Instead of using a native 64-bit pointer, we can refer to a node by the index of its page in the page table and the position of the node in its page, which together can easily fit in 32 bits. This requires a double indirection to step from one node to the next, but the page table should be in cache, and qp-trie traversal is friendly to prefetching, so we can provide hints if the processor can't prefetch automatically.

There are a couple of ways to make use of this saving.

We can reduce the size of each twig from 16 bytes to 12 bytes, making the whole tree 25% smaller. This adds some constraints on leaves: either the key and value pointers must fit in 48 bits each (which requires unwarranted chumminess with the implementation); or we can get hold of the key via the value (and waste 32 bits in the leaf).

Or if this is a one-pass DNS-trie we can use the extra space for path compression, and avoid making assumptions about pointers.

metadata placement

For each page we need to keep track of how much free space it contains, so that we know when it should be evacuated; and something to tell us if the page should be freed after the next RCU epoch.

It's fairly straightforward to put this metadata at the start of each page. At the cost of a little wasted space we can make sure this writable data doesn't share a cache line with read-only nodes.

If we are using small pointers, another option is to put per-page metadata in the page table, or perhaps in another array parallel to the page table to keep read-only and writable data separate.

transactions and caches

I normally think of a cache as having a lot of small point updates, which is unlikely to be efficient with this transaction-oriented design. But perhaps it makes sense if we split the cache into two parts.

The main cache is read-only; we use transactional updates for eviction based on TTL and cache size, and to bring in new records from the working cache. It uses ordered lookups to support RFC 8198 NXDOMAIN synthesis.

The working cache is used by the resolver to keep track of queries in progress. It can be based on fine-grained updates and locking, rather than being designed for a read-mostly workload. It might not need ordered lookups at all.

Queries that miss the main cache get handed over to the resolver, which might be able to answer them straight from the working cache, or it might add the query to a list of queries waiting for the same answer, or when there are no matches it creates a new entry in the working cache that belongs to a new resolver task.

application data

Most of what I have written above is about working with the interior branch nodes of the tree. What about the application data hanging off the leaf nodes?

During a transaction, any elements that we want to delete need to be added to a free list, so that they can be cleaned up after the next RCU epoch. When we need to modify an element, we must do so COW-style.

It's reasonable for the tree implementation to keep a free list of application elements, so any delete or set operations will automatically add the old element pointer to the list for later cleanup. On the other hand, it's probably easier for the applicaton to COW its own data.

The only callback we will need is to free application elements during the delayed cleanup after the RCU epoch has passed. (This is simpler than Knot DNS, which also has callbacks for refcount manipulation.)

conclusion

For a long time I was doubtful that a custom allocator for a qp-trie would be worth the effort. But now I think it is likely to be worth it:

The refcounting in Knot is confusing; GC seems to be a nicer way to support RCU.
Small pointers can save a significant amount of space, and are more accommodating for a one-pass radix tree version.
It will become feasible to see if layout optimization can make queries faster.

It remains to be seen if I can find the time to turn these ideas into code!

video: Tim Hunkin

I first encountered Tim Hunkin via his 1980s TV series, The Secret Life Of Machines, in which he and Rex Garrod explained how various household electrical gadgets work. One of the remarkable experiments I remember from back then was their demo of how audio tape works: they rubbed rust on sticky tape, and used it to record and play back sound. Not very hi-fi, but it worked!

Tim Hunkin's main occupation since then seems to have been as a maker of 3D mechanical cartoons. I call his machines "cartoons" because for a while he had a regular cartoon in the Observer, The Rudiments Of Wisdom, or, Almost Everything There Is To Know, and his 2D people and 3D people look very similar. He has an exhibit in the basement of the Science Museum in London, inspired by The Secret Life Of Machines, and amusement arcades in Southwold (the Under the Pier Show) and Holborn (Novelty Automation). His machines are surprising and funny!

So, this year Tim Hunkin has made a YouTube series called The Secret Life of Components in which he talks about the parts that he has used when making his machines - a different kind of component in each of the 8 episodes. It's fascinating and informative.

And, as a bonus, he has also been releasing remastered versions of The Secret Life of Machines: 11 episodes so far, with a new one added each week, plus extra commentary with Tim Hunkin's memories of filming them.

text: 50 Years of Text Games

I was lucky to find out about this newsletter/blog at about the time of its first article, and I have been looking forward to its weekly entries ever since.

It tends to focus on the people creating the games, the context in which they worked, with enough about the games to give you an idea of what they were like, and less about the techincal details. Each episode has an epilogue saying how you can play the game today.

What I love about it is how varied the creators are: the husband-and-wife team in Florida, the lesbian house in Cork, the Czech satirists - and the games too: history, romance, politics, horror.

I confess I'm not a keen player of adventure games, but the intersection of literature, technology, and play is so cool, and this series of articles showed me how much broader and deeper interactive fiction is than I was previously aware.

Nautical miles and metres were both originally defined in the same way: each of them was the distance on the surface of the earth subtended by a particular angle. For nautical miles, the angle is a minute of arc; for the metre, it is 1/40,000,000 of a turn.

I was idly wondering about the history of this way of defining a unit of length, and how it led to these two particular units, and what (if any) influence they had on each other. The In Our Time episode on Pierre-Simon Laplace briefly discussed his involvement in the definition of the metric system in revolutionary France, which nudged me to actually do some reading.

( Read more... )

The key™ ingredient of my DNS-trie is a way of re-encoding DNS names so that each byte has less than 48 possible values (so that the fan-out of each node is not too big), and normal hostname characters are encoded in one byte.

But this idea is also useful for speeding up DNS name lookups in general-purpose radix trees.

( Read more... )

Last summer I wrote about my DNS-trie, a version of my qp-trie that is optimized for DNS names. It turned out to work rather well: I patched NSD to use my code and it was smaller and faster than NSD's radix tree.

But there seemed to be a couple of places where there was room for improvement.

( Read more... )

Adapted from a twitter thread

( Read more... )

One of the small annoyances about Mac OS Terminal.app is that is claims to be xterm-256color but its default key escape codes are incomplete and somewhat weird. (For example, alt-left generates ESC-b which is the Emacs backward-word binding, rather than a VT-style escape.)

I wanted to fix this properly, so here's what I did.

( Read more... )

Having tried to make practical use of my multivector multiply, I'm adding a health warning to my notes on vanishing zeroes. (This is my shamefaced grin.)

There's a general rule with Rust that if the compiler is making things hard for you, it's usually right, and you should pay attention.

The problem with my type system hackery is kind of right there in my description of what was difficult about making it work: the requirements in the trait bounds have to be spelled out in full.

I had trouble with the enormous trait bounds that I needed in order to use the small Zero and Num overloads; the same trouble will occur, but bigger, for anyone who tries to use the overloaded multivector multiply in a generic way.

This means that application code will only be tolerable if it uses concrete types at function boundaries, and avoids generic overloading except to support type inference within functions. This is not very far from what I had in mind, but less flexible.

It's worse for mid-level code which provides higher-level multivecor operations. I wanted to combine generic multivector primitives into generic higher-level operations, but the types make this too awful to contemplate. It might be possible to rescue it by extending the wide-and-shallow approach to these operations, i.e. define them all in the same style as multiplication.

So it's sad and somewhat embarrassing that it didn't work out, but I did learn quite a lot about Rust in the process :-)

Anyway, after reflecting on my mistakes, I'm putting this aside for now, and I'm circling around to vector in on my 3D graphics project from another angle.

These are some notes on what I have done with Rust over the last week, in the course of doing something highly experimental that could well turn out to be foolishly over-complicated. (I'll be interested to hear about other ways of doing this, in other programming languages as well as Rust!) But it involves some cool ideas - abstract interpretation, type-level programming in Rust, automatically optimized data structures, and a bit of geometric algebra for 3D graphics - and I want to share what I learned.

Health warning: this did turn out to be foolishly over-complicated - I suggest reading the followup before wading in to the rest of these notes, if you have not already read them.

( Read more... )

[ adapted from a Twitter thread ]

Normally I only pay attention to leap seconds every 6 months when the IERS publishes Bulletin C, the leap second yes or no announcement. But this week brings news from Michael Deckers via the LEAPSECS mailing list, and it relates to Bulletin A, which is why it’s off my usual 6 month schedule.

( Read more... )

I recently saw Colm MacCárthaigh's exegesis of Daniel Lemire's paper about nearly divisionless fast random integer generation. I found myself writing down my own notes on how it works, that take a slightly different angle.

( Read more... )

Last month I wrote some notes on endianness, arguing that it is fundamentally about bit order: byte order arises from the combination of bit order and byte-wide addressing. I also paid attention to the algebraic properties of endian-aware bit strings, but I didn't pay attention to performance.

( Read more... )

In 2010 I wrote down my idea for generalized string literals.

A lot has changed in programming languages since then: Go, Kotlin, Rust, and Swift have become important and influential; C++ and Javascript have been significantly revamped. String literals in many languages are a lot more complicated than they were 10 years ago.

My design has evolved a little bit since my old description, and recently it has been agitating me for another write-up and a comparison with non-fantasy literal syntaxes.

( Read more... )

tl;dr: http://dotat.at/random/bridge.html

( Read more... )

What is the right way to understand endianness?

In my view endianness makes much more sense when viewed as a matter of bit order: these notes explore how byte order and byte swapping are a consequence of bitwise endianness combined with bytewise addressing and data transmission. But byte swapping isn't the only flavour of endianness: bitwise endianness is equally comfortable with oddities like PDP11 word swapping.

I also explain why big-endian is just as mathematically elegant as little-endian: they are perfectly dual to each other, so one can't be more elegant than the other.

There's a sketch of an abstract API for endian-aware bit sequences, designed with algebra in mind (associativity, anamorphisms and catamorphisms).

edited to add: this article has a sequel with improved symmetry

( Read more... )

AKA 2020 vision sorry not sorry

The kitchen is the last room in our house to get modern lighting. We probably still have some compact fluorescents around the place, but we are mostly on LED bulbs - except the kitchen which was firmly stuck in the previous century.

The ceiling had a 1.5m fluorescent tube, and under the cupboards were incandescent strip lights. Incandescent! Ugh! Last time I bought replacements was some years ago, about when LEDs were becoming standard for bulbs, and although I looked around, there did not seem to be modern lamps for these linear fittings.

But now lighting suppliers are LED everywhere! However it is not quite trivially plug and play, at least for big fluorescent tube fittings. (Choosing the small tubes was not much more difficult than getting incandescents used to be.)

Colour temperature

Instead of choosing the wattage of our incandescents we now choose the colour temperature of our LEDs. I had three kinds of fitting to populate:

The big fluorescent tube: I wasn’t really aware there was much choice before; this time I went for the default 4000K, often described as “soft white”, which looks basically the same as it used to.
The under-cupboard lights: somewhat experimentally I thought I would go for 6000K “daylight” tubes, because I saw a suggestion that it’s a good choice for a work area. They are Quite Blue. If they ever die I think next time I will choose 4000K and (if possible) more watts.
Another small strip light above the mirror in the en suite bathroom: I got 2800K “warm white” because it gets used at night. But it also gets used for shaving so I got a brighter tube. So far this has worked out to be less self-contradictory than it might sound :-)

Lumens

The old fluorescent tube claimed 4600 lumens from 58W, whereas new LED tubes that are supposed to be substitutes generally claim about 2100 lumens from about 22W. After some searchengineering, I found out where the discrepancy in lumens comes from: fluorescent tubes emit light in all directions, but LED tubes typically emit through 150 degrees. So, the intensity of the direct light is about the same or a bit more, and there is a bit less indirect light reflected off the ceiling.

The new tube is definitely bright enough :-)

The Energizer tube I bought has an opaque diffuser round most of the tube and a solid white backside on the quarter of the tube that points at the fitting. I was tempted to get a tube with a transparent cover so you can see the LEDs (to make it blatantly obvious that this is not a gas-discharge lamp) but they were not clear enough about :—

Compatibility

It took me a while to get to grips with the compatibility issues, because a lot of the explanatory material out there seems to be aimed at office building technicians who are way more likely to be opening up their fluorescent light fittings than me, and the descriptions for the LED tubes themselves are often bad at explaining the compatibility requirements, and/or they don’t include the installation instructions.

There seem to be basically three kinds of LED tubes:

Retrofit tubes: these want unadulterated mains, so you need to bypass the support circuitry that fluorescent tubes need (the ballast and starter).
Compatible with electronic ballast: these tubes work in many office light fittings.
Plug-and-play compatible with all fittings.

It was remarkably difficult to find a clear way to know if we have an electronic or magnetic ballast in our kitchen light. Some sites helpfully suggested taking a picture of the lit tube with a smartphone, to see if it looks stripy or not, which is no bloody use if the tube has died!

The answer I wanted was straightforward: if (like ours) the fitting has a little cylindrical starter, it has a magnetic ballast. (Electronic ballasts can start the tube by themselves without a separate replaceable component.)

My next question was whether I needed to worry about the starter. (It gives the tube a high voltage jolt when it turns on, so you can imagine an LED driver might not like that.) The answer is that usually you are expected to replace the starter with a “fuse”, a do-nothing bit of wire in a starter housing. LED tubes are sometimes supplied with the thing you need.

I saw hints that some tubes only need you to remove the starter and leave the hole empty, but in the end I chose an Energizer tube which is supplied with a suitable fuse, since that seemed to be least error-prone.

Sorted!

Eventually the new tubes arrived. The small strip lights are glass like the old ones, but unlike an incandescent filament the LEDs are in a few long segments. They are better than 10x more efficient than the old incandescent ones, 30W to 2.5W.

The big tube is plastic, so it feels a bit lighter and less fragile than glass. It was covered in bubble wrap and cardboard for delivery, so it came in a cylinder that was at least 10x the diameter of the tube inside :-)

The last dead fluorescent tube has joined two others in the little mercury-contaminated toxic waste dump in our junk room. (Don't worry, the mercury is still inside the tubes!) We have collected them over the last 10+ years because they are way too much faff to take to the Milton recycling centre...

I'm sitting in my room surrounded by un-assembled bits of IKEA furniture and un-opened boxes containing various goodies to help organize my clutter. I have been distracted by an idea that would not let go. It's not often I get literally compelled by creativity!

This is the story of my splendid new DNS-trie code, with benchmarks at the end. ( It is quite long so it is behind a cut... )

For those who are interested in my work, my various ops and web dev activity and some of my DNS work is now being posted on www.dns.cam.ac.uk/news/ and the Atom feed is available on Dreamwidth at

dns_cam_feed

For several years I have published a leap second table in the DNS. My version(s) of the leap second table are (as far as I know) unique:

They are cryptographically signed: if you get my leap second table you know it came from my workstation.
They have the full history from 1972, in a single DNS query, and this needs less than 100 bytes (plain text) or 20 bytes (binary).

My leap second table is sod-all use

The problem with leap seconds is not how they are published, but the tools that (fail to) disseminate and use the leap second table.

NTP is a big problem: the traditional NTP implementation is designed to choose and fixate on one server out of an ensemble of possible upstreams. Leap seconds are not part of the decision process of choosing which upstream is correct.

As a result, NTP will trust an upstream that is good at keeping time (maybe it is in a room with good temperature stability), but misconfigured for leap seconds, so when a leap second comes along NTP gets surprised by a sudden one-second offset.

Leap seconds should not be a surprise

There at least two plausible ways to avoid being surprised:

Make your NTP server get the leap second table from a trusted source, e.g. my DNS records. If your upstream NTP server's leap indicator bits disagree with the correct table, expect that it will go wrong when a leap second happens.
Get NTP servers to disseminate the table, and only trust upstreams that agree with the majority. The table is less than 20 bytes (600x smaller than the NIST table) so it can easily fit in an NTP packet.

How to fix leap seconds

Make it cheap to distribute the leap second table.

Done, but my spec needs more rigor. My binary leap second table is smaller than a SHA-1 hash, so it is cheaper to distribute the whole thing than a digest.
Make NTP distrust upstreams that disagree with the leap second table when a leap second is pending.

This is the hard part, because it involves persuading multiple open source and proprietary implementers.

I am not working on this

But I will happily rant about precise time keeping at the Cambridge Beer Festival to anyone who talks about their project to refurbish a rubidium frequency standard [not a hypothetical example] and I will try not to bore people who are not time nuts....

I thought that this week I might work on polishing my spec for compact leap second tables, or writing an implementation in Rust, but I have decided that tackling the oppressive todo list of procrastinatory doom will make me happier.

Save the following program to /tmp/quine.pl

    Illegal division by zero at /tmp/quine.pl line 1.

Run it with perl /tmp/quine.pl and it prints its own source code.

It's easy to make a "cheating quine" in many languages, where a syntax error in the source provokes the parser to emit an error message that matches the source. I posted several cheating quine examples on Twitter including

      File "quine.py", line 1
        File "quine.py", line 1
        ^
    IndentationError: unexpected indent

The Perl quine at the start of this post is a different kind of cheat: the program parses OK, and it runs briefly until the division by zero error is raised. It is quite sensitive to details of the filename: for example ./quine.pl does not work.

This error message is a program?!

This little program gets into a lot of perl's do-what-I-mean parsing.

The / character is quite context-sensitive, and can be parsed as a division operator or the start of a regex. Small perturbations of this program make it into a regex parse error rather than runnable code. In this case both / appear in an operator context.

The other non-words in this program are 1., which is just a number, and . which is the concatenation operator.

So what do the words mean?

Bare words in Perl can be subroutine names, method names, package or class names, or (in non-strict mode) un-delimited strings, and maybe other things I have forgotten!

Perl also has an unusual method invocation syntax called "indirect object syntax" which has the form

    method object args

most frequently seen looking like

    print $filehandle "message";
    my $instance = new Class(args);

although Perl's preferred syntax is

    $filehandle->print("message");
    my $instance = Class->new(args);

The perlobj documentation says

To parse this code, Perl uses a heuristic based on what package names it has seen, what subroutines exist in the current package, what barewords it has previously seen, and other input. Needless to say, heuristics can produce very surprising results!

How does it parse?

Starting from the right,

    pl line 1.

is parsed as the method call

    line->pl(1.)

where line is a package (class) name and pl is the method.

In the middle of the program, at, tmp, and quine are parsed as barewords, i.e. strings. The expression parses as:

    (("at" / "tmp") / "quine") . line->pl(1.)

On the left there are two nested indirect object method calls,

    division->Illegal(zero->by( ... ))

The innermost expression, which gets evaluated first, is

    "at" / "tmp"

And this immediately raises a division by zero exception.

I have had this article brewing for some time now but it has never really had a point. I think the drafts in my head were trying to be too measured, even handed, educational, whereas this probably works better if I be an old man shaking my fist at a cloud.

What the fuck is a sysadmin anyway

Go and read Rachel Kroll on being a sysadmin or what.

Similar to Rachel, when I was starting my career I looked up to people who ran large systems and called themselves sysadmins: at that time the bleeding edge of scalability was the middle of the sigmoid adoption curve in universities, and in early ISPs, so "large" was 10k - 100k users. And these sysadmins were comfortable with custom kernels and patched daemons.

The first big open source project I got involved with was the Apache httpd, which was started by webmasters who had to fix their web servers, and who helped each other to solve their problems. Hacking C to build the world-wide web.

About ten years later, along came DevOps and SRE, and I thought, yeah, I code and do ops, so what? I like the professionalism both of them have promoted, but they tend to be about how to run LOTS of bespoke code.

Are you local?

Go and read David MacIver on "situated code".

A lot of the code I have inherited and perpetrated has been glue code that's inherently "situated" - tied to a particular place or context. ETL scripts, account provisioning, for example.

There are actually two dimensions here: situatedness (local vs. global) and configness vs hardcodedness. The line between code and config is blurry: If you can't configure a feature, do you write a wrapper script, or do you hack the code to add it?

Local patches are just advanced build-time configuration.

Docker images are radically less-situated configuration.

Custom code is bureaucratic overhead

Code is not as bad as personal data - that's toxic waste. Code is more like a costly byproduct of providing a service. Write code to reduce the drudgery of operating the service; then operations becomes maintaining the bespoke code.

Maintaining the code becomes bureaucratic overhead. A bureaucracy exists to sustain itself.

How to reduce the overhead? DELETE THE CODE. How do you do that? Simplify the code. Share the code. Offload the code.

There is no open source business model

Unless you are ~~Red Hat~~ IBM.

There has been a lot of argument in recent months about open source companies finding it hard to make money when all the money is going to AWS.

Go and read this list of fail.

The code I write is a by-product of providing a service. This is how Apache httpd and Exim came to be. The point was not to make money from the software, the point was to make some non-software thing better. And sharing improvements to code that solves common problems is the point of open source software.

Doing open source wrong

Don't solve a problem with open source software using code that you can't share.

Amazon has an exceptionally strict policy of taking open source code and never sharing any of the improvements they make. Their monopolizing success is the main cause of the recent crisis amongst open source software businesses. It isn't open source's fault, it's because Amazon are rapacious fuckers, and monopolies have somehow become OK.

Doing open source right

Everything I have learned about software quality in practice I have learned from open source.

From the ops perspective, before you can even start to consider the usual measures of quality (documentation, testing, reliability ...) open source forces you to eliminate situatedness. Make the code useful to people other than yourselves. Then you can share it, and if you are lucky, offload it.

If some problem is difficult to solve with your chosen package, you can often solve it with a wrapper script or dockerfile. You can share your solution in a blog post or on GitHub. That's all good.

Even better if you can improve the underlying software to make the problem easier to solve, so the blog posts and wrapper scripts can be deleted. It's a lot more work, but it's a lot more rewarding.

Now I have a blogging platform at work, I'm going to use that as the primary place for my work-related articles. I've just added some notes on my server upgrade project which got awkwardly interrupted by the holidays: https://www.dns.cam.ac.uk/news/2019-01-02-upgrade-notes.html

(Edited to add) and I have got the Atom feed syndicated by Dreamwidth at dns_cam_feed

This was a good way to eat up the left-over veg from xmas day.

Yesterday we had chicken with roast potatoes, leek sauce, peas, and pigs in blankets - not a complicated meal because there was just the four of us.

I cooked a 1kg bag of roasties, which we ate a bit more than half of, and I made leek sauce with two leeks of which about half was left over, plus about 100g of left-over peas. Obviously something like bubble and squeak or veggie potato fritter things is the way to use them up.

Here are some outline recipes. No quantities, I'm afraid, because I am not a precision cook.

Leek sauce

I dunno how well-known this is, but it's popular in my family.

Slice a couple of leeks into roundels, and gently fry them in LOTS of butter until they are soft.

Then make a white sauce incorporating the leeks. So add a some flour, and stir it all together making a leeky roux.

Then add milk a glug at a time, stirring to incorporate in between (it's much more forgiving than a plain white sauce) until it is runny. Season with nutmeg and black pepper.

Keep stirring while it cooks and the sauce thickens.

Ham

I like to cook a ham like this on Boxing Day.

Get a nice big gammon, which will fit in one of your pots while covered in liquid. It's best to soak it in cold water in the fridge overnight, to make it less salty.

Drain it and rinse it, then put it on the hob to simmer in cider (diluted to taste / budget with apple juice and / or water). I added some peppercorns and bay leaves, but there are lots of spice options.

For the glaze I mix the most fierce mustard I have with honey, and (after peeling off the skin) paint it all over the gammon. Then roast in an oven long enough to caramelize the glaze (15 minutes ish).

Let the ham stand - it doesn't need to be hot when you serve it.

Cider gravy

In the past I have usually neglected to soak the ham before cooking, so the cider ended up impossibly salty afterwards. This time I tried reducing the cooking liquid to see if I could make a cider gravy, but it was still too salty.

Apparently I could try making a potato soup with it because that neutralizes the salt. I think I will try this because we also have loads of (unsalted) chicken stock to be used.

Bubble and squeak fritters

This might have worked better if I had got the leftovers out of the fridge some time before preparation, so that they could warm up and soften!

I mashed the potatoes, then mixed them together with the leek sauce (which I zapped in the microwave to make it less solid!), and finally added the peas (to avoid smashing them up too much). I also added an egg, but actually the mixture had enough liquid already and the egg made it a bit too soft.

After some experimentation I found that the best way to cook it was to put a dollop of mixture directly in the frying pan with a wooden spoon, so (after flattening) the fritters were about 2cm thick and a bit smaller than palm sized. I could comfortably cook a few at a time.

I fried them in oil in a moderately hot pan, so they were hot all the way through and browned nicely on the outside.

Verdict

I had just enough fritters to feed three adults and a child, and most of the ham has gone! Enough left for a few sandwiches, I think.

I will try to aim for left-over leek sauce and potatoes more often :-)

Cambridge University's official web templates come in a variety of colour schemes that are generally quite garish - see the links in the poll below for examples. I've nearly got my new web site ready to go (no link, because spoilers) and I have chosen two of the less popular colour schemes: one that I think is the most alarming (and I expect there would be little disagreement about that choice) and one that I think is the most queer.

Do you think what I think?

Poll #20722 Queerest colours

This poll is anonymous.
Open to: Registered Users, detailed results viewable to: All, participants: 21

Which colour scheme is is the most queer?

View Answers

Blue
0 (0.0%)

Turquoise
1 (4.8%)

Purple
16 (76.2%)

Green
0 (0.0%)

Orange
1 (4.8%)

Red
1 (4.8%)

Grey
0 (0.0%)

Tickybox!
6 (28.6%)

(This is a bit off-brand for my usual Dreamwidth posts - my fun stuff usually happens on Twitter. But Twitter's polls are too small!)

(Fri Sat Sun Mon Tue Wed Thu)

Very belated, but there was not a great deal to report from the last day of the RIPE meeting and I have spent the last few days doing Other Things.

One of the more useful acronyms I learned was the secretly humorous "DONUTS": DNS Over Normal Unencrypted TCP Sessions.

Following Thurday's presentation by Jen Linkova on IETF IPv6 activity, Sander Steffann did a lightning talk about the IPv6-only RA flag. There was quite a lot of discussion, and it was generally agreed to be a terrible idea: instead, operators who want to suppress IPv4 should use packet filters on switches rather than adding denial-of-service features to end hosts.

Amanda Gowland gave a well-received talk on the women's lunch and diversity efforts in general. There was lots of friendly amusement about there being no acronyms (except for "RIPE").

Razvan Oprea gave the RIPE NCC tech report on the meeting's infrastructure:

10Gbit connection to the meeting - "you haven't used much of it to be honest, so you need to try harder" - peaking at 300Mbit/s
800 simultaneous devices on the wireless net
They need more feedback on how well the NAT64 network works
There were a few devices using DNS-over-TLS on the Knot resolvers

One of the unusual and popular features of RIPE meetings is the real-time captioning produced by a small team of stenographers. In addition to their normal dictionary of 65,000 common English words, they have a custom dictionary of 36,000 specialized technical terms and acronyms. Towards the end of the week they relaxed a bit and in the more informal parts of the meeting (especially when they were being praised) they talked back via the steno transcript display :-) (Tho those parts aren't included in the steno copy on the web).

That's about it for the daily-ish notes. Now to distill them into an overall summary of the week for my colleagues...

(Fri Sat Sun Mon Tue Wed)

I'm posting these notes earlier than usual because it's the RIPE dinner later. As usual there are links to the presentation materials from the RIPE77 meeting plan.

One hallway conversation worth noting: I spoke to Colin Petrie of RIPE NCC who mentioned that they are rebooting the Wireless APs every day because they will not switch back to a DFS channel after switching away to avoid radar interference, so they gradually lose available bandwidth.

DNS WG round 2

Anand Buddhdev - RIPE NCC update

k-root: 80,000 qps, 75% junk, 250 Mbit/s on average, new 100Gbit/s node
RIPE has a new DNSSEC signer. Anand gave a detailed examination of the relative quality of the available solutions, and explained why they chose Knot DNS. Their migration is currently in progress using a key rollover.
Anand also spoke supportively about CDS/CDNSKEY automation

Ondřej Caletka - DS updates in the RIPE DB

Some statistics from the RIPE database to help inform decisions about CDS automation.

Benno Overeinder - IETF DNSOP update

Overview of work in progress, including ANAME. I spoke at the mic to explain that there is a "camel-sensitive" revamped draft that has not yet been submitted
Matthijs Mekking has started a prototype provisioning-side implementation of ANAME https://github.com/matje/anamify

Sara Dickinson - performance of DNS over TCP

With multithreading, TCP performance is 67% of UDP performance for Unbound, and only 25% for BIND
Current DNS load generation tools are not well suited to TCP, and web load generation tools also need a lot of adaptation (e.g. lack of pipelining)
There's a lack of good models for client behaviour, which is much more pertinent for TCP than UDP. Sara called for data collection and sharing to help this project.

Petr Špaček - DNSSEC and geoIP in Knot DNS

Details of how this new feature works with performance numbers. Petr emphasized how this king of thing is outside the scope of current DNS standards. It's kind of relevant to ANAME because many existing ANAME-like features are coupled to geoIP features. I've been saying to several people this week that the key challege in the ANAME spec is to have a clearly described an interoperable core, which also allows tricks like these.

Ondřej Surý - ISC BIND feature telemetry

Ondřej asked what is the general opinion on adding a phone home feature to BIND which allows ISC to find out what features people are not using and which could be removed.
NLnet Labs and CZ.NIC said they were also interested in this idea; PowerDNS is already doing this and their users like the warnings about security updates being available.

Open Source

Sasha Romijn on IRRd v4

Nice to hear a success story about storing JSON in PostgreSQL
RPSL has horrid 822 line continuations and interleaved comments, oh dear!

Mircea Ulinic (Cloudflare) Salt + Napalm for network automation

Some discussion about why they chose Salt: others "not event-driven nor data-driven"

Andy Wingo - a longer talk about Snabb - choice quotes:

"rewritable software"
"network functions in the smallest amount of code possible"

Peter Hessler on OpenBSD and OpenBGPD - a couple of notable OpenBSD points

they now have zero ROP gadgets in libc on arm64
they support arbitrary prefix length for SLAAC

Martin Hoffman - "Oxidising RPKI" - NLnet Labs Routinator 3000 written in Rust:

write in C? "why not take advantage of the last 40 years of progress in programming languages?"

IPv6

Jen Linkova on current IETF IPv6 activity:

IPv6 only RA flag
NAT64 prefix in RA
path MTU discovery "a new hope?" - optional packet truncation and/or MTU annotations in packet header
Indefensible Neighbour Discovery - Jen recommends this summary of mitigations for layer 2 resource exhaustion

Oliver Gasser on how to discover IPv6 addresses:

You can't brute-force scan IPv6 like you can IPv4 :-)
Use a "hitlist" of known IPv6 addresses instead, obtained from DNS, address assignment policies, crowdsourcing, infering nerby addresses, ...
It's possible to cover 50% of prefixes using their methods
Cool use of entropy clustering to discover IPv6 address assignment schemes.

Jens Link talked about IPv6 excuses, and Benedikt Stockebrand talked about how to screw up an IPv6 addressing plan. Both quite amusing and polemical :-)

I was out late last night so I'm writing yesterday's notes this morning.

Yesterday I attended the DNS and MAT meetings, and did some work outside the meetings.

CDS

Ondřej Caletka presented his work on keeping DNS zone files in git.

Lots of my favourite tools :-) Beamer, Gitolite, named-compilezone
How to discover someone has already written a program you are working on: search for a name for your project :-)

BCP 20 classless in-addr.arpa delegation led to problems for Ondřej: RFC2317 suggests putting slashes in zone names, which causes problems for tools that want to use zone names for file names. In my expired RFC2317bis draft I wanted to change the recommendation to use dash ranges instead, which better matches BIND's $GENERATE directive.

At the end of his talk, Ondřej mentioned his woork on automatically updating the RIPE database using CDS records. As planned, I commented afterwards in support, and afterwards I sent a message to the dns-wg mailing list about CDS to get the formal process moving.

DNS tooling

I spoke to Florian Streibelt who did the talk on BGP community leaks on Tuesday. I mentioned my DNS-over-TLS measurements; he suggested looking for an uptick after christmas, and that we might be able to observe some interesting correlations with MAC address data, e.g. identifying manufacturer and age using the first 4 octets of the MAC addresss. It's probably possible to get some interesting results without being intrusive.

I spent some time with Jerry Lundstrom and Petr Špaček to have a go at getting respdiff working, with a view to automated smoke testing during upgrades, but I ran out of battery :-) Jerry and Petr talked about improving its performance: the current code relies on multiple python processes for concurrency.

I talked to them about whether to replace the doh101 DNS message parser (because deleting code is good): dnsjit message parsing code is C so it will require dynamic linking into nginx, so it might not actually simplify things enough to be worth it.

DNS miscellanea

Ed Lewis (ICANN) on the DNSSEC root key rollover

next step is January 11 when the old key gets revoked and the DNSKEY response size will grow a few bytes bigger than it has been before
Geoff Huston says see http://www.potaroo.net/ispcol/2017-08/xtn-hdrs.html and http://www.potaroo.net/ispcol/2016-11/rootstars.html for information about the risks of the large response size

Petr Špaček (CZ.NIC) on the EDNS flag day, again

"20 years is enough time for an upgrade"

Ermias Malelgne - performance of flows in cellular networks

DNS: 2% lookups fail, 15% experience loss - apalling!

Tim Wattenberg - global DNS propagation times

zero TTLs actually work! (with caveats!)
https://ismydnslive.com/ propagation checker using RIPE Atlas

Other talks

Maxime Mouchet - learning network states from RTT

traceroute doesn't explain some of the changes in delay
nice and clever analysis

Trinh Viet Doan - tracing the path to YouTube: how do v4 and v6 differ?

many differences seem to be due to failure to dual-stack CDN caches in ISP networks

Kevin Vermeulen - multilevel MDA-lite Paris traceroute

MDA = multipath detection algorithm
I need to read up on what Paris traceroute is ...
some informative notes on difficulties of measuring using RIPE Atlas due to NATs messing with the probe packets

The excitement definitely caught up with me today, and it was a bit of a struggle to stay awake. On Monday I repeated the planning error I made at IETF101 and missed a lie-in which didn't help! D'oh! So I'm having a quiet evening instead of going to the RIPE official nightclub party.

Less DNS stuff on the timetable today, but it has still been keeping me busy:

CDS

During the DNS-OARC meeting I spoke to Ondřej Caletka of CESNET (the Czech national academic network) about his work on automatically updating DS records for reverse DNS delegations in the RIPE database. He had some really useful comments about the practicalities of handling CDS records and how dnssec-cds does or does not fit into a bigger script, which is kind of important because I intended dnssec-cds to encapsulate the special CDS validation logic in a reusable, scriptable way.

Today Anand Buddhdev of RIPE NCC caught me between coffees to give me some sage advice on how to help get the CDS automation to happen on the parent side of the delegation, at least for the reverse DNS zones for which RIPE is the parent.

The RIPE vs RIPE NCC split is important for things like this: As I understand it, RIPE is the association of European ISPs, and it's a consensus-driven organization that develops policies and recommendations; RIPE NCC is the secretariat or bureaucracy that implements RIPE's policies. So Anand (as a RIPE NCC employee) needs to be told by RIPE to implement CDS checking, he can't do it without prior agreement from his users.

So I gather there is going to be some opportunity to get this onto the agenda at the DNS working group meetings tomorrow and on Thursday.

ANAME

As planned, I went through Matthijs's comments today, and grabbed some time to discuss where clarification is needed. There are several points in the draft which are really matters of taste, so it'll be helpful if I note them in the draft as open to suggestions. But there are other aspects that are core to the design, so it's really important (as Evan told me) to make it easy for readers to understand them.

Jon Postel

Today was the 20th anniversary of Jon Postel's death.

Daniel Karrenberg spoke about why it is important to remember Jon, with a few examples of his approach to Internet governance.

RFC 2468 - "I remember IANA"

Women in Tech

I skipped the Women in Tech lunch, even though Denesh suggested I could go - I didn't want to add unnecessary cis-male to a women's space. But I gather there were some good discussions about overthrowing the patriarchy, so I regret missing an opportunity to learn by listening to the arguments.

VXLAN / EVPN / Geneve

Several talks today about some related networking protocols that I am not at all familiar with.

The first talk by Henrik Kramshoej on VXLAN injection attacks looks like it is something my colleagues need to be aware of (if they are not already!)

The last talk was by Ignas Bagdonas on the Geneve which is a possible replacement for VXLAN. Most informative question was "why not MPLS?" and the answer seemed to be that Geneve (like VXLAN) is supposed to be easier since it includes more of the control plane as part of the package.

Flemming Heino from LINX talked about "deploying a disaggregated network model using EVPN technology". This was interesting because of the discussion of the differences between data centre networks and exchange point networks. I think the EVPN part was to do with some of the exchange point features, which I didn't really understand. The physical side of their design is striking, though: 1U switches, small number of SKUs, using a leaf + spine design, with a bit of careful traffic modelling, instead of a big chassis with a fancy backplane.

Other talks

At least two used LaTeX Beamer :-)

Lorenzo Cogotti on the high performance isolario.it BGP scanner

"dive right into C which is not pleasant but necessary"
keen on C99 VLAs!
higher level wrappers allow users to avoid C

Florian Streibelt - BGP community attacks

14% of transit providers propagate BGP communities which is enough to propagate widely because the network is densely connected
high potential for attack!
leaking community 666 remotely-triggered black hole; failing to filter 666 announcements
he provided lots of very good motivation for his safety recommendations

Constanze Dietrich - human factors of security misconfigurations

really nice summary of her very informative research

Niels ten Oever - Innovation and Human Rights in the Internet Architecture

super interesting social science analysis of the IETF
much more content in the talk than the slides, so it's probably worth looking at the video (high bandwidth talking!)

Tom Strickx - Cloudflare - fixing some anycast technical debt

nice description of a project to overhaul their BGP configuration

Andy Wingo - 8 Ways Network Engineers use Snabb

nice overview of the Lua wire-speed software network toolkit project started by Luke Gorrie
I had a pleasant chat with Andy on the sunny canalside

Same city, same hotel, same lunch menu, but we have switched from DNS-OARC (Fri Sat Sun) to RIPE, which entails a huge expansion in the bredth of topics and number of people. The DNS-OARC meeting was the biggest ever with 197 attendees; the RIPE meeting has 881 registrations and at the time of the opening plenary there were 514 present. And 286 first-timers, including me!

I have some idea of the technical side of RIPE meetings because I have looked at slides and other material during previous meetings - lots of great stuff! But being here in person it is striking how much of an emphasis there is on social networking as well as IP networking: getting to know other people doing similar things in other companies in other countries seems to be a really important part of the meeting.

I have met several people today who I only know from mailing lists and Twitter, and they keep saying super nice things :-)

I'm not going to deep link to each presentation below - look in the RIPE77 meeting programme for the presentation materials.

DoT / DoH / DoQ

The DNS continues to be an major topic :-)

Sara Dickinson did her "DNS, Jim, but not as we know it" talk to great acclaim, and widespread consternation.

Ólafur Guðmundsson did a different talk, called "DNS over anything but UDP". His main point is that DNS implementations have appallingly bad transport protocol engineering, compared to TCP or QUIC. This affects things like recovery from packet loss, path MTU discovery, backpressure, and many other things. He argues that the DNS should make use of all the stateful protocol performance and scalability engineering that has been driven by the Web.

Some more or less paraphrased quotes:

"I used to be a UDP bigot - DNS would only ever be over UDP - I was wrong"
"DoH is for reformed script kiddies who have become application developers"
"authenticated connections are the only defence against route hijacks"
"is the community ready if we start moving 50% - 60% of DNS traffic over to TCP?"

I've submitted my lightning talk again, though judging from this afternoon's talks it is perhaps a bit too brief for the RIPE 10 minutes lightning talk length.

ANAME

Mattijs Mekking read through the ANAME draft and came back with lots of really helpful feedback, with plenty of good questions about things that are unclear or missing.

It might be worth finding some time tomorrow to hammer in some revisions...

Non-DNS things

First presentation was by Thomas Weible from Flexoptix on 400Gb/s fibre.

lovely explanation of how eye diagrams show signal clarity! I did not previously understand them and it was delightful to learn!
lots of details about transceiver form factors
initial emphasis seems to be based on shorter distance limits, because that is cheaper

Steinhor Bjarnason from Arbor talked about defending against DDoS attacks.

scary "carpet bombing", spreading DDoS traffic across many targets so bandwidth is low enough not to trigger alarms but high enough to cause problems and really hard to mitigate
networks should rate-limit IP fragments, except for addresses running DNS resolvers [because DNS-over-UDP is terrible]
recommended port-based rate-limiting config from Job Snijders and Jared Mauch

Hisham Ibrahim of RIPE NCC on IPv6 for mobile networks

it seems there is a lot of confusion and lack of confidence about how to do IPv6 on mobile networks in Europe
we are well behind the USA and India
how to provide "best current operational practice" advice?
what to do about vendors that lie about IPv6 support (no naming and shaming happened but it sounds like many of the people involved know who the miscreants are)

Up at the crack of dawn for the second half of the DNS-OARC workshop. (See the timetable for links to slides etc.) The coffee I bought yesterday morning made a few satisfactory cups to help me get started.

Before leaving the restaurant this evening I mentioned writing my notes to Dave Knight, who said his approach is to incrementally add to an email as the week goes on. I kind of like my daily reviews for remembering interesting side conversations, which are the major advantage of the value of attending these events in person.

DoT / DoH

Sara Dickinson of Sinodun did a really good talk on the consequences of DNS encryption, with a very insightful analysis of the implications for how this might change the architectural relationships between the web and the DNS.

DNS operators should read RFC 8404 on "Effects of Pervasive Encryption on Operators". (I have not read it yet.)

Sara encouraged operators to implement DoT and DoH on their resolvers.

My lightning talk on DoT and DoH at Cambridge was basically a few (very small) numbers to give operators an idea of what they can expect if they actually do this. I'm going to submit the same talk for the RIPE lightning talks session later this week.

I had some good conversations with Baptiste Jonglez (who is doing a PhD at Univ. Grenoble Alpes) and with Sara about DoT performance measurements. At the moment BIND doesn't collect statistics that allow me to know interesting things about DoT usage like DoT query rate and timing of queries within a connection. (The latter is useful for setting connection idle timeouts.) Something to add to the todo list...

CNAME at apex

Ondřej Surý of ISC.org talked about some experiments to find out how much actually breaks in practice if you put a CNAME and other data at a zone apex. Many resolvers break, but surprisingly many resolvers kind of work.

Interestingly, CNAME+DNAME at the same name is pretty close to working. This has been discussed in the past as "BNAME" (B for both) with the idea of using it for completely aliasing a DNS subtree to cope with internationalized domain names that are semantically equivalent but have different Unicode encodings (e.g. ss / ß). However the records have to be put in the parent zone, which is problematic if the parent is a TLD.

The questions afterwards predictably veered towards ANAME and I spoke up to encourage the audience to take a look at my revamped ANAME draft when it is submitted. (I hope to do a submission early this week to give it a wider audience for comments before a revised submission near the deadline next Monday.)

Tale Lawrence mentioned the various proposals for multiple queries in a single DNS request as another angle for improving performance. (A super simplified version of this is actually a stealth feature of the ANAME draft, but don't tell anyone.)

I spoke to a few people about ANAME today and there's more enthusiasm than I feared, though it tends to be pretty guarded. So I think the draft's success really depends on getting the semantics right.

C-DNS / `dnstap`

Early in the morning was Jim Hague also of Sinodun talked about C-DNS, which is a compressed DNS packet capture format used for DITL ("day in the life" or "dittle") data collection from ICANN L-root servers. (There was a special DITL collection for a couple of days around the DNSSEC key rollover this weekend).

C-DNS is based on CBOR which is a pretty nice IETF standard binary serialization format with a very JSON-like flavour.

Jim was talking partly about recent work on importing C-DNS data into the ClickHouse column-oriented SQLish time-series database.

I'm vaguely interested in this area because various people have made casual requests for DNS telemetry from my servers. (None of them have followed through yet, so I don't do any query data collection at the moment.) I kind of hoped that dnstap would be a thing, but the casual requests for telemetry have been more interested in pcaps. Someone (I failed to make a note of who, drat) mentioned that there is a dnstap fanout/filter tool, which was on my todo list in case we ever needed to provide multiple feeds containing different data.

I spoke to Robert Edmonds (the dnstap developer, who is now at Fastly) who thinks in retrospect that protobufs was an unfortunate choice. I wonder if it would be a good idea to re-do dnstap using uncompressed C-DNS for framing, but I didn't manage to talk to Jim about this before he had to leave.

DNS Flag day

A couple of talks on what will happen next year after the open source DNS resolvers remove their workaround code for broken authoritative servers. Lots of people collaborating on this including Sebastián Castro (.nz), Hugo Salgado (.cl), Petr Špaček (.cz).

Their analysis is rapidly becoming more informative and actionable, which is great. They have a fairly short list of mass hosting providers that will be responsible for the vast majority of the potential breakage, if they aren't fixed in time.

Smaller notes

Giovane Moura (SIDN) - DNS Defenses During DDoS

also to appear at SIGCOMM
headline number on effectiveness of DNS caches: 70% hit rate
query amplification during an outage can be 8x - unbound has mitigation for this which I should have a look at.

Duane Wessels (Verisign) - zone digests

really good slide on channel vs data security
he surprised me by saying there is no validation for zone transfer SOA queries - I feel I need to look at the code but I can imagine why it works that way...
zone digests potentially great for safer stealth secondaries which we have a lot of in Cambridge
Petr Spacek complained about the implementation complexity ... I wonder if there's a cunning qp hack to make it easier :-)

Peter van Dijk (PowerDNS) - NSEC aggressive use and TTLs

there are now three instead of two TTLs that affect negative cacheing: SOA TTL, SOA MINIMUM, plus now NSEC TTL.
new operational advice: be careful to make NSEC TTL and SOA negative TTLs match!

Profile

fanf

dotat.at

July 2025

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Syndicate

Page Summary

Cataract surgery
Safe memory reclamation for BIND
Slower DNS name decompression
BIND zone transfer performance
An update on leap seconds
Faster DNS name decompression
hg64 multithreaded histograms
How big is this integer type?
hg64: a 64-bit histogram data structure
A DNS name compression algorithm
tolower() in bulk at speed
Compacting a qp-trie
Choice on Units of Measurement: Markings and Sales
really divisionless random numbers
IETF 113, Vienna, first half
Addenbrooke's cataract clinic
nsnotifyd-2.0 released
Mac Mini Linux frustration
On the move
My cataract
Page-based GC for qp-trie RCU
Two highlights of 2021
Miles and metres
Preparing DNS names for faster radix tree lookups
A one-pass DNS-trie?
What is a subdomain?
Terminal.app xterm compatibiity
Clever but foolish
Vanishing zeroes for geometric algebra in Rust
Leap second hiatus
nearly divisionless random numbers
Some more notes on endianness
Generalized string literal syntax, 10 years later
da Vinci bridges
Some notes on endianness
21st century lighting: LED tubes
A compelling idea: the genesis of my DNS-trie
Work bloggery
I am NOT working on leap seconds this week
A curious Perl Quine
Open source as a byproduct
Notes on DNS server upgrades
Boxing day ham with bubble and squeak
Help me choose colours
Amsterdam day 7
Amsterdam day 6
Amsterdam day 5
Amsterdam day 4
Amsterdam day 3
Amsterdam day 2

Style Credit

Style: Gray for Boxes and Borders by branchandroot

Expand Cut Tags

No cut tags

Page generated 2025-07-16 07:34

Recent Entries

hg64

previously

isc mfld

qp-trie

bind

radix trees

dnsop

the cataract

outcomes

anaesthesia

waiting time

machine that goes ping

how it was

(lack of) treatment

middle age

surgery?

postscript

COW and RCU

memory layout

refcounts vs tracing

rough design

starting a transaction

finishing a transaction

layout optimization

garbage collection

cache eviction

small pointers

metadata placement

transactions and caches

application data

conclusion

video: Tim Hunkin

text: 50 Years of Text Games

Colour temperature

Lumens

Compatibility

Sorted!

My leap second table is sod-all use

Leap seconds should not be a surprise

How to fix leap seconds

I am not working on this

This error message is a program?!

How does it parse?

What the fuck is a sysadmin anyway

Custom code is bureaucratic overhead

There is no open source business model

Doing open source wrong

Doing open source right

Leek sauce

Ham

Cider gravy

Bubble and squeak fritters

Verdict

DNS WG round 2

Open Source

IPv6

CDS

DNS tooling

DNS miscellanea

Other talks

CDS

ANAME

Jon Postel

Women in Tech

VXLAN / EVPN / Geneve

Other talks

DoT / DoH / DoQ

ANAME

Non-DNS things

DoT / DoH

CNAME at apex

C-DNS / dnstap

DNS Flag day

Smaller notes

Syndicate

Page Summary

Style Credit

Expand Cut Tags

C-DNS / `dnstap`