fanf: (Default)

I have had this article brewing for some time now but it has never really had a point. I think the drafts in my head were trying to be too measured, even handed, educational, whereas this probably works better if I be an old man shaking my fist at a cloud.

What the fuck is a sysadmin anyway

Go and read Rachel Kroll on being a sysadmin or what.

Similar to Rachel, when I was starting my career I looked up to people who ran large systems and called themselves sysadmins: at that time the bleeding edge of scalability was the middle of the sigmoid adoption curve in universities, and in early ISPs, so "large" was 10k - 100k users. And these sysadmins were comfortable with custom kernels and patched daemons.

The first big open source project I got involved with was the Apache httpd, which was started by webmasters who had to fix their web servers, and who helped each other to solve their problems. Hacking C to build the world-wide web.

About ten years later, along came DevOps and SRE, and I thought, yeah, I code and do ops, so what? I like the professionalism both of them have promoted, but they tend to be about how to run LOTS of bespoke code.

Are you local?

Go and read David MacIver on "situated code".

A lot of the code I have inherited and perpetrated has been glue code that's inherently "situated" - tied to a particular place or context. ETL scripts, account provisioning, for example.

There are actually two dimensions here: situatedness (local vs. global) and configness vs hardcodedness. The line between code and config is blurry: If you can't configure a feature, do you write a wrapper script, or do you hack the code to add it?

Local patches are just advanced build-time configuration.

Docker images are radically less-situated configuration.

Custom code is bureaucratic overhead

Code is not as bad as personal data - that's toxic waste. Code is more like a costly byproduct of providing a service. Write code to reduce the drudgery of operating the service; then operations becomes maintaining the bespoke code.

Maintaining the code becomes bureaucratic overhead. A bureaucracy exists to sustain itself.

How to reduce the overhead? DELETE THE CODE. How do you do that? Simplify the code. Share the code. Offload the code.

There is no open source business model

Unless you are Red Hat IBM.

There has been a lot of argument in recent months about open source companies finding it hard to make money when all the money is going to AWS.

Go and read this list of fail.

The code I write is a by-product of providing a service. This is how Apache httpd and Exim came to be. The point was not to make money from the software, the point was to make some non-software thing better. And sharing improvements to code that solves common problems is the point of open source software.

Doing open source wrong

Don't solve a problem with open source software using code that you can't share.

Amazon has an exceptionally strict policy of taking open source code and never sharing any of the improvements they make. Their monopolizing success is the main cause of the recent crisis amongst open source software businesses. It isn't open source's fault, it's because Amazon are rapacious fuckers, and monopolies have somehow become OK.

Doing open source right

Everything I have learned about software quality in practice I have learned from open source.

From the ops perspective, before you can even start to consider the usual measures of quality (documentation, testing, reliability ...) open source forces you to eliminate situatedness. Make the code useful to people other than yourselves. Then you can share it, and if you are lucky, offload it.

If some problem is difficult to solve with your chosen package, you can often solve it with a wrapper script or dockerfile. You can share your solution in a blog post or on GitHub. That's all good.

Even better if you can improve the underlying software to make the problem easier to solve, so the blog posts and wrapper scripts can be deleted. It's a lot more work, but it's a lot more rewarding.

fanf: (Default)

Now I have a blogging platform at work, I'm going to use that as the primary place for my work-related articles. I've just added some notes on my server upgrade project which got awkwardly interrupted by the holidays: https://www.dns.cam.ac.uk/news/2019-01-02-upgrade-notes.html

(Edited to add) and I have got the Atom feed syndicated by Dreamwidth at [syndicated profile] dns_cam_feed

fanf: (Default)

This was a good way to eat up the left-over veg from xmas day.

Yesterday we had chicken with roast potatoes, leek sauce, peas, and pigs in blankets - not a complicated meal because there was just the four of us.

I cooked a 1kg bag of roasties, which we ate a bit more than half of, and I made leek sauce with two leeks of which about half was left over, plus about 100g of left-over peas. Obviously something like bubble and squeak or veggie potato fritter things is the way to use them up.

Here are some outline recipes. No quantities, I'm afraid, because I am not a precision cook.

Leek sauce

I dunno how well-known this is, but it's popular in my family.

Slice a couple of leeks into roundels, and gently fry them in LOTS of butter until they are soft.

Then make a white sauce incorporating the leeks. So add a some flour, and stir it all together making a leeky roux.

Then add milk a glug at a time, stirring to incorporate in between (it's much more forgiving than a plain white sauce) until it is runny. Season with nutmeg and black pepper.

Keep stirring while it cooks and the sauce thickens.

Ham

I like to cook a ham like this on Boxing Day.

Get a nice big gammon, which will fit in one of your pots while covered in liquid. It's best to soak it in cold water in the fridge overnight, to make it less salty.

Drain it and rinse it, then put it on the hob to simmer in cider (diluted to taste / budget with apple juice and / or water). I added some peppercorns and bay leaves, but there are lots of spice options.

For the glaze I mix the most fierce mustard I have with honey, and (after peeling off the skin) paint it all over the gammon. Then roast in an oven long enough to caramelize the glaze (15 minutes ish).

Let the ham stand - it doesn't need to be hot when you serve it.

Cider gravy

In the past I have usually neglected to soak the ham before cooking, so the cider ended up impossibly salty afterwards. This time I tried reducing the cooking liquid to see if I could make a cider gravy, but it was still too salty.

Apparently I could try making a potato soup with it because that neutralizes the salt. I think I will try this because we also have loads of (unsalted) chicken stock to be used.

Bubble and squeak fritters

This might have worked better if I had got the leftovers out of the fridge some time before preparation, so that they could warm up and soften!

I mashed the potatoes, then mixed them together with the leek sauce (which I zapped in the microwave to make it less solid!), and finally added the peas (to avoid smashing them up too much). I also added an egg, but actually the mixture had enough liquid already and the egg made it a bit too soft.

After some experimentation I found that the best way to cook it was to put a dollop of mixture directly in the frying pan with a wooden spoon, so (after flattening) the fritters were about 2cm thick and a bit smaller than palm sized. I could comfortably cook a few at a time.

I fried them in oil in a moderately hot pan, so they were hot all the way through and browned nicely on the outside.

Verdict

I had just enough fritters to feed three adults and a child, and most of the ham has gone! Enough left for a few sandwiches, I think.

I will try to aim for left-over leek sauce and potatoes more often :-)

fanf: (Default)

Cambridge University's official web templates come in a variety of colour schemes that are generally quite garish - see the links in the poll below for examples. I've nearly got my new web site ready to go (no link, because spoilers) and I have chosen two of the less popular colour schemes: one that I think is the most alarming (and I expect there would be little disagreement about that choice) and one that I think is the most queer.

Do you think what I think?

Poll #20722 Queerest colours
This poll is anonymous.
Open to: Registered Users, detailed results viewable to: All, participants: 21

Which colour scheme is is the most queer?

View Answers

Blue
0 (0.0%)

Turquoise
1 (4.8%)

Purple
16 (76.2%)

Green
0 (0.0%)

Orange
1 (4.8%)

Red
1 (4.8%)

Grey
0 (0.0%)

Tickybox!
6 (28.6%)

(This is a bit off-brand for my usual Dreamwidth posts - my fun stuff usually happens on Twitter. But Twitter's polls are too small!)

fanf: (Default)

(Fri Sat Sun Mon Tue Wed Thu)

Very belated, but there was not a great deal to report from the last day of the RIPE meeting and I have spent the last few days doing Other Things.

One of the more useful acronyms I learned was the secretly humorous "DONUTS": DNS Over Normal Unencrypted TCP Sessions.

Following Thurday's presentation by Jen Linkova on IETF IPv6 activity, Sander Steffann did a lightning talk about the IPv6-only RA flag. There was quite a lot of discussion, and it was generally agreed to be a terrible idea: instead, operators who want to suppress IPv4 should use packet filters on switches rather than adding denial-of-service features to end hosts.

Amanda Gowland gave a well-received talk on the women's lunch and diversity efforts in general. There was lots of friendly amusement about there being no acronyms (except for "RIPE").

Razvan Oprea gave the RIPE NCC tech report on the meeting's infrastructure:

  • 10Gbit connection to the meeting - "you haven't used much of it to be honest, so you need to try harder" - peaking at 300Mbit/s

  • 800 simultaneous devices on the wireless net

  • They need more feedback on how well the NAT64 network works

  • There were a few devices using DNS-over-TLS on the Knot resolvers

One of the unusual and popular features of RIPE meetings is the real-time captioning produced by a small team of stenographers. In addition to their normal dictionary of 65,000 common English words, they have a custom dictionary of 36,000 specialized technical terms and acronyms. Towards the end of the week they relaxed a bit and in the more informal parts of the meeting (especially when they were being praised) they talked back via the steno transcript display :-) (Tho those parts aren't included in the steno copy on the web).

That's about it for the daily-ish notes. Now to distill them into an overall summary of the week for my colleagues...

fanf: (Default)

(Fri Sat Sun Mon Tue Wed)

I'm posting these notes earlier than usual because it's the RIPE dinner later. As usual there are links to the presentation materials from the RIPE77 meeting plan.

One hallway conversation worth noting: I spoke to Colin Petrie of RIPE NCC who mentioned that they are rebooting the Wireless APs every day because they will not switch back to a DFS channel after switching away to avoid radar interference, so they gradually lose available bandwidth.

DNS WG round 2

Anand Buddhdev - RIPE NCC update

  • k-root: 80,000 qps, 75% junk, 250 Mbit/s on average, new 100Gbit/s node

  • RIPE has a new DNSSEC signer. Anand gave a detailed examination of the relative quality of the available solutions, and explained why they chose Knot DNS. Their migration is currently in progress using a key rollover.

  • Anand also spoke supportively about CDS/CDNSKEY automation

Ondřej Caletka - DS updates in the RIPE DB

  • Some statistics from the RIPE database to help inform decisions about CDS automation.

Benno Overeinder - IETF DNSOP update

  • Overview of work in progress, including ANAME. I spoke at the mic to explain that there is a "camel-sensitive" revamped draft that has not yet been submitted

  • Matthijs Mekking has started a prototype provisioning-side implementation of ANAME https://github.com/matje/anamify

Sara Dickinson - performance of DNS over TCP

  • With multithreading, TCP performance is 67% of UDP performance for Unbound, and only 25% for BIND

  • Current DNS load generation tools are not well suited to TCP, and web load generation tools also need a lot of adaptation (e.g. lack of pipelining)

  • There's a lack of good models for client behaviour, which is much more pertinent for TCP than UDP. Sara called for data collection and sharing to help this project.

Petr Špaček - DNSSEC and geoIP in Knot DNS

  • Details of how this new feature works with performance numbers. Petr emphasized how this king of thing is outside the scope of current DNS standards. It's kind of relevant to ANAME because many existing ANAME-like features are coupled to geoIP features. I've been saying to several people this week that the key challege in the ANAME spec is to have a clearly described an interoperable core, which also allows tricks like these.

Ondřej Surý - ISC BIND feature telemetry

  • Ondřej asked what is the general opinion on adding a phone home feature to BIND which allows ISC to find out what features people are not using and which could be removed.

  • NLnet Labs and CZ.NIC said they were also interested in this idea; PowerDNS is already doing this and their users like the warnings about security updates being available.

Open Source

Sasha Romijn on IRRd v4

  • Nice to hear a success story about storing JSON in PostgreSQL

  • RPSL has horrid 822 line continuations and interleaved comments, oh dear!

Mircea Ulinic (Cloudflare) Salt + Napalm for network automation

  • Some discussion about why they chose Salt: others "not event-driven nor data-driven"

Andy Wingo - a longer talk about Snabb - choice quotes:

  • "rewritable software"

  • "network functions in the smallest amount of code possible"

Peter Hessler on OpenBSD and OpenBGPD - a couple of notable OpenBSD points

  • they now have zero ROP gadgets in libc on arm64

  • they support arbitrary prefix length for SLAAC

Martin Hoffman - "Oxidising RPKI" - NLnet Labs Routinator 3000 written in Rust:

  • write in C? "why not take advantage of the last 40 years of progress in programming languages?"

IPv6

Jen Linkova on current IETF IPv6 activity:

  • IPv6 only RA flag

  • NAT64 prefix in RA

  • path MTU discovery "a new hope?" - optional packet truncation and/or MTU annotations in packet header

  • Indefensible Neighbour Discovery - Jen recommends this summary of mitigations for layer 2 resource exhaustion

Oliver Gasser on how to discover IPv6 addresses:

  • You can't brute-force scan IPv6 like you can IPv4 :-)

  • Use a "hitlist" of known IPv6 addresses instead, obtained from DNS, address assignment policies, crowdsourcing, infering nerby addresses, ...

  • It's possible to cover 50% of prefixes using their methods

  • Cool use of entropy clustering to discover IPv6 address assignment schemes.

Jens Link talked about IPv6 excuses, and Benedikt Stockebrand talked about how to screw up an IPv6 addressing plan. Both quite amusing and polemical :-)

fanf: (Default)

I was out late last night so I'm writing yesterday's notes this morning.

Yesterday I attended the DNS and MAT meetings, and did some work outside the meetings.

CDS

Ondřej Caletka presented his work on keeping DNS zone files in git.

  • Lots of my favourite tools :-) Beamer, Gitolite, named-compilezone

  • How to discover someone has already written a program you are working on: search for a name for your project :-)

BCP 20 classless in-addr.arpa delegation led to problems for Ondřej: RFC2317 suggests putting slashes in zone names, which causes problems for tools that want to use zone names for file names. In my expired RFC2317bis draft I wanted to change the recommendation to use dash ranges instead, which better matches BIND's $GENERATE directive.

At the end of his talk, Ondřej mentioned his woork on automatically updating the RIPE database using CDS records. As planned, I commented afterwards in support, and afterwards I sent a message to the dns-wg mailing list about CDS to get the formal process moving.

DNS tooling

I spoke to Florian Streibelt who did the talk on BGP community leaks on Tuesday. I mentioned my DNS-over-TLS measurements; he suggested looking for an uptick after christmas, and that we might be able to observe some interesting correlations with MAC address data, e.g. identifying manufacturer and age using the first 4 octets of the MAC addresss. It's probably possible to get some interesting results without being intrusive.

I spent some time with Jerry Lundstrom and Petr Špaček to have a go at getting respdiff working, with a view to automated smoke testing during upgrades, but I ran out of battery :-) Jerry and Petr talked about improving its performance: the current code relies on multiple python processes for concurrency.

I talked to them about whether to replace the doh101 DNS message parser (because deleting code is good): dnsjit message parsing code is C so it will require dynamic linking into nginx, so it might not actually simplify things enough to be worth it.

DNS miscellanea

Ed Lewis (ICANN) on the DNSSEC root key rollover

Petr Špaček (CZ.NIC) on the EDNS flag day, again

  • "20 years is enough time for an upgrade"

Ermias Malelgne - performance of flows in cellular networks

  • DNS: 2% lookups fail, 15% experience loss - apalling!

Tim Wattenberg - global DNS propagation times

Other talks

Maxime Mouchet - learning network states from RTT

  • traceroute doesn't explain some of the changes in delay

  • nice and clever analysis

Trinh Viet Doan - tracing the path to YouTube: how do v4 and v6 differ?

  • many differences seem to be due to failure to dual-stack CDN caches in ISP networks

Kevin Vermeulen - multilevel MDA-lite Paris traceroute

  • MDA = multipath detection algorithm

  • I need to read up on what Paris traceroute is ...

  • some informative notes on difficulties of measuring using RIPE Atlas due to NATs messing with the probe packets

fanf: (Default)

The excitement definitely caught up with me today, and it was a bit of a struggle to stay awake. On Monday I repeated the planning error I made at IETF101 and missed a lie-in which didn't help! D'oh! So I'm having a quiet evening instead of going to the RIPE official nightclub party.

Less DNS stuff on the timetable today, but it has still been keeping me busy:

CDS

During the DNS-OARC meeting I spoke to Ondřej Caletka of CESNET (the Czech national academic network) about his work on automatically updating DS records for reverse DNS delegations in the RIPE database. He had some really useful comments about the practicalities of handling CDS records and how dnssec-cds does or does not fit into a bigger script, which is kind of important because I intended dnssec-cds to encapsulate the special CDS validation logic in a reusable, scriptable way.

Today Anand Buddhdev of RIPE NCC caught me between coffees to give me some sage advice on how to help get the CDS automation to happen on the parent side of the delegation, at least for the reverse DNS zones for which RIPE is the parent.

The RIPE vs RIPE NCC split is important for things like this: As I understand it, RIPE is the association of European ISPs, and it's a consensus-driven organization that develops policies and recommendations; RIPE NCC is the secretariat or bureaucracy that implements RIPE's policies. So Anand (as a RIPE NCC employee) needs to be told by RIPE to implement CDS checking, he can't do it without prior agreement from his users.

So I gather there is going to be some opportunity to get this onto the agenda at the DNS working group meetings tomorrow and on Thursday.

ANAME

As planned, I went through Matthijs's comments today, and grabbed some time to discuss where clarification is needed. There are several points in the draft which are really matters of taste, so it'll be helpful if I note them in the draft as open to suggestions. But there are other aspects that are core to the design, so it's really important (as Evan told me) to make it easy for readers to understand them.

Jon Postel

Today was the 20th anniversary of Jon Postel's death.

Daniel Karrenberg spoke about why it is important to remember Jon, with a few examples of his approach to Internet governance.

RFC 2468 - "I remember IANA"

Women in Tech

I skipped the Women in Tech lunch, even though Denesh suggested I could go - I didn't want to add unnecessary cis-male to a women's space. But I gather there were some good discussions about overthrowing the patriarchy, so I regret missing an opportunity to learn by listening to the arguments.

VXLAN / EVPN / Geneve

Several talks today about some related networking protocols that I am not at all familiar with.

The first talk by Henrik Kramshoej on VXLAN injection attacks looks like it is something my colleagues need to be aware of (if they are not already!)

The last talk was by Ignas Bagdonas on the Geneve which is a possible replacement for VXLAN. Most informative question was "why not MPLS?" and the answer seemed to be that Geneve (like VXLAN) is supposed to be easier since it includes more of the control plane as part of the package.

Flemming Heino from LINX talked about "deploying a disaggregated network model using EVPN technology". This was interesting because of the discussion of the differences between data centre networks and exchange point networks. I think the EVPN part was to do with some of the exchange point features, which I didn't really understand. The physical side of their design is striking, though: 1U switches, small number of SKUs, using a leaf + spine design, with a bit of careful traffic modelling, instead of a big chassis with a fancy backplane.

Other talks

At least two used LaTeX Beamer :-)

Lorenzo Cogotti on the high performance isolario.it BGP scanner

  • "dive right into C which is not pleasant but necessary"

  • keen on C99 VLAs!

  • higher level wrappers allow users to avoid C

Florian Streibelt - BGP community attacks

  • 14% of transit providers propagate BGP communities which is enough to propagate widely because the network is densely connected

  • high potential for attack!

  • leaking community 666 remotely-triggered black hole; failing to filter 666 announcements

  • he provided lots of very good motivation for his safety recommendations

Constanze Dietrich - human factors of security misconfigurations

  • really nice summary of her very informative research

Niels ten Oever - Innovation and Human Rights in the Internet Architecture

  • super interesting social science analysis of the IETF

  • much more content in the talk than the slides, so it's probably worth looking at the video (high bandwidth talking!)

Tom Strickx - Cloudflare - fixing some anycast technical debt

  • nice description of a project to overhaul their BGP configuration

Andy Wingo - 8 Ways Network Engineers use Snabb

  • nice overview of the Lua wire-speed software network toolkit project started by Luke Gorrie

  • I had a pleasant chat with Andy on the sunny canalside

fanf: (Default)

Same city, same hotel, same lunch menu, but we have switched from DNS-OARC (Fri Sat Sun) to RIPE, which entails a huge expansion in the bredth of topics and number of people. The DNS-OARC meeting was the biggest ever with 197 attendees; the RIPE meeting has 881 registrations and at the time of the opening plenary there were 514 present. And 286 first-timers, including me!

I have some idea of the technical side of RIPE meetings because I have looked at slides and other material during previous meetings - lots of great stuff! But being here in person it is striking how much of an emphasis there is on social networking as well as IP networking: getting to know other people doing similar things in other companies in other countries seems to be a really important part of the meeting.

I have met several people today who I only know from mailing lists and Twitter, and they keep saying super nice things :-)

I'm not going to deep link to each presentation below - look in the RIPE77 meeting programme for the presentation materials.

DoT / DoH / DoQ

The DNS continues to be an major topic :-)

Sara Dickinson did her "DNS, Jim, but not as we know it" talk to great acclaim, and widespread consternation.

Ólafur Guðmundsson did a different talk, called "DNS over anything but UDP". His main point is that DNS implementations have appallingly bad transport protocol engineering, compared to TCP or QUIC. This affects things like recovery from packet loss, path MTU discovery, backpressure, and many other things. He argues that the DNS should make use of all the stateful protocol performance and scalability engineering that has been driven by the Web.

Some more or less paraphrased quotes:

  • "I used to be a UDP bigot - DNS would only ever be over UDP - I was wrong"

  • "DoH is for reformed script kiddies who have become application developers"

  • "authenticated connections are the only defence against route hijacks"

  • "is the community ready if we start moving 50% - 60% of DNS traffic over to TCP?"

I've submitted my lightning talk again, though judging from this afternoon's talks it is perhaps a bit too brief for the RIPE 10 minutes lightning talk length.

ANAME

Mattijs Mekking read through the ANAME draft and came back with lots of really helpful feedback, with plenty of good questions about things that are unclear or missing.

It might be worth finding some time tomorrow to hammer in some revisions...

Non-DNS things

First presentation was by Thomas Weible from Flexoptix on 400Gb/s fibre.

  • lovely explanation of how eye diagrams show signal clarity! I did not previously understand them and it was delightful to learn!

  • lots of details about transceiver form factors

  • initial emphasis seems to be based on shorter distance limits, because that is cheaper

Steinhor Bjarnason from Arbor talked about defending against DDoS attacks.

  • scary "carpet bombing", spreading DDoS traffic across many targets so bandwidth is low enough not to trigger alarms but high enough to cause problems and really hard to mitigate

  • networks should rate-limit IP fragments, except for addresses running DNS resolvers [because DNS-over-UDP is terrible]

  • recommended port-based rate-limiting config from Job Snijders and Jared Mauch

Hisham Ibrahim of RIPE NCC on IPv6 for mobile networks

  • it seems there is a lot of confusion and lack of confidence about how to do IPv6 on mobile networks in Europe

  • we are well behind the USA and India

  • how to provide "best current operational practice" advice?

  • what to do about vendors that lie about IPv6 support (no naming and shaming happened but it sounds like many of the people involved know who the miscreants are)

fanf: (Default)

Up at the crack of dawn for the second half of the DNS-OARC workshop. (See the timetable for links to slides etc.) The coffee I bought yesterday morning made a few satisfactory cups to help me get started.

Before leaving the restaurant this evening I mentioned writing my notes to Dave Knight, who said his approach is to incrementally add to an email as the week goes on. I kind of like my daily reviews for remembering interesting side conversations, which are the major advantage of the value of attending these events in person.

DoT / DoH

Sara Dickinson of Sinodun did a really good talk on the consequences of DNS encryption, with a very insightful analysis of the implications for how this might change the architectural relationships between the web and the DNS.

DNS operators should read RFC 8404 on "Effects of Pervasive Encryption on Operators". (I have not read it yet.)

Sara encouraged operators to implement DoT and DoH on their resolvers.

My lightning talk on DoT and DoH at Cambridge was basically a few (very small) numbers to give operators an idea of what they can expect if they actually do this. I'm going to submit the same talk for the RIPE lightning talks session later this week.

I had some good conversations with Baptiste Jonglez (who is doing a PhD at Univ. Grenoble Alpes) and with Sara about DoT performance measurements. At the moment BIND doesn't collect statistics that allow me to know interesting things about DoT usage like DoT query rate and timing of queries within a connection. (The latter is useful for setting connection idle timeouts.) Something to add to the todo list...

CNAME at apex

Ondřej Surý of ISC.org talked about some experiments to find out how much actually breaks in practice if you put a CNAME and other data at a zone apex. Many resolvers break, but surprisingly many resolvers kind of work.

Interestingly, CNAME+DNAME at the same name is pretty close to working. This has been discussed in the past as "BNAME" (B for both) with the idea of using it for completely aliasing a DNS subtree to cope with internationalized domain names that are semantically equivalent but have different Unicode encodings (e.g. ss / ß). However the records have to be put in the parent zone, which is problematic if the parent is a TLD.

The questions afterwards predictably veered towards ANAME and I spoke up to encourage the audience to take a look at my revamped ANAME draft when it is submitted. (I hope to do a submission early this week to give it a wider audience for comments before a revised submission near the deadline next Monday.)

Tale Lawrence mentioned the various proposals for multiple queries in a single DNS request as another angle for improving performance. (A super simplified version of this is actually a stealth feature of the ANAME draft, but don't tell anyone.)

I spoke to a few people about ANAME today and there's more enthusiasm than I feared, though it tends to be pretty guarded. So I think the draft's success really depends on getting the semantics right.

C-DNS / dnstap

Early in the morning was Jim Hague also of Sinodun talked about C-DNS, which is a compressed DNS packet capture format used for DITL ("day in the life" or "dittle") data collection from ICANN L-root servers. (There was a special DITL collection for a couple of days around the DNSSEC key rollover this weekend).

C-DNS is based on CBOR which is a pretty nice IETF standard binary serialization format with a very JSON-like flavour.

Jim was talking partly about recent work on importing C-DNS data into the ClickHouse column-oriented SQLish time-series database.

I'm vaguely interested in this area because various people have made casual requests for DNS telemetry from my servers. (None of them have followed through yet, so I don't do any query data collection at the moment.) I kind of hoped that dnstap would be a thing, but the casual requests for telemetry have been more interested in pcaps. Someone (I failed to make a note of who, drat) mentioned that there is a dnstap fanout/filter tool, which was on my todo list in case we ever needed to provide multiple feeds containing different data.

I spoke to Robert Edmonds (the dnstap developer, who is now at Fastly) who thinks in retrospect that protobufs was an unfortunate choice. I wonder if it would be a good idea to re-do dnstap using uncompressed C-DNS for framing, but I didn't manage to talk to Jim about this before he had to leave.

DNS Flag day

A couple of talks on what will happen next year after the open source DNS resolvers remove their workaround code for broken authoritative servers. Lots of people collaborating on this including Sebastián Castro (.nz), Hugo Salgado (.cl), Petr Špaček (.cz).

Their analysis is rapidly becoming more informative and actionable, which is great. They have a fairly short list of mass hosting providers that will be responsible for the vast majority of the potential breakage, if they aren't fixed in time.

Smaller notes

Giovane Moura (SIDN) - DNS Defenses During DDoS

  • also to appear at SIGCOMM

  • headline number on effectiveness of DNS caches: 70% hit rate

  • query amplification during an outage can be 8x - unbound has mitigation for this which I should have a look at.

Duane Wessels (Verisign) - zone digests

  • really good slide on channel vs data security

  • he surprised me by saying there is no validation for zone transfer SOA queries - I feel I need to look at the code but I can imagine why it works that way...

  • zone digests potentially great for safer stealth secondaries which we have a lot of in Cambridge

  • Petr Spacek complained about the implementation complexity ... I wonder if there's a cunning qp hack to make it easier :-)

Peter van Dijk (PowerDNS) - NSEC aggressive use and TTLs

  • there are now three instead of two TTLs that affect negative cacheing: SOA TTL, SOA MINIMUM, plus now NSEC TTL.

  • new operational advice: be careful to make NSEC TTL and SOA negative TTLs match!

fanf: (Default)

During IETF 101 I wrote up the day's activity on the way home on the train, which worked really well. My vague plan of doing the same in Amsterdam might be less successful because it's less easy to hammer out notes while walking across the city.

Some semi-chronological notes ... I'm going to use the workshop programme as a starting point, but I'm not going to write particularly about the contents of the talks (see the workshop web site for that) but more about what I found relevant to me and the follow-up hallway discussions.

Morning

The first half of the morning was CENTR members only, so I went to get a few bits from the local supermarket via the "I Amsterdam" sign before heading to the conference hotel.

The second half of the morning was DNS-OARC business. It's a small organization that provides technical infrastructure, so most of the presentation time was about technical matters. The last sessopn before lunch was Matt Pounsett talking about sysadmin matters. He's impressively enthusiastic about cleaning up a mountain of technical debt. We had a chat at the evening social and I tried to convince him that Debian without systemd is much less painful than the Devuan he has been struggling with.

Jerry Lundstrom talked about his software development projects. A couple of these might be more useful to me than I was previously aware:

  • drool - DNS replay tool. I have a crappy dnsmirror script which I wrote after my previous test procedure failed to expose CVE-2018-5737 because I wasn't repeating queries.

    drool respdiff sounds like it might be a lot of what I need to automate my testing procedures between deploying to staging and production.

  • dnsjit - suite of DNS-related tools in Lua. If this has the right facilities then I might be able to delete the horrid DNS packet parsing code from doh101.

Cloudflare DoH and DoT

Ólafur Guðmundsson said a few noteworthy things:

  • Cloudflare's DoT implementation is built-in to Knot Resolver. They are heavy users of my qp trie data structure, and at the evening social Petr Špaček told me they are planning to merge the knot-reolver-specific fixes with my beer festival COW work.

  • Their DoH implementation uses Lua scripting in NGINX which sounds eerily familiar :-) (Oversimplifying to the point of being wrong, Cloudflare's web front end is basically OpenResty.)

  • He mentioned a problem with hitting quotas on the number of http2 channels opened by Firefox clients, which I need to double check.

  • Cloudflare are actively working on DoT for recursive to authoritative servers. Sadly, although the IETF DNS Privacy working group has been discussing this recently, there hasn't been much comment from people who are doing practical work. Cloudflare likes the advantages of holding persistent connections from their big resolvers to popular authoritative servers, which is basically optimizing for a more centralized web. It's the modern version of leased lines between mainframes.

I submitted my DoH / DoT lightning talk (2 slides!) to the program committee since it includes stuff which Ólafur didn't mention that is relevant to other recursive server operators.

ANAME

I merged my revamped draft into Evan Hunt's aname repository and collaboration is happening. I've talked about it with at least one big DNS company who have old proprietary ANAME-like facilities, and they are keen to use standardization as a push towards removing unused features and cleaning up other technical debt. I've put some "mighty weasel words" in the draft, i.e. stealing the "as if" idea from C, with the idea that it gives implementers enough freedom to make meaningful zone file portability possible, provided the zone is only relying on some common subset of features.

Other small notes

Matt Larson (ICANN) - rollover was a "pleasant non-event"

  • several ICANN staff are at DNS-OARC, so they travelled to NL before the rollover, and used a conference room at NLnet Labs as an ad-hoc mission control room for the rollover

Matt Weinberg (Verisign) - bad trust anchor telemetry signal investigation - culprit was Monero! I wonder if this was what caused Cambridge's similar weirdness back in June.

Duane Wessels (Verisign) talked about abolishing cross-TLD glue. Gradually the registry data model becomes less insane! [fanf's rant censored]

Jaromír Talíř (CZ.NIC) on TLD algorithm rollovers. Don't fear ECDSA P256! I definitely want to do this for Cambridge's zones.

Bert Hubert (PowerDNS) talked about the DNS Camel and his Hello DNS effort to make the RFCs easier to understand. He's very pleased with the name compression logic he worked out for tdns, a teaching DNS server. I want to have a proper look at it...

fanf: (Default)

Today I travelled to Amsterdam for a bunch of conferences. This weekend there is a joint DNS-OARC and CENTR workshop, and Monday - Friday there is a RIPE meeting.

The DNS Operations Analysis and Research Centre holds peripatetic workshops a few times a year, usually just before an ICANN or RIR meeting. They are always super interesting and relevant to my work, but usually a very long way away, so I make do with looking at the slides and other meeting materials from afar.

CENTR is the association of European country-code TLD registries. Way above my level :-)

RIPE is the regional Internet registry for Europe and the Middle East. Earlier this year, the University of Cambridge became a local Internet registry (i.e. an organization responsible for sub-allocating IP address space) and new members get a couple of free tickets to a RIPE meeting. RIPE meetings also usually have a lot of interesting presentations, and there's a DNS track which is super relevant to me.

I haven't been to any of these meetings before, so it's a bit of an adventure, though I know quite a lot of the people who will be here from other meetings! This week I've been doing some bits and bobs that I hope to talk about with other DNS people while I am here.

doh101

Last month I deployed DNS-over-TLS and DNS-over-HTTPS on the University's central DNS recolvers. This turned out to be a bit more interesting than expected, mainly because a number of Android users started automatically using it straight away. Ólafur Guðmundsson from Cloudflare is talking about DoH and DoT tomorrow, and I'm planning to do a lightning talk on Sunday about my experience. So on Wednesday I gathered some up-to-to-date stats, including the undergraduates who were not yet around last month.

(My DoT stats are a bit feeble at the moment because I need full query logs to get a proper idea of what is going on, but they are usually turned off.)

Rollover

Yesterday evening was the belated DNSSEC root key rollover. There are some interesting graphs produced by [SIDN labs[(https://www.sidnlabs.nl/) and NLnet Labs on the NLnet Labs home page. These stats are gathered using RIPE Atlas which is a distributed Internet measurement platform.

I found the rollover very distracting, although it was mostly quite boring, which is exactly what it should be!

ANAME

The IETF dnsop working group is collectively unhappy with the recently expired ANAME draft - including at least some of the draft authors. Since this is something dear to my heart (because web site aliases are one of the more troublesome parts of my job, and I want better features in the DNS to help make the trouble go away) I spent most of this week turning my simplified ANAME proposal into a proper draft.

I'm hoping to discuss it with a few people (including some of the existing draft authors) with the aim of submitting the revised draft before the deadline on Monday 22nd for next month's IETF meeting.

fanf: (Default)

This afternoon I reckon I was six deep in a stack of yaks that I needed to shave to finish this job, and four of them turned up today. I feel like everything I try to do reveals some undiscovered problem that needs fixing...

  • When the network is a bit broken, my DNS servers soon stop being able to provide answers, because the most popular sites insist on tiny TTLs so they can move fast and break things.

    As a result the DNS gets the blame for network problems, and helpdesk issues get misdirected, and confusion reigns.

  • Serve-Stale to the rescue! It was implemented towards the end of last year in BIND and is a feature of the 9.12 releases.

    • Let's deploy it! First attempt in March with 9.12.1.

    • CVE-2018-5737 appears!

      Roll back!

    • The logging is too noisy for production so we need to wait for 9.12.2 which includes a separate logging category for serve-stale.

    • Time passes...

    • Deploy 9.12.2 earlier this week, more carefully.

    • Let's make sure everything is sorted before we turn on serve-stale again! (Now we get to today.)

      • The logging settings need revising: serve-stale is enough of a shove to make it worth reviewing other noisy log categories.

      • Can we leave most of them off most of the time, and use the default-debug category to let us turn them on when necessary?

      • This means the debug 1 level needs to be not completely appalling. Let's try it!

        • Hmm, this RPZ debug log looks a bit broken. Let's fix it!

        • Two little patches, one cosmetic, one a possible minor bug fix.

          • Need to rebase my hack branch onto master to test the patches.

          • Fix dratted merge conflicts.

        • Build patched server!

          • Build fails :-( why?

          • No enlightenment from commit logs.

          • Sigh, let's git bisect the build system to work out which commit broke things...

            • While the workstation churns away repeatedly building BIND, let's get coffee!
          • Success! The culprit is found!

          • Submit bug report

          • Work around bug, and get a successful build!

        • Test patched server!

          • The little patches seem OK, but while repeatedly restarting the server, a more worrying bug turns up!

            Sometimes when the server starts, my monitoring queries get stuck with SERVFAIL responses when they should succeed! Why?

          • Really don't want this to be anything that might affect production, so it needs investigation.

          • Turn off noisy background activity, and reproduce the problem with a simpler query stream. It's still hard to characterize the bug.

            • I'll need to test this in a less weird and more easily reconfigured server than my toy server. Let's spin up a VM.

              • Damnit, my virtualbox setup was broken by the jessie -> stretch upgrade!

              • Work out that this is because virtualbox is no longer included in stretch and the remnants from jessie are not compatible with the stretch kernel.

              • Reinstall virtualbox direct from Oracle. It now works again.

            • Install BIND on the new VM with a simplified version of my toy config. Reproduce the bug.

          • Is it related to serve-stale? no. QNAME minimization? no. RPZ? no.

          • After much headscratching and experimentation, enlightenment slowly, painfully dawns.

          • Submit bug report

            Actually, the writing of the bug report, and especially the testing of the unfounded assertions and guesses as I wrote it, was a key part of pinning down this weirdness.

            I think this is one of the most obscure DNS interoperability problems I have investigated!

OK, that's it for now. I still have two patches to submit, and a revised logging configuration to finalize, so I can put serve-stale into production, so I can make it easier in some situations for my colleagues to tell the difference between a network problem and a DNS problem.

fanf: (Default)

This is a follow-up to my unfinished series of posts last month.

(Monday's notes) (Tuesday's notes) (Wednesday's notes) (Thursday's notes)

On the Friday of the beer festival I found myself rather worn out. I managed to write a missing function (delete an element copy-on-write style) but that was about it.

When I got back to work after the bank holiday there was a bunch of stuff demanding more urgent attention so I wasn't able to find time to finish the qp trie hacking until this week.

Testing

The nice thing about testing data structures is you can get a very long way with randomized testing and a good set of invariant checks.

When there's a bug I tend to rely on voluminous dumps of how the trie structure changes as it mutates, with pointer values so I can track allocation and reuse. I stare at them wondering how the heck that pointer got where it shouldn't be, until enlightenment dawns.

Bugs

There were a number of notable bugs:

  • Another variable rename error from the big refactor. I think that was the last refactoring bug. I got away with that pretty well, I think :-)

  • Memory leaks in the commit/rollback functions. Remember to free the top-level structure, dumbass!

  • COW pushdown works on a "node stack" structure, which is a list of pointers to the trie nodes on the spine from the root to the leaf of interest. Pushdown involves making a copy of each branch node, so that the copies can be exclusively owned by the new trie where they are safe to mutate. The bug was that the pushdown function didn't update the child pointer in the node stack to point to the new copy instead of the old one. A relatively small oversight which caused interesting corruption and much staring at runic dump output.

  • During my Beer Festival week thinking, I completely forgot to consider the string keys. The Knot qp trie code makes copies of the keys for its own use, so it needs to keep track of their sharing state during a COW transaction so that they can be freed at the right time. This was quite a big thing to forget! Fortunately, since the keys are owned by the qp trie code, I could change their shape to add a COW flag and fix the use-after-free and memory leak bugs...

Submission

Having dealt with those, I have at last submitted my patches to CZ.NIC! There is still a lot of work to do, changing the DNS-specific parts of Knot so that UPDATE/IXFR transactions use the COW API instead of a copy of the entire zone.

One thing I'm not entirely sure about is whether I have been working with a valid memory model; in particular I have assumed that it's OK to flip COW flags in shared words without any interlocks. The COW flags are only written in a single threaded way by the mutation thread; the concurrent read-only threads pay no attention to the COW flags, though they read the words that hold the flags.

If this is wrong, I might need to sprinkle some RCU calls through the qp trie implementation to ensure it works correctly...

fanf: (Default)

(Monday's notes) (Tuesday's notes) (Wednesday's notes) (Epilogue)

Today's hacking was mixed, a bit like the weather! At lunch time I hung out at the beer festival with the Collabora crowd in the sun, and one of my arms got slightly burned. This evening I went to the pub as usual (since the beer festival gets impossibly rammed on Thursday and Friday evenings) and I'm now somewhat moist from the rain.

Non-conformant C compilers

My refactoring seems to have been successful! I only needed to fix a few silly mistakes to get the tests to pass, so I'm quite pleased.

But the last silly mistake was very annoying.

As part of eliminating the union, I replaced expressions like t->branch.twigs with an accessor function twigs(t). However twigs was previously used as a variable name in a few places, and I missed out one of the renames.

Re-using a name like this during a refactoring is asking for a cockup, but I thought the compiler would have my back because they had such different types.

So last night's messy crash bug was caused by a line vaguely like this:

    memmove(nt, twigs + s, size);

When twigs is a function pointer, this is clearly nonsense. And in fact the C standard requires an error message for arithmetic on a function pointer. (See the constraints in section 6.5.6 of C99:TC3.) But my code compiled cleanly with -Wall -Wextra.

Annoyingly, gcc's developers decided that pointer arithmetic is such a good idea that it ignores this requirement in the standard unless you tell it to be -pedantic or you enable -Wpointer-arith. And in the last year or so I have lazily stopped using $FANFCFLAGS since I foolishly thought -Wall -Wextra covered all the important stuff.

Well, lesson learned. I should be -pedantic and proud of it.

COW

This afternoon I turned my prose description of how copy-on-write should work into code. It was remarkably straight-forward! The preparatory thinking and refactoring paid off nicely.

However, I forgot to implement the delete function, oops!

TODO

  • Rewrite the hacking branch commit history into something that makes sense. At the very least, I need a proper explanation of what happened during the refactoring.

  • Tests!

There's still a lot of work needed to do copy-on-write in the DNS parts of Knot, but I am feeling more confident that this week I have laid down some plausible foundations.

fanf: (Default)

(Monday's notes) (Tuesday's notes) (Thursday's notes) (Epilogue)

Today was productive, and I feel I'm over the hump of the project - though I fear I won't get to a good stopping point this week.

Refactoring

As I hoped, I managed to finish the refactoring. Or, to be precise, I got it to the point of compiling cleanly and crashing messily.

My refactoring approach this week has been to hack in haste and debug at leisure. Hopefully not too much leisure :-)

A lot of the stripping out of unions and bitfields was fairly mechanical, but I also took the opportunity to simplify some of the internal interfaces. I also changed some of the other data representations. I hope this doesn't turn out to be foolishly lacking in refactoring discipline!

Nibbles

The qp trie code selects a child of a branch based on a nibble somewhere in the key string. A good representation of "somewhere" is pretty important.

My original qp trie code represented indexes into keys as a pair of a byte index and some flags that selected the nibble in that byte. This turned out to be pretty good when I did the tricky expansion from 4 bit to 5 bit nibbles. However, knowledge of this detail was smeared all through the code.

In this week's refactoring I've tried unifying the byte+flags into a single nibble index. Overall I think this has turned out to be simpler, and my vague handwavy feeling is that the code should compile down to about the same instructions. (If you set NDEBUG to switch all the new asserts off, that is!)

COW pushdown

I'm fairly confident now that I have a good idea of how copy-on-write will work. This afternoon I wrote a 700 word summary of the COW states and invariants - the task now (well, after the debugging) is to turn the prose into code.

fanf: (Default)

(Monday's notes) (Wednesday's notes) (Thursday's notes) (Epilogue)

Today was a bit weird since I managed to perform some unusual productivity judo on myself.

I started the day feeling very unsure about what direction to take, and worried that the project would be a bust before it had even got properly started.

After I had some coffee and killed a bit of email, I found myself looking through Knot's DNS update code, worrying about how to COWify it. Usually if I am feeling unsure about a project I will put it on the back burner to think about while I do something else. That isn't going to work when I have given myself only one week to see what I can achieve.

Eventually I realised that my only chance of success would be to strictly limit the scope of the project to the qp trie code itself, aiming to make it possible to COWify the DNS code but not touching the DNS code until the trie is able to COW.

(COW = copy-on-write)

I still didn't know how COW would work: the new invariants, the fence posts, the edge cases. But it seemed clear after last night's thoughts that I would need to add some kind of COW flag to the leaf nodes.

Digression: misuse of C

I made two horrible mistakes in my original qp trie code.

First, I used a union to describe how branch and leaf nodes both occupy two words, but with different layouts. It is really hard to use a union in C and not lose a fight with the compiler over the strict aliasing rules.

Second, I used bitfields to describe how one of the branch words is split up into subfields. Bitfields have always been a portability nightmare. I was using them to describe the detailed specifics of a memory layout, which does not work for portable code.

I was aware at the time that these were really bad ideas, but I wanted a reasonably explicit and lightweight notation for accessing the parts of the data structure while I was getting it working. And I never got round to correcting this short-term hack.

Refactoring

So today I spent a few hours starting to replace this dubious C with something more explicit and less likely to confuse the compiler.

As a side-effect, it will make it possible to stash an extra COW flag into leaf nodes.

Where to bung a COW?

When I started refactoring I didn't know how I would use the COW flag.

It takes me about 25 minutes to walk between home and the beer festival, and today that was really useful thinking time.

My thoughts oscillated between different possible semantics for the COW flag (most of them clearly broken) and I wasn't sure I could make it work. (At worst I might finish the week with a bit of code cleanup...)

This evening I think I came up with something that will work, and that will justify the refactoring. The problem I have been struggling with is that the existing qp trie structure puts the metadata about a structure next to the pointer to the structure, but the COW flag is all about whether a structure is shared or not, so it really demands to be put in the structure itself.

When you have trie A and its COW-clone trie B, if you put the COW flags in the pointers you end up with different copies of the pointers and flags in A and B. Then you modify B some more, which means you need to update a flag, but the flag you need to update is in A and you don't have a quick way to locate it. Gah!

Tomorrow

My aim is to hammer through the refactoring, and think about the details of how to use this COW flag, and what the API might look like for COWing the application data structure - mainly the DNS stuff in the case of Knot, but the hooks have to be general-purpose. (Knot uses qp tries for a lot more than just DNS zones.)

fanf: (Default)

(Tuesday's notes) (Wednesday's notes) (Thursday's notes) (Epilogue)

This week I have taken time off work to enjoy the beer festival, like I did last year, and again I am planning to do some recreational hacking on my qp tries.

Last year I fiddled with some speculative ideas that turned out to be more complicated and less clearly beneficial than I hoped. I didn't manage to get it working within one week, and since it wasn't obviously going to win I have not spent any more time on it. (I'm not a scientist!)

During this year's beer festival I'm seeing if I can help the Knot DNS developers improve Knot's qp trie.

Monday morning

My start was delayed a bit because I needed to deploy the (BIND security release)[https://lists.isc.org/pipermail/bind-announce/2018-May/thread.html] that fixed a crash bug I reported in March.

Plans

My general plan is to work on redusing Knot's memory use for UPDATE and IXFR, both of which involve incremental changes to a DNS zone. At the moment, Knot makes a clone of the zone, modifies the clone, then discards the old version. The clone doubles the zone's memory usage, which can be painful if it is big.

My aim is to add copy-on-write (COW) updates, so that the memory usage is proportional to the size of the update rather than the size of the zone. Operators will still have to size a server to allow for double memory during whole-zone updates; the aim is to make frequent small updates cheaper.

Pre-flight checks

On Friday last week I discussed my ideas with the Knot developers to confirm that my plan is definitely something they are also interested in, and Vladimír Čunát (who adapted qp tries into Knot) had some very useful suggestions.

My first step today was to build Knot and make sure I could run its test suite. This was pleasingly easy :-)

Reading and thinking

Most of the rest of the afternoon I spent reading the code, understanding the differences between Knot's qp code and mine, and thinking about how to add COW. It's difficult to measure progress at this stage, since it's mostly throwing away half-formed ideas that turn out to be wrong when I understand the problem better.

The importance of context

The most obvious difference between Knot's qp code and mine is the API. My API was bodged without any particular context to shape it, other than to provide a proof of concept for the data structure and something that a test suite could use.

This COW work requires extending Knot's trie API, and Knot has an existing infrastructure for zone changes that will use this new API.

This context is really helpful for clarifying the design.

A half-formed bad idea

In the run-up to this week I had been aiming for a reasonably self-contained project that I could make decent progress on in a few beery days. But it looks like it will be more complicated than that!

Today, when I was thinking about the edge cases and fenceposts and invariants for COW qp tries, I started off thinking of it as a purely internal problem - internal to the data structure.

In a qp trie, whereas branches are complicated (and have space for extra COW bits), leaf nodes are dead simple. The overall layout of the trie, and its memory use, are mostly constrained by the layout of a simple leaf node: a pair of words containing a key pointer and a value pointer. Making leaves bigger makes the whole trie bigger.

When I was thinking of COW as an internal problem I was scavenging space from branch nodes only. But, for COW to work, we also have to COW the key / value data structures hanging off each leaf, and the application using the data structure has to co-operate with the COW so that it knows about key / value pairs that got altered or deleted.

So the COW metadata can't be purely internal, it has to extend into the leaf nodes and into the application-specific data.

How this relates to Knot

In Knot, the trie is a map from DNS names to the list of RRsets owned by that name. It's possible to push COW a few levels beyond the trie into these RRsets, but that's something to leave for later.

What's unavoidable is keeping track of which leaf nodes were modified or deleted during a COW update - we have to know when to copy or free these RRsets.

Keeping that information inside the trie almost certainly requires making leaf nodes bigger, which defeats the goal of reducing memory use.

Tomorrow

So I have gone back to an earlier half-formed idea, that I should use an auxiliary list of modified or deleted nodes - interior branches or application leaves - that only exists during updates, and so does not bloat the trie in normal read-only use.

Maybe this will allow me to draw a line to keep the scope of this project inside the qp trie data structure (and maybe inside a week), without getting lost in the weeds of a large application.

fanf: (Default)

Yay, I've had interesting comments on my previous article about using curl to test web servers before going live!

I think there are a couple of things that are worth unpacking: what kind of problems am I trying to spot? and why curl with a wrapper rather than some other tool?

What webmasters get wrong before deployment

Here are some of the problems that I have found with my curl checks:

  • Is there a machine on the target IP address that is running a web server listening on port 80 and/or port 443?

  • Does it give a sensible response on port 80? either a web page or a redirect to port 443? A common mistake is for the virtual host configuration to be set up for the development hostname but not the production hostname.

  • Does it return a verifiable certificate on port 443? With the intermediate certificate chain?

  • Is the TLS setup on the new server consistent with the old one? Like, will old permanent redirects still work? If the old server has strict transport security, does the new one too? We have not had many security downgrades, but it's a looming footgun target.

This is all really basic, but these problems happen often enough that when I am making the DNS change, I check the web server so I don't have to deal with follow-up panic rollbacks.

Usually if it passes the smoke test, the content is good enough, e.g. when I get HTML I look for a title or h1 that makes sense. Anything content-related is clearly a web problem not a DNS problem even to non-technical people.

What other tools might I have chosen?

Maciej Sołtysiak said, "I usually just speak^H^H^H^H^Htype HTTP over telnet or openssl s_client for tls'd services." I'm all in favour of protocols that can be typed at a server by hand :-) but in practice typing the protocol soon becomes quite annoying.

HTTP is vexing because the URL gets split between the Host: header and the request path, so you can't trivially copy and paste the user-visible target into the protocol (in the way you can for SMTP, say). [And by "trivially" I mean it's usual for terminal/email/chat apps to make it extra easy to copy entire URLs as a unit, and comparatively hard to copy just the hostname.]

And when I'm testing a site, especially if it's a bit broken and I need to explain what is wrong and how to fix it, I'm often repeating variations of an HTTP(S) request. The combination of command line histroy and curl's flexibility makes it super easy to swicth between GET and HEAD (-I) or ignore or follow redirects (-L), and so on.

OK, so I don't want to type in HTTP, but often I don't even need to HTTP to find a site is broken. But checking TLS is also a lot more faff without curl.

For example, using my script,

    curlto chiark.greenend.org.uk https://dotat.at

How do I do that with openssl?

    openssl s_client -verify_return_error \
        -servername dotat.at \
        -connect chiark.greenend.org.uk:443

OK, that's pretty tedious to type, and it also has the chopped-up URL problem.

And, while curl checks subject names in certificates, openssl s_client only checks the certificate chain. It does print the certificate's DN, so you can check that part, but it doesn't print the subjectAltName fields which are crucial for proper browser-style verification.

So if you're manually doing it properly, you need to copy the certificate printed by s_client, then paste it into openssl x509 -text and have a good eyeball at the subjectAltName fields.

I have done all these things in the past, but really, curl is awesome and it makes this kind of smoke test much easier.

fanf: (Default)

A large amount of my support work is helping people set up web sites. It's time-consuming because we often have to co-ordinate between three or more groups: typically University IT (me and colleagues), the non-technical owner of the web site, and some commercial web consultancy. And there are often problems, so the co-ordination overhead makes them even slower to fix.

When moving an existing web site, I check that the new web server will work before I update the DNS - it's embarrassing if they have an outage because of an easy-to-avoid cockup, and it's good if we can avoid a panic.

I use a little wrapper around curl --resolve for testing. This makes curl ignore the DNS and talk to the web server I tell it to, but it still uses the new host name when sending the Host: header and TLS SNI and doing certificate verification.

You use the script like:

    curlto <target server> [curl options] <url>

e.g.

    curlto ucam-ac-uk.csi.cam.ac.uk -LI http://some.random.name

This needs a bit of scripting because the curl --resolve option is a faff: you need to explicitly map the URL hostname to all the target IP addresses, and you need to repeat the mapping for both http and https.

Here's the script:

    #!/usr/bin/perl

    use warnings;
    use strict;

    use Net::DNS;

    my $dns = new Net::DNS::Resolver;

    sub addrs {
        my $dn = shift;
        my @a;
        for my $t (qw(A AAAA)) {
            my $r = $dns->query($dn, $t) or next;
            push @a, map $_->address, grep { $_->type eq $t } $r->answer;
        }
        die "curlto: could not resolve $dn\n" unless @a;
        return @a;
    }

    unless (@ARGV > 1) {
        die "usage: curlto <target server> [curl options] <url>\n";
    }

    my $url = $ARGV[-1];
    $url =~ m{^(https?://)?([a-z0-9.-]+)}
        or die "curlto: could not parse hostname in '$url'\n";
    my $name = $2;

    my @addr = shift;
    @addr = addrs @addr unless $addr[0] =~ m{^([0-9.]+|[0-9a-f:]+)$};
    for my $addr (@addr) {
        unshift @ARGV, '--resolv', "$name:80:$addr";
        unshift @ARGV, '--resolv', "$name:443:$addr";
    }

    print "curl @ARGV\n";
    exec 'curl', @ARGV;
fanf: (Default)

I have a short wishlist of dnstap-related tools. I haven't managed to find out if anything like this already exists - if it does exist I'll be grateful for any pointers!

fanout

We have a couple of kinds of people who have expressed interest in getting dnstap feeds from our campus resolvers (though this is not yet happening).

  • There are people on site doing information security threat intelligence research, who would like a full feed of client queries and responses.

  • And there are third parties who would like a passive DNS feed of outgoing resolver queries, and who aren't allowed a full-fat feed for privacy reasons.

The dnstap implementation in BIND only supports one output stream, so if we are going to satisfy these consumers, we would need to split the dnstap feed downstream of BIND before feeding the distributaries onwards.

replay

More recently it occurred to me that it might be useful to generate queries from a dnstap feed. I have a couple of scenarios:

  • Replay client queries against a test server, to verify that it behaves OK with real-ish traffic. I have a tool for replaying cache dump files, but a replaying a cache dump is nothing like real user traffic since it doesn't include repeated queries and the queries occur in a weirdly lexicographical order.

  • Replay outgoing resolver queries from a live server against a standby server. These queries are effectively the cache misses, so they are less costly to replicate than all the client traffic. This keeps the standby cache hot whereas at the moment my standby servers have cold caches.

    It might also be worth duplicating this traffic from one live server to the other one, in the hope that this increases the cache hit rate, since the more users a cache has the higher its hit rate. (Some experimentation needed!)

I'm not really insterested in the responses to these queries so it's OK if the replay just drops the answers. (Though when replaying a full client query feed it might be useful to compare the replay responses to the recorded feed of client responses.)

todo?

If anything like this does not exist, I might write it myself.

I have not used protobufs before so I'm keen to hear advice from those who have already got their hands dirty / fingers burned.

I'm tempted to weld libfstrm to Lua, so you can configure filtering, replication, and output with a bit of Lua. The number of Lua protobuf implementations is a bit of a worry - if anyone has a recommendation I'd like to short-cut the experimental stage. (I should ask this on the Lua list I guess!)

Alternatively it might be easier to hack around with the golang-dnstap code, tho then I would have to think harder about how to configure it...

fanf: (Default)

At Cambridge since approximately I don't know about 25 years ago or more, we have encouraged sysadmins to set up stealth secondary DNS servers. This has a couple of advantages:

  • It distributes the DNS resolution load, so looking up names of on-site services is consistently fast.

  • It has better failure isolation, so the local DNS still works even if the University or the department have connectivity problems.

It has a disadvantage too:

  • It is complicated to configure, whereas a forwarding cache has nearly the same advantages and is a lot simpler to configure.

  • Stealth secondaries don't have any good way to authenticate zone transfers - TSIG only provides mutual authentication by prior arrangement, and part of being stealthy is there's none of that.

DNSSEC and stealth servers

Disappointingly, DNSSEC does not help with this stealth secondary setup, and in some ways hurts:

  • Zone transfers do not validate DNSSEC signatures, so it doesn't provide a replacement for TSIG. You can sort-of implement a lash-up (RFC 7706 has examples for the root zone) but if the transfer gets corrupted your stealth secondary goes bogus without any attempt at automatic recovery.

  • Validation requires chasing a chain of trust from the root, which requires external connectivity, even when you have a local copy of the data you are validating. So you lose much of the robustness.

  • You could in theory mitigate this by distributing trust anchors, but that's an much bigger configuration maintenance burden.

Work in progress

We have been living with this unsatisfactory situation for nearly 10 years, but things are at last starting to look promising. Here are a few technologies in the works that might address these problems.

DNS-over-TLS for zone transfers

To provide on-the-wire security for zone transfers, we need a one-sided alternative to TSIG that authenticates the server while allowing the client to remain anonymous. In theory SIG(0) could do that, but it has never been widely implemented.

Instead, we have DNS-over-TLS which can do the job admirably. The server side can be implemented now with a simple configuration for a proxy like NGINX; the client side needs a little bit more work.

Built-in support for RFC 7706

Authenticating the server isn't quite enough, since it doesn't provide end-to-end validation of the contents of the zone. It looks like there is interest in adding native support for DNSSEC authenticated zone transfers to the open source DNS servers, so they can support RFC 7706 without the lash-ups and bogosity pitfalls.

I would like to see this support in a generalized form, so it can be used for any zones, not just the root.

Catalog zones

To simplify the setup of stealth secondaries, I provide a Cambridge catalog zone. This makes the setup much easier, almost comparable to a forwarding configuration. If only we could do this for trust anchors as well...

DLV delenda est

Before the root zone was signed, isc.org created a mechanism called "DNSSEC lookaside validation", which allowed "islands of trust" to publish their trust anchors in a special dlv.isc.org zone, in a way that made it easy for third parties to use them.

Now that the root is signed and support for DNSSEC is widespread, DLV has been decommissioned. But if we tweak it a bit, maybe it will gain a new lease of life...?

DLV TNG

DLV acted as a fallback, to be used when the normal chain of trust from the root was incomplete. I would like to be able to set up my own local DLV, to be used as a replacement for the normal chain of trust, not a fallback. The advantages would be:

  • When we have connectivity problems, DNSSEC validators can still work for local names because they will not need to chase a validation chain off site.

  • I can distribute just one trust anchor, covering all our zones, including disconnected ones such as reverse DNS for RFC 1918 addresses and IPv6 unique local address prefixes.

  • We get tinfoil-hat safety: localized DNSSEC validation for on-site services can't be compromised by attacks from those in control of keys nearer the root.

  • Even better if my DLV could be used as a stealth secondary zone obtained via our catalog zone.

DLV on the edge

That sounds nice for recursive DNS servers, but for DNSSEC to be really successful we need validation on end-user devices. And that undermines the robustifications I just listed.

But if your validating stub resolver supports localized DLV, and it has been configured by a group policy or similar configuration management system (like those corporate TLS trust anchors some enterprises have) then you have won those advantages back.

Summary

I want:

  • Support for DNS zone transfers over TLS

  • Validation of zone contents after transfer, and automatic retransfer to recover from corrupted zones

  • A localized DLV to act as an enterprise trust anchor distribution mechanism

I mentioned this last feature to Evan Hunt at the IETF 101 London meeting. I feared he would think it is too horrible to contemplate, but in fact he thought the use case is quite reasonable. So I have written this down so I can give these ideas a wider airing.

fanf: (Default)

ICANN are currently requesting public comments on restarting the root DNSSEC KSK rollover. I thought I didn't have anything particularly useful to say, but when I was asked to contribute a comment I found that I had more opinions than I realised! So I sent in some thoughts which you can see on the KSK public comments list and which are duplicated below.

Please go ahead and roll the root KSK as planned on the 11th October 2018.

The ongoing work on trust anchor telemetry and KSK sentinel might be useful for making an informed decision, but there is the risk of getting into the trap that there is never enough data. It would be bad to delay again and again for just one more experiment. KSK sentinel might provide better data than trust anchor telemetry, but I fear it is too late for this rollover, and it may never be deployed by the problem sites that cause concern. So maybe the quest for data is not absolutely crucial.

I increasingly think RFC 5011 is insufficient and not actually very helpful. It was implemented rather late and it seems many deployments don’t use it at all. It only solves the easy problem of online rollovers: it doesn’t help with bootstrapping or recovery.

RFC 7958 moves the problem around rather than solving it. It suggests that we can treat personal PGP keys or self-signed X.509 keys with unclear provenance as more trusted than the root key, with all its elaborate documentation, protections, and ceremonial. It adds more points of failure, whereas a proper solution should disperse trust.

I think the best short term option is to put more emphasis on using software updates for distributing the root trust anchors. The software is already trusted, so using it for key distribution doesn't introduce a new point of failure. Most vendors have a plausible security update process. Software updates can solve the same problems as RFC 5011, in a straight-forward and familiar way. Across the ecosystem as a whole it disperses trust amongst the vendors.

Longer term I would like a mechanism that addresses bootstrap and recovery (because you can’t get your software update without DNS) but that is not doable before the rollover later this year.

fanf: (Default)

Here's a somewhat obscure network debugging tale...

Context: recursive DNS server networking

Our central server network spans four sites across Cambridge, so it has a decent amount of resilience against power and cooling failures, and although it is a single layer two network, it is using some pretty fancy Cisco Nexus switches to provide plenty of redundant connectivity.

We have four recursive DNS servers, one at each site, usually two live and two hot spare. They are bare metal machines, which are intended to be able to boot up and provide service even if everything else is broken, provided they have power and cooling and network in at least one site.

The server network has several VLANs, and our resolver service addresses are on two of them: 131.111.8.42 is on VLAN 808, and 131.111.12.20 is on VLAN 812. So that any of the servers can provide service on either address, their switch ports are configured to deliver VLAN 808 untagged (so the servers can be provisioned using PXE booting without any special config) and VLAN 812 tagged.

Context: complying with reverse path filtering

There is strict reverse path filtering on the server network routers, so I have to make sure my resolvers use the correct VLAN depending on the source address. The trick is to use policy routing to match source addresses, since the normal routing table only looks at destination addresses.

The servers run Ubuntu, so this is configured in /etc/network/interfaces by adding a couple of up and down commands. Here's an example; there are four similar blocks in the config, for VLAN 808 and VLAN 812, and for IPv4 and IPv6.

    iface em1.812 inet static
        address 131.111.12.{{ ifnum }}
        netmask 24

        up   ip -4 rule  add from 131.111.12.0/24 table 12
        down ip -4 rule  del from 131.111.12.0/24 table 12
        up   ip -4 route add default table 12 via 131.111.12.62
        down ip -4 route del default table 12 via 131.111.12.62

The bug: missing IPv6 policy routing

On Sunday we had some scheduled power work in one of our machine rooms. On Monday I found that the server in that room was not answering correctly over IPv6.

The machine had mostly booted OK, but it had partially failed to configure its network interfaces: everything was there except for the IPv6 policy routing, which meant that answers over IPv6 were being sent out of the wrong interfaces and dropped by the routers.

The logs were not completely clear, but it looked like the server had booted faster than the switch that it was connected to, so it had tried to configure its network interfaces when there was no network.

Two possible fixes

One approach might have been to add a script that waits for the network to come up in /etc/network/if-pre-up.d. But this is likely to be unreliable in bad situations where it is extra important that the server boots predictably.

The other approach, suggested by David McBride, was to try disabling IPv6 duplicate address detection. He found the dad-attempts option in the interfaces(5) man page, which looked very promising.

Edited to add: Chris Share pointed out that there is a third option: DAD can be disabled using sysctl net.ipv6.conf.default.accept_dad=0 which is probably simpler than individually nobbling each network interface.

Debugging

I went downstairs to the machine room in our office building to try booting a server with the ethernet cable unlugged. This nicely reproduced the problem.

I then tried adding the dad-attempts option, and booting again. The server booted successfully!

No need for a horrible pre-up script, yay!

Moans

The ifupdown man pages are not very good at explaining how the program works: they don't explain the /etc/network/if-*.d hook scripts, nor how the dad-attempts option works.

I dug around in its source code, and I found that ifupdown's DAD logic is implemented by the script /lib/ifupdown/settle-dad.sh, which polls the output of ip -6 address list. If it times out while the address is still marked "tentative" (because the network is down) the script declares failure, and ifupdown breaks.

The other key part is the nodad option to ip -6 addr add, which is undocumented.

This made it somewhat harder to find the fix and understand it. Bah.

Risks

I've now disabled duplicate address detection on my DNS servers, though I might have gone a bit far by disabling it on my VMs as well as the recursive servers. The point of DAD is to avoid accidentally breaking the network, so it's a bit arrogant to turn it off. On the other hand, if I have misconfigured duplicate IPv6 addresses, I have almost certainly done the same for IPv4, so I have still accidentally broken the network...

fanf: (Default)

I've been quite lucky with the timetable, so like Wednesday I could have a relaxed morning on Thursday. Friday is a half day (so that attendees can head home before the weekend) but I popped into London for the last session and to say goodbye to anyone who was still around.

Hallway track

Before lunch on Thursday I had a chat with various people including Tim Griffin (who I have not previously met despite working in the building hext door to him!) and Geoff Huston. Geoff thanked me for my suggestion about avoiding a potential interop gotcha in his kskroll-sentinel draft that I covered on Tuesday, and I thanked him for his measurement work on fragmentation. He told me to not to forget Davey Song's clever "additional truncated response" idea, so I posted a followup to yesterday's notes on fragmentation to the dnsop list.

Root DNSSEC key rollover

Over lunch there was a talk by David Conrad about replacing the root DNSSEC key. I have been paying attention to this process so there were no big surprises. It's difficult to get good data on how DNSSEC is configured or misconfigured, hence the kskroll-sentinel draft, and it's difficult to get feedback from operators about their approaches to the rollover. An awkward situation, but hopefully the rollover won't have to be postponed again.

doh

After lunch was the DNS-over-HTTPS working group meeting.

This started with some feedback from the hackathon, and then a discussion of the current state of the draft spec. It is close to being ready, so the authors hope to push it to last call within a few weeks. (The DoH WG has been remarkably speedy - it helps to have a simple protocol!)

After that, there was some discussion about what comes next. The WG chairs plan to close the working group after the spec is published, unless there is consensus to pusrsue some follow-up work. There was also a presentation from dkg about using HTTP/2 push to send unsolicited DoH responses: in what situations can browsers use these responses safely? are they useful for avoiding DNS lookup latency?

I still don't know if DoH is a massive distraction from the bad idea fairy. It feels to me like it might be one of those friction-reducing technologies that changes the balance of trade-offs in ways that have unexpected consequences.

ANAME

In the next session I missed the jmap meeting and instead spent some time in the code lounge with Evan Hunt (ISC BIND), Peter van Dijk (PowerDNS), and Matthijs Mekking (Dyn), hammering out some details of ANAME (at least for authoritative servers).

PowerDNS and Dyn have existing (non-standard, differing) implementations of this functionality, so we were partly trying to work out how a standardized version could cover existing use cases. One thing that slightly surprised me was that PowerDNS does ALIAS expansion during an outgoing zone transfer - I had not previously considered that mechanism, but PowerDNS is designed around dynamic zone contents, so I guess their zone transfer code has to quite a lot more work than BIND.

We ended up with a few almost-orthogonal considerations: Is the server a primary or a secondary for the zone? Is the zone signed or not? Does the server have the private keys for the zone? Does the server actively expand ANAME when answering queries, or passively serve pre-expanded addresses from the zone? Does the server expand ANAME on outgoing zone transfers, or transfer the zone verbatim?

There are a few combinations that don't make sense, and a few that end up being equivalent, but it's quite a large and confusing space to navigate.

I think we managed to resolve several questions (as it were) and had a useful meeting of minds, so I'm looking forward to more progress with this draft.

dnsop II

The evening session was the second dnsop meeting, which was for triage of new drafts.

Shumon Huque has a nice operations draft explaining how to manage DNSSEC keys for zones served by multiple DNS providers which I reviewed on the mailing list.

Ray Bellis presented catalog zones which I quite like, though it isn't quite the right shape for simplifying the tricky parts of my configuration, though it does simplify our stealth secondary config a lot. But a lot of others in the room do not like abusing the DNS for server configuration

Matthijs Mekking presented his idea for less verbose zone transfers. This is something we discussed in the mfld track at the previous London IETF and although it is quite a fun idea, Matthijs now thinks that if we are going to revise the zone transfer protocol, it would probably be better to move it out of band so that there's the flexibility to do even more clever things without overloading Bert's camel.

We ran out of time before we got to Petr Špaček's camel-diet draft. This is related to the agreement between the big 4 open source DNS servers that next year they will stop working around broken EDNS implementations

lamps

For the final session I went to the working group on limited additional mechanisms for PKIX and SMIME. Paul Hoffman clued me in that there would be some discussion of CAA (X.509 certificate authority authorization) DNS records. There's a revision of the spec in the works, which includes more operational advice that I should review wrt the problems we had back in September preventing some certificates from being issued.

mfld

On Thursday evening I went to a Thai restaurant with some friendly Dutch DNS folks. I foolishly chose the "most adventurous" menu item, which was nice and stinky although a bit too spicy. I think I still smell of fish sauce...

The end of my IETF was lunch with John Levine chatting about ISOC and our shared tribulations of the small-scale DNS operator.

And now I am on my way home at the end of a long busy week, hopefully in time to pick up my new specs from the optician.

fanf: (Default)

After Tuesday's revelation about the possibility of lie-ins I took it easy yesterday morning, and got to Paddington around lunch time for the afternoon sessions.

dprive

The first afternoon session was the DNS privacy working group. I have not been paying as much attention to this work as I should have, despite being keen on deploying it in production.

So far, this WG has specified DNS-over-TLS, DNS-over-DTLS, and EDNS padding (which aims to make traffic analysis harde by quantizing the lengths of DNS messages), and more details about server authentication.

They are working on recommendations for DNS privacy service operators which looks like it will have some pertinent advice - it's on my list of documents to review, and I got the impression from the summary presentation that it's likely to have some helpful ideas I can use when reviewing my services' privacy policies for GDPR.

Roland van Rijswijk-Deij presented a neat application of Bloom filters for privacy-preserving collection of DNS queries. The idea is that if you have a set of known bad queries (e.g. botnet C&C, compromised web sites) you can check the Bloom filter to retrospectively find out if anyone made a bad query, and that you need to follow up with a more detailed investigation.

Finally, there was some discussion about a second phase of work for the group. Stéphane Bortzmeyer has a draft about DNS privacy for the resolver-to-authoritative path. This is in need of more discussion and feedback.

acme

For the second session I went to the acme WG meeting. (ACME is the protocol used by Let's Encrypt).

There was some discussion about authentication mechanisms for IP address certificates (likely to be of interest for dprive DNS servers). The draft suggests using the reverse DNS, and there were a lot of comments in the meeting that this is probabaly not secure enough: there isn't necessarily a good coupling between authentication of IP address ownership and authentication of reverse DNS ownership. I pointed out that in many enterprises, DNS and routing are handled by different teams; I forgot to mention that in the RIPE databse (for example) they are also represented by separate objects that can have separate access control configurations. So this draft needs a rethink.

Another topic of interest was how to fix the broken TLS-SNI-01 challenge. As I understand it the draft replacement uses TLS ALPN (application layer protocol negotiation, which allows a TLS client to say it wants to speak something other than HTTP). This is fiddly, but the idea for these challenges is to have close integration with an HTTPS web server, to minimize support glue scripts.

dnsop

Outside working group meetings I discussed the IP fragmentation considered fragile draft with a few people, and my idea for a DNS-specific followup. Idecided I should go ahead and try to get the ball rolling, so I posted some notes on reducing fragmented DNS-over-UDP to the dnsop WG list.

Plenary

The final session of the day was the Plenary meeting. This includes a certain amount of meta-discussion about how the meetings are run - announcements of future locations, budgeting, changes to memberships of senior committees, etc. This time there are about 1200 on-site attendees, 400 remote, and the money to pay for the meeting is about $804,000 in attendance fees, plus $521,000 in sponsorship.

There were a few presentations on expanding access to the Internet to sub-Saharan Africa and to areas with low population density. It seems there is currently a boom in satellite communications, and the satellite engineers are doing lots of cool things with multi-path communications to avoid rain fade, and maybe in the not too distant future, direct sat-to-sat relaying over space lasers. Awesome.

A lot of the plenary is for open mic sessions, where anyone can quiz the senior committees (the Internet Architecture Board, IAB; the IETF administrative oversight committee, IAOC; and the Internet Engineering Steering Group, IESG, which is the committee of IETF area directors). It struck me that the composition of these committees is about 1/3 women, which is considerably better than the IETF at large - the bulk of the attendees are white Americans and European middle-aged men.

mfld

I had an anti-social lunch, but after I bailed out of the plenary before the IAOC open mic, I found a pub with a few folks from Sinodun and nic.at. We had a pleasant chat, although I managed to knock my beer over, so I went home unpleasantly moist and smelly. D'oh!

fanf: (Default)

In yesterday's notes about my IETF 101 activities on Monday I forgot to mention the Hackathon Happy Hour. A few of the teams were demoing their projects, so I had a chat with the BBC R&D folks. They had been at the table next to us over the weekend, working on IP multicast for TV. I found out at the happy hour that they were multicasting unidirectional QUIC (Google's new encrypted transport layer, which is currently going through IETF standardization). Their player app normally uses unicast HTTP; by using QUIC they can re-use HTTP semantics for multicasting as well. Super cool.

Tuesday morning

This week I am commuting from Cambridge to Paddington. So far this has been working OK - I'm not suffering too much from burning the candle at both ends, though I'm not seeing very much of the family!

The first WG meeting session starts at 09:30, and I can get there in time if I get up around 07:00, and get out of the house without faffing. Once I get on the train I can plan my day, read blogs and mail, etc.

I realised yesterday that this routine is slightly suboptimal if there aren't any WG meetings that I want to attend in the morning - if I don't realise this until after I am up and out of the house, I miss the opportunity for a lie in!

Oh well, I spent the morning catching up on things in the code lounge.

Connectivity problems in Cambridge

There were some complaints on the ucam-itsupport list about a connectivity problem, so (yet again) I had to explain why it wasn't my fault....

Remember that when there is an upstream connectivity problem, it tends to be most visible to end users as a DNS problem: when the uplink goes away, the DNS servers can't get answers for users, so the users never even get to the point of trying to talk off-site, so they don't discover that the uplink has gone - they just see the DNS error.

The TTL for en.wikipedia.org is 5 minutes, so if the uplink problem lasts longer than that, you will get a DNS error for Wikipedia.

There's a new feature in the recently released BIND 9.12 called "serve-stale", which changes the cache time-to-live logic. When the DNS server tries to refresh an item in its cache, and discovers that it can no longer reach the authoritative DNS servers, it will continue to return the stale answer to users.

We have upgraded to BIND 9.12.0, but I have not yet enabled the serve-stale feature. I wanted to be sure that the servers continued to work OK with the existing configuration (in case I needed to roll back), and then industrial action intervened before I could make the serve-stale change.

I have a 9.12.1 upgrade in the works (to fix an interoperability regression to accommodate bad DNS zones that have a forbidden CNAME at the apex) after which I will enable serve-stale.

Tangentially, there's another 9.12 feature which I am looking forward to enabling: BIND can now use DNSSEC NSEC proof-of-nonexistence records to synthesize negative answers without having to re-ask the authoritative servers. This is particularly good at improving the performance of handling junk queries for invalid TLDs, and it will allow me to delete a lot of configuration verbiage that I added to suppress other junk queries.

netconf

In the first session after lunch, I thought the most useful WG meeting would be netconf, following the conversations I had about it on Monday. This kind of choice is a bit risky, because if you don't know much about a protocol, you lack the context to make much sense of the detailed business of a WG.

There was some discussion about the YANG keystore model (used for configuring things like ssh keys, I gather), YANG push (I guess for pub-sub style data collection), and a binary representation for netconf (which is natively XML).

dnsop

The second afternoon session was one of the main reasons I am attendnig IETF 101.

There was a comment from Warren Kumari (the IESG area director responsible for the dnsop WG) that there are quite a lot of DNS drafts in flight at the moment. From my point of view, there are some that I am really keen on since they are directly helpful for the services I run; there are some which are interesting but not directly relevant to me; and there are some that I think are probably bad ideas. (Of course, other people think that my favourite drafts are bad ideas - DNS people generally get on well with each other but we don't always agree!)

dnsop - refuse-any

Joe Abley has resurrected the long-stalled draft-ietf-dnsop-refuse-any so that it can be pushed through the various "last call" stages towards publication as an RFC.

I'm pleased to see that it is making progress again. Two years ago I implemented this draft for BIND to improve our robustness against certain kinds of flooding attacks.

dnsop - aname

Evan Hunt reported on the current state of draft-ietf-dnsop-aname. This spec will be a standardized version of the CNAME-at-apex workarounds that various DNS vendors have implemented in various ways. I'm looking forward to getting my hands on this since the restrictions on CNAMEs are a longstanding pain point for us.

I reviewed the draft in detail earlier this year (part 1, part 2) and Evan told me that my comments were very helpful, especially for clarifying how recursive servers could handle ANAME records.

The discussion in the meeting was about how to refactor the draft to clarify it.

dnsop - DNS capture format

This spec is about a more compact way of recording DNS traffic than the traditional pcap files. It is already being used for recording telemetry data on some root name servers. Sounds quite cool.

dnsop - root key rollover

There are a couple of drafts in this area: security considerations for RFC 5011, how to follow a key rollover safely; and kskroll-sentinel which is another way for tracking whether validators are following RFC 5011 successfully.

There was also a discussion about how a DNSSEC validator can bootstrap its trust anchors. I have some ideas about this - a few years ago I wrote a draft about trust anchor witnesses (I should probably write down my ideas about how to simplify it). I need to read Ben Laurie's old draft on DNSSEC key distribution and how getdns does zero-configuration DNSSEC.

dnsop - terminology

Work on the revised DNS terminology explainer continues, and seems to be approaching readiness for the last call process.

dnsop - session signalling

The DNS session signalling draft describes a way to make DNS-over-TCP (and other persistent transports) timeouts and connection shutdown more explicit. I don't really see the point of it - it isn't clear to me that explicit negotiation will provide much benefit. The alternative is for the server to close idle TCP connections whenever it wants, and for clients to handle lost connections gracefully.

dnsop - the DNS camel

After the discussion of drafts in progress, there was a presentation by Bert Hubert (author of PowerDNS) about the increasing complexity of DNS.

It was a very witty and well-informed rant, and it sparked some good discussion about what we can do to tackle the problem. One suggestion (from Job Snijders wrt the routing area) was to have strict rules that drafts cannot progress to RFC without multiple independent interoperable implementations. Another suggestion was to see if there were old RFCs that could be deprecated.

Would it be worth writing a consolidated DNS spec? Probably far too much work for unclear benefit. Would it be worth writing a roadmap RFC, that tells readers how much attention to give to old documents?

mfld

The non-wg track included a pleasant lunch with folks from NLnet Labs and others, and in the evening (instead of going to the official social meet in the Science Museum) several of us went to a local pub. I can't remember many of the fascinating topics we discussed :-) But (work-related) there was some agreement about session signalling being of dubious benefit.

fanf: (Default)

After the IETF 101 hackathon (day 1, day 2) we got into the usual IETF rhythm of working group meetings on Monday.

Morning

The first part of the day I mostly spent working through my email backlog and various things that did not get done during the UCU USS strike.

I was sitting near the BIND9 team who were discussing how their workflow is changing with the move to GitLab.

intarea

The first WG meeting I attended was the Internet Area working group. The IETF is divided into areas (e.g. Internet, Transport, Security, Applications) which are further subdivided into specific protocol-related working groups. The area working groups discuss topics that cross over multiple working groups or do not fit into an existing working group.

The most relevant item on the agenda was a document titled "IP fragmentation considered fragile". This is a particular pain point for the DNS, especially with large EDNS buffer sizes and large DNSSEC records, and the draft says that DNS needs work.

Although DNS people are aware of this problem, I don't know of any work in the IETF dnsop WG related to avoiding fragmentation. Maybe I should start a draft...

dhc

The next WG meeting I attended was for DHCP. One of the ongoing topics here is a YANG model for managing DHCP servers with netconf.

Over coffee before the dhc meeting I had a chat with Normen Kowalewski from Deutsch Telekom about his DHCP deployment using ISC Kea with a distributed Cassandra database for lease storage - super cool. He's also very keen on YANG and netconf, and convinced me that I should learn more about it.

I made a small suggestion on this topic, that the DHCP YANG model could maybe use JDBC URLs for configuring lease storage.

mfld

IETF attendees joke that the meeting should be called MFLD, short for "many fine lunches and dinners". Yesterday the friendly folks at isc.org kindly invited me to join them for both lunch and dinner. It's nice getting to know people in person, having worked with them over email on open source software.

fanf: (Default)

Yesterday, on IETF 101 hackathon day 1, I made a proof of concept DNS-over-HTTPS server. Today I worked on separating it from my prototyping repository, documenting it, and knocking out some interoperability bugs.

You can get doh101 from https://github.com/fanf2/doh101 and https://dotat.at/cgi/git/doh101.git, and you can send me feedback via GitHub or email to dot@dotat.at.

doh101 vs doh-proxy

Yesterday’s problem with the doh-proxy client turned out to be very simple: my server only did HTTP/1.1 whereas doh-proxy only does HTTP/2. The simple fix was to enable HTTP/2: I added http2 to the listen ssl line in my nginx.conf.

doh101 vs Firefox

Daniel Stenberg of cURL fame suggested I should try out doh101 with the DoH support in Firefox Nightly. It mysteriously did not work, for reasons that were not immediately obvious.

I could see Firefox making its initial probe query to check that my server worked, after which Firefox clearly decided that my server was broken. After some experimentation with Firefox debugging telemetry, and cURL tracing mode, and fiddling with my code to make sure it was doing the right thing with Content-Length etc. I noticed that I was sending the response with ngx.say() instead of ngx.print(): say appends a newline, so I had a byte of stray garbage after my DNS packet.

Once I fixed that, Firefox was happy! It’s useful to have such a pedantic client to test against :-)

doh101 vs HTTP

It became clear yesterday that the current DoH draft is a bit unclear about the dividing line between the DNS part and the HTTP part. I wasn't the only person that noticed this lacuna: on the way into London this morning I wrote up some notes on error handling in DNS over HTTPS, and by the time I was ready to send my notes to the list I found that Ted Hardie and Patrick McManus had already started discussing the topic. I think my notes had some usefully concrete suggestions.

Still to do

The second item on yesterday's TODO list was to improve the connection handling on the back end of my DoH proxy. I did not make any progress on that today; at the moment I don't know if it is worth spending more time on this code, or whether it would be better to drop to C and help to make an even more light-weight NGINX DoH module.

fanf: (Default)

The 101st IETF meeting is in London this coming week, and it starts with the IETF 101 Hackathon.

I thought I could do some useful work on DNS privacy. There is lots of work going on in this area, part of which is adding lots of new transport options for DNS - as well as the traditional DNS-over-UDP and DNS-over-TCP, there is now DNS-over-TLS and (soon) DNS-over-HTTPS, and maybe DNS-over-DTLS and DNS-over-QUIC.

My idea was to set up a proxy that could provide DNS-over-TLS and DNS-over-HTTPS in front of a trad DNS server (for my purposes, specifically BIND, but the proxy does not need to care).

Choice of proxy

There are a number of proxies that make sense as a base for this work: I need TLS support and HTTP support and a reasonably light-weight implementation - HAProxy is one, but I chose to use NGINX. There's a variant of NGINX called OpenResty which includes LuaJIT and a bunch of other libraries and plugins.

Lua is a really nice language for scripting an event-driven server like NGINX, since Lua has coroutines. This means you can write straight-line Lua, and whenever it calls into a potentially-blocking OpenResty API, it actually suspends the coroutine and drops into the NGINX event loop.

Existing infrastructure

I have a little virtual machine cluster on my workstation for development and prototyping - I use it for my work porting the IP Register database to PostgreSQL. The relatively unusual part of the setup is that I have a special DNS zone (signed with DNSSEC) for these VMs, and the VMs and their provisioning scripts have DNS UPDATE privileges on this DNS zone. When a VM boots and gets IP addresses form DHCP and SLAAC, it UPDATEs its entry in my dev zone. When it generates its ssh host keys, it puts the corresponding SSHFP records in the zone.

Plan of attack

My prototypes are configured using Ansible, so that there's relatively little required to bring them up to production quality. So the basic plan was to write a playbook to install OpenResty, configure NGINX, and add a bit of Lua to massage DNS-over-HTTPS requests into DNS-over-TLS.

I have a bit of previous experience with Lua, but I have never used NGINX before, so there will be a certain amount of learning...

Installation

There are pre-built OpenResty packages which are quite convenient - very similar to the existing setup I have for PostgreSQL. Happily the OpenResty packages include a SysVinit rc script so they are compatible with my non-systemd VMs.

TLS certificates

To get a certificate for my service I needed to use Let's Encrypt, which I have not done before (since we have an easy enough TLS certificate service at work). I chose to use the dehydrated ACME client since I have friends who report that it is very satisfactory. Since I'm a DNS geek, I thought it would be fun to use the ACME DNS-01 challenge. (I realised later that this was a lucky choice, since my VMs have RFC 1918 IPv4 addresses.)

One simple script (basically copied off the dehydrated wiki) and two lines of configuration, and I could run dehydrated -c and get a TLS certificate within a few seconds. Absolutely brilliant. My first experience of really using Let's Encrypt was really pleasing.

DNS over TLS

This is fairly easy to set up with a bit of Googling StackOverflow to work out how to configure NGINX. The trick is to use a stream {} section instead of the usual http {} section.

    stream {
        upstream dns {
            server 131.111.57.57:53;
        }

        server {
            listen [::]:853 ssl;
            proxy_pass dns;

            ssl_certificate      cert.pem;
            ssl_certificate_key  cert.key;
        }
    }

Getting to grips with OpenResty

I spent a while reading around the OpenResty web site looking for documentation, but I wasn't having much luck beyond the "hello world" example. What I needed was some example code that showed how to pick apart an HTTP request and put together a response.

I found OpenResty's Lua DNS library which I thought might be useful for cribbing from, but it did not help at all with HTTP.

Introspecting Lua

One of the coolest talks at the LuaConf I went to several years ago was about how Olivetti use (used?) Lua to write test code for their scanner/printer devices. The bulk of the firmware was written in C++, but it included a Lua interpreter which was able to dig through the RTTI, allowing a test engineer to tab-complete through the device's internal data structures and APIs.

So, being unable to find anything else, I thought I could get some idea of what the OpenResty API looked like by doing a bit of reflection.

I found a nice script called inspect.lua which pretty-prints a Lua data structure. You can dump all the global variables recursively with inspect(_G), which gave me a nice listing of the Lua standard library and some OpenResty bits.

I added the following variant of "hello world" to nginx.conf so I could conveniently curl a list of the OpenResty ngx module.

    location /ngx {
        default_type text/plain;
        content_by_lua_block {
            ngx.say(require("inspect")(ngx))
        }
    }

Writing an HTTP request handler

There were some fairly promising functions called things like ngx.req.get_method() and ngx.req.get_headers() which turned out to do reasonably obvious things. I also dug through the OpenResty Lua module sources that were installed on my dev VM to get a better idea of how they worked.

This was just about enough that I could write a handler to implement most of the DNS-over-HTTPS requirements.

The main stumbling point came when I needed to do base64url decoding of DNS query packets embedded in HTTP GET requests.

Whither base64url

After a bit of grepping it became evident that OpenResty has support for normal base64 - I found its resty.core.base64 module, but disappointingly it does not include base64url support. However, running strings on nginx revealed that nginx does have base64url support, though OpenResty does not expose it to Lua.

The base64.lua module uses the LuaJIT FFI to call some OpenResty C wrapper functions that convert vanilla C types to NGINX's internal types, then call NGINX's base64 functions.

The LuaJIT FFI is a glorious thing: as well as calling C functions, you can directly access C structures from Lua. So OpenResty's C wrapper layer is not in fact needed.

So, after a bit of clone-and-hack (and an embarrassing diversion spending ages working out that I had mistyped a variable name) I was able to write a 50 line base64url.lua module which called ngx_decode_base64url() directly.

DNS time

At this point my HTTP handler was able to get a DNS wire format query from an HTTP GET or POST request, so I needed to forward it to a DNS server, and return the response to the client.

During my base64url adventures I had worked out that the documentation I had been looking for belongs to the OpenResty Lua Nginx Module, so it was straightforward to write the back half of the proxy.

This back end is the bare minimum: just 20 lines of code including error checking and tracing. The whole DNS-over-HTTPS handler is less than 100 lines of Lua.

This is where it becomes obvious that OpenResty shines, because all the front-end POST reading and back-end socket calls are potentially blocking, but I can write straight-line sequential code without any inversion of control.

DoH clients

At this point I had a server, but no client for testing it!

I was sitting next to the author of doh-proxy who told me that he has one. I installed it, but found that it would hang after sending a request. (I did not debug this problem.)

So I decided to go low-tech.

The doh-client printed its base64url-encoded query, so I could curl it myself. I just needed something that could pretty-print the response.

First attempt was with drill which can dump DNS packets and print dumped packets. However its dump format is an ASCII hex dump, not binary, so I found myself writing a complicated curl | perl | drill pipeline and it was getting silly.

So fairly rapidly I moved on to the second attempt in pure perl, using Net::DNS for query packet construction and response packet pretty-printing, MIME::Base64 for that troublesome base64url encoding, LWP::UserAgent for performaing the HTTPS request. 20 lines of code, and I have a client that works with my server!

Tomorrow

There are a couple of obvious next steps.

Firstly, I should extract this work from my prototyping setup, so that it can be published in a form that's plausibly useful for other people. Edited to add: Done!

Secondly, my trivial DoH back end needs some work. It uses one brief DNS-over-TCP connection per HTTP request, which is rather wasteful. It would be a lot cooler to keep one or more persistent TCP connections open to the DNS server, and multiplex DNS-over-HTTP requests onto these DNS-over-TCP connections. It looks like NGINX and OpenResty have lots of support for connection pooling, so I should work out how I can make good use of it.

Today has been pleasingly successful, so I hope tomorrow will have more of the same!

fanf: (Default)

Last week, the EFF wrote about how to safely allow web servers to update ACME DNS challenges. Whereas non-wildcard Let's Encrypt certificates can be authorized by the web server itself, wildcard certs require the ACME client to put the challenge token in the DNS.

The EFF article outlined a few generic DNS workarounds (which I won't describe here), and concluded by suggesting delegating the _acme-challenge subdomain to a special ACME-DNS server.

But, if your domain is hosted with BIND, it's much easier.

First, you need to generate a TSIG key (a shared secret) which will be used by your ACME client to update the DNS. The tsig-keygen command takes the name of the key as its argument; give the key the same name as the domain it will be able to update. I write TSIG keys to files named tsig.<keyname> so I know what they are.

    $ tsig-keygen _acme-challenge.dotat.at \
        >tsig._acme-challenge.dotat.at

This file needs to be copied to the ACME client - I won't go into the details of how to get that part working.

The key needs to be included in the primary BIND server config:

    include "tsig._acme-challenge.dotat.at";

You also need to modify your zone's dynamic update configuration. My zones typically have:

    update-policy local;

The new configuration needs both the expanded form of local plus the _acme-challenge permissions, like this:

    update-policy {
        grant local-ddns zonesub any;
        grant _acme-challenge.dotat.at self _acme-challenge.dotat.at TXT;
    };

You can test that the key has restricted permissions using nsupdate. The following transcript shows that this ACME TSIG key can only add and delete TXT records at the _acme-challenge subdomain - it isn't able to update TXT records at other names, and isn't able to update non-TXT records at the _acme-challenge subdomain.

    nsupdate -k tsig._acme-challenge.dotat.at
    > add thing.dotat.at 3600 txt thing
    > send
    update failed: REFUSED
    > add _acme-challenge.dotat.at 3600 a 127.0.0.1
    > send
    update failed: REFUSED
    > add _acme-challenge.dotat.at 3600 txt thing
    > send
    > del _acme-challenge.dotat.at 3600 txt thing
    > send

That's it!

fanf: (Default)

For various bad reasons I haven't blogged here since I said I was moving to Dreamwidth last year. Part of it is that I haven't found a satisfactory page layout / style sheet (sigh).

I have ended up writing blog-type things in other contexts, mostly work-related. In the last couple of months I have been posting them On Jackdaw: one page of plans for the future and another with rants and raves about work in progress.

This work will include a lot more web development than I have done in the past. I expect that one side-effect will be to turn those not-quite-blog pages into a more complete blog setup with RSS/Atom feeds etc. which will I will probably reuse for personal blogging on my own web site. (Something I have vaguely wanted to do for ages but have not had the tuits.)

Anyway, in the mean time I have been posting stuff on Twitter, mostly interesting links, silly retweets (and sometimes political ones, tho I try not to overdo politics), and other randomness. I posted a brief status update the week before last, I'm still getting occasional notifications about a nerdy observation about CamelCase which became surprisingly popular - my usual standard for a popular tweet is orders of magnitude smaller :-)

fanf: (dotat)
I am in the process of moving this blog to https://fanf.dreamwidth.org/ - follow me there!
fanf: (dotat)

Ansible is the configuration management tool we use at work. It has built-in support for encrypted secrets, called ansible-vault, so you can safely store secrets in version control.

I thought I should review the ansible-vault code.

Summary

It's a bit shoddy but probably OK, provided you have a really strong vault password.

HAZMAT

The code starts off with a bad sign:

    from cryptography.hazmat.primitives.hashes import SHA256 as c_SHA256
    from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC
    from cryptography.hazmat.backends import default_backend

I like the way the Python cryptography library calls this stuff HAZMAT but I don't like the fact that Ansible is getting its hands dirty with HAZMAT. It's likely to lead to embarrassing cockups, and in fact Ansible had an embarrassing cockup - there are two vault ciphers, "AES" (the cockup, now disabled except that for compatibility you can still decrypt) and "AES256" (fixed replacement).

As a consequence of basing ansible-vault on relatively low-level primitives, it has its own Python implementations of constant-time comparison and PKCS#7 padding. Ugh.

Good

Proper random numbers:

    b_salt = os.urandom(32)

Poor

Iteration count:

    b_derivedkey = PBKDF2(b_password, b_salt,
                          dkLen=(2 * keylength) + ivlength,
                          count=10000, prf=pbkdf2_prf)

PBKDF2 HMAC SHA256 takes about 24ms for 10k iterations on my machine, which is not bad but also not great - e.g. 1Password uses 100k iterations of the same algorithm, and gpg tunes its non-PBKDF2 password hash to take (by default) at least 100ms.

The deeper problem here is that Ansible has hard-coded the PBKDF2 iteration count, so it can't be changed without breaking compatibility. In gpg an encrypted blob includes the variable iteration count as a parameter.

Ugly

ASCII armoring:

    b_vaulttext = b'\n'.join([hexlify(b_salt),
                              to_bytes(hmac.hexdigest()),
                              hexlify(b_ciphertext)])
    b_vaulttext = hexlify(b_vaulttext)

The ASCII-armoring of the ciphertext is as dumb as a brick, with hex-encoding inside hex-encoding.

File handling

I also (more briefly) looked through ansible-vault's higher-level code for managing vault files.

It is based on handing decrypted YAML files to $EDITOR, so it's a bit awkward if you don't want to wrap secrets in YAML or if you don't want to manipulate them in your editor.

It uses mkstemp(), so the decrypted file can be placed on a ram disk, though you might have to set TMPDIR to make sure.

It shred(1)s the file after finishing with it.

fanf: (dotat)

Because I just needed to remind myself of this, here are a couple of links with suggestions foe less accidentally-racist computing terminology:

Firstly, Kerri Miller suggests blocklist / safelist for naming deny / allow data sources.

Less briefly, Bryan Liles got lots of suggestions for better distributed systems terminology. If you use a wider vocabulary to describe your systems, you can more quickly give your reader a more precise idea of how your systems work, and the relative roles of their parts.

fanf: (dotat)

Today I rolled out a significant improvement to the automatic recovery system on Cambridge University's recursive DNS servers. This change was because of three bugs.

BIND RPZ catatonia

The first bug is that sometimes BIND will lock up for a few seconds doing RPZ maintenance work. This can happen with very large and frequently updated response policy zones such as the Spamhaus Domain Block List.

When this happens on my servers, keepalived starts a failover process after a couple of seconds - it is deliberately configured to respond quickly. However, BIND soon recovers, so a few seconds later keepalived fails back.

BIND lost listening socket

This brief keepalived flap has an unfortunate effect on BIND. It sees the service addresses disappear, so it closes its listening sockets, then the service addresses reappear, so it tries to reopen its listening sockets.

Now, because the server is fairly busy, it doesn't have time to clean up all the state from the old listening socket before BIND tries to open the new one, so BIND gets an "address already in use" error.

Sadly, BIND gives up at this point - it does not keep trying periodically to reopen the socket, as you might hope.

Holy health check script, Bat Man!

At this point BIND is still listening on most of the interface addresses, except for a TCP socket on the public service IP address. Ideally this should have been spotted by my health check script, which should have told keepalived to fail over again.

But there's a gaping hole in the health checker's coverage: it only tests the loopback interfaces!

In a fix

Ideally all three of these bugs should be fixed. I'm not expert enough to fix the BIND bugs myself, since they are in some of the gnarliest bits of the code, so I'll leave them to the good folks at ISC.org. Even if they are fixed, I still need to fix my health check script so that it actually checks the user-facing service addresses, and there's no-one else I can leave that to.

Previously...

I wrote about my setup for recursive DNS server failover with keepalived when I set it up a couple of years ago. My recent work leaves the keepalived configuration bascially unchanged, and concentrates on the health check script.

For the purpose of this article, the key feature of my keepalived configuration is that it runs the health checker script many times per second, in order to fake up dynamically reconfigurable server priorities. The old script did DNS queries inline, which was OK when it was only checking loopback addresses, but the new script needs to make typically 16 queries which is getting a bit much.

Daemonic decoupling

The new health checker is split in two.

The script called by keepalived now just examines the contents of a status file, so it runs predictably fast regardless of the speed of DNS responses.

There is a separate daemon which performs the actual health checks, and writes the results to the status file.

The speed thing is nice, but what is really important is that the daemon is naturally stateful in a way the old health checker could not be. When I started I knew statefulness was necessary because I clearly needed some kind of hysteresis or flap damping or hold-down or something.

This is much more complex

https://www.youtube.com/watch?v=DNb4VKln1uw

There is this theory of the Möbius: a twist in the fabric of space where time becomes a loop

  • BIND observes the list of network interfaces, and opens and closes listening sockets as addresses come and go.

  • The health check daemon verifies that BIND is responding properly on all the network interface addresses.

  • keepalived polls the health checker and brings interfaces up and down depending on the results.

Without care it is inevitable that unexpected interactions between these components will destroy the Enterprise!

Winning the race

The health checker gets into races with the other daemons when interfaces are deleted or added.

The deletion case is simpler. The health checker gets the list of addresses, then checks them all in turn. If keepalived deletes an address during this process then the checker can detect a failure - but actually, it's OK if we don't get a respose from a missing address! Fortunately there is a distinctive error message in this case which the health checker can treat as an alternative successful response.

New interfaces are more tricky, because the health checker needs to give BIND a little time to open its sockets. It would be really bad if the server appears to be healthy, so keepalived brings up the addresses, which the health checker tests before BIND is ready, causing it to immediately fail - a huge flap.

Back off

The main technique that the new health checker uses to suppress flapping is exponential backoff.

Normally, when everything is working, the health checker queries every network interface address, writes an OK to the status file, then sleeps for 1 second before looping.

When a query fails, it immediately writes BAD to the status file, and sleeps for a while before looping. The sleep time increases exponentially as more failures occur, so repeated failures cause longer and longer intervals before the server tries to recover.

Exponential backoff handles my original problem somewhat indirectly: if there's a flap that causes BIND to lose a listening socket, there will then be a (hopefully short) series of slower and slower flaps until eventually a flap is slow enough that BIND is able to re-open the socket and the server recovers. I will probably have to tune the backoff parameters to minimize the disruption in this kind of event.

Hold down

Another way to suppress flapping is to avoid false recoveries.

When all the test queries succeed, the new health checker decreases the failure sleep time, rather than zeroing it, so if more failures occur the exponential backoff can continue. It still reports the success immediately to keepalived, because I want true recoveries to be fast, for instance if the server accidentally crashes and is restarted.

The hold-down mechanism is linked to the way the health checker keeps track of network interface addresses.

After an interface goes away the checker does not decrease the sleep time for several seconds even if the queries are now working OK. This hold-down is supposed to cover a flap where the interface immediately returns, in which case we want exponential backoff to continue.

Similarly, to avoid those tricky races, we also record the time when each interface is brought up, so we can ignore failures that occur in the first few seconds.

Result

It took quite a lot of headscratching and trial and error, but in the end I think I came up with something resonably simple. Rather than targeting it specifically at failures I have observed in production, I have tried to use general purpose robustness techniques, and I hope this means it will behave OK if some new weird problem crops up.

Actually, I hope NO new weird problems crop up!

PS. the ST:TNG quote above is because I have recently been listening to my old Orbital albums again - https://www.youtube.com/watch?v=RlB-PN3M1vQ

fanf: (dotat)

Following on from my recent item about the leap seconds list I have come up with a better binary encoding which is half the size of my previous attempt. (Compressing it with deflate no longer helps.)

Here's the new binary version of the leap second list, 15 bytes displayed as hexadecimal in network byte order. I have not updated it to reflect the latest Bulletin C which was promulgated this week, because the official leap second lists have not yet been updated.

    001111111211343 12112229D5652F4

This is a string of 4-bit nybbles, upper nybble before lower nybble of each byte.

If the value of this nybble V is less than 0x8, it is treated as a pair of nybbles 0x9V. This abbreviates the common case.

Otherwise, we consider this nybble Q and the following nybble V as a pair 0xQV.

Q contains four flags,

    +-----+-----+-----+-----+
    |  W  |  M  |  N  |  P  |
    +-----+-----+-----+-----+

W is for width:

  • W == 1 indicates a QV pair in the string.
  • W == 0 indicates this is a bare V nybble.

M is the month multiplier:

  • M == 1 indicates the nybble V is multiplied by 1
  • M == 0 indicates the nybble V is multiplied by 6

NP together are NTP-compatible leap indicator bits:

  • NP == 00 indicates no leap second
  • NP == 01 == 1 indicates a positive leap second
  • NP == 10 == 2 indicates a negative leap second
  • NP == 11 == 3 indicates an unknown leap second

The latter is equivalent to the ? terminating the text version.

The event described by NP occurs a number of months after the previous event, given by

    (M ? 1 : 6) * (V + 1).

That is, a 6 month gap can be encoded as M=1, V=5 or as M=0, V=0.

The "no leap second" option comes into play when the gap between leap seconds is too large to fit in 4 bits. In this situation you encode a number of "no leap second" gaps until the remaining gap fits.

The recommended way to break up long gaps is as follows. Gaps up to 16 months can be encoded in one QV pair. Gaps that are a multiple of 6 months long should be encoded as a number of 16*6 month gaps, followed by the remainder. Other gaps should be rounded down to a whole number of years and encoded as a X*6 month gap, which is followed by a gap for the remaining few months.

To align the list to a whole number of bytes, add a redundant 9 nybble to turn a bare V nybble into a QV pair.

In the current leap second list, every gap is encoded as a single V nybble, except for the 84 month gap which is encoded as QV = 0x9D, and the last 5 months encoded as QV = 0xF4.

Once the leap second lists have been updated, the latest Bulletin C will change the final nybble from 4 to A.

The code now includes decoding as well as encoding functions, and a little fuzz tester to ensure they are consistent with each other. http://dotat.at/cgi/git/leapseconds.git.

fanf: (dotat)

Here's an amusing trick that I was just discussing with Mark Wooding on IRC.

Did you know you can define functions with optional and/or named arguments in C99? It's not even completely horrible!

The main limitation is there must be at least one non-optional argument, and you need to compile with -std=c99 -Wno-override-init.

(I originally wrote that this code needs C11, but Miod Vallat pointed out it works fine in C99)

The pattern works like this, using a function called repeat() as an example.

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>

    // Define the argument list as a structure. The dummy argument
    // at the start allows you to call the function with either
    // positional or named arguments.

    struct repeat_args {
        void *dummy;
        char *str;
        int n;
        char *sep;
    };

    // Define a wrapper macro that sets the default values and
    // hides away the rubric.

    #define repeat(...) \
            repeat((struct repeat_args){ \
                    .n = 1, .sep = " ", \
                    .dummy = NULL, __VA_ARGS__ })

    // Finally, define the function,
    // but remember to suppress macro expansion!

    char *(repeat)(struct repeat_args a) {
        if(a.n < 1)
            return(NULL);
        char *r = malloc((a.n - 1) * strlen(a.sep) +
                         a.n * strlen(a.str) + 1);
        if(r == NULL)
            return(NULL);
        strcpy(r, a.str);
        while(a.n-- > 1) { // accidentally quadratic
            strcat(r, a.sep);
            strcat(r, a.str);
        }
        return(r);
    }

    int main(void) {

        // Invoke it like this
        printf("%s\n", repeat(.str = "ho", .n = 3));

        // Or equivalently
        printf("%s\n", repeat("ho", 3, " "));

    }
fanf: (dotat)

Firstly, I have to say that it's totally awesome that I am writing this at all, and it's entirely due to the cool stuff done by people other than me. Yes! News about other people doing cool stuff with my half-baked ideas, how cool is that?

CZ.NIC Knot DNS

OK, DNS is approximately the ideal application for tries. It needs a data structure with key/value lookup and lexically ordered traversal.

When qp tries were new, I got some very positive feedback from Marek Vavrusa who I think was at CZ.NIC at the time. As well as being the Czech DNS registry, they also develop their own very competitive DNS server software. Clearly the potential for a win there, but I didn't have time to push a side project to production quality, nor any expectation that anyone else would do the work.

But, in November I got email from Vladimír Čunát telling me he had reimplemented qp tries to fix the portability bugs and missing features (such as prefix searches) in my qp trie code, and added it to Knot DNS. Knot was previously using a HAT trie.

Vladimír said qp tries could reduce total server RSS by more than 50% in a mass hosting test case. The disadvantage is that they are slightly slower than HAT tries, e.g. for the .com zone they do about twice as many memory indirections per lookup due to checking a nybble per node rather than a byte per node.

On balance, qp tries were a pretty good improvement. Thanks, Vladimír, for making such effective use of my ideas!

(I've written some notes on more memory-efficient DNS name lookups in qp tries in case anyone wants to help close the speed gap...)

Rust

Shortly before Christmas I spotted that Frank Denis has a qp trie implementation in Rust!

Sadly I'm still only appreciating Rust from a distance, but when I find some time to try it out properly, this will be top of my list of things to hack around with!

I think qp tries are an interesting test case for Rust, because at the core of the data structure is a tightly packed two word union with type tags tucked into the low order bits of a pointer. It is dirty low-level C, but in principle it ought to work nicely as a Rust enum, provided Rust can be persuaded to make the same layout optimizations. In my head a qp trie is a parametric recursive algebraic data type, and I wish there were a programming language with which I could express that clearly.

So, thanks, Frank, for giving me an extra incentive to try out Rust! Also, Frank's Twitter feed is ace, you should totally follow him.

Time vs space

Today I had a conversation on Twitter with @tef who has some really interesting ideas about possible improvements to qp tries.

One of the weaknesses of qp-tries, at least in my proof-of-concept implementation, is the allocator is called for every insert or delete. C's allocator is relatively heavyweight (compared to languages with tightly-coupled GCs) so it's not great to call it so frequently.

(Bagwell's HAMT paper was a major inspiration for qp tries, and he goes into some detail describing his custom allocator. It makes me feel like I'm slacking!)

There's an important trade-off between small memory size and keeping some spare space to avoid realloc() calls. I have erred on the side of optimizing for simple allocator calls and small data structure size at the cost of greater allocator stress.

@tef suggested adding extra space to each node for use as a write buffer, in a similar way to "fractal tree" indexes.. As well as avoiding calls to realloc(), a write buffer could avoid malloc() calls for inserting new nodes. I was totally nerd sniped by his cool ideas!

After some intensive thinking I worked out a sketch of how write buffers might amortize allocation in qp tries. I don't think it quite matches what tef had in mind, but it's definitely intriguing. It's very tempting to steal some time to turn the sketch into code, but I fear I need to focus more on things that are directly helpful to my colleagues...

Anyway, thanks, tef, for the inspiring conversation! It also, tangentially, led me to write this item for my blog.

fanf: (dotat)

The list of leap seconds is published in a number of places. Most authoritative is the IERS list published from the Paris Observatory and most useful is the version published by NIST. However neither of them are geared up for distributing leap seconds to (say) every NTP server.

For a couple of years, I have published the leap second list in the DNS as a set of fairly human-readable AAAA records. There's an example in my message to the LEAPSECS list though I have changed the format to avoid making any of them look like real IPv6 addresses.

Poul-Henning Kamp also publishes leap second information in the DNS though he only publishes an encoding of the most recent Bulletin C rather than the whole history.

I have recently added a set of PHK-style A records at leapsecond.dotat.at listing every leap second, plus an "illegal" record representing the point after which the difference between TAI and UTC is unknown. I have also added a next.leapsecond.dotat.at which should be the same as PHK's leapsecond.utcd.org.

There are also some HINFO records that briefly explain the other records at leapsecond.dotat.at.

One advantage of using the DNS is that the response can be signed and validated using DNSSEC. One disadvantage is that big lists of addresses rapidly get quite unwieldy. There can also be problems with DNS messages larger than 512 bytes. For leapsecond.dotat.at the answers can get a bit chunky:

plainDNSSEC
A485692
AAAA8211028

Terse

I've thought up a simple and brief way to list the leap seconds, which looks like this:

    6+6+12+12+12+12+12+12+12+18+12+12+24+30+24+12+18+12+12+18+18+18+84+36+42+36+18+5?

ABNF syntax:

    leaps  =  *leap end
    leap   =  gap delta
    end    =  gap "?"
    delta  =  "-" / "+"
    gap    =  1*DIGIT

Each leap second is represented as a decimal number, which counts the months since the previous leap second, followed by a "+" or a "-" to indicate a positive or negative leap second. The sequence starts at the beginning of 1972, when TAI-UTC was 10s.

So the first leap is "6+", meaning that 6 months after the start of 1972, a leap second increases TAI-UTC to 11s. The "84+" leap represents the gap between the leap seconds at the end of 1998 and end of 2005 (7 * 12 months).

The list is terminated by a "?" to indicate that the IERS have not announced what TAI-UTC will be after that time.

Rationale

ITU recommendation TF.460-6 specifies in paragraph 2.1 that leap seconds can happen at the end of any month, though in practice the preference for the end of December or June has never been overridden. Negative leap seconds are also so far only a theoretical possibility.

So, counting months between leaps is the largest radix permitted, giving the smallest (shortest) numbers.

I decided to use decimal and mnemonic separators to keep it simple.

Binary

If you want something super compact, then bit-banging is the way to do it. Here's a binary version of the leap second list, 29 bytes displayed as hexadecimal in network byte order.

46464c4c 4c4c4c4c 4c524c4c 585e584c 524c4c52 52523c58 646a6452 85

Each byte has a 2-bit two's complement signed delta in the most significant bits, and a 6-bit unsigned month count in the least significant bits.

The meaning of the delta field is as follows (including the hex, binary, and decimal values):

  • 0x40 = 01 = +1 = positive leap second
  • 0x00 = 00 = 0 = no leap second
  • 0xC0 = 11 = -1 = negative leap second
  • 0x80 = 10 = -2 = end of list, like "?"

So, for example, a 6 month gap followed by a positive leap second is 0x40 + 6 == 0x46. A 12 month gap is 0x40 + 12 == 0x4c.

The "no leap second" option comes into play when the gap between leap seconds is too large to fit in 6 bits, i.e. more than 63 months, e.g. the 84 month gap between the ends of 1998 and 2005. In this situation you encode a number of "no leap second" gaps until the remaining gap is less than 63.

I have encoded the 84 month gap as 0x3c58, i.e. 0x00 + 0x3c is 60 months followed by no leap second, then 0x40 + 0x18 is 24 months followed by a positive leap second.

Compression

There's still quite a lot of redundancy in the binary encoding. It can be reduced to 24 bytes using RFC 1951 DEFLATE compression.

Publication

There is now a TXT record at leapsecond.dotat.at containing the human-readable terse form of the leap seconds list. This is gives you a 131 byte plain DNS response, or a 338 byte DNSSEC signed response.

I've published the deflated binary version using a private-use TYPE65432 record which saves 58 bytes.

There is code to download and check the consistency of the leapseconds files from the IERS, NIST, and USNO, generate the DNS records, and update the DNS if necessary, at http://dotat.at/cgi/git/leapseconds.git.

fanf: (dotat)

I have a bit of a bee in my bonnet about using domain names consistently as part of an organization's branding and communications. I don't much like the proliferation of special-purpose or short-term vanity domains.

They are particularly vexing when I am doing something security-sensitive. For example, domain name transfers. I'd like to be sure that someone is not trying to race with my transfer and steal the domain name, say.

Let's have a look at a practical example: transfering a domain from Gandi to Mythic Beasts.

(I like Gandi, but getting the University to pay their domain fees is a massive chore. So I'm moving to Mythic Beasts, who are local, friendly, accommodating, and able to invoice us.)

Edited to add: The following is more ranty and critical than is entirely fair. I should make it clear that both Mythic Beasts and Gandi are right at the top of my list of companies that it is good to work with.

This just happens to be an example where I get to see both ends of the transfer. In most cases I am transferring to or from someone else, so I don't get to see the whole process, and the technicalities are trivial compared to the human co-ordination!

First communication

    Return-Path: <opensrs-bounce@registrarmail.net>
    Message-Id: <DIGITS.DATE-osrs-transfers-DIGITS@cron01.osrs.prod.tucows.net>
    From: "Transfer" <do_not_reply@ns-not-in-service.com>
    Subject: Transfer Request for EXAMPLE.ORG

    https://approve.domainadmin.com/transfer/?domain=EXAMPLE.ORG

A classic! Four different domain names, none of which identify either of our suppliers! But I know Mythic Beasts are an OpenSRS reseller, and OpenSRS is a Tucows service.

Let's see what whois has to say about the others...

    Domain Name: REGISTRARMAIL.NET
    Registrant Name: Domain Admin
    Registrant Organization: Yummynames.com
    Registrant Street: 96 Mowat Avenue
    Registrant City: Toronto
    Registrant Email: whois@yummynames.com

"Yummynames". Oh kaaaay.

    Domain Name: YUMMYNAMES.COM
    Registrant Name: Domain Admin
    Registrant Organization: Tucows.com Co.
    Registrant Street: 96 Mowat Ave.
    Registrant City: Toronto
    Registrant Email: tucowspark@tucows.com

Well I suppose that's OK, but it's a bit of a rabbit hole.

Also,

    $ dig +short mx registrarmail.net
    10 mx.registrarmail.net.cust.a.hostedemail.com.

Even more generic than Fastmail's messagingengine.com infrastructure domain :-)

    Domain Name: HOSTEDEMAIL.COM
    Registrant Name: Domain Admin
    Registrant Organization: Tucows Inc
    Registrant Street: 96 Mowat Ave.
    Registrant City: Toronto
    Registrant Email: domain_management@tucows.com

The domain in the From: address, ns-not-in-service.com is an odd one. I have seen it in whois records before, in an obscure context. When a domain needs to be cancelled, there can sometimes be glue records inside the domain which also need to be cancelled. But they can't be cancelled if other domains depend on those glue records. So, the registrar renames the glue records into a place-holder domain, allowing the original domain to be cancelled.

So it's weird to see one of these cancellation workaround placeholder domains used for customer communications.

    Domain Name: NS-NOT-IN-SERVICE.COM
    Registrant Name: Tucows Inc.
    Registrant Organization: Tucows Inc.
    Registrant Street: 96 Mowat Ave
    Registrant City: Toronto
    Registrant Email: corpnames@tucows.com

Tucows could do better at keeping their whois records consistent!

Finally,

    Domain Name: DOMAINADMIN.COM
    Registrant Name: Tucows.com Co. Tucows.com Co.
    Registrant Organization: Tucows.com Co.
    Registrant Street: 96 Mowat Ave
    Registrant City: Toronto
    Registrant Email: corpnames@tucows.com

So good they named it twice!

Second communication

    Return-Path: <bounce+VERP@bounce.gandi.net>
    Message-ID: <DATE.DIGITS@brgbnd28.bi1.0x35.net>
    From: "<noreply"@domainnameverification.net
    Subject: [GANDI] IMPORTANT: Outbound transfer of EXAMPLE.ORG to another provider

    http://domainnameverification.net/transferout_foa/?fqdn=EXAMPLE.ORG

The syntactic anomaly in the From: line is a nice touch.

Both 0x35.net and domainnameverification.net belong to Gandi.

    Registrant Name: NOC GANDI
    Registrant Organization: GANDI SAS
    Registrant Street: 63-65 Boulevard MASSENA
    Registrant City: Paris
    Registrant Email: noc@gandi.net

Impressively consistent whois :-)

Third communication

    Return-Path: <opensrs-bounce@registrarmail.net>
    Message-Id: <DIGITS.DATE-osrs-transfers-DIGITS@cron01.osrs.prod.tucows.net>
    From: "Transfers" <dns@mythic-beasts.com>
    Subject: Domain EXAMPLE.ORG successfully transferred

OK, so this message has the reseller's branding, but the first one didn't?!

The web sites

To confirm a transfer, you have to paste an EPP authorization code into the old and new registrars' confirmation web sites.

The first site https://approve.domainadmin.com/transfer/ has very bare-bones OpenSRS branding. It's a bit of a pity they don't allow resellers to add their own branding.

The second site http://domainnameverification.net/transferout_foa/ is unbranded; it isn't clear to me why it isn't part of Gandi's normal web site and user interface. Also, it is plain HTTP without TLS!

Conclusion

What I would like from this kind of process is an impression that it is reassuringly simple - not involving loads of unexpected organizations and web sites, difficult to screw up by being inattentive. The actual experience is shambolic.

And remember that basically all Internet security rests on domain name ownership, and this is part of the process of maintaining that ownership.

Here endeth the rant.

fanf: (dotat)

So I have a toy DNS server which runs bleeding edge BIND 9 with a bunch of patches which I have submitted upstream, or which are work in progress, or just stupid. It gets upgraded a lot.

My build and install script puts the version number, git revision, and an install counter into the name of the install directory. So, when I deploy a custom hack which segfaults, I can just flip a symlink and revert to the previous less incompetently modified version.

More recently I have added an auto-upgrade feature to my BIND rc script. This was really simple, stupid: it just used the last install directory from the ls lexical sort. (You can tell it is newish because it could not possibly have coped with the BIND 9.9 -> 9.10 transition.)

It broke today.

BIND 9.11 has been released! Yay! I have lots of patches in it!

Unfortunately 9.11.0 sorts lexically before 9.11.0rc3. So my dumb auto update script refused to update. This can only happen in the transition period between a major release and the version bump on the master branch for the next pre-alpha version. This is a rare edge case for code only I will ever use.

But, I fixed it!

And I learned several cool things in the process!

I started off by wondering if I could plug dpkg --compare-versions into a sorting algorithm. But then I found that GNU sort has a -V version comparison mode. Yay! ls | sort -V!

But what is its version comparison algorithm? Is it as good as dpkg's? The coreutils documentation refers to the gnulib filevercmp() function. This appears not to exist, but gnulib does have a strverscmp() function cloned from glibc.

So look at the glibc man page for strverscmp(). It is a much simpler algorithm than dpkg, but adequate for my purposes. And, I see, built in to GNU ls! (I should have spotted this when reading the coreutils docs, because that uses ls -v as an example!)

Problem solved!

Add an option in ls | tail -1 so it uses ls -v and my script upgrades from 9.11.0rc3 to 9.11.0 without manual jiggery pokery!

And I learned about some GNU carnival organ bells and whistles!

But ... have I seen something like this before?

If you look at the isc.org BIND distribution server, its listing is sorted by version number, not lexically, i.e. 10 sorts after 9 not after 1.

They are using the Apache mod_autoindex IndexOptions VersionSort directive, which works the same as GNU version number sorting.

Nice!

It's pleasing to see widespread support for version number comparisons, even if they aren't properly thought-through elaborate battle hardened dpkg version number comparisons.

fanf: (dotat)

We have a periodic table shower curtain. It mostly uses Arial for its lettering, as you can see from the "R" and "C" in the heading below, though some of the lettering is Helvetica, like the "t" and "r" in the smaller caption.


(bigger) (huge)

The lettering is very inconsistent. For instance, Chromium is set in a lighter weight than other elements - compare it with Copper

(bigger) (huge) (bigger) (huge)

Palladium is particularly special


(bigger) (huge)

Platinum is Helvetica but Meitnerium is Arial - note the tops of the "t"s.

(bigger) (huge) (bigger) (huge)

Roentgenium is all Arial; Rhenium has a Helvetica "R" but an Arial "e"!

(bigger) (huge) (bigger) (huge)

It is a very distracting shower curtain.

fanf: (dotat)

Recently there was a thread on bind-users about "minimal responses and speeding up queries. The discussion was not very well informed about the difference between the theory and practice of addidional data in DNS responses, so I wrote the following.

Background: DNS replies can contain "additional" data which (according to RFC 1034 "may be helpful in using the RRs in the other sections." Typically this means addresses of servers identified in NS, MX, or SRV records. In BIND you can turn off most additional data by setting the minimal-responses option.

Reindl Harald <h.reindl at thelounge.net> wrote:

additional responses are part of the inital question and may save asking for that information - in case the additional info is not needed by the client it saves traffic

Matus UHLAR - fantomas <uhlar at fantomas.sk>

If you turn mimimal-responses on, the required data may not be in the answer. That will result into another query send, which means number of queries increases.

There are a few situations in which additional data is useful in theory, but it's surprisingly poorly used in practice.

End-user clients are generally looking up address records, and the additional and authority records aren't of any use to them.

For MX and SRV records, additional data can reduce the need for extra A and AAAA records - but only if both A and AAAA are present in the response. If either RRset is missing the client still has to make another query to find out if it doesn't exist or wouldn't fit. Some code I am familiar with (Exim) ignores additional sections in MX responses and always does separate A and AAAA lookups, because it's simpler.

The other important case is for queries from recursive servers to authoritative servers, where you might hope that the recursive server would cache the additional data to avoid queries to the authoritative servers.

However, in practice BIND is not very good at this. For example, let's query for an MX record, then the address of one of the MX target hosts. We expect to get the address in the response to the first query, so the second query doesn't need another round trip to the authority.

Here's some log, heavily pruned for relevance.

2016-09-23.10:55:13.316 queries: info:
        view rec: query: isc.org IN MX +E(0)K (::1)
2016-09-23.10:55:13.318 resolver: debug 11:
        sending packet to 2001:500:60::30#53
;; QUESTION SECTION:
;isc.org.                       IN      MX
2016-09-23.10:55:13.330 resolver: debug 10:
        received packet from 2001:500:60::30#53
;; ANSWER SECTION:
;isc.org.               7200    IN      MX      10 mx.pao1.isc.org.
;isc.org.               7200    IN      MX      20 mx.ams1.isc.org.
;; ADDITIONAL SECTION:
;mx.pao1.isc.org.       3600    IN      A       149.20.64.53
;mx.pao1.isc.org.       3600    IN      AAAA    2001:4f8:0:2::2b
2016-09-23.10:56:13.150 queries: info:
        view rec: query: mx.pao1.isc.org IN A +E(0)K (::1)
2016-09-23.10:56:13.151 resolver: debug 11:
        sending packet to 2001:500:60::30#53
;; QUESTION SECTION:
;mx.pao1.isc.org.               IN      A

Hmf, well that's disappointing.

Now, there's a rule in RFC 2181 about ranking the trustworthiness of data:

5.4.1. Ranking data

[ snip ]
Unauthenticated RRs received and cached from the least trustworthy of those groupings, that is data from the additional data section, and data from the authority section of a non-authoritative answer, should not be cached in such a way that they would ever be returned as answers to a received query. They may be returned as additional information where appropriate. Ignoring this would allow the trustworthiness of relatively untrustworthy data to be increased without cause or excuse.

Since my recursive server is validating, and isc.org is signed, it should be able to authenticate the MX target address from the MX response, and promote its trustworthiness, instead of making another query. But BIND doesn't do that.

There are other situations where BIND fails to make good use of all the records in a response, e.g. when you get a referral for a signed zone, the response includes the DS records as well as the NS records. But BIND doesn't cache the DS records properly, so when it comes to validate the answer, it re-fetches them.

In a follow-up message, Mark Andrews says these problems are on his to-do list, so I'm looking forward to DNSSEC helping to make DNS faster :-)

fanf: (dotat)

The jq tutorial demonstrates simple filtering and rearranging of JSON data, but lacks any examples of how you might combine jq's more interesting features. So I thought it would be worth writing this up.

I want a list of which zones are in which views in my DNS server. I can get this information from BIND's statistics channel, with a bit of processing.

It's quite easy to get a list of zones, but the zone objects returned from the statistics channel do not include the zone's view.

	$ curl -Ssf http://[::1]:8053/json |
	  jq '.views[].zones[].name' |
	  head -2
	"authors.bind"
	"hostname.bind"

The view names are keys of an object further up the hierarchy.

	$ curl -Ssf http://[::1]:8053/json |
	  jq '.views | keys'
	[
	  "_bind",
	  "auth",
	  "rec"
	]

I need to get hold of the view names so I can use them when processing the zone objects.

The first trick is to_entries which turns the "views" object into an array of name/contents pairs.

	$ curl -Ssf http://[::1]:8053/json |
	  jq -C '.views ' | head -6
	{
	  "_bind": {
	    "zones": [
	      {
	        "name": "authors.bind",
	        "class": "CH",
	$ curl -Ssf http://[::1]:8053/json |
	  jq -C '.views | to_entries ' | head -8
	[
	  {
	    "key": "_bind",
	    "value": {
	      "zones": [
	        {
	          "name": "authors.bind",
	          "class": "CH",

The second trick is to save the view name in a variable before descending into the zone objects.

	$ curl -Ssf http://[::1]:8053/json |
	  jq -C '.views | to_entries |
		.[] | .key as $view |
		.value' | head -5
	{
	  "zones": [
	    {
	      "name": "authors.bind",
	      "class": "CH",

I can then use string interpolation to print the information in the format I want. (And use array indexes to prune the output so it isn't ridiculously long!)

	$ curl -Ssf http://[::1]:8053/json |
	  jq -r '.views | to_entries |
		.[] | .key as $view |
		.value.zones[0,1] |
		"\(.name) \(.class) \($view)"'
	authors.bind CH _bind
	hostname.bind CH _bind
	dotat.at IN auth
	fanf2.ucam.org IN auth
	EMPTY.AS112.ARPA IN rec
	0.IN-ADDR.ARPA IN rec

And that's it!

fanf: (silly)
Oh good grief, reading this interview with Nick Clegg.

http://www.theguardian.com/politics/2016/sep/03/nick-clegg-did-not-cater-tories-brazen-ruthlessness

I am cross about the coalition. A lot of my friends are MUCH MORE cross than me, and will never vote Lib Dem again. And this interview illustrates why.

The discussion towards the end about the university tuition fee debacle really crystallises it. The framing is about the policy, in isolation. Even, even! that the Lib Dems might have got away with it by cutting university budgets, and making the universities scream for a fee increase!

(Note that this is exactly the tactic the Tories are using to privatise the NHS.)

The point is not the betrayal over this particular policy.

The point is the political symbolism.

Earlier in the interview, Clegg is very pleased about the stunning success of the coalition agreement. It was indeed wonderful, from the policy point of view. But only wonks care about policy.

From the non-wonk point of view of many Lib Dem voters, their upstart radical lefty party suddenly switched to become part of the machine.

Many people who voted for the Lib Dems in 2010 did so because they were not New Labour and - even more - not the Tories. The coalition was a huge slap in the face.

Tuition fees were just the most blatant simple example of this fact.

Free university education is a symbol of social mobility, that you can improve yourself by work regardless of parenthood. Educating Rita, a bellwether of the social safety net.

And I am hugely disappointed that Clegg still seems to think that getting lost in the weeds of policy is more important than understanding the views of his party's supporters.

He says near the start of the interview that getting things done was his main motivation. And in May 2010 that sounded pretty awesome.

But the consequence was the destruction of much of his party's support, and the loss of any media acknowledgment that the Lib Dems have a distinct political position. Both were tenuous in 2010 and both are now laughable.

The interviewer asks him about tuition fees, and still, he fails to talk about the wider implications.
fanf: (dotat)

Towards the end of this story, a dear friend of ours pointed out that when I tweeted about having plenty of "doctors" I actually meant "medics".

Usually this kind of pedantry is not considered to be very polite, but I live in Cambridge; pedantry is our thing, and of course she was completely correct that amongst our friends, "Dr" usually means PhD, and even veterinarians outnumber medics.

And, of course, any story where it is important to point out that the doctor is actually someone who works in a hospital, is not an entirely happy story.

At least it did not involve eye surgeons!

Happy birthday

My brother-in-law's birthday is near the end of August, so last weekend we went to Sheffield to celebrate his 30th. Rachel wrote about our logistical cockups, but by the time of the party we had it back under control.

My sister's house is between our AirB&B and the shops, so we walked round to the party, then I went on to Sainsburys to get some food and drinks.

A trip to the supermarket SHOULD be boring.

Prosectomy!

While I was looking for some wet wipes, one of the shop staff nearby was shelving booze. He was trying to carry too many bottles of prosecco in his hands at once.

He dropped one.

It hit the floor about a metre from me, and smashed - exploded - something hit me in the face!

"Are you ok?" he said, though he probably swore first, and I was still working out what just happened.

"Er, yes?"

"You're bleeding!"

"What?" It didn't feel painful. I touched my face.

Blood on my hand.

Dripping off my nose. A few drops per second.

He started guiding me to the first aid kit. "Leave your shopping there."

We had to go past some other staff stacking shelves. "I just glassed a customer!" he said, probably with some other explanation I didn't hear so clearly.

I was more surprised than anything else!

What's the damage?

In the back of the shop I was given some tissue to hold against the wound. I could see in the mirror in the staff loo it was just a cut on the bridge of my nose.

I wear specs.

It could have been a LOT worse.

I rinsed off the blood and Signor Caduto Prosecco found some sterile wipes and a chair for me to sit on.

Did I need to go to hospital, I wondered? I wasn't in pain, the bleeding was mostly under control. Will I need stitches? Good grief, what a faff.

I phoned my sister, the junior doctor.

"Are you OK?" she asked. (Was she amazingly perceptive or just expecting a minor shopping quandary?)

"Er, no." (Usual bromides not helpful right now!) I summarized what had happened.

"I'll come and get you."

Suddenly!

I can't remember the exact order of events, but before I was properly under control (maybe before I made the phone call?) the staff call bell rang several times in quick succession and they all ran for the front of house.

It was a shoplifter alert!

I gathered from the staff discussions that this guy had failed to steal a chicken and had escaped in a car. Another customer was scared by the would-be-thief following her around the shop, and took refuge in the back of the shop.

Departure

In due course my sister arrived, and we went with Signor Rompere Bottiglie past large areas of freshly-cleaned floor to get my shopping. He gave us a £10 discount (I wasn't in the mood to fight for more) and I pointedly told him not to try carrying so many bottles in the future.

At the party there were at least half a dozen medics :-)

By this point my nose had stopped bleeding so I was able to inspect the damage and tweet about what happened.

My sister asked around to find someone who had plenty of A&E experience and a first aid kit. One of her friends cleaned the wound and applied steri-strips to minimize the scarring.

It's now healing nicely, though still rather sore if I touch it without care.

I just hope I don't have another trip to the supermarket which is quite so eventful...

fanf: (dotat)

Thinking of "all the best names are already taken", I wondered if the early adopters grabbed all the single character Twitter usernames. It turns out not, and there's a surprisingly large spread - one single-character username was grabbed by a user created only last year. (Missing from the list below are i 1 2 7 8 9 which give error responses to my API requests.)

I guess there has been some churn even amongst these lucky people...

         2039 : 2006-Jul-17 : w : Walter
         5511 : 2006-Sep-08 : f : Fred Oliveira
         5583 : 2006-Sep-08 : e : S VW
        11046 : 2006-Oct-30 : p : paolo i.
        11222 : 2006-Nov-01 : k : Kevin Cheng
        11628 : 2006-Nov-07 : t : ⚡️
       146733 : 2006-Dec-22 : R : Rex Hammock
       628833 : 2007-Jan-12 : y : reY
       632173 : 2007-Jan-14 : c : Coley Pauline Forest
       662693 : 2007-Jan-19 : 6 : Adrián Lamo
       863391 : 2007-Mar-10 : N : Naoki  Hiroshima
       940631 : 2007-Mar-11 : a : Andrei Zmievski
      1014821 : 2007-Mar-12 : fanf : Tony Finch
      1318181 : 2007-Mar-16 : _ : Dave Rutledge
      2404341 : 2007-Mar-27 : z : Zach Brock
      2890591 : 2007-Mar-29 : x : gene x
      7998822 : 2007-Aug-06 : m : Mark Douglass
      9697732 : 2007-Oct-25 : j : Juliette Melton
     11266532 : 2007-Dec-17 : b : jake h
     11924252 : 2008-Jan-07 : h : Helgi Þorbjörnsson
     14733555 : 2008-May-11 : v : V
     17853751 : 2008-Dec-04 : g : Greg Leding
     22005817 : 2009-Feb-26 : u : u
     50465434 : 2009-Jun-24 : L : L. That is all.
    132655296 : 2010-Apr-13 : Q : Ariel Raunstien
    287253599 : 2011-Apr-24 : 3 : Blair
    347002675 : 2011-Aug-02 : s : Science!
    400126373 : 2011-Oct-28 : 0 : j0
    898384208 : 2012-Oct-22 : 4 : 4oto
    966635792 : 2012-Nov-23 : 5 : n
   1141414200 : 2013-Feb-02 : O : O
   3246146162 : 2015-Jun-15 : d : D
(Edited to add @_)

fanf: (dotat)

On Tuesday (2016-07-26) I gave a talk at work about domain registry APIs. It was mainly humorous complaining about needlessly complicated and broken stuff, but I slipped in a few cool ideas and recommendations for software I found useful.

I have published the slides and notes (both PDF).

The talk was based on what I learned last year when writing some software that I called superglue, but the talk wasn't about the software. That's what this article is about.

What is superglue?

Firstly, it isn't finished. To be honest, it's barely started! Which is why I haven't written about it before.

The goal is automated management of DNS delegations in parent zones - hence "super" (Latin for over/above/beyond, like a parent domain) "glue" (address records sometimes required as part of a delegation).

The basic idea is that superglue should work roughly like nsdiff | nsupdate.

Recall, nsdiff takes a DNS master file, and it produces an nsupdate script which makes the live version of the DNS zone match what the master file says it should be.

Similarly, superglue takes a collection of DNS records that describe a delegation, and updates the parent zone to match what the delegation records say it should be.

A uniform interface to several domain registries

Actually superglue is a suite of programs.

When I wrote it I needed to be able to update parent delegations managed by RIPE, JISC, Nominet, and Gandi; this year I have started moving our domains to Mythic Beasts. They all have different APIs, and only Nominet supports the open standard EPP.

So superglue has a framework which helps each program conform to a consistent command line interface.

Eventually there should be a superglue wrapper which knows which provider-specific script to invoke for each domain.

Refining the user interface

DNS delegation is tricky even if you only have the DNS to deal with. However, superglue also has to take into account the different rules that various domain registries require.

For example, in my initial design, the input to superglue was going to simply be the delegations records that should go in the parent zone, verbatim.

But some domain registries to not accept DS records; instead you have to give them your DNSKEY records, and the registry will generate their preferred form of DS records in the delegation. So superglue needs to take DNSKEY records in its input, and do its own DS generation for registries that require DS rather than DNSKEY.

Sticky problems

There is also a problem with glue records: my idea was that superglue would only need address records for nameservers that are within the domain whose delegation is being updated. This is the minimal set of glue records that the DNS requires, and in many cases it is easy to pull them out of the master file for the delegated zone.

However, sometimes a delegation includes name servers in a sibling domain. For example, cam.ac.uk and ic.ac.uk are siblings.

    cam.ac.uk.  NS  authdns0.csx.cam.ac.uk.
    cam.ac.uk.  NS  ns2.ic.ac.uk.
    cam.ac.uk.  NS  sns-pb.isc.org.
    ; plus a few more

In our delegation, glue is required for nameservers in cam.ac.uk, like authdns0; glue is forbidden for nameservers outside ac.uk, like sns-pb.isc.org. Those two rules are fixed by the DNS.

But for nameservers in a sibling domain, like ns2.ic.ac.uk, the DNS says glue may be present or omitted. Some registries say this optional sibling glue must be present; some registries say it must be absent.

In registries which require optional sibling glue, there is a quagmire of problems. In many cases the glue is already present, because it is part of the sibling delegation and, therefore, required - this is the case for ns2.ic.ac.uk. But when the glue is truly optional it becomes unclear who is responsible for maintaining it: the parent domain of the nameserver, or the domain that needs it for their delegation?

I basically decided to ignore that problem. I think you can work around it by doing a one-off manual set up of the optional glue, after which superglue will work. So it's similar to the other delegation management tasks that are out of scope for superglue, like registering or renewing domains.

Captain Adorable

The motivation for superglue was, in the short term, to do a bulk update of out 100+ delegations to add sns-pb.isc.org, and in the longer term, to support automatic DNSSEC key rollovers.

I wrote barely enough code to help with the short term goal, so what I have is undocumented, incomplete, and inconsistent. (Especially wrt DNSSEC support.)

And since then I haven't had time or motivation to work on it, so it remains a complete shambles.

fanf: (dotat)

Nearly two years ago I wrote about my project to convert the 25-year history of our DNS infrastructure from SCCS to git - "uplift from SCCS to git"

In May I got email from Akrem ANAYA asking my SCCS to git scripts. I was pleased that they might be of use to someone else!

And now I have started work on moving our Managed Zone Service to a new server. The MZS is our vanity domain system, for research groups who want domain names not under cam.ac.uk.

The MZS also uses SCCS for revision control, so I have resurrected my uplift scripts for another outing. This time the job only took a couple of days, instead of a couple of months :-)

The most amusing thing I found was the cron job which stores a daily dump of the MySQL database in SCCS...

January 2019

M T W T F S S
 1 23456
78910 111213
14151617181920
21222324252627
28293031   

Syndicate

RSS Atom

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated 2019-02-22 18:02
Powered by Dreamwidth Studios