fanf: (Default)
[personal profile] fanf

This afternoon I reckon I was six deep in a stack of yaks that I needed to shave to finish this job, and four of them turned up today. I feel like everything I try to do reveals some undiscovered problem that needs fixing...

  • When the network is a bit broken, my DNS servers soon stop being able to provide answers, because the most popular sites insist on tiny TTLs so they can move fast and break things.

    As a result the DNS gets the blame for network problems, and helpdesk issues get misdirected, and confusion reigns.

  • Serve-Stale to the rescue! It was implemented towards the end of last year in BIND and is a feature of the 9.12 releases.

    • Let's deploy it! First attempt in March with 9.12.1.

    • CVE-2018-5737 appears!

      Roll back!

    • The logging is too noisy for production so we need to wait for 9.12.2 which includes a separate logging category for serve-stale.

    • Time passes...

    • Deploy 9.12.2 earlier this week, more carefully.

    • Let's make sure everything is sorted before we turn on serve-stale again! (Now we get to today.)

      • The logging settings need revising: serve-stale is enough of a shove to make it worth reviewing other noisy log categories.

      • Can we leave most of them off most of the time, and use the default-debug category to let us turn them on when necessary?

      • This means the debug 1 level needs to be not completely appalling. Let's try it!

        • Hmm, this RPZ debug log looks a bit broken. Let's fix it!

        • Two little patches, one cosmetic, one a possible minor bug fix.

          • Need to rebase my hack branch onto master to test the patches.

          • Fix dratted merge conflicts.

        • Build patched server!

          • Build fails :-( why?

          • No enlightenment from commit logs.

          • Sigh, let's git bisect the build system to work out which commit broke things...

            • While the workstation churns away repeatedly building BIND, let's get coffee!
          • Success! The culprit is found!

          • Submit bug report

          • Work around bug, and get a successful build!

        • Test patched server!

          • The little patches seem OK, but while repeatedly restarting the server, a more worrying bug turns up!

            Sometimes when the server starts, my monitoring queries get stuck with SERVFAIL responses when they should succeed! Why?

          • Really don't want this to be anything that might affect production, so it needs investigation.

          • Turn off noisy background activity, and reproduce the problem with a simpler query stream. It's still hard to characterize the bug.

            • I'll need to test this in a less weird and more easily reconfigured server than my toy server. Let's spin up a VM.

              • Damnit, my virtualbox setup was broken by the jessie -> stretch upgrade!

              • Work out that this is because virtualbox is no longer included in stretch and the remnants from jessie are not compatible with the stretch kernel.

              • Reinstall virtualbox direct from Oracle. It now works again.

            • Install BIND on the new VM with a simplified version of my toy config. Reproduce the bug.

          • Is it related to serve-stale? no. QNAME minimization? no. RPZ? no.

          • After much headscratching and experimentation, enlightenment slowly, painfully dawns.

          • Submit bug report

            Actually, the writing of the bug report, and especially the testing of the unfounded assertions and guesses as I wrote it, was a key part of pinning down this weirdness.

            I think this is one of the most obscure DNS interoperability problems I have investigated!

OK, that's it for now. I still have two patches to submit, and a revised logging configuration to finalize, so I can put serve-stale into production, so I can make it easier in some situations for my colleagues to tell the difference between a network problem and a DNS problem.

Date: 2018-08-04 17:14 (UTC)
andrewducker: (Default)
From: [personal profile] andrewducker
Oh God. This is horribly familliar.

July 2025

S M T W T F S
  1 2345
6789101112
13141516171819
20212223242526
2728293031  

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated 2025-07-06 11:18
Powered by Dreamwidth Studios