Here are some ideas for how a special-purpose allocator might improve
a qp-trie implementation:
- lower memory usage
- faster allocation
- neater RCU support
- possibly less load on the TLB
- tunable fragmentation overhead
The downsides are:
- complexity - it's a custom allocator and garbage collector!
- it would only support transactional updates
Let's dig in...
COW and RCU
A few years ago I added support for transactional updates to the
qp-trie used by Knot DNS. Knot handled DNS updates by making a
complete copy of the zone, so that the old copy could continue to
serve queries while the new copy was being modified. The new code made
it possible to copy only the parts of the zone that were affected by
the update, and reduce the overhead of handling small updates.
My COW (copy-on-write) code was designed to work with the RCU
(read-copy-update) concurrency framework. RCU was developed for
concurrent data structures in the Linux kernel; there is also a
userland RCU library. RCU is a combination of;
- COW data structures, so updates don't interfere with readers
- lightweight concurrency barriers, so readers do not need to take a lock
- deferred cleanup, so writers know when all readers are using the
new copy of a data structure, when the old copy can be cleaned up
I used one-bit reference counts to mark the boundary between the parts
of the tree (mostly near the root) that had been copied, and the
shared parts (towards the leaves). So it wasn't a pure COW, because
the refcount manipulation required writes to the shared parts of the
tree.
memory layout
A common design for malloc()
implementations (for example
phkmalloc and jemalloc) is to keep allocations of different
sizes separate. Each size class has its own free list, and each page
can only satisfy allocations from a single size class. This can reduce
the amount of searching around for free space inside malloc()
and
reduce the amount of fragmentation.
But in a qp-trie, nodes are often different sizes, so each step when
traversing the tree will usually require a leap to a different page,
which can increase pressure on the CPU's address translation lookaside
buffer.
Could we, perhaps, make a qp-trie more friendly to the TLB, and maybe
also the prefetcher, by being more clever about how it allocates
nodes, and how they are arranged next to each other in memory?
A custom allocator seems like a lot of work for a (probably) small
performance improvement, so I have not (until recently) pursued the
idea.
refcounts vs tracing
Reference counts are often regarded as a poor substitute for "proper"
tracing garbage collection. A tracing copying collector can give you:
- cheaper allocations: just bump a pointer
- amortized free: release whole pages rather than individual nodes
- better locality and less fragmentation
- no extra write traffic to update reference counts
To get most of these advantages, the garbage collector must be able to
move objects around. What you gain in more efficient alloc and free,
you pay for by copying.
However, if all updates to our data structure are RCU transactions
that necessarily involve making copies, then tracing garbage
collection seems like less of a stretch.
rough design
Our qp-trie allocator has a bag of pages, whose size does not need to
match the hardware page size, but that kind of size should be about right.
For each page, we keep track of how much free space it has (so that we
can decide when it is worth evacuating and freeing it), and a note of
the RCU epoch after which it can be freed.
There's a global array of pages, containing the the address of each
page. Actually, when the page table is resized, we will need to do an
RCU delayed cleanup, so there can also be a secondary array which is
waiting to be freed.
starting a transaction
When an update transaction is started, we obtain a fresh page where we
will put new nodes and modified copies. We use a cheap bump allocator
that just obtains another page when it runs out of space. Unlike many
GC languages, we still manually free()
nodes, to keep count of the
free space in each page.
There can only be one write transaction at a time, so the writer can
update the page metadata without interlocks.
finishing a transaction
After the updates have been applied to make a new version of the tree,
we can do a bit of extra maintenance work before switching our readers
over to the new tree. I'll discuss these in more detail below:
layout optimization: it might be worth doing some extra copying to
make the tree nicer for the prefetcher;
garbage collection: identify which pages have too much free space,
and evacuate and compact their contents so they can be freed.
cache eviction: if our tree is used for a cache rather than for
authoritative data, the GC phase can also discard entries that are
past their TTL.
Finally, the switch-over process:
- swap the tree's root pointers atomically
- wait for an RCU epoch so all readers are using the new tree
- free everything on the delayed cleanup list
layout optimization
This is entirely optional: I don't know if it will have any useful
effect. The idea is to copy tree nodes into a layout that's friendly
to the CPU's prefetcher and maybe also its TLB. My best guess for how
to achieve this is, starting from the root of the tree, to copy nodes
in breadth-first order, until some heuristic limits are reached.
One of the tradeoffs is between better layout and extra memory usage
(for more copies). A minimal option might be to only copy the few
uppermost levels of the tree until they fill one page. Layout
optimization across multiple pages is more complicated.
garbage collection
Here is a sketch of an algorithm for a full collection; I have not
worked out how to do a useful collection that touches less data.
We recursively traverse the whole tree. The argument for the recursive
function is a branch twig, i.e. a pointer to an interior node with its
metadata (bitmap etc.), and the return value is either the same as the
argument, or an altered version pointing to the node's new location.
The function makes a temporary copy of its node on the stack, then
iterates over the twigs contained in the node. Leaf twigs are copied
as is; it calls itself recursively for each branch twig.
If any of the branch twigs were changed by the recursive calls, or if
the old copy of this node was in a sufficiently-empty page, the old
copy is freed (which only alters its page's free space counter), the
new version of the node is copied to the allocation pointer, and this
recursive invocation returns the node's new location. Otherwise it
returns a pointer to the old location (and the copy on the stack is
discareded).
We can tune our fragmentation overhead by adjusting the threshold for
sufficiently-empty pages. Note that garbage collection must also
include recent allocations during the update transaction: a
transaction containing multiple updates is likely to generate garbage
because many qp-trie updates change the size of a node, even if we
update in place when we can. So the pages used for new allocations
should be treated as sufficiently-empty so that their contents are
compacted before they enter heavy read-only use.
cache eviction
So far my qp-trie code has worked well for authoritative data, but I
have not tried to make it work for a DNS cache. A cache needs to do a
couple of extra things:
- evict entries that have passed their time-to-live;
- evict older entries to keep within a size limit.
Both of these can be done as part of the garbage collection tree walk.
In BIND, activities like this are performed incrementally by
co-operatively scheduled tasks, rather than dedicated threads, which
makes them a bit more intricate to code.
small pointers
The page table allows us to use much smaller node pointers.
Instead of using a native 64-bit pointer, we can refer to a node by
the index of its page in the page table and the position of the node
in its page, which together can easily fit in 32 bits. This requires a
double indirection to step from one node to the next, but the page
table should be in cache, and qp-trie traversal is friendly to
prefetching, so we can provide hints if the processor can't prefetch
automatically.
There are a couple of ways to make use of this saving.
We can reduce the size of each twig from 16 bytes to 12 bytes, making
the whole tree 25% smaller. This adds some constraints on leaves:
either the key and value pointers must fit in 48 bits each (which
requires unwarranted chumminess with the implementation); or we can
get hold of the key via the value (and waste 32 bits in the leaf).
Or if this is a one-pass DNS-trie we can use the extra
space for path compression, and avoid making assumptions about
pointers.
metadata placement
For each page we need to keep track of how much free space it
contains, so that we know when it should be evacuated; and something
to tell us if the page should be freed after the next RCU epoch.
It's fairly straightforward to put this metadata at the start of each
page. At the cost of a little wasted space we can make sure this
writable data doesn't share a cache line with read-only nodes.
If we are using small pointers, another option is to put per-page
metadata in the page table, or perhaps in another array parallel to
the page table to keep read-only and writable data separate.
transactions and caches
I normally think of a cache as having a lot of small point updates,
which is unlikely to be efficient with this transaction-oriented
design. But perhaps it makes sense if we split the cache into two
parts.
The main cache is read-only; we use transactional updates for eviction
based on TTL and cache size, and to bring in new records from the
working cache. It uses ordered lookups to support RFC 8198 NXDOMAIN
synthesis.
The working cache is used by the resolver to keep track of queries in
progress. It can be based on fine-grained updates and locking, rather
than being designed for a read-mostly workload. It might not need
ordered lookups at all.
Queries that miss the main cache get handed over to the resolver,
which might be able to answer them straight from the working cache, or
it might add the query to a list of queries waiting for the same
answer, or when there are no matches it creates a new entry in the
working cache that belongs to a new resolver task.
application data
Most of what I have written above is about working with the interior
branch nodes of the tree. What about the application data hanging off
the leaf nodes?
During a transaction, any elements that we want to delete need to be
added to a free list, so that they can be cleaned up after the next
RCU epoch. When we need to modify an element, we must do so COW-style.
It's reasonable for the tree implementation to keep a free list of
application elements, so any delete or set operations will
automatically add the old element pointer to the list for later
cleanup. On the other hand, it's probably easier for the applicaton to
COW its own data.
The only callback we will need is to free application elements during
the delayed cleanup after the RCU epoch has passed. (This is simpler
than Knot DNS, which also has callbacks for refcount manipulation.)
conclusion
For a long time I was doubtful that a custom allocator for a qp-trie
would be worth the effort. But now I think it is likely to be worth it:
The refcounting in Knot is confusing; GC seems to be a nicer way
to support RCU.
Small pointers can save a significant amount of space, and are
more accommodating for a one-pass radix tree version.
It will become feasible to see if layout optimization can make
queries faster.
It remains to be seen if I can find the time to turn these ideas into code!