fanf: (dotat)
[personal profile] fanf

Crit-bit tries have fixed-size branch nodes and a constant overhead per leaf, which means they can be used as an embedded lookup structure. Embedded lookup structures do not need any extra memory allocation; it is enough to allocate the objects that are to be indexed by the lookup structure.

An embedded lookup structure is a data structure in which the internal pointers used to search for an object (such as branch nodes) are embedded within the objects you are searching through. Each object can be a member of at most one of any particular kind of lookup structure, though an object can simultaneously be a member of several different kinds of lookup structure.

The BSD <sys/queue.h> macros are embedded linked lists. They are used frequently in the kernel, for instance in the network stack to chain mbuf packet buffers together. Each mbuf can be a member of a list and a tailq. There is also a <sys/tree.h> which is used by OpenSSH's privilege separation memory manager. Embedded red-black trees also appear in jemalloc.

embedded crit-bit branch node structure

DJB's crit-bit branch nodes require three words: bit index, left child, and right child; embedded crit-bit branches are the same with an additional parent pointer.

    struct branch {
        uint index;
        void *twig[2];
        void **parent;
    };

The "twig" child pointers are tagged to indicate whether they point to a branch node or a leaf. The parent pointer normally points to the relevant child pointer inside the parent node; it can also point at the trie's root pointer, which means there has to be exactly one root pointer in a fixed place.

(An aside about how I have been counting overhead: DJB does not include the leaf string pointer as part of the overhead of his crit-bit tries, and I have followed his lead by not counting the leaf key and value pointers in my crit-bit and qp tries. So by this logic, although an embedded branch adds four words to an object, it only counts as three words of overhead. Perhaps it would be more honest to count the total size of the data structure.)

using embedded crit-bit tries

For most purposes, embedded crit-bit tries work the same as external crit-bit tries.

When searching for an object, there is a final check that the search key matches the leaf. This check needs to know where to find the search key inside the leaf object - it should not assume the key is at the start.

When inserting a new object, you need to add a branch node to the trie. For external crit-bit tries this new branch is allocated; for embedded crit-bit tries you use the branch embedded in the new leaf object.

deleting objects from embedded crit-bit tries

This is where the fun happens. There are four objects of interest:

  • The doomed leaf object to be deleted;

  • The victim branch object which needs to remain in the trie, although it is embedded in the doomed leaf object;

  • The parent branch object pointing at the leaf, which will be unlinked from the trie;

  • The bystander leaf object in which the parent branch is embedded, which remains in the trie.

The plan is that after unlinking the parent branch from the trie, you rescue the victim branch from the doomed leaf object by moving it into the place vacated by the parent branch. You use the parent pointer in the victim branch to update the twig (or root) pointer to follow the move.

Note that you need to beware of the case where the parent branch happens to be embedded in the doomed leaf object.

exercise for the reader

Are the parent pointers necessary?

Is the movement of branches constrained enough that we will always encounter a leaf's embedded branch in the course of searching for that leaf? If so, we can eliminate the parent pointers and save a word of overhead.

conclusion

I have not implemented this idea, but following Simon Tatham's encouragement I have written this description in the hope that it inspires someone else.

Date: 2015-10-07 12:15 (UTC)
simont: A picture of me in 2016 (Default)
From: [personal profile] simont
I hadn't actually realised my comments in that other post would have the effect of encouraging you to post further things like this, but I think it's a most excellent effect if they did :-)

I actually had a subset of these thoughts myself when I read djb's crit-bit trie description the other day – I spotted the possibility of embedding the trie in the stored objects, decided you could probably find something to do about rescuing what you've called the 'victim branch object' during deletion, and then didn't figure out the rest. So I'm glad someone else did!

The use case for embedded lookup structures that always comes to my mind first is memory allocators, in which you need to tie the free blocks together into a data structure using only the memory of the free blocks themselves. So for me this immediately raises the question of whether there would be any value in using a crit-bit trie for memory allocation – e.g. would it permit fast implementation of any unusual and beneficial allocation policy, by choosing keys in some particular way?

Date: 2015-10-07 16:01 (UTC)
simont: A picture of me in 2016 (Default)
From: [personal profile] simont
I suppose that for all the streamlined design of a crit-bit trie, it is still basically just storing a list of items sorted by a single total ordering criterion, so we shouldn't expect it to enable any previously infeasible form of allocation policy.

(Unlike, say, McCreight's implementation of log-time first-fit, which uses an annotated AVL or red-black tree to arrange a log-time query taking the address ordering of free blocks and their sizes into account simultaneously.)

Date: 2015-10-08 09:22 (UTC)
From: (Anonymous)
FWIW, while googling for the McCreight paper I chanced upon http://maths-people.anu.edu.au/~brent/pd/rpb089.pdf from 1989 "Efficient Implementation of the First-Fit Strategy for Dynamic Storage Allocation" by R. P. Brent. It might be of use, or at least extra pointers.

Date: 2015-10-07 17:17 (UTC)
simont: A picture of me in 2016 (Default)
From: [personal profile] simont
Is the movement of branches constrained enough that we will always encounter a leaf's embedded branch in the course of searching for that leaf?

I think so, yes, and here is (hopefully) a proof.

We aim to show that the trie always satisfies the property that every leaf's embedded branch object (if it exists at all – since we need one fewer branch than we have leaves, there's always one leaf whose embedded branch object is totally unused) is an ancestor of that leaf.

Proof is by induction, of course. The base case is that a trie with zero or one leaf obviously obeys the invariant. Now we have to show that insertion and deletion each preserve it.

Insertion. We need one new branch object which will point at the new leaf. Obviously we'll make this the branch object embedded in the new leaf itself. (In principle we could instead find and use the free one somewhere else in the tree, as mentioned above. But that would be deliberately perverse, so let's not.)

So the new leaf's embedded branch object is indeed an ancestor of the new leaf – specifically, the very closest ancestor.

And the new branch object gets inserted in the middle of some existing link of the trie, from one branch object (or the root) to another branch object (or a leaf). So the paths from the root to all pre-existing leaves are either unchanged, or they get this new branch object added in the middle of them. But insertion never removes a branch object from any leaf's ancestry, so it cannot break the invariant.

Deletion. In your terminology: the parent branch object is the closest ancestor of the doomed leaf object. The victim branch object is currently embedded in the doomed leaf object, and therefore, by the inductive hypothesis, it's currently an ancestor of doomed leaf object. Hence, the victim branch object is also ancestor of the parent branch object. (Unless they're the same object, as you point out, but in that case we have nothing to prove anyway.)

Also by the inductive hypothesis, the parent branch object is an ancestor of the bystander leaf object. (I.e. the bystander leaf is some leaf that you can reach by following whichever of the parent branch object's child pointers doesn't lead to the doomed leaf).

But if the victim branch object is an ancestor of the parent branch object, and the parent branch object in turn is an ancestor of the bystander leaf object, then it follows that the victim branch object must be an ancestor of the bystander leaf object. So embedding the victim branch object in the bystander leaf object surely cannot break the invariant!

Caveat. “I have only proved it correct, not tried it.” I won't be completely confident of this argument until I've seen it go through a long random test run :-)

Date: 2015-10-07 14:51 (UTC)
simont: A picture of me in 2016 (Default)
From: [personal profile] simont
Hence, the victim branch object is also ancestor of the parent branch object. (Unless they're the same object, as you point out, but in that case we have nothing to prove anyway.)

Ooh – actually there's a second easy special case, which you didn't warn implementors to watch out for :-) Another possibility is that the doomed leaf object might be the one whose embedded branch object isn't used at all, in which case there's no victim branch object in the first place and it would be a mistake to follow any pointers in an effort to salvage it!

July 2025

S M T W T F S
  1 2345
6789101112
13141516171819
20212223242526
2728293031  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated 2025-07-28 12:40
Powered by Dreamwidth Studios