fanf | More on O(log(log(n))) searching

Thanks to everyone for the informative comments on my previous post.

I found a bug in my simulation code which meant that many of the tests were performed on arrays that were half-full of zeroes (oops). This distorted the stats a little, and more for larger N because I tested fewer different arrays for large N. This bug caused the increase in s.d. for large N: the corrected code has the same s.d. for all N. Another effect was to increase the maximum number of iterations required to find an entry. After the fix the maximum number of iterations goes down to log(log(N))+12 for the pure secant method, and (log(N)+14)/2 for Junio's mixed secant/bisection method. Altogether much more well-behaved.

Flat | Top-Level Comments Only

From:

gareth-rees.livejournal.com

Knuth suggests [The Art of Computer Programming, §6.2.1] that interpolation searching is superior to binary searching only for large enough tables (because even random data doesn't have a smooth distribution of values, and because of the arithmetic involved in the interpolation step), and so for the fastest search of an ordered array, you should start with interpolation search and switch over to binary search at some point.

Does your simulation bear this out? (The relative costs of division and memory lookup have changed since Knuth was writing §6.2.1, so it might not be true any more.)

From:

fanf

I was focussing on things that might matter for git's performance. So my array sizes ranged from 2^7 to 2^24, which is roughly one page worth of index entries up to the largest pack file size you might see. I was counting number of iterations because that corresponds fairly directly to page faults; when you are short on memory or when your buffer cache is empty is when index lookups can be slow. So I haven't looked at really small arrays, but certainly the secant method requires fewer memory accesses for the whole range I examined. I also haven't looked at timings, so I don't know what the relative performance is for in-memory searches. Designing a good secant method search is harder than a binary search because it relies on subtraction rather than comparison, and you have to keep two differences (between your min and max points, and your target and mid points) whereas a bisection method only needs comparisons between the target and mid points. So if you are working with an API like bsearch(3) then it will make two indirect function calls per iteration instead of one. If you can customise the code to the specific context then I expect it would be more likely to win.

From:

fanf

Actually that's fewer memory accesses on average. If you look at the maximum number of memory accesses the crossover is around 2^13.

In git, there's a 256-entry meta-index which reduces the number of probes in a bisection search by 8, but is probably less useful than a single iteration of the secant method when N > 2^16. Each index entry is 24 bytes, so 168 of them fit in a page. This implies that git won't see a clear benefit from the secant method until you get packs of 350 million objects, which is not really feasible when pack files are limited to 4GB :-) But this does explain why it isn't a win when large repositories currently have one or two million objects.