Binary Search is Very Fast
I've been practicing using LLMs (mostly OpenAI o3-mini-high
) for
coding. It's good at quick prototypes, and I've recently been thinking
about fast set membership operations (specifically related to CSS
matching). So I've had it implement a number of different binary
search algorithms.
The upshot is that modern CPUs are amazingly fast. Here's the fastest version I ended up with:
int sSearch(int arr[], int size, int target) { __builtin_assume(size > 1); int *begin = arr; int two_k = 1 << (sizeof(int) * 8 - 1 - __builtin_clz(size)); if (begin[size / 2] < target) begin += size - (two_k - 1); for (int bit = two_k >> 1; bit != 0; bit >>= 1) { if (begin[bit - 1] < target) begin += bit; } return *begin == target ? begin - arr : -1; }
This implementation is based on Orson Peters's bitwise binary search, specifically the "overlap" variant, which seems to have been originally invented by L. E. Shar in 1971. It makes a few mild assumptions (the array size is at least 2 and at most 231); I'm most interested in arrays of size about 15–4095.
It is fast. On my Apple M1, 255 elements take 5.9ns (19 cycles) and 4095 elements take 8.4ns (27 cycles). That's crazy fast, and that's including a tiny loop overhead. Naturally this is with everything in cache and the array size fixed to keep the branch predictor happy. John tried his M4 mac and got very similar cycle counts, though of course that chip is clocked much higher.
Generalizing a bit, there seem to be two regimes: the bisection loop seems to take about 0.4ns (1.5 cycles) if the data is all in one page and 0.8ns (2.5 cycles) if it's split across two pages. The prologue/epilogue is about 4 cycles for the smaller sizes; for larger sizes it's actually 5 cycles less than the bisection loop cost would suggest. Something something superscalarity.
Variations
I tested a huge number of different variations (thank you ChatGPT) to get to this one, all carefully benchmarked. Here's the list of variations, each with the estimated cycle count for searching a 255-element array, organized by basic algorithm:
Linear search: the basic for
loop takes 265 cycles; ChatGTP's NEON
version takes 105 cycles (2.5x faster!), and a better NEON version
takes 83 cycles (3.2x faster). Linear search is obviously not
competitive with binary search at these sizes, but I wanted to know
how much faster NEON would make things, because I was considering
hybrid searches.
Binary search: a basic, correct implemenetation takes 54 cycles. Testing for exact equality only at the end of the loop takes it down to 36 cycles, and using the overflow-prone midpoint computation takes it down further to 24 cycles. Unrolling 4x takes it to 22 cycles, and specializing the size to exactly 255 elements takes it to 19 cycles, but that's kind of fake since in my use case I won't know the size exactly.
Hybrid search: If we do the 4x unrolled, best binary search but stop with 4 elements to go, and add a basic linear search for the last 4 elements, that's 21 cycles, so that saves a cycle. Unrolling that loop doesn't help. Using NEON for the linear search also doesn't help.
Quaternary search: a basic, correct implementation is 32 cycles, and nothing improves on that much. Note that quaternary search requires a "hybrid" tail loop, since it can't go all the way down to a single element. Quaternary search has more memory accesses but fewer dependencies between them; I thought it would help, but apparently the M1 doesn't care.
Octal search: a basic version is 31 cycles, and an unrolled version is 29 cycles. Binary is best!
Shar search: this "bitwise" binary search is still basically binary search, but with a bunch of tricks to remove individual operations. It's super short and also very fast: 19 cycles. Given that it has to do 8 bisections, this is just crazy fast.
I also tested every variation on an Intel Gen 8 (Coffee Lake), though note that 1) this used GCC, not Clang; 2) I didn't micro-optimize; 3) I didn't test AVX variations. In this chip the fastest variation was the basic hybrid search, at 38 cycles. Even correcting for the 15% higher clock rate, the Intel chip is about 1.5× slower than the M1.
If I had to make any guesses why the M1 was so much better, it really seems like dependencies between memory and compute just don't bother it at all. Don't know why, but it's very impressive! For some reason the Shar search is quite bad on this Intel chip, at 52 cycles. I put the assembly into uICA and it gave 2.5 cycles per main iteration, which doesn't seem to be what I observe. Dunno!
Conclusion
In short, binary search of moderate-size sets is very fast, especially on the Apple M1. That said, it's still quite a bit slower than a Bloom filter lookup, which just takes one memory latency, maybe 4-6 cycles. That's a bit of a bummer for what I was planning (dropping the bloom filter entirely) but good to know.
I considered testing Eytzinger layout, but I wasn't sure it would be viable for my use case. It's a bit hard to believe it could improve much on these already extremely fast results, but perhaps you could do a bit better, especially on the Intel chip where SIMD instructions are better and better-optimized.