An update on Herbie, Summer 2022

Hi all! While I'm waiting for tests to run, let me give a brief update on what I've been doing on Herbie for the last month. To be clear, lots of other folks, mainly Brett and Edward, have also been doing great Herbie work, but this post will focus on me.

Improving Rival

Rival is Herbie's interval-arithmetic library. It is, I think, a pretty high-quality implementation:

It supports arbitrary precision, which Herbie uses for computable reals.
It has had some attention put toward performance; for example, multiplication is quite a bit faster than in a "textbook" implementation.
It supports many many functions, and not just the easy ones like sqrt or log; Rival has implementations of fmod, for example, and of pow for negative inputs.
It has clear correctness criteria, and is tested extensively against them
It has rigorous treatment of invalid inputs, like sqrt([-1, 1]), via what we call error intervals.
It also supports arbitrary boolean operations, like comparing intervals, and produces boolean intervals as a result.
It also supports "movability flags", which can detect lots of cases where higher-precision re-evaluation is fruitless.

Herbie's been using Rival for many releases now, and it ensures a level of soundness that I've very proud of.

So over the last few weeks I've done a bunch of minor maintenance and cleanup in Rival. First, I wrote some documentation, in case there are any other potential users of the library. That also lead me to think a little bit about the API we want to present, which is now cleaner. And finally, I added support for the final two functions, lgamma and tgamma, which had been missing in Rival. Amusingly, I talked about an interval version of these two functions in my last update post in February 2020.

Let me say just a bit more about that last point. The two Gamma functions are the most mathematically complex functions supported by Rival, and took quite a bit of work. The reason is basically this: computing the interval version of some function basically means solving an optimization problem, finding the function's minimum and maximum over a range. For all of the other functions supported by Rival, those minima and maxima are directly computable. However, there's no analytical closed form (as far as I know) for either the x or y positions of the Gamma function's minima and maxima. Therefore, finding those minima and maxima involves basically a binary search (actually, a ternary search) and also some correction for rounding error.

This all makes the interval gamma functions dreadfully slow—about 200ms on my computer for a single evaluation at 80 bits of precision. At higher precision, it's slower—much slower. (I tried running it at 8000 bits of precision and it never completed). But that's OK, because frankly I don't expect it to see much use. However, having these implementations around (even if they take a long time or time out for many unusual uses) means we can now enforce the invariant that all of Herbie's supported functions have interval versions, which in turn allows us to deprecate a bunch of fallback algorithms that were basically never used.

Amusingly, working on this also lead me to discover a bug I introduced into math/bigfloat, which I've now fixed.

Speeding up regimes and binary search

A few years ago, I implemented an interactive profile explorer tool for Herbie. You can see it live on the Herbie nightly reports by scrolling to the bottom of the page. Basically, it takes the standard gprof output and hyperlinks it, so that instead of scrolling up and down looking for a function by number, you can click links to browse the profile. That's made it a lot easier to find inefficiencies in Herbie.

One such inefficiency I found recently was in Herbie's regimes pass. Regimes basically takes several different ways ("alts") to evaluate the same mathematical expression, and tries to learn a simple rule for which way is most accurate on which input. This involves a lot of repetitive trial-and-error, but we haven't put a lot of effort into speeding this up because it only makes up about 20% of Herbie runtime.

But when I was browsing around a Herbie profile, I noticed something odd. Of that 20%, about half was spent evaluating those mathematical expressions over and over again. This is strange, because in theory you should only need to evaluate each expression once, which shouldn't take long. When I checked the code, I saw the reason: Herbie was evaluating the expressions once during every search step, not once total. This was a super-easy fix (just hoist the evaluation out of the search loop) and lead to a big speed-up to regimes. Along the way I also upgraded regimes to use Herbie's batched evaluation function, which should produce another small speed-up.

A related component of Herbie is bsearch, which runs right after regimes and cleans up its output a bit. I simplified it, but I'm going to need to explain a bit about bsearch before the simplification makes sense.

regimes outputs a couple of things: an expression (e), a set of split points (v1, v2, …), and a list of alts (a1, a2, …) to use between each split point. Basically, this data corresponds to the following program:

(if (< e v1)
  a1
  (if (< e v2)
    a2
    ...))))

Moreover, each of the split points actually corresponds to an interval [v1a, v1b], where regimes has no way of picking any particular point in that interval.

The job of bsearch is to refine that interval using binary search. To do so, it considers the interval [v1a, v1b] and the two alts a1 and a2 used on either side of that interval. It then samples a bunch of inputs where e is equal to the midway points (v1a + v1b) / 2, evaluates a1 and a2 on them, determines which is more accurate, and uses that to refine the interval further.

The tough part is the part where it samples inputs where e is equal to a particular value. How? Well, here's where a trick comes in. Suppose the programs a1 and a2 both have a subexpression equal to e. And furthermore suppose that whatever variables are used in that subexpression, they're not used anywhere else in a1 or a2. Then we can treat e not as an expression, but as a symbolic variable, and then sampling points where e is any particular value is easy. This trick lets us run bsearch on many (though not all) results of regimes.

Anyway, the simplification I made to bsearch is that up until now, bsearch was sampling new points with e equal to a particular value. This makes sense, but unfortunately requires us to expose the sample-new-points function to the core of Herbie, and we've been trying for a while now to separate the two. If Herbie is separate from the sampling function, we would be able to run Herbie of points gathered by, say, dynamic analysis of some binary, or points generated by some sampling program other than Herbie. This resampling step in bsearch was the final thing standing in the way of complete separation.

So here's my simple fix that eliminates resampling. Instead of sampling new points, where the symbolic variable e is set to some specific value, just take the old points and add the variable e to them. This leads to basically the same results (hypothetically, there could be some over-fitting, but I think the chances and impact of that are small) and avoids the need to invoke the sampler!

Speeding up series expansions

For a long time, Herbie's series expansion algorithm has been a source of timeouts. Not only is this annoying (to users and to us developers), but it also limits us from calling the series expander as often as we'd like. Now, the series expander is pretty complicated—complicated enough that I've already written two posts about how it works, plus another post promising an e-graph version (it failed). So it might look like this is just a problem we're stuck with.

However, my suspicions where raised when Brett discovered that a simple cache in the series simplifier sped it up by a factor of two. Perhaps it wasn't series expansion, but the simplifier, that was the problem? Indeed, the simplifier took up something like 95% of series expansion runtime, and importantly, the simplifier is (frankly) crap. I wrote the simplifier way way back, in September 2014 or so, and at the time it was intended to be Herbie's main simplifier (this was before we discovered e-graphs). Therefore, it tries really hard to do a good job simplifying.

For series expansion, this is a waste. The only reason we have a special simplifier is because the series expander produces exponentially large terms if left to its own devices, and we can't feed those terms into our nice e-graph based simplifier for technical reasons. So the series simplifier should run fast and do very little; instead it seems to run forever.

Now, I had long totally forgotten how the series expansion simplifier works, but reading over the (mercifully short) code I noticed that it called cartesian-product, a sure sign of something bad happening. It turns out that the series expander would attempt to distribute terms in a multiplication, and to do so it would expand any product of sums into a sum of products. This is definitely super slow, and there's not really any reason to do it: even if this simplification were profitable, it would almost certainly be handled later by our full e-graph simplifier.

Turning this distribution handling off was very easy (just deleting a few lines of code) but it immediately reduced the time taken by series simplification, and series as a whole, to almost nothing, with no timeouts at all. And Herbie's results actually improved, presumably because we were now generating shorter expressions that later Herbie phases could handle better. This was an easy win and totally eliminates an annoying cause of timeouts.

Tuning the number of iterations

One of Herbie's most important parameters is how many total iterations Herbie performs, controlled by the --num-iters flag. The default is four iterations, and I remember that way back when it used to be three, but we bumped it for reasons I don't remember. In the intervening time, though, Herbie has gotten dramatically faster, and in particular each iteration has gotten way, way faster (sampling, which happens before any of the iterations, has not been sped up as dramatically). So it makes sense to think about whether this default is still optimal.

To test this, I ran Herbie on its full test suite at a varying number of iterations and looked at its runtime and its accuracy. I expected accuracy to improve but runtime to worsen as the number of iterations crept up. Here are the results:

N	Accuracy	Time	Fail	Done
0	2079	0.7hr	6	0
1	4905	0.8hr	6	72
2	5113	1.1hr	8	131
3	5177	1.2hr	8	181
4	5239	1.3hr	8	218
5	5247	1.4hr	8	252
6	5257	1.5hr	8	269
10	5262	1.4hr	8	364
50	5287	2.2hr	12	493

In this table, the first column shows the number of iterations, with the default (4) labeled in bold. The second column shows Herbie's "bits of error" metric, where higher means more accurate. Accuracy indeed increases steadily as the number of iterations goes up, but hits a clear knee at the default of four iterations. The first two and last one row are italicized, because with 0 or 1 iteration two fewer benchmarks timed out, while with 50 four more timed out; timed-out benchmarks don't contribute to the total. (Timeouts and other failures are shown in the fourth column.)

The third column shows total runtime, with the italicised rows run under less load and so artificially lower. Runtime increases pretty steadily throughout, with additional iterations adding a few minutes each. The last column shows how many benchmarks saturate and thus are not affected by additional iterations; there are 528 benchmarks in total, so with 50 iterations most benchmarks are saturated but there are still plenty that aren't. It looks like truly absurd iteration limits would be necessary to get everything to saturate.

All of the above suggests that our default of 4 is actually pretty much optimal, getting all of the large accuracy improvements without incurring the most extreme costs in running time. So I'm planning to leave the default in place. Of course, future changes to Herbie might recommend reducing the number—if each iteration gets better—or increasing it—if we find new things for additional iterations to do.