An update on Herbie, Summer 2022
Hi all! While I'm waiting for tests to run, let me give a brief update on what I've been doing on Herbie for the last month. To be clear, lots of other folks, mainly Brett and Edward, have also been doing great Herbie work, but this post will focus on me.
Improving Rival
Rival is Herbie's interval-arithmetic library. It is, I think, a pretty high-quality implementation:
- It supports arbitrary precision, which Herbie uses for computable reals.
- It has had some attention put toward performance; for example, multiplication is quite a bit faster than in a "textbook" implementation.
- It supports many many functions, and not just the easy ones like
sqrt
orlog
; Rival has implementations offmod
, for example, and ofpow
for negative inputs. - It has clear correctness criteria, and is tested extensively against them
- It has rigorous treatment of invalid inputs, like
sqrt([-1, 1])
, via what we call error intervals. - It also supports arbitrary boolean operations, like comparing intervals, and produces boolean intervals as a result.
- It also supports "movability flags", which can detect lots of cases where higher-precision re-evaluation is fruitless.
Herbie's been using Rival for many releases now, and it ensures a level of soundness that I've very proud of.
So over the last few weeks I've done a bunch of minor maintenance and
cleanup in Rival. First, I wrote some documentation, in case there are
any other potential users of the library. That also lead me to think a
little bit about the API we want to present, which is now cleaner. And
finally, I added support for the final two functions, lgamma
and
tgamma
, which had been missing in Rival. Amusingly, I talked about an
interval version of these two functions in my last update post in
February 2020.
Let me say just a bit more about that last point. The two Gamma functions are the most mathematically complex functions supported by Rival, and took quite a bit of work. The reason is basically this: computing the interval version of some function basically means solving an optimization problem, finding the function's minimum and maximum over a range. For all of the other functions supported by Rival, those minima and maxima are directly computable. However, there's no analytical closed form (as far as I know) for either the x or y positions of the Gamma function's minima and maxima. Therefore, finding those minima and maxima involves basically a binary search (actually, a ternary search) and also some correction for rounding error.
This all makes the interval gamma functions dreadfully slow—about 200ms on my computer for a single evaluation at 80 bits of precision. At higher precision, it's slower—much slower. (I tried running it at 8000 bits of precision and it never completed). But that's OK, because frankly I don't expect it to see much use. However, having these implementations around (even if they take a long time or time out for many unusual uses) means we can now enforce the invariant that all of Herbie's supported functions have interval versions, which in turn allows us to deprecate a bunch of fallback algorithms that were basically never used.
Amusingly, working on this also lead me to discover a bug I introduced
into math/bigfloat
, which I've now fixed.
Speeding up regimes and binary search
A few years ago, I implemented an interactive profile explorer tool
for Herbie. You can see it live on the Herbie nightly reports by
scrolling to the bottom of the page. Basically, it takes the standard
gprof
output and hyperlinks it, so that instead of scrolling up and
down looking for a function by number, you can click links to browse
the profile. That's made it a lot easier to find inefficiencies in Herbie.
One such inefficiency I found recently was in Herbie's regimes
pass.
Regimes basically takes several different ways ("alts") to evaluate
the same mathematical expression, and tries to learn a simple rule for
which way is most accurate on which input. This involves a lot of
repetitive trial-and-error, but we haven't put a lot of effort into
speeding this up because it only makes up about 20% of Herbie runtime.
But when I was browsing around a Herbie profile, I noticed something odd. Of that 20%, about half was spent evaluating those mathematical expressions over and over again. This is strange, because in theory you should only need to evaluate each expression once, which shouldn't take long. When I checked the code, I saw the reason: Herbie was evaluating the expressions once during every search step, not once total. This was a super-easy fix (just hoist the evaluation out of the search loop) and lead to a big speed-up to regimes. Along the way I also upgraded regimes to use Herbie's batched evaluation function, which should produce another small speed-up.
A related component of Herbie is bsearch
, which runs right after
regimes
and cleans up its output a bit. I simplified it, but I'm going
to need to explain a bit about bsearch
before the simplification makes
sense.
regimes
outputs a couple of things: an expression (e
), a set of split
points (v1
, v2
, …), and a list of alts (a1
, a2
, …) to use between
each split point. Basically, this data corresponds to the following
program:
(if (< e v1) a1 (if (< e v2) a2 ...))))
Moreover, each of the split points actually corresponds to an interval
[v1a, v1b]
, where regimes has no way of picking any particular point
in that interval.
The job of bsearch
is to refine that interval using binary search. To
do so, it considers the interval [v1a, v1b]
and the two alts a1
and a2
used on either side of that interval. It then samples a bunch of
inputs where e
is equal to the midway points (v1a + v1b) / 2
,
evaluates a1
and a2
on them, determines which is more accurate, and
uses that to refine the interval further.
The tough part is the part where it samples inputs where e
is equal to
a particular value. How? Well, here's where a trick comes in. Suppose
the programs a1
and a2
both have a subexpression equal to e
. And
furthermore suppose that whatever variables are used in that
subexpression, they're not used anywhere else in a1
or a2
. Then we can
treat e
not as an expression, but as a symbolic variable, and then
sampling points where e
is any particular value is easy. This trick
lets us run bsearch
on many (though not all) results of regimes
.
Anyway, the simplification I made to bsearch
is that up until now,
bsearch
was sampling new points with e
equal to a particular value.
This makes sense, but unfortunately requires us to expose the
sample-new-points function to the core of Herbie, and we've been
trying for a while now to separate the two. If Herbie is separate from
the sampling function, we would be able to run Herbie of points
gathered by, say, dynamic analysis of some binary, or points generated
by some sampling program other than Herbie. This resampling step in
bsearch
was the final thing standing in the way of complete
separation.
So here's my simple fix that eliminates resampling. Instead of
sampling new points, where the symbolic variable e
is set to some
specific value, just take the old points and add the variable e
to
them. This leads to basically the same results (hypothetically, there
could be some over-fitting, but I think the chances and impact of that
are small) and avoids the need to invoke the sampler!
Speeding up series expansions
For a long time, Herbie's series expansion algorithm has been a source of timeouts. Not only is this annoying (to users and to us developers), but it also limits us from calling the series expander as often as we'd like. Now, the series expander is pretty complicated—complicated enough that I've already written two posts about how it works, plus another post promising an e-graph version (it failed). So it might look like this is just a problem we're stuck with.
However, my suspicions where raised when Brett discovered that a simple cache in the series simplifier sped it up by a factor of two. Perhaps it wasn't series expansion, but the simplifier, that was the problem? Indeed, the simplifier took up something like 95% of series expansion runtime, and importantly, the simplifier is (frankly) crap. I wrote the simplifier way way back, in September 2014 or so, and at the time it was intended to be Herbie's main simplifier (this was before we discovered e-graphs). Therefore, it tries really hard to do a good job simplifying.
For series expansion, this is a waste. The only reason we have a special simplifier is because the series expander produces exponentially large terms if left to its own devices, and we can't feed those terms into our nice e-graph based simplifier for technical reasons. So the series simplifier should run fast and do very little; instead it seems to run forever.
Now, I had long totally forgotten how the series expansion simplifier
works, but reading over the (mercifully short) code I noticed that it
called cartesian-product
, a sure sign of something bad happening. It
turns out that the series expander would attempt to distribute terms
in a multiplication, and to do so it would expand any product of sums
into a sum of products. This is definitely super slow, and there's not
really any reason to do it: even if this simplification were
profitable, it would almost certainly be handled later by our full
e-graph simplifier.
Turning this distribution handling off was very easy (just deleting a few lines of code) but it immediately reduced the time taken by series simplification, and series as a whole, to almost nothing, with no timeouts at all. And Herbie's results actually improved, presumably because we were now generating shorter expressions that later Herbie phases could handle better. This was an easy win and totally eliminates an annoying cause of timeouts.
Tuning the number of iterations
One of Herbie's most important parameters is how many total iterations
Herbie performs, controlled by the --num-iters
flag. The default is
four iterations, and I remember that way back when it used to be
three, but we bumped it for reasons I don't remember. In the
intervening time, though, Herbie has gotten dramatically faster, and
in particular each iteration has gotten way, way faster (sampling,
which happens before any of the iterations, has not been sped up as
dramatically). So it makes sense to think about whether this default
is still optimal.
To test this, I ran Herbie on its full test suite at a varying number of iterations and looked at its runtime and its accuracy. I expected accuracy to improve but runtime to worsen as the number of iterations crept up. Here are the results:
N | Accuracy | Time | Fail | Done |
---|---|---|---|---|
0 | 2079 | 0.7hr | 6 | 0 |
1 | 4905 | 0.8hr | 6 | 72 |
2 | 5113 | 1.1hr | 8 | 131 |
3 | 5177 | 1.2hr | 8 | 181 |
4 | 5239 | 1.3hr | 8 | 218 |
5 | 5247 | 1.4hr | 8 | 252 |
6 | 5257 | 1.5hr | 8 | 269 |
10 | 5262 | 1.4hr | 8 | 364 |
50 | 5287 | 2.2hr | 12 | 493 |
In this table, the first column shows the number of iterations, with the default (4) labeled in bold. The second column shows Herbie's "bits of error" metric, where higher means more accurate. Accuracy indeed increases steadily as the number of iterations goes up, but hits a clear knee at the default of four iterations. The first two and last one row are italicized, because with 0 or 1 iteration two fewer benchmarks timed out, while with 50 four more timed out; timed-out benchmarks don't contribute to the total. (Timeouts and other failures are shown in the fourth column.)
The third column shows total runtime, with the italicised rows run under less load and so artificially lower. Runtime increases pretty steadily throughout, with additional iterations adding a few minutes each. The last column shows how many benchmarks saturate and thus are not affected by additional iterations; there are 528 benchmarks in total, so with 50 iterations most benchmarks are saturated but there are still plenty that aren't. It looks like truly absurd iteration limits would be necessary to get everything to saturate.
All of the above suggests that our default of 4 is actually pretty much optimal, getting all of the large accuracy improvements without incurring the most extreme costs in running time. So I'm planning to leave the default in place. Of course, future changes to Herbie might recommend reducing the number—if each iteration gets better—or increasing it—if we find new things for additional iterations to do.