A Feb 2020 look at Herbie

As part of my usual quality assurance routine, I did a crawl through the latest Herbie reports, looking for timeouts, long runtimes, and sampling errors. In summary, Herbie is now really fast! I found only a few benchmarks that run for more than 20s, with even quite large benchmarks running fast.

What makes Herbie slow?

Taking a look at a few timelines, it seems like Herbie now spends most of its time, especially for the slower benchmarks, in series expansion and pruning.

That said, it seems that when pruning is slow it is just because recursive rewrite generates a lot of candidates—up to 18k candidates in some cases. Even then, pruning take runs in 36s, which is respectable enough. A bit cost here is just running at 18k programs, on a few hundred inputs. You end up with a few million function evaluations, which is almost 100% of the pruning runtime. A code reorganization can reduce this number by sharing results in a few places instead of recomputing them, which might reduce the cost of pruning by a small integer multiple. Speeding up pruning could allow us to generate even more candidates. Alternatively, we could also look into why the rewriter is producing thousands of candidates. The rewriter is supposed to be dumb, but maybe it could avoid doing obviously stupid things.

Series expansion, on the other hand, seems to be slow in a lot of "normal" cases, especially in the case of a bad pattern where an expression that series expansion doesn't like too much appears in a few different variants. In one example, series expansion takes 5s for several different variants of a fixed expression. That's a lot of lost time. Luckily, we've been discussing overhauling series expansion for a while now, making use of e-graphs to simplify the code and speed up the computation.

What causes timeouts?

Another issue is the rare dramatical worst-case runtime, causing some algorithm to take minutes. I only found two in the standard Herbie benchmarks, and about a dozen more in the online demo.¹ [¹ What a treasure a trove of user-submitted inputs is.]

In some cases, rewriting is at fault. This seems to cause the majority of timeouts on the web demo, so is especially worth looking into.
Other cases are due to series expansion. This is probably related to the issue noted above with variants of a single program, and the e-graphs fix may apply equally well.
On the demo, a few cases of sampling taking too long. Most of those are a series of related inputs that use lgamma and therefore fall back to the old sampling code. Adding lgamma support to the sampler is tough (the mathematics are complex!) but obviously not a high priority, since the new sampler handles the vast majority of inputs.

Despite timeouts being rare, they are still over a third of nightly report runtime, since the timeout is 10 minutes and few other things take over a minute. I'm thinking of decreasing that timeout to 2.5 minutes, same as the timeout on the web demo. I think it's a good idea from an end-user perspective to give our nightly reports to the same constraints as demo users.

In any case, looking into weird cases of an algorithm going berserk has yielded big dividends in the past. It may be infinite loop, it may be accidentally quadratic, or it may be something else, but it's often a boon even outside of edge cases.

What causes accuracy to worsen?

In some cases, Herbie's output is actually less accurate than its input, which shouldn't happen, since Herbie is fairly religious about running each candidate to determine if it is as accurate as claimed.

The cause of such cases is, however, fairly easy to diagnose: regime inference. This is the part of Herbie that takes multiple candidate programs and attempts to create a hybrid, using if statements to choose the best candidate for each point. On average, regime inference works quite well (roughly 80% efficiency² [² The metric here looks at the accuracy of the resulting program, compared with a perfect, oracular if statement that always chooses the best candidate for any point. 0% efficiency is to take the single best candidate and not create any if statements at all.]), but in these edge cases it has negative efficiency (up to -30% in the case linked above).

This doesn't feel that urgent. Herbie informs the user that it is bad, and we know regimes is imperfect. After all, it is trying to solve a classification task, but for readability reasons is restricted to a very small set of valid classifiers. I am fairly confident in the regimes code, and suspect that the issue is just sampling unrepresentative points. That's been the case the last few times I've investigated something similar.

Conclusion

Herbie continues doing better and better, especially in becoming more robust to novel inputs and keeping runtime under control. Over the last two years, Herbie has become dramatically faster, so much so that our old timeouts have become obsolete. The error cases in Herbie have become quite rare, as you should expect from a mature project.

Nonetheless, Herbie will continue to improve yet. I recently describe our progress in 2019, and a lot of that continues. I've been working on refactoring Herbie's web interface to make it more accessible and understandable, as well as to enable new functionality. Oliver continues to improve the core algorithms to speed them up and make them more reliable. And some of the research I'm doing now with novel number formats and library implementations will stress and improve Herbie as well.

Many of the issues identified above can be improved quite easily.

Pruning could evaluate the program fewer times, which might make Herbie another 10% faster with a very small change.
For series expansion we've got a new e-graph based algorithm that should produce dramatic speed-ups.
Regimes is an area where we can refocus our research. Fundamentally new algorithms, perhaps ML-inspired, could lead to big improvements here. 25% of the accuracy Herbie does not recover is due to regime inference.
Finally, the rewrite step likely has some kind of simple bug that causes rare large runtimes, which we will track down and fix.

All told, Herbie's future looks ever-brighter!

Footnotes:

What a treasure a trove of user-submitted inputs is.

The metric here looks at the accuracy of the resulting program, compared with a perfect, oracular if statement that always chooses the best candidate for any point. 0% efficiency is to take the single best candidate and not create any if statements at all.

By Pavel Panchekha

21 February 2020