Herbie Without Iterations

This proposal suggests a radically different way Herbie could work, making use of high-quality oracle-free methods of evaluating error. This new method would make Herbie simpler, faster, and possibly better, while (I hope) maintaining all the things we currently like about Herbie. This new Herbie would use egg/egglog much more heavily than today, and the greedy improvement loop would be gone.

Background

Herbie currently works by:

Take a starting expression or expressions
Grow an egraph containing many similar expressions
Extract a bunch of them, hopefully some are more accurate than the input
Evaluate each of them for accuracy, keeping the good ones
Put those back into the egraph and repeat

This works, but why do we have to do this dance of taking things out of the egraph and putting them back in? Why can't we just grow one giant egraph, pull out the best term, and call it good?

Well, the answer is simple: the egraph has no idea which terms are accurate. Only when we take a term out of the egraph, and run it, and compare it to a ground-truth accurate evaluation (computed with Rival), do we have any idea which terms are accurate.

This is the core reason for the loop inside Herbie: egraph to generate terms, evaluation to select the good ones, repeat.

Preliminary Work

Now, one sorta obvious idea we had before is—why can't we just evaluate the terms in the egraph, and use that? This fails for a simple reason: Herbie doesn’t have ground truths for the (often new) subexpressions in the egraph, so while we can compute what each subexpression evaluates to, we can't compare that to a ground truth to figure out how accurate it is. Basically: we lack an oracle for subexpressions, we only have a cheap oracle for the expression as a whole.

Oliver did try constructing such an oracle, using Rival, inside the egraph. That even involved porting (some of) Rival to Rust, which was pretty ballsy, and it did work. But the flaw was that it was very slow. Oliver managed to get a run that was like 5x slower than normal Herbie and used I think 4 total sampled points. He had some cleverness about choosing the points well, and there was still a loop using the other points, but it clearly was too heavyweight to be workable.

The Proposal

But nowadays there are good oracle-free ways to evaluate accuracy: condition numbers. Bhargav and Artem, specifically, have done several projects using condition numbers to predict floating-point error, and it’s worked well in both cases. A condition number, basically, is a number we can compute, using ordinary floating-point, for every floating-point operation, and if the condition number is small the floating-point error will also be small. This is an oracle-free way to evaluate accuracy, where by “oracle-free” I mean that we don’t need a high-precision ground-truth evaluation.

So the idea is that we can, in the egraph, estimate the accuracy of every subexpression (e-node) using condition numbers, and then extract a single known-accurate expression from the egraph. We wouldn’t need the whole egraph-extract-evaluate loop (we’d build a single big egraph once and select the best expression). We wouldn’t need ground truth evaluation at all (or maybe we’d keep the one for re-evaluation but we wouldn’t need the training points).

The entire Herbie process would be much simpler: start with the user input; build a huge egraph; extract the most accurate term. No phases, no alt table, no heuristics (besides the condition number). No phase ordering, no greediness. (We’d probably still sample points for testing, and we’d probably still have preprocessing and regimes.) It would also, I think, be one day more amenable to not using sampling at all, and being somehow more deductive.

Question 1: Will it work?

It might work and it might not work. A lot hinges on whether we can actually determine which expressions are most accurate using condition numbers.

Bhargav in particular has done some deep-dives into condition numbers for predicting error, and various tweaks to ordinary condition numbers are needed to really get good results—things like tracking introduced error accurately, handling out-of-range values, and ignoring certain kinds of error. After making those changes we get pretty good predictive accuracy for error, something like 80% precision and 97% recall. I suspect recall will matter more for this task (we’re happy with any low-error expression).

If condition numbers alone won't work, Bhargav’s DSL number system might, especially if we added some handling for things like sin(x + eps) that Bhargav and I already have ideas for. We could port it to Rust—most of DSL’s implementation is already an FFI into a C library—and thus use it inside egg or egglog.

And keep in mind that Herbie’s existing heuristics have their own costs in terms of accuracy, mostly by being too greedy, so as long as we get the condition numbers working well enough, we might end up more accurate than before.

Question 2: Will we keep all of Herbie’s features?

Platforms are probably easy to keep; condition numbers already split “introduced error” (operator and precision dependent) from “amplified error” (purely-real semantics) and it’s the introduced error that platforms affect. You can compute the introduced error by just evaluating the expression. Platforms would also affect the rules in the e-graph, as now.

Pareto is probably doable. I think we’d need to compute the cost-accuracy curve for each e-class in the e-graph during extraction, per point, and then I think they’ll compose nicely to get us a global pareto curve. It’s possible that these will explode in size (Pareto curves can grow exponentially) but we don’t see this in Herbie today and we could probably merge nearby Pareto points or something to reduce the size of the Pareto curves.

Regimes might be doable? If you build a cost-accuracy curve per point per e-class you’d sometimes find e-classes where different points use different best e-nodes. I’m not 100% sure how to do regimes on an e-graph but it feels like the kind of combinatorial optimization problem that should be solvable.

Preprocessing is doable, since it happens before everything and relies on egraph already. It just manipulates the sampled points, not their ground truth outputs.

So in short it seems like we might be able to keep all of the Herbie features we know and love. And the advantage, of course, would be a radically simpler design, based more on powerful primitives like egraphs, and possibly much faster as well.

Question 3: What could go wrong?

The simplest thing that could go wrong is that condition numbers might be "gameable" in a way where the egraph consistently finds a "hack" that has low condition numbers but high actual error. If these are easier to find than "actual" solutions then this appraoch might end up being useless. I don't have a fix here besides existing condition number experience being quite good.

It's also not clear to me how to integrate series expansion into this framework. Series expansion doesn't live that comfortably inside egraphs, and it is very important to the results Herbie gets today. One thought I've had is that maybe series expansion can be done at the end, instead of during the main improvement loop. The idea is that series expansion is only needed when rewrites fail, and then it's independent of the egraph and isn't affected much by this design idea.

Finally, our existing rewrite system isn't sound, which right now really limits how much egraph we can actually use. A design that relies far more on the egraph will have to address this somehow, either by tracking much more information to make the rewrites sound (what Oliver did) or removing the unsound rewrites (seems unlikely to work, they are important rewrites) or something else (maybe weird schedule hacks?). Luckily we can experient with avoiding unsound rewrites without doing the rest of this proposal, so probably that's the best path forward.

Conclusion

This is a pretty radical idea but if it would work it would be a real breakthrough, certainly a real “Herbie 2.0” paper focused on the core of Herbie.

Appendix: How did I come up with this?

Bhargav asked. Of course I’ve been talking about “deductive Herbie” for a long time, and explanations have been part of that. This has always been a vague idea: have Herbie enumerate specific floating-point problems and try to fix them specifically instead of using sampling and empirical error measures.

More recently, I was working on the no-localize branch and noticed that the greediness of extraction and alt tables was causing a lot of problems. Specifically, I was trying to figure out why no-localize mode slightly less accurate than vanilla Herbie, and one of my conclusions is that a lot of Herbie’s misses were due to its greediness (which no-localize makes worse, since it produces more expressions and thus has more chances for being overly greedy). This made me interested in avoiding greediness.

Also as part of no-localize I was thinking about cost-aware localization. This is a bit of a mystery, where when we added cost-aware localization it really helped our results, but later Brett wrote the cost-aware extraction engine and at that point cost-aware localization stopped helping. (And results got better.) So I had the thought: cost-aware localization and cost-aware extraction do the same thing, except cost-aware localization uses the greedy alt-table iteration approach, while cost-aware extraction uses egraphs directly and gets a better result. What if we could do this for accuracy? How can we do accuracy-aware optimization in the egraph?

By Pavel Panchekha

14 January 2025