Does O3 beat a specalized numeric compiler?

Alright, I'm back at it: comparing Herbie against LLMs to see who is best at numerical rewriting. If you're new here, Herbie is a research compiler I've been working on for about a decade that helps compile mathematical expressions to fast and accurate floating-point code. More recently, I've been comparing new OpenAI models against Herbie to see who is the top numerical analyst. So far, Herbie's been winning all of the match-ups… but with O3-mini's release can the LLMs finally dethrone the champion?

The reason I keep doing this, by the way, is because I see a lot of hype about how well AIs do at mathematical tasks, and I'm quite bullish on AIs generally and want to make more use of them. But this task is out-of-distribution for them while still mostly relying on mathematical ability, so it's an interesting test of how real and general the recent advances are.

This time I'm testing OpenAI's o1 model (not o1-preview like the last post) and its new o3-mini-high model, which is out today and is supposedly best at math. I'm using the same benchmarks as last time, and I'm using Herbie to evaluate accuracy. As usual, all outputs are available in FPCore format.

Benchmark	Input	Herbie	gpt3.5	o1-preview	o1	o3-high
2sqrt	52%	99%	100%	100%	100%	100%
asinh	17%	99%	52%	76%	30%	100%
cos2	50%	99%	39%	100%	74%	100%
expq2	36%	99%	36%	100%	38%	100%
csqrt	41%	89%	7%	19%	35%	55%
jcobi	49%	96%	18%	8%	34%	48%
eccentricity	50%	99%	100%	100%	100%	100%
mixing	44%	96%	3%	3%	75%	28%
maksimov2	76%	96%	75%	77%	66%	77%
acos1m	7%	10%	100%	100%	100%	100%
rump	71%	93%	100%	77%	76%	86%

In short, Herbie is still king but o3-mini-high continues improving.

Methodology Notes

I rounded input and Herbie accuracies down and LLM accuracies up, which lead to a weird result for 2sqrt where all of the tools gave the same answer.

One change is that I'm now providing FPCore input directly to the LLMs and expecting FPCore back. This mostly worked: the LLMs sometimes added extra parentheses, which I had to remove, and they used functions sqr, max, and abs, which I had to implement for them. The o3-mini-high model also struggled with putting parentheses in the right place for let statements, which require some manual fixing up. But it wasn't too bad, and I think this is basically a viable way to use the LLMs.

I also adjusted my prompt this time around, telling the LLMs to focus on rounding error, overflow, and underflow, clarifying that the input range was all double-precision values, asking them to focus on error over the whole range, and telling them to ignore special values like 0, infinity, and NaN. This helped, I think, compared to last time.

The o1 model mostly took 10-30 seconds to solve problems, with the longest benchmark taking 40 seconds. The o3-mini-high model took about twice as long, with the longest benchmark taking 75 seconds. Herbie's longest runtime was 4 seconds.

Let me also note that some of these benchmarks---2sqrt and csqrt expecially—are extremely well-known expressions that an LLM might get by recall instead of reasoning. The acos1m and asinh benchmarks might also involve recalling uncommon identities; that's why I think all LLMs beat Herbie on acos1m. The jcobi, mixing, and maksimov2 expressions are "real-world" expressions where I don't expect recall to be useful.

Results

ChatGPT 3.5, and even o1-preview, would sometimes make programs worse than they started, basically by making math errors and simply changing the real-valued program. That rarely happens with o1 and basically not at all with o3, so that's a big improvement. Also, o3-mini-high basically crushes the "easy" problems like expq2, at times getting better results than Herbie. For example, expq2 is the program:

\[ \frac{e^x}{e^x - 1} \]

Herbie has the uninspired solution

\[ \frac{e^x}{\mathsf{expm1}(x)} ]

It's fine, but o3-mini-high has the much better

\[ \frac{-1}{\mathsf{expm1}(-x)} \]

It's faster (because you don't need the exp call) and also more accurate for some reason I can't quite fathom (something about overflow?) but whatever the reason it's clearly the best solution. So well done o3-mini-high!¹ [¹ I'll add that in my earlier blog post there was something similar with eccentricity, but Herbie now achieves, I think, a similarly good result to o3-mini-high.]

Some stuff could also maybe be improved with minor work. For example, in csqrt, Herbie's weakest example, o3-mini-high got basically the right answer but fumbled by forgetting to use hypot; if it hadn't forgotten, it actually would have beat Herbie and acheived a basically perfect result. But it's possible that prompting it with a list of supported functions would help.

On the other hand, the three most "real-world" examples, mixing, maksimov2, and jcobi, all of which are drawn from real code bases or papers, not textbook examples o3-mini-high really struggled. I think this is a case where feedback from actually evaluating examples and seeing how they do would be very valuable, and the LLMs don't have that.

In short, I think o3-mini-high is close to being competitive with Herbie, at least on easier textbook examples, and it's possible that some kind of fine-tuned or reinforcement-learning setup would actually beat Herbie—at least when it comes to maximum accuracy. Of course, Herbie is still faster to run, and it does produce a Pareto curve of outputs, and so on, but LLMs are close to competitive on this task.

Footnotes:

I'll add that in my earlier blog post there was something similar with eccentricity, but Herbie now achieves, I think, a similarly good result to o3-mini-high.

By Pavel Panchekha

31 January 2025

Does O3 beat a specalized numeric compiler?

Methodology Notes

Results

Footnotes: