OpenAI o1 vs. Herbie

A year and a half ago, I wrote a blog post comparing Herbie to the first ChatGPT (which we now call, I think, GPT 3.5). I chose 11 floating-point repair benchmarks, and fed all of them to Herbie and ChatGPT. Herbie is a tool my students and I develop to do exactly this work, and I wanted to know if AI tools had obsoleted it. The conclusion was that Herbie was still much better, winning 6/11 and tying two others. Moreover, the cases where Herbie lost ChatGPT's response was usually not actually much better, or was better only stylistically.

Well, OpenAI just released their o1-preview model, better known as Strawberry, so let's see if this new model can unseat the reigning champ. I used the same prompt and the same set of benchmarks, and without further ado here are the results:

	Input	Herbie		3.5	o1		Best
Benchmark	%	%	s	%	%	s
2sqrt	52.7%	99.7%	3.5s	99.7%	99.7%	3	All tie
asinh	18.1%	99.4%	3.0s	26.4%	75.7%	84	Herbie
cos2	50.3%	99.6%	4.5s	37.9%	99.7%	9	o1
expq2	36.4%	100.0%	3.1s	37.1%	100.0%	66	Tie
csqrt	40.4%	90.2%	4.3s	7.3%	18.6%	32	Herbie
jcobi	48.5%	96.4%	4.9s	17.8%	8.1%	77	Herbie
eccentricity	49.7%	100.0%	1.8s	100.0%	100.0%	34	All tie
mixing	44.3%	97.2%	7.2s	3.3%	3.5%	16	Herbie
maksimov2	76.9%	96.8%	4.9s	76.9%	76.9%	73	D/Q
acos1m	7.0%	10.5%	5.5s	100.0%	100.0%	24	AIs win
rump	70.5%	95.8%	4.2s	74.5%	76.5%	83	Herbie

Note that the numbers above differ from the prior blog post because we've since released two new Herbie versions. The "%" column gives Herbie's accuracy estimate, higher is better, and the "s" column gives runtime in seconds, lower is better. In a few cases, Strawberry gave several options, and I chose the best one. The "3.5" column uses the ChatGPT 3.5 responses from a year and a half ago, but I've re-estimated the accuracy just to put things on an even footing and there's no run time, but c'mon, you know what using ChatGPT is like. It's pretty fast.

There's more I can say on methodology but this isn't a published paper so I won't bother. Instead, here's the chat transcript if you'd like to see the results yourself:

https://chatgpt.com/share/66e4bcbb-a878-8011-a29c-5dd0e07649b3

The Herbie results use checkout 697848d3d on seed 929805499.

What Strawberry got wrong

Strawberry is clearly smarter than ChatGPT 3.5, and a closer match for Herbie, but Herbie still wins by quite a lot, with six wins, two losses, and three ties. On a more human level, though, o1's explanations usually made a lot of sense and I felt like I learned something reading them, which wasn't the case with ChatGPT 3.5.

In the asinh example, Strawberry correctly detected the overflow and the cancellation, but didn't detect cancellation at negative x.

In csqrt it incorrectly simplified 0.5 * sqrt(2x) to sqrt(x). It also didn't propose separate cases for positive and negative re, though it did get the positive case right.

In jcobi it rewrite b^2 - a^2 as (b - a)^2 / 2, which just isn't right.

In eccentricity it added a test where you really don't need one.

In maksimov2, Herbie just deleted some terms, which for some reason reduced error. I'm going to call this a tie even though Herbie maxed out its own error metric, because I think something odd happened.

In acos1m both ChatGPT and Strawberry knew a trig identity that Herbie didn't know, so aced the example.

In rump, Herbie performed a pretty clever rewrite into FMAs that basically lead to no error. Meanwhile, Strawberry attempted to rescale all the inputs to avoid overflow, which does help but doesn't totally avoid the issue and, moreover, missed some rearrangements that help.

My summary

A few other things I noticed as I was doing this:

Strawberry is kind of error-prone and sometimes goes down. I assume this is temporary and will go away in a week or so. It also sometimes gets stuck. I don't know if that is operations or core.
In the asinh example, o1 recognized that this is the formula for asinh and suggested that. I didn't count it, but if I did then o1 would win on that expression.
It generates really funny "chains of thought" which are often kinda nonsense. But taken seriously, it spends a lot of time thinking about trigonometry, even in cases where that has nothing to do with anything.
The final explanations are pretty thorough and, when it's right, extremely impressive, and give useful comments like where extra precision would help and so on. These extra comments aren't graded above but might be useful to a user. On the other hand, when it's wrong, the output is equally thorough and impressive (but wrong) which maybe isn't a big win.
Strawberry is stronger algebra, a really big change from 3.5, which was kind of garbage. But it still makes mistakes, and those mistakes really throw it off. Herbie isn't, like, amazing at algebra, but it's better than this.
It's quite a bit slower than Herbie, though not so slow to be unusable. It would be interesting to try o1-mini, which presumably is faster and worse, more like 3.5.
In most of the cases that Herbie won, a dedicated user could still probably get some value from o1's response, or maybe fix it, like in csqrt or jcobi.

In short: Strawberry is a closer competitor to Herbie and would probably be good at generating explanations of what Herbie did and why, though automation, speed, and cost might be concerns. If you imagine Strawberry getting better over the next few years, it's possible that it will fix its algebraic challenges, at which point you've gotta assume that Herbie won't keep up.

Well done, OpenAI, hope to try this again in a few years and finally have Herbie lose!

By Pavel Panchekha

13 September 2024

OpenAI o1 vs. Herbie

What Strawberry got wrong

My summary