OpenAI o1 vs. Herbie
A year and a half ago, I wrote a blog post comparing Herbie to the first ChatGPT (which we now call, I think, GPT 3.5). I chose 11 floating-point repair benchmarks, and fed all of them to Herbie and ChatGPT. Herbie is a tool my students and I develop to do exactly this work, and I wanted to know if AI tools had obsoleted it. The conclusion was that Herbie was still much better, winning 6/11 and tying two others. Moreover, the cases where Herbie lost ChatGPT's response was usually not actually much better, or was better only stylistically.
Well, OpenAI just released their o1-preview model, better known as Strawberry, so let's see if this new model can unseat the reigning champ. I used the same prompt and the same set of benchmarks, and without further ado here are the results:
Input | Herbie | 3.5 | o1 | Best | |||
---|---|---|---|---|---|---|---|
Benchmark | % | % | s | % | % | s | |
2sqrt | 52.7% | 99.7% | 3.5s | 99.7% | 99.7% | 3 | All tie |
asinh | 18.1% | 99.4% | 3.0s | 26.4% | 75.7% | 84 | Herbie |
cos2 | 50.3% | 99.6% | 4.5s | 37.9% | 99.7% | 9 | o1 |
expq2 | 36.4% | 100.0% | 3.1s | 37.1% | 100.0% | 66 | Tie |
csqrt | 40.4% | 90.2% | 4.3s | 7.3% | 18.6% | 32 | Herbie |
jcobi | 48.5% | 96.4% | 4.9s | 17.8% | 8.1% | 77 | Herbie |
eccentricity | 49.7% | 100.0% | 1.8s | 100.0% | 100.0% | 34 | All tie |
mixing | 44.3% | 97.2% | 7.2s | 3.3% | 3.5% | 16 | Herbie |
maksimov2 | 76.9% | 96.8% | 4.9s | 76.9% | 76.9% | 73 | D/Q |
acos1m | 7.0% | 10.5% | 5.5s | 100.0% | 100.0% | 24 | AIs win |
rump | 70.5% | 95.8% | 4.2s | 74.5% | 76.5% | 83 | Herbie |
Note that the numbers above differ from the prior blog post because we've since released two new Herbie versions. The "%" column gives Herbie's accuracy estimate, higher is better, and the "s" column gives runtime in seconds, lower is better. In a few cases, Strawberry gave several options, and I chose the best one. The "3.5" column uses the ChatGPT 3.5 responses from a year and a half ago, but I've re-estimated the accuracy just to put things on an even footing and there's no run time, but c'mon, you know what using ChatGPT is like. It's pretty fast.
There's more I can say on methodology but this isn't a published paper so I won't bother. Instead, here's the chat transcript if you'd like to see the results yourself:
https://chatgpt.com/share/66e4bcbb-a878-8011-a29c-5dd0e07649b3
The Herbie results use checkout 697848d3d on seed 929805499.
What Strawberry got wrong
Strawberry is clearly smarter than ChatGPT 3.5, and a closer match for Herbie, but Herbie still wins by quite a lot, with six wins, two losses, and three ties. On a more human level, though, o1's explanations usually made a lot of sense and I felt like I learned something reading them, which wasn't the case with ChatGPT 3.5.
In the asinh
example, Strawberry correctly detected the overflow and
the cancellation, but didn't detect cancellation at negative x
.
In csqrt
it incorrectly simplified 0.5 * sqrt(2x)
to sqrt(x)
. It also
didn't propose separate cases for positive and negative re
, though it
did get the positive case right.
In jcobi
it rewrite b^2 - a^2
as (b - a)^2 / 2
, which just isn't right.
In eccentricity
it added a test where you really don't need one.
In maksimov2
, Herbie just deleted some terms, which for some reason
reduced error. I'm going to call this a tie even though Herbie maxed
out its own error metric, because I think something odd happened.
In acos1m
both ChatGPT and Strawberry knew a trig identity that Herbie
didn't know, so aced the example.
In rump
, Herbie performed a pretty clever rewrite into FMAs that
basically lead to no error. Meanwhile, Strawberry attempted to rescale
all the inputs to avoid overflow, which does help but doesn't totally
avoid the issue and, moreover, missed some rearrangements that help.
My summary
A few other things I noticed as I was doing this:
- Strawberry is kind of error-prone and sometimes goes down. I assume this is temporary and will go away in a week or so. It also sometimes gets stuck. I don't know if that is operations or core.
- In the
asinh
example, o1 recognized that this is the formula forasinh
and suggested that. I didn't count it, but if I did then o1 would win on that expression. - It generates really funny "chains of thought" which are often kinda nonsense. But taken seriously, it spends a lot of time thinking about trigonometry, even in cases where that has nothing to do with anything.
- The final explanations are pretty thorough and, when it's right, extremely impressive, and give useful comments like where extra precision would help and so on. These extra comments aren't graded above but might be useful to a user. On the other hand, when it's wrong, the output is equally thorough and impressive (but wrong) which maybe isn't a big win.
- Strawberry is stronger algebra, a really big change from 3.5, which was kind of garbage. But it still makes mistakes, and those mistakes really throw it off. Herbie isn't, like, amazing at algebra, but it's better than this.
- It's quite a bit slower than Herbie, though not so slow to be
unusable. It would be interesting to try
o1-mini
, which presumably is faster and worse, more like 3.5. - In most of the cases that Herbie won, a dedicated user could still
probably get some value from o1's response, or maybe fix it, like in
csqrt
orjcobi
.
In short: Strawberry is a closer competitor to Herbie and would probably be good at generating explanations of what Herbie did and why, though automation, speed, and cost might be concerns. If you imagine Strawberry getting better over the next few years, it's possible that it will fix its algebraic challenges, at which point you've gotta assume that Herbie won't keep up.
Well done, OpenAI, hope to try this again in a few years and finally have Herbie lose!