ChatGPT vs Herbie

In research it's common to compare your solution to some simple approach as a baseline, to prove that whatever complex algorithm you've developed is actually worth using. Sometimes, coming up with this baseline is hard. But now we have ChatGPT, so let's plug a hole in my 2015 research paper on floating-point error improvement by comparing my tool, Herbie, with ChatGPT.

Methodology

I've chosen a representative sample of 12 benchmarks from the Herbie benchmark suite. Iv'e biased it a bit toward classic numerical problems of the sort I expected ChatGPT to do well on, but there's a mix of problems from applications as well. I've heard that it's helpful to prompt the model the play the role of an expert, so I used the following prompt before giving it any problems:

Hello! For the next series of questions, I want you to play the role of expert numerical analyst and Turing Award winner William Kahan. I will provide you with a series of questions about how to accurately evaluate various floating-point expressions, and you should identify rearrangements of the mathematical expression that are more accurate. Each of the expressions use 64-bit IEEE floating point (which you, as William Kahan, designed) and no higher precisions is available.

ChatGPT cheerily agreed to the challenge:

Hello, I will do my best to play the role of expert numerical analyst and Turing Award winner William Kahan. Fire away with your questions about accurately evaluating various floating-point expressions!

I also tried to give ChatGPT encouragement after each response, since I've also heard that helps. I'm not a heavy ChatGPT user, so perhaps someone with more experience can retry this and get better results. To that end, you can find my benchmarks and ChatGPT's results in this FPCore. (I used Herbie's :herbie-target field to store ChatGPT's responses.) I used Herbie checkout 0809d07c with seed 1734654412. Herbie ran for about 2 minutes total on these benchmarks, which is about how long ChatGPT ran, so that aspect of the tools is comparable.

I then used Herbie's C exporter to generate the queries to ChatGPT, giving it one at a time, and then converted its results back into FPCore formulas so I could use Herbie to evaluate the quality of ChatGPT's results. I compared it to a git checkout of Herbie using its standard error metric (percentage error) with standard parameters. I also did a second Herbie run where I turned off Herbie's support for special numeric functions like hypot and expm1, which ChatGPT never used, the effect on the results was miniscule.

To some extent, my success metric is biased toward Herbie, since it's the metric Herbie uses. But first of all, this is a weekend project while the baby sleeps. And second of all, I do think this is a good metric for non-specialist use of floating-point error improvement tools!

Results

Here's a quick table of results, comparing the initial input, Herbie, and ChatGPT. Note that lower scores are better (they mean lower error).

Benchmark	Input	Herbie	ChatGPT
rump	81%	0%	0%
mixing	56%	1.6%	97%
maksimov2	25%	2%	25%
jcobi	52%	3%	82%
expq2	97%	0%	97%
expq2*	64%	1%	64%
eccentricity	49%	0%	0%
csqrt	60%	9%	93%
cos2	49%	0%	61%
asinh	71%	0%	48%
acos1m	93%	89%	0%
2sqrt	47%	0%	0%

Uhh, ok, that's actually a lot worse than I was expecting—not only did Herbie beat ChatGPT, but ChatGPT was frequently worse than the original expression. Note that in expq2*, I gave ChatGPT the same probelm as in expq2, but prompted it to focus on inputs of interest. That didn't help much.

Manual review

But perhaps the metric chosen is a bit biased. ChatGPT is actually very polite and explains its reasoning (including showing most steps of the derivation) so I read through all of them and tried to grade its results:

Benchmark	Herbie	ChatGPT
rump	A	A
mixing	B	F
maksimov2	F	D
jcobi	B	F
expq2	B	D
eccentricity	A	A+
csqrt	C	D
cos2	A	C
asinh	A-	B
acos1m	F	A
2sqrt	A	A

I tried to be as nice as possible to ChatGPT, including trying to give it partial credit for good ideas with bad execution. I biased toward harshness for Herbie—to be fair, this wasn't hard, because ChatGPT gives very good explanations of its thinking, while Herbie's explanations are not very good. In particular, Herbie gives you the option of either way-too-detailed or way-too-high-level explanations, while ChatGPT's are always at an intuitive-for-me level.

Nonetheless, Herbie performs like a B student, with some unfortunate failures, while ChatGPT performs like a student I'm trying to give a pity pass too, performing well only on the simple examples and the acos1m case. Here's a detailed description of each tool's performance on each benchmark:

`rump`

Both tools got the perfectly correct answer.

`mixing`

Herbie applied some unsound simplifications that were effective at handling the case of g very large. That said, there were probably some sound simplifications it could have applied, and some of Herbie's unsound simplifications didn't help at all. B

ChatGPT applied a difference of cubes formula, which doesn't help at all. It explains that this should be useful when a is large and h is close to zero, but this is a bit bizarre, since a is just a multiplicative factor on the whole expression, while error is low when h is close to 0. F

`maksimov2`

Herbie just totally drops one of the terms in the expression by setting K to zero, with no other changes. I do not understand why this produces a benefit, actually. (It's because cos of large values is very sensitive to rounding error—but it's hard to say deleting half the input is a benefit.) F

ChatGPT factors out the n - m term. It says this avoids issues when K, m, and n are all large numbers. It doesn't, but at least there really is some error when K is large. D

`jcobi`

This expression is quite complex. Herbie uses fma to remove some rounding error on most of the range, and applies a taylor series to the case of very large negative outputs. B

ChatGPT makes an algebra mistake and produces total nonsense. It also claims that this will help when alpha and beta are close to 2i; this isn't quite right, though there is potential rounding error when alpha plus beta is close to -2i. F

`expq2`

Herbie uses expm1 or, when that's banned, applies a Taylor series, which is accurate for inputs near 0 but less accurate for inputs far from 0. B

ChatGPT does some algebraic simplification, but nothing that helps with accuracy (it claims it should help for large x, and I guess it does avoid overflow). When prompted to focus on small x, it suggests a Taylor series, but it takes the Taylor series incorrectly. If it had done the Taylor series correctly, the answer would have been alright. D

`eccentricity`

Herbie and ChatGPT use the same algebraic rearrangement, which is the right one. That said, Herbie's rearrangement is presented in a slightly uglier way, so let's say A for Herbie and A+ for ChatGPT.

`csqrt`

Herbie uses hypot for the main branch and a Taylor expansion when there is cancellation. It misses the important difference-of-squares trick, but at least it has some solution, hacky though it is. C

ChatGPT gets its algebra wrong, and therefore misses most of the solution, but at a high level it does attempt to apply difference of squares, which keeps me from giving it a failing grade. D

`cos2`

Herbie uses a pretty clever trigonometric identity and then messes with the formula a bit to further eliminate the potential for overflow. The result is extremely professionally done, though the final formula isn't as pretty as I'd like. A

ChatGPT applies two clever trigonometric identities, but does so incorrectly. If it were applied correctly, it'd be a good solution, and its explanation of the problem is good as well, so I'll give it a C.

`asinh`

Herbie proposes using hypot and a Taylor expansion to resolve the problem. I give it an A-, because there's a better version that doesn't rely on Taylor series.

ChatGPT correctly executes a pretty complicated rewrite that factors out a log(x) term. This rewrite appropriately fixes a problem caused by overflow. I was pretty impressed by this! That said, its explanation claims it fixes a cancellation issue (not true), and in fact it doesn't fix the actual cancellation issue at play. Let's say B.

`acos1m`

Herbie basically has no clue here, and ultimately produces an unweidly ball of nonsense that doesn't improve the error much. Note that I picked this one specifically because it's something Herbie is bad at! F

ChatGPT correctly finds a very complex rewriting that also has basically no error. It's very impressive, I'd love it if Herbie could do things like this. A

`2sqrt`

Herbie and ChatGPT basically do the same thing, with the same results. Well done both. A

Conclusions

I don't work in ML and won't venture a guess as to what ChatGPT is doing internally. That said, it sometimes performs algebra respectably well, and handles smaller and more "classic" benchmarks at a passable level. On larger benchmarks, however, it frequently forgets part of the input, messes up the algebra or otherwise totally butchers the answer. I wouldn't use it—sanity-checking its algebra is a lot of work, but even if you fixed that up, the high-level ideas typically aren't that good either.

That said, it's cool that it beats Herbie in a few cases. It produces more human-readable results in eccentricity and it can solve acos1m, whereas Herbie can't. That's amazing! And it's possible that if it were better at algebra (maybe "plugins" will help?) it could at least be better that nothing, and there might be value in sometimes asking it instead of Herbie.

By Pavel Panchekha

02 April 2023