Pavel Panchekha


Share under CC-BY-SA.

Floating Point Error in the Small

A conversation with Jeffrey Sarnoff on the Herbie mailing list raised a very interesting small example of how floating-point error can occur. This example is extreme, but I thought it would be interesting to work through.

The program is this: sqrt(x^2).

Non-compositional error

Run it through Herbie and Herbie will tell you that this expression is fairly inaccurate: approximately 29.1 bits of error, with the problems occurring for very large and very small inputs.1 [1 Herbie suggests the more accurate fabs(x) instead.] Where does the error come from, and how does Herbie get 29.1 bits out of it?

This result is surprising, because both square root and squaring are accurate operations. Or, to be more precise, they are as-exact-as-possible operations: Some floating-point numbers (like 3) do not have floating-point square roots, but the square root operation always returns the floating-point number closest to the correct square root. Likewise for squaring. How does composing two nearly-exact operations produce so inexact an operation?

The clue comes in the inputs that are erroneous: very large values or values very close to zero. For example, consider 1e200. Its square should be 1e400, which (double precision) floating point cannot represent; so, the square is actually computed as +inf. But the square root of infinity is infinity, so the final output is +inf as well, pretty far from the correct 1e200. In some sense, this is a kind of overflow.

This can be surprising! For example, x / sqrt(x^2) should always be 1, -1, or NaN, but yields 0 for the largest values in floating point. And though sqrt(x^2) is too silly a program to ever write (I hope!), the same problem occurs in common expressions like sqrt(x^2 + y^2) or in equations like the quadratic formula.

Measuring the error

Herbie measures the error of an expression with a statistical measure of average forward bits of error. This means that it measures the bits of error between the computed and correct answers for a randomly chosen input value. Let's see if we can't approximate the 29.1 bits that Herbie computes using just the definition of its error measure.

We know that in this example the problem occurs mostly for very large and very small inputs, computing 0 instead of small numbers and inf instead of large ones. Those errors will occur only for inputs where x^2 is equal to 0 or infinity, which will occur for roughly half of floating point numbers.

But how many bits of error is it to produce 0 or inf instead of x? Well, let's look at the 0 case. To compute the bits of error between x and the correct value, we first figure out how many floating point numbers are between the two, and then we take the base-2 logarithm. There are 264 floating point numbers total, of which 263 are between -1 and 1 (roughly), and of which 262 are small enough that their square is 0 (again, roughly). Half of those are negative, half are positive. So for the biggest floating point number that squares to 0, it is 261 floating-point numbers away from 0 and so has 61 bits of error. But that's the worst case. Half of the numbers (so 261) will have between 60 and 61 bits of error, another 260 will have between 59 and 60, and so on.2 [2 In fact, Herbie shows a helpful diagram in the page linked above that can give you a sense of what the error is like. Note that the diagram shows error peaking at about 61 bits for roughly 1e-154.]

If you do the math, you will end up with somewhere between 59 and 60 bits of error. But that's for just 262 out of 263 numbers between -1 and 1; so it's somewhere between 29.5 and 30 for this group. A similar calculation applies for the numbers less than -1 or greater than 1; on average something like 29.5 is computed. That's pretty close to Herbie's estimate; the difference probably comes from excluding NaNs and other unusual numbers, and due to the error Herbie incurs due to sampling.


Even composing accurate library functions, on valid floating-point inputs, can lead to inaccurate results. While Herbie's error measure may not be the one you expected, it is easy to work with mathematically. And so Herbie can, just like in this situation, help with your floating-point surprises.

Edit: Thank you to James Wilcox for pointing out improvements to this post.



Herbie suggests the more accurate fabs(x) instead.


In fact, Herbie shows a helpful diagram in the page linked above that can give you a sense of what the error is like. Note that the diagram shows error peaking at about 61 bits for roughly 1e-154.