Pavel Panchekha


Share under CC-BY-SA.

Floating Point Guarantees

Floating point is an inexact representation of the real numbers. Ask any programmer, and they will say, “floating point gives inaccurate results”. But I think digging a bit deeper is valuable. Floating point is completely deterministic, and the errors introduced are neither random nor malicious. Rounding error is the result of hard mathematical realities and just a few simple rules.

Floating point computation, as implemented in most computers today, is defined by several standards; the upshot is that most floating point hardware acts identically.

The same is not true of floating point software, in particular programming languages, which usually prevent the user from accessing a variety of floating point features, like floating point flags, rounding modes, and traps. However, the core features of the standards, like how various operators behave, is identically implemented in most languages. I’ll be using Python 3.4 for my examples, and whenever I print floating-point results, I’ll be using the following function, which displays 25 decimal digits:

def pfp(x): print("{:+0.25f}".format(x))

Reading in floating point values

The most basic guarantee floating point provides is that any decimal constant written in source code will represent the closest floating point number. For example:

>>> pfp(1.0)
>>> pfp(1.5)
>>> pfp(0.1)

Note that 1.0 and 1.5 have “exact” floating point representations, but 0.1 does not. This is simply because there is no floating point number with the value 0.1, and there are floating point numbers with values 1.0 and 1.5. This isn’t too surprising: there are infinitely many numbers, even numbers between 0 and 1, and only finitely many floating point numbers, so some have to be missing. As a rule of thumb, diadic fractions—those with a power of two on the bottom—whose numerators and denominators are not very large have exact values.

One fun way to test which floating point numbers are representable is with a loop like so1 [1 Note that if floats were real numbers, this would be an infinite loop. It only halts because of the limited precision with which floating point can represent real values.]:

def fp_prev(x):
    y = x / 2
    while (x + y) / 2 != x and (x + y) / 2 != y:
        y = (x + y) / 2
    return y

def fp_next(x):
    y = x * 2
    while (x + y) / 2 != x and (x + y) / 2 != y:
        y = (x + y) / 2
    return y

These find the floating point value nearest to but different from a given floating point argument. For a floating point number x that isn’t terribly big or terribly small, x / 2 and x * 2 are guaranteed to be different from x, so the loop will iterate until it converges to a number very close to, but different from, x.

>>> pfp(fp_prev(1.0))
>>> pfp(fp_next(1.0))
>>> pfp(fp_prev(0.1))
>>> pfp(0.1)
>>> pfp(fp_next(0.1))

As you can see, despite 0.1 not being exactly represented by any floating point value, the floating point value that 0.1 is read in as is still closer to 0.1 than the two other nearby values.

Basic floating point operations

Once we have floating point values, we will want to compute with them. The most basic operations we might want to use are addition, subtraction, multiplication, and division. To understand how these operations behave, it’s useful to have a way to compute intermediate values with higher precision. I’ll be using the Python decimal module, which defines variable-precision software decimal floating point:

from decimal import *
d = Decimal
define psd(x): pfp(float(x))

We can now explore how the basic arithmetic operations work. For some values, addition seems to work exactly:

>>> pfp(0.1)
>>> pfp(0.2)
>>> pfp(0.1 + 0.1)
>>> psd(d("0.1") + d("0.1"))

Here both the floating-point and the extended-precision computations of 0.1 + 0.1 give the same result, which is also the floating point value closest to 0.2. However, this is an accidental success. You see, we did not actually add 0.1 to 0.1; instead, we added a value slightly exceeding 0.1 to a value slightly exceeding 0.1, and got a value bigger than 0.2 by twice the excess in the inputs. It is a complete accident that this result is still the floating point number closest to 0.2; the larger excess in the output could have made it too far from 0.2.

This is exactly what happens if we try to add on another 0.1:

>>> pfp(0.1 + 0.2)
>>> pfp(0.3)

We can see that the resulting value is 0.3 with a yet bigger excess, and in this case, the excess is so large that another floating-point value is closer to 0.3. In fact, the closest floating point number to 0.3 is just smaller than the result we computed from 0.1 + 0.2.

>>> pfp(fp_prev(0.1 + 0.2))

There is a third subtlety here:

>>> pfp(0.1)
>>> pfp(0.2)
>>> pfp(0.1 + 0.2)

Do the addition manually, and you’ll see that the value for 0.1 + 0.2 is not exactly the sum of the values for 0.1 and 0.2, which would be:


The problem, of course, is that “0.3000000000000000166533453” is not exactly representable in floating point, so it has to be rounded. Which way is the value rounded? This is defined by a rounding mode, which does not necessarily produce the closest floating point number:

>>> pfp(+0.3000000000000000166533453)
>>> pfp(0.1 + 0.2)

As you can imagine, if you tried to compute 0.1 + 0.1 + 0.1 + …, you would eventually (if slowly) end up with values that differed from the exact value rather significantly.

So you can see that there are a variety of subtleties even to the simple act of floating point addition. You might now be worried that floating point is a total rat-hole and completely unpredictable. But in fact, there is a simple rule that always describes the result of a floating point addition or subtraction: the result of a basic floating point operation is always equal to the exact result of that operation, rounded per the rounding mode.

Formalizing basic floating point operations

To better understand how floating point operations behave, let’s formalize their behavior. To do this, we want make a few important distinctions: between real values and floating point values, and between real and floating point operations.

Let’s write

  • All constants will represent real values
  • The operator fp will construct floating point constants by finding the closest floating point value to a real value. I’ll write fp(0.1) for the floating point value that an 0.1 in your source code would reference.
  • The operator will convert floating point values to real values; I’ll never display floating point values, but instead their real values.
  • Operators like +, -, ·, and / will represent the real operations.
  • Circled operators like , , , and will represent floating point operations.
  • The operator rd will round real values to floating point values; this need not be the same operator fp (though in many cases it is).

In this new notation, the rule for basic operators is:

x ⊕ y = rd(ℜ(x) + ℜ(y))

For example, suppose we wanted to understand what 0.1 + 0.2 produces. The source code 0.1 + 0.2 really describes the operation fp(0.1) ⊕ fp(0.2), so per the rule its result is:

rd(ℜ(fp(0.1)) + ℜ(fp(0.2)))

The operators and fp are not inverses; for example

ℜ(fp(0.1)) = 0.100000000000000005551115…

as we saw above. This new notation makes it clear that we were in fact never adding 0.1 and 0.2, but instead the nearby values ℜ(fp(0.1)) and ℜ(fp(0.2)); so there’s no reason whatsoever to expect 0.3, or even ℜ(fp(0.3)) as a result, since there’s certainly no rule ℜ(fp(x)) + ℜ(fp(y)) = ℜ(fp(x + y)).

Similar rules to this one for plus apply to subtraction, multiplication, and division, as well as to operations like square root and exponentiation. But consult your mathematics library. For example, the GNU libc library describes its accuracy in its manual, where you can see the number of floating point values between the exact values and the computed values for various provided functions.

Putting the floating point guarantees in perspective

So floating-point operations are deterministic and operate on simple principles. Don’t look at floating point as a black box; neither just wave away its unintuitive nature as “inaccuracy”. Floating point computation follows a few simple rules, which can be internalized and understood.

That isn’t to say the floating point computation is easy to get right. Problems like cancellation and rounding can combine in surprising ways to completely destroy the accuracy of your results. That is the object of numerical analysis, which is a field of study in its own right. But the whole existence of numerical analysis is predicated upon floating point acting in predictable and understandable ways. Floating point is not black magic, but if you master it you still become a wizard.

Edit: Thank you to Zach Tatlock for many tweaks and bug fixes.



Note that if floats were real numbers, this would be an infinite loop. It only halts because of the limited precision with which floating point can represent real values.