# Arbitrary Precision, not Arbitrary Accuracy

When I pitch Herbie, our tool for automatically fixing rounding errors in floating point code, I often get asked whether rounding error can be avoided by computing results in arbitrary precision. It can—but this isn’t as easy as it sounds. Arbitrary precision libraries require choosing an acceptable level of precision, and this can be hard.

## Rounding error

Floating point numbers only store so many bits, so they can only represent real numbers approximately, by rounding them. This doesn’t sound so bad; who needs more than 19 decimal digits, anyway? But rounding causes bigger problems than that, because when rounding values are added, subtracted, or otherwise combined, the rounding error can cause much more inaccuracy than just the 19th decimal place. For example, the complex square root code I looked at in math.js could, for certain inputs, return 0 instead of relatively small but still representable values. If you’re dividing by that number, or taking its logarithm, this inaccuracy can cause bit problems.

The main reason for these problems is that intermediate computations may need many more than 19 decimal digits to compute the answer accurately. As a toy example, consider computing `(2¹⁰⁰ + 2⁻¹⁰⁰) - 2¹⁰⁰`

. Even though every value, and the correct answer, are all representable exactly as floating point values, no matter how many bits are used (as long as the exponent field is 8 bits large), the intermediate value `2¹⁰⁰ + 2⁻¹⁰⁰`

requires at least 200 bits of mantissa to represent exactly. Any fewer bits, and the expression will compute to 0, instead of `2⁻¹⁰⁰`

. As a less contrived example, if you want to compute the quadratic formula and get double-precision accurate results, and you intend to have inputs where `b`

is large and `a`

and `c`

can be small, you might need as many as 500 bits for the intermediate values.

## Increasing the precision

The number of bits used to store a floating point value is called the *precision* of that representation. Most floating point hardware gives access to 32-bit and 64-bit floating point values—single and double precision floating point^{1}. But if you need more bits, like in the example above, you might use a library which provides arbitrary-precision floating point. MPFR is a good choice, especially with built-in bindings in languages like Racket.

With MPFR, you select the level of precision you want for each variable, and MPFR can evaluate mathematical functions (including things like Bessel functions) as accurately as that level of precision allows. But just because primitive functions are computed accurately doesn’t mean that the results are accurate. Just like with floating point (which computes many basic functions accurately, as I’ve described before), rounding error occurs.

This means that you the programmer must *choose* how much precision you want to use for intermediate values. Making this choice incorrectly means that the inaccuracy you wanted to avoid is still present. But as you demand more precision, MPFR grows slower and slower^{2}, starting at two orders of magnitude slower even for a hundred or so bits. Wild over-estimates (asking for sixty thousand bits) are costly. But there’s no easy way to check that you’re using sufficiently many bits internally; different input values may need more or fewer bits, so simple testing is not a real answer. And you might need more bits than you expect. I’ve seen some examples that require thousands of bits internally to get double-precision outputs on some inputs.

Solving for the required precision analytically is still an area of research. MPFR is based on advances in this field; see the MPFR paper. And the Rosa compiler by E. Darulova and V. Kuncak(see their POPL’14 paper) is a recent advance: it uses static analysis and an SMT solver to automatically pick the smallest level of precision that provides the necessary error bounds. But at the moment, Rosa is very conservative in its analysis when any non-linear operators are used, so it might end up using far, far too many bits. I’m sure there will be more work in this area, but for now, figuring out what level of *precision* you need to get the *accuracy* you want is challenging. It makes an arbitrary precision library a tool to use as a last, not a first resort.

## Other representations

Arbitrary-precision floating point leaves the hard problem of determining how much precision you need. What about other representations that don’t require answering such questions? A promising alternative is continued fraction arithmetic. A continued fraction is a sequence \([a_0 ; a_1, a_2, \dotsc]\) of natural numbers, which represent the value \[ a_0 + \cfrac1{a_1 + \cfrac1{a_2 + \ddots}}. \] It’s not immediately obvious, but it’s possible to add, multiply, or take square roots of continued fractions (see these slides or Khinchin’s book *Continued Fractions*). Moreover, these operations are streaming: they automatically demand more precision in the inputs when you ask for more accuracy in the output. However, continued fractions are slower yet than arbitrary-precision floats, and I don’t think it’s known how to take exponents, logarithms, and trigonometric functions on continued fractions.

## What to do

So it looks like we’re stuck with floating-point for now. Even if you’re OK with slower software implementations, the challenges of picking the right precision and carefully tracking the rounding error make that option more work than it seems. Rearranging expressions is often the only way to handle rounding error, which is precisely why Herbie tackles that problem. If you can’t find a rearrangement, give Rosa a try; if you’re going to use an arbitrary-precision library, do it very carefully.

*Edit*: Thank you to James Wilcox for proofreading this post.

^{1}

x87, the old math co-processor in Intel chips, also offers extended double precision, which is 80 bits wide. On the other hand, SSE, which is the newer floating point instruction set, does not offer extended double precision.

^{2}

This slowdown is non-linear for most operations.