Not all Arithmetic Operations are Created Equal

Floating-point code uses various functions on floating-point numbers, like arithmetic and elementary functions. When I got started working on floating-point error, I thought that was the ontology: arithmetic vs elementary. However, I now realize that it's more complex than that: it's useful to distinguish between a few different levels of floating-point operations.

Bitwise operations (Group 0)

These operations include comparisons, min/max, absolute value, copysign, and some more obscure options like fdim or ldexp. These operations are exact: they do not round. One still has to reason about exceptional values (what is max(NaN, 0)) when using them, but for many purposes it is safe to equate the floating-point and real version.

Moreover, because these operations are exact and are in some way related to conditionals, you can typically erase them during analysis by performing case splitting. In other words, these operations barely exist.

In software, these operations are very fast: an inverse throughput of 1 and a latency of 1, or something along those lines. In hardware they might be more or less free.

Polynomial operations (Group 1)

Addition and multiplication are special in that, while they are not exact, they have exact residuals. This means that error-free transformations exist and also that many special cases have no error at all, like short-constant multiplication or Sterbenz subtraction. It also means that you can compute the exact result and represent it on the computer if you'd like (you just need more bits).

This also makes these operations amenable to hardware implementation. Typically addition and multiplication are very fast, maybe inverse throughput of 1 or a half and a latency of maybe 3 or 4 cycles.

Algebraic operations (Group 2)

Division and square root are fundamentally harder than addition and multiplication. For one, there is no exact answer: while the product or sum of two floating-point numbers is typically exactly representable at some larger precision, their quotient typically is not represented in any finite floating-point representation. And the actual algorithms used are often also iterative, maybe based on Newton's method (there are many options). Curiously, division and square root are actually quite similar in terms of how to implement them.

These operations are common enough, with enough tricks available, that it makes sense to implement them in hardware, yet at the same time they are quite slow. You can think of these as the "limit" of what we would want to make hardware do. A typical speed is a throughput of one every four cycles and a latency of something like 13 cycles.

However, importantly, while the algorithms we have are iterative, they're not approximate in the way that elementary function implementations are; we're not fitting a polynomial to anything at any point. I think this has something to do with these functions being the solutions to polynomial equations. In principle, you could also have such iterative solutions to a function like cbrt, except these functions aren't common enough to warrant it (and convergence rates are not attractive anyway).

Simple elementary functions (Group 3)

The simple logarithmic, exponential, trigonometric, and hyperbolic elementary functions are easy to implement because they have a wealth of identities to exploit. These identities make it possible to range reduce more or less arbitrarily, so you can always get to a small range on which you can construct a really good polynomial or rational approximation.

Note that these operations fundamentally require (in fast implementations) some kind of curve fitting or precomputed values, and the algorithms are also correspondingly more complex, since they're going to involve range reduction and reconstruction plus either polynomial evaluation (long dependency chains, bad) or table lookup (cache effects, bad).

For these functions "worst cases" for rounding are known and we have moderately effective techniques for finding them.

Hard elementary functions (Group 4)

Functions like pow do not have good worst-case finding techniques, and are just hard to implement. Plus, because it has two arguments, polynomial approximations or similar are just much harder. pow in particular has sharp corners too with a lot of opportunity for rounding error amplification. It's plausible to consider it just the most common of the next group.

The wooly world of weird functions (Group 5)

Then there's weird stuff, some of which is somewhat widely implemented (like lgamma, erf, the Bessel functions) and some of which really isn't (weird statistical stuff, inverse-error-function, polygamma, Lambert's W, etc). You find implementations of these in something like Abramovich and Stern, Numerical Recipes, and the GSL library, and in general you just can't expect these implementations to have the same level of quality as libm; they often have no known error bounds, weird error handling, or just totally wrong behavior for large / weird inputs. Part of the issue is that these functions are extremely domain-specific, so maybe the good implementations are locked away somewhere I don't know about.

I think good implementations of these functions seem difficult or impossible and are often mathematically sophisticated. It's also really hard to get a list of which one of these are even used when; it's often fairly specialized parts of physics.

Conclusion

One thing I've been thinking about is novel hardware instructions to support numerical kernels. Each group is different:

In Group 0 the only way to make them cheaper is free! You can do that by fusing them into a Group 1 instruction. For example, AVX has FMA variants for all possible negations. But I wonder whether there's anywhere where it'd be useful to fuse in a max operation.
For Group 1 there are lots more Group 1 instructions that could be useful: three-way addition, add/mult-with-residual, dot product, quadratics and cubics, and so on.
For Group 2 maybe there are more operations to implement? Hard to say. There's also utility in fusing in Group 1 operations, like hypot for example. Alternatively, since division and sqrt can be implemented with Newton-Raphson iteration, is there any plausible instruction that speeds that up? Then the user could maybe write their own variants.
For Group 3 we need to break the operation down into pieces that we can speed up. Some plausible options are: double-double arithmetic; polynomial evaluation; and table lookup.
For Groups 4 and 5 I don't think hardware is the biggest blocker. There are either numerical (Group 4) or mathematical (Group 5) foundations that need to come first. Usually we're not even worried about performance, but instead correctness.

By Pavel Panchekha

08 October 2024