Pavel Panchekha

By

Share under CC-BY-SA.

FastTwoSum is Faster Than TwoSum

We all know that floating-point arithmetic is imprecise. When you do s = x + y, the value of s might not exactly be the sum of the values of x and y. Curiously (and importantly, for some applications), there's a way to quantify the imprecision: there's a family of algorithms1 [1 Called "error-free transformations", because they transform x and y into s and e without any error in the sum.] that, given x and y, compute not just s but also e, where e is the rounding error.2 [2 Under mild assumptions about overflow, and with round-to-nearest-even, this rounding error is a floating-point number.]

TwoSum and FastTwoSum

The best-known such algorithms are TwoSum and FastTwoSum. Let's start with TwoSum. It works like this:

def twosum(x, y):
    s = x + y
    xx = s - y
    yy = s - xx
    ex = x - xx
    ey = y - yy
    e = ex + ey
    return s, e

Let me explain, step by step:3 [3 This explanation is courtesey of D.K. Zhang.]

  • s is the normal sum of x and y, nothing fancy
  • xx and yy are the "effective" versions of x and y: they are x and y, but rounded so that s = xx + yy exactly. Note that the computation is not symmetric; for xx we use s - y, but then for yy we subtract xx, not x. That's to ensure s = xx + yy exactly. Of course, since addition is commutative, you could do the reverse, compute yy with x and xx with yy.
  • ex and ey are, then, the parts of x and y that got rounded off to get xx and yy.
  • Finally, we add those errors to get the full rounded-off amount.

Now here's FastTwoSum. It's much shorter, but it has an extra assumption:

def fasttwosum(x, y):
    assert abs(x) > abs(y)
    s = x + y
    yy = s - x
    e = y - yy
    return s, e

FastTwoSum is faster than TwoSum (it has three floating-point operations, not six) but you can only use it if you know which argument is bigger.

Analyzing Performance

Let's analyze the runtime of these algorithms more carefully, assuming we're implementing these in C or something similarly low-level. TwoSum uses six floating-point operations, but two of them are independent (the computations of ex and ey). The others all form one long dependency chain, so the latency is five floating-point additions. On most modern chips, a floating-point addition has latency 3 or 4, so that's 15–20 cycles total. FastTwoSum, by contrast, only has three operations, which is 9–12 cycles.

You can compute throughput as well. TwoSum needs six add/sub units, and most modern chips have at least two of those, maybe more, so the inverse throughput should be something like 3 cycles. FastTwoSum is half that, so maybe 1.5 cycles, but in practice I don't think you can really beat two cycles.

This makes it look like FastTwoSum is faster, but of course it really isn't, because you can only use it if the inputs are sorted by absolute value. But that makes me thing: how long would it take to sort the inputs?

The issue is that, if the values of x and y are unpredictable, comparing x and y and branching on them will add a branch misprediction penalty, which is something like 10-20 cycles. If you get that penalty half the time, that takes you from 9 cycles to 14-19 cycles, which is now worse than TwoSum. And there are other costs to branch misprediction, so TwoSum looks better.

But this is, as the kids say, a skill issue. Modern CPUs provide a variety of branchless conditional operations, which all basically work by first computing both sides of the conditional and then selecting which to keep using a boolean. This doesn't use the branch predictor and is usually very low-latency; if computing both sides of the conditional is cheap enough, it can be a big improvement over regular branches.

So this suggests an idea: sort the two inputs to FastTwoSum using a branchless operation like this:

def abs_sort(x, y):
    a = abs(x)
    b = abs(y)
    mask = a > b
    xx = blend(mask, x, y)
    yy = blend(mask, y, x)
    return xx, yy

Here blend is one of many names for branchless conditionals; select and cmov are others.

What's the latency of this sort routine? Well, it as always depends, but floating-point absolute value is basically an and, so it should take one cycle; on M1 it appears to take two, not sure why. Both absolute values can happen in parallel. Floating-point comparison is basically the same as addition, so that's 3 or 4 cycles. The blend should be one cycle, though on M1 it's two. So that's a total of 5–8 cycles, and can happen in parallel with the computation of s, so in the best case we should get something like an 11-cycle sort plus FastTwoSum, handily beating TwoSum.

Results, Apple M1

I implemented this on my M1 Mac: classic TwoSum and FastTwoSum; FastTwoSum with a simple branch; and FastTwoSum with a SIMD blend. I got these results:

Algorithm Latency Throughput
TwoSum 15.0 cycles 3.2 cycles
FastTwoSum 9.0 cycles 3.2 cycles
Basic sort 12.3 cycles 3.2 cycles
SIMD sort 12.5 cycles 2.5 cycles

As expected, TwoSum is 15 cycles and FastTwoSum is 9. The two versions with the sort are in between, about 12 cycles, which makes sense since the M1's fabs and fcsel are two cycles, not one (and fcmp is two, not three). The latency math then predicts 12 cycles.

There turns out to be very little difference between the basic sort and the SIMD one. Why? Well, take a look at the assembly for the basic sort:

fadd    d2, d0, d1
str     d2, [x0]
fabs    d3, d1
fabs    d4, d0
fcmp    d3, d4
fcsel   d3, d0, d1, gt
fcsel   d0, d1, d0, gt
fsub    d0, d2, d0
fsub    d0, d3, d0
str     d0, [x1]

Ah—Clang here produced a branchless conditional move operation (the fcsel), which on the Apple M1 is very cheap, for the basic sort. The SIMD sort then doesn't really win by much. (The better throughput might be due to splitting the workload between general-purpose and SIMD registers, though on M1 those go to the same ports so I'm a little suspicious of this explanation.)

Results, Intel Coffee Lake

I reran the same results on an Intel Gen 8 / Coffee Lake computer I have access to; this one has AVX2 but not AVX-512, which is a real bummer because AVX-512 has min and max routines that might have been useful. Here are my results with GCC:

Algorithm Latency Throughput
TwoSum 16.0 cycles 4.2 cycles
FastTwoSum 9.7 cycles 4.3 cycles
Basic sort 9.8 cycles 4.1 cycles
SIMD sort 16.8 cycles 4.3 cycles

Here, oddly, the "basic sort" version is extremely good—why? Well, here's the assembly:

vmovq   xmm2, QWORD PTR .LC0[rip]
vaddsd  xmm3, xmm0, xmm1
vmovsd  xmm4, xmm1, xmm1
vandpd  xmm5, xmm1, xmm2
vandpd  xmm2, xmm0, xmm2
vcmpltsd        xmm2, xmm2, xmm5
vmovsd  QWORD PTR [rdi], xmm3
vblendvpd       xmm1, xmm0, xmm1, xmm2
vblendvpd       xmm0, xmm4, xmm0, xmm2
vsubsd  xmm3, xmm3, xmm1
vsubsd  xmm0, xmm0, xmm3
vmovsd  QWORD PTR [rsi], xmm0

Here LC0 clears the high bit, to take absolute values. The code, then, is extremely straightforward: the first vaddsd and the vmovsd to memory do the initial summation to compute s; then the vmovq / vandpd steps take absolute values; then vcmpltsd and vblendvpd steps do the sort; and then the last three instructions are FastTwoSum.

In other words, GCC produced great SIMD code; good work!

But what went wrong with my hand-written SIMD code? I won't bore you with the details but GCC handled by intrinsics badly; I think the use of intrinsics hindered its own register allocation and scheduling code. For example, on its own GCC did a good job spacing out the vmovq and the vmovsd to memory, presumably to help the out-of-order execution. With intrinsics it instead had to do a lot of register moves and clears, and all the memory operations were bunched together.

Moving on, here are the results with Clang:

Algorithm Latency Throughput
TwoSum 16.0 cycles 4.7 cycles
FastTwoSum 9.7 cycles 4.8 cycles
Basic sort 9.8 cycles 20.2 cycles
SIMD sort 10.8 cycles 17.2 cycles
Integer sort 9.7 cycles 5.0 cycles

What the hell? Why is throughput so bad for the basic sort version?4 [4 And I just want to be clear here that I'm using -O3 and -mtune=native and so on.] Looking at the assembly gives a simple answer: Clang generated a branch. That also explains the extremely good latency numbers: I am recording the best-case latency and the average-case throughput, so the best-case latency is when the branch is extremely well predicted, in which case the sort is free. We really, really need a __builtin_cmov.

But for the SIMD sort that can't be an issue, right? Because I wrote SIMD intrinsics there and they don't have a branch? Turns out … Clang inserted a branch to handle the NaN case. And then that branch is somehow poorly predicted. This is pretty disappointing, yes. Bad Clang.

Finally, I wrote an "Integer sort" version, which uses bit-casts to sort the inputs as integers. This version convinces both Clang and GCC to generate a conditional move instruction, achieving near-perfect latency and throughput on both compilers, at the low cost of a pretty ugly implementation. However, this integer sort version is very slow on ARM, because there's ARM segregates float and integer registers.

Conclusion

Here's my overall conclusion. For each CPU and compiler, there's a version of sort + FastTwoSum which is faster than TwoSum. However, there isn't a single version that beats TwoSum on every CPU and compiler. The fault, frankly, is with the compilers: it is oddly difficult to convince Clang specifically to avoid branches on x86.

On ARM, the best FastTwoSum variant is 20% faster than TwoSum: 4× floating-point addiiton instead of TwoSum's 5×. On Intel Coffee Lake, the best FastTwoSum variant is 40% faster, at 3× the cost of addition, basically no overhead versus FastTwoSum with no sort.

SIMD wasn't worth it on any platform.

Anyway, here's my bottom line: FastTwoSum is faster than TwoSum, even though it has to sort its inputs. If you're using TwoSum, consider switching. You can expect speedups of 20% or more; just mind the compiler and make sure it's generating true branchless code.

Finally, plaintive requests: Clang and GCC, please add __cmov to x86intrin.h. And hardware vendors, please provide a hardware TwoSum (or an equivalent, like a three-way add). It should be easy and cheap, and could take the cost of double-double addition down to 2× addition. Pretty please? Happy to chat.

Footnotes:

1

Called "error-free transformations", because they transform x and y into s and e without any error in the sum.

2

Under mild assumptions about overflow, and with round-to-nearest-even, this rounding error is a floating-point number.

3

This explanation is courtesey of D.K. Zhang.

4

And I just want to be clear here that I'm using -O3 and -mtune=native and so on.