FastTwoSum is Faster Than TwoSum
We all know that floating-point arithmetic is imprecise. When you do s
= x + y
, the value of s
might not exactly be the sum of the values of
x
and y
. Curiously (and importantly, for some applications), there's a
way to quantify the imprecision: there's a family of
algorithms1 [1 Called "error-free transformations", because they
transform x
and y
into s
and e
without any error in the sum.] that,
given x
and y
, compute not just s
but also e
, where e
is the rounding
error.2 [2 Under mild assumptions about overflow, and with
round-to-nearest-even, this rounding error is a floating-point
number.]
TwoSum and FastTwoSum
The best-known such algorithms are TwoSum
and FastTwoSum
. Let's start
with TwoSum
. It works like this:
def twosum(x, y): s = x + y xx = s - y yy = s - xx ex = x - xx ey = y - yy e = ex + ey return s, e
Let me explain, step by step:3 [3 This explanation is courtesey of D.K. Zhang.]
s
is the normal sum ofx
andy
, nothing fancyxx
andyy
are the "effective" versions ofx
andy
: they arex
andy
, but rounded so thats = xx + yy
exactly. Note that the computation is not symmetric; forxx
we uses - y
, but then foryy
we subtractxx
, notx
. That's to ensures = xx + yy
exactly. Of course, since addition is commutative, you could do the reverse, computeyy
withx
andxx
withyy
.ex
andey
are, then, the parts ofx
andy
that got rounded off to getxx
andyy
.- Finally, we add those errors to get the full rounded-off amount.
Now here's FastTwoSum
. It's much shorter, but it has an extra assumption:
def fasttwosum(x, y): assert abs(x) > abs(y) s = x + y yy = s - x e = y - yy return s, e
FastTwoSum
is faster than TwoSum
(it has three floating-point
operations, not six) but you can only use it if you know which
argument is bigger.
Analyzing Performance
Let's analyze the runtime of these algorithms more carefully, assuming
we're implementing these in C or something similarly low-level. TwoSum
uses six floating-point operations, but two of them are independent
(the computations of ex
and ey
). The others all form one long
dependency chain, so the latency is five floating-point additions. On
most modern chips, a floating-point addition has latency 3 or 4, so
that's 15–20 cycles total. FastTwoSum
, by contrast, only has three
operations, which is 9–12 cycles.
You can compute throughput as well. TwoSum
needs six add/sub units,
and most modern chips have at least two of those, maybe more, so the
inverse throughput should be something like 3 cycles. FastTwoSum
is
half that, so maybe 1.5 cycles, but in practice I don't think you can
really beat two cycles.
This makes it look like FastTwoSum
is faster, but of course it really
isn't, because you can only use it if the inputs are sorted by
absolute value. But that makes me thing: how long would it take to
sort the inputs?
The issue is that, if the values of x
and y
are unpredictable,
comparing x
and y
and branching on them will add a branch
misprediction penalty, which is something like 10-20 cycles. If you
get that penalty half the time, that takes you from 9 cycles to 14-19
cycles, which is now worse than TwoSum
. And there are other
costs to branch misprediction, so TwoSum
looks better.
But this is, as the kids say, a skill issue. Modern CPUs provide a variety of branchless conditional operations, which all basically work by first computing both sides of the conditional and then selecting which to keep using a boolean. This doesn't use the branch predictor and is usually very low-latency; if computing both sides of the conditional is cheap enough, it can be a big improvement over regular branches.
So this suggests an idea: sort the two inputs to FastTwoSum
using a
branchless operation like this:
def abs_sort(x, y): a = abs(x) b = abs(y) mask = a > b xx = blend(mask, x, y) yy = blend(mask, y, x) return xx, yy
Here blend
is one of many names for branchless conditionals; select
and cmov
are others.
What's the latency of this sort routine? Well, it as always depends,
but floating-point absolute value is basically an and
, so it should
take one cycle; on M1 it appears to take two, not sure why. Both
absolute values can happen in parallel. Floating-point comparison is
basically the same as addition, so that's 3 or 4 cycles. The blend
should be one cycle, though on M1 it's two. So that's a total of 5–8
cycles, and can happen in parallel with the computation of s
, so in
the best case we should get something like an 11-cycle sort plus
FastTwoSum
, handily beating TwoSum
.
Results, Apple M1
I implemented this on my M1 Mac: classic TwoSum
and FastTwoSum
;
FastTwoSum
with a simple branch; and FastTwoSum
with a SIMD blend. I
got these results:
Algorithm | Latency | Throughput |
---|---|---|
TwoSum | 15.0 cycles | 3.2 cycles |
FastTwoSum | 9.0 cycles | 3.2 cycles |
Basic sort | 12.3 cycles | 3.2 cycles |
SIMD sort | 12.5 cycles | 2.5 cycles |
As expected, TwoSum
is 15 cycles and FastTwoSum
is 9. The two versions
with the sort are in between, about 12 cycles, which makes sense since
the M1's fabs
and fcsel
are two cycles, not one (and fcmp
is two, not
three). The latency math then predicts 12 cycles.
There turns out to be very little difference between the basic sort and the SIMD one. Why? Well, take a look at the assembly for the basic sort:
fadd d2, d0, d1 str d2, [x0] fabs d3, d1 fabs d4, d0 fcmp d3, d4 fcsel d3, d0, d1, gt fcsel d0, d1, d0, gt fsub d0, d2, d0 fsub d0, d3, d0 str d0, [x1]
Ah—Clang here produced a branchless conditional move operation (the
fcsel
), which on the Apple M1 is very cheap, for the basic sort. The
SIMD sort then doesn't really win by much. (The better throughput
might be due to splitting the workload between general-purpose and
SIMD registers, though on M1 those go to the same ports so I'm a
little suspicious of this explanation.)
Results, Intel Coffee Lake
I reran the same results on an Intel Gen 8 / Coffee Lake computer I
have access to; this one has AVX2 but not AVX-512, which is a real
bummer because AVX-512 has min
and max
routines that might have been
useful. Here are my results with GCC:
Algorithm | Latency | Throughput |
---|---|---|
TwoSum | 16.0 cycles | 4.2 cycles |
FastTwoSum | 9.7 cycles | 4.3 cycles |
Basic sort | 9.8 cycles | 4.1 cycles |
SIMD sort | 16.8 cycles | 4.3 cycles |
Here, oddly, the "basic sort" version is extremely good—why? Well, here's the assembly:
vmovq xmm2, QWORD PTR .LC0[rip] vaddsd xmm3, xmm0, xmm1 vmovsd xmm4, xmm1, xmm1 vandpd xmm5, xmm1, xmm2 vandpd xmm2, xmm0, xmm2 vcmpltsd xmm2, xmm2, xmm5 vmovsd QWORD PTR [rdi], xmm3 vblendvpd xmm1, xmm0, xmm1, xmm2 vblendvpd xmm0, xmm4, xmm0, xmm2 vsubsd xmm3, xmm3, xmm1 vsubsd xmm0, xmm0, xmm3 vmovsd QWORD PTR [rsi], xmm0
Here LC0
clears the high bit, to take absolute values. The code, then,
is extremely straightforward: the first vaddsd
and the vmovsd
to
memory do the initial summation to compute s
; then the vmovq
/ vandpd
steps take absolute values; then vcmpltsd
and vblendvpd
steps do the
sort; and then the last three instructions are FastTwoSum
.
In other words, GCC produced great SIMD code; good work!
But what went wrong with my hand-written SIMD code? I won't bore you
with the details but GCC handled by intrinsics badly; I think the use
of intrinsics hindered its own register allocation and scheduling
code. For example, on its own GCC did a good job spacing out the vmovq
and the vmovsd
to memory, presumably to help the out-of-order
execution. With intrinsics it instead had to do a lot of register
moves and clears, and all the memory operations were bunched together.
Moving on, here are the results with Clang:
Algorithm | Latency | Throughput |
---|---|---|
TwoSum | 16.0 cycles | 4.7 cycles |
FastTwoSum | 9.7 cycles | 4.8 cycles |
Basic sort | 9.8 cycles | 20.2 cycles |
SIMD sort | 10.8 cycles | 17.2 cycles |
Integer sort | 9.7 cycles | 5.0 cycles |
What the hell? Why is throughput so bad for the basic sort
version?4 [4 And I just want to be clear here that I'm using -O3
and
-mtune=native
and so on.] Looking at the assembly gives a simple
answer: Clang generated a branch. That also explains the extremely
good latency numbers: I am recording the best-case latency and the
average-case throughput, so the best-case latency is when the branch
is extremely well predicted, in which case the sort is free. We
really, really need a __builtin_cmov
.
But for the SIMD sort that can't be an issue, right? Because I wrote SIMD intrinsics there and they don't have a branch? Turns out … Clang inserted a branch to handle the NaN case. And then that branch is somehow poorly predicted. This is pretty disappointing, yes. Bad Clang.
Finally, I wrote an "Integer sort" version, which uses bit-casts to sort the inputs as integers. This version convinces both Clang and GCC to generate a conditional move instruction, achieving near-perfect latency and throughput on both compilers, at the low cost of a pretty ugly implementation. However, this integer sort version is very slow on ARM, because there's ARM segregates float and integer registers.
Conclusion
Here's my overall conclusion. For each CPU and compiler, there's a
version of sort + FastTwoSum
which is faster than TwoSum
. However,
there isn't a single version that beats TwoSum
on every CPU and
compiler. The fault, frankly, is with the compilers: it is oddly
difficult to convince Clang specifically to avoid branches on x86.
On ARM, the best FastTwoSum
variant is 20% faster than TwoSum
: 4×
floating-point addiiton instead of TwoSum
's 5×. On Intel Coffee Lake,
the best FastTwoSum
variant is 40% faster, at 3× the cost of addition,
basically no overhead versus FastTwoSum
with no sort.
SIMD wasn't worth it on any platform.
Anyway, here's my bottom line: FastTwoSum
is faster than TwoSum
, even
though it has to sort its inputs. If you're using TwoSum
, consider
switching. You can expect speedups of 20% or more; just mind the
compiler and make sure it's generating true branchless code.
Finally, plaintive requests: Clang and GCC, please add __cmov
to
x86intrin.h
. And hardware vendors, please provide a hardware TwoSum
(or an equivalent, like a three-way add). It should be easy and cheap,
and could take the cost of double-double addition down to 2× addition.
Pretty please? Happy to chat.
Footnotes:
Called "error-free transformations", because they
transform x
and y
into s
and e
without any error in the sum.
Under mild assumptions about overflow, and with round-to-nearest-even, this rounding error is a floating-point number.
This explanation is courtesey of D.K. Zhang.
And I just want to be clear here that I'm using -O3
and
-mtune=native
and so on.