The Fastest TwoSum on an Apple M1
In my last post, I pointed out that FastTwoSum
plus a sort is actually
faster than normal TwoSum
on modern hardware, and suggested that
TwoSum
is therefore obsolete. This post expands the analysis and tests
futher optimizations on the Apple M1 chip.
General approach
The general FastTwoSum
algorithm is the following:
s = a + b bb = s - a e = b - bb
This assumes that |a| > |b|
, or perhaps a slightly weaker condition.
To satisfy this precondition, we need to sort a
and b
by absolute
value, kind of like this:
ap, bp = fabs(a), fabs(b) cmp = ap > bp a, b = select(cmp, a, b), select(cmp, b, a)
I've used multiple-assignment blocks here to clarify dependency
chains: FastTwoSum
's chain is three floating-point
addition/subtraction operations, while the sort's is an abs, a
compare, and a select.
Computing latency
Now let's look at how to optimally execute these. The relevant latencies and units for the various very well-named instructions are:1 [1 These are drawn from Dougall Johnson's tables for the P-cores.]
Instruction | Latency | Units |
fadd/fsub | 3 | 11-14 |
fabs | 2 | 11-14 |
fcmp | 2 | 11 |
fcsel | 2 | 13-14 |
So first of all, the FastTwoSum
chain of three addition/subtractions
is going to take up 9 cycles of latency, no matter what. In the sort
chain, we have six total cycles of latency; both the fabs
and the
fcsel
can be dual-dispatched. So here's one execution option, which
takes 15 cycles:
FTS | Sort | |
0 | fabs | |
1 | fabs | |
2 | fcmp | |
3 | fcmp | |
4 | fcsel | fcsel |
5 | fcsel | fcsel |
6 | fadd | |
7 | fadd | |
8 | fadd | |
9 | fsub | |
10 | fsub | |
11 | fsub | |
12 | fsub | |
13 | fsub | |
14 | fsub |
This code roughly corresponds to this high-level code:
ap, bp = fabs(a), fabs(b) cmp = ap > bp a, b = select(cmp, a, b), select(cmp, b, a) s = a + b bb = s - a e = b - bb
Note that I wrote fcsel
in both columns, since that's where the
results of sorting are moved into variables used by the fast-two-sum
approach. I call the this "@0" version because the sort occurs before
the 0th add/sub operation.
Of course, this is wasteful: the addition a + b
doesn't depend on the
order of a
and b
, meaning that it can be executed in parallel with the
sort chain, resulting in this code:
s = a + b ap, bp = fabs(a), fabs(b) cmp = ap > bp a, b = select(cmp, a, b), select(cmp, b, a) bb = s - a e = b - bb
with this runtime:
FTS | Sort | |
0 | fadd | fabs |
1 | fadd | fabs |
2 | fadd | fcmp |
3 | fcmp | |
4 | fcsel | fcsel |
5 | fcsel | fcsel |
6 | fsub | |
7 | fsub | |
8 | fsub | |
9 | fsub | |
10 | fsub | |
11 | fsub |
This "@1" version is 12 cycles of latency, better than the 15 cycles for the "@0" version, and is what I called the "basic sort" version in the previous post, where I in fact measured it to have 12 cycles of latency.
Before I move on let me note a point of confusing. I implemented both the "@0" and "@1" versions in C and then carefully audited Clang's resulting assembly. The @1 version's assembly faithfully represents the C code, and the measured 12.3 cycles of latency is very close to theory. But Clang actually does something very clever with the @0 version and actually rewrites it into the @1 version, presumably doing some kind of clever CFG hoisting or rewriting, so the @0 version also has a measured latency of 12 cycles, not the 15 I would assume from the latency tables. Well done Clang!
Optimizing latency
Anyway, is 12 cycles ideal? No! There's a whole cycle where the
FastTwoSum
chain is just sitting there doing nothing! So 11 cycles
should be possible. We'd need to give the FastTwoSum
chain something
to do before the sort is done. For example, we could move the first
fsub
instruction, meaning s - a
, before the sort. Of course the
challenge is that we need to sort to understand what a
is.
Or do we? Here's my idea: what if we, speculatively, compute both s -
a
and s - b
? Then during the sort step we just pick the correct
subtraction to continue working with? Here's what that would look like
in the high-level code:
s = a + b aa, bb = s - b, s - a ap, bp = fabs(a), fabs(b) cmp = ap > bp x, xx = select(cmp, b, bb), select(cmp, a, aa) e = x - xx
This "@2" version should run like this:
FTS | Sort | |
0 | fadd | fabs |
1 | fadd | fabs |
2 | fadd | fcmp |
3 | fsub | fcmp |
4 | fsub | |
5 | fsub | |
6 | fcsel | fcsel |
7 | fcsel | fcsel |
8 | fsub | |
9 | fsub | |
10 | fsub |
I coded this up in C and, unfortunately, this didn't work at all: I still got 12 cycles of latency. Looking at the assembly, though, revealed that Clang helpfully optimized this version into the "@1" version above, meaning that it just undid our optimization. I tested a few different variations and eventually found one that worked, where by worked I mean it compiled correctly and achieved a latency of 11 cycles.
So I also tested this, even-more-wasteful "@3" version:
s = a + b aa, bb = s - b, s - a e1, e2 = a - aa, b - bb ap, bp = fabs(a), fabs(b) cmp = ap > bp e = select(cmp, e2, e1)
This "@3" version should run like this:
FTS | Sort | |
0 | fadd | fabs |
1 | fadd | fabs |
2 | fadd | fcmp |
3 | fsub | fcmp |
4 | fsub | |
5 | fsub | |
6 | fsub | |
7 | fsub | |
8 | fsub | |
9 | fcsel | fcsel |
10 | fcsel | fcsel |
Clang doesn't butcher this one, and it indeed runs in 11 cycles of latency.2 [2 More specifically, the @0 and @1 versions run in 12.3 cycles of latency, while the @2 and @3 versions take 11.3 cycles. The extra 0.3 cycles presumably include memory read/write, loop overhead, and so on; I'm ignoring them.]
Computing throughput
But wait, let's also think about throughput. The @0 and @1 versions
need three add/sub calls, two abs
calls, one cmp
, and two fcsel
calls,
for a total of 8 instructions executed. Since these are spread over
four ports, the ideal (inverse) throughput is 2 cycles. Then the @2
version adds an extra fsub
call, and @3 adds another fsub
but removes
an fcsel
, so their ideal throughput is 2.25 cycles. In fact, I measure
2.3 cycles for the @0 and @1 versions and 2.5 cycles for the @2 and @3
versions.
Normally I'd delve deeper into throughput using uICA. But, annoyingly, there's no uICA/IACA for the Apple M1! So I had to take matters into my own hands and write a tiny micro-architectural simulator (for just these instructions on these ports) and I got the following (inverse) throughputs:3 [3 Am I annoyed? Yes it is.]
Version | Good port order | Bad port order | Latency |
@0 | 2.0 | 3.8 | 15.0 |
@1 | 2.1 | 3.0 | 12.0 |
@2 | 2.3 | 2.5 | 11.0 |
@3 | 2.3 | 2.8 | 11.0 |
FastTwoSum | 0.8 | 0.8 | 9.0 |
TwoSum | 1.5 | 1.5 | 15.0 |
Here "Good port order" prefers dispatching instructions to unit 12
while "Bad port order" prefers dispatching instructions to unit 11;
it's "bad" to do that because then unit 11 is more likely to be busy
when the fcmp
instruction comes around. The latency is computed by
having exactly one instance executing at a time, while the throughputs
are computed by allowing up to 20 instances executing at a time (of
course, it never gets that high).
Here the computed throughputs are a bit low, which I assume comes down
to frontend jitter, loop overhead, and so on. But we really can verify
a few things. First, scheduling execution units doesn't seem to be a
problem; we reach almost the peak possible throughput of 2.25 (or 2.0
for @0/@1). Thus there's an almost-mechanical increase for the @2 and
@3 versions due to one extra instruction. Second, the machine is
really executing this code near-optimally, so things like port order
don't end up mattering, though I suppose there's a weak preference for
@2 over @3 since it seems more resiliant to bad scheduling and in
practice seems to have minutely lower latency as well. Lastly, TwoSum
specifically seems poorly executed, throughput wise, since I measure
an (inverse) throughput of 2.2 cycles instead of 1.5. Not sure what
that's about.
By the way, I also tested the E-cores, and got these numbers:
Version | Throughput | Latency |
@0 | 4.0 | 15.0 |
@1 | 4.0 | 12.0 |
@2 | 4.5 | 11.0 |
@3 | 4.5 | 11.0 |
FastTwoSum | 1.5 | 9.0 |
TwoSum | 3.0 | 15.0 |
It's basically the same deal as the P-core, except that the E-core only has two execution units for FP stuff so the throughput is halved. But we still see the same general principles like scheduling not being an issue and the machine executing near-optimally. Clearly in my tests I'm running on the P-cores, so let's ignore the E cores for the rest of this.
Conclusion
Overall, I think latency is more important than throughput for this
routine; maximizing throughput is really hard. I suppose there might
be some very clever way to schedule the classic TwoSum
to achieve the
hypothetical 1.5 cycle throughput, but in reality its throughput is
2.2 cycles, meaning that even the worst "@3" version had throughput
not much higher.
Given that objective, the best double-double summation formula for the Apple M1 is the "@2" version:
inline void at2_sort(double a, double b, double* s, double* err) { *s = a + b; double aa = (*s - b); double bb = (*s - a); *err = (fabs(b) > fabs(a)) ? a - aa : b - bb; }
The "@3" version is basically equally good, and also seems a bit more resilient to the compilation process, so you might try it, too:
inline void at3_sort(double a, double b, double* s, double* err) { *s = a + b; double ifa = b - (*s - a); double ifb = a - (*s - b); *err = (fabs(a) > fabs(b)) ? ifa : ifb; }
I've updated the C code and uploaded the M1 simulator so you can play with these (and other) variations on your own.
I am now entirely out of ideas for further speeding up TwoSum
; I spent
quite a while trying to figure out a way to shrink the fcsel
to a
single cycle, but on AArch64 it doesn't seem possible.
Next time, if there is a next time, we'll look at x86 where (spoiler) compilers seem to make frequent, catastrophic compilation errors so the story is much more complicated.
Footnotes:
These are drawn from Dougall Johnson's tables for the P-cores.
More specifically, the @0 and @1 versions run in 12.3 cycles of latency, while the @2 and @3 versions take 11.3 cycles. The extra 0.3 cycles presumably include memory read/write, loop overhead, and so on; I'm ignoring them.
Am I annoyed? Yes it is.