The Fastest TwoSum on an Apple M1

In my last post, I pointed out that FastTwoSum plus a sort is actually faster than normal TwoSum on modern hardware, and suggested that TwoSum is therefore obsolete. This post expands the analysis and tests futher optimizations on the Apple M1 chip.

General approach

The general FastTwoSum algorithm is the following:

s = a + b
bb = s - a
e = b - bb

This assumes that |a| > |b|, or perhaps a slightly weaker condition. To satisfy this precondition, we need to sort a and b by absolute value, kind of like this:

ap, bp = fabs(a), fabs(b)
cmp = ap > bp
a, b = select(cmp, a, b), select(cmp, b, a)

I've used multiple-assignment blocks here to clarify dependency chains: FastTwoSum's chain is three floating-point addition/subtraction operations, while the sort's is an abs, a compare, and a select.

Computing latency

Now let's look at how to optimally execute these. The relevant latencies and units for the various very well-named instructions are:¹ [¹ These are drawn from Dougall Johnson's tables for the P-cores.]

Instruction	Latency	Units
fadd/fsub	3	11-14
fabs	2	11-14
fcmp	2	11
fcsel	2	13-14

So first of all, the FastTwoSum chain of three addition/subtractions is going to take up 9 cycles of latency, no matter what. In the sort chain, we have six total cycles of latency; both the fabs and the fcsel can be dual-dispatched. So here's one execution option, which takes 15 cycles:

	FTS	Sort
0		fabs
1		fabs
2		fcmp
3		fcmp
4	fcsel	fcsel
5	fcsel	fcsel
6	fadd
7	fadd
8	fadd
9	fsub
10	fsub
11	fsub
12	fsub
13	fsub
14	fsub

This code roughly corresponds to this high-level code:

ap, bp = fabs(a), fabs(b)
cmp = ap > bp
a, b = select(cmp, a, b), select(cmp, b, a)
s = a + b
bb = s - a
e = b - bb

Note that I wrote fcsel in both columns, since that's where the results of sorting are moved into variables used by the fast-two-sum approach. I call the this "@0" version because the sort occurs before the 0th add/sub operation.

Of course, this is wasteful: the addition a + b doesn't depend on the order of a and b, meaning that it can be executed in parallel with the sort chain, resulting in this code:

s = a + b
ap, bp = fabs(a), fabs(b)
cmp = ap > bp
a, b = select(cmp, a, b), select(cmp, b, a)
bb = s - a
e = b - bb

with this runtime:

	FTS	Sort
0	fadd	fabs
1	fadd	fabs
2	fadd	fcmp
3		fcmp
4	fcsel	fcsel
5	fcsel	fcsel
6	fsub
7	fsub
8	fsub
9	fsub
10	fsub
11	fsub

This "@1" version is 12 cycles of latency, better than the 15 cycles for the "@0" version, and is what I called the "basic sort" version in the previous post, where I in fact measured it to have 12 cycles of latency.

Before I move on let me note a point of confusing. I implemented both the "@0" and "@1" versions in C and then carefully audited Clang's resulting assembly. The @1 version's assembly faithfully represents the C code, and the measured 12.3 cycles of latency is very close to theory. But Clang actually does something very clever with the @0 version and actually rewrites it into the @1 version, presumably doing some kind of clever CFG hoisting or rewriting, so the @0 version also has a measured latency of 12 cycles, not the 15 I would assume from the latency tables. Well done Clang!

Optimizing latency

Anyway, is 12 cycles ideal? No! There's a whole cycle where the FastTwoSum chain is just sitting there doing nothing! So 11 cycles should be possible. We'd need to give the FastTwoSum chain something to do before the sort is done. For example, we could move the first fsub instruction, meaning s - a, before the sort. Of course the challenge is that we need to sort to understand what a is.

Or do we? Here's my idea: what if we, speculatively, compute both s - a and s - b? Then during the sort step we just pick the correct subtraction to continue working with? Here's what that would look like in the high-level code:

s = a + b
aa, bb = s - b, s - a
ap, bp = fabs(a), fabs(b)
cmp = ap > bp
x, xx = select(cmp, b, bb), select(cmp, a, aa)
e = x - xx

This "@2" version should run like this:

	FTS	Sort
0	fadd	fabs
1	fadd	fabs
2	fadd	fcmp
3	fsub	fcmp
4	fsub
5	fsub
6	fcsel	fcsel
7	fcsel	fcsel
8	fsub
9	fsub
10	fsub

I coded this up in C and, unfortunately, this didn't work at all: I still got 12 cycles of latency. Looking at the assembly, though, revealed that Clang helpfully optimized this version into the "@1" version above, meaning that it just undid our optimization. I tested a few different variations and eventually found one that worked, where by worked I mean it compiled correctly and achieved a latency of 11 cycles.

So I also tested this, even-more-wasteful "@3" version:

s = a + b
aa, bb = s - b, s - a
e1, e2 = a - aa, b - bb
ap, bp = fabs(a), fabs(b)
cmp = ap > bp
e = select(cmp, e2, e1)

This "@3" version should run like this:

	FTS	Sort
0	fadd	fabs
1	fadd	fabs
2	fadd	fcmp
3	fsub	fcmp
4	fsub
5	fsub
6	fsub
7	fsub
8	fsub
9	fcsel	fcsel
10	fcsel	fcsel

Clang doesn't butcher this one, and it indeed runs in 11 cycles of latency.² [² More specifically, the @0 and @1 versions run in 12.3 cycles of latency, while the @2 and @3 versions take 11.3 cycles. The extra 0.3 cycles presumably include memory read/write, loop overhead, and so on; I'm ignoring them.]

Computing throughput

But wait, let's also think about throughput. The @0 and @1 versions need three add/sub calls, two abs calls, one cmp, and two fcsel calls, for a total of 8 instructions executed. Since these are spread over four ports, the ideal (inverse) throughput is 2 cycles. Then the @2 version adds an extra fsub call, and @3 adds another fsub but removes an fcsel, so their ideal throughput is 2.25 cycles. In fact, I measure 2.3 cycles for the @0 and @1 versions and 2.5 cycles for the @2 and @3 versions.

Normally I'd delve deeper into throughput using uICA. But, annoyingly, there's no uICA/IACA for the Apple M1! So I had to take matters into my own hands and write a tiny micro-architectural simulator (for just these instructions on these ports) and I got the following (inverse) throughputs:³ [³ Am I annoyed? Yes it is.]

Version	Good port order	Bad port order	Latency
@0	2.0	3.8	15.0
@1	2.1	3.0	12.0
@2	2.3	2.5	11.0
@3	2.3	2.8	11.0
FastTwoSum	0.8	0.8	9.0
TwoSum	1.5	1.5	15.0

Here "Good port order" prefers dispatching instructions to unit 12 while "Bad port order" prefers dispatching instructions to unit 11; it's "bad" to do that because then unit 11 is more likely to be busy when the fcmp instruction comes around. The latency is computed by having exactly one instance executing at a time, while the throughputs are computed by allowing up to 20 instances executing at a time (of course, it never gets that high).

Here the computed throughputs are a bit low, which I assume comes down to frontend jitter, loop overhead, and so on. But we really can verify a few things. First, scheduling execution units doesn't seem to be a problem; we reach almost the peak possible throughput of 2.25 (or 2.0 for @0/@1). Thus there's an almost-mechanical increase for the @2 and @3 versions due to one extra instruction. Second, the machine is really executing this code near-optimally, so things like port order don't end up mattering, though I suppose there's a weak preference for @2 over @3 since it seems more resiliant to bad scheduling and in practice seems to have minutely lower latency as well. Lastly, TwoSum specifically seems poorly executed, throughput wise, since I measure an (inverse) throughput of 2.2 cycles instead of 1.5. Not sure what that's about.

By the way, I also tested the E-cores, and got these numbers:

Version	Throughput	Latency
@0	4.0	15.0
@1	4.0	12.0
@2	4.5	11.0
@3	4.5	11.0
FastTwoSum	1.5	9.0
TwoSum	3.0	15.0

It's basically the same deal as the P-core, except that the E-core only has two execution units for FP stuff so the throughput is halved. But we still see the same general principles like scheduling not being an issue and the machine executing near-optimally. Clearly in my tests I'm running on the P-cores, so let's ignore the E cores for the rest of this.

Conclusion

Overall, I think latency is more important than throughput for this routine; maximizing throughput is really hard. I suppose there might be some very clever way to schedule the classic TwoSum to achieve the hypothetical 1.5 cycle throughput, but in reality its throughput is 2.2 cycles, meaning that even the worst "@3" version had throughput not much higher.

Given that objective, the best double-double summation formula for the Apple M1 is the "@2" version:

inline void at2_sort(double a, double b, double* s, double* err) {
    *s = a + b;
    double aa = (*s - b);
    double bb = (*s - a);
    *err = (fabs(b) > fabs(a)) ? a - aa : b - bb;
}

The "@3" version is basically equally good, and also seems a bit more resilient to the compilation process, so you might try it, too:

inline void at3_sort(double a, double b, double* s, double* err) {
    *s = a + b;
    double ifa = b - (*s - a);
    double ifb = a - (*s - b);
    *err = (fabs(a) > fabs(b)) ? ifa : ifb;
}

I've updated the C code and uploaded the M1 simulator so you can play with these (and other) variations on your own.

I am now entirely out of ideas for further speeding up TwoSum; I spent quite a while trying to figure out a way to shrink the fcsel to a single cycle, but on AArch64 it doesn't seem possible.

Next time, if there is a next time, we'll look at x86 where (spoiler) compilers seem to make frequent, catastrophic compilation errors so the story is much more complicated.

Footnotes:

These are drawn from Dougall Johnson's tables for the P-cores.

More specifically, the @0 and @1 versions run in 12.3 cycles of latency, while the @2 and @3 versions take 11.3 cycles. The extra 0.3 cycles presumably include memory read/write, loop overhead, and so on; I'm ignoring them.

Am I annoyed? Yes it is.

By Pavel Panchekha

12 May 2025