Simulating Double-Double Addition Algorithms

I just read David's paper on floating-point accumulation networks. It's fantastic work: he's developed a new abstraction of floating-point numbers that allows him to verify error bounds for sequences of TwoSum-like operations, and then used that abstraction to design new, faster double-double addition algorithms.

Since I already had a Apple M1 simulator on hand, I figured I'd give his algorithms a spin.

About my simulator

I guess I was a big glib when describing my simulator earlier. Let me give a few details. My simulator is a cycle-accurate simulator specifically for the Apple M1 P- and E-cores. Here's how it works.

First is the definition of the cores. These define each supported assembly instruction by giving its 1) latency in cycles and 2) execution ports. For example, here's the P-core:

PCORE = {
    "fadd": (3, [11, 12, 13, 14]),
    "fsub": (3, [11, 12, 13, 14]),
    "fabs": (2, [11, 12, 13, 14]),
    "fcmp": (2, [11]),
    "fcsel": (2, [13, 14]),
}

Next, you've got to define some assembly code. I have a cute little Python abstraction for this, so you can write "assembly" like so:

@algorithm
def TwoSum(code, a, b):
    s = code.fadd(a, b)
    bb = code.fsub(s, a)
    return s, code.fadd(code.fsub(a, code.fsub(s, bb)), code.fsub(b, bb))

Despite the Pythonic appearance, the variables a, b, s, bb, and so on all represent virtual registers, not actual values. The calls to code.foo append instructions to the assembly code, and the variables are just there to manage dependencies in an ergonomic way. In this little assembler, here's my definition of "@2", my preferred alternative to TwoSum for the Apple M1:

@algorithm
def at2(code, a, b):
    s = code.fadd(a, b)
    aa = code.fsub(s, b)
    bb = code.fsub(s, a)
    cond = code.fcmp(code.fabs(b), code.fabs(a))
    x = code.fcsel(cond, a, b)
    xx = code.fcsel(cond, aa, bb)
    return s, code.fsub(x, xx)

Note that fcsel takes the condition flags as an explicit argument, and fcmp returns the condition flags directly. So this is an abstraction of assembly from the CPU point of view, not actual assembly code. If any of the instructions had micro-ops or did fusion I'd have to represent those microops / macroops directly, but that isn't a problem floating-point code has, so it's OK.

The return doesn't do anything when benchmarking a single routine, but when you compose these it becomes important. For example, here's the ddadd and madd algorithms from David's paper:¹ [¹ Here add is the obvious wrapper around fadd.]

@algorithm
def ddadd(code, x0, y0, x1, y1):
    x0, y0 = TwoSum(code, x0, y0)
    x1, y1 = TwoSum(code, x1, y1)
    y0     = add(code, y0, x1)
    x0, y0 = TwoSum(code, x0, y0)
    y0     = add(code, y0, y1)
    x0, y0 = TwoSum(code, x0, y0)
    return x0, y0

@algorithm
def madd(code, x0, y0, x1, y1):
    x0, y0 = TwoSum(code, x0, y0)
    x1, y1 = TwoSum(code, x1, y1)
    x0, x1 = TwoSum(code, x0, x1)
    y0     = add(code, y0, y1)
    y0     = add(code, y0, x1)
    x0, y0 = TwoSum(code, x0, y0)
    return x0, y0

Here you can see how returning registers allows us to compose different routines together into larger ones. And let me be clear: there's no "calls" or whatever here. All the routines are inlined, it's basically a high-level assembler.

Running the code

The simulator itself is pretty straightforward. You simulate one listing at a time, a listing being a single assembly routine.

Inside a listing, each assembly instruction has a ID (a program counter value), a name (which is a key for the CPU definition), and a list of arguments. The arguments are identified by the IDs of the instructions that output them; for simplicity I assume there's no shortage of register names and each instruction produces a single output, both of which are true. I'm also not simulating all the various dispatch queues and so on; on the M1 they're pretty wide anyway, so I don't think they affect the results, though maybe for extreme concurrency they would, I don't know, I'm not an architecture guy.

To do the simulation, the simulator first creates N instances of the code, basically independent copies of each instruction with independent arguments. With one instance, it measures latency, while with infinitely many it measures (inverse) throughput, and you can also pick a particular value to measure bounded concurrency. In practice, 12-way concurrency is the maximum, because there are four add ports on the P-core and they have 3-cycle latency, so with 12 instances you will be dispatching all ports on all cycles.

Then, each cycle, the simulator just checks each instruction in each instance. If that instruction's arguments have already been computed, the instruction is dispatched, which means we look up what ports that instruction can execute on that aren't already busy this cycle and then pick one. We then record the instruction's completion time, so we can mark it done after that however many cycles. If the final instruction in an instance is completed, all of that instance's instructions are marked uncomplete again so the instance can be dispatched again.

Then I run 10,000 cycles and count completions. (No credit for partial completions, but at 10k cycles that hardly matters.

Minor details: instructions are dispatched from back to front (prioritizing completion) and instances are checked in order. Ports are dispatched in a fixed priority order which I chose by hand but it's basically least useful ports first. I did a bunch of pretty careful debugging to make sure we don't lose a cycle here or there due to how we handle dependencies. At least for basic stuff I get similar latencies and throughputs for my simulator and an actual CPU.

Results

David's paper highlights a specific concrete problem: adding four floating-point values and returning a double-double result. Using the various abstractions and search procedures he's devised, he is able to show that the standard solution, ddadd, isn't optimal, and that in fact there's a better algorithm called madd.

I timed both in my simulator, using either TwoSum or one of my faster TwoSum-like algorithms @1, @2, and @3.² [² @0 is obviously dumb so I don't test that.] I tested with concurrency of 1 through 6 instances, and also 12 instances. First, a reminder of what the TwoSum-like algorithms themselves:

Concurrency	1	2	3	4	5	6	12
TwoSum	15	7.5	5.0	3.75	3.00	2.50	1.50
@1	12	6.0	4.0	3.00	2.57	2.32	2.06
@2	11	5.5	3.7	3.00	2.64	2.36	2.28
@3	11	5.5	3.7	2.93	2.59	2.40	2.32

You can see that with one instance, TwoSum takes the expected 15 cycles while my variants take the expected 12 or 11 cycles. But as the concurrency increases, TwoSum catches up. That's because all of my TwoSum-like algorithms use the fcmp instruction, which on the M1 P-core has only one available execution port. So they end up bottlenecked there. For example, the @2 algorithm is has latency of 11, so concurrency literally cannot improve past 11. Actually the fcsel instructions, which have two available ports but are dispatched twice in @2 end up more of a bottleneck.

But anyway, 12-way concurrency is insane. I know the M1 has a big lookahead buffer but we're talking hundreds of instructions at once, with no dependencies between them. Not happening. And you can see that my "@" variants are better out through six-way concurrency, which is a huge amount.

Now let's look at ddadd and madd. First, ddadd:

ddadd	1	2	3	4	5	6	12
TwoSum	51	25.5	17.0	13.0	10.5	8.9	6.5
@1	43	21.6	14.9	11.8	10.1	9.2	8.6
@2	40	20.0	14.0	11.5	10.3	9.8	9.7
@3	40	20.0	14.0	11.4	10.2	9.8	9.6

Now madd:

madd	1	2	3	4	5	6	12
TwoSum	37	18.5	12.4	9.8	8.3	7.4	6.5
@1	32	16.0	11.7	9.9	9.2	8.8	8.6
@2	30	16.0	12.0	10.5	9.9	9.6	9.6
@3	30	15.0	11.6	10.3	9.8	9.6	9.6

Look, I know you didn't read these data tables carefully, so let me share the high-level trends here:

madd is always faster than dadd, but the difference shrinks as you add more concurrency. This makes sense: both perform the same number of TwoSum-like operations, it's just a question of dependencies. Once you have enough concurrency, the dependencies stop mattering.
My TwoSum-like algorithms shine with low concurrency. The "@3" variant seems best, even though the "@2" variant is slightly preferred solo. That's because both ddadd and madd allow for a substantial amount of concurrency within a single instance.
But at high concurrency, TwoSum is best. I assume that's because with very high concurrency what really matters is just port contention, and since TwoSum has both fewer instructions and also only the most flexible add/subtract instructions, it wins here.

So we have kind of two quadrants. To maximize latency you want madd with @3; to maximize throughput you want TwoSum with either algorithm.

But is there value to mixing multiple TwoSum implementations within a single madd or ddadd call? What about bigger FPANs? And, by the way, what changes for E-cores? (A lot, in my tests.) I guess that has to wait for next time, or at this point maybe a proper paper.

Footnotes:

Here add is the obvious wrapper around fadd.

@0 is obviously dumb so I don't test that.

By Pavel Panchekha

30 May 2025

Simulating Double-Double Addition Algorithms

About my simulator

Running the code

Results

Footnotes: