Simulating Double-Double Addition Algorithms
I just read David's paper on floating-point accumulation networks. It's fantastic work: he's developed a new abstraction of floating-point numbers that allows him to verify error bounds for sequences of TwoSum-like operations, and then used that abstraction to design new, faster double-double addition algorithms.
Since I already had a Apple M1 simulator on hand, I figured I'd give his algorithms a spin.
About my simulator
I guess I was a big glib when describing my simulator earlier. Let me give a few details. My simulator is a cycle-accurate simulator specifically for the Apple M1 P- and E-cores. Here's how it works.
First is the definition of the cores. These define each supported assembly instruction by giving its 1) latency in cycles and 2) execution ports. For example, here's the P-core:
PCORE = { "fadd": (3, [11, 12, 13, 14]), "fsub": (3, [11, 12, 13, 14]), "fabs": (2, [11, 12, 13, 14]), "fcmp": (2, [11]), "fcsel": (2, [13, 14]), }
Next, you've got to define some assembly code. I have a cute little Python abstraction for this, so you can write "assembly" like so:
@algorithm def TwoSum(code, a, b): s = code.fadd(a, b) bb = code.fsub(s, a) return s, code.fadd(code.fsub(a, code.fsub(s, bb)), code.fsub(b, bb))
Despite the Pythonic appearance, the variables a
, b
, s
, bb
, and so on
all represent virtual registers, not actual values. The calls to
code.foo
append instructions to the assembly code, and the variables
are just there to manage dependencies in an ergonomic way. In this
little assembler, here's my definition of "@2", my preferred
alternative to TwoSum for the Apple M1:
@algorithm def at2(code, a, b): s = code.fadd(a, b) aa = code.fsub(s, b) bb = code.fsub(s, a) cond = code.fcmp(code.fabs(b), code.fabs(a)) x = code.fcsel(cond, a, b) xx = code.fcsel(cond, aa, bb) return s, code.fsub(x, xx)
Note that fcsel
takes the condition flags as an explicit argument, and
fcmp
returns the condition flags directly. So this is an abstraction
of assembly from the CPU point of view, not actual assembly code. If
any of the instructions had micro-ops or did fusion I'd have to
represent those microops / macroops directly, but that isn't a problem
floating-point code has, so it's OK.
The return
doesn't do anything when benchmarking a single routine, but
when you compose these it becomes important. For example, here's the
ddadd
and madd
algorithms from David's paper:1 [1 Here add
is the
obvious wrapper around fadd
.]
@algorithm def ddadd(code, x0, y0, x1, y1): x0, y0 = TwoSum(code, x0, y0) x1, y1 = TwoSum(code, x1, y1) y0 = add(code, y0, x1) x0, y0 = TwoSum(code, x0, y0) y0 = add(code, y0, y1) x0, y0 = TwoSum(code, x0, y0) return x0, y0 @algorithm def madd(code, x0, y0, x1, y1): x0, y0 = TwoSum(code, x0, y0) x1, y1 = TwoSum(code, x1, y1) x0, x1 = TwoSum(code, x0, x1) y0 = add(code, y0, y1) y0 = add(code, y0, x1) x0, y0 = TwoSum(code, x0, y0) return x0, y0
Here you can see how returning registers allows us to compose different routines together into larger ones. And let me be clear: there's no "calls" or whatever here. All the routines are inlined, it's basically a high-level assembler.
Running the code
The simulator itself is pretty straightforward. You simulate one listing at a time, a listing being a single assembly routine.
Inside a listing, each assembly instruction has a ID (a program counter value), a name (which is a key for the CPU definition), and a list of arguments. The arguments are identified by the IDs of the instructions that output them; for simplicity I assume there's no shortage of register names and each instruction produces a single output, both of which are true. I'm also not simulating all the various dispatch queues and so on; on the M1 they're pretty wide anyway, so I don't think they affect the results, though maybe for extreme concurrency they would, I don't know, I'm not an architecture guy.
To do the simulation, the simulator first creates N
instances of the
code, basically independent copies of each instruction with
independent arguments. With one instance, it measures latency, while
with infinitely many it measures (inverse) throughput, and you can
also pick a particular value to measure bounded concurrency. In
practice, 12-way concurrency is the maximum, because there are four
add ports on the P-core and they have 3-cycle latency, so with 12
instances you will be dispatching all ports on all cycles.
Then, each cycle, the simulator just checks each instruction in each instance. If that instruction's arguments have already been computed, the instruction is dispatched, which means we look up what ports that instruction can execute on that aren't already busy this cycle and then pick one. We then record the instruction's completion time, so we can mark it done after that however many cycles. If the final instruction in an instance is completed, all of that instance's instructions are marked uncomplete again so the instance can be dispatched again.
Then I run 10,000 cycles and count completions. (No credit for partial completions, but at 10k cycles that hardly matters.
Minor details: instructions are dispatched from back to front (prioritizing completion) and instances are checked in order. Ports are dispatched in a fixed priority order which I chose by hand but it's basically least useful ports first. I did a bunch of pretty careful debugging to make sure we don't lose a cycle here or there due to how we handle dependencies. At least for basic stuff I get similar latencies and throughputs for my simulator and an actual CPU.
Results
David's paper highlights a specific concrete problem: adding four
floating-point values and returning a double-double result. Using the
various abstractions and search procedures he's devised, he is able to
show that the standard solution, ddadd
, isn't optimal, and that in
fact there's a better algorithm called madd
.
I timed both in my simulator, using either TwoSum or one of my faster TwoSum-like algorithms @1, @2, and @3.2 [2 @0 is obviously dumb so I don't test that.] I tested with concurrency of 1 through 6 instances, and also 12 instances. First, a reminder of what the TwoSum-like algorithms themselves:
Concurrency | 1 | 2 | 3 | 4 | 5 | 6 | 12 |
---|---|---|---|---|---|---|---|
TwoSum | 15 | 7.5 | 5.0 | 3.75 | 3.00 | 2.50 | 1.50 |
@1 | 12 | 6.0 | 4.0 | 3.00 | 2.57 | 2.32 | 2.06 |
@2 | 11 | 5.5 | 3.7 | 3.00 | 2.64 | 2.36 | 2.28 |
@3 | 11 | 5.5 | 3.7 | 2.93 | 2.59 | 2.40 | 2.32 |
You can see that with one instance, TwoSum takes the expected 15
cycles while my variants take the expected 12 or 11 cycles. But as the
concurrency increases, TwoSum catches up. That's because all of my
TwoSum-like algorithms use the fcmp
instruction, which on the M1
P-core has only one available execution port. So they end up
bottlenecked there. For example, the @2 algorithm is has latency of
11, so concurrency literally cannot improve past 11. Actually the
fcsel
instructions, which have two available ports but are dispatched
twice in @2 end up more of a bottleneck.
But anyway, 12-way concurrency is insane. I know the M1 has a big lookahead buffer but we're talking hundreds of instructions at once, with no dependencies between them. Not happening. And you can see that my "@" variants are better out through six-way concurrency, which is a huge amount.
Now let's look at ddadd
and madd
. First, ddadd
:
ddadd | 1 | 2 | 3 | 4 | 5 | 6 | 12 |
---|---|---|---|---|---|---|---|
TwoSum | 51 | 25.5 | 17.0 | 13.0 | 10.5 | 8.9 | 6.5 |
@1 | 43 | 21.6 | 14.9 | 11.8 | 10.1 | 9.2 | 8.6 |
@2 | 40 | 20.0 | 14.0 | 11.5 | 10.3 | 9.8 | 9.7 |
@3 | 40 | 20.0 | 14.0 | 11.4 | 10.2 | 9.8 | 9.6 |
Now madd
:
madd | 1 | 2 | 3 | 4 | 5 | 6 | 12 |
---|---|---|---|---|---|---|---|
TwoSum | 37 | 18.5 | 12.4 | 9.8 | 8.3 | 7.4 | 6.5 |
@1 | 32 | 16.0 | 11.7 | 9.9 | 9.2 | 8.8 | 8.6 |
@2 | 30 | 16.0 | 12.0 | 10.5 | 9.9 | 9.6 | 9.6 |
@3 | 30 | 15.0 | 11.6 | 10.3 | 9.8 | 9.6 | 9.6 |
Look, I know you didn't read these data tables carefully, so let me share the high-level trends here:
madd
is always faster thandadd
, but the difference shrinks as you add more concurrency. This makes sense: both perform the same number of TwoSum-like operations, it's just a question of dependencies. Once you have enough concurrency, the dependencies stop mattering.- My TwoSum-like algorithms shine with low concurrency. The "@3"
variant seems best, even though the "@2" variant is slightly
preferred solo. That's because both
ddadd
andmadd
allow for a substantial amount of concurrency within a single instance. - But at high concurrency, TwoSum is best. I assume that's because with very high concurrency what really matters is just port contention, and since TwoSum has both fewer instructions and also only the most flexible add/subtract instructions, it wins here.
So we have kind of two quadrants. To maximize latency you want madd
with @3
; to maximize throughput you want TwoSum with either algorithm.
But is there value to mixing multiple TwoSum implementations within a
single madd
or ddadd
call? What about bigger FPANs? And, by the way,
what changes for E-cores? (A lot, in my tests.) I guess that has to
wait for next time, or at this point maybe a proper paper.