Testing My M1 Simulator
Ok, here's the story so far. Way back when I wanted to test whether sort + FastTwoSum is faster than TwoSum. But as I started optimizing it, I kept getting frustrated with how hard it is to benchmark low-level code, and eventually I wrote a simulator for my CPU, the Apple M1 chip. At this point, I was in too deep and wanted to keep improving the simulator.
Compiling to Assembly
As of the last blog post, I'd added a little DAG-building library so that I could easily write little assembly snippets and time them in my simulator. The simulator in fact works great, so nothing there needed changing, and the DAG-builder was fine to work with too, but one thing I really wanted to know was how trustworthy the numbers coming out of that simulator really were. So I hatched a plan: compile my assembly code to assembly, and actually run it.
Internally, assembly code in my simulator is represented like this:
CODE = [ [0, "fabs", [-1]], [1, "fabs", [-2]], [2, "fcmp", [0, 1]], ... ]
In other words, it is a list of instructions, and each instruction has an output register (that's the number on the left) and a set of input registers (on the right). By convention, negative numbers are input arguments. Actually, I initially didn't have them at all, but then I decided to add them. Here I'm assuming an infinite number of registers, and I never reuse register names. That's because I'm mostly simulating the CPU back-end so I'm assuming all the registers have already gone through register renaming.
Abstractly, compiling to assembly should be easy: pick actual registers for each register name, and then output each instruction. Since all of my code snippets are short, and since ARM has 32 floating-point registers, that should be easy. The loop basically looks like this:
argstr = ", ".join(f"d{assignment[arg]}" for arg in args) print(f" {op} d{assignment[out]}, {argstr}", file=fd)
Handling flags registers
One issue, however, were instructions that modify the flags, like
fcmp
. In my simulator, I represent the flags register as a direct
output of the fcmp
instruction (I should probably also represent it as
an input, now that I think about it). But in actual assembly, it's
implicit. My fix for this was special-casing fcmp
in the assembly
generator. I also don't need to assign an ISA register to hold that
"output"; so I modified my DAG-builder to record which registers are
flags registers (that is, which registers are outputs from fcmp
) so I
can skip assigning them.
This also fixed a bit of a sharp edge in the DAG builder, where you assembly could act as if multiple flags registers are active at once:
CODE = [ [0, "fcmp", [-2, -1]], [1, "fcmp", [-3, -4]], [2, "fcsel", [0, -2, -3]] ... ]
This makes conceptual sense, since the CPU can allocate multiple flags registers during out-of-order execution, but you can't actually write assembly this way. So my DAG-builder now keeps track of the active flags register and doesn't let you use it out of order:
class Assembler: def push_instruction(self, name, args): # ... if name == "fcmp": self.flags.add(out) self.curflags = out elif name == "fcsel": assert args[0] in self.flags, "fcsel argument is not a flags register" assert args[0] == self.curflags, "fcsel argument is not current flags" # ...
Doing register allocation
Now I have a whole bunch of abstract registers, defined via the
DAG-builder, and I need to assign each an ISA register, d0
through
d31
. Now there are a lot of register allocator algorithms out there,
but luckily I was interested in doing this for extremely short
assembly snippets, maybe a few dozen instructions long. So I figured
I'd skip the complex algorithms, and also gain optimality, by using an
ILP solver to do this.
I chose PuLP, mostly because it has a really name Python API. The basic ILP setup is simple. We create \(O(M N)\) boolean variables, one for each combination of virtual register / ISA register. Each virtual register goes in exactly one ISA register. But if two virtual registers are live at the same time, then they have to go into different ISA registers.
To figure out the live region for a virtual register, I just search for the last instruction that reads from it. That's the end of the live region. Each virtual register is written to exactly once, and I name the register after the instruction that writes to it, so that's the start. Then I just check for overlaps to determine conflicts. Flags registers aren't considered.
Funny bug: I accidentally kept getting "unsat" because I forgot to
ignore the conflict between a register and itself. ChatGPT o4-mini
found the bug for me.
The register allocations usually take under a second. I think this gets especially important once I started doing concurrency, because then I really could start using up most of the available registers.
Calling conventions and loop nests
I did a standard calling convention for floating-point stuff, saving
the eight callee-saved registers to the stack. I only needed one
integer register (for a loop counter), so I picked x9
, which is
caller-saved. One nice thing about working in assembly, compared to C,
is that in C you constantly have to worry about outwitting the
compiler. Every loop iteration needs to have independent,
unpredictable inputs, and the outputs need to be saved to independent
array slots and you have to hash the array afterwards, and make sure
to print the hash, and all this jazz. Otherwise the evil compiler will
just delete your whole benchmark. Assembly is much easier. I just
initialize every loop input to 0, and enter the loop. No constant
folding here! Just to be careful, I do align the loop entry to.
My loop works like this. First, the loop body. Then, I fmov
the last
loop output register into every input register. This makes sure I have
a dependency from each iteration to the next iteration, no matter
what. This fmov
is free, because a register to register move doesn't
actually do anything, it's just handled by the register renamer. Funny
enough, I spent so long trying to instead map the various loop outputs
to the various loop inputs one-to-one. This is harder than it seems,
because the number of outputs and inputs may differ, and also you have
to copy over the registers without overwriting a value. And if you get
this wrong, you might set up a too-short dependency chain and get
wrong outputs. Anyway, none of this was necessary and a huge waste of
time. If's fine to copy the one output into every input. The compiler
won't do anything evil!
Timing
To do the timing properly, I link my assembly with a small C driver
program. That driver uses mach_absolute_time()
, which ChatGPT told me
is macOS's cycle-accurate timer. Sadly, there's no rdtsc
on ARM.
mach_absolute_time()
measures nanoseconds, not cycles; I multiply by
3.2, the nominal frequency of my chip, and make sure to run a warm-up
loop to get the CPU to scale up.
For convenience, I also generate a "null loop" with no body, and
subtract that off. The null loop takes two cycles per iteration (a sub
and a b.ne
), and it seems that in the real loop those two cycles
aren't overlapped with anything. I'm not sure why: it seems like they
should overlap? Anyway, subtracting it off results in the "expected"
numbers.
Let me add that I have found ChatGPT terrible at cycle-level reasoning about microarchitectures. It just invents all sorts of nonsense, and struggles to just carefully count things one step at a time.
I do 100 million iterations, which takes a few seconds at most even for the most expensive routines I'm testing.
I also wrote in an option to N copies of the routine inside the loop body, each with dependencies only on themselves, to test N-way concurrency. Mostly I did this because my simulator implements it, and I wanted to do accurate comparisons. Another funny but: I spent over an hour on a bug where I was dividing the total cycles by the number of copies twice, making it look like adding more copies made things faster. This was such a frustrating bug, because I was convinced I generated the assembly wrong, but in fact it was fine and I was munging the output wrong instead.
All N copies go through the register allocator at once, so you do get optimal register usage across all of the copies, but of course the N you can do on a real chip is still limited. For example, my @3 algorithm can do five concurrent instances at most; by six, it starts requiring too many registers. One thing I've considered is allowing the register allocator to also shuffle instructions; right now each copy is contiguous and goes one after the other, which I think is OK on the Apple M1 (it has a huge reorder buffer), but it's possible that you could squeeze a few more registers out by shuffling the instructions around.
Results
Anyway, the point of all of this was to validate whether my simulator is getting the right results and measure how much "slack" there is in its estimates. There's good news! In this table I have the simulated and measured latency/throughput for a number of algorithms:
Algorithm | Latency | 2-concurrency | 3-concurrency |
---|---|---|---|
FastTwoSum | 9.00/9.58 | 4.50/4.74 | 3.00/3.25 |
TwoSum | 15.02/15.64 | 7.51/7.83 | 5.01/5.33 |
@0 | 15.02/15.89 | 7.53/8.34 | 5.01/5.61 |
@1 | 12.00/12.66 | 6.00/6.75 | 4.00/4.69 |
@2 | 11.00/11.53 | 5.50/5.91 | 3.67/4.29 |
@3 | 11.00/12.08 | 5.50/6.12 | 3.67/4.41 |
ddadd | 51.02/52.30 | 25.51/26.33 | - |
madd | 37.04/38.39 | 18.52/19.49 | - |
madd@3 | 30.03/31.45 | 15.04/16.69 | - |
If you look through the table, you'll see that the difference between simulated and real numbers is usually pretty minor, peaking at about 20% for @3 with 3-way concurrency. I think that's not too bad!
All told, this was a super fun project that, I think, gave me a much better understanding of CPU micro-architectures. I'm not sure if I want to take it further (which I think means adding an x86 mode) or not but even if not I'm pretty happy with what I ended up with. The simulator code is still on this site.