Testing My M1 Simulator

Ok, here's the story so far. Way back when I wanted to test whether sort + FastTwoSum is faster than TwoSum. But as I started optimizing it, I kept getting frustrated with how hard it is to benchmark low-level code, and eventually I wrote a simulator for my CPU, the Apple M1 chip. At this point, I was in too deep and wanted to keep improving the simulator.

Compiling to Assembly

As of the last blog post, I'd added a little DAG-building library so that I could easily write little assembly snippets and time them in my simulator. The simulator in fact works great, so nothing there needed changing, and the DAG-builder was fine to work with too, but one thing I really wanted to know was how trustworthy the numbers coming out of that simulator really were. So I hatched a plan: compile my assembly code to assembly, and actually run it.

Internally, assembly code in my simulator is represented like this:

CODE = [
    [0, "fabs", [-1]],
    [1, "fabs", [-2]],
    [2, "fcmp", [0, 1]],
    ...
]

In other words, it is a list of instructions, and each instruction has an output register (that's the number on the left) and a set of input registers (on the right). By convention, negative numbers are input arguments. Actually, I initially didn't have them at all, but then I decided to add them. Here I'm assuming an infinite number of registers, and I never reuse register names. That's because I'm mostly simulating the CPU back-end so I'm assuming all the registers have already gone through register renaming.

Abstractly, compiling to assembly should be easy: pick actual registers for each register name, and then output each instruction. Since all of my code snippets are short, and since ARM has 32 floating-point registers, that should be easy. The loop basically looks like this:

argstr = ", ".join(f"d{assignment[arg]}" for arg in args)
print(f"  {op} d{assignment[out]}, {argstr}", file=fd)

Handling flags registers

One issue, however, were instructions that modify the flags, like fcmp. In my simulator, I represent the flags register as a direct output of the fcmp instruction (I should probably also represent it as an input, now that I think about it). But in actual assembly, it's implicit. My fix for this was special-casing fcmp in the assembly generator. I also don't need to assign an ISA register to hold that "output"; so I modified my DAG-builder to record which registers are flags registers (that is, which registers are outputs from fcmp) so I can skip assigning them.

This also fixed a bit of a sharp edge in the DAG builder, where you assembly could act as if multiple flags registers are active at once:

CODE = [
    [0, "fcmp", [-2, -1]],
    [1, "fcmp", [-3, -4]],
    [2, "fcsel", [0, -2, -3]]
    ...
]

This makes conceptual sense, since the CPU can allocate multiple flags registers during out-of-order execution, but you can't actually write assembly this way. So my DAG-builder now keeps track of the active flags register and doesn't let you use it out of order:

class Assembler:
    def push_instruction(self, name, args):
        # ...
        if name == "fcmp":
            self.flags.add(out)
            self.curflags = out
        elif name == "fcsel":
            assert args[0] in self.flags, "fcsel argument is not a flags register"
            assert args[0] == self.curflags, "fcsel argument is not current flags"
        # ...

Doing register allocation

Now I have a whole bunch of abstract registers, defined via the DAG-builder, and I need to assign each an ISA register, d0 through d31. Now there are a lot of register allocator algorithms out there, but luckily I was interested in doing this for extremely short assembly snippets, maybe a few dozen instructions long. So I figured I'd skip the complex algorithms, and also gain optimality, by using an ILP solver to do this.

I chose PuLP, mostly because it has a really name Python API. The basic ILP setup is simple. We create \(O(M N)\) boolean variables, one for each combination of virtual register / ISA register. Each virtual register goes in exactly one ISA register. But if two virtual registers are live at the same time, then they have to go into different ISA registers.

To figure out the live region for a virtual register, I just search for the last instruction that reads from it. That's the end of the live region. Each virtual register is written to exactly once, and I name the register after the instruction that writes to it, so that's the start. Then I just check for overlaps to determine conflicts. Flags registers aren't considered.

Funny bug: I accidentally kept getting "unsat" because I forgot to ignore the conflict between a register and itself. ChatGPT o4-mini found the bug for me.

The register allocations usually take under a second. I think this gets especially important once I started doing concurrency, because then I really could start using up most of the available registers.

Calling conventions and loop nests

I did a standard calling convention for floating-point stuff, saving the eight callee-saved registers to the stack. I only needed one integer register (for a loop counter), so I picked x9, which is caller-saved. One nice thing about working in assembly, compared to C, is that in C you constantly have to worry about outwitting the compiler. Every loop iteration needs to have independent, unpredictable inputs, and the outputs need to be saved to independent array slots and you have to hash the array afterwards, and make sure to print the hash, and all this jazz. Otherwise the evil compiler will just delete your whole benchmark. Assembly is much easier. I just initialize every loop input to 0, and enter the loop. No constant folding here! Just to be careful, I do align the loop entry to.

My loop works like this. First, the loop body. Then, I fmov the last loop output register into every input register. This makes sure I have a dependency from each iteration to the next iteration, no matter what. This fmov is free, because a register to register move doesn't actually do anything, it's just handled by the register renamer. Funny enough, I spent so long trying to instead map the various loop outputs to the various loop inputs one-to-one. This is harder than it seems, because the number of outputs and inputs may differ, and also you have to copy over the registers without overwriting a value. And if you get this wrong, you might set up a too-short dependency chain and get wrong outputs. Anyway, none of this was necessary and a huge waste of time. If's fine to copy the one output into every input. The compiler won't do anything evil!

Timing

To do the timing properly, I link my assembly with a small C driver program. That driver uses mach_absolute_time(), which ChatGPT told me is macOS's cycle-accurate timer. Sadly, there's no rdtsc on ARM. mach_absolute_time() measures nanoseconds, not cycles; I multiply by 3.2, the nominal frequency of my chip, and make sure to run a warm-up loop to get the CPU to scale up.

For convenience, I also generate a "null loop" with no body, and subtract that off. The null loop takes two cycles per iteration (a sub and a b.ne), and it seems that in the real loop those two cycles aren't overlapped with anything. I'm not sure why: it seems like they should overlap? Anyway, subtracting it off results in the "expected" numbers.

Let me add that I have found ChatGPT terrible at cycle-level reasoning about microarchitectures. It just invents all sorts of nonsense, and struggles to just carefully count things one step at a time.

I do 100 million iterations, which takes a few seconds at most even for the most expensive routines I'm testing.

I also wrote in an option to N copies of the routine inside the loop body, each with dependencies only on themselves, to test N-way concurrency. Mostly I did this because my simulator implements it, and I wanted to do accurate comparisons. Another funny but: I spent over an hour on a bug where I was dividing the total cycles by the number of copies twice, making it look like adding more copies made things faster. This was such a frustrating bug, because I was convinced I generated the assembly wrong, but in fact it was fine and I was munging the output wrong instead.

All N copies go through the register allocator at once, so you do get optimal register usage across all of the copies, but of course the N you can do on a real chip is still limited. For example, my @3 algorithm can do five concurrent instances at most; by six, it starts requiring too many registers. One thing I've considered is allowing the register allocator to also shuffle instructions; right now each copy is contiguous and goes one after the other, which I think is OK on the Apple M1 (it has a huge reorder buffer), but it's possible that you could squeeze a few more registers out by shuffling the instructions around.

Results

Anyway, the point of all of this was to validate whether my simulator is getting the right results and measure how much "slack" there is in its estimates. There's good news! In this table I have the simulated and measured latency/throughput for a number of algorithms:

Algorithm	Latency	2-concurrency	3-concurrency
FastTwoSum	9.00/9.58	4.50/4.74	3.00/3.25
TwoSum	15.02/15.64	7.51/7.83	5.01/5.33
@0	15.02/15.89	7.53/8.34	5.01/5.61
@1	12.00/12.66	6.00/6.75	4.00/4.69
@2	11.00/11.53	5.50/5.91	3.67/4.29
@3	11.00/12.08	5.50/6.12	3.67/4.41
ddadd	51.02/52.30	25.51/26.33	-
madd	37.04/38.39	18.52/19.49	-
madd@3	30.03/31.45	15.04/16.69	-

If you look through the table, you'll see that the difference between simulated and real numbers is usually pretty minor, peaking at about 20% for @3 with 3-way concurrency. I think that's not too bad!

All told, this was a super fun project that, I think, gave me a much better understanding of CPU micro-architectures. I'm not sure if I want to take it further (which I think means adding an x86 mode) or not but even if not I'm pretty happy with what I ended up with. The simulator code is still on this site.

By Pavel Panchekha

31 May 2025