# Optimal Heap Limits for Generational GCs

In MemBalancer, my student Marisa and I came up with a novel correctness property for garbage collection heap limits, which we called "compositionality", derived compositional heap limits for a basic one-generation garbage collector, and showed that this new heap limit lead to big reductions in memory usage for Chrome's V8 JavaScript runtime. But actually V8 uses a generational GC, and compositional heap limits for that are much harder to derive.

## Compositional heap limits

Any time you have a garbage collector, you have to decide when it will
actually collect garbage. There are *lots* of possible ways to decide
this, but a popular one uses a "heap limit": you set a maximum amount
of memory that the program can use, and then if memory is allocated
past that limit the garbage collector runs.

So then the question is what the heap limit should be. It's a trade-off: a higher heap limit means using more memory but spending less time collecting, and vice versa. So the rule might look like just a matter of taste, but in the MemBalancer paper we introduced a new way to think about it.

The idea is: suppose you have *two* garbage-collected programs. Then
they have a *total* memory use and a *total* time collecting garbage. For
a single program, any heap limit is pareto-optimal, but surprisingly
some combinations of heap limits for two programs are not! For
example, you could have one program that uses a huge amount of memory,
and another that uses basically no memory and spends a huge time
collecting. In total, there's a lot of memory and a lot of garbage
collection, and you could reallocate a bit of memory from the first to
the second to get a strictly better situation. But it *is* possible to
set the heap limit in just the right way, and in that case multiple
programs will also be pareto-optimal.

Formally, the MemBalancer paper works like this.

First, write down a theoretical **model**:

\[ T = f[g, s, \ldots](M) \]

This model is a function with one *input* \(M\), the heap limit, any
number of *parameters* like allocation rate \(g\) and garbage collection
speed \(s\), and an *output* which is the time spent collecting garbage
\(T\). The difference between the input and the parameters is that the
input is something we can control (we can set the heap limit to
whatever we want) while the parameters are out of our control (they
are up to the program and the runtime).

In the MemBalancer paper the model is:

\[ T = \frac{L}{s} \frac{g}{M - L} \]

This is basically a model that assumes:

- Allocation rate \(g\), garbage collection rate \(s\), and live memory \(L\) are static over time
- Marking / compacting takes all the time, so garbage collection time is proportional to live memory \(L\)

Second, **solve** this equation for \(M\):

\[ -c = \frac{\partial f}{\partial M} \]

You can see the explanation for this step in the MemBalancer paper, but basically if this equation is true, then this composition property is true. In the paper, the solution is:

\[ M = L + \sqrt{g L / c s} \]

Note that this solution computes the *input* in terms of the *parameters*.
So it gives us a way to set the heap limit.

Third, we need to **measure** the parameters. If we can measure the
parameters, we can compute the heap limit using the formula derived in
step 2, but the parameters may not be arbitrarily observable. For
example, in the MemBalancer paper, we measure \(s\) and \(L\) after a
garbage collection, and measure \(g\) every second in a background
thread. Moreover, \(s\) and \(g\) are smoothed (with different smoothing
rates). Fine-tuning measurement actually makes a pretty big difference
to the results, though even without the fine-tuning we were seeing
some results.

So this is the MemBalancer methodology: modeling, solving, and measuring, in that order. The result is an algorithm for compositional heap limits.

## Modeling generational collectors

Our idea was to repeat this methodology, but for a generational garbage collector. In a generational collector, there are two heaps (a "young generation" and an "old generation"), and live data from the young generation is moved to the old generation after collection.

So our model would have two outputs: a major heap size \(M_j\) and a minor heap size \(M_i\). The major heap model can just copy the MemBalancer model:

\[ T_j = \frac{L}{s_j} \frac{p(M_i)}{M_j - L} \]

The only difference from the model above is that instead of a fixed
allocation rate \(g\), I now have the *promotion rate* \(p(M_i)\), the
fraction of bytess that are moved from the minor to the major heap,
which notably depends on the minor heap size.

The minor heap model is also pretty easy; we assume copying the data is what takes all the time, so it's just:

\[ T_i = \frac{p(M_i) g}{s_i} \]

These models are pretty simple, and also all the parameters in them are evidently measurable: they are just variations of what we had in MemBalancer.

Except the \(p(M_i)\) term. What determines the promotion rate? How does it change with the minor heap size? Presumably it goes down—the larger the minor heap, the less-often you collect it and the better the chance any given object is dead when you check. But I myself am not really aware of a good theoretical form for that curve. My students Pranav and Marisa collected data for a bunch of JavaScript benchmarks, and we concluded that the curve does point down, but there was too much noise to really say much about its functional form.

I'm not a garbage collection expert. Perhaps there are theoretical
models where the promotion rate \(p(M_i)\) is derivable. **If you know,
please email me.**

## Solving generational collectors

The cool thing about generational GC is that we don't even need to think about multiple programs in this scenario. We already have multiple heaps (the major and minor heaps) to balance against each other. It's possible, for example, that you could make the minor heap bigger, and the major heap smaller by the same amount, and that garbage collection time could go down (or up). So even in a single program you could get speedups, which is very attractive.

Even in the absence of a model for \(p(M_i)\), we can still try to solve the equation. We'll need to solve:

\[ -c = \frac{\partial T_i + T_j}{\partial M_i} = \frac{\partial T_i + T_j}{\partial M_j} \]

Note that this is two equations in two unknowns, so it should in theory have solutions. Also, \(T_i\), the time in minor collection, does not depend on \(M_j\), the size of the major heap, so we can solve that first.

As for solving things, the second half, the major heap, solves the same way as before, because \(T_i\) is independent of \(M_j\):

\[ M_j = L + \sqrt{L p(M_i) / s_j c} \]

Let's solve the first half. We get:

\[ -c = 2 p'(M_i) \frac{g}{s_i} + \frac{L}{s_j} \frac{p'(M_i)}{M_j - L} \]

But since we know the correct value of \(M_j\) we can substitute that in:

\[ -c = \frac{g}{s_i} p'(M_i) + \sqrt{\frac{L c}{s_j}} \frac{p'(M_i)}{\sqrt{p(M_i)}} \]

That's a pretty ugly equation! And of course we cannot solve this equation for \(M_i\), lacking a formula for \(f(M_i)\). Still, it shows what'd need to do if we had one.

## Modeling promotion

Modeling promotion is hard becase, as I mentioned, the data is pretty noisy and unhelpful, and I'm not aware of a good theoretical prior.

But what do we want from a model? It should be **empirically valid**, of
course, this is what actually matters, but this constraint is not as
constraining as I'd like. But it also has two more requirements. It
should be **solvable**, meaning that we should be able to substitute
\(p(M_i)\) into the equation above and then solve it for \(M_i\). And also
it should be **measurable**, meaning that whatever parameters our
promotion model has, it should be possible to measure those
parameters. This is a little tricky. We do observe promotion every
time we run a minor GC, so ideally the promotion model would have only
one parameter; then we'd be able to use one observation to find the
value of the one parameter.

Do these goals help narrow down \(p(M_i)\)? Surprisingly—this is the big trick—yes! Keep in mind this is all wild speculation, but let's focus on the solvability requirement instead of the empirical validity one.

The equation above is a differential equation. What if we solve it *for*
\(p(M_i)\)?

\[ p(M_i) = \left(a \sqrt{x} + b\right)^2 \]

for some pretty nasty constant \(a\) and \(b\).

Now, it's important to remember that this is not the actual formula for \(p(M_i)\). That wouldn't make sense, we haven't yet added any empirical validity. What does this mathematical operation tell us? Perhaps you think of solving a differential equation as going from a differential equation to a function that satisfies it. But don't think of it that way. Think of it as converting a differential equation into an equivalent non-differential equation. So whatever the model of \(p(M_i)\), this is the formula we'll be solving! So if we're kind of agnostic about the true form of \(p(M_i)\) we can at least choose a version where this specific equation is easy to solve. Given the data we had looked vaguely linear, I'm tempted to use:

\[ p(M_i) = (x \sqrt{M_i} + y)^2 \]

Here \(x\) and \(y\) are now parameters for my model of promoted bytes.
Because this has basically the same shape as the solution of the
differential equation, solving that differential equation for \(M_i\)
would be easy (it would basically be linear), so this form of the
promotion model is vaguely empirically valid and *also* solvable.

Of course there is still a question of measuring \(x\) and \(y\), which would be especially hard since we have no idea what they physically represent. And if we only observe promotions at the end of each minor GC, having two parameters (for one observation) is not ideal. But I have another trick here (though I didn't get to test it).

We assume that \(p(0) = 1\), in other words, if the minor heap is tiny then everything gets promoted. This is attractive because substituting that into the fake model above gives \(y = 1\), meaning that actually we only need to measure \(x\) at runtime. But then we can just measure the promotion rate at every minor collection and use that, plus the known \(M_i\) for that collection, to solve for \(x\). And then we can use that \(x\) to pick a new optimal major and minor heap.

So the final \(p(M_i)\) is:

\[ p(M_i) = (x \sqrt{M_i} + 1)^2 \]

Here \(x\) could be measured pretty easily: measure the promotion rate \(r\) at the end of a minor GC, and assume that the minor heap size was \(m\) for that minor GC. Then we have \(p(m) = r\) so

\[ x = \frac{\sqrt{r} - 1}{\sqrt{m}} \]

Anyway, the point of this all is that by combining weak empirical observations with guidance on what is mathematically solvable and measurable, we get a plausible model of generational GC that should make an implementable optimal heap size controller.

If all this works—a big if!

## Conclusion

Unfortunately, Pranav graduated before we could test out some hypotheses for \(p(M_i)\) and actually get good data of this running in production. I'd love to find another student interested in this project; email me if that's you!

But we did see some real evidence that gains are available here.
Specifically, by looking at the `splay`

benchmark in JetStream 2, which
generates a huge amount of garbage and is thus heavily influenced by
both minor and major heap sizing, we found that you *could* reduce both
collection time and memory usage just by balancing the major and minor
heaps against each other.

There's a lot left. We gotta actually test some of these assumptions, like that the \(p(M_i)\) model above is reasonable and also that it extrapolates out to \(p(0) = g\). Then we gotta implement all of this and see if it works, including fine-tuning all of the measurements and smoothing parameters and so on. If it works it'd be very exciting, not only because V8 is very important and getting better GC would be a big win, but also because generational MemBalancer should have wins even for single programs.

It's also worth noting that V8 is a little different from the model above: its GC doesn't promote immediately but after a second collection, and that might lead to a slightly different algorithm.