Pavel Panchekha

By

Share under CC-BY-SA.

Optimal Heap Limits for Generational GCs

In MemBalancer, my student Marisa and I came up with a novel correctness property for garbage collection heap limits, which we called "compositionality", derived compositional heap limits for a basic one-generation garbage collector, and showed that this new heap limit lead to big reductions in memory usage for Chrome's V8 JavaScript runtime. But actually V8 uses a generational GC, and compositional heap limits for that are much harder to derive.

Compositional heap limits

Any time you have a garbage collector, you have to decide when it will actually collect garbage. There are lots of possible ways to decide this, but a popular one uses a "heap limit": you set a maximum amount of memory that the program can use, and then if memory is allocated past that limit the garbage collector runs.

So then the question is what the heap limit should be. It's a trade-off: a higher heap limit means using more memory but spending less time collecting, and vice versa. So the rule might look like just a matter of taste, but in the MemBalancer paper we introduced a new way to think about it.

The idea is: suppose you have two garbage-collected programs. Then they have a total memory use and a total time collecting garbage. For a single program, any heap limit is pareto-optimal, but surprisingly some combinations of heap limits for two programs are not! For example, you could have one program that uses a huge amount of memory, and another that uses basically no memory and spends a huge time collecting. In total, there's a lot of memory and a lot of garbage collection, and you could reallocate a bit of memory from the first to the second to get a strictly better situation. But it is possible to set the heap limit in just the right way, and in that case multiple programs will also be pareto-optimal.

Formally, the MemBalancer paper works like this.

First, write down a theoretical model:

\[ T = f[g, s, \ldots](M) \]

This model is a function with one input \(M\), the heap limit, any number of parameters like allocation rate \(g\) and garbage collection speed \(s\), and an output which is the time spent collecting garbage \(T\). The difference between the input and the parameters is that the input is something we can control (we can set the heap limit to whatever we want) while the parameters are out of our control (they are up to the program and the runtime).

In the MemBalancer paper the model is:

\[ T = \frac{L}{s} \frac{g}{M - L} \]

This is basically a model that assumes:

  • Allocation rate \(g\), garbage collection rate \(s\), and live memory \(L\) are static over time
  • Marking / compacting takes all the time, so garbage collection time is proportional to live memory \(L\)

Second, solve this equation for \(M\):

\[ -c = \frac{\partial f}{\partial M} \]

You can see the explanation for this step in the MemBalancer paper, but basically if this equation is true, then this composition property is true. In the paper, the solution is:

\[ M = L + \sqrt{g L / c s} \]

Note that this solution computes the input in terms of the parameters. So it gives us a way to set the heap limit.

Third, we need to measure the parameters. If we can measure the parameters, we can compute the heap limit using the formula derived in step 2, but the parameters may not be arbitrarily observable. For example, in the MemBalancer paper, we measure \(s\) and \(L\) after a garbage collection, and measure \(g\) every second in a background thread. Moreover, \(s\) and \(g\) are smoothed (with different smoothing rates). Fine-tuning measurement actually makes a pretty big difference to the results, though even without the fine-tuning we were seeing some results.

So this is the MemBalancer methodology: modeling, solving, and measuring, in that order. The result is an algorithm for compositional heap limits.

Modeling generational collectors

Our idea was to repeat this methodology, but for a generational garbage collector. In a generational collector, there are two heaps (a "young generation" and an "old generation"), and live data from the young generation is moved to the old generation after collection.

So our model would have two outputs: a major heap size \(M_j\) and a minor heap size \(M_i\). The major heap model can just copy the MemBalancer model:

\[ T_j = \frac{L}{s_j} \frac{p(M_i)}{M_j - L} \]

The only difference from the model above is that instead of a fixed allocation rate \(g\), I now have the promotion rate \(p(M_i)\), the fraction of bytess that are moved from the minor to the major heap, which notably depends on the minor heap size.

The minor heap model is also pretty easy; we assume copying the data is what takes all the time, so it's just:

\[ T_i = \frac{p(M_i) g}{s_i} \]

These models are pretty simple, and also all the parameters in them are evidently measurable: they are just variations of what we had in MemBalancer.

Except the \(p(M_i)\) term. What determines the promotion rate? How does it change with the minor heap size? Presumably it goes down—the larger the minor heap, the less-often you collect it and the better the chance any given object is dead when you check. But I myself am not really aware of a good theoretical form for that curve. My students Pranav and Marisa collected data for a bunch of JavaScript benchmarks, and we concluded that the curve does point down, but there was too much noise to really say much about its functional form.

I'm not a garbage collection expert. Perhaps there are theoretical models where the promotion rate \(p(M_i)\) is derivable. If you know, please email me.

Solving generational collectors

The cool thing about generational GC is that we don't even need to think about multiple programs in this scenario. We already have multiple heaps (the major and minor heaps) to balance against each other. It's possible, for example, that you could make the minor heap bigger, and the major heap smaller by the same amount, and that garbage collection time could go down (or up). So even in a single program you could get speedups, which is very attractive.

Even in the absence of a model for \(p(M_i)\), we can still try to solve the equation. We'll need to solve:

\[ -c = \frac{\partial T_i + T_j}{\partial M_i} = \frac{\partial T_i + T_j}{\partial M_j} \]

Note that this is two equations in two unknowns, so it should in theory have solutions. Also, \(T_i\), the time in minor collection, does not depend on \(M_j\), the size of the major heap, so we can solve that first.

As for solving things, the second half, the major heap, solves the same way as before, because \(T_i\) is independent of \(M_j\):

\[ M_j = L + \sqrt{L p(M_i) / s_j c} \]

Let's solve the first half. We get:

\[ -c = 2 p'(M_i) \frac{g}{s_i} + \frac{L}{s_j} \frac{p'(M_i)}{M_j - L} \]

But since we know the correct value of \(M_j\) we can substitute that in:

\[ -c = \frac{g}{s_i} p'(M_i) + \sqrt{\frac{L c}{s_j}} \frac{p'(M_i)}{\sqrt{p(M_i)}} \]

That's a pretty ugly equation! And of course we cannot solve this equation for \(M_i\), lacking a formula for \(f(M_i)\). Still, it shows what'd need to do if we had one.

Modeling promotion

Modeling promotion is hard becase, as I mentioned, the data is pretty noisy and unhelpful, and I'm not aware of a good theoretical prior.

But what do we want from a model? It should be empirically valid, of course, this is what actually matters, but this constraint is not as constraining as I'd like. But it also has two more requirements. It should be solvable, meaning that we should be able to substitute \(p(M_i)\) into the equation above and then solve it for \(M_i\). And also it should be measurable, meaning that whatever parameters our promotion model has, it should be possible to measure those parameters. This is a little tricky. We do observe promotion every time we run a minor GC, so ideally the promotion model would have only one parameter; then we'd be able to use one observation to find the value of the one parameter.

Do these goals help narrow down \(p(M_i)\)? Surprisingly—this is the big trick—yes! Keep in mind this is all wild speculation, but let's focus on the solvability requirement instead of the empirical validity one.

The equation above is a differential equation. What if we solve it for \(p(M_i)\)?

\[ p(M_i) = \left(a \sqrt{x} + b\right)^2 \]

for some pretty nasty constant \(a\) and \(b\).

Now, it's important to remember that this is not the actual formula for \(p(M_i)\). That wouldn't make sense, we haven't yet added any empirical validity. What does this mathematical operation tell us? Perhaps you think of solving a differential equation as going from a differential equation to a function that satisfies it. But don't think of it that way. Think of it as converting a differential equation into an equivalent non-differential equation. So whatever the model of \(p(M_i)\), this is the formula we'll be solving! So if we're kind of agnostic about the true form of \(p(M_i)\) we can at least choose a version where this specific equation is easy to solve. Given the data we had looked vaguely linear, I'm tempted to use:

\[ p(M_i) = (x \sqrt{M_i} + y)^2 \]

Here \(x\) and \(y\) are now parameters for my model of promoted bytes. Because this has basically the same shape as the solution of the differential equation, solving that differential equation for \(M_i\) would be easy (it would basically be linear), so this form of the promotion model is vaguely empirically valid and also solvable.

Of course there is still a question of measuring \(x\) and \(y\), which would be especially hard since we have no idea what they physically represent. And if we only observe promotions at the end of each minor GC, having two parameters (for one observation) is not ideal. But I have another trick here (though I didn't get to test it).

We assume that \(p(0) = 1\), in other words, if the minor heap is tiny then everything gets promoted. This is attractive because substituting that into the fake model above gives \(y = 1\), meaning that actually we only need to measure \(x\) at runtime. But then we can just measure the promotion rate at every minor collection and use that, plus the known \(M_i\) for that collection, to solve for \(x\). And then we can use that \(x\) to pick a new optimal major and minor heap.

So the final \(p(M_i)\) is:

\[ p(M_i) = (x \sqrt{M_i} + 1)^2 \]

Here \(x\) could be measured pretty easily: measure the promotion rate \(r\) at the end of a minor GC, and assume that the minor heap size was \(m\) for that minor GC. Then we have \(p(m) = r\) so

\[ x = \frac{\sqrt{r} - 1}{\sqrt{m}} \]

Anyway, the point of this all is that by combining weak empirical observations with guidance on what is mathematically solvable and measurable, we get a plausible model of generational GC that should make an implementable optimal heap size controller.

If all this works—a big if!

Conclusion

Unfortunately, Pranav graduated before we could test out some hypotheses for \(p(M_i)\) and actually get good data of this running in production. I'd love to find another student interested in this project; email me if that's you!

But we did see some real evidence that gains are available here. Specifically, by looking at the splay benchmark in JetStream 2, which generates a huge amount of garbage and is thus heavily influenced by both minor and major heap sizing, we found that you could reduce both collection time and memory usage just by balancing the major and minor heaps against each other.

There's a lot left. We gotta actually test some of these assumptions, like that the \(p(M_i)\) model above is reasonable and also that it extrapolates out to \(p(0) = g\). Then we gotta implement all of this and see if it works, including fine-tuning all of the measurements and smoothing parameters and so on. If it works it'd be very exciting, not only because V8 is very important and getting better GC would be a big win, but also because generational MemBalancer should have wins even for single programs.

It's also worth noting that V8 is a little different from the model above: its GC doesn't promote immediately but after a second collection, and that might lead to a slightly different algorithm.