Batches as Virtual Mu Types

This post is the third in a series of me trying to work out batches, a new data structure I am testing for Herbie and more generally for programming with recursive data structures. In my first post I laid out a type theory for batches, and in my second post I discussed how we can check that programs that use batches are safe. But I'm still after a "holy grail": how to making programming with batches "normal". Specifically, most programming with batches involves lots of indices, which need to be used correctly. I'd like instead to write normal recursive programs and have them automatically be batchified.

This third step has been far more difficult than the type theory and temporal verification discussed in the last two posts; I've been especially thankful for Milind Kulkarni and his students. There's an enormous amount of really interesting work on tree traversals done by his group, which really formed my thinking here. I particularly liked Grafter, which introduced the idea of using regular automata for expressing access sets and recursive calls.

The goal

Let me be explicit about the goal. I would ideally like to be able to write 1) a normal recursive program over expressions, and then 2) separately, write a "schedule" for how it should execute on batches. Ideally "execute recursively" would be a valid schedule so we could transition step by step. Ideally ideally, it would also have a nice Racket syntax so I can deploy it in Herbie.

The way I'm approaching this problem is to try to rigorously describe the relationship between recursive and batch code. Batches definitely have some quirks that recursive functions don't have: they involve mutation, the traversal order matters more, sometimes the deduplication has weird effects, and so on. If we could connect batch code to recursive code, then hopefully we could see all the degrees of freedom and then building a language around these degrees of freedom would provide the desired "schedule language".

So I've been trying to do this, specifically by starting with a lambda calculus and trying to see how to translate generic lambda terms to batches.

Mu Types (a quick introduction)

Normally in functional programming we define recursive data structures, like this:

data IntList = Cons Int IntList | Nil

But an alternative is to first define the recursive functor:

data IntList' r = Cons' Int r | Nil

It's the same, except recursive IntList's are replaced by the r parameter. Then you can recover IntList with a special type, Mu:

data Mu F
= Fold { unfold :: F (Mu F) }

Then IntList = Mu (IntList') gives you the recursive data structure again. A sort of core idea here is that IntList' represents the "actual data" while Mu represents the recursive aspect to the data.

The Mu type is entirely defined by two functions:¹ [¹ Plus various identities linking these two functions with the fmap on F.]

fold :: F (Mu F) -> Mu F
unfold :: Mu F -> F (Mu F)

There's a similar trick to functions. Suppose you have some recursive function:

length l =
  case l of
  | Cons a l' => 1 + length l'
  | Nil => 0

You can replace the recursive instance of length with a new function:

length' r l =
  case l of
  | Cons a l' => 1 + f l'
  | Nil => 0

Now length is not recursive, but we can recover the recursion with an extra function usually called fix:

fix f e = f (fix f) e

Then you can define length = fix length and get back the recursive function. Again, length defines the "actual computation" while fix adds the actual recursion.

Mu types as data representations

Let's switch to a lower level. Suppose I want to define a linked list in Rust. I can't write:

pub enum IntList {
    Cons(i32, IntList),
    Nil,
}

The reason this doesn't make sense is that in Rust when you say something is a struct member that means it is laid out in line, and it doesn't make sense to have an IntList store another IntList inline, it would require IntList to be an infinitely large structure. So instead you have to add a Box:

pub enum IntList {
    Cons(i32, Box<IntList>),
    Nil,
}

I think this is straightforward, but here's the cool thing: Box<IntList> now looks a lot like Mu<IntList'>. You have to unfold (deref) to a Box<IntList> to get an IntList, which contains other Box<IntList> for the leaves. Or you need to Box::new an IntList, with Box<IntList> already at the leaves, to get a new Box<IntList>.

In other words, the concept of a Mu type not only separates out the "recursive" aspect but also the "allocation" aspect. If you don't have a Mu type, that is, if you don't have any recursive data structures, then you could stack-allocate everything.² [² It might involve returning really large structures across the stack, of course.]

A batch represents a recursive data structure (possibly deduplicated) but it is also a place to store recursive data structures, kind of like a region in region-based allocation. So let's extend Mu types with an annotation showing which batch the data is allocated in. For example, the length function might start out as:

length :: Mu IntList  -> Integer

But if we want to store the input in a batch³ [³ Which, fun fact, would basically transform the linked list into a vector.⁴ [⁴ A cool follow-on to that is that if you apply this same trick to nat, the canonical inductive data structure, what you get is a vector of basically unit type, which has zero length. So the vector is just an integer.]] called A, its type would become:

length :: Mu A  IntList -> Integer

This also explains the role of indices. An index is basically a pointer, restricted to a specific set of allocations, so it is the same as the runtime representation of Box<T>, so in other words with batches we just want Mu<A, F> to be represented by indices into A. The unfold function just indexes into the batch, while the fold function adds terms to a (mutable) batch.

By the way, we could also have a special "batch" called Heap to represent heap-allocated terms, so that Mu<Heap, F> was basically the normal representation of inductive data types with real pointers.

Batches polymorphism

Naturally, you would never actually write a function that could only take inputs from a specific batch; instead, you'd want it to be generic over input batches:

length :: forall A. Mu A IntList -> Integer

But if you think through what this means at the level of execution, it means that length now receives as input both a batch and an index into that batch. Which is the same as the earlier idea of batchref objects, which are a pair of batch and index, but now I see batchref as just one common pattern for batch functions.

Now let's consider a function that rewrites an expression in some way. Normally it would have type:

rewrite :: Mu Expr -> Mu Expr

But with the batch annotations, we now have a number of different rewrite functions. First, there's this one:

rewrite :: forall A B, Mu A Expr -> Mu B Expr

This one takes input from one batch and writes output to another batch, where the output batch is passed in by the user. A slightly different option is:

rewrite :: forall A, Mu A Expr -> exists B, Mu B Expr

Here the output batch is allocated and returned by rewrite. Finally, there's the weird option,

rewrite :: forall A, Mu A Expr -> Mu A Expr

which puts rewritten terms back into the same batch as it read from. This is kind of what, say, egraphs do, in that they apply rewrite rules to expressions and put them back into the same "pool" as the original expressions, ready for the next iteration.

Functions as vectors

The other big idea with batches is that an (immutable) batch describes a closed, finite universe of objects, so a function over that tree can be represented (materialized) by a simple vector. So for example, the length function above, with type:

length :: forall A, Mu A IntList -> Integer

Instead of applying this to a specific index in the batch, we could apply it to all the indices in the batch, store the results in an output vector, and then instead of calling the function we index into the vector instead.

I think of this as a replacement to fix. When we fix a function from Mu<A, IntList>, we could allocate a vector of length equal to A, evaluate it on all of the nodes in A, and just have the vector. Note the order of operations here: you first pass in the batch A, then that returns a fix'd lambda which is materialized to a vector. So you end up with length represented by:

mat_length :: forall A, Column A Integer

Where Column A X is defined as Vector Integer, but the type parameter keeps track of how long it is.

If you recall that length can be defined in terms of a fix, you can imagine the fix operation itself constructing this column and returning a function that references directly into it.

Desugaring to batches

With all of that background out of the way, here's how I conceive of the batch problem now.

Imagine first writing a normal program over recursive data structures. This program is written in a normal lambda calculus with explicit Mu types and fix combinators. It's convenient if we give regular lambdas and fix functions different types. So we have regular lambdas a -> b eliminated by function calls f(x) and "queries" a ->> b eliminated by special function calls query f x.

To get the equivalent program over batches, we have annotate every Mu type with a batch. To do so, we'll need to:

Add terms to allocate new batches
Identify which Mu types go into which batches
Determine how those Mu types are quantified

Also, since fix terms now have effects, we will need to describe the order in which fix'd functions are materialized. On a type level that means type abstraction (forall) over batches and then type application to eliminate those abstractions.

Here's a toy but somewhat complex example. Suppose you want to convert a binary tree into a height-annotated one:

data Tree r = Tree r r | Nil
data ATree r = ATree Int r r | Nil

height :: Mu Tree ->> Int
height =
  fix \r t.
    case (unfold t) of
    | Tree a b => max (r a) (r b)
    | Nil => 0

add_height :: Mu Tree ->> Mu ATree
add_height =
  fix \r t.
    case (unfold t) of
    | Tree a b =>
      fold (ATree (query height t) (f a) (f b))
    | Nil => Nil

This is the definition you'd normally write, except:

All recursive data types are written as functors, requiring explicit Mu constructors;
All uses of Mu types require manual unfold and unfold calls;
Every recursive function requires an explicit fix call and thereby has type a ->> b.
Every call to a recursive function (the call to height inside add_height) needs to use query to eliminate query functions.

To turn it into batches, we need to make a bunch of changes:

height needs to have type forall A, Mu A Tree ->> Int; this means height needs to be specialized to a specific batch A first, at which point you can compute the height of any or all of the trees in the batch A.
add_height probably needs to have type forall A B, Mu A Tree ->> Mu B ATree. In principle it could allocate the batch itself but I don't see the point, might as well pass in the output batch from the caller.
add_height queries height with a term of type Mu A Tree, so we need to specialize height's new forall term with A. We should do that specialization once (because each specialization is expensive).

More formally, if you take a normal program, with explicit Mu, fix, fold, unfold, and query terms, I think you can turn it into a batch program by:

For each Mu type, pick a fresh variable for the batch it is from.
Run unification in case arguments have to come from the same batch.
For each function, universally quantify over the batches.
Type check the body of each function. You need to figure out the type specializations of each fold, unfold, and query term.
Most of these will be type directed; in some cases maybe not (like the term unfold (fold Nil)), which constructs a temporary batch, and reads back out of it) in which case I think it's fine to use Heap as the batch.
Now each fold, unfold, and query have explicit batch annotations, so everything uses a batch.
Lift out all arguments to query.

That gets you to here:

data Tree r = Tree r r | Nil
data ATree r = ATree Int r r | Nil

height :: forall A, Mu A Tree ->> Int
height = \A.
  fix \r t.
    case (unfold A t) of
    | Tree a b => max (r a) (r b)
    | Nil => 0

add_height :: forall A B, Mu A Tree ->> Mu B ATree
add_height = \A B.
  let height_A = height A in
  fix \r t.
    case (unfold A t) of
    | Tree a b =>
      fold B (ATree (query height_A t) (f a) (f b))
    | Nil => Nil

Note that when add_height calls height A, that returns a fix term, so add_height actually has this body:

add_height =
  let height_A = fix ... in
  fix ...

Since I said fix actually constructs a vector and evaluates the function on every element of the input batch, this means that specializing add_height computes columns of all the heights, then all the annotated trees, and a query of that specialized add_height just indexes into those columns.

Now suppose you want to run this on a specific tree:

main =
  let t = ... in
  print (query add_height t)

Then you need to specialize that to batches too, but this time you can actually go ahead and just allocate the batches:

main =
  let A = Batch.new in
  let t = Batch.add A (...) in
  let B = Batch.new in
  let add_height_AB = add_height A B
  print (query B add_height t)

In summary—this is a sketch, but I think there is a systematic way to turn non-batch code into batch code, which can be done in a fairly type-directed way. And there is some flexibility in that conversion, in terms of what order you specialize things in, and that flexibility is what the specialization comes down to.

Scheduling as semi-naive evaluation

Another separate line of work that I found useful is the idea of semi-naive evaluation in Datalog. In Datalog we are interested in computing "facts" and facts are only ever discovered (they are monotonic). This means that, in the middle of a computation, we have "new" facts and "old" facts, where "old" means that we've already processed their consequences.

I think the same idea is applicable to batches. In some cases you are materializing a recursive query of an immutable batch. In this case, you can just apply the function to every element of the batch. But in some of the more complicated, mutually-recursive options I've looked at, you need to compute data on elements of a mutable batch; for example, you might need to compute some semantic details of terms in a batch to construct more terms in that batch.

Here the idea behind "semi-naive" is that, as discussed last time, every column or batch is valid up until a certain index. You start by considering zero indices of the input columns, and have no valid indices in the output columns. Everything is "quiescent", you've computed every fact that is possible. Next, you "reveal" one more row of the input columns, and that makes a bunch of new rows of all the output columns possibly computable, and you compute them in some order. And often you have some freedom to choose the order and what you allocate when and so on. Then you repeat with the next row.

So the idea is that you'd write a schedule that looks something like this:

bottom-up A
  materialize (height A)
  materialize (add_height A B)

This schedule means that, for every row of A, first compute height for that row, and then compute add_height for that row. So this fills in both height and add_height in parallel. Specifically, we imagine that A starts as entirely invalid, and then we repeatedly "reveal" a row of A and run the body, which updates both height A and add_height A B to take advantage of the new rows.

Alternatively, consider this schedule:

bottom-up A
  materialize (height A)
bottom-up A
  materialize (add_height A B)

Here we first fill in height, then we repeatedly "reveal" one row at a time of both height and A (since both have the same index space) and compute add_height. So in this case we compute all the heights first, then all the annotated trees.

You can write incorrect schedules, like this:

bottom-up A
  materialize (add_height A B)
  materialize (height A)

What's incorrect about this is that add_height reads from both A and height but those have different lengths.

You can also write complex schedules, like this:

bottom-up A
  materialize reduce A B
  bottom-up B
    materialize gather-additive-terms B

This is the schedule from the previous post, where reduce does some simplification, adds to the batch B, and uses gather-additive-terms to perform normalization. So we first do the rewriting and then we need to call gather-additive-terms for each added term. Note that the B column is filled in in one go by reduce but the gather-additive-terms column is filled in a bit at a time, as stuff is added to B.

Conclusion

I'm three days in to working on this post, so it's time to post it even though it's a bit of a mess. But here's what I think are the most valuable ideas:

I think batches are all about specializing Mu and fix to refer to finite universes of terms, which allow them to be materialized into a vector. Mu corresponds to batches and fix to columns, including columns that map between index spaces.
You can convert a regular lambda calculus program into a batch one in a type-directed way, using type annotations and type specialization. There's probably a canonical / free way to do this.
Then the order in which fix terms actually fill in columns is the schedule.
That schedule is described by "semi-naive" evaluation where each index space is revealed step by step, and that then updates output columns.

I'm not 100% sure on the scheduling language here but I think something can be worked out. Maybe another post.

Footnotes:

Plus various identities linking these two functions with the fmap on F.

It might involve returning really large structures across the stack, of course.

Which, fun fact, would basically transform the linked list into a vector.⁴ [⁴ A cool follow-on to that is that if you apply this same trick to nat, the canonical inductive data structure, what you get is a vector of basically unit type, which has zero length. So the vector is just an integer.]

⁴

A cool follow-on to that is that if you apply this same trick to nat, the canonical inductive data structure, what you get is a vector of basically unit type, which has zero length. So the vector is just an integer.

By Pavel Panchekha

15 September 2024