Egraphs Modulo T (Attempt 1)
Phil Zucker has been writing about Egraphs modulo T for a while now; see his latest blog post and latest paper for up to date efforts. I've been thinking about this too, and this post is my attempt to summarize. I'll be referencing Phil's ideas repeatedly, since all of my thoughts trace back to his.
What is an E-graph (abstractly)
An e-graph is for conjunctive reasoning with equality. Conjunctive means we don't do any DPLL-style guesswork and we therefore don't need to backtrack; it's an interesting sweet-spot, logically speaking. The traditional E-graph is for conjunctive reasoning about equality of uninterpreted functions. The broad goal of the Egraphs modulo T effort is to figure out how to generalize this to arbitrary theories.
Phil taught me that an e-graph is made of two things:
- A term bank, the set of terms in the e-graph. In a normal e-graph, the term bank is the hash-cons. The term bank plays a big role is Phil's bottom-up e-matching, but a smaller role in my thoughts.
- A canonicalization procedure. I mean: normalization of terms, modulo known equalities of other terms. In a normal e-graph, this is both the hash-cons (for normalizing e-nodes) and the union-find (for normalizing e-classes), and normalization for e-graphs kind of bounces between the two.
Before I continue, I want to point out how non-obvious this is. Years ago, I would tell people that an e-graph is a bipartite graph of e-nodes and e-classes. I think that's how the Denali paper introduces it, which is where I learned e-graphs. But the relational E-matching paper showed that you don't need a graph at all: the hash-cons and union find are what really matter, and the actual graph structure was just a (bad) indexing data structure.
And even then, we thought the hash-cons and union-find were two independent pieces. (By the way, the egg paper does introduce it this way, though we didn't understand the significance at the time.) But Phil pointed out that the hash-cons plays two roles (term bank and e-node normalizer), and it's better to think of these roles as separate, with their sharing a data structure being a convenient implementation trick.
Ok, anyway: an e-graph is two things: a term bank and a canonicalization procedure. The term bank is simple: a list of "interesting" terms, where "interesting" plays a role in pattern matching but is pretty free-form. Canonicalization is tricky, but it's something SMT solvers already have for many theories, so it's not something we're inventing from scratch.
Let me add where I diverge from Phil here: E-matching. I'll return to it later.
Building a generic e-graph
So Phil then proposes the following recipe for building an E-graph
modulo T. First, create a generic term bank. Conceptually, this is
just a set of terms, and that's not a bad implementation either. It's
up to you what you consider interesting, and probably not important.
The add
method of an e-graph might add terms to the term bank.
Then you have an index structure. Conceptually, the index structure is
a set of asserted term equalities, but in practice you use some fancy
data structure to make normalization fast. The union
method of an
e-graph adds term equalities to the index structure.
Then there's a rebuild
method. It does two things. First, it updates
the index structure, doing whatever fancy thing is necessary to make
normalization fast now that you've added new term equalities. This
typically involves normalizing all of the equalities in the index
structure. Second, the rebuild
method might also normalize all the
terms in the term bank, just to keep things nice and tidy.
Then there's the add
method, again, which returns the normal form of
the term that you add. You can test equality of terms by adding both
to the egraph and comparing the results. If the results are equal they
normalize to the same thing. In principle, the "normalization" part of
add
and the "term bank" part of add
could be separate methods, or not.
The description above uses the "e-class-free" style, where e-classes
aren't exposed to the user and might not even exist, there are just
terms (e-nodes), which might be normalized or not. If you want the
"e-class-full" description, then e-classes represent normalized terms
while e-nodes represent un-normalized terms. In this world, since add
returns normalized terms, it returns e-classes. That said, I think
with general theories there often isn't a good separate data structure
for normalized terms, so the e-class-free perspective is typically
easier.
Matching
This is all well and good and matches how SMT solvers use e-graphs. But most e-graph users want rewrite rules. How do those work?
Phil has an easy but brilliant solution: leverage the term bank. That is, if you have a rewrite rule, instantiate it with every single term in the term bank, for every single variable. That gives you a whole bunch of valid equalities; filter for the ones where the LHS is "interesting" (meaning, in the term bank), and add the resulting equalities.
This is genius, in that it works for any theory whatsoever, but it's also dumb, because it has kind of the worst possible performance, and theories don't improve performance and kind of don't do anything.
I think about matching differently, and the inspiration comes from the
relational e-matching paper. Relational e-matching points out that a
"pattern" like f(g(?x, ?y), ?x)
is actually a conjunctive query, like
this:
∃ x, ∃ y, ∃ z, ∃ w, z = f(w, x) /\ w = g(x, y)
Note the structure of this query:
- There are quantifiers. Those quantifiers are in a certain order, which ends up being important to performance.
- Inside the quantifier there's a set of conjuncts, which are each a kind of minimal, atomic uninterpreted-function constraint.
- Just like the rest of the e-graph, the conjuncts are terms, but terms over the quantified variables.
It makes sense that we do conjunctive queries in our data structure that is all about conjunctive reasoning. The relational e-matching paper presents an interesting procedure to solve these queries.
First, build an index for each conjunct in the query. The index depends on the variable order. This confused me for a while, so let me say it again. Build an index for each conjunct in the query. This is very important so I'm going to elaborate on it at length.
In the relational e-matching paper, focused on uninterpreted
functions, the term bank is structured as a set of relations. There's
one relation per function symbol, and the arity of that relation is
the function's arity plus one (for the output). The relation is a
relation over e-classes. The index used is a trie. For a long time I
thought you built one trie per relation. No. You build one trie per
constraint in the query. If f(x, y)
and f(y, x)
both appear in the
query, you need to build two tries for the f
relation, because if the
variable order is x, y
, the two tries go arg1, arg2
and arg2, arg1
.
So: one index per conjunct.
What does one do with the indices? Well, for each constraint, you
look up the valid values of the first argument, x
. Now you have one
set of x
values, per constraint.1 [1 A constraint might not use x
,
then we ignore it / think of it as producing the universal set of x
values.] You intersect these sets; that gives you the set of x
values
that works for all constraints.
Then you recurse; for each possible value of x
in this intersection,
you substitute it into all constraints and repeat with the next
variable.
Matching is cell decomposition
To me this seems suspiciously similar to quantifier elimination. We go variable by variable; describe all valid values of the variable; and substitute those into the formula and recurse. Specifically, this reminds me of cell decomposition in cylindrical algebra decompositions, and Zach pointed out it also seems like Cooper's algorithm for lineaer integer arithmetic.
In cell decomposition, you have a set of cells, which are a set of
primitive atomic predicates of a single variable x
. Any formula in one
variable can be rewritten as a disjunction of cells. Let me give a few
examples of what this means:
- In relational algebra, cells are equalities
x = a
. - In linear integer arithmetic, cells are inequalities
x ≥ a
intersected with modular congruencesx ≡ a mod b
. - In nonlinear real arithmetic, cells are polynomial roots.
Next, we start with a logical formula:
\[ \exists x, \exists y, ..., \phi_1(x, y, ...) \land \phi_2(x, y, ...) \land ... \]
We then weaken this formula by considering each conjunct separately:
\[ \left(\exists x, \exists y, ..., \phi_1(x, y, ...)\right) \land \left(\exists x, \exists y, ..., \phi_2(x, y, ...)\right) \]
For each conjunct, we go under the ∃ x
term and eliminate the other
quantifiers, resulting in:
\[ \phi_1'(x) \land \phi_2'(x) \land \phi_3'(x) \]
These are \(\phi_i'\), not \(\phi_i\), because we've eliminated
quantifiers. That gives us a quantifier-free formula in x
that we can
now write as a DNF of cells \(c_ij\):
\[ (c_11(x) \lor c_12(x) \lor \dotsb) \land (c_21(x) \lor c_22(x) \lor \dotsb) \land \dotsb \]
Now you intersect the cells; this has to be possible because the above is just a formula in one variable so it must have a cell decomposition. so that gives you:
\[ c_1(x) \lor c_2(x) \lor \dotsb \]
Now you can go back to your original formula and make a substitution for each cell.
So now let's recap what we need from a theory for pattern-matching to work. First, we need a set of patterns. These are the primitive logical formulas that can be conjoined together to make a query. Second, we need to be able to quantifier-eliminate patterns into disjunctions of cells. Third, we need to be able to expand CNF of cells into cells. Third, we need to be able to substitute cells into patterns, resulting in patterns.
This also brings us back to the term bank. The theory of uninterpreted functions with equality, EUF, doesn't support quantifier elimination. Which means the algorithm above shouldn't work. The way we make it work is simple: model-based quantifier instantiation. This requires us having an explicit model, which is what the term bank offers. In this sense, the term bank is actually part of the index structure, much like the tries for the trie join.
Mixing theories
Finally, let's think about mixed e-graphs; all realistic uses of e-graphs involve uninterpreted functions plus some other theory.
In a mixed e-graph, terms involve a mix of terms from each theory. That is, each theory provides a type of terms. Terms can have a type signature so that you mix theories in structured ways.
We also have an index structure for each theory which stores a list of asserted equalities in that theory and helps normalize that theory's terms. There might be a separate type for normalized terms (e-classes), but there doesn't have to be, and it doesn't make a big difference if there is one. The important thing is that there is normalization.
The add
method normalizes a term (and, for EUF, updates the term
bank). The union
method asserts an equality. If you have a separate
type for normalized terms, that's the return type of add
and the
argument type for union
.
Each theory also provides a type of patterns, which each have a type signature over theory variables. These form the primitive conjuncts in a query. The query is existentially quantified over all its variables, and those variables are in some order.
Each theory provides a subtype of cells. It must be possible to convert any positive formula over patterns into a disjunction of cells. Cells should also be patterns.
Finally, each theory should provide a quantifier elimination procedure. Given a pattern (from any theories), it must be possible to eliminate a quantifier (of the theory's type) and return a disjunction of cells for that variable. Then those disjunctions can be conjoined across all patterns, simplified back into a disjunction of cells, that final disjunction of cells is then used as bindings for the variable and substituted into all other patterns.
Note that there are two cross-theory operations: it should be possible
to eliminate a T
-variable x
from an S
-pattern P(x, y, ...)
and
substitute a T
-cell c
into an S
-pattern P(x, y, ...)
. These are weird:
in a typical SMT setup, theories don't share anything except the
notion of equality. It might be possible fix this. For example, in
cylindrical algebra decomposition, you typically sample a value from
each cell. Sampling converts a cell into a trivial cell x = c
, so
maybe different theories can exchange just these simple values, I'm
not sure. But it should also be possible to write these cross-theory
functions for a couple of hand-picked theories, and thereby get
extensions of traditional e-graphs to a couple of extra theories. Next
time, I'll try to do that.
Footnotes:
A constraint might not use x
,
then we ignore it / think of it as producing the universal set of x
values.