Posted by
https://cfallin.org/blog/2026/04/09/aegraph/
Today, I'll be writing about the aegraph, or acyclic egraph, the
data structure at the heart of Cranelift's mid-end optimizer. I
introduced this approach in
2022
and, after a somewhat circuitous path involving one full rewrite, a
number of interesting realizations and "patches" to the initial idea,
various discussions with the wider e-graph community (including a
talk
(slides)
at the EGRAPHS workshop at PLDI
2023 and a recent talk
and discussions at the e-graphs Dagstuhl
seminar),
and a whole bunch of contributed rewrite rules over the past three
years,
it is time that I describe the why (why an e-graph? what benefits
does it bring?), the how (how did we escape the pitfalls of full
equality saturation? how did we make this efficient enough to
productionize in Cranelift?), and the how much (does it help? how
can we evaluate it against alternatives?).
For those who are already familiar with Cranelift's mid-end and its
aegraph, note that I'm taking a slightly different approach in this
post. I've come to the viewpoint that the "sea-of-nodes" aspect of our
aegraph, and the translation passes we've designed to translate into
and out of it (with optimizations fused in along the way), are
actually more fundamental than the "multi-representation" part of the
aegraph, or in other words, the "equivalence class" part itself. I'm
choosing to introduce the ideas from sea-of-nodes-first in this post,
so we will see a "trivial eclass of one enode" version of the aegraph
first (no union nodes), then motivate unions later. In actuality, when
I was experimenting then building this functionality in Cranelift in
2022, the desire to integrate e-graphs came first, and aegraphs were
created to make them practical; the pedagogy and design taxonomy have
only become clear to me over time. With that, let's jump in!
Initial context: Fixpoint Loops and the Pass-Ordering Problem
Around May of 2022, I had introduced a simple alias analysis and
related
optimizations
(removing redundant loads, and doing store-to-load forwarding). It
worked fine on all of the expected test cases, and we saw real speedup
on a few benchmarks (e.g. 5% on meshoptimizer
here)
but led to a new question as well: how should we integrate this pass
with our other optimization passes, which at the time included GVN
(global value numbering), LICM (loop-invariant code motion), constant
propagation and some algebraic rewrites?
To see why this is an interesting question, consider how GVN, which
canonicalizes values, and redundant load elimination interact, on the
following IR snippet:
v2 = load.i64 v0+8
v3 = iadd v2, v1 ;; e.g., array indexing
v4 = load.i8 v3
;; ... (no stores or other side effects here) ...
v10 = load.i64 v0+8
v11 = iadd v10, v1
v12 = load.i8 v11
Redundant load elimination (RLE) will be able to see that the load
defining v10 can be removed, and v10 can be made an alias of
v2, in a single pass. In a perfect world, we should then be able to
see that v11 becomes the same as v3 by means of GVN's
canonicalization, and subsequently, v12 becomes an alias of
v4. But those last two steps imply a tight cooperation between two
different optimization passes: we need to run one full pass of RLE
(result: v10 rewritten), then one full pass of GVN (result: v11
rewritten), then one additional full pass of RLE (result: v12
rewritten). One can see that an arbitrarily long chain of such
reasoning steps, bouncing through different passes, might require an
arbitrarily long sequence of pass invocations to fully simplify. Not
good!
This is known as the pass-ordering problem in the study of compilers
and is a classical heuristic question with no easy answers as long as
the passes remain separate, coarse-grained algorithms (i.e., not
interwoven). To permit some interesting cases to work in the initial
Cranelift integration of alias analysis-based rewrites, I made a
somewhat ad-hoc
choice
to invoke GVN once after the alias-analysis rewrite pass.
But this is clearly arbitrary, wastes compilation effort in the common
case, and we should be able to do better. In general, the solution
should reason about all passes' possible rewrites in a unified
framework, and interleave them in a fine-grained way: so, for example,
if we can apply RLE then GVN five times in a row just for one
localized expression, we should be able to do that, without running
each of these passes on the whole function body. In other words, we
want a "single fixpoint loop" that iterates until optimization is done
at a fine granularity.
Three Building Blocks: Rewrites, Code Motion, and Canonicalization
Let's review the optimizations we had at this point:
-
GVN (global value numbering), which is a canonicalization
operation: within a given scope where a value is defined (for SSA
IRs, the subtree of the dominance
tree below
a given definition), any identical computations of that value should
be canonicalized to the original one.
-
LICM (loop-invariant code motion), which is a code-motion
operation: computations that occur within a loop, but whose value is
guaranteed to be the same on each iteration, should be moved
out. Loop invariance can be defined recursively: values already
outside the loop, or pure operators inside the loop whose arguments
are all loop-invariant. The transform doesn't change any operators,
it only moves where they occur.
-
Constant propagation (cprop) and algebraic rewrites: these are
transforms like rewriting 1 + 2 to 3 (cprop) or x + 0 to x
(algebraic). They can all be expressed as substitutions for
expressions that match a given pattern.
-
Redundant load elimination and store-to-load forwarding: these both
replace load operators with the SSA value that operator is known
to load.
-
And one that we wanted to implement: rematerialization, which
reduces register pressure for values that are easier to recompute on
demand (e.g., integer constants) by re-defining them with a new
computation. This can be seen as a kind of code motion as well.
As a start to thinking about frameworks, we can categorize the above
into code motion, canonicalization, and rewrites. Code motion is
what it sounds like: it involves moving where a computation occurs,
but not changing it otherwise. Canonicalization is the unifying of
more than one instance of a computation into one ("canonical")
instance. And rewrites are any optimization that replaces one
expression with another that should compute the same value. Said more
intuitively (and colloquially), these three categories attempt to
cover the whole space of possibilities for "simple" optimizations: one
can move code, merge identical code, or replace code with equivalent
code. (The notable missing possibility here is the ability to change
control flow and/or make use of control-flow-related reasoning; more
on that in a later section.) Thus, if we can build a framework that
handles these kinds of transforms, we should have a good
infrastructure for the next steps in Cranelift's evolution.
IR Design, Sea-of-Nodes, and Intermediate Points
From first principles, one might ask: how should a unifying
framework for these concerns look? Code motion and canonicalization
together imply that perhaps computations (operator nodes) should not
have a "location" in the program, whenever that can be avoided. In
other words, perhaps we should find a way to represent add v1, v2 in
our IR without putting it somewhere concrete in the control flow. Then
all instances of that same computation would be merged (because
duplicates would differ only by their location, which we removed), and
code motion is... inapplicable, because code does not have a location?
Well, not quite: the idea is that one starts with a conventional IR
(with control flow), and ends with it too, but in the middle one
can eliminate locations where possible. So in the transition to this
representation, we erase locations, and canonicalize; and in the
transition from this representation, we re-assign locations, and
code-motion can be a side-effect of how we do that.
What we just described above is called a
sea-of-nodes IR. A
sea-of-nodes IR is one that dispenses with a classical "sequential
order" for all instructions or operators in the program, and instead
builds a graph (the "sea") of operators (the "nodes") with edges to
denote the actual dependencies, either for dataflow or control flow.
In the purest form of this design, one can represent every IR
transform as a graph rewrite, because a graph is all there is. For
example, LICM, a kind of code motion that hoists a computation out of
a loop, is a purely algebraic rewrite on the subgraph representing the
loop body. This is because the loop itself is a kind of node in the
sea of nodes, with control-flow edges like any other edge; code motion
is not a "special" action outside the scope of the expression
language (nodes and their operands).
While that kind of flexibility is tempting, it comes with a
significant complexity tax as well: it means that reasoning through
and implementing classical compiler analyses and transforms is more
difficult, at least for existing compiler engineers with their
experience, because the IR is so different from the classical data
structure (CFG of basic blocks). The V8 team wrote about this
difficulty recently as
support for their decision to migrate away from a pure Sea-of-Nodes
representation.
However, we might achieve some progress toward our goal -- providing a
general framework for rewrites, code motion and canonicalization -- if
we take inspiration from sea-of-nodes' handling of pure
(side-effect-free) operators, and the way that they can "float" in the
sea, unmoored by any anchor other than actual inputs and outputs
(dataflow edges). Stated succinctly: what if we kept the CFG for the
side-effectful instructions (call it the "side-effect skeleton") and
used a sea-of-nodes for the rest?

This would allow for us to unify code motion, canonicalization and
rewrites, as described above: canonicalization works on pure
operators, because we remove distinctions based on location;
code-motion can occur when we put pure operators back in the CFG; and
rewrites can occur on pure operators. In fact rewrites are now both
(i) simpler to reason about, because we don't have to place expression
nodes at locations in an IR, only create them "floating in the air",
and (ii) more efficient, because they occur once on a canonicalized
instance of an expression, rather than all instances separately.
We'll call this representation a "sea-of-nodes with CFG".
Implementing Sea-of-Nodes-with-CFG
Now, to practical implementation: architecting the entire compiler
around sea-of-nodes for pure operators might make sense from first
principles, but as a modification of the existing Cranelift compiler
pipeline, we would not want to (or be able to) make such a radical
change in one step. Rather, I wanted to build this as a replacement
for the mid-end, taking CLIF (our conventional CFG-based SSA IR) as
input and producing CLIF as output. So we need a three-stage
optimizer:
-
Lift all pure operators out of the CFG, leaving behind the
skeleton. Put these operators into the "sea" of pure computation
nodes, deduplicating
(hash-consing) as we
go.
-
Perform rewrites on these operators, replacing some values with
others according to whatever rules we have that preserve value
equivalence.
-
Convert this sea-of-pure-nodes back to sequential IR by scheduling
nodes into the CFG. We'll call this process "elaboration" of the
computations.
This is in fact how the heart of Cranelift's mid-end now works; we'll
go through each part above in turn.
Into Sea-of-Nodes-with-CFG: Canonicalization
Let's talk about how we get into the sea-of-nodes representation
first. The most straightforward answer, of course, would be to simply
"remove the nodes from the CFG" and let them free-float, referenced by
their uses that remain in the skeleton -- and that's it. But that
gives up on the obvious opportunity offered by the fact that these
operators are pure (have no side-effects, or implicit dependencies
on the rest of the world): an operator op v1, v2 always produces
the same value given the same inputs, and two separate instances of
this node have no distinguishing features or other properties that
should lead to different results. Hence, we should canonicalize, or
hash-cons, nodes.
Hash-consing is a
standard technique in systems that have value- or operator-nodes: the
idea is to keep a lookup table indexed by the contents of each value
or operator, perform lookups in this table when creating a new node,
and reuse existing nodes when a match occurs.
What is the equivalence class by which we deduplicate? (In other
words, more concretely, how do we define Eq and Hash on
sea-of-nodes values?) We adopt a very simple answer (and deal with
subtleties later, as is often the case!): the (shallow) content of a
given node is its identity. In other words, if we have iadd v1, v2,
then that is "equal to" (deduplicates with) any other such operator.
Now, this shallow notion of equality may not seem like enough to
canonicalize all instances of the same expression tree. Consider if we
had
v0 = ...
v1 = ...
v2 = iadd v0, v1
v3 = iconst 42
v4 = imul v2, v3
v5 = iadd v0, v1
v6 = iconst 42
v7 = imul v5, v6
Clearly any reasonable canonicalization algorithm should consider v4
and v7 to be the same, and condense uses of them into uses of one
canonical node. But the nodes are not shallowly equal. How do we
get from here to there?
One possible answer is induction: we could canonicalize a node only
after all of its operands have been canonicalized (and rewritten), so
we know that if subtrees are identical, we will have identical value
numbers. Thus, inductively, all values would be canonicalized deeply.
This requires processing definitions of a node before its uses,
however. Fortunately, the SSA CFG from which we are constructing the
sea-of-nodes-with-CFG provides us this property already if we traverse
it in a particular order: we need to visit blocks in the control-flow
graph in some preorder of the dominance tree (domtree), which we
usually have available already.
So we have an algorithm something like the following pseudo-code to
canonicalize the SSA CFG into a sea-of-nodes-with-CFG:
def canonicalize(basic_block):
for inst in basic_block:
if is_pure(inst): # only dedup and move to sea-of-nodes for "pure" insts;
# leave the "skeleton" in place
basic_block.remove(inst)
inst.rename_values(rename_map) # rewrite uses according to a value->value map
if inst in hashcons_map: # equality defined by shallow content
rename_map[inst.value] = hashcons_map[inst]
else:
nodes.push(inst) # add to the sea-of-nodes
hashcons_map[inst] = inst.value
else:
# we still need to rename the CFG skeleton's uses to refer to sea-of-nodes
inst.rename_values(rename_map)
# recursive domtree-preorder traversal.
for child in domtree.children(basic_block):
canonicalize(child)
This will handle not only the above example, where we have "deep
equality" (because we will canonicalize and rename e.g. v5 into v2
before visiting v5's use), but also more complex examples with the
redundancies spread across basic blocks.
Finally: how does the "-with-CFG" aspect of all of this work? So far, we
have very much glossed over any values that are defined in the CFG
skeleton, other than to imply above that they are never renamed (because
we never take the is_pure branch). But is this OK?
Yes, in a sense, by construction: we have defined all impure values to
have their own "identity", distinct from any other such value, even if
shallowly equal at a syntactic level. This aligns with the notion that
impure computations have implicit inputs: for example, load v0
appearing twice in the program may produce different values at those
two different times, so we cannot deduplicate it. This can be relaxed
if we have a dedicated analysis that can reason about such implicit
dependencies, and in fact for loads we do have one (alias analysis,
feeding into redundant-load elimination and store-to-load
forwarding). But in general, we cannot do anything with these
"roots". Rather, they stay in the skeleton, feed values into the sea
of nodes, and consume values back out of that sea of nodes.
Out of Sea-of-Nodes-with-CFG: Scoped Elaboration
Given a sea-of-nodes + skeleton representation of a program, how do we
go back to a conventional CFG, with fully linearized operators (i.e.,
each of which has a concrete program-point where it is computed), to
feed to the compiler backend and lower to machine code?
The basic task is to decide a location at which to put each
operator. Since nodes in the sea-of-nodes are "rooted" (referenced and
ultimately computed/used) by side-effectful operators in the CFG
skeleton, the first idea one might have is to copy pure nodes back
into the CFG where they are referenced. One could do this recursively:
if e.g. we have a side-effecting instruction store v1, v2, we can
place the (pure operator) definitions of v1 and v2 just before
this instruction; if those definitions require other values, likewise
compute them first. We could call this "elaboration".
Let's consider the single-basic-block case first and then define
something like the following pseudocode:
def demand_based_elaboration(bb):
for inst in bb:
elaborate_inst(inst, bb, before=inst)
def elaborate_inst(inst, bb, before):
for value in inst.args:
inst.rewrite_arg(value, elaborate_value(value, bb, before=inst))
if is_pure(inst):
bb.insert_before(before, inst)
return inst.def
def elaborate_value(value, bb, before):
if defined_by_inst(value): # some values are blockparam roots, not inst defs
return elaborate_inst(value.inst, bb, before)
else:
return value
This would certainly work, but is far too simple: it duplicates
computation every time a value is used, and no value (other than
blockparam roots) is ever used more than once. This will almost
certainly result in extreme blowup in program size!
So if we use a value multiple times, it seems that we should compute
it once, some place in the program before any of the uses. For
example, perhaps we could augment the above algorithm with a map that
records the resulting value number the first time we elaborate a node,
and reuses it (i.e., memoizes the elaboration):
# ...
def elaborate_value(value, bb, before):
if value in elaborated:
return already_elaborated[value]
else if defined_by_inst(inst):
result = elaborate_inst(value.inst, bb, before)
elaborated[value] = result
return result
else:
return value
This modified algorithm will handle the case of a single block with
reuse efficiently, computing a value the first time it is used ("on
demand") as expected.
Now let's consider multiple basic blocks. One might be tempted to
wrap the above with a traversal, as we did for the translation into
sea-of-nodes:
def elaborate_domtree(bb):
demand_based_elaboration(bb)
for child in domtree.children(bb):
elaborate_domtree(child)
def elaborate(func):
elaborate_domtree(func.entry)
But this, too, has an issue. Consider a program that began as a CFG
with many paths, two of which compute the same value:

If we define some traversal over all basic blocks to perform an
elaboration as above, with a single map elaborated, we will
- Elaborate a computation of
v2 in bb2 and use it there;
- Use it in
bb3 as well in place of v3, since it has already been
computed and is thus memoized;
- And thus generate invalid SSA, where a value is used on a path
where it is never computed!
Perhaps we could hoist the computation to a "common ancestor" of all
of its uses instead. Here that would be bb1. But that creates yet
another problem: if control flows from bb1 to bb4, then we will
have computed the value and never used it -- in supposedly optimized
code! This is sometimes called a "partial redundancy": a computation
that is sometimes unused, depending on control flow. We would like to
avoid this if possible.
It turns out that this problem exactly corresponds to common
subexpression elimination (CSE), which aims to find one place to
compute a value possibly used multiple times. The usual approach in
SSA code, global value
numbering (GVN),
solves the problem by reasoning about scopes, where a "scope" is the
region in which a value has already been computed. The intuition is
that at any given use, we can cast a "shadow" downward and remove
redundant uses but only in that shadow. So in our example program, if
bb1 computed v2 then we could reuse it in bb2 and bb3; but
because it occurs independently in two subtrees with no common
ancestor, we do nothing; we duplicate it (re-elaborate it).
SSA "scopes" -- regions in which a value can be used -- are defined by
the dominance relation, and so we can work with a domtree traversal to
implement the needed behavior. Concretely, we can do a domtree
preorder traversal; we can keep the elaborated map but separate it
into scope "overlays", and push a new overlay for each subtree. This
formalizes the "shadow" intuition above. We call this scoped
elaboration. Pseudo-code follows:
def find_in_scope(value, scope):
if value in scope.map:
return scope.map[value]
elif scope.parent:
return find_in_scope(value, scope.parent)
else:
return None
def elaborate_value(value, bb, before, scope):
if find_in_scope(value, scope):
# ...
# ...
def elaborate_domtree(bb, scope):
demand_based_elaboration(bb, scope)
for child in domtree.children(bb):
subscope = { map = {}, parent = scope }
elaborate_domtree(child, subscope)
def elaborate(func):
root_scope = { map = {}, parent = None }
elaborate_domtree(func.entry, root_scope)
The real
implementation
of our scoped hashmap takes advantage of the fact that keys will not
overlap between overlay layers (because once defined, a value will not
be re-defined in a lower layer), and this enables us to have true O(1)
rather than O(depth) lookup using some tricks with a layer number
and generation-per-layer (see implementation for
details!). Nevertheless, the semantics are the same as above.
As we foreshadowed above, just as the problem is closely related to
CSE and GVN, scoped elaboration is as well. In fact, the approach of
tracking a definition-within-scope for scopes that correspond to
subtrees in the domtree, given a preorder traversal on the domtree, is
exactly how Cranelift's old
implementation
works as well. We even borrowed the scoped hashmap implementation!
A few more observations are in order. First, it's fairly interesting
that we sometimes re-elaborate a node into multiple dom subtrees;
why is this? Does this introduce inefficiency (e.g. in code size) or
is it the best we can do?
The duplication is, in my opinion, best seen as a dual of the
canonicalization. The original code may have multiple copies of a pure
computation in multiple paths, with no common ancestor that computes
that value. When translating to sea-of-nodes, we will canonicalize
that computation, so we can optimize it once. But then when returning
to the original linearized IR, we may need to restore the original
duplication if there truly was no (non-redundancy-producing)
optimization opportunity. Additionally, and very importantly: we
should never elaborate a value in more than one place unless it also
appeared in more than once in the original program. So we should not
grow the program size beyond the original.
Another interesting observation is that by driving elaboration by
demand (from the roots in the side-effecting CFG skeleton), we do
dead-code elimination (DCE) of the pure operations for free. Their
existence in the sea of nodes may cost us some compile time if we
spend effort to optimize them (only to throw them away later); but
anything that becomes dead because of rewrites in sea-of-nodes will
then naturally disappear from the final result.
A third observation is that elaboration gives us a central location to
control when and where code is placed in the final program. In other
words, there is room for us to add heuristics beyond the simplest
version of the algorithm described above. For example: we stated that
we did not want to introduce any partial redundancies. But for
correctness, we don't need to adhere to this: our only real
restriction is that a pure computation cannot happen before its
arguments are computed (i.e., we have to obey dataflow dependencies).
So, for example, if we have the loop nest (structure of loops in the
program) available, if a pure computation within a loop does not use
any values that are computed within that loop, we know it is
loop-invariant and we may choose to elaborate it before the loop
begins (into the "preheader"), in a transform known as loop-invariant
code motion (LICM). This is redundant if the loop executes zero
iterations, but most loops execute at least once; and performing a
loop-invariant computation only once can be a huge efficiency
improvement.
In the other direction -- pushing computation downward rather than
upward -- we could choose to implement rematerialization by
strategically forgetting a value in the already-elaborated scope and
recomputing it at a new use. Why would we do this? Perhaps it is
cheaper to recompute than to thread the original value through the
program. For example, constant values are very cheap to "compute"
(typically 1 or 2 instructions) but burning a machine register to keep
a constant across a long function can be expensive.
There is a lot of room for heuristic code scheduling within
elaboration as well (LICM and rematerialization can be seen as
scheduling too, but here I mean the order that operations are
linearized within the block they are otherwise elaborated into). For a
modern
out-of-order
CPU, this may not matter too much to the hardware -- but it may
matter to the register allocator, because reordering instructions
changes the "interference graph", or the way that different live
register values compete for finite resources (hardware
registers). E.g., pushing an instruction that uses many values for the
last time "earlier" (to eliminate the need to store those values) is
great; but this minimization is not always straightforward. In fact,
ordering instructions that define and use values to minimize the
coloring count for the resulting live-range interference graph is an
NP-complete problem. So it goes, too often, in compiler engineering!
Despite the complexities that may arise in combining many heuristics,
these three dimensions -- LICM, rematerialization, and code scheduling
for register pressure -- are an interesting high-dimensional cost
optimization problem and one that we still haven't fully solved (see
e.g. #6159,
#6260 and
#8959).
Optimizing Pure Expression Nodes: Rewrite Framework
We've covered the transitions into and out of the
sea-of-nodes-with-CFG program representation. We've seen how merely
this translation gives us GVN (deduplication), DCE, LICM, and
rematerialization "for free" (not really free, but falling out as a
natural consequence of the algorithms). But we still haven't covered
one of the most classical sets of optimizations: algebraic (and other)
rewrites from one expression to another equivalent one (e.g, x+0 to
x). How can we do this on the sea-of-nodes?
In principle, the answer is as "simple" as: build the logic that
pattern-matches the "left-hand side" of a rewrite (the part that we
have a "better" equivalent expression for), and then replaces it with
the "right-hand side". That is, in x + 0 -> x, the left-hand side is
x + 0 and the right-hand side is x. Such a framework is highly
amenable to a domain-specific language to express these rewrites:
ideally one doesn't want to write code that manually iterates through
nodes to find these patterns. Fortunately for us, in the Cranelift
project we have the ISLE (instruction-selection and
lowering-expressions) DSL
(RFC, language
reference,
blog post). I originally designed
ISLE in the context of instruction lowering, as the name implies, but
I was careful to keep a separation between the core language and its
"prelude" binding it to a particular environment. Hence we could adapt
it fairly easily to rewrite a graph of Cranelift IR operators as
well. The idea is that, as in instruction lowering, for mid-end
optimizations we invoke an ISLE constructor (entry point) on a
particular node and the ruleset produces a possibly better node.
That gives us the logic for one expression, but there is still an open
question how to apply these rewrites: to which nodes, in what order,
and how to manage or update any uses of a node when that node is
rewritten.
The two general design axes one might consider are:
-
Eager or deferred: do we apply rewrites to a node as soon as it
exists, or apply them later (perhaps as some sort of batch-rewrite)?
-
Single-rewrite or fixpoint loop: do we rewrite a node only once, or
apply rewrite rules again to the result of a rewrite? Also, if the
operand of a node is rewritten, do we (and how do we) rewrite users
of that node as well, since more tree-matching patterns may now
apply to the new subtree?
It is clear that different choices to these questions could lead to
different efficiency-quality tradeoffs: most obviously, applying
rewrites in a fixpoint should produce better code at the cost of
longer compile time. But also, it seems possible that either eager or
deferred rewrite processing could win, depending on the workload and
particular rules: batching (hence, deferred until one bulk pass) often
leads to efficiency advantages (see the egg
paper and discussion below!),
but also, deferral may require additional bookkeeping vs. eagerly
rewriting before making use of the (soon to be stale) original value.
For the overall design that we have described so far, there turns out
to be a fairly clear optimal answer, surprisingly: because we build an
acyclic sea-of-nodes, as long as we keep it acyclic during rewrites,
we should be able to do a single rewrite pass rather than a
fixpoint. And, to make that single pass work, we rewrite eagerly, as
soon as we create a node; then use the final rewritten version of that
node for any uses of the original value. Because we visit defs before
uses and do rewrites immediately at the def, we never need to update
(and re-canonicalize!) nodes after creation.
An aside is in order: while it is fairly clear why the
sea-of-nodes-with-CFG is initially acyclic -- because SSA permits
dataflow cycles only through block-parameters / phi-nodes, and those
remain in the CFG, which we don't "look through" when applying
rewrites -- it is less clear why rewrites should maintain
acyclicity, especially in the face of hashconsing, which may "tie the
knot" of a cycle if we're not careful. The answer lies in the previous
paragraph: once we create a node, we never update it. That's it! We've
now maintained acyclicity, by construction.
Perhaps surprisingly as well, this rewrite process can be fused with
the translation pass into the sea-of-nodes itself. So we can amend the
above canonicalize to
def canonicalize(basic_block):
for inst in basic_block:
if is_pure(inst):
# ...
if inst in hashcons_map:
# ...
else:
inst = rewrite(inst) # NEW
nodes.push(inst) # add to the sea-of-nodes
hashcons_map[inst] = inst.value
else:
# ...
i.e., simply add the rewrite rule application at the place we create
nodes, and hashcons based on the final version of the instruction.
Now, note that this is not quite complete yet: inst = rewrite(inst)
is doing some heavy lifting, and is actually a bit too simplistic, in
the sense that this implies that a rewrite rule can only ever rewrite
to one instruction on the right hand side. This isn't quite right:
for example, one may want a DeMorgan rewrite rule ~(x & y) -> ~x | ~y. The right-hand side includes three operator nodes (instructions):
two bitwise-NOTs and the OR that uses them. What if x or y in this
pattern also match a subexpression that can be simplified with some
logic rule?
There seem to be two general answers: create the original right-hand
side nodes un-rewritten and later apply rewrites, or immediately and
recursively rewrite. As we observed above, deferral requires
additional bookkeeping and re-canonicalization as a node's inputs
change, so we choose the recursive approach. So, concretely, given
~((a & b) & (c & d)) and the one rewrite rule above, we would:
- Encounter the top-level
~, and try to match the rewrite rule's
left-hand side. It would match with bindings x = (a & b) and y = (c & d).
- Apply the right-hand side
~x | ~y bottom-up, building nodes and rewriting
them as we go:
- First,
~x. This creates ~(a & b), which recursively fires the
rule, which results in (~a | ~b).
- Then,
~y. This creates ~(c & d), again recursively firing the
rule, which results in (~c | ~d).
- We then create the top-level node on the right-hand side,
resulting in
(~a | ~b) | (~c | ~d).
One needs to limit the recursion if there is any concern that rule
chain depths may not be statically bounded or easily analyzable, but
otherwise this yields the correct answer in a single pass without the
need to track users of a node to later rewrite and recanonicalize it.
And that's the whole pipeline: we now have a way to optimize code by
translating to sea-of-nodes-with-CFG, applying rewrites as we go, then
translating back to classical SSA CFG. In the process we've achieved
all the goals we set out with: GVN, LICM, DCE, rematerialization, and
algebraic rewrites.
E-graphs: Representing Many Possible Rewrites
So far, we've described a system that has zero or one deterministic
rewrite for any given node; this is analogous to a classical compiler
pipeline that destructively updates instructions/operators. This is
great for rewrite rules like x+0 -> x: the right-hand side is
unambiguously better if it is "smaller" (rewrites a whole expression
into only one of its parts). This is also fine when instructions have
clear and very distinct costs, such as integer divide (typically tens
of cycles or more even on modern CPUs) by a constant converted into
magic wrapping
multiplies.
But what about cases where the benefit of a rewrite is less clear, or
depends on context, or depends on how it may or may not be able to
compose with or enable other rewrites in a given program?
For example, consider the classical example from the 2021 paper on
egg, an e-graph
framework: if we have the
expression (x * 2) / 2 in our program, we would expect that to
simplify to x. To implement this simplification, we might have a
general rewrite rule (x * k) / k -> x. But we might also,
separately, have a rewrite rule that (x * 2^k) -> (x << k), i.e.,
convert a multiplication into a left-shift operation. If we performed
this latter rewrite eagerly, the former rewrite rule might never match.
(Now, you might complain that we could also convert the divide into
a right-shift, then we have another rewrite rule that simplifies (x << k) >> k -> x. In this particular example, that might be
reasonable. But (i) that required careful thinking about canonical
forms, where multiplies/divides by powers-of-2 are always
canonicalized down to shifts, and (ii) this same fortunate behavior
might not exist for all rulesets.)
In general, we also have a question at the rule-application level: if
multiple rules apply, which do we take? In the above example, we would
have had to have some prioritization scheme to (say) apply
strength-reduction rules to convert to shifts before we examine
divide-of-multiply. That's an extra layer of heuristic engineering
that must be considered when designing the optimizer.
Onto the scene, then, comes a new data structure: the e-graph, or
equivalence graph, which is a kind of sea-of-nodes
program/expression representation that can represent many different
equivalent forms of a program at once. The key idea is that, rather
than have a single expression node as a referent for any value, we
have an e-class (equivalence class) that contains many e-nodes, and we
can pick any of these e-nodes to compute the value.
The idea is a sort of principled approach to the optimization
problem: let's model the state space explicitly, and then pick the
best result objectively. Typically one uses the result of an e-graph
by "extracting" one possible representation of the program according
to a cost metric. (More on this below, but a simple cost metric could
be a static number per operator kind, plus cost of inputs.)
The magic of e-graphs is how they can compress a very large
combinatorial space of equivalent programs into a small data
structure. A detailed exploration of how this works is beyond the
scope of this blog post (please read the egg paper: it's very good!)
but a very short intuitive summary might be something like:
-
Ensuring that all value uses point to an e-class rather than a
particular node will propagate knowledge of equivalences to
maximally many places. That is, if we know that op1 v1, v2 is
equivalent to op2 v3, v4, all users of the op1 v1, v2 expression
should automatically get the knowledge propagated that they can use
any form. This knowledge propagation is the essence of "equality
saturation" that e-graphs enable.
-
A strong regime of canonicalization and "re-interning"
(re-hashconsing), which the egg paper calls "rebuilding", ensures
that such information is maximally propagated. Basically, when we
discover that the op1 and op2 expressions above are equivalent,
we re-process all users of both op1 and op2, looking for more
follow-on consequences. Merging those two might in turn cause other
expressions to be equivalent or other rewrite rules to fire.
Practical Efficiency of Classical E-graphs
The two problems that arise with a "classical e-graph" (by which I
include the 2021 egg paper's batched-rebuilding formulation) are
blowup -- that is, too many rewrite rules apply and the e-graph
becomes too large -- and data-structure inefficiency.
The blowup problem is easier to understand: if we allow for
representing many different forms of the program, maybe we will
represent too many, and run out of memory and processing time. It is
often hard to control how rules will compose and lead to blowup, as
well: each rewrite rule may seem reasonable in isolation, but the
transitive closure of all possible programs under a well-developed set
of equivalences can be massive. So practical applications of e-graphs
usually need some kind of meta/strategy driver layer that uses "fuel"
to bound effort, and/or selectively applies rewrites where they are
likely to lead to better outcomes. Even then, this operating regime
often has compile-times measured in seconds or worse. This may be
appropriate for certain kinds of optimization problems where
compilation happens once or rarely and the quality of the outcome is
extremely important (e.g., hardware design), but not for a fast
compiler like Cranelift.
We can protect against such outcomes with careful heuristics, though,
and the possibility of allowing for objective choice of the best
possible expression is still very tempting. So in my initial
experiments, I applied the egg crate
to the problem and eventually, with custom tweaks, managed to get
e-graph roundtripping to 23%
overhead
-- with no rewrites applied. That's not bad at first glance but it
proposes to replace an optimization pipeline that itself takes only
10% of compile-time, and we haven't yet added the rewrites to the
23%. (And the 23% came after a good amount of data-structure
engineering to reduce storage; the initial overhead was over 2x.)
In profiling the optimizer's execution, the overheads were occurring
more or less in building the e-graph itself (that is, cache misses
throughout the code transcribing IR to the e-graph). And what does the
e-graph contain? Per e-class, it contains a "parent pointer" list: we
need to track users of every e-class so that we can re-canonicalize
them during the "rebuild" step when e-classes are merged (a new
equivalence is discovered). And, even more fundamentally, it stores
e-nodes separately from e-classes, which is an essential element of
the idea but means that we have (at least) two different entities for
each value, even when most e-classes have only one e-node.
Is there any way to simplify the data structures so that we don't have
to store so many different bits for one value?
Insight #1: Implicit E-graph in the SSA IR
The first major insight that enabled efficient implementation of an
e-graph in Cranelift was that we could redefine the existing IR into
an implicit e-graph, without copying over the whole function body
into an e-graph and back, thus avoiding the compile-time penalty of
this data movement. (Data movement can be very expensive when the main
loops of a program are otherwise fairly optimized! It is best to keep
and operate on data in-place whenever possible.)
We start with a sea-of-nodes-with-CFG, where we have an IR with SSA
values not placed in basic blocks. We can already build this
"in-place" in Cranelift's IR, CLIF, by removing existing SSA
definitions from the CFG but keeping their data in the data-flow graph
(DFG) data structures.
Then, to allow for multi-representation in an e-graph, the idea is to
discard the separation between e-classes and e-nodes, and instead
define a new kind of IR node that is a union node. Rather than two
index spaces, for e-nodes and e-classes, we have only one index space,
the SSA value space. An SSA value is either an ordinary operator
result or a block parameter (as before), or a union of two other SSA
values. Any arbitrary e-class can then be represented via a binary
tree of union nodes. We don't need to change anything about operator
arguments to make use of this representation: operators already refer
to value numbers, and an e-class of multiple e-nodes (defined by the
"top" union node in its union tree) already has a value number.
The coolest thing about this representation is: once we have a
sea-of-nodes, it is already implicitly an e-graph, with "trivial"
(one-member) e-classes for each e-node. Thus, the lift from
sea-of-nodes to e-graph is a no-op -- the best (and cheapest) kind of
compile-time pass. We only pay for multi-representation when we use
that capability, creating union nodes as needed.
Insight #2: Acyclicity with Eager Rewrites
The other aspect of the classical e-graph data structure's cost has to
do with its need to rebuild, and in order to do so, to track all
uses of an e-class (its "parents" in egg's terminology). Cranelift
does not keep bidirectional use-def links, and the binary tree of
union nodes would make this even more complex still to track.
In trying to address this cost, I started with a somewhat radical
question: what would happen if we never rebuilt (to propagate
equalities)? How much "optimization goodness" would that give up?
If one (i) builds an e-graph then (ii) applies rewrite rules to find
better versions of nodes, adding to e-classes, then the answer is that
this would hardly work at all: this would mean that all users of a
value would see only its initial form and never its rewrites. The
rewritten forms would float in the sea-of-nodes, and union-nodes
joining them to the original forms would exist, but no users would
actually refer to those union nodes.
Instead, what is needed is to apply rewrites eagerly. When we create
a new node in the sea-of-nodes, we apply all rewrites immediately,
then join those rewrites with the original form with union nodes. The
"top" of that union tree is then the value number used as the
"optimized form" of that original value, referenced by all subsequent
uses.
The union-node representation plays a key part of this story: it acts
as an immutable data structure in a sense, where we always append
new knowledge and union it with existing values, and refer to that
"newer version" of an e-class; but we never go back and update
existing references.
This has a very nice implication for the graph structure of the sea of
nodes as well: it preserves acyclicity! Classical e-graphs, in their
rebuild step, can create cycles even when the input is acyclic because
they can condense nodes arbitrarily. But when we eagerly rewrite, then
freeze, we can never "tie the knot" and create a cycle.
This acyclicity is important because it permits a single pass for
the rewrites. In fact, taking our sea-of-nodes build algorithm above
as a baseline, we can add eager rewriting as a very small change: when
we apply rewrites, we build a "union-node spine" to join all rewritten
forms, rather than destructively take only the rewritten form.
def canonicalize_and_rewrite(basic_block):
for inst in basic_block:
if is_pure(inst):
# ...
if inst in hashcons_map:
# ...
else:
optimized = rewrites(inst) # NEW
union = join_with_union_nodes(inst, optimized) # NEW
optimized_form[inst.def] = union
else:
# ...
All of these aspects work together and cannot really be separated:
- Union nodes allow for a cheap, pay-as-you-go representation for
e-classes, without a two-level data structure (nodes and classes)
and without parent pointers.
- Eager rewriting, applied as we build the e-graph (sea of nodes),
allows for a single-pass algorithm and ensures all members of the
e-class are present before it is "sealed" by union nodes and
referenced by uses.
- Acyclicity, present in the input (because of SSA), is preserved by
the append-only, immutable nature of union nodes, and permits eager
rewriting to work in a single pass.
Note that here we are glossing over recursive rewrites. Due to space
constraints I will only outline the problem and solution briefly: the
right-hand side of a rewrite rule application (rewrites above) will
produce nodes that themselves may be able to trigger further
rewrites. Rather than leave this to another iteration of a rewrite
loop, as a classical e-graph driver might do, we want to eagerly
rewrite this right-hand side as well before establishing any uses of
it. So we recursively re-invoke rewrites; and this occurs within the
right-hand side of rules as pieces of the final expression are
created, as well. This recursion is tightly bounded (in levels and in
total rewrite invocations per top-level loop iteration) to prevent
blowup.
Finally, we are also glossing over details of how we apply our
pattern-matching/rewrite DSL, ISLE, to the rewrite problem when
multiple rewrites are now permitted. In brief, we extended the
language to permit "multi-extractors" and "multi-constructors": rather
than matching only one rule, and disambiguating by priority, we take
all applicable rules. The
RFC has more
details.
The Extraction Problem
So we now have a way to represent multiple expressions as
alternatives to compute the same value. How do we compile this
program? It surely wouldn't make sense to compile all of these
expressions: they produce the same bits, so we only need one. Which
one do we pick?
This is the extraction problem, and it is both easy to state and
deceptively hard (in fact, NP-hard): choose the easiest (cheapest)
expression to compute any given value.
Why is this hard? First, let's construct the case where it's easy.
Let's say that we have one root expression (say, returned from a
function) with all pure operators. This forms a tree of choices: each
eclass lets us choose one enode to compute it, and that enode has
arguments that themselves refer to eclasses with choices.
Given this tree of choices, with every choice independent, we can
pick the best choice for each subtree, and compute the cost of any
given expression node as best-cost-of-args plus that own node's cost
to compute. In more formal algorithmic terms, that is optimal
substructure.
Unfortunately, as soon as we permit references to shared nodes (a
DAG rather than a tree), this nice structure evaporates. To see why,
consider: we could have two eclasses we wish to compute
v0 = union v10, v11
v1 = union v10, v12
with computations (not shown) v10 that costs 10 units to compute,
and v11 and v12 that each cost 7 units to compute. The optimal
choice at each subproblem is to choose the cheaper computation (v11
or v12), but the program would actually be more globally optimal if
we computed only v10 (cost of 10 total). A solver that tries to
recognize this would either process each root (v0 and v1) one at a
time and "backtrack" at some point once it sees the additional use, or
somehow build a shared representation of the problem, which is no
longer deconstructed in a way that permits sub-problem solutions to
compose.
In fact, the extraction problem is NP-hard. To see why, I will show
a simple linear-time reduction (mapping) from a known NP-hard problem,
weighted set-cover, to eclass extraction.
Take each weighted set S with weight w and elements S = { x_1, x_2, ... }. Add an enode (operator with args) N, with self-cost
(not including args) w, and no arguments. Then for each element
x_n in the universe (the union of all sets' elements), define an
eclass: that is, if we have an x_i, define an eclass C_i. Then for
each set-element edge (for each i, j such that x_i ∈ S_j), add
an enode to C_i with opaque zero-cost operator SetElt_ij(y) where
y is the eclass for x_i.
Performing an optimal (lowest-cost) extraction, with all eclasses
C_i taken as roots, will compute the lowest-weight set cover: the
choice of enode in each eclass C_i encodes which set we choose to
cover element x_i. Thus, because egraph extraction with shared
structure can compute the solution to an NP-hard problem (weighted set
cover), egraph extraction with shared structure is NP-hard.
OK, but we want a fast compiler. What do we do?
The classical compiler-literature answer to this problem -- seen over
and over in a 50-year history -- is "solve a simpler approximation
problem". Register allocation, for example, is filled with simplified
problem models (linear scan, no live-range splitting, ...) that
reduce the decision space and allow for a simpler algorithm.
In our case, we solve the extraction problem with a simplifying
choice: we will not try to account for shared substructure and the way
that it complicates accounting of cost. In other words, we'll ignore
shared substructure, pretending that each use of a subtree counts that
subtree's cost anew. For each enode, having computed the cost of each
of its arguments, we can compute its own cost easily as the sum of its
arguments plus its own computation cost; and for each eclass, we can
pick the minimum-cost enode. That's it!
We implement this with a dynamic
programming
algorithm: we do a toposort of the aegraph (which can always be done,
because it's acyclic), then process nodes from leaves upward,
accumulating cost and picking minima at each subproblem. This is a
single
pass
and is a relatively fast and straightforward algorithm.
After the Dagstuhl seminar in January, I had an ongoing discussion
with collaborators Alexa VanHattum and Nick Fitzgerald about whether
we could do better here. Alexa and Nick both prototyped a bunch of
interesting alternatives: dynamically updating (shortcutting to zero)
costs when subtrees become used ("sunk-cost" accounting), computing
costs by doing full top-down traversals rather than bottom-up dynamic
programming (and then mixing in memoization somehow), trying to
account for sharing by doing DP but tracking the full set of covered
leaves, and
some other things. This was an interesting exploration but in the end
we didn't find anything that looked better in the compile-time /
execution-time tradeoff space. We have an issue tracking
this and
more ideas are always welcome, of course.
Other Aspects
There are two other aspects of our aegraph implementation that I don't
have space to go into in this post:
-
There is an interesting problem that arises with respect to the
domtree and SSA invariants when different values are merged together
with a union node and some of them have wider "scope" than
others. For example, via store-to-load forwarding we may know that a
load instruction produces a constant 0; so we might have a union
node with iconst 0. The load can only happen at its current
location, but iconst 0 can be computed anywhere. A user of this
eclass should be able to pick either value (said another way:
extraction should not be load-bearing for correctness). If the user
is within the dominance subtree under the load, then all is fine,
but if not, e.g. if some other user of iconst 0 elsewhere in the
function errantly happened upon the eclass-neighbor load
instruction, we might get an invalid program.
There are many ways one might be tempted to solve this, but in the
end we landed on an "available block" analysis that runs as we build
nodes. For every node, we record which block is the "highest" in the
domtree that it can be computed: function entry for pure
zero-argument nodes, current block for any impure nodes, otherwise
the lowest node in the domtree among available blocks for all
arguments. (Claim: the available-nodes for all args of a node will
form an ancestor path in the domtree; one will always exist that is
dominated by all others. This follows from the properties of SSA.)
Then when we insert into the hashcons map, we insert at the level
that the final union is available.
-
We also have an important optimization that we call subsume. This
is an identity operator that wraps a value returned by a rewrite
rule. It is not required for correctness, but its semantics are: if
any value is marked "subsume", all "subsuming" values erase
existing members of the eclass. Usually, only one subsuming rule
will match (but this, also, is not necessary for correctness).
The usual use-case is for rules that have clear "directionality": it
is always better to say 2 than (iadd 1 1), so let's go ahead and
shrink the eclass so that all further matching, and eventual
extraction, is more efficient.
Evaluation
So how does all of this actually work? Do aegraphs benefit Cranelift's
strength as a compiler -- its ability to optimize code, its efficiency
in doing so quickly, or both?
This is the part where I offer a somewhat surprising conclusion: the
tl;dr of this post is that I believe the sea-of-nodes-with-CFG
aspect of this mid-end works great, but the aegraph itself -- the
ability to represent multiple options for one value -- may not (yet?)
be pulling its weight. It doesn't really hurt much either, so maybe
it's a reasonable capability to keep around. But in any case, it's an
interesting conclusion and we'll dig more into it below.
The main interesting evaluation is a two-dimension comparison of
compile time -- that is, how long Cranelift takes to compile code --
on the X-axis, versus execution time -- that is, how long the
resulting code takes to execute -- on the Y-axis. This forms a
tradeoff space: it may be good to spend a little more time to compile
if the resulting code runs faster (or vice-versa), for example. Of
course, reducing both is best. One point may be "strictly better" than
another if it reduces both -- then there is no tradeoff, because one
would always choose the configuration that both compiles faster and
produces better code. (One can then find the Pareto
frontier of points that
form a set in which none is strictly better than another -- these are
all "valid configuration points" that one may rationally choose
depending on one's goals.)
Below we have a compile-time vs. execution-time plot for a number of
configurations of Cranelift, compiling and running the Sightglass
benchmark suite:
- No optimizations enabled;
- The (on by default) aegraph-based optimization pipeline, as
described in this post, with several variants (below);
- A "classical optimization pipeline" that does not form a
sea-of-nodes-with-CFG at all; instead, it applies exactly the same
rewrite rules, but in-place, and interleaves with classical GVN and
LICM passes;
- Variants of the aegraphs pipeline and classical pipeline with the
whole mid-end repeated 2 or 3 times (to test whether code continues
to get better).
Here's the main result:

A few conclusions are in order. First, the aegraph pipeline does
generate better code than the classical pipeline. This objective
result is "mission accomplished" with respect to the aegraph effort's
original motivation: we wanted to allow optimization passes to
interact more finely and optimize more completely. Note in particular
that repeating the classical pipeline multiple times does not get
the same result; we could not have obtained the ~2% speedup without
building a new optimization framework.
Second, though, there is clearly a Pareto frontier that includes "no
optimizations" and "classical pipeline" as well as the aegraph
variants: each takes more compilation time than the previous. In other
words, moving from a classical compiler pipeline to the design
described here, we spend about 7-8% more compile time. Notably, this
is not the result that we had when we first built the aegraphs
implementation in 2023 and switched over -- at that time, we were more
or less at parity. This is likely a result of the growth of the body
of rewrite rules over the intervening three years.
To get a better picture of how aegraph's various design choices
matter, let's zoom into the area in the red ellipse above, which
contains multiple variants of the aegraphs pipeline:
- "aegraph": Exactly as described in this post, and default Cranelift
configuration;
- "no multivalue (eager pick)": sea-of-nodes-with-CFG, without union
nodes; i.e., not actually representing more than one equivalent
value in an eclass. Instead, after evaluating rewrite rules, we pick
the best option and use that one option (destructively replacing the
original);
- "no rematerialization": testing the effect of this aspect of the
elaboration algorithm;
- "no subsume": testing this efficiency tweak of the rewrite-rule
application.
Here's the plot:

One can see that there are some definite tradeoffs, but looking
closely at the axis scales, these effects are very very small. In
particular, moving from sea-of-nodes-with-CFG to true aegraph (taking
all rewritten values, and picking the best in a principled way with
cost-based extraction) nets us ~0.1% execution-time improvement, at
~0.005% compile-time cost. That's more-or-less in the noise.
Supporting that conclusion is the statistic that the average eclass
size after rewriting is 1.13 enodes: in other words, very few cases
with our ruleset and benchmark corpus actually result in more than one
option.
Finally, the most interesting question in my view: does the eager
aspect of aegraphs -- applying rewrite rules right away, and never
going back to "fill in" other equivalences -- matter? In other words,
does skipping equality saturation take the egraph goodness out of an
egraph(-alike)?
We can measure this, too: I instrumented our implementation to track
when a subtree of an eclass is not chosen by extraction, and then
any node in that subtree is later actually elaborated (in other words,
when we use a suboptimal choice because we could not see an equality
in the "wrong" direction). This should only happen if, in theory, our
rules rewrite f to g where cost(g) > cost(f), and we don't have
a rewrite g to f: then a user of g might never directly get a
rewrite of f eagerly, but a later coincidentally-occurring f might
rewrite onto g (but we'll never propagate that equality into the
original users of g).
It turns out that, in all of our benchmarks, with ~4 million value
nodes created overall, this happens two (2) times. Both instances
occur in spidermonkey.wasm (a large benchmark that consists of the
SpiderMonkey JS engine, compiled to WebAssembly, then run through
Wasmtime+Cranelift), and occur due to an ireduce-of-iadd rewrite rule
that violates this move-toward-lower-cost principle (explicitly, in
the name of simplicity). Overall, we conclude that the eager rewrites
are effective as long as the ruleset is designed with optimization
(rather than mere exploration of all equivalent expressions) in mind.
Discussion
The most surprising conclusion in all of the data was, for me, that
aegraphs (per se) -- multi-value representations -- don't seem to
matter. What?! That was the entire point of the project, and (proper)
e-graphs have seen great promise in other application areas.
I think the main reason for this is that our workload is somewhat
"small" in a combinatorial possibility-space sense: we are (i)
compiling workloads that are often optimized already (as Wasm modules)
before hitting the Cranelift compilation pipeline, and (ii) applying a
set of rewrite rules that, while large and growing (hundreds of
rules), explicitly do not include identities like associativity and
commutativity, or arbitrary algebraic identities, that do not
"simplify" somehow. In other words, if we're generally applying
rewrites that look more like simple, obvious "cleanups", we would
expect that we don't hold a "superposition" of multiple good
expression options very often.
Given that it doesn't cost us that much compile time to keep
aegraphs around, though, maybe this is... fine? Having the
capability to do principled cost-based extraction is great, versus
having to think about whether a rewrite rule should exist. We still do
try to be careful not to introduce rules that are never productive,
of course.
And, further into the future, one could imagine that workloads with
more optimization opportunity could cause more interesting situations
to occur within the aegraph, leading to more emergent composition in
the rewrites.
Future Directions
There are a bunch of directions we could (and should) take this in the
future. In terms of evaluation: finding the "corner of the use-case
domain" where aegraphs truly shine is still an open question. More
concretely: if we evaluate Cranelift with new and different workloads,
and/or pile on more rewrite rules, do we get to a point where the
classical benefit of "multi-representation with cost-based extraction"
pays off in a conventional compiler? I don't know!
There is also still a lot of room to improve the core algorithms:
-
Better extraction, as mentioned above: something that accounts for
shared substructure would be great, as long as we don't have to pay
the NP-hard cost for it. Maybe there's a nice approximation
algorithm that's better than our current dynamic-programming
approach.
-
We'd like to be able to handle more rewrites that alter the CFG
skeleton as well. Right now, we have a separate ISLE entry-point
that allows for destructive rewriting of skeleton instructions
(thanks to my colleague Nick Fitzgerald for building this!). However,
maybe we could remove redundant block parameters (phi nodes), for
example; and/or maybe we could fold branches; and/or maybe we could
apply path-sensitive knowledge to values when used in certain
control-flow contexts (x=1 in the dominance subtree under the
"true" branch for if x==1, goto ...). My former colleague Jamey
Sharp wrote up a few excellent, in-depth issues on these topics in
our tracker
(#5623,
#6109,
#6129)
and I think there is a lot of potential here.
(The full version of this is, again, something like
RVSDG in the node language seems
like the most principled option to express all useful forms of
control-flow rewrites; Jamey also has a prototype called
optir for this.)
-
It would be interesting to experiment with incorporating our
lowering backend rules into the aegraph somehow: they are a rich,
fruitful target-specific database of natural "costs" for various
operations. For example, on AArch64 we can fold shifts and extends
into (some) arithmetic operations for "free"; maybe this alters the
extraction choices we make. Or likewise for the various odd corners
of addressing modes on each architecture.
The simple version of this idea is to incorporate lowering rules as
rewrites, and make the egraph's node language a union of CLIF and
the machine's instruction set. But maybe there's something better we
could do instead, allowing multi-extractors to see the aegraph
eclasses directly and keeping various VCode sequences. I need to
write up more of my ideas on this topic someday. Jamey also has more
thoughts on this in
#8529.
I'm sure there are other things that could be done here too!
Further Reading
-
I gave a talk about aegraphs at EGRAPHS
2023:
slides,
re-recorded video (the original was
not recorded).
-
I gave a talk about aegraphs at the January 2026 Dagstuhl e-graphs
seminar;
the slides are a
heavily updated and amended version of the 2023 talk, with the
experiments/data I presented here.
-
There is a Cranelift RFC on aegraphs
here, and one on
ISLE (the rewrite DSL that we use to drive rewrites in the aegraph)
here.
-
The main PR that implemented the current form of aegraphs is
here,
co-authored by my former colleague Jamey Sharp (this production
implementation was a fantastically fun and productive
pair-programming project!).
Acknowledgments
Thanks to many folks for discussion of the ideas around aegraphs
through the years: Nick Fitzgerald, Jamey Sharp, Trevor Elliott, Max
Willsey, Alexa VanHattum, Max Bernstein, and many others at the
Dagstuhl e-graphs seminar. None of them reviewed this post (it had
been languishing for too long already and I wanted to get it out) so all fault
for any errors herein is solely my own!
https://cfallin.org/blog/2026/04/09/aegraph/