Top-down LR parsing
This semester, I am teaching Compilers, and I'm quite happy with how I covered parsing this year. Specifically, I taught students to build a top-down LR parser and it seemed to go relatively well.
Update: Laurence asks me to note that he's not as much of a parser theorist as this post makes him out to be.
Update: There's an ongoing discussion on Hacker News about this post. I learned a lot from it!
Update: Gábor Lehel pointed me to Russell Johnston's post on recursive ascent parsing. It's not quite the same, but similar in spirit to what's described here.
What I want out of parsing
I am not a big fan of the parsing portion of a compilers class, but I try to do a good job of it because I recognize that parsing is the part of my compilers class that students are most likely to reuse in their later programming. This means that I do want students to parse a real programming language with infix operators and precedence, not just S-expressions, and to use a general-enough parsing algorithm that they could re-use their skills on future parsing problems.
Additionally, the style of my Compilers class is that we write nearly everything from scratch, so I don't just want to teach them to use ANTLR or a similar tool. I also do not want to spend a lot of time on theory. While I find Laurence Tratt's argument for LR parsing persuasive, I find that most presentations of LR parsing are very theoretical, and building intuition for resolving conflicts takes longer than I want to spend on parsing (which is two weeks of lecture and three assignments).
Finally, what I did take away from Laurence Tratt's posts is that I do not want students to just write spaghetti code recursive descent parsers that just "do something". I want to teach a parsing methodology, not just get them to write enough code to get the parser working; I want students to derive their parser directly from their grammar. Especially because, in my experience, students are exceptionally clever when it comes to inventing bugs, and parser bugs are often quite hard to find.
Top-down LL
I start by teaching students to write recursive descent parsers for what I call LL(1) languages. Specifically, I say that a grammar is LL(1) if every production for a non-terminal N starts with a different terminal T. I make sure the grammar we compile has plenty of productions like that:
expr : <integer> | <float> | <variable> | [ <expr> , ... ] | { <expr> , ... }
Because I am teaching them a methodology, I have a very specific way I want them to write this parser.
I ask them to write a function called parse_expr
, and functions called parse_int_expr
, parse_float_expr
, and so on. The parse_expr
functions only ever do one thing: they switch on the results of peek_token
, and in each branch calls a parse_X_expr
function:
Expr parse_expr() { switch (peek_token()) { case INT: return parse_int_expr(); ... } }
The parse_X_expr
functions only ever do one thing: they call expect_token
for each token and parse_N
for each non-terminal N
, one after another, before finally building and returning an AST node of the appropriate type.
IntExpr parse_int_expr() { int n = expect_token(TokenType.INTVALUE); return new IntExpr(n); }
This year, I waited to introduce sequences and repetion until I had covered LR parsing, but in future years I might introduce the idea of a parse_exprs
function that contains a while loop, and teach them to write a single specific while loop to parse sequences.
The goal is that students should be able to translate an LL(1) grammar to code mechanically. This part is pretty standard, it's how top-down parsing is normally taught. Most parsing classes that aren't using a parser generator are going to have a section somewhat like this.
Top-down LR
I now explain that some grammars are not LL(1), because multiple productions can start with the same terminal. I stick to easy examples, like post-fix indexing operations, instead of doing tricky things like infix operations. So for example, we might add this to the grammar:1 [1 For those wondering, in the language I have them compile, square brackets are for N-dimensional tensors and curly braces are for tuples.]
expr : <expr> [ <expr> , ... ] | <expr> { <integer> }
This grammar isn't LL(1) because both these productions and other productions can start with, say, a variable token. Note that this is implicitly introducing the FIRST
set from LR table construction, but instead of defining it recursively, I just talk them through the concept of whether the first token of two productions can match. This is a natural follow-on from discussing LL(1) grammars.
Now, instead of introducing LR grammars and LR parsing, I introduce the idea of transforming this grammar into another one. Specifically, I ask students to transform the grammar into:
expr : <integer> <expr_cont> | <float> <expr_cont> | <variable> <expr_cont> | [ <expr> , ... ] <expr_cont> | { <expr> , ... } <expr_cont> expr_cont : [ <expr> , ... ] <expr_cont> | { <integer> } <expr_cont> |
This grammar isn't quite LL(1), because expr_cont
has an empty production, but the rule is just that all productions have to start with different tokens, except that there can be an empty production, and if there's a single production then it can start with a non-terminal.
Note that this grammar is equivalent to the original. Doing this transformation is not easy or intuitive for students, but we do lots of examples. I pick two different productions that can start with the same token and write out example programs that use those two productions. Then I ask: at what point do we know that they're using two different productions, because they correspond to two different tokens that can be seen at that point? That's the point where we want to add a _cont
grammar class. Again, this is implicitly introducing the FOLLOWS
set from LR parsing, but I introduce it intuitively instead of mathematically.
The parse_expr
and parse_int_expr
functions stay exactly the same; here, parse_int_expr
calls parse_expr_cont
at the end, just like a normal non-terminal. However, these continuation non-terminals are special (and I tell students to always name them x_cont
to keep track): instead of taking no arguments, they take terminals or non-terminals as an argument. So, for example, the parse_int_expr
parses an integer token, builds an IntExpr
AST node from it, and then calls parse_expr_cont
with that IntExpr
as an argument:
Expr parse_int_expr() { String n = expect_token(TokenType.INTVALUE); IntExpr e = new IntExpr(n); return parse_expr_cont(e); }
These parsing functions for continuation non-terminals then use that argument to build larger AST nodes, based on further parses:
Expr parse_tupleindex_expr_cont(Expr base) { expect_token(TokenType.LCURLY); String n = expect_token(TokenType.INTVALUE); expect_token(TokenType.RCURLY); Expr e = new TupleIndexExpr(base, n); return parse_expr_cont(e); }
One big advantage of this approach is that it leverages the idea of grammar transformation / elaboration / equivalence. This makes talking about ambiguity easier, because my approach to grammar ambiguity also involves rewriting the grammar into an equivalent, non-ambiguous grammar. Moreover, I can ask students to turn in elaborated grammars before they start coding, which cuts down on bugs dramatically and also isn't that much work for me (because the grammars are short).
Now, if you think about this in terms of LR parsing, all of the _cont
grammatical classes here correspond to different LR(0) states, and the fact that these _cont
classes always come at the end of a production means that the recursive descent parser is using tail calls, that is, just iterating instead of recursing. So once you have the parser in this form, you can create invert the control flow to get a normal LR parser with a table, so I'm not straying far from the theory here.
But most presentations of an LR parser introduce it with enough formality to write a parser generator—because most LR parsers use a table, and writing a table by hand is difficult for any non-trivial language. Instead of doing that, my top-down LR parsing approach requires students to compute the state table and transitions mostly by hand—but they only need to do it once, for the specific language that we are compiling, which means much less formality is needed. Students don't need to learn how the FIRST
and FOLLOW
sets are defined, though they do use those concepts when doing the transformation themselves. (And, of course, I control the langauge grammar so I've built it in a way to require _cont
tokens in a couple of places.) By framing that as a grammar transformation instead of table construction, I also get them to practice a skill they'll need later for grammar disambiguation.
Disambiguation
My discussion of disambiguation is pretty standard, so I won't discuss it in too much detail. I introduce ambiguity, explain the idea of banning of banning certain patterns (like a AddExpr
below a MultExpr
), show how you do that by splitting the expr
class into multiple classes, and so on. This is how most discussions of parsing cover it, and I do it too.
But one thing that has to do with parsing algorithms is associativity. LL parsers prefer to parenthesize to the right while LR parsers prefer to parenthesize to the left. However, in my top-down LR approach, the grammar is the same:
addexpr : <mulexpr> <addexpr_cont> addexpr_cont: + <mulexpr> <addexpr_cont> |
The main difference is how parse_addexpr_cont
works. You can specify its input to be a MulExpr
or an Expr
. In the first case, you're guaranteeing that the left hand side of the addition is a multiply node, meaning that you get right associativity. In the second case, you allow it to be an addition node and get left associativity. In code, option one looks like this:
Expr parse_add_addexpr_cont(MulExpr lhs) { expect_token(TokenType.PLUS); MulExpr middle = parse_mulexpr(); Expr rhs = parse_addexpr_cont(middle); return new AddExpr(lhs, rhs); }
Option two looks like this:
Expr parse_add_addexpr_cont(Expr first) { expect_token(TokenType.PLUS); MulExpr middle = parse_mulexpr(); Expr lhs = new AddExpr(first, middle); return parse_addexpr_cont(lhs); }
Note that the first option, which left associates, looks a lot like an LL parser, where the last thing we do is build an AST node. And the second option, which right associates, looks a lot like an LR parser, at least in the sense that we tail recurse, which is kind of like iterating. Importantly, students figured this out without hints.
Conclusion
I taught students to parse a reasonably complicated language this way, and didn't get too much complaining. Students did struggle with grammar transformations, but they would have struggled with them anyway when I talked about grammar ambiguity. And despite building a pretty complex parser, the parser could be almost entirely derived from the grammar, with the associativity decision being the only decision they have to make at the code level.
Now, one part of Laurence's pitch that I find compelling that I'm not making use of here is the idea that an LR parser generator is a kind of type checker for your grammar, making sure that it can be parsed unambiguously. It would be cool to build some grammar tooling for students to use, which could maybe use a standard LR parser generator under the hood to provably identify ambiguous grammars. But even in this case, I wouldn't need to actually teach the students what shift/reduce and reduce/reduce conflicts are, because those would just be translated into two productions starting with the same token. Also, I'd be interested to know what the standard name for this kind of parsing is; if you've seen it before, let me know.
Footnotes:
For those wondering, in the language I have them compile, square brackets are for N-dimensional tensors and curly braces are for tuples.