Why I'm using a recursive descent parser
Writing a parser by hand is hard, but it's worth it for better error-reporting and control
If you search the web for the best parser architecture, you’ll be swamped with results on the best parser generators. These tools output generated parsing code based on input about a formal specification of the grammar of the language you want to parse. This generated code has many advantages, notably its execution speed and ease of changing the input grammar quickly and regenerating code.
In creating the Poly programming language, I’ve decided to hand-write my parser using a technique called recursive descent. This is a more manual process but it provides more control over the parsing process and makes it easier to implement better error messages. Theoretically, the recursive descent technique can have exponentially bad performance if your program is structured in certain (bad) ways. But practically, most programming languages are not conducive to exponentially bad parsing performance. In researching existing programming language implementations, I found that among mainstream programming language toolchains, hand-written recursive descent parsers quietly reign supreme.
What is recursive descent?
Recursive descent is the primary way of constructing a parser by hand, because it mimics the ways a language’s rules are intuitively applied (top-down, recursively).
Let’s say you have a really simple language that can only add and subtract integers. The grammar might look like:
expression:
| expression + expression
| expression - expression
| number(int)
A recursive descent parser to recognize this grammar starts at the beginning of the program, recursively trying to apply the rules to successfully recognize the program. In ReasonML, a basic implementation might look like:
type token = | PLUS_TOKEN | MINUS_TOKEN | NUMBER_TOKEN(int);
type expression =
| PLUS(expression, expression)
| MINUS(expression, expression)
| NUMBER(int)
| PARSE_ERROR;
let rec expression = (tokens) => {
switch (tokens) {
| [e1, PLUS_TOKEN, ...e2] =>
PLUS(expression([e1]), expression(e2))
| [e1, MINUS_TOKEN, ...e2] =>
MINUS(expression([e1]), expression(e2))
| [NUMBER_TOKEN(n)] => NUMBER(n)
| _ => PARSE_ERROR
}
}
What draws me to this parsing style isn’t just that it’s more feasible to construct parsers by hand; it’s that almost every practical compiler toolchain I’ve researched uses it.
Recursive descent in the wild
GCC, one of the dominant compiler toolchains for C and C++, replaced their generated parser with a hand-written recursive descent parser in 2004. In a message board post, contributor Joseph Myers wrote:
Writing a C parser is quicker than reading, analysing and replying to
hundreds of list messages … so I just wrote one over
the past week. This is a hand-written, recursive-descent parser for C
that parses the same GNU C language as the Bison parser, just
replacing the parser and not other code at the same time.
Essentially C, one of the most famous and enduring programming languages, is compiled using a hand-written parser by its most popular toolchain. Clang, which is perhaps the second most popular toolchain for C/C++, also uses a hand-written recursive descent parser. On their website, they have a section about their parser that reads:
We are convinced that the right parsing technology for this class of languages is a hand-built recursive-descent parser. Because it is plain C++ code, recursive descent makes it very easy for new developers to understand the code, it easily supports ad-hoc rules and other strange hacks required by C/C++, and makes it straight-forward to implement excellent diagnostics and error recovery.
It’s not well-advertised, but in basically all mainstream programming languages that I have peeked at under-the-hood, I see recursive descent. Buried in Rust’s developer guide:
The parser translates the token stream from the lexer into an Abstract Syntax Tree (AST). It uses a recursive descent (top-down) approach to syntax analysis.
I dug through TypeScript’s compiler source code:
// Note: it should not be necessary to save/restore these flags during speculative/lookahead // parsing. These context flags are naturally stored and restored through normal recursive // descent parsing and unwinding.
The list goes on and on (OCaml, Go, C#).
I am sure there are counter-examples of programming languages with non-recursive descent parsers. But of the established programming languages I researched, recursive descent parsing is used in all of their toolchains.
When to use generated parsers
Theoretically, there are other parser architectures that offer superior performance to hand-written parsers. Unfortunately, the code required to implement these other architectures is not very human-readable. As a result, there are many tools to generate parsers, like Bison, Antlr, and Menhir.
These tools are useful, even for programming language creation and ideation. But for the purposes of having clear error messages and extensible, custom behavior, the generated parser code falls short.
Summary
I imagine that the debate between writing a parser by hand and using a generated parser plays in many programming language creators’ heads. What ultimately tipped me over the edge was seeing the practice of hand-written, recursive descent parsing so widespread in practical, mainstream programming languages. I guess in the end compiler performance is not as important as usability and diagnostics.