Basic Lean parser infrastructure #

The Lean parser was developed with the following primary goals in mind:

flexibility: Lean's grammar is complex and includes indentation and other whitespace sensitivity. It should be possible to introduce such custom "tweaks" locally without having to adjust the fundamental parsing approach.
extensibility: Lean's grammar can be extended on the fly within a Lean file, and with Lean 4 we want to extend this to cover embedding domain-specific languages that may look nothing like Lean, down to using a separate set of tokens.
losslessness: The parser should produce a concrete syntax tree that preserves all whitespace and other "sub-token" information for the use in tooling.
performance: The overhead of the parser building blocks, and the overall parser performance on average-complexity input, should be comparable with that of the previous parser hand-written in C++. No fancy optimizations should be necessary for this.

Given these constraints, we decided to implement a combinatoric, non-monadic, lexer-less, memoizing recursive-descent parser. Using combinators instead of some more formal and introspectible grammar representation ensures ultimate flexibility as well as efficient extensibility: there is (almost) no pre-processing necessary when extending the grammar with a new parser. However, because the all results the combinators produce are of the homogeneous Syntax type, the basic parser type is not actually a monad but a monomorphic linear function ParserState → ParserState, avoiding constructing and deconstructing countless monadic return values. Instead of explicitly returning syntax objects, parsers push (zero or more of) them onto a syntax stack inside the linear state. Chaining parsers via >> accumulates their output on the stack. Combinators such as node then pop off all syntax objects produced during their invocation and wrap them in a single Syntax.node object that is again pushed on this stack. Instead of calling node directly, we usually use the macro leading_parser p, which unfolds to node k p where the new syntax node kind k is the name of the declaration being defined.

The lack of a dedicated lexer ensures we can modify and replace the lexical grammar at any point, and simplifies detecting and propagating whitespace. The parser still has a concept of "tokens", however, and caches the most recent one for performance: when tokenFn is called twice at the same position in the input, it will reuse the result of the first call. tokenFn recognizes some built-in variable-length tokens such as identifiers as well as any fixed token in the ParserContext's TokenTable (a trie); however, the same cache field and strategy could be reused by custom token parsers. Tokens also play a central role in the prattParser combinator, which selects a leading parser followed by zero or more trailing parsers based on the current token (via peekToken); see the documentation of prattParser for more details. Tokens are specified via the symbol parser, or with symbolNoWs for tokens that should not be preceded by whitespace.

The Parser type is extended with additional metadata over the mere parsing function to propagate token information: collectTokens collects all tokens within a parser for registering. firstTokens holds information about the "FIRST" token set used to speed up parser selection in prattParser. This approach of combining static and dynamic information in the parser type is inspired by the paper "Deterministic, Error-Correcting Combinator Parsers" by Swierstra and Duponcheel. If multiple parsers accept the same current token, prattParser tries all of them using the backtracking longestMatchFn combinator. This is the only case where standard parsers might execute arbitrary backtracking. At the moment there is no memoization shared by these parallel parsers apart from the first token, though we might change this in the future if the need arises.

Finally, error reporting follows the standard combinatoric approach of collecting a single unexpected token/... and zero or more expected tokens (see Error below). Expected tokens are e.g. set by symbol and merged by <|>. Combinators running multiple parsers should check if an error message is set in the parser state (hasError) and act accordingly. Error recovery is left to the designer of the specific language; for example, Lean's top-level parseCommand loop skips tokens until the next command keyword on error.