dev-resources.site
for different kinds of informations.
Pratt Parsing in MiniScript
I recently came across a delightful blog post called Pratt Parsers: Expression Parsing Made Easy. It describes a technique for parsing known as Pratt parsing, or top-down operator precedence parsing.
There's no point in me repeating the explanation here, as the author (Bob Nystrom) does an excellent job already. But, he illustrates by building a parser in Java. Naturally I thought: why not do the same in MiniScript?
So, I've done just that. You can find the code on GitHub here: https://github.com/JoeStrout/bantam-miniscript
Code Overview
Since MiniScript is not Java and does not require one file per class, our code is organized a little differently. So let me just give you a quick orientation to what's what:
- lexer.ms: a very simple lexer that breaks an input string into tokens. Note that Bantam does not support numbers; only named identifiers like "abc".
- tokens.ms: defines the token types and a little token class (which contains the type and the actual token text).
- precedence.ms: defines the nine different precedence levels needed by the Bantam language.
- expressions.ms: defines the abstract syntax tree, i.e., the internal representation of expressions in our language. The goal of the parser is to convert text input into a little tree of objects defined in this file.
- parselets.ms: defines the nine different little "parselet" classes needed (all deriving from one of two base classes, PrefixParselet or InfixParselet). There is one of these for all prefix operators (like the "-" in "-a"), one for all binary operators (like "+" in "a+b"), and the rest for handling special cases like assignment, grouping (parentheses), a simple identifier name, etc.
-
parser.ms: the base class for any language parser. The guts of the Pratt algorithm are in the
parseExpression
method, which we'll discuss a little more below. - bantamParser.ms: customizes Parser for the Bantam language. You can easily see here what parselet is used (and with what precedence) based on the type of the next token to be parsed.
- main.ms: finally, the main program just runs a bunch of tests; my version also provides an "interact" method that shows the parsing for any expression you type in.
Heart of the Pratt parser
The core of the algorithm is this method in parser.ms:
Parser.parseExpression = function(precedence = 0)
token = self.consume
prefix = self.prefixParselets.get(token.type)
if not prefix then
self.throwError "Could not parse """ + token.text + """."
return
end if
left = prefix.parse(self, token)
while precedence < self.getPrecedence and not self.error
token = self.consume
infix = self.infixParselets.get(token.type)
left = infix.parse(self, left, token)
end while
return left
end function
Not too bad, is it? The first line grabs the next token to be parsed, and tries to find a matching prefix parselet. (The prefix parselets handle anything that can start an expression, like a prefix operator or a simple identifier.) We then call on that prefix to actually parse, and store the result in local variable left
.
Then, we have a little loop that continues as long as the precedence of the next parser we would apply (which is found by self.getPrecedence
) is greater than our current precedence. The loop gets the appropriate infix (i.e. not starting an expression) parselet, tells it to parse, and makes the result the new left
.
And that's it. This simple loop, along with an appropriate set of parselets, does all the magic.
Could it be any simpler?
...Actually, I suspect it could. The algorithm itself is simple, but as written (a fairly direct port of the Java code), it still feels like a lot of overhead and bureaucracy to do a relatively simple job.
So, maybe next week I'll follow up with a clean-sheet rewrite of the same algorithm, but taking a simpler, more MiniScript-y approach.
What do you think? Let me know in the comments below!
Featured ones: