dev-resources.site
for different kinds of informations.
How to parse computer code, Advent of Code 2024 day 3
Having tackled some of the later Advent of Code challenges, I wanted to revisit Day 3, which presented an interesting parsing problem. The task involved extracting valid code from a noisy input, a great exercise in parser and lexer development. Join me as I explore my approach to this challenge.
A generated image showing my love for the puzzle (?) by Microsoft Copilot
When I first wrote about the ruler DSL, I relied on Hy for parsing. However, my recent exploration of generative AI has introduced a new parsing method: generated code using the funcparserlib library. This Advent of Code challenge allowed me to delve into the intricacies of funcparserlib and develop a much stronger grasp of the generated code's functionality.
Implementing the Lexer (Lexical Analysis)
The first step in processing our corrupted input is lexing (or tokenization). The lexer (or tokenizer) scans the input string and divides it into individual tokens, which are the basic building blocks for further processing. A token represents a meaningful unit in the input, categorized by its type. For this puzzle, we’re interested in these token types:
- Operators (OP): These represent function names, such as mul, do, and don't. For example, the input mul(2, 3) contains the operator token mul.
- Numbers: These are numerical values. For instance, in the input mul(2, 3), 2 and 3 would be recognized as number tokens.
- Commas: The comma character (,) acts as a separator between arguments.
- Parentheses: Opening ( and closing ) parentheses define the structure of the function calls.
- Gibberish: This category encompasses any characters or sequences of characters that don’t match the other token types. This is where the “corrupted” part of the input comes in. For example, %$#@ or any random letters would be considered gibberish.
While funcparserlib often uses magic strings in its tutorials, I prefer a more structured approach. Magic strings can lead to typos and make it difficult to refactor code. Using an Enum to define token types offers several advantages: better readability, improved maintainability, and enhanced type safety. Here's how I defined the token types using an Enum:
from enum import Enum, auto
class Spec(Enum):
OP = auto()
NUMBER = auto()
COMMA = auto()
LPAREN = auto()
RPAREN = auto()
GIBBERISH = auto()
By using Spec.OP, Spec.NUMBER, etc., we avoid the ambiguity and potential errors associated with using plain strings.
To seamlessly integrate Enum with funcparserlib, I created a custom decorator named TokenSpec_. This decorator acts as a wrapper around the original TokenSpec function from funcparserlib. It simplifies token definition by accepting a value from our Spec Enum as the spec argument. Internally, it extracts the string representation of the enum (spec.name) and passes that along with any other arguments to the original TokenSpec function.
from funcparserlib.lexer import TokenSpec
def TokenSpec_(spec: Spec, *args: Any, **kwargs: Any) -> TokenSpec:
return TokenSpec(spec.name, *args, **kwargs)
With the decorated TokenSpec_ function, this allows us to define the tokenizer. We use make_tokenizer from funcparserlib to create a tokenizer that takes a list of TokenSpec_ definitions. Each definition specifies a token type (from our Spec ENUM) and a regular expression to match it.
from funcparserlib.lexer import make_tokenizer
def tokenize(input: str) -> tuple[Token, ...]:
tokenizer = make_tokenizer(
[
TokenSpec_(
Spec.OP, r"mul(?=\(\d{1,3},\d{1,3}\))|do(?=\(\))|don\'t(?=\(\))"
),
TokenSpec_(Spec.NUMBER, r"\d{1,3}"),
TokenSpec_(Spec.LPAREN, r"\("),
TokenSpec_(Spec.RPAREN, r"\)"),
TokenSpec_(Spec.COMMA, r","),
TokenSpec_(Spec.GIBBERISH, r"[\s\S]"),
]
)
return tuple(
token for token in tokenizer(input) if token.type != Spec.GIBBERISH.name
)
The OP regular expression uses alternation (|) to match the different function formats. Specifically:
- mul(?=(\d{1,3},\d{1,3})): Matches mul only if it's followed by parentheses containing two numbers separated by a comma. The positive lookahead assertion (?=...) ensures that the parentheses and numbers are present but are not consumed by the match.
- do(?=()): Matches do only if followed by empty parentheses.
- don\'t(?=()): Matches don't only if followed by empty parentheses.
A graph representation of the regular expression
Finally, the tokenize function filters out any GIBBERISH tokens during tokenization to focus on the relevant parts of the input for further processing.
The process of interpreting code typically involves two main stages: lexical analysis (or lexing) and parsing. We’ve already implemented the first stage: our tokenize function acts as a lexer, taking the input string and converting it into a sequence of tokens. These tokens are the fundamental building blocks that the parser will use to understand the structure and meaning of the code. Now, let's explore how the parser uses these tokens.
Implementing the parser
The parsed tokens returned by the tokenize function are then sent to a parser for further processing. To bridge the gap between our Spec Enum and the tok function, we introduce a decorator named tok_.
from funcparserlib.parser import tok
def tok_(spec: Spec, *args: Any, **kwargs: Any) -> Parser[Token, str]:
return tok(spec.name, *args, **kwargs)
For example, if we have a Spec.NUMBER token, the returned parser will accept the token, and return a value, as follows:
>>> from funcparserlib.lexer import Token
>>> number_parser = tok_(Spec.NUMBER)
>>> number_parser.parse([Token(Spec.NUMBER.name, '123'])
'123'
The returned value can then be transformed into the desired data type using the >> operator, as shown below:
>>> from funcparserlib.lexer import Token
>>> from ast import literal_eval
>>> number_parser = tok_(Spec.NUMBER) >> literal_eval
>>> number_parser.parse([Token(Spec.NUMBER.name, '123'])
123
Typically, it’s best practice to use ast.literal_eval when parsing unknown input to avoid potential security vulnerabilities. However, the constraints of this particular Advent of Code puzzle—specifically, that all numbers are valid integers—allow us to use the more direct and efficient int function for converting string representations to integers.
number = tok_(Spec.NUMBER) >> int
We can define parsing rules to enforce specific token sequences and transform them into meaningful objects. For example, to parse a mul function call, we require the following sequence: left parenthesis, number, comma, another number, right parenthesis. We then transform this sequence into a Mul object:
from abc import ABC
class Expr(ABC):
pass
@dataclass
class Mul(Expr):
alpha: int
beta: int
def evaluate(self) -> int:
return self.alpha * self.beta
def __str__ (self) -> str:
return f"mul({self.alpha},{self.beta})"
mul = (
tok_(Spec.OP, "mul")
+ tok_(Spec.LPAREN)
+ number
+ tok_(Spec.COMMA)
+ number
+ tok_(Spec.RPAREN)
) >> (lambda elem: Mul(elem[2], elem[4]))
This rule combines the tok_ parsers for the required tokens (OP, LPAREN, COMMA, RPAREN) with the number parser. The >> operator then transforms the matched sequence into a Mul object, extracting the two numbers from the tuple elem at indices 2 and 4.
We can apply the same principle to define parsing rules for the do and don't operations. These operations take no arguments (represented by empty parentheses) and are transformed into Condition objects:
@dataclass
class Condition(Expr):
can_proceed: bool
do = (tok_(Spec.OP, "do") + tok_(Spec.LPAREN) + tok_(Spec.RPAREN)) >> (
lambda elem: Condition(True)
)
dont = (tok_(Spec.OP, "don't") + tok_(Spec.LPAREN) + tok_(Spec.RPAREN)) >> (
lambda elem: Condition(False)
)
The do rule creates a Condition object with can_proceed = True, while the don't rule creates one with can_proceed = False.
Finally, we combine these individual parsing rules (do, don't, and mul) into a single expr parser using the | (or) operator:
expr = do | dont | mul
This expr parser will attempt to match the input against each of the rules in turn, returning the result of the first successful match.
Our expr parser handles complete expressions like mul(2,3), do(), and don't(). However, the input might also contain individual tokens that are not part of these structured expressions. To handle these, we define a catch-all parser called everything:
everything = (
tok_(Spec.NUMBER) | tok_(Spec.LPAREN) | tok_(Spec.RPAREN) | tok_(Spec.COMMA)
)
This parser uses the | (or) operator to match any single token of type NUMBER, LPAREN, RPAREN, or COMMA. It's essentially a way to capture any stray tokens that aren't part of a larger expression.
With all the components defined, we can now define what constitutes a complete program. A program consists of one or more “calls,” where a “call” is an expression potentially surrounded by stray tokens.
The call parser handles this structure: it matches any number of stray tokens (many(everything)), followed by a single expression (expr), followed by any number of additional stray tokens. The operator.itemgetter(1) function then extracts the matched expression from the resulting sequence.
from funcparserlib.parser import finished, many
call = many(everything) + expr + many(everything) >> operator.itemgetter(1)
A full program, represented by the program parser, consists of zero or more calls, ensuring that the entire input is consumed by using the finished parser. The parsed result is then converted into a tuple of expressions.
program = many(call) + finished >> (lambda elem: tuple(elem[0]))
Finally, we group all these definitions into a parse function. This function takes a tuple of tokens as input and returns a tuple of parsed expressions. All the parsers are defined within the function body to prevent polluting the global namespace and because the number parser depends on the tok_ function.
def parse(tokens: tuple[Token, ...]) -> tuple[Expr, ...]:
number = tok_(Spec.NUMBER) >> int
everything = (
tok_(Spec.NUMBER) | tok_(Spec.LPAREN) | tok_(Spec.RPAREN) | tok_(Spec.COMMA)
)
mul = (
tok_(Spec.OP, "mul")
+ tok_(Spec.LPAREN)
+ number
+ tok_(Spec.COMMA)
+ number
+ tok_(Spec.RPAREN)
) >> (lambda elem: Mul(elem[2], elem[4]))
do = (tok_(Spec.OP, "do") + tok_(Spec.LPAREN) + tok_(Spec.RPAREN)) >> (
lambda elem: Condition(True)
)
dont = (tok_(Spec.OP, "don't") + tok_(Spec.LPAREN) + tok_(Spec.RPAREN)) >> (
lambda elem: Condition(False)
)
expr = do | dont | mul
call = many(everything) + expr + many(everything) >> operator.itemgetter(1)
program = many(call) + finished >> (lambda elem: tuple(elem[0]))
return program.parse(tokens)
Solving the puzzle
With our parser in place, solving Part 1 is straightforward. We need to find all mul operations, perform the multiplications, and sum the results. We start by defining an evaluation function that handles Mul expressions
def evaluate_skip_condition(expressions: tuple[Expr, ...]) -> int:
return reduce(
operator.add,
(
expression.evaluate()
for expression in expressions
if isinstance(expression, Mul)
),
)
To solve Part 1, we tokenize and parse the input, then use the function evaluate_skip_condition we just defined to get the final result:
def part1(input: str) -> int:
return evaluate_skip_condition(parse(tokenize(input.strip())))
For Part 2, we need to skip evaluating mul operations if a don't condition has been encountered. We define a new evaluation function, evaluate_with_condition, to handle this:
from functools import reduce
def evaluate_with_condition(expressions: tuple[Expr, ...]) -> int:
def reducer(current: tuple[bool, int], incoming: Expr) -> tuple[bool, int]:
condition, result = current
match incoming:
case Mul():
return (
condition,
(result + incoming.evaluate()) if condition else result,
)
case Condition():
return (incoming.can_proceed, result)
case _:
raise Exception("Unknown expression")
return reduce(reducer, expressions, (True, 0))[-1]
This function uses reduce with a custom reducer function to maintain a running sum and a boolean flag (condition). The condition flag is updated when a Condition expression (do or don't) is encountered. Mul expressions are only evaluated and added to the sum if the condition is True.
Previous Iteration
Initially, my approach to parsing involved two distinct passes. First, I would tokenize the entire input string, collecting all tokens regardless of their type. Then, in a separate step, I would perform a second tokenization and parsing specifically to identify and process mul operations.
# Simplified representation of the double parsing approach
tokens = tokenize(input) # First pass: all tokens
mul_tokens = tokenize_mul(input) # Second pass: only mul-related tokens
parsed_mul = parser_mul(mul_tokens) # Parse mul tokens
# ... further processing ...
The improved approach eliminates this redundancy by performing the tokenization and parsing in a single pass. We now have a single parser that handles all token types, including those related to mul, do, don't, and other individual tokens.
# Simplified representation of the single parsing approach
tokens = tokenize(input) # Single pass: all tokens
parsed_expressions = parse(tokens) # Parse all tokens in one go
# ... further processing ...
Instead of re-tokenizing the input to find mul operations, we leverage the token types identified during the initial tokenization. The parse function now uses these token types to directly construct the appropriate expression objects (Mul, Condition, etc.). This avoids the redundant scanning of the input and significantly improves efficiency.
That wraps up our parsing adventure for this week’s Advent of Code. While this post required a significant time commitment, the process of revisiting and solidifying my knowledge of lexing and parsing made it a worthwhile endeavour. This was a fun and insightful puzzle, and I’m eager to tackle more complex challenges in the coming weeks and share my learnings.
As always, thank you for reading, and I shall write again, next week.
Featured ones: