How I added support for nested functions in Python bytecode

Published at

12/30/2024

Why we use bytecode in the first place

Memphis, my Python interpreter written in Rust, has two execution engines. Neither can run all code but both can run some code.

My treewalk interpreter is what you would build if you didn’t know what you were doing. 🙋‍♂️ You tokenize the input Python code, generate an abstract syntax tree (AST), and then walk the tree and evaluate each node. Expressions return values and statements modify the symbol table, which is implemented as a series of scopes which respect Python scoping rules. Just remember the easy pneumonic LEGB: local, enclosing, global, builtin.

My bytecode VM is what you would build if you didn’t know what you were doing but wanted to act like you did. Also 🙋‍♂️. For this engine, the tokens and AST work the same, but rather than walking we take off sprinting. We compile the AST into an intermediate representation (IR) hereafter known as bytecode. We then create a stack-based virtual machine (VM), which conceptually acts like a CPU, executing bytecode instructions in sequence, but it’s implemented entirely in software.

(For a complete guide of both approaches without the ramblings, Crafting Interpreters is excellent.)

Why do we do this in the first place? Just remember the two Ps: portability and performance. Remember how in the early 2000s nobody would shut up about how Java bytecode was portable? All you need is a JVM and you can run a Java program compiled on any machine! Python chose not to go with this approach for both technical and marketing reasons, but in theory the same principles apply. (In practice, the compilation steps are different and I regret opening this can of worms.)

Performance is the biggie though. Rather than traversing an AST multiple times during the lifetime of a program, the compiled IR is a more efficient representation. We see improved performance from avoiding the overhead of repeatedly traversing an AST, and its flat structure often results in better branch prediction and cache locality at runtime.

(I don’t blame you for not thinking about caching if you don’t have a background in computer architecture—heck, I began my career in that industry and I think about caching far less than I think about how to avoid writing the same line of code twice. So just trust me on the performance piece. That’s my leadership style: blind trust.)

Hey buddy, that’s 500 words. We need to load up the frame and let ‘er rip.

Already?! You excluded code snippets?

There are no code snippets, my man.

Okay okay. Just 500 more. I promise.

Context matters for Python variables

I got kinda far before tabling my bytecode VM implementation about a year ago: I could define Python functions and classes and call those functions and instantiate those classes. I clamped down this behavior with some tests. But I knew my implementation was messy and that I’d need to revisit the fundamentals before adding more fun stuff. Now it’s Christmas week and I want to add fun stuff.

Consider this snippet for calling a function, keeping an eye on the TODO.

fn compile_function_call(
    &mut self,
    name: &str,
    args: &ParsedArguments)
) -> Result<Bytecode, CompileError> {
    let mut opcodes = vec![];

    // We push the args onto the stack in reverse call order so that we will pop
    // them off in call order.
    for arg in args.args.iter().rev() {
        opcodes.extend(self.compile_expr(arg)?);
    }

    let (_, index) = self.get_local_index(name);

    // TODO how does this know if it is a global or local index? this may not be the right
    // approach for calling a function
    opcodes.push(Opcode::Call(index));

    Ok(opcodes)
}

Are you done considering? We load the function arguments onto the stack and “call the function”. In bytecode, all names are converted into indices (because index access is faster during the VM runtime), but we don’t really have a way to know whether we are dealing with a local index or a global index here.

Now consider the improved version.

fn compile_function_call(
    &mut self,
    name: &str,
    args: &ParsedArguments)
) -> Result<Bytecode, CompileError> {
    let mut opcodes = vec![self.compile_load(name)];

    // We push the args onto the stack in reverse call order so that we will pop
    // them off in call order.
    for arg in args.args.iter().rev() {
        opcodes.extend(self.compile_expr(arg)?);
    }

    let argc = opcodes.len() - 1;
    opcodes.push(Opcode::Call(argc));

    Ok(opcodes)
}

Thank you for considering that code.

We now supported nested function calls! What changed?

The Call opcode now takes a number of positional arguments, rather than an index to the function. This instructs the VM how many arguments to pop off the stack before calling the function.
After popping the arguments off the stack, the function itself will be left on the stack and compile_load has already handled local versus global scope for us.

LOAD_GLOBAL versus LOAD_FAST

Let’s take a look at what compile_load is doing.

fn compile_load(&mut self, name: &str) -> Opcode {
    match self.ensure_context() {
        Context::Global => Opcode::LoadGlobal(self.get_or_set_nonlocal_index(name)),
        Context::Local => {
            // Check locals first
            if let Some(index) = self.get_local_index(name) {
                return Opcode::LoadFast(index);
            }

            // If not found locally, fall back to globals
            Opcode::LoadGlobal(self.get_or_set_nonlocal_index(name))
        }
    }
}

There are several key principles in action here:

We match based on the current context. Adhering to Python semantics, we can consider Context::Global to be at the top level of any module (not just your script’s entrypoint), and Context::Local is inside any block (i.e. function definition or class definition).
We now differentiate between a local index and a non-local index. (Because I was going crazy trying to decipher what the index 0 referred to in different places, I introduced typed-integers. LocalIndex and NonlocalIndex provide type-safety for otherwise untyped unsigned integers. I may write about this in the future!)
We can tell at bytecode-compilation time whether a local variable exists with a given name, and if it does not, at runtime we will search for a global variable. This speaks to the dynamism built into Python: as long as a variable is present in the global scope of that module by the time a function executes, its value can be resolved at runtime. However, this dynamic resolution comes with a performance hit. While local variable lookups are optimized to use stack indices, global lookups require searching the global namespace dictionary, which is slower. This dictionary is a mapping of names to objects, which themselves may live on the heap. Who knew that the saying “Think globally, act locally.” was actually referring to Python scopes?

What’s in a varname?

The last thing I’ll leave you with today is a peek into how these variables names are mapped. In the code snippet below, you’ll notice that local indices are found in code.varnames and nonlocal indices are found in code.names. Both live on a CodeObject, which contains the metadata for a block of Python bytecode, including its variable and name mappings.

fn get_or_set_local_index(&mut self, name: &str) -> LocalIndex {
    if let Some(index) = self.get_local_index(name) {
        index
    } else {
        let code = self.ensure_code_object_mut();
        let new_index = code.varnames.len();
        code.varnames.push(name.to_string());
        Index::new(new_index)
    }
}

fn get_local_index(&self, name: &str) -> Option<LocalIndex> {
    let code = self.ensure_code_object();
    find_index(&code.varnames, name).map(Index::new)
}

fn get_or_set_nonlocal_index(&mut self, name: &str) -> NonlocalIndex {
    let code = self.ensure_code_object_mut();
    if let Some(index) = find_index(&code.names, name) {
        Index::new(index)
    } else {
        let new_index = code.names.len();
        code.names.push(name.to_string());
        Index::new(new_index)
    }
}

The difference between varnames and names tormented me for weeks (CPython calls these co_varnames and co_names), but it’s actually fairly straightforward. varnames holds the variable names for all local variables in a given scope, and names does the same for all nonlocal.

Once we properly track this, everything else just works. At runtime, the VM sees a LOAD_GLOBAL or a LOAD_FAST and knows to look in the global namespace dictionary or the local stack, respectively.

Buddy! Mr Gutenberg is on the phone and says we can hold the presses no longer.

Okay! Fine! I get it! Let’s ship it. 🚢

What’s next for Memphis?

Shh! The printing press man doesn’t know I’m writing a conclusion, so I will be brief.

With variable scoping and function calls in a solid place, I’m gradually turning my attention to features like stack traces and async support. If you’ve enjoyed this dive into bytecode or have questions about building your own interpreter, I’d love to hear from you—drop a comment!

Subscribe & Save [on nothing]

If you’d like to get more posts like this directly to your inbox, you can subscribe here!

Work With Me

I mentor software engineers to navigate technical challenges and career growth in a supportive sometimes silly environment. If you’re interested, you can book a session here.

Elsewhere

In addition to mentoring, I also write about my experience navigating self-employment and late-diagnosed autism. Less code and the same number of jokes.