06. The Virtual Machine and Bytecode

6.1 The Two Phases of Compilation

Our compiler project is divided into 2 phases:

  1. Compile the syntax tree into instructions of an imaginary virtual machine (VM).
  2. Map virtual instructions to x64 instructions and generate Linux ELF executables.

There are many advantages to dividing our compiler into 2 phases, and production-grade compilers can have more than 2 phases. One of the advantages is that our virtual machine is an abstract representation of a real machine; which means it is free of architecture-specific details. The VM can be converted to any real CPU architecture in phase 2 such as x64 or ARM (although we’ll only use x64).

The 2-phase design also allows for incremental implementation, which makes the book easier to follow.

Virtual instructions are widely used in compilers and interpreters and are often called “bytecode”. An efficient interpreter executes the program by interpreting the bytecode instead of the syntax tree, although this is not the goal of this book.

6.2 The Virtual Machine Design

There are many possible bytecode designs. But if we design our bytecode to be as close to a real machine as possible, then the conversion to the real machine will be easier. We’ll use the following design:

6.2.1 The Data Stack

There is a data stack for local variables and temporary variables. The stack area for a function call is called a “frame”

| function 0 | function 1 |            function 2           |
|            |            |   arguments ... | variables ... |
|            |            | 0 | 1 | 2 | ... | n | n+1 | ... |
|   frame 0  |   frame 1  |              frame 2            |
                          |=> top

(The figure above: function 0 calls function 1, which calls function 2).

Variables on the current (top) frame are zero-indexed. Arguments are passed on the frame as if they were local variables. The return value of a function (if it has one) is placed at the position of the first argument.

You may have noticed a difference from real machines: It doesn’t store the return address in the stack. How the return address is preserved is an architecture-specific detail that will be handled in phase 2.

6.2.2 Virtual Instructions

Our virtual machine also doesn’t have registers, but variables on the stack. Register allocations can be added in phase 2 to map variables to registers. Although this is optional because we can simply load variables into fixed temporary registers for an arithmetic operation and put the result back.

We’ll use the following notation for our bytecodes.

6.2.3 Non-Local Variables

In many programming languages, functions can access variables besides local variables and arguments. In C, all functions are global and top-level, and variables can be either local or global. Global variables are defined at the top level and can be accessed by functions.

In other languages, such as Python or Go, functions can be nested, a nested function can read and update variables in the parent function, and the language even allows functions to be passed around. This feature is called “closure” — functions that capture variables outside the function and be passed as a value.

There are some extra complications to allowing a nested function to access variables outside of itself. If the variables live on the stack, then it is illegal to call the function that captures them after it has left the scope. This is solved by lifting variables to the heap in GC-enabled languages. Some languages allow you to capture variables by value, which is always accessible, but you cannot update the outside variable by value.

The compiler we’re going to implement is none of the above. Our compiler allows nested functions and updating variables outside the function, but has no support for passing functions around; functions are always executed within the scope in which they are defined. This is easier to implement than GC-enabled languages, because we don’t need a runtime to support the program. And it is only slightly more complicated than C functions.

6.3 Static Typing

We didn’t worry about types when coding the naive interpreter, because we offloaded them to the host language: Python. Almost everything in Python is a heap-allocated “object” that has a type. The Python interpreter constantly checks the type of the objects in order to decide what to do with them. This is called “dynamic typing”.

This is in contrast to “static typing”, where the compiler knows the type of the values and knows what to do with them at compile time. For example, to execute the expression a + b in Python, the interpreter must check the types of both operands, and the execution takes different paths depending on their types: the interpreter may throw an exception, add two numbers, concatenate two strings, and so on.

This is why you don’t compile Python into native machine code like you compile C/C++, the compiler doesn’t have much work to do as most of the work happens at runtime. However, there are just-in-time (JIT) compilers that collect type information at runtime and compile Python functions into machine code. This highlights the importance of typing.

We’ll use static typing in our language design; it’s easier to compile. And it also helps to understand how things work, because the CPU doesn’t work with types or objects, the CPU works mostly with integers and pointers!

Our language only supports a few types:

int         ;; int64, signed 64-bit integer
byte        ;; uint8, unsigned 8-bit integer
ptr int     ;; pointer to int
ptr byte    ;; pointer to byte
void        ;; only for functions that return nothing