One of my current projects is implementing a tiny compiler for a small language targeting x64, and it is a lot of fun.

01. Interpreter

There are a lot of approaches to tiny compilers and small languages that are fun by themselves. The reason for small languages is that they can be implemented with little work, and I can get something working in a short amount of time, leaving me more time for exploration and fun stuff.

The first step to my compiler is a simple interpreter, which is a Python script that executes an S-expression (sexpr). It has variables, control flows, and functions, which is sufficient to write programs in. The syntax looks like this:

(def foo (n)
    (if (le n 0)
        (then 0)
        (else (+ n (call foo (- n 1))))))

Interpreters like this are usually about 200 LoC or less in a scripting language, which isn’t much work, but it can be exciting on the first try.

There are good reasons for choosing sexpr as the syntax, one of them being ease of parsing. If I chose a C-like syntax, the parsing code alone would probably exceed 200 LoC, and I would be bored in the middle of coding.

02. Compiler

Coding a simple interpreter in a dynamic language like Python seems easy now, so let’s move on to the next challenge: a real compiler.

The line between compilers and interpreters can be blurry. Many languages are compiled into “bytecode”, which is then executed by “interpreting” the bytecode. And there are “real” compilers like GCC and Clang that generate machine code and native executables. Generating real machine code seems cooler than interpreting virtual bytecode, so this is the path I took.

My goal is to generate native x64 Linux ELF executables from a statically typed language from scratch. I don’t plan to use an assembler to generate the binary from the texture assembly. There are a few reasons for doing this:

  1. Generating machine code directly instead of using an assembler helps me understand x64 assembly better.

    Do you know that mov is actually multiple instructions in the same name? Or why you cannot say mov [rax], [rcx] in assembly? The assembly language is just a notation for machine code, so it makes sense to know the machine code as well, especially the instruction format.

  2. Generating native Linux executables equips me with knowledge about system calls, the ELF format, how the kernel starts a process, and so on. Some of these are important operating system concepts, which is more than just fun, unlike the simple interpreter.

The resulting binary is a “freestanding” executable without dependencies, all it can do is make Linux syscalls, which is very bare, but still sufficient to write programs in.

I also don’t bother with a GC or even a heap, so that I don’t need to implement a runtime to support the program. If a program can make syscalls, it can create the heap by itself via mmap, this is what I considered “sufficient to program in”.

03. Bytecode

I still use a bytecode design as an “intermediate representation”. The compiler is divided into 2 phases: generating bytecode, and mapping bytecode to machine code.

For example, the function bar is compiled into bytecode:

(def (bar int) ((n int))
    (if (le n 0) (then 0) (else (call bar (- n 1)))))
    const 0 1
    binop le 0 1 1
    jmpf 1 L1
    const 0 1
    jmp L0
    const 1 1
    binop - 0 1 1
    call 1 1 2 2
    ret 1

It’s a virtual instruction set that can later be mapped to real machine instructions.

The 2-phase design allows me to develop the compiler incrementally. I can ignore machine-specific things in the first phase and focus on the compiler, and deal with machine code only in the second phase. This also allows me to target other CPU architectures like ARM later.

04. The Project

The compiler is on GitHub and quite tiny at this point. It’s still a work in progress, as is the accompanying book “From Source Code To Machine Code”.