01. Introduction
1.1 Why Compiler?
Compilers are a fascinating topic. You use them every day. They turn your textual source code into some binary magic executed by computers. Ever wondered how they work? You can learn all of this by coding a mini-compiler by yourself, thus removing a mystery from your mind.
It is often said that learning C leads to a better understanding of computers. What’s even better than just learning C is learning compilers and assembly, because even after learning C, the machine and the compiler may still seem like black boxes to you. The book doesn’t require prior experience with C or assembly, though.
Compiler construction is a subject included in computer science and software engineering education. However, many of today’s (2023+) coders do not have a formal CS education. Basic things such as compilers, databases, operating systems, etc. are often seen as magical black boxes. That’s why I started the “Build Your Own X” book series. To learn and teach basic things by the “from scratch” approach, through succinct & condensed books.
It’s also considered a fun side project that just costs you some free time.
1.2 Components of Compilers
Let’s start with the source: some text files written in a programming language. To do anything useful with a source file (whether it’s a compiler or an IDE), the text must be “parsed” into some structures. The jargon for this process is “parser”.
The parser of a big language is often divided into 2 phases: first, the text is split into a stream of “tokens” (keywords, names, operators, think of the words in an article), then the parser assembles the tokens into structures (grammar). The first phase is often called a “tokenizer” or “scanner”, although this phase is not necessary.
The parsed structure is often called a “syntax tree”. One can simply simulate the program from the syntax tree. This is called an “interpreter”, and it’s a naive interpreter.
A less naive interpreter includes extra steps. Such as converting the syntax tree into a list of instructions called “bytecode”. The execution of a program works on the bytecode instead of the syntax tree. Why the extra step? You’ll find out later.
The component for producing the bytecode is also called a “compiler”, and the line between compiler and interpreter blurs from here. A compiler such as GCC ultimately outputs a binary executable — a container for machine code — which also contains a list of instructions, the real instructions that the CPU executes.
There are often multiple steps between the syntax tree and the final machine code and some “intermediate representations” (IR) for various purposes. Bytecode can also be an IR. However, what we describe in this section are components of typical compilers and interpreters; there are also many alternative compiler/interpreter designs and implementations, which you can have fun exploring them.
1.3 The Book
Like my other books, this book follows a step-by-step approach. You start with simple things, like a calculator, then a simple interpreter. Then you tackle the compiler bit-by-bit, compiling to bytecode, learning assembly, and finally generating native executables.
The final product is a small, statically typed language that resembles a subset of C, which compiles into x64 Linux ELF executables. And it has a machine-agnostic IR, so it’s not hard to target other platforms such as ARM, which is a challenge you can take up later.
The source code of the final compiler is available on GitHub:
https://github.com/byo-books/pretty_laughable_lang
The project is written from scratch in pure Python without any dependencies, although everything is language agnostic. You can find my other books on the official website: https://build-your-own.org