01. Complex systems are built from simple ideas

Complex software like databases, compilers, and browsers are treated like black boxes. You use them every day as a user, but you probably don’t understand them as a programmer, even though they are nothing but code. Why?

  • They have little in common with programmers’ daily task.
  • Their code bases are so large, so discouraging.

But that doesn’t mean you can’t learn these things. People have built small, toy versions of these things for learning purposes, such as C in 4 Functions, Crafting Interpreters, Web Browser Engineering. It turns out that if you minimize uninteresting details and focus on the essentials, you can recreate seemingly complex software and learn its core ideas.

For compilers, some of the core ideas include:

  • Parsing stuff with recursion.
  • Representing a program as a tree and simulate it.
  • Representing a program as a list of instructions with gotos.
  • Stack machine or register machine.

None of these ideas are complicated. So a compiler is a viable and useful exercise for programmers.

02. What are the core ideas of databases?

I’ve built a small database in 3000 lines from scratch in Go to learn the core ideas of databases. It’s not that complicated if you approach it in a certain way.

2.1 Power-loss atomicity

A database stores data on disk, but why not just use files? Why is the filesystem not a database? Because databases eliminate a common source of data corruption: partially written files caused by a crash or power loss.

Video games warn you not to turn off the power while saving, because a partially written save file can destroy all your progress. Building a database will teach you how to eliminate this concern. The solution is an append-only log with a checksum for each item.

  • Append-only means that writes will not destroy existing data.
  • Checksum means that partial writes will be detected and discarded, making appends power-loss atomic.
╔═══╤════╦═══╤════╦══╌╌╌══╦═══╤════╗
║sum│data║sum│data║more...║sum│data║
╚═══╧════╩═══╧════╩══╌╌╌══╩═══╧════╝
                          ╰───┬────╯
                    possible partial write

These are the first 2 ideas. But they are not enough to build a database.

2.2 Index data with data structures

Logs have uses for systems like Kafka, but you cannot just put everything in a log and forget about it. You have to find stuff in it, so you need data structures like B+trees, LSM-trees, hashtables. That’s where I started, I built a B+tree in 366 lines.

But if you just put a data structure on disk, you have the partial write problem. Is it possible to apply the log-based ideas to data structures? There are 2 ideas:

The first idea is to make the data structure append-only, like a log. There are ways to make a B+tree append-only, also called a copy-on-write B+tree. A copy-on-write tree does not overwrite existing nodes, it creates new nodes from leaf to root for the entire path. New nodes can be appended like a log.

    d           d         D*
   / \         / \       / \
  b   e  ──►  b   e  +  B*  e
 / \         / \       / \
a   c       a   c     a   C*
            original  updated

The other idea is to use both a log and a main data structure:

  1. Before updating the main data structure, store the intended updates in the log. Each log record contains a list of “write this data to that position”.
  2. Then apply the updates to the main data structure.
  3. When the databases start, it can always apply the last log record to fix possible partial writes in the main data structure, regardless of its status.

I chose the copy-on-write B+tree because it doesn’t require a log, which reduces the LoC. Moving the B+tree to disk involves interfacing with the filesystem, including the fsync(). In 601 lines, I have an append-only KV store on disk.

2.3 Recycle and reuse unused space

I have to handle the consequence of append-only because storage is not infinite. I simply added a free list to recycle unused B+tree nodes. In 731 lines, I have a practical KV store.

                     first_item
                         ↓
list_head ─► [ next |    xxxxx ]
                ↓
             [ next | xxxxxxxx ]
                ↓
list_tail ─► [ NULL | xxxx     ]
                         ↑
                     last_item

2.4 Relational DB on KV

A SQL query can be arbitrarily complex, but it boils down to 2 data structure operations: point query and range query. That’s why I started with B+tree data structures. Databases are built on data structures, not on some theories about relations.

I added a layer on top of KV to store records with primary keys and columns. Each record is encoded as KV pair. The LoC increased to 1107. But it’s still just a KV. So I added 2 more features:

  • Range query, 1294 lines.
  • Secondary index, 1438 lines.

I didn’t bother with SQL at this point because SQL is just a user interface. I built APIs for databases which is called a storage engine.

2.5 Concurrent control

I can ignore concurrency and make everything serialized. But using a copy-on-write B+tree means that readers get snapshot isolation for free, because updates do not destroy old versions. So readers will not be blocked by the writer.

I went a step further to allow write transactions to run concurrently, and then merge the updates at the end of the transaction. This requires detecting and rejecting conflicting updates. In 1702 lines, I have a transactional interface.

2.6 SQL-like query language

My user interface is just a set of APIs to operate the database. The next step is a query language. I chose a SQL-like syntax. I only implemented simple queries with a single point/range query, so there is no query planner.

Parsing SQL is not a database topic, but a compiler topic. It’s done by applying recursion in a specific way, which is hardly an idea once you’ve learned it.

A query language is not just about data, it can do computations, like SELECT a + b FROM c. So I need an interpreter.

Then I need lots of glue code between the query language and the database. The lines of code increased from 1702 to 2795, most of them uninteresting.

03. Database in 3000 lines, incrementally

This is my minimalist attempt to build a database from scratch. Each step adds an essential part while minimizing uninteresting details.

LoC Step
366 B+tree data structure.
601 Append-only KV.
731 Practical KV with a free list.
1107 Tables on KV.
1294 Range queries.
1438 Secondary indexes.
1461 Transactional interfaces.
1702 Concurrency control.
2795 SQL-like query language.

The lessons as a book

I’ve turned this into a book so that you can follow my steps. The first half of the book is free online and contains the most interesting parts of the storage engine, which is usable as a KV on its own.

Build Your Own Database

3000 lines is not much, but it’s an effective exercise. The first edition took me only 2 months of free time. The book is updated whenever I learn something new, so this is the second edition.