00. Introduction

Master fundamentals by building your own DB

What to learn?

Complex systems like databases are built on a few simple principles.

  1. Atomicity & durability. A DB is more than just files!
    • Persist data with fsync.
    • Crash recovery.
  2. KV store based on B-tree.
    • Disk-based data structures.
    • Space management with a free list.
  3. Relational DB on top of KV.
    • How tables and indexes are mapped to low-level B-trees.
    • SQL-like query language; parser & interpreter.
  4. Concurrency control for transactions.

Code a database in 3000 LoC, incrementally

It’s amazing that an interesting topic can be captured in 3000 LoC, which is orders of magnitude smaller than any real DB, while retaining the essentials. So anyone can try this in their spare time.

LoC Step
366 B+tree data structure.
601 Append-only KV.
731 Practical KV with a free list.
1107 Tables on KV.
1294 Range queries.
1438 Secondary indexes.
1461 Transactional interfaces.
1702 Concurrency control.
2795 SQL-like query language.

Learn by doing: principles instead of jargon

Feymann once said, “what I can’t build, I don’t understand”. So can you build a database by reading about databases? Unfortunately, the database literature is full of confusing, overloaded jargon with no consistent meaning. So it’s important to learn by doing.

While there is a lot to learn, the core of a DB concists of just a few principles.

Topic 1: durability and atomicity

What is a database?

In marketing, anything can be called a “database”, including spreadsheets, cache servers, etc. We’re only interested in what is traditionally called a database, namely MySQL, Postgres, SQLite, etc. What do they have in common?

  • They can persist data to disk.
  • They are disk-based, can work with larger-than-memory data.
  • They are implemented from scratch, not as wrappers over other databases.

There are only a few mature code bases that meet these criteria, and they are all quite large. For example, the SQLite3 source is measured in megabytes, compressed. A real DB is not the best place to learn the principles. Instead, we’ll code a DB from scratch in 3000 lines.

A DB is more than just a data format

Persisting data to disk (durability) is the #1 criterion of a traditional DB. This is why mobile phones use SQLite, a file-based DB. But why not just use files if the database is just a file anyway? Because “persisting” has some concrete requirements: At some point, the data added to the DB is guaranteed to persist even if the machine crashes (by power loss or whatever).

There are 2 requirements: to survive crashes; to confirm the state of durability. Considering that most databases run on filesystems, a filesystem must also meet these requirements. So filesystems are somewhat similar to databases. However, the big difference is that typical filesystem use (writing to files) has no durability guarantee, resulting in data loss or corruption after a power loss, while typical database use guarantees durability.

Making files durable means implementing half a database, which we’ll learn.

The `fsync` syscall

fsync is the filesystem operation that makes all previously written data durable. It’s used to request and confirm durability; a DB will only return success to the client after fsync. But what if the DB crashes before or during fsync? The most recent data may be lost, but it should not be half lost, half persisted. This all or nothing is called atomicity.

A database must recover from a crash to a reasonable state, which is harder than using fsync.

Topic 2: indexing data structures

Control latency and cost with data structures

A DB turns a query into a result without the user knowing how. But the result is not the only concern, latency and cost (memory, IO, computation) are also relevant, hence the distinction between analytical (OLAP) and transactional (OLTP).

  • OLAP can involve large amounts of data, with aggregations and joins. Indexing can be limited or non-existent. They are mostly column-based data stores. Used to execute ad hoc, offline, “analytical” queries that are not sensitive to latency.
  • OLTP touches small amounts of data using indexes. Low latency and cost. Based on either B+tree or LSM-tree data structures. Used to execute preprogrammed, user-facing queries that require immediate results.

The word “transactional” is not about DB transactions, it’s just jargon.

The 3 traditional DBs (MySQL, Postgres, SQLite) are all OLTP, OLTP DBs can also do some OLAP if the data is small. But it’s more effective to use separate DBs for different use cases. So some of the newer DBs are OLAP-only, such as ClickHouse, DuckDB, and all “big data” DBs.

OLAP and OLTP are different paths of database, we’ll pick OLTP and ignore OLAP, so indexing data structures is critical, while joins and aggregations are irrelevant.

In-memory data structures vs. on-disk data structures

There are extra challenges when putting an indexing data structure on disk. The 1st is which data structure to use. RAM and disk have different characteristics, especially latency. Even with the best SSDs, latency is 3 orders of magnitude greater than RAM. Studying these characteristics leads to just 2 data structures: B+tree and LSM-tree. So databases are much limited in data structures.

Medium Latency
RAM 50~100 ns
SSD 50000~100000 ns
HDD 5000000~10000000 ns

The 2nd challenge is persisting data to disk. A database is just a data structure on disk. So data structure is a prerequisite. (See my book “Build Your Own Redis” to practice data structures). You can learn data structures from textbooks, but the missing part is putting them on disk and updating them incrementally while maintaining atomicity and durability.

The 3rd challenge is concurrency. For in-memory data, it’s usually OK to serialize the data structure access with a single mutex. For disk-based data, the IO latency makes this impractical and requires more advanced concurrency control.

Topic 3: Relational DB on KV

Two layers of DB interfaces

SQL is almost a synonym for database. But SQL is just a user interface, it’s not fundamental to a DB. What’s important is the functionalities underneath.

Another much simpler interface is key-value (KV). You can get, set, and delete a single key, and most importantly, list a range of keys in sorted order. KV is simpler than SQL because it’s one layer lower. Relational DBs are built on top of KV-like interfaces called storage engines.

type KV interface {
   // get, set, del
   Get(key []byte) (val []byte, ok bool)
   Set(key []byte, val []byte)
   Del(key []byte)
   // range query
   FindGreaterThan(key []byte) Iterator
   // ...
}
type Iterator interface {
   HasNext() bool
   Next() (key []byte, val []byte)
}

The first half of the book is a KV on which the second half is based.

Query languages: parsers and interpreters

The last step is easy, despite the larger LoC. Both the parser and the interpreter are coded with nothing but recursion! The lesson can be applied to almost any computer language, or to creating your own programming language or DSL (See my book “From Source Code To Machine Code” for more challenges).

Questions | getting help

You can email the author at:

Build Your Own X book series

X includes Redis, web server and compiler. Read the web version on the website.

https://build-your-own.org