00. Introduction
Master fundamentals by building your own DB
What to learn?
Complex systems like databases are built on a few simple principles.
- Atomicity & durability. A DB is more than just
files!
- Persist data with
fsync
. - Crash recovery.
- Persist data with
- KV store based on B-tree.
- Disk-based data structures.
- Space management with a free list.
- Relational DB on top of KV.
- How tables and indexes are mapped to low-level B-trees.
- SQL-like query language; parser & interpreter.
- Concurrency control for transactions.
Code a database in 3000 LoC, incrementally
It’s amazing that an interesting topic can be captured in 3000 LoC, which is orders of magnitude smaller than any real DB, while retaining the essentials. So anyone can try this in their spare time.
LoC | Step |
---|---|
366 | B+tree data structure. |
601 | Append-only KV. |
731 | Practical KV with a free list. |
1107 | Tables on KV. |
1294 | Range queries. |
1438 | Secondary indexes. |
1461 | Transactional interfaces. |
1702 | Concurrency control. |
2795 | SQL-like query language. |
Learn by doing: principles instead of jargon
Feymann once said, “what I can’t build, I don’t understand”. So can you build a database by reading about databases? Unfortunately, the database literature is full of confusing, overloaded jargon with no consistent meaning. So it’s important to learn by doing.
While there is a lot to learn, the core of a DB concists of just a few principles.
Topic 1: durability and atomicity
What is a database?
In marketing, anything can be called a “database”, including spreadsheets, cache servers, etc. We’re only interested in what is traditionally called a database, namely MySQL, Postgres, SQLite, etc. What do they have in common?
- They can persist data to disk.
- They are disk-based, can work with larger-than-memory data.
- They are implemented from scratch, not as wrappers over other databases.
There are only a few mature code bases that meet these criteria, and they are all quite large. For example, the SQLite3 source is measured in megabytes, compressed. A real DB is not the best place to learn the principles. Instead, we’ll code a DB from scratch in 3000 lines.
A DB is more than just a data format
Persisting data to disk (durability) is the #1 criterion of a traditional DB. This is why mobile phones use SQLite, a file-based DB. But why not just use files if the database is just a file anyway? Because “persisting” has some concrete requirements: At some point, the data added to the DB is guaranteed to persist even if the machine crashes (by power loss or whatever).
There are 2 requirements: to survive crashes; to confirm the state of durability. Considering that most databases run on filesystems, a filesystem must also meet these requirements. So filesystems are somewhat similar to databases. However, the big difference is that typical filesystem use (writing to files) has no durability guarantee, resulting in data loss or corruption after a power loss, while typical database use guarantees durability.
Making files durable means implementing half a database, which we’ll learn.
The `fsync` syscall
fsync
is the filesystem operation that makes all
previously written data durable. It’s used to request and
confirm durability; a DB will only return success to the client
after fsync
. But what if the DB crashes before or during
fsync
? The most recent data may be lost, but it should not
be half lost, half persisted. This all or nothing is called
atomicity.
A database must recover from a crash to a reasonable state, which is
harder than using fsync
.
Topic 2: indexing data structures
Control latency and cost with data structures
A DB turns a query into a result without the user knowing how. But the result is not the only concern, latency and cost (memory, IO, computation) are also relevant, hence the distinction between analytical (OLAP) and transactional (OLTP).
- OLAP can involve large amounts of data, with aggregations and joins. Indexing can be limited or non-existent. They are mostly column-based data stores. Used to execute ad hoc, offline, “analytical” queries that are not sensitive to latency.
- OLTP touches small amounts of data using indexes. Low latency and cost. Based on either B+tree or LSM-tree data structures. Used to execute preprogrammed, user-facing queries that require immediate results.
The word “transactional” is not about DB transactions, it’s just jargon.
The 3 traditional DBs (MySQL, Postgres, SQLite) are all OLTP, OLTP DBs can also do some OLAP if the data is small. But it’s more effective to use separate DBs for different use cases. So some of the newer DBs are OLAP-only, such as ClickHouse, DuckDB, and all “big data” DBs.
OLAP and OLTP are different paths of database, we’ll pick OLTP and ignore OLAP, so indexing data structures is critical, while joins and aggregations are irrelevant.
In-memory data structures vs. on-disk data structures
There are extra challenges when putting an indexing data structure on disk. The 1st is which data structure to use. RAM and disk have different characteristics, especially latency. Even with the best SSDs, latency is 3 orders of magnitude greater than RAM. Studying these characteristics leads to just 2 data structures: B+tree and LSM-tree. So databases are much limited in data structures.
Medium | Latency |
---|---|
RAM | 50~100 ns |
SSD | 50000~100000 ns |
HDD | 5000000~10000000 ns |
The 2nd challenge is persisting data to disk. A database is just a data structure on disk. So data structure is a prerequisite. (See my book “Build Your Own Redis” to practice data structures). You can learn data structures from textbooks, but the missing part is putting them on disk and updating them incrementally while maintaining atomicity and durability.
The 3rd challenge is concurrency. For in-memory data, it’s usually OK to serialize the data structure access with a single mutex. For disk-based data, the IO latency makes this impractical and requires more advanced concurrency control.
Topic 3: Relational DB on KV
Two layers of DB interfaces
SQL is almost a synonym for database. But SQL is just a user interface, it’s not fundamental to a DB. What’s important is the functionalities underneath.
Another much simpler interface is key-value (KV). You can get, set, and delete a single key, and most importantly, list a range of keys in sorted order. KV is simpler than SQL because it’s one layer lower. Relational DBs are built on top of KV-like interfaces called storage engines.
type KV interface {
// get, set, del
(key []byte) (val []byte, ok bool)
Get(key []byte, val []byte)
Set(key []byte)
Del// range query
(key []byte) Iterator
FindGreaterThan// ...
}
type Iterator interface {
() bool
HasNext() (key []byte, val []byte)
Next}
The first half of the book is a KV on which the second half is based.
Query languages: parsers and interpreters
The last step is easy, despite the larger LoC. Both the parser and the interpreter are coded with nothing but recursion! The lesson can be applied to almost any computer language, or to creating your own programming language or DSL (See my book “From Source Code To Machine Code” for more challenges).
Questions | getting help
You can email the author at: [email protected]
Build Your Own X book series
X includes Redis, web server and compiler. Read the web version on the website.