Build Your Own Database From Scratch EBook·Paperback
⟵ prev Contents next ⟶

🔥 My other Book: Build Your Own Redis

00. Introduction

0.1 What is This Book About?

Databases are not black boxes. Understand them by building your own from scratch!

This book contains a walk-through of a minimal persistent database implementation. The implementation is incremental. We start with a B-Tree, then a simple KV store, and eventually end with a mini relational DB.

The book focuses on important ideas rather than implementation details. Real-world databases are complex and harder to grasp. We can learn faster and easier from a stripped-down version of a database. And the “from scratch” method forces you to learn deeper.

Although the book is short and the implementation is minimal, it aims to cover three important topics:

  1. Persistence. How not to lose or corrupt your data. Recovering from a crash.
  2. Indexing. Efficiently querying and manipulating your data. (B-tree).
  3. Concurrency. How to handle multiple (large number of) clients. And transactions.

If you have only vague ideas like “databases store my data” or “indexes are fast”, this book is for you.

0.2 How to Use This Book?

This book takes a step-by-step approach. Each step builds on the previous one and adds a new concept. The book uses Golang for sample code, but the topics are language agnostic. Readers are advised to code their own version of a database rather than just read the text.

The draft chapters can be accessed at the official website:

https://build-your-own.org

0.3 Topic One: Persistence

Why do we need databases? Why not dump the data directly into files? Our first topic is persistence.

Let’s say your process crashed middle-way while writing to a file, or you lost power, what’s the state of the file?

Any outcome is possible. Your data is not guaranteed to persist on a disk when you simply write to files. This is a concern of databases. And a database will recover to a usable state when started after an unexpected shutdown.

Can we achieve persistence without using a database? There is a way:

  1. Write the whole updated dataset to a new file.
  2. Call fsync on the new file.
  3. Overwrite the old file by renaming the new file to the old file, which is guaranteed by the file systems to be atomic.

This is only acceptable when the dataset is tiny. A database like SQLite can do incremental updates.

0.4 Topic Two: Indexing

There are two distinct types of database queries: analytical (OLAP) and transactional (OLTP).

Note that the word “transactional” is not related to database transactions as you may know. Computer jargon is often overloaded with different meanings. This book focuses only on OLTP techniques.

While many applications are not real-time systems, most user-facing software should respond in a reasonable (small) amount of time, using a reasonable amount of resources (memory, IO). This falls into the OLTP category. How do we find the data quickly (in O(log(n))), even if the dataset is large? This is why we need indexes.

If we ignore the persistence aspect and assume that the dataset fits in memory, finding the data quickly is the problem of data structures. Data structures that persist on a disk to look up data are called “indexes” in database systems. And database indexes can be larger than memory. There is a saying: if your problem fits in memory, it’s an easy problem.

Common data structures for indexing include B-Trees and LSM-Trees.

0.5 Topic Three: Concurrency

Modern applications do not just do everything sequentially, nor do databases. There are different levels of concurrency:

Even the file-based SQLite supports some concurrency. But concurrency is easier within a process, which is why most database systems can only be accessed via a “server”.

With the addition of concurrency, applications often need to do things atomically, such as the read-modify-write operation. This adds a new concept to databases: transactions.


⟵ prev Contents next ⟶