Build Your Own Database From Scratch PDF·EPUB·Paperback
⟵ prev Contents next ⟶

🆕 This chapter is part of the WIP book:
Build Your Own Database From Scratch

Subscribe to get notified of new chapters and the book's release.
🔥 My other Book: Build Your Own Redis

02. Indexing

2.1 Key-Value Storage and Relational DB

Although a relational DB supports many types of queries, almost all queries can be broken down into three types of disk operations:

  1. Scan the whole data set. (No index is used).
  2. Point query: Query the index by a specific key.
  3. Range query: Query the index by a range. (The index is sorted).

Database indexes are mostly about range queries and point queries, and it’s easy to see that a range query is just a superset of point queries. If we extract the functionality of the database indexes, it is trivial to make a key-value storage. But the point is that a database system can be built on the top of a KV store.

We’ll build a KV store before attempting the relational DB, but let’s explore our options first.

2.2 Hashtables

Hashtables are the first to be ruled out when designing a general-purpose KV store. The main reason is sorting — many real-world applications do require sorting and ordering.

However, it is possible to use hashtables in specialized applications. Another headache of using hashtables is the resizing operation. Naive resizing is O(n) and causes a sudden increase in disk space and IO. It’s possible to resize a hashtable incrementally, but this adds complexity.

2.3 B-Trees

Balanced binary trees can be queried and updated in O(log(n)) and can be range-queried. A B-tree is roughly a balanced n-ary tree. Why use an n-ary tree instead of a binary tree? There are several reasons:

  1. Less space overhead.

    Every leaf node in a binary tree is reached via a pointer from a parent node, and the parent node may also have a parent. On average, each leaf node requires 1~2 pointers.

    In contrast to B-trees, where multiple data in a leaf node share one parent. And n-ary trees also are shorter. Less space is wasted on pointers.

  2. Faster in memory.

    Due to modern CPU memory caching and other factors, B-trees can be faster than binary trees even if their big-O complexity is the same.

  3. Less disk IO.

    • B-Trees are shorter, which means fewer disk seeks.
    • The minimum size of disk IOs is usually the size of the memory page (probably 4K). The operating system will fill the whole 4K page even if you read a smaller size. It’s optimal if we make use of all the information in a 4K page (by choosing the node size of at least one page).

We’ll use B-trees in this book. But B-trees are not the only option.

2.4 LSM-Trees

Log-structured merge-tree. Here is a high-level overview of LSM-Tree operations.

How to query:

  1. An LSM-Tree contains multiple levels of data.
  2. Each level is sorted and split into multiple files.
  3. A point query starts at the top level, if the key is not found, the search continues to the next level.
  4. A range query merges the results from all levels, higher levels have more priority when merging.

How to update:

  1. When updating a key, the key is inserted into a file from the top level first.
  2. If the size of a file exceeds a threshold, merge it with the next level.
  3. The file size threshold increases exponentially with each level, which means that the amount of data also increases exponentially.

Let’s analyze how this works. For queries:

  1. Each level is sorted, keys can be found via binary search, and range queries are just sequential file IO. It’s efficient.

For updates:

  1. The top-level file size is small, so inserting into the top level requires only a small amount of IO.
  2. Data is eventially merged to a low level. Merging is sequential IO, which is an advantage.
  3. Higher levels trigger merging more often, but the merge is also smaller.
  4. When merging a file into a lower level, any lower files whose range intersects are replaced by the merged results (which can be multiple files). We can see why levels are split into multiple files — to reduce the size of merge.
  5. Merging can be done in the background. However, low-level merging can suddenly cause high IO usage, which can degrade system performance.

Readers can try using LSM-trees instead of B-trees while following later chapters. And compare the cons and pros between B-trees and LSM-trees.


⟵ prev Contents next ⟶