02. Indexing
2.1 Key-Value Store and Relational DB
Although a relational DB supports many types of queries, almost all queries can be broken down into three types of disk operations:
- Scan the whole data set. (No index is used).
- Point query: Query the index by a specific key.
- Range query: Query the index by a range. (The index is sorted).
Database indexes are mostly about range queries and point queries, and it’s easy to see that a range query is just a superset of point queries. If we extract the functionality of the database indexes, it is trivial to make a key-value store. But the point is that a database system can be built on top of a KV store.
We’ll build a KV store before attempting the relational DB, but let’s explore our options first.
2.2 Hashtables
Hashtables are the first to be ruled out when designing a general-purpose KV store. The main reason is sorting — many real-world applications do require sorting and ordering.
However, it is possible to use hashtables in specialized
applications. A headache of using hashtables is the resizing operation.
Naive resizing is O(n)
and causes a sudden increase in disk
space and IO. It’s possible to resize a hashtable incrementally, but
this adds complexity. Another problem with hashtables is when to resize
down; hashtables generally don’t shrink automatically to avoid frequent
and costly resizing, at the cost of wasted disk space.
2.3 B-Trees
Balanced binary trees can be queried and updated in
O(log(n))
and can be range-queried. A B-tree is roughly a
balanced n-ary tree. Why use an n-ary tree instead of a binary tree?
There are several reasons:
Less space overhead.
Every leaf node in a binary tree is reached via a pointer from a parent node, and the parent node may also have a parent. On average, each leaf node requires 1~2 pointers.
This is in contrast to B-trees, where multiple data in a leaf node share one parent. And n-ary trees are also shorter. Less space is wasted on pointers.
Less disk IO.
- B-trees are shorter, which means fewer disk seeks.
- The minimum size of disk IOs is usually the size of the memory page (probably 4K). The operating system will fill the whole 4K page even if you read a smaller size. It’s optimal if we make use of all the information in a 4K page (by choosing the node size of at least one page).
Faster in memory.
Even when the data is cached in memory and disk IO is out of the equation, due to modern CPU memory caching and other factors, n-ary trees can be faster than binary trees even if their big-O complexity is the same.
We’ll use B-trees in this book. But B-trees are not the only option.
2.4 LSM-Trees
Log-structured merge-tree. Here is a high-level overview of how LSM-Tree works.
Let’s start with 2 files: a small file holding the recent updates and a big file holding the rest of the data. Updates go to the smaller file first, but the file cannot grow forever, it has to be merged with the big file at some point to create a new, bigger file. Compare this to the dumb approach of overwriting the whole database when you update something, this is an improvement because it reduces writes.
writes => | new updates | => | accumulated data |
file 1 file 2
And how do you query the database? You have to query both files, and the newer (smaller) file has higher priority. For point queries, you can query the small file first, and query the big file if it misses. For range queries, both files are queried simultaneously and the results are merged. Deletion is usually done by putting a mark in the small file to indicate that a key has been deleted. The actual deletion takes place when the files are merged.
Both files contain indexing data structures for queries. The advantage is that you can use simpler data structures because the files aren’t updated in place, since the update operations are replaced by the merge operation. Each file can simply be a list of sorted KVs indexed by an array of pointers — easier and less error-prone to implement than B-trees.
Having 2 files is still not optimal regarding the amount of writes when merging files since the data in the big file is written to disk over and over again. Luckily, this idea can be generalized to more than 2 files, and each “file” is usually called a “level”. Data goes into the 1st level first, and when the 1st level gets too big, the 1st level is merged into the 2nd level, and the 2nd level is now bigger. Each level is merged into the next bigger and older level when it gets too big.
|level 1|
||
\/
|------level 2------|
||
\/
|-----------------level 3-----------------|
Why does this scheme writes less than the 2-level scheme? Levels grow
exponentially, the multiplier of excess disk write (called write
amplification) is O(log(n))
to the data size. For example,
you can think of a list of files with exponentially increasing size by
the power of two, then you double the size of the 1st file, now the size
if the same as the 2nd file, merge it with the 2nd file and then merge
it with the 3rd file and etc.
Real databases don’t use the power of two ratio between levels because it creates too many levels, which hurts query performance. The size ratio between levels is usually tunable to allow tradeoffs between write amplification and query performance.
Also, real databases usually implement levels as multiple sorted and non-overlapping files instead of one big sorted file. Merges are performed in small parts, which allows for smoother operation. This also reduces the disk space requirements, otherwise, merging the last level would double the disk space usage.
Readers can try to use LSM-trees instead of B-trees after finishing this book. And compare the cons and pros between B-trees and LSM-trees.