Build Your Own Database From Scratch Subscribe to get notified of new chapters and the book's release.
03. B-Tree: The Ideas
3.1 The Intuitions of the B-Tree and BST
Our first intuition comes from balanced binary trees (BST). Binary trees are popular data structures for sorted data. Keeping a tree in good shape after inserting or removing keys is what “balancing” means. As stated in a previous chapter, n-ary trees should be used instead of binary trees to make use of the “page” (minimum unit of IO).
B-trees can be generalized from BSTs. Each node of a B-tree contains multiple keys and multiple links to its children. When looking up a key in a node, all keys are used to decide the next child node.
[1, 4, 9]
/ | \
v v v
[1, 2, 3] [4, 6] [9, 11, 12]
The balancing of a B-tree is different from a BST, popular BSTs like RB trees or AVL trees are balanced on the height of sub-trees (by rotation). While the height of all B-tree leaf nodes is the same, a B-tree is balanced by the size of the nodes:
- If a node is too large to fit on one page, it is split into two nodes. This will increase the size of the parent node and possibly increase the height of the tree if the root node was split.
- If a node is too small, try merging it with a sibling.
If you are familiar with RB trees, you may also be aware of 2-3 trees that can be easily generalized as B-trees.
3.2 B-tree and Nested Arrays
Even if you are not familiar with the 2-3 tree, you can still gain some intuition using nested arrays.
Let’s start with a sorted array. Queries can be done by bisection.
But, updating the array is O(n)
which we need to tackle.
Updating a big array is bad so we split it into smaller arrays. Let’s
say we split the array into sqrt(n)
parts, and each part
contains sqrt(n)
keys on average.
[[1,2,3], [4,6], [9,11,12]]
To query a key, we must first determine which part contains the key,
bisecting on the sqrt(n)
parts is O(log(n))
.
After that, bisecting the key on the part is again
O(log(n))
— it’s no worse than before. And updating is
improved to O(sqrt(n))
.
This is a 2-level sorted nested array, what if we add more levels? This is another intuition of the B-tree.
3.3 B-Tree Operations
Querying a B-tree is the same as querying a BST.
Updating a B-tree is more complicated. From now on we’ll use a variant of B-tree called “B+ tree”, the B+ tree stores values only in leaf nodes, and internal nodes contain only keys.
Key insertion starts at a leaf. A leaf is just a sorted list of keys. Inserting the key into the leaf is trivial. But, the insertion may cause the node size to exceed the page size. In this case, we need to split the leaf node into 2 nodes, each containing half of the keys, so that both leaf nodes fit into one page.
An internal node consists of:
- A list of pointers to its children.
- A list of keys paired with the pointer list. Each of the keys is the first key of the corresponding child.
After splitting a leaf node into 2 nodes. The parent node replaces the old pointer and key with the new pointers and keys. And the size of the node increases, which may trigger further splitting.
parent parent
/ | \ => / | | \
L1 L2 L6 L1 L3 L4 L6
After the root node is split, a new root node is added. This is how a B-tree grows.
new_root
/ \
root N1 N2
/ | \ => / | | \
L1 L2 L6 L1 L3 L4 L6
Key deletion is the opposite of insertion. A node is never empty because a small node will be merged into either its left sibling or its right sibling.
And when a non-leaf root is reduced to a single key, the root can be replaced by its sole child. This is how a B-tree shrinks.
3.4 Immutable Data Structures
Immutable means never updating data in place. Some similar jargons are “append-only”, “copy-on-write”, and “persistent data structures” (the word “persistent” has nothing to do with the “persistence” we talked about ealier).
For example, when inserting a key into a leaf node, do not modify the node in place, instead, create a new node with all the keys from the to-be-updated node and the new key. Now the parent node must also be updated to point to the new node.
Likewise, the parent node is duplicated with the new pointer. Until we reach the root node, the entire path has been duplicated. This effectively creates a new version of the tree that coexists with the old version.
There are several advantages of immutable data structures:
- Avoid data corruption. Immutable data structures do not modify the existing data, they merely add new data, so the old version of data remains intact even if the update is interrupted.
- Easy concurrency. Readers can operate concurrently with writers since readers can work on older versions unaffected.
Persistence and concurrency are covered in later chapters. For now, we’ll code an immutable B+ tree first.