03. B-Tree & Crash Recovery

3.1 B-tree as a balanced n-ary tree

Height-balanced tree

Many practical binary trees, such as the AVL tree or the RB tree, are called height-balanced trees, meaning that the height of the tree (from root to leaves) is limited to O(log N), so a lookup is O(log N).

A B-tree is also height-balanced; the height is the same for all leaf nodes.

Generalizing binary trees

n-ary trees can be generalized from binary trees (and vice versa). An example is the 2-3-4 tree, which is a B-tree where each node can have either 2, 3, or 4 children. The 2-3-4 tree is equivalent to the RB tree. However, we won’t go into the details because they are not necessary for understanding B-trees.

Visualizing a 2-level B+tree of a sorted sequence [1, 2, 3, 4, 6, 9, 11, 12].

     [1,   4,   9]
     /     |     \
    v      v      v
[1, 2, 3] [4, 6] [9, 11, 12]

In a B+tree, only leaf nodes contain value, keys are duplicated in internal nodes to indicate the key range of the subtree. In this example, node [1, 4, 9] indicates that its 3 subtrees are within intervals [1, 4), [4, 9), and [9, +∞). However, only 2 keys are needed for 3 intervals, so the first key (1) can be omitted and the 3 intervals become (-∞, 4), [4, 9), and (9, +∞).

3.2 B-tree as nest arrays

Two-level nested arrays

Without knowing the details of the RB tree or the 2-3-4 tree, the B-tree can be understood from sorted arrays.

The problem with sorted arrays is the O(N) update. If we split the array into m smaller non-overlapping ones, the update becomes O(N/m). But we have to find out which small array to update/query first. So we need another sorted array of references to smaller arrays, that’s the internal nodes in a B+tree.

[[1,2,3], [4,6], [9,11,12]]

The lookup cost is still O(log N) with 2 binary searches. If we choose m as N, update become O(√N), that’s as good as 2-level sorted arrays can be.

Multiple levels of nested arrays

O(√N) is unacceptable for databases, but if we add more levels by splitting arrays even more, the cost is further reduced.

Let’s say we keep splitting levels until all arrays are no larger than a constant s, we end up with log (N/s) levels, and the lookup cost is O(log (N/s) + log (s)), which is still O(log N).

For insertion and deletion, after finding the leaf node, updating the leaf node is constant O(s) most of the time. The remaining problem is to maintain the invariants that nodes are not larger than s and are not empty.

3.3 Maintaining a B+tree

3 invariants to preserve when updating a B+tree:

  1. Same height for all leaf nodes.
  2. Node size is bounded by a constant.
  3. Node is not empty.

Growing a B-tree by splitting nodes

The 2nd invariant is violated by inserting into a leaf node, which is restored by splitting the node into smaller ones.

    parent              parent
   /  |  \     =>      /  | |  \
L1   L2   L6         L1  L3 L4  L6
     *                   *  *

After splitting a leaf node, its parent node gets a new branch, which may also exceed the size limit, so it may need to be split as well. Node splitting can propagate to the root node, increasing the height by 1.

                        new_root
                          / \
    root                 N1 N2
   /  |  \     =>      /  | |  \
L1   L2   L6         L1  L3 L4  L6

This preserves the 1st invariant, since all leaves gain height by 1 simultaneously.

Shrinking a B-tree by merging nodes

Deleting may result in empty nodes. The 3rd invariant is restored by merging empty nodes into a sibling node. Merging is the opposite of splitting. It can also propagate to the root node, so the tree height can decrease.

When coding a B-tree, merging can be done earlier to reduce wasted space: you can merge a non-empty node when its size reaches a lower bound.

3.4 B-Tree on disk

You can already code an in-memory B-tree using these principles. But B-tree on disk requires extra considerations.

Block-based allocation

One missing detail is how to limit node size. For in-memory B+tree, you can limit the maximum number of keys in a node, the node size in bytes is not a concern, because you can allocate as many bytes as needed.

For disk-based data structures, there are no malloc/free or garbage collectors to rely on; space allocation and reuse is entirely up to us.

Space reuse can be done with a free list if all allocations are of the same size, which we’ll implement later. For now, all B-tree nodes are the same size.

Copy-on-write B-tree for safe updates

We’ve seen 3 crash-resistant ways to update disk data: renaming files, logs, LSM-trees. The lesson is not to destroy any old data during an update. This idea can be applied to trees: make a copy of the node and modify the copy instead.

Insertion or deletion starts at a leaf node; after making a copy with the modification, its parent node must be updated to point to the new node, which is also done on its copy. The copying propagates to the root node, resulting in a new tree root.

  • The original tree remains intact and is accessible from the old root.
  • The new root, with the updated copies all the way to the leaf, shares all other nodes with the original tree.
    d           d         D*
   / \         / \       / \
  b   e  ==>  b   e  +  B*  e
 / \         / \       / \
a   c       a   c     a   C*
            original  updated

This is a visualization of updating the leaf c. The copied nodes are in uppercase (D, B, C), while the shared subtrees are in lowercase (a, e).

This is called a copy-on-write data structure. It’s also described as immutable, append-only (not literally), or persistent (not related to durability). Be aware that database jargon does not have consistent meanings.

2 more problems remain for the copy-on-write B-tree:

  1. How to find the tree root, as it changes after each update? The crash safety problem is reduced to a single pointer update, which we’ll solve later.
  2. How to reuse nodes from old versions? That’s the job of a free list.

Copy-on-write B-tree advantages

One advantage of keeping old versions around is that we got snapshot isolation for free. A transaction starts with a version of the tree, and won’t see changes from other versions.

And crash recovery is effortless; just use the last old version.

Another one is that it fits the multi-reader-single-writer concurrency model, and readers do not block the writer. We’ll explore these later.

Alternative: In-place update with double-write

While crash recovery is obvious in copy-on-write data structures, they can be undesirable due to the high write amplification. Each update copies the whole path (O(log N)), while most in-place updates touch only 1 leaf node.

It’s possible to do in-place updates with crash recovery without copy-on-write:

  1. Save a copy of the entire updated nodes somewhere. This is like copy-on-write, but without copying the parent node.
  2. fsync the saved copies. (Can respond to the client at this point.)
  3. Actually update the data structure in-place.
  4. fsync the updates.

After a crash, the data structure may be half updated, but we don’t really know. What we do is blindly apply the saved copies, so that the data structure ends with the updated state, regardless of the current state.

| a=1 b=2 |
    ||  1. Save a copy of the entire updated nodes.
    \/
| a=1 b=2 |   +   | a=2 b=4 |
   data           updated copy
    ||  2. fsync the saved copies.
    \/
| a=1 b=2 |   +   | a=2 b=4 |
   data           updated copy (fsync'ed)
    ||  3. Update the data structure in-place. But we crashed here!
    \/
| ??????? |   +   | a=2 b=4 |
   data (bad)     updated copy (good)
    ||  Recovery: apply the saved copy.
    \/
| a=2 b=4 |   +   | a=2 b=4 |
   data (new)     useless now

The saved updated copies are called double-write in MySQL jargon. But what if the double-write is corrupted? It’s handled the same way as logs: checksum.

  • If the checksum detects a bad double-write, ignore it. It’s before the 1st fsync, so the main data is in a good and old state.
  • If the double-write is good, applying it will always yield good main data.

Some DBs actually store the double-writes in logs, called physical logging. There are 2 kinds of logging: logical and physical. Logical logging describes high-level operations such as inserting a key, such operations can only be applied to the DB when it’s in a good state, so only physical logging (low-level disk page updates) is useful for recovery.

The crash recovery principle

Let’s compare double-write with copy-on-write:

  • Double-write makes updates idempotent; the DB can retry the update by applying the saved copies since they are full nodes.
  • Copy-on-write atomically switches everything to the new version.

They are based on different ideas:

  • Double-write ensures enough information to produce the new version.
  • Copy-on-write ensures that the old version is preserved.

What if we save the original nodes instead of the updated nodes with double-write? That’s the 3rd way to recover from corruption, and it recovers to the old version like copy-on-write. We can combine the 3 ways into 1 idea: there is enough information for either the old state or the new state at any point.

Also, some copying is always required, so larger tree nodes are slower to update.

We’ll use copy-on-write because it’s simpler, but you can deviate here.

3.5 What we learned

B+tree principles:

  • n-ary tree, node size is limited by a constant.
  • Same height for all leaves.
  • Split and merge for insertion and deletion.

Disk-based data structures:

  • Copy-on-write data structures.
  • Crash recovery with double-write.

We can start coding now. 3 steps to create a persistent KV based on B+tree:

  1. Code the B+tree data structure.
  2. Move the B+tree to disk.
  3. Add a free list.