04. B+Tree Node and Insertion

Steps to code the B+tree data structure:

  1. Create a B+tree node representation.
  2. Insert and delete keys from a node.
  3. Split or merge nodes.
  4. Propagate splits and merges to parents.

4.1 B+tree node

What’s in a node?

A B+tree node is a list of KV pairs and the corresponding child pointers:

// pseudo code
type Node struct {
    keys [][]byte
    // one of the following
    vals [][]byte   // for leaf nodes only
    kids []*Node    // for internal nodes only
}

In a B+tree, only leaf nodes store value, and only internal nodes have children. But to keep things simple, we will use the same format for both node types.

Fit a node into a page

A node is split into smaller ones when it gets too big. But how do we define “too big”? For an in-memory B+tree, we can limit the number of keys in a node. But for a disk-based B+tree, the serialized node must fit into a fixed-size unit, called a page.

func Encode(node *Node) []byte
func Decode(page []byte) (*Node, error)

Fixed-size pages make space allocation and reuse easier because all deleted nodes are interchangeable, which can be managed with a free list rather than reinventing malloc().

But how do we know the serialized size? Let’s say the node is serialized via JSON, the size is only known after it is serialized, which means the node is serialized on every update just to see its size.

To make the node size easily predictable, we can roll our own simpler node serialization format, like every other B+tree implementation.

Design a node format

Here is our node format. The 2nd row is the encoded field size in bytes.

| type | nkeys |  pointers  |  offsets   | key-values | unused |
|  2B  |   2B  | nkeys × 8B | nkeys × 2B |     ...    |        |

The format starts with a 4-bytes header:

  • type is the node type (leaf or internal).
  • nkeys is the number of keys (and the number of child pointers).
const (
    BNODE_NODE = 1 // internal nodes without values
    BNODE_LEAF = 2 // leaf nodes with values
)

Then an array of child pointers and the KV pairs follow. The format is the same for both leaf and internal nodes, so the pointer array is simply unused for leaf nodes.

Each KV pair is prefixed by its size. For internal nodes, the value size is 0.

| key_size | val_size | key | val |
|    2B    |    2B    | ... | ... |

The encoded KV pairs are concatenated. To find the nth KV pair, we have to read all previous pairs. This is avoided by storing the offset of each KV pair.

For example, a leaf node {"k1":"hi", "k3":"hello"} is encoded as:

| type | nkeys | pointers | offsets |            key-values           | unused |
|   2  |   2   | nil nil  |  8 19   | 2 2 "k1" "hi"  2 5 "k3" "hello" |        |
|  2B  |  2B   |   2×8B   |  2×4B   | 4B + 2B + 2B + 4B + 2B + 5B     |        |

The offset of the first KV pair is always 0, so it’s not stored. To find the position of the n-th pair, use the offsets[n-1]. In this example, 8 is the offset of the 2nd pair, 19 is the offset past the end of the 2nd pair.

A range is divided into subranges by keys

Keys in an internal node indicate the range of each child. A root node’s range is [−∞, +∞). The range is divided recursively from the root to the leaves. To divide a range into n subranges, we need n − 1 keys. For example, node ["p", "q"] divides its range [a, z) into 3 subranges: [a, p), [p, q), [q, z).

However, our format uses n keys instead. Each key represents the start of the subrange. For example, node ["p", "q"] divides its range [p, z) into 2 subranges: [p, q), [q, z).

This makes the visualization easier and removes some edge cases, but the 1st key in an internal node is redundant, because the range start is inherited from the parent node.

KV size limit

We’ll set the node size to 4K, which is the typical OS page size. However, keys and values can be arbitrarily large, exceeding a single node. There should be a way to store large KVs outside of nodes, or to make the node size variable. This is solvable, but not fundamental. So we’ll just limit the KV size so that they always fit into a node.

const BTREE_PAGE_SIZE = 4096
const BTREE_MAX_KEY_SIZE = 1000
const BTREE_MAX_VAL_SIZE = 3000

func init() {
    node1max := 4 + 1*8 + 1*2 + 4 + BTREE_MAX_KEY_SIZE + BTREE_MAX_VAL_SIZE
    assert(node1max <= BTREE_PAGE_SIZE) // maximum KV
}

The key size limit also ensures that an internal node can at least host 2 keys.

Page number

An in-memory pointer is an integer address to the location of a byte. For disk data, a pointer can mean a file offset. Either way, it’s just an integer.

In a disk-based B+tree, nodes are fixed-size pages, the entire file is an array of fixed-size pages. So a node pointer needs not to address bytes, but the index of pages, called the page number, which is the file offset divided by the page size.

4.2 Implement the node format

Decode the node on the fly

The format is simple, we can read data directly from it without deserializing it into a separate, in-memory only data type. So our node data type is a byte array.

type BNode []byte // can be dumped to the disk

We just need some helper functions to use it, a.k.a getters and setters.

// getters
func (node BNode) btype() uint16 {
    return binary.LittleEndian.Uint16(node[0:2])
}
func (node BNode) nkeys() uint16 {
    return binary.LittleEndian.Uint16(node[2:4])
}

Integers are encoded in little-endian, which is just a single memory load or store. This is not much different from using a struct.

// setter
func (node BNode) setHeader(btype uint16, nkeys uint16) {
    binary.LittleEndian.PutUint16(node[0:2], btype)
    binary.LittleEndian.PutUint16(node[2:4], nkeys)
}

Except that a struct can’t be variable-length, while a byte array can.

// read and write the child pointers array
func (node BNode) getPtr(idx uint16) uint64 {
    assert(idx < node.nkeys())
    pos := 4 + 8*idx
    return binary.LittleEndian.Uint64(node[pos:])
}
func (node BNode) setPtr(idx uint16, val uint64) {
    assert(idx < node.nkeys())
    pos := 4 + 8*idx
    binary.LittleEndian.PutUint64(node[pos:], val)
}

Read KV pairs

The start offset of the n-th KV pairs is read from the offset array.

// read the `offsets` array
func (node BNode) getOffset(idx uint16) uint16 {
    if idx == 0 {
        return 0
    }
    pos := 4 + 8*node.nkeys() + 2*(idx-1)
    return binary.LittleEndian.Uint16(node[pos:])
}

The offset is adjusted by the data before. So node[node.kvPos(0):] is where the encoded pairs starts.

func (node BNode) kvPos(idx uint16) uint16 {
    assert(idx <= node.nkeys())
    return 4 + 8*node.nkeys() + 2*node.nkeys() + node.getOffset(idx)
}

Then the KV data is returned as a byte slice, after decoding their sizes.

| key_size | val_size | key | val |
|    2B    |    2B    | ... | ... |
func (node BNode) getKey(idx uint16) []byte {
    assert(idx < node.nkeys())
    pos := node.kvPos(idx)
    klen := binary.LittleEndian.Uint16(node[pos:])
    return node[pos+4:][:klen]
}
func (node BNode) getVal(idx uint16) []byte {
    assert(idx < node.nkeys())
    pos := node.kvPos(idx)
    klen := binary.LittleEndian.Uint16(node[pos+0:])
    vlen := binary.LittleEndian.Uint16(node[pos+2:])
    return node[pos+4+klen:][:vlen]
}

Using a node is just loading integers or returning sliced data. A slice is a reference, so the KV data can be passed around without extra copies.

Construct a B+tree node

Now that we can read data from a node, the next step is to put data into a node. For example, we will create a leaf node {"k1":"hi", "k3":"hello"}:

node := BNode(make([]byte, BTREE_PAGE_SIZE))
node.setHeader(BNODE_LEAF, 2)
//             ^type       ^ number of keys
nodeAppendKV(node, 0, 0, []byte("k1"), []byte("hi"))
//                 ^ 1st KV
nodeAppendKV(node, 1, 0, []byte("k3"), []byte("hello"))
//                 ^ 2nd KV

Starting with an empty page, the first thing is to set the number of keys with setHeader(), because the variable-length arrays depend on it.

Next, add KV pairs or pointers to the node.

func nodeAppendKV(new BNode, idx uint16, ptr uint64, key []byte, val []byte)
  • idx is the position of the item (a key, a value or a pointer).
  • ptr is the nth child pointer, which is unused for leaf nodes.
  • key and val is the KV pair. Use an empty value for internal nodes.

nodeAppendKV assumes that keys are added in order. It uses the previous value of the offsets array to determine where the KV should be.

func nodeAppendKV(new BNode, idx uint16, ptr uint64, key []byte, val []byte) {
    // ptrs
    new.setPtr(idx, ptr)
    // KVs
    pos := new.kvPos(idx)   // uses the offset value of the previous key
    // 4-bytes KV sizes
    binary.LittleEndian.PutUint16(new[pos+0:], uint16(len(key)))
    binary.LittleEndian.PutUint16(new[pos+2:], uint16(len(val)))
    // KV data
    copy(new[pos+4:], key)
    copy(new[pos+4+uint16(len(key)):], val)
    // update the offset value for the next key
    new.setOffset(idx+1, new.getOffset(idx)+4+uint16((len(key)+len(val))))
}

Copy-on-write nodes

Copy-on-write means no inplace updates; updates create new nodes instead. Example 1, node {"k1":"hi", "k2":"a", "k3":"hello"} is updated with "k2":"b":

new := BNode(make([]byte, BTREE_PAGE_SIZE))
new.setHeader(BNODE_LEAF, 3)
nodeAppendKV(node, 0, 0, old.getKey(0), old.getVal(0))
nodeAppendKV(node, 1, 0, []byte("k2"), []byte("b"))
nodeAppendKV(node, 2, 0, old.getKey(2), old.getVal(2))

Example 2, remove "k2" from that node:

new := BNode(make([]byte, BTREE_PAGE_SIZE))
new.setHeader(BNODE_LEAF, 2)
nodeAppendKV(node, 0, 0, old.getKey(0), old.getVal(0))
nodeAppendKV(node, 1, 0, old.getKey(2), old.getVal(2))

Example 3, insert "a":"b" into that node:

new := BNode(make([]byte, 2*BTREE_PAGE_SIZE))   // larger
new.setHeader(BNODE_LEAF, 4)
nodeAppendKV(node, 0, 0, []byte("a"), []byte("b"))
nodeAppendKV(node, 1, 0, old.getKey(0), old.getVal(0))
nodeAppendKV(node, 2, 0, old.getKey(1), old.getVal(1))
nodeAppendKV(node, 3, 0, old.getKey(2), old.getVal(2))

After insert, the node may exceed the page size. That’s why the in-memory node is allocated with a larger size, so that it can temporarily exceed the page size before it is split.

Calculate the node size

The reason we create a serialization format is to easily predict the serialized size, which is the fixed-size header + arrays + encoded KVs.

The size of the encoded KVs is the sum of each KV data size + the 4-byte encoded lengths. But there is a shortcut: The offsets array tells us the total size of the encoded KVs, so we don’t need to iterate through them.

// node size in bytes
func (node BNode) nbytes() uint16 {
    return node.kvPos(node.nkeys()) // uses the offset value of the last key
}

4.3 Insert or update the leaf node

Insert or update at a position

Insert starts at a leaf node. To insert a new key at position idx:

func leafInsert(
    new BNode, old BNode, idx uint16, key []byte, val []byte,
) {
    new.setHeader(BNODE_LEAF, old.nkeys()+1)
    nodeAppendRange(new, old, 0, 0, idx)    // copy the keys before `idx`
    nodeAppendKV(new, idx, 0, key, val)     // the new key
    nodeAppendRange(new, old, idx+1, idx, old.nkeys()-idx)  // keys from `idx`
}

nodeAppendRange() is just a loop for copying keys, values, and pointers:

// copy multiple keys, values, and pointers into the position
func nodeAppendRange(
    new BNode, old BNode, dstNew uint16, srcOld uint16, n uint16,
) {
    for i := uint16(0); i < n; i++ {
        dst, src := dstNew+i, srcOld+i
        nodeAppendKV(new, dst,
            old.getPtr(src), old.getKey(src), old.getVal(src))
    }
}

If the key already exists, we should update the value instead:

func leafUpdate(
    new BNode, old BNode, idx uint16, key []byte, val []byte,
) {
    new.setHeader(BNODE_LEAF, old.nkeys())
    nodeAppendRange(new, old, 0, 0, idx)
    nodeAppendKV(new, idx, 0, key, val)
    nodeAppendRange(new, old, idx+1, idx+1, old.nkeys()-(idx+1))
}

Find the insert position

Keys are sorted, we must find the update position that keeps the sort order.

// find the last postion that is less than or equal to the key
func nodeLookupLE(node BNode, key []byte) uint16 {
    nkeys := node.nkeys()
    var i uint16
    for i = 0; i < nkeys; i++ {
        cmp := bytes.Compare(node.getKey(i), key)
        if cmp == 0 {
            return i
        }
        if cmp > 0 {
            return i - 1
        }
    }
    return i - 1
}

This can be a binary search instead of a linear search, since getKey(i) is O(1) with the offsets array.

Insert or update after a key lookup

Let’s combine a lookup with an update:

idx := nodeLookupLE(node, key)  // node.getKey(idx) <= key
if bytes.Equal(key, node.getKey(idx)) {
    leafUpdate(new, node, idx, key, val)   // found, update it
} else {
    leafInsert(new, node, idx+1, key, val) // not found. insert
}

But what if all positions are greater than the key and idx is -1? We will make sure this doesn’t happen. Explained later.

4.4 Split a node

Split by balancing bytes

For an in-memory B+tree, an oversized node can be split into 2 nodes, each with half of the keys. For a disk-based B+tree, half of the keys may not fit into a page due to uneven key sizes. However, we can use the half position as an initial guess, then move it left or right if the half is too large.

// Split an oversized node into 2 nodes. The 2nd node always fits.
func nodeSplit2(left BNode, right BNode, old BNode) {
    assert(old.nkeys() >= 2)
    // the initial guess
    nleft := old.nkeys() / 2
    // try to fit the left half
    left_bytes := func() uint16 {
        return 4 + 8*nleft + 2*nleft + old.getOffset(nleft)
    }
    for left_bytes() > BTREE_PAGE_SIZE {
        nleft--
    }
    assert(nleft >= 1)
    // try to fit the right half
    right_bytes := func() uint16 {
        return old.nbytes() - left_bytes() + 4
    }
    for right_bytes() > BTREE_PAGE_SIZE {
        nleft++
    }
    assert(nleft < old.nkeys())
    nright := old.nkeys() - nleft
    // new nodes
    left.setHeader(old.btype(), nleft)
    right.setHeader(old.btype(), nright)
    nodeAppendRange(left, old, 0, 0, nleft)
    nodeAppendRange(right, old, 0, nleft, nright)
    // NOTE: the left half may be still too big
    assert(right.nbytes() <= BTREE_PAGE_SIZE)
}

Again, the offsets array helps to predict the node size.

Split multiple times

After splitting, the left half may still be too large, because while fitting the right half, the left half size grows. This can happen if there is a big key in the middle, requiring another split.

[foo...|BIG|bar...] -> [foo...|BIG] + [bar...] -> [foo...] + [BIG] + [bar...]
                         too big

Our size limits allow a single KV to take up almost the entire page, which allows for the big-key-in-the-middle case.

const BTREE_PAGE_SIZE = 4096
const BTREE_MAX_KEY_SIZE = 1000
const BTREE_MAX_VAL_SIZE = 3000

So the result of a split is either 2 or 3 nodes:

// split a node if it's too big. the results are 1~3 nodes.
func nodeSplit3(old BNode) (uint16, [3]BNode) {
    if old.nbytes() <= BTREE_PAGE_SIZE {
        old = old[:BTREE_PAGE_SIZE]
        return 1, [3]BNode{old} // not split
    }
    left := BNode(make([]byte, 2*BTREE_PAGE_SIZE)) // might be split later
    right := BNode(make([]byte, BTREE_PAGE_SIZE))
    nodeSplit2(left, right, old)
    if left.nbytes() <= BTREE_PAGE_SIZE {
        left = left[:BTREE_PAGE_SIZE]
        return 2, [3]BNode{left, right} // 2 nodes
    }
    leftleft := BNode(make([]byte, BTREE_PAGE_SIZE))
    middle := BNode(make([]byte, BTREE_PAGE_SIZE))
    nodeSplit2(leftleft, middle, left)
    assert(leftleft.nbytes() <= BTREE_PAGE_SIZE)
    return 3, [3]BNode{leftleft, middle, right} // 3 nodes
}

The 3-node case can be eliminated by lowering the KV size limit.

4.5 B+tree data structure

Isolate disk IO from data structures

A disk-based B+tree depends on disk IO to dereference pointers (page numbers). But we haven’t coded that yet. It’s isolated via callbacks, so the data structure can be tested alone.

type BTree struct {
    // root pointer (a nonzero page number)
    root uint64
    // callbacks for managing on-disk pages
    get func(uint64) []byte // read data from a page number
    new func([]byte) uint64 // allocate a new page number with data
    del func(uint64)        // deallocate a page number
}

Disk IO can be simulated with these callbacks. So a disk-based B+tree can easily be used as an in-memory B+tree, while the reverse is not true.

Tree insertion

Tree insertion is a top-down recursion until reaching the leaf node.

func treeInsert(tree *BTree, node BNode, key []byte, val []byte) BNode {
    // The extra size allows it to exceed 1 page temporarily.
    new := BNode(make([]byte, 2*BTREE_PAGE_SIZE))
    // where to insert the key?
    idx := nodeLookupLE(node, key) // node.getKey(idx) <= key
    switch node.btype() {
    case BNODE_LEAF: // leaf node
        if bytes.Equal(key, node.getKey(idx)) {
            leafUpdate(new, node, idx, key, val)   // found, update it
        } else {
            leafInsert(new, node, idx+1, key, val) // not found, insert
        }
    case BNODE_NODE: // internal node, walk into the child node
        // ...
    default:
        panic("bad node!")
    }
    return new
}

The case of internal nodes is just insert into the subtree, split the new child node, then replace the link with the new nodes. Since nodes are copy-on-write, the child node is always replaced with new nodes, that’s why treeInsert() returns the updated node.

    case BNODE_NODE:
        // recursive insertion to the kid node
        kptr := node.getPtr(idx)
        knode := treeInsert(tree, tree.get(kptr), key, val)
        // after insertion, split the result
        nsplit, split := nodeSplit3(knode)
        // deallocate the old kid node
        tree.del(kptr)
        // update the kid links
        nodeReplaceKidN(tree, new, node, idx, split[:nsplit]...)

Propagate splits into parents

nodeReplaceKidN() is like leafUpdate(), but with multiple keys.

// replace a link with multiple links
func nodeReplaceKidN(
    tree *BTree, new BNode, old BNode, idx uint16,
    kids ...BNode,
) {
    inc := uint16(len(kids))
    new.setHeader(BNODE_NODE, old.nkeys()+inc-1)
    nodeAppendRange(new, old, 0, 0, idx)
    for i, node := range kids {
        nodeAppendKV(new, idx+uint16(i), tree.new(node), node.getKey(0), nil)
    }
    nodeAppendRange(new, old, idx+inc, idx+1, old.nkeys()-(idx+1))
}

What’s next?

The work is almost done. We just need to add these in the next chapter:

  1. Node merging and tree delete.
  2. A high-level KV interface.
  3. Fake page callbacks for tests.