04. B+Tree Node and Insertion

4.1 Design B+tree nodes

What we will do

The first big step is just the B+tree data structures, other DB concerns will be covered in later chapters. We’ll do it from the bottom up.

  1. Design a node format that contains all the necessary bits.
  2. Manipulate nodes in a copy-on-write fashion (insert and delete keys).
  3. Split and merge nodes.
  4. Tree insertion and deletion.

The node format

All B+tree nodes are the same size for later use of the free list. Although we won’t deal with disk data at this point, a concrete node format is needed because it decides the node size in bytes and when to split a node.

A node includes:

  1. A fixed-size header, which contains:
    • The type of node (leaf or internal).
    • The number of keys.
  2. A list of pointers to child nodes for internal nodes.
  3. A list of KV pairs.
  4. A list of offsets to KVs, which can be used to binary search KVs.
| type | nkeys |  pointers  |   offsets  | key-values | unused |
|  2B  |   2B  | nkeys * 8B | nkeys * 2B |     ...    |        |

This is the format of each KV pair. Lengths followed by data.

| klen | vlen | key | val |
|  2B  |  2B  | ... | ... |

Simplifications and limits

Our goal is to learn the basics, not to create a real DB. So some simplifications are made.

The same format is used for both leaf nodes and internal nodes. This wastes some space: leaf nodes don’t need pointers and internal nodes don’t need values.

An internal node of n branches contains n keys, each key is duplicated from the minimum key of the corresponding subtree. However, only n − 1 keys are needed for n branches, as you’ll see in other B-tree introductions. The extra key makes the visualization easier.

We’ll set the node size to 4K, which is the typical OS page size. However, keys and values can be arbitrarily large, exceeding a single node. There should be a way to store large KVs outside of nodes, or to make the node size variable. This problem is solvable, but not fundamental. So we’ll skip it by limiting the KV size so that they always fit inside a node.

const HEADER = 4

const BTREE_PAGE_SIZE = 4096
const BTREE_MAX_KEY_SIZE = 1000
const BTREE_MAX_VAL_SIZE = 3000

func init() {
    node1max := HEADER + 8 + 2 + 4 + BTREE_MAX_KEY_SIZE + BTREE_MAX_VAL_SIZE
    assert(node1max <= BTREE_PAGE_SIZE) // maximum KV
}

The key size limit also ensures that an internal node can always host 2 keys.

In-memory data types

In our code, a node is just a chunk of bytes interpreted by this format. Moving data from memory to disk is simpler without a serialization step.

type BNode []byte // can be dumped to the disk

Decouple data structure from IO

Space allocation/deallocation is required for both in-memory and on-disk data structures. We can abstract this away with callbacks, which is a boundary between the data structure and the rest of the DB.

type BTree struct {
    // pointer (a nonzero page number)
    root uint64
    // callbacks for managing on-disk pages
    get func(uint64) []byte // dereference a pointer
    new func([]byte) uint64 // allocate a new page
    del func(uint64)        // deallocate a page
}

For an on-disk B+tree, the database file is an array of pages (nodes) referenced by page numbers (pointers). We’ll implement these callbacks as follows:

  • get reads a page from disk.
  • new allocates and writes a new page (copy-on-write).
  • del deallocates a page.

We can use fake callbacks (mocks) to test the data structure in memory without the rest of the DB.

4.2 Decode the node format

Since the node type is just a chunk of bytes, we’ll define some helper functions to access it.

| type | nkeys |  pointers  |   offsets  | key-values | unused |
|  2B  |   2B  | nkeys * 8B | nkeys * 2B |     ...    |        |

| klen | vlen | key | val |
|  2B  |  2B  | ... | ... |
const (
    BNODE_NODE = 1 // internal nodes without values
    BNODE_LEAF = 2 // leaf nodes with values
)

func (node BNode) btype() uint16 {
    return binary.LittleEndian.Uint16(node[0:2])
}
func (node BNode) nkeys() uint16 {
    return binary.LittleEndian.Uint16(node[2:4])
}
func (node BNode) setHeader(btype uint16, nkeys uint16) {
    binary.LittleEndian.PutUint16(node[0:2], btype)
    binary.LittleEndian.PutUint16(node[2:4], nkeys)
}

Child pointers

// pointers
func (node BNode) getPtr(idx uint16) uint64 {
    assert(idx < node.nkeys())
    pos := HEADER + 8*idx
    return binary.LittleEndian.Uint64(node[pos:])
}
func (node BNode) setPtr(idx uint16, val uint64) // trivial; omitted

KV offsets and pairs

The format packs everything back to back. Finding the nth KV can be done by reading each KV pair one by one. To make it easier, we have included an offset list to locate the nth KV in O(1). This also allows binary searches within a node.

Each offset is the end of the KV pair relative to the start of the 1st KV. The start offset of the 1st KV is just 0, so we use the end offset instead, which is the start offset of the next KV.

// offset list
func offsetPos(node BNode, idx uint16) uint16 {
    assert(1 <= idx && idx <= node.nkeys())
    return HEADER + 8*node.nkeys() + 2*(idx-1)
}
func (node BNode) getOffset(idx uint16) uint16 {
    if idx == 0 {
        return 0
    }
    return binary.LittleEndian.Uint16(node[offsetPos(node, idx):])
}
func (node BNode) setOffset(idx uint16, offset uint16)

kvPos returns the position of the nth KV pair relative to the whole node.

// key-values
func (node BNode) kvPos(idx uint16) uint16 {
    assert(idx <= node.nkeys())
    return HEADER + 8*node.nkeys() + 2*node.nkeys() + node.getOffset(idx)
}
func (node BNode) getKey(idx uint16) []byte {
    assert(idx < node.nkeys())
    pos := node.kvPos(idx)
    klen := binary.LittleEndian.Uint16(node[pos:])
    return node[pos+4:][:klen]
}
func (node BNode) getVal(idx uint16) []byte

It also conveniently returns the node size (used space) with an off-by-one lookup.

// node size in bytes
func (node BNode) nbytes() uint16 {
    return node.kvPos(node.nkeys())
}

KV lookups within a node

The “seek” operation is used for both range and point queries. So they are fundamentally the same.

// returns the first kid node whose range intersects the key. (kid[i] <= key)
// TODO: binary search
func nodeLookupLE(node BNode, key []byte) uint16 {
    nkeys := node.nkeys()
    found := uint16(0)
    // the first key is a copy from the parent node,
    // thus it's always less than or equal to the key.
    for i := uint16(1); i < nkeys; i++ {
        cmp := bytes.Compare(node.getKey(i), key)
        if cmp <= 0 {
            found = i
        }
        if cmp >= 0 {
            break
        }
    }
    return found
}

The function is called nodeLookupLE because it uses the Less-than-or-Equal operator. For point queries, we should use the equal operator instead, which is a step we can add later.

4.3 Update B+tree nodes

Insert into leaf nodes

Let’s consider inserting a key into a leaf node. The 1st step is to use nodeLookupLE to get the insert position. Then copy everything to a new node with the extra key. That’s copy-on-write.

// add a new key to a leaf node
func leafInsert(
    new BNode, old BNode, idx uint16,
    key []byte, val []byte,
) {
    new.setHeader(BNODE_LEAF, old.nkeys()+1) // setup the header
    nodeAppendRange(new, old, 0, 0, idx)
    nodeAppendKV(new, idx, 0, key, val)
    nodeAppendRange(new, old, idx+1, idx, old.nkeys()-idx)
}

Node copying functions

nodeAppendRange copies a range of KVs and nodeAppendKV copies a KV pair. This must be done in order because these functions rely on the previous offset.

// copy a KV into the position
func nodeAppendKV(new BNode, idx uint16, ptr uint64, key []byte, val []byte) {
    // ptrs
    new.setPtr(idx, ptr)
    // KVs
    pos := new.kvPos(idx)
    binary.LittleEndian.PutUint16(new[pos+0:], uint16(len(key)))
    binary.LittleEndian.PutUint16(new[pos+2:], uint16(len(val)))
    copy(new[pos+4:], key)
    copy(new[pos+4+uint16(len(key)):], val)
    // the offset of the next key
    new.setOffset(idx+1, new.getOffset(idx)+4+uint16((len(key)+len(val))))
}

// copy multiple KVs into the position from the old node
func nodeAppendRange(
    new BNode, old BNode,
    dstNew uint16, srcOld uint16, n uint16,
)

Update internal nodes

For internal nodes, the link to the child node is always updated with the copy-on-write scheme, which can become multiple links if the child node is split.

// replace a link with one or multiple links
func nodeReplaceKidN(
    tree *BTree, new BNode, old BNode, idx uint16,
    kids ...BNode,
) {
    inc := uint16(len(kids))
    new.setHeader(BNODE_NODE, old.nkeys()+inc-1)
    nodeAppendRange(new, old, 0, 0, idx)
    for i, node := range kids {
        nodeAppendKV(new, idx+uint16(i), tree.new(node), node.getKey(0), nil)
        //                ^position      ^pointer        ^key            ^val
    }
    nodeAppendRange(new, old, idx+inc, idx+1, old.nkeys()-(idx+1))
}

Note that the tree.new callback is used to allocate the child nodes.

4.4 Split B+tree nodes

Due to the size limits we imposed, a node can host at least 1 KV pair. In the worst case, an oversized node will be split into 3 nodes, with a large KV in the middle. So we may have to split it 2 times.

// split a oversized node into 2 so that the 2nd node always fits on a page
func nodeSplit2(left BNode, right BNode, old BNode) {
    // code omitted...
}

// split a node if it's too big. the results are 1~3 nodes.
func nodeSplit3(old BNode) (uint16, [3]BNode) {
    if old.nbytes() <= BTREE_PAGE_SIZE {
        old = old[:BTREE_PAGE_SIZE]
        return 1, [3]BNode{old} // not split
    }
    left := BNode(make([]byte, 2*BTREE_PAGE_SIZE)) // might be split later
    right := BNode(make([]byte, BTREE_PAGE_SIZE))
    nodeSplit2(left, right, old)
    if left.nbytes() <= BTREE_PAGE_SIZE {
        left = left[:BTREE_PAGE_SIZE]
        return 2, [3]BNode{left, right} // 2 nodes
    }
    leftleft := BNode(make([]byte, BTREE_PAGE_SIZE))
    middle := BNode(make([]byte, BTREE_PAGE_SIZE))
    nodeSplit2(leftleft, middle, left)
    assert(leftleft.nbytes() <= BTREE_PAGE_SIZE)
    return 3, [3]BNode{leftleft, middle, right} // 3 nodes
}

Note that the returned nodes are allocated from memory; they are just temporary data until nodeReplaceKidN actually allocates them.

4.5 B+tree insertion

We’ve implemented 3 node operations:

  • leafInsert updates a leaf node.
  • nodeReplaceKidN updates an internal node.
  • nodeSplit3 splits an oversized node.

Let’s put them together for a full B+tree insertion, which starts with key lookups in the root node until it reaches a leaf.

// insert a KV into a node, the result might be split.
// the caller is responsible for deallocating the input node
// and splitting and allocating result nodes.
func treeInsert(tree *BTree, node BNode, key []byte, val []byte) BNode {
    // the result node.
    // it's allowed to be bigger than 1 page and will be split if so
    new := BNode{data: make([]byte, 2*BTREE_PAGE_SIZE)}

    // where to insert the key?
    idx := nodeLookupLE(node, key)
    // act depending on the node type
    switch node.btype() {
    case BNODE_LEAF:
        // leaf, node.getKey(idx) <= key
        if bytes.Equal(key, node.getKey(idx)) {
            // found the key, update it.
            leafUpdate(new, node, idx, key, val)
        } else {
            // insert it after the position.
            leafInsert(new, node, idx+1, key, val)
        }
    case BNODE_NODE:
        // internal node, insert it to a kid node.
        nodeInsert(tree, new, node, idx, key, val)
    default:
        panic("bad node!")
    }
    return new
}

leafUpdate is similar to leafInsert; it updates an existing key instead of inserting a duplicate key.

// part of the treeInsert(): KV insertion to an internal node
func nodeInsert(
    tree *BTree, new BNode, node BNode, idx uint16,
    key []byte, val []byte,
) {
    kptr := node.getPtr(idx)
    // recursive insertion to the kid node
    knode := treeInsert(tree, tree.get(kptr), key, val)
    // split the result
    nsplit, split := nodeSplit3(knode)
    // deallocate the kid node
    tree.del(kptr)
    // update the kid links
    nodeReplaceKidN(tree, new, node, idx, split[:nsplit]...)
}

Internal nodes are handled recursively, each call returns an updated node, and the caller will split it if it’s oversized and handle the allocation/deallocation.

4.6 What’s next?

The work is almost done. We just need to add these in the next chapter:

  1. Node merging and tree deletion.
  2. A high-level interface.
  3. Fake node callbacks for tests.