Build Your Own Database From Scratch EBook·Paperback
⟵ prev Contents next ⟶

🔥 My other Book: Build Your Own Redis

04. B-Tree: The Practice (Part I)

This chapter implements an immutable B+ tree in Golang. The implementation is minimal and thus is easy to follow.

4.1 The Node Format

Our B-tree will be persisted to the disk eventually, so we need to design the wire format for the B-tree nodes first. Without the format, we won’t know the size of a node and when to split a node.

A node consists of:

  1. A fixed-sized header containing the type of the node (leaf node or internal node) and the number of keys.
  2. A list of pointers to the child nodes. (Used by internal nodes).
  3. A list of offsets pointing to each key-value pair.
  4. Packed KV pairs.
| type | nkeys |  pointers  |   offsets  | key-values
|  2B  |   2B  | nkeys * 8B | nkeys * 2B | ...

This is the format of the KV pair. Lengths followed by data.

| klen | vlen | key | val |
|  2B  |  2B  | ... | ... |

To keep things simple, both leaf nodes and internal nodes use the same format.

4.2 Data Types

Since we’re going to dump our B-tree to the disk eventually, why not use an array of bytes as our in-memory data structure as well?

type BNode struct {
    data []byte // can be dumped to the disk

const (
    BNODE_NODE = 1 // internal nodes without values
    BNODE_LEAF = 2 // leaf nodes with values

And we can’t use the in-memory pointers, the pointers are 64-bit integers referencing disk pages instead of in-memory nodes. We’ll add some callbacks to abstract away this aspect so that our data structure code remains pure data structure code.

type BTree struct {
    // pointer (a nonzero page number)
    root uint64
    // callbacks for managing on-disk pages
    get func(uint64) BNode // dereference a pointer
    new func(BNode) uint64 // allocate a new page
    del func(uint64)       // deallocate a page

The page size is defined to be 4K bytes. A larger page size such as 8K or 16K also works.

We also add some constraints on the size of the keys and values. So that a node with a single KV pair always fits on a single page. If you need to support bigger keys or bigger values, you have to allocate extra pages for them and that adds complexity.

const HEADER = 4

const BTREE_PAGE_SIZE = 4096
const BTREE_MAX_KEY_SIZE = 1000
const BTREE_MAX_VAL_SIZE = 3000

func init() {
    node1max := HEADER + 8 + 2 + 4 + BTREE_MAX_KEY_SIZE + BTREE_MAX_VAL_SIZE
    assert(node1max <= BTREE_PAGE_SIZE)

4.3 Decoding the B-tree Node

Since a node is just an array of bytes, we’ll add some helper functions to access its content.

// header
func (node BNode) btype() uint16 {
    return binary.LittleEndian.Uint16(
func (node BNode) nkeys() uint16 {
    return binary.LittleEndian.Uint16([2:4])
func (node BNode) setHeader(btype uint16, nkeys uint16) {
    binary.LittleEndian.PutUint16([0:2], btype)
    binary.LittleEndian.PutUint16([2:4], nkeys)
// pointers
func (node BNode) getPtr(idx uint16) uint64 {
    assert(idx < node.nkeys())
    pos := HEADER + 8*idx
    return binary.LittleEndian.Uint64([pos:])
func (node BNode) setPtr(idx uint16, val uint64) {
    assert(idx < node.nkeys())
    pos := HEADER + 8*idx
    binary.LittleEndian.PutUint64([pos:], val)

Some details about the offset list:

// offset list
func offsetPos(node BNode, idx uint16) uint16 {
    assert(1 <= idx && idx <= node.nkeys())
    return HEADER + 8*node.nkeys() + 2*(idx-1)
func (node BNode) getOffset(idx uint16) uint16 {
    if idx == 0 {
        return 0
    return binary.LittleEndian.Uint16([offsetPos(node, idx):])
func (node BNode) setOffset(idx uint16, offset uint16) {
    binary.LittleEndian.PutUint16([offsetPos(node, idx):], offset)

The offset list is used to locate the nth KV pair quickly.

// key-values
func (node BNode) kvPos(idx uint16) uint16 {
    assert(idx <= node.nkeys())
    return HEADER + 8*node.nkeys() + 2*node.nkeys() + node.getOffset(idx)
func (node BNode) getKey(idx uint16) []byte {
    assert(idx < node.nkeys())
    pos := node.kvPos(idx)
    klen := binary.LittleEndian.Uint16([pos:])
func (node BNode) getVal(idx uint16) []byte {
    assert(idx < node.nkeys())
    pos := node.kvPos(idx)
    klen := binary.LittleEndian.Uint16([pos+0:])
    vlen := binary.LittleEndian.Uint16([pos+2:])

And to determine the size of the node.

// node size in bytes
func (node BNode) nbytes() uint16 {
    return node.kvPos(node.nkeys())

4.4 The B-Tree Insertion

The code is broken down into small steps.

Step 1: Look Up the Key

To insert a key into a leaf node, we need to look up its position in the sorted KV list.

// returns the first kid node whose range intersects the key. (kid[i] <= key)
// TODO: bisect
func nodeLookupLE(node BNode, key []byte) uint16 {
    nkeys := node.nkeys()
    found := uint16(0)
    // the first key is a copy from the parent node,
    // thus it's always less than or equal to the key.
    for i := uint16(1); i < nkeys; i++ {
        cmp := bytes.Compare(node.getKey(i), key)
        if cmp <= 0 {
            found = i
        if cmp >= 0 {
    return found

The lookup works for both leaf nodes and internal nodes. Note that the first key is skipped for comparison, since it has already been compared from the parent node.

Step 2: Update Leaf Nodes

After looking up the position to insert, we need to create a copy of the node with the new key in it.

// add a new key to a leaf node
func leafInsert(
    new BNode, old BNode, idx uint16,
    key []byte, val []byte,
) {
    new.setHeader(BNODE_LEAF, old.nkeys()+1)
    nodeAppendRange(new, old, 0, 0, idx)
    nodeAppendKV(new, idx, 0, key, val)
    nodeAppendRange(new, old, idx+1, idx, old.nkeys()-idx)

The nodeAppendRange function copies keys from an old node to a new node.

// copy multiple KVs into the position
func nodeAppendRange(
    new BNode, old BNode,
    dstNew uint16, srcOld uint16, n uint16,
) {
    assert(srcOld+n <= old.nkeys())
    assert(dstNew+n <= new.nkeys())
    if n == 0 {

    // pointers
    for i := uint16(0); i < n; i++ {
        new.setPtr(dstNew+i, old.getPtr(srcOld+i))
    // offsets
    dstBegin := new.getOffset(dstNew)
    srcBegin := old.getOffset(srcOld)
    for i := uint16(1); i <= n; i++ { // NOTE: the range is [1, n]
        offset := dstBegin + old.getOffset(srcOld+i) - srcBegin
        new.setOffset(dstNew+i, offset)
    // KVs
    begin := old.kvPos(srcOld)
    end := old.kvPos(srcOld + n)

The nodeAppendKV function copies a KV pair to the new node.

// copy a KV into the position
func nodeAppendKV(new BNode, idx uint16, ptr uint64, key []byte, val []byte) {
    // ptrs
    new.setPtr(idx, ptr)
    // KVs
    pos := new.kvPos(idx)
    binary.LittleEndian.PutUint16([pos+0:], uint16(len(key)))
    binary.LittleEndian.PutUint16([pos+2:], uint16(len(val)))
    copy([pos+4:], key)
    copy([pos+4+uint16(len(key)):], val)
    // the offset of the next key
    new.setOffset(idx+1, new.getOffset(idx)+4+uint16((len(key)+len(val))))

Step 3: Recursive Insertion

The main function for inserting a key.

// insert a KV into a node, the result might be split into 2 nodes.
// the caller is responsible for deallocating the input node
// and splitting and allocating result nodes.
func treeInsert(tree *BTree, node BNode, key []byte, val []byte) BNode {
    // the result node.
    // it's allowed to be bigger than 1 page and will be split if so
    new := BNode{data: make([]byte, 2*BTREE_PAGE_SIZE)}

    // where to insert the key?
    idx := nodeLookupLE(node, key)
    // act depending on the node type
    switch node.btype() {
    case BNODE_LEAF:
        // leaf, node.getKey(idx) <= key
        if bytes.Equal(key, node.getKey(idx)) {
            // found the key, update it.
            leafUpdate(new, node, idx, key, val)
        } else {
            // insert it after the position.
            leafInsert(new, node, idx+1, key, val)
    case BNODE_NODE:
        // internal node, insert it to a kid node.
        nodeInsert(tree, new, node, idx, key, val)
        panic("bad node!")
    return new

The leafUpdate function is similar to the leafInsert function.

Step 4: Handle Internal Nodes

Now comes the code for handling internal nodes.

// part of the treeInsert(): KV insertion to an internal node
func nodeInsert(
    tree *BTree, new BNode, node BNode, idx uint16,
    key []byte, val []byte,
) {
    // get and deallocate the kid node
    kptr := node.getPtr(idx)
    knode := tree.get(kptr)
    // recursive insertion to the kid node
    knode = treeInsert(tree, knode, key, val)
    // split the result
    nsplit, splited := nodeSplit3(knode)
    // update the kid links
    nodeReplaceKidN(tree, new, node, idx, splited[:nsplit]...)

Step 5: Split Big Nodes

Inserting keys into a node increases its size, causing it to exceed the page size. In this case, the node is split into multiple smaller nodes.

The maximum allowed key size and value size only guarantee that a single KV pair always fits on one page. In the worst case, the fat node is split into 3 nodes (one large KV pair in the middle).

// split a bigger-than-allowed node into two.
// the second node always fits on a page.
func nodeSplit2(left BNode, right BNode, old BNode) {
    // code omitted...

// split a node if it's too big. the results are 1~3 nodes.
func nodeSplit3(old BNode) (uint16, [3]BNode) {
    if old.nbytes() <= BTREE_PAGE_SIZE { =[:BTREE_PAGE_SIZE]
        return 1, [3]BNode{old}
    left := BNode{make([]byte, 2*BTREE_PAGE_SIZE)} // might be split later
    right := BNode{make([]byte, BTREE_PAGE_SIZE)}
    nodeSplit2(left, right, old)
    if left.nbytes() <= BTREE_PAGE_SIZE { =[:BTREE_PAGE_SIZE]
        return 2, [3]BNode{left, right}
    // the left node is still too large
    leftleft := BNode{make([]byte, BTREE_PAGE_SIZE)}
    middle := BNode{make([]byte, BTREE_PAGE_SIZE)}
    nodeSplit2(leftleft, middle, left)
    assert(leftleft.nbytes() <= BTREE_PAGE_SIZE)
    return 3, [3]BNode{leftleft, middle, right}

Step 6: Update Internal Nodes

Inserting a key into a node can result in either 1, 2 or 3 nodes. The parent node must update itself accordingly. The code for updating an internal node is similar to that for updating a leaf node.

// replace a link with multiple links
func nodeReplaceKidN(
    tree *BTree, new BNode, old BNode, idx uint16,
    kids ...BNode,
) {
    inc := uint16(len(kids))
    new.setHeader(BNODE_NODE, old.nkeys()+inc-1)
    nodeAppendRange(new, old, 0, 0, idx)
    for i, node := range kids {
        nodeAppendKV(new, idx+uint16(i),, node.getKey(0), nil)
    nodeAppendRange(new, old, idx+inc, idx+1, old.nkeys()-(idx+1))

We have finished the B-tree insertion. Deletion and the rest of the code will be introduced in the next chapter.

⟵ prev Contents next ⟶