04. B+Tree Node and Insertion
Steps to code the B+tree data structure:
- Create a B+tree node representation.
- Insert and delete keys from a node.
- Split or merge nodes.
- Propagate splits and merges to parents.
4.1 B+tree node
What’s in a node?
A B+tree node is a list of KV pairs and the corresponding child pointers:
// pseudo code
type Node struct {
[][]byte
keys // one of the following
[][]byte // for leaf nodes only
vals []*Node // for internal nodes only
kids }
In a B+tree, only leaf nodes store value, and only internal nodes have children. But to keep things simple, we will use the same format for both node types.
Fit a node into a page
A node is split into smaller ones when it gets too big. But how do we define “too big”? For an in-memory B+tree, we can limit the number of keys in a node. But for a disk-based B+tree, the serialized node must fit into a fixed-size unit, called a page.
func Encode(node *Node) []byte
func Decode(page []byte) (*Node, error)
Fixed-size pages make space allocation and reuse easier because all
deleted nodes are interchangeable, which can be managed with a free list
rather than reinventing malloc()
.
But how do we know the serialized size? Let’s say the node is serialized via JSON, the size is only known after it is serialized, which means the node is serialized on every update just to see its size.
To make the node size easily predictable, we can roll our own simpler node serialization format, like every other B+tree implementation.
Design a node format
Here is our node format. The 2nd row is the encoded field size in bytes.
| type | nkeys | pointers | offsets | key-values | unused |
| 2B | 2B | nkeys × 8B | nkeys × 2B | ... | |
The format starts with a 4-bytes header:
type
is the node type (leaf or internal).nkeys
is the number of keys (and the number of child pointers).
const (
= 1 // internal nodes without values
BNODE_NODE = 2 // leaf nodes with values
BNODE_LEAF )
Then an array of child pointers and the KV pairs follow. The format is the same for both leaf and internal nodes, so the pointer array is simply unused for leaf nodes.
Each KV pair is prefixed by its size. For internal nodes, the value size is 0.
| key_size | val_size | key | val |
| 2B | 2B | ... | ... |
The encoded KV pairs are concatenated. To find the nth KV pair, we have to read all previous pairs. This is avoided by storing the offset of each KV pair.
For example, a leaf node {"k1":"hi", "k3":"hello"}
is
encoded as:
| type | nkeys | pointers | offsets | key-values | unused |
| 2 | 2 | nil nil | 8 19 | 2 2 "k1" "hi" 2 5 "k3" "hello" | |
| 2B | 2B | 2×8B | 2×4B | 4B + 2B + 2B + 4B + 2B + 5B | |
The offset of the first KV pair is always 0, so it’s not stored.
To find the position of the n-th pair, use the
offsets[n-1]
. In this example, 8 is the offset of the 2nd
pair, 19 is the offset past the end of the 2nd pair.
A range is divided into subranges by keys
Keys in an internal node indicate the range of each child. A root
node’s range is [−∞, +∞). The range is
divided recursively from the root to the leaves. To divide a range into
n subranges, we need n − 1 keys. For example, node
["p", "q"]
divides its range [a, z) into 3 subranges:
[a, p), [p, q), [q, z).
However, our format uses n
keys instead. Each key represents the start of the subrange. For
example, node ["p", "q"]
divides its range [p, z) into 2 subranges:
[p, q), [q, z).
This makes the visualization easier and removes some edge cases, but the 1st key in an internal node is redundant, because the range start is inherited from the parent node.
KV size limit
We’ll set the node size to 4K, which is the typical OS page size. However, keys and values can be arbitrarily large, exceeding a single node. There should be a way to store large KVs outside of nodes, or to make the node size variable. This is solvable, but not fundamental. So we’ll just limit the KV size so that they always fit into a node.
const BTREE_PAGE_SIZE = 4096
const BTREE_MAX_KEY_SIZE = 1000
const BTREE_MAX_VAL_SIZE = 3000
func init() {
:= 4 + 1*8 + 1*2 + 4 + BTREE_MAX_KEY_SIZE + BTREE_MAX_VAL_SIZE
node1max (node1max <= BTREE_PAGE_SIZE) // maximum KV
assert}
The key size limit also ensures that an internal node can at least host 2 keys.
Page number
An in-memory pointer is an integer address to the location of a byte. For disk data, a pointer can mean a file offset. Either way, it’s just an integer.
In a disk-based B+tree, nodes are fixed-size pages, the entire file is an array of fixed-size pages. So a node pointer needs not to address bytes, but the index of pages, called the page number, which is the file offset divided by the page size.
4.2 Implement the node format
Decode the node on the fly
The format is simple, we can read data directly from it without deserializing it into a separate, in-memory only data type. So our node data type is a byte array.
type BNode []byte // can be dumped to the disk
We just need some helper functions to use it, a.k.a getters and setters.
// getters
func (node BNode) btype() uint16 {
return binary.LittleEndian.Uint16(node[0:2])
}
func (node BNode) nkeys() uint16 {
return binary.LittleEndian.Uint16(node[2:4])
}
Integers are encoded in little-endian, which is just a single memory load or store. This is not much different from using a struct.
// setter
func (node BNode) setHeader(btype uint16, nkeys uint16) {
.LittleEndian.PutUint16(node[0:2], btype)
binary.LittleEndian.PutUint16(node[2:4], nkeys)
binary}
Except that a struct can’t be variable-length, while a byte array can.
// read and write the child pointers array
func (node BNode) getPtr(idx uint16) uint64 {
(idx < node.nkeys())
assert:= 4 + 8*idx
pos return binary.LittleEndian.Uint64(node[pos:])
}
func (node BNode) setPtr(idx uint16, val uint64) {
(idx < node.nkeys())
assert:= 4 + 8*idx
pos .LittleEndian.PutUint64(node[pos:], val)
binary}
Read KV pairs
The start offset of the n-th KV pairs is read from the offset array.
// read the `offsets` array
func (node BNode) getOffset(idx uint16) uint16 {
if idx == 0 {
return 0
}
:= 4 + 8*node.nkeys() + 2*(idx-1)
pos return binary.LittleEndian.Uint16(node[pos:])
}
The offset is adjusted by the data before. So
node[node.kvPos(0):]
is where the encoded pairs starts.
func (node BNode) kvPos(idx uint16) uint16 {
(idx <= node.nkeys())
assertreturn 4 + 8*node.nkeys() + 2*node.nkeys() + node.getOffset(idx)
}
Then the KV data is returned as a byte slice, after decoding their sizes.
| key_size | val_size | key | val |
| 2B | 2B | ... | ... |
func (node BNode) getKey(idx uint16) []byte {
(idx < node.nkeys())
assert:= node.kvPos(idx)
pos := binary.LittleEndian.Uint16(node[pos:])
klen return node[pos+4:][:klen]
}
func (node BNode) getVal(idx uint16) []byte {
(idx < node.nkeys())
assert:= node.kvPos(idx)
pos := binary.LittleEndian.Uint16(node[pos+0:])
klen := binary.LittleEndian.Uint16(node[pos+2:])
vlen return node[pos+4+klen:][:vlen]
}
Using a node is just loading integers or returning sliced data. A slice is a reference, so the KV data can be passed around without extra copies.
Construct a B+tree node
Now that we can read data from a node, the next step is to put data
into a node. For example, we will create a leaf node
{"k1":"hi", "k3":"hello"}
:
:= BNode(make([]byte, BTREE_PAGE_SIZE))
node .setHeader(BNODE_LEAF, 2)
node// ^type ^ number of keys
(node, 0, 0, []byte("k1"), []byte("hi"))
nodeAppendKV// ^ 1st KV
(node, 1, 0, []byte("k3"), []byte("hello"))
nodeAppendKV// ^ 2nd KV
Starting with an empty page, the first thing is to set the number of
keys with setHeader()
, because the variable-length arrays
depend on it.
Next, add KV pairs or pointers to the node.
func nodeAppendKV(new BNode, idx uint16, ptr uint64, key []byte, val []byte)
idx
is the position of the item (a key, a value or a pointer).ptr
is the nth child pointer, which is unused for leaf nodes.key
andval
is the KV pair. Use an empty value for internal nodes.
nodeAppendKV
assumes that keys are added in order. It
uses the previous value of the offsets
array to determine
where the KV should be.
func nodeAppendKV(new BNode, idx uint16, ptr uint64, key []byte, val []byte) {
// ptrs
new.setPtr(idx, ptr)
// KVs
:= new.kvPos(idx) // uses the offset value of the previous key
pos // 4-bytes KV sizes
.LittleEndian.PutUint16(new[pos+0:], uint16(len(key)))
binary.LittleEndian.PutUint16(new[pos+2:], uint16(len(val)))
binary// KV data
copy(new[pos+4:], key)
copy(new[pos+4+uint16(len(key)):], val)
// update the offset value for the next key
new.setOffset(idx+1, new.getOffset(idx)+4+uint16((len(key)+len(val))))
}
Copy-on-write nodes
Copy-on-write means no inplace updates; updates create new nodes
instead. Example 1, node
{"k1":"hi", "k2":"a", "k3":"hello"}
is updated with
"k2":"b"
:
new := BNode(make([]byte, BTREE_PAGE_SIZE))
new.setHeader(BNODE_LEAF, 3)
(node, 0, 0, old.getKey(0), old.getVal(0))
nodeAppendKV(node, 1, 0, []byte("k2"), []byte("b"))
nodeAppendKV(node, 2, 0, old.getKey(2), old.getVal(2)) nodeAppendKV
Example 2, remove "k2"
from that node:
new := BNode(make([]byte, BTREE_PAGE_SIZE))
new.setHeader(BNODE_LEAF, 2)
(node, 0, 0, old.getKey(0), old.getVal(0))
nodeAppendKV(node, 1, 0, old.getKey(2), old.getVal(2)) nodeAppendKV
Example 3, insert "a":"b"
into that node:
new := BNode(make([]byte, 2*BTREE_PAGE_SIZE)) // larger
new.setHeader(BNODE_LEAF, 4)
(node, 0, 0, []byte("a"), []byte("b"))
nodeAppendKV(node, 1, 0, old.getKey(0), old.getVal(0))
nodeAppendKV(node, 2, 0, old.getKey(1), old.getVal(1))
nodeAppendKV(node, 3, 0, old.getKey(2), old.getVal(2)) nodeAppendKV
After insert, the node may exceed the page size. That’s why the in-memory node is allocated with a larger size, so that it can temporarily exceed the page size before it is split.
Calculate the node size
The reason we create a serialization format is to easily predict the serialized size, which is the fixed-size header + arrays + encoded KVs.
The size of the encoded KVs is the sum of each KV data size + the
4-byte encoded lengths. But there is a shortcut: The
offsets
array tells us the total size of the encoded KVs,
so we don’t need to iterate through them.
// node size in bytes
func (node BNode) nbytes() uint16 {
return node.kvPos(node.nkeys()) // uses the offset value of the last key
}
4.3 Insert or update the leaf node
Insert or update at a position
Insert starts at a leaf node. To insert a new key at position
idx
:
func leafInsert(
new BNode, old BNode, idx uint16, key []byte, val []byte,
) {
new.setHeader(BNODE_LEAF, old.nkeys()+1)
(new, old, 0, 0, idx) // copy the keys before `idx`
nodeAppendRange(new, idx, 0, key, val) // the new key
nodeAppendKV(new, old, idx+1, idx, old.nkeys()-idx) // keys from `idx`
nodeAppendRange}
nodeAppendRange()
is just a loop for copying keys,
values, and pointers:
// copy multiple keys, values, and pointers into the position
func nodeAppendRange(
new BNode, old BNode, dstNew uint16, srcOld uint16, n uint16,
) {
for i := uint16(0); i < n; i++ {
, src := dstNew+i, srcOld+i
dst(new, dst,
nodeAppendKV.getPtr(src), old.getKey(src), old.getVal(src))
old}
}
If the key already exists, we should update the value instead:
func leafUpdate(
new BNode, old BNode, idx uint16, key []byte, val []byte,
) {
new.setHeader(BNODE_LEAF, old.nkeys())
(new, old, 0, 0, idx)
nodeAppendRange(new, idx, 0, key, val)
nodeAppendKV(new, old, idx+1, idx+1, old.nkeys()-(idx+1))
nodeAppendRange}
Find the insert position
Keys are sorted, we must find the update position that keeps the sort order.
// find the last postion that is less than or equal to the key
func nodeLookupLE(node BNode, key []byte) uint16 {
:= node.nkeys()
nkeys var i uint16
for i = 0; i < nkeys; i++ {
:= bytes.Compare(node.getKey(i), key)
cmp if cmp == 0 {
return i
}
if cmp > 0 {
return i - 1
}
}
return i - 1
}
This can be a binary search instead of a linear search, since
getKey(i)
is O(1)
with the offsets
array.
Insert or update after a key lookup
Let’s combine a lookup with an update:
:= nodeLookupLE(node, key) // node.getKey(idx) <= key
idx if bytes.Equal(key, node.getKey(idx)) {
(new, node, idx, key, val) // found, update it
leafUpdate} else {
(new, node, idx+1, key, val) // not found. insert
leafInsert}
But what if all positions are greater than the key and
idx
is -1
? We will make sure this doesn’t
happen. Explained later.
4.4 Split a node
Split by balancing bytes
For an in-memory B+tree, an oversized node can be split into 2 nodes, each with half of the keys. For a disk-based B+tree, half of the keys may not fit into a page due to uneven key sizes. However, we can use the half position as an initial guess, then move it left or right if the half is too large.
// Split an oversized node into 2 nodes. The 2nd node always fits.
func nodeSplit2(left BNode, right BNode, old BNode) {
(old.nkeys() >= 2)
assert// the initial guess
:= old.nkeys() / 2
nleft // try to fit the left half
:= func() uint16 {
left_bytes return 4 + 8*nleft + 2*nleft + old.getOffset(nleft)
}
for left_bytes() > BTREE_PAGE_SIZE {
--
nleft}
(nleft >= 1)
assert// try to fit the right half
:= func() uint16 {
right_bytes return old.nbytes() - left_bytes() + 4
}
for right_bytes() > BTREE_PAGE_SIZE {
++
nleft}
(nleft < old.nkeys())
assert:= old.nkeys() - nleft
nright // new nodes
.setHeader(old.btype(), nleft)
left.setHeader(old.btype(), nright)
right(left, old, 0, 0, nleft)
nodeAppendRange(right, old, 0, nleft, nright)
nodeAppendRange// NOTE: the left half may be still too big
(right.nbytes() <= BTREE_PAGE_SIZE)
assert}
Again, the offsets
array helps to predict the node
size.
Split multiple times
After splitting, the left half may still be too large, because while fitting the right half, the left half size grows. This can happen if there is a big key in the middle, requiring another split.
[foo...|BIG|bar...] -> [foo...|BIG] + [bar...] -> [foo...] + [BIG] + [bar...]
too big
Our size limits allow a single KV to take up almost the entire page, which allows for the big-key-in-the-middle case.
const BTREE_PAGE_SIZE = 4096
const BTREE_MAX_KEY_SIZE = 1000
const BTREE_MAX_VAL_SIZE = 3000
So the result of a split is either 2 or 3 nodes:
// split a node if it's too big. the results are 1~3 nodes.
func nodeSplit3(old BNode) (uint16, [3]BNode) {
if old.nbytes() <= BTREE_PAGE_SIZE {
= old[:BTREE_PAGE_SIZE]
old return 1, [3]BNode{old} // not split
}
:= BNode(make([]byte, 2*BTREE_PAGE_SIZE)) // might be split later
left := BNode(make([]byte, BTREE_PAGE_SIZE))
right (left, right, old)
nodeSplit2if left.nbytes() <= BTREE_PAGE_SIZE {
= left[:BTREE_PAGE_SIZE]
left return 2, [3]BNode{left, right} // 2 nodes
}
:= BNode(make([]byte, BTREE_PAGE_SIZE))
leftleft := BNode(make([]byte, BTREE_PAGE_SIZE))
middle (leftleft, middle, left)
nodeSplit2(leftleft.nbytes() <= BTREE_PAGE_SIZE)
assertreturn 3, [3]BNode{leftleft, middle, right} // 3 nodes
}
The 3-node case can be eliminated by lowering the KV size limit.
4.5 B+tree data structure
Isolate disk IO from data structures
A disk-based B+tree depends on disk IO to dereference pointers (page numbers). But we haven’t coded that yet. It’s isolated via callbacks, so the data structure can be tested alone.
type BTree struct {
// root pointer (a nonzero page number)
uint64
root // callbacks for managing on-disk pages
func(uint64) []byte // read data from a page number
get new func([]byte) uint64 // allocate a new page number with data
func(uint64) // deallocate a page number
del }
Disk IO can be simulated with these callbacks. So a disk-based B+tree can easily be used as an in-memory B+tree, while the reverse is not true.
Tree insertion
Tree insertion is a top-down recursion until reaching the leaf node.
func treeInsert(tree *BTree, node BNode, key []byte, val []byte) BNode {
// The extra size allows it to exceed 1 page temporarily.
new := BNode(make([]byte, 2*BTREE_PAGE_SIZE))
// where to insert the key?
:= nodeLookupLE(node, key) // node.getKey(idx) <= key
idx switch node.btype() {
case BNODE_LEAF: // leaf node
if bytes.Equal(key, node.getKey(idx)) {
(new, node, idx, key, val) // found, update it
leafUpdate} else {
(new, node, idx+1, key, val) // not found, insert
leafInsert}
case BNODE_NODE: // internal node, walk into the child node
// ...
default:
panic("bad node!")
}
return new
}
The case of internal nodes is just insert into the subtree, split the
new child node, then replace the link with the new nodes. Since nodes
are copy-on-write, the child node is always replaced with new nodes,
that’s why treeInsert()
returns the updated node.
case BNODE_NODE:
// recursive insertion to the kid node
:= node.getPtr(idx)
kptr := treeInsert(tree, tree.get(kptr), key, val)
knode // after insertion, split the result
, split := nodeSplit3(knode)
nsplit// deallocate the old kid node
.del(kptr)
tree// update the kid links
(tree, new, node, idx, split[:nsplit]...) nodeReplaceKidN
Propagate splits into parents
nodeReplaceKidN()
is like leafUpdate()
, but
with multiple keys.
// replace a link with multiple links
func nodeReplaceKidN(
*BTree, new BNode, old BNode, idx uint16,
tree ...BNode,
kids ) {
:= uint16(len(kids))
inc new.setHeader(BNODE_NODE, old.nkeys()+inc-1)
(new, old, 0, 0, idx)
nodeAppendRangefor i, node := range kids {
(new, idx+uint16(i), tree.new(node), node.getKey(0), nil)
nodeAppendKV}
(new, old, idx+inc, idx+1, old.nkeys()-(idx+1))
nodeAppendRange}
What’s next?
The work is almost done. We just need to add these in the next chapter:
- Node merging and tree delete.
- A high-level KV interface.
- Fake page callbacks for tests.