04. B+Tree Node and Insertion
4.1 Design B+tree nodes
What we will do
The first big step is just the B+tree data structures, other DB concerns will be covered in later chapters. We’ll do it from the bottom up.
- Design a node format that contains all the necessary bits.
- Manipulate nodes in a copy-on-write fashion (insert and delete keys).
- Split and merge nodes.
- Tree insertion and deletion.
The node format
All B+tree nodes are the same size for later use of the free list. Although we won’t deal with disk data at this point, a concrete node format is needed because it decides the node size in bytes and when to split a node.
A node includes:
- A fixed-size header, which contains:
- The type of node (leaf or internal).
- The number of keys.
- A list of pointers to child nodes for internal nodes.
- A list of KV pairs.
- A list of offsets to KVs, which can be used to binary search KVs.
| type | nkeys | pointers | offsets | key-values | unused |
| 2B | 2B | nkeys * 8B | nkeys * 2B | ... | |
This is the format of each KV pair. Lengths followed by data.
| klen | vlen | key | val |
| 2B | 2B | ... | ... |
Simplifications and limits
Our goal is to learn the basics, not to create a real DB. So some simplifications are made.
The same format is used for both leaf nodes and internal nodes. This wastes some space: leaf nodes don’t need pointers and internal nodes don’t need values.
An internal node of n branches contains n keys, each key is duplicated from the minimum key of the corresponding subtree. However, only n − 1 keys are needed for n branches, as you’ll see in other B-tree introductions. The extra key makes the visualization easier.
We’ll set the node size to 4K, which is the typical OS page size. However, keys and values can be arbitrarily large, exceeding a single node. There should be a way to store large KVs outside of nodes, or to make the node size variable. This problem is solvable, but not fundamental. So we’ll skip it by limiting the KV size so that they always fit inside a node.
const HEADER = 4
const BTREE_PAGE_SIZE = 4096
const BTREE_MAX_KEY_SIZE = 1000
const BTREE_MAX_VAL_SIZE = 3000
func init() {
:= HEADER + 8 + 2 + 4 + BTREE_MAX_KEY_SIZE + BTREE_MAX_VAL_SIZE
node1max (node1max <= BTREE_PAGE_SIZE) // maximum KV
assert}
The key size limit also ensures that an internal node can always host 2 keys.
In-memory data types
In our code, a node is just a chunk of bytes interpreted by this format. Moving data from memory to disk is simpler without a serialization step.
type BNode []byte // can be dumped to the disk
Decouple data structure from IO
Space allocation/deallocation is required for both in-memory and on-disk data structures. We can abstract this away with callbacks, which is a boundary between the data structure and the rest of the DB.
type BTree struct {
// pointer (a nonzero page number)
uint64
root // callbacks for managing on-disk pages
func(uint64) []byte // dereference a pointer
get new func([]byte) uint64 // allocate a new page
func(uint64) // deallocate a page
del }
For an on-disk B+tree, the database file is an array of pages (nodes) referenced by page numbers (pointers). We’ll implement these callbacks as follows:
get
reads a page from disk.new
allocates and writes a new page (copy-on-write).del
deallocates a page.
We can use fake callbacks (mocks) to test the data structure in memory without the rest of the DB.
4.2 Decode the node format
Since the node type is just a chunk of bytes, we’ll define some helper functions to access it.
| type | nkeys | pointers | offsets | key-values | unused |
| 2B | 2B | nkeys * 8B | nkeys * 2B | ... | |
| klen | vlen | key | val |
| 2B | 2B | ... | ... |
Header
const (
= 1 // internal nodes without values
BNODE_NODE = 2 // leaf nodes with values
BNODE_LEAF )
func (node BNode) btype() uint16 {
return binary.LittleEndian.Uint16(node[0:2])
}
func (node BNode) nkeys() uint16 {
return binary.LittleEndian.Uint16(node[2:4])
}
func (node BNode) setHeader(btype uint16, nkeys uint16) {
.LittleEndian.PutUint16(node[0:2], btype)
binary.LittleEndian.PutUint16(node[2:4], nkeys)
binary}
Child pointers
// pointers
func (node BNode) getPtr(idx uint16) uint64 {
(idx < node.nkeys())
assert:= HEADER + 8*idx
pos return binary.LittleEndian.Uint64(node[pos:])
}
func (node BNode) setPtr(idx uint16, val uint64) // trivial; omitted
KV offsets and pairs
The format packs everything back to back. Finding the nth KV can be done by reading each KV pair one by one. To make it easier, we have included an offset list to locate the nth KV in O(1). This also allows binary searches within a node.
Each offset is the end of the KV pair relative to the start of the 1st KV. The start offset of the 1st KV is just 0, so we use the end offset instead, which is the start offset of the next KV.
// offset list
func offsetPos(node BNode, idx uint16) uint16 {
(1 <= idx && idx <= node.nkeys())
assertreturn HEADER + 8*node.nkeys() + 2*(idx-1)
}
func (node BNode) getOffset(idx uint16) uint16 {
if idx == 0 {
return 0
}
return binary.LittleEndian.Uint16(node[offsetPos(node, idx):])
}
func (node BNode) setOffset(idx uint16, offset uint16)
kvPos
returns the position of the nth KV pair relative
to the whole node.
// key-values
func (node BNode) kvPos(idx uint16) uint16 {
(idx <= node.nkeys())
assertreturn HEADER + 8*node.nkeys() + 2*node.nkeys() + node.getOffset(idx)
}
func (node BNode) getKey(idx uint16) []byte {
(idx < node.nkeys())
assert:= node.kvPos(idx)
pos := binary.LittleEndian.Uint16(node[pos:])
klen return node[pos+4:][:klen]
}
func (node BNode) getVal(idx uint16) []byte
It also conveniently returns the node size (used space) with an off-by-one lookup.
// node size in bytes
func (node BNode) nbytes() uint16 {
return node.kvPos(node.nkeys())
}
KV lookups within a node
The “seek” operation is used for both range and point queries. So they are fundamentally the same.
// returns the first kid node whose range intersects the key. (kid[i] <= key)
// TODO: binary search
func nodeLookupLE(node BNode, key []byte) uint16 {
:= node.nkeys()
nkeys := uint16(0)
found // the first key is a copy from the parent node,
// thus it's always less than or equal to the key.
for i := uint16(1); i < nkeys; i++ {
:= bytes.Compare(node.getKey(i), key)
cmp if cmp <= 0 {
= i
found }
if cmp >= 0 {
break
}
}
return found
}
The function is called nodeLookupLE
because it uses the
Less-than-or-Equal operator. For point
queries, we should use the equal operator instead, which is a step we
can add later.
4.3 Update B+tree nodes
Insert into leaf nodes
Let’s consider inserting a key into a leaf node. The 1st step is to
use nodeLookupLE
to get the insert position. Then copy
everything to a new node with the extra key. That’s copy-on-write.
// add a new key to a leaf node
func leafInsert(
new BNode, old BNode, idx uint16,
[]byte, val []byte,
key ) {
new.setHeader(BNODE_LEAF, old.nkeys()+1) // setup the header
(new, old, 0, 0, idx)
nodeAppendRange(new, idx, 0, key, val)
nodeAppendKV(new, old, idx+1, idx, old.nkeys()-idx)
nodeAppendRange}
Node copying functions
nodeAppendRange
copies a range of KVs and
nodeAppendKV
copies a KV pair. This must be done in order
because these functions rely on the previous offset.
// copy a KV into the position
func nodeAppendKV(new BNode, idx uint16, ptr uint64, key []byte, val []byte) {
// ptrs
new.setPtr(idx, ptr)
// KVs
:= new.kvPos(idx)
pos .LittleEndian.PutUint16(new[pos+0:], uint16(len(key)))
binary.LittleEndian.PutUint16(new[pos+2:], uint16(len(val)))
binarycopy(new[pos+4:], key)
copy(new[pos+4+uint16(len(key)):], val)
// the offset of the next key
new.setOffset(idx+1, new.getOffset(idx)+4+uint16((len(key)+len(val))))
}
// copy multiple KVs into the position from the old node
func nodeAppendRange(
new BNode, old BNode,
uint16, srcOld uint16, n uint16,
dstNew )
Update internal nodes
For internal nodes, the link to the child node is always updated with the copy-on-write scheme, which can become multiple links if the child node is split.
// replace a link with one or multiple links
func nodeReplaceKidN(
*BTree, new BNode, old BNode, idx uint16,
tree ...BNode,
kids ) {
:= uint16(len(kids))
inc new.setHeader(BNODE_NODE, old.nkeys()+inc-1)
(new, old, 0, 0, idx)
nodeAppendRangefor i, node := range kids {
(new, idx+uint16(i), tree.new(node), node.getKey(0), nil)
nodeAppendKV// ^position ^pointer ^key ^val
}
(new, old, idx+inc, idx+1, old.nkeys()-(idx+1))
nodeAppendRange}
Note that the tree.new
callback is used to allocate the
child nodes.
4.4 Split B+tree nodes
Due to the size limits we imposed, a node can host at least 1 KV pair. In the worst case, an oversized node will be split into 3 nodes, with a large KV in the middle. So we may have to split it 2 times.
// split a oversized node into 2 so that the 2nd node always fits on a page
func nodeSplit2(left BNode, right BNode, old BNode) {
// code omitted...
}
// split a node if it's too big. the results are 1~3 nodes.
func nodeSplit3(old BNode) (uint16, [3]BNode) {
if old.nbytes() <= BTREE_PAGE_SIZE {
= old[:BTREE_PAGE_SIZE]
old return 1, [3]BNode{old} // not split
}
:= BNode(make([]byte, 2*BTREE_PAGE_SIZE)) // might be split later
left := BNode(make([]byte, BTREE_PAGE_SIZE))
right (left, right, old)
nodeSplit2if left.nbytes() <= BTREE_PAGE_SIZE {
= left[:BTREE_PAGE_SIZE]
left return 2, [3]BNode{left, right} // 2 nodes
}
:= BNode(make([]byte, BTREE_PAGE_SIZE))
leftleft := BNode(make([]byte, BTREE_PAGE_SIZE))
middle (leftleft, middle, left)
nodeSplit2(leftleft.nbytes() <= BTREE_PAGE_SIZE)
assertreturn 3, [3]BNode{leftleft, middle, right} // 3 nodes
}
Note that the returned nodes are allocated from memory; they are just
temporary data until nodeReplaceKidN
actually allocates
them.
4.5 B+tree insertion
We’ve implemented 3 node operations:
leafInsert
updates a leaf node.nodeReplaceKidN
updates an internal node.nodeSplit3
splits an oversized node.
Let’s put them together for a full B+tree insertion, which starts with key lookups in the root node until it reaches a leaf.
// insert a KV into a node, the result might be split.
// the caller is responsible for deallocating the input node
// and splitting and allocating result nodes.
func treeInsert(tree *BTree, node BNode, key []byte, val []byte) BNode {
// the result node.
// it's allowed to be bigger than 1 page and will be split if so
new := BNode{data: make([]byte, 2*BTREE_PAGE_SIZE)}
// where to insert the key?
:= nodeLookupLE(node, key)
idx // act depending on the node type
switch node.btype() {
case BNODE_LEAF:
// leaf, node.getKey(idx) <= key
if bytes.Equal(key, node.getKey(idx)) {
// found the key, update it.
(new, node, idx, key, val)
leafUpdate} else {
// insert it after the position.
(new, node, idx+1, key, val)
leafInsert}
case BNODE_NODE:
// internal node, insert it to a kid node.
(tree, new, node, idx, key, val)
nodeInsertdefault:
panic("bad node!")
}
return new
}
leafUpdate
is similar to leafInsert
; it
updates an existing key instead of inserting a duplicate key.
// part of the treeInsert(): KV insertion to an internal node
func nodeInsert(
*BTree, new BNode, node BNode, idx uint16,
tree []byte, val []byte,
key ) {
:= node.getPtr(idx)
kptr // recursive insertion to the kid node
:= treeInsert(tree, tree.get(kptr), key, val)
knode // split the result
, split := nodeSplit3(knode)
nsplit// deallocate the kid node
.del(kptr)
tree// update the kid links
(tree, new, node, idx, split[:nsplit]...)
nodeReplaceKidN}
Internal nodes are handled recursively, each call returns an updated node, and the caller will split it if it’s oversized and handle the allocation/deallocation.
4.6 What’s next?
The work is almost done. We just need to add these in the next chapter:
- Node merging and tree deletion.
- A high-level interface.
- Fake node callbacks for tests.