07. Free List: Reusing Pages
Since our B-tree is immutable, every update to the KV store creates new nodes in the path instead of updating current nodes, leaving some nodes unreachable from the latest version. We need to reuse these unreachable nodes from old versions, otherwise, the database file will grow indefinitely.
7.1 Design the Free List
To reuse these pages, we’ll add a persistent free list to keep track of unused pages. Update operations reuse pages from the list before appending new pages, and unused pages from the current version are added to the list.
The list is used as a stack (first-in-last-out), each update operation can both remove from and add to the top of the list.
// number of items in the list
func (fl *FreeList) Total() int
// get the nth pointer
func (fl *FreeList) Get(topn int) uint64
// remove `popn` pointers and add some new pointers
func (fl *FreeList) Update(popn int, freed []uint64)
The free list is also immutable like our B-tree. Each node contains:
- Multiple pointers to unused pages.
- The link to the next node.
- The total number of items in the list. This only applies to the head node.
| node1 | | node2 | | node3 |
+-----------+ +-----------+ +-----------+
| total=xxx | | | | |
| next=yyy | ==> | next=qqq | ==> | next=eee | ==> ...
| size=zzz | | size=ppp | | size=rrr |
| pointers | | pointers | | pointers |
The node format:
| type | size | total | next | pointers |
| 2B | 2B | 8B | 8B | size * 8B |
const BNODE_FREE_LIST = 3
const FREE_LIST_HEADER = 4 + 8 + 8
const FREE_LIST_CAP = (BTREE_PAGE_SIZE - FREE_LIST_HEADER) / 8
Functions for accessing the list node:
func flnSize(node BNode) int
func flnNext(node BNode) uint64
func flnPtr(node BNode, idx int)
func flnSetPtr(node BNode, idx int, ptr uint64)
func flnSetHeader(node BNode, size uint16, next uint64)
func flnSetTotal(node BNode, total uint64)
7.2 The Free List Datatype
The FreeList
type consists of the pointer to the head
node and callbacks for managing disk pages.
type FreeList struct {
uint64
head // callbacks for managing on-disk pages
func(uint64) BNode // dereference a pointer
get new func(BNode) uint64 // append a new page
func(uint64, BNode) // reuse a page
use }
These callbacks are different from the B-tree because the pages used by the list are managed by the list itself.
- The
new
callback is only for appending new pages since the free list must reuse pages from itself. - There is no
del
callback because the free list adds unused pages to itself. - The
use
callback registers a pending update to a reused page.
type BTree struct {
// pointer (a nonzero page number)
uint64
root // callbacks for managing on-disk pages
func(uint64) BNode // dereference a pointer
get new func(BNode) uint64 // allocate a new page
func(uint64) // deallocate a page
del }
7.3 The Free List Implementation
Getting the nth item from the list is just a simple list traversal.
func (fl *FreeList) Get(topn int) uint64 {
(0 <= topn && topn < fl.Total())
assert:= fl.get(fl.head)
node for flnSize(node) <= topn {
-= flnSize(node)
topn := flnNext(node)
next (next != 0)
assert= fl.get(next)
node }
return flnPtr(node, flnSize(node)-topn-1)
}
Updating the list is tricky. It first removes popn
items
from the list, then adds the freed
to the list, which can
be divided into 3 phases:
- If the head node is larger than
popn
, remove it. The node itself will be added to the list later. Repeat this step until it is not longer possible. - We may need to remove some items from the list and possibly add some new items to the list. Updating the list head requires new pages, and new pages should be reused from the items of the list itself. Pop some items from the list one by one until there are enough pages to reuse for the next phase.
- Modify the list by adding new nodes.
// remove `popn` pointers and add some new pointers
func (fl *FreeList) Update(popn int, freed []uint64) {
(popn <= fl.Total())
assertif popn == 0 && len(freed) == 0 {
return // nothing to do
}
// prepare to construct the new list
:= fl.Total()
total := []uint64{}
reuse for fl.head != 0 && len(reuse)*FREE_LIST_CAP < len(freed) {
:= fl.get(fl.head)
node = append(freed, fl.head) // recyle the node itself
freed if popn >= flnSize(node) {
// phase 1
// remove all pointers in this node
-= flnSize(node)
popn } else {
// phase 2:
// remove some pointers
:= flnSize(node) - popn
remain = 0
popn // reuse pointers from the free list itself
for remain > 0 && len(reuse)*FREE_LIST_CAP < len(freed)+remain {
--
remain= append(reuse, flnPtr(node, remain))
reuse }
// move the node into the `freed` list
for i := 0; i < remain; i++ {
= append(freed, flnPtr(node, i))
freed }
}
// discard the node and move to the next node
-= flnSize(node)
total .head = flnNext(node)
fl}
(len(reuse)*FREE_LIST_CAP >= len(freed) || fl.head == 0)
assert
// phase 3: prepend new nodes
(fl, freed, reuse)
flPush
// done
(fl.get(fl.head), uint64(total+len(freed)))
flnSetTotal}
func flPush(fl *FreeList, freed []uint64, reuse []uint64) {
for len(freed) > 0 {
new := BNode{make([]byte, BTREE_PAGE_SIZE)}
// construct a new node
:= len(freed)
size if size > FREE_LIST_CAP {
= FREE_LIST_CAP
size }
(new, uint16(size), fl.head)
flnSetHeaderfor i, ptr := range freed[:size] {
(new, i, ptr)
flnSetPtr}
= freed[size:]
freed
if len(reuse) > 0 {
// reuse a pointer from the list
.head, reuse = reuse[0], reuse[1:]
fl.use(fl.head, new)
fl} else {
// or append a page to house the new node
.head = fl.new(new)
fl}
}
(len(reuse) == 0)
assert}
7.4 Manage Disk Pages
Step 1: Modify the Data Structure
The data structure is modified. Temporary pages are kept in a map keyed by their assigned page numbers. And removed page numbers are also there.
type KV struct {
// omitted...
struct {
page uint64 // database size in number of pages
flushed int // number of pages taken from the free list
nfree int // number of pages to be appended
nappend // newly allocated or deallocated pages keyed by the pointer.
// nil value denotes a deallocated page.
map[uint64][]byte
updates }
}
Step 2: Page Management for B-Tree
The pageGet
function is modified to also return
temporary pages because the free list code depends on this behavior.
// callback for BTree & FreeList, dereference a pointer.
func (db *KV) pageGet(ptr uint64) BNode {
if page, ok := db.page.updates[ptr]; ok {
(page != nil)
assertreturn BNode{page} // for new pages
}
return pageGetMapped(db, ptr) // for written pages
}
func pageGetMapped(db *KV, ptr uint64) BNode {
:= uint64(0)
start for _, chunk := range db.mmap.chunks {
:= start + uint64(len(chunk))/BTREE_PAGE_SIZE
end if ptr < end {
:= BTREE_PAGE_SIZE * (ptr - start)
offset return BNode{chunk[offset : offset+BTREE_PAGE_SIZE]}
}
= end
start }
panic("bad ptr")
}
The function for allocating a B-tree page is changed to reuse pages from the free list first.
// callback for BTree, allocate a new page.
func (db *KV) pageNew(node BNode) uint64 {
(len(node.data) <= BTREE_PAGE_SIZE)
assert:= uint64(0)
ptr if db.page.nfree < db.free.Total() {
// reuse a deallocated page
= db.free.Get(db.page.nfree)
ptr .page.nfree++
db} else {
// append a new page
= db.page.flushed + uint64(db.page.nappend)
ptr .page.nappend++
db}
.page.updates[ptr] = node.data
dbreturn ptr
}
Removed pages are marked for the free list update later.
// callback for BTree, deallocate a page.
func (db *KV) pageDel(ptr uint64) {
.page.updates[ptr] = nil
db}
Step 3: Page Management for the Free List
Callbacks for appending a new page and reusing a page for the free list:
// callback for FreeList, allocate a new page.
func (db *KV) pageAppend(node BNode) uint64 {
(len(node.data) <= BTREE_PAGE_SIZE)
assert:= db.page.flushed + uint64(db.page.nappend)
ptr .page.nappend++
db.page.updates[ptr] = node.data
dbreturn ptr
}
// callback for FreeList, reuse a page.
func (db *KV) pageUse(ptr uint64, node BNode) {
.page.updates[ptr] = node.data
db}
Step 4: Update the Free List
Before extending the file and writing pages to disk, we must update the free list first since it also creates pending writes.
func writePages(db *KV) error {
// update the free list
:= []uint64{}
freed for ptr, page := range db.page.updates {
if page == nil {
= append(freed, ptr)
freed }
}
.free.Update(db.page.nfree, freed)
db
// extend the file & mmap if needed
// omitted...
// copy pages to the file
for ptr, page := range db.page.updates {
if page != nil {
copy(pageGetMapped(db, ptr).data, page)
}
}
return nil
}
The pointer to the list head is added to the master page:
| sig | btree_root | page_used | free_list |
| 16B | 8B | 8B | 8B |
Step 5: Done
The KV store is finished. It is persistent and crash resistant, although it can only be accessed sequentially.
There is more to learn in part II of the book:
- Relational DB on the KV store.
- Concurrent access to the database and transactions.
codecrafters.io offers “Build Your Own X” courses in many programming languages.
Including Redis, Git, SQLite, Docker, and more.