Build Your Own Database From Scratch EBook·Paperback
⟵ prev Contents next ⟶

🔥 My other Book: Build Your Own Redis

07. Free List: Reusing Pages

Since our B-tree is immutable, every update to the KV store creates new nodes in the path instead of updating current nodes, leaving some nodes unreachable from the latest version. We need to reuse these unreachable nodes from old versions, otherwise, the database file will grow indefinitely.

7.1 Design the Free List

To reuse these pages, we’ll add a persistent free list to keep track of unused pages. Update operations reuse pages from the list before appending new pages, and unused pages from the current version are added to the list.

The list is used as a stack (first-in-last-out), each update operation can both remove from and add to the top of the list.

// number of items in the list
func (fl *FreeList) Total() int
// get the nth pointer
func (fl *FreeList) Get(topn int) uint64
// remove `popn` pointers and add some new pointers
func (fl *FreeList) Update(popn int, freed []uint64)

The free list is also immutable like our B-tree. Each node contains:

  1. Multiple pointers to unused pages.
  2. The link to the next node.
  3. The total number of items in the list. This only applies to the head node.
|   node1   |     |   node2   |     |   node3   |
+-----------+     +-----------+     +-----------+
| total=xxx |     |           |     |           |
|  next=yyy | ==> |  next=qqq | ==> |  next=eee | ==> ...
|  size=zzz |     |  size=ppp |     |  size=rrr |
|  pointers |     |  pointers |     |  pointers |

The node format:

| type | size | total | next |  pointers |
|  2B  |  2B  |   8B  |  8B  | size * 8B |
const BNODE_FREE_LIST = 3
const FREE_LIST_HEADER = 4 + 8 + 8
const FREE_LIST_CAP = (BTREE_PAGE_SIZE - FREE_LIST_HEADER) / 8

Functions for accessing the list node:

func flnSize(node BNode) int
func flnNext(node BNode) uint64
func flnPtr(node BNode, idx int)
func flnSetPtr(node BNode, idx int, ptr uint64)
func flnSetHeader(node BNode, size uint16, next uint64)
func flnSetTotal(node BNode, total uint64)

7.2 The Free List Datatype

The FreeList type consists of the pointer to the head node and callbacks for managing disk pages.

type FreeList struct {
    head uint64
    // callbacks for managing on-disk pages
    get func(uint64) BNode  // dereference a pointer
    new func(BNode) uint64  // append a new page
    use func(uint64, BNode) // reuse a page
}

These callbacks are different from the B-tree because the pages used by the list are managed by the list itself.

type BTree struct {
    // pointer (a nonzero page number)
    root uint64
    // callbacks for managing on-disk pages
    get func(uint64) BNode // dereference a pointer
    new func(BNode) uint64 // allocate a new page
    del func(uint64)       // deallocate a page
}

7.3 The Free List Implementation

Getting the nth item from the list is just a simple list traversal.

func (fl *FreeList) Get(topn int) uint64 {
    assert(0 <= topn && topn < fl.Total())
    node := fl.get(fl.head)
    for flnSize(node) <= topn {
        topn -= flnSize(node)
        next := flnNext(node)
        assert(next != 0)
        node = fl.get(next)
    }
    return flnPtr(node, flnSize(node)-topn-1)
}

Updating the list is tricky. It first removes popn items from the list, then adds the freed to the list, which can be divided into 3 phases:

  1. If the head node is larger than popn, remove it. The node itself will be added to the list later. Repeat this step until it is not longer possible.
  2. We may need to remove some items from the list and possibly add some new items to the list. Updating the list head requires new pages, and new pages should be reused from the items of the list itself. Pop some items from the list one by one until there are enough pages to reuse for the next phase.
  3. Modify the list by adding new nodes.
// remove `popn` pointers and add some new pointers
func (fl *FreeList) Update(popn int, freed []uint64) {
    assert(popn <= fl.Total())
    if popn == 0 && len(freed) == 0 {
        return // nothing to do
    }

    // prepare to construct the new list
    total := fl.Total()
    reuse := []uint64{}
    for fl.head != 0 && len(reuse)*FREE_LIST_CAP < len(freed) {
        node := fl.get(fl.head)
        freed = append(freed, fl.head) // recyle the node itself
        if popn >= flnSize(node) {
            // phase 1
            // remove all pointers in this node
            popn -= flnSize(node)
        } else {
            // phase 2:
            // remove some pointers
            remain := flnSize(node) - popn
            popn = 0
            // reuse pointers from the free list itself
            for remain > 0 && len(reuse)*FREE_LIST_CAP < len(freed)+remain {
                remain--
                reuse = append(reuse, flnPtr(node, remain))
            }
            // move the node into the `freed` list
            for i := 0; i < remain; i++ {
                freed = append(freed, flnPtr(node, i))
            }
        }
        // discard the node and move to the next node
        total -= flnSize(node)
        fl.head = flnNext(node)
    }
    assert(len(reuse)*FREE_LIST_CAP >= len(freed) || fl.head == 0)

    // phase 3: prepend new nodes
    flPush(fl, freed, reuse)

    // done
    flnSetTotal(fl.get(fl.head), uint64(total+len(freed)))
}
func flPush(fl *FreeList, freed []uint64, reuse []uint64) {
    for len(freed) > 0 {
        new := BNode{make([]byte, BTREE_PAGE_SIZE)}

        // construct a new node
        size := len(freed)
        if size > FREE_LIST_CAP {
            size = FREE_LIST_CAP
        }
        flnSetHeader(new, uint16(size), fl.head)
        for i, ptr := range freed[:size] {
            flnSetPtr(new, i, ptr)
        }
        freed = freed[size:]

        if len(reuse) > 0 {
            // reuse a pointer from the list
            fl.head, reuse = reuse[0], reuse[1:]
            fl.use(fl.head, new)
        } else {
            // or append a page to house the new node
            fl.head = fl.new(new)
        }
    }
    assert(len(reuse) == 0)
}

7.4 Manage Disk Pages

Step 1: Modify the Data Structure

The data structure is modified. Temporary pages are kept in a map keyed by their assigned page numbers. And removed page numbers are also there.

type KV struct {
    // omitted...
    page struct {
        flushed uint64 // database size in number of pages
        nfree   int    // number of pages taken from the free list
        nappend int    // number of pages to be appended
        // newly allocated or deallocated pages keyed by the pointer.
        // nil value denotes a deallocated page.
        updates map[uint64][]byte
    }
}

Step 2: Page Management for B-Tree

The pageGet function is modified to also return temporary pages because the free list code depends on this behavior.

// callback for BTree & FreeList, dereference a pointer.
func (db *KV) pageGet(ptr uint64) BNode {
    if page, ok := db.page.updates[ptr]; ok {
        assert(page != nil)
        return BNode{page} // for new pages
    }
    return pageGetMapped(db, ptr) // for written pages
}

func pageGetMapped(db *KV, ptr uint64) BNode {
    start := uint64(0)
    for _, chunk := range db.mmap.chunks {
        end := start + uint64(len(chunk))/BTREE_PAGE_SIZE
        if ptr < end {
            offset := BTREE_PAGE_SIZE * (ptr - start)
            return BNode{chunk[offset : offset+BTREE_PAGE_SIZE]}
        }
        start = end
    }
    panic("bad ptr")
}

The function for allocating a B-tree page is changed to reuse pages from the free list first.

// callback for BTree, allocate a new page.
func (db *KV) pageNew(node BNode) uint64 {
    assert(len(node.data) <= BTREE_PAGE_SIZE)
    ptr := uint64(0)
    if db.page.nfree < db.free.Total() {
        // reuse a deallocated page
        ptr = db.free.Get(db.page.nfree)
        db.page.nfree++
    } else {
        // append a new page
        ptr = db.page.flushed + uint64(db.page.nappend)
        db.page.nappend++
    }
    db.page.updates[ptr] = node.data
    return ptr
}

Removed pages are marked for the free list update later.

// callback for BTree, deallocate a page.
func (db *KV) pageDel(ptr uint64) {
    db.page.updates[ptr] = nil
}

Step 3: Page Management for the Free List

Callbacks for appending a new page and reusing a page for the free list:

// callback for FreeList, allocate a new page.
func (db *KV) pageAppend(node BNode) uint64 {
    assert(len(node.data) <= BTREE_PAGE_SIZE)
    ptr := db.page.flushed + uint64(db.page.nappend)
    db.page.nappend++
    db.page.updates[ptr] = node.data
    return ptr
}

// callback for FreeList, reuse a page.
func (db *KV) pageUse(ptr uint64, node BNode) {
    db.page.updates[ptr] = node.data
}

Step 4: Update the Free List

Before extending the file and writing pages to disk, we must update the free list first since it also creates pending writes.

func writePages(db *KV) error {
    // update the free list
    freed := []uint64{}
    for ptr, page := range db.page.updates {
        if page == nil {
            freed = append(freed, ptr)
        }
    }
    db.free.Update(db.page.nfree, freed)

    // extend the file & mmap if needed
    // omitted...

    // copy pages to the file
    for ptr, page := range db.page.updates {
        if page != nil {
            copy(pageGetMapped(db, ptr).data, page)
        }
    }
    return nil
}

The pointer to the list head is added to the master page:

| sig | btree_root | page_used | free_list |
| 16B |     8B     |     8B    |     8B    |

Step 5: Done

The KV store is finished. It is persistent and crash resistant, although it can only be accessed sequentially.

There is more to learn in part II of the book:

See also:
codecrafters.io offers “Build Your Own X” courses in many programming languages.
Including Redis, Git, SQLite, Docker, and more.
Check it out

⟵ prev Contents next ⟶