07. Free List: Recyle & Reuse

The last step of the KV store is to reuse deleted pages, which is also a problem for in-memory data structures.

7.1 Memory management techniques

What we will do

Memory (space) management can be either manual or automatic. A garbage collector is automatic, it detects unused objects without any help from the programmer. The next problem is how to deal with (reuse) unused objects.

We don’t need a GC, because in a tree data structure, detecting unused nodes is trivial, as we have already done with the BTree.del callback. What we’ll do is to reimplement those callbacks.

List of unused objects

One of the reasons that disk space is managed in pages of the same size is that they become interchangeable after they are deleted; the DB can reuse any of them when it needs a page. This is simpler than generic memory management routines such as malloc, which deal with arbitrary sizes.

We need to store a list of unused pages, called a free list or object pool. For in-memory data, this can simply be an array of pointers, or a linked list embedded in objects.

Embedded linked list

The simplest scheme is to use an embedded (intrusive) linked list. The list pointer sits inside the object itself; it borrows space from the object, so no extra space is needed for the data structure.

  head
   ↓
[ next | space... ] (unused object 1)
   ↓
[ next | space... ] (unused object 2)
   ↓
  ...

However, this conflicts with copy-on-write, as it overwrites during an update.

External list

The other scheme is to store pointers to unused pages in an external data structure. The external data structure itself takes up space, which is a problem we’ll solve.

Let’s say our free list is just a log of unused page numbers; adding items is just appending. The problem is how to remove items so that it doesn’t grow infinitely.

7.2 Linked list on disk

Free list requirements

Let’s image the free list as a sequence of items, like a log. In a copy-on-write tree, each update requires new nodes and deletes old nodes, so the free list is both added to and removed from per update. If items are removed from the end, then the added items overwrite old data, requiring extra crash recovery mechanisms discussed in chapter 03.

If items are removed from the beginning, how do you reclaim the space from the removed items? We’re back to the original problem.

To solve the problem, the free list should also be page-based, so that it can manage itself. A page-based list is just a linked list, except that a page can hold multiple items, like a B+tree node. This is also called an unrolled linked list.

In summary:

  • Our free list is a standalone data structure: a linked list of pages.
    • It will try to get a page from itself when it grows a new node.
    • Removed list nodes are added to itself for reuse.
  • Each page can contain multiple items (page numbers).
    • Pages are updated in-place, but it’s still append-only within a page.
  • Items are appended to the tail node and consumed from the head.
    • It’s easier to make the tail node append-only this way.

Free list disk layout

Each node starts with a pointer to the next node. Items are appended next to it.

// node format:
// | next | pointers | unused |
// |  8B  |   n*8B   |   ...  |
type LNode []byte

const FREE_LIST_HEADER = 8
const FREE_LIST_CAP = (BTREE_PAGE_SIZE - FREE_LIST_HEADER) / 8

// getters & setters
func (node LNode) getNext() uint64
func (node LNode) setNext(next uint64)
func (node LNode) getPtr(idx int) uint64
func (node LNode) setPtr(idx int, ptr uint64)

We also store pointers to both the head node and the tail node in the meta page. The pointer to the tail node is needed for O(1) insertion.

                     first_item
                         ↓
head_page -> [ next |    xxxxx ]
                ↓
             [ next | xxxxxxxx ]
                ↓
tail_page -> [ NULL | xxxx     ]
                         ↑
                     last_item

Update free list nodes

Without the free list, the meta page is the only page that is updated in-place, which is how copy-on-write made crash recovery easy. Now there are 2 more in-place page updates in list nodes: the next pointer and the appended items.

Although list nodes are updated in-place, no data is overwritten within a page. So if an update is interrupted, the meta page still points to the same data; no extra crash recovery is needed. And unlike the meta page, atomicity is not required.

Following this analysis, the embedded list can also work iff the next pointer is reserved in the B+tree node. Here you can deviate from the book. Although this doubles write amplification.

7.3 Free list implementation

Free list interface

type KV struct {
    Path   string
    // internals
    fd   int
    tree BTree
    free FreeList // added
    // ...
}

FreeList is the extra data structure in KV.

type FreeList struct {
    // callbacks for managing on-disk pages
    get func(uint64) []byte // read a page
    new func([]byte) uint64 // append a new page
    set func(uint64) []byte // update an existing page
    // persisted data in the meta page
    headPage uint64 // pointer to the list head node
    headSeq  uint64 // monotonic sequence number to index into the list head
    tailPage uint64
    tailSeq  uint64
    // in-memory states
    maxSeq uint64 // saved `tailSeq` to prevent consuming newly added items
}

// get 1 item from the list head. return 0 on failure.
func (fl *FreeList) PopHead() uint64
// add 1 item to the tail
func (fl *FreeList) PushTail(ptr uint64)

Like BTree, page management is isolated via 3 callbacks:

  • get reads a page; same as before,
  • new appends a page; previously used for BTree.
  • set returns a writable buffer to capture in-place updates.
  • del is not there because the free list manages free pages itself.
func (db *KV) Open() error {
    // ...
    // B+tree callbacks
    db.tree.get = db.pageRead      // read a page
    db.tree.new = db.pageAlloc     // (new) reuse from the free list or append
    db.tree.del = db.free.PushTail // (new) freed pages go to the free list
    // free list callbacks
    db.free.get = db.pageRead      // read a page
    db.free.new = db.pageAppend    // append a page
    db.free.set = db.pageWrite     // (new) in-place updates
    // ...
}

Free list data structure

Since a node contains a variable number of items up to FREE_LIST_CAP, we need to know where the 1st item is in the head node (headSeq), and where the items end in the tail node (tailSeq).

type FreeList struct {
    // ...
    // persisted data in the meta page
    headPage uint64 // pointer to the list head node
    headSeq  uint64 // monotonic sequence number to index into the list head
    tailPage uint64
    tailSeq  uint64
    // in-memory states
    maxSeq uint64 // saved `tailSeq` to prevent consuming newly added items
}

headSeq, tailSeq are indexes into the head and tail nodes, except that they are monotonically increasing. So the wrapped-around index is:

func seq2idx(seq uint64) int {
    return int(seq % FREE_LIST_CAP)
}

We make them monotonically increasing so that they become a unique identifier of the list position; to prevent the list head from overrunning the list tail, simply compare the 2 sequence numbers.

During an update, the list is both added to and removed from, and when we remove from the head, we cannot remove what we just added to the tail. So we need to …

  1. At the beginning of the update, save the original tailSeq to maxSeq.
  2. During the update, headSeq cannot overrun maxSeq.
  3. At the beginning of the next update, maxSeq is advanced to tailSeq.
// make the newly added items available for consumption
func (fl *FreeList) SetMaxSeq() {
    fl.maxSeq = fl.tailSeq
}

Consuming from the free list

Removing an item from the head node is simply advancing headSeq. And when the head node becomes empty, move to the next node.

// remove 1 item from the head node, and remove the head node if empty.
func flPop(fl *FreeList) (ptr uint64, head uint64) {
    if fl.headSeq == fl.maxSeq {
        return 0, 0 // cannot advance
    }
    node := LNode(fl.get(fl.headPage))
    ptr = node.getPtr(seq2idx(fl.headSeq)) // item
    fl.headSeq++
    // move to the next one if the head node is empty
    if seq2idx(fl.headSeq) == 0 {
        head, fl.headPage = fl.headPage, node.getNext()
        assert(fl.headPage != 0)
    }
    return
}

The free list self-manages; the removed head node is fed back to itself.

// get 1 item from the list head. return 0 on failure.
func (fl *FreeList) PopHead() uint64 {
    ptr, head := flPop(fl)
    if head != 0 { // the empty head node is recycled
        fl.PushTail(head)
    }
    return ptr
}

What if the last node is removed? A linked list with 0 nodes implies nasty special cases. In practice, it’s easier to design the linked list to have at least 1 node than to deal with special cases. That’s why we assert(fl.headPage != 0).

Pushing into the free list

Appending an item to the tail node is simply advancing tailSeq. And when the tail node is full, we immediately add a new empty tail node to ensure that there is at least 1 node in case the previous tail node is removed as a head node.

func (fl *FreeList) PushTail(ptr uint64) {
    // add it to the tail node
    LNode(fl.set(fl.tailPage)).setPtr(seq2idx(fl.tailSeq), ptr)
    fl.tailSeq++
    // add a new tail node if it's full (the list is never empty)
    if seq2idx(fl.tailSeq) == 0 {
        // try to reuse from the list head
        next, head := flPop(fl) // may remove the head node
        if next == 0 {
            // or allocate a new node by appending
            next = fl.new(make([]byte, BTREE_PAGE_SIZE))
        }
        // link to the new tail node
        LNode(fl.set(fl.tailPage)).setNext(next)
        fl.tailPage = next
        // also add the head node if it's removed
        if head != 0 {
            LNode(fl.set(fl.tailPage)).setPtr(0, head)
            fl.tailSeq++
        }
    }
}

Again, the free list is self-managing: it will try to get a node from itself for the new tail node before resorting to appending.

7.4 KV with a free list

Page management

Now that pages can be reused, reused pages are overwritten in-place, so a map is used to capture pending updates.

type KV struct {
    // ...
    page struct {
        flushed uint64            // database size in number of pages
        nappend uint64            // number of pages to be appended
        updates map[uint64][]byte // pending updates, including appended pages
    }
}

BTree.new is now KV.pageAlloc, it uses the free list before resorting to appending.

// `BTree.new`, allocate a new page.
func (db *KV) pageAlloc(node []byte) uint64 {
    if ptr := db.free.PopHead(); ptr != 0 { // try the free list
        db.page.updates[ptr] = node
        return ptr
    }
    return db.pageAppend(node) // append
}

KV.pageWrite returns a writable page copy to capture in-place updates.

// `FreeList.set`, update an existing page.
func (db *KV) pageWrite(ptr uint64) []byte {
    if node, ok := db.page.updates[ptr]; ok {
        return node // pending update
    }
    node := make([]byte, BTREE_PAGE_SIZE)
    copy(node, db.pageReadFile(ptr)) // initialized from the file
    db.page.updates[ptr] = node
    return node
}

Another change is that we may read a page again after it has been updated, so KV.pageRead should consult the pending updates map first.

// `BTree.get`, read a page.
func (db *KV) pageRead(ptr uint64) []byte {
    if node, ok := db.page.updates[ptr]; ok {
        return node // pending update
    }
    return db.pageReadFile(ptr)
}
func (db *KV) pageReadFile(ptr uint64) []byte {
    // same as `KV.pageRead` in the last chapter ...
}

Update the meta page

The meta page now includes free list pointers (head and tail) that are updated atomically along with the tree root.

| sig | root_ptr | page_used | head_page | head_seq | tail_page | tail_seq |
| 16B |    8B    |     8B    |     8B    |    8B    |     8B    |    8B    |

Remember that the free list always contains at least 1 node, we’ll assign an empty node to it when initializing an empty DB.

func readRoot(db *KV, fileSize int64) error {
    if fileSize == 0 { // empty file
        // reserve 2 pages: the meta page and a free list node
        db.page.flushed = 2
        // add an initial node to the free list so it's never empty
        db.free.headPage = 1 // the 2nd page
        db.free.tailPage = 1
        return nil // the meta page will be written in the 1st update
    }
    // ...
}

Since headSeq is blocked by maxSeq, maxSeq is updated to tailSeq between updates to allow reuse of pages from the last version.

func updateFile(db *KV) error {
    // ...
    // prepare the free list for the next update
    db.free.SetMaxSeq()
    return nil
}

We still assume sequential access in this chapter. When we add concurrency later, headSeq will be blocked by the oldest reader instead.

7.5 Conclusion of the KV store

What we have done:

  • File layout for a copy-on-write B+tree.
  • Durability and atomicity with fsync.
  • Managing disk pages with a free list.

That’s enough for a KV store with get, set, del. But there is more in part II:

  • Relational DB on KV store.
  • Concurrent transactions.