07. Free List: Recyle & Reuse
The last step of the KV store is to reuse deleted pages, which is also a problem for in-memory data structures.
7.1 Memory management techniques
What we will do
Memory (space) management can be either manual or automatic. A garbage collector is automatic, it detects unused objects without any help from the programmer. The next problem is how to deal with (reuse) unused objects.
We don’t need a GC, because in a tree data structure, detecting
unused nodes is trivial, as we have already done with the
BTree.del callback. What we’ll do is to reimplement those
callbacks.
List of unused objects
One of the reasons that disk space is managed in pages of
the same size is that they become interchangeable after they
are deleted; the DB can reuse any of them when it needs a page. This is
simpler than generic memory management routines such as
malloc, which deal with arbitrary sizes.
We need to store a list of unused pages, called a free list or object pool. For in-memory data, this can simply be an array of pointers, or a linked list embedded in objects.
Embedded linked list
The simplest scheme is to use an embedded (intrusive) linked list. The list pointer sits inside the object itself; it borrows space from the object, so no extra space is needed for the data structure.
head
↓
[ next | space... ] (unused object 1)
↓
[ next | space... ] (unused object 2)
↓
...
However, this conflicts with copy-on-write, as it overwrites during an update.
External list
The other scheme is to store pointers to unused pages in an external data structure. The external data structure itself takes up space, which is a problem we’ll solve.
Let’s say our free list is just a log of unused page numbers; adding items is just appending. The problem is how to remove items so that it doesn’t grow infinitely.
7.2 Linked list on disk
Free list requirements
Let’s image the free list as a sequence of items, like a log. In a copy-on-write tree, each update requires new nodes and deletes old nodes, so the free list is both added to and removed from per update. If items are removed from the end, then the added items overwrite old data, requiring extra crash recovery mechanisms discussed in chapter 03.
If items are removed from the beginning, how do you reclaim the space from the removed items? We’re back to the original problem.
To solve the problem, the free list should also be page-based, so that it can manage itself. A page-based list is just a linked list, except that a page can hold multiple items, like a B+tree node. This is also called an unrolled linked list.
In summary:
- Our free list is a standalone data structure: a linked list of
pages.
- It will try to get a page from itself when it grows a new node.
- Removed list nodes are added to itself for reuse.
- Each page can contain multiple items (page numbers).
- Pages are updated in-place, but it’s still append-only within a page.
- Items are appended to the tail node and consumed from the head.
- It’s easier to make the tail node append-only this way.
Free list disk layout
Each node starts with a pointer to the next node. Items are appended next to it.
// node format:
// | next | pointers | unused |
// | 8B | n*8B | ... |
type LNode []byte
const FREE_LIST_HEADER = 8
const FREE_LIST_CAP = (BTREE_PAGE_SIZE - FREE_LIST_HEADER) / 8
// getters & setters
func (node LNode) getNext() uint64
func (node LNode) setNext(next uint64)
func (node LNode) getPtr(idx int) uint64
func (node LNode) setPtr(idx int, ptr uint64)We also store pointers to both the head node and the tail node in the meta page. The pointer to the tail node is needed for O(1) insertion.
first_item
↓
head_page -> [ next | xxxxx ]
↓
[ next | xxxxxxxx ]
↓
tail_page -> [ NULL | xxxx ]
↑
last_item
Update free list nodes
Without the free list, the meta page is the only page that is updated in-place, which is how copy-on-write made crash recovery easy. Now there are 2 more in-place page updates in list nodes: the next pointer and the appended items.
Although list nodes are updated in-place, no data is overwritten within a page. So if an update is interrupted, the meta page still points to the same data; no extra crash recovery is needed. And unlike the meta page, atomicity is not required.
Following this analysis, the embedded list can also work iff the next pointer is reserved in the B+tree node. Here you can deviate from the book. Although this doubles write amplification.
7.3 Free list implementation
Free list interface
type KV struct {
Path string
// internals
fd int
tree BTree
free FreeList // added
// ...
}FreeList is the extra data structure in
KV.
type FreeList struct {
// callbacks for managing on-disk pages
get func(uint64) []byte // read a page
new func([]byte) uint64 // append a new page
set func(uint64) []byte // update an existing page
// persisted data in the meta page
headPage uint64 // pointer to the list head node
headSeq uint64 // monotonic sequence number to index into the list head
tailPage uint64
tailSeq uint64
// in-memory states
maxSeq uint64 // saved `tailSeq` to prevent consuming newly added items
}
// get 1 item from the list head. return 0 on failure.
func (fl *FreeList) PopHead() uint64
// add 1 item to the tail
func (fl *FreeList) PushTail(ptr uint64)Like BTree, page management is isolated via 3
callbacks:
getreads a page; same as before,newappends a page; previously used forBTree.setreturns a writable buffer to capture in-place updates.delis not there because the free list manages free pages itself.
func (db *KV) Open() error {
// ...
// B+tree callbacks
db.tree.get = db.pageRead // read a page
db.tree.new = db.pageAlloc // (new) reuse from the free list or append
db.tree.del = db.free.PushTail // (new) freed pages go to the free list
// free list callbacks
db.free.get = db.pageRead // read a page
db.free.new = db.pageAppend // append a page
db.free.set = db.pageWrite // (new) in-place updates
// ...
}Free list data structure
Since a node contains a variable number of items up to
FREE_LIST_CAP, we need to know where the 1st item is in the
head node (headSeq), and where the items end in the tail
node (tailSeq).
type FreeList struct {
// ...
// persisted data in the meta page
headPage uint64 // pointer to the list head node
headSeq uint64 // monotonic sequence number to index into the list head
tailPage uint64
tailSeq uint64
// in-memory states
maxSeq uint64 // saved `tailSeq` to prevent consuming newly added items
}headSeq, tailSeq are indexes into the head
and tail nodes, except that they are monotonically increasing. So the
wrapped-around index is:
func seq2idx(seq uint64) int {
return int(seq % FREE_LIST_CAP)
}We make them monotonically increasing so that they become a unique identifier of the list position; to prevent the list head from overrunning the list tail, simply compare the 2 sequence numbers.
During an update, the list is both added to and removed from, and when we remove from the head, we cannot remove what we just added to the tail. So we need to …
- At the beginning of the update, save the original
tailSeqtomaxSeq. - During the update,
headSeqcannot overrunmaxSeq. - At the beginning of the next update,
maxSeqis advanced totailSeq. - …
// make the newly added items available for consumption
func (fl *FreeList) SetMaxSeq() {
fl.maxSeq = fl.tailSeq
}Consuming from the free list
Removing an item from the head node is simply advancing
headSeq. And when the head node becomes empty, move to the
next node.
// remove 1 item from the head node, and remove the head node if empty.
func flPop(fl *FreeList) (ptr uint64, head uint64) {
if fl.headSeq == fl.maxSeq {
return 0, 0 // cannot advance
}
node := LNode(fl.get(fl.headPage))
ptr = node.getPtr(seq2idx(fl.headSeq)) // item
fl.headSeq++
// move to the next one if the head node is empty
if seq2idx(fl.headSeq) == 0 {
head, fl.headPage = fl.headPage, node.getNext()
assert(fl.headPage != 0)
}
return
}The free list self-manages; the removed head node is fed back to itself.
// get 1 item from the list head. return 0 on failure.
func (fl *FreeList) PopHead() uint64 {
ptr, head := flPop(fl)
if head != 0 { // the empty head node is recycled
fl.PushTail(head)
}
return ptr
}What if the last node is removed? A linked list with 0 nodes implies
nasty special cases. In practice, it’s easier to design the
linked list to have at least 1 node than to deal with special
cases. That’s why we assert(fl.headPage != 0).
Pushing into the free list
Appending an item to the tail node is simply advancing
tailSeq. And when the tail node is full, we immediately add
a new empty tail node to ensure that there is at least 1 node in case
the previous tail node is removed as a head node.
func (fl *FreeList) PushTail(ptr uint64) {
// add it to the tail node
LNode(fl.set(fl.tailPage)).setPtr(seq2idx(fl.tailSeq), ptr)
fl.tailSeq++
// add a new tail node if it's full (the list is never empty)
if seq2idx(fl.tailSeq) == 0 {
// try to reuse from the list head
next, head := flPop(fl) // may remove the head node
if next == 0 {
// or allocate a new node by appending
next = fl.new(make([]byte, BTREE_PAGE_SIZE))
}
// link to the new tail node
LNode(fl.set(fl.tailPage)).setNext(next)
fl.tailPage = next
// also add the head node if it's removed
if head != 0 {
LNode(fl.set(fl.tailPage)).setPtr(0, head)
fl.tailSeq++
}
}
}Again, the free list is self-managing: it will try to get a node from itself for the new tail node before resorting to appending.
7.4 KV with a free list
Page management
Now that pages can be reused, reused pages are overwritten in-place, so a map is used to capture pending updates.
type KV struct {
// ...
page struct {
flushed uint64 // database size in number of pages
nappend uint64 // number of pages to be appended
updates map[uint64][]byte // pending updates, including appended pages
}
}BTree.new is now KV.pageAlloc, it uses the
free list before resorting to appending.
// `BTree.new`, allocate a new page.
func (db *KV) pageAlloc(node []byte) uint64 {
if ptr := db.free.PopHead(); ptr != 0 { // try the free list
db.page.updates[ptr] = node
return ptr
}
return db.pageAppend(node) // append
}KV.pageWrite returns a writable page copy to capture
in-place updates.
// `FreeList.set`, update an existing page.
func (db *KV) pageWrite(ptr uint64) []byte {
if node, ok := db.page.updates[ptr]; ok {
return node // pending update
}
node := make([]byte, BTREE_PAGE_SIZE)
copy(node, db.pageReadFile(ptr)) // initialized from the file
db.page.updates[ptr] = node
return node
}Another change is that we may read a page again after it has been
updated, so KV.pageRead should consult the pending updates
map first.
// `BTree.get`, read a page.
func (db *KV) pageRead(ptr uint64) []byte {
if node, ok := db.page.updates[ptr]; ok {
return node // pending update
}
return db.pageReadFile(ptr)
}
func (db *KV) pageReadFile(ptr uint64) []byte {
// same as `KV.pageRead` in the last chapter ...
}Update the meta page
The meta page now includes free list pointers (head and tail) that are updated atomically along with the tree root.
| sig | root_ptr | page_used | head_page | head_seq | tail_page | tail_seq |
| 16B | 8B | 8B | 8B | 8B | 8B | 8B |
Remember that the free list always contains at least 1 node, we’ll assign an empty node to it when initializing an empty DB.
func readRoot(db *KV, fileSize int64) error {
if fileSize == 0 { // empty file
// reserve 2 pages: the meta page and a free list node
db.page.flushed = 2
// add an initial node to the free list so it's never empty
db.free.headPage = 1 // the 2nd page
db.free.tailPage = 1
return nil // the meta page will be written in the 1st update
}
// ...
}Since headSeq is blocked by maxSeq,
maxSeq is updated to tailSeq between
updates to allow reuse of pages from the last version.
func updateFile(db *KV) error {
// ...
// prepare the free list for the next update
db.free.SetMaxSeq()
return nil
}We still assume sequential access in this chapter. When we add
concurrency later, headSeq will be blocked by the oldest
reader instead.
7.5 Conclusion of the KV store
What we have done:
- File layout for a copy-on-write B+tree.
- Durability and atomicity with
fsync. - Managing disk pages with a free list.
That’s enough for a KV store with get, set,
del. But there is more in part II:
- Relational DB on KV store.
- Concurrent transactions.