07. Free List: Recyle & Reuse
The last step of the KV store is to reuse deleted pages, which is also a problem for in-memory data structures.
7.1 Memory management techniques
What we will do
Memory (space) management can be either manual or automatic. A garbage collector is automatic, it detects unused objects without any help from the programmer. The next problem is how to deal with (reuse) unused objects.
We don’t need a GC, because in a tree data structure, detecting
unused nodes is trivial, as we have already done with the
BTree.del
callback. What we’ll do is to reimplement those
callbacks.
List of unused objects
One of the reasons that disk space is managed in pages of
the same size is that they become interchangeable after they
are deleted; the DB can reuse any of them when it needs a page. This is
simpler than generic memory management routines such as
malloc
, which deal with arbitrary sizes.
We need to store a list of unused pages, called a free list or object pool. For in-memory data, this can simply be an array of pointers, or a linked list embedded in objects.
Embedded linked list
The simplest scheme is to use an embedded (intrusive) linked list. The list pointer sits inside the object itself; it borrows space from the object, so no extra space is needed for the data structure.
head
↓
[ next | space... ] (unused object 1)
↓
[ next | space... ] (unused object 2)
↓
...
However, this conflicts with copy-on-write, as it overwrites during an update.
External list
The other scheme is to store pointers to unused pages in an external data structure. The external data structure itself takes up space, which is a problem we’ll solve.
Let’s say our free list is just a log of unused page numbers; adding items is just appending. The problem is how to remove items so that it doesn’t grow infinitely.
7.2 Linked list on disk
Free list requirements
Let’s image the free list as a sequence of items, like a log. In a copy-on-write tree, each update requires new nodes and deletes old nodes, so the free list is both added to and removed from per update. If items are removed from the end, then the added items overwrite old data, requiring extra crash recovery mechanisms discussed in chapter 03.
If items are removed from the beginning, how do you reclaim the space from the removed items? We’re back to the original problem.
To solve the problem, the free list should also be page-based, so that it can manage itself. A page-based list is just a linked list, except that a page can hold multiple items, like a B+tree node. This is also called an unrolled linked list.
In summary:
- Our free list is a standalone data structure: a linked list of
pages.
- It will try to get a page from itself when it grows a new node.
- Removed list nodes are added to itself for reuse.
- Each page can contain multiple items (page numbers).
- Pages are updated in-place, but it’s still append-only within a page.
- Items are appended to the tail node and consumed from the head.
- It’s easier to make the tail node append-only this way.
Free list disk layout
Each node starts with a pointer to the next node. Items are appended next to it.
// node format:
// | next | pointers | unused |
// | 8B | n*8B | ... |
type LNode []byte
const FREE_LIST_HEADER = 8
const FREE_LIST_CAP = (BTREE_PAGE_SIZE - FREE_LIST_HEADER) / 8
// getters & setters
func (node LNode) getNext() uint64
func (node LNode) setNext(next uint64)
func (node LNode) getPtr(idx int) uint64
func (node LNode) setPtr(idx int, ptr uint64)
We also store pointers to both the head node and the tail node in the meta page. The pointer to the tail node is needed for O(1) insertion.
first_item
↓
head_page -> [ next | xxxxx ]
↓
[ next | xxxxxxxx ]
↓
tail_page -> [ NULL | xxxx ]
↑
last_item
Update free list nodes
Without the free list, the meta page is the only page that is updated in-place, which is how copy-on-write made crash recovery easy. Now there are 2 more in-place page updates in list nodes: the next pointer and the appended items.
Although list nodes are updated in-place, no data is overwritten within a page. So if an update is interrupted, the meta page still points to the same data; no extra crash recovery is needed. And unlike the meta page, atomicity is not required.
Following this analysis, the embedded list can also work iff the next pointer is reserved in the B+tree node. Here you can deviate from the book. Although this doubles write amplification.
7.3 Free list implementation
Free list interface
type KV struct {
string
Path // internals
int
fd
tree BTree// added
free FreeList // ...
}
FreeList
is the extra data structure in
KV
.
type FreeList struct {
// callbacks for managing on-disk pages
func(uint64) []byte // read a page
get new func([]byte) uint64 // append a new page
func(uint64) []byte // update an existing page
set // persisted data in the meta page
uint64 // pointer to the list head node
headPage uint64 // monotonic sequence number to index into the list head
headSeq uint64
tailPage uint64
tailSeq // in-memory states
uint64 // saved `tailSeq` to prevent consuming newly added items
maxSeq }
// get 1 item from the list head. return 0 on failure.
func (fl *FreeList) PopHead() uint64
// add 1 item to the tail
func (fl *FreeList) PushTail(ptr uint64)
Like BTree
, page management is isolated via 3
callbacks:
get
reads a page; same as before,new
appends a page; previously used forBTree
.set
returns a writable buffer to capture in-place updates.del
is not there because the free list manages free pages itself.
func (db *KV) Open() error {
// ...
// B+tree callbacks
.tree.get = db.pageRead // read a page
db.tree.new = db.pageAlloc // (new) reuse from the free list or append
db.tree.del = db.free.PushTail // (new) freed pages go to the free list
db// free list callbacks
.free.get = db.pageRead // read a page
db.free.new = db.pageAppend // append a page
db.free.set = db.pageWrite // (new) in-place updates
db// ...
}
Free list data structure
Since a node contains a variable number of items up to
FREE_LIST_CAP
, we need to know where the 1st item is in the
head node (headSeq
), and where the items end in the tail
node (tailSeq
).
type FreeList struct {
// ...
// persisted data in the meta page
uint64 // pointer to the list head node
headPage uint64 // monotonic sequence number to index into the list head
headSeq uint64
tailPage uint64
tailSeq // in-memory states
uint64 // saved `tailSeq` to prevent consuming newly added items
maxSeq }
headSeq
, tailSeq
are indexes into the head
and tail nodes, except that they are monotonically increasing. So the
wrapped-around index is:
func seq2idx(seq uint64) int {
return int(seq % FREE_LIST_CAP)
}
We make them monotonically increasing so that they become a unique identifier of the list position; to prevent the list head from overrunning the list tail, simply compare the 2 sequence numbers.
During an update, the list is both added to and removed from, and when we remove from the head, we cannot remove what we just added to the tail. So we need to …
- At the beginning of the update, save the original
tailSeq
tomaxSeq
. - During the update,
headSeq
cannot overrunmaxSeq
. - At the beginning of the next update,
maxSeq
is advanced totailSeq
. - …
// make the newly added items available for consumption
func (fl *FreeList) SetMaxSeq() {
.maxSeq = fl.tailSeq
fl}
Consuming from the free list
Removing an item from the head node is simply advancing
headSeq
. And when the head node becomes empty, move to the
next node.
// remove 1 item from the head node, and remove the head node if empty.
func flPop(fl *FreeList) (ptr uint64, head uint64) {
if fl.headSeq == fl.maxSeq {
return 0, 0 // cannot advance
}
:= LNode(fl.get(fl.headPage))
node = node.getPtr(seq2idx(fl.headSeq)) // item
ptr .headSeq++
fl// move to the next one if the head node is empty
if seq2idx(fl.headSeq) == 0 {
, fl.headPage = fl.headPage, node.getNext()
head(fl.headPage != 0)
assert}
return
}
The free list self-manages; the removed head node is fed back to itself.
// get 1 item from the list head. return 0 on failure.
func (fl *FreeList) PopHead() uint64 {
, head := flPop(fl)
ptrif head != 0 { // the empty head node is recycled
.PushTail(head)
fl}
return ptr
}
What if the last node is removed? A linked list with 0 nodes implies
nasty special cases. In practice, it’s easier to design the
linked list to have at least 1 node than to deal with special
cases. That’s why we assert(fl.headPage != 0)
.
Pushing into the free list
Appending an item to the tail node is simply advancing
tailSeq
. And when the tail node is full, we immediately add
a new empty tail node to ensure that there is at least 1 node in case
the previous tail node is removed as a head node.
func (fl *FreeList) PushTail(ptr uint64) {
// add it to the tail node
(fl.set(fl.tailPage)).setPtr(seq2idx(fl.tailSeq), ptr)
LNode.tailSeq++
fl// add a new tail node if it's full (the list is never empty)
if seq2idx(fl.tailSeq) == 0 {
// try to reuse from the list head
, head := flPop(fl) // may remove the head node
nextif next == 0 {
// or allocate a new node by appending
= fl.new(make([]byte, BTREE_PAGE_SIZE))
next }
// link to the new tail node
(fl.set(fl.tailPage)).setNext(next)
LNode.tailPage = next
fl// also add the head node if it's removed
if head != 0 {
(fl.set(fl.tailPage)).setPtr(0, head)
LNode.tailSeq++
fl}
}
}
Again, the free list is self-managing: it will try to get a node from itself for the new tail node before resorting to appending.
7.4 KV with a free list
Page management
Now that pages can be reused, reused pages are overwritten in-place, so a map is used to capture pending updates.
type KV struct {
// ...
struct {
page uint64 // database size in number of pages
flushed uint64 // number of pages to be appended
nappend map[uint64][]byte // pending updates, including appended pages
updates }
}
BTree.new
is now KV.pageAlloc
, it uses the
free list before resorting to appending.
// `BTree.new`, allocate a new page.
func (db *KV) pageAlloc(node []byte) uint64 {
if ptr := db.free.PopHead(); ptr != 0 { // try the free list
.page.updates[ptr] = node
dbreturn ptr
}
return db.pageAppend(node) // append
}
KV.pageWrite
returns a writable page copy to capture
in-place updates.
// `FreeList.set`, update an existing page.
func (db *KV) pageWrite(ptr uint64) []byte {
if node, ok := db.page.updates[ptr]; ok {
return node // pending update
}
:= make([]byte, BTREE_PAGE_SIZE)
node copy(node, db.pageReadFile(ptr)) // initialized from the file
.page.updates[ptr] = node
dbreturn node
}
Another change is that we may read a page again after it has been
updated, so KV.pageRead
should consult the pending updates
map first.
// `BTree.get`, read a page.
func (db *KV) pageRead(ptr uint64) []byte {
if node, ok := db.page.updates[ptr]; ok {
return node // pending update
}
return db.pageReadFile(ptr)
}
func (db *KV) pageReadFile(ptr uint64) []byte {
// same as `KV.pageRead` in the last chapter ...
}
Update the meta page
The meta page now includes free list pointers (head and tail) that are updated atomically along with the tree root.
| sig | root_ptr | page_used | head_page | head_seq | tail_page | tail_seq |
| 16B | 8B | 8B | 8B | 8B | 8B | 8B |
Remember that the free list always contains at least 1 node, we’ll assign an empty node to it when initializing an empty DB.
func readRoot(db *KV, fileSize int64) error {
if fileSize == 0 { // empty file
// reserve 2 pages: the meta page and a free list node
.page.flushed = 2
db// add an initial node to the free list so it's never empty
.free.headPage = 1 // the 2nd page
db.free.tailPage = 1
dbreturn nil // the meta page will be written in the 1st update
}
// ...
}
Since headSeq
is blocked by maxSeq
,
maxSeq
is updated to tailSeq
between
updates to allow reuse of pages from the last version.
func updateFile(db *KV) error {
// ...
// prepare the free list for the next update
.free.SetMaxSeq()
dbreturn nil
}
We still assume sequential access in this chapter. When we add
concurrency later, headSeq
will be blocked by the oldest
reader instead.
7.5 Conclusion of the KV store
What we have done:
- File layout for a copy-on-write B+tree.
- Durability and atomicity with
fsync
. - Managing disk pages with a free list.
That’s enough for a KV store with get
, set
,
del
. But there is more in part II:
- Relational DB on KV store.
- Concurrent transactions.