06. Persist to Disk
The B-tree data structure from the previous chapter can be dumped to disk easily. Let’s build a simple KV store on top of it. Since our B-tree implementation is immutable, we’ll allocate disk space in an append-only manner, reusing disk space is deferred to the next chapter.
6.1 The Method for Persisting Data
As mentioned in previous chapters, persisting data to disk is more than just dumping data into files. There are a couple of considerations:
- Crash recovery: This includes database process crashes, OS crashes, and power failures. The database must be in a usable state after a reboot.
- Durability: After a successful response from the database, the data involved is guaranteed to persist, even after a crash. In other words, persistence occurs before responding to the client.
There are many materials describing databases using the ACID jargon (atomicity, consistency, isolation, durability), but these concepts are not orthogonal and hard to explain, so let’s focus on our practical example instead.
- The immutable aspect of our B-tree: Updating the B-tree does not touch the previous version of the B-tree, which makes crash recovery easy — should the update goes wrong, we can simply recover to the previous version.
- Durability is achieved via the
fsync
Linux syscall. Normal file IO viawrite
ormmap
goes to the page cache first, the system has to flush the page cache to the disk later. Thefsync
syscall blocks until all dirty pages are flushed.
How do we recover to the previous version if an update goes wrong? We can split the update into two phases:
- An update creates new nodes; write them to the disk.
- Each update creates a new root node, we need to store the pointer to the root node somewhere.
The first phase may involve writing multiple pages to the disk, this is generally not atomic. But the second phase involves only a single pointer and can be done in an atomic single page write. This makes the whole operation atomic — the update will simply not happen if the database crashes.
The first phase must be persisted before the second phase, otherwise,
the root pointer could point to a corrupted (partly persisted) version
of the tree after a crash. There should be an fsync
between
the two phases (to serve as a barrier).
And the second phase should also be fsync
’d before
responding to the client.
6.2 mmap-Based IO
The contents of a disk file can be mapped from a virtual address
using the mmap
syscall. Reading from this address initiates
transparent disk IO, which is the same as reading the file via the
read
syscall, but without the need for a user-space buffer
and the overhead of a syscall. The mapped address is a proxy to the page
cache, modifying data via it is the same as the write
syscall.
mmap
is convenient, and we’ll use it for our KV store.
However, the use of mmap
is not essential.
// create the initial mmap that covers the whole file.
func mmapInit(fp *os.File) (int, []byte, error) {
, err := fp.Stat()
fiif err != nil {
return 0, nil, fmt.Errorf("stat: %w", err)
}
if fi.Size()%BTREE_PAGE_SIZE != 0 {
return 0, nil, errors.New("File size is not a multiple of page size.")
}
:= 64 << 20
mmapSize (mmapSize%BTREE_PAGE_SIZE == 0)
assertfor mmapSize < int(fi.Size()) {
*= 2
mmapSize }
// mmapSize can be larger than the file
, err := syscall.Mmap(
chunkint(fp.Fd()), 0, mmapSize,
.PROT_READ|syscall.PROT_WRITE, syscall.MAP_SHARED,
syscall)
if err != nil {
return 0, nil, fmt.Errorf("mmap: %w", err)
}
return int(fi.Size()), chunk, nil
}
The above function creates the initial mapping at least the size of
the file. The size of the mapping can be larger than the file size, and
the range past the end of the file is not accessible
(SIGBUS
), but the file can be extended later.
We may have to extend the range of the mapping as the file grows. The
syscall for extending a mmap
range is mremap
.
Unfortunately, we may not be able to keep the starting address when
extending a range by remapping. Our approach to extending mappings is to
use multiple mappings — create a new mapping for the overflow file
range.
type KV struct {
string
Path // internals
*os.File
fp
tree BTreestruct {
mmap int // file size, can be larger than the database size
file int // mmap size, can be larger than the file size
total [][]byte // multiple mmaps, can be non-continuous
chunks }
struct {
page uint64 // database size in number of pages
flushed [][]byte // newly allocated pages
temp }
}
// extend the mmap by adding new mappings.
func extendMmap(db *KV, npages int) error {
if db.mmap.total >= npages*BTREE_PAGE_SIZE {
return nil
}
// double the address space
, err := syscall.Mmap(
chunkint(db.fp.Fd()), int64(db.mmap.total), db.mmap.total,
.PROT_READ|syscall.PROT_WRITE, syscall.MAP_SHARED,
syscall)
if err != nil {
return fmt.Errorf("mmap: %w", err)
}
.mmap.total += db.mmap.total
db.mmap.chunks = append(db.mmap.chunks, chunk)
dbreturn nil
}
The size of the new mapping increases exponentially so that we don’t
have to call mmap
frequently.
Below is how we access a page from the mapped address.
// callback for BTree, dereference a pointer.
func (db *KV) pageGet(ptr uint64) BNode {
:= uint64(0)
start for _, chunk := range db.mmap.chunks {
:= start + uint64(len(chunk))/BTREE_PAGE_SIZE
end if ptr < end {
:= BTREE_PAGE_SIZE * (ptr - start)
offset return BNode{chunk[offset : offset+BTREE_PAGE_SIZE]}
}
= end
start }
panic("bad ptr")
}
6.3 The Master Page
The first page of the file is used to store the pointer to the root, let’s call it the “master page”. The total number of pages is needed for allocating new nodes, thus it is also stored there.
| the_master_page | pages... | tree_root | pages... |
| btree_root | page_used | ^ ^
| | | |
+------------+----------------------+ |
| |
+---------------------------------------+
The function below reads the master page when initializing a database:
const DB_SIG = "BuildYourOwnDB06"
// the master page format.
// it contains the pointer to the root and other important bits.
// | sig | btree_root | page_used |
// | 16B | 8B | 8B |
func masterLoad(db *KV) error {
if db.mmap.file == 0 {
// empty file, the master page will be created on the first write.
.page.flushed = 1 // reserved for the master page
dbreturn nil
}
:= db.mmap.chunks[0]
data := binary.LittleEndian.Uint64(data[16:])
root := binary.LittleEndian.Uint64(data[24:])
used
// verify the page
if !bytes.Equal([]byte(DB_SIG), data[:16]) {
return errors.New("Bad signature.")
}
:= !(1 <= used && used <= uint64(db.mmap.file/BTREE_PAGE_SIZE))
bad = bad || !(0 <= root && root < used)
bad if bad {
return errors.New("Bad master page.")
}
.tree.root = root
db.page.flushed = used
dbreturn nil
}
Below is the function for updating the master page. Unlike the code
for reading, it doesn’t use the mapped address for writing. This is
because modifying a page via mmap
is not atomic. The kernel
could flush the page midway and corrupt the disk file, while a small
write
that doesn’t cross the page boundary is guaranteed to
be atomic.
// update the master page. it must be atomic.
func masterStore(db *KV) error {
var data [32]byte
copy(data[:16], []byte(DB_SIG))
.LittleEndian.PutUint64(data[16:], db.tree.root)
binary.LittleEndian.PutUint64(data[24:], db.page.flushed)
binary// NOTE: Updating the page via mmap is not atomic.
// Use the `pwrite()` syscall instead.
, err := db.fp.WriteAt(data[:], 0)
_if err != nil {
return fmt.Errorf("write master page: %w", err)
}
return nil
}
6.4 Allocating Disk Pages
We’ll simply append new pages to the end of the database until we add a free list in the next chapter.
And new pages are kept temporarily in memory until copied to the file later (after possibly extending the file).
type KV struct {
// omitted...
struct {
page uint64 // database size in number of pages
flushed [][]byte // newly allocated pages
temp }
}
// callback for BTree, allocate a new page.
func (db *KV) pageNew(node BNode) uint64 {
// TODO: reuse deallocated pages
(len(node.data) <= BTREE_PAGE_SIZE)
assert:= db.page.flushed + uint64(len(db.page.temp))
ptr .page.temp = append(db.page.temp, node.data)
dbreturn ptr
}
// callback for BTree, deallocate a page.
func (db *KV) pageDel(uint64) {
// TODO: implement this
}
Before writing the pending pages, we may need to extend the file
first. The corresponding syscall is fallocate
.
// extend the file to at least `npages`.
func extendFile(db *KV, npages int) error {
:= db.mmap.file / BTREE_PAGE_SIZE
filePages if filePages >= npages {
return nil
}
for filePages < npages {
// the file size is increased exponentially,
// so that we don't have to extend the file for every update.
:= filePages / 8
inc if inc < 1 {
= 1
inc }
+= inc
filePages }
:= filePages * BTREE_PAGE_SIZE
fileSize := syscall.Fallocate(int(db.fp.Fd()), 0, 0, int64(fileSize))
err if err != nil {
return fmt.Errorf("fallocate: %w", err)
}
.mmap.file = fileSize
dbreturn nil
}
6.5 Initializing the Database
Putting together what we have done.
func (db *KV) Open() error {
// open or create the DB file
, err := os.OpenFile(db.Path, os.O_RDWR|os.O_CREATE, 0644)
fpif err != nil {
return fmt.Errorf("OpenFile: %w", err)
}
.fp = fp
db
// create the initial mmap
, chunk, err := mmapInit(db.fp)
szif err != nil {
goto fail
}
.mmap.file = sz
db.mmap.total = len(chunk)
db.mmap.chunks = [][]byte{chunk}
db
// btree callbacks
.tree.get = db.pageGet
db.tree.new = db.pageNew
db.tree.del = db.pageDel
db
// read the master page
= masterLoad(db)
err if err != nil {
goto fail
}
// done
return nil
:
fail.Close()
dbreturn fmt.Errorf("KV.Open: %w", err)
}
// cleanups
func (db *KV) Close() {
for _, chunk := range db.mmap.chunks {
:= syscall.Munmap(chunk)
err (err == nil)
assert}
= db.fp.Close()
_ }
6.6 Update Operations
Unlike queries, update operations must persist the data before returning.
// read the db
func (db *KV) Get(key []byte) ([]byte, bool) {
return db.tree.Get(key)
}
// update the db
func (db *KV) Set(key []byte, val []byte) error {
.tree.Insert(key, val)
dbreturn flushPages(db)
}
func (db *KV) Del(key []byte) (bool, error) {
:= db.tree.Delete(key)
deleted return deleted, flushPages(db)
}
The flushPages
is the function for persisting new
pages.
// persist the newly allocated pages after updates
func flushPages(db *KV) error {
if err := writePages(db); err != nil {
return err
}
return syncPages(db)
}
It is split into two phases as mentioned earlier.
func writePages(db *KV) error {
// extend the file & mmap if needed
:= int(db.page.flushed) + len(db.page.temp)
npages if err := extendFile(db, npages); err != nil {
return err
}
if err := extendMmap(db, npages); err != nil {
return err
}
// copy data to the file
for i, page := range db.page.temp {
:= db.page.flushed + uint64(i)
ptr copy(db.pageGet(ptr).data, page)
}
return nil
}
And the fsync
is in between and after them.
func syncPages(db *KV) error {
// flush data to the disk. must be done before updating the master page.
if err := db.fp.Sync(); err != nil {
return fmt.Errorf("fsync: %w", err)
}
.page.flushed += uint64(len(db.page.temp))
db.page.temp = db.page.temp[:0]
db
// update & flush the master page
if err := masterStore(db); err != nil {
return err
}
if err := db.fp.Sync(); err != nil {
return fmt.Errorf("fsync: %w", err)
}
return nil
}
Our KV store is functional, but the file can’t grow forever as we update the database, we’ll finish our KV store by reusing disk pages in the next chapter.
codecrafters.io offers “Build Your Own X” courses in many programming languages.
Including Redis, Git, SQLite, Docker, and more.