Build Your Own Database From Scratch EBook·Paperback
⟵ prev Contents next ⟶

🔥 My other Book: Build Your Own Redis

06. Persist to Disk

The B-tree data structure from the previous chapter can be dumped to disk easily. Let’s build a simple KV store on top of it. Since our B-tree implementation is immutable, we’ll allocate disk space in an append-only manner, reusing disk space is deferred to the next chapter.

6.1 The Method for Persisting Data

As mentioned in previous chapters, persisting data to disk is more than just dumping data into files. There are a couple of considerations:

  1. Crash recovery: This includes database process crashes, OS crashes, and power failures. The database must be in a usable state after a reboot.
  2. Durability: After a successful response from the database, the data involved is guaranteed to persist, even after a crash. In other words, persistence occurs before responding to the client.

There are many materials describing databases using the ACID jargon (atomicity, consistency, isolation, durability), but these concepts are not orthogonal and hard to explain, so let’s focus on our practical example instead.

  1. The immutable aspect of our B-tree: Updating the B-tree does not touch the previous version of the B-tree, which makes crash recovery easy — should the update goes wrong, we can simply recover to the previous version.
  2. Durability is achieved via the fsync Linux syscall. Normal file IO via write or mmap goes to the page cache first, the system has to flush the page cache to the disk later. The fsync syscall blocks until all dirty pages are flushed.

How do we recover to the previous version if an update goes wrong? We can split the update into two phases:

  1. An update creates new nodes; write them to the disk.
  2. Each update creates a new root node, we need to store the pointer to the root node somewhere.

The first phase may involve writing multiple pages to the disk, this is generally not atomic. But the second phase involves only a single pointer and can be done in an atomic single page write. This makes the whole operation atomic — the update will simply not happen if the database crashes.

The first phase must be persisted before the second phase, otherwise, the root pointer could point to a corrupted (partly persisted) version of the tree after a crash. There should be an fsync between the two phases (to serve as a barrier).

And the second phase should also be fsync’d before responding to the client.

6.2 mmap-Based IO

The contents of a disk file can be mapped from a virtual address using the mmap syscall. Reading from this address initiates transparent disk IO, which is the same as reading the file via the read syscall, but without the need for a user-space buffer and the overhead of a syscall. The mapped address is a proxy to the page cache, modifying data via it is the same as the write syscall.

mmap is convenient, and we’ll use it for our KV store. However, the use of mmap is not essential.

// create the initial mmap that covers the whole file.
func mmapInit(fp *os.File) (int, []byte, error) {
    fi, err := fp.Stat()
    if err != nil {
        return 0, nil, fmt.Errorf("stat: %w", err)
    }

    if fi.Size()%BTREE_PAGE_SIZE != 0 {
        return 0, nil, errors.New("File size is not a multiple of page size.")
    }

    mmapSize := 64 << 20
    assert(mmapSize%BTREE_PAGE_SIZE == 0)
    for mmapSize < int(fi.Size()) {
        mmapSize *= 2
    }
    // mmapSize can be larger than the file

    chunk, err := syscall.Mmap(
        int(fp.Fd()), 0, mmapSize,
        syscall.PROT_READ|syscall.PROT_WRITE, syscall.MAP_SHARED,
    )
    if err != nil {
        return 0, nil, fmt.Errorf("mmap: %w", err)
    }

    return int(fi.Size()), chunk, nil
}

The above function creates the initial mapping at least the size of the file. The size of the mapping can be larger than the file size, and the range past the end of the file is not accessible (SIGBUS), but the file can be extended later.

We may have to extend the range of the mapping as the file grows. The syscall for extending a mmap range is mremap. Unfortunately, we may not be able to keep the starting address when extending a range by remapping. Our approach to extending mappings is to use multiple mappings — create a new mapping for the overflow file range.

type KV struct {
    Path string
    // internals
    fp   *os.File
    tree BTree
    mmap struct {
        file   int      // file size, can be larger than the database size
        total  int      // mmap size, can be larger than the file size
        chunks [][]byte // multiple mmaps, can be non-continuous
    }
    page struct {
        flushed uint64   // database size in number of pages
        temp    [][]byte // newly allocated pages
    }
}
// extend the mmap by adding new mappings.
func extendMmap(db *KV, npages int) error {
    if db.mmap.total >= npages*BTREE_PAGE_SIZE {
        return nil
    }

    // double the address space
    chunk, err := syscall.Mmap(
        int(db.fp.Fd()), int64(db.mmap.total), db.mmap.total,
        syscall.PROT_READ|syscall.PROT_WRITE, syscall.MAP_SHARED,
    )
    if err != nil {
        return fmt.Errorf("mmap: %w", err)
    }

    db.mmap.total += db.mmap.total
    db.mmap.chunks = append(db.mmap.chunks, chunk)
    return nil
}

The size of the new mapping increases exponentially so that we don’t have to call mmap frequently.

Below is how we access a page from the mapped address.

// callback for BTree, dereference a pointer.
func (db *KV) pageGet(ptr uint64) BNode {
    start := uint64(0)
    for _, chunk := range db.mmap.chunks {
        end := start + uint64(len(chunk))/BTREE_PAGE_SIZE
        if ptr < end {
            offset := BTREE_PAGE_SIZE * (ptr - start)
            return BNode{chunk[offset : offset+BTREE_PAGE_SIZE]}
        }
        start = end
    }
    panic("bad ptr")
}

6.3 The Master Page

The first page of the file is used to store the pointer to the root, let’s call it the “master page”. The total number of pages is needed for allocating new nodes, thus it is also stored there.

|     the_master_page    | pages... | tree_root | pages... |
| btree_root | page_used |                ^                ^
      |            |                      |                |
      +------------+----------------------+                |
                   |                                       |
                   +---------------------------------------+

The function below reads the master page when initializing a database:

const DB_SIG = "BuildYourOwnDB06"

// the master page format.
// it contains the pointer to the root and other important bits.
// | sig | btree_root | page_used |
// | 16B |     8B     |     8B    |
func masterLoad(db *KV) error {
    if db.mmap.file == 0 {
        // empty file, the master page will be created on the first write.
        db.page.flushed = 1 // reserved for the master page
        return nil
    }

    data := db.mmap.chunks[0]
    root := binary.LittleEndian.Uint64(data[16:])
    used := binary.LittleEndian.Uint64(data[24:])

    // verify the page
    if !bytes.Equal([]byte(DB_SIG), data[:16]) {
        return errors.New("Bad signature.")
    }
    bad := !(1 <= used && used <= uint64(db.mmap.file/BTREE_PAGE_SIZE))
    bad = bad || !(0 <= root && root < used)
    if bad {
        return errors.New("Bad master page.")
    }

    db.tree.root = root
    db.page.flushed = used
    return nil
}

Below is the function for updating the master page. Unlike the code for reading, it doesn’t use the mapped address for writing. This is because modifying a page via mmap is not atomic. The kernel could flush the page midway and corrupt the disk file, while a small write that doesn’t cross the page boundary is guaranteed to be atomic.

// update the master page. it must be atomic.
func masterStore(db *KV) error {
    var data [32]byte
    copy(data[:16], []byte(DB_SIG))
    binary.LittleEndian.PutUint64(data[16:], db.tree.root)
    binary.LittleEndian.PutUint64(data[24:], db.page.flushed)
    // NOTE: Updating the page via mmap is not atomic.
    //       Use the `pwrite()` syscall instead.
    _, err := db.fp.WriteAt(data[:], 0)
    if err != nil {
        return fmt.Errorf("write master page: %w", err)
    }
    return nil
}

6.4 Allocating Disk Pages

We’ll simply append new pages to the end of the database until we add a free list in the next chapter.

And new pages are kept temporarily in memory until copied to the file later (after possibly extending the file).

type KV struct {
    // omitted...
    page struct {
        flushed uint64   // database size in number of pages
        temp    [][]byte // newly allocated pages
    }
}
// callback for BTree, allocate a new page.
func (db *KV) pageNew(node BNode) uint64 {
    // TODO: reuse deallocated pages
    assert(len(node.data) <= BTREE_PAGE_SIZE)
    ptr := db.page.flushed + uint64(len(db.page.temp))
    db.page.temp = append(db.page.temp, node.data)
    return ptr
}

// callback for BTree, deallocate a page.
func (db *KV) pageDel(uint64) {
    // TODO: implement this
}

Before writing the pending pages, we may need to extend the file first. The corresponding syscall is fallocate.

// extend the file to at least `npages`.
func extendFile(db *KV, npages int) error {
    filePages := db.mmap.file / BTREE_PAGE_SIZE
    if filePages >= npages {
        return nil
    }

    for filePages < npages {
        // the file size is increased exponentially,
        // so that we don't have to extend the file for every update.
        inc := filePages / 8
        if inc < 1 {
            inc = 1
        }
        filePages += inc
    }

    fileSize := filePages * BTREE_PAGE_SIZE
    err := syscall.Fallocate(int(db.fp.Fd()), 0, 0, int64(fileSize))
    if err != nil {
        return fmt.Errorf("fallocate: %w", err)
    }

    db.mmap.file = fileSize
    return nil
}

6.5 Initializing the Database

Putting together what we have done.

func (db *KV) Open() error {
    // open or create the DB file
    fp, err := os.OpenFile(db.Path, os.O_RDWR|os.O_CREATE, 0644)
    if err != nil {
        return fmt.Errorf("OpenFile: %w", err)
    }
    db.fp = fp

    // create the initial mmap
    sz, chunk, err := mmapInit(db.fp)
    if err != nil {
        goto fail
    }
    db.mmap.file = sz
    db.mmap.total = len(chunk)
    db.mmap.chunks = [][]byte{chunk}

    // btree callbacks
    db.tree.get = db.pageGet
    db.tree.new = db.pageNew
    db.tree.del = db.pageDel

    // read the master page
    err = masterLoad(db)
    if err != nil {
        goto fail
    }

    // done
    return nil

fail:
    db.Close()
    return fmt.Errorf("KV.Open: %w", err)
}
// cleanups
func (db *KV) Close() {
    for _, chunk := range db.mmap.chunks {
        err := syscall.Munmap(chunk)
        assert(err == nil)
    }
    _ = db.fp.Close()
}

6.6 Update Operations

Unlike queries, update operations must persist the data before returning.

// read the db
func (db *KV) Get(key []byte) ([]byte, bool) {
    return db.tree.Get(key)
}

// update the db
func (db *KV) Set(key []byte, val []byte) error {
    db.tree.Insert(key, val)
    return flushPages(db)
}

func (db *KV) Del(key []byte) (bool, error) {
    deleted := db.tree.Delete(key)
    return deleted, flushPages(db)
}

The flushPages is the function for persisting new pages.

// persist the newly allocated pages after updates
func flushPages(db *KV) error {
    if err := writePages(db); err != nil {
        return err
    }
    return syncPages(db)
}

It is split into two phases as mentioned earlier.

func writePages(db *KV) error {
    // extend the file & mmap if needed
    npages := int(db.page.flushed) + len(db.page.temp)
    if err := extendFile(db, npages); err != nil {
        return err
    }
    if err := extendMmap(db, npages); err != nil {
        return err
    }

    // copy data to the file
    for i, page := range db.page.temp {
        ptr := db.page.flushed + uint64(i)
        copy(db.pageGet(ptr).data, page)
    }
    return nil
}

And the fsync is in between and after them.

func syncPages(db *KV) error {
    // flush data to the disk. must be done before updating the master page.
    if err := db.fp.Sync(); err != nil {
        return fmt.Errorf("fsync: %w", err)
    }
    db.page.flushed += uint64(len(db.page.temp))
    db.page.temp = db.page.temp[:0]

    // update & flush the master page
    if err := masterStore(db); err != nil {
        return err
    }
    if err := db.fp.Sync(); err != nil {
        return fmt.Errorf("fsync: %w", err)
    }
    return nil
}

Our KV store is functional, but the file can’t grow forever as we update the database, we’ll finish our KV store by reusing disk pages in the next chapter.

See also:
codecrafters.io offers “Build Your Own X” courses in many programming languages.
Including Redis, Git, SQLite, Docker, and more.
Check it out

⟵ prev Contents next ⟶