06. Append-Only KV Store

6.1 What we will do

We’ll create a KV store with a copy-on-write B+tree backed by a file.

type KV struct {
    Path   string // file name
    // internals
    fd   int
    tree BTree
    // more ...
}
func (db *KV) Open() error // open or create

func (db *KV) Get(key []byte) ([]byte, bool) {
    return db.tree.Get(key)
}
func (db *KV) Set(key []byte, val []byte) error {
    db.tree.Insert(key, val)
    return updateFile(db)
}
func (db *KV) Del(key []byte) (bool, error) {
    deleted := db.tree.Delete(key)
    return deleted, updateFile(db)
}

The scope of this chapter is durability + atomicity:

  • The file is append-only; space reuse is left to the next chapter.
  • We will ignore concurrency and assume sequential access within 1 process.

We’ll implement the 3 B+tree callbacks that deal with disk pages:

type BTree struct {
    root uint64
    get func(uint64) []byte // read a page
    new func([]byte) uint64 // append a page
    del func(uint64)        // ignored in this chapter
}

6.2 Two-phase update

Atomicity + durability

As discussed in chapter 03, for a copy-on-write tree, the root pointer is updated atomically. Then fsync is used to request and confirm durability.

The atomicity of the root pointer itself is insufficient; to make the whole tree atomic, new nodes must be persisted before the root pointer. And the write order is not the order in which the data is persisted, due to factors like caching. So another fsync is used to ensure the order.

func updateFile(db *KV) error {
    // 1. Write new nodes.
    if err := writePages(db); err != nil {
        return err
    }
    // 2. `fsync` to enforce the order between 1 and 3.
    if err := syscall.Fsync(db.fd); err != nil {
        return err
    }
    // 3. Update the root pointer atomically.
    if err := updateRoot(db); err != nil {
        return err
    }
    // 4. `fsync` to make everything persistent.
    return syscall.Fsync(db.fd)
}

Alternative: durability with a log

The alternative double-write scheme also has 2 fsync’ed phases:

  1. Write the updated pages with checksum.
  2. fsync to make the update persistent (for crash recovery).
  3. Update the data in-place (apply the double-writes).
  4. fsync for the order between 3 and 1 (reuse or delete the double-writes).

A difference with copy-on-write is the order of the phases: the data is persistent after the 1st fsync; the DB can return success and do the rest in the background.

The double-write is comparable to a log, which also needs only 1 fsync for an update. And it can be an actual log to buffer multiple updates, which improves performance. This is another example of logs in DBs, besides the LSM-tree.

We won’t use a log as copy-on-write doesn’t need it. But a log still offers the benefits discussed above; it’s one of the reasons logs are ubiquitous in databases.

Concurrency of in-memory data

Atomicity for in-memory data (w.r.t. concurrency) can be achieved with a mutex (lock) or some atomic CPU instructions. There is a similar problem: memory reads/writes may not appear in order due to factors like out-of-order execution.

For an in-memory copy-on-write tree, new nodes must be made visible to concurrent readers before the root pointer is updated. This is called a memory barrier and is analogous to fsync, although fsync is more than enforcing order.

Synchronization primitives such as mutexes, or any OS syscalls, will enforce memory ordering in a portable way, so you don’t have to mess with CPU-specific atomics or barriers (which are inadequate for concurrency anyway).

6.3 Database on a file

The file layout

Our DB is a single file divided into “pages”. Each page is a B+tree node, except for the 1st page; the 1st page contains the pointer to the latest root node and other auxiliary data, we call this the meta page.

|     the_meta_page    | pages... | root_node | pages... | (end_of_file)
| root_ptr | page_used |                ^                ^
      |          |                      |                |
      +----------|----------------------+                |
                 |                                       |
                 +---------------------------------------+

New nodes are simply appended like a log, but we cannot use the file size to count the number of pages, because after a power loss the file size (metadata) may become inconsistent with the file data. This is filesystem dependent, we can avoid this by storing the number of pages in the meta page.

`fsync` on directory when creating a file

As mentioned in chapter 01, fsync must be used on the parent directory after a rename. This is also true when creating new files, because there are 2 things to be made persistent: the file data, and the directory that references the file.

We’ll preemptively fsync after potentially creating a new file with O_CREATE. To fsync a directory, open the directory in O_RDONLY mode.

func createFileSync(file string) (int, error) {
    // obtain the directory fd
    flags := os.O_RDONLY | syscall.O_DIRECTORY
    dirfd, err := syscall.Open(path.Dir(file), flags, 0o644)
    if err != nil {
        return -1, fmt.Errorf("open directory: %w", err)
    }
    defer syscall.Close(dirfd)
    // open or create the file
    flags = os.O_RDWR | os.O_CREATE
    fd, err := syscall.Openat(dirfd, path.Base(file), flags, 0o644)
    if err != nil {
        return -1, fmt.Errorf("open file: %w", err)
    }
    // fsync the directory
    if err = syscall.Fsync(dirfd); err != nil {
        _ = syscall.Close(fd)  // may leave an empty file
        return -1, fmt.Errorf("fsync directory: %w", err)
    }
    return fd, nil
}

The directory fd can be used by openat to open the target file, which guarantees that the file is from the same directory we opened before, in case the directory path is replaced in between (race condition). This is only a minor concern. Moving a directory with a running DB in it is not a sane scenario.

`mmap`, page cache and IO

mmap is a way to read/write a file as if it’s an in-memory buffer. Disk IO is implicit and automatic with mmap.

func Mmap(fd int, offset int64, length int, ...) (data []byte, err error)

To understand mmap, let’s review some operating system basics. An OS page is the minimum unit for mapping between virtual address and physical address. However, the virtual address space of a process is not fully backed by physical memory all the time; part of the process memory can be swapped to disk, and when the process tries to access it:

  1. The CPU triggers a page fault, which hands control to the OS.
  2. The OS then …
    1. Reads the swapped data into physical memory.
    2. Remaps the virtual address to it.
    3. Hands control back to the process.
  3. The process resumes with the virtual address mapped to real RAM.

mmap works in a similar way, the process gets an address range from mmap and when it touches a page in it, it page faults and the OS reads the data into a cache and remaps the page to the cache. That’s the automatic IO in a read-only scenario.

The CPU also takes note (called a dirty bit) when the process modifies a page so the OS can write the page back to disk later. fsync is used to request and wait for the IO. This is writing data via mmap, it is not very different from write on Linux because write goes to the same page cache.

You don’t have to use mmap at all, but it’s important to know the basics.

6.4 Manage disk pages

We’ll use mmap for these page management callbacks because it’s just convenient.

func (db *KV) Open() error {
    db.tree.get = db.pageRead   // read a page
    db.tree.new = db.pageAppend // apppend a page
    db.tree.del = func(uint64) {}
    // ...
}

Invoke `mmap`

A file-backed mmap can be either read-only, read-write, or copy-on-write. To create a read-only mmap, use the PROT_READ and MAP_SHARED flags.

syscall.Mmap(fd, offset, size, syscall.PROT_READ, syscall.MAP_SHARED)

The mapped range can be larger than the current file size, which is a fact that we can exploit because the file will grow.

`mmap` a growing file

mremap remaps a mapping to a larger range, it’s like realloc. That’s one way to deal with the growing file. However, the address may change, which can hinder concurrent readers in later chapters. Our solution is to add new mappings to cover the expanded file.

type KV struct {
    // ...
    mmap struct {
        total  int      // mmap size, can be larger than the file size
        chunks [][]byte // multiple mmaps, can be non-continuous
    }
}

// `BTree.get`, read a page.
func (db *KV) pageRead(ptr uint64) []byte {
    start := uint64(0)
    for _, chunk := range db.mmap.chunks {
        end := start + uint64(len(chunk))/BTREE_PAGE_SIZE
        if ptr < end {
            offset := BTREE_PAGE_SIZE * (ptr - start)
            return chunk[offset : offset+BTREE_PAGE_SIZE]
        }
        start = end
    }
    panic("bad ptr")
}

Adding a new mapping each time the file is expanded results in lots of mappings, which is bad for performance because the OS has to keep track of them. This is avoided with exponential growth, since mmap can go beyond the file size.

func extendMmap(db *KV, size int) error {
    if size <= db.mmap.total {
        return nil // enough range
    }
    alloc := max(db.mmap.total, 64<<20) // double the current address space
    for db.mmap.total + alloc < size {
        alloc *= 2 // still not enough?
    }
    chunk, err := syscall.Mmap(
        db.fd, int64(db.mmap.total), alloc,
        syscall.PROT_READ, syscall.MAP_SHARED, // read-only
    )
    if err != nil {
        return fmt.Errorf("mmap: %w", err)
    }
    db.mmap.total += alloc
    db.mmap.chunks = append(db.mmap.chunks, chunk)
    return nil
}

You may wonder why not just create a very large mapping (say, 1TB) and forget about the growing file, since an unrealized virtual address costs nothing. This is OK for a toy DB in 64-bit systems.

Capture page updates

The BTree.new callback collects new pages from B+tree updates, and allocates the page number from the end of DB.

type KV struct {
    // ...
    page struct {
        flushed uint64   // database size in number of pages
        temp    [][]byte // newly allocated pages
    }
}

func (db *KV) pageAppend(node []byte) uint64 {
    ptr := db.page.flushed + uint64(len(db.page.temp)) // just append
    db.page.temp = append(db.page.temp, node)
    return ptr
}

Which are written (appended) to the file after B+tree updates.

func writePages(db *KV) error {
    // extend the mmap if needed
    size := (int(db.page.flushed) + len(db.page.temp)) * BTREE_PAGE_SIZE
    if err := extendMmap(db, size); err != nil {
        return err
    }
    // write data pages to the file
    offset := int64(db.page.flushed * BTREE_PAGE_SIZE)
    if _, err := unix.Pwritev(db.fd, db.page.temp, offset); err != nil {
        return err
    }
    // discard in-memory data
    db.page.flushed += uint64(len(db.page.temp))
    db.page.temp = db.page.temp[:0]
    return nil
}

pwritev is variant of write with an offset and multiple input buffers. We have to control the offset because we also need to write the meta page later. Multiple input buffers are combined by the kernel.

6.5 The meta page

Read the meta page

We’ll also add some magic bytes to the meta page to identify the file type.

const DB_SIG = "BuildYourOwnDB06" // not compatible between chapters

// | sig | root_ptr | page_used |
// | 16B |    8B    |     8B    |
func saveMeta(db *KV) []byte {
    var data [32]byte
    copy(data[:16], []byte(DB_SIG))
    binary.LittleEndian.PutUint64(data[16:], db.tree.root)
    binary.LittleEndian.PutUint64(data[24:], db.page.flushed)
    return data[:]
}

func loadMeta(db *KV, data []byte)

The meta page is reserved if the file is empty.

func readRoot(db *KV, fileSize int64) error {
    if fileSize == 0 { // empty file
        db.page.flushed = 1 // the meta page is initialized on the 1st write
        return nil
    }
    // read the page
    data := db.mmap.chunks[0]
    loadMeta(db, data)
    // verify the page
    // ...
    return nil
}

Update the meta page

Writing a small amount of page-aligned data to a real disk, modifying only a single sector, is likely power-loss-atomic at the hardware level. Some real databases depend on this. That’s how we update the meta page too.

// 3. Update the meta page. it must be atomic.
func updateRoot(db *KV) error {
    if _, err := syscall.Pwrite(db.fd, saveMeta(db), 0); err != nil {
        return fmt.Errorf("write meta page: %w", err)
    }
    return nil
}

However, atomicity means different things at different levels, as you’ve seen with rename. write is not atomic w.r.t. concurrent readers at the system call level. This is likely how the page cache works.

We’ll consider read/write atomicity when we add concurrent transations, but we have already seen a solution: In an LSM-tree, the 1st level is the only thing that is updated, and it’s duplicated as a MemTable, which moves the concurrency problem to memory. We can keep an in-memory copy of the meta page and synchronize it with a mutex, thus avoiding concurrent disk reads/writes.

Even if the hardware is not atomic w.r.t. power loss. Atomicity is achievable with log + checksum. We could switch between 2 checksumed meta pages for each update, to ensure that one of them is good after a power loss. This is called double buffering, which is a rotating log with 2 entries.

6.6 Error handling

Scenarios after IO errors

The bare minimum of error handling is to propagate errors with if err != nil. Next, consider the possibility of using the DB after an IO error (fsync or write).

  • Read after a failed update? Should behave as if nothing happened.
  • Update it again after a failure?
    • If the error persists, it’s expected to fail again.
    • If the error is temporary, can we recover from the previous error?
  • Restart after the problem is resolved? See crash recovery in chapter 03.

Revert to the previous version

There is a survey on the handling of fsync failures. From which we can learn that the topic is filesystem dependent. If we read after an fsync failure, some filesystems return the failed data as the page cache doesn’t match the disk. So reading back failed writes is problematic.

But since we’re copy-on-write, this is not a problem; we can revert to the old tree root to avoid the problematic data. The tree root is stored in the meta page, but we never read the meta page from disk after opening a DB, so we’ll just revert the in-memory root pointer.

func (db *KV) Set(key []byte, val []byte) error {
    meta := saveMeta(db) // save the in-memory state (tree root)
    if err := db.tree.Insert(key, val); err != nil {
        return err // length limit
    }
    return updateOrRevert(db, meta)
}
func updateOrRevert(db *KV, meta []byte) error {
    // 2-phase update
    err := updateFile(db)
    // revert on error
    if err != nil {
        // the in-memory states can be reverted immediately to allow reads
        loadMeta(db, meta)
        // discard temporaries
        db.page.temp = db.page.temp[:0]
    }
    return err
}

So after a write failure, it’s still possible to use the DB in read-only mode. Reads can also fail, but we’re using mmap, on a read error the process is just killed by SIGBUS. That’s one of the drawbacks of mmap.

Recover from temporary write errors

Some write errors are temporary, such as “no space left”. If an update fails and then the next succeeds, the end state is still good. The problem is the intermediate state: between the 2 updates, the content of the meta page on disk is unknown!

If fsync fails on the meta page, the meta page on disk can be either the new or the old version, while the in-memory tree root is the old version. So the 2nd successful update will overwrite the data pages of the newer version, which can be left in a corrupted intermediate state if crashed.

The solution is to rewrite the last known meta page on recovery.

type KV struct {
    // ...
    failed bool // Did the last update fail?
}

func updateOrRevert(db *KV, meta []byte) error {
    // ensure the on-disk meta page matches the in-memory one after an error
    if db.failed {
        // write and fsync the previous meta page
        // ...
        db.failed = false
    }
    err := updateFile(db)
    if err != nil {
        // the on-disk meta page is in an unknown state;
        // mark it to be rewritten on later recovery.
        db.failed = true
        // ...
    }
    return err
}

We rely on filesystems to report errors correctly, but there is evidence that they don’t. So can the system as a whole handle errors is still doubtful.

6.7 Summary of the append-only KV store

  • File layout for a copy-on-write B+tree.
  • Durability and atomicity with fsync.
  • Error handling.

B+tree on disk is a major step. We just have to add a free list to make it practical.