build-your-own.org > Books > Build Your Own Database From Scratch in Go
EBook·Paperback
⟵ prev Contents next ⟶

🔥 My other Book: Build Your Own Redis

06. Append-Only KV Store

6.1 What we will do

We’ll create a KV store with a copy-on-write B+tree backed by a file.

type KV struct {
    Path   string // file name
    // internals
    fd   int
    tree BTree
    // more ...
}
func (db *KV) Open() error

func (db *KV) Get(key []byte) ([]byte, bool) {
    return db.tree.Get(key)
}
func (db *KV) Set(key []byte, val []byte) error {
    db.tree.Insert(key, val)
    return updateFile(db)
}
func (db *KV) Del(key []byte) (bool, error) {
    deleted := db.tree.Delete(key)
    return deleted, updateFile(db)
}

The scope of this chapter is durability + atomicity:

We’ll implement the 3 B+tree callbacks that deal with disk pages:

type BTree struct {
    root uint64
    get func(uint64) BNode // read a page
    new func(BNode) uint64 // append a page
    del func(uint64)       // ignored in this chapter
}

6.2 Two-phase update

Atomicity + durability

As discussed in chapter 03, for a copy-on-write tree, the root pointer is updated atomically. Then fsync is used to request and confirm durability.

The atomicity of the root pointer itself is insufficient; to make the whole tree atomic, new nodes must be persisted before the root pointer. And the write order is not the order in which the data is persisted, due to factors like caching. So another fsync is used to ensure the order.

func updateFile(db *KV) error {
    // 1. Write new nodes.
    if err := writePages(db); err != nil {
        return err
    }
    // 2. `fsync` to enforce the order between 1 and 3.
    if err := syscall.Fsync(db.fd); err != nil {
        return err
    }
    // 3. Update the root pointer atomically.
    if err := updateRoot(db); err != nil {
        return err
    }
    // 4. `fsync` to make everything persistent.
    if err := syscall.Fsync(db.fd); err != nil {
        return err
    }
    return nil
}

Durability and log

The double-write scheme also has 2 fsync’ed phases:

  1. Write the updated pages with checksum.
  2. fsync to make the update persistent (for crash recovery).
  3. Update the data in-place (apply the double-writes).
  4. fsync for the order between 3 and 1 (reuse or delete the double-writes).

A difference with copy-on-write is the order of the phases: the data is persistent after the 1st fsync; the DB can respond to the client after only 1 fsync and do the rest in the background.

The double-write is comparable to a log, which also needs only 1 fsync for an update. And it can be implemented as an actual log to buffer multiple updates, which improves performance. This is another example of using a log in databases, besides the LSM-tree.

We won’t use a log as copy-on-write doesn’t need it. But a log still offers the benefits discussed above; it’s one of the reasons logs are ubiquitous in databases.

Concurrency of in-memory data

Atomicity for in-memory data (w.r.t. concurrency) can be achieved with a mutex (lock) or some atomic CPU instructions. There is a similar problem: memory reads/writes may not appear in order due to factors like out-of-order execution.

For an in-memory copy-on-write tree, new nodes must be made visible to concurrent readers before the root pointer is updated. This is called a memory barrier and is analogous to fsync, although fsync is more than enforcing order.

Synchronization primitives such as mutexes, or any OS syscalls, will enforce memory ordering in a portable way, so you don’t have to mess with CPU-specific atomics or barriers (which are inadequate for concurrency anyway).

6.3 Database on a file

Although files are insufficient, databases can be built on top of them.

The file layout

Our DB is a single file divided into “pages”. Each page is a B+tree node, except for the 1st page; the 1st page contains the pointer to the latest root node and other auxiliary data, we call this the meta page.

|     the_meta_page    | pages... | root_node | pages... | (end_of_file)
| root_ptr | page_used |                ^                ^
      |            |                    |                |
      +------------+--------------------+                |
                   |                                     |
                   +-------------------------------------+

New nodes are simply appended like a log, but we cannot use the file size to count the number of pages, because after a power loss the file size (metadata) may become inconsistent with the file data. This is filesystem dependent, we can avoid this by storing the number of pages in the meta page.

`fsync` on directory

As mentioned in chapter 01, fsync must be used on the parent directory after a rename. This is also true when creating new files, because there are 2 things to be made persistent: the file data, and the directory that references the file.

We’ll preemptively fsync after potentially creating a new file with O_CREATE. That way we don’t have to worry about it in later updates.

To fsync a directory, we must first get an fd by opening it in O_RDONLY mode.

// open or create a file and fsync the directory
func createFileSync(file string) (int, error) {
    // obtain the directory fd
    flags := os.O_RDONLY | syscall.O_DIRECTORY
    dirfd, err := syscall.Open(path.Dir(file), flags, 0o644)
    if err != nil {
        return -1, fmt.Errorf("open directory: %w", err)
    }
    defer syscall.Close(dirfd)
    // open or create the file
    flags = os.O_RDWR | os.O_CREATE
    fd, err := syscall.Openat(dirfd, path.Base(file), flags, 0o644)
    if err != nil {
        return -1, fmt.Errorf("open file: %w", err)
    }
    // fsync the directory
    err = syscall.Fsync(dirfd)
    if err != nil { // may leave an empty file
        _ = syscall.Close(fd)
        return -1, fmt.Errorf("fsync directory: %w", err)
    }
    // done
    return fd, nil
}

The directory fd can also be used by openat to open the target file, which guarantees that the file is from the same directory we opened before, in case someone else has replaced the directory in between (race condition). This is not a concern in our case, since we don’t expect the DB to be messed by other processes.

`mmap`, page cache and IO

mmap is a way to read/write a file as if it’s an in-memory buffer. Disk IO is implicit and automatic with mmap.

func Mmap(fd int, offset int64, length int, ...) (data []byte, err error)

To understand mmap, let’s review some operating system basics. An OS page is the minimum unit for mapping between virtual address and physical address. However, the virtual address space of a process is not fully backed by physical memory all the time; part of the process memory can be swapped to disk, and when the process tries to access it:

  1. The CPU triggers a page fault, which hands control to the OS.
  2. The OS then …
    1. Reads the swapped data into physical memory.
    2. Remaps the virtual address to it.
    3. Hands control back to the process.
  3. The process resumes with the virtual address mapped to real RAM.

mmap works in a similar way, the process gets an address range from mmap and when it touches a page in it, it page faults and the OS reads the data into a cache and remaps the page to the cache. That’s the automatic IO in a read-only scenario.

The CPU also takes note (called a dirty bit) when the process modifies a page so the OS can write the page back to disk later. fsync is used to request and wait for the IO. This is writing data via mmap, it is not very different from write on Linux because write goes to the same page cache.

There are pros and cons of using mmap in databases, which are outside the scope of a toy DB. We’ll use mmap because it’s just convenient.

6.4 Manage disk pages

Invoke `mmap`

To read a file via mmap, use the PROT_READ and MAP_SHARED flags.

// create the initial mmap that covers the whole file.
func initMmap(fd int, fileSize int64) ([]byte, error) {
    mmapSize := 64 << 20
    for mmapSize < int(fileSize) {
        mmapSize *= 2 // can be larger than the file
    }
    chunk, err := syscall.Mmap(
        fd, 0, mmapSize, syscall.PROT_READ, syscall.MAP_SHARED,
    )
    if err != nil {
        return nil, fmt.Errorf("mmap: %w", err)
    }
    return chunk, nil
}

The mapped range can be larger than the file, which is convenient because we don’t have to update the mapping each time the file grows. In x64, the virtual address space is 48-bit (256TB), and for most use cases, a single 1TB mmap will probably last forever. However, this isn’t viable on 32-bit systems.

Expand `mmap`

mremap can be used to remap a mapping to a larger range when the file is expanded. However, the start address may change, which can hinder concurrent readers in later chapters. Our solution is to add new mappings to cover the expanded file.

type KV struct {
    Path   string
    // internals
    fd   int
    tree BTree
    mmap struct {
        total  int      // mmap size, can be larger than the file size
        chunks [][]byte // multiple mmaps, can be non-continuous
    }
}

Adding a new mapping each time the file is expanded results in lots of mappings, which is bad for performance because the OS has to keep track of them. This is avoided with exponential growth, as mmap can go beyond the current file size.

// extend the mmap by adding new mappings.
func extendMmap(db *KV, size int) error {
    for db.mmap.total < size {
        // double the address space
        chunk, err := syscall.Mmap(
            db.fd, int64(db.mmap.total), db.mmap.total,
            syscall.PROT_READ, syscall.MAP_SHARED,
        )
        if err != nil {
            return fmt.Errorf("mmap: %w", err)
        }

        db.mmap.total += db.mmap.total
        db.mmap.chunks = append(db.mmap.chunks, chunk)
    }
    return nil
}

Read from `mmap`

The BTree.get callback simply returns a slice of the mmap’ed buffer.

// `BTree.get`, read a page.
func (db *KV) pageGet(ptr uint64) BNode {
    start := uint64(0)
    for _, chunk := range db.mmap.chunks {
        end := start + uint64(len(chunk))/BTREE_PAGE_SIZE
        if ptr < end {
            offset := BTREE_PAGE_SIZE * (ptr - start)
            return BNode{chunk[offset : offset+BTREE_PAGE_SIZE]}
        }
        start = end
    }
    panic("bad ptr")
}

Append pages

The BTree.new callback collects new page data in memory, and allocates and returns the page number from the end.

type KV struct {
    // ...
    page struct {
        flushed uint64   // database size in number of pages
        temp    [][]byte // newly allocated pages
    }
}

// `BTree.new`, allocate a new page.
func (db *KV) pageNew(node BNode) uint64 {
    assert(len(node.data) <= BTREE_PAGE_SIZE)
    ptr := db.page.flushed + uint64(len(db.page.temp)) // just append
    db.page.temp = append(db.page.temp, node.data)
    return ptr
}

The file is only updated after the data structure is updated.

// 1. Write new nodes.
func writePages(db *KV) error {
    // extend the mmap if needed
    size := (int(db.page.flushed) + len(db.page.temp)) * BTREE_PAGE_SIZE
    if err := extendMmap(db, size); err != nil {
        return err
    }
    // write data to the file
    offset := int64(db.page.flushed * BTREE_PAGE_SIZE)
    if _, err := unix.Pwritev(db.fd, db.page.temp, offset); err != nil {
        return err
    }
    // discard the in-memory data
    db.page.flushed += uint64(len(db.page.temp))
    db.page.temp = db.page.temp[:0]
    return nil
}

New pages are appended with the pwritev syscall, which is a variant of write with an offset and multiple input buffers.

6.5 The meta page

Open a database

Let’s review the first things to do with a DB.

// open or create a DB file
func (db *KV) Open() error {
    var err error
    var chunk []byte
    // B+tree callbacks
    db.tree.get = db.pageGet
    db.tree.new = db.pageNew
    db.tree.del = func(uint64) {}
    // open or create the DB file
    if db.fd, err = createFileSync(db.Path); err != nil {
        return err
    }
    // get the file size
    finfo := syscall.Stat_t{}
    if err = syscall.Fstat(db.fd, &finfo); err != nil {
        goto fail
    }
    // create the initial mmap
    if chunk, err = initMmap(db.fd, finfo.Size); err != nil {
        goto fail
    }
    db.mmap.total = len(chunk)
    db.mmap.chunks = [][]byte{chunk}
    // read the meta page
    if err = readRoot(db, finfo.Size); err != nil {
        goto fail
    }
    return nil
    // error
fail:
    db.Close()
    return fmt.Errorf("KV.Open: %w", err)
}

The remaining thing is the meta page.

Read the meta page

Besides the root pointer and the number of pages, we’ll also add some magic bytes to the meta page to identify the file type.

const DB_SIG = "BuildYourOwnDB06"

// the 1st page stores the root pointer and other auxiliary data.
// | sig | root_ptr | page_used |
// | 16B |    8B    |     8B    |
func readRoot(db *KV, fileSize int64) error {
    if fileSize%BTREE_PAGE_SIZE != 0 {
        return errors.New("file is not a multiple of pages")
    }
    if fileSize == 0 { // empty file
        db.page.flushed = 1 // the meta page is initialized on the 1st write
        return nil
    }
    // read the page
    data := db.mmap.chunks[0]
    db.tree.root = binary.LittleEndian.Uint64(data[16:])
    db.page.flushed = binary.LittleEndian.Uint64(data[24:])
    // verify the page
    bad := !bytes.Equal([]byte(DB_SIG), data[:16])
    // pointers are within range?
    maxpages := uint64(fileSize / BTREE_PAGE_SIZE)
    bad = bad || !(1 <= db.page.flushed && db.page.flushed <= maxpages)
    bad = bad || !(0 < db.tree.root && db.tree.root < db.page.flushed)
    if bad {
        return errors.New("bad meta page")
    }
    return nil
}

Update the meta page

Writing a small amount of page-aligned data to a real disk, modifying only a single sector, is likely power-loss-atomic at the hardware level. Some real databases depend on this. That’s how we update the meta page too.

// 3. Update the root pointer atomically.
func updateRoot(db *KV) error {
    var data [32]byte
    copy(data[:16], []byte(DB_SIG))
    binary.LittleEndian.PutUint64(data[16:], db.tree.root)
    binary.LittleEndian.PutUint64(data[24:], db.page.flushed)
    // NOTE: atomic?
    _, err := syscall.Pwrite(db.fd, data[:], 0)
    if err != nil {
        return fmt.Errorf("write meta page: %w", err)
    }
    return nil
}

However, atomicity means different things at different levels, as you’ve seen with rename. write is not atomic w.r.t. concurrent readers at the system call level. This is likely how the page cache works.

We’ll consider read/write atomicity when we add concurrent transations, but we have already seen a solution: In an LSM-tree, the 1st level is the only thing that is updated, and it’s duplicated as a MemTable, which moves the concurrency problem to memory. We can keep an in-memory copy of the meta page and synchronize it with a mutex, thus avoiding concurrent disk reads/writes.

Even if the hardware is not atomic w.r.t. power loss. Atomicity is achievable with log + checksum. We could switch between 2 checksumed meta pages for each update, to ensure that one of them is good after a power loss. This is called double buffering, which is a rotating log with 2 entries.

6.6 Summary of the append-only KV store

What we have done:

Data structure on disk is a major milestone. The next thing is a free list since an append-only file is just impractical.

( Report an Error | Ask a Question) @ build-your-own.org

See also:
codecrafters.io offers “Build Your Own X” courses in many programming languages.
Including Redis, Git, SQLite, Docker, and more.
Check it out

⟵ prev Contents next ⟶