06. Append-Only KV Store
6.1 What we will do
We’ll create a KV store with a copy-on-write B+tree backed by a file.
type KV struct {
string // file name
Path // internals
int
fd
tree BTree// more ...
}
func (db *KV) Open() error // open or create
func (db *KV) Get(key []byte) ([]byte, bool) {
return db.tree.Get(key)
}
func (db *KV) Set(key []byte, val []byte) error {
.tree.Insert(key, val)
dbreturn updateFile(db)
}
func (db *KV) Del(key []byte) (bool, error) {
:= db.tree.Delete(key)
deleted return deleted, updateFile(db)
}
The scope of this chapter is durability + atomicity:
- The file is append-only; space reuse is left to the next chapter.
- We will ignore concurrency and assume sequential access within 1 process.
We’ll implement the 3 B+tree callbacks that deal with disk pages:
type BTree struct {
uint64
root func(uint64) []byte // read a page
get new func([]byte) uint64 // append a page
func(uint64) // ignored in this chapter
del }
6.2 Two-phase update
Atomicity + durability
As discussed in chapter 03, for a copy-on-write tree, the root
pointer is updated atomically. Then fsync
is used to
request and confirm durability.
The atomicity of the root pointer itself is insufficient; to make the
whole tree atomic, new nodes must be persisted before the root
pointer. And the write order is not the order in which the data
is persisted, due to factors like caching. So another
fsync
is used to ensure the order.
func updateFile(db *KV) error {
// 1. Write new nodes.
if err := writePages(db); err != nil {
return err
}
// 2. `fsync` to enforce the order between 1 and 3.
if err := syscall.Fsync(db.fd); err != nil {
return err
}
// 3. Update the root pointer atomically.
if err := updateRoot(db); err != nil {
return err
}
// 4. `fsync` to make everything persistent.
return syscall.Fsync(db.fd)
}
Alternative: durability with a log
The alternative double-write scheme also has 2 fsync
’ed
phases:
- Write the updated pages with checksum.
fsync
to make the update persistent (for crash recovery).- Update the data in-place (apply the double-writes).
fsync
for the order between 3 and 1 (reuse or delete the double-writes).
A difference with copy-on-write is the order of the phases: the data
is persistent after the 1st fsync
; the DB can return
success and do the rest in the background.
The double-write is comparable to a log, which also needs only 1
fsync
for an update. And it can be an actual log to buffer
multiple updates, which improves performance. This is another example of
logs in DBs, besides the LSM-tree.
We won’t use a log as copy-on-write doesn’t need it. But a log still offers the benefits discussed above; it’s one of the reasons logs are ubiquitous in databases.
Concurrency of in-memory data
Atomicity for in-memory data (w.r.t. concurrency) can be achieved with a mutex (lock) or some atomic CPU instructions. There is a similar problem: memory reads/writes may not appear in order due to factors like out-of-order execution.
For an in-memory copy-on-write tree, new nodes must be made visible
to concurrent readers before the root pointer is updated. This
is called a memory barrier and is analogous to
fsync
, although fsync
is more than enforcing
order.
Synchronization primitives such as mutexes, or any OS syscalls, will enforce memory ordering in a portable way, so you don’t have to mess with CPU-specific atomics or barriers (which are inadequate for concurrency anyway).
6.3 Database on a file
The file layout
Our DB is a single file divided into “pages”. Each page is a B+tree node, except for the 1st page; the 1st page contains the pointer to the latest root node and other auxiliary data, we call this the meta page.
| the_meta_page | pages... | root_node | pages... | (end_of_file)
| root_ptr | page_used | ^ ^
| | | |
+----------|----------------------+ |
| |
+---------------------------------------+
New nodes are simply appended like a log, but we cannot use the file size to count the number of pages, because after a power loss the file size (metadata) may become inconsistent with the file data. This is filesystem dependent, we can avoid this by storing the number of pages in the meta page.
`fsync` on directory when creating a file
As mentioned in chapter 01, fsync
must be used on the
parent directory after a rename
. This is also true when
creating new files, because there are 2 things to be made persistent:
the file data, and the directory that references the file.
We’ll preemptively fsync
after potentially creating a
new file with O_CREATE
. To fsync
a directory,
open
the directory in O_RDONLY
mode.
func createFileSync(file string) (int, error) {
// obtain the directory fd
:= os.O_RDONLY | syscall.O_DIRECTORY
flags , err := syscall.Open(path.Dir(file), flags, 0o644)
dirfdif err != nil {
return -1, fmt.Errorf("open directory: %w", err)
}
defer syscall.Close(dirfd)
// open or create the file
= os.O_RDWR | os.O_CREATE
flags , err := syscall.Openat(dirfd, path.Base(file), flags, 0o644)
fdif err != nil {
return -1, fmt.Errorf("open file: %w", err)
}
// fsync the directory
if err = syscall.Fsync(dirfd); err != nil {
= syscall.Close(fd) // may leave an empty file
_ return -1, fmt.Errorf("fsync directory: %w", err)
}
return fd, nil
}
The directory fd can be used by openat
to open the
target file, which guarantees that the file is from the same directory
we opened before, in case the directory path is replaced in between (race condition).
This is only a minor concern. Moving a directory with a running DB in it
is not a sane scenario.
`mmap`, page cache and IO
mmap
is a way to read/write a file as if it’s an
in-memory buffer. Disk IO is implicit and automatic with
mmap
.
func Mmap(fd int, offset int64, length int, ...) (data []byte, err error)
To understand mmap
, let’s review some operating system
basics. An OS page is the minimum unit for mapping between virtual
address and physical address. However, the virtual address space of a
process is not fully backed by physical memory all the time; part of the
process memory can be swapped to disk, and when the process tries to
access it:
- The CPU triggers a page fault, which hands control to the OS.
- The OS then …
- Reads the swapped data into physical memory.
- Remaps the virtual address to it.
- Hands control back to the process.
- The process resumes with the virtual address mapped to real RAM.
mmap
works in a similar way, the process gets an address
range from mmap
and when it touches a page in it, it page
faults and the OS reads the data into a cache and remaps the page to the
cache. That’s the automatic IO in a read-only scenario.
The CPU also takes note (called a dirty bit) when the process
modifies a page so the OS can write the page back to disk later.
fsync
is used to request and wait for the IO. This is
writing data via mmap
, it is not very different from
write
on Linux because write
goes to the same
page cache.
You don’t have to use mmap
at all, but it’s important to
know the basics.
6.4 Manage disk pages
We’ll use mmap
for these page management callbacks
because it’s just convenient.
func (db *KV) Open() error {
.tree.get = db.pageRead // read a page
db.tree.new = db.pageAppend // apppend a page
db.tree.del = func(uint64) {}
db// ...
}
Invoke `mmap`
A file-backed mmap
can be either read-only, read-write,
or copy-on-write. To create a read-only mmap
, use the
PROT_READ
and MAP_SHARED
flags.
.Mmap(fd, offset, size, syscall.PROT_READ, syscall.MAP_SHARED) syscall
The mapped range can be larger than the current file size, which is a fact that we can exploit because the file will grow.
`mmap` a growing file
mremap
remaps a mapping to a larger range, it’s like
realloc
. That’s one way to deal with the growing file.
However, the address may change, which can hinder concurrent readers in
later chapters. Our solution is to add new mappings to cover the
expanded file.
type KV struct {
// ...
struct {
mmap int // mmap size, can be larger than the file size
total [][]byte // multiple mmaps, can be non-continuous
chunks }
}
// `BTree.get`, read a page.
func (db *KV) pageRead(ptr uint64) []byte {
:= uint64(0)
start for _, chunk := range db.mmap.chunks {
:= start + uint64(len(chunk))/BTREE_PAGE_SIZE
end if ptr < end {
:= BTREE_PAGE_SIZE * (ptr - start)
offset return chunk[offset : offset+BTREE_PAGE_SIZE]
}
= end
start }
panic("bad ptr")
}
Adding a new mapping each time the file is expanded results in lots
of mappings, which is bad for performance because the OS has to keep
track of them. This is avoided with exponential growth, since
mmap
can go beyond the file size.
func extendMmap(db *KV, size int) error {
if size <= db.mmap.total {
return nil // enough range
}
:= max(db.mmap.total, 64<<20) // double the current address space
alloc for db.mmap.total + alloc < size {
*= 2 // still not enough?
alloc }
, err := syscall.Mmap(
chunk.fd, int64(db.mmap.total), alloc,
db.PROT_READ, syscall.MAP_SHARED, // read-only
syscall)
if err != nil {
return fmt.Errorf("mmap: %w", err)
}
.mmap.total += alloc
db.mmap.chunks = append(db.mmap.chunks, chunk)
dbreturn nil
}
You may wonder why not just create a very large mapping (say, 1TB) and forget about the growing file, since an unrealized virtual address costs nothing. This is OK for a toy DB in 64-bit systems.
Capture page updates
The BTree.new
callback collects new pages from B+tree
updates, and allocates the page number from the end of DB.
type KV struct {
// ...
struct {
page uint64 // database size in number of pages
flushed [][]byte // newly allocated pages
temp }
}
func (db *KV) pageAppend(node []byte) uint64 {
:= db.page.flushed + uint64(len(db.page.temp)) // just append
ptr .page.temp = append(db.page.temp, node)
dbreturn ptr
}
Which are written (appended) to the file after B+tree updates.
func writePages(db *KV) error {
// extend the mmap if needed
:= (int(db.page.flushed) + len(db.page.temp)) * BTREE_PAGE_SIZE
size if err := extendMmap(db, size); err != nil {
return err
}
// write data pages to the file
:= int64(db.page.flushed * BTREE_PAGE_SIZE)
offset if _, err := unix.Pwritev(db.fd, db.page.temp, offset); err != nil {
return err
}
// discard in-memory data
.page.flushed += uint64(len(db.page.temp))
db.page.temp = db.page.temp[:0]
dbreturn nil
}
pwritev
is variant of write
with an offset
and multiple input buffers. We have to control the offset because we
also need to write the meta page later. Multiple input buffers are
combined by the kernel.
6.5 The meta page
Read the meta page
We’ll also add some magic bytes to the meta page to identify the file type.
const DB_SIG = "BuildYourOwnDB06" // not compatible between chapters
// | sig | root_ptr | page_used |
// | 16B | 8B | 8B |
func saveMeta(db *KV) []byte {
var data [32]byte
copy(data[:16], []byte(DB_SIG))
.LittleEndian.PutUint64(data[16:], db.tree.root)
binary.LittleEndian.PutUint64(data[24:], db.page.flushed)
binaryreturn data[:]
}
func loadMeta(db *KV, data []byte)
The meta page is reserved if the file is empty.
func readRoot(db *KV, fileSize int64) error {
if fileSize == 0 { // empty file
.page.flushed = 1 // the meta page is initialized on the 1st write
dbreturn nil
}
// read the page
:= db.mmap.chunks[0]
data (db, data)
loadMeta// verify the page
// ...
return nil
}
Update the meta page
Writing a small amount of page-aligned data to a real disk, modifying only a single sector, is likely power-loss-atomic at the hardware level. Some real databases depend on this. That’s how we update the meta page too.
// 3. Update the meta page. it must be atomic.
func updateRoot(db *KV) error {
if _, err := syscall.Pwrite(db.fd, saveMeta(db), 0); err != nil {
return fmt.Errorf("write meta page: %w", err)
}
return nil
}
However, atomicity means different things at different levels, as
you’ve seen with rename
. write
is not atomic
w.r.t. concurrent readers at the system call
level. This is likely how the page cache works.
We’ll consider read/write atomicity when we add concurrent transations, but we have already seen a solution: In an LSM-tree, the 1st level is the only thing that is updated, and it’s duplicated as a MemTable, which moves the concurrency problem to memory. We can keep an in-memory copy of the meta page and synchronize it with a mutex, thus avoiding concurrent disk reads/writes.
Even if the hardware is not atomic w.r.t. power loss. Atomicity is achievable with log + checksum. We could switch between 2 checksumed meta pages for each update, to ensure that one of them is good after a power loss. This is called double buffering, which is a rotating log with 2 entries.
6.6 Error handling
Scenarios after IO errors
The bare minimum of error handling is to propagate errors with
if err != nil
. Next, consider the possibility of using the
DB after an IO error (fsync
or write
).
- Read after a failed update? Should behave as if nothing happened.
- Update it again after a failure?
- If the error persists, it’s expected to fail again.
- If the error is temporary, can we recover from the previous error?
- Restart after the problem is resolved? See crash recovery in chapter 03.
Revert to the previous version
There is a survey
on the handling of fsync
failures. From which we can learn
that the topic is filesystem dependent. If we read after an
fsync
failure, some filesystems return the failed data as
the page cache doesn’t match the disk. So reading back failed writes is
problematic.
But since we’re copy-on-write, this is not a problem; we can revert to the old tree root to avoid the problematic data. The tree root is stored in the meta page, but we never read the meta page from disk after opening a DB, so we’ll just revert the in-memory root pointer.
func (db *KV) Set(key []byte, val []byte) error {
:= saveMeta(db) // save the in-memory state (tree root)
meta if err := db.tree.Insert(key, val); err != nil {
return err // length limit
}
return updateOrRevert(db, meta)
}
func updateOrRevert(db *KV, meta []byte) error {
// 2-phase update
:= updateFile(db)
err // revert on error
if err != nil {
// the in-memory states can be reverted immediately to allow reads
(db, meta)
loadMeta// discard temporaries
.page.temp = db.page.temp[:0]
db}
return err
}
So after a write failure, it’s still possible to use the DB in
read-only mode. Reads can also fail, but we’re using mmap
,
on a read error the process is just killed by SIGBUS
.
That’s one of the drawbacks of mmap
.
Recover from temporary write errors
Some write errors are temporary, such as “no space left”. If an update fails and then the next succeeds, the end state is still good. The problem is the intermediate state: between the 2 updates, the content of the meta page on disk is unknown!
If fsync
fails on the meta page, the meta page on disk
can be either the new or the old version, while the in-memory tree root
is the old version. So the 2nd successful update will overwrite the data
pages of the newer version, which can be left in a corrupted
intermediate state if crashed.
The solution is to rewrite the last known meta page on recovery.
type KV struct {
// ...
bool // Did the last update fail?
failed }
func updateOrRevert(db *KV, meta []byte) error {
// ensure the on-disk meta page matches the in-memory one after an error
if db.failed {
// write and fsync the previous meta page
// ...
.failed = false
db}
:= updateFile(db)
err if err != nil {
// the on-disk meta page is in an unknown state;
// mark it to be rewritten on later recovery.
.failed = true
db// ...
}
return err
}
We rely on filesystems to report errors correctly, but there is evidence that they don’t. So can the system as a whole handle errors is still doubtful.
6.7 Summary of the append-only KV store
- File layout for a copy-on-write B+tree.
- Durability and atomicity with
fsync
. - Error handling.
B+tree on disk is a major step. We just have to add a free list to make it practical.