06. Append-Only KV Store
6.1 What we will do
We’ll create a KV store with a copy-on-write B+tree backed by a file.
type KV struct {
string // file name
Path // internals
int
fd
tree BTree// more ...
}
func (db *KV) Open() error
func (db *KV) Get(key []byte) ([]byte, bool) {
return db.tree.Get(key)
}
func (db *KV) Set(key []byte, val []byte) error {
.tree.Insert(key, val)
dbreturn updateFile(db)
}
func (db *KV) Del(key []byte) (bool, error) {
:= db.tree.Delete(key)
deleted return deleted, updateFile(db)
}
The scope of this chapter is durability + atomicity:
- The file is append-only; space reuse is left to the next chapter.
- We will ignore concurrency and assume sequential access within 1 process.
We’ll implement the 3 B+tree callbacks that deal with disk pages:
type BTree struct {
uint64
root func(uint64) BNode // read a page
get new func(BNode) uint64 // append a page
func(uint64) // ignored in this chapter
del }
6.2 Two-phase update
Atomicity + durability
As discussed in chapter 03, for a copy-on-write tree, the root
pointer is updated atomically. Then fsync
is used to
request and confirm durability.
The atomicity of the root pointer itself is insufficient; to make the
whole tree atomic, new nodes must be persisted before the root
pointer. And the write order is not the order in which the data
is persisted, due to factors like caching. So another
fsync
is used to ensure the order.
func updateFile(db *KV) error {
// 1. Write new nodes.
if err := writePages(db); err != nil {
return err
}
// 2. `fsync` to enforce the order between 1 and 3.
if err := syscall.Fsync(db.fd); err != nil {
return err
}
// 3. Update the root pointer atomically.
if err := updateRoot(db); err != nil {
return err
}
// 4. `fsync` to make everything persistent.
if err := syscall.Fsync(db.fd); err != nil {
return err
}
return nil
}
Durability and log
The double-write scheme also has 2 fsync
’ed phases:
- Write the updated pages with checksum.
fsync
to make the update persistent (for crash recovery).- Update the data in-place (apply the double-writes).
fsync
for the order between 3 and 1 (reuse or delete the double-writes).
A difference with copy-on-write is the order of the phases: the data
is persistent after the 1st fsync
; the DB can respond to
the client after only 1 fsync
and do the rest in the
background.
The double-write is comparable to a log, which also needs only 1
fsync
for an update. And it can be implemented as an actual
log to buffer multiple updates, which improves performance. This is
another example of using a log in databases, besides the LSM-tree.
We won’t use a log as copy-on-write doesn’t need it. But a log still offers the benefits discussed above; it’s one of the reasons logs are ubiquitous in databases.
Concurrency of in-memory data
Atomicity for in-memory data (w.r.t. concurrency) can be achieved with a mutex (lock) or some atomic CPU instructions. There is a similar problem: memory reads/writes may not appear in order due to factors like out-of-order execution.
For an in-memory copy-on-write tree, new nodes must be made visible
to concurrent readers before the root pointer is updated. This
is called a memory barrier and is analogous to fsync
,
although fsync
is more than enforcing order.
Synchronization primitives such as mutexes, or any OS syscalls, will enforce memory ordering in a portable way, so you don’t have to mess with CPU-specific atomics or barriers (which are inadequate for concurrency anyway).
6.3 Database on a file
Although files are insufficient, databases can be built on top of them.
The file layout
Our DB is a single file divided into “pages”. Each page is a B+tree node, except for the 1st page; the 1st page contains the pointer to the latest root node and other auxiliary data, we call this the meta page.
| the_meta_page | pages... | root_node | pages... | (end_of_file)
| root_ptr | page_used | ^ ^
| | | |
+------------+--------------------+ |
| |
+-------------------------------------+
New nodes are simply appended like a log, but we cannot use the file size to count the number of pages, because after a power loss the file size (metadata) may become inconsistent with the file data. This is filesystem dependent, we can avoid this by storing the number of pages in the meta page.
`fsync` on directory
As mentioned in chapter 01, fsync
must be used on the
parent directory after a rename
. This is also true when
creating new files, because there are 2 things to be made persistent:
the file data, and the directory that references the file.
We’ll preemptively fsync
after potentially creating a
new file with O_CREATE
. That way we don’t have to worry
about it in later updates.
To fsync
a directory, we must first get an fd by
open
ing it in O_RDONLY
mode.
// open or create a file and fsync the directory
func createFileSync(file string) (int, error) {
// obtain the directory fd
:= os.O_RDONLY | syscall.O_DIRECTORY
flags , err := syscall.Open(path.Dir(file), flags, 0o644)
dirfdif err != nil {
return -1, fmt.Errorf("open directory: %w", err)
}
defer syscall.Close(dirfd)
// open or create the file
= os.O_RDWR | os.O_CREATE
flags , err := syscall.Openat(dirfd, path.Base(file), flags, 0o644)
fdif err != nil {
return -1, fmt.Errorf("open file: %w", err)
}
// fsync the directory
= syscall.Fsync(dirfd)
err if err != nil { // may leave an empty file
= syscall.Close(fd)
_ return -1, fmt.Errorf("fsync directory: %w", err)
}
// done
return fd, nil
}
The directory fd can also be used by openat
to open the
target file, which guarantees that the file is from the same directory
we opened before, in case someone else has replaced the directory in
between (race condition). This is not a concern in our case, since we
don’t expect the DB to be messed by other processes.
`mmap`, page cache and IO
mmap
is a way to read/write a file as if it’s an
in-memory buffer. Disk IO is implicit and automatic with
mmap
.
func Mmap(fd int, offset int64, length int, ...) (data []byte, err error)
To understand mmap
, let’s review some operating system
basics. An OS page is the minimum unit for mapping between virtual
address and physical address. However, the virtual address space of a
process is not fully backed by physical memory all the time; part of the
process memory can be swapped to disk, and when the process tries to
access it:
- The CPU triggers a page fault, which hands control to the OS.
- The OS then …
- Reads the swapped data into physical memory.
- Remaps the virtual address to it.
- Hands control back to the process.
- The process resumes with the virtual address mapped to real RAM.
mmap
works in a similar way, the process gets an address
range from mmap
and when it touches a page in it, it page
faults and the OS reads the data into a cache and remaps the page to the
cache. That’s the automatic IO in a read-only scenario.
The CPU also takes note (called a dirty bit) when the process
modifies a page so the OS can write the page back to disk later.
fsync
is used to request and wait for the IO. This is
writing data via mmap
, it is not very different from
write
on Linux because write
goes to the same
page cache.
There are pros and cons of using mmap
in databases,
which are outside the scope of a toy DB. We’ll use mmap
because it’s just convenient.
6.4 Manage disk pages
Invoke `mmap`
To read a file via mmap
, use the PROT_READ
and MAP_SHARED
flags.
// create the initial mmap that covers the whole file.
func initMmap(fd int, fileSize int64) ([]byte, error) {
:= 64 << 20
mmapSize for mmapSize < int(fileSize) {
*= 2 // can be larger than the file
mmapSize }
, err := syscall.Mmap(
chunk, 0, mmapSize, syscall.PROT_READ, syscall.MAP_SHARED,
fd)
if err != nil {
return nil, fmt.Errorf("mmap: %w", err)
}
return chunk, nil
}
The mapped range can be larger than the file, which is convenient
because we don’t have to update the mapping each time the file grows. In
x64, the virtual address space is 48-bit (256TB), and for most use
cases, a single 1TB mmap
will probably last forever.
However, this isn’t viable on 32-bit systems.
Expand `mmap`
mremap
can be used to remap a mapping to a larger range
when the file is expanded. However, the start address may change, which
can hinder concurrent readers in later chapters. Our solution is to add
new mappings to cover the expanded file.
type KV struct {
string
Path // internals
int
fd
tree BTreestruct {
mmap int // mmap size, can be larger than the file size
total [][]byte // multiple mmaps, can be non-continuous
chunks }
}
Adding a new mapping each time the file is expanded results in lots
of mappings, which is bad for performance because the OS has to keep
track of them. This is avoided with exponential growth, as
mmap
can go beyond the current file size.
// extend the mmap by adding new mappings.
func extendMmap(db *KV, size int) error {
for db.mmap.total < size {
// double the address space
, err := syscall.Mmap(
chunk.fd, int64(db.mmap.total), db.mmap.total,
db.PROT_READ, syscall.MAP_SHARED,
syscall)
if err != nil {
return fmt.Errorf("mmap: %w", err)
}
.mmap.total += db.mmap.total
db.mmap.chunks = append(db.mmap.chunks, chunk)
db}
return nil
}
Read from `mmap`
The BTree.get
callback simply returns a slice of the
mmap
’ed buffer.
// `BTree.get`, read a page.
func (db *KV) pageGet(ptr uint64) BNode {
:= uint64(0)
start for _, chunk := range db.mmap.chunks {
:= start + uint64(len(chunk))/BTREE_PAGE_SIZE
end if ptr < end {
:= BTREE_PAGE_SIZE * (ptr - start)
offset return BNode{chunk[offset : offset+BTREE_PAGE_SIZE]}
}
= end
start }
panic("bad ptr")
}
Append pages
The BTree.new
callback collects new page data in memory,
and allocates and returns the page number from the end.
type KV struct {
// ...
struct {
page uint64 // database size in number of pages
flushed [][]byte // newly allocated pages
temp }
}
// `BTree.new`, allocate a new page.
func (db *KV) pageNew(node BNode) uint64 {
(len(node.data) <= BTREE_PAGE_SIZE)
assert:= db.page.flushed + uint64(len(db.page.temp)) // just append
ptr .page.temp = append(db.page.temp, node.data)
dbreturn ptr
}
The file is only updated after the data structure is updated.
// 1. Write new nodes.
func writePages(db *KV) error {
// extend the mmap if needed
:= (int(db.page.flushed) + len(db.page.temp)) * BTREE_PAGE_SIZE
size if err := extendMmap(db, size); err != nil {
return err
}
// write data to the file
:= int64(db.page.flushed * BTREE_PAGE_SIZE)
offset if _, err := unix.Pwritev(db.fd, db.page.temp, offset); err != nil {
return err
}
// discard the in-memory data
.page.flushed += uint64(len(db.page.temp))
db.page.temp = db.page.temp[:0]
dbreturn nil
}
New pages are appended with the pwritev
syscall, which
is a variant of write
with an offset and multiple input
buffers.
6.5 The meta page
Open a database
Let’s review the first things to do with a DB.
// open or create a DB file
func (db *KV) Open() error {
var err error
var chunk []byte
// B+tree callbacks
.tree.get = db.pageGet
db.tree.new = db.pageNew
db.tree.del = func(uint64) {}
db// open or create the DB file
if db.fd, err = createFileSync(db.Path); err != nil {
return err
}
// get the file size
:= syscall.Stat_t{}
finfo if err = syscall.Fstat(db.fd, &finfo); err != nil {
goto fail
}
// create the initial mmap
if chunk, err = initMmap(db.fd, finfo.Size); err != nil {
goto fail
}
.mmap.total = len(chunk)
db.mmap.chunks = [][]byte{chunk}
db// read the meta page
if err = readRoot(db, finfo.Size); err != nil {
goto fail
}
return nil
// error
:
fail.Close()
dbreturn fmt.Errorf("KV.Open: %w", err)
}
The remaining thing is the meta page.
Read the meta page
Besides the root pointer and the number of pages, we’ll also add some magic bytes to the meta page to identify the file type.
const DB_SIG = "BuildYourOwnDB06"
// the 1st page stores the root pointer and other auxiliary data.
// | sig | root_ptr | page_used |
// | 16B | 8B | 8B |
func readRoot(db *KV, fileSize int64) error {
if fileSize%BTREE_PAGE_SIZE != 0 {
return errors.New("file is not a multiple of pages")
}
if fileSize == 0 { // empty file
.page.flushed = 1 // the meta page is initialized on the 1st write
dbreturn nil
}
// read the page
:= db.mmap.chunks[0]
data .tree.root = binary.LittleEndian.Uint64(data[16:])
db.page.flushed = binary.LittleEndian.Uint64(data[24:])
db// verify the page
:= !bytes.Equal([]byte(DB_SIG), data[:16])
bad // pointers are within range?
:= uint64(fileSize / BTREE_PAGE_SIZE)
maxpages = bad || !(1 <= db.page.flushed && db.page.flushed <= maxpages)
bad = bad || !(0 < db.tree.root && db.tree.root < db.page.flushed)
bad if bad {
return errors.New("bad meta page")
}
return nil
}
Update the meta page
Writing a small amount of page-aligned data to a real disk, modifying only a single sector, is likely power-loss-atomic at the hardware level. Some real databases depend on this. That’s how we update the meta page too.
// 3. Update the root pointer atomically.
func updateRoot(db *KV) error {
var data [32]byte
copy(data[:16], []byte(DB_SIG))
.LittleEndian.PutUint64(data[16:], db.tree.root)
binary.LittleEndian.PutUint64(data[24:], db.page.flushed)
binary// NOTE: atomic?
, err := syscall.Pwrite(db.fd, data[:], 0)
_if err != nil {
return fmt.Errorf("write meta page: %w", err)
}
return nil
}
However, atomicity means different things at different levels, as
you’ve seen with rename
. write
is not atomic
w.r.t. concurrent readers at the system call
level. This is likely how the page cache works.
We’ll consider read/write atomicity when we add concurrent transations, but we have already seen a solution: In an LSM-tree, the 1st level is the only thing that is updated, and it’s duplicated as a MemTable, which moves the concurrency problem to memory. We can keep an in-memory copy of the meta page and synchronize it with a mutex, thus avoiding concurrent disk reads/writes.
Even if the hardware is not atomic w.r.t. power loss. Atomicity is achievable with log + checksum. We could switch between 2 checksumed meta pages for each update, to ensure that one of them is good after a power loss. This is called double buffering, which is a rotating log with 2 entries.
6.6 Summary of the append-only KV store
What we have done:
- File layout for the copy-on-write B+tree.
- Durability and atomicity with
fsync
.
Data structure on disk is a major milestone. The next thing is a free list since an append-only file is just impractical.
( Report an Error | Ask a Question) @ build-your-own.org
codecrafters.io offers “Build Your Own X” courses in many programming languages.
Including Redis, Git, SQLite, Docker, and more.