01. From Files To Databases
Let’s start with files, and examine the challenges we face.
1.1 Updating files in-place
Let’s say you need to save some data to disk; this is a typical way to do it:
func SaveData1(path string, data []byte) error {
, err := os.OpenFile(path, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, 0664)
fpif err != nil {
return err
}
defer fp.Close()
, err = fp.Write(data)
_if err != nil {
return err
}
return fp.Sync() // fsync
}
This code creates the file if it does not exist, or truncates the
existing one before writing the content. And most importantly, the data
is not persistent unless you call fsync
(fp.Sync()
in Go).
It has some serious limitations:
- It updates the content as a whole; only usable for tiny data. This is why you don’t use Excel as a database.
- If you need to update the old file, you must read and modify it in memory, then overwrite the old file. What if the app crashes while overwriting the old file?
- If the app needs concurrent access to the data, how do you prevent readers from getting mixed data and writers from conflicting operations? That’s why most databases are client-server, you need a server to coordinate concurrent clients. (Concurrency is more complicated without a server, see SQLite).
1.2 Atomic renaming
Replacing data atomically by renaming files
Many problems are solved by not updating data in-place. You can write a new file and delete the old file. Not touching the old file data means:
- If the update is interrupted, you can recover from the old, intact file.
- Concurrent readers won’t get half written data.
How will readers find the new file? This is solved by the renaming pattern:
func SaveData2(path string, data []byte) error {
:= fmt.Sprintf("%s.tmp.%d", path, randomInt())
tmp , err := os.OpenFile(tmp, os.O_WRONLY|os.O_CREATE|os.O_EXCL, 0664)
fpif err != nil {
return err
}
defer func() { // 4. discard the temporary file if it still exists
.Close() // not expected to fail
fpif err != nil {
.Remove(tmp)
os}
}()
if _, err = fp.Write(data); err != nil { // 1. save to the temporary file
return err
}
if err = fp.Sync(); err != nil { // 2. fsync
return err
}
= os.Rename(tmp, path) // 3. replace the target
err return err
}
Renaming a file to an existing one replaces it atomically. But pay attention to the meaning of the jargon, whenever you see “X is atomic”, you should ask “X is atomic with respect to what?” In this case:
- Rename is atomic w.r.t. concurrent readers; a reader opens either the old or the new file.
- Rename is NOT atomic w.r.t. power loss; it’s not even durable. You
need an extra
fsync
on the parent directory, which is discussed later.
Why does renaming work?
Filesystems keep a mapping from file names to file data, so replacing a file by renaming simply points the file name to the new data without touching the old data. That’s why atomic renaming is possible in filesystems. And the operation cost is constant regardless of the data size.
On Linux, the replaced old file may still exist if it’s still being opened by a reader; it’s just not accessible from a file name. Readers can safely work on whatever version of the data it got, while writer won’t be blocked by readers. However, there must be a way to prevent concurrent writers. The level of concurrency is multi-reader-single-writer, which is what we will implement.
1.3 Append-only logs
Safe incremental updates with logs
One way to do incremental updates is to just append the updates to a file. This is called a “log” because it’s append-only. It’s safer than in-place updates because no data is overwritten; you can always recover the old data after a crash.
The reader must consider all log entries when using the log. For example, here is a log-based KV with 4 entries:
0 1 2 3
| set a=1 | set b=2 | set a=3 | del b |
The final state is a=3
.
Logs are an essential component of many databases. But logs alone are not enough to build a DB because:
- It’s not an indexing data structure; readers must read all entries.
- It has no way to reclaim space from deleted data.
Atomic log updates with checksums
While a log won’t corrupt old data, you still have to deal with the last entry if it gets corrupted after a crash. Many possibilities:
- The last append simply does not happen; the log is still good.
- The last entry is half written.
- The size of the log is increased but the last entry is not there.
The way to deal with these cases is to add a checksum to each log entry. If the checksum is wrong, the update did not happen, making log updates atomic (w.r.t. both readers and durability).
This scenario is about incomplete writes (torn writes in DB
jargon) that occur before a successful fsync
. Checksums can
also detect other forms of corruption after fsync
, but
that’s not something a DB can recover from.
1.4 `fsync` gotchas
After renaming files or creating new files, you must call
fsync
on the parent directory. A directory is a mapping
from file names to files, and like file data, it’s not durable unless
you use fsync
. See this
example of fsync
on the directory.
Another issue with fsync
is error handling. If
fsync
fails, the DB update fails, but what if you read the
file afterwards? You may get the new data even if fsync
failed (because of the OS page cache)! This behavior is filesystem
dependent.
1.5 Summary of database challenges
What we have learned:
- Problems with in-place updates.
- Avoid in-place updates by renaming files.
- Avoid in-place updates using logs.
- Append-only logs.
- Incremental updates.
- Not a full solution; no indexing and space reuse.
fsync
usage.
What remains a question:
- Indexing data structures and how to update them.
- Reuse space from append-only files.
- Combining a log with an indexing data structure.
- Concurrency.