01. From Files To Databases
Let’s start with a seemingly simple problem: save data in a file and try to make it atomic and durable, i.e., resistant to crashes.
1.1 Updating files in-place
The typical way to save and overwrite a file is like this:
func SaveData1(path string, data []byte) error {
, err := os.OpenFile(path, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, 0664)
fpif err != nil {
return err
}
defer fp.Close()
, err = fp.Write(data)
_if err != nil {
return err
}
return fp.Sync() // fsync
}
This is how to use a filesystem as a KV, with the file name as the
“key” and the file data as the “value”. The O_CREATE
flag
creates the file if it doesn’t exist, and the O_TRUNC
flag
discards the previous data if the file exists.
fsync
is the syscall that requests and
confirms that the file data is really written. This is required
because there may be many levels of buffering: the OS page cache; the
on-device RAM. After a successful fsync
, the data is
considered permanent on the final medium. The Golang Sync()
just calls fsync
.
This method will fail the atomicity and durability requirements.
- Truncating a file means it’s possible to end up with an empty file.
- Not truncating means it’s possible to end up with a half written file.
This is the problem of updating data in-place.
1.2 Atomic renaming
Replacing data atomically by renaming files
To avoid updating data in-place, we need to write a new chunk of data
and switch to it atomically. The “switch” operation is just
rename()
in the filesystem, it’s specifically designed to
make atomic files possible.
┌───🬼 create ┌───🬼 ┌───🬼 rename ┌───🬼
1 │ ────────→ 1 │ + 2 │ ────────→ 2 │
└───┘ └───┘ └───┘ └───┘
old old temp new
func SaveData2(path string, data []byte) error {
:= fmt.Sprintf("%s.tmp.%d", path, randomInt())
tmp , err := os.OpenFile(tmp, os.O_WRONLY|os.O_CREATE|os.O_EXCL, 0664)
fpif err != nil {
return err
}
defer func() { // 4. discard the temporary file if it still exists
.Close() // not expected to fail
fpif err != nil {
.Remove(tmp)
os}
}()
if _, err = fp.Write(data); err != nil { // 1. save to the temporary file
return err
}
if err = fp.Sync(); err != nil { // 2. fsync
return err
}
= os.Rename(tmp, path) // 3. replace the target
err return err
}
Types of atomicity
“Atomicity” means different things in different contexts. There are at least 2 kinds atomicity:
- Power-loss atomic: Will a reader observe a bad state after a crash?
- Readers-writer atomic: Will a reader observe a bad state with a concurrent writer?
SaveData1
fails both atomicities; a reader can observe
an empty file with or without a crash.
SaveData2
is readers-writer atomic, but NOT power-loss
atomic. This is surprising, explained later.
Why does renaming work?
Filesystems keep a mapping from file names to file data, so replacing a file by renaming simply points the file name to the new data without touching the old data. This mapping is just a “directory”. The mapping is many-to-one, multiple names can reference the same file, even from different directories, this is the concept of “hard link”. A file with 0 references is automatically deleted.
The atomicity and durability of rename()
depends on
directory updates. But unfortunately, updating a directory is only
readers-writer atomic, it’s not power-loss atomic or durable. So
SaveData2
is still incorrect.
`fsync` gochas
Both creating a file and renaming a file update the containing
directory. So there must be a way to make directories durable, thus
fsync
can also be called on directories. To do so, you need
to obtain a handle (file descriptor) of the directory. Fixing
SaveData2
is an exercise for the reader.
Another issue with fsync
is error handling. If
fsync
fails, the DB update fails, but what if you read the
file afterwards? You may get the new data even if fsync
failed (because of the OS page cache)! This behavior is filesystem
dependent.
1.3 Append-only logs
The file overwriting problem shows the problem of in-place updates
and fsync
gotchas. However, it’s nowhere close to
databases. Because we don’t know how to update data
incrementally.
Incremental updates with logs
A log can be used for incremental updates; it stores an ordered list of descriptions of each update. The entire state can be reconstructed by interpreting each log entry. For example, a log for KV updates:
0 1 2 3
| set a=1 | set b=2 | set a=3 | del b |
The final state is a=3
.
Atomic log updates with checksums
Let’s ignore the problem of discarding old log entries and assume that the log can only grow. This is called an “append-only” log, in-place updates are avoided, the next problem is to make the “append” atomic.
Each log append is made durable with fsync
. But if the
DB crashed between append and fsync
, the last entry may be
half written. Corrupted log entries are inevitable, the solution is to
detect the corrupted last entry and ignore it when recovering
from a crash.
The detection is done by adding a checksum to each log entry. A log entry can start with a fixed-size header containing the entry size and the checksum. The DB checks the last log entry on startup, if the size or checksum is wrong, it simply ignores it.
Checksums may have other use cases, such as detecting silent data corruption, but silent data corruption is NOT a database concern, the checksums in logs are only intended to detect incomplete writes (torn writes in DB jargon).
Log as a database component
A log is still not a database; it must be combined with the indexing data structure. Log entries are merged with the data structure when the log gets too large. This solves the forever growth problem. The remaining problem is how to atomically update the main data structure. Logs can be part of the solution, which we’ll learn about later.
Logs also have other uses in databases that we won’t discuss now. Most databases use a log, MySQL even uses 2 logs. However, databases are possible without logs, as we’ll learn.
1.4 Summary of database challenges
What we have learned:
- Problems with in-place updates.
- Avoid in-place updates by renaming files.
- Avoid in-place updates with logs.
- Append-only logs.
- Incremental updates.
- Not a full solution; no indexing and space reuse.
fsync
usage.- Both file data and directory need
fsync
.
- Both file data and directory need
What remains a question:
- Indexing data structures.
- Combining a log with an indexing data structure.
- Concurrency.