01. From Files To Databases

Let’s start with files, and examine the challenges we face.

1.1 Updating files in-place

Let’s say you need to save some data to disk; this is a typical way to do it:

func SaveData1(path string, data []byte) error {
    fp, err := os.OpenFile(path, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, 0664)
    if err != nil {
        return err
    defer fp.Close()

    _, err = fp.Write(data)
    if err != nil {
        return err
    return fp.Sync() // fsync

This code creates the file if it does not exist, or truncates the existing one before writing the content. And most importantly, the data is not persistent unless you call fsync (fp.Sync() in Go).

It has some serious limitations:

  1. It updates the content as a whole; only usable for tiny data. This is why you don’t use Excel as a database.
  2. If you need to update the old file, you must read and modify it in memory, then overwrite the old file. What if the app crashes while overwriting the old file?
  3. If the app needs concurrent access to the data, how do you prevent readers from getting mixed data and writers from conflicting operations? That’s why most databases are client-server, you need a server to coordinate concurrent clients. (Concurrency is more complicated without a server, see SQLite).

1.2 Atomic renaming

Replacing data atomically by renaming files

Many problems are solved by not updating data in-place. You can write a new file and delete the old file.

Not touching the old file data means:

  1. If the update is interrupted, you can recover from the old file since it remains intact.
  2. Concurrent readers won’t get half written data.

The problem is how readers will find the new file. A common pattern is to rename the new file to the old file path.

func SaveData2(path string, data []byte) error {
    tmp := fmt.Sprintf("%s.tmp.%d", path, randomInt())
    fp, err := os.OpenFile(tmp, os.O_WRONLY|os.O_CREATE|os.O_EXCL, 0664)
    if err != nil {
        return err
    defer func() {
        if err != nil {

    _, err = fp.Write(data)
    if err != nil {
        return err
    err = fp.Sync() // fsync
    if err != nil {
        return err

    return os.Rename(tmp, path)

Renaming a file to an existing one replaces it atomically; deleting the old file is not needed (and not correct).

Pay attention to the meaning of the jargon, whenever you see “X is atomic”, you should ask “X is atomic with respect to what?” In this case:

Why does renaming work?

Filesystems keep a mapping from file names to file data, so replacing a file by renaming simply points the file name to the new data without touching the old data. That’s why atomic renaming is possible in filesystems. And the operation cost is constant regardless of the data size.

On Linux, the replaced old file may still exist if it’s still being opened by a reader; it’s just not accessible from a file name. Readers can safely work on whatever version of the data it got, while writer won’t be blocked by readers. However, there must be a way to prevent concurrent writers. The level of concurrency is multi-reader-single-writer, which is what we will implement.

1.3 Append-only logs

Safe incremental updates with logs

One way to do incremental updates is to just append the updates to a file. This is called a “log” because it’s append-only. It’s safer than in-place updates because no data is overwritten; you can always recover the old data after a crash.

The reader must consider all log entries when using the log. For example, here is a log-based KV with 4 entries:

     0         1         2        3
| set a=1 | set b=2 | set a=3 | del b |

The final state is a=3.

Logs are an essential component of many databases. However, a log is only a description of each update, which means:

So logs alone are not enough to build a DB, they must be combined with other indexing data structures.

Atomic log updates with checksums

While a log won’t corrupt old data, you still have to deal with the last entry if it gets corrupted after a crash. Many possibilities:

  1. The last append simply does not happen; the log is still good.
  2. The last entry is half written.
  3. The size of the log is increased but the last entry is not there.

The way to deal with these cases is to add a checksum to each log entry. If the checksum is wrong, the update did not happen, making log updates atomic (w.r.t. both readers and durability).

This scenario is about incomplete writes (torn writes in DB jargon) that occur before a successful fsync. Checksums can also detect other forms of corruption after fsync, but that’s not something a DB can recover from.

1.4 `fsync` gotchas

After renaming files or creating new files, you must call fsync on the parent directory. A directory is a mapping from file names to files, and like file data, it’s not durable unless you use fsync. See this example of fsync on the directory.

Another issue with fsync is error handling. If fsync fails, the DB update fails, but what if you read the file afterwards? You may get the new data even if fsync failed (because of the OS page cache)! This behavior is filesystem dependent.

1.5 Summary of database challenges

What we have learned:

  1. Problems with in-place updates.
    • Avoid in-place updates by renaming files.
    • Avoid in-place updates using logs.
  2. Append-only logs.
    • Incremental updates.
    • Not a full solution; no indexing and space reuse.
  3. fsync usage.

What remains a question:

  1. Indexing data structures and how to update them.
  2. Reuse space from append-only files.
  3. Combining a log with an indexing data structure.
  4. Concurrency.