01. From Files To Databases

Let’s start with files, and examine the challenges we face.

1.1 Updating files in-place

Let’s say you need to save some data to disk; this is a typical way to do it:

func SaveData1(path string, data []byte) error {
    fp, err := os.OpenFile(path, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, 0664)
    if err != nil {
        return err
    }
    defer fp.Close()

    _, err = fp.Write(data)
    if err != nil {
        return err
    }
    return fp.Sync() // fsync
}

This code creates the file if it does not exist, or truncates the existing one before writing the content. And most importantly, the data is not persistent unless you call fsync (fp.Sync() in Go).

It has some serious limitations:

  1. It updates the content as a whole; only usable for tiny data. This is why you don’t use Excel as a database.
  2. If you need to update the old file, you must read and modify it in memory, then overwrite the old file. What if the app crashes while overwriting the old file?
  3. If the app needs concurrent access to the data, how do you prevent readers from getting mixed data and writers from conflicting operations? That’s why most databases are client-server, you need a server to coordinate concurrent clients. (Concurrency is more complicated without a server, see SQLite).

1.2 Atomic renaming

Replacing data atomically by renaming files

Many problems are solved by not updating data in-place. You can write a new file and delete the old file. Not touching the old file data means:

  1. If the update is interrupted, you can recover from the old, intact file.
  2. Concurrent readers won’t get half written data.

How will readers find the new file? This is solved by the renaming pattern:

func SaveData2(path string, data []byte) error {
    tmp := fmt.Sprintf("%s.tmp.%d", path, randomInt())
    fp, err := os.OpenFile(tmp, os.O_WRONLY|os.O_CREATE|os.O_EXCL, 0664)
    if err != nil {
        return err
    }
    defer func() { // 4. discard the temporary file if it still exists
        fp.Close() // not expected to fail
        if err != nil {
            os.Remove(tmp)
        }
    }()

    if _, err = fp.Write(data); err != nil { // 1. save to the temporary file
        return err
    }
    if err = fp.Sync(); err != nil { // 2. fsync
        return err
    }
    err = os.Rename(tmp, path) // 3. replace the target
    return err
}

Renaming a file to an existing one replaces it atomically. But pay attention to the meaning of the jargon, whenever you see “X is atomic”, you should ask “X is atomic with respect to what?” In this case:

  • Rename is atomic w.r.t. concurrent readers; a reader opens either the old or the new file.
  • Rename is NOT atomic w.r.t. power loss; it’s not even durable. You need an extra fsync on the parent directory, which is discussed later.

Why does renaming work?

Filesystems keep a mapping from file names to file data, so replacing a file by renaming simply points the file name to the new data without touching the old data. That’s why atomic renaming is possible in filesystems. And the operation cost is constant regardless of the data size.

On Linux, the replaced old file may still exist if it’s still being opened by a reader; it’s just not accessible from a file name. Readers can safely work on whatever version of the data it got, while writer won’t be blocked by readers. However, there must be a way to prevent concurrent writers. The level of concurrency is multi-reader-single-writer, which is what we will implement.

1.3 Append-only logs

Safe incremental updates with logs

One way to do incremental updates is to just append the updates to a file. This is called a “log” because it’s append-only. It’s safer than in-place updates because no data is overwritten; you can always recover the old data after a crash.

The reader must consider all log entries when using the log. For example, here is a log-based KV with 4 entries:

     0         1         2        3
| set a=1 | set b=2 | set a=3 | del b |

The final state is a=3.

Logs are an essential component of many databases. But logs alone are not enough to build a DB because:

  • It’s not an indexing data structure; readers must read all entries.
  • It has no way to reclaim space from deleted data.

Atomic log updates with checksums

While a log won’t corrupt old data, you still have to deal with the last entry if it gets corrupted after a crash. Many possibilities:

  1. The last append simply does not happen; the log is still good.
  2. The last entry is half written.
  3. The size of the log is increased but the last entry is not there.

The way to deal with these cases is to add a checksum to each log entry. If the checksum is wrong, the update did not happen, making log updates atomic (w.r.t. both readers and durability).

This scenario is about incomplete writes (torn writes in DB jargon) that occur before a successful fsync. Checksums can also detect other forms of corruption after fsync, but that’s not something a DB can recover from.

1.4 `fsync` gotchas

After renaming files or creating new files, you must call fsync on the parent directory. A directory is a mapping from file names to files, and like file data, it’s not durable unless you use fsync. See this example of fsync on the directory.

Another issue with fsync is error handling. If fsync fails, the DB update fails, but what if you read the file afterwards? You may get the new data even if fsync failed (because of the OS page cache)! This behavior is filesystem dependent.

1.5 Summary of database challenges

What we have learned:

  1. Problems with in-place updates.
    • Avoid in-place updates by renaming files.
    • Avoid in-place updates using logs.
  2. Append-only logs.
    • Incremental updates.
    • Not a full solution; no indexing and space reuse.
  3. fsync usage.

What remains a question:

  1. Indexing data structures and how to update them.
  2. Reuse space from append-only files.
  3. Combining a log with an indexing data structure.
  4. Concurrency.