01. From Files To Databases

Let’s start with a seemingly simple problem: save data in a file and try to make it atomic and durable, i.e., resistant to crashes.

1.1 Updating files in-place

The typical way to save and overwrite a file is like this:

func SaveData1(path string, data []byte) error {
    fp, err := os.OpenFile(path, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, 0664)
    if err != nil {
        return err
    }
    defer fp.Close()

    _, err = fp.Write(data)
    if err != nil {
        return err
    }
    return fp.Sync() // fsync
}

This is how to use a filesystem as a KV, with the file name as the “key” and the file data as the “value”. The O_CREATE flag creates the file if it doesn’t exist, and the O_TRUNC flag discards the previous data if the file exists.

fsync is the syscall that requests and confirms that the file data is really written. This is required because there may be many levels of buffering: the OS page cache; the on-device RAM. After a successful fsync, the data is considered permanent on the final medium. The Golang Sync() just calls fsync.

This method will fail the atomicity and durability requirements.

  • Truncating a file means it’s possible to end up with an empty file.
  • Not truncating means it’s possible to end up with a half written file.

This is the problem of updating data in-place.

1.2 Atomic renaming

Replacing data atomically by renaming files

To avoid updating data in-place, we need to write a new chunk of data and switch to it atomically. The “switch” operation is just rename() in the filesystem, it’s specifically designed to make atomic files possible.

┌───🬼  create   ┌───🬼   ┌───🬼  rename   ┌───🬼
 1 │ ────────→  1 │ +  2 │ ────────→  2 │
└───┘           └───┘   └───┘           └───┘
 old             old     temp            new
func SaveData2(path string, data []byte) error {
    tmp := fmt.Sprintf("%s.tmp.%d", path, randomInt())
    fp, err := os.OpenFile(tmp, os.O_WRONLY|os.O_CREATE|os.O_EXCL, 0664)
    if err != nil {
        return err
    }
    defer func() { // 4. discard the temporary file if it still exists
        fp.Close() // not expected to fail
        if err != nil {
            os.Remove(tmp)
        }
    }()

    if _, err = fp.Write(data); err != nil { // 1. save to the temporary file
        return err
    }
    if err = fp.Sync(); err != nil { // 2. fsync
        return err
    }
    err = os.Rename(tmp, path) // 3. replace the target
    return err
}

Types of atomicity

“Atomicity” means different things in different contexts. There are at least 2 kinds atomicity:

  • Power-loss atomic: Will a reader observe a bad state after a crash?
  • Readers-writer atomic: Will a reader observe a bad state with a concurrent writer?

SaveData1 fails both atomicities; a reader can observe an empty file with or without a crash.

SaveData2 is readers-writer atomic, but NOT power-loss atomic. This is surprising, explained later.

Why does renaming work?

Filesystems keep a mapping from file names to file data, so replacing a file by renaming simply points the file name to the new data without touching the old data. This mapping is just a “directory”. The mapping is many-to-one, multiple names can reference the same file, even from different directories, this is the concept of “hard link”. A file with 0 references is automatically deleted.

The atomicity and durability of rename() depends on directory updates. But unfortunately, updating a directory is only readers-writer atomic, it’s not power-loss atomic or durable. So SaveData2 is still incorrect.

`fsync` gochas

Both creating a file and renaming a file update the containing directory. So there must be a way to make directories durable, thus fsync can also be called on directories. To do so, you need to obtain a handle (file descriptor) of the directory. Fixing SaveData2 is an exercise for the reader.

Another issue with fsync is error handling. If fsync fails, the DB update fails, but what if you read the file afterwards? You may get the new data even if fsync failed (because of the OS page cache)! This behavior is filesystem dependent.

1.3 Append-only logs

The file overwriting problem shows the problem of in-place updates and fsync gotchas. However, it’s nowhere close to databases. Because we don’t know how to update data incrementally.

Incremental updates with logs

A log can be used for incremental updates; it stores an ordered list of descriptions of each update. The entire state can be reconstructed by interpreting each log entry. For example, a log for KV updates:

     0         1         2        3
| set a=1 | set b=2 | set a=3 | del b |

The final state is a=3.

Atomic log updates with checksums

Let’s ignore the problem of discarding old log entries and assume that the log can only grow. This is called an “append-only” log, in-place updates are avoided, the next problem is to make the “append” atomic.

Each log append is made durable with fsync. But if the DB crashed between append and fsync, the last entry may be half written. Corrupted log entries are inevitable, the solution is to detect the corrupted last entry and ignore it when recovering from a crash.

The detection is done by adding a checksum to each log entry. A log entry can start with a fixed-size header containing the entry size and the checksum. The DB checks the last log entry on startup, if the size or checksum is wrong, it simply ignores it.

Checksums may have other use cases, such as detecting silent data corruption, but silent data corruption is NOT a database concern, the checksums in logs are only intended to detect incomplete writes (torn writes in DB jargon).

Log as a database component

A log is still not a database; it must be combined with the indexing data structure. Log entries are merged with the data structure when the log gets too large. This solves the forever growth problem. The remaining problem is how to atomically update the main data structure. Logs can be part of the solution, which we’ll learn about later.

Logs also have other uses in databases that we won’t discuss now. Most databases use a log, MySQL even uses 2 logs. However, databases are possible without logs, as we’ll learn.

1.4 Summary of database challenges

What we have learned:

  1. Problems with in-place updates.
    • Avoid in-place updates by renaming files.
    • Avoid in-place updates with logs.
  2. Append-only logs.
    • Incremental updates.
    • Not a full solution; no indexing and space reuse.
  3. fsync usage.
    • Both file data and directory need fsync.

What remains a question:

  1. Indexing data structures.
  2. Combining a log with an indexing data structure.
  3. Concurrency.