02. Indexing Data Structures

2.1 Types of queries

Most SQL queries can be broken down into 3 types:

  1. Scan the whole data set. (No index is used).
  2. Point query: Query the index by a specific key.
  3. Range query: Query the index by a range. (The index is sorted).

There are ways to make scanning fast, such as column-based storage. But a scan is O(N) no matter how fast it is; our focus is on queries that can be served in O(logN) using data structures.

A range query consists of 2 phases:

  1. Seek: find the starting key.
  2. Iterate: find the previous/next key in sorted order.

A point query is just seek without iterate; a sorting data structure is all we need.

2.2 Hashtables

Hashtables are viable if you only consider point queries (get, set, del), so we will not bother with them because of the lack of ordering.

However, coding a hashtable, even an in-memory one, is still a valuable exercise. It’s far easier than the B-tree we’ll code later, though some challenges remain:

2.3 Sorted arrays

Ruling out hashtables, let’s start with the simplest sorting data structure: the sorted array. You can binary search on it in O(logN). For variable-length data such as strings (KV), use an array of pointers (offsets) to do binary searches.

Updating a sorted array is O(N), either in-place or not. So it’s not practical, but it can be extended to other updatable data structures.

One way to reduce the update cost is to split the array into several smaller non-overlapping arrays — nested sorted arrays. This extension leads to B+tree (multi-level n-ary tree), with the additional challenge of maintaining these small arrays (tree nodes).

Another form of “updatable array” is the log-structured merge tree (LSM-tree). Updates are first buffered in a smaller array (or other sorting data structures), then merged into the main array when it becomes too large. The update cost is amortized by propagating smaller arrays into larger arrays.

2.4 B-tree

A B-tree is a balanced n-ary tree, comparable to balanced binary trees. Each node stores variable number of keys (and branches) up to n and n > 2.

Reducing random access with shorter trees

A disk can only perform a limited number of IOs per second (IOPS), which is the limiting factor for tree lookups. Each level of the tree is a disk read in a lookup, and n-ary trees are shorter than binary trees for the same number of keys (lognN vs. log2N), thus n-ary trees are used for fewer disk reads per lookup.

How is the n chosen? There is a trade-off:

IO in the unit of pages

While you can read any number of bytes at any offset from a file, disks do not work that way. The basic unit of disk IO is not bytes, but sectors, which are 512-byte contiguous blocks on old HDDs.

However, disk sectors are not an application’s concern because regular file IOs do not interact directly with the disk. The OS caches/buffers disk reads/writes in the page cache, which consists of 4K-byte memory blocks called pages.

In any way, there is a minimum unit of IO. DBs can also define their own unit of IO (also called a page), which can be larger than an OS page.

The minimum IO unit implies that tree nodes should be allocated in multiples of the unit; a half used unit is half wasted IO. Another reason against small n!

The B+tree variant

In the context of databases, B-tree means a variant of B-tree called B+tree. In a B+tree, internal nodes do not store values, values exist only in leaf nodes. This leads to shorter tree because internal nodes have more space for branches.

B+tree as an in-memory data structure also makes sense because the minimum IO unit between RAM and CPU caches is 64 bytes (cache line). The performance benefit is not as great as on disk because not much can fit in 64 bytes.

Data structure space overhead

Another reason why binary trees are impractical is the number of pointers; each key has at least 1 incoming pointer from the parent node, whereas in a B+tree, multiple keys in a leaf node share 1 incoming pointer.

Keys in a leaf node can also be packed in a compact format or compressed to further reduce the space.

2.5 Log-structured storage

Update by merge: amortize cost

The most common example of log-structured storage is log-structure merge tree (LSM-tree). Its main idea is neither log nor tree; it’s “merge” instead!

Let’s start with 2 files: a small file holding the recent updates, and a large file holding the rest of the data. Updates go to the small file first, but it cannot grow forever; it will be merged into the large file when it reaches a threshold.

writes => | new updates | => | accumulated data |
               file 1               file 2

Merging 2 sorted files results in a newer, larger file that replaces the old large file and shrinks the small file.

Merging is O(N), but can be done concurrently with readers and writers.

Reduce write amplification with multiple levels

Buffering updates is better than rewriting the whole dataset every time. What if we extend this scheme to multiple levels?

                 |level 1|
           |------level 2------|
|-----------------level 3-----------------|

In the 2-level scheme, the large file is rewritten every time the small file reaches a threshold, the excess disk write is called write amplification, and it gets worse as the large file gets larger. If we use more levels, we can keep the 2nd level small by merging it into the 3rd level, similar to how we keep the 1st level small.

Intuitively, levels grow exponentially, and the power of two growth (merging similarly sized levels) results in the least write amplification. But there is a trade-off between write amplification and the number of levels (query performance).

LSM-tree indexes

Each level contains indexing data structures, which could simply be a sorted array, since levels are never updated (except for the 1st level). But binary search is not much better than binary tree in terms of random access, so a sensible choice is to use B-tree inside a level, that’s the “tree” part of LSM-tree. Anyway, data structures are much simpler because of the lack of updates.

To better understand the idea of “merge”, you can try to apply it to hashtables, a.k.a. log-structured hashtables.

LSM-tree queries

Keys can be in any levels, so to query an LSM-tree, the results from each level are combined (n-way merge for range queries).

For point queries, Bloom filters can be used as an optimization to reduce the number of searched levels.

Since levels are never updated, there can be old versions of keys in older levels, and deleted keys are marked with a special flag in newer levels (called tombstones). Thus, newer levels have priority in queries.

The merge process naturally reclaims space from old or deleted keys. Thus, it’s also called compaction.

Real-world LSM-tree: SSTable, MemTable and log

These are jargons about LSM-tree implementation details. You don’t need to know them to build one from principles, but they do solve some real problems.

Levels are split into multiple non-overlapping files called SSTables, rather than one large file, so that merging can be done gradually. This reduces the free space requirement when merging large levels, and the merging process is spread out over time.

The 1st level is updated directly, a log becomes a viable choice because the 1st level is bounded in size. This is the “log” part of the LSM-tree, an example of combining a log with other indexing data structures.

But even if the log is small, a proper indexing data structure is still needed. The log data is duplicated in an in-memory index called MemTable, which can be a B-tree, skiplist, or whatever. It’s a small, bounded amount of in-memory data, and has the added benefit of accelerating the read-the-recent-updates scenario.

2.6 Summary of indexing data structures

There are 2 options: B+tree and LSM-tree.

LSM-tree solves many of the challenges from the last chapter, such as how to update disk-based data structures and resue space. While these challenges remain for B+tree, which will be explored later.