10. The AVL Tree: Coding & Testing

While Redis is called a key-value store, the “value” part is more than byte strings:

  • List. Double-ended queue. Trivial to code.
  • Hashmap and hashset. We’ve already coded a hashtable.
  • Sorted set. The sorting data structure in this chapter.

10.1 Sorted Set for Sorting

Range Query

A sorted set is a collection of (score, name) pairs indexed in 2 ways:

  1. Find the score by name. Like a hashmap.
  2. Range query: Getting a subset of the sorted pairs:
    • Seek to the closest pair from a target (score, name) pair.
    • Iterate in ascending or descending order from the starting pair.

The range query (seek & iterate) is like using a SQL index.

Rank Query

Sorted set can also do queries related to the position in the list (rank):

  • Find a pair’s rank.
  • Find a pair by its rank.
  • Offset a pair. Like offsetting a SQL query.

Unlike SQL indexes, these are O(log(n)) in Redis.

Sorted Set Use Cases

Suppose you have a high score board using sorted set. You can:

  • Find a user’s score.
  • Range query: Show the list starting from a given score (pagination).
  • Rank query:
    • Find a user’s rank.
    • Find the user with a given rank.
    • Paginate the list by offset instead of score. Handy!

In addition to score boards, a sorted set can sort anything. Suppose you have a list of posts ordered by time. You can:

  • Find the post time (score) by post ID (name).
  • Range query: List posts starting from a given time (score).
  • Rank query:
    • Find the position of a post in the list. Not useful.
    • Find the post ID at a given position. Not useful.
    • Paginate the list by offset instead of time. Handy!

The score is a 64-bit float; if the sort key cannot be encoded as the score, you can use the name for sorting instead (fixed score).

Big-O of Sorted Set

  • Find the score by name: O(1) by a hashtable.
  • Insert a pair: O(log(n)).
  • Range query: O(log(n)).
  • Rank query:
    • Find a pair’s rank: O(log(n)).
    • Find by rank: O(log(n)).
    • Offsetting: O(log(offset)).

The average cost of both inserting and searching is O(log(n)) for all practical sorting data structures.

Paginating a query by offset is avoided in SQL because the cost is O(offset). However, this is practical in Redis as the cost is reduced to O(log(offset)).

10.2 Sorting Data Structures Quick Review

Sorted Arrays

The simplest sorting data structure is a sorted array. You can bisect it in O(log(n)), but the update is O(n), so it’s not practical.

The cost of updating a sorted array can be amortized by buffering the updates in a smaller array first, and then merging them into the main array later. This idea leads to the log structured merge tree (LSM-tree), although it has nothing to do with trees and is rarely used as an in-memory data structure.

If we split the array into several non-overlapping ones, the update cost can be reduced. This is a 2-level tree, and the idea leads to n-ary trees like B+tree.

Tree Data Structures

Almost all other sorting data structures are trees:

  • Binary trees:
    • AVL tree. Balanced.
    • RB tree. Balanced.
    • Treap. Randomized.
    • Splay tree. Require randomized usage patterns.
  • N-ary trees:
    • Trie.
    • B-tree. Balanced.
    • Skip list. Randomized, hierarchical.

Trees are hierarchical and recursive. A tree node divides the dataset into non-overlapping subtrees, that’s how to sort data with trees.

Balance By Randomness

The simplest tree structure is the unbalanced binary tree, where new nodes are simply added as leaves, so the tree can be irregular and deep. For example, after inserting a monotonic sequence, the tree is just a linear list.

There are ways to affect the shape of the tree to make binary trees practical. One way is to use randomness. If you insert and delete at random, the average tree depth is O(log(n)).

Splay trees also use randomness of lookups, it rotates the target node all the way up to the root node, so random lookups also shuffle the shape of the tree. However, it’s only usable when worst cases do not matter, because the worst case can be triggered by common usage patterns such as sequential lookups.

Treaps and Skip lists are better than Splay trees because the randomness is generated. Their worst cases are a matter of luck instead of usage patterns.

Height-Balanced Trees

Another way to make trees practical is to limit the tree height by some invariants.

  • RB tree:
    • Nodes are either black or red.
    • Equal number of black nodes in all paths from root to leaves.
    • No 2 consecutive red nodes in a path.
  • AVL tree:
    • 2 subtrees may differ in height, but only by 1.
  • B-tree:
    • All leaves have the same height. (Only possible for n-ary trees).

The height of the trees in the worst case is O(log(n)).

Comparison of Trees

Tree Worst case Branch Random Difficulty
AVL tree O(log(n)) 2 No Medium
RB tree O(log(n)) 2 No High
Treap O(n) 2 Yes Low
Splay tree O(n) 2 Usage Low
B-tree O(log(n)) n No High
Skip list O(n) n Yes Medium

Splay trees are not general-purpose due to their worst case nature. They are designed for fast access to hot data.

B-trees are simple in concept, but coding one is non-trivial. They are used for disk storage where binary trees are impractical.

RB trees are simple to insert, but deleting is complicated and non-obvious. So I prefer the much simpler AVL tree.

The remaining 3 candidates are AVL tree, Treap, and Skip list. AVL trees have bounded worst cases and are the most well-known, so that’s what we’ll use.

10.3 Implementing Binary Trees

These things are the same for all kinds of binary trees.

Search and Insert

Starting from the root, the target is either in the left or right subtree, if not in the current node.

An insert is simply placing the new node at where the search ends.

When the tree is implemented as an intrusive data structure, there is no dedicate function for searching or inserting because it’s so trivial.

Delete

  • Remove a leaf node: simple.
  • Remove an internal node:
    • If the right subtree is not empty:
      1. Swap it with its successor in the right subtree.
      2. Remove the successor node (which is at a lower position).
    • If the right subtree is empty:
      • Return the left subtree.

Notes:

  • The process is recursive, it ends eventually because the target to remove is lower and lower.
  • The case for leaves is redundant, since the case for internal nodes also works.
  • Trees are symmetric, so things also work if left and right are swapped.

Iterate

The successor node is the leftmost node in the right subtree, except when the right subtree is empty, you have to go back to the parent node. That’s why a tree node includes a parent pointer.

But parent pointers are not necessary, you can also keep a stack of nodes (path), this saves memory but makes the code clumsy.

10.4 AVL Tree

Invariant: Height Difference Only By One

An AVL tree is a binary tree with an extra step, inserts and deletions are done as normal, then the tree is fixed by some rules.

The height of a subtree is the maximum distance to leaves. For each node in an AVL tree, the height of its 2 subtrees can differ, but only by 1.

The rules for fixing the heights can be the same for insert and delete. That’s why AVL trees are far easier than RB trees.

Rotations Keep the Order

Rotations change the shape of a subtree while keeping its order. A visualization of a left rotation:

  2           4
 / \         / \
1   4  ==>  2   5
   / \     / \
  3   5   1   3

In Splay trees and Treaps, rotations are used to move the target node up.

Rotations Adjust Heights

Inserting or removing a node changes the subtree height by 1, this can cause a height difference of 2 for the parent node, which is restored by rotations.

Rule 1: A right rotation restores balance if:

  • Condition 1: The left subtree (B) is deeper by 2.
  • Condition 2: The left left subtree (A) is deeper than the left right subtree (C).
      D (h+3)               B (h+2)
     /      \              /      \
   B (h+2)   E (h)  ==>  A (h+1)   D (h+1)
  /      \                        /      \
A (h+1)   C (h)                  C (h)    E (h)
  • Condition 1 is initially introduced by insertion or deletion.
  • Condition 2 can be obtained by applying rule 2 if unmet.

Rule 2: A right rotation makes the right subtree deeper than the left subtree while keeping the rotated height if:

  • The left subtree (B) is deeper by 1.
      D (h+2)               B (h+2)
     /      \              /      \
   B (h+1)   E (h)  ==>  A (h-1)   D (h+1)
  /      \                        /      \
A (h-1)   C (h)                  C (h)    E (h)

The height of A does not matter in the above graph, node A can cause an imbalance but the following rule 1 still works.

Both rules are left-right mirrored. The left rotation rule 2 is paired with the right rotation rule 1.

The fix start from the initial node:

  • Check for imbalance (height difference of 2).
    • Stop if balanced.
  • If rule 1 is not applicable (condition 2 unmet).
    • Apply the opposite rotation rule 2.
  • Apply rule 1.
  • Go to parent node and repeat.

10.5 Coding an AVL Tree

Step 1: Node Definition

struct AVLNode {
    uint32_t depth = 0;     // subtree height
    uint32_t cnt = 0;       // subtree size
    AVLNode *left = NULL;
    AVLNode *right = NULL;
    AVLNode *parent = NULL;
};

static void avl_init(AVLNode *node) {
    node->depth = 1;
    node->cnt = 1;
    node->left = node->right = node->parent = NULL;
}

The subtree height is stored in the node, which is the most intuitive way to code an AVL tree. However, the exact height is unnecessary; some implementations store the height difference instead, which takes only 2 bits.

If only 2 bits are needed, they can be packed into one of the pointers since x64 pointers don’t use all the bits. However, this optimization is useless to us because we also have an extra field cnt.

cnt stores the subtree size, it’s used for rank queries in the next chapter.

Some helper functions for updating and using these auxiliary data:

static uint32_t avl_depth(AVLNode *node) {
    return node ? node->depth : 0;
}
static uint32_t avl_cnt(AVLNode *node) {
    return node ? node->cnt : 0;
}
// maintain the depth and cnt field
static void avl_update(AVLNode *node) {
    node->depth = 1 + max(avl_depth(node->left), avl_depth(node->right));
    node->cnt = 1 + avl_cnt(node->left) + avl_cnt(node->right);
}

Step 2: Rotations

  2           4
 / \         / \
1   4  ==>  2   5
   / \     / \
  3   5   1   3

2 lines for rotation, the rest is for maintaining parent pointers and auxiliary data.

static AVLNode *rot_left(AVLNode *node) {
    AVLNode *new_node = node->right;
    if (new_node->left) {
        new_node->left->parent = node;
    }
    node->right = new_node->left;   // rotation
    new_node->left = node;          // rotation
    new_node->parent = node->parent;
    node->parent = new_node;
    avl_update(node);
    avl_update(new_node);
    return new_node;
}

static AVLNode *rot_right(AVLNode *node);   // mirrored

Rotation functions must return the rotated node, because the tree root may be updated.

Step 3: Rules for Rotations

Rule 1: A right rotation restores balance if the left subtree is deeper by 2, and the left left subtree is deeper than the left right subtree.

Rule 2: A left rotation makes the left subtree deeper than the right subtree while keeping the rotated height if the right subtree is deeper by 1.

// the left subtree is too deep
static AVLNode *avl_fix_left(AVLNode *root) {
    if (avl_depth(root->left->left) < avl_depth(root->left->right)) {
        root->left = rot_left(root->left);  // rule 2
    }
    return rot_right(root); // rule 1
}

Step 4: Fix Imbalances

avl_fix() fixes imbalances after an insertion or deletion. It loops from the initial node to the root node and returns the new root.

// fix imbalanced nodes and maintain invariants until the root is reached
static AVLNode *avl_fix(AVLNode *node) {
    while (true) {
        avl_update(node);
        uint32_t l = avl_depth(node->left);
        uint32_t r = avl_depth(node->right);
        AVLNode **from = NULL;
        if (AVLNode *p = node->parent) {
            from = (p->left == node) ? &p->left : &p->right;
        }
        if (l == r + 2) {
            node = avl_fix_left(node);
        } else if (l + 2 == r) {
            node = avl_fix_right(node);
        }
        if (!from) {
            return node;
        }
        *from = node;
        node = node->parent;
    }
}

Step 5: Delete

The general binary tree delete:

  • If the right subtree is not empty:
    1. Swap it with the successor in the right subtree.
    2. Remove the successor node (which is at a lower position).
  • If the right subtree is empty:
    • Return the left subtree.

The avl_fix() in the non-recursive case is the extra step of the AVL tree.

// detach a node and returns the new root of the tree
static AVLNode *avl_del(AVLNode *node) {
    if (node->right == NULL) {
        // no right subtree, replace the node with the left subtree
        // link the left subtree to the parent
        AVLNode *parent = node->parent;
        if (node->left) {
            node->left->parent = parent;
        }
        if (parent) {   // attach the left subtree to the parent
            (parent->left == node ? parent->left : parent->right) = node->left;
            return avl_fix(parent);     // AVL-specific!
        } else {        // removing root?
            return node->left;
        }
    } else {
        // detach the successor
        AVLNode *victim = node->right;
        while (victim->left) {
            victim = victim->left;
        }
        AVLNode *root = avl_del(victim);
        // swap with it
        *victim = *node;
        if (victim->left) {
            victim->left->parent = victim;
        }
        if (victim->right) {
            victim->right->parent = victim;
        }
        if (AVLNode *parent = node->parent) {
            (parent->left == node ? parent->left : parent->right) = victim;
            return root;
        } else {        // removing root?
            return victim;
        }
    }
}

Step 6: Search and Insert

Searching a binary tree is not much more work than looping through a linked list. A search function is not required for an intrusive tree, since the tree node is exposed to the user, so we will just stop there. Our AVL tree API consists of:

  • avl_fix(): Fixes the tree after insertion.
  • avl_del(): Detaches a node. (Also calls avl_fix()).

10.6 Testing the AVL Tree

Set up the Tests

Read the hashtable chapter if unfamiliar with intrusive data structures.

struct Data {
    AVLNode node;
    uint32_t val = 0;
};

struct Container {
    AVLNode *root = NULL;
};

Regular binary tree insertion with avl_fix():

static void add(Container &c, uint32_t val) {
    Data *data = new Data();    // allocate the data
    avl_init(&data->node);
    data->val = val;

    AVLNode *cur = NULL;        // current node
    AVLNode **from = &c.root;   // the incoming pointer to the next node
    while (*from) {             // tree search
        cur = *from;
        uint32_t node_val = container_of(cur, Data, node)->val;
        from = (val < node_val) ? &cur->left : &cur->right;
    }
    *from = &data->node;        // attach the new node
    data->node.parent = cur;
    c.root = avl_fix(&data->node);
}

Using incoming pointer avoids the special case of adding the first node as the root. (See the hashtable chapter on list node deletion).

Search and delete with avl_del().

static bool del(Container &c, uint32_t val) {
    AVLNode *cur = c.root;
    while (cur) {
        uint32_t node_val = container_of(cur, Data, node)->val;
        if (val == node_val) {
            break;
        }
        cur = val < node_val ? cur->left : cur->right;
    }
    if (!cur) {
        return false;
    }

    c.root = avl_del(cur);
    delete container_of(cur, Data, node);
    return true;
}

Verify the Tree Structures

You should verify each node that:

static void avl_verify(AVLNode *parent, AVLNode *node) {
    if (!node) {
        return;
    }
    // verify subtrees recursively
    avl_verify(node, node->left);
    avl_verify(node, node->right);
    // 1. The parent pointer is correct.
    assert(node->parent == parent);
    // 2. The auxiliary data is correct.
    assert(node->cnt == 1 + avl_cnt(node->left) + avl_cnt(node->right));
    uint32_t l = avl_depth(node->left);
    uint32_t r = avl_depth(node->right);
    assert(node->depth == 1 + max(l, r));
    // 3. The height invariant is OK.
    assert(l == r || l + 1 == r || l == r + 1);
    // 4. The data is ordered.
    uint32_t val = container_of(node, Data, node)->val;
    if (node->left) {
        assert(node->left->parent == node);
        assert(container_of(node->left, Data, node)->val <= val);
    }
    if (node->right) {
        assert(node->right->parent == node);
        assert(container_of(node->right, Data, node)->val >= val);
    }
}

You should also compare against a reference data structure.

static void extract(AVLNode *node, std::multiset<uint32_t> &extracted) {
    if (!node) {
        return;
    }
    extract(node->left, extracted);
    extracted.insert(container_of(node, Data, node)->val);
    extract(node->right, extracted);
}

static void container_verify(
    Container &c, const std::multiset<uint32_t> &ref)
{
    avl_verify(NULL, c.root);
    assert(avl_cnt(c.root) == ref.size());
    std::multiset<uint32_t> extracted;
    extract(c.root, extracted);
    assert(extracted == ref);
}

Randomized Test Cases

You can start by hard coding some test data. But this is a waste of time, as a random number generator will do more with less effort.

    // random insertion
    for (uint32_t i = 0; i < 100; i++) {
        uint32_t val = (uint32_t)rand() % 1000;
        add(c, val);
        ref.insert(val);
        container_verify(c, ref);
    }
    // random deletion
    for (uint32_t i = 0; i < 200; i++) {
        // omitted ...
    }

You can seed the PRNG to make it easier to debug failed cases.

Directed Test Cases

Just feeding random numbers is not good enough, because low-probability bugs are hard to reach. And random cases may be too complicated (large) to debug.

We can try something more directed: for a given tree, try to insert into or delete from all possible positions. This avoids lots of duplicate testing in random cases.

static void test_insert(uint32_t sz) {
    // for each position in the tree
    for (uint32_t val = 0; val < sz; ++val) {
        Container c;
        std::multiset<uint32_t> ref;
        // create the tree of the given size
        for (uint32_t i = 0; i < sz; ++i) {
            if (i == val) {
                continue;
            }
            add(c, i);
            ref.insert(i);
        }
        container_verify(c, ref);
        // insert into the position
        add(c, val);
        ref.insert(val);
        container_verify(c, ref);
        dispose(c);
    }
}

We start from small tree sizes, so failed cases are likely to be small, which makes debugging easier.

This doesn’t replace the random cases, because the shape of the tree depends on the insertion order. You can try to enumerate every tree shape or insertion order, but this is only feasible for tiny trees.

10.7 What We Learned

  • The sorted set data type and its use cases.
  • Various tree data structures.
  • General binary tree implementation.
  • 2 simple rules for coding an AVL tree.
  • Randomized and directed test cases.

Source code: