10. The AVL Tree: Coding & Testing
While Redis is called a key-value store, the “value” part is more than byte strings:
- List. Double-ended queue. Trivial to code.
- Hashmap and hashset. We’ve already coded a hashtable.
- Sorted set. The sorting data structure in this chapter.
10.1 Sorted Set for Sorting
Range Query
A sorted set is a collection of (score, name)
pairs
indexed in 2 ways:
- Find the score by name. Like a hashmap.
- Range query: Getting a subset of the sorted pairs:
- Seek to the closest pair from a target
(score, name)
pair. - Iterate in ascending or descending order from the starting pair.
- Seek to the closest pair from a target
The range query (seek & iterate) is like using a SQL index.
Rank Query
Sorted set can also do queries related to the position in the list (rank):
- Find a pair’s rank.
- Find a pair by its rank.
- Offset a pair. Like offsetting a SQL query.
Unlike SQL indexes, these are O(log(n)) in Redis.
Sorted Set Use Cases
Suppose you have a high score board using sorted set. You can:
- Find a user’s score.
- Range query: Show the list starting from a given score (pagination).
- Rank query:
- Find a user’s rank.
- Find the user with a given rank.
- Paginate the list by offset instead of score. Handy!
In addition to score boards, a sorted set can sort anything. Suppose you have a list of posts ordered by time. You can:
- Find the post time (score) by post ID (name).
- Range query: List posts starting from a given time (score).
- Rank query:
- Find the position of a post in the list. Not useful.
- Find the post ID at a given position. Not useful.
- Paginate the list by offset instead of time. Handy!
The score is a 64-bit float; if the sort key cannot be encoded as the
score, you can use the name
for sorting instead (fixed
score).
Big-O of Sorted Set
- Find the score by name: O(1) by a hashtable.
- Insert a pair: O(log(n)).
- Range query: O(log(n)).
- Rank query:
- Find a pair’s rank: O(log(n)).
- Find by rank: O(log(n)).
- Offsetting: O(log(offset)).
The average cost of both inserting and searching is O(log(n)) for all practical sorting data structures.
Paginating a query by offset is avoided in SQL because the cost is O(offset). However, this is practical in Redis as the cost is reduced to O(log(offset)).
10.2 Sorting Data Structures Quick Review
Sorted Arrays
The simplest sorting data structure is a sorted array. You can bisect it in O(log(n)), but the update is O(n), so it’s not practical.
The cost of updating a sorted array can be amortized by buffering the updates in a smaller array first, and then merging them into the main array later. This idea leads to the log structured merge tree (LSM-tree), although it has nothing to do with trees and is rarely used as an in-memory data structure.
If we split the array into several non-overlapping ones, the update cost can be reduced. This is a 2-level tree, and the idea leads to n-ary trees like B+tree.
Tree Data Structures
Almost all other sorting data structures are trees:
- Binary trees:
- AVL tree. Balanced.
- RB tree. Balanced.
- Treap. Randomized.
- Splay tree. Require randomized usage patterns.
- N-ary trees:
- Trie.
- B-tree. Balanced.
- Skip list. Randomized, hierarchical.
Trees are hierarchical and recursive. A tree node divides the dataset into non-overlapping subtrees, that’s how to sort data with trees.
Balance By Randomness
The simplest tree structure is the unbalanced binary tree, where new nodes are simply added as leaves, so the tree can be irregular and deep. For example, after inserting a monotonic sequence, the tree is just a linear list.
There are ways to affect the shape of the tree to make binary trees practical. One way is to use randomness. If you insert and delete at random, the average tree depth is O(log(n)).
Splay trees also use randomness of lookups, it rotates the target node all the way up to the root node, so random lookups also shuffle the shape of the tree. However, it’s only usable when worst cases do not matter, because the worst case can be triggered by common usage patterns such as sequential lookups.
Treaps and Skip lists are better than Splay trees because the randomness is generated. Their worst cases are a matter of luck instead of usage patterns.
Height-Balanced Trees
Another way to make trees practical is to limit the tree height by some invariants.
- RB tree:
- Nodes are either black or red.
- Equal number of black nodes in all paths from root to leaves.
- No 2 consecutive red nodes in a path.
- AVL tree:
- 2 subtrees may differ in height, but only by 1.
- B-tree:
- All leaves have the same height. (Only possible for n-ary trees).
The height of the trees in the worst case is O(log(n)).
Comparison of Trees
Tree | Worst case | Branch | Random | Difficulty |
---|---|---|---|---|
AVL tree | O(log(n)) | 2 | No | Medium |
RB tree | O(log(n)) | 2 | No | High |
Treap | O(n) | 2 | Yes | Low |
Splay tree | O(n) | 2 | Usage | Low |
B-tree | O(log(n)) | n | No | High |
Skip list | O(n) | n | Yes | Medium |
Splay trees are not general-purpose due to their worst case nature. They are designed for fast access to hot data.
B-trees are simple in concept, but coding one is non-trivial. They are used for disk storage where binary trees are impractical.
RB trees are simple to insert, but deleting is complicated and non-obvious. So I prefer the much simpler AVL tree.
The remaining 3 candidates are AVL tree, Treap, and Skip list. AVL trees have bounded worst cases and are the most well-known, so that’s what we’ll use.
10.3 Implementing Binary Trees
These things are the same for all kinds of binary trees.
Search and Insert
Starting from the root, the target is either in the left or right subtree, if not in the current node.
An insert is simply placing the new node at where the search ends.
When the tree is implemented as an intrusive data structure, there is no dedicate function for searching or inserting because it’s so trivial.
Delete
- Remove a leaf node: simple.
- Remove an internal node:
- If the right subtree is not empty:
- Swap it with its successor in the right subtree.
- Remove the successor node (which is at a lower position).
- If the right subtree is empty:
- Return the left subtree.
- If the right subtree is not empty:
Notes:
- The process is recursive, it ends eventually because the target to remove is lower and lower.
- The case for leaves is redundant, since the case for internal nodes also works.
- Trees are symmetric, so things also work if left and right are swapped.
Iterate
The successor node is the leftmost node in the right subtree, except when the right subtree is empty, you have to go back to the parent node. That’s why a tree node includes a parent pointer.
But parent pointers are not necessary, you can also keep a stack of nodes (path), this saves memory but makes the code clumsy.
10.4 AVL Tree
Invariant: Height Difference Only By One
An AVL tree is a binary tree with an extra step, inserts and deletions are done as normal, then the tree is fixed by some rules.
The height of a subtree is the maximum distance to leaves. For each node in an AVL tree, the height of its 2 subtrees can differ, but only by 1.
The rules for fixing the heights can be the same for insert and delete. That’s why AVL trees are far easier than RB trees.
Rotations Keep the Order
Rotations change the shape of a subtree while keeping its order. A visualization of a left rotation:
2 4
/ \ / \
1 4 ==> 2 5
/ \ / \
3 5 1 3
In Splay trees and Treaps, rotations are used to move the target node up.
Rotations Adjust Heights
Inserting or removing a node changes the subtree height by 1, this can cause a height difference of 2 for the parent node, which is restored by rotations.
Rule 1: A right rotation restores balance if:
- Condition 1: The left subtree (B) is deeper by 2.
- Condition 2: The left left subtree (A) is deeper than the left right subtree (C).
D (h+3) B (h+2)
/ \ / \
B (h+2) E (h) ==> A (h+1) D (h+1)
/ \ / \
A (h+1) C (h) C (h) E (h)
- Condition 1 is initially introduced by insertion or deletion.
- Condition 2 can be obtained by applying rule 2 if unmet.
Rule 2: A right rotation makes the right subtree deeper than the left subtree while keeping the rotated height if:
- The left subtree (B) is deeper by 1.
D (h+2) B (h+2)
/ \ / \
B (h+1) E (h) ==> A (h-1) D (h+1)
/ \ / \
A (h-1) C (h) C (h) E (h)
The height of A does not matter in the above graph, node A can cause an imbalance but the following rule 1 still works.
Both rules are left-right mirrored. The left rotation rule 2 is paired with the right rotation rule 1.
The fix start from the initial node:
- Check for imbalance (height difference of 2).
- Stop if balanced.
- If rule 1 is not applicable (condition 2 unmet).
- Apply the opposite rotation rule 2.
- Apply rule 1.
- Go to parent node and repeat.
10.5 Coding an AVL Tree
Step 1: Node Definition
struct AVLNode {
uint32_t depth = 0; // subtree height
uint32_t cnt = 0; // subtree size
*left = NULL;
AVLNode *right = NULL;
AVLNode *parent = NULL;
AVLNode };
static void avl_init(AVLNode *node) {
->depth = 1;
node->cnt = 1;
node->left = node->right = node->parent = NULL;
node}
The subtree height is stored in the node, which is the most intuitive way to code an AVL tree. However, the exact height is unnecessary; some implementations store the height difference instead, which takes only 2 bits.
If only 2 bits are needed, they can be packed into one of the
pointers since x64 pointers don’t use all the bits. However, this
optimization is useless to us because we also have an extra field
cnt
.
cnt
stores the subtree size, it’s used for rank queries
in the next chapter.
Some helper functions for updating and using these auxiliary data:
static uint32_t avl_depth(AVLNode *node) {
return node ? node->depth : 0;
}
static uint32_t avl_cnt(AVLNode *node) {
return node ? node->cnt : 0;
}
// maintain the depth and cnt field
static void avl_update(AVLNode *node) {
->depth = 1 + max(avl_depth(node->left), avl_depth(node->right));
node->cnt = 1 + avl_cnt(node->left) + avl_cnt(node->right);
node}
Step 2: Rotations
2 4
/ \ / \
1 4 ==> 2 5
/ \ / \
3 5 1 3
2 lines for rotation, the rest is for maintaining parent pointers and auxiliary data.
static AVLNode *rot_left(AVLNode *node) {
*new_node = node->right;
AVLNode if (new_node->left) {
->left->parent = node;
new_node}
->right = new_node->left; // rotation
node->left = node; // rotation
new_node->parent = node->parent;
new_node->parent = new_node;
node(node);
avl_update(new_node);
avl_updatereturn new_node;
}
static AVLNode *rot_right(AVLNode *node); // mirrored
Rotation functions must return the rotated node, because the tree root may be updated.
Step 3: Rules for Rotations
Rule 1: A right rotation restores balance if the left subtree is deeper by 2, and the left left subtree is deeper than the left right subtree.
Rule 2: A left rotation makes the left subtree deeper than the right subtree while keeping the rotated height if the right subtree is deeper by 1.
// the left subtree is too deep
static AVLNode *avl_fix_left(AVLNode *root) {
if (avl_depth(root->left->left) < avl_depth(root->left->right)) {
->left = rot_left(root->left); // rule 2
root}
return rot_right(root); // rule 1
}
Step 4: Fix Imbalances
avl_fix()
fixes imbalances after an insertion or
deletion. It loops from the initial node to the root node and returns
the new root.
// fix imbalanced nodes and maintain invariants until the root is reached
static AVLNode *avl_fix(AVLNode *node) {
while (true) {
(node);
avl_updateuint32_t l = avl_depth(node->left);
uint32_t r = avl_depth(node->right);
**from = NULL;
AVLNode if (AVLNode *p = node->parent) {
= (p->left == node) ? &p->left : &p->right;
from }
if (l == r + 2) {
= avl_fix_left(node);
node } else if (l + 2 == r) {
= avl_fix_right(node);
node }
if (!from) {
return node;
}
*from = node;
= node->parent;
node }
}
Step 5: Delete
The general binary tree delete:
- If the right subtree is not empty:
- Swap it with the successor in the right subtree.
- Remove the successor node (which is at a lower position).
- If the right subtree is empty:
- Return the left subtree.
The avl_fix()
in the non-recursive case is the extra
step of the AVL tree.
// detach a node and returns the new root of the tree
static AVLNode *avl_del(AVLNode *node) {
if (node->right == NULL) {
// no right subtree, replace the node with the left subtree
// link the left subtree to the parent
*parent = node->parent;
AVLNode if (node->left) {
->left->parent = parent;
node}
if (parent) { // attach the left subtree to the parent
(parent->left == node ? parent->left : parent->right) = node->left;
return avl_fix(parent); // AVL-specific!
} else { // removing root?
return node->left;
}
} else {
// detach the successor
*victim = node->right;
AVLNode while (victim->left) {
= victim->left;
victim }
*root = avl_del(victim);
AVLNode // swap with it
*victim = *node;
if (victim->left) {
->left->parent = victim;
victim}
if (victim->right) {
->right->parent = victim;
victim}
if (AVLNode *parent = node->parent) {
(parent->left == node ? parent->left : parent->right) = victim;
return root;
} else { // removing root?
return victim;
}
}
}
Step 6: Search and Insert
Searching a binary tree is not much more work than looping through a linked list. A search function is not required for an intrusive tree, since the tree node is exposed to the user, so we will just stop there. Our AVL tree API consists of:
avl_fix()
: Fixes the tree after insertion.avl_del()
: Detaches a node. (Also callsavl_fix()
).
10.6 Testing the AVL Tree
Set up the Tests
Read the hashtable chapter if unfamiliar with intrusive data structures.
struct Data {
;
AVLNode nodeuint32_t val = 0;
};
struct Container {
*root = NULL;
AVLNode };
Regular binary tree insertion with avl_fix()
:
static void add(Container &c, uint32_t val) {
*data = new Data(); // allocate the data
Data (&data->node);
avl_init->val = val;
data
*cur = NULL; // current node
AVLNode **from = &c.root; // the incoming pointer to the next node
AVLNode while (*from) { // tree search
= *from;
cur uint32_t node_val = container_of(cur, Data, node)->val;
= (val < node_val) ? &cur->left : &cur->right;
from }
*from = &data->node; // attach the new node
->node.parent = cur;
data.root = avl_fix(&data->node);
c}
Using incoming pointer avoids the special case of adding the first node as the root. (See the hashtable chapter on list node deletion).
Search and delete with avl_del()
.
static bool del(Container &c, uint32_t val) {
*cur = c.root;
AVLNode while (cur) {
uint32_t node_val = container_of(cur, Data, node)->val;
if (val == node_val) {
break;
}
= val < node_val ? cur->left : cur->right;
cur }
if (!cur) {
return false;
}
.root = avl_del(cur);
cdelete container_of(cur, Data, node);
return true;
}
Verify the Tree Structures
You should verify each node that:
static void avl_verify(AVLNode *parent, AVLNode *node) {
if (!node) {
return;
}
// verify subtrees recursively
(node, node->left);
avl_verify(node, node->right);
avl_verify// 1. The parent pointer is correct.
assert(node->parent == parent);
// 2. The auxiliary data is correct.
assert(node->cnt == 1 + avl_cnt(node->left) + avl_cnt(node->right));
uint32_t l = avl_depth(node->left);
uint32_t r = avl_depth(node->right);
assert(node->depth == 1 + max(l, r));
// 3. The height invariant is OK.
assert(l == r || l + 1 == r || l == r + 1);
// 4. The data is ordered.
uint32_t val = container_of(node, Data, node)->val;
if (node->left) {
assert(node->left->parent == node);
assert(container_of(node->left, Data, node)->val <= val);
}
if (node->right) {
assert(node->right->parent == node);
assert(container_of(node->right, Data, node)->val >= val);
}
}
You should also compare against a reference data structure.
static void extract(AVLNode *node, std::multiset<uint32_t> &extracted) {
if (!node) {
return;
}
(node->left, extracted);
extract.insert(container_of(node, Data, node)->val);
extracted(node->right, extracted);
extract}
static void container_verify(
&c, const std::multiset<uint32_t> &ref)
Container {
(NULL, c.root);
avl_verifyassert(avl_cnt(c.root) == ref.size());
std::multiset<uint32_t> extracted;
(c.root, extracted);
extractassert(extracted == ref);
}
Randomized Test Cases
You can start by hard coding some test data. But this is a waste of time, as a random number generator will do more with less effort.
// random insertion
for (uint32_t i = 0; i < 100; i++) {
uint32_t val = (uint32_t)rand() % 1000;
(c, val);
add.insert(val);
ref(c, ref);
container_verify}
// random deletion
for (uint32_t i = 0; i < 200; i++) {
// omitted ...
}
You can seed the PRNG to make it easier to debug failed cases.
Directed Test Cases
Just feeding random numbers is not good enough, because low-probability bugs are hard to reach. And random cases may be too complicated (large) to debug.
We can try something more directed: for a given tree, try to insert into or delete from all possible positions. This avoids lots of duplicate testing in random cases.
static void test_insert(uint32_t sz) {
// for each position in the tree
for (uint32_t val = 0; val < sz; ++val) {
;
Container cstd::multiset<uint32_t> ref;
// create the tree of the given size
for (uint32_t i = 0; i < sz; ++i) {
if (i == val) {
continue;
}
(c, i);
add.insert(i);
ref}
(c, ref);
container_verify// insert into the position
(c, val);
add.insert(val);
ref(c, ref);
container_verify(c);
dispose}
}
We start from small tree sizes, so failed cases are likely to be small, which makes debugging easier.
This doesn’t replace the random cases, because the shape of the tree depends on the insertion order. You can try to enumerate every tree shape or insertion order, but this is only feasible for tiny trees.
10.7 What We Learned
- The sorted set data type and its use cases.
- Various tree data structures.
- General binary tree implementation.
- 2 simple rules for coding an AVL tree.
- Randomized and directed test cases.
Source code: