11. Sorted Set From AVL Tree

Our sorted set is based on AVL tree. Real Redis uses Skip list, but the principle is the same.

11.1 Sorted Set Commands

Simplified Sorted Set Commands

We’ll implement some simplified commands to cover all sorted set operations.

Insert a pair: `ZADD key score name`

Find and remove by name:

• `ZREM key name`
• `ZSCORE key name`

Range query: `ZQUERY key score name offset limit`

• Iterate through the sublist where `pair >= (score, name)`.
• Offset the sublist and limit the result size.

Rank query: The `offset` in the `ZQUERY` command.

Range Query Commands

In real Redis, there are many range query commands:

• `ZrangeByScore`: Range query by score without name.
• `ZrangeByLex`: Assumes same score, range query by name.
• `Zrange`: Query by rank, `ZrangeByScore`, or `ZrangeByLex`.
• `Zrev*`: In descending order.
• `Zrem*`: Range delete.

Lots of work for the same functionality. So we invented the generic `ZQUERY` command, which can do everything except delete ranges, reverse order, and the `max` argument.

• `ZrangeByScore`: `ZQUERY` with an empty name.
• `ZrangeByLex`: `ZQUERY` with the fixed score.
• `Zrange` by rank: `ZQUERY` with `(-inf, "")` and offset.

11.2 Sorted Set Data Type

Multiple Ways to Index Data

A sorted set is indexed in 2 ways:

• Hashmap by name. The primary key.
• Sorted by the `(score, name)` pair. The secondary index.
``````struct ZSet {
AVLNode *tree = NULL;
HMap hmap;
};

struct ZNode {
AVLNode tree;   // index by (score, name)
HNode hmap;     // index by name
double score = 0;
size_t len = 0;
char name[0];   // variable length
};``````

Variable Length Structures

The variable length array at the end of the structure is a common way to reduce allocation overhead.

``````static ZNode *znode_new(const char *name, size_t len, double score) {
ZNode *node = (ZNode *)malloc(sizeof(ZNode) + len);
avl_init(&node->tree);
node->hmap.next = NULL;
node->hmap.hcode = str_hash((uint8_t *)name, len);
node->score = score;
node->len = len;
memcpy(&node->name[0], name, len);
return node;
}``````

You can also apply this to the key object `Entry`.

Intrusive Data Structures

One of the advantages of intrusive data structures is sharing data with multiple collections (multiple indexes). We’ll illustrate this with an example in STL:

``````// pseudo code!
struct ZSet {
std::unordered_map<std::string, double> hmap;   // index by name
std::set<std::pair<double, std::string>> tree;  // index by (score, name)
};``````

Drawbacks compared to our intrusive data structures:

• Data is duplicated in the secondary index.
• Extra heap allocations in the secondary index.

Multiple Key Types

The key object `Entry` now distinguishes between different key types.

``````enum {
T_STR = 0,  // string
T_ZSET = 1, // sorted set
};

// the structure for the key
struct Entry {
struct HNode node;
std::string key;
uint32_t type = 0;
std::string val;    // string
ZSet *zset = NULL;  // sorted set
};``````

Here we simply added a new pointer field for the sorted set type. You can also embed `ZSet` into it and make the “value” part a union to reduce allocations.

11.3 Sorted Set Lookup, Insert, and Delete

Lookup By Name

Lookup by name is just a hashtable lookup.

``````ZNode *zset_lookup(ZSet *zset, const char *name, size_t len) {
if (!zset->tree) {
return NULL;
}
HKey key;
key.node.hcode = str_hash((uint8_t *)name, len);
key.name = name;
key.len = len;
HNode *found = hm_lookup(&zset->hmap, &key.node, &hcmp);
return found ? container_of(found, ZNode, hmap) : NULL;
}``````

The compare function doesn’t have to use the same type as the node type. We used the `HKey` helper to refer to the arguments.

``````struct HKey {
HNode node;
const char *name = NULL;
size_t len = 0;
};

static bool hcmp(HNode *node, HNode *key) {
ZNode *znode = container_of(node, ZNode, hmap);
HKey *hkey = container_of(key, HKey, node);
if (znode->len != hkey->len) {
return false;
}
return 0 == memcmp(znode->name, hkey->name, znode->len);
}``````

AVL Tree Insert

Inserting into the tree is the same as in the last chapter.

``````static void tree_add(ZSet *zset, ZNode *node) {
AVLNode *cur = NULL;            // current node
AVLNode **from = &zset->tree;   // the incoming pointer to the next node
while (*from) {                 // tree search
cur = *from;
from = zless(&node->tree, cur) ? &cur->left : &cur->right;
}
*from = &node->tree;            // attach the new node
node->tree.parent = cur;
zset->tree = avl_fix(&node->tree);
}

static bool zless(AVLNode *lhs, AVLNode *rhs);  // lhs < rhs``````

Insert or Update

The `ZADD` command must handle the case where the `name` already exists.

``````bool zset_add(ZSet *zset, const char *name, size_t len, double score) {
ZNode *node = zset_lookup(zset, name, len);
if (node) {     // update the score of an existing pair
zset_update(zset, node, score);
return false;
} else {        // add a new node
node = znode_new(name, len, score);
hm_insert(&zset->hmap, &node->hmap);
return true;
}
}``````

Detaching and re-inserting the AVL node will fix the order if the score changes.

``````static void zset_update(ZSet *zset, ZNode *node, double score) {
if (node->score == score) {
return;
}
zset->tree = avl_del(&node->tree);
node->score = score;
avl_init(&node->tree);
}``````

Delete By Name

Since `zset_add()` allocates the node, deallocation should be managed by the same module so that the memory management mechanism is not leaked to the caller.

``````// lookup and detach a node by name
ZNode *zset_pop(ZSet *zset, const char *name, size_t len);
// deallocate the node
void znode_del(ZNode *node);``````

Unlinking a node from a collection and deallocating are separate steps, because the caller may use the data before deallocating it.

11.4 Sorted Set Range Query

We’ll implement the query command: `ZQUERY key score name offset limit`.

1. Seek to the pair where `pair >= (score, name)`.
2. Walk to the n-th successor/predecessor (offset).
3. Iterate and output.
``````ZNode *zset_query(ZSet *zset, double score, const char *name, size_t len) {
AVLNode *found = NULL;
for (AVLNode *cur = zset->tree; cur; ) {
if (zless(cur, score, name, len)) {
cur = cur->right;
} else {
found = cur;    // candidate
cur = cur->left;
}
}
return found ? container_of(found, ZNode, tree) : NULL;
}``````

Offset and Iterate

Iterate is just offset by ±1, which is just walking the AVL tree.

``````// offset into the succeeding or preceding node.
ZNode *znode_offset(ZNode *node, int64_t offset) {
AVLNode *tnode = node ? avl_offset(&node->tree, offset) : NULL;
return tnode ? container_of(tnode, ZNode, tree) : NULL;
}

AVLNode *avl_offset(AVLNode *node, int64_t offset);``````

`avl_offset()` will be coded later. We can code the `ZQUERY` now.

``````// ZQUERY key score name offset limit
static void do_zquery(std::vector<std::string> &cmd, std::string &out) {
// omitted ...

// 1. seek
ZNode *znode = zset_query(ent->zset, score, name.data(), name.size());
// 2. offset
znode = znode_offset(znode, offset);
// 3. iterate & output
void *arr = begin_arr(out);
uint32_t n = 0;
while (znode && (int64_t)n < limit) {
out_str(out, znode->name, znode->len);
out_dbl(out, znode->score);
znode = znode_offset(znode, +1);    // successor
n += 2;
}
end_arr(out, arr, n);
}``````

11.5 Sorted Set Rank Query

Rank Query

As mentioned before, the `offset` argument (rank) in sorted sets is special: it costs O(log(offset)) as opposed to walking through the tree nodes one by one (O(offset)). Paginating a large sorted set by offset is viable in Redis, unlike SQL.

Paginating by score may be impossible when the entire page has the same score, as you cannot use the full sort key (score + name) as the paginator in real Redis.

Walk the Tree in O(log(offset))

The subtree size maintained in the tree node tells the rank difference from its parent, which is the size of a subtree + 1.

``````    A (rank=a)
\
C (rank=a+s+1)
/             \
B (size=s)          D``````

The subtree size also tells you the maximum rank difference from the current rank, so you can tell if the target is in a subtree.

``````AVLNode *avl_offset(AVLNode *node, int64_t offset) {
int64_t pos = 0;    // relative to the starting node
while (offset != pos) {
if (pos < offset && pos + avl_cnt(node->right) >= offset) {
// the target is inside the right subtree
node = node->right;
pos += avl_cnt(node->left) + 1;
} else if (pos > offset && pos - avl_cnt(node->left) <= offset) {
// the target is inside the left subtree
node = node->left;
pos -= avl_cnt(node->right) + 1;
} else {
// go to the parent
AVLNode *parent = node->parent;
if (!parent) {
return NULL;    // out of range
}
if (parent->right == node) {
pos -= avl_cnt(node->left) + 1;
} else {
pos += avl_cnt(node->right) + 1;
}
node = parent;
}
}
return node;
}``````

`avl_offset()` works in 2 phases without knowing the absolute rank:

1. If the target is not in one of the subtrees, go up to the parent.
2. Otherwise, go down that subtree to meet the target.

Quick quiz: Find the absolute rank of a node (`ZRANK`, `ZCOUNT`).

Order Statistic Tree

The idea of `avl_offset()` is based on the rank difference (subtree size) between parent and child, which is also used by the Skip list of the real Redis.

If you add the subtree size to the B+tree used in relational DBs, they can also do fast offset and `count(*)`. Unfortunately, relational DBs are very limited in data structures, and the added information increases write amplification by requiring every update to touch every node in the path.

A tree augmented by the subtree size is called an order statistic tree. Besides sorted data, it’s also used to create array-like structures with O(log(n)) insertions at arbitrary positions.

Testing

Use the idea from the last chapter to quickly exercise all cases. Pay attention to unhappy cases such as off by one.

``````static void test_case(uint32_t sz) {
// create a tree of a given size
Container c;
for (uint32_t i = 0; i < sz; ++i) {
}
AVLNode *min = c.root;
while (min->left) {
min = min->left;
}
// for each starting rank
for (uint32_t i = 0; i < sz; ++i) {
AVLNode *node = avl_offset(min, (int64_t)i);
assert(container_of(node, Data, node)->val == i);
// test all possible offset
for (uint32_t j = 0; j < sz; ++j) {
int64_t offset = (int64_t)j - (int64_t)i;
AVLNode *n2 = avl_offset(node, offset);
assert(container_of(n2, Data, node)->val == j);
}
// out of range by one
assert(!avl_offset(node, -(int64_t)i - 1));
assert(!avl_offset(node, sz - i));
}
dispose(c.root);
}``````

11.6 What We Learned

• Multi-indexed data with intrusive data structures.
• Order statistic tree: A fast way to offset a tree node.

Source code: