11. Sorted Set From AVL Tree
Our sorted set is based on AVL tree. Real Redis uses Skip list, but the principle is the same.
11.1 Sorted Set Commands
Simplified Sorted Set Commands
We’ll implement some simplified commands to cover all sorted set operations.
Insert a pair: ZADD key score name
Find and remove by name:
ZREM key name
ZSCORE key name
Range query: ZQUERY key score name offset limit
- Iterate through the sublist where
pair >= (score, name)
. - Offset the sublist and limit the result size.
Rank query: The offset
in the ZQUERY
command.
Range Query Commands
In real Redis, there are many range query commands:
ZrangeByScore
: Range query by score without name.ZrangeByLex
: Assumes same score, range query by name.Zrange
: Query by rank,ZrangeByScore
, orZrangeByLex
.Zrev*
: In descending order.Zrem*
: Range delete.
Lots of work for the same functionality. So we invented the generic
ZQUERY
command, which can do everything except delete
ranges, reverse order, and the max
argument.
ZrangeByScore
:ZQUERY
with an empty name.ZrangeByLex
:ZQUERY
with the fixed score.Zrange
by rank:ZQUERY
with(-inf, "")
and offset.
11.2 Sorted Set Data Type
Multiple Ways to Index Data
A sorted set is indexed in 2 ways:
- Hashmap by name. The primary key.
- Sorted by the
(score, name)
pair. The secondary index.
struct ZSet {
*tree = NULL;
AVLNode ;
HMap hmap};
struct ZNode {
; // index by (score, name)
AVLNode tree; // index by name
HNode hmapdouble score = 0;
size_t len = 0;
char name[0]; // variable length
};
Variable Length Structures
The variable length array at the end of the structure is a common way to reduce allocation overhead.
static ZNode *znode_new(const char *name, size_t len, double score) {
*node = (ZNode *)malloc(sizeof(ZNode) + len);
ZNode (&node->tree);
avl_init->hmap.next = NULL;
node->hmap.hcode = str_hash((uint8_t *)name, len);
node->score = score;
node->len = len;
node(&node->name[0], name, len);
memcpyreturn node;
}
You can also apply this to the key object Entry
.
Intrusive Data Structures
One of the advantages of intrusive data structures is sharing data with multiple collections (multiple indexes). We’ll illustrate this with an example in STL:
// pseudo code!
struct ZSet {
std::unordered_map<std::string, double> hmap; // index by name
std::set<std::pair<double, std::string>> tree; // index by (score, name)
};
Drawbacks compared to our intrusive data structures:
- Data is duplicated in the secondary index.
- Extra heap allocations in the secondary index.
Multiple Key Types
The key object Entry
now distinguishes between different
key types.
enum {
= 0, // string
T_STR = 1, // sorted set
T_ZSET };
// the structure for the key
struct Entry {
struct HNode node;
std::string key;
uint32_t type = 0;
std::string val; // string
*zset = NULL; // sorted set
ZSet };
Here we simply added a new pointer field for the sorted set type. You
can also embed ZSet
into it and make the “value” part a
union to reduce allocations.
11.3 Sorted Set Lookup, Insert, and Delete
Lookup By Name
Lookup by name is just a hashtable lookup.
*zset_lookup(ZSet *zset, const char *name, size_t len) {
ZNode if (!zset->tree) {
return NULL;
}
;
HKey key.node.hcode = str_hash((uint8_t *)name, len);
key.name = name;
key.len = len;
key*found = hm_lookup(&zset->hmap, &key.node, &hcmp);
HNode return found ? container_of(found, ZNode, hmap) : NULL;
}
The compare function doesn’t have to use the same type as the node
type. We used the HKey
helper to refer to the
arguments.
struct HKey {
;
HNode nodeconst char *name = NULL;
size_t len = 0;
};
static bool hcmp(HNode *node, HNode *key) {
*znode = container_of(node, ZNode, hmap);
ZNode *hkey = container_of(key, HKey, node);
HKey if (znode->len != hkey->len) {
return false;
}
return 0 == memcmp(znode->name, hkey->name, znode->len);
}
AVL Tree Insert
Inserting into the tree is the same as in the last chapter.
static void tree_add(ZSet *zset, ZNode *node) {
*cur = NULL; // current node
AVLNode **from = &zset->tree; // the incoming pointer to the next node
AVLNode while (*from) { // tree search
= *from;
cur = zless(&node->tree, cur) ? &cur->left : &cur->right;
from }
*from = &node->tree; // attach the new node
->tree.parent = cur;
node->tree = avl_fix(&node->tree);
zset}
static bool zless(AVLNode *lhs, AVLNode *rhs); // lhs < rhs
Insert or Update
The ZADD
command must handle the case where the
name
already exists.
bool zset_add(ZSet *zset, const char *name, size_t len, double score) {
*node = zset_lookup(zset, name, len);
ZNode if (node) { // update the score of an existing pair
(zset, node, score);
zset_updatereturn false;
} else { // add a new node
= znode_new(name, len, score);
node (&zset->hmap, &node->hmap);
hm_insert(zset, node);
tree_addreturn true;
}
}
Detaching and re-inserting the AVL node will fix the order if the score changes.
static void zset_update(ZSet *zset, ZNode *node, double score) {
if (node->score == score) {
return;
}
->tree = avl_del(&node->tree);
zset->score = score;
node(&node->tree);
avl_init(zset, node);
tree_add}
Delete By Name
Since zset_add()
allocates the node, deallocation should
be managed by the same module so that the memory management
mechanism is not leaked to the caller.
// lookup and detach a node by name
*zset_pop(ZSet *zset, const char *name, size_t len);
ZNode // deallocate the node
void znode_del(ZNode *node);
Unlinking a node from a collection and deallocating are separate steps, because the caller may use the data before deallocating it.
11.4 Sorted Set Range Query
We’ll implement the query command:
ZQUERY key score name offset limit
.
- Seek to the pair where
pair >= (score, name)
. - Walk to the n-th successor/predecessor (offset).
- Iterate and output.
Seek is Just a Tree Search
*zset_query(ZSet *zset, double score, const char *name, size_t len) {
ZNode *found = NULL;
AVLNode for (AVLNode *cur = zset->tree; cur; ) {
if (zless(cur, score, name, len)) {
= cur->right;
cur } else {
= cur; // candidate
found = cur->left;
cur }
}
return found ? container_of(found, ZNode, tree) : NULL;
}
Offset and Iterate
Iterate is just offset by ±1, which is just walking the AVL tree.
// offset into the succeeding or preceding node.
*znode_offset(ZNode *node, int64_t offset) {
ZNode *tnode = node ? avl_offset(&node->tree, offset) : NULL;
AVLNode return tnode ? container_of(tnode, ZNode, tree) : NULL;
}
*avl_offset(AVLNode *node, int64_t offset); AVLNode
avl_offset()
will be coded later. We can code the
ZQUERY
now.
// ZQUERY key score name offset limit
static void do_zquery(std::vector<std::string> &cmd, std::string &out) {
// omitted ...
// 1. seek
*znode = zset_query(ent->zset, score, name.data(), name.size());
ZNode // 2. offset
= znode_offset(znode, offset);
znode // 3. iterate & output
void *arr = begin_arr(out);
uint32_t n = 0;
while (znode && (int64_t)n < limit) {
(out, znode->name, znode->len);
out_str(out, znode->score);
out_dbl= znode_offset(znode, +1); // successor
znode += 2;
n }
(out, arr, n);
end_arr}
11.5 Sorted Set Rank Query
Rank Query
As mentioned before, the offset
argument (rank) in
sorted sets is special: it costs O(log(offset)) as opposed to walking
through the tree nodes one by one (O(offset)). Paginating a large sorted
set by offset is viable in Redis, unlike SQL.
Paginating by score may be impossible when the entire page has the same score, as you cannot use the full sort key (score + name) as the paginator in real Redis.
Walk the Tree in O(log(offset))
The subtree size maintained in the tree node tells the rank difference from its parent, which is the size of a subtree + 1.
A (rank=a)
\
C (rank=a+s+1)
/ \
B (size=s) D
The subtree size also tells you the maximum rank difference from the current rank, so you can tell if the target is in a subtree.
*avl_offset(AVLNode *node, int64_t offset) {
AVLNode int64_t pos = 0; // relative to the starting node
while (offset != pos) {
if (pos < offset && pos + avl_cnt(node->right) >= offset) {
// the target is inside the right subtree
= node->right;
node += avl_cnt(node->left) + 1;
pos } else if (pos > offset && pos - avl_cnt(node->left) <= offset) {
// the target is inside the left subtree
= node->left;
node -= avl_cnt(node->right) + 1;
pos } else {
// go to the parent
*parent = node->parent;
AVLNode if (!parent) {
return NULL; // out of range
}
if (parent->right == node) {
-= avl_cnt(node->left) + 1;
pos } else {
+= avl_cnt(node->right) + 1;
pos }
= parent;
node }
}
return node;
}
avl_offset()
works in 2 phases without knowing the
absolute rank:
- If the target is not in one of the subtrees, go up to the parent.
- Otherwise, go down that subtree to meet the target.
Quick quiz: Find the absolute rank of a node (ZRANK
,
ZCOUNT
).
Order Statistic Tree
The idea of avl_offset()
is based on the rank difference
(subtree size) between parent and child, which is also used by the Skip
list of the real Redis.
If you add the subtree size to the B+tree used in relational DBs,
they can also do fast offset and count(*)
. Unfortunately,
relational DBs are very limited in data structures, and the added
information increases write amplification by requiring every update to
touch every node in the path.
A tree augmented by the subtree size is called an order statistic tree. Besides sorted data, it’s also used to create array-like structures with O(log(n)) insertions at arbitrary positions.
Testing
Use the idea from the last chapter to quickly exercise all cases. Pay attention to unhappy cases such as off by one.
static void test_case(uint32_t sz) {
// create a tree of a given size
;
Container cfor (uint32_t i = 0; i < sz; ++i) {
(c, i);
add}
*min = c.root;
AVLNode while (min->left) {
= min->left;
min }
// for each starting rank
for (uint32_t i = 0; i < sz; ++i) {
*node = avl_offset(min, (int64_t)i);
AVLNode assert(container_of(node, Data, node)->val == i);
// test all possible offset
for (uint32_t j = 0; j < sz; ++j) {
int64_t offset = (int64_t)j - (int64_t)i;
*n2 = avl_offset(node, offset);
AVLNode assert(container_of(n2, Data, node)->val == j);
}
// out of range by one
assert(!avl_offset(node, -(int64_t)i - 1));
assert(!avl_offset(node, sz - i));
}
(c.root);
dispose}
11.6 What We Learned
- Multi-indexed data with intrusive data structures.
- Order statistic tree: A fast way to offset a tree node.
Source code: