11. Sorted Set From AVL Tree

Our sorted set is based on AVL tree. Real Redis uses Skip list, but the principle is the same.

11.1 Sorted Set Commands

Simplified Sorted Set Commands

We’ll implement some simplified commands to cover all sorted set operations.

Insert a pair: ZADD key score name

Find and remove by name:

  • ZREM key name
  • ZSCORE key name

Range query: ZQUERY key score name offset limit

  • Iterate through the sublist where pair >= (score, name).
  • Offset the sublist and limit the result size.

Rank query: The offset in the ZQUERY command.

Range Query Commands

In real Redis, there are many range query commands:

  • ZrangeByScore: Range query by score without name.
  • ZrangeByLex: Assumes same score, range query by name.
  • Zrange: Query by rank, ZrangeByScore, or ZrangeByLex.
  • Zrev*: In descending order.
  • Zrem*: Range delete.

Lots of work for the same functionality. So we invented the generic ZQUERY command, which can do everything except delete ranges, reverse order, and the max argument.

  • ZrangeByScore: ZQUERY with an empty name.
  • ZrangeByLex: ZQUERY with the fixed score.
  • Zrange by rank: ZQUERY with (-inf, "") and offset.

11.2 Sorted Set Data Type

Multiple Ways to Index Data

A sorted set is indexed in 2 ways:

  • Hashmap by name. The primary key.
  • Sorted by the (score, name) pair. The secondary index.
struct ZSet {
    AVLNode *tree = NULL;
    HMap hmap;
};

struct ZNode {
    AVLNode tree;   // index by (score, name)
    HNode hmap;     // index by name
    double score = 0;
    size_t len = 0;
    char name[0];   // variable length
};

Variable Length Structures

The variable length array at the end of the structure is a common way to reduce allocation overhead.

static ZNode *znode_new(const char *name, size_t len, double score) {
    ZNode *node = (ZNode *)malloc(sizeof(ZNode) + len);
    avl_init(&node->tree);
    node->hmap.next = NULL;
    node->hmap.hcode = str_hash((uint8_t *)name, len);
    node->score = score;
    node->len = len;
    memcpy(&node->name[0], name, len);
    return node;
}

You can also apply this to the key object Entry.

Intrusive Data Structures

One of the advantages of intrusive data structures is sharing data with multiple collections (multiple indexes). We’ll illustrate this with an example in STL:

// pseudo code!
struct ZSet {
    std::unordered_map<std::string, double> hmap;   // index by name
    std::set<std::pair<double, std::string>> tree;  // index by (score, name)
};

Drawbacks compared to our intrusive data structures:

  • Data is duplicated in the secondary index.
  • Extra heap allocations in the secondary index.

Multiple Key Types

The key object Entry now distinguishes between different key types.

enum {
    T_STR = 0,  // string
    T_ZSET = 1, // sorted set
};

// the structure for the key
struct Entry {
    struct HNode node;
    std::string key;
    uint32_t type = 0;
    std::string val;    // string
    ZSet *zset = NULL;  // sorted set
};

Here we simply added a new pointer field for the sorted set type. You can also embed ZSet into it and make the “value” part a union to reduce allocations.

11.3 Sorted Set Lookup, Insert, and Delete

Lookup By Name

Lookup by name is just a hashtable lookup.

ZNode *zset_lookup(ZSet *zset, const char *name, size_t len) {
    if (!zset->tree) {
        return NULL;
    }
    HKey key;
    key.node.hcode = str_hash((uint8_t *)name, len);
    key.name = name;
    key.len = len;
    HNode *found = hm_lookup(&zset->hmap, &key.node, &hcmp);
    return found ? container_of(found, ZNode, hmap) : NULL;
}

The compare function doesn’t have to use the same type as the node type. We used the HKey helper to refer to the arguments.

struct HKey {
    HNode node;
    const char *name = NULL;
    size_t len = 0;
};

static bool hcmp(HNode *node, HNode *key) {
    ZNode *znode = container_of(node, ZNode, hmap);
    HKey *hkey = container_of(key, HKey, node);
    if (znode->len != hkey->len) {
        return false;
    }
    return 0 == memcmp(znode->name, hkey->name, znode->len);
}

AVL Tree Insert

Inserting into the tree is the same as in the last chapter.

static void tree_add(ZSet *zset, ZNode *node) {
    AVLNode *cur = NULL;            // current node
    AVLNode **from = &zset->tree;   // the incoming pointer to the next node
    while (*from) {                 // tree search
        cur = *from;
        from = zless(&node->tree, cur) ? &cur->left : &cur->right;
    }
    *from = &node->tree;            // attach the new node
    node->tree.parent = cur;
    zset->tree = avl_fix(&node->tree);
}

static bool zless(AVLNode *lhs, AVLNode *rhs);  // lhs < rhs

Insert or Update

The ZADD command must handle the case where the name already exists.

bool zset_add(ZSet *zset, const char *name, size_t len, double score) {
    ZNode *node = zset_lookup(zset, name, len);
    if (node) {     // update the score of an existing pair
        zset_update(zset, node, score);
        return false;
    } else {        // add a new node
        node = znode_new(name, len, score);
        hm_insert(&zset->hmap, &node->hmap);
        tree_add(zset, node);
        return true;
    }
}

Detaching and re-inserting the AVL node will fix the order if the score changes.

static void zset_update(ZSet *zset, ZNode *node, double score) {
    if (node->score == score) {
        return;
    }
    zset->tree = avl_del(&node->tree);
    node->score = score;
    avl_init(&node->tree);
    tree_add(zset, node);
}

Delete By Name

Since zset_add() allocates the node, deallocation should be managed by the same module so that the memory management mechanism is not leaked to the caller.

// lookup and detach a node by name
ZNode *zset_pop(ZSet *zset, const char *name, size_t len);
// deallocate the node
void znode_del(ZNode *node);

Unlinking a node from a collection and deallocating are separate steps, because the caller may use the data before deallocating it.

11.4 Sorted Set Range Query

We’ll implement the query command: ZQUERY key score name offset limit.

  1. Seek to the pair where pair >= (score, name).
  2. Walk to the n-th successor/predecessor (offset).
  3. Iterate and output.
ZNode *zset_query(ZSet *zset, double score, const char *name, size_t len) {
    AVLNode *found = NULL;
    for (AVLNode *cur = zset->tree; cur; ) {
        if (zless(cur, score, name, len)) {
            cur = cur->right;
        } else {
            found = cur;    // candidate
            cur = cur->left;
        }
    }
    return found ? container_of(found, ZNode, tree) : NULL;
}

Offset and Iterate

Iterate is just offset by ±1, which is just walking the AVL tree.

// offset into the succeeding or preceding node.
ZNode *znode_offset(ZNode *node, int64_t offset) {
    AVLNode *tnode = node ? avl_offset(&node->tree, offset) : NULL;
    return tnode ? container_of(tnode, ZNode, tree) : NULL;
}

AVLNode *avl_offset(AVLNode *node, int64_t offset);

avl_offset() will be coded later. We can code the ZQUERY now.

// ZQUERY key score name offset limit
static void do_zquery(std::vector<std::string> &cmd, std::string &out) {
    // omitted ...

    // 1. seek
    ZNode *znode = zset_query(ent->zset, score, name.data(), name.size());
    // 2. offset
    znode = znode_offset(znode, offset);
    // 3. iterate & output
    void *arr = begin_arr(out);
    uint32_t n = 0;
    while (znode && (int64_t)n < limit) {
        out_str(out, znode->name, znode->len);
        out_dbl(out, znode->score);
        znode = znode_offset(znode, +1);    // successor
        n += 2;
    }
    end_arr(out, arr, n);
}

11.5 Sorted Set Rank Query

Rank Query

As mentioned before, the offset argument (rank) in sorted sets is special: it costs O(log(offset)) as opposed to walking through the tree nodes one by one (O(offset)). Paginating a large sorted set by offset is viable in Redis, unlike SQL.

Paginating by score may be impossible when the entire page has the same score, as you cannot use the full sort key (score + name) as the paginator in real Redis.

Walk the Tree in O(log(offset))

The subtree size maintained in the tree node tells the rank difference from its parent, which is the size of a subtree + 1.

    A (rank=a)
     \
      C (rank=a+s+1)
     /             \
B (size=s)          D

The subtree size also tells you the maximum rank difference from the current rank, so you can tell if the target is in a subtree.

AVLNode *avl_offset(AVLNode *node, int64_t offset) {
    int64_t pos = 0;    // relative to the starting node
    while (offset != pos) {
        if (pos < offset && pos + avl_cnt(node->right) >= offset) {
            // the target is inside the right subtree
            node = node->right;
            pos += avl_cnt(node->left) + 1;
        } else if (pos > offset && pos - avl_cnt(node->left) <= offset) {
            // the target is inside the left subtree
            node = node->left;
            pos -= avl_cnt(node->right) + 1;
        } else {
            // go to the parent
            AVLNode *parent = node->parent;
            if (!parent) {
                return NULL;    // out of range
            }
            if (parent->right == node) {
                pos -= avl_cnt(node->left) + 1;
            } else {
                pos += avl_cnt(node->right) + 1;
            }
            node = parent;
        }
    }
    return node;
}

avl_offset() works in 2 phases without knowing the absolute rank:

  1. If the target is not in one of the subtrees, go up to the parent.
  2. Otherwise, go down that subtree to meet the target.

Quick quiz: Find the absolute rank of a node (ZRANK, ZCOUNT).

Order Statistic Tree

The idea of avl_offset() is based on the rank difference (subtree size) between parent and child, which is also used by the Skip list of the real Redis.

If you add the subtree size to the B+tree used in relational DBs, they can also do fast offset and count(*). Unfortunately, relational DBs are very limited in data structures, and the added information increases write amplification by requiring every update to touch every node in the path.

A tree augmented by the subtree size is called an order statistic tree. Besides sorted data, it’s also used to create array-like structures with O(log(n)) insertions at arbitrary positions.

Testing

Use the idea from the last chapter to quickly exercise all cases. Pay attention to unhappy cases such as off by one.

static void test_case(uint32_t sz) {
    // create a tree of a given size
    Container c;
    for (uint32_t i = 0; i < sz; ++i) {
        add(c, i);
    }
    AVLNode *min = c.root;
    while (min->left) {
        min = min->left;
    }
    // for each starting rank
    for (uint32_t i = 0; i < sz; ++i) {
        AVLNode *node = avl_offset(min, (int64_t)i);
        assert(container_of(node, Data, node)->val == i);
        // test all possible offset
        for (uint32_t j = 0; j < sz; ++j) {
            int64_t offset = (int64_t)j - (int64_t)i;
            AVLNode *n2 = avl_offset(node, offset);
            assert(container_of(n2, Data, node)->val == j);
        }
        // out of range by one
        assert(!avl_offset(node, -(int64_t)i - 1));
        assert(!avl_offset(node, sz - i));
    }
    dispose(c.root);
}

11.6 What We Learned

  • Multi-indexed data with intrusive data structures.
  • Order statistic tree: A fast way to offset a tree node.

Source code: