8b. Data Structure: Hashtables (Part 2)
8.6 Fixed-size chaining hashtable
We’ll break the implementation into 2 phases:
- Fixed-size hashtable with intrusive linked lists.
- Resizable hashtable with progressive rehashing.
Step 0: Choose a hash function
There are several classes of hash functions with different properties for different uses:
- Cryptographic hash functions, such as MD5, SHA1.
- Checksum hash functions, such as CRC32, Adler32.
- Hash functions for hashtables, such as FNV, Murmur. This is what to use!
They are all called hash functions, but their uses do not overlap. Do not use cryptographic hash functions for hashtables because they are slow and overkill. Do not use hashtable hash functions for anything else because they are not strong enough.
Step 1: Define the intrusive list node
struct HNode {
*next = NULL;
HNode uint64_t hcode = 0; // hash value
};
Just an intrusive linked list node with the hash value of the key. An intrusive hashtable doesn’t care about the data, but it still needs the hash value for the insertion.
Step 2: Define the fixed-size hashtable
struct HTab {
**tab = NULL; // array of slots
HNode size_t mask = 0; // power of 2 array size, 2^n - 1
size_t size = 0; // number of keys
};
hash(key) % N
maps a hash value to a slot. Modulo or
division are slow CPU operations, so it’s common to use powers of 2
sizes. Modulo by a power of 2 is just taking the lower bits, so it can
be done by a fast bitwise and: hash(key) & (N-1)
.
static void h_init(HTab *htab, size_t n) {
assert(n > 0 && ((n - 1) & n) == 0); // n must be a power of 2
->tab = (HNode **)calloc(n, sizeof(HNode *));
htab->mask = n - 1;
htab->size = 0;
htab}
Step 3: Linked list insertion
static void h_insert(HTab *htab, HNode *node) {
size_t pos = node->hcode & htab->mask;
*next = htab->tab[pos];
HNode ->next = next;
node->tab[pos] = node;
htab->size++;
htab}
New nodes are inserted at the front, so insertion is O(1) in the worst case. Being
intrusive means that there’s no allocation in the data structure code,
unlike std::list
.
Linked lists are the most common collections in large C projects like
the Linux kernel, and almost all practical linked lists are
intrusive, non-intrusive linked lists like
std::list
are rarely useful.
Step 6: Hashtable lookup
The generic search function needs a callback to do the equality comparison, as it knows nothing about the data. Note that it compares the hash value before calling the callback, this is an optimization to rule out candidates early.
static HNode **h_lookup(HTab *htab, HNode *key, bool (*eq)(HNode *, HNode *)) {
if (!htab->tab) {
return NULL;
}
size_t pos = key->hcode & htab->mask;
**from = &htab->tab[pos]; // incoming pointer to the target
HNode for (HNode *cur; (cur = *from) != NULL; from = &cur->next) {
if (cur->hcode == key->hcode && eq(cur, key)) {
return from; // Q: Why not return `cur`? This is for deletion.
}
}
return NULL;
}
The function should return the node found, but it actually returns the address of its parent pointer, requiring a pointer dereference to get the target pointer. Why? Because we need the address of the pointer to delete it.
Step 7: Hashtable deletion
To remove a linked list node, we need to update the pointer in its
parent node. That’s why h_lookup()
returns the address of
the parent pointer. No need for a separate search-and-delete
function!
static HNode *h_detach(HTab *htab, HNode **from) {
*node = *from; // the target node
HNode *from = node->next; // update the incoming pointer to the target
->size--;
htabreturn node;
}
But what if we remove the first node without a parent node? This seems to be a special case.
- If the node is the first node, update the array slot.
- Otherwise, update its parent node.
The above cases are unnecessary, because h_lookup()
returns the address of the to-be-updated pointer, it doesn’t matter if
it’s from a slot or a node.
**from = &htab->tab[pos]; // incoming pointer to the target
HNode for (HNode *cur; (cur = *from) != NULL; from = &cur->next) {
if (cur->hcode == key->hcode && eq(cur, key)) {
return from; // may be a node, may be a slot
}
}
When you think harder, special cases may not be special at all.
- Removing the first list node is not special.
- No need for a search-and-delete function.
8.7 Resizable hashtable
Step 8: Define the hashtable interfaces
The resizable HMap
is based on the fixed-size
HTab
. It contains 2 of them for the progressive
rehashing.
struct HMap {
;
HTab newer;
HTab oldersize_t resizing_pos = 0;
};
The get, set, del interfaces:
*hm_lookup(HMap *hmap, HNode *key, bool (*eq)(HNode *, HNode *));
HNode void hm_insert(HMap *hmap, HNode *node);
*hm_delete(HMap *hmap, HNode *key, bool (*eq)(HNode *, HNode *)); HNode
Step 9: Deal with 2 hashtables during rehashing
Normally, HMap::newer
is used while
HMap::older
is unused. When the load factor is too high,
HMap::newer
is moved to HMap::older
, and
HMap::newer
is replaced by a larger, empty table. We used
power of 2 sizes, so the newer table is simply doubled.
static void hm_trigger_rehashing(HMap *hmap) {
->older = hmap->newer; // (newer, older) <- (new_table, newer)
hmap(&hmap->newer, (hmap->newer.mask + 1) * 2);
h_init->resizing_pos = 0;
hmap}
During rehashing, lookup or delete may need to query both tables.
*hm_lookup(HMap *hmap, HNode *key, bool (*eq)(HNode *, HNode *)) {
HNode **from = h_lookup(&hmap->newer, key, eq);
HNode if (!from) {
= h_lookup(&hmap->older, key, eq);
from }
return from ? *from : NULL;
}
*hm_delete(HMap *hmap, HNode *key, bool (*eq)(HNode *, HNode *)) {
HNode if (HNode **from = h_lookup(&hmap->newer, key, eq)) {
return h_detach(&hmap->newer, from);
}
if (HNode **from = h_lookup(&hmap->older, key, eq)) {
return h_detach(&hmap->older, from);
}
return NULL;
}
Step 10: Trigger rehashing by the load factor
Insertions always update the newer table. It triggers rehashing when the load factor is high.
void hm_insert(HMap *hmap, HNode *node) {
if (!hmap->newer.tab) {
(&hmap->newer, 4); // initialized it if empty
h_init}
(&hmap->newer, node); // always insert to the newer table
h_insertif (!hmap->older.tab) { // check whether we need to rehash
size_t shreshold = (hmap->newer.mask + 1) * k_max_load_factor;
if (hmap->newer.size >= shreshold) {
(hmap);
hm_trigger_rehashing}
}
(hmap); // migrate some keys
hm_help_rehashing}
For a chaining hashtable, the load factor limit should be greater than 1 because each slot is intended to hold multiple items.
const size_t k_max_load_factor = 8;
hm_help_rehashing()
moves some keys to the newer table.
We can also trigger this from lookup and delete, it doesn’t hurt as it
does O(1) work.
*hm_lookup(HMap *hmap, HNode *key, bool (*eq)(HNode *, HNode *)) {
HNode (hmap);
hm_help_rehashing// ...
}
*hm_delete(HMap *hmap, HNode *key, bool (*eq)(HNode *, HNode *)) {
HNode (hmap);
hm_help_rehashing// ...
}
Step 11: Progressive rehashing
hm_help_rehashing()
scans each slot, moves a constant
number of items, and then exits. It remembers the last slot index in
HMap::resizing_pos
.
const size_t k_resizing_work = 128; // constant work
static void hm_help_rehashing(HMap *hmap) {
size_t nwork = 0;
while (nwork < k_resizing_work && hmap->older.size > 0) {
// find a non-empty slot
**from = &hmap->older.tab[hmap->resizing_pos];
HNode if (!*from) {
->resizing_pos++;
hmapcontinue; // empty slot
}
// move the first list item to the newer table
(&hmap->newer, h_detach(&hmap->older, from));
h_insert++;
nwork}
// discard the old table if done
if (hmap->older.size == 0 && hmap->older.tab) {
(hmap->older.tab);
free->older = HTab{};
hmap}
}
8.8 Redis command: get, set, del
Step 1: Define data types
static struct {
; // top-level hashtable
HMap db} g_data;
// KV pair for the top-level hashtable
struct Entry {
struct HNode node; // hashtable node
std::string key;
std::string val;
};
struct Entry
is the KV pair with an embedded hashtable
node. The equality callback shows how to use the
container_of
macro.
static bool entry_eq(HNode *lhs, HNode *rhs) {
struct Entry *le = container_of(lhs, struct Entry, node);
struct Entry *re = container_of(rhs, struct Entry, node);
return le->key == re->key;
}
Step 2: Implement get, set, del
We used a dummy Entry
with just a key in it for the
comparisons during the lookup.
static void do_get(std::vector<std::string> &cmd, Response &out) {
// a dummy `Entry` just for the lookup
;
Entry key.key.swap(cmd[1]);
key.node.hcode = str_hash((uint8_t *)key.key.data(), key.key.size());
key// hashtable lookup
*node = hm_lookup(&g_data.db, &key.node, &entry_eq);
HNode if (!node) {
.status = RES_NX;
outreturn;
}
// copy the value
const std::string &val = container_of(node, Entry, node)->val;
assert(val.size() <= k_max_msg);
.data.assign(val.begin(), val.end());
out}
Note that the key doesn’t need to be an Entry
, the key
can be other types that embed an HNode
, as long as you
update entry_eq()
accordingly. Right now Entry
is just a KV pair, it will get larger as we go. We can define a leaner
struct for lookups.
struct LookupKey { // for lookup only
;
HNode nodestd::string key;
};
= {...};
LookupKey key *node = hm_lookup(&g_data.db, &key.node, &key_eq); HNode
8.8 What we learned
- Scalable hashtable implementations.
- Avoid special cases via pointer indirection in linked list code.
Source code: