09. Data Serialization

9.1 The need for serialization

So far, the response is a string, but some Redis commands return other data types. For example, dbsize returns the number of keys as an integer, keys returns a list of strings, and some sorted set commands return a list of (string, float) pairs. Encoding them into bytes is called serialization, which is part of the protocol.

Popular serialization formats include XML, JSON, MessagePack, Protobuf, Thrift.

Simple data types

Simple data types are string, integer, float, boolean, null.

While all of these can be represented simply as strings, it’s not a good idea. Strings are ambiguous and require the extra work of parsing. There is a need to distinguish different data types in the serialization format. Many common serialization formats, except XML, define data types other than string. That’s why XML fell out of fasion.

Complex data types

Complex data types are array, map, struct. They are collections of other data types.

Not all serialization formats have map or struct. A struct can be encoded as a map or an array, a map can be encoded as an array of KV pairs.

Schema or schema-less

Formats like XML or JSON are self-describing; they can be fully parsed without extra information. While Protobuf keeps the data type description separate, the serialized data may not contain enough type information. The separate description is called a schema.

                 encode
schema + input ──────────► bytes

                 decode
schema + bytes ──────────► output

A schema ensures that the data is in a certain shape, but it also has costs:

  • Schemas are extra moving parts that must be synchronized with consumers.
  • Schemas require extra libraries or toolings such as code generators.

Self-describing means that schemas are optional; they can still be added, for example: JSON Schema. Although Thrift is self-describing, it’s typically used with schemas.

JSON as a reference model

JSON is often a good enough format; it has the following desirable properties:

  • It defines simple data types found in most languages: string, number, bool, null.
  • JSON arrays and objects are easily converted to native arrays, maps, structs.
  • JSON is self-describing. Schemas are optional.

We can use JSON as a reference to understand the requirements of serialization. But it also has many drawbacks. Serialization formats can be both simpler and better.

9.2 Serialization formats

Delimited formats

Delimiters are used to determine the length of strings, arrays, maps. JSON also uses different delimiters to identify different data types.

  • JSON string, delimited by quotes: "foo".
  • JSON array, delimited by brackets and commas: [1, 2, 3].
  • JSON array, delimited by braces, colons, commas: {"k":"v", "foo":"bar"}.

Delimiters aren’t the only way to handle variable-length data, but they make the format human-writable with a text editor. JSON is the simplest practical, textual format. Parsing JSON is not hard, but it’s still more complicated than necessary. Many JSON parsers are buggy due to the complications of textual formats:

  • Optional whitespace.
  • Complicated escaping rules due to delimiters.
  • Ambiguous string representation of numbers.

Fortunately, all of these complications can be avoided by not being textual.

Tag-Length-Value

Most binary serialization formats follow the formula of Tag-Length-Value (TLV).

  1. “Tag” or “Type” comes first, it makes the data self-describing.
  2. “Length” is for strings, arrays, maps, it removes the need for delimiters.
  3. “Value” is the encoded data.

Example: [123, "foo"] in TLV format.

 array   2    int         str    3
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ tag │ len │ tag │ 123 │ tag │ len │ foo │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┘

You get the idea. The remaining details are encoding TLV tuples as bits and bytes.

You should read the details of popular formats, compare their pros and cons, and learn the lessons. But I’ll just summarize here.

Name Text Self-desc Simple types Complex types
XML Yes Yes str markup
JSON Yes Yes null, bool, number, str array, map
MessagePack No Yes null, bool, int, float, str array, map
Thrift No Yes bool, int, float, str array, map, struct
Protobuf No No bool, int, float, str struct

XML loses to JSON because it doesn’t convert cleanly to common data types (array, map, struct). An XML element can have a map<string, string> as attributes, or an array of child elements, or text content, or a combination of these, or none of these. XML tries to be anything, but ends like nothing.

MessagePack is just binary JSON with some encoding optimizations. It also adds some useful data types that JSON lacks: integer, float, binary bytes.

Thrift is like MessagePack but with structs. It’s intended to be used with a schema where structs can replace some maps. Unlike map keys, struct field names are not serialized, they are replaced by integer IDs. Thrift is also more “statically typed” in that array items and map KV pairs are limited to the same type. It doesn’t have a null value, but struct fields can be optional.

Protobuf is similar to Thrift in that both use schema-defined structs. But it has many weirdness, it doesn’t have arrays, instead it allows struct fields to repeat. A repeating field that repeats only once is indistinguishable from a non-repeating field. And different data types share the same “tag”. So it’s not self-describing; schemas are required to interpret Protobuf data.

Most binary formats use TLV, just with different encodings. What we’ll do is implement the simplest TLV encoding. Many projects, including the real Redis, have rolled their own serialization format because it’s so easy.

9.3 A simple serialization format

We’ll support the following data types:

enum {
    TAG_NIL = 0,    // nil
    TAG_ERR = 1,    // error code + msg
    TAG_STR = 2,    // string
    TAG_INT = 3,    // int64
    TAG_DBL = 4,    // double
    TAG_ARR = 5,    // array
};

A response is one of these type. Like JSON, each array element can be of any type, including nested arrays. We don’t need a map type for now.

Binary format

Tag is 1 byte, length is 4 bytes. Strings are arbitrary bytes. Array length is the number of array elements.

  nil       int64           str                   array
┌─────┐   ┌─────┬─────┐   ┌─────┬─────┬─────┐   ┌─────┬─────┬─────┐
│ tag │   │ tag │ int │   │ tag │ len │ ... │   │ tag │ len │ ... │
└─────┘   └─────┴─────┘   └─────┴─────┴─────┘   └─────┴─────┴─────┘
   1B        1B    8B        1B    4B   ...        1B    4B   ...

Integers and lengths are encoded in little-endian, which is just memcpying values on all relevant platforms.

Output the serialized data

We’ve discarded struct Response and output serialized response directly to Conn::outgoing.

static void do_get(std::vector<std::string> &cmd, Buffer &out) {
    // ...
    return out_str(out, val.data(), val.size());    // string value
}
static void do_set(std::vector<std::string> &cmd, Buffer &out) {
    // ...
    return out_nil(out);                // nil
}
static void do_del(std::vector<std::string> &cmd, Buffer &out) {
    // ...
    return out_int(out, node ? 1 : 0);  // the number of deleted keys
}
static void out_nil(Buffer &out) {
    buf_append_u8(out, TAG_NIL);
}
static void out_str(Buffer &out, const char *s, size_t size) {
    buf_append_u8(out, TAG_STR);
    buf_append_u32(out, (uint32_t)size);
    buf_append(out, (const uint8_t *)s, size);
}
static void out_int(Buffer &out, int64_t val) {
    buf_append_u8(out, TAG_INT);
    buf_append_i64(out, val);
}
static void out_arr(Buffer &out, uint32_t n) {
    buf_append_u8(out, TAG_ARR);
    buf_append_u32(out, n);
}

Buffer is just std::vector if you haven’t replaced it with something better in chapter 6b. These helper functions are places to handle endian-ness.

typedef std::vector<uint8_t> Buffer;

static void buf_append_u8(Buffer &buf, uint8_t data) {
    buf.push_back(data);
}
static void buf_append_u32(Buffer &buf, uint32_t data) {
    buf_append(buf, (const uint8_t *)&data, 4); // assume littlen-endian
}

Handle the message header

Now that we’re outputing the response body to Conn::outgoing. The size of the response is not known in advance. So we must reserve 4 bytes of space for the response header.

static bool try_one_request(Conn *conn) {
    // ...
    size_t header_pos = 0;
    response_begin(conn->outgoing, &header_pos);
    do_request(cmd, conn->outgoing);
    response_end(conn->outgoing, header_pos);
    // ...
}

The position of the reserved header is saved in header_pos, so that the header can be updated later.

static void response_begin(Buffer &out, size_t *header) {
    *header = out.size();       // messege header position
    buf_append_u32(out, 0);     // reserve space
}
static size_t response_size(Buffer &out, size_t header) {
    return out.size() - header - 4;
}
static void response_end(Buffer &out, size_t header) {
    size_t msg_size = response_size(out, header);
    if (msg_size > k_max_msg) {
        out.resize(header + 4);
        out_err(out, ERR_TOO_BIG, "response is too big.");
        msg_size = response_size(out, header);
    }
    // message header
    uint32_t len = (uint32_t)msg_size;
    memcpy(&out[header], &len, 4);
}

Add the `keys` command

hm_foreach() invokes the callback on each item. It’s trivial if you have actually coded the hashtable yourself.

static bool cb_keys(HNode *node, void *arg) {
    Buffer &out = *(Buffer *)arg;
    const std::string &key = container_of(node, Entry, node)->key;
    out_str(out, key.data(), key.size());
    return true;
}

static void do_keys(std::vector<std::string> &, Buffer &out) {
    out_arr(out, (uint32_t)hm_length(&g_data.db));
    hm_foreach(&g_data.db, &cb_keys, (void *)&out);
}

This command is added to illustrate the array type. You can add more commands now, the “value” part of KV is not just strings, it can be other data structure like hashtables, sorted sets. In the next chapter we’ll focus on sorting data structures.

Source code: