09. Data Serialization
9.1 The need for serialization
So far, the response is a string, but some Redis commands return
other data types. For example, dbsize
returns the number of
keys as an integer, keys
returns a list of strings, and
some sorted set commands return a list of (string, float) pairs.
Encoding them into bytes is called serialization, which is part
of the protocol.
Popular serialization formats include XML, JSON, MessagePack, Protobuf, Thrift.
Simple data types
Simple data types are string, integer, float, boolean, null.
While all of these can be represented simply as strings, it’s not a good idea. Strings are ambiguous and require the extra work of parsing. There is a need to distinguish different data types in the serialization format. Many common serialization formats, except XML, define data types other than string. That’s why XML fell out of fasion.
Complex data types
Complex data types are array, map, struct. They are collections of other data types.
Not all serialization formats have map or struct. A struct can be encoded as a map or an array, a map can be encoded as an array of KV pairs.
Schema or schema-less
Formats like XML or JSON are self-describing; they can be fully parsed without extra information. While Protobuf keeps the data type description separate, the serialized data may not contain enough type information. The separate description is called a schema.
encode
schema + input ──────────► bytes
decode
schema + bytes ──────────► output
A schema ensures that the data is in a certain shape, but it also has costs:
- Schemas are extra moving parts that must be synchronized with consumers.
- Schemas require extra libraries or toolings such as code generators.
Self-describing means that schemas are optional; they can still be added, for example: JSON Schema. Although Thrift is self-describing, it’s typically used with schemas.
JSON as a reference model
JSON is often a good enough format; it has the following desirable properties:
- It defines simple data types found in most languages: string, number, bool, null.
- JSON arrays and objects are easily converted to native arrays, maps, structs.
- JSON is self-describing. Schemas are optional.
We can use JSON as a reference to understand the requirements of serialization. But it also has many drawbacks. Serialization formats can be both simpler and better.
9.2 Serialization formats
Delimited formats
Delimiters are used to determine the length of strings, arrays, maps. JSON also uses different delimiters to identify different data types.
- JSON string, delimited by quotes:
"foo"
. - JSON array, delimited by brackets and commas:
[1, 2, 3]
. - JSON array, delimited by braces, colons, commas:
{"k":"v", "foo":"bar"}
.
Delimiters aren’t the only way to handle variable-length data, but they make the format human-writable with a text editor. JSON is the simplest practical, textual format. Parsing JSON is not hard, but it’s still more complicated than necessary. Many JSON parsers are buggy due to the complications of textual formats:
- Optional whitespace.
- Complicated escaping rules due to delimiters.
- Ambiguous string representation of numbers.
Fortunately, all of these complications can be avoided by not being textual.
Tag-Length-Value
Most binary serialization formats follow the formula of Tag-Length-Value (TLV).
- “Tag” or “Type” comes first, it makes the data self-describing.
- “Length” is for strings, arrays, maps, it removes the need for delimiters.
- “Value” is the encoded data.
Example: [123, "foo"]
in TLV format.
array 2 int str 3
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ tag │ len │ tag │ 123 │ tag │ len │ foo │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┘
You get the idea. The remaining details are encoding TLV tuples as bits and bytes.
Comparison of popular serialization formats
You should read the details of popular formats, compare their pros and cons, and learn the lessons. But I’ll just summarize here.
Name | Text | Self-desc | Simple types | Complex types |
---|---|---|---|---|
XML | Yes | Yes | str | markup |
JSON | Yes | Yes | null, bool, number, str | array, map |
MessagePack | No | Yes | null, bool, int, float, str | array, map |
Thrift | No | Yes | bool, int, float, str | array, map, struct |
Protobuf | No | No | bool, int, float, str | struct |
XML loses to JSON because it doesn’t convert cleanly to common data
types (array, map, struct). An XML element can have a
map<string, string>
as attributes, or an array of
child elements, or text content, or a combination of these, or none of
these. XML tries to be anything, but ends like nothing.
MessagePack is just binary JSON with some encoding optimizations. It also adds some useful data types that JSON lacks: integer, float, binary bytes.
Thrift is like MessagePack but with structs. It’s intended to be used with a schema where structs can replace some maps. Unlike map keys, struct field names are not serialized, they are replaced by integer IDs. Thrift is also more “statically typed” in that array items and map KV pairs are limited to the same type. It doesn’t have a null value, but struct fields can be optional.
Protobuf is similar to Thrift in that both use schema-defined structs. But it has many weirdness, it doesn’t have arrays, instead it allows struct fields to repeat. A repeating field that repeats only once is indistinguishable from a non-repeating field. And different data types share the same “tag”. So it’s not self-describing; schemas are required to interpret Protobuf data.
Most binary formats use TLV, just with different encodings. What we’ll do is implement the simplest TLV encoding. Many projects, including the real Redis, have rolled their own serialization format because it’s so easy.
9.3 A simple serialization format
We’ll support the following data types:
enum {
= 0, // nil
TAG_NIL = 1, // error code + msg
TAG_ERR = 2, // string
TAG_STR = 3, // int64
TAG_INT = 4, // double
TAG_DBL = 5, // array
TAG_ARR };
A response is one of these type. Like JSON, each array element can be of any type, including nested arrays. We don’t need a map type for now.
Binary format
Tag is 1 byte, length is 4 bytes. Strings are arbitrary bytes. Array length is the number of array elements.
nil int64 str array
┌─────┐ ┌─────┬─────┐ ┌─────┬─────┬─────┐ ┌─────┬─────┬─────┐
│ tag │ │ tag │ int │ │ tag │ len │ ... │ │ tag │ len │ ... │
└─────┘ └─────┴─────┘ └─────┴─────┴─────┘ └─────┴─────┴─────┘
1B 1B 8B 1B 4B ... 1B 4B ...
Integers and lengths are encoded in little-endian, which is just
memcpy
ing values on all relevant platforms.
Output the serialized data
We’ve discarded struct Response
and output serialized
response directly to Conn::outgoing
.
static void do_get(std::vector<std::string> &cmd, Buffer &out) {
// ...
return out_str(out, val.data(), val.size()); // string value
}
static void do_set(std::vector<std::string> &cmd, Buffer &out) {
// ...
return out_nil(out); // nil
}
static void do_del(std::vector<std::string> &cmd, Buffer &out) {
// ...
return out_int(out, node ? 1 : 0); // the number of deleted keys
}
static void out_nil(Buffer &out) {
(out, TAG_NIL);
buf_append_u8}
static void out_str(Buffer &out, const char *s, size_t size) {
(out, TAG_STR);
buf_append_u8(out, (uint32_t)size);
buf_append_u32(out, (const uint8_t *)s, size);
buf_append}
static void out_int(Buffer &out, int64_t val) {
(out, TAG_INT);
buf_append_u8(out, val);
buf_append_i64}
static void out_arr(Buffer &out, uint32_t n) {
(out, TAG_ARR);
buf_append_u8(out, n);
buf_append_u32}
Buffer
is just std::vector
if you haven’t
replaced it with something better in chapter 6b. These helper functions
are places to handle endian-ness.
typedef std::vector<uint8_t> Buffer;
static void buf_append_u8(Buffer &buf, uint8_t data) {
.push_back(data);
buf}
static void buf_append_u32(Buffer &buf, uint32_t data) {
(buf, (const uint8_t *)&data, 4); // assume littlen-endian
buf_append}
Handle the message header
Now that we’re outputing the response body to
Conn::outgoing
. The size of the response is not known in
advance. So we must reserve 4 bytes of space for the response
header.
static bool try_one_request(Conn *conn) {
// ...
size_t header_pos = 0;
(conn->outgoing, &header_pos);
response_begin(cmd, conn->outgoing);
do_request(conn->outgoing, header_pos);
response_end// ...
}
The position of the reserved header is saved in
header_pos
, so that the header can be updated later.
static void response_begin(Buffer &out, size_t *header) {
*header = out.size(); // messege header position
(out, 0); // reserve space
buf_append_u32}
static size_t response_size(Buffer &out, size_t header) {
return out.size() - header - 4;
}
static void response_end(Buffer &out, size_t header) {
size_t msg_size = response_size(out, header);
if (msg_size > k_max_msg) {
.resize(header + 4);
out(out, ERR_TOO_BIG, "response is too big.");
out_err= response_size(out, header);
msg_size }
// message header
uint32_t len = (uint32_t)msg_size;
(&out[header], &len, 4);
memcpy}
Add the `keys` command
hm_foreach()
invokes the callback on each item. It’s
trivial if you have actually coded the hashtable yourself.
static bool cb_keys(HNode *node, void *arg) {
&out = *(Buffer *)arg;
Buffer const std::string &key = container_of(node, Entry, node)->key;
(out, key.data(), key.size());
out_strreturn true;
}
static void do_keys(std::vector<std::string> &, Buffer &out) {
(out, (uint32_t)hm_length(&g_data.db));
out_arr(&g_data.db, &cb_keys, (void *)&out);
hm_foreach}
This command is added to illustrate the array type. You can add more commands now, the “value” part of KV is not just strings, it can be other data structure like hashtables, sorted sets. In the next chapter we’ll focus on sorting data structures.
Source code: