04. Application Protocol

4.1 Multiple requests in a single connection

The server loop

Let’s ignore concurrent connections for now. We’ll make the server process multiple requests in a single connection with a loop.

    while (true) {
        // accept
        struct sockaddr_in client_addr = {};
        socklen_t socklen = sizeof(client_addr);
        int connfd = accept(fd, (struct sockaddr *)&client_addr, &socklen);
        if (connfd < 0) {
            continue;   // error
        }

        // only serves one client connection at once
        while (true) {
            int32_t err = one_request(connfd);
            if (err) {
                break;
            }
        }
        close(connfd);
    }

The one_request function will read 1 request and write 1 response. The problem is, how does it know how many bytes to read? This is the primary function of an application protocol. Usually a protocol has 2 levels of structures:

  1. A high-level structure to split the byte stream into messages.
  2. The structure within a message, a.k.a. deserialization.

A simple binary protocol

The first step is to split the byte stream into messages. What’s inside the message (serialization) is added later. For now, both the request and response messages are just strings. The client sends a variable-length string and the server responds with the same protocol.

+-----+------+-----+------+--------
| len | msg1 | len | msg2 | more...
+-----+------+-----+------+--------

Each message consists of a 4-byte little-endian integer indicating the length of the request and the variable-length payload. This is not the real Redis protocol. We’ll discuss alternative protocol designs later.

4.2 Parse the protocol

Check the return value of read/write

read/write returns the number of bytes read/written. The return value is -1 on error. read also returns 0 after EOF (end of file/connection).

ssize_t read(int fd, void *buf, size_t count);
ssize_t write(int fd, const void *buf, size_t count);

To read a message, first read the 4-byte integer, then read the payload. You may imagine the read side like this:

// Bad example!
uint32_t n;
char payload[MAX_PAYLOAD];
rv = read(fd, &n, 4);
if (rv != 4) { /* error */ }
rv = read(fd, &payload, n);
if (rv != n) { /* error */ }

And imagine the write side like this:

// Bad example!
rv = write(fd, &n, 4);
if (rv != 4) { /* error */ }
rv = write(fd, &payload, n);
if (rv != n) { /* error */ }

Both are wrong ways to handle a TCP socket, because read/write can return less than the requested number of bytes under normal conditions (no error, no EOF). This is documented in man read.2 and man write.2, but why do they behave this way? Explained later.

People often code like this because they assume that a read somehow corresponds to a write from the peer. This is a common mistake; a byte stream has no boundaries within!

`read_full` and `write_all`

To actually read/write n bytes from/to a TCP socket. You must do it in a loop.

static int32_t read_full(int fd, char *buf, size_t n) {
    while (n > 0) {
        ssize_t rv = read(fd, buf, n);
        if (rv <= 0) {
            return -1;  // error, or unexpected EOF
        }
        assert((size_t)rv <= n);
        n -= (size_t)rv;
        buf += rv;
    }
    return 0;
}

static int32_t write_all(int fd, const char *buf, size_t n) {
    while (n > 0) {
        ssize_t rv = write(fd, buf, n);
        if (rv <= 0) {
            return -1;  // error
        }
        assert((size_t)rv <= n);
        n -= (size_t)rv;
        buf += rv;
    }
    return 0;
}

Whatever a read returns is accumulated in a buffer. It’s how much data you have that matters, how much a single read returns matters not.

Parse the request and produce the response

In the server program, read_full and write_all are used instead of read and write.

const size_t k_max_msg = 4096;

static int32_t one_request(int connfd) {
    // 4 bytes header
    char rbuf[4 + k_max_msg + 1];
    errno = 0;
    int32_t err = read_full(connfd, rbuf, 4);
    if (err) {
        if (errno == 0) {
            msg("EOF");
        } else {
            msg("read() error");
        }
        return err;
    }
    uint32_t len = 0;
    memcpy(&len, rbuf, 4);  // assume little endian
    if (len > k_max_msg) {
        msg("too long");
        return -1;
    }
    // request body
    err = read_full(connfd, &rbuf[4], len);
    if (err) {
        msg("read() error");
        return err;
    }
    // do something
    rbuf[4 + len] = '\0';
    printf("client says: %s\n", &rbuf[4]);
    // reply using the same protocol
    const char reply[] = "world";
    char wbuf[4 + sizeof(reply)];
    len = (uint32_t)strlen(reply);
    memcpy(wbuf, &len, 4);
    memcpy(&wbuf[4], reply, len);
    return write_all(connfd, wbuf, 4 + len);
}

4.3 Client and testing

static int32_t query(int fd, const char *text) {
    uint32_t len = (uint32_t)strlen(text);
    if (len > k_max_msg) {
        return -1;
    }
    // send request
    char wbuf[4 + k_max_msg];
    memcpy(wbuf, &len, 4);  // assume little endian
    memcpy(&wbuf[4], text, len);
    if (int32_t err = write_all(fd, wbuf, 4 + len)) {
        return err;
    }
    // 4 bytes header
    char rbuf[4 + k_max_msg + 1];
    errno = 0;
    int32_t err = read_full(fd, rbuf, 4);
    if (err) {
        if (errno == 0) {
            msg("EOF");
        } else {
            msg("read() error");
        }
        return err;
    }
    memcpy(&len, rbuf, 4);  // assume little endian
    if (len > k_max_msg) {
        msg("too long");
        return -1;
    }
    // reply body
    err = read_full(fd, &rbuf[4], len);
    if (err) {
        msg("read() error");
        return err;
    }
    // do something
    rbuf[4 + len] = '\0';
    printf("server says: %s\n", &rbuf[4]);
    return 0;
}

Test our server by sending several commands:

int main() {
    int fd = socket(AF_INET, SOCK_STREAM, 0);
    if (fd < 0) {
        die("socket()");
    }

    // code omitted ...

    // send multiple requests
    int32_t err = query(fd, "hello1");
    if (err) {
        goto L_DONE;
    }
    err = query(fd, "hello2");
    if (err) {
        goto L_DONE;
    }
    err = query(fd, "hello3");
    if (err) {
        goto L_DONE;
    }

L_DONE:
    close(fd);
    return 0;
}

Running the server and client:

$ ./server
client says: hello1
client says: hello2
client says: hello3
EOF

$ ./client
server says: world
server says: world
server says: world

4.4 Understand read/write

TCP Socket vs. disk file

Why is read_full needed? There are differences between reading disk files and reading sockets despite of sharing the same read/write API. When reading a disk file and it returns less than requested, it means either EOF or an error. But a socket can return less data even under normal conditions. This can be explained by pull-based IO and push-based IO.

Data over a network is pushed by the remote peer. The remote does not need the read call before sending data. The kernel will allocate a receive buffer to store the received data. read just copies whatever is available from the receive buffer to the userspace buffer, since it’s unknown if there is more inflight data.

Data from a local file is pulled from disk. The data is always considered “ready” and the file size is known. There is no reason to return less than requested unless it’s EOF.

Interrupted syscalls

Why is write_all needed? Normally, write just append data to a kernel-side buffer, the actual network transfer is deferred to the OS. The buffer size is limited, so when the buffer is full, the caller must wait for it to drain before copying the remaining data. During the wait, the syscall may be interrupted by a signal, causing write to return with partial written data.

read can also be interrupted by a signal because it must wait if the buffer is empty. In this case, 0 bytes are read, but the return value is -1 and errno is EINTR. This is not an error. An exercise for the reader: handle this case in read_full.

4.5 More on protocols

Text vs. binary

Instead of messing with binary data, why not use something simpler & nicer like HTTP and JSON? Plain text seems “simple” because it’s human-readable. But they aren’t very machine-readable due to the implementation complexity.

A human-readable protocol deals with strings, strings are variable-length, so you are constantly checking the length of things, which is tedious and error-prone. While a binary protocol avoids unnecessary strings, nothing is simpler than memcpying a struct.

Length prefixes vs. delimiters

This chapter follows a common pattern:

  • Start with a fixed-size part.
  • Variable-length data follows, with the length indicated by the fixed-size part.

When parsing a protocol like this, you always know how much data to read.

The other pattern is to use delimiters to indicate the end of the variable-length thing. To parse a delimited protocol, keep reading until the delimiter is found. But what if the payload contains the delimiter? You now need escape sequences, which adds even more complexity.

Case study: real-world protocols

HTTP headers are strings delimited by \r\n, each header is a KV pair delimited by colon. The URL may contain \r\n, so the URL in the request line must be escaped/encoded. You might forget that \r\n is not allowed in header values, which has caused some security vulnerabilities.

GET /index.html HTTP/1.1
Host: example.com
Foo: bar

If you code HTTP as an exercise, you’re probably get a buggy subset because there is so much work like encoding/escaping things, checking for forbidden characters, etc. HTTP is a lesson in how NOT to design network protocols.

The real Redis protocol is also human-readable but not as crazy as HTTP. It uses both delimiters and length prefixes. Strings are length prefixed, but the length is a decimal number delimited by a newline. There is a newline after a string, but that’s just for readability. Example:

$5\r\nhello\r\n

You can try to implement the real Redis protocol as a challenge since it requires more work. But don’t spend too much effort because the next step of the event loop is more important and you cannot reuse code from this chapter.

Source code: