04. Message-Oriented Protocol
4.1 Multiple requests in a single connection
The server loop
Let’s ignore concurrent connections for now. We’ll make the server process multiple requests in a single connection with a loop.
while (true) {
// accept
struct sockaddr_in client_addr = {};
socklen_t socklen = sizeof(client_addr);
int connfd = accept(fd, (struct sockaddr *)&client_addr, &socklen);
if (connfd < 0) {
continue; // error
}
// only serves one client connection at once
while (true) {
int32_t err = one_request(connfd);
if (err) {
break;
}
}
(connfd);
close}
The one_request
function will read 1 request and write 1
response. The problem is, how does it know how many bytes to read? This
is the primary function of an application protocol. Usually a protocol
has 2 levels of structures:
- A high-level structure to split the byte stream into messages.
- The structure within a message, a.k.a. deserialization.
A simple binary protocol
We will do the first step is to split the byte stream into messages. For now, both the request and response messages are just strings.
┌─────┬──────┬─────┬──────┬────────
│ len │ msg1 │ len │ msg2 │ more...
└─────┴──────┴─────┴──────┴────────
4B ... 4B ...
Each message consists of a 4-byte little-endian integer indicating the length of the request and the variable-length payload. This is not the real Redis protocol. We’ll discuss alternative protocol designs later.
4.2 Parse the protocol
Check the return value of read/write
read/write
returns the number of bytes read/written. The
return value is -1 on error. read
also returns 0 after EOF
(end of file/connection).
ssize_t read(int fd, void *buf, size_t count);
ssize_t write(int fd, const void *buf, size_t count);
To read a message, first read the 4-byte integer, then read the payload. You may imagine the read side like this:
// Bad example!
uint32_t n;
char payload[MAX_PAYLOAD];
= read(fd, &n, 4);
rv if (rv != 4) { /* error */ }
= read(fd, &payload, n);
rv if (rv != n) { /* error */ }
And imagine the write side like this:
// Bad example!
= write(fd, &n, 4);
rv if (rv != 4) { /* error */ }
= write(fd, &payload, n);
rv if (rv != n) { /* error */ }
Both are wrong ways to handle a TCP socket, because
read/write
can return less than the requested
number of bytes under normal conditions (no error, no EOF). Why
do they behave this way? Explained later.
A common mistake is to assume that a read
somehow
corresponds to a write
from the peer. This is not possible
because a byte stream does not preserve any boundaries.
`read_full` and `write_all`
To actually read/write n bytes from/to a TCP socket. You must do it in a loop.
static int32_t read_full(int fd, char *buf, size_t n) {
while (n > 0) {
ssize_t rv = read(fd, buf, n);
if (rv <= 0) {
return -1; // error, or unexpected EOF
}
assert((size_t)rv <= n);
-= (size_t)rv;
n += rv;
buf }
return 0;
}
static int32_t write_all(int fd, const char *buf, size_t n) {
while (n > 0) {
ssize_t rv = write(fd, buf, n);
if (rv <= 0) {
return -1; // error
}
assert((size_t)rv <= n);
-= (size_t)rv;
n += rv;
buf }
return 0;
}
Whatever a read
returns is accumulated in a buffer. It’s
how much data you have that matters, how much a single read
returns matters not.
Parse the request and produce the response
In the server program, read_full
and
write_all
are used instead of read
and
write
.
const size_t k_max_msg = 4096;
static int32_t one_request(int connfd) {
// 4 bytes header
char rbuf[4 + k_max_msg];
= 0;
errno int32_t err = read_full(connfd, rbuf, 4);
if (err) {
(errno == 0 ? "EOF" : "read() error");
msgreturn err;
}
uint32_t len = 0;
(&len, rbuf, 4); // assume little endian
memcpyif (len > k_max_msg) {
("too long");
msgreturn -1;
}
// request body
= read_full(connfd, &rbuf[4], len);
err if (err) {
("read() error");
msgreturn err;
}
// do something
("client says: %.*s\n", len, &rbuf[4]);
printf// reply using the same protocol
const char reply[] = "world";
char wbuf[4 + sizeof(reply)];
= (uint32_t)strlen(reply);
len (wbuf, &len, 4);
memcpy(&wbuf[4], reply, len);
memcpyreturn write_all(connfd, wbuf, 4 + len);
}
`errno` gotchas
errno
is set to the error code if the syscall failed.
However, errno
is NOT set to 0 if the syscall
succeeded; it simply keeps the previous value. That’s why the
above code sets errno = 0
before read_full()
to distinguish the EOF case.
You can read errno
if and only if the call failed. But
some libc functions have no way to tell if the call failed other than by
clearing errno
first:
= 0;
errno int val = atoi("0"); // returns 0 on error, but 0 is also a valid result
if (errno) { /* failed */ }
errno
is a bad old idea in C. The Linux kernel doesn’t
use it at all; syscalls actually return the error code as a negative
integer, it’s the syscall wrappers in libc that put the error code in
errno
. Having the error code mixed with the result is still
a bad idea, a more sensible way is like this:
int32_t read(int fd, void *buf, size_t size, size_t *actually_read);
// returns the error code, outputs the result via a pointer.
4.3 Client and testing
static int32_t query(int fd, const char *text) {
uint32_t len = (uint32_t)strlen(text);
if (len > k_max_msg) {
return -1;
}
// send request
char wbuf[4 + k_max_msg];
(wbuf, &len, 4); // assume little endian
memcpy(&wbuf[4], text, len);
memcpyif (int32_t err = write_all(fd, wbuf, 4 + len)) {
return err;
}
// 4 bytes header
char rbuf[4 + k_max_msg + 1];
= 0;
errno int32_t err = read_full(fd, rbuf, 4);
if (err) {
(errno == 0 ? "EOF" : "read() error");
msgreturn err;
}
(&len, rbuf, 4); // assume little endian
memcpyif (len > k_max_msg) {
("too long");
msgreturn -1;
}
// reply body
= read_full(fd, &rbuf[4], len);
err if (err) {
("read() error");
msgreturn err;
}
// do something
("server says: %.*s\n", len, &rbuf[4]);
printfreturn 0;
}
Test our server by sending several commands:
int main() {
int fd = socket(AF_INET, SOCK_STREAM, 0);
if (fd < 0) {
("socket()");
die}
// code omitted ...
// send multiple requests
int32_t err = query(fd, "hello1");
if (err) {
goto L_DONE;
}
= query(fd, "hello2");
err if (err) {
goto L_DONE;
}
:
L_DONE(fd);
closereturn 0;
}
Running the server and client:
$ ./server
client says: hello1
client says: hello2 EOF
$ ./client
server says: world server says: world
4.4 Understand read/write
TCP Socket vs. disk file
Why is read_full
needed? There are differences between
reading disk files and reading sockets despite of sharing the same
read/write
API. When reading a disk file and it returns
less than requested, it means either EOF or an error. But a socket can
return less data even under normal conditions. This can be explained by
pull-based IO and push-based IO.
Data over a network is pushed by the remote peer. The remote
does not need the read
call before sending data. The kernel
will allocate a receive buffer to store the received data.
read
just copies whatever is available from the receive
buffer to the userspace buffer, since it’s unknown if there is more
inflight data.
Data from a local file is pulled from disk. The data is always considered “ready” and the file size is known. There is no reason to return less than requested unless it’s EOF.
Interrupted syscalls
Why is write_all
needed? Normally, write
just append data to a kernel-side buffer, the actual network transfer is
deferred to the OS. The buffer size is limited, so when the buffer is
full, the caller must wait for it to drain before copying the remaining
data. During the wait, the syscall may be interrupted by a signal,
causing write
to return with partial written data.
read
can also be interrupted by a signal because it must
wait if the buffer is empty. In this case, 0 bytes are read, but the
return value is -1 and errno
is EINTR
. This is
not an error. An exercise for the reader: handle this case in
read_full
.
4.5 More on protocols
Text vs. binary
Instead of messing with binary data, why not use something simpler & nicer like HTTP and JSON? Plain text seems “simple” because it’s human-readable. But they aren’t very machine-readable due to the implementation complexity.
A human-readable protocol deals with strings, strings are
variable-length, so you are constantly checking the length of things,
which is tedious and error-prone. While a binary protocol avoids
unnecessary strings, nothing is simpler than memcpy
ing a
struct.
Length prefixes vs. delimiters
This chapter follows a common pattern:
- Start with a fixed-size part.
- Variable-length data follows, with the length indicated by the fixed-size part.
When parsing a protocol like this, you always know how much data to read.
The other pattern is to use delimiters to indicate the end of the variable-length thing. To parse a delimited protocol, keep reading until the delimiter is found. But what if the payload contains the delimiter? You now need escape sequences, which adds even more complexity.
Case study: real-world protocols
HTTP headers are strings delimited by \r\n
, each header
is a KV pair delimited by colon. The URL may contain \r\n
,
so the URL in the request line must be escaped/encoded. You might forget
that \r\n
is not allowed in header values, which has caused
some security vulnerabilities.
GET /index.html HTTP/1.1
Host: example.com Foo: bar
If you code HTTP as an exercise, you’re probably get a buggy subset because there is so much work like encoding/escaping things, checking for forbidden characters, etc. HTTP is a lesson in how NOT to design network protocols.
The real Redis protocol is also human-readable but not as crazy as HTTP. It uses both delimiters and length prefixes. Strings are length prefixed, but the length is a decimal number delimited by a newline. There is a newline after a string, but that’s just for readability. Example:
$5\r\nhello\r\n
You can try to implement the real Redis protocol as a challenge since it requires more work. But don’t spend too much effort because the next step of the event loop is more important and you cannot reuse code from this chapter.
Source code: