05. Concurrent IO Models

Prerequisites: Basic operating system concepts like threads, processes, concurrency.

5.1 Thread-based concurrency

A connection-oriented request-response protocol can be used for any number of request-response pairs, and the client can hold the connection as long as it wants. So there is a need to handle multiple connections simultaneously, because while the server is waiting on one client, it cannot do anything with the other clients. This is solved by multi-threading. Pseudo code:

fd = socket()
bind(fd, address)
listen(fd)
while True:
    conn_fd = accept(fd)
    new_thread(do_something_with, conn_fd)
    # continue to accept the next client without blocking

def do_something_with(conn_fd):
    while not_quiting(conn_fd):
        req = read_request(conn_fd)     # blocks thread
        res = process(req)
        write_response(conn_fd, res)    # blocks thread
    close(conn_fd)

Why aren’t threads enough?

We will not bother with threading because most modern server apps use event loops to handle concurrent IO without creating new threads. What are the drawbacks of thread-based IO?

  • Memory usage: Many threads means many stacks. Stacks are used for local variables and function calls, memory usage per thread is hard to control.
  • Overhead: Stateless clients like PHP apps will create many short-lived connections, adding overhead to both latency and CPU usage.

Forking new processes is older than multi-threading, and it costs even more. It’s in the same league as multi-threading. Multi-threading and multi-processing are still used when there is no need to scale to large numbers of connections, and they have a big advantage over event loops: they are easier and less error-prone.

5.2 Event-based concurrency

Concurrent IO is possible without threading. Let’s start by examining the read() syscall. The Linux TCP stack handles sending and receiving IP packets transparently, placing incoming data in a per-socket kernel-side buffer. read() merely copies data from the kernel-side buffer, and when the buffer is empty, read() suspends the calling thread until more data is ready.

Similarly, write() does not interact directly with the network; it merely put the data into a kernel-side buffer for the TCP stack to consume, and when the buffer is full, write() suspends the calling thread until there is room.

The need for multi-threading comes from the need to wait for each socket to become ready (to read or write). If there is a way to wait for multiple sockets at once, and then read/write whichever ones are ready, only a single thread is needed!

while running:
    want_read = [...]           # socket fds
    want_write = [...]          # socket fds
    can_read, can_write = wait_for_readiness(want_read, want_write) # blocks!
    for fd in can_read:
        data = read_nb(fd)      # non-blocking, only consume from the buffer
        handle_data(fd, data)   # application logic without IO
    for fd in can_write:
        data = pending_data(fd) # produced by the application
        n = write_nb(fd, data)  # non-blocking, only append to the buffer
        data_written(fd, n)     # n <= len(data), limited by the available space

This involves 3 operating system mechanisms:

  • Readiness notification: Wait for multiple sockets, return when one or more are ready. “Ready” means the read buffer is not empty or the write buffer is not full.
  • Non-blocking read: Assuming the read buffer is not empty, consume from it.
  • Non-blocking write: Assuming the write buffer is not full, put some data into it.

This is called an event loop. Each loop iteration waits for any readiness events, then reacts to events without blocking, so that all sockets are processed without delay.

Callback-based programming

Callbacks are common in JS. To read from something in JS, first register a callback on some event, then the data is delivered to the callback. This is what we will do next. Except in JS, the event loop is hidden, while in this project, the event loop is coded by us. We will have a better understanding of this important mechanism.

5.3 Non-blocking IO

Non-blocking read & write behavior

If the read buffer is not empty, both blocking and non-blocking reads will return the data immediately. Otherwise, a non-blocking read will return with errno = EAGAIN, while a blocking read will wait for more data. Non-blocking reads can be called repeatedly to fully drain the read buffer.

If the write buffer is not full, both blocking and non-blocking writes will fill the write buffer and return immediately. Otherwise, a non-blocking write will return with errno = EAGAIN, while a blocking write will wait for more room. Non-blocking writes can be called repeatedly to fully fill the write buffer. If the data is larger than the available write buffer, a non-blocking write will do a partial write, while a blocking write may block.

Non-blocking `accept()`

accept() is similar to read() in that it just consumes an item from a queue, so it has a non-blocking mode and can provide readiness notifications.

    for fd in can_read:
        if fd is a listening socket:
            conn_fd = accept_nb(fd)     # non-blocking accept()
            handle_new_conn(conn_fd)
        else:   # fd is a connection socket
            data = read_nb(fd)          # non-blocking read()
            handle_data(fd, data)

Enable non-blocking mode

Non-blocking reads & writes use the same syscalls as blocking reads & writes. The O_NONBLOCK flag puts a socket in non-blocking mode.

static void fd_set_nonblock(int fd) {
    int flags = fcntl(fd, F_GETFL, 0);  // get the flags
    flags |= O_NONBLOCK;                // modify the flags
    fcntl(fd, F_SETFL, flags);          // set the flags
    // TODO: handle errno
}

The fcntl() syscall gets and sets file flags. Only O_NONBLOCK is accepted for sockets.

5.4 Readiness API

Waiting for IO readiness is platform specific, and there are several ones on Linux.

    can_read, can_write = wait_for_readiness(want_read, want_write)

The simplest one on Linux is poll().

int poll(struct pollfd *fds, nfds_t nfds, int timeout);

struct pollfd {
    int   fd;
    short events;   // request: want to read, write, or both?
    short revents;  // returned: can read? can write?
};

poll() takes an array of fds, each with an input flag and an output flag:

  • The events flag indicates whether you want to read (POLLIN), write (POLLOUT), or both (POLLIN|POLLOUT).
  • The revents flag returned from the syscall indicates the readiness.

The timeout argument is used to implement timers later.

Other readiness APIs

  • select() is like poll() and is present on both Windows and Unix, but it can only use 1024 fds, which is a tiny number. It should not be used!
  • epoll_wait() is Linux specific. Unlike poll(), the fd list is not passed as an argument, but stored in the kernel. epoll_ctl() is used to add or modify the fd list. It’s more scalable than poll() because passing a huge number of fds is inefficient.
  • kqueue() is BSD specific. It’s like epoll, but requires fewer syscalls because it can batch update the fd list.

We will use poll() because it’s the simplest. But note that epoll is the default choice on Linux as it’s more scalable and should be used in real projects instead. All the readiness APIs are just differs in shapes, using them isn’t much different.

Readiness APIs cannot be used with files

All the readiness APIs can only be used with sockets, pipes, and some special stuff like signalfd. They cannot be used with disk files! Why? Because when a socket is ready to read, it means that the data is in the read buffer, so the read is guaranteed not to block, but for a disk file, no such buffer exists in the kernel, so the readiness for a disk file is undefined.

These APIs will always report a disk file as ready, but the IO will block. So file IO must be done outside the event loop, in a thread pool, which we’ll learn later.

On Linux, file IO within an event loop may be possible with io_uring, which is a unified interface for both file IO and socket IO. But io_uring is a very different API, so we will not pursue it.

5.5 Summary of concurrent IO techniques

Type Method API Scalability
Socket Thread per connection pthread Low
Socket Process per connection fork() Low
Socket Event loop poll(), epoll High
File Thread pool pthread
Any Event loop io_uring High