12. Timer and Timeout

12.1 The need for timers

There are many uses for timers in a Redis server:

  • Time-to-live (TTL) and cache expiration.
  • Network IO timeouts.
  • Closing idle connections.

Cache expiration

TTL is often used in caching because:

  • Memory is finite. TTL is a cache eviction method.
  • Some lazy systems do not synchronize the DB with the cache and rely on cache expiration to keep the data fresh.

Network IO timeouts

Network timeouts are a major missing piece of real-world software. We need to handle the case where the remote peer is unreachable or simply gone.

A TCP connection will eventually timeout and be closed by the OS, assuming TCP keep-alive works. But this can take a very long time, measured in minutes or hours, while applications use timeouts measured in seconds or milliseconds. So timeouts are an explicit consideration for networked apps.

Idle timeouts

While in the middle of reading a request, the reasonable timeout value is measured in seconds or milliseconds, since the typical case is just a few TCP retransmissions. It’s better to report errors than to be stuck for a long time.

However, while waiting for a new request, the timeout value should be larger, since long-lived connections are the intended use case. Still, it’s not infinite.

12.2 Timeouts in syscalls

Blocking IO

For regular files, timeouts are not possible on Linux.

For blocking sockets, timeouts are set with setsockopt() and the SO_RCVTIMEO and SO_SNDTIMEO options. See man socket.7 for details.

  • A blocking read() or write() will return EAGAIN after the timeout.
  • A blocking connect() will return ETIMEDOUT instead.

Nonblocking IO

Timeouts are irrelevent for nonblocking syscalls. In an event loop, the readiness API is the only blocking syscall, so it’s the only syscall that can timeout.

int poll(struct pollfd *fds, nfds_t nfds, int timeout); // milliseconds

A timeout value of -1 means no timeouts.

Sleeping

In blocking code, you can wait for a future time by sleeping. This is not possible in an event loop. However, poll() can wake up from timeouts, which is a form of sleeping.

`timerfd`

timerfd_create() creates an in-kernel timer and returns a timer handle called timerfd. Expiring timers can be notified by poll() using the timer handles, along with the socket handles for IO readiness.

// return a timer handle
int timerfd_create(int clockid, int flags);
// set the expiration time
int timerfd_settime(int fd, int flags,
    const struct itimerspec *new_value, struct itimerspec *old_value);

timerfd can be used with event loops, but this isn’t very useful because timers can be implemented in userspace without any syscalls.

12.3 Timer data structures

Wait for the nearest timer

poll() has a single timeout value, but we want to use multiple timers. The native way is to wake from poll() periodically, say every 1s, then loop through each timer to check for expirations. This solution has limited resolution and is O(N) per wakeup.

There are many special data structures invented for timers. But the general solution is based on sorting; having things happen at a certain time means having things happen in a certain order, and ordering things by time is a sorting problem. So the problem can be restated as: output a list of timestamps in sort order.

If timers are sorted, poll() only needs the timeout value of the nearest timer. This is how to implement timers in an event loop.

Fully sorted data structures

We can use the AVL tree to sort timers:

  • Adding, updating, or removing a timer is O(log N).
  • Finding the nearest timer (the leftmost node) is O(log N).

The minimum node can be cached to make finding the nearest timer O(1).

Partially sorted data structures

Timers don’t need to be fully sorted; the only requirement is to find the nearest timer. This is satisfied by the heap data structure. A heap can find the minimum in O(1) and updates are O(log N). It’s an array-based implicit tree without child or parent pointers, which takes up less space. It’s the most popular choice for timers.

Fixed timeout values and FIFO

The socket timeout problem can be solved with much simpler data structures. Note that the timeout values are the same for all sockets, so when adding a timer, it becomes the farthest timer, which means that the order of the timers is the order in which they are added, if the timeout values are fixed.

This is just a FIFO (first-in-first-out). New timers are added to the back and the nearest timer is removed from the front, which can be done with a linked list in O(1).

If you have multiple timeout values, e.g., IO timeout and idle timeout, just use multiple linked lists.

12.4 Implement timers

Getting the time

clock_gettime() is the syscall to get the time.

int clock_gettime(clockid_t clockid, struct timespec *tp);

struct timespec {
    time_t     tv_sec;   /* Seconds */
    long       tv_nsec;  /* Nanoseconds [0, 999'999'999] */
};

The clockid argument is the type of timestamp. The most common types are wall time and monotonic time.

#define CLOCK_REALTIME  0   // wall time, the Unix timestamp
#define CLOCK_MONOTONIC 1   // monotonic time

A wall timestamp is related to time keeping in the real-world. The most common is the Unix timestamp. It’s the number of days since 1970 multiplied by 24×3600, roughly the number of seconds since 1970. The wall time in a computer can be adjusted, so the duration between 2 timestamps can be unstable or even negative. It’s not usable for timers.

A monotonic time cannot be adjusted, it only moves forward. It’s value is not related to any real-world time keeping. It can only be used to measure durations.

poll() uses CLOCK_MONOTONIC internally for timeouts, so that’s what we will use.

static uint64_t get_monotonic_msec() {
    struct timespec tv = {0, 0};
    clock_gettime(CLOCK_MONOTONIC, &tv);
    return uint64_t(tv.tv_sec) * 1000 + tv.tv_nsec / 1000 / 1000;
}

Doubly-linked list

We’ll timeout idle connections using a linked list. Each time a socket does IO, the timeout is set to a fixed offset from now, and the socket is moved to the list back.

The linked list must be doubly-linked so that nodes can be detached from the middle.

struct DList {
    DList *prev = NULL;
    DList *next = NULL;
};

inline void dlist_detach(DList *node) {
    DList *prev = node->prev;
    DList *next = node->next;
    prev->next = next;
    next->prev = prev;
}

The list node is embedded into struct Conn.

struct Conn {
    // ...
    // timer
    uint64_t last_active_ms = 0;
    DList idle_node;
};

The list head is a dummy node, not a list member. It’s used to avoid the empty list special case.

// global states
static struct {
    HMap db;
    // a map of all client connections, keyed by fd
    std::vector<Conn *> fd2conn;
    // timers for idle connections
    DList idle_list;    // list head
} g_data;

The dummy node is linked to itself, forming a circle:

inline void dlist_init(DList *node) {
    node->prev = node->next = node;
}

An empty list is a list with only the dummy node:

inline bool dlist_empty(DList *node) {
    return node->next == node;
}

Because of the dummy node, insertion doesn’t need to handle the empty list case.

inline void dlist_insert_before(DList *target, DList *rookie) {
    DList *prev = target->prev;
    prev->next = rookie;
    rookie->prev = prev;
    rookie->next = target;
    target->prev = rookie;
}

Create and update timers

New Conn is added to the list back by inserting it before the dummy node:

    // create a `struct Conn`
    Conn *conn = new Conn();
    conn->fd = connfd;
    conn->want_read = true;
    conn->last_active_ms = get_monotonic_msec();
    dlist_insert_before(&g_data.idle_list, &conn->idle_node);

Each time the socket is ready for IO, it’s moved to the list back:

        // handle connection sockets
        for (size_t i = 1; i < poll_args.size(); ++i) {
            uint32_t ready = poll_args[i].revents;
            if (ready == 0) {
                continue;
            }
            Conn *conn = g_data.fd2conn[poll_args[i].fd];
            // update the idle timer by moving conn to the end of the list
            conn->last_active_ms = get_monotonic_msec();
            dlist_detach(&conn->idle_node);
            dlist_insert_before(&g_data.idle_list, &conn->idle_node);
            // ...
        }

Find the timeout value from the nearest timer

The poll() timeout is calculated from the first list item:

static int32_t next_timer_ms() {
    if (dlist_empty(&g_data.idle_list)) {
        return -1;  // no timers, no timeouts
    }
    uint64_t now_ms = get_monotonic_msec();
    Conn *conn = container_of(g_data.idle_list.next, Conn, idle_node);
    uint64_t next_ms = conn->last_active_ms + k_idle_timeout_ms;
    if (next_ms <= now_ms) {
        return 0;   // missed?
    }
    return (int32_t)(next_ms - now_ms);
}
        int32_t timeout_ms = next_timer_ms();
        int rv = poll(poll_args.data(), (nfds_t)poll_args.size(), timeout_ms);
        // ...
        process_timers();

Process timers at each event loop iteration

Check the nearest timers to see if they have expired. poll() can wake up from IO, so there may be none. The timers are sorted, so stop at the first non-expired timer.

static void process_timers() {
    uint64_t now_ms = get_monotonic_msec();
    while (!dlist_empty(&g_data.idle_list)) {
        Conn *conn = container_of(g_data.idle_list.next, Conn, idle_node);
        uint64_t next_ms = conn->last_active_ms + k_idle_timeout_ms;
        if (next_ms >= now_ms) {
            break;  // not expired
        }
        fprintf(stderr, "removing idle connection: %d\n", conn->fd);
        conn_destroy(conn);
    }
}

Timeout sockets are destroyed. Remember to remove them from the list.

static void conn_destroy(Conn *conn) {
    (void)close(conn->fd);
    g_data.fd2conn[conn->fd] = NULL;
    dlist_detach(&conn->idle_node);
    delete conn;
}

Done

The server should close idle connections after a timeout. Test it with socat:

socat tcp:127.0.0.1:1234 -

Exercise: Add IO timeouts for read() and write() (with a different timeout value).

Source code: