Understanding NIO, Socket and Epoll system setup

I have had the opportunity to explore the underlying support interface on which most server systems were built on. I'm talking about epoll(7). Read the manpage documentation; the first manpage I read from its beginning till the very end, nodding my head, supposedly, in understanding.

So this is my own attempt to document my own learning about this interface. I will reference alot of materials I had to read in search of answers to some gray area questions. So let's start with a very basic and less useful example of epoll:

package netiolite

import (
  "fmt"
  "syscall"
)

func main() {
  epollFd, err := syscall.EpollCreate1(0)
  if err != nil {
    fmt.Fatalf("epoll_create1: %v", err)
  }
  defer syscall.Close(epollFd)

  ...

  err = syscall.EpollCtl(epollFd, syscall.EPOLL_CTL_ADD, sockFd, &syscall.EpollEvent{
    Fd: sockFd,
    Events: syscall.EPOLLIN | syscall.EPOLLOUT
  })
  if err != nil {
    fmt.Fatalf("epoll_ctl: %v", err)
  }

  events := make([]syscall.EpollEvent, 64)
  n, err = syscall.EpollWait(epollFd, events, -1)
  if err != nil && err != syscall.EINTR {
    fmt.Fatalf("epoll_wait: %v", err)
  }

  for i := 0; i < n; i++ {
    // handle events
    fmt.Printf("Event on fd: %d\n", events[i].Fd)
  }
}

This is a basic demonstration of the epoll interface used to monitor a single IO event report on a socket. Before continuing we have a couple of backtracking to do; preliminary discussions about files I/O that's necessary to understand epoll's essence. It would be about my understanding across socket, epoll and I/O operations facilitating communication between processes on same machine or between two different machines.

Slow & fast files

Let's begin by discussing and categorizing files as fast and slow files. This categorization is not about the actual speed of the read and write operations but rather about the predictability of their response time. Files that respond to read and write requests within a predictable amount of time are considerably fast, while those that could take "infinitely" long time to respond are slow. Take for example, reading a block from a regular file is predictable (the kernel knows the data is there, it just have to go get it), so regular files are fast in that context. But reading from a socket/pipe, is not predictable (the kernel doesn't know when, if ever, a connection will happen on a socket), and as a result sockets are slow files.

File mode

There's another consideration to have in mind when working with files, File mode. File mode dictates how your read and write request will behave. By default, files are in blocking mode, meaning a read request will block if there not yet data to be read, likewise write requests will block until all data has successfully been written or an error occurred. This behavior can easily be changed by switching the file mode to non-blocking. This means a read request will return immediately if there is no available data, while a write request will write as much data as possible before returning. This is particularly useful for event-driven systems, where you want to avoid blocking the entire process while waiting for I/O

Sockets

The two above paragraphs points to our next subject of discussion, the socket(2). While not a physical file on disk, socket is considered a virtual file as they are represented by a file descriptor and accessed through same system calls used to interact with regular files. And as earlier mentioned, we can't exactly tell when a connection will happend on this open and listening process. Also, as a file, sockets can be in blocking or non-blocking mode for accept(2), read(2) and write(2) requests.

Let's start with a blocking socket setup first, targeting the Linux platform:`

package netiolite

import (
  "os"
  "syscall"
)

type listener struct {
  fd int
  sa *syscall.SockaddrInet4
}

func NewListener(ip []byte, port int) (*listener, error) {
  syscall.ForkLock.Lock()

  fd, err := syscall.Socket(syscall.AF_INET, syscall.SOCK_STREAM, syscall.IPPROTO_TCP)
  if err != nil {
    return nil, os.NewSyscallError("socket", err)
  }

  syscall.ForkLock.Unlock()

  if err := syscall.SetsockoptInt(fd, syscall.SOL_SOCKET, syscall.SO_REUSEADDR, 1); err != nil {
    syscall.Close(fd)
    return nil, os.NewSyscallError("setsockoptInt", err)
  }

  sa := &syscall.SockaddrInet4{Port: port}
  copy(sa.Addr[:], ip)

  if err := syscall.Bind(fd, sa); err != nil {
    syscall.Close(fd)
    return nil, os.NewSyscallError("bind", err)
  }
  
  if err := syscall.Listen(fd, syscall.SOMAXCONN); err != nil {
    syscall.Close(fd)
    return nil, os.NewSyscallError("listen", err)
  }

  return &listener{fd, sa}, nil
}

func (ln *listener) Accept() (*conn, error) {
  nfd, sa, err := syscall.Accept(ln.fd)
  if err != nil {
    return nil, os.NewSyscallError("accept", err)
  }

  return &conn{fd: nfd, saddr: sa}, nil
}

This setup involves the socket(2), setsockopt(2), bind(2), listen(2), and the blocking accept(2) system calls. For a simplified usecase, this very setup serves well. Opens a listening port, accepts connections, send and receive data on the connections. The accept on the socket would block until the first connection request hits the listening socket. Likewise, the read and send requests on the connection will block as earlier discussed. But for a complex system beyond the above code excerpt, we need a setup that can continue other tasks if no data can be read or written at the moment.

Acheiving a non-blocking system involves setting the O_NONBLOCK flag on the subject files. In this case, we can call socket(2) and accept(2) with this flag as an argument creating a non-blocking listening and connection sockets. Utilizing the type parameter on socket(), and the flags parameter on accept4(), we can acheive a non-blocking setup.

func NewListener(ip []byte, port int, flags int) (*listener, error) {
  syscall.ForkLock.Lock()

  fd, err := syscall.Socket(syscall.AF_INET, syscall.SOCK_STREAM | flags, syscall.IPPROTO_TCP)
  if err != nil {
    return nil, os.NewSyscallError("socket", err)
  }
  ...
}

func (ln *listener) Accept(flags int) (*conn, error) {
  nfd, sa, err := syscall.Accept4(ln.fd, flags)
  if err != nil && err != syscall.EAGAIN {
    return nil, os.NewSyscallError("accept", err)
  }

  return &conn{fd: nfd, saddr: sa}, nil
}

The somewhat downside of this non-blocking setup is the need to constantly poll the system APIs as you won't know if there's a connection to accept, data to read, or buffer space to write data.

Epoll

This new concern would lead us to the select(2), kqueue(2), poll(2) and epoll(7) APIs. We depend on these calls to get event notification on the file descriptors; submit both the listening and connection socket file descriptors for monitoring by the kernel. They serve as I/O multiplexers reporting events for a set of file descriptors.

package main

import (
 "fmt"
 "net"
 "sync"
 "syscall"
 "time"
)

type poll struct {
 fd int
 events []syscall.EpollEvent
 evfds []int32
}

func newPoll(flags int) (*poll, error) {
 fd, err := syscall.EpollCreate1(flags)
 if err != nil {
  return nil, err
 }

 p := new(poll)
 p.fd = fd

 p.events = make([]syscall.EpollEvent, 64)
 p.evfds = make([]int32, len(p.events))

 return p, nil
}

func (p *poll) wait(msec time.Duration) ([]int32, error) {
 var err error
 var n int
 if msec >= 0 {
  n, err = syscall.EpollWait(p.fd, p.events, int(msec / time.Millisecond))
 } else {
  n, err = syscall.EpollWait(p.fd, p.events, -1)
 }

 if err != nil && err != syscall.EINTR {
  return nil, err
 }

 p.evfds = p.evfds[:0]
 for i := 0; i < n; i++ {
  p.evfds = append(p.evfds, p.events[i].Fd)
 }

 return p.evfds, nil
}

func (p *poll) addEvents(fd int32, flags int) error {
 if err := syscall.EpollCtl(p.fd, syscall.EPOLL_CTL_ADD, int(fd), &syscall.EpollEvent{
  Fd: fd,
  Events: uint32(flags),
 }); err != nil {
  return err
 }

 return nil
}

func (p *poll) modEvents(fd int32, flags int) error {
 if err := syscall.EpollCtl(p.fd, syscall.EPOLL_CTL_MOD, int(fd), &syscall.EpollEvent{
  Fd: fd,
  Events: uint32(flags),
 }); err != nil {
 return err
 }

 return nil
}

func (p *poll) removeFd(fd int32) error {
 if err := syscall.EpollCtl(p.fd, syscall.EPOLL_CTL_DEL, int(fd), nil); err != nil {
  return err
 }

 return nil
}

This completes the setup for the blog post, but there are some epoll quirks worth discussing. These quirks hinge on different behaviors and less predictable outcomes of epoll.

Quirks

Let's start with the two different ways events on file descriptors can be reported. Epoll report event in two behavior, Level-triggered (LT) and Edge-triggered (ET). By default, epoll operates in LT mode, behaving exactly like poll(2). Say a socket receives 2 connections, a call to epoll_wait(2) will return the socket's file descriptor. And say we called accept() on the socket, handling only a single connection on the listening socket, then the next call to epoll_wait(2) will still return that file descriptor as part of the set of "ready" file descriptors because there's still pending event (connection) to be handled, effectively meaning the file descriptor state is still in the "readable" state. The state changes and it is not returned amongst the set of "ready" file descriptors only until all its events has handled (accepted).

Running epoll in ET mode has an opposite behavior. You're only notified about a file descriptor's events once. Meaning on the second call to epoll_wait(2), after the first returns the file descriptor in the returned set, the file descriptor won't be returned even though there are pending events on it to be handled. It is left to you, the consumer, to make sure all events are handled. Only after then can you receive fresh notification for new events that has occurred. By setting the EPOLLET flag when registering a file descriptor with epoll, makes epoll report events for that file descriptor in ET mode.

  err := epoll.addEvent(fd, syscall.EPOLLIN|syscall.EPOLLET)

It is down to you and your system requirements as you can combine both LT and ET file descriptors to be monitored by same epoll instance. Just be careful to fully consume events.

Another consideration is how less predictable the outcome of scaling out accept() and read() operations across several threads (processes) are in ET mode. Imagine a server with a short burst (backlog) of connections that needs accepting. It is logical to share this workload across multiple threads/processes, but this leads to the "thundering herd" problem because all the threads/processes are woken up (blocked epoll_wait call returns) but only process receives the events while the rest of the threads/processes fails with EAGAIN error. This leads to waste of CPU resource. We can use the EPOLLEXCLUSIVE flag to enforce the kernel to only wake up one or more threads/processes to handle the events on a file descriptor than waking all.

Let's see another quirk to be aware of. Say we're monitoring a socket file descriptor and scaled epoll_wait across two two threads, A and B, waiting for data to arrive on this socket. Data arrives and the below plays out:

In ET mode, thread A's epoll_wait returns, and it accepts the connection
Another connection comes in, thread B's epoll_wait returns.
But before thread B could call accept(), thread A which is active calls the accept again, consuming the second connection.
Thread B finally gets around to call accept and receives the EAGAIN error (non-blocking mode) leading to unnecessary wake up.

We can circumvent this less predictable behavior by setting the EPOLLONESHOT flag. This will disarm the file descriptor, meaning the kernel will diable monitoring for the file descriptor once returned from epoll_wait until it is rearmed again using EPOLL_CTL_MOD in a epoll_ctl call. By doing so, thread A can keep calling accept on the socket until it receives EAGAIN, and then rearms the file descriptor for monitoring. This is a race condition problem, and the use of proper synchronization (e.g., use of mutexes) is another way to handle the issue.

A reminder to always explicitly deregister a file descriptor from all epoll instances (using EPOLL_CTL_DEL) before closing the file descriptor. See the Q&A section of epoll(7) for better understanding of why it is important.

Conclusion

In this post, we explored the epoll interface, its use in monitoring file descriptors, and its quirks. We also discussed the differences between blocking and non-blocking I/O and how epoll can be used to build efficient, event-driven systems.

Learning about this interface was quite an interesting one for me; setting up an observable system that combines the different concepts touched on throughout this blogpost to see the different behaviors and outcomes.

Part of my next exploration is to extend the "netiolite" package I'm currently building with epoll to include an event loop and call stack system.

Links: