Worth noting that there's an equivalent of epoll on most platforms.
On Windows there's IOCP, on Mac and BSD-derivates there's kqueue, and Linux has epoll, but to a first approximation they all do basically the same thing; instead of giving you a full-sized array of "active or not" results that you have to iterate across to detect activity on each of your sockets (as you get from the standard berkeley sockets 'select' and 'poll' APIs), they only inform you about the sockets that are actually active, so you can spend less CPU time iterating through that big results array.
I can say from personal experience that when you've got a single server process that's monitoring 25,000 sockets on a Pentium Pro 200mhz box, it makes a huge difference!
I'm a little surprised (and maybe skeptical) that it'd make a noticeable difference for the much smaller number of sockets that your average web server would be using, but.. maybe?
epoll and kqueue really are just edge-triggered select/poll.
However IOCP and the new io_uring are different beasts, they are completion based APIs vs readiness based.
To quickly explain the difference:
readiness based: tell me all sockets that are ready to be read from
completion based: do this, tell me when you are done
The "tell me when you are done" part is usually handled in the form a message on a queue (or ring buffer, hence the name io_uring, with the u being for userspace). Which also generally means really high scalability of submitting tons of tasks and also processing tons of completions.
Completion based APIs are superior IMO and it was always sad to me that Windows had one and Linux didn't so it's awesome Jens Axboe got his hands dirty to implement it. It beats the pants off of libaio, eventfd, epoll and piles of hacks.
A point that people seem to miss: epoll supports both level-triggered and edge-triggered. Most similar APIs only support level-triggered.
Edge-triggered is theoretically less work for the kernel than level-triggered, but requires that your application not be buggy. People tend to either assume "nobody uses edge-triggered" or "everybody uses edge-triggered".
Completion-based is far from trivial; since the memory traffic can happen at any time, the kernel has to consider "what if somebody changes the memory map between the start syscall and the end syscall". It complicates the application too, since now you have to keep ownership of a buffer but you aren't allowed to touch it.
AIX and Solaris apparently also support completion-based APIs, but I've never seen anyone actually run these OSes.
(aside, `poll` is the easiest API to use for just a few file descriptors, and `select` is more flexible than it appears if you ignore the value-based API assumptions and do your own allocation)
Edge-triggered requires an extra read/write on every epoll relative to level-triggered though because you must exactly trigger reading the error state (EAGAIN), so it actually can be much slower (libuv considered switching at one point, but wasn’t clear the extra syscalls required by edge triggering were worth while)
Only on reads. For writes you always want to loop until the kernel buffer really is full (remember the kernel can do I/O while you're working). Writes, incidentally, are a case where epoll is awkward since you have to EPOLL_CTL_MOD it every single time the buffer empties/fills (though you should only do this after a full tick of the event loop of course ... but the bursty nature means that you often do have to, thus you get many more syscalls than `select`).
Even for reads, there exist plenty of scenarios where you will get short reads despite more data being available by the time you check. Though ... I wonder if deferring that and doing a second pass over all your FDs might be more efficient, since that gives more time for real data to arrive again?
True, I don’t remember the details for writes, and the complexity of managing high/low water marks makes it even trickier for optimal code. And large kernel send buffers here mostly avoid the performance problem here anyways. But on a short write, I am not sure I see the value in testing for EAGAIN over looping through epoll and getting a fresh set of events for everything instead of just this one fd
Right, for reads, epoll will happily tell you if there is more data still there. If the read buffer size is reasonable, short reads should not be common. And if the buffer is huge, a trip around the event loop is probably better at that point to avoid starvation of the other events
Perhaps it's just that I cut my teeth on the classic Mac OS and absorbed its way of thinking, but after using its asynchronous, callback-driven IO API, the multithreaded polling/blocking approach dominant in the Unix world felt like a clunky step backward. I've been glad to see a steady shift toward asynchronous state machines as the preferred approach for IO.
> to a first approximation they all do basically the same thing; instead of giving you an array you have to iterate across they inform you about the sockets that are actually active
Well, that's half the story. The other half is that select/poll is stateless, meaning the application–kernel bridge is flooded with data about which events you are interested in, despite the fact that this set usually doesn't change much between calls.
kqueue and the like are stateful instead: you tell the kernel which events you are interested in and then it remembers that.
Stateless is simpler to implement and easier to scale across multiple nodes, but comes with additional overhead. When polling the kernel for sockets, the overhead was a bigger cost than implementation complexity and horizontal scaling. (Implementation complexity is still a problem – see the discussions regarding the quality of epoll; and horizontal scaling is just not something desktop kernels do.)
It's a subscription-based kernel API. Instead of passing roughly the same set of descriptors in each successive iteration of the event loop, you tell the kernel once that you are interested in a given fd. It can then append to the "ready list" as each of the descriptors becomes available, even when you're not waiting on epoll_wait.
With the older select() and poll() interfaces, you have to pass the whole set every time you wait. The kernel has to construct individual wait queues for each fd. Epoll is just a more convenient and efficient API, allowing an efficient implementation on the side of the kernel.
I've tried to use IOCP with Python and the pywin32 module few years ago. I was never able to make it work, even reading c++ code as a guiding source. Also documentation and resources for IOCP when I looked at the time were very scarce and almost looked like an obscure topic to dive in, and finally gave up. On the other side kqueue and epoll are almost trivial to use. If everything fails there is always select which is easy to use as well.
Making IOCP work underneath stateful protocols like TLS is more complicated still, and then add on multithreaded handlers and things get messy real quick. It’s a similar story with io_uring. It can be done (I’ve done it) but its not easy to reason about (at least for me, maybe I just don’t have the mental horsepower or proper brain wiring to grasp it easily).
> Aside: All of the above work on many operating systems and support API’s other than epoll, which is Linux specific. The Internet is mostly made of Linux, so epoll is the API that matters.
That quote is talking about epoll-using software (Go, nginx and 'most programming languages', including Rust), not polling-based APIs. You should instead quote:
> without BSD’s kqueue (which preceded epoll by two years), we’d really be in trouble because the only alternatives were proprietary (/dev/poll in Solaris 8 and I/O Completion Ports in Windows NT 3.5).
> I'm a little surprised (and maybe skeptical) that it'd make a noticeable difference for the much smaller number of sockets that your average web server would be using, but.. maybe?
Are you aware that Linux is used e.g. by Google, Microsoft or Amazon? Can there be occasions when they handle more than 5 simultaneous connections per box? Maybe epoll can make a difference for them, no?
Also this specific article is kinda hazy on pointing out epoll's improvements over previous APIs. One half is that event subscriptions are registered/submitted for a period longer than one syscall (which the article kinda-sorta says.) But the other half is that you also have an opaque field (mostly used as pointer) that the kernel returns back to you on event notifications, which removes an extra level of lookup in userspace.
And lastly, FreeBSD has in the meantime acquired epoll support, and Linux is moving to the next-yet-again replacement in io_uring (which calling an epoll/poll replacement is an insult, it does much more.)
Almost nobody actually uses epoll at that kind of scale though. Almost everybody is just fine with `accept` from just one thread.
and `SO_REUSEPORT` is safe if you're using a fixed thread pool; it only fails if you use a variable-size thread pool or if you're using multiple processes and one dies (though frankly, if a process dies, you're going to lose something anyway).
The "but what if I `fork` and `close`" problem is purely theoretical I'm pretty sure.
I remember the sysadmins trying to figure out why sqiud was so slow and tracing the system calls. It was because they did not use kqueue/epoll. When he asked the maintainer to improve on this he said squid was fast enough. This was the start of Varnish where one of main ideas was to use modern os-calls to speed up things.
It's a fundamental Linux API for writing event loops of any kind. In addition to watching sockets, you also have eventfd (alternative to semaphores and condition variables), timerfd (timers managed by the kernel), signalfd (handle signals safely and synchronously), pidfd (handle child process events).
They allow you to handle heterogenous events inside that same loop, reducing the usage of threads.
Of course, io_uring has had many security vulnerabilities, as it essentially forms a new ad-hoc programming language with its sequencing operations and doesn't seem to have a cohesive story around security. It will likely suffer many more vulnerabilities. This is presumably why many corporations, including Google, limit its use in production.
As a lot of complication and issues seem to be around sequencing and the adhoc language, hopefully the community will eventually settle on using eBPF for all programming within io_uring [1].
Do you know of a good resource for learning how to write event loops using these primitives? I feel like there are some extremely common and robust patterns there, but my day job mostly involves high-level abstractions like async/await so I never get to dig into how it really works.
I am not expert in these but I thought Tornado's ioloop was readable enough for me to learn more event loops. Mostly, it was being implemented in pure Python.
You might be interested in a pure Zig implementation of these primitives by Mitchell in his "libxev" event loop library: https://github.com/mitchellh/libxev
Have a look at how memcached does it. It's a bit more interesting than most examples I've seen as insofar as I can remember it combines epoll with multithreading. Be aware that many of the tutorials and criticisms of epoll are a bit outdated as the API has evolved a little over time.
io_uring and epoll can be combined, in both directions. Either add poll SQEs to the ring for an epollfd or have the uring submit readiness events to an eventfd that can be added to epoll.
epoll is a horrible API, which has grown from the even worse select and poll API. I believe its entrenchment, despite existence of APIs like IOCP, is one of the failings of Linux, which even has negatively influenced programming languages (I am looking at you Rust). Others have already posted the "fundamentally broken" article, but to add an anecdote, after closely working with io-uring, I simply can not go back to the horrors of epoll and I wish we, as an industry, will migrate from it as soon as possible.
tldr: you create a bunch of processes, each of them creates its own socket (bound to the same local port) and processes requests. The kernel split incoming connections amongst the processes.
The issue: "short" and "long" requests are managed in the same way, so one process could be overloaded by "long" requests. There is no feedback loop, as far as I know.
Would someone be kind enough to explain how a C program is structured to do asynchronous work with epoll without the use of async/await? Do you need to keep custom structures for each network socket that maintains a snapshot of the state of how things are progressing?
I have a somewhat popular HTTP server library for Zig [1]. It started off as a thread-per-connection (with an optional thread pool), but when it became apparent that async wasn't going to be added back into the language any time soon, I switched to using epoll/kqueue.
Both APIs allow you to associate arbitrary data (a `void *` in kqueue, or a union of `int/uin32_t/uint65_/void *` in epoll) with the event that you're registering. So when you're notified of the event, you can access this data. In my case, it's a big Conn struct. It contains things like the # of requests on this connection (to enforce a configured max request per connection), a timestamp where it should timeout if there's no activity. The Conn is part of an intrusive linked list, so it has a next: *Conn and prev: *Conn. But, what you're probably most curious about, is that it has a Request.State. This has a static buffer ([]u8) that can grow as needed to hold all the received data up until that point (or if we're writing the data, then the buffered data that we have to write). It's important to have a max # of connections and a max request size so you can enforce an upper limit on the maximum memory the library might use. It acts as a state machine to track up to what point it's parsed the request. (since you don't want to have to re-parse the entire request as more bytes trickle in).
It's all half-baked. I can do receiving/sending asynchronously, but the application handler is called synchronously, and if that, for example, calls PG, that's probably also synchronous (since there's no async PG library in Zig). Makes me feel that any modern language needs a cohensive (as in standard library, or de facto standard) concurrency story.
I wasn't aware that you could store a reference when you register an fd with epoll. I have used select and poll in the past, and you'd need to maintain a mapping of fd->structure somehow, and you know C doesn't come with a handy data structure like a map to make this efficient and easy when scaling to say 100k+ fds. So being able to store and retrieve a reference is incredibly useful.
How do you handle the application handler code, would you run that in a separate thread to not block handling of other fds?
In my project, you can start N workers, each accept(2) thanks to REUSEPORT[_LB] and manages its own epoll/kqueue. When a request is complete, the application's handler is called directly by that worker.
I considered what you're suggesting: having a threadpool to dispatch application handlers on. It's obviously better. But you do have to synchronize a little more, especially if you don't trust the client. While dispatched, the client shouldn't be able to send another request, so you need to remove the READ notification for the socket and then once the response is written, re-add it. Seemed a bit tedious considering I'm hoping to throw it all out when async is re-added as a first class citizen to the language.
The main benefit of my half-baked solution is that a slow or misbehaving connection won't slow (or block!) other connections. Application latency is an issue (since the worker can't processed more requests while the application handler is executing), but at least that's not open to an attack.
> and you know C doesn't come with a handy data structure like a map to make this efficient and easy when scaling to say 100k+ fds.
There's plenty of map like things available; not in the language standard, but in libcs or os includes, but fds are ints, and kernel APIs that return 'new' fds are contracted to return you the least numerical fd that isn't currently in use... So you can set a max fds (the OS will tell you if you ask), and set an array of that size.
> Do you need to keep custom structures for each network socket that maintains a snapshot of the state of how things are progressing?
Yup.
Everything is a callback and you dispatch those from your event loop. You're managing/implementing your own closures by (manually/explicitly) throwing all necessary data into some data structure (essentially the "async" part). Then make appropriate callback registration calls and return from your code (equivalent to "await").
Normally it's just a matter of "store a buffer of bytes received and a buffer of bytes to be sent". The "application" part doesn't do syscalls directly, just uses the buffers. Then when the application part is done (for now) you go back to the event loop which makes all the syscalls for you.
There is potential for bugs if you try to be "clever" with partially parsing the buffer before all the message's data is received, but this is still far simpler than the kind of bugs we get with async/await.
A convenient fact is that file descriptors are just numbers, and the lowest available number is always used when creating a new socket, so it's naturally easy to just keep all your structures in a giant array and use the fd into index into it to retrieve the structure that represents the connection.
From there it's just choosing a function to dispatch to based upon epoll event and being exceptionally careful to always use non blocking IO on the underlying fd.
> Do you need to keep custom structures for each network socket that maintains a snapshot of the state of how things are progressing?
That's one way of doing it.
Ideally, you track state for every connection (you need to do this even with async/await, it's just you need to be explicit) and you dispatch even handling to a thread pool using a command queue. One thread is your demultiplexor calling epoll/kqueue/MsgReceive/waitflor and all your other threads are workers in a pool, and communication is through a synchronized queue.
It's not fancy. It's low overhead and fairly easy to reason about.
it was eons ago that I looked at this so I don't remember the technical details, but here's a c++ framework that used epoll https://github.com/williame/hellepoll
Back in the '90s, ACE was the library that introduced/popularized reactor/proactor to C++. These days there are many libraries available. ASIO is quite popular, seastar, POCO, etc.
It starts with a simple single-request-at-a-time HTTP server implemented in Rust, then progresses to examples with multi-threading, non-blocking, epoll-based multiplexing, futures and async/await, showing the limitations and advantages at each step.
Linux is somewhat unique because they tried many times, and it never worked quite well, "epoll" is the penultimate attempt as of today, "io_uring" is the last one.
Linux is also somewhat unique, because most of the Internet, basically, you could say all of the Internet with a bit of hand-waving runs on Linux. So, even if other OSes have asynchronous I/O it doesn't affect the Internet.
Epoll is a readyness notification API. You tell epoll which fd's you're interested in, then you run a poll operation which blocks until at least one of the fd's you're interested in is ready. Then you separately need to do the actual IO (which is essentially just a memcpy() between kernel and user space).
In contrast io_uring is a completion notification API. You give it a buffer to read or write from/to, and it'll notify you when the IO has been completed.
Io_uring is also more flexible, and can for example be used for disk IO which epoll can't.
Makes sense. Also makes for much more efficient data polling since you get the buffer completed but not a random fd on which to perform yet another syscall.
On Windows there's IOCP, on Mac and BSD-derivates there's kqueue, and Linux has epoll, but to a first approximation they all do basically the same thing; instead of giving you a full-sized array of "active or not" results that you have to iterate across to detect activity on each of your sockets (as you get from the standard berkeley sockets 'select' and 'poll' APIs), they only inform you about the sockets that are actually active, so you can spend less CPU time iterating through that big results array.
I can say from personal experience that when you've got a single server process that's monitoring 25,000 sockets on a Pentium Pro 200mhz box, it makes a huge difference!
I'm a little surprised (and maybe skeptical) that it'd make a noticeable difference for the much smaller number of sockets that your average web server would be using, but.. maybe?