Hacker News new | comments | show | ask | jobs | submit login
Epoll is broken – part 2/2 (popcount.org)
108 points by protomyth 243 days ago | hide | past | web | 39 comments | favorite



I always found it interesting that WinNT got this one right very early on (IOCP; first appeared in NT 3.5 in 1994), and thus avoided all these problems, and a series of subsequent APIs trying to correct them. The way you do efficient asynchronous I/O in Win32 today is still the same as it was 20 years ago.

For those unfamiliar with that model, here's a brief description: https://web.archive.org/web/20101101112358/http://doc.sch130...

My understanding is that kqueue in FreeBSD is also conceptually similar, but I never had a chance to take a close look at that.


I dunno, NT tied IOs to the initiating thread until Vista. This means that you had to keep the initiating thread around or your IOs would be cancelled.

NT also has the same flaw mentioned in the article; it ties the IOCP registration to the file object instead of the file handle. This means that you can't use a handle with an IOCP in one process then hand it off to another process.

The NT IO model has some nice properties but also some pretty serious warts. I think no one got it right.


kqueue got it right. Not just wrt descriptors but with many other aspects of the API.

The BSDs tend to put more thought and abstraction into their designs. Linux and Windows designs tend to be primarily driven by the [expected] most common use case, which often leads to premature design and performance optimizations that over the years prove ill conceived or shortsighted as programming patterns shift.

OTOH, Linux and Windows APIs tend to be more immediately useable. On BSD things tend to be more "some assembly required".

That's all very general but I have very specific examples in mind, like IOCP vs polling, signalfd vs EVFILT_SIGNAL, inotify v EVFILT_VNODE, containers v jail, and seccomp v pledge[1], others.

[1] With seccomp v pledge, seccomp seems like it needs more assembly. But I think the most common expected use case, given that seccomp restrictions are inherited across fork, was that seccomp sandboxes would be created by a core system utility which then invoked other services, sandboxing them without having to modify the services. pledge requires source code modification. pledge, I think, is the better and more useful approach but it necessarily requires some assembly. Same story for containers vs jail, at least until recently.


Regarding pledge: I agree it's a good pragmatic design, but I wouldn't see it as an example of BSD vs. Linux broadly. To my mind it's a very specifically OpenBSD style design, a kind of ruthless reduction to a dozen or so hardcoded cases that cover 90% of the benefit, sane defaults, opposition to configurability if it can at all be avoided, and screw coming up with a clean "general" solution. Other BSDs have gone in different directions. FreeBSD's solution here is capsicum, which predates pledge and is more powerful in principle, but is more difficult to use, so not that many programs in FreeBSD actually use it.


Ah, good point. Capsicum is a much better example of the difference in approaches.


The NT team had the advantage that they (Cutler et al) had worked on a previous OS and had the opportunity to revisit design decisions in many places. I have no idea if/how VMS solved this problem


While that can be (and is, really) an excuse for select(2) and poll(2) (which I believe date back to '83) it is not one for epoll (October 2002) as it was preceded by both IOCP ('94) and kqueue (July 2000).


Yes, both kqueue and IOCP are modelled the same: notify on operation completion, rather than when a file descriptor can perform an operation.


kqueue is not completion based, at least not principally.

kqueue is a kernel framework and userspace API for event signaling. The functional components are called event filters, and most of them signal readiness. On FreeBSD there is a kqueue filter for async I/O, EVFILT_AIO, an extension of POSIX AIO. It's analogous to Linux's io_submit IOCB_FLAG_RESFD flag that will signal AIO completion through an eventfd. EVFILT_AIO is a more consistent API than IOCB_FLAG_RESFD, but doesn't strictly distinguish BSD from Linux in terms of capability.[1]

What makes kqueue better than the Linux epoll family of interfaces (epoll, eventfd, signalfd, inotify, etc) is that the framework was much more carefully designed and generally has saner behavior. For example, from the very beginning kqueue had EVFILT_SIGNAL for signal events. It wasn't until 7 years later that Linux got signalfd. But on Linux only a single signalfd listener will receive notification for a particular signal number, as it's basically a hack built atop the traditional signal delivery mechanism, for which there can only be a single handler. On BSD there can be multiple EVFILT_SIGNAL listeners for a signal number, and all will receive a notification. So if you have multiple components that might want to reset themselves on SIGHUP, they can't do that independently. Thus, signalfd is not much better from a functionality standpoint than just manually installing a signal handler which uses eventfd; it's merely more convenient for most cases where there's only a signal listener, but it doesn't compose well and so fails Interface Design 101.

Apropos the article, kqueue will properly delete a descriptor event when you close it, regardless of whether it's been dup'd, and even though both Linux and BSD share the same distinctions between descriptors and file table entry objects.[2]

It's things like that which make kqueue superior as a practical matter, above and beyond the better architectural design. So much of the annoying aspects of Linux semantics has to do with premature optimization of particular use cases.

Somewhat similarly, so much that's annoying about Windows is that their interfaces are too high-level. If Microsoft's architectural decisions aren't well suited for your particular use case you're basically SOL. For example, IOCP is implemented in userland basically using something like epoll combined with a thread pool (think libdispatch on macOS where the number of slave threads is globally balanced). Microsoft could expose that readiness signaling mechanism so you could roll your own event loops (and thus make integration of projects like Node.js much simpler and cleaner), but they refuse to do so. And so while IOCP is great for the 80% of cases where you really just want completion and are happy to use their well-optimized framework, for the other 20% of cases where the design or performance isn't a good fit you have no good recourse. Windows, and to a lesser extent Linux, often violates the core design principle of keeping easy tasks easy and difficult tasks possible. They make easy things very easy but the difficult stuff often remains impossible as a practical matter.

[1] Actually, EVFILT_READ and EVFILT_WRITE do support setting a low water mark, so you can theoretically condition readiness signaling on kernel buffer availability. But this is basically a kqueue-based extension to the traditional BSD Sockets API options SO_RCVLOWAT and SO_SNDLOWAT. You could theoretically get the same behavior on Linux, at least on a per socket basis (as opposed to the per event basis for kqueue) by setting those options. Alas, Linux doesn't obey SO_RCVLOWAT or SO_SNDLOWAT when polling. :( Speculating, I bet it's probably a performance optimization netting some single-digit performance improvement for the original expected use cases but, once again, keeping something difficult impossible.

In any event, I hesitate to call this completion signaling because the buffer contents could be modified but the already posted event remain outstanding (if not in kernel land, in userspace land). By it's nature Windows IOCP forecloses that possibility completely.

[2] In Unix there are universally three kernel data structures involved with any file-like resource: the file descriptor object, the open file table entry object, and the resource object itself. So a single socket object could be referenced by multiple file table entry objects, which in turn could each be referenced by multiple file descriptor objects. dup() will create a new descriptor object referencing the same file table entry object. Opening /proc/self/fd/N will actually create a new file table entry object referencing the resource object. So if you ever find yourself actually facing the epoll issue discussed in the article, one hack solution (unverified!) is to create new descriptor references using /proc/self/fd/N rather than dup(). I discovered this distinction when investigating the difference between /dev/fd on BSD and /dev/fd on Linux. On Linux /dev/fd is a symlink to /proc/self/fd, and so opening /dev/fd/N creates a new file table entry object, which can have different access flags and (IIRC) file position cursors from the original descriptor. On BSD /dev/fd/N is analogous to dup() and so will share the same access flags and position cursor. I imagine that someday confusion on these matters will indirectly lead to an application exploit, just because it's so esoteric and involves semantics people rarely consider. Redirecting /dev/fd/N from the shell might result in unintentional sharing of access flags or cursors. Whereas using /dev/fd/N or /proc/self/fd/N on Linux might result in an unintentional lack of sharing if the programmer was expecting dup-like behavior.

I forgot what are the semantics of /dev/fd on Solaris or /.vol/DEV#/INO# on macOS. It would pay to check and document your code before relying on them.


I think NT's model is only possible when the API is intimately tied to implementation details of the threading model. POSIX derived APIs probably do not want to be quite as prescriptive.


That's already the case for Linux and epoll, isn't it? Epoll isn't POSIX and is Linux-specific. If it's going to be platform-specific, it had better be right.

A different way to look at it is that NT was designed by a single cohesive team, working from start to finish, where as the Linux kernel was evolved over many years, via the contributions of many people, in a less cohesive way.


From the man page...

> Q6 Will closing a file descriptor cause it to be removed from all epoll sets automatically?

> A6 Yes, but be aware of the following point. A file descriptor is a reference to an open file description (see open(2)). Whenever a file descriptor is duplicated via dup(2), dup2(2), > fcntl(2) F_DUPFD, or fork(2), a new file descriptor referring to the same open file description is created. An open file description continues to exist until all file descriptors refer‐ > ring to it have been closed. A file descriptor is removed from an epoll set only after all the file descriptors referring to the underlying open file description have been closed (or > before if the file descriptor is explicitly removed using epoll_ctl(2) EPOLL_CTL_DEL). This means that even after a file descriptor that is part of an epoll set has been closed, events > may be reported for that file descriptor if other file descriptors referring to the same underlying file description remain open.

EDIT:

Lots of comments here seem to think this should be unexpected or is a bug. Closing a FD you are using is a bug. I think epoll does a fairly good job of letting the user know that it is watching the description and not the descriptor. Failing to read the man page for dup would also leave you in a blind spot. I have been writing code for linux a while now and I did not think it was any secret that a file is still open until all of the fds pointing to it are closed. That is why you have to take care and close your duplicated fds at the right time otherwise you will end up with file handles leaking. The example code provided illustrates this perfectly.

As a side note using dup2 to get your original FD passed to epoll associated with the still open description from the duped fd should allow you to remove it.


> Closing a FD you are using is a bug.

Sort of, but think of the problem epoll is addressing: having an epoll-registered descriptor is semantically equivalent to having an active read() on the thing. That's sort of the whole point, of course. Calling close() on a descriptor blocked in read() works fine and produces the expected (synchronous EOF) behavior. Doing it on the epoll one produces surprising behavior depending on whether or not it got dup'd somewhere (which can have been somewhere outside your process!).

It's a bug for sure. Not sure it rises to the level of "fundamentally broken", but it's a bug.


POSIX does not require EOF to be returned from a blocking read() if the same file descriptor is passed to close() in another thread. Even if it appears to work on some particular OS, consider the case where the reading thread may have called read() but is preempted just before entering the kernel. If another thread then closes the file descriptor, expecting the first thread to see EBADF or EOF, but then the file descriptor is reused (e.g. by accept()) before the first thread resumes, the reading thread may block reading from an unrelated socket. Even if you can guarantee that accept(), open(), or other calls that explicitly allocate a file descriptor will not be called, ordinary library functions are generally allowed to use file descriptors internally (e.g. to read a configuration file), so closing the file descriptor while another thread is reading from it is almost certainly a bug.


epoll and read are different calls though, expecting them to behave the same when they have very different uses is weird to me.

If it is a bug, then why does the man page call out this case, and other cases with dup and dup2 as use cases? Was the man page edited after the fact to cover up the bug? If so why not change how it works?

I really disagree calling something somebody finds surprising a bug, even more so when the man page covers the "surprising" behaviour.

Edit:

Something was bugging me so I checked it out myself. You are actually wrong. and read does work the same way epoll works. If if you dup your fd, launch a pthread that will wait a while and then close the fd, and then start your blocking read call on the fd. If the thread closes your original fd while the read is active you will not get a EOF, and you will continue to receive data from the server. Only using the FD to a system call after the fd has been closed will generate an error.

This is very much what we are seeing with epoll. We pass a fd to a system call. It is now doing its thing. We then close the fd, and are now surprised that the system call has no clue what you are talking about? The thing to remember there is a table from FD to file (or description). That table is how we talk to the kernel about files. If that table is missing a entry it won't know what file we are talking about.


> Lots of comments here seem to think this should be unexpected or is a bug. Closing a FD you are using is a bug. I think epoll does a fairly good job of letting the user know that it is watching the description and not the descriptor.

Then why is it reporting events for the file descriptor (not description) that was closed? Am I misunderstanding that?

There's nothing wrong with dup'ing an fd and closing either the original one or the second one while continuing to use the other. It's very strange that something would emit events on the fd that was closed.


You are not getting the events on the FD. You are getting the events on the file. When. You set up the event you told the kennel let me know about changes on a file -- but let use the int when we talk about this file. This actually happens at open. You setup the system to say hey when I do things on this file let's short hand it and just use this number which we will call a FD from now on.

The confusion is this king the FD is the file. It is not.

Also a FD can't be open or closed it is simply in the FD lookup table is not. Files are what are opened and closed. A FD is a designation.

Open sets up the designation close removes it. So to say you are getting events on a fd is okay, but not painting the full picture. You are getting events from the kennel about a file and the kennel thinks you know this file by it's designation.

Closing a FD does not close a file unless it is the last FD referencing that file.

Also please see my other big post. I hate typing on a phone.


Maybe I'm skimming over some important point, but it seems to me that the biggest problem here is the way the man page answers that question. I think it would be better with the first sentence deleted altogether, and much better if that sentence were actually a summary of the following explanation instead of basically contradicting it.


What's the rationale for watching the underlying file description instead of the fd?


epoll works with files at the kernel level. You have to use descriptors from the user side to tell the kernel what file you are wanting to work with. Because epoll uses the kernel objects ref count closing a fd that is NOT the last fd referencing the kernel object will not remove it from the set or trigger the close.

Lets not forget that system calls are not user libraries, but are controlled access to the kernel.


So why is Linux literally the only one who does it like that? It results in a behaviour that's surprising to the application developer and is then hard to control and get right. All I'm hearing are philosophical excuses, but nobody seems to have studied the systems that came before and can point out a concrete reason why the Linux way is better. Meanwhile everyone else is having trouble using the interface correctly.


> So why is Linux literally the only one who does it like that?

That is such a weird question. That is like asking why are you you. It's linux. That's why. This is how its system calls fd, and files work. Also you should know, epoll is just ONE of the ways ways you can get events on files, and people are mostly interested in it for the same reasons that this article hates it. Because it happens at the kernel level it is event driven, vs a polling system.

> It results in a behaviour that's surprising to the application developer and is then hard to control and get right.

It is only surprising because the user of the function did not read the man page and made assumptions on about how things work without fully understanding how they do work.

> All I'm hearing are philosophical excuses, but nobody seems to have studied the systems that came before and can point out a concrete reason why the Linux way is better.

There has not been any excuses, it's a matter of fact of how things work and how system calls work. Could things have been done differently? For sure. But even those differently done things could have been done differently.

The article and most of the post in this thread are not talking about "why Linux is way better". And I don't think anything I have said has argued either way. That being said, maybe it is not better. But it is what runs the internet and most of the top companies you frequent on the internet today -- so it can't be all that bad.

Also who is this that you expected to study? Why have you not done it? You do realize Linux is one of the most successful opens source project with thousands of developers around the world working to make it better every day? And that a lot of thought goes into adding new system calls and improving the existing systems? Even if you wanted epoll to make magic ponies for you it still could only do so with the restraints the systems it works with impose. For you to come along after 20 years and casually ask these questions is just insane.

> Meanwhile everyone else is having trouble using the interface correctly.

If you look closely lots of people who don't read man pages before using system calls find things surprising and likely use them incorrectly. Just because people refuse to understand the system they are wanting work with in full before using it -- and make assumptions how it should work without understanding the layers that live below it work -- does not make something faulty, broken, or wrong. As a systems programmer it is your responsibility to not only understand the APIs you are using, but understand what is under the hood of those APIs. If you fail to do that you are going to have a bad day. Meanwhile a awful lot of good software uses epoll and has no problems.

For all the moaning about how epoll works, or anything related to linux you guys/girls do know it is open? That if you truly think the system call interface or a particular set of system calls could be improved on then I encourage you to put your pitchforks and ranty blog post down and do something about it.


Interesting writeup, seems like this only applies to people writing epoll abstraction layers though. Everybody else is able to avoid this bug by deregistering the fds before calling close. So I'm not sure calling it "fundamentally broken" is necessarily accurate.


Yeah, I think 'fundamentally broken' is going into clickbait territory - these are API calls that are successfully used by a huge number of programs. epoll() is the most recent of these multiplexer syscalls and it's been around for a decade or so. The underlying message is that there are corner cases that exist and should be thought about, but that is true of most syscalls.

The epoll() call itself is a strange API. As the series of articles says, it was introduced to solve the problem of the kernel having to register and unregister every listed FD on every invocation. However, as a trade-off, your program now needs to make a syscall each time it wants to add/remove an FD. So where you once had a single poll() call per event loop, you now have to fire off extra syscalls for each single FD on each new connection.

Compare this to the API of FreeBSD's kqueue - it is another solution to avoid the linear costs of select()/poll(), but instead of making a syscall for registering/unregistering each new FD, kevent() is called with a list of FDs that you want to change. So an event loop using kqueue consists of just one syscall per loop.


> Compare this to the API of FreeBSD's kqueue - it is another solution to avoid the linear costs of select()/poll(), but instead of making a syscall for registering/unregistering each new FD, kevent() is called with a list of FDs that you want to change. So an event loop using kqueue consists of just one syscall per loop.

It would be simple enough to add a version of epoll_ctl that took array arguments, but the existing epoll_ctl makes a lot of sense for the most common use cases.

You either know all the descriptors ahead of time or you don't. If you do then you can add them all on startup and never touch it again, and one-time costs usually aren't worth optimizing. If you don't then you're usually modifying them one at a time anyway.

And if the syscall takes them one at a time then you don't have to worry about maintaining the state yourself or memory management based on how many descriptors you have or anything like that.


Error prone?

It definitely creates situations where bugs crop up. The example of close() is actually quite important, and something which is perhaps unexpected until you hit that bug and spend hours or days tracking it down.

Definitely has happened to me.


It sounds like you can avoid the bug except you know, the part where you can't avoid bugs.

Example: Let's say your program is multi-threaded and you have a bug where you close(2) a dangling file descriptor, which normally harmlessly returns EBADF but this time its integer value happens due to unlucky timing to get re-used for a brand-new socket which is registered with epoll... Then suddenly perhaps this becomes relevant in your program, even though any time you intentionally close the fd you de-register from epoll first.

In other words, when debugging a program using epoll, I think it's still a relevant question. These corner-case type things where it seems like "obviously" you're never going to screw up are often where the most interesting bugs can surface.


If you ever see EBADF, abort the program. You've screwed up something badly somewhere. Ignoring EBADF is like ignoring a double free: even if you can get away with it in the short term, it's indicative of something horrible down the road.


Yes, for the reasons I already illustrated in my example. To be clear I am not advocating EBADF to be something you should live with. I am describing it as unintentional and as a bug. It is a time bomb for harmful race conditions.


> Interesting writeup, seems like this only applies to people writing epoll abstraction layers though. [...] So I'm not sure calling it "fundamentally broken" is necessarily accurate.

If you can't write an abstraction around it, it sounds pretty broken.


The bug is only there because the epoll interface necessitates you write the code to avoid it. That, to me, is the definition of a fundamentally broken interface.


https://illumos.org/man/5/epoll

While a best effort has been made to mimic the Linux semantics, there are some semantics that are too peculiar or ill-conceived to merit accommodation. In particular, the Linux epoll facility will -- by design -- continue to generate events for closed file descriptors where/when the underlying file description remains open. For example, if one were to fork(2) and subsequently close an actively epoll'd file descriptor in the parent, any events generated in the child on the implicitly duplicated file descriptor will continue to be delivered to the parent -- despite the fact that the parent itself no longer has any notion of the file description! This epoll facility refuses to honor these semantics; closing the EPOLL_CTL_ADD'd file descriptor will always result in no further events being generated for that event description.


"Using these toolkits is like trying to make a bookshelf out of mashed potatoes." - Jamie Zawinski


For people reading this thread who want to better understand the distinctions between files, file descriptions, and file descriptors, I recommend Michael Kerrisk's book The Linux Programming Interface, which has great coverage of this topic.


I am making a new post because I feel like this is getting out of hand. There is a lot of misinformation that needs to be cleared up.

When you are working with system calls you use fds to tell the kernel what file you are working with. These are simply numbers, an int to be exact. When you call a system call that takes in a fd you pass the int that references the file you want to work with and it is looked up in a table to find the kernel structure that represents the file. When you dup a fd all that does is add a entry to the lookup table to point to the same kernel structure -- increasing the refcount at the same time. The ref count lives on the kernel structure, not with the int. The fd system is real simple. For open, dupe, dup2… will add an entry. While close will remove the entry. If the entry is removed the fd is now useless for any communication to the kernel over system calls. This is because the lookup table will not be able to translate your fd to a kernel structure. Now the question is epoll broken or have a bug? I don’t think so. You just have to keep in mind you are using a system call and not a library. This is controlled access to kernel level functions. In the example adding a fd to epoll, then duping it followed by a closing is providing some surprising results. Understanding how userspace talks to kernel space and how system calls translate into running kernel code is key to understanding why this is not a bug but just how things work. You might also be surprised that other system calls behave the same way! When you tell epoll to wait you use a fd to tell it which file you want to wait on. You are not telling a library, or a userspace program to do this. You are telling the kernel. So the fd is translated into a kernel object and that is what epoll is working with. Only when adding and removing is this translation done. The kernel subsystem that provides epoll will continue to work with kernel objects -- not fds.

When you call dup you are adding a new entry to your programs fd lookup table for a new fd but pointing to the same kernel structure -- at the same time you are also increment the refcount of the kernel structure. The file is not actually closed until both the original fd and the duped fd have had closed called on it, or more correctly, until the refcount of the kernel structure is zero. This means that if you tell epoll to wait on a file and do so by an fd, and then close the fd -- you will only see a close event if the refcount is zero. By calling dup before closing your original fd you have incremented the ref count and thus the file remains open -- no close event.

If you then try to remove your file from epoll using the now invalid (closed) fd it will fail. This is because the fd is no longer in the fd lookup table and the system call can not translate your fd into a kernel object. Now you might say this is not how other system calls work. As in another thread pointed out that “read” does not behave this way. To the contrary -- read actually works the same way. If you dup your socket fd, and use a thread to close your socket fd while in an active read -- your read will not fail. It will finish out delivering the data to you. This is because the fd is not the actual object you are working with. It is just a reference to the object. Onces you enter the read function you are in kernel land. And the kernel only knows about the kernel object. Because you duped the fd before closing it the kernel object has a ref count of 2 and your close while read leaves the ref count at 1 thus there is nothing for the kernel to do other than finish out its read. If you attempt to use the now closed and invalid fd for a subsequent read, or close you will get a EBADF error -- much like you do when you attempt to use it with epolls functions to remove the file.


But no other syscall interface presumes that you're directly twiddling kernel structs and I struggle to think of any other syscall that operates on fds, for which you need to keep in mind that it actually operates on internal kernel structs.

It would be trivial for the kernel to attach the epoll struct to the file descriptor and give you a close notification when the descriptor is closed (even if there are other open descriptors to the same file), but it's really hard to emulate that behaviour in userland, except this is the precise behaviour most people want! The only way to emulate this behaviour is to write a full abstraction over all file syscalls and make sure you never ever call them directly.


The entire point of my post was to say yes, that all the other/all system calls do work with kernel structs internally ones the system call passes into kernel land. Again, please do not forget a system call is a controlled exposure to kernel code -- it is not a library.

Every function call works the same way as epoll works ounces the fd is translated into a actual kernel structure. Conflating the fd with the file is where all assumptions go wrong. The fd is the means you tell epoll what file you want to be notified on events about. A file is NOT closed until there are no more references. This is simply how linux works. To change that would be non trivial and break years of software.

Let's explore your idea a bit more. If you were to want to know when a "fd" was closed, what would you be using that for? You really want to know when a file was closed right? If so, any code that you execute after you get the close on the fd would be wrong and cause even more surprising results. This is because the file is not actually closed. It is open. Whoever has the remaining handles can modify it while you think the file is closed -- that is a bug.


I would think you could use dup (or dup2/ioctl) to de-register (assuming nothing grabbed that FD)?


That could possibly work, but only if you dup() the same original FD. If you just try dup()ing any random FD the internal file description pointer won't match, so epoll won't reconize it.

But of course, if you have any threads then it's a big race condition.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: