I admit, sometimes when I read these articles, I have trouble understanding why these people are having so much difficulty getting it to work.
It's also hard for me to imagine a scenario where accept() takes longer than servicing a request and becomes a bottleneck. That is, why would you need multiple threads accepting on the same socket?
Author here. I showed an example - short lived http 1.0 connections. I haven't done benchmarks but my hunch is that you can run accept at in low tens of thousands times per second from one CPU. If you have more than, say, 10k qps the accept() might well be the bottleneck. You can imagine having a box with 64 CPU's, doing short-lived connections, and being limited by accept() done on only one CPU since it doesn't scale.
Second issue is cache locality. If you do accept() in one thread only, then you will need to move the new accept-ed client socket to another worker thread. Depending on details this might not be efficient - aRFS comes into mind. (but frankly, epoll alone won't help here, you need SO_REUSEPORT with SO_INCOMING_CPU).
Even if you don't agree that scaling out accept is a real concern - that's missing the point. The point is: the epoll() model should take this into account and at least support this problem. Or loudly say that scaling out accept() with epoll is not possible. But neither things happened. Up till kernel 4.5 it was impossible to do it correctly, but undocumented, from 4.5 you can use the EPOLLEXCLUSIVE flag, which I feel is a hack.
It's also hard for me to imagine a scenario where accept() takes longer than servicing a request and becomes a bottleneck. That is, why would you need multiple threads accepting on the same socket?