However, I will have to side with jacquesm in saying that regular internet servers will usually benefit from epoll. It is unfathomable that web servers have lots of active file descriptors especially considering the "keep-alive" and that even broadband latency is an eternity for modern cpus to wait. On the other hand though, using poll makes sense for a heavily trafficked file server.
If the ATR > 0.6, poll wins. If ATR < 0.6, epoll wins.
That means, there's potential gains to be had by using both, and at a minimum you can use both and it won't hurt you very much.
Additionally, Mongrel2 isn't a regular server. Remember all those pesky 0MQ backends? Those basically crank data fast as shit, so this research could potentially benefit them. It also points out potential flaws in previous analyses, so people can now potentially use ATR to tune their server's work loads appropriately.
But notice I keep saying potentially? So far I'm the only one offering everything up, all data, everything for anyone to prove me wrong, and I actually don't think I'm totally right.
I don't see very many other people who criticize the idea doing the same. Just more comments on HN.
If the ATR > 0.6, poll wins. If ATR < 0.6, epoll wins.
You left out the part where you accused people who advocate using epoll of being bullshitting touts who perpetuate fallacies:
"why the touted benefits of epoll and most of the information out there is mostly bullshit because of some falacies people seem to have about epoll (mostly perpetuated by epoll's proponents)."
Which you proceeded to back up with a strawman argument about claims of O(1) performance and a rant about how graphs that show that epoll accomplishes exactly what epoll is designed to accomplish "flat out get it wrong".
You're using "ATR" to find the inflection point in the performance curve from epoll's constant overhead per active fd, for the purpose of micro-optimizing Mongrel2. That's fine. But when you claim that people who just use epoll because it solves the problem they actually need solved are making decisions based on "bullshit", you shouldn't act so surprised that they don't respond positively.
I have evidence that contradicts both of the main assertions of epoll, so until someone comes up with counter evidence I can safely say that epoll information is bullshit, and my argument is nothing like a straw man.
So yes, it is bullshit, and if that offends you then awesome. Because nothing's worse than blindly believing some load of crap when you could have a better understanding of what you're doing.
The man page says nothing about O(1):
The original epoll writeup says nothing about O(1), and shows graphs explicitly for dead connection scenarios (iow, it never claims better performance per active fd):
The Dan Kegel writeup that's pretty much universally cited on this issue says nothing about O(1):
The link you threw up yesterday that "flat out gets it wrong" says nothing about O(1) and simply shows epoll doing exactly what it was designed to do (again, it never claims better performance per active fd), including graphs that actually do show poll outperforming epoll with small numbers of dead connections:
Can you show any authoritative sources who claim epoll is O(1)? If not, then yes, you built a fluffy little strawman and took him down hard.
I have no problem with your work on superpoll. I'm skeptical it will help much, guessing there will be situations it will hurt, and overall not convinced it's worth the effort to find out, but it's your time. Have fun with it, and I'm eager to see the results.
I just don't get why you feel the need to imply that everyone who ever used epoll without finding the ATR performance cut-off is a blind moron snowed into doing so by some shadowy malevolent epoll cabal spreading misinformation about how it works.
I'm fine with the amount of time taken scaling linearly with the amount of work to do, I'm not cool with it scaling linearly with the number of flowers in the garden. :-)
Only if your ATR crosses the .6 boundary frequently. If it consistently remains on either side, the performance would be worse at the cost of added complexity.
I agree with your assumption we need more instrumentation. I strongly suspect my ATRs are consistently closer to zero, but I never looked under the hood to check.
Also, the ATR should vary according to what you are serving. Longer streams would drive the ATR up while serving shorter bursts would drive it down.
Now I've got some evicence that epoll isn't faster, and in fact it's a wash at 40% "utilization", which is sort of ridiculous. I'd want people to go and test and see what they get, or at least know the implications so they pick wisely.
Ultimately though, I'd rather have the Linux kernel just fix epoll so it's always faster than poll.
Unless they break poll, that's very unlikely. The implementation of poll is much simpler and should be faster per fd. What could happen is that epoll gets some improvement that moves up the .6 boundary but I seriously doubt epoll could ever be faster than poll with 100% active fds - and I never encountered that kind of usage.
As with any optimization, we should be concerned with end result: how much slower does using epoll really makes the system. I am on the skeptical side here.
Just read() or write(), as if something told you they were ready! Your code has to be able to handle EAGAIN anyway, right? If you're at 1.0 ATR, you're saving the whole cost of poll or epoll!
I'm actually serious here, BTW. It would need some measurement to figure out the threshold at which this is okay, and if it's close enough to 0.6, I'm guessing you could go straight from epoll to nothing, and take the small (potential) performance hit between 0.6 and 0.X.
Hmm, I'm an idiot: this is basically epoll in edge-triggered mode.
Mongrel is probably not used by Joe the VPS user, serving 20 simultaneous connections per hour, because Joe is doing just fine with the Apache install that came with his slice. Mongrel is deployed for high-traffic file servers, as the current weapon of choice.
It's very likely that that will continue to be its niche (though it will expand out somewhat as larger shared-hosting providers start to default to it), so it's very likely to mostly see use in high-ATR deployment scenarios. Under that assumption, Zed's experiment is solid. I would still like to see a prologue tacked on about the composition of the Mongrel user base, though, if Zed has it :)
Edit: rephrased to... sound less like Zed?
This is probably true for most non-apache servers now, but I don't think it will necessarily be true for Mongrel2.
Mongrel2 gives us a new development model - http calls -> messages, language doesn't matter. I can easily mix python and c++/Haskell, for instance . It's rather different from what we've had before.
Mongrel isn't just a faster apache; people might use Mongrel2 for features rather than performance.
 Last time I tried to do this, I basically put a halfassed version of mongrel2 behind apache/django. I.e., apache/django handled http and sent messages to my c++ daemons. Fugly.
Re: > 60% active fd's being real-life or not, I don't really see what's the problem in wanting to handle that case in a project whose goal is basically to keep humming along nicely regardless of what you throw at it.
Then in the kernel you've got something that demultiplexes it back into N separate streams, which it accumulates in N different buffers.
The server software is aware of all N of these streams. Through some mechanism the kernel notifies the server software that data is available on some subset of the N buffers, and the server software has to ask for that data.
If the server software is using a queue and thread model, it likely then gets that data which the kernel has gone to great effort to demultiplex into N streams and provide it through N separate file descriptors--and multiplexes back onto a single queue for the worker threads to pull from!
Why not get rid of some of this demultiplexing and re-multiplexing? Let me have a single file descriptor, and everything that comes in destined to my port comes through that single file descriptor, as a message with a header that tells what remote IP address and port it came from and the length, followed by that many bytes of data.
Instead of N separate buffers in the kernel, just have one big buffer per port. You'd still have to keep track of how many bytes in that buffer are associated with each remote connection, for purposes of dealing with TCP flow control, but I think this could be made to work reasonably (yes, I'm hand waving right now).
Now you don't need poll or epoll. Heck, you don't even need to use select. I'll just have a blocking thread whose job is to just read that one file descriptor using blocking reads in a loop to pull in data and put it on my queue.
Looking at this slightly different--everything destined for, say, port 80 is going to the web server, so why does the web server want this to come in through N different file descriptors? Especially when you consider that the code its going to run is the same no matter which it comes in on? All the multiple file descriptors are really accomplishing is providing a way to tell which bytes came from which client. Is that really the best use of file descriptors, as opposed to just having the kernel tell you which client the data came from when you do your read system call?
I can see that the N file descriptor model made sense in the days before threads, when to get concurrency you had to use multiple processes. You then need to have data coming in on port X from different clients not necessarily all going to the same process.
But for servers that are going to use the "single process with multiple threads" model to handle concurrency, a single stream network model would seem to be a lot nicer and efficient.
This would indeed make no sense were it not for the fact that only a subset of the data received on the network card is actually intended for your server app.
You want a single file descriptor on which you can read data only destined for your port. At that point the data must already be demultiplexed by the kernel or you'd get the data destined for all possible ports.
I'm not sure not demultiplexing data will win you that much because the kernel had to demultiplex it anyway regardless. But if you want to implement this and try it out, by all means go ahead.
Wait a minute, let me get this right... you want to write your own TCP congestion handling code?? Sounds like you're interested in raw sockets, which are already available.
maybe if you're trying to serve 50,000 video streams from a single machine or something, but then you'll probably hit the limitations of your connection before you see any CPU usage from the networking code.
This whole discussion, as I've said before, seems like a classic case of premature optimization.
Networking code doesn't eat CPU. It's negligible.
Concentrate on all the BS dynamic functional language database crap people seem to insist on using at the higher levels. That's where all the wastage is.
And that's what the kernel did for you all along. The abstraction of file descriptors for clients is a useful one, it allows you to decide what to do. The decision here is not 'do I want all this data in one stream' or 'do I want it demultiplexed', the kernel does a fairly good job of the demultiplexing that needs to be done anyway (re-assembling a stream of packets in to functioning connections is non-trivial) and it exports that interface to all user processes to avoid having to recreate that useful abstraction in those user processes.
It's possible there is some gain to be had by 'rolling your own' and by getting rid of the kernel abstraction by collapsing all these layers in to a single user mode program.
But I think you'd soon find that you are simply re-implementing all that code, and that instead of a clean (poll or epoll, either is fairly clean) interface you'd be bogged down in maintaining your own tcp stack.
I once ran in to a situation like the one you describe. I was the programmer in charge of coding up a driver for a high speed serial card (high speed at the time, nowadays we'd laugh at it) doing X.25. I'd nicely laid it out as two separate pieces of software talking to each other using message passing on a weird but wonderful little os called QnX. One process to handle the data link layer, one to handle the X.25 protocol.
After a series of benchmarks with 8 such cards in a single ISA industrial enclosure, together with a (for the time) very expensive micronics 33 MHz motherboard I figured that maybe we can gain some performance by merging the two processes, to save on the IPC. The end result? Identical performance, in spite of the message passing overhead the bit-twiddling, checksumming and actual IO was such a large portion of the work done that the IPC didn't matter at all performance wise.
But it did matter to the code, the second version was decidedly harder to debug and maintain, and eventually that 'branch' was scrapped in favour of the one where the layers were clearly separated.
Poll and Epoll are communications mechanisms, they communicate with the kernel about the state of a bunch of fds. Both implement the same functionality (as long as you don't use the actual events of epoll) from a programmers point of view. Both have slightly different use cases, and for some situations it may be advantageous to use the one or the other. There may be a difference, but on the whole it will probably not be a very large one once you factor in all the other stuff that needs doing.
You're not understanding OP's proposal. He's not talking about implementing TCP in user-space. He said the kernel would still be handling TCP flow control, and would only be sending data destined for a specific port to this fd. So TCP is still in the kernel.
He's just proposing that the payloads for all connections come in on a single fd instead of one fd per connection. But each read from the fd tells you what remote host it came from.
If you processed incoming network data in strict FIFO order, a single host on your LAN could totally DoS your system by flanking you with incoming data. That host would never get TCP flow-controlled because if you're processing anything, you're processing his obnoxiously voluminous input.
If he was just on FD among many, you would service him once in a while, but then you would service other clients. While you're servicing other clients, he gets TCP flow-controlled, so he doesn't singlehandedly demand all your attention.
See lively discussion at http://news.ycombinator.com/item?id=1570694
Personally I think that's nonsense. I downloaded the code, checked it out, there is nothing in there that changes my mind on the whole thing, the meat is in the C code and it's the same as what was linked from the article. The rest is a wrapper script and some stuff to visualise the results, which are what he showed them to be. The premise that web servers have the majority of their fds active is where the issue lies.
FWIW I did some measuring and the results are pretty much in line with what I expected (though the spread is still puzzling, I really intend to find out the cause of that): http://news.ycombinator.com/item?id=1573145
However, I'm glad you finally went and got some metrics to show one webserver instead of the useless rhetoric you were spewing before. Not that it proves a damn thing since it has no external validity, but at least it's a step in the right direction.
I also say that your findings of 40% are much higher than what you implied in your previous comments. You made it seem like it's close to the 10% range, but 40% is damn near ready to leap over into the equal performance levels. It's interesting that even after that you still can't see the value of either making epoll faster or using both, but whatever, you can't convince every idiot on the internet, even with Science.
If I get around to it (I'm on a holiday right now) I'm going to instrument some more sites to measure their active-to-total ratios to see if there is some kind of number (or a set of them) that we might be able to agree on when it comes to real life web traffic. One of them is a filedump, which serves up large media files to clients, that's the use case that I can come up with in which your scheme might pay off, and in which it is possibly advantageous (even if it is a very small advantage) to use poll.
You keep 'challenging' people to come up with numbers. So, I did, from a real life webserver serving real customers, lots of them. If 60% is the cutoff point (and we're assuming that you are right about that) then 40% is comfortably under that, not 'much higher' than what I wrote in my previous comments, the majority (60 to 90%) of the fds is idle. So it's not in to 'equal performance levels' yet.
But as far as I'm concerned even that is probably moot. (But that needs more testing) because I suspect that even this fairly high traffic server spends only very small amount of time on polling the OS, and a much larger time on shovelling bits in and out of the system.
You keep using that word 'science' as though measuring anything at all is automatically science.
But what you're doing is more like selecting the data that fits your 'who needs epoll' and 'epoll is bad' hypothesis and rejecting out of hand any experience and / or experimental data that seems to prove you wrong.
That's not science. Science is to use all the data available to you, not just the juicy bits. Not just those bits that allow you to make bold claims about discoveries when in fact all you've done is to show that the underlying implementation of one systemcall is different and has different trade-offs than another, which may be specific to the way your system is set up.
I agree that to test something you need to reduce it to the minimum of code required to do your tests. But when you're going that route you are doing what so many other database, CPU and application benchmarkers forget. The real world is nothing like a benchmark.
You to suggest that this is 'not a localhost test' (http://news.ycombinator.com/item?id=1573195) and you say so pretty loud. That would mean that there may be some validity to your data that I have not been able to find to date in the stuff you've provided.
I'm challenging you to provide the data that apparently is still missing that proves that this was not 'a localhost test'.
And if you don't supply that data then I think you should take back (and try to do it gracriously for a change) that this wasn't a local host test, or you're going to have to come with some definition of what you consider a localhost test to be.
For me, that definition is simple, a localhost test is a benchmark that never generates any network traffic.
Feel free to disagree.
You may argue that nobody is making any such claim, but I can tell you that the sections of the blogosphere I skim via HN, reddit, etc. do imply it. Maybe the authors just assume everyone can infer that every decision has tradeoffs and there's no need to discuss them in depth for this particular case, or maybe—and the nature of the blogosphere leaves me inclined to think that this is the case—most of those authors never consciously thought about those tradeoffs because epoll happens to be a good fit for their particular problem. Either way, Zed is the first person I've seen trying to explicitly identify and quantify those trade-offs for poll vs. epoll. As such, I think he's doing the community a service.
If this research leads to a better implementation of epoll, or a server that requires less tuning to do well under a particular load, that is great for everyone. Even if it doesn't lead to any great improvements, a better-educated developer population that makes more informed decisions is still a net win for the world.
Now, there's an experiment to figure out what the threshold is, and frankly, it's a bit lower than I had hoped (I would have guessed 0.9, 0.8 at worst), which is exactly why having the actual data is awesome.
I think what he was saying in relation to it being not 'a localhost test' is that it isn't a test that touches the network stack, even to the loopback interface -- you seem to be equating localhost with anything that happens on a single local machine regardless of whether it has anything to do with the network or not. A difference in use of terminology, perhaps.
You don't measure a specific thing by inventing some full on "real world" evaluation. You test that specific thing. Why is it this simple concept seems to baffle so many coders?
I still of the opinion that a >0.6 ATR means you're fucked anyway (where processing all those fds can't be done as fast as data arrives), but I'm trying to come up with a good small test for that hypothesis (like you did).
Until I do, it's just an opinion/hypothesis, and it's kind of hard to argue about it strongly either way.
Basically, you're a FUD slinger who's got a beef and doesn't know enough to hold his own in a real discussion about measurement and analysis.