Hacker News new | past | comments | ask | show | jobs | submit login
Zed Shaw: "poll, epoll, science, and superpoll" with R (sheddingbikes.com)
227 points by tdmackey on Aug 3, 2010 | hide | past | favorite | 141 comments

In real-life web serving situations, and not in benchmarks, the majority of the fds is not active. It's the slow guys that kill you.

A client on a fast connection will come in and will pull the data as fast as the server can spit it out, keeping the process and the buffers occupied for the minimum amount of wall clock time and the number of times the 'poll' cycle is done is very small.

But the slowpokes, the ones on dial up and on congested lines will get you every time. They keep the processes busy far longer than you'd want and you have to hit the 'poll' cycle far more frequently, first to see if they've finally completed sending you a request, then to see if they've finally received the last little bit of data that you sent them.

The impact of this is very easy to underestimate, and if you're benchmarking web servers for real world conditions you could do a lot worse than to run a test across a line that is congested on purpose.

So let's take your assertions and take them apart:

> the ones on dial up and on congested lines will get you every time.

Do you have numbers on the dial-up users for your server? My understanding is that there's far fewer, so this is bogus. Show evidence of high dial-up penetration first.

> They keep the processes busy far longer than you'd want and you have to hit the 'poll' cycle far more frequently

Again, you have no numbers on the active/total ratio in your server, so unless you do this statement doesn't refute what I found. I've presented evidence that just shows the math of O(N=active) / O(N=total) holds up. Simple math. The only way epoll wins for all load types is if it is as fast as poll all the time. My tests show it's not, which stands to reason since it's implemented using more syscalls than poll.

> The impact of this is very easy to underestimate, and if you're benchmarking web servers for real world conditions you could do a lot worse than to run a test across a line that is congested on purpose.

Again, you have no definition of "congestion". If you adopt a simple metric like ATR then we can talk. As it is, you (and everyone else) just throws around latency numbers like those matter when really the performance break is in the ATR. In addition, my numbers show the performance break being at about 60% ATR, so if you're saying that no server every goes above 60% activity levels then you're totally wrong. 60% is not completely unreasonable on a loaded server.

But, I think you're missing a key point: You need both in a server like Mongrel2. I never said epoll sucks and poll rocks (since you probably didn't read the article). I said something very exact and measurable:

> epoll is faster than poll when the active/total FD ratio is < 0.6, but poll is faster than epoll when the active/total ratio is > 0.6.

If you don't think that's the case in "the real world" then go measure it and report back. That's the science part. I totally don't believe it yet myself, which is why I'm measuring it and showing the methods to everyone so they can confirm it for me.

So, here are the numbers from one of the webservers that I instrumented to log the active-to-total ratio over a couple of hours.

The webserver is custom job called yawwws (yet-another-www-server) that is used to serve up a variety of bits and pieces for a high traffic website, typically the requests are very short in nature (a 500 byte request followed by a < 10K answer).

After about two hours of running the active-to-total ratio varied between 10% to 40% for 5 minute intervals, with the majority of the 5 minute buckets around the 30% mark. I'm actually quite surprised at the spread.

The bigger portion of the time seems to be spent waiting for the clients to send the request, most if not all of the output data should fit in the TCP output buffers, so that actually skews the results upwards, for longer running requests sending more data to the clients the active-to-total ratios would probably be a bit lower.

So 10% to 40% of all the sockets were active at any given time, the rest was idle, waiting for data to be received or for buffer space to be freed up so data could be written.

In this situation epoll would be faster than poll because epoll only sends the user process those fds that it actually has to deal with rather than all of them, so the loop that takes the output of the system call will have less iterations.

So, as I wrote before, I think the typical web server is, when it is dealing with the client facing side more often than not waiting for the client to do something, and it seems that on my server that hasn't changed since I last looked at it.

This server runs with keepalive off. Switching it on will most likely make the active-to-total ratio dramatically lower but I don't feel like pissing off a large number of users just to see how bad it could get. There is a good chance that my socket pool will turn out to be too small to do this without damage.

Chances are that for different workloads the percentages will vary but this setup is fairly typical (single threaded server, all requests served from memory) so I wouldn't expect to see too much variation on different sites, and if there is variation I'd expect it to go down rather than up.

If I get a chance I'll re-run the test on some other websites to see if the numbers come out comparable or are wildly different.

Read "on dial-up" as "slow". The argument depends only on there being a certain distribution of client speeds. It's not about dial-up in particular.

And, if there's a distribution of speeds then you can measure the distribution and see what works best. Again, my challenge still stands:

Measure it or STFU.

You've already got a benchmarking system configured; why not benchmark using a congested (or at least, emulated) pipe?

It doesn't necessarily depend on dial-up, either. Imagine the number of people who leave bittorrent open in the background, stream porn, or whatever else that leaves their individual HTTP connections slow. Hell, latency alone (it takes at least a second for my connection to reach the east coast of the USA) would have an effect, and you can't underestimate the increasing number of mobile devices on slow(-ish, depending on congestion) 3G networks.

I'd provide statistics from my server (I serve an NZ gaming community), but I suspect my numbers would be disproportionate compared to the average workload. Here in NZ, we have far more people on crappy pipes (our DSL network is, famously, a gigantic pile of shit - although that has improved over the past couple of years and continues to), and far less people on smartphones (iPhones cost ~$800USD here).

Still, I believe the commenter has a point which you shouldn't ignore, or at least shouldn't pass off so easily :). I'd love to do some testing myself, but unfortunately between working a day-job, and spending my evenings trying to get a startup off the ground, I've got no time spare.

How about you get measurements of ATR from real-world deployments instead of the wild conjectures you've laid in this thread? Your challenge applies even moreso to yourself:

Measure it or GTFO.

Oh, you mean do what I'm already doing? Measuring and developing ideas then testing them?

It helps if you're going to comment that you actually read the words I use, not the ones you have in your head that make you sound like you're super smart.

Yes, you measured the ideal ATR inflection point for poll vs. epoll in your synthetic microbenchmark.

But you guessed wildly about what ATRs people see in the real world: http://news.ycombinator.com/item?id=1572292 http://news.ycombinator.com/item?id=1572418

Yeah, 'cause there's no way I'll be able to test a real web server that I actually wrote based on this small test. This is a small test to test one specific thing, doing more would confound the test. Confounding. Look it up.

Incidentally, this is the same test everyone else uses, so if you thought it was bullshit why did you support it when people testing epoll with it were using it? Oh, because they used it to confirm your bias rather than disagree with it.

I think there's less disagreement about the well-controlled result derived and more about whether

   a) Your controls were right, and
   b) What the most optimal decision is in light of this new information.
(a) is well known to be one of the most difficult parts of scientific reasoning and is almost always open to endless debate and improvement. In short, it's the question of whether ATR is a human-sensible metric. (b) however has an interesting direct answer: figure out the distribution of "live" ATRs on an interesting population of real servers and then, to borrow Eliezer's phrase, shut up and multiply.

If a lot of servers that you're targeting with M2 fall across that 60% divide (under circumstances similar to your controlled microbenchmark) then of course Superpoll is a good compromise.

Jacques is arguing a combination of (a) and (b). Perhaps ATR is not a sufficient metric to understand all interesting server loads. Moreover, perhaps many interesting servers live at really low or high ATRs all the time and so Superpoll must gracefully degrade to either poll or epoll.

In any case, driving for empirical data is noble, but possessing data is never sufficient to whitewall all detractors. It's really nice to have strong empirical support for the breakeven point between the two (ie. the ratio of their constant time components) via your benchmark, but science isn't just statistics.

(edit: I'll also add that pushing the pipetest microbenchmark past where people are usually making hyperbolic claims is a pretty big deal and a good catch.)

I think he was suggesting that you find the ATR on real-world production webservers, which has nothing to do with your test and would not confound it. Maybe in practice nobody sees an ATR > 30%, in which case your test, if correct, is irrelevant. What sort of loads produce an ATR > 60%? How common are they?

One problem with getting such numbers is that real world, the load is given, not the ATR, and the ATR for a given load may depend on implementation choices. Total connections is a function of server latency, and on a loaded system, latency can be a function of polling method. So switching methods may change your ATR.

On the other hand maybe ATR for a given load doesn't change significantly depending on implementation. Your test found what is best for a given ATR, but not necessarily for a given load, or how ATR depends on load for a given implementation. Depending on the results, you may want to add some hysteresis to superpoll.

If only you practiced what you preached. Imagine the amount of self-righteous bile that your servers wouldn't serve.

I am practicing what I preach, you stupid 20%-er troll. That's why the whole blog post is full of measurements, testing hypotheses, and the assumption that I might be wrong.

Because unlike you, I actually go do shit rather than spout off in a comment thread.

You may think you are. Far out. Why you think you can violate basic social norms, while others apparently support it, is beyond me.

Basic social norms also involve not saying ruthless comments you wouldn't say to my face.

> Do you have numbers on the dial-up users for your server? My understanding is that there's far fewer, so this is bogus. Show evidence of high dial-up penetration first.

He doesn't need to show that it's high, only that it's high enough to cause a significant contingent of ordinary webservers' requests to be lingering slow connections.

I agree, but "high enough" is apparently just 60%. The standing question is, what's the actual level in different kinds of servers?

In other words, I've given a metsric, ATR at 60% is the break even point for poll vs. epoll. So far the only responses I've got haven't even tried to give out a metric, let alone say what their actual ATR is but they claim that it's low.

I'm a scientist, so in the same way I don't believe my own research, I don't believe their rhetoric.

> I'm a scientist

I've never come across a scientist that took criticism of their work the way you do and that responded in the way you do. Shouting down, deriding, insulting and in general being a jerk to those that don't agree with you because 'you're a scientist' is not the way of science.

I dunno, that is how a lot of scientists I know interact with people. :/

That's pretty sad.

I'm shouting you down because you're a FUD slinging troll. Very first thing you did was immediately reply to every branch of the comments with your agenda. I actually have no idea what your problem is, since I'm just presenting some information and working on my own software with it, but you've got some weird "epoll religion" you like to spread.

So, consider me the Richard Dawkins of epoll.

Zed, you might reconsider your social strategy. Your article lays out a brilliant theoretical approach, but it's hard figure out how to apply it without more real-world data. Pragmatically, it seems like your goal should be to encourage others who are in a position to gather this data to do so, and to share it with you.

It looks like Jacques looks has a pretty good start at making these measurements: http://news.ycombinator.com/item?id=1573145. If the numbers he provides aren't helpful, or aren't complete, you might try encouraging him to fix them. Calling him a "FUD slinging troll" seems more likely to cause him to tune out and ignore you, to the detriment of us all.

Realize that you've been thinking about this problem for a while now, while others have just started their thought process. Your goal is to get them up to speed so they can move your argument further, but this won't be instantaneous. Treating them as potential allies during this formative stage might pay off. If you can hold off with the insults for a couple hours or even days, you might get better results than if you shout them down immediately. :)

  Very first thing you did was immediately reply to every branch of the
  comments with your agenda.
You're suffering from paranoia. If you post an article about security, you can bet tptacek is all over the comments, informing and correcting people. In this case, the article was about something jacquesm happens to know a bit about, so he participates actively in the comments. To suggest he is pushing an 'agenda' is ridiculous: there's nothing at stake for him. The only thing he tries to do is help you, by noting that he thinks you have overlooked something.

  I actually have no idea what your problem is [..]
That's because he doesn't have a problem: it's your mind that's filling in the blanks. It suggests that while writing the article, you were already sure people would challenge you based on 'religious conviction' instead of on fact. jacquesm's point was a simple, critical question: what are actual real-world ATR's for the servers that Mongrel2 should be able to replace?

Allow me to make an observation of a psychological nature: you are thoroughly miffed that it was so easy for someone to provide possibly devastating criticism to an idea about which you started caring WAY to much. What you should realize it nobody thinks lesser of you because of that criticism: the article is still interesting and provides a sound basis. There is no reason to react in such a aggressive way; it's even counterproductive.

Yeah, but there's a fetishization of "high concurrency" (being able to support a huge number of connections) rather than absolute performance.

For instance, you might have a system which has a latency of 1 second, and at a given workload, you have 10,000 connections. In the Java culture, people think you're a genius if you can increase those connections to 100,000 and increase the latency to 10 seconds.

End users, on the other hand, would be happier if you cut the latency to 0.1 seconds, but there are a lot of people who'll then think you're a loser who can only manage to handle 1000 concurrent connections.

Of course, getting that latency down is a holistic process that requires you to think about the client, the server, and what exactly goes over the wire.

If you could increase the number of connections to 100,000 you would indeed be a genius because when you bind to a network interface using IPV4 there is a hard limit of the short integer used to indicate the port number which automatically limits you to 65536 connections (actually a few less, usually you'll lose 3 for stdin,stdout and stderr (which you can close to reuse them) and one for the listen socket).

As far as I know the only way around this is to use multiple IPS (possibly aliases on the same interface) but that would still require a new process.

So even if your per-process limit for fds can be larger than 64K the network layer or the mapper that turns fds in to socket ids for the network stack to work with may impose a restriction. I don't know enough about the linux kernel to figure out what exactly causes this.

I use the 64K limit on some high throughput machines (mostly video and image servers), but when I go over that I need to start another process. Possibly there's a way around that but the expense of another process is fairly small so I haven't put in much time to see if I can work around that. Socket to fd mapping presumably takes in to account the address as well as the port so it shouldnt't be a problem but on the kernel of the machines where I have to resort to these tricks it appears to be a limit.

Maybe someone with more knowledge of the guts of the linux kernel can point out why this happens.

TCP connections are identified by the (src ip, src port, dest ip, dest port) tuple. The server only needs one port. So theoretically a server can handle 64k connections per client.

You can see this in the 1M connection test done here: http://www.metabrew.com/article/a-million-user-comet-applica... Look at the "Turning it up to 1 Million" section where he details the need to use 17 IPs for the client side.

Yeah, and that's on the client side, as is indicated by the first sentence of that section:

Creating a million tcp connections from one host is non-trivial.

The key words being "from one host". With a single client machine connecting to a single server endpoint, the (src ip, src port, dest ip, dest port) is reduced to being unique only on src port (from the client's perspective), so that's where the 65k limit, and the need for more IPs to do that, comes from. Using multiple source IPs on the same machine is like using multiple client hosts.

...using IPV4 there is a hard limit of the short integer used to indicate the port number which automatically limits you to 65536 connections (actually a few less, usually you'll lose 3 for stdin,stdout and stderr (which you can close to reuse them) and one for the listen socket).

The file descriptor limit is independent of the 65k total possible source ports. The source port limit is part of TCP/UDP. The file descriptor limit is set by ulimit (nofile in limits.conf) on a per-process basis and in /proc for system-wide. If you need more file descriptors, you can reuse 0, 1 and 2, but that's going to free up some ports so a single process can make more connections to the same server endpoint.

Now that is a test. Thanks for posting that, it is the most interesting thing I've seen all day.

But, a server can have multiple IPS, so a server should be able to handle more than 64K connections from multiple clients without a problem. In practice there appears to be some kind of limit.

The server doesn't need multiple IPs to handle > 65535 connections. All the server connections to a given IP are to the same port. For a given client, the unique key for an http connection is (client-ip, PORT, server-ip, 80). The only number that can vary is PORT, and that's a value on the client. So, the client is limited to 65535 connections to the server. But, a second client could also have another 65K connections to the same server-ip:port.

edit: You may be limited by number of open sockets or file handles. It's likely a per-process limit. Google or some linux guru could help you track down what limit it actually is, but it's not the number of server ports available. It might be a number you could raise.

Right, that makes perfect sense. But it really makes me wonder why I run in to that hard limit, I've tried just about everything to get around it and no matter what I do that seems to be the magic number.

I should go and do some testing to see what's causing this, you make me feel like the solution is right around the corner.

re. your edit, ulimit will happily raise the number > 64K, all the /proc/* settings seem to be ok so that's not it, it has to be some other layer in the stack that causes this. I'll definitely spend some time on this, it's been bugging me for a long time.

edit2: there seems to be a max_user_watches upper limit to what epoll will handle.

For simple testing purposes, it is easy to set up a forwarding proxy that drops n% of the packets it receives - for some high value of n. The World Wide Web is far more sadistic, but it still uncovers some performance or usability problems that are invisible over normal `localhost` traffic. I bet you can also use web servers with traffic shaping to mimic lots of slow connections at once, but I haven't tried that

Mongrel2 is supposed to handle WebSockets as well as HTTP, so I think open connections with sporadic traffic are a use case Zed has to worry about.

Zed isn't the only one who has found epoll to be slower than poll. The author of libev basically says the same thing. See http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod and search for EVBACKEND_EPOLL.

I wonder how kqueue behaves compares to poll and epoll. Kqueue has a less stupid interface because it allows you to perform batch updates with a single syscall.

It is worth pointing out that the original epoll benchmarks were focused on how performance scaled with the number of dead connections, not performance in general:


And as jacquesm points out, in a web-facing server, that's the case you should care about. A 15-20% performance hit in a situation a web-facing server is never going to see doesn't matter when you consider that the 'faster' method is 80% slower (or worse) in lots of real world scenarios.

I'll be interested to see how the superpoll approach ends up working, but my first impression is 'more complexity, not much more benefit'.

> And as jacquesm points out, in a web-facing server, that's the case you should care about.

Yes, but where's the evidence what people see for active/total ratios in the real world? I'm showing that unless it's below about 60% (probably more like 50%) then poll is the way to go.

60% active isn't entirely unrealistic at all. I can see quite a few servers hitting those thresholds, so in that cases, poll vs. epoll doesn't matter.

I think what's more important in what I'm finding is that you really need both. It's entirely possible that you have servers that are at 80-90% ATR all the time. Others that are 10% ATR. The key is either you have to measure that, which nobody does, or you have to make a server that can adapt.

> but where's the evidence what people see for active/total ratios in the real world?

Yes Zed, where the fuck is it? You're claiming SCIENCE! based on your worst-case synthetic localhost benchmarks, and then turning around and wildly guessing as to real-world performance characteristics with internet latencies.

Worse, your whole thesis hinges off of ATR but you made no effort to measure it anywhere, instead you're passive-aggressively berating us to do it.

Wow here we are again, you not reading my article. I ran the same test that everyone else runs for poll vs. epoll, then used R to craft graphs and tested hypothesis. It was not a localhost test.

So far all you've got is trolling HN comments. YOU WIN!

Pipes, localhost, who is counting, as far as I'm concerned that's the same thing, making it seem as if for the purpose of this test that's a significant difference is simply conversational trickery.

If you have tested this on real live servers then there is no evidence of that in your posting, and to suggest that this:


is anything but a localhost test is simply bogus.

The only use case where you may be right that poll is advantageous as far as I can see is streaming media servers (video, audio, other large files), image servers are the ones with the worst active-to-total ratios, especially if the images are small. I should know, I only serve up a few billion of them every day. A few years ago or so I was stupid enough to think that video was hard, man was I ever wrong. Repeated connections to the same host, that's a much bigger killer than pumping bits.

Testing this on real live servers is confounding. Man you guys really don't get this. If you want to test epoll and poll over file descriptors you test that. You don't test a billion other things in a network server. That confounds your results.

But what's really amazing is this is the test the proponents epoll have been using for 8 years. Where was your objection back when they were using it for that?

That you didn't say anything about the machines you ran it on was a huge red flag. It looks like jacquesm has you nailed on the localhost testing, I hadn't noticed the link to the code before.

I'll trust that you accurately measured the ATR boundary between poll and epoll in your specific synthetic benchmark. That's then completely undermined by your handwaving in this thread about what ATR looks like in the wild, and the lack of any way for us to relate your microbenchmark with the real world.

The original article had the link to the code so don't even try to imply that I was hiding jack squat. He hasn't nailed shit, it's the same test epoll proponents have used for years. It's not a "localhost" test, it's a test that makes a bunch of file descriptors and then compares the poll vs. epoll performance as active vs. total changes.

But hey, you can live in your own little fantasy world where you think you've won some kind of battle of the HN because you listened to some epoll fanatic weirdo and cheered him on.

It's entirely possible that you have servers that are at 80-90% ATR all the time

I'd be curious if you have any evidence that this occurs in practice. Even a busy server with clients of uniform + low latency, intuitively I'd expect fairly low ATRs.

I think what's more important in what I'm finding is that you really need both.

I'm not sure you do: the performance advantage of poll seems marginal at best. When ATR is high, you're presumably doing enough real work that the slight overhead of epoll vs. poll is probably not super important.

This is the point where talking about it does nothing. Go measure it like I have. In fact, I'll give you your hypothesis to test:

"There are no servers that have an ATR of > 80%."

That's easy to test, and I'm damn positive you could find some that disprove your assertion.

More importantly though, you have this assertion:

"Using both poll and epoll has no advantage in performance."

Again, who knows, that's why I'm testing and trying out. That's the science part, since I've got no idea, but I'll give it a shot. And now that I've done an analysis that tells me what really matters, I'll be able to do very good tests for the different kinds of loads.

Incidentally, when people run performance tests against web servers to see how fast they serve files they're testing the server with an ATR at around 100%. Food for thought.

You have measured nothing about real world workloads. You have applied completely artificial benchmarks and formed some possible conclusions from them that mean absolutely nothing until you demonstrate that this is a real problem in the real world. There's ample evidence to believe that FDs spend the majority of their existence idle--between HTTP keepalive, processing time for the queries themselves, network bandwidth and the generally sparse nature of HTTP traffic, it is extremely plausible that the ATR, as you call it, is well below 60% almost all the time. Your numbers demonstrate that there is a tradeoff between epoll and poll, which is of some interest, but unless you actually measure the ATR of a real site you're shooting off your mouth.

I cannot collect this data because I don't possess a sufficiently high load web server. Go forth and measure, but measure useful information.

I used the same test everyone has used for the last 8 years to compare poll vs. epoll performance. I also ran it in order to test a hypothesis then assumed I was wrong then ran it more and presented the information openly so others could try it.

Of course I'm going to keep doing this, but if you say that my test is invalid, then all of the tests people did to justify epoll are invalid.

The tests are perfectly valid, it's the conclusion that is wrong. Of course there is a trade-off, of course you'll find a cross-over point because there is a trade off. The fact that the crossover point is at roughly 60% for your kernel config and machine setup is essentially meaningless, other than that it it at least falls within the expected range (50 to 80% or so).

Your hypothesis is that web servers have the majority of their fds active most of the time, and that's where the problem lies. I've put up the numbers elsewhere in this thread, feel free to do some measurement on your own high traffic websites to come up with more data points.

"Your hypothesis is that web servers have the majority of their fds active most of the time,"

Aha! Totally wrong. My hypothesis has not been that at all. You totally didn't even understand the hypothesis, and I stated it very clearly. My hypothesis has always been:

epoll is faster than poll when the active/total FD ratio is < 0.6, but poll is faster than epoll when the active/total ratio is > 0.6.

Nowhere in there do I say that all web servers have the majority of their FDs active. NOWHERE. I say some might, I say who knows, I say we need to go measure, but nowhere do I say anything like what you say.

  My hypothesis has always been:

  epoll is faster than poll when the active/total FD ratio is < 0.6, but poll is
  faster than epoll when the active/total ratio is > 0.6.
That wasn't your hypothesis: that was your intermediate conclusion after the first tests. Your hypothesis was: the common knowledge that epoll yields better performance, and that I should obviously use epoll for Mongrel2, is wrong. [1]

It's obvious he didn't mean 'hypothesis', but 'implicit assumption'. If you implement superpoll, you implicitly assume it will be useful. It will only be useful if actually deployed Mongrel2 servers will have an ATR > 0.6 at least some of the time.

[1] You can replace 'is' by 'may be', if you feel the strong version puts words in your mouth. It doesn't, because the hypothesis for an experiment may also be "The half life of protons is shorter than a trillion years", when I expect to reject that hypothesis. It's not an assertion of your opinion, but a statement of a fact you intend to accept or reject based on the outcome of the experiment.

IE will keep a connection open for about 60 seconds, how much of that's going to be active? I don't have the exact number, and of course it will vary, but of course it's going to be far less than 60% in the vast majority of cases.

If a site gets spiked with the typical 'read-and-leave' traffic a link from reddit or huffpo or wherever generates, how does superpoll compare to straight epoll? Based on your description so far, I can only see it hurting - you're not just wasting time on dead connections in your poll bin, you're now also incurring the overhead of managing the migration over to the epoll bin.

Pardon my ignorance, I haven't built high performance servers at this low a level, but I'm intrigued:

What exactly is the definition of an "active" file descriptor in this context?

My best guess after reading the man pages is that poll() takes an array of file descriptors to monitor and sets flags in the relevant array entries, which your code then needs to scan linearly for changes, whereas epoll_wait() gives you an array of events, thus avoiding checking file descriptors which haven't received any events. Active file descriptors would therefore be those that did indeed receive an event during the call.

EDIT: thanks for pointing out Zed's "superpoll" idea. I somehow completely missed that paragraph in the article, which makes the following paragraph redundant.

If this is correct, it sounds to me (naive as I am) as if some kind of hybrid approach would be the most efficient: stuff the idling/lagging connections into an epoll pool and add the pool's file descriptor to the array of "live" connections you use with poll(). That of course assumes you can identify a set of fds which are indeed most active.

An active file descriptor is one that you can read from or write to without blocking or getting EAGAIN as error. The whole point of poll/epoll/kqueue/select is to figure out which file descriptors are in such a state.

The difference between poll and epoll is that, given an input of N file descriptors, poll returns all N file descriptors and you need to loop through each one of them to check whether the 'active' flag is set on there. epoll just returns all the active file descriptors so that you don't need to loop through the inactive ones.

A hybrid approach, as Zed has suggested, would appear to be more efficient on the surface. It remains to be seen whether it can actually be implemented efficiently because migrating fds from/to epoll is extremely expensive, requiring a single syscall per fd.

But if you ask me, the real solution is to have the kernel team fix their epoll implementation performance issues instead of forcing people to work around it with hybrid approaches. Other than the stupid single-syscall-per-fd requirement, there's nothing in epoll's interface that would force it to perform worse than poll when the active/total ratio is high.

I totally absolutely agree they should fix epoll, but the way they've designed I don't see it happening. Of course they could fix the call for doing the actual select and make it at least as fast as poll, but the fact that you have to do a syscall for every file descriptor is idiotic.

Some people actually did fix epoll, benchmarked the results and concluded it wasn't worth it.


Thanks for the detailed explanation. Sounds like I was at least on the right track.

But if you ask me, the real solution is to have the kernel team fix their epoll implementation performance issues instead of forcing people to work around it with hybrid approaches.

That does indeed sound like a better conclusion.

Other than the stupid single-syscall-per-fd requirement, there's nothing in epoll's interface that would force it to perform worse than poll when the active/total ratio is high.

I don't see a reason why the syscall-per-fd couldn't easily be replaced/augmented with a single mass add/remove syscall which takes an array. The worse performance seems similarly baffling; it almost sounds as if they had some kind of inefficient data structure holding the file descriptor pool; considering poll() uses a flat array and epoll uses set operations I assume it's pretty tricky to make it perform well, even with a hash table. Maybe set operations aren't the best way to handle this data structure; but only some profiling in the kernel code can tell us that.

Obviously it'll take until 2.6.37 at least for any changes to enter the mainstream kernel, and until then a hybrid approach sounds sensible for those unwilling to patch. But still, fixing the root problem seems like a worthwhile cause.

I highly doubt that in production it will make any difference at all. In the end it is not the poll/epoll overhead that determines your overall throughput. If you call poll/epoll more frequently than you should then it starts to add up but in reality one call per several thousand file system operations doing real IO is not going to make a big difference.

Of course all the little bits help and I'm happy to see someone pay attention to detail like this but normally speaking you should get to the point where you're shifting data in real life situations and you can hook up a profiler to make the decision. You have less to blog about like that but the difference between poll and epoll is not large enough that you would spend more time going from the one to the other than was spent analysing this and writing the post.

Optimisations like this are best left to when you have things working, first make it work, then make it fast.

I keep seeing this...misconception? I don't know what to call it, over most comments. The article doesn't talk about optimizations. It talks about design decisions. Someone actually took the time to sit down, ponder what kind of workloads will be handled by his application, came across an interesting dilemma, measured and tested them both, and finally reached an educated conclusion.

That right there is how you correctly choose how to implement things, contrary to popular belief. Think first, write code later.

  Someone actually took the time to sit down, ponder what kind of
  workloads will be handled by his application
The most important point made in this thread is that Zed actually didn't do that, but did benchmarks for a range of workloads, suggesting he intends Mongrel2 to be able to handle all of them. The question is whether that is necessary. If ATR > 0.6 never happens in practice, it will only unnecessarily complicate Mongrel2.

Writing the "which fds are ready" check twice would be crazy, if neither version were markedly more efficient than the other. Optimization is the only reason to make this particular design decision, and measurement is definitely the right way to go about it.

what good thinking gives you when your assumptions are wrong?

You again? Man you got some beef.

Anyway, we do have something working. You can deploy Mongrel2 right now, and I do all the time. You can actually measure how fast it is, although that's going to be slow since we haven't done much tuning so kind of pointless.

So, again you're wrong. We are at a point in the design where this measurement and analysis matters, and we have something that actually works to put it in. You should probably maybe go do some actual reading instead of posting here like you know what you're talking about.

Absolutely, although I can see how it makes sense to look into this stuff for Mongrel2 which seems to do little processing of its own, let alone disk I/O and mostly acts as a kind of demultiplexer.

It's actually a really simple concept what's "active" in poll vs. epoll. Your call to poll and epoll basically looks like this:

active_fds = poll(big_ass_array_of_fds, total_fds)

epoll is slightly different but same concept. You have a total number of FDs you're want to know about, and each call returns a number that have had activity.

And that's it. You then just do active_fds/total_fds and that gives the ATR. If this is < 0.6 after your call to poll, then that call to poll would have been better done with epoll. If the active_fd/total_fd is > 0.6 then it's better to stick with poll.

Of course, it's more complicated than that, but this gives you a simple metric of the break point where one is better than another.

An active file descriptor is a filedescriptor that you want to read from that has data available and one that you want to write to that has buffer space available.

To put it in another way, if you were to use blocking IO then an operation on an active descriptor would not block. Of course poll and epoll are all about asynchronous IO (so non-blocking by definition) but that's a good way to describe the difference.

Zed's 'superpoll' is precisely what you suggest.

Thanks for the explanation, I didn't think of the part about the socket being free for writing vs. whether there was data available.

Zed's 'superpoll' is precisely what you suggest.

Facepalm. Thanks, I mysteriously missed that part of the article.


You know, I don't really like this sort of comment. Someone goes to great effort to empirically verify something, something that is against common wisdom, then someone gets to just float in with 20/20 hindsight and say "Well, yeah, duh."

If it was so "yeah, duh", where's comment your about how blindingly obvious it was that common wisdom was wrong that dates from before somebody demonstrated it with concrete data and large test runs?

All you'd have to do to improve your comment is remove the last "Stop the presses" snark.

Sounds like premature optimization to me. Is this really the bottleneck? Is the extra complexity and logic really going to be a net win?

A conclusion reached by measurement is not premature. This looks like an attempt to write a better server than the 80/20 rule allows. If he's wrong and only one polling method is useful in production, the live servers will pick the good one and nobody will suffer because he jumped to conclusions. Since he's written Mongrel, I trust that he has a reason to worry about polling that may not have appeared in the post

> A conclusion reached by measurement is not premature.

That's just plain wrong. Premature optimisation does not refer to having to measure before you optimise, it refers to optimising things that in practice may have little or no effect on the actual performance of the program.

By doing these tests in isolation instead of while running on a profiling kernel under production load it is very well possible that the bottleneck will not be the polling code at all but something entirely different. I'd say that this is a textbook example of what premature optimisation is all about.

Assuming you have a finite budget of time to spend on a project any optimisations done that take time out of that budget that could have been spent more effective elsewhere is premature.

Now there is a chance that this would have been the bottleneck in the completed system, but before you've got a complete system you can't really tell. My guess based on real world experience with lots of system level code that used both, including web servers, video servers, streaming audio servers and so on is that the overhead of poll/epoll will be relatively minor compared to other parts of the code and the massive amount of IO that typically follows a poll or an epoll call.

If you have 10K sockets open then typically poll/epoll will return a large number of 'active' descriptors, you'll then be doing IO on all of those for that single call to poll/epoll.

Each of those IO calls is probably going to be as much or more work to process than the poll call was.

Again with this idea that Mongrel2 isn't working. You sir have no freaking idea what you're talking about.

"That's just plain wrong. Premature optimisation does not refer to having to measure before you optimise, it refers to optimising things that in practice may have little or no effect on the actual performance of the program."

No, that's just plain wrong. Premature optimisation is actually implementing something convoluted thinking it's optimized without knowing whether it actually is or not. It's voodoo cargo cult science. It's going against occam's razor.

There's nothing in Mongrel2 that's premature optimized. It's all very simple algorithms chosen for the right task, and later on I'll be testing them to see if they're still right. So your claim that this is premature optimization is just a buzzword and completely offensive. I took a long time to actually test my ideas before implementing them.

That's the total inverse of premature optimization.

What is total voodoo junk science is most of what you say. So far I haven't seen one set of data or any scientific experimental design or even a single testable hypothesis backing what you claim. It's all just rhetoric.

Until you've got hard numbers backing what you say, everything you're saying is inferior to what I'm doing: science.

So, if it's working why not throw a load of real world traffic at it in stead of this 'science' that you're performing here ?

After all, that is where the rubber meets the road and it would be a very easy way to determine if your hunch is right or not.

Epoll was specifically created with that sort of workloads in mind, your 'surprising' conclusion is not rooted in the fact that epoll is somehow behaving in a way that is contrary to expectation, in fact it behaves exactly as it is designed to do.

Benchmarking it like this is nothing like the real world, and that's where epoll shines, not when you test it the way you just did.

As for the numbers, we're serving about 10Gbps continuously using a combination of varnish and java code to several million uniques daily, html, images, video. Poll over epoll is a run race, as far as I'm concerned you're wasting your time with this.

But by all means, ignore all this and do what you have to, those are the lessons learned best anyway, and it's your time, not mine.

If you feel like getting another view on this I'd suggest to contact the author of Varnish, he really knows his stuff and he might be able to convince you where I can not.

Because "real world traffic" is a bullshit test. Who's real world traffic? Mine? Yours? Google's? Yahoo's? Science is repeatable, which means you can replicate the same inputs to the test every time. The lab experiments involved in science are often highly sanitized and not "real world" at all. Determining what the results mean in the real world comes later.

If you're going to complain about science, at least understand how it works.

There are basically three sorts of traffic that I have experience with and would expect to be the major portion of whatever goes over the web:

- underwater ajax requests

- regular website content (images, dynamic html, css and other relatively small (say < 250K) files)

- media servers (filedumps, video servers, streaming audio servers)

Each of those requires fairly specific tuning of the TCP stack to get the most out of it, so you're not likely going to find all of these on one and the same machine unless it is a small operation (and in that case this whole discussion is moot).

A benchmark done in isolation is meaningless because in the end, real world traffic is what it is all about. So, I personally don't care whose site(s) you test with, as long as there are enough of them to get a statistically valid result.

Google's or Yahoo's would be fine with me, I've given my results above, if I have the time I'll do the same thing on a couple of other high volume sites.

I've (unfortunately) studied this problem quite a bit because of the size of the websites that I'm involved with and so far I've learned that you can play around on your testbench all day long it doesn't matter one little bit for production purposes unless you are very careful (such as in that other test linked from this page) to simulate users clients.

You could do a lot worse than to play back a log file in order to make an experiment repeatable. I assume that real world performance is what Zed is after, not theoretical performance.

You know what I find troubling about your behavior on this thread? It's just so weirdly manipulative. I gotta think you have like 40% stock invested in Epoll, Inc. or something. You make wild claims about what I'm saying that aren't true, you imply that I know nothing of real world performance when I've written some pretty bad ass real world software. You reply to every single thread with a constant stream of FUD and nitpicking everything you can then blowing it out of proportion.

I mean, are you sure you didn't used to work for Microsoft and then got hired by Linus to work your FUD spreading magic?

  Because "real world traffic" is a bullshit test [..]
The hell it is. A statistically significant sample of 'real world ...' is the foundation for most engineering decisions. When you build a bridge, you take the actual loads it has to support into account. Intel bases their chipbaking on the actual purity of the silicon their suppliers can provide.

Exactly, there's no concept of confounding at all. You use "real world tests" (whatever the hell that is) when you have an actual specific setup to test. You use a small model experiment like this to test one specific thing like poll vs. epoll.

I think what he's saying is that it should be pretty easy to change between poll, epoll, and "superpoll", as you can easily abstract a common interface. Then, you could worry about the performance of that particular bit only when you encounter it, and use your time arguably more effectively on actual feature needed to productionize Mongrel2 better.

I sort of agree with him, except with an important detail: this question is basically about prioritization of your time, and I'd say that this is nobody's business but yours! You can optimize memcpy() all day, for all I care. ;-)

There's one aspect where you'd be quite right to do some investigation before implementing: if the outcome changes the interface you'd need to implement.

For example, here's an hypothesis: using epoll's edge-triggered mode could drastically reduce the number of events (since you only get an event when an fd becomes readable/writable the first time, instead of every time it's in that state). Since epoll is O(N) on the number of events returned (not on the number of fds that are currently readable/writable), you'd lower the effective ATR a whole lot. In fact, a really busy server would have fewer events, since a readable fd would stay that way for longer if data is received at a great rate (the write-side story might be less brilliant). You'd also have to do much fewer calls to epoll_ctl, since you could just stop caring about the reading side while you're trying to write the last batch of data on the other side (no need to remove it from the interest set, you won't get events for it). You only need to set it when flipping from read to write, and the other way (after receiving/sending headers and bodies).

Now, if that's true, that's a big deal, because now you have to change your design a fair bit. You have to remember that an fd is readable until you get EAGAIN from read() yourself, so there's some more state management, moving that fd from one list to another, etc. Finding out that this would be a million times faster (or slower!) now would save you a ton of work, either way.

But finding whether poll or epoll is faster, or an hybrid solution with the same interface? Meh, it could wait.

(about my hypothesis, that's actually what Davide Libenzi designed epoll for, which might explain some of the weirder bits)

"I need polling" => "Here are my options, which one is better?" => "They're good for different things" => "I'll pick the best one for the environment" is a reasonable design process. More so than some decisions that I make! Yes, there's a fixed time budget. But you're suggesting selectively ignoring evidence when designing a program, preferring random guessing and pattern matching to actual numbers. Should he have collected them to begin with? Maybe not, but sometimes you can't help your curiosity on a hobby project :). I understand your concern about this hypothetical production system, but the fact of the matter is that there is no production system right now, and no way to measure how it will handle certain things, but there are benchmark numbers. Better than nothing, I say!

> => "I'll pick the best one for the environment"

Which, in reality is, "I'll spend a lot of design and implementation effort designing a new one which may or may not improve the measurable, global performance of my new web server because it's not yet at the point where I can benchmark these sorts of things to verify that I'm not wasting a whole ton of effort that could be better spent by deciding that epoll is fast enough."

Maybe Zed knows from his previous server experience that {e}poll is where he hits a bottleneck; it's just that if there's any chance that it's not, he could be wasting a bunch of time implementing "superpoll".

(Or maybe he just wants to do it because it's neat, or because it's innovative (which it is), or for any number of other reasons. I'm just pointing out that he's doing much more than picking "the best one for the environment")

It's an idea I had after actually measuring. If it doesn't work then I tried something out.

What you really should be getting from it though is that epoll is not faster. It is not O(1). It is not faster on smaller vs. larger lists of FDs. Pretty much all the things you were told as advantages of epoll are total crap.

The only advantage of epoll is it's O(N=active) when poll is O(N=total). That's it.

So at a minimum I've done some education and spent some time learning something.

I tried to make sure I gave lots of reasons that justify your work; I do think it's cool.

I just wanted to say that it is not an unquestionable design decision.

Rock on with the superpoll, I hope it's awesome and very successful.

... or it's good "PR" so people get behind the project. It screams "I know what I'm doing!!! I'm even optimizing this!!!". ;)

"I need a filesystem" => "Here are my options, which one is better?" => "They're good for different things"...


For most things, it doesn't matter. A filesystem is a filesystem.

You only need to make decisions like that when you properly measure and decide that X may be a bottleneck, or you need features that Y has but X doesn't.

In nearly every large system I've been involved in designing, a "filesystem is a filesystem" would give you a great chance of rendering the system non-functional or performing so bad it might as well be non-functional when making simple design decisions based on knowledge about the problem made it easy to avoid in the first place.

Premature optimization is a sin, but making design decisions you know from experience has a major impact is not, as long as you measure to confirm afterwards.

We don't design by throwing dice - most decisions we make are based on experience or assumptions about what works and what doesn't. Measurements are important to challenge those assumptions, but it doesn't take away the value of making use of experience to create a reasonable baseline.

Where "premature optimization" comes in is where you start expending unnecessary extra effort to implement a more complicated solution without evidence to back up the need for it.

Spending a little bit of time to think through the requirements for major aspects of your system is not extra effort.

The funny thing is that he's either going to pick the wrong solution for the workload or spend a lot of time on creating a hybrid which will work as good as epoll would without augmentation.

Zed is clearly out to change the world and I would very much like him to succeed but he just seems to be missing the obvious here, which is that idle connections are the ones to worry about (because they're very expensive!) so his benchmark at this point in time is useless.

First off, again you miss the point that I've already got poll working in Mongrel2. If this fails (which I'll know since I like, measure stuff rather than care a bottle of cheetah blood around like you) then I'll just go put epoll in there or leave it with poll.

But hey, if I don't make it back alive from my complete dangerous experiment in disproving that epoll is always the way to go always you can come get me. Bring a big gun because this stuff is so scawy and howwible I might not make out alive.

If you have 10K sockets open then typically poll/epoll will return a large number of 'active' descriptors, ...<snip>

That's half true - it doesn't hold for low ATR traffic (lots of hanging connections, clients that GET something, spend time elsewhere while in the meantime the browser keeps the connection alive). In short, there's nothing typical about it because, while those two kinds of loads have been studied extensively in both bibliography and practice, their combination and the practical consequences are not well understood, afaik. Links to relevant studies are more than welcome, of course.

Large in an absolute sense, say 3K for a pool of 10K sockets. That's a sizeable number of active connections to deal with after a single system call. Typically for each of those fds you'll then do a read or a write. So the epoll/poll overhead is fairly small, with epoll coming out a bit faster than poll in that situation.

FWIW, I've written an async webserver which handles a few thousand concurrent users, and does thousands of HTTP reqs/second. I've never seen poll/epoll as a bottleneck, but I'm using Java NIO which really seems to work extremely well (I don't remember how/which it uses, I think epoll).

Maybe Java does some of this cool stuff already so perhaps I'm shielded from the pain of dealing with things directly.

In the past I've written Java NIO code that dealt with around 60,000 concurrent connections pretty well. The time spent doing poll seemed to be completely insignificant. CPU usage was negligible.

It'd be good to see some numbers though - for example:

For average mongrel application, 40% of CPU time is spent in poll / average of 30ms latency is due to poll etc.

But I'm skeptical those numbers are true. That was my point.

If you don't start with those numbers and measurements, optimizations like this, whilst interesting, may end up being of no real use to anyone.

> But I'm skeptical those numbers are true. That was my point.

Yeah, I'm skeptical those numbers are true too, but then again we're talking about totally different numbers.

Looks like he's already written a large bulk of his server's code, so maybe this optimization isn't really premature :)

You're probably right that when you actually use Mongrel2 as your app server your app-specific code higher up will be a larger bottleneck, but that's code that you have to deal with and this is code that he has to deal with so optimizing the hell out of it doesn't sound like a bad idea.

Lets say 95% of time is in your code, and 5% is in Mongrel2. Lets say that within Mongrel2, 10% of time is in this poll/epoll stuff.

That's 0.5% of your total time being spent here. So even if it's made twice as fast, your app will only speed up by say 100ms -> 99.75ms

Find the big things that matter and optimize them. Adding extra complexity to small things that don't matter is a recipe for more bugs and more issues.

I haven't actually used Mongrel2 yet, but I get the impression you can use it as a front-end for multiple physical servers, in which case the relative load of the M2 process on its server increases. Besides, unless Zed is writing your application and this is eating away at his time for that, it's not clear what exactly your complaint is.

> it's not clear what exactly your complaint is.

Any complexity introduced in the code increases its long term maintenance cost. I strongly suspect this is one of those cases where the performance gain will not justify the long-term effort of maintaining a more complex architecture.

I remember having a similar discussion circa 1992 about advantages and disadvantages of using ODBC versus native MS SQL/Sybase libraries. I instrumented the program I was writing and showed it spent 99+% of the time idling, 1-% of the time computing and, of that time, about 78% waiting for the database to return something. Using native libraries would yield a minuscule improvement at the cost of a huge headache.

So, you're comparing my track record with writing simple maintainable well documented code to something you did in 1992 with ODBC? That's your experience that's causing all the paranoia?

The worst two things that afflict programmers today is:

1. They never update their information, even after 18 years (18! You realize that right!? Things change man!)

2. They have an irrational paranoia about trying new things, as if me trying this out is going to destroy the universe.

That really needs to change.

1) Will you maintain Mongrel forever? It's not your track record that's the question, but the one of all future maintainers of Mongrel that will have to deal with the added complexity this change creates.

2) The experience from 1992 still seems current. Adding complexity to any software project adds cost to maintain it in the future. My experience in 1992 showed how added complexity for a marginal performance gain did not pay off then and still won't pay off today (unless you are programming a computer so expensive even a marginal increase in performance means lots of money).

3) It's your project and you may do with it whatever pleases you. What I wrote was intended as friendly advice from someone who is in this business for a long time. You are, of course, free not to accept the advice.

4) I encourage you to try new things and I am usually the first to propose workload-adaptable solutions. I, however, had my share of extremely clever optimizations that bit me back later when things as subtle as processor caches changed and it's not very funny (albeit it is fun to dig deep in the system to find out why X runs 33% slower on the 50% faster box). Nowadays, I consider every program line not written a line gained.

Will you maintain Mongrel forever?

With all due respect, this seems a ridiculous question. Do you obtain written statements to the above effect from all maintainers of software (open source or otherwise) before using it? Yes, it'll have to be maintained, but Zed's article alone is more documentation than you could ever hope for from most programmers. It's clear he isn't "most programmers" but that's no reason to hold him to ridiculous standards.

I am not holding him to ridiculous standards. It's obvious he won't maintain Mongrel forever. What I am confronting him is with the fact that he may be able to navigate any arbitrarily clever construct he invents, but that future maintainers who inherit Mongrel may not be as capable.

Again, it's his project and he's free to do whatever he feels like with it. Heck... I am not even a user. I offered him advice he is free to disregard. The fact my cleverness has been biting me ever since my 6502 days (quite likely before Zed was born) is a problem I sort of learned to deal with long ago - but it still comes to bite me from time to time.

While we're at it, let's lobby our governments to ban all software innovation!

Seriously, I bet the operating system you're using relies on far more of that evil "cleverness" than the subject of this thread.

This whole thread reeks of thinly veiled ad hominem attacks to me.

Zed tends to do the ad-hominem part very well without any need for external help. And you are using a straw-man. I never came even close to advocate for banning all software innovation.

What I said, and repeat, is that if you want to introduce an expensive to maintain piece of code, you have to weight the added cost against the performance gain. In this case, the performance gain seems marginal, the assumption of usage envisioned seems wrong and the added complexity seems just pointless.

It's his project, his code and I am not even a Mongrel user. I offer this as a friendly piece of advice, from old programmer to young programmer.

You know: it doesn't matter if you are beam-racing a 6502 or writing networking code to run on 64-bit deeply pipelined processors, there are things that remain true. This is one of them.

I don't know. Nobody has tried to find out and Zed is the first. Whatever his results are the gained knowledge will be useful to future developers.

> Is the extra complexity and logic really going to be a net win?

Fortunately, Zed is the right guy to find this out. I'm certainly looking forward to the results of this--which I bet we'll have an initial answer to by tomorrow.

The blog post does not say if the epoll code uses level triggering or edge triggering. It would be interesting to see the results for both modes. The smaller number of system calls required for edge triggering might make a difference in performance.

That's entirely possible, but then you pay a penalty in complexity because you have to keep track of missed events yourself. I think (unproven) that it's actually a wash because of this.

At most, you need to track a couple of booleans per socket, one for read and one for write.

Depending on what you are doing, you might not even need to track these booleans. For example, on the read side you can ignore read events when you are not interested in reading. When you switch back to read interest, you can read the socket to see if data arrived while you ignored events. A similar strategy can be used on the write side.

Is it just me, or did Zed not describe his testing methodology in any detail?

I can't even find a reference to his OS configuration and version details that he's developing on, which seems to me like a critical detail.

There's the pipetest.c file that everyone uses (since 2002) linked off that blog post, but I got tired and went to sleep.

Today I'm crafting how I ran the tests and releasing all the code and asking everyone to test my results. I am completely assuming I am wrong so looking for other people to test it.

Incidentally, if you google for "pipetest.c" you'll it's kind of the gold standard for this comparison, so if that code is wrong, then the entire assumption that epoll is better needs to be redone.

Okay. And I appreciate that, I'll look.

To make your process scientific, I'd like to suggest you add the following things to the post when you find it convenient:

1. A detailed explanation of your methodology, preferably with source code. This is so we can reproduce the tests. The ability to reproduce your work is a critical part of any process calling itself science.

2. A detailed list of the hardware you used & its deployment. (For reasons listed above).

3. Your raw data should be made available upon request so other people can work it as well.

P.S., aren't you concerned about I/O overhead with your superpoll proposal? It seems like the added resource allocation and the time spent in zeromq is going to eat up the small advantages you gain?

Cool experiment Mr Zed, but what about kqueue?

It seems superior to both *poll minions. Would be great if you proved/falsified this thesis as well.

kqueue is on OpenBSD and FreeBSD, while epoll is from Linux. (poll and select are on both)

I'm aware of it (you forgot to mention that kqueue is on the OS X as well). So what?

There are probably hordes of people who will be willing to run Mongrel2 on *BSD platforms, precisely because of the performance reasons. And Zed is a famous tinkerer rather than a religious zealot, so very probably he could be interested in checking kqueue as well.

"Why not" is also a good reason for a hacker when he's lacking other reasons.

My point was that comparing something that only runs on Linux against something that only runs on (various) BSDs adds a lot of other noise to the comparison - it's no longer the same hardware, install, and tuning, with just a different kernel call.

I see your point. Still it would be useful to see some typical BSD/kqueue in action compared to typical Linux/*poll. I bet Zed is not doing a big sysctl tuning at this stage. Just leaving default system settings as they are still is some starting point for further investigations.

What about NetBSD? Zed has already said he uses NetBSD, so if kqueue is there, he might add it to the mix.

Yes, it does. (I haven't used NetBSD at all, and I forgot to mention it.)

NetBSD also supports kqueue

Well, I haven't tried kqueue, but IIRC it has its own set of problems. Mainly that you can't kqueue certain types of file descriptors like ptys. I'd have to look into that, but I'm sure I'll have some kind of thing going about it soon.

Lets assume we have 20k opened FDs.

In case of poll(), you have to transfer this array of FDs from the userland vm to the kernel vm each time you call poll(). Now compare this with epoll() (let's assume we are using EPOLLET trigger), when you only have to transfer the file descriptors once.

You might say the copying won't matter, but it will matter when you have a lot of events coming on the 20k FDs which eventually leads to calling xpoll() at a higher rate, hence more copying of data between the userland and kernel (4bytes * 20k, ~80kbytes each call).

Yep, that's what I thought too, that at least epoll would be as fast. Turns out it's not though, but then I could be wrong.

Also, your assumption of EPOLLET is potentially wrong. I think (unproven) that the extra overhead and complexity of using edge trigger right makes EPOLLET pointless.

Sorry, I meant level-triggered. :) I think edge-triggered does add an extra overhead as you stated.

Why would there be extra overhead when using edge triggered? There's definitely extra complexity on the client side, but it's close to what you're trying to do with super-poll (the extra complexity is basically to find out when an fd isn't busy anymore).

I think it might even be faster, kernel-side. From what I remember of the implementation, both modes have to walk the same list of ready fds, but that list is shorter in edge triggered mode, because they get removed from the list as it goes.

Edge triggered might have more overhead if many fds change between ready/not-ready quickly, but that's quite the wacky situation (and if it has an even distribution, would ensure your ATR is about 0.5, so probably still winning).

Why would there by any copying? The kernel can directly read userspace memory.

For the kernel to execute a system call, it has to place the arguments on its stack. a system call doesn't execute in the userland.

Yes but the argument to poll is a pointer. The pointer would be copied but the kernel can still follow the pointer to userspace, right?

The pointer referred to by the process is not accessible by the kernel because when the user process was running, it had a different vm space than the kernel vm space. So if it just passes the pointer (without copying the pointer's data), then the kernel will point to a virtual address that won't exist until the user process gets swapped in again.

This sounds really strange to me. The kernel has full access to the page tables so can't it lookup things in userspace?

When the kernel is executing a function call placed on the stack, all the addresses on the stack are assumed in the same vm space. It does not know that an address is actually a virtual memory address belonging to process X and tries to figure out the value in the physical memory.

Yeah but it's possible to look up things in userspace right? So just change poll() to assume that the pointer points to userspace. I don't see the need for copying.

When the kernel calls poll, poll will access memory in the kernel address space because that's where it is running. All the addresses accessed in any system call are in the kernel address space. They don't go back and forth and swap vm pointers to fetch data from other processes. That's not how kernels work. And no you cannot change poll().. You write epoll/kqueue

Zed, whats with all the premature optimization? Surely Mongrel2 should first be able to make coffee, build you an island and f@!in transform into a jet and fly you there, before you start to make it faster!

Just kidding. It's always nice to see science in action. Great work! I suspect there's an impact on ZeroMQ's own poll/epoll strategy.

0.6 is so arbitrary. it should be 1.0/golden-ratio.

I was hoping for e, but alas no luck.

The best would be if it would be possible to code up superpoll to be adaptive, and in effect, benchmark itself to come to the same conclusion, dynamically. So if one day the kernel people fixed epoll to be better all the time, Mongrel2 would magically not use poll() much on systems using that kernel, and favour epoll.

Of course, that's often Kinda Hard To Do (tm). ;-)

It's pretty darn close to 1 - 1/e.

The first four digits of 1/0.6 would be 1666, the Annus Mirabilis. So you could compare Mongrel2's multiple request handlers to Isaac Newton first splitting light with a prism.

Question: as the ATR is going higher, so would the proportional time spent in poll or epoll, no?

So if you have a thousand fds, and they're all active, you have to deal with a thousand fds, which would make the difference between poll and epoll insignificant (only twice as fast, not even an order of magnitude!)?

This would make the micro-benchmark quite micro! Annoyingly enough, I think that means that the real way to find out would be an httpperf run, with each backends. A lot more work...

Very nice write-up. Little details such as this should make Mongrel2 very solid. It's nice to see how he analyzed the issues around poll and epoll and then figured out how to make use of both for optimum performance no matter what happens in production. Many other programs could benefit from this sort of analysis although at different levels... e.g. Sorted vectors may be better for smaller containers but hash tables better for larger containers, etc.

interesting article! Is 'super-poll' done yet? i would have liked to see a super poll line on some of those graphs to see how it compares to just vanilla poll and ePoll at different ATRs. Though i guess you would also have to test for situations where ATR varies over time (so that you could measure the impact of moving fds back and forth).

It is a little wonder why this kind of people think that everyone else are just stupid to realize such things. What they want is a fame and followers. (btw, don't you forget to donate!)

hint: nginx/src/event/modules/ngx_epoll_module.c

May be one should learn how to use epoll and, perhaps, how to program? ^_^

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact