
Zed Shaw: "poll, epoll, science, and superpoll" with R - tdmackey
http://sheddingbikes.com/posts/1280829388.html
======
jacquesm
In real-life web serving situations, and not in benchmarks, the majority of
the fds is not active. It's the slow guys that kill you.

A client on a fast connection will come in and will pull the data as fast as
the server can spit it out, keeping the process and the buffers occupied for
the minimum amount of wall clock time and the number of times the 'poll' cycle
is done is very small.

But the slowpokes, the ones on dial up and on congested lines will get you
every time. They keep the processes busy far longer than you'd want and you
have to hit the 'poll' cycle far more frequently, first to see if they've
finally completed sending you a request, then to see if they've finally
received the last little bit of data that you sent them.

The impact of this is very easy to underestimate, and if you're benchmarking
web servers for real world conditions you could do a lot worse than to run a
test across a line that is congested on purpose.

~~~
zedshaw
So let's take your assertions and take them apart:

> the ones on dial up and on congested lines will get you every time.

Do you have numbers on the dial-up users for your server? My understanding is
that there's far fewer, so this is bogus. Show evidence of high dial-up
penetration first.

> They keep the processes busy far longer than you'd want and you have to hit
> the 'poll' cycle far more frequently

Again, you have no numbers on the active/total ratio in your server, so unless
you do this statement doesn't refute what I found. I've presented evidence
that just shows the math of O(N=active) / O(N=total) holds up. Simple math.
The only way epoll wins for all load types is if it is as fast as poll all the
time. My tests show it's not, which stands to reason since it's implemented
using more syscalls than poll.

> The impact of this is very easy to underestimate, and if you're benchmarking
> web servers for real world conditions you could do a lot worse than to run a
> test across a line that is congested on purpose.

Again, you have no definition of "congestion". If you adopt a simple metric
like ATR then we can talk. As it is, you (and everyone else) just throws
around latency numbers like those matter when really the performance break is
in the ATR. In addition, my numbers show the performance break being at about
60% ATR, so if you're saying that _no_ server every goes above 60% activity
levels then you're totally wrong. 60% is not completely unreasonable on a
loaded server.

But, I think you're missing a key point: You need both in a server like
Mongrel2. I never said epoll sucks and poll rocks (since you probably didn't
read the article). I said something very exact and measurable:

> epoll is faster than poll when the active/total FD ratio is < 0.6, but poll
> is faster than epoll when the active/total ratio is > 0.6.

If you don't think that's the case in "the real world" then go measure it and
report back. That's the science part. I totally don't believe it yet myself,
which is why I'm measuring it and showing the methods to everyone so they can
confirm it for me.

~~~
bdr
Read "on dial-up" as "slow". The argument depends only on there being a
certain distribution of client speeds. It's not about dial-up in particular.

~~~
zedshaw
And, if there's a distribution of speeds then you can measure the distribution
and see what works best. Again, my challenge still stands:

Measure it or STFU.

~~~
blasdel
How about you get measurements of ATR from real-world deployments instead of
the wild conjectures you've laid in this thread? Your challenge applies even
moreso to yourself:

Measure it or GTFO.

~~~
zedshaw
Oh, you mean do what I'm already doing? Measuring and developing ideas then
testing them?

It helps if you're going to comment that you actually read the words I use,
not the ones you have in your head that make you sound like you're super
smart.

~~~
blasdel
Yes, you measured the ideal ATR inflection point for poll vs. epoll in your
synthetic microbenchmark.

But you guessed wildly about what ATRs people see in the real world:
<http://news.ycombinator.com/item?id=1572292>
<http://news.ycombinator.com/item?id=1572418>

~~~
zedshaw
Yeah, 'cause there's no way I'll be able to test a real web server that I
actually wrote based on this small test. This is a small test to test one
specific thing, doing more would confound the test. Confounding. Look it up.

Incidentally, this is the same test everyone else uses, so if you thought it
was bullshit why did you support it when people testing epoll with it were
using it? Oh, because they used it to confirm your bias rather than disagree
with it.

~~~
tel
I think there's less disagreement about the well-controlled result derived and
more about whether

    
    
       a) Your controls were right, and
       b) What the most optimal decision is in light of this new information.
    

(a) is well known to be one of the most difficult parts of scientific
reasoning and is almost always open to endless debate and improvement. In
short, it's the question of whether ATR is a human-sensible metric. (b)
however has an interesting direct answer: figure out the distribution of
"live" ATRs on an interesting population of real servers and then, to borrow
Eliezer's phrase, shut up and multiply.

If a lot of servers that you're targeting with M2 fall across that 60% divide
(under circumstances similar to your controlled microbenchmark) then of course
Superpoll is a good compromise.

Jacques is arguing a combination of (a) and (b). Perhaps ATR is not a
sufficient metric to understand all interesting server loads. Moreover,
perhaps many interesting servers live at really low or high ATRs all the time
and so Superpoll must gracefully degrade to either poll or epoll.

In any case, driving for empirical data is noble, but possessing data is never
sufficient to whitewall all detractors. It's really nice to have strong
empirical support for the breakeven point between the two (ie. the ratio of
their constant time components) via your benchmark, but science isn't _just_
statistics.

( _edit:_ I'll also add that pushing the pipetest microbenchmark past where
people are usually making hyperbolic claims _is_ a pretty big deal and a good
catch.)

------
FooBarWidget
Zed isn't the only one who has found epoll to be slower than poll. The author
of libev basically says the same thing. See
<http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod> and search for
EVBACKEND_EPOLL.

I wonder how kqueue behaves compares to poll and epoll. Kqueue has a less
stupid interface because it allows you to perform batch updates with a single
syscall.

------
jfager
It is worth pointing out that the original epoll benchmarks were focused on
how performance scaled with the number of dead connections, not performance in
general:

<http://www.xmailserver.org/linux-patches/nio-improve.html>

And as jacquesm points out, in a web-facing server, that's the case you should
care about. A 15-20% performance hit in a situation a web-facing server is
never going to see doesn't matter when you consider that the 'faster' method
is 80% slower (or worse) in lots of real world scenarios.

I'll be interested to see how the superpoll approach ends up working, but my
first impression is 'more complexity, not much more benefit'.

~~~
zedshaw
> And as jacquesm points out, in a web-facing server, that's the case you
> should care about.

Yes, but where's the evidence what people see for active/total ratios in the
real world? I'm showing that unless it's below about 60% (probably more like
50%) then poll is the way to go.

60% active isn't entirely unrealistic at all. I can see quite a few servers
hitting those thresholds, so in that cases, poll vs. epoll doesn't matter.

I think what's more important in what I'm finding is that you really need
both. It's entirely possible that you have servers that are at 80-90% ATR all
the time. Others that are 10% ATR. The key is either you have to measure that,
which nobody does, or you have to make a server that can adapt.

~~~
blasdel
> but where's the evidence what people see for active/total ratios in the real
> world?

Yes Zed, where the fuck is it? You're claiming _SCIENCE!_ based on your worst-
case synthetic localhost benchmarks, and then turning around and wildly
guessing as to real-world performance characteristics with internet latencies.

Worse, your whole thesis hinges off of ATR but you made no effort to measure
it anywhere, instead you're passive-aggressively berating us to do it.

~~~
zedshaw
Wow here we are again, you not reading my article. I ran the same test that
everyone else runs for poll vs. epoll, then used R to craft graphs and tested
hypothesis. It was _not_ a localhost test.

So far all you've got is trolling HN comments. YOU WIN!

~~~
jacquesm
Pipes, localhost, who is counting, as far as I'm concerned that's the same
thing, making it seem as if for the purpose of this test that's a significant
difference is simply conversational trickery.

If you have tested this on real live servers then there is no evidence of that
in your posting, and to suggest that this:

<http://dpaste.de/32o8/>

is anything but a localhost test is simply bogus.

The only use case where you _may_ be right that poll is advantageous as far as
I can see is streaming media servers (video, audio, other large files), image
servers are the ones with the worst active-to-total ratios, especially if the
images are small. I should know, I only serve up a few billion of them every
day. A few years ago or so I was stupid enough to think that video was hard,
man was I ever wrong. Repeated connections to the same host, that's a much
bigger killer than pumping bits.

~~~
zedshaw
Testing this on real live servers is confounding. Man you guys really don't
get this. If you want to test epoll and poll over file descriptors you test
that. You don't test a billion other things in a network server. That
confounds your results.

But what's really amazing is this is the test the proponents epoll have been
using for 8 years. Where was your objection back when they were using it for
that?

------
pmjordan
Pardon my ignorance, I haven't built high performance servers at this low a
level, but I'm intrigued:

What exactly is the definition of an "active" file descriptor in this context?

My best guess after reading the man pages is that poll() takes an array of
file descriptors to monitor and sets flags in the relevant array entries,
which your code then needs to scan linearly for changes, whereas epoll_wait()
gives you an array of events, thus avoiding checking file descriptors which
haven't received any events. Active file descriptors would therefore be those
that did indeed receive an event during the call.

EDIT: thanks for pointing out Zed's "superpoll" idea. I somehow completely
missed that paragraph in the article, which makes the following paragraph
redundant.

If this is correct, it sounds to me (naive as I am) as if some kind of hybrid
approach would be the most efficient: stuff the idling/lagging connections
into an epoll pool and add the _pool_ 's file descriptor to the array of
"live" connections you use with poll(). That of course assumes you can
identify a set of fds which are indeed most active.

~~~
FooBarWidget
An active file descriptor is one that you can read from or write to without
blocking or getting EAGAIN as error. The whole point of
poll/epoll/kqueue/select is to figure out which file descriptors are in such a
state.

The difference between poll and epoll is that, given an input of N file
descriptors, poll returns all N file descriptors and you need to loop through
each one of them to check whether the 'active' flag is set on there. epoll
just returns all the active file descriptors so that you don't need to loop
through the inactive ones.

A hybrid approach, as Zed has suggested, would appear to be more efficient on
the surface. It remains to be seen whether it can actually be implemented
efficiently because migrating fds from/to epoll is extremely expensive,
requiring a single syscall per fd.

But if you ask me, the _real_ solution is to have the kernel team fix their
epoll implementation performance issues instead of forcing people to work
around it with hybrid approaches. Other than the stupid single-syscall-per-fd
requirement, there's nothing in epoll's interface that would force it to
perform worse than poll when the active/total ratio is high.

~~~
zedshaw
I totally absolutely agree they should fix epoll, but the way they've designed
I don't see it happening. Of course they could fix the call for doing the
actual select and make it at least as fast as poll, but the fact that you have
to do a syscall for _every_ file descriptor is idiotic.

~~~
jacquesm
Some people actually did fix epoll, benchmarked the results and concluded it
wasn't worth it.

[http://www.linuxinsight.com/ols2004_comparing_and_evaluating...](http://www.linuxinsight.com/ols2004_comparing_and_evaluating_epoll_select_and_poll_event_mechanisms.html)

------
axod
Sounds like premature optimization to me. Is this _really_ the bottleneck? Is
the extra complexity and logic really going to be a net win?

~~~
jakevoytko
A conclusion reached by measurement is not premature. This looks like an
attempt to write a better server than the 80/20 rule allows. If he's wrong and
only one polling method is useful in production, the live servers will pick
the good one and nobody will suffer because he jumped to conclusions. Since
he's written Mongrel, I trust that he has a reason to worry about polling that
may not have appeared in the post

~~~
jacquesm
> A conclusion reached by measurement is not premature.

That's just plain wrong. Premature optimisation does not refer to having to
measure before you optimise, it refers to optimising things that in practice
may have little or no effect on the actual performance of the program.

By doing these tests in isolation instead of while running on a profiling
kernel under production load it is very well possible that the bottleneck will
not be the polling code at all but something entirely different. I'd say that
this is a textbook example of what premature optimisation is all about.

Assuming you have a finite budget of time to spend on a project any
optimisations done that take time out of that budget that could have been
spent more effective elsewhere is premature.

Now there is a chance that this would have been the bottleneck in the
completed system, but before you've got a complete system you can't really
tell. My guess based on real world experience with lots of system level code
that used both, including web servers, video servers, streaming audio servers
and so on is that the overhead of poll/epoll will be relatively minor compared
to other parts of the code and the massive amount of IO that typically follows
a poll or an epoll call.

If you have 10K sockets open then typically poll/epoll will return a large
number of 'active' descriptors, you'll then be doing IO on all of those for
that single call to poll/epoll.

Each of those IO calls is probably going to be as much or more work to process
than the poll call was.

~~~
jakevoytko
"I need polling" => "Here are my options, which one is better?" => "They're
good for different things" => "I'll pick the best one for the environment" is
a reasonable design process. More so than some decisions that I make! Yes,
there's a fixed time budget. But you're suggesting selectively ignoring
evidence when designing a program, preferring random guessing and pattern
matching to actual numbers. Should he have collected them to begin with? Maybe
not, but sometimes you can't help your curiosity on a hobby project :). I
understand your concern about this hypothetical production system, but the
fact of the matter is that there is no production system right now, and no way
to measure how it will handle certain things, but there are benchmark numbers.
Better than nothing, I say!

~~~
llimllib
> => "I'll pick the best one for the environment"

Which, in reality is, "I'll spend a lot of design and implementation effort
designing a new one which may or may not improve the measurable, global
performance of my new web server because it's not yet at the point where I can
benchmark these sorts of things to verify that I'm not wasting a whole ton of
effort that could be better spent by deciding that epoll is _fast enough_."

Maybe Zed knows from his previous server experience that {e}poll is where he
hits a bottleneck; it's just that if there's any chance that it's not, he
could be wasting a bunch of time implementing "superpoll".

(Or maybe he just wants to do it because it's neat, or because it's innovative
(which it is), or for any number of other reasons. I'm just pointing out that
he's doing much more than picking "the best one for the environment")

~~~
zedshaw
It's an idea I had after actually measuring. If it doesn't work then I tried
something out.

What you really should be getting from it though is that epoll is not faster.
It is not O(1). It is not faster on smaller vs. larger lists of FDs. Pretty
much all the things you were told as advantages of epoll are total crap.

The only advantage of epoll is it's O(N=active) when poll is O(N=total).
That's it.

So at a minimum I've done some education and spent some time learning
something.

~~~
llimllib
I tried to make sure I gave lots of reasons that justify your work; I _do_
think it's cool.

I just wanted to say that it is not an unquestionable design decision.

Rock on with the superpoll, I hope it's awesome and very successful.

------
frognibble
The blog post does not say if the epoll code uses level triggering or edge
triggering. It would be interesting to see the results for both modes. The
smaller number of system calls required for edge triggering might make a
difference in performance.

~~~
zedshaw
That's entirely possible, but then you pay a penalty in complexity because you
have to keep track of missed events yourself. I think (unproven) that it's
actually a wash because of this.

~~~
frognibble
At most, you need to track a couple of booleans per socket, one for read and
one for write.

Depending on what you are doing, you might not even need to track these
booleans. For example, on the read side you can ignore read events when you
are not interested in reading. When you switch back to read interest, you can
read the socket to see if data arrived while you ignored events. A similar
strategy can be used on the write side.

------
KirinDave
Is it just me, or did Zed not describe his testing methodology in any detail?

I can't even find a reference to his OS configuration and version details that
he's developing on, which seems to me like a critical detail.

~~~
zedshaw
There's the pipetest.c file that everyone uses (since 2002) linked off that
blog post, but I got tired and went to sleep.

Today I'm crafting how I ran the tests and releasing all the code and asking
everyone to test my results. I am completely assuming I am wrong so looking
for other people to test it.

Incidentally, if you google for "pipetest.c" you'll it's kind of the gold
standard for this comparison, so if that code is wrong, then the entire
assumption that epoll is better needs to be redone.

~~~
KirinDave
Okay. And I appreciate that, I'll look.

To make your process scientific, I'd like to suggest you add the following
things to the post when you find it convenient:

1\. A detailed explanation of your methodology, preferably with source code.
This is so we can reproduce the tests. The ability to reproduce your work is a
critical part of any process calling itself _science_.

2\. A detailed list of the hardware you used & its deployment. (For reasons
listed above).

3\. Your raw data should be made available upon request so other people can
work it as well.

P.S., aren't you concerned about I/O overhead with your superpoll proposal? It
seems like the added resource allocation and the time spent in zeromq is going
to eat up the small advantages you gain?

------
kunley
Cool experiment Mr Zed, but what about kqueue?

It seems superior to both *poll minions. Would be great if you
proved/falsified this thesis as well.

~~~
silentbicycle
kqueue is on OpenBSD and FreeBSD, while epoll is from Linux. (poll and select
are on both)

~~~
kunley
I'm aware of it (you forgot to mention that kqueue is on the OS X as well). So
what?

There are probably hordes of people who will be willing to run Mongrel2 on
*BSD platforms, precisely because of the performance reasons. And Zed is a
famous tinkerer rather than a religious zealot, so very probably he could be
interested in checking kqueue as well.

"Why not" is also a good reason for a hacker when he's lacking other reasons.

~~~
silentbicycle
My point was that comparing something that only runs on Linux against
something that only runs on (various) BSDs adds a lot of other noise to the
comparison - it's no longer the same hardware, install, and tuning, with just
a different kernel call.

~~~
kunley
I see your point. Still it would be useful to see some typical BSD/kqueue in
action compared to typical Linux/*poll. I bet Zed is not doing a big sysctl
tuning at this stage. Just leaving default system settings as they are still
is some starting point for further investigations.

------
kqueue
Lets assume we have 20k opened FDs.

In case of poll(), you have to transfer this array of FDs from the userland vm
to the kernel vm each time you call poll(). Now compare this with epoll()
(let's assume we are using EPOLLET trigger), when you only have to transfer
the file descriptors once.

You might say the copying won't matter, but it will matter when you have a lot
of events coming on the 20k FDs which eventually leads to calling xpoll() at a
higher rate, hence more copying of data between the userland and kernel
(4bytes * 20k, ~80kbytes each call).

~~~
zedshaw
Yep, that's what I thought too, that at least epoll would be as fast. Turns
out it's not though, but then I could be wrong.

Also, your assumption of EPOLLET is potentially wrong. I think (unproven) that
the extra overhead and complexity of using edge trigger right makes EPOLLET
pointless.

~~~
kqueue
Sorry, I meant level-triggered. :) I think edge-triggered does add an extra
overhead as you stated.

------
phintjens
Zed, whats with all the premature optimization? Surely Mongrel2 should first
be able to make coffee, build you an island and f@!in transform into a jet and
fly you there, before you start to make it faster!

Just kidding. It's always nice to see science in action. Great work! I suspect
there's an impact on ZeroMQ's own poll/epoll strategy.

------
jaekwon
0.6 is so arbitrary. it should be 1.0/golden-ratio.

~~~
zedshaw
I was hoping for e, but alas no luck.

~~~
aston
It's pretty darn close to 1 - 1/e.

~~~
mhd
The first four digits of 1/0.6 would be 1666, the Annus Mirabilis. So you
could compare Mongrel2's multiple request handlers to Isaac Newton first
splitting light with a prism.

------
pphaneuf
Question: as the ATR is going higher, so would the proportional time spent in
poll or epoll, no?

So if you have a thousand fds, and they're all active, you have to deal with a
thousand fds, which would make the difference between poll and epoll
insignificant (only _twice_ as fast, not even an order of magnitude!)?

This would make the micro-benchmark quite micro! Annoyingly enough, I think
that means that the real way to find out would be an httpperf run, with each
backends. A lot more work...

------
16s
Very nice write-up. Little details such as this should make Mongrel2 very
solid. It's nice to see how he analyzed the issues around poll and epoll and
then figured out how to make use of both for optimum performance no matter
what happens in production. Many other programs could benefit from this sort
of analysis although at different levels... e.g. Sorted vectors may be better
for smaller containers but hash tables better for larger containers, etc.

------
lukesandberg
interesting article! Is 'super-poll' done yet? i would have liked to see a
super poll line on some of those graphs to see how it compares to just vanilla
poll and ePoll at different ATRs. Though i guess you would also have to test
for situations where ATR varies over time (so that you could measure the
impact of moving fds back and forth).

------
c00p3r
It is a little wonder why this kind of people think that everyone else are
just stupid to realize such things. What they want is a fame and followers.
(btw, don't you forget to donate!)

hint: nginx/src/event/modules/ngx_epoll_module.c

May be one should learn how to use epoll and, perhaps, how to program? ^_^

