
Let's Do Some Science: Zed's Test Stack for poll vs. epoll - gthank
http://sheddingbikes.com/posts/1280882826.html
======
codexon
Zed is probably right that poll is faster than epoll when all descriptors are
active. There's no reason to doubt this since epoll has a callback for every
file descriptor, while in poll, the OS only has to fill up an array of flags
once per call.

<http://lwn.net/Articles/14168/>

However, I will have to side with jacquesm in saying that regular internet
servers will usually benefit from epoll. It is unfathomable that web servers
have lots of active file descriptors especially considering the "keep-alive"
and that even broadband latency is an eternity for modern cpus to wait. On the
other hand though, using poll makes sense for a heavily trafficked file
server.

~~~
zedshaw
There's nothing in anything I've said that disagrees with your assertion that
regular web server (not internet servers) have lots of idle connections. What
I've been saying all along is something very simple.

If the ATR > 0.6, poll wins. If ATR < 0.6, epoll wins.

That means, there's _potential_ gains to be had by using both, and at a
minimum you can use both and it won't hurt you very much.

Additionally, Mongrel2 isn't a regular server. Remember all those pesky 0MQ
backends? Those basically crank data fast as shit, so this research could
potentially benefit them. It also points out potential flaws in previous
analyses, so people can now potentially use ATR to tune their server's work
loads appropriately.

But notice I keep saying potentially? So far I'm the only one offering
everything up, all data, everything for anyone to prove me wrong, and I
actually don't think I'm totally right.

I don't see very many other people who criticize the idea doing the same. Just
more comments on HN.

~~~
jfager
_What I've been saying all along is something very simple.

If the ATR > 0.6, poll wins. If ATR < 0.6, epoll wins._

You left out the part where you accused people who advocate using epoll of
being bullshitting touts who perpetuate fallacies:

"why the touted benefits of epoll and most of the information out there is
mostly bullshit because of some falacies people seem to have about epoll
(mostly perpetuated by epoll's proponents)."

Which you proceeded to back up with a strawman argument about claims of O(1)
performance and a rant about how graphs that show that epoll accomplishes
exactly what epoll is designed to accomplish "flat out get it wrong".

You're using "ATR" to find the inflection point in the performance curve from
epoll's constant overhead per active fd, for the purpose of micro-optimizing
Mongrel2. That's fine. But when you claim that people who just use epoll
because it solves the problem they actually need solved are making decisions
based on "bullshit", you shouldn't act so surprised that they don't respond
positively.

~~~
zedshaw
The touted benefits of epoll are mostly bullshit, since it was advocated as
being O(1) and frequently mentioned as being O(1) and always faster than poll
by others. In fact, right after mentioning that I thought epoll wasn't faster
the _first_ thing people said was it was O(1). Every person I talked to said
it.

I have _evidence_ that contradicts both of the main assertions of epoll, so
until someone comes up with counter evidence I can safely say that epoll
information is bullshit, and my argument is _nothing_ like a straw man.

So yes, it is bullshit, and if that offends you then awesome. Because
nothing's worse than blindly believing some load of crap when you could have a
better understanding of what you're doing.

~~~
jfager
I googled 'epoll "O(1)"'. You're the first hit (congratulations), followed by
some random person on stackoverflow, the documentation for a CPAN module, the
repetition of that documentation on github and in various package repos, and
some random bloggers and mailing list responders who are either just referring
to the work that you have to do to find an active fd when the wait call
returns, or are confused.

The man page says nothing about O(1):

[http://www.kernel.org/doc/man-
pages/online/pages/man4/epoll....](http://www.kernel.org/doc/man-
pages/online/pages/man4/epoll.4.html)

The original epoll writeup says nothing about O(1), and shows graphs
explicitly for dead connection scenarios (iow, it never claims better
performance per active fd):

<http://www.xmailserver.org/linux-patches/nio-improve.html>

The Dan Kegel writeup that's pretty much universally cited on this issue says
nothing about O(1):

<http://www.kegel.com/c10k.html#nb.epoll>

The link you threw up yesterday that "flat out gets it wrong" says nothing
about O(1) and simply shows epoll doing exactly what it was designed to do
(again, it never claims better performance per active fd), including graphs
_that actually do show poll outperforming epoll with small numbers of dead
connections_ :

<http://lse.sourceforge.net/epoll/index.html>

Can you show any authoritative sources who claim epoll is O(1)? If not, then
yes, you built a fluffy little strawman and took him down hard.

I have no problem with your work on superpoll. I'm skeptical it will help
much, guessing there will be situations it will hurt, and overall not
convinced it's worth the effort to find out, but it's your time. Have fun with
it, and I'm eager to see the results.

I just don't get why you feel the need to imply that everyone who ever used
epoll without finding the ATR performance cut-off is a blind moron snowed into
doing so by some shadowy malevolent epoll cabal spreading misinformation about
how it works.

------
lhorie
I'd be interested to see if the idea of switching between epoll and poll at
60% ATR would have too much overhead as to be impractical.

Re: > 60% active fd's being real-life or not, I don't really see what's the
problem in wanting to handle that case in a project whose goal is basically to
keep humming along nicely regardless of what you throw at it.

~~~
pphaneuf
Adding/removing from the poll() array is more or less free, but with epoll,
you have to use epoll_ctl(), so you'd definitely want an hysteresis to avoid
bouncing things around too much.

~~~
zedshaw
Interesting, you're like the 3rd or 4th person who's mentioned hysteresis.

~~~
rbanffy
It's a fairly common idea.

------
tzs
The network interface in general on most operating systems seems kind of
absurd from a server point of view. You've got data coming in from N clients.
The data from all those clients gets physically multiplexed into a single data
stream that comes in over the wire.

Then in the kernel you've got something that demultiplexes it back into N
separate streams, which it accumulates in N different buffers.

The server software is aware of all N of these streams. Through some mechanism
the kernel notifies the server software that data is available on some subset
of the N buffers, and the server software has to ask for that data.

If the server software is using a queue and thread model, it likely then gets
that data which the kernel has gone to great effort to demultiplex into N
streams and provide it through N separate file descriptors--and multiplexes
back onto a single queue for the worker threads to pull from!

Why not get rid of some of this demultiplexing and re-multiplexing? Let me
have a single file descriptor, and everything that comes in destined to my
port comes through that single file descriptor, as a message with a header
that tells what remote IP address and port it came from and the length,
followed by that many bytes of data.

Instead of N separate buffers in the kernel, just have one big buffer per
port. You'd still have to keep track of how many bytes in that buffer are
associated with each remote connection, for purposes of dealing with TCP flow
control, but I think this could be made to work reasonably (yes, I'm hand
waving right now).

Now you don't need poll or epoll. Heck, you don't even need to use select.
I'll just have a blocking thread whose job is to just read that one file
descriptor using blocking reads in a loop to pull in data and put it on my
queue.

Looking at this slightly different--everything destined for, say, port 80 is
going to the web server, so why does the web server want this to come in
through N different file descriptors? Especially when you consider that the
code its going to run is the same no matter which it comes in on? All the
multiple file descriptors are really accomplishing is providing a way to tell
which bytes came from which client. Is that really the best use of file
descriptors, as opposed to just having the kernel tell you which client the
data came from when you do your read system call?

I can see that the N file descriptor model made sense in the days before
threads, when to get concurrency you had to use multiple processes. You then
need to have data coming in on port X from different clients not necessarily
all going to the same process.

But for servers that are going to use the "single process with multiple
threads" model to handle concurrency, a single stream network model would seem
to be a lot nicer and efficient.

~~~
jacquesm
Technically nothing stops you from opening the raw device and to start reading
packets. Sooner or later you'll need to figure out when your packets can be
forwarded to the next layer for processing so work can be done for a
particular connection.

And that's what the kernel did for you all along. The abstraction of file
descriptors for clients is a useful one, it allows you to decide what to do.
The decision here is not 'do I want all this data in one stream' or 'do I want
it demultiplexed', the kernel does a fairly good job of the demultiplexing
that needs to be done anyway (re-assembling a stream of packets in to
functioning connections is non-trivial) and it exports that interface to all
user processes to avoid having to recreate that useful abstraction in those
user processes.

It's possible there is some gain to be had by 'rolling your own' and by
getting rid of the kernel abstraction by collapsing all these layers in to a
single user mode program.

But I think you'd soon find that you are simply re-implementing all that code,
and that instead of a clean (poll or epoll, either is fairly clean) interface
you'd be bogged down in maintaining your own tcp stack.

I once ran in to a situation like the one you describe. I was the programmer
in charge of coding up a driver for a high speed serial card (high speed at
the time, nowadays we'd laugh at it) doing X.25. I'd nicely laid it out as two
separate pieces of software talking to each other using message passing on a
weird but wonderful little os called QnX. One process to handle the data link
layer, one to handle the X.25 protocol.

After a series of benchmarks with 8 such cards in a single ISA industrial
enclosure, together with a (for the time) very expensive micronics 33 MHz
motherboard I figured that maybe we can gain some performance by merging the
two processes, to save on the IPC. The end result? Identical performance, in
spite of the message passing overhead the bit-twiddling, checksumming and
actual IO was such a large portion of the work done that the IPC didn't matter
at all performance wise.

But it did matter to the code, the second version was decidedly harder to
debug and maintain, and eventually that 'branch' was scrapped in favour of the
one where the layers were clearly separated.

Poll and Epoll are communications mechanisms, they communicate with the kernel
about the state of a bunch of fds. Both implement the same functionality (as
long as you don't use the actual events of epoll) from a programmers point of
view. Both have slightly different use cases, and for some situations it may
be advantageous to use the one or the other. There may be a difference, but on
the whole it will probably not be a very large one once you factor in all the
other stuff that needs doing.

~~~
haberman
> But I think you'd soon find that you are simply re-implementing all that
> code, and that instead of a clean (poll or epoll, either is fairly clean)
> interface you'd be bogged down in maintaining your own tcp stack.

You're not understanding OP's proposal. He's not talking about implementing
TCP in user-space. He said the kernel would still be handling TCP flow
control, and would only be sending data destined for a specific port to this
fd. So TCP is still in the kernel.

He's just proposing that the payloads for all connections come in on a single
fd instead of one fd per connection. But each read from the fd tells you what
remote host it came from.

------
wglb
Duplicate.

See lively discussion at <http://news.ycombinator.com/item?id=1570694>

~~~
jacquesm
Not exactly, this is where Zed presumably shows that using pipes is equivalent
to real world testing and has nothing to do with using the loopback/localhost
device.

Personally I think that's nonsense. I downloaded the code, checked it out,
there is nothing in there that changes my mind on the whole thing, the meat is
in the C code and it's the same as what was linked from the article. The rest
is a wrapper script and some stuff to visualise the results, which are what he
showed them to be. The premise that web servers have the majority of their fds
active is where the issue lies.

FWIW I did some measuring and the results are pretty much in line with what I
expected (though the spread is still puzzling, I really intend to find out the
cause of that): <http://news.ycombinator.com/item?id=1573145>

~~~
zedshaw
If the test harness is wrong than every other test that everyone doing epoll
comparisons with poll has done is wrong. This is the one they've all been
using. So, either you write your own test to evaluate the same thing, without
confounding it with other external factors (like this test does), or you use
this test and prove me wrong by finding different data.

However, I'm glad you finally went and got some metrics to show one webserver
instead of the useless rhetoric you were spewing before. Not that it proves a
damn thing since it has no external validity, but at least it's a step in the
right direction.

I also say that your findings of 40% are _much_ higher than what you implied
in your previous comments. You made it seem like it's close to the 10% range,
but 40% is damn near ready to leap over into the equal performance levels.
It's interesting that even after that you still can't see the value of either
making epoll faster or using both, but whatever, you can't convince every
idiot on the internet, even with Science.

~~~
jacquesm
You should read more carefully. 40% was the highest observed, 10% the lowest.
Also, I said 'the majority', which is not like making it seem like it is only
10%. So it's not '10%' or '40%', for this particular workload, on average
you're hovering somewhere between 25 and 30%, since you seem to prefer the
higher number make that 30%. And if you read a bit more you would see that
there are some factors that could pull that number down considerably in
different situations and only one major one (large file serving, streaming)
that would pull it up.

If I get around to it (I'm on a holiday right now) I'm going to instrument
some more sites to measure their active-to-total ratios to see if there is
some kind of number (or a set of them) that we might be able to agree on when
it comes to real life web traffic. One of them is a filedump, which serves up
large media files to clients, that's the use case that I can come up with in
which your scheme might pay off, and in which it is possibly advantageous
(even if it is a very small advantage) to use poll.

You keep 'challenging' people to come up with numbers. So, I did, from a real
life webserver serving real customers, lots of them. If 60% is the cutoff
point (and we're assuming that you are right about that) then 40% is
comfortably under that, not 'much higher' than what I wrote in my previous
comments, the majority (60 to 90%) of the fds is idle. So it's not in to
'equal performance levels' yet.

But as far as I'm concerned even that is probably moot. (But that needs more
testing) because I suspect that even this fairly high traffic server spends
only very small amount of time on polling the OS, and a much larger time on
shovelling bits in and out of the system.

You keep using that word 'science' as though measuring anything at all is
automatically science.

But what you're doing is more like selecting the data that fits your 'who
needs epoll' and 'epoll is bad' hypothesis and rejecting out of hand any
experience and / or experimental data that seems to prove you wrong.

That's not science. Science is to use all the data available to you, not just
the juicy bits. Not just those bits that allow you to make bold claims about
discoveries when in fact all you've done is to show that the underlying
implementation of one systemcall is different and has different trade-offs
than another, which may be specific to the way _your_ system is set up.

I agree that to test something you need to reduce it to the minimum of code
required to do your tests. But when you're going that route you are doing what
so many other database, CPU and application benchmarkers forget. The real
world is _nothing_ like a benchmark.

You to suggest that this is 'not a localhost test'
(<http://news.ycombinator.com/item?id=1573195>) and you say so pretty loud.
That would mean that there may be some validity to your data that I have not
been able to find to date in the stuff you've provided.

I'm challenging you to provide the data that apparently is still missing that
proves that this was not 'a localhost test'.

And if you don't supply that data then I think you should take back (and try
to do it gracriously for a change) that this wasn't a local host test, or
you're going to have to come with some definition of what you consider a
localhost test to be.

For me, that definition is simple, a localhost test is a benchmark that never
generates any network traffic.

Feel free to disagree.

~~~
gthank
Having followed most of the comments on both these threads, I don't think it's
fair to characterize Zed's position as saying "epoll is always bad". I think a
better characterization would be: "epoll is not automagically better under
every imaginable scenario".

You may argue that nobody is making any such claim, but I can tell you that
the sections of the blogosphere I skim via HN, reddit, etc. do _imply_ it.
Maybe the authors just assume everyone can infer that every decision has
tradeoffs and there's no need to discuss them in depth for this particular
case, or maybe—and the nature of the blogosphere leaves me inclined to think
that this is the case—most of those authors never consciously thought about
those tradeoffs because epoll happens to be a good fit for their particular
problem. Either way, Zed is the first person I've seen trying to explicitly
identify and quantify those trade-offs for poll vs. epoll. As such, I think
he's doing the community a service.

If this research leads to a better implementation of epoll, or a server that
requires less tuning to do well under a particular load, that is great for
everyone. Even if it doesn't lead to any great improvements, a better-educated
developer population that makes more informed decisions is still a net win for
the world.

~~~
zedshaw
Thank you. I very much appreciate your comments.

~~~
gthank
I appreciate the work you're doing. I'm not sure I've worked on a system that
would really benefit from Mongrel2, but I think it's a pretty interesting
concept, and there's always a chance I work on something more interesting in
the future. Even if I don't, I almost always learn something or gain a
different perspective from reading your stuff, and that's reason enough to be
thankful.

