

The C10M problem (2013) - gaigepr
http://highscalability.com/blog/2013/5/13/the-secret-to-10-million-concurrent-connections-the-kernel-i.html

======
nkurz
Lots of great discussion here about a year ago:
[https://news.ycombinator.com/item?id=5699552](https://news.ycombinator.com/item?id=5699552)

The line that jumped out at me from the article this time was "It costs 300
clock cycles to go out to main memory, at which time the CPU isn’t doing
anything." Since the last time I've read this article, I've learned this isn't
really true.

First, with current memory and bus speeds (Sandy/Ivy/Haswell), it's closer to
100 cycles (or 150 with a single level of TLB miss, although this miss can
often be avoided with HugePages). But more importantly, there is no reason for
the CPU to be doing nothing during this latency.

You can have 10 outstanding memory requests per core, so if you queue your
requests for an instant you can issue a prefetch ahead of when you need it and
continue working on the current request. This way you are only waiting sub-10
cycles for a load from L1 when the request is at the top of the queue. This
doesn't speed up the response latency for each request, but it can help a lot
with overcoming a limited cycle budget.

~~~
eloff
You can't always prefetch though. Look at linked lists and trees. You have to
execute the code and wait the 100+ cycles at every node access because you
can't predict where the next node will be.

~~~
twoodfin
"Doctor, it hurts when I do this!"

If you're hoping to be able to service a request in a few hundred (dozen?)
cycles, you'll find your choice of data structures severely limited.

That being said, it would be interesting to see how much smarter a CPU could
make prefetch. I know there has been a lot of research over the years into
prefetch helper threads[1] that would speculatively execute code along both
sides of branches to attempt to pull forward as many memory requests as
possible. As I understand it, most attempts to implement this in practical
systems have been failures.

[1]
[http://cseweb.ucsd.edu/~swanson/papers/ASPLOS2011Prefetching...](http://cseweb.ucsd.edu/~swanson/papers/ASPLOS2011Prefetching.pdf)

~~~
eloff
Well I'm going to try something like a prefetch thread soon with some common
request types that just fetch/modify a single object. It will be interesting
to see if that makes any difference to throughput. Something less speculative
that helps a lot is just hoist the loads in your code as far away from where
you use them as possible.

------
mturmon
Besides the discussion a year ago pointed out by @nkurz, there is also this
from 77 days ago --

[https://news.ycombinator.com/item?id=7250505](https://news.ycombinator.com/item?id=7250505)

~~~
gaigepr
I must have missed that; thanks for pointing it out.

------
iseyler
This article highlights the goals for BareMetal OS
([https://github.com/ReturnInfinity/BareMetal-
OS](https://github.com/ReturnInfinity/BareMetal-OS)) - let the app do the
heavy lifting.

~~~
gaigepr
I'm surprised things like this aren't more common. Places like google strike
me as great candidates for bare metal OS type setups to most effectively serve
the billions of requests per days.

