Hacker News new | comments | ask | show | jobs | submit login
The C10M problem (2013) (highscalability.com)
34 points by gaigepr on May 5, 2014 | hide | past | web | favorite | 14 comments

Lots of great discussion here about a year ago: https://news.ycombinator.com/item?id=5699552

The line that jumped out at me from the article this time was "It costs 300 clock cycles to go out to main memory, at which time the CPU isn’t doing anything." Since the last time I've read this article, I've learned this isn't really true.

First, with current memory and bus speeds (Sandy/Ivy/Haswell), it's closer to 100 cycles (or 150 with a single level of TLB miss, although this miss can often be avoided with HugePages). But more importantly, there is no reason for the CPU to be doing nothing during this latency.

You can have 10 outstanding memory requests per core, so if you queue your requests for an instant you can issue a prefetch ahead of when you need it and continue working on the current request. This way you are only waiting sub-10 cycles for a load from L1 when the request is at the top of the queue. This doesn't speed up the response latency for each request, but it can help a lot with overcoming a limited cycle budget.

Don't forget that most CPUs parallelism isn't just in hyperthreading or prefetch. Most CPUs nowadays have 5-way or more superscalar architectures - they can execute 5 instructions or more at the same time, in a single thread. If one instruction is stalling on memory, they can still run several other outstanding instructions. Obviously, depending on the actual code, inter-instruction dependencies may not allow the processor to do this.

This article is really short of actual benchmarks, nor does it say what actual work is done per connection. If you're concerned about the cost of memory misses, trying to cram more connections onto one core is just going to make it worse. Is the goal static file serving? A chat client server? You can't do much more than that in 1400 cycles.

You can't always prefetch though. Look at linked lists and trees. You have to execute the code and wait the 100+ cycles at every node access because you can't predict where the next node will be.

"Doctor, it hurts when I do this!"

If you're hoping to be able to service a request in a few hundred (dozen?) cycles, you'll find your choice of data structures severely limited.

That being said, it would be interesting to see how much smarter a CPU could make prefetch. I know there has been a lot of research over the years into prefetch helper threads[1] that would speculatively execute code along both sides of branches to attempt to pull forward as many memory requests as possible. As I understand it, most attempts to implement this in practical systems have been failures.

[1] http://cseweb.ucsd.edu/~swanson/papers/ASPLOS2011Prefetching...

Well I'm going to try something like a prefetch thread soon with some common request types that just fetch/modify a single object. It will be interesting to see if that makes any difference to throughput. Something less speculative that helps a lot is just hoist the loads in your code as far away from where you use them as possible.

Yes, but this puts a lower limit on the latency of each individual request, not on the throughput. Consider having 4 3-level tree lookups that you need to perform. If the first level hits L1, the second L3, and the 3rd is in RAM, you'd have about 10 + 40 + 100 = ~150 cycles of minimum latency for each request.

Serially, you'd have ~600 cycles cycles of total latency for the 4 lookups. But if you can do the lookups in batches of 4, you can overlap your latencies and increase your throughput. You can issue the 4 parallel first level lookups, then 4 second level, and then the 4 third level in very close to the same ~150 cycles that you require to do a single complete lookup.

As an alternative to prefetch, I don't think there's a technical barrier to Intel increasing the number of hyperthreads that can be scheduled on a single core. POWER8 gets up to eight.

Here is a somewhat related question on stack overflow.


> Is there a benefit to running a parallelizable process on more threads than cores? In other words, will my process finish faster, slower, or in about the same amount of time if I run it using 4000 threads rather than 4 threads?

> If your threads don't do I/O, synchronization, etc., and there's nothing else running, 1 thread per core will get you the best performance. However that very likely not the case. Adding more threads usually helps, but after some point, they cause some performance degradation.

> Not long ago, I was doing performance testing on a 2 quad-core machine running an ASP.NET application on Mono under a pretty decent load. We played with the minimum and maximum number of threads and in the end we found out that for that particular application in that particular configuration the best throughput was somewhere between 36 and 40 threads. Anything outside those boundaries performed worse. Lesson learned? If I were you, I would test with different number of threads until you find the right number for your application.

> One thing for sure: 4k threads will take longer. That's a lot of context switches.

This is not related at all. Hyperthreads are hardware threads that share the execution resources of a core. They have nothing to do with OS threads, other than an OS thread is scheduled to run on a hardware thread.

This article highlights the goals for BareMetal OS (https://github.com/ReturnInfinity/BareMetal-OS) - let the app do the heavy lifting.

I'm surprised things like this aren't more common. Places like google strike me as great candidates for bare metal OS type setups to most effectively serve the billions of requests per days.

If I understand that project correctly, the assumption is that they think they can write assembly by hand that is more optimized than by using a compiler to optimize?

Besides the discussion a year ago pointed out by @nkurz, there is also this from 77 days ago --


I must have missed that; thanks for pointing it out.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact