The line that jumped out at me from the article this time was "It costs 300 clock cycles to go out to main memory, at which time the CPU isn’t doing anything." Since the last time I've read this article, I've learned this isn't really true.
First, with current memory and bus speeds (Sandy/Ivy/Haswell), it's closer to 100 cycles (or 150 with a single level of TLB miss, although this miss can often be avoided with HugePages). But more importantly, there is no reason for the CPU to be doing nothing during this latency.
You can have 10 outstanding memory requests per core, so if you queue your requests for an instant you can issue a prefetch ahead of when you need it and continue working on the current request. This way you are only waiting sub-10 cycles for a load from L1 when the request is at the top of the queue. This doesn't speed up the response latency for each request, but it can help a lot with overcoming a limited cycle budget.
This article is really short of actual benchmarks, nor does it say what actual work is done per connection. If you're concerned about the cost of memory misses, trying to cram more connections onto one core is just going to make it worse. Is the goal static file serving? A chat client server? You can't do much more than that in 1400 cycles.
If you're hoping to be able to service a request in a few hundred (dozen?) cycles, you'll find your choice of data structures severely limited.
That being said, it would be interesting to see how much smarter a CPU could make prefetch. I know there has been a lot of research over the years into prefetch helper threads that would speculatively execute code along both sides of branches to attempt to pull forward as many memory requests as possible. As I understand it, most attempts to implement this in practical systems have been failures.
Serially, you'd have ~600 cycles cycles of total latency for the 4 lookups. But if you can do the lookups in batches of 4, you can overlap your latencies and increase your throughput. You can issue the 4 parallel first level lookups, then 4 second level, and then the 4 third level in very close to the same ~150 cycles that you require to do a single complete lookup.
> Is there a benefit to running a parallelizable process on more threads than cores? In other words, will my process finish faster, slower, or in about the same amount of time if I run it using 4000 threads rather than 4 threads?
> If your threads don't do I/O, synchronization, etc., and there's nothing else running, 1 thread per core will get you the best performance. However that very likely not the case. Adding more threads usually helps, but after some point, they cause some performance degradation.
> Not long ago, I was doing performance testing on a 2 quad-core machine running an ASP.NET application on Mono under a pretty decent load. We played with the minimum and maximum number of threads and in the end we found out that for that particular application in that particular configuration the best throughput was somewhere between 36 and 40 threads. Anything outside those boundaries performed worse. Lesson learned? If I were you, I would test with different number of threads until you find the right number for your application.
> One thing for sure: 4k threads will take longer. That's a lot of context switches.