This isn't really correct, because it ignores the fact that modern out-of-order CPU's also have out-of-order memory pipelines.
The reason the four slow CPU's are faster in the author's example is that they are issuing four concurrent requests at once, while the single CPU is issuing one request at a time. If requests have fixed and high memory latency, the four CPU's will get back four times as many results in each waiting period as the single CPU. This presupposes that the problem has enough memory parallelism to support four memory requests in flight at once.
But if the problem is highly memory parallel, then the faster CPU can take advantage of that memory parallelism just as easily. It doesn't have to only issue one request at a time, then wait until it is fulfilled. It can shoot off four requests at a time and wait for them all to come in. Indeed, "brawny" CPU's tend to also have "brawny" memory controllers that can support more simultaneous requests.
>This isn't really correct, because it ignores the fact that modern out-of-order CPU's also have out-of-order memory pipelines.
Not only that, it's assuming that a processor with 4X the clock speed will be 4X faster, and then says "well if it isn't then having 4X more cores can be better than having 4X the clock speed." Which is obviously true, but ignores that Amdahl's law isn't about clock speed, it's about single thread performance. If the brawny core has 4X the single thread performance (clock speed being irrelevant) then it still always wins over four cores each with a quarter of the single thread performance.
If there is anything to be salvaged from this it's the idea that putting, say, two brawny cores next to a whole bunch of wimpy cores is a good design for heterogeneous workloads. Which is kind of what we have with CPU (few brawny cores) + GPU (many wimpy cores). Though what might be advantageous is to see an architecture where the brawny cores and the wimpy cores run the same instruction set, so that the OS can schedule code on the brawny cores first (or put the highest priority threads there) and then if there are still any ready threads left to run put them on the wimpy cores, with the programmer not having to distinguish between the two when writing the code.
I didn't claim to disprove Amdahl's law - I argued with an assumption based on "Amdahl's law" that I put in quotes, exactly because there's a leap from real Amdahl's law to something that doesn't follow from it in that line of reasoning.
What you "salvaged" is not bad though :) And - ARM might be getting there if a licensee takes big.LITTLE to that direction as I pointed out.
I'm not sure you are really clear about the distinction between serial and parallel workloads. It seems you are referring mostly to memory access patterns which cause cache misses and dominate actual processing. Of course larger cores can throw more resources at this program through resource sharing, hyperthreading,caching and parallelism not to mention they can issue memory requests faster as they don't spend as much time processing the data they get in. Most designs for smaller cores are designed around tasks that exhibit uniform data access patterns for precisely this reason.
Either:
1) you throw more cache at the problem or increase memory speed
2) redesign your algorithm(almost always the best choice)
If memory access latencies are causing problems you aren't going to stop the bleeding by band-aiding the problem for very minimal gain. For most workloads it isn't worth making this trade. The Sparc T1 was very effective at hiding memory latency but suffered from very slow performance at each task; and this was very heavily tuned workloads that exhibit high concurrency. It didn't matter if a task was blocked because another one was always ready.
The big.LITTLE design simply wouldn't run this type of workloads.It isn't a case of concurrency is hard, I don't see the theoretical benefits here at all.
I agree with everything except for "just as easily". It's trivial to extract memory parallelism from separate instruction streams and heroic to impossible for a single stream. Think processing 1000 linked lists sequentially vs processing 4 groups of 250 lists each. Which OOO CPU will parallelize the processing of 4 lists given a program processing 1000 lists serially?
Even "rather wimpy" cores - single-issue cores branded as in-order - can issue "outstanding loads" (issue a request, keep executing until the result is needed). It's way way better than nothing but it's also very far from succeeding as consistently as separate instruction streams.
> I agree with everything except for "just as easily". It's trivial to extract memory parallelism from separate instruction streams and heroic to impossible for a single stream. Think processing 1000 linked lists sequentially vs processing 4 groups of 250 lists each.
You're not comparing like with like. Do you have one linked list with 1,000 nodes, or do you have four linked lists with 250 nodes each? Those are two different problems. The latter problem has 4x as much memory parallelism as the former one (and indeed, is probably the pedagogical example of explaining memory parallelism). A single out-of-order CPU with out-of-order memory pipeline can traverse four linked lists in parallel just as easily as four CPU's can do so sequentially. Just keep four cursors and advance each one in a round-robin fashion.
I specifically said "1000 linked lists" with an unknown amount of nodes in each.
"Just keep four cursors" - why four and not N? This involves rather ugly and not quite portable/"scalable" code (different processors can issue different amounts of outstanding memory requests). For a large N, you run out of register names for variables, and register renaming only gets you so far.
There are cases when ILP is easier to express than TLP and there are the opposite cases; it's not "just as easy" - in many cases one is much easier than the other. Having to keep multiple cursors is the perfect example when TLP is easier; consider the fact that different lists have different lengths, which a load balancing scheduler can take care of nicely without any programming effort and your single-threaded version will have trouble with.
> It's trivial to extract memory parallelism from separate instruction streams and heroic to impossible for a single stream.
That's why the brawny CPUs are multithreaded and come as chip muliprocessors. But if you add too many instruction streams, bandwidth quickly becomes a serious problem. That's why a chip full of wimpy processors is not such a good idea. (except for a few special cases of course)
After all, there is a reason people still have brawny cpus in their computers, instead of using the gpu for everything. The cpus have to do all the dirty serial work while the gpus play with their pixels...
This is an excellent reference, I also thought of it when I read this post. Here is the version from IEEE Computer. It's the same article, but has more colors. ;-)
Of course this fight is not over, and I agree with your points to some degree or other (especially "they might not wait as often" - memory access being the bottleneck is in my experience less likely for a moderately sized multi-CPU system; and as to room to improve power savings - muscle needs calories :)
In fact my point was only that fight is never over, as a counterpoint to "brawny cores beat wimpy cores most of the time" - not that the reverse statement is true.
This isn't correct because in either case you're almost certainly sharing the same memory controller, which still has to service requests one at a time (if they are randomly dispersed in memory). Furthermore if the memory access takes long enough the single-core can context switch to another process and run non-blocked instructions there. The 4-core version doesn't magically have 4 times the memory pipelines, and the 1-core version won't stupidly sit waiting for a memory access to return.
You don't need 4 times the memory pipelines. If you have one pipeline and cores issue few enough requests for bandwidth to not be a problem, then contentions between cores only cost you 0-3 cycles of latency, which is a tiny cost compared to DRAM latency these days.
A more relevant factor is how many banks the DRAM has compared to how many cores are issuing bank-missing requests in parallel.
The reason the four slow CPU's are faster in the author's example is that they are issuing four concurrent requests at once, while the single CPU is issuing one request at a time. If requests have fixed and high memory latency, the four CPU's will get back four times as many results in each waiting period as the single CPU. This presupposes that the problem has enough memory parallelism to support four memory requests in flight at once.
But if the problem is highly memory parallel, then the faster CPU can take advantage of that memory parallelism just as easily. It doesn't have to only issue one request at a time, then wait until it is fulfilled. It can shoot off four requests at a time and wait for them all to come in. Indeed, "brawny" CPU's tend to also have "brawny" memory controllers that can support more simultaneous requests.