
Amdahl's law in reverse: the wimpy core advantage - mmastrac
http://www.yosefk.com/blog/amdahls-law-in-reverse-the-wimpy-core-advantage.html
======
rayiner
This isn't really correct, because it ignores the fact that modern out-of-
order CPU's also have out-of-order memory pipelines.

The reason the four slow CPU's are faster in the author's example is that they
are issuing four concurrent requests at once, while the single CPU is issuing
one request at a time. If requests have fixed and high memory latency, the
four CPU's will get back four times as many results in each waiting period as
the single CPU. This presupposes that the problem has enough memory
parallelism to support four memory requests in flight at once.

But if the problem is highly memory parallel, then the faster CPU can take
advantage of that memory parallelism just as easily. It doesn't have to only
issue one request at a time, then wait until it is fulfilled. It can shoot off
four requests at a time and wait for them all to come in. Indeed, "brawny"
CPU's tend to also have "brawny" memory controllers that can support more
simultaneous requests.

~~~
_yosefk
I agree with everything except for "just as easily". It's trivial to extract
memory parallelism from separate instruction streams and heroic to impossible
for a single stream. Think processing 1000 linked lists sequentially vs
processing 4 groups of 250 lists each. Which OOO CPU will parallelize the
processing of 4 lists given a program processing 1000 lists serially?

Even "rather wimpy" cores - single-issue cores branded as in-order - can issue
"outstanding loads" (issue a request, keep executing until the result is
needed). It's way way better than nothing but it's also very far from
succeeding as consistently as separate instruction streams.

~~~
rayiner
> I agree with everything except for "just as easily". It's trivial to extract
> memory parallelism from separate instruction streams and heroic to
> impossible for a single stream. Think processing 1000 linked lists
> sequentially vs processing 4 groups of 250 lists each.

You're not comparing like with like. Do you have one linked list with 1,000
nodes, or do you have four linked lists with 250 nodes each? Those are two
different problems. The latter problem has 4x as much memory parallelism as
the former one (and indeed, is probably the pedagogical example of explaining
memory parallelism). A single out-of-order CPU with out-of-order memory
pipeline can traverse four linked lists in parallel just as easily as four
CPU's can do so sequentially. Just keep four cursors and advance each one in a
round-robin fashion.

~~~
_yosefk
I specifically said "1000 linked lists" with an unknown amount of nodes in
each.

"Just keep four cursors" - why four and not N? This involves rather ugly and
not quite portable/"scalable" code (different processors can issue different
amounts of outstanding memory requests). For a large N, you run out of
register names for variables, and register renaming only gets you so far.

There are cases when ILP is easier to express than TLP and there are the
opposite cases; it's not "just as easy" - in many cases one is much easier
than the other. Having to keep multiple cursors is the perfect example when
TLP is easier; consider the fact that different lists have different lengths,
which a load balancing scheduler can take care of nicely without any
programming effort and your single-threaded version will have trouble with.

------
johnpapdu
That's not Amdahl's law in reverse, that's just Amdahl's law as applied to
mulitcore. This was all worked out over a decade ago [1].

[1] M. Hill and M. Marty. Amdahl’s law in the multicore era. Computer, 1998.

~~~
ChuckMcM
Two things, first the link is better:

[http://minds.wisconsin.edu/bitstream/handle/1793/60554/TR159...](http://minds.wisconsin.edu/bitstream/handle/1793/60554/TR1593.pdf?sequence=1)

And second that was done in 2007 (did they do an earlier version?)

Its worth a read.

~~~
solarexplorer
This is an excellent reference, I also thought of it when I read this post.
Here is the version from IEEE Computer. It's the same article, but has more
colors. ;-)

[http://research.cs.wisc.edu/multifacet/papers/ieeecomputer08...](http://research.cs.wisc.edu/multifacet/papers/ieeecomputer08_amdahl_multicore.pdf)

------
qznc
Do you define "brawny" via higher clock frequency or more internal logic (out-
of-order execution,prefetching,etc) or both? I would say "both".

> Faster processors are exactly as slow as slower processors when they're
> stalled - say, because of waiting for memory;

But they might not wait as often, because they have bigger caches, better
prefetching logic, etc.

> Many slow processors are actually faster than few fast ones when stalled,
> because they're waiting in parallel;

Only if your memory supports those parallel accesses. If the memory access is
a bottleneck, they might just as well wait for each other.

> All this on top of area & power savings of many wimpy cores compared to few
> brawny ones.

The "brawny" cores still have a lot of room to improve their power saving.

This fight is not over. I believe reality just stay complex and the answer
will be "it depends" forever.

~~~
_yosefk
Of course this fight is not over, and I agree with your points to some degree
or other (especially "they might not wait as often" - memory access being the
bottleneck is in my experience less likely for a moderately sized multi-CPU
system; and as to room to improve power savings - muscle needs calories :)

In fact my point was only that fight is never over, as a counterpoint to
"brawny cores beat wimpy cores most of the time" - not that the reverse
statement is true.

------
iyulaev
This isn't correct because in either case you're almost certainly sharing the
same memory controller, which still has to service requests one at a time (if
they are randomly dispersed in memory). Furthermore if the memory access takes
long enough the single-core can context switch to another process and run non-
blocked instructions there. The 4-core version doesn't magically have 4 times
the memory pipelines, and the 1-core version won't stupidly sit waiting for a
memory access to return.

~~~
_yosefk
You don't need 4 times the memory pipelines. If you have one pipeline and
cores issue few enough requests for _bandwidth_ to not be a problem, then
contentions between cores only cost you 0-3 cycles of _latency_ , which is a
tiny cost compared to DRAM latency these days.

A more relevant factor is how many banks the DRAM has compared to how many
cores are issuing bank-missing requests in parallel.

------
wmf
Other than scrypt it's hard to find examples of such memory-hard code.

