
Analysis: more than 16 cores may well be pointless - alexandros
http://arstechnica.com/news.ars/post/20081207-analysis-more-than-16-cores-may-well-be-pointless.html
======
magoghm
The UltraSparc T2 has 8 cores and each core can run 8 concurrent threads, so
you have 64 concurrent threads running "simultaneously" on a single chip:
<http://en.wikipedia.org/wiki/UltraSPARC_T2> One of the techniques they use to
solve the memory bandwidth problem is to have four memory controllers.

~~~
lallysingh
Yeah, but the way it does it is by switching concurrent threads when the last
gets stalled on memory.

It's brilliant, and seems effective, but it's still just 8 cores per socket.

~~~
Retric
This is useful to get around the latency issues, which let's you use more of
your memory bandwidth before stalling a core.

------
andr
Why not do for memory controllers the same thing we are doing for cores? If 8
cores is the practical limit of a single memory controller, why not ship
16-core CPUs with 2 independent controllers, one for each set of eight cores?

True, the two groups of cores would probably not be able to share memory (at
least not at full speed), but that's something the OS scheduler and memory
manager could take care of.

For one, there are plenty of pins. The socket used by Core i7 more than
doubled the amount of pins (in the 1300s now), yet a QuickPath Interconnect
(Intel's replacement to the FSB) only takes 84 pins. Surely, they can squeeze
those in.

~~~
comatose_kid
I would guess that having a high speed cache on chip would be a better use of
real-estate.

Going with multiple mem controllers means that performance would still
probably be pretty variable, depending on how the mem controllers are mapped
to your physical addresses + how your data is laid out.

------
kragen
Memory latency problems can be worked around with pipelined requests, if you
have enough concurrency in the CPU; memory bandwidth problems can be worked
around with more independent banks of memory and more memory channels to the
CPU, which means more wires. Both can be worked around by putting more memory
locally.

Absent such workarounds, it's true that there are some problems (large-mesh
finite element analysis, maybe) that more cores won't help with, and other
problems that exhibit enough locality (large cellular automata) that more
cores _will_ help with. This article observes that some problems are in the
first category. It would be absurd to claim that no problems are in the second
category.

------
tlrobinson
Perhaps this would be an extremely inefficient use of chip real estate, but
what if extra cores were used to execute the same set of instructions in
parallel with both outcomes of conditional branches (i.e. completely replace
branch prediction) and take whichever one ends up being correct?

On second thought, it's probably not worth it. It seems modern branch
predictors are at least 90% accurate
(<http://en.wikipedia.org/wiki/Branch_predictor>), and only a few percent of
instructions are conditional branches anyway
([http://bloggablea.wordpress.com/2007/04/27/so-does-anyone-
ev...](http://bloggablea.wordpress.com/2007/04/27/so-does-anyone-even-use-all-
these-darn-cpu-instructions/)).

------
jbert
If a problem is parallelisable to that extent, are the cores actually hitting
the same memory? (Presumably not, since you don't want multiple threads
mutating the same memory).

So the problem is really that we have:

[lots of cores] <=chokepoint of memory bus=> [lots of ram]

when we could have:

k times: [1/k of our cores] <=single mem bus=> [1/k of our RAM]

That increases our aggregate memory bwidth by a factor of k. This has to be <=
number of cores, and the problem being solved needs to be able to be
partitioned into k chunks.

This is basically the clustering approach (individual proc+ram working on the
problem), with the added advantage that we can leave some of the RAM unsplit
so we get 'local' shared RAM for free.

~~~
wmf
No, the crossbar between the cores and memory controllers is not a problem.
The problem is getting enough pins to attach more than four memory channels.

~~~
jbert
OK, thanks. So the problem is that transistor density is going up faster than
pin density? Makes sense.

I guess one approach to that is a bigger die size, so you end up with a fixed
transistor/cm^2? That would mean die sizes doubling with Moore's law I guess.
So you'd want other ways of packing them in (use the flipside of the mobo, try
3d arrays and suffer heat problems).

Or the cores come with attached (non-shared) RAM, which is of course where we
are with adding cache.

Other random thoughts: why have memory busses stayed parallel when peripheral
busses (scsi, usb) have gone serial? That would reduce pin count for a
connection?

Can we avoid going to full 'macro' pins for the CPU-memory bus (and thus pack
more pins into the same area for memory connections)? Instead have a smaller,
denser collection of pins which are attached as a group to each memory
connector?

Sorry for being clueless and thinking aloud, but it's an interesting problem.

~~~
wmf
I don't think the die size is a problem; AFAIK you can get very many pins out
of a die, but getting them out of the package is where it gets expensive.

FB-DIMM is already a serial-style high-frequency interconnect, but it isn't
needed on low-end systems.

I think this is really about business, not technology. What they want is not
what the mainstream wants, so they must choose between cheap but memory
starved systems or very expensive balanced systems. HPC people have been
whining about killer micros for 20 years; this is just another version of it.

------
petercooper
_I think there is a world market for maybe five computers._

~~~
vizard
Did you actually read the article? How is it in anyway related to the "five
computers" thingy? The article is talking about an architectural limitation
not a business/consumer one.

~~~
petercooper
Claiming limitations when it comes to technology usually proves short-sighted.
The quotes about five computers, 640k of memory, etc, are not formally
comparable but they give the anecdotal hint that limitations will always be
beaten.

The point made in this article surrounds current architectures and their
limitations. The conclusion that more than 16 cores makes no sense might well
be a good conclusion for now, but "more than 16 cores may well be pointless"
is by no means a conclusion for the long or even mid term.

~~~
jshen
"they give the anecdotal hint that limitations will ALWAYS be beaten."

Don't you mean usually? I haven't seen the big break throughs in AI that were
expected. I haven't seen a solution to the halting problem, etc, etc.

~~~
rbanffy
Because, by definition, when an AI breakthrough is achieved, it's no longer
AI...

~~~
jshen
that's a good soundbite, but it's nonsense.

AI hasn't lived up to the promises made a few decades ago.

* 1958, H. A. Simon and Allen Newell: "within ten years a digital computer will be the world's chess champion" and "within ten years a digital computer will discover and prove an important new mathematical theorem."[53]

* 1965, H. A. Simon: "machines will be capable, within twenty years, of doing any work a man can do."[54]

* 1967, Marvin Minsky: "Within a generation ... the problem of creating 'artificial intelligence' will substantially be solved."[55]

* 1970, Marvin Minsky (in Life Magazine): "In from three to eight years we will have a machine with the general intelligence of an average human being."[56]

~~~
Retric
I like how you found 2 people who support your argument, but there is a gap
separating predictions from promises. Many classical definitions of AI have
been surpassed, but because people are still better at a verity of tasks we
say we don't have AI.

PS: A digital computer is the worlds chess champion or would be if we let them
compete. Making a useful captia is hard, but computers don't compose poetry so
we can still say we don't have AI.

~~~
jshen
poetry is a bad example. A computer still can't translate from one language to
another at a level anyone would trust for something important.

~~~
Retric
Computer translations fall in between a freshman HS language student and an
expert. I am not going to trust it for diplomatic negotiations, but I have
still used it for some verifiable tasks. Voice to text translations also fit
in this context if you can't type then it's ok, but if you need a high level
of accuracy then use a person. Which IMO describes the state of most computer
AI. It's picking stocks and rejecting parts but I still want a real doctor.

~~~
jshen
exactly, AI hasn't lived up to the promises which was my original point.
Remember I was responding to this quote,

"they give the anecdotal hint that limitations will ALWAYS be beaten."

Limitations will not always be beaten.

------
jsmcgd
How does this development affect the issue then?

<http://news.ycombinator.com/item?id=389857>

------
s3graham
Assuming "stacking memory chips on top of the processor" means local stores
for each processor, it's pretty much a given at this point. That's how GPUs
work (~16k local), and that's how the PS3 SPUs are (256k local).

It's _extraordinarily_ painful to code for, but hey, all performance
optimization is an exercise in caching. So it goes.

~~~
lallysingh
Yeah, essentially the processor cores start looking like ccNUMA boxes. The
late 1990s called, they want their architecture back :-)

IMHO, it looks like we'll need some smarter memory bus management. If we're
looking at the 1990s, anyone remember the crossbar switches SGI used to put in
their short-lived x86 boxes? Thoughts on effectiveness?

~~~
ars
That's what I was going to post. Use NUMA - each core gets it's own memory.
Doesn't linux support NUMA?

It seems to me that each thread already has it's own memory space, just make
sure the memory space for the thread is on the same CPU the thread runs on.

It isn't really necessary for each core to be able to access all memory (or at
least it'll be way slower to access memory outside it's area).

------
biohacker42
Don't super computers have shared direct memory access of all the memory for
all CPUs?

Perhaps it is time for home desktops to adopt the super computer architecture!

~~~
biohacker42
Alright I'm not a hardware engineer I don't even like assembly.

Can somebody explain how mainframe DMA is different from the DMA in your home
PC?

I know mainframes have more then 16 CPUs. The memory problem we're talking
about here does not apply to them, right?

------
markessien
Why stack? Just make it an onion.

------
dilanj
"640K ought to be enough for anybody." \- (Never actually said by) Bill Gates

~~~
tlrobinson
That quote is irrelevant, except that both the quote and title of the article
are misleading.

The real title should have been "With the current architectures and/or memory
speeds, more than 16 cores may well be pointless".

~~~
DenisM
"may or may not be pointless, becasue author couldn't find the data to backup
his point".

There, FTFY.

~~~
tlrobinson
Well, the basic premise of the article is common sense. The memory needs to
feed the processor instructions and data.

If the clock speed and number of cores increases at a faster rate than memory
speed increases (assuming all the cores share memory) then at some point the
memory can't keep up.

Whether or not 16 cores is the magic number, I don't know.

~~~
Retric
Your assuming a linear relationship between memory access and number of cores
but the size of L2/L3 cache mitigates the problem to some degree. We already
have chips with 12+ MB which is plenty of RAM to run windows 3.11. The real
question is what type of workload do you have and how well does that play with
the number of cores your using etc.

