

“256 cores by 2013”? - hamidr
http://herbsutter.com/2012/11/30/256-cores-by-2013/

======
varelse
W/r to GPUs, I prefer to consider the SMs and SMXs as "cores" rather than ALU
units within an SM/SMX. With that definition, a single K20 SMX can issue up to
256 instructions per cycle (4 dual-issues of instructions across 4 32-way
warps) and they each carry effectively 368K of L1 cache (64K true L1 plus an
enormous 256K register file plus 48K of read-only cache). Speaking from
experience, ~40 warps are needed to saturate the instruction pipeline. The
1.25 MB L2 is shared among the 15 SMXs. 208 GB/s of main memory bandwidth is
shared between the 15 SMXs providing each one with ~14 GB/s.

Each Xeon Phi core has a 32K L1 data cache and a 512K L2 data cache. A Xeon
Phi core can issue 16 SIMD operations/cycle and it needs 4 threads to saturate
the instruction pipeline. The 60 cores in a 5110P have to share 320 GB/s of
main memory bandwidth or 5.33 GB/s per core.

And all of this means that code customized for one processor (OpenCL) will run
like crap on the other one. The Xeon Phi needs that 30 MB of total individual
L2 cache to avoid slamming into memory bus contention while the K20 needs to
operate entirely inside its L1 cache and register file to hit peak
performance. Main memory fetches are less fatal to the K20 both because
optimized code will be running 40+ warps to bury the latency and because of
the significantly higher bandwidth.

What strikes me at this point is absolutely paucity of compelling Xeon Phi
benchmarks. All we see are SGEMM, DGEMM, and a bunch of synthetic tests.
They've had 6 years to get this right so why didn't they go after all the
jewels in NVIDIA's many-core crown from the get-go?

Finally, languages like OpenCL and CUDA subsume SIMD, multithreading, multi-
core, and cache optimization into the programming model, all but implicitly
forcing the programmer into optimizing many-core performance. In contrast,
Intel continues to expect programmers to use processor-specific intrinsics to
hit peak performance that change with vendor and processor generation. Sure,
it's easier to write serial code for a serial Intel core. But I thorughly
disagree that it's easier to write many-core applications by adding processor
intrinsics and a threading library to fundamentally serial code.

------
karterk
The main stream languages currently do not offer enough abstraction for
developers to make use of hardware parallelism effectively. Threads do get the
job done, but nowhere as elegantly as I would want. Until that happens, we
won't get to reap the full benefits of multi-core systems.

Edit: Worth mentioning that there has been lots of interesting things
happening in Haskell to support concurrency:
[http://stackoverflow.com/questions/3063652/whats-the-
status-...](http://stackoverflow.com/questions/3063652/whats-the-status-of-
multicore-programming-in-haskell)

~~~
pjmlp
Yeah, this is why functional languages are so appealing for multicore
programming.

Already in the middle 90's there where some HPC research with Lisp as the main
language.

Functional languages have the ability to work similar to how SQL does querys,
by abstracting how the DB engine works.

Of course we are seeing this type of language constructs come to the
mainstream languages as well, because you really need to have some kind of
inteligent runtime to fully explore parallelism.

It is a bit like assembly programming vs compilers. Sure clever programmers
can beat most compilers when targeting simple processors, but when you level
up with out-of-order executing, multiple instruction pairing, multiple level
caches, NUMA, and so on, the compiler optimizer usually wins hands down.

~~~
pdhborges
> It is a bit like assembly programming vs compilers. Sure clever programmers
> can beat most compilers when targeting simple processors, but when you level
> up with out-of-order executing, multiple instruction pairing, multiple level
> caches, NUMA, and so on, the compiler optimizer usually wins hands down.

I disagree. Real HP software have their computation kernels hand written (or
generated) in assembly. As for relying on the smartness of the compiler to
take into account all those factors I invite you to recall or read the story
of the itanium architecture.

~~~
pjmlp
Funny, I never saw hand written assembly while doing research at CERN.

That seems pretty real HP to me.

~~~
cmccabe
I work on Hadoop, and we do dip into assembly sometimes. It's very rare. One
example is the CRC calculating code in HDFS.

If you are working in a research context, it's not really worth writing
assembly-- just getting it to work is more important. If you have to buy more
hardware, then just do it. Traditional HPC is not really known for being very
cost-sensitive.

When you're working in a commerical context, performance starts to matter
more. It's the same reason why a one-off hand-soldered electronic device isn't
built to the same standards as an iPod. If you're only building one, don't
waste time on polish.

~~~
pjmlp
> I work on Hadoop, and we do dip into assembly sometimes. It's very rare. One
> example is the CRC calculating code in HDFS.

Why aren't you using compiler vector operations for such case?

On the project I used to work on, we were building the infrastructure to
perform real time data analysis for the data coming straight out of the
accelerator.

We have written custom memory allocators, our own network stack and protocols,
measured and optimized every operation in the years before the accelerator
went live.

The code is massively parallel in core and cluster distributed.

No extra need for Assembly.

~~~
cmccabe
Intel has a CRC instruction built in to the x86 instruction set on newer CPUs.
It turns out that using this instruction is a substantial performance win.

The problem with the "sufficiently smart compiler" argument, as always, is
that the compiler may have a lot of optimizations, but it's still not an
artificial intelligence. It can't tell that what you are trying to do with
your set of operations is actually perform CRC, and there is hardware support
for that.

Check it out at: [http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-
comm...](http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-common-
project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32.c)

 _We have written custom memory allocators, our own network stack and
protocols, measured and optimized every operation in the years before the
accelerator went live._

I would actually argue that you don't need to do these things in most cases.
SCTP and DCCP are alternatives to TCP/IP that have been in the Linux kernel
for a while now. If you want to trash TCP/IP and go full custom, you can do so
without writing a line of protocol code. However, a better approach is to use
TCP/IP with tweaks like Fast Open (what Google uses internally.) You can then
continue to buy commodity hardware.

Similarly, memory allocators have been done to death. Just use tcmalloc or
jemalloc rather than rolling yet another malloc(). The exception is if you
want to create something like a slab allocator where you just hand out lots of
mostly identically-sized buffers, or a database-like application where you
manage your own writeback to disk.

~~~
pjmlp
> The problem with the "sufficiently smart compiler" argument, as always, is
> that the compiler may have a lot of optimizations, but it's still not an
> artificial intelligence. It can't tell that what you are trying to do with
> your set of operations is actually perform CRC, and there is hardware
> support for that.

This might be a special case, as you're taking advantage of a specific use
case instruction, which will fail in portable code anyway.

I have seen Assembly programmers put to shame in modern processors by C and
C++ developers, just by making use of better data structures, algorithms and a
special mix of compiler flags.

Anyway thanks for the follow up, very interesting.

~~~
cmccabe
Thank you too. It's very interesting to see what folks in HPC are doing. If
your new layer3 protocol is open source, send me a link-- I worked on an
alternative layer3 protocol at one point, so I'd be interested where you went
with it.

~~~
pjmlp
I am no longer at CERN, but you can get some information here

<http://cdsweb.cern.ch/record/616089>

A more up to date link is available here, [http://atlas-proj-hltdaqdcs-
tdr.web.cern.ch/atlas-proj-hltda...](http://atlas-proj-hltdaqdcs-
tdr.web.cern.ch/atlas-proj-hltdaqdcs-tdr/latest/PDF/TDR-2up.pdf)

ROBins are the special purpose network cards.

------
pooriaazimi
There's something I always wanted to know, but was too lazy to look up myself
(I have a CS degree _(or will, in a year)_ and have taken microprocessor,
assembly, computer architecture and courses like that, but none of them talked
about multi-core, so I don't know anything about them).

When you have lots of cores, I think your bottleneck would be cache and to a
lesser degree, memory bandwidth. You can't have multiple cores manipulate the
same memory address simultaneously.

Am I right?

~~~
ivany
Spot-on. I came to post this. A CPU core != a Xeon Phi core != a GPU "core".
Memory architecture matters a LOT, and much of CPU area is devoted to cache
and memory infrastructure. In a GPU, my understanding is every ~32 cores share
a small cache and MMU, which makes some SIMD operations / algorithms more
challenging to implement. Good for some algorithms, not so great for others.

I think the whole focus on "cores" misses lots of issues. Memory
infrastructure is only one. Instruction-level parallelism (superscalar & out-
of-order) execution is another - even single-core processors like ye olde
Pentium can execute multiple instructions per cycle. It's very easy and
tempting to look at the number of cores and use that as a rough estimate of
system performance. But this approach will land you WAY off of real-world
figures. It's akin to using the number of cylinders in a car's engine to
determine how fast it is - sure, to the first order, cylinder count is
correlated with engine output and hence car speed, but it's only a very rough
correlation.

~~~
lgeek
I'm wondering why you say that a 'CPU core' is different from a Xeon Phi core.
I've looked a bit into the Xeon Phi microarchitecture and I don't see any
fundamental differences (well, I guess it depends on how you define
'fundamental') compared to other processors.

------
SageRaven
Are hardware threads really that useful in any scenario? I tend to disable the
functionality and use actual cores for capacity planning, especially in
virtualization. Running a "make -j<cores>" generally runs faster than "make
-j<threads>" on the same hardware in scenarios I've tried.

Is there solid empirical data on the threads vs cores thing?

~~~
sp332
Sorry if this seems unhelpful, but it depends _entirely_ on your workload. And
things get even more complicated when you get into "turbo" mode, which only
applies if you use < x% of the cores, and then AMD's new architecture with two
decode and integer units for each floating-point unit... It's really hard to
predict in advance how all that will interact with a particular workload.
Sometimes just tweaking the scheduler can give 10-20% boost in throughput.

~~~
zurn
It depends even more heavily on the "threading" implementation on your
hardware. Some CPUs are designed to run a bunch of threads to keep the chip
busy (like the Niagara family Ultrasparcs) and others have SMT bolted on as an
afterthought that mostly doesn't help (Pentium 4 "HyperThreading" or whatever
Intel calls it). Then there's the AMD Bulldozer approach which is kind of
between multicore and SMT, with some shared and some dedicated resources per
thread. Then there's the Tera MTA style approach where you basically switch
threads on every instruction to hide memory latency. Etc etc.

------
rubinelli
I'm under the impression that the reason Intel isn't producing 256-core CPUs
has less to do with the technical problems involved than with the fact that,
outside CPU-heavy mathematical computing, nobody needs them. The few
embarrassingly parallel tasks we have are most of the time well served by
GPUs, or distributed in a cluster of cheap servers to increase I/O.

~~~
moe
_or distributed in a cluster of cheap servers to increase I/O_

If a smaller cluster of 256-core servers had better bang/buck than a bigger
cluster of 16-core servers then that would still be a win for many
applications (not everything CPU bound is also I/O bound, e.g. web application
servers written in hilariously inefficient languages, virtualization).

------
derda
I know people who have worked with the Intel 48-Core SCC (Single-Chip Cloud
Computer). While I don't remember technical details it was very hard for them
to build software that actually benefits from that many cores. If I remember
correctly sharing data between the cores was a big bottleneck.

~~~
merijnv
I disagree, benefitting from 48 cores on the SCC would be easy, if the other
(solvable) problems plaguing the SCC wouldn't get in the way. The bigger
issues plaguing the SCC were:

The use of old P54C core, while understandable from hardware point (design
already existed, no need to validate/test) this decision has big consequences.
The core can only support a single outstanding memory operation, crippling
memory performance. Doubly so because it does not support out-of-order
execution, meaning the core is stalled until memory requests finish.

Furthermore, the lack of L2 cache consistency lets you experiment nicely with
NUMA and the combination of reprogrammable lookup tables and more physical
memory than address space (34 bit vs 32 bit) lets you do nice tricks to
share/communicate data between cores with zero copying needed. Unfortunately
(due to time/budget constraints) Intel didn't implement any programmable
expiration/invalidation/flushing of the L2 cache. This means that any
application wanting to use these tricks had no solution beyond reading in a
full 256KB (with the proper striding to make sure you actually cache miss on
all of those 256KB!) to flush the cache. Such a flush takes approximately 500k
CPU cycles, almost guaranteed killing any performance gained from zero copy
tricks.

The fact that cores were 32bit is also extremely, as you quickly consume your
entire address space when you start playing around with the above mentioned
sort of memory shenanigans.

There were some more fairly minor issues, but these are the ones that
initially pop in my head.

------
aneth4
I can't be the only one who stops reading articles when the entire first
section is contorted self-congratulation?

This is up there with The Next Web always vainly touting what they "told us"
in the last article.

Say something interesting. Don't tell me how great you are. It immediately
diminishes my opinion of people and companies because it makes it clear they
are more interested in recognition and taking credit than in being
interesting.

------
mitchi
It's much smarter for Intel to work on power, efficiency, size and memory (I
hear that ram is going to disappear soon) rather than building mammoth like
processors. If you want parallel computing, you buy many computers and you
make them work together.

------
shocks
CPU cores may no but on the up, but GPU cores certainly are.

Iirc my gtx 670s have 1344 cores each, compared to 4 in my 3770k.

~~~
Symmetry
In GPUs "core" means something entirely different from what they do with CPUs.
If you counted the 3770k's cores the way GPU manufacturers do, it would have
16 cores. And If you count the GTX 670's cores the way a CPU manufacturer
would it would only have a few dozen.

------
cmccabe
I usually like Herb Sutter's articles, but this feels a bit like special
pleading to me. He made a prediction-- that the amount of hardware parallelism
we'd have would be between 16 and 256-- and it didn't really turn out to be
true. I would say that we now have something like between 8 and 12 way
parallelism on most systems. If you take a 6-core i7 with hyperthreading,
that's 12-way parallelism.

Talking about specialty hardware like the Xeon Phi HPC accelerator (which most
developers have never heard of, let alone programmed for), just doesn't make
sense. Similarly, GPUs may be ubiquitous, but how many programmers have
actually written for them? Very few.

The reality is that CPU architects have done everything they possibly can to
keep down the amount of parallelism. When you get transistors, you can use
them for things like L1 cache, instead of adding more execution units.

The bottom line is that it's hard to predict the future. We can read stuff
from the past, like mailing list posts about how Linux is irrelevant because
"we'll all be using Sparcs in a few years," but the lesson that it's hard to
make predictions never really seems to sink in.

I hope that we'll see more manycore chips in the future for developers to play
with. The realities of physics seem to be dragging CPU architects kicking and
screaming into the multicore world. But it may not be as fast as we once
thought.

~~~
lambda
But don't most machines that will be using CPUs that big have two sockets? Big
servers these days come with 2 CPU sockets, 6 cores per CPU, 2 threads per
code. So you do see 24 way parallelism pretty often, for server-side code at
least.

Now, this was pretty much the low end of his original prediction: 'If we stick
with "just more of the same" as in Figure 2's extrapolation, we'd expect
aggressive early hardware adopters to be running 16-core machines (possibly
double that if they're aggressive enough to run dual-CPU workstations with two
sockets), and we'd likely expect most general mainstream users to have 4-, 8-
or maybe a smattering of 16-core machines (accounting for the time for new
chips to be adopted in the marketplace).'

It is true that new, mainstream systems offer 4 to 8 way parallelism, and that
you see up to 24 way parallelism on the high end. So, his "more of the same"
prediction was perfectly correct.

We did not see the large jump that he hypothesized might happen. But if you
notice, he said "But the gating factor is software that can use them
effectively; specifically, the availability of scalable parallel mainstream
killer applications. The only thing I can foresee that could prevent the
widespread adoption of manycore mainstream systems in the next decade would be
a complete failure to find and build some key parallel killer apps, ones that
large numbers of people want and that work better with lots of cores."

As you say, most programmers don't develop for these big many-core
monstrosities. So he was right; software that takes advantage of multi-core
machines is still a gating factor. The chips are available, but because that
many cores does not help most software that hasn't been specially crafted for
it, they are not mainstream.

I think the one major fault is in his optimism that it's possible for most
software to take good advantage of that many cores. Parallelism is hard.
Correct software is more important than fast software. Heck, a lot of software
these days is written in languages like JavaScript, Python, and Ruby, which
are not known for their speed, but rather their ease of development. Telling
people to get with the multi-core bandwagon, when really just writing large,
correct software is in most cases more important that squeezing out the last
ounce of performance, is not exactly productive.

As GPUs have shown us, the best way to take advantage of that extra
parallelism is to write special purpose libraries for the computationally
intensive parts, and then just call out to that from simpler, less parallel
code. So I'm not sure that the average programmer is going to spend that much
time adapting their code for the parallel world, other than using existing
libraries and frameworks for taking advantage of that.

A database backed web app is a great example of this kind of parallelism. The
hard parts of parallelism are generally handled by the database itself. The
web app is usually written completely single threaded, but you can spin up
multiple processes all talking to the same database to take advantage of your
multiple cores.

I suspect that is how we'll see parallelism play out in general. Most software
written single threaded, with parallelism only by running multiple processes,
talking to specially tailored databases, message queues, rendering libraries,
numerical computing libraries, and the like.

~~~
genwin
Good points. Parallel code is getting easier to write, e.g. the Go language,
thus likely more common. I think web servers alone (serving more users per
machine) could drive multi-core advances.

------
goggles99
HAHA, whatever happened to Intel pledges 80 cores in five years? (article from
2006) [http://news.cnet.com/Intel-pledges-80-cores-in-five-
years/21...](http://news.cnet.com/Intel-pledges-80-cores-in-five-
years/2100-1006_3-6119618.html)

The Nvidia Tesla K20 has 2496 cores (I know that it is a GPU and not a CPU)

~~~
compilercreator
Well, about the 80 core Intel chip, you may be interested in the Intel Xeon
Phi that has about 60 cores.

~~~
marshray
Xeon Phi (nee Larrabee) looks interesting, but it's not a real product yet.
I.e., you can't buy one for your desktop computer.

As others have pointed out, with AMD Opterons you can stuff 64 cores on a
server motherboard today.

~~~
compilercreator
Well it is about to be available in about 2 months. From Intel: "The Intel
Xeon Phi coprocessor 5110P is shipping today with general availability on Jan.
28 with recommended customer price of $2,649."

