
Wow: Intel unveils 1 teraflop chip with 50-plus cores - jhack
http://seattletimes.nwsource.com/html/technologybrierdudleysblog/2016775145_wow_intel_unveils_1_teraflop_c.html
======
mrb
"Wow?" This is actually disappointingly low raw TFLOPS performance.

Intel's Knights Ferry GPGPU ASIC is not yet available, but already
outperformed by 2-year-old chips from AMD and Nvidia who have both been
selling GPU ASICs breaking the 1 TFLOPS barrier (single precision) for over
_two years now_. The AMD Radeon HD 5870 and HD 6970 both reach 2.7 TFLOPS, and
AMD makes a dual-ASIC PCIe card HD 6990 reaching 5.1 TFLOPS. Nvidia's mid-
level GTX 275 (1.01 TFLOPS) was released in April 2009.

In fact, Knights Ferry evolved from the Larrabee GPU project, which
disappointed Intel so much in terms of performance that they decided to forgo
the GPU market (as it was clearly not going to be competitive), and to remain
focused only on the GPGPU market by evolving Knights Ferry from Larrabee.

The one strong advantage of Knights Ferry is not performance, but x86
compatibility, as it would theoretically make it easy to port programs to.
Although one would still have to rewrite the app to use the LRBni instruction
set (512-bit regs) to fully exploit the computing performance... or else one
would be limited to a quarter of its potential with SSE (128-bit regs.)

Another relative advantage of Knights Ferry is that each of the 50+ cores will
probably be able to execute 50 unique instructions every clock cycle, making
it very flexible. (Compared to, say, the HD 6970 which has 384 "cores" or VLIW
units only able to execute 24 unique instructions: the ASIC is organized in 24
SIMD engines of 16 VLIW units each, the 16 VLIW units in each SIMD engine
execute the same instruction in 16 different thread contexts, for a total of
384 threads.)

 _Edit:_ my bad, it looks like Intel claims 1 TFLOPS in double precision,
which would put it up to the level of upcoming AMD chips (HD 7970 rumored to
provide 4.1 SP TFLOPS or 1.0 DP TFLOPS in early 2012.)

~~~
stuntprogrammer
A few problems in your comment:

1) The article text is wrong (and doesn't match the pics of the slides). The
chip demonstrated today is Knights Corner which is a new part, not the older
Knights Ferry SDV.

2) When counting flops we need to distinguish between single precision flops
and double precision flops. You're comparison isn't valid -- Knights Corner
was shown sustaining over 1TF on a double precision code. Nvidia's most recent
flagship GPU has a theoretical peak of 515GF/s but sustains less than 225GF/s
on the same DGEMM operation. Knights Corner is sustaining 4-5x that, and this
implies that it's theoretical peak is higher again. AMD's GPUs also cannot
touch this with a single chip. Their dual chip 6990 has what looks like the
same theoretical peak but far lower practical performance due to being more of
a graphics part than a compute part (e.g. look at the cache structures).

You are correct that these are real cores, each with a wide vector unit. If we
wanted the equivalent of GPU "cores" we should multiply out by the vector
width per core.

~~~
Retric
This Intel chip has a theoretical max performance of 1TF/s actual benchmarks
are clearly going to be lower than that. The only thing slightly interesting
about this is x86 but considering the large vector unit and anemic cache your
not going to be able to port high performance code to this without massive
changes anyway. And while comparing new chips vs existing chips is always a
tradeoff, looking at the Radeon HD 6970 released in Dec 15, 2010 which had 2.7
TFLOPs Single Precision and 683 GFLOPs Double Precision this is a relatively
minor jump in single chip double precision performance and unless there
releasing it next week it would probably still be far slower than it's
competitors.

That's also raw performance, considering this is a brand new architecture it's
likely to have some significant bottlenecks limiting it's performance for the
next 2-3 product cycles.

PS: Considering so few details where provided it's hard to look at this as
anything but Intel saying "Please don't port your code we will have
competitive x86 chips out at some point in time."

~~~
stuntprogrammer
It's doing 1TF sustained and no one ever sustains 100% of their theoretical
peak, so we know the peak is higher than 1TF. Consider also the sandbagging of
number of cores as "50+". How big is the "+"? In reality the design will have
a larger number that is then binned by yield and so forth to give a range of
SKUs, as usual.

Next, you're comparing a sustained number on DGEMM with theoretical peaks on
other machines. Nvidia sustains <225GF on DGEMM with Fermi so this is 4-5x.
Last I looked, AMD were sustaining ~500GF with Cayman, so this is 2x, and a
much easier machine to sustain perf on for other codes compared to Cayman. If
you consider a potentially sandbagged 2x sustained perf to be "relatively
minor" then so be it.

There are few public details provided but many of us have been programming
with the Knights Ferry SDV kit in preparation for this part. So we have
experience with the tools, with the use of lots of similar cache coherent x86
cores, etc. I can tell you this -- it's much easier to work with this than
GPUs, and I've written a ton of code on all kinds of whacky machines,
production compute code on GPUs included.

~~~
onemoreact
I don't see any mention of this chip doing DGEMM at 1TF. Just that it's
sustaining 1TF performance, but you can write assembler code that get's within
1% of theoretical peak flops if your not trying to get anything done but, if
you have a source feel free to give it. (Not that that even means much, AMD's
getting 80% of theoretical max FLOPS on that benchmark and I assume Intel
would pick the optimum benchmark for it's chip even if they had to design the
chip around the benchmark.)

Also, I don't see anything that suggests it's anywhere close to a production
chip. _More important, Knights Ferry chips may help engineers build the next
generation of supercomputing systems, which Intel and its partners hope to
delivery by 2018._ Not to mention your comparing a preproduction chip with a
year old chip that's running on a 2 year old process when AMD, Intel, and
Nvidia are about to do a die shrink.

~~~
stuntprogrammer
Sustaining 1TF on DGEMM was explicitly mentioned by Intel in the
presentation/briefing.

It's also mentioned in the press release:

[http://newsroom.intel.com/community/intel_newsroom/blog/2011...](http://newsroom.intel.com/community/intel_newsroom/blog/2011/11/15/intel-
reveals-details-of-next-generation-high-performance-computing-platforms)

"The first presentation of the first silicon of “Knights Corner” co-processor
showed that Intel architecture is capable of delivering more than 1 TFLOPs of
double precision floating point performance (as measured by the Double-
precision, General Matrix-Matrix multiplication benchmark -- DGEMM). This was
the first demonstration of a single processing chip capable of achieving such
a performance level."

Does it mean much? It means something to me, and is a great first step for
those of us running compute intensive codes. They really wouldn't get far if
they designed the chip only around being able to do this.

As I mentioned elsewhere in the thread, the article text is incorrect. The
chip we're discussing is Knights Corner not Knights Ferry. The latter has been
in early user hands for quite some time now and I've spent plenty of time
hacking on it. Knights Corner is the new chip that is working it's way to
production via the usual process with ship for revenue in 2012.

The 2018 target is for an exascale machine, not shipment of initial MIC
devices. TACC have already announced they'll be building out a 10 petaflop MIC
based system next year to go operational by 2013.

Yes, I'm comparing a chip that has not shipped, but given the perf advantage,
given the tools and productivity advantage, given the multiyear process
advantage Intel is sustaining, this is not a chip to be ignored. Knights
Corner is shipping on 22nm. Other vendors have notoriously had difficultly on
previous processes, depend on fabs like TSMC who are doing 28nm for them, and
will be later to 14nm etc.

~~~
onemoreact
Thanks for clearing that up, my google foo is weak when they use the wrong
names.

Still, it looks like they really do design for benchmarks: "Xeon E5 delivers
up to 2.1* times more performance in raw FLOPS (Floating Point Operations Per
Second as measured by Linpack) and _up to 70 percent more performance using
real-HPC workloads compared to the previous generation of Intel Xeon 5600
series processors._ " 110% on benchmark = 70% in real world apps.

Granted, if this works out great, I have seen Intel blow to many new 'high
performance' chips to expect much still they might just pull this one off.
Unlike say the <http://en.wikipedia.org/wiki/Itanium> etc

PS: I always look at what Intel get's x86 to do much like how Microsoft could
develop software, it's not that the capability is awesome so much as watching
a mountain of hacks dance. They have a huge process advantage and can throw
piles of money and talent at the process but they are stuck with
optimization's made when computers where less than 1% as powerful.

~~~
stuntprogrammer
We should distinguish between designing for a benchmark and designing for a
set of workloads. Everyone choices representative workloads they care about
and evaluate design choices on a variety of metrics from simulating execution
of parts of those workloads.

Linpack is a common go-to number because, for all the flaws, it's a widely
quoted number. E.g. used in the top500 ranking. It tends to let the cpu crank
away and not stress the interconnect, and is widely viewed as an upper bound
on perf for the machine. In the E5 case it'll be particularly helped by the
move to AVX enabled cores, and take more advantage of that than general
workloads. Realistic hpc workloads stress a lot more of the machine beyond the
cpu. Interconnect performance in particular.

People like to dump on x86 but it's not that bad. There are plenty of features
no one really uses and we still have around, but those features will often end
up being microcoded and not gunking up the rest of the core. The big issue is
decoder power and performance. x86 decode is complex. On the flipside, the
code density is pretty good and that is important. Secondly, Intel and others,
have added various improvements that help avoid the downsides. E.g. caching of
decode, post-decode loop buffers and uop caches etc. Plus the new ISA
extensions are much kinder..

~~~
Retric
The problem with x86 is when you scale the chips to N cores you have N copy's
of all that dead weight. You might not save many transistors by say dropping
support for 16 bit floats relative to how much people would hate you for doing
so. However, there are plenty of things you can drop from a GPU or vector
processor and when you start having 100's of them it's a real issue.

Still with enough of a process advantage and enough manpower you can end with
something like the i7 2600 which has a near useless GPU and a ridiculous pin
count and still dominates all competition in it's price range.

~~~
stuntprogrammer
Is there a cost? Of course. But arguably it's in the noise on these chips.
Knights Ferry and Corner are using a scalar x86 core derived from the P54C.
How many transistors was that? About 3.3 million. By contrast, Nvidia's
16-core Fermi is a 3 billion transistor design. (No, Fermi doesn't have 512
cores, that's a marketing number based on declaring that a SIMD lane is a
"cuda core", if we do the same trick with MIC we start doing 50+ cores * 16
wide and claiming 800 cores).

How can we resolve this dissonance? Easy -- ignoring the fixed function and
graphics only parts of Fermi, most of the transistors are going to be in the
caches, the floating point units and the interconnect. These are places MIC
will also spend billions of transistors but they're not carrying legacy dead
weight from x86 history -- the FPU is 16 wide by definition must have a new
ISA. The cost of the scalar cores will not be remotely dominant.

I'm not sure why you are concerned about the pin count on the processor,
except perhaps if you are complaining about changing socket designs which is a
different argument. The i7 2600 would fit in a LGA 1155 (i.e. 1155 pins)
whereas Fermi was using a 1981 pin design on the compute SKUs. The sandy
bridge CPU design is a fine one. The GPU is rapidly improving (e.g. ivy bridge
should be significantly better, and will be a 1.4 billion transistor design in
the same 22nm as Knights Corner).

------
modeless
I see questions about why this is better than a GPU for anything. Two main
things:

1\. The double-precision floating point performance is a lot better.

2\. Unlike GPUs which have baroque memory access restrictions and many
performance cliffs, this is a much more familiar SMP architecture with a
unified coherent cache hierarchy.

~~~
Klinky
1\. It's estimated to come out a year later than AMD's Radeon part which will
boast similar double-precision floating point performance. Both could be
delayed, though if Southern Islands comes out on time & Knights Corner gets
delayed, by the time KC sees the light of day AMD or nVidia might have another
part out by then that will offer even higher DPFP performance.

2\. I am not sold that modeling the cores after x86/SMP means special care
won't be needed to feed the Intel MIC architecture properly. I'd like to see
some real world numbers on purchasable hardware.

------
joshu
Heh. Article is nearly incoherent:

> If you're building a new system and want to future-proof it, the Knights
> Ferry chip uses a double PCI Express slot. Chrysos said the systems are also
> likely to run alongside a few Xeon processors.

------
rbanffy
The memory bus must be saying "Great. Another 50 mouths to feed".

You have to design your program very carefully if you don't want the cores to
starve.

------
jwatte
So it doesn't run the general x64 system architecture? Then how is this
different from GPGPU? I thought NVIDIA broke a teraflop on a dual slot a while
back (dunno if it was single GPU.) Slot based coprocessors have always been a
_very_ niche kind of thing.

Basically, if I can't hook it up to my SSD array and also my GPU, then it's
not a "real" computer -- like the reporter was talking about a laptop. And if
I can't rent it by the hour from Amazon, then it's not really a good
investment (Amazon already has GPU instances.)

Or, you know, maybe this time it will work, when every time before, a co-
processor platform has failed...

------
r3demon
AMD Radeon HD 6990 already has over 1 TFLOPS performance double-precision, and
there's no problem buying it, Intel is too late.

~~~
phamilton
Yes there is no problem buying it, but have you ever tried programming in
OpenCL? Complexity aside, GPGPU hits a big bottleneck when dealing with large
datasets. There just isn't enough memory available on the GPU, and transfers
to and from the device are costly.

~~~
marcf
There are AMD video cards with 4GB of memory these days and special cards have
up to 16GB I understand.

~~~
phamilton
16GB isn't adequate for the work I've done on them. Genome assembly ( an
O(n^2k^2) process ) generally has 100GB of data, each segment of which needs
to be compared against each other segment ( O(n^2) ), and when comparing two
segments each datum needs to be compared to each other datum ( O(k^2) ).

So while you can just transfer 4GB of data to the card at a time, it really
doesn't cut it.

~~~
r3demon
I'm sure Intel won't cut it either, PCI Express speed is limited, and RAM
woudn't catch up with simple computation speed. Find a better algorithm.

------
cultureulterior
I'll be very interested to see how this does with raytracing.

------
nextparadigms
This was to be expected. In a classic disruptive innovation fashion, Intel is
starting to move upmarket, where the profits are higher, and in a few years
they'll be leaving the mobile and notebook/PC market to ARM.

------
ck2
These aren't x86 cores, are they?

I mean 50 atom cores would be downright silly.

50 i3 cores, well then you might have something.

~~~
rayiner
Yes, these are x86 cores. Actually quite a bit like the Pentium (p55c) with a
512-bit vector unit bolted on.

------
suivix
What is the significance of this over standard GPUs that can already do over a
teraflop?

See table: <http://en.wikipedia.org/wiki/Northern_Islands_(GPU_family)>

~~~
stuntprogrammer
The AMD 6990 has ~1.37TF double precision, but uses 2 GPU chips to do it,
where this chip is that perf level.

It is difficult to get good performance out of the GPUs for a very wide range
of highly parallel programs. Effectively, you are programming a part that is
trying to give you mainly a graphics part, since that is where the volume is,
with enough compute compromises to try to grow that market. MIC is designed to
be a compute processor from the get go. How about this for a difference: it
can boot linux all on its own! You can ssh into it and run programs. You can
even run 'reverse offload' programs that call out to code on the CPU! Trying
doing any of that with a GPU.

BTW, this MIC chip has a large number of cores (50+), these are real cores,
and they're not doing the GPU marketing trick of counting SIMD lanes as
"cores". You could multiply 50+ * 16 to get the equivalent number of GPU
"cores". Each core is cache coherent, with a decent memory hierarchy designed
for compute. There's no graphics tax on here.

I have much more expectation that Intel can leverage their massive process
advantage to keep MIC ahead on compute performance each generation. It'll be a
relief to have compute parts rather than repurposed GPUs.

------
ciderpunx
I should probably get one of these for my laptop.

