
Intel's 72-Core “Knight's Landing” Xeon Phi Chip Cleared for Takeoff - taspeotis
http://hothardware.com/news/intels-72-core-knights-landing-xeon-phi-supercomputer-chip-cleared-for-takeoff
======
dkbrk
I was wondering how radically this architecture differs from more conventional
x86 cores, and found this: "An Overview of Programming for Intel Xeon
processors and Intel Xeon Phi coprocessors (2012)" [0]. This paper refers to a
slightly older architecture - "Knight's Corner", but it should be roughly the
same. Essentially, it has a large number of x86 cores with 4 in-order threads
per core. Each core has a 512 KB L2 cache and the entire coprocessor is cache-
coherent through a high-speed ring bus. There is also some quantity of GDDR5
(up to 8GB) on said ring bus. The great advantage over Xeon is its scaling to
very high levels of parallelism: though threads on the same core will benefit
from data locality, the architecture scales uniformly with more cores unlike a
NUMA multi-socket setup. There's more information about the sorts of
optimisations required to get good performance in the paper, but they're much
the same sort of things as for Xeon. Also, according to the paper, any
algorithm well suited for a GPU is well suited to the Phi architecture, though
the Phi architecture is far more general-purpose in its applications.

[0] -
[https://software.intel.com/sites/default/files/article/33016...](https://software.intel.com/sites/default/files/article/330164/an-
overview-of-programming-for-intel-xeon-processors-and-intel-xeon-phi-
coprocessors_1.pdf)

~~~
knweiss
The Knights Landing (KNL) memory hierarchy is different from Knights Corner
because it now features MCDRAM (HBM) - even with three different modes [0].

[0] - [http://www.anandtech.com/show/9794/a-few-notes-on-intels-
kni...](http://www.anandtech.com/show/9794/a-few-notes-on-intels-knights-
landing-and-mcdram-modes-from-sc15)

------
hendzen
The real question is when they will release a socketed version. The
disadvantages of the PCI-E form factor are somewhat mitigated by the on-board
memory, but it would still be nicer to not have to program against them as an
accelerator card.

EDIT: Looks like this is in the works already [0].

[0] - [http://www.nextplatform.com/2015/03/25/more-knights-
landing-...](http://www.nextplatform.com/2015/03/25/more-knights-landing-xeon-
phi-secrets-unveiled/)

~~~
rbanffy
I'd love to see these parts on the desks of developers. I can't imagine there
are only 4 things my computer can be doing right now. And even if there
aren't, it's about 16 times fewer context switches to do.

------
mrb
It took Intel 10 years, starting with Larrabee in 2006, to finally produce a
MIC architecture that reaches the same level of performance as high-end GPUs:
AMD's, Nvidia's, and Intel's top chips now all hit 6-8 single-precision
teraflops per chip. This will hopefully bring some competition to the GPU
compute market which is currently owned by Nvidia.

The two main reasons Knights Landing is so competitive compared to the
previous-generation Knights Corner is that (1) it doubled the raw compute
performance per core thanks to its two 512-bit VPUs (vector processing units)
per core compared to only 1 VPU per core for Knights Corner, and (2) it upped
the number of cores from 61 to 72 while maintaining and even slightly
increasing the clock frequency from 1.25 GHz to 1.3 GHz. All this was possible
because Knights Landing is manufactured in 14nm while Knights Corner was 22nm,
so its logic gates are 2-2.5x denser. Meanwhile all AMD and Nvidia discrete
GPUs are still stuck at 28nm. The only reason GPUs still perform comparably to
Knights Landing is that their execution units are simpler and smaller than
Intel MIC cores.

------
chubot
Hm I thought Xeon Phi was sort of a failed launch? I remember the previous
version being heavily discounted at one point, and it didn't seem like they
were releasing new versions very quickly.

What are people using the Xeon Phi for?

nVidia seems to be making a killing off machine learning applications. The
entire GTC 2015 was about deep learning.

I assume that Xeon Phi is not nearly as effective for machine learning (?).

~~~
jandrewrogers
Xeon Phi is somewhat misunderstood, and its early incarnations have not helped
that. The most important point to understand is that it is neither a CPU or
GPU architecture. Code designed for either of those architectures will be very
suboptimal even if it does run adequately.

The Xeon Phi is a modern reincarnation of a _barrel processor_. It understands
x86-64 opcodes but your data structures and algorithms need to be designed
quite differently than vanilla CPU code if you want to optimize throughput. No
one is doing that though.

In principle, if your code is correctly designed for the architecture, the
Xeon Phi should significantly outperform both CPUs and GPUs for a wide range
of use cases. The strength of these architectures is that their throughput is
relatively insensitive to both latency and lack of trivial parallelism, which
are the major bottlenecks to a lot of modern software performance. It is why
Intel resurrected this style of computing architecture.

Basically, very few people know how to design software for these architectures
even though it is pretty easy ( _much_ easier than GPU code). I have
experience designing software for exotic architectures like this, I know what
they are capable of in terms of throughput, but have never worked with a Xeon
Phi. Nonetheless, the silicon specs suggest that someone that actually knows
what they are doing should be able to significantly outperform e.g. GPUs for
things like machine learning. Right now, it is basically being wasted because
most developers treat them like weird CPUs.

I'd love to play with one of the new Xeon Phi processors to characterize its
true performance but I am unlikely to see one. But I would not dismiss their
performance; it is an extremely efficient kind of architecture for a
surprisingly wide range of workloads if used well.

~~~
chubot
Hm, that's interesting. If true, it sounds like they need to take a lesson
from nVidia and start providing more tools -- for example, the CUDA compiler,
and libraries like cuDNN and cuFFT.

I'm sure you can do better by hand than the code generated by the CUDA
compiler. But people need something to get started with, and it's probably
good enough for a lot of applications. In order to get adoption, the vendor
has to meet developers closer to the application.

There are very few people writing assembly code from scratch anymore... and
those that do are probably the kind of people who are designing their own
hardware anyway!

It makes more sense for the vendor to be writing libraries in assembly, since
they know the architecture best. Then apps can build on top of that in higher
level languages.

~~~
rbanffy
They have good OpenCL and MPI support. Intel has tools that help characterize
the performance of programs running on it. They are, obviously, complex tools
that take some time and dedication to master.

[http://shop.oreilly.com/product/9780124104143.do](http://shop.oreilly.com/product/9780124104143.do)
is an introduction and weights more than 400 pages.

~~~
14113
According to an Intel employee I spoke to, OpenCL won't be available from day
one on knights landing - all it will support will be OpenMP and MPI.
Apparently they're the two main libraries which their target customer (HPC)
require, so they're not putting any effort into OpenCL until they're sure it's
wanted.

~~~
rbanffy
Intel already supports OpenCL on previous versions of the Phi. I see no reason
not to assume that code will run at higher performance on Knights Landing than
it did on previous versions, even if the code generated does not squeeze out
every tiny bit of performance of the x86 side of the cores.

~~~
14113
I'm not sure you understand - I mean they won't actually provide an OpenCL
runtime/compiler. It's not that the code won't be slower/less performant, it's
that it won't actually be possible to run it at all.

------
curiousAl
How many years therapy does this come with for the person who has to push down
the motherboard lever?

~~~
sdrothrock
None, since it's a PCI-E card, not a processor.

(I'm assuming you were referring to the possibility of bending pins by trying
to mount a processor incorrectly? Might be way off.)

~~~
simcop2387
He might have been referring to the non-pci-e version that's in the works, as
hendzen pointed out: [http://www.nextplatform.com/2015/03/25/more-knights-
landing-...](http://www.nextplatform.com/2015/03/25/more-knights-landing-xeon-
phi-secrets-unveiled/)

It looks really beefy.

------
rtl49
On a related subject, can someone shed some light on the reasons and maybe the
legal justification for export controls for these processors? I recall reading
an article a few months ago about limitations imposed on exports to China of
Xeon processors. Are these such a competitive advantage?

~~~
SXX
It's wasn't ban just for Intel Xeon, but Nvidia and AMD as well. So it is
competitive advantage because there are no alternatives. It's obvious that
such restrictions won't affect China's access to actual hardware, but prevent
their cooperation with CPU/GPU vendors.

------
afsina
AMD will probably have a 32 core CPU by the time (If they are not bankrupt).
It would offer a better performance since Xeon Phi CPUs are probably weaker.

[http://wccftech.com/amd-exascale-heterogeneous-processor-
ehp...](http://wccftech.com/amd-exascale-heterogeneous-processor-ehp-
apu-32-zen-cores-hbm2/)

~~~
cma
32 core, or "32 core":

[http://www.pcworld.com/article/3003113/components-
processors...](http://www.pcworld.com/article/3003113/components-
processors/lawsuit-alleges-amds-bulldozer-cpus-arent-really-8-core-
processors.html)

~~~
greggyb
AMD's new Zen architecture implements SMT, similar to Intel's hyperthreading.

The contention about core count on the Bulldozer architecture seems spurious
to me. At the time Bulldozer was being developed, multi-core systems were just
being introduced to the consumer market. It was unclear what sort of
architecture would be most performant. AMD made a (bad only in hindsight) bet
that Bulldozer would be a viable architecture for general purpose compute
loads. It turns out that combining 1 FPU with two integer pipelines is not as
effective as an SMT architecture.

At the time Bulldozer was developed and released, what exactly constituted a
core was still not precisely defined. It turns out that Intel's SMT
architecture is much more effective, and thanks to market- and mind-share,
people associate the definition of a core with Intel's specific
implementation.

AMD's new development is on an SMT architecture known as Zen. Like all AMD
news and marketing, it sounds exciting. Hopefully they execute well and it
actually turns out to be exciting.

------
jd3
nice. I remember reading about this (what feels like) ages ago at this point.
Patiently waiting to see if Terry Davis shows up. reference:[0]

[0]:
[https://usesthis.com/interviews/terry.davis/](https://usesthis.com/interviews/terry.davis/)

------
dman
Are there any publicly accessible clusters/clouds with these?

------
modeless
NVIDIA Titan X today: 336 GB/s, 7 teraflops

Intel Xeon Phi next year: 400 GB/s, 6 teraflops

NVIDIA Pascal next year: 1 TB/s, >10? teraflops

NVIDIA still wins for my applications. Intel is aiming too low.

~~~
rdtsc
Why do you think they still do it? It would seem they have enough people for
someone to check NVIDIA's numbers.

Are they offering something closer to general purpose CPUs, something that
could schedule OS threads to run on it?

~~~
ScottBurson
According to the article hendzen linked to, the Phi processors are absolutely
general purpose CPUs -- they can be the main CPU on the board and can boot and
run stock OSes.

This is much more powerful than a GPU, which is fundamentally SIMD -- it can
hit TFLOPS speeds running hundreds of threads _doing different things_. While
not everyone will need this power, I suspect it will open up entire new
classes of applications.

~~~
makomk
It still only hits teraflops speed when you're running very wide SIMD code -
non-SIMD stuff is a lot slower, probably roughly the same speed penalty as
running non-SIMD code on a GPU based on what they've released. Not only that,
most researcher's very wide SIMD code is written for CUDA which this doesn't
support. Oh, and the AVX-512 instructions required to take advantage of the
full speed of this aren't used by any Intel product currently in existence -
you basically have to develop code specifically for this board.

