
Intel Shipping Nervana Neural Network Processor First Silicon Before Year End - kartD
https://www.anandtech.com/show/11942/intel-shipping-nervana-neural-network-processor-first-silicon-before-year-end
======
omarforgotpwd
“Alright guys we already blew it on mobile. People are starting to realize
that they can put ARM in their data center and save a ton on electricity. We
better not blow it on this AI thing with Nvidia”

~~~
mtgx
I think they already did? Highly unlikely this chip is ahead of Volta. And
Nvidia has already teased much higher performance than Volta with its next-
generation, at least for inference (320 TOPS at 30W).

~~~
scottlegrand2
Not that I want to give Intel any credit here whatsoever because they don't
deserve any, but that 320 Tera Ops number involves multiple chips and to the
best of my knowledge, Nvidia has not specified whether that's 8, 16, or 32 bit
math.

And they're not the only ones, we still don't know the underlying math model
in Google's second generation TPU now do we?

We are swimming in a sea of FUD and Benchmarksmanship right now, but if I had
to make a bet, the next three years are owned by Nvidia. After that, things
get hazy.

------
kbumsik
I hope Intel would try to integrate this with existing ML libraries like
Tensorflow and Caffe, rather than to make their own separated ecosystems.

~~~
alvern
I'm betting it shares a lot with the Movidius Neural Compute Stick and uses
Caffe for the ML.

[https://developer.movidius.com/](https://developer.movidius.com/)

~~~
nrp
Nervana and Movidius were separate acquisitions by Intel and were relatively
recent. I’d be surprised if they managed to merge the software efforts that
quickly.

------
vonnik
Intel has been saying all along that they would ship Nervana's chips by year
end, so not really news. The news will be if they miss their deadline.

------
dragontamer
What's the key advantage of this Nervana architecture over GPUs?

I can theorycraft... hypothetically, GPUs have a memory architecture
structured like this: (using OpenCL terms here) Global Memory <-> Local <->
Private Memory, correlating to the GPU <-> Work Group <-> Work Item.

On AMD's systems, "Local" and/or Work Group tier is a group of roughly
256-work items (or in CUDA Terms, a 256-"Block" of "Threads") and all 256-work
items can access Local Memory at outstanding speeds (and then there's even
faster "Private" memory per work item / thread, which is basically a hardware
register).

On say a Vega 64, there are 64-compute units (each of which has 256 work items
running in parallel). The typical way Compute Unit 1 can talk to Compute Unit
2 is to write data from CU1 into Global Memory (which is off-chip), and then
read it back in in CU2. In effect, GPUs are designed for high-bandwidth
communications WITHIN a Workgroup (or "Warp" in CUDA terms), but they have
slow communications ACROSS Work-groups / Warps.

In effect, there's only "one" Global Memory on a GPU. And in the case of a
Vega64 GPU, that's 16384 work items that might be trying to hit Global Memory
at the same time. True, there's caching layers and other optimizations, but
any methodology based on Global resources will naturally slow down code and
hamper parallelism.

Neural Networks possibly can have faster memory message passing if there were
a different architecture. Imagine if the compute-units allowed quick
communication in a torus for example. So compute unit #1 can quickly
communicate to compute unit #2.

This would roughly correlate to "Layer1 Neurons" passing signals to "Layer2
Neurons" and vice versa (say for backpropagation of errors).

Alas, I don't see much information on what Nervana is doing differently. When
"Parallela" came out a few years ago, they were crystal clear on how their
memory architecture was grossly different than a GPU... it'd be nice if
Nervana's marketing material was similarly clear.

\----------

Hmm, this page is a bit more technical: [https://www.intelnervana.com/intel-
nervana-neural-network-pr...](https://www.intelnervana.com/intel-nervana-
neural-network-processors-nnp-redefine-ai-silicon/)

It seems like the big selling points are:

* "Flexpoint" \-- They're a bit light on the details, but they argue that "Flexpoint" is better than Floating Point. It'd be nice if they were a bit more transparent on what "Flexpoint" is, but I'll imagine that its like a Logarithmic Number System ([https://en.wikipedia.org/wiki/Logarithmic_number_system](https://en.wikipedia.org/wiki/Logarithmic_number_system)) or similar, which probably would be better for low-precision Neural Network computations.

* "Better Memory Architecture" \-- I can't find any details on why their memory architecture is better. They just sorta... claim its better.

Ultimately, GPUs were designed for graphics problems. So I'm sure there's a
better architecture out there for Neural Network problems. Its just
fundamentally a different kind of parallelism. (Image processing / shaders
handling the top-left corner of a polygon don't need to know what's going on
on the bottom-right polygon. So GPUs don't have high-bandwidth communication
lines between those units. Neural Networks require a little bit more
communication than image processing problems did from the past). But I'm not
really seeing "why" this architecture is better yet.

~~~
tachyonbeam
The main competitor for this is other dedicated neural network hardware, such
as the Google TPU: [https://cloud.google.com/blog/big-data/2017/05/an-in-
depth-l...](https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-
at-googles-first-tensor-processing-unit-tpu)

One potential advantage is that neural network weights could be as small as 8
bits. If you're working with 8-bit quantities, you're saving in terms of
memory bandwidth, power, and you can fit more computational units. Neural
network hardware also isn't going to need hardware implementations of things
like sine/cosine, square root. Can potentially avoid control flow and
branching hardware (or have less of it). May not need the whole complex
register file and thread swapping machinery that GPUs have.

~~~
dragontamer
Flexpoint looks like its 16-bits however:
[https://www.intelnervana.com/nervana-engine-delivers-deep-
le...](https://www.intelnervana.com/nervana-engine-delivers-deep-learning-at-
ludicrous-speed/)

And I'm not really sure that stripping "Sin" and "Cos" out of the FPU units
would really save that much power and/or improve efficiency.

Generally speaking: the biggest problem today is the memory architecture of
your computing device (be it a CPU, GPU, Paralella, or FPGA). Efficient CPU
programming requires deep understanding of L1 / L2 / L3 / Off-chip cache. And
similarly, efficient GPU code require deep understanding of the memory
hierarchy.

\-------------------

[https://www.intelnervana.com/nervana-engine-delivers-deep-
le...](https://www.intelnervana.com/nervana-engine-delivers-deep-learning-at-
ludicrous-speed/)

Looking at the Nervana posts... it looks like the "memory" question is:

1\. HBM: Except both AMD Fury and AMD Vega64 are already using HBM. NVidia
also has an HBM offering at the very very high end.

2\. 16-bit Flexpoint: Except NVidia and AMD cards already can efficiently
calculate half-floats (16-bit floats).

3\. MAC: Multiply and Accumulate. AMD and NVidia cards have been optimizing
this instruction and can perform Matrix Multiplications at ludicrous speeds
already.

Maybe the Google TPU can be a theoretical competitor... but until Google's TPU
becomes available from newegg.com, it really isn't a consideration for most
people.

\----------------

The main "advantage" of Nervana looks like its improved memory architecture
with "separate pipelines for computation and data management". (aka: vague
marketing talk).

And I really do think this is the key: depending on how memory is architected
on the Nervana system, it really can make-or-break the speed of applications.
But is also the part that's the most mysterious with the least amount of
documentation so far...

~~~
tachyonbeam
Google plans to sell the TPU as a cloud service. It's also their secret
weapon, so to speak. They don't need people to buy it in order for it to be an
asset, they mostly want to give themselves an advantage. That is, all the
people at DeepMind now have access to faster hardware with more memory. They
can run bigger experiments than everyone else.

As for memory hierarchy, in hardware, there is always something to gain by
specialization. Something like a convolution in a CNN is going to have very
predictable memory access patterns. It's possible to take advantage of that.
Google claims to already have about 10x speedups over GPUs IIRC.

------
novaRom
Unfortunately for Intel, it is probably too late. The specs of new processors
are not impressing even in comparison with previous Tesla generation.

~~~
adventured
Well, first they're Intel. Second, they have $25 billion in cash and $10b per
year in profit. They have the resources to be persistent if it's important.
This is still the first inning. It'd be like calling it too late in 3D
graphics chips when 3dfx shipped the Voodoo. It's nowhere near too late, the
market will be a decade in the making.

~~~
doomlaser
You could have said the same thing about Microsoft's late entry into the
touchscreen smartphone market.

~~~
adventured
You could have said the same thing about X Y Z.

Some of those turn out correctly, some of those turn out wrong. It's nearly
meaningless as a generalized premise.

Nintendo's NES arrived into the global video game console market long after
Atari. Sony was very far behind in arriving into the console market with the
first PlayStation.

Google arrived years after Excite and AltaVista.

The portable mp3 player market was years old when the iPod arrived.

Microsoft was years behind in most cases in productivity software. It almost
always chased from behind, that includes with Windows.

In the early years Dell was _tiny_ in terms of volume compared to the computer
majors at the time. How could they possibly catch up?

AMD was practically bankrupt on multiple occasions. They lost a billion
dollars in 2014+2015. What chance did they have in coming back to life and
challenging Intel again, having to chase from behind like they have?

The tech industry is overflowing with examples throughout its history of
companies entering existing markets and taking them, or otherwise recovering
from running behind.

~~~
doomlaser
None of those were examples of companies buying their way into markets late
with me-too strategies financed by dominant war chests.

~~~
lodi
Okay, how about Microsoft and the original Xbox. They lost $4B+ muscling their
way in, and now have the market-leading console.

[https://www.engadget.com/2005/09/26/forbes-xbox-lost-
microso...](https://www.engadget.com/2005/09/26/forbes-xbox-lost-
microsoft-4-billion-and-counting/)

