
Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Learning? - okket
https://www.nextplatform.com/2017/03/21/can-fpgas-beat-gpus-accelerating-next-generation-deep-learning/
======
mabbo
I think the two can complement each other very well.

GPUs are flexible and scalable when you don't know what the large-scale
parameters of the network you want to build look like, and need a lot of them
to do training. Let a fleet of cloud-based GPUs do the heavy-lifting of
training and learning.

But then once training is over, an FPGA or even an ASIC could implement the
trained model and run it at a crazy-fast speed with low-power. A piece of
hardware like that would be able to handle things like real-time video
processing of a DNN potentially. Very handy for things like self-driving
vehicles.

~~~
DocSavage
Are there any TensorFlow-tuned ASICs like Google's TPU available or planned
for general release?

[https://rcpmag.com/articles/2016/10/10/microsoft-google-
ai-s...](https://rcpmag.com/articles/2016/10/10/microsoft-google-ai-
showdown.aspx)

If the deep learning architecture stabilizes for a problem with sufficient
market demand, seems like ASICs could be economical.

~~~
etrautmann
TrueNorth from Paul Merolla and co. at IBM, and Nervana Systems (now Intel)
both have hardware optimized for neural networks

~~~
p1esk
TrueNorth was built to run spiking neural networks, which have little to do
with deep learning (even though they managed to get it to run a small
convolutional NN), and Nervana has never actually built any hardware.

------
payne92
The historical impediment to broader FPGA adoption has always been their
proprietary nature.

Until vendors are willing to release bitstream details enabling open source
tools and an vibrant ecosystem, applications will be limited.

~~~
blackguardx
I'm all for open toolchains, but I don't think this is the primary reason for
FPGAs lack of broader appeal. I think the lack of open toolchains and the lack
of broad appeal both stem from FPGA vendors almost exclusive focus on high
margin, high-end applications. There doesn't seem to be much push to focus on
larger markets with lower margins.

That said, Lattice has recently started to push into these new areas, but they
haven't been that successful. If they start to see more success, I think we
will see open toolchains. Lattice also has the advantage of being able to lean
on the open toolchain work done by people like Clifford Wolf [1]

[1] [http://www.clifford.at/icestorm/](http://www.clifford.at/icestorm/)

~~~
bayesian_horse
I've heard people say the open source ICE open source tool chain is less
painful to work with.

~~~
bostand
It is!

It doesn't beat the commercial tools in area / performance right now, but dear
god its so much more pleasant to work with!

~~~
thegp
True. Plus Clifford is an awesome guy:)

------
user5994461
>>> Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Learning?

Whoever has worked with FPGA knows that they are completely different to
program than CPU/GPU.

They are not competing at all. You can't take some computer developers and
have them work with FPGA. That'd be like taking a dude who knows XML and put
him on optimizing C++ low level algorithms.

For starters, a FPGA doesn't run programs, it describes hardware components.

~~~
zamalek
There is the evolutionary approach taken by Thompson.[1] There were some
spectacular results; though each circuit only worked on the exact FPGA that it
was trained on - so that particular approach has no practical applications.

[1]:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50.9691&rep=rep1&type=pdf)

~~~
bayesian_horse
I don't actually believe this is true. I find that paper fascinating, and I
believe today you could just simulate the hardware definition, rather than
work straight on the bitstream and evaluate it in hardware.

I also believe that today's FPGAs are more robust against these kinds of bugs.
Because, even with some HDL code, these defects or cross-talk incidents could
result in difficult to debug errors.

------
anonymousDan
Microsoft Research has some interesting recent work on using FPGAs for
accelerating the convolution steps of DNN training at least:
[https://www.microsoft.com/en-
us/research/publication/acceler...](https://www.microsoft.com/en-
us/research/publication/accelerating-deep-convolutional-neural-networks-using-
specialized-
hardware/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F240715%2Fcnn%2520whitepaper.pdf)

The big win for them is a 10x reduction in power usage, since in a
datacenter/cloud environment this is more important. Still at the research
stage though.

~~~
moftz
Switch those FPGAs out with ASICs and they could reduce power usage and
increase speed by magnitudes.

------
sqeaky
All in that article keeps assuming lower precision, which is often okay, but
then they kept testing the Titan X with 32 bit wide floats. Doesn't the newer
nVidia stuff, like the Titan X support 16 bit wide floats, and don't they run
at almost double the speed?

~~~
znfi
The Titan X(P) does not support 16-bit floats, or, well, it is supported but
at 1/64th the speed of 32-bit floats.

Source:
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture)](https://en.wikipedia.org/wiki/Pascal_\(microarchitecture\))

section 2.4 Chips claims the Titan XP uses the GP102 chip, and section 3
Performance gives the speed for computing with 16-bit floats.

~~~
gwern
They (including the 1080ti which is basically a Titan) do support 4x faster
INT8, though, so if comparing to a reduced-precision ternary net running a
FPGA, that seems relevant. (They mention using INT8 in some of the GPU
benchmarks but I'm not sure which graphs are supposed to represent that.)

------
molticrystal
How many Titan Xs can you purchase for the price of a Intel Stratix 10,
including the kit required to start development?

I saw some of their II-V kits costing 1-100k+ but didn't find the Stratix 10
mentioned in the article in their kit list:
[https://www.buyaltera.com/Search?keywords=stratix+kit&pNum=1](https://www.buyaltera.com/Search?keywords=stratix+kit&pNum=1)

>Intel Stratix 10 FPGA is 10%, 50%, and 5.4x better in performance (TOP/sec)
than Titan X Pascal GPU on GEMMs for sparse, Int6, and binarized DNNs

My guess is that while electricity costs would be much higher, it would be
better at this time to still just buy ceil(.1, .5, 5.4) Titans instead.

~~~
user5994461
>>> How many Titan Xs can you purchase for the price of a Intel Stratix 10,
including the kit required to start development?

Actually, they are similar in price.

medium to high end consumer FPGA and GPU will go up to around $1k.

Then, there are the entreprisey GPU (Quadro and FireThing as I recall) going
for a few k.

It's similar for FPGA. The very high-end (Stratix/Virtex) will charge a few k
as well, peaking at 5k or 10k for the top models (nude FPGA chip only).

I recall negotiating some FPGA devkits in the 10-15k€ range, that seems to be
the top end. If I remember well, there was an option to get 4 Virtex on the
same devboard for 30k or 50k. That's as high as it gets.

My memory ain't perfect but that's about this much.

------
sgt101
"Another emerging trend introduces sparsity (the presence of zeros) in DNN
neurons and weights by techniques such as pruning, ReLU, and ternarization,
which can lead to DNNs with ~50% to ~90% zeros. " Does anyone have a reference
for this ?

~~~
znfi
Just randomly googling "ternarization deep learning" lead me to this article
from ICLR 2017
[https://openreview.net/pdf?id=r1fYuytex](https://openreview.net/pdf?id=r1fYuytex)
which in turn seem to reference further work in the area.

~~~
sgt101
Thanks - I wouldn't have thought of ternarization in the context of sparsity!
Good idea :)

------
jhallenworld
What's missing from this analysis is a price comparison between FPGAs and
GPUs.

------
legulere
I guess the problem is that FPGAs are even more horrendous to program than
GPUs.

~~~
PeterisP
It's not like most people in the field are programming GPUs.

If a FPGA vendor made a FPGA solution (both the hardware and the software
libraries to integrate with one or two machine learning frameworks) that did
basic matrix/tensor calculations faster/cheaper than GPUs, then they'd be able
to take a lot of market off nvidia. Users wouldn't have a need to program the
FPGA directly if they can work at the level of matrix operations.

~~~
user5994461
That means making a whole networked black-box, based on FPGA(s) and exposing
an API for external use.

It's certainly possible to make. It's also a very expensive very specialized
appliance. All of that to do some matrix manipulations.

~~~
PeterisP
People are buying large quantities of very expensive GPUs just to do some
matrix manipulations, so why not FPGAs?

But the point is that you don't have that much FPGA-specific code - once
someone does the matrix manipulations and the proper integration, everyone
else can just run e.g. tensorflow code on it faster and/or cheaper without
specific expertise; if a FPGA vendor can do this one-time investment in
software tools, then they can compete for a slice of the large pie of hardware
revenue that nvidia now has for itself.

~~~
radarsat1
But, assuming that "matrix manipulation" code has existed for FPGAs for a
while, since verilog and VHDL are quite old, the question remains: why hasn't
(a) an FPGA vendor already done this and already actively selling a tensorflow
solution or (b) NVidia pursuing this? I have a feeling there are more factors
at work than just whether or not it's possible.

------
krapht
This news article is fairly useless. FPGAs have always been an option; that's
why they are used in many DSP heavy workloads in communications etc. However,
they are expensive. GPUs are popular and cost-efficient because millions of
video gamers purchase them, driving prices down. The amount of 32-bit floating
point compute on a modern video card is absurd for the price.

~~~
user5994461
FPGA are cheap. What's expensive are the engineers who can speak HDL and make
use of a FPGA.

------
p0nce
How about FPGAs programmed using OpenCL.

