Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Learning?

mabbo · on March 21, 2017

I think the two can complement each other very well.

GPUs are flexible and scalable when you don't know what the large-scale parameters of the network you want to build look like, and need a lot of them to do training. Let a fleet of cloud-based GPUs do the heavy-lifting of training and learning.

But then once training is over, an FPGA or even an ASIC could implement the trained model and run it at a crazy-fast speed with low-power. A piece of hardware like that would be able to handle things like real-time video processing of a DNN potentially. Very handy for things like self-driving vehicles.

Symmetry · on March 21, 2017

If you're dealing at scales where you can use the word "fleet" then it will usually make sense to just build an ASIC on a trailing process node rather then go for FPGAs. They'll be cheaper in bulk and more performant even with a large process disadvantage.

ADDENDUM: But fundamentally, in spaces like this, the underlying algorithms that can be accelerated are fairly simple. In most cutting edge AI these days the heavy lifting is performed by convolutional neural networks and the specialized silicon that works to speed up one set of convolutional neural network operations will speed up another just as well. Baking the network itself into the hardware shouldn't tend to be any better than loading it into specialized memory pools unless you get really exotic and do your neural network in analog electronics.

bayesian_horse · on March 22, 2017

I think there is a big enough space between GPU and ASIC technology for FPGAs. The main reason is the lifetime of the models. The shorter that lifespan, the more expensive it is to exchange the ASICs. At the very least you have to produce new ASICs every few months, and replace them in special sockets, or even reflow/solder them to new cards.

I hardly believe that this is economical.

Symmetry · on March 22, 2017

My assumption is that the ASIC is executing code that changes every month but that it's using instructions and a memory hierarchy geared towards constitutional neural networks. If that stops being true then of course you'd need a different ASIC but then again if that stops being true then there's no guarantee that a GPU or ASIC will do any better than a CPU. You could end up with something like Alpha-Beta pruning where parallelism doesn't make much different. A reasonable chip wont' be able to contain enough transistors to have separate execution resources for each layer. It's going to have to work by loading a layer, convolving it, load the next layer, convolve it, etc. So you'll be able to change your network without changing the ASIC your running it on while still taking advantage of your dedicated ganged operations. The exact size of the network layers can be optimized for in the FPGA version that you can't in the more flexible ASIC version. But I expect the benefit from those to be much smaller than moving to an ASIC.

bayesian_horse · on March 23, 2017

From the papers I do believe they are hardcoding the layer weights in the hardware definition of the FPGA. These FPGAs also have no significant RAM on chip, but the intel FPGAs they use do seem to have an even larger number of LUTs than the usual embedded FPGAs, and even dedicated floating point units.

At the very least they talk about omitting weights which are 0 in the synthesis.

DocSavage · on March 21, 2017

Are there any TensorFlow-tuned ASICs like Google's TPU available or planned for general release?

https://rcpmag.com/articles/2016/10/10/microsoft-google-ai-s...

If the deep learning architecture stabilizes for a problem with sufficient market demand, seems like ASICs could be economical.

etrautmann · on March 21, 2017

TrueNorth from Paul Merolla and co. at IBM, and Nervana Systems (now Intel) both have hardware optimized for neural networks

p1esk · on March 22, 2017

TrueNorth was built to run spiking neural networks, which have little to do with deep learning (even though they managed to get it to run a small convolutional NN), and Nervana has never actually built any hardware.

hedgehog · on March 21, 2017

Yes, there are at least a dozen companies with specialized hardware accelerators in some stage of development. For smaller parts some of the existing DSP companies like CEVA and Cadence Tensilica are also are adapting their architectures for deep neural net workloads.

p1esk · on March 22, 2017

Yet it's still not clear if building a custom chip makes sense, because the next Nvidia chip might make it obsolete. Or the one after the next (still too soon).

payne92 · on March 21, 2017

The historical impediment to broader FPGA adoption has always been their proprietary nature.

Until vendors are willing to release bitstream details enabling open source tools and an vibrant ecosystem, applications will be limited.

blackguardx · on March 21, 2017

I'm all for open toolchains, but I don't think this is the primary reason for FPGAs lack of broader appeal. I think the lack of open toolchains and the lack of broad appeal both stem from FPGA vendors almost exclusive focus on high margin, high-end applications. There doesn't seem to be much push to focus on larger markets with lower margins.

That said, Lattice has recently started to push into these new areas, but they haven't been that successful. If they start to see more success, I think we will see open toolchains. Lattice also has the advantage of being able to lean on the open toolchain work done by people like Clifford Wolf [1]

[1] http://www.clifford.at/icestorm/

thesz · on March 31, 2017

You are wrong about focusing on the high end FPGA. There are plenty of FPGA's for low-level stuff. For example, one project I was involved in used FPGA for debouncing completed with the programming interface for ADSP-21xx. They replaced a dozen of different chips with one FPGA.

bayesian_horse · on March 22, 2017

I've heard people say the open source ICE open source tool chain is less painful to work with.

bostand · on March 22, 2017

It is!

It doesn't beat the commercial tools in area / performance right now, but dear god its so much more pleasant to work with!

thegp · on March 25, 2017

True. Plus Clifford is an awesome guy:)

tmsldd · on March 22, 2017

You touch the main point.

user5994461 · on March 21, 2017

>>> Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Learning?

Whoever has worked with FPGA knows that they are completely different to program than CPU/GPU.

They are not competing at all. You can't take some computer developers and have them work with FPGA. That'd be like taking a dude who knows XML and put him on optimizing C++ low level algorithms.

For starters, a FPGA doesn't run programs, it describes hardware components.

rebootthesystem · on March 22, 2017

I think you are taking the question too literally.

Hardware engineers can design FPGA-based hardware optimized for ML.

A second set of engineers then uses these boards/FPGA's just as they would GPU's.

They write code in whatever language to use them as ML co-processors.

This second group doesn't have to be composed of hardware engineers.

Today someone using a GPU doesn't have to be a hardware engineer who knows how to design a GPU. Same thing.

tyingq · on March 22, 2017

True, but if the benefit is there, the software folks will find the right people to assist.

Bitcoin is a good example. I'm reasonably sure that FPGAs eclipsed GPU there because some "software person" called the right person in. And it certainly was a competition. Similar for the eventual move to ASIC.

If there's enough incremental benefit, it will happen.

zamalek · on March 22, 2017

There is the evolutionary approach taken by Thompson.[1] There were some spectacular results; though each circuit only worked on the exact FPGA that it was trained on - so that particular approach has no practical applications.

[1]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50....

bayesian_horse · on March 22, 2017

I don't actually believe this is true. I find that paper fascinating, and I believe today you could just simulate the hardware definition, rather than work straight on the bitstream and evaluate it in hardware.

I also believe that today's FPGAs are more robust against these kinds of bugs. Because, even with some HDL code, these defects or cross-talk incidents could result in difficult to debug errors.

bayesian_horse · on March 22, 2017

They are in competition, because increasingly, the deep learning models are run by people who are not experts in machine learning: in deployment, self-driving cars and other machinery.

And if it is just about bang-for-buck and scale, FPGAs seem to be quite competitive.

snnn · on March 22, 2017

there are opencl,blas libraries. You don't have to do everything from the beginning.

anonymousDan · on March 22, 2017

Microsoft Research has some interesting recent work on using FPGAs for accelerating the convolution steps of DNN training at least: https://www.microsoft.com/en-us/research/publication/acceler...

The big win for them is a 10x reduction in power usage, since in a datacenter/cloud environment this is more important. Still at the research stage though.

_9vzr · on March 22, 2017

Switch those FPGAs out with ASICs and they could reduce power usage and increase speed by magnitudes.

sqeaky · on March 21, 2017

All in that article keeps assuming lower precision, which is often okay, but then they kept testing the Titan X with 32 bit wide floats. Doesn't the newer nVidia stuff, like the Titan X support 16 bit wide floats, and don't they run at almost double the speed?

znfi · on March 21, 2017

The Titan X(P) does not support 16-bit floats, or, well, it is supported but at 1/64th the speed of 32-bit floats.

Source: https://en.wikipedia.org/wiki/Pascal_(microarchitecture)

section 2.4 Chips claims the Titan XP uses the GP102 chip, and section 3 Performance gives the speed for computing with 16-bit floats.

gwern · on March 21, 2017

They (including the 1080ti which is basically a Titan) do support 4x faster INT8, though, so if comparing to a reduced-precision ternary net running a FPGA, that seems relevant. (They mention using INT8 in some of the GPU benchmarks but I'm not sure which graphs are supposed to represent that.)

AIMunchkin · on March 22, 2017

GP100 supports FP16 FMAD GP102 supports INT16 and INT8 MAD with 32-bit accumulation

Overall, not impressed with Stratix 10. It won't be cost effective, it's not much more power-efficient, and Volta will likely leapfrog it across the board within a year.

Wasn't this thing supposed to sample in late 2014? Back then it would have been a gamechanger at any price. Now, 1080Ti for $700 beats it across the board in throughput/$. NVIDIA's confusing messaging about using consumer versus professional HW is about the only thing that might make it viable for deep learning. Although I note the absence of training perf numbers here, just (apparently) inference.

sp332 · on March 22, 2017

"GP100 however uses more flexible FP32 cores that are able to process one single-precision or two half-precision numbers in a two-element vector. Nvidia intends to address the calculation of algorithms related to deep learning with those." So 16-bit ops are more than twice as fast as 32-bit ops, if you pack them into the 32-bit cores two at a time and also use the dedicated 16-bit core.

foota · on March 21, 2017

The speed is likely faster, but there probably aren't as many 16 bit floating point units.

jsheard · on March 21, 2017

You're thinking of Nvidia's GP100 chip, which does run FP16 twice as fast as FP32. GP100 is only used on the Tesla P100 and Quadro GP100 cards though.

The Titan X Pascal (and 1080ti) uses the smaller GP102 chip, which as znfi pointed out is practically useless for FP16.

molticrystal · on March 21, 2017

How many Titan Xs can you purchase for the price of a Intel Stratix 10, including the kit required to start development?

I saw some of their II-V kits costing 1-100k+ but didn't find the Stratix 10 mentioned in the article in their kit list: https://www.buyaltera.com/Search?keywords=stratix+kit&pNum=1

>Intel Stratix 10 FPGA is 10%, 50%, and 5.4x better in performance (TOP/sec) than Titan X Pascal GPU on GEMMs for sparse, Int6, and binarized DNNs

My guess is that while electricity costs would be much higher, it would be better at this time to still just buy ceil(.1, .5, 5.4) Titans instead.

user5994461 · on March 21, 2017

>>> How many Titan Xs can you purchase for the price of a Intel Stratix 10, including the kit required to start development?

Actually, they are similar in price.

medium to high end consumer FPGA and GPU will go up to around $1k.

Then, there are the entreprisey GPU (Quadro and FireThing as I recall) going for a few k.

It's similar for FPGA. The very high-end (Stratix/Virtex) will charge a few k as well, peaking at 5k or 10k for the top models (nude FPGA chip only).

I recall negotiating some FPGA devkits in the 10-15k€ range, that seems to be the top end. If I remember well, there was an option to get 4 Virtex on the same devboard for 30k or 50k. That's as high as it gets.

My memory ain't perfect but that's about this much.

calvinlh · on March 22, 2017

You can find engineering samples of the chips on Digikey: http://www.digikey.com/products/en?FV=ffecdb78&pkeyword=fpga

Looks like the cheapest one is $17k (so 15 Titan Xs? 24 1080Tis?), and I'd expect the price will be around that after full release.

pensono · on March 21, 2017

I have a feeling that much of the cost is from the smaller market. It takes a massive amount of engineering resources to put out a new chip, and you need slot of sales to compensate. If FPGAs become commonplace in servers, then we will see a price decrease as it becomes easier to recouperate the overhead.

ReverseCold · on March 21, 2017

Nvidia says the 1080Ti is pretty much a Titan with 1GB less vram, but I don't remember if that had to do with ML.

sgt101 · on March 21, 2017

"Another emerging trend introduces sparsity (the presence of zeros) in DNN neurons and weights by techniques such as pruning, ReLU, and ternarization, which can lead to DNNs with ~50% to ~90% zeros. " Does anyone have a reference for this ?

znfi · on March 22, 2017

Just randomly googling "ternarization deep learning" lead me to this article from ICLR 2017 https://openreview.net/pdf?id=r1fYuytex which in turn seem to reference further work in the area.

sgt101 · on March 23, 2017

Thanks - I wouldn't have thought of ternarization in the context of sparsity! Good idea :)

jhallenworld · on March 21, 2017

What's missing from this analysis is a price comparison between FPGAs and GPUs.

legulere · on March 21, 2017

I guess the problem is that FPGAs are even more horrendous to program than GPUs.

PeterisP · on March 21, 2017

It's not like most people in the field are programming GPUs.

If a FPGA vendor made a FPGA solution (both the hardware and the software libraries to integrate with one or two machine learning frameworks) that did basic matrix/tensor calculations faster/cheaper than GPUs, then they'd be able to take a lot of market off nvidia. Users wouldn't have a need to program the FPGA directly if they can work at the level of matrix operations.

user5994461 · on March 21, 2017

That means making a whole networked black-box, based on FPGA(s) and exposing an API for external use.

It's certainly possible to make. It's also a very expensive very specialized appliance. All of that to do some matrix manipulations.

PeterisP · on March 22, 2017

People are buying large quantities of very expensive GPUs just to do some matrix manipulations, so why not FPGAs?

But the point is that you don't have that much FPGA-specific code - once someone does the matrix manipulations and the proper integration, everyone else can just run e.g. tensorflow code on it faster and/or cheaper without specific expertise; if a FPGA vendor can do this one-time investment in software tools, then they can compete for a slice of the large pie of hardware revenue that nvidia now has for itself.

radarsat1 · on March 22, 2017

But, assuming that "matrix manipulation" code has existed for FPGAs for a while, since verilog and VHDL are quite old, the question remains: why hasn't (a) an FPGA vendor already done this and already actively selling a tensorflow solution or (b) NVidia pursuing this? I have a feeling there are more factors at work than just whether or not it's possible.

someguy12 · on March 22, 2017

It's a very expensive, very customizable appliance. Hence Amazon's AWS F1 instance and Nimbix cloud.

flamedoge · on March 21, 2017

Guess we need more assholes to write compilers

legulere · on March 21, 2017

The problem isn't compilers but bitstream formats and the device description database (according to https://news.ycombinator.com/item?id=10653179)

krapht · on March 21, 2017

This news article is fairly useless. FPGAs have always been an option; that's why they are used in many DSP heavy workloads in communications etc. However, they are expensive. GPUs are popular and cost-efficient because millions of video gamers purchase them, driving prices down. The amount of 32-bit floating point compute on a modern video card is absurd for the price.

user5994461 · on March 21, 2017

FPGA are cheap. What's expensive are the engineers who can speak HDL and make use of a FPGA.

p0nce · on March 22, 2017

How about FPGAs programmed using OpenCL.