I think the two can complement each other very well.
GPUs are flexible and scalable when you don't know what the large-scale parameters of the network you want to build look like, and need a lot of them to do training. Let a fleet of cloud-based GPUs do the heavy-lifting of training and learning.
But then once training is over, an FPGA or even an ASIC could implement the trained model and run it at a crazy-fast speed with low-power. A piece of hardware like that would be able to handle things like real-time video processing of a DNN potentially. Very handy for things like self-driving vehicles.
If you're dealing at scales where you can use the word "fleet" then it will usually make sense to just build an ASIC on a trailing process node rather then go for FPGAs. They'll be cheaper in bulk and more performant even with a large process disadvantage.
ADDENDUM: But fundamentally, in spaces like this, the underlying algorithms that can be accelerated are fairly simple. In most cutting edge AI these days the heavy lifting is performed by convolutional neural networks and the specialized silicon that works to speed up one set of convolutional neural network operations will speed up another just as well. Baking the network itself into the hardware shouldn't tend to be any better than loading it into specialized memory pools unless you get really exotic and do your neural network in analog electronics.
I think there is a big enough space between GPU and ASIC technology for FPGAs. The main reason is the lifetime of the models. The shorter that lifespan, the more expensive it is to exchange the ASICs. At the very least you have to produce new ASICs every few months, and replace them in special sockets, or even reflow/solder them to new cards.
My assumption is that the ASIC is executing code that changes every month but that it's using instructions and a memory hierarchy geared towards constitutional neural networks. If that stops being true then of course you'd need a different ASIC but then again if that stops being true then there's no guarantee that a GPU or ASIC will do any better than a CPU. You could end up with something like Alpha-Beta pruning where parallelism doesn't make much different. A reasonable chip wont' be able to contain enough transistors to have separate execution resources for each layer. It's going to have to work by loading a layer, convolving it, load the next layer, convolve it, etc. So you'll be able to change your network without changing the ASIC your running it on while still taking advantage of your dedicated ganged operations. The exact size of the network layers can be optimized for in the FPGA version that you can't in the more flexible ASIC version. But I expect the benefit from those to be much smaller than moving to an ASIC.
From the papers I do believe they are hardcoding the layer weights in the hardware definition of the FPGA. These FPGAs also have no significant RAM on chip, but the intel FPGAs they use do seem to have an even larger number of LUTs than the usual embedded FPGAs, and even dedicated floating point units.
At the very least they talk about omitting weights which are 0 in the synthesis.
TrueNorth was built to run spiking neural networks, which have little to do with deep learning (even though they managed to get it to run a small convolutional NN), and Nervana has never actually built any hardware.
Yes, there are at least a dozen companies with specialized hardware accelerators in some stage of development. For smaller parts some of the existing DSP companies like CEVA and Cadence Tensilica are also are adapting their architectures for deep neural net workloads.
Yet it's still not clear if building a custom chip makes sense, because the next Nvidia chip might make it obsolete. Or the one after the next (still too soon).
I'm all for open toolchains, but I don't think this is the primary reason for FPGAs lack of broader appeal. I think the lack of open toolchains and the lack of broad appeal both stem from FPGA vendors almost exclusive focus on high margin, high-end applications. There doesn't seem to be much push to focus on larger markets with lower margins.
That said, Lattice has recently started to push into these new areas, but they haven't been that successful. If they start to see more success, I think we will see open toolchains. Lattice also has the advantage of being able to lean on the open toolchain work done by people like Clifford Wolf [1]
You are wrong about focusing on the high end FPGA. There are plenty of FPGA's for low-level stuff. For example, one project I was involved in used FPGA for debouncing completed with the programming interface for ADSP-21xx. They replaced a dozen of different chips with one FPGA.
>>> Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Learning?
Whoever has worked with FPGA knows that they are completely different to program than CPU/GPU.
They are not competing at all. You can't take some computer developers and have them work with FPGA. That'd be like taking a dude who knows XML and put him on optimizing C++ low level algorithms.
For starters, a FPGA doesn't run programs, it describes hardware components.
True, but if the benefit is there, the software folks will find the right people to assist.
Bitcoin is a good example. I'm reasonably sure that FPGAs eclipsed GPU there because some "software person" called the right person in. And it certainly was a competition. Similar for the eventual move to ASIC.
If there's enough incremental benefit, it will happen.
There is the evolutionary approach taken by Thompson.[1] There were some spectacular results; though each circuit only worked on the exact FPGA that it was trained on - so that particular approach has no practical applications.
I don't actually believe this is true. I find that paper fascinating, and I believe today you could just simulate the hardware definition, rather than work straight on the bitstream and evaluate it in hardware.
I also believe that today's FPGAs are more robust against these kinds of bugs. Because, even with some HDL code, these defects or cross-talk incidents could result in difficult to debug errors.
They are in competition, because increasingly, the deep learning models are run by people who are not experts in machine learning: in deployment, self-driving cars and other machinery.
And if it is just about bang-for-buck and scale, FPGAs seem to be quite competitive.
The big win for them is a 10x reduction in power usage, since in a datacenter/cloud environment this is more important. Still at the research stage though.
All in that article keeps assuming lower precision, which is often okay, but then they kept testing the Titan X with 32 bit wide floats. Doesn't the newer nVidia stuff, like the Titan X support 16 bit wide floats, and don't they run at almost double the speed?
They (including the 1080ti which is basically a Titan) do support 4x faster INT8, though, so if comparing to a reduced-precision ternary net running a FPGA, that seems relevant. (They mention using INT8 in some of the GPU benchmarks but I'm not sure which graphs are supposed to represent that.)
GP100 supports FP16 FMAD
GP102 supports INT16 and INT8 MAD with 32-bit accumulation
Overall, not impressed with Stratix 10. It won't be cost effective, it's not much more power-efficient, and Volta will likely leapfrog it across the board within a year.
Wasn't this thing supposed to sample in late 2014? Back then it would have been a gamechanger at any price. Now, 1080Ti for $700 beats it across the board in throughput/$. NVIDIA's confusing messaging about using consumer versus professional HW is about the only thing that might make it viable for deep learning. Although I note the absence of training perf numbers here, just (apparently) inference.
"GP100 however uses more flexible FP32 cores that are able to process one single-precision or two half-precision numbers in a two-element vector. Nvidia intends to address the calculation of algorithms related to deep learning with those." So 16-bit ops are more than twice as fast as 32-bit ops, if you pack them into the 32-bit cores two at a time and also use the dedicated 16-bit core.
>>> How many Titan Xs can you purchase for the price of a Intel Stratix 10, including the kit required to start development?
Actually, they are similar in price.
medium to high end consumer FPGA and GPU will go up to around $1k.
Then, there are the entreprisey GPU (Quadro and FireThing as I recall) going for a few k.
It's similar for FPGA. The very high-end (Stratix/Virtex) will charge a few k as well, peaking at 5k or 10k for the top models (nude FPGA chip only).
I recall negotiating some FPGA devkits in the 10-15k€ range, that seems to be the top end. If I remember well, there was an option to get 4 Virtex on the same devboard for 30k or 50k. That's as high as it gets.
My memory ain't perfect but that's about this much.
I have a feeling that much of the cost is from the smaller market. It takes a massive amount of engineering resources to put out a new chip, and you need slot of sales to compensate. If FPGAs become commonplace in servers, then we will see a price decrease as it becomes easier to recouperate the overhead.
"Another emerging trend introduces sparsity (the presence of zeros) in DNN neurons and weights by techniques such as pruning, ReLU, and ternarization, which can lead to DNNs with ~50% to ~90% zeros. " Does anyone have a reference for this ?
Just randomly googling "ternarization deep learning" lead me to this article from ICLR 2017 https://openreview.net/pdf?id=r1fYuytex which in turn seem to reference further work in the area.
It's not like most people in the field are programming GPUs.
If a FPGA vendor made a FPGA solution (both the hardware and the software libraries to integrate with one or two machine learning frameworks) that did basic matrix/tensor calculations faster/cheaper than GPUs, then they'd be able to take a lot of market off nvidia. Users wouldn't have a need to program the FPGA directly if they can work at the level of matrix operations.
People are buying large quantities of very expensive GPUs just to do some matrix manipulations, so why not FPGAs?
But the point is that you don't have that much FPGA-specific code - once someone does the matrix manipulations and the proper integration, everyone else can just run e.g. tensorflow code on it faster and/or cheaper without specific expertise; if a FPGA vendor can do this one-time investment in software tools, then they can compete for a slice of the large pie of hardware revenue that nvidia now has for itself.
But, assuming that "matrix manipulation" code has existed for FPGAs for a while, since verilog and VHDL are quite old, the question remains: why hasn't (a) an FPGA vendor already done this and already actively selling a tensorflow solution or (b) NVidia pursuing this? I have a feeling there are more factors at work than just whether or not it's possible.
This news article is fairly useless. FPGAs have always been an option; that's why they are used in many DSP heavy workloads in communications etc. However, they are expensive. GPUs are popular and cost-efficient because millions of video gamers purchase them, driving prices down. The amount of 32-bit floating point compute on a modern video card is absurd for the price.
GPUs are flexible and scalable when you don't know what the large-scale parameters of the network you want to build look like, and need a lot of them to do training. Let a fleet of cloud-based GPUs do the heavy-lifting of training and learning.
But then once training is over, an FPGA or even an ASIC could implement the trained model and run it at a crazy-fast speed with low-power. A piece of hardware like that would be able to handle things like real-time video processing of a DNN potentially. Very handy for things like self-driving vehicles.