* They actually started deploying them in 2015, they're probably already hard at work on a new version!
* The TPU only operates on 8-bit integers (and 16-bit at half speed), whereas CPU/GPUs are 32-bit floating point. They point out in the discussion section that they did have an 8-bit CPU version of one of the benchmarks, and the TPU was ~3.5x faster.
* Used via TensorFlow.
* They don't really break out hardware vs hardware for each model, it seems like the TPU suffers a lot whenever there's a really large number of weights and layers that it must handle - but they don't break out the performance on each model individually, so it's hard to see whether the TPU offers an advantage over the GPU for arbitrary networks.
It's something that keeps getting rediscovered. I know embedded industry shoehorns all kinds of problems into 8- and 16-bitters. Some even use 4-bit MCU's. Might be worthwhile if someone does a survey of all the things you can handle easily or without too much work in 8-16-bit cores. This might help for people building systems out of existing parts or people trying to design heterogenous SOC's.
I do like that they highlighted the importance of low latency output though...that's even more critical for future non "Web" applications which have to run in real time.
3.5x faster than CPU doesn't sound special, but when you're building inference capacity by the megawatt, you get a lot more of that 3.5x faster TPU inside that hard power constraint.
Here is another paper demonstrating very good results with just 6 bit gradients: https://arxiv.org/abs/1606.06160
I wanted to have some basic idea about hardware so I did some "research" (googling) and ended up giving a short informal talk. My slides with some links are here:
(Used in tensorflow)
Intel's latest chips will be even further behind compared to the next-generation TPU than Haswell was compared to TPU 1.0.
So they are telling us about inference hardware. Im much more curious about training hardware.
Let's say you want to use a genetic algorithm to find a good set of weights: you generate, mutate, combine and select many random networks, and repeat this process many times. How many networks and how many times? That depends on the length of your chromosome and complexity of the task. Networks that work well for image classification need at least a million weights. The entire set of weights is a single chromosome.
You realize now how computationally intractable this task is on modern hardware?
You've created your own straw man here.
> "You realize now how computationally intractable this task is on modern hardware?"
Here are the people that prove it isn't computationally intractable : https://blog.openai.com/evolution-strategies/ - but to say they've discovered a new breakthrough method is over-selling the result.
Perhaps the reason is simply that they don't have them in their servers, but we'll see if Jeff Dean replies on G+ .
TPU excited me too at first, but when I realized that it is not related to training new networks (research) and is useful only for large scale deployment, I toned down my enthusiasm a little.
When I google around a bit, I see several results talking about the software licensing cost model for the M-series GPUs.
Part of the fault was GDDR5's limitations that involved trickery to make the Kepler-series work.
Pascal is coming with ECC because HBM2 comes with ECC built-in.
Most of us are probably better off building a few workstations at home with high-end cards. The hardware will be more efficient for the money. But if you're considering hiring someone to manage all your machines, power-efficiency and stability become more important than the performance/upfront $ ratio.
There's also FPGAs, but they tend to be much lower quality than the chips Intel or Nvidia put out so unless you know why you'd want them you don't need them.
(I work on GCP)
Then consider the possible applications of that at Google scale -- there are "an awful lot" of images on the web, over 13PB of photos in Google photos last year , a gajiggle of photos in street view and google maps, an elephant worth in google plus, and probably a few trillion I'm not even thinking of. :)
Same applies, of course, to Translate, and to RankBrain, also mentioned as NNs running on the TPU. 100B words per day translated , and .. many, many, many Google Searches per day, even if RankBrain primarily targets the 15% of never-before-seen queries .
Add that to the fact that GPUs are poorly-suited to realtime inference because of the large batch size requirements, and it's a solid first target.
(work at Google Brain on Mondays, but speakin' for myself here.)
> The TPU server has 17 to 34 times better total-performance/Watt than Haswell, which makes the TPU server 14 to 16 times the performance/Watt of the K80 server. The relative incremental-performance/Watt—which was our company’s justification for a custom ASIC—is 41 to 83 for the TPU, which lifts the TPU to 25 to 29 times the performance/Watt of the GPU.
I'd much rather see a general purpose CPU that uses something like an array of many hundreds or thousands of fixed-point ALUs with local high speed ram for each core on-chip. Then program it in a parallel/matrix language like Octave or as a hybrid with the actor model from Erlang/Go. Basically give the developer full control over instructions and let the compiler and hardware perform those operations on many pieces of data at once. Like SIMD or VLIW without the pedantry and limitations of those instruction sets. If the developer wants to have a thousand realtime linuxes running Python, then the hardware will only stand in the way if it can’t do that, and we’ll be left relying on academics to advance the state of the art. We shouldn’t exclude the many millions of developers who are interested in this stuff by forcing them to use notation that doesn’t build on their existing contextual experience.
I think an environment where the developer doesn’t have to worry about counting cores or optimizing interconnect/state transfer, and can run arbitrary programs, is the only way that we’ll move forward. Nothing should stop us from devoting half the chip to gradient descent and the other half to genetic algorithms, or simply experiment with agents running as adversarial networks or cooperating in ant colony optimization. We should be able to start up and tear down algorithms borrowed from others to solve any problem at hand.
But not being able to have that freedom - in effect being stuck with the DSP approach taken by GPUs, is going to send us down yet another road to specialization and proprietary solutions that result in vendor lock-in. I’ve said this many times before and I’ll continue to say it as long as we aren’t seeing real general-purpose computing improving.
Convolutional networks easily get up there, especially if you add a third dimension that the network can travel across (either space in 3D covnets for medical scans, or time for videos in some experimental archetecture). Say you want to look at a heart in a 3D covnet, that could easily be 512x512x512 for the input alone.
In fully connected models, for training efficiency, many features are implemented as one-hot encoded parameters, which turns a single caragory like "state" into 50 parameters. I think there is some active research into sparse representations of this with the same efficiency but I've never seen those techniques, just people piling on more parameters.
A further point is that even if the model has relatively few parameters, there are advantages to having more memory--- namely, you can do inference on larger batch sizes in one go.
- No control flow instructions (though apparently some operations can have a repeat count)
- Fundamentally simple architecture
This allows them to get through validation and tapeout very quickly.
The SOTA networks are around 300MB+...
Since it appears you're in the deep learning hardware business, what would be the impediment to using eDRAM or similar? eDRAM is too costly at those sizes for general purpose processors, but I imagine the reduced latency and increased bandwidth would be a huge win for a ridiculously parallel deep learning processor, and would definitely be a tradeoff worth making.
Okay, so about eDRAM. There are two types of eDRAM: On-die and on-package. On-die eDRAM refers to manufacturing of DRAM cells on the logic die, which would be a big boon in terms of density since eDRAM cells can be almost 3x as dense as SRAM. The problem however, is that on-chip eDRAM has been impossible to scale beyond 40nm, which mitigates any advantages you would receive from using eDRAM.
On-package eDRAM is more interesting but the primary cost in memory access is the physical transportation of the data, which is a physical limit and can't be circumvented. You can call it all sorts of fancy names such as "eDRAM", but the fact of the matter is that you're still moving data. For reference, the projected cost of movement of a 64-bit word on 10nm (ON CHIP) according to Lawrence Livermore national laboratories is ~1pJ, whereas the cost of a 64-bit FLOP is estimated to be 1pJ also. As you can see, the cost of data movement dwarfs the cost of computation.
Of course, you gain a lot compared to DRAM, but HBM can offer the same efficiency gains of course.
Didn't meant to be rude with the first response.
Let me know if you have any other questions, I'd be happy to answer them :)
As an example, the ALVINN self-driving vehicle used several such arrays for it's on-board processing.
I'm not absolutely certain that this is the same, but it has the "smell" of it.