Software. Software. Software. Just two companies, Google and NVIDIA, have publicly launched a viable service or software stack. Just two companies have successfully written a "sufficiently advanced compiler". Just two companies actually have a product. And Google refuses to step into the arena and actually compete with NVIDIA. Man, what a time we live in.
Applied Brain Research has software called Nengo (www.nengo.ai) explicitly for developing neural network models and compiling them to different backends, including CPUs, GPUs, and neuromorphic hardware (Intel's Loihi, Spinnaker, Spinnaker 2, BrainDrop). It's been battle tested for over 10 years of developing models, built the world's largest functional brain model (https://bit.ly/2VNGgSX), integrates deep learning and spiking neural networks. Would be interested to hear your thoughts on it.
I wonder if WebGPU will reduce dependence on CUDA, esp as Tensorflow is being ported to WebGPU. With WebGPU's improved performance and utility and the fact that it runs on top of Vulcan, Metal and D3D with any GPU that has drivers for those, I wonder if DL folks will find it more tempting to use TFJS/WebGPU via Electron or the browser and just be done with CUDA (i.e. break or soften NVIDIA's monopoly)
Well, the memory limit of a WebGPU process would be the limiting factor for training. In addition, the bandwidth between the nodes and the parameter server, if doing training in data-parallel fashion, is another limiting factor.
Some of the numbers in that table do not make any sense and makes me question the quality of the entire article.
Where are the numbers for the Cerebras chip coming from?:
- How do you have a TDP of 180W for an entire wafer of chips?
- Why is there a peak FP32 number when they are clearly working with FP16?
Each of these chips is a completely different architecture and it makes no sense to compare them at this level. The only meaningful comparison is actual performance in applications because that reflects how the entire system will be used.
In the table, the figures are for a single die in the wafer. This is to make a meaningful comparison with the other chips listed (there is a table footnote for this). The 15 KW is the power consumption of the whole wafer (a detail I think was mentioned in the Hot Chips presentation). Why are they clearly working with FP16? Are there any public details on this?
One of the numbers that jumped out at me as being very unusual about the Cerebrus chip was this one: "Speculated clock speed of ~1 GHz and 15 kW power consumption."
"Ascend 910 is used for AI model training. In a typical training session based on ResNet-50, the combination of Ascend 910 and MindSpore is about two times faster at training AI models than other mainstream training cards using TensorFlow."
I think it would be great to clarify that what is commonly referred to as "TPU v2" (e.g. on GCP pricing, also what is shown in the image in this article), consists of 4 such modules with 8 cores total, which gives a more commonly quoted value of 180 TFLOPs.
Thanks. I've updated the article with clarifications and correct TPU numbering. The text already mentions TPU v1 is inference only, and I think it's useful to include as context.
Are deep neural networks really that widely applicable that it's profitable to design custom chips for them? What about other models of AI that involve, say, discrete math or graph search?
Yes. They are far beyond any other AI technique in speech recognition, speech synthesis, translation, OCR, object recognition, playing Go, and many other diverse tasks. And their performance continues to increase with added computing power with no limit that we've seen yet, so custom hardware improves results.
I don’t know whether it’ll be profitable, but MATMUL, for example, is useful for a variety of programs beyond propagation. My guess is most of this stuff will be packaged (e.g. Apple’s “neural engine” on their A-series SoCs).
Consumer turing cards are about the closest you get right now. They're pretty reasonable bang for buck for training. They have tensor cores - not quite as many as Ulta, but the entire chip runs at a higher clock rate and the price/performance is better if you don't mind losing a gig or so of RAM and some memory bandwidth.
Kendryte K210 maybe? It's cheap as chips (pun intended!) I think I got mine for £40 including shipping. https://kendryte.com/ Note this is only for inference. For training you'll have to use a GPGPU or one of the chips in this article.
Stuff they acquired. This is something originating from Nervana Systems and I think there are also some Altera chips out there. Intel's custom foundry offering has historically been poor so chances are anyone they buy will have been using someone else (why take the risk and change that).
'''CONCLUSION
Graphics has just been reinvented. The new NVIDIA Turing GPU architecture is the most
advanced and efficient GPU architecture ever built. Turing implements a new Hybrid Rendering
model that combines real-time ray tracing, rasterization, AI, and simulation. Teamed with the
next generation graphics APIs, Turing enables massive performance gains and incredibly realistic
graphics for PC games and professional applications.'''
AWS gpu compute is extremely expensive. If this is due to datacenter licensing costs, I hope they come out with their own hardware soon to reduce these costs. If on the otherhand, it's because their value-add is not in renting out the hardware but burst scalability, then I'm less optimistic that they'd cannibalize their own cloud product.
Currently, it only takes about 1 month to break even if you buy a consumer gpu like the rtx 2080ti, compared with AWS time. For training purposes it doesn't seem to make sense.
- just looked up the numbers and google tpus are pretty similar in terms of pricing. I think any aws equivalent would probably be just as expensive compared to a diy pc.
Exactly. AWS is simply too expensive. You can buy a Lambda Quad GPU workstation and pay it back in a couple of months. If you want to save more you can just build it yourself.
> I am surprised Amazon has not jumped in the game
Why should they? There is not a lot of money to be gained from renting niche product in comparison to enormous capital expenditure for anything hardware related.
Lot's of dotcom companies burned themselves badly while chasing trendsetters with custom silicon. A cookie cutter 40nm SoC may cost "just" 10M today, but by involving yourself into custom silicon game you risk loosing in it. Not to mention that your operations troubles will increase n-fold.
Managing operations of hosting business with hundreds of thousands customer is hard enough. Logistics, server lifecycle, DC management, managing procurement contracts with unruly OEMs... Now try to dock all troubles you have with chipmakers to it. It will become a nightmare.
Edit: I didn't see the parent meant a hardware accelerator created by Amazon itself. Thx to @jsty and @paol for pointing this out. An ASIC by Amazon was announced last year and is known as 'AWS Inferentia'
I think they were referring to the fact that those are all bought in from outside vendors, rather than having their own custom accelerator chips produced like e.g. Google's TPUs.
And no, AMD doesn't count. ROCm is a mess.