Hacker News new | past | comments | ask | show | jobs | submit login
New chips for machine intelligence (jameswhanlon.com)
172 points by jwhanlon on Oct 7, 2019 | hide | past | favorite | 51 comments

Software. Software. Software. Just two companies, Google and NVIDIA, have publicly launched a viable service or software stack. Just two companies have successfully written a "sufficiently advanced compiler". Just two companies actually have a product. And Google refuses to step into the arena and actually compete with NVIDIA. Man, what a time we live in.

And no, AMD doesn't count. ROCm is a mess.

Disclaimer: am a co-founder of this company

Applied Brain Research has software called Nengo (www.nengo.ai) explicitly for developing neural network models and compiling them to different backends, including CPUs, GPUs, and neuromorphic hardware (Intel's Loihi, Spinnaker, Spinnaker 2, BrainDrop). It's been battle tested for over 10 years of developing models, built the world's largest functional brain model (https://bit.ly/2VNGgSX), integrates deep learning and spiking neural networks. Would be interested to hear your thoughts on it.

I wonder if WebGPU will reduce dependence on CUDA, esp as Tensorflow is being ported to WebGPU. With WebGPU's improved performance and utility and the fact that it runs on top of Vulcan, Metal and D3D with any GPU that has drivers for those, I wonder if DL folks will find it more tempting to use TFJS/WebGPU via Electron or the browser and just be done with CUDA (i.e. break or soften NVIDIA's monopoly)

I've never heard anyone suggest that WebGPU would be appropriate for ML training workloads. Maybe inference, but not training.

Well, the memory limit of a WebGPU process would be the limiting factor for training. In addition, the bandwidth between the nodes and the parameter server, if doing training in data-parallel fashion, is another limiting factor.

There is one more company with a viable compiler: https://ai.facebook.com/tools/glow/

I don't have first hand experience using it, but from people I know, it does work.

Is there a good summary of the state of art in this field of software infra for ML?

Some of the numbers in that table do not make any sense and makes me question the quality of the entire article.

Where are the numbers for the Cerebras chip coming from?:

- How do you have a TDP of 180W for an entire wafer of chips?

- Why is there a peak FP32 number when they are clearly working with FP16?

Each of these chips is a completely different architecture and it makes no sense to compare them at this level. The only meaningful comparison is actual performance in applications because that reflects how the entire system will be used.

In the table, the figures are for a single die in the wafer. This is to make a meaningful comparison with the other chips listed (there is a table footnote for this). The 15 KW is the power consumption of the whole wafer (a detail I think was mentioned in the Hot Chips presentation). Why are they clearly working with FP16? Are there any public details on this?

One of the numbers that jumped out at me as being very unusual about the Cerebrus chip was this one: "Speculated clock speed of ~1 GHz and 15 kW power consumption."

15 kW power consumption for 1 chip?!?

Reportedly, it's an insanely large (whole wafer) single chip. But, then, why is it listed with a size that's not commensurate with that?

It's a very big chip. Comes with fancy custom water-cooling to handle the heat, they say.

I heard that from an engineer there. He was aghast too.

And this is ignoring all the bit players, some of which are doing crazy things like using analog multiplication.


That's currently just slideware at the moment.

Huawei looks like it has a strong game here:

"Ascend 910 is used for AI model training. In a typical training session based on ResNet-50, the combination of Ascend 910 and MindSpore is about two times faster at training AI models than other mainstream training cards using TensorFlow."


edit: The software framework "MindSpore will go open source in the first quarter of 2020."

When I read about it first I wondered if it would be a mobile chip - but apparently not (with a TDP of 300w)

I wonder how brittle the performance will be vs other models such as transformers and DRL vs CNN and ResNet.

Small corrections:

> I’m focusing on chips designed for training

TPU 1 is designed for inference AFAIK.

> TPU v2: 45 TFLOPs

I think it would be great to clarify that what is commonly referred to as "TPU v2" (e.g. on GCP pricing, also what is shown in the image in this article), consists of 4 such modules with 8 cores total, which gives a more commonly quoted value of 180 TFLOPs.

Thanks. I've updated the article with clarifications and correct TPU numbering. The text already mentions TPU v1 is inference only, and I think it's useful to include as context.

FYI: "DISCLAIMER: I work at Graphcore, and all of the information given here is lifted directly from the linked references below."

I'm excited to see how the Habana Labs Gaudi performs in the real world:


They claim to have some great features. Anyone know when if a consumer version is coming / any release dates promised?

Are deep neural networks really that widely applicable that it's profitable to design custom chips for them? What about other models of AI that involve, say, discrete math or graph search?

Yes. They are far beyond any other AI technique in speech recognition, speech synthesis, translation, OCR, object recognition, playing Go, and many other diverse tasks. And their performance continues to increase with added computing power with no limit that we've seen yet, so custom hardware improves results.

Alas, you do not usually train models from scratch. I think that transfer learning will dominate, and it does not need this power.

I don’t know whether it’ll be profitable, but MATMUL, for example, is useful for a variety of programs beyond propagation. My guess is most of this stuff will be packaged (e.g. Apple’s “neural engine” on their A-series SoCs).

Will there be a customer grade TPU in the near future?

Or won't they be able to be as price/performance efficient compared to (nvidia) GPUs?

Consumer turing cards are about the closest you get right now. They're pretty reasonable bang for buck for training. They have tensor cores - not quite as many as Ulta, but the entire chip runs at a higher clock rate and the price/performance is better if you don't mind losing a gig or so of RAM and some memory bandwidth.

Apple’s neural engine in their A-series of SoCs.

This article is about hardware for training neural networks, not the inference chips that are in most phones today.

Kendryte K210 maybe? It's cheap as chips (pun intended!) I think I got mine for £40 including shipping. https://kendryte.com/ Note this is only for inference. For training you'll have to use a GPGPU or one of the chips in this article.

Do the Google Coral products count as consumer grade?


No, those are designed for inference.

Will there be a customer grade TPU in the near future?

Would the nVidia Jetson count?

Jetson is for inference, not training.

ARM socs already have NPU, but I guess they doesn't count.

This list is missing mobile phone chips that are specially designed for Deep Learning.

Existing mobile phone chips are designed for inference, not training. The list is explicitly restricted to chips that are designed for training.

there is lot of training happening on the edge (mobile devices) as well, look up 'federated learning' https://medium.com/syncedreview/federated-learning-the-futur...

> Intel NNP-T TSMC 16FF+

Intel has stuff made by other foundries?

Stuff they acquired. This is something originating from Nervana Systems and I think there are also some Altera chips out there. Intel's custom foundry offering has historically been poor so chances are anyone they buy will have been using someone else (why take the risk and change that).

'''CONCLUSION Graphics has just been reinvented. The new NVIDIA Turing GPU architecture is the most advanced and efficient GPU architecture ever built. Turing implements a new Hybrid Rendering model that combines real-time ray tracing, rasterization, AI, and simulation. Teamed with the next generation graphics APIs, Turing enables massive performance gains and incredibly realistic graphics for PC games and professional applications.'''

Quoted from Nvidia Turing datasheet

I am surprised Amazon has not jumped in the game, renting out an accelerator like Google does with the TPUs

They do, although inference (not training) so far. They have a custom chip call AWS Inferentia[0].

I believe it is available via Elastic Inference[1][2] (or maybe soon will be).

[0] https://aws.amazon.com/machine-learning/inferentia/

[1] https://docs.aws.amazon.com/elastic-inference/latest/develop...

[2] https://aws.amazon.com/machine-learning/elastic-inference/

AWS gpu compute is extremely expensive. If this is due to datacenter licensing costs, I hope they come out with their own hardware soon to reduce these costs. If on the otherhand, it's because their value-add is not in renting out the hardware but burst scalability, then I'm less optimistic that they'd cannibalize their own cloud product.

Currently, it only takes about 1 month to break even if you buy a consumer gpu like the rtx 2080ti, compared with AWS time. For training purposes it doesn't seem to make sense.

- just looked up the numbers and google tpus are pretty similar in terms of pricing. I think any aws equivalent would probably be just as expensive compared to a diy pc.

That's because NVIDIA's EULA requires you to buy a NVIDIA Tesla in datacentre developments; and Teslas are marked up by thousands of dollars.

Exactly. AWS is simply too expensive. You can buy a Lambda Quad GPU workstation and pay it back in a couple of months. If you want to save more you can just build it yourself.


> I am surprised Amazon has not jumped in the game

Why should they? There is not a lot of money to be gained from renting niche product in comparison to enormous capital expenditure for anything hardware related.

Lot's of dotcom companies burned themselves badly while chasing trendsetters with custom silicon. A cookie cutter 40nm SoC may cost "just" 10M today, but by involving yourself into custom silicon game you risk loosing in it. Not to mention that your operations troubles will increase n-fold.

Managing operations of hosting business with hundreds of thousands customer is hard enough. Logistics, server lifecycle, DC management, managing procurement contracts with unruly OEMs... Now try to dock all troubles you have with chipmakers to it. It will become a nightmare.

You mean something not like EC2 Accelerated Computing Instances?


Edit: I didn't see the parent meant a hardware accelerator created by Amazon itself. Thx to @jsty and @paol for pointing this out. An ASIC by Amazon was announced last year and is known as 'AWS Inferentia'



I think they were referring to the fact that those are all bought in from outside vendors, rather than having their own custom accelerator chips produced like e.g. Google's TPUs.

Those are just regular GPUs (and one FPGA offering)

what are your thoughts on how federated learning might change the HW landscape for edge devices?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact