Hacker News new | past | comments | ask | show | jobs | submit login

Modern AI/ML is increasingly about neural nets (deep learning), whose performance is based on floating point math - mostly matrix multiplication and multiply-and-add operations. These neural nets are increasingly massive, e.g. GPT-3 has 175 billion parameters, meaning that each pass thru the net (each word generated) is going to involve in excess of 175B floating point multiplications!

When you're multiplying two large matrices together (or other similar operations) there are thousands of individual multiply operations that need to be performed, and they can be done in parallel since these are all independent (one result doesn't depend on the other).

So, to train/run these ML/AI models as fast as possible requires the ability to perform massive numbers of floating point operations in parallel, but a desktop CPU only has a limited capacity to do that, since they are designed as general purpose devices, not just for math. A modern CPU has multiple "cores" (individual processors than can run in parallel), but only a small number ~10, and not all of these can do floating point since it has specialized FPU units to do that, typically less in number than the number of cores.

This is where GPU/TPU/etc "AI/ML" chips come in, and what makes them special. They are designed specifically for this job - to do massive numbers of floating point multiplications in parallel. A GPU of course can run games too, but it turns out the requirements for real-time graphics are very similar - a massive amount of parallelism. In contrast to the CPUs ~10 cores, GPUs have thousands of cores (e.g. NVIDIA GTX 4070 has 5,888) running in parallel, and these are all floating-point capable. This results in the ability to do huge numbers of floating point operations per second (FLOPS), e.g. the GTX 4070 can do 30 TFLOPS (Tera-FLOPS) - i.e. 30,000,000,000,000 floating point multiplications per second !!

This brings us to the second specialization of these GPU/TPU chips - since they can do these ridiculous number of FLOPS, they need to be fed data at an equally ridiculous rate to keep them busy, so they need massive memory bandwidth - way more than the CPU needs to be kept busy. The normal RAM in a desktop computer is too slow for this, and is in any case in the wrong place - on the motherboard, where it can only be accessed across the PCI bus which is again way too slow to keep up. GPU's solve this memory speed problem by having a specially designed memory architecture and lots of very fast RAM co-located very close to the GPU chip. For example, that GTX 4070 has 12GB of RAM and can move data from it into its processing cores at a speed (memory bandwidth) of 1TB/sec !!

The exact designs of the various chips differ a bit (and a lot is proprietary), but they are all designed to provided these two capabilities - massive floating point parallelism, and massive memory bandwidth to feed it.

If you want to get into this in detail, best place to start would be to look into low level CUDA programming for NVIDIAs cards. CUDA is the lowest level API that NVIDIA provide to program their GPUs.




A few finer points:

1 - It's RTX 4070, not GTX 4070

2 - the 30 TFLOPS you mention are at the very top when overclocked, they go for 22 normally.

3 - Also those are single precision TFLOPS, as in 32 bit. What really matter nowadays is double precision. And in double precision a 4070 is 0.35 TFLOPS (or 350 GFLOPS). 2 orders of magnitude lower, still impressive though


For neural nets it's actually the opposite - half-precision bfloat16 is enough. You need large range, but not much accuracy.

Yes, the exact numbers are going to vary, but just giving a data point to indicate the magnitude of the numbers. If you want to quibble there's CPU SIMD too.


For gaming do matter those double precision. And we were talking about a certain GPU, which is used for gaming, not AI. Hence why the AI chips exists in the first place - dedicated hardware for dedicated tasks (or ASIC for short)


The NVIDIA cards are all dual-use for gaming and compute/ML. Some features like the RTX 4070's Tensor Cores (incl. bfloat16) are there primarily for ML, and other features like ray tracing are there for gaming.


The NVIDIA cards are for mining crypto-coins too, and they successfully did that for years, before being made obsolete in that area by ASICs. Now it's time for the same thing in AI/ML too, hence why AI chips are being developed, they are the ASICs for this domain. That's the big picture. In 2 to 3 years none is going to use NVIDIA gaming cards for AI/ML anymore, no matter how many GFLOPS future 5000/6000 series are going to offer. They will be for gaming only. End of story.


ASICs aren't magic - they are just chips designed to do a single function fast (e.g run a crypto mining algorithm) as an alternative to using a general purpose CPU/GPU whose generality comes at the cost of some performance overhead.

If your application calls for generality - like a gaming card's need to run custom shaders, or an ML model's need to run custom compute kernels, then an ASIC won't help you. These applications still need a general purpose processor, just one that provides huge parallelism.

It seems you may be thinking that all an ML chip does is matrix multiplication, and so a specialized ASIC would make sense, but that's not the case - an ML chip needs to run the entire model - think of it as a PyTorch accelerator, not a matmul accelerator.

Finally, the market for consumer (vs data center) ML cards is tiny relative to the gaming market, and these chips/cards are expensive to develop. Unless this changes, it doesn't make sense for companies like NVIDA to develop ML-only cards when with minimal effort they can leverage their data center designs and build dual-use GPU/compute consumer cards.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: