Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What is an A.I. chip and how does it work?
145 points by rramadass 9 months ago | hide | past | favorite | 93 comments
With all the current news about NVIDIA AI/ML chips;

Can anybody give an overview of AI/ML/NPU/TPU/etc chips and pointers to detailed technical papers/books/videos about them? All i am able to find are marketing/sales/general overviews which really don't explain anything.

Am looking for a technical deep dive.

It may help your digging and search if you have in mind what those chips really try to do: Accelerate numerical linear algebra calculations.

If you are familiar with linear algebra these specialized chips literally etch silicon so as to perform vector (and more general multi-array or tensor) computations faster than a general purpose CPU. They do that by loading and operating a whole set of numbers (a chunk of a vector or a matrix) simultaneously (whereas the CPU would operate mostly serially - one at a time).

The advantage is (in a nutshell) that you can get a significant speedup. How much depends on the problem and how big a chunk you can process simultaneously but it can be a significant factor.

There are disadvantages that people ignore in the current AI hype:

* The speedup in a one-off gain, the death of Moore's law is equally dead for "AI chips" and CPU's

* It is extremely specialized and fine-tuned software you need to develop and run and it only applies to the above linear algebra problems.

* In the past such specialized numerical algebra hardware was the domain of HPC (high performance computing). Many a supercomputer vendor went bankrupt in the past because the cost versus market size was not there.

While a given generation of accelerator can only target model architectures that are comparatively proven out, and there’s a lag time, it’s measured in years not decades.

I remember when NVIDIA didn’t have hardware for ReLU.

The fact of the matter on Moore’s law is that we’ve got transistors, but not TDP to burn and have for years. These stupid big L3 caches are just: “fuck it, I’ve got die to burn”.

This is an old story, things migrate in and out of the “CPU”, but the current outlook is that we’ll be targeting specialized hardware more rather than less for the foreseeable future.

There have been some really strange instruction sets conceived. As a student at Stanford we had a time-share system that was home-grown (as I remember). It had opcodes to reverse bits in a bitstring! And odder things. Somebody needed that for some research project I guess. And then repurposed the damn thing for timeshare.

It was pretty sad timeshare as I recall. The only machine(s?) available to a population of what? 20,000? And about 40 terminals total. You had to sign up for 15-minute measured timeslots.

I came from a state school that had over 1000 terminals on a dozen machines, all unlimited time to students. It was a big shock to find the star of Silicon Valley had such crappy student services.

> It had opcodes to reverse bits in a bitstring

Bit reversal is used in Fast Fourier Transforms. Its not entirely surprising to me that you'd have specialized hardware for that operation.

Ref: https://en.wikipedia.org/wiki/Bit-reversal_permutation#Appli...

Also to count ones in a bitstring. And so on.

I remember when NVIDIA didn’t have hardware for ReLU.

Could you elaborate? ReLU is max(x,0) and CUDA had fmaxf since CUDA 1.0 (2007).

I could easily be wrong, but IIUC the software/hardware stack was, as of 2014 or so, fusing down to a specific set of circuits for broadcast activation functions under the broader banner of “Tensor Cores” (along with a bunch of devilish FMA hot-pathing under the sheets).

This is a fairly random press piece off Google but there are a ton of them. Hard to tell whether it’s the hardware, the blob, or the process node unless you work there.


Not sure if or what the chance was, but fmax(x, 0) only requires checking the sign bit instead of doing a full floating point comparison (putting aside nan handling).

A hypothetical relu instruction could probably get away with much less power and die soace?

> the current outlook is that we’ll be targeting specialized hardware more rather than less for the foreseeable future.

I think there are some important question marks still unresolved that bear on how things will play out. E.g. how the training versus inference balance will land in terms of usage and economics.

Inference is inherently more "mass market". You need it locally without lags from moving data around. But inference is just numerical linear algebra. Ultimately augmenting the CPU to provide inference natively might be the optimal arrangement.

Good explanation. That also gives you an idea why GPUs are decent at acceleration computations for neural networks (think cuda) -- they are already optimized for doing many small computations in parallel rather with slower processors and have a lot of dedicated ram (VRAM).

>* The speedup in a one-off gain, the death of Moore's law is equally dead for "AI chips" and CPU's

This does not seem to be true in reality. The H100 is about 2.5x 'better' for AI than the A100 (obviously depends exactly what you are doing), and they released about 2 years apart. That is roughly in line with Moore's law.

The difference seems to be: more units for parallel calculation, but the speed of calculation in itself doesn't double anymore. In other words: Moore's law has stopped for raw speed and perhaps other areas, but is still alive in other areas. This has some weird consequences: Some models can't be processed in smaller chips (because swapping in parts is too slow to be useful), but after a threshold is crossed, suddenly the large models run efficiently.

Probably we will see usage of large AI models in smaller devices the next years because there's another way to optimize: use more efficient representations of the model weights. I think about posits, a different floating point system where even 6 bits are perhaps usable. When models can switch to 6 bit posits from f16 (half floats), hardware can load more than three times larger models. We will see whether hardware for this will be mass-produced.

Moore's law was never about raw speed. It is about transistors per unit area and it is still very much active. That transistor count is going into cache and core count, but it can just as easily go into specialized linear algebra units, which is what AI chips do.

Why from 16 to 6 is "more than three times larger"? Why not 16/6=2.667 times?

> Accelerate numerical linear algebra calculations.

Technically, for AI, you need to accelerate numerical non-linear algebra calculations, which take the general form of matrix multiplication, but interpose non-linear functions at key points in the calculation.

the calculations within a neural network involve both linear and non-linear operations, but the fundamental mathematical framework underlying AI remains rooted in linear algebra

Is this why my M1 MacBook Air can run the same R code at least 10x faster than my giant Linux tower?

Apple Silicon Macs have special matrix multiplication units (AMX) that can do matrix multiplication fast and with low energy requirements [1]. These AMX units can often beat matrix multiplication on AMD/Intel CPUs (especially those without a very large number of cores). Since a lot of linear algebra code uses matrix multiplication and using the AMX units is only a matter of linking against Accelerate (for its BLAS interface), a lot of software that uses BLAS is faster o Apple Silicon Macs.

That said, the GPUs in your M1 Mac are faster than the AMX units and any reasonably modern NVIDIA GPU will wipe the floor with the AMX units or Apple Silicon GPUs in raw compute. However, a lot of software does not use CUDA by default and for small problem sets AMX units or CPUs with just AVX can be faster because they don't incur the cost of data transfers from main memory to GPU memory and vice versa.

[1] Benchmarks:


https://explosion.ai/blog/metal-performance-shaders (scroll down a bit for AMX and MPS numbers)

> That said, the GPUs in your M1 Mac are faster than the AMX units

Not for double, which is what R mostly uses IIRC.

Ah, thanks for the correction! I never use R, so I assumed that it uses/supports single-precision floating point.

> Accelerate numerical linear algebra calculations.

Like compute eigenvalues/eigenvectors of large matrices, compute SVDs, solve large sparse systems of equations, etc?

Nothing that fancy. Usually matrix-matrix and matrix-scalar multiplication.

This is indeed the bread-and-butter, but there is use of all sorts of standard linear algebra algorithms. You can check various xla-related (accelerated linear algebra) folders in tensorflow or torch folders in pytorch to see the list of what is used [1],[2]

[1] https://github.com/tensorflow/tensorflow/tree/8d9b35f442045b...

[2] https://github.com/pytorch/pytorch/blob/6e3e3dd477e0fb9768ee...

But is this used for AI or is this "what can we do with an AI chip since we happen to have one anyway?"

> If you are familiar with linear algebra

Could you please write what are some common day to day life applications of linear algebra in computing?

The obvious fields are computer graphics (all kinds: 2d, 3d rasterized and 3d raytraced are all heavy on linear algebra, though 3d rasterized is the easiest to speed up with the lockstep SIMD architectures we call GPUs) and neural networks.

Computer graphics mostly because you can view the real world as 3d space and the screen as 2d space, and linear algebra gives you all the tools to manipulate something in 3d space and project it into 2d space. Neural networks because you can treat them as matrix multiplications.

Linear Algebra is like a lot of math in CS; you don't necessarily see it initially, but once you get some familiarity you start seeing it everywhere.

Others have commented on computer graphics, but (as it turns out) the exact same algorithms apply for Collision Detection in 3D space. And since games already are manipulating graphics, you add on another set of Linear Algebra transforms that change the position/rotation/shear of those vertices. In a similar way, science (especially physics) use linear algebra to build simulations of all kinds of systems.

One surprising use is in Advertising (and other user preference aggregators). Turns out a preference acts like a magnitude of a one-dimensional vector. String N preference vectors together and you get an N-Dimension Vector that you can perform Linear Algebra operations upon. One common application is the Dot Product, which is a fancy way of taking two N-Dimension vectors and measuring how close those vectors point in the same direction in a [1, -1] range.

Yet another common place to find Linear Algebra is in computer science papers. Most of the time this is simply notation; a lot of common programming forms can be represented by MxN matrices. However some of those algorithms will use LA as a way of parsing and breaking down a problem. You will see this in compiler papers often, but its transferable to many other domains.

As a final and personal observation, I found that Linear Algebra helped me grasp Functional Programming. In both cases I am applying a transform to some input, and often stringing together many such transforms. Also in both cases, the transformations are sensitive to their order, and a bad ordering produces nonsense just like garbage data.

> One common application is the Dot Product , which is a fancy way of taking two N Dimension vectors and measuring how close those vectors point in the same direction in a [1, 1] range.

That's cosine similarity, or normalised dot product. The dot product can take any value when the vectors are not unit norm.

Nvidia exists because linear algebra is essential for 3D graphics. They got a second boost from crypto-currency, which isn't strictly about linear algebra but it can put those computation units to good use. Now Nvidia is riding high again because neural nets are all about linear algebra, and LLMs are big neural nets.

I would also add scientific simulations to the list of tasks that GPUs are used for. They parallelise well on many cores.

I have really lost touch with common day to day life at this point, but Excel would be an excellent example. If you have a column with a million data points and do some basic calculation -- subtract or multiply another column -- whether or not the software you use can translate that into a vectorized operation under the hood using linear algebra can significantly speed up your operation.

This isn't a technical deep dive, but here's a simplified explanation.

It's a matrix multiplication (https://en.wikipedia.org/wiki/Matrix_multiplication) accelerator chip. Matrix multiplication is important for a few reasons. 1) it's the slow part of a lot of AI algorithms, 2) it's a 'high intensity' algorithm (naively n^3 computation vs. n^2 data), 3) it's easily parallelizable, and 4) it's conceptually and computationally simple.

The situation is kind of analogous to FPUs (floating point math co-processor units) when they were first introduced before they were integrated into computers.

There's more to it than this, but that's the basic idea.

There are very different architectures in the wild. Some are simply standard GPUs (maybe with additional support for bf16/float16) (Rockchip RK1808 has one like that). You give it a list of instructions pretty much like a CPU (except massively parallel), and it'll execute it. BTW when I say standard GPU, I'm not saying "kinda like GPU, but really literally GPU architecture. Linux mainline support for Amlogic A311D2's NPU is 10 lines, and it's declaring a no-output vivante GPU.

Some are just hardware pipeline to compute 2d/3d convolutions + activation function (Rockchip RK3588 has one like that). You give it the memory address + dimensions of the input matrix, the memory address + dimensions of the weights, memory address + dimensions of the output, which activation function you want (there are only like 4 supported), then you tell it RUN, you wait for a bit, and you have the result at the output memory address.

(I took Rockchip example to show that even in one microcosm it can change a lot)

And then you can imagine any architecture in-between.

AFAIK they all work with some RAM as input and some RAM as output, but some may have its own RAM, some may share RAM with the system, some might have mixed usage (RK3588 has some SRAM, and when you ask it to compute the convolution, you can tell it either to write to SRAM or system RAM)

It's possible that there are some components that are border line between ISP (Image Signal Porcessing) and a NPU, where the input is the direct camera stream, but my guess is that they do some very small processing on the camera stream, then dump it to RAM, then do all the heavy work from RAM to RAM. I think that Pixel 4-5 had something like that.

The RK3588 seems like a beast for performance per dollar, I have one running desktop Linux for emulators and games, but I haven't used the NPU yet, because I haven't taken the time to figure out how to get OpenCV to talk to it.

Is Rockchip's stuff "good", as NPUs go? I'm thinking of buying another 3588 SoC for my robotics hobby - you seem like you'd know if that's a decent idea or not.

Well, I already said it :P RK3588's NPU can only do convolutions and a small list of activation functions. Realistically it's usable only on image (2D, 3D data). The software side is pretty weird: It's amazing the amount of models they support despite the limited fixed-hardware function, but there has been literally 0 development in the last 6 months and it does have a lot of bugs that doesn't seem hard to fix. Even when it comes to images, it doesn't support changing the resolution of the input (you need to ""recompile the model"" for that), which is super weird since hardware pipeline doesn't care much about the size.

Anyway, I really don't recommend it, unless you're making your own model, you know before-hand what's supported and what isn't, and your input is fixed resolution (which is a pretty fair usage in an embedded system) (fixed = doesn't change at every frame. handling hotplug from one webcam with a resolution to another with another resolution is fine)

I think looking at the examples give you a reasonable show of what it can do: https://github.com/rockchip-linux/rknn-toolkit/tree/master/e... It's mobilenet, yolov3, resnet50. There aren't more examples because they didn't had more examples. There aren't more examples because that's pretty much all you can reasonably run.

As far as I can tell, modern image models using transformer/vit won't be runnable on it. (it acts enough as a coprocessor that it's possible to do some parts in CPU some parts in NPU - and Rockchip framework handles that -, so maybe it's somehow possible)

(Note: I say this as a huge Rockchip lover, their mainline support is top-notch, they make very durable product (their 2015's RK3288 is still far from obsolete), and I bought a RK3588 SBC to play with a NPU accelerator (whose full specification is publicly available btw), in the hope to have a self-hosted LLM voice assistant)

Yeah it’s performance vs cost is honestly nuts. 6 TOPS is a pretty solid NPU, but I don’t know what their software is like. Programming those accelerators is often difficult, especially if you’re a small time customer.

Curious if anyone can weigh in on their SW usability. A quick search for their user level tools showed examples/documentation in Chinese(?)

Yeah it's the first under $100 system I've tried, (and I'm a fan of the genre) that's truly a desktop replacement in terms of being a snappy responsive desktop even with lots of browser tabs, etc.

I do know they have their own special sauce to talk to the NPU. I was discouraged from making the effort myself because their special sauce to talk to the VPU has barely any ffmpeg support, it p much only uses gstreamer, and I'm neither a masochist nor French so that's a non-starter.

Yeah I’m very surprised to see a single board computer at less than $100 for a processor like that. Hard to tell what their actual 1ku price is, but if the random Alibaba I found for $20 is right, then that price for the overall board is absurd.

By VPU are you talking about stuff like ISP, video encoder/decoder, or something else?

Among embedded processors I’ve seen touting vision acceleration, gstreamer support is fairly widespread. I bit the bullet to learn it because my role requires it. Maybe it’s Stockholm Syndrome talking, but I’ve somehow grown to like gstreamer. The learning curve was awkward. I struggled with documentation and learned more by analyzing some examples and trial-and-error.

By VPU, I meant the square on the board that eats h.265 hw_enc and hw_dec and things like that, yes. I've been told it's a separate square from the GPU, by someone who is full-time waist-deep on getting Arch running on that hardware, so I take it as fact.

Oh no. The prospect of gstreamer being the only way... Oh no.

Maybe there's a zsh plugin or smth that autocompletes sane defaults? AAAAAAAAHHHHHH there surely isn't, anyone merciful enough to make one would just use ffmpeg instead...

H.T. Kung’s 1982 paper on systolic arrays is the genesis of what are now called TPUs: http://www.eecs.harvard.edu/~htk/publication/1982-kung-why-s...

Short story, CPU's can do calculations, they can do them one at a time. Think of something like 1+1, = 2. If you had 1 million equations like these, CPU's will generally do them one at a go, i.e the first one, then the second, etc.

GPUs were optimised to draw, so were able to do dozens of these at a go. So these can be used for AI/ML in both gradient descent and inference (forward passes). Because you can do many at a go, in parallel, they speed things up dramatically. Geoff Hinton experimented with GPUs exploiting their ability to do this, but they aren't actually optimised to do that. It just turned out that it is the best way available to do it at the time, and still currently.

AI chips, are optimised to do either inference or gradient descent. They are not good at drawing like GPUs are. They are optimised for machine learning and joining other AI chips together so you can have massive networks of chips that can parallel compute.

One other class of chips that has not yet shown up are ASICs that mimic the transformers architectures for even more speed - though it changes too much at the moment for it to be useful.

Also because of the mechanics of scale manufacturing: GPUs are currently cheaper per flop of compute as the aggregate of scale is shared with graphical uses. Though with time if there is enough scale AI chips should end up cheaper

Do you have any sources on those informations? I find it really hard to find stuff for what you describe. Also do you know about the detail of producing those Asics? Are they CMOS or flash (in-memory-compute?)

All current AI accelerators (that aren't a research project) are ordinary CMOS. Google published some papers about TPUv3. You should read them if you want to know more about the architecture of these kinds of chips.

I’d start with CUDA, because knowing what a chip does won’t click until you see how it can be programmed to do massive parallel computation and matmul.

I read the first book in this list about 10 years ago, and though it’s pretty old the concepts are solid.


CUDA abstracts most of the parallelism, the magic of CUDA is it gave developers a C/C++ API or language if you will that doesn’t really requires them to think about that they can continue writing their problems as they did when programming for mostly single core single threaded CPUs back in the day and CUDA takes care of the rest.

Even “manual” CUDA optimizations deal more with concurrency and data residency than parallelism and even those are usually limited to following the compute guide for your specific hardware and feature set and the driver does the majority of the heavy lifting.

Relevant: Stanford Online course - CS217 Hardware acceleration for machine learning


Course website with lecture notes: https://cs217.stanford.edu/

Reading list: https://cs217.stanford.edu/readings


Google's TPU which they sell via Coral is just a systolic array of multiply-accumulates arranged in a grid.

Here's a decent overview from the horse's mouth. https://cloud.google.com/blog/products/ai-machine-learning/a...

It's called a systolic array because the data moves through it in waves similar to what an engineer imagines the heart looks like :)

Trying to make a list of AI accelerator chip families, anything missing?

- GPU (Graphics Processing Unit)

- TPU (Tensor Processing Unit): ASIC designed for TensorFlow

- IPU (Intelligence Processing Unit): Graphcore

- HPU (Habana Processing Unit): Intel Habana Labs' Gaudi and Goya AI

- NPU (Neural Processing Unit): Huawei, Samsung, Microsoft Brainwave

- VPU (Vision Processing Unit): Intel Movidius

- DPU (Data Processing Unit): NVIDIA data center infrastructure processing unit

- Amazon's Inferentia: Amazon's accelerator chip focused on low cost

- Cerebras Wafer Scale Engine (WSE)

- SambaNova Systems DataScale

- Groq Tensor Streaming Processor (TSP)

Nice and very much in line with what i am looking for !

If you don't mind, could you add links to authoritative wiki pages/whitepapers/articles for each of the above? I think it will give us a good starting point to start our study/research from.

Veritasium did a pretty good video on some of them: https://www.youtube.com/watch?v=GVsUOuSjvcg

That video is about analog computers

The video is about analog chips for ML/NNs. His profile of this company was particularly interesting: https://mythic.ai/

OP asked about chips for AI.

Do companies pay him or does he do these ads for free?

He gets paid. It says so at the start of the video, but I guess that could depend on where you live.

Starting from https://cloud.google.com/tpu/docs/system-architecture-tpu-vm what are you looking for?

Yes, this is what i am looking for.

System Architecture of these chips with detailed Functional Units and how they are used by the AI algorithm Instruction/Data streams.

There is a YouTube channel TechTechPotato [1] that has a podcast on AI hardware called "The AI Hardware Show". Pretty small and it gives you a view on how niche this market is - but if you want the 10k foot view from young budding tech journalists then I think this fits the bill.

Some random examples of video titles from the last 6 months of the channel:

* A Deep Dive into IBM's New Machine Learning Chip

* Does my PC actually use Machine Learning?

* Intel's Next-Gen 2023 Max CPU and Max GPU

* A Deep Dive into Avant, the new chip from Lattice Semiconductor (White Paper Video)

* The AI Hardware Show 2023, Episode 1: TPU, A100, AIU, BR100, MI250X

I think the podcasters background is actually in HPC (High Performance Computing), i.e. super computers. But that overlaps just enough with AI hardware that he saw an opportunity to capitalize on the new AI hype.

1. https://www.youtube.com/c/TechTechPotato

Nice, looks like a good starting point to survey the field.

There's a lot of information here about chips which are mostly built for training neural networks.

It's worth noting there are very widely deployed chips primarily built for inference (running the network) especially on mobile phones.

Depending on the device and manufacturer sometimes this is implemented as part of the CPU itself, but functionally it's the same idea.

The Apple Neural Engine is a good example of this. This is separate to the GPU which is also on the CPU.

Further information is here: https://machinelearning.apple.com/research/neural-engine-tra...

The Google Tensor CPU used in the pixel has a similar coprocessor called the EdgeTPU.

I've worked in this space for the past five years. The chips are essentially highly parallel processors. There's no unifying architecture. You have the graph-based / hpc-simulator chips like Cerebras, Graphcore, etc which are basically a stick-as-many-cores-as-possible situation with a high-speed networking fabric. You have the 'tensor' cores like Groq where the chip operates as a whole and is just well suited for tensor processing (parallelizable, high-speed memory, etc).

At the end of the day, it's matrix multiplication acceleration mostly, and then IO optimization. Literally most of the optimization has nothing to do with compute. We can compute faster than we can ingest.

>The chips are essentially highly parallel processors

Right. AFAIK we already were doing SIMD, Vector Processing, VLIW etc. to speed up parallel processing for numerical calculations in AI/ML. What then is the reason for the explosion of these different categories of chips? Are they just glorified ASICs designed for their specific domains or FPGAs programmed accordingly? If so what is their architecture i.e. what are their functional units and how do they work together?

Realistically, a good AI chip will have provisions for high throughput IO (the most important thing, and the key differentiator) and the actual processing really doesn't matter because with enough engineering effort you'll be able to saturate the chip.

GPUs have a high speed output in the form of an HDMI link. However, there is no high speed input. Reads/writes to/from the GPU are slow. The Cerebras wafer chip for example has 8-16 FPGA driven IO chips that directly read from TCP/IP onto the chip and off again in parallel. Each FPGA connects to its own ethernet port. So you can get the data on/off the chip as fast as possible. That's it really.

As for the processing engines. They're usually just standard cores with a high speed interconnect and maybe some matrix multiplication optimizations. Some, like Groq, have a high speed fabric with specialized processors at various locations.

I want to latch on this question a bit -- which company out there is primed to bring us a CUDA competitor. AMD has failed, so any wise words from the people in the industry?

There's a five part series outlining AI accelerator chips that came out last year. Starting with Introduction, Motivation, Technical Foundations, and Existing Solutions:


Excellent; just what i was looking for!

Please do submit this as a top-level HN submission so that it gets greater visibility.

Here's one from Google (paper link at the end): https://cloud.google.com/blog/topics/systems/tpu-v4-enables-...

Great video from asianometry explaining AI chips (GPGPUS, general purpose GPUs) roots in GPUs (graphics processing units) -- how did we get here and what do these chips do?


You might find this talk interesting, The AI Chip Revolution with Andrew Feldman of Cerebras, https://youtu.be/JjQkNzzPm_Q

It's the founder of a new AI chip company and they talk a bit on the differences

An "AI" chip is marketing. But as other posts say, "linear algebra coprocessor" doesn't roll of the tongue as well.

Incidentally there used to be a proper "AI" chip. The original perceptron was intended to be implemented in hardware. But general purpose chips evolved much faster.


On top of what others have said here about TPUs and their kin, you can make things really scream by taping out an ASIC for a specific frozen neural network (i.e. including the weights and parameters).

If you never have to change the network - for instance to do image segmentation or object recognition - then you can’t get any more efficient than a custom silicon design that bakes in the weights as transistors.

What would be an affordable/ cheap way to get hands on with this type of hardware? Right now I have zero knowledge.

I'm pretty sure you can't buy TPUs, but people usually buy GPUs instead. If you're building a personal rig, these days, you can get an Nvidia RTX 3090 for about $720 USD on ebay used, which is pretty cheap for 24GB VRAM. There's also the A6000 with 48GB VRAM but that'll cost about $5000 on Amazon. Of course, there's new cards that are faster with more VRAM like the 4090 and RTX 6000, but they're also more expensive.

Of course, this is all pretty expensive still. If your models are small enough you can get away with even older GPUs with less VRAM like a GTX 1080 Ti. And then of course there's services like Google Collab and vast.ai where you can rent a TPU or GPU in the cloud.

I'd check out Tim Dettmers' guide for buying GPUs: https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...

AFAIK Google Coral is an inexpensive TPU you can buy right now: https://coral.ai/products/accelerator/

The problem is that this is an "inferencing" accelerator - i.e. it can only execute pretrained models. You cannot train a model on one of these, you need a training accelerator. And pretty much all of those are either NVidia GPUs or cloud-only offerings.

Very cool! Although seems to have no memory to speak of so many use cases like LLM goes away because of that I guess?

It has 8 MB of memory. It also supports live streaming the neural network to the chip, although that is slower than when it is cached in the memory.

Depending on your definition of affordable, the windows dev kit 2023 makes a big deal out of their NPU but you'll have to deal with windows 11 to access it unfortunately

Qualcomm has an SDK [1] where you can run software on a DSP/NSP simulator.

[1] https://developer.qualcomm.com/software/hexagon-dsp-sdk

The NVIDIA Deep Learning Accelerator (NVDLA) is a free and open architecture that promotes a standard way to design deep learning inference accelerators.


Remember crypto miners (ASICS). Exact same thing but built for the math around AI work instead of Blockchain work.

Interestingly, this might be well answered by the LLMs built on the technology you’re interested in.

AI chip is basically a chip that calculates matrix’s better than general purpose CPUs

AI chips is just regular chips that do AI stuff faster

So dedicated hardware to do math stuff

It is what ASIC for bitcoin. A new era for AI models.

Modern AI/ML is increasingly about neural nets (deep learning), whose performance is based on floating point math - mostly matrix multiplication and multiply-and-add operations. These neural nets are increasingly massive, e.g. GPT-3 has 175 billion parameters, meaning that each pass thru the net (each word generated) is going to involve in excess of 175B floating point multiplications!

When you're multiplying two large matrices together (or other similar operations) there are thousands of individual multiply operations that need to be performed, and they can be done in parallel since these are all independent (one result doesn't depend on the other).

So, to train/run these ML/AI models as fast as possible requires the ability to perform massive numbers of floating point operations in parallel, but a desktop CPU only has a limited capacity to do that, since they are designed as general purpose devices, not just for math. A modern CPU has multiple "cores" (individual processors than can run in parallel), but only a small number ~10, and not all of these can do floating point since it has specialized FPU units to do that, typically less in number than the number of cores.

This is where GPU/TPU/etc "AI/ML" chips come in, and what makes them special. They are designed specifically for this job - to do massive numbers of floating point multiplications in parallel. A GPU of course can run games too, but it turns out the requirements for real-time graphics are very similar - a massive amount of parallelism. In contrast to the CPUs ~10 cores, GPUs have thousands of cores (e.g. NVIDIA GTX 4070 has 5,888) running in parallel, and these are all floating-point capable. This results in the ability to do huge numbers of floating point operations per second (FLOPS), e.g. the GTX 4070 can do 30 TFLOPS (Tera-FLOPS) - i.e. 30,000,000,000,000 floating point multiplications per second !!

This brings us to the second specialization of these GPU/TPU chips - since they can do these ridiculous number of FLOPS, they need to be fed data at an equally ridiculous rate to keep them busy, so they need massive memory bandwidth - way more than the CPU needs to be kept busy. The normal RAM in a desktop computer is too slow for this, and is in any case in the wrong place - on the motherboard, where it can only be accessed across the PCI bus which is again way too slow to keep up. GPU's solve this memory speed problem by having a specially designed memory architecture and lots of very fast RAM co-located very close to the GPU chip. For example, that GTX 4070 has 12GB of RAM and can move data from it into its processing cores at a speed (memory bandwidth) of 1TB/sec !!

The exact designs of the various chips differ a bit (and a lot is proprietary), but they are all designed to provided these two capabilities - massive floating point parallelism, and massive memory bandwidth to feed it.

If you want to get into this in detail, best place to start would be to look into low level CUDA programming for NVIDIAs cards. CUDA is the lowest level API that NVIDIA provide to program their GPUs.

A few finer points:

1 - It's RTX 4070, not GTX 4070

2 - the 30 TFLOPS you mention are at the very top when overclocked, they go for 22 normally.

3 - Also those are single precision TFLOPS, as in 32 bit. What really matter nowadays is double precision. And in double precision a 4070 is 0.35 TFLOPS (or 350 GFLOPS). 2 orders of magnitude lower, still impressive though

For neural nets it's actually the opposite - half-precision bfloat16 is enough. You need large range, but not much accuracy.

Yes, the exact numbers are going to vary, but just giving a data point to indicate the magnitude of the numbers. If you want to quibble there's CPU SIMD too.

For gaming do matter those double precision. And we were talking about a certain GPU, which is used for gaming, not AI. Hence why the AI chips exists in the first place - dedicated hardware for dedicated tasks (or ASIC for short)

The NVIDIA cards are all dual-use for gaming and compute/ML. Some features like the RTX 4070's Tensor Cores (incl. bfloat16) are there primarily for ML, and other features like ray tracing are there for gaming.

The NVIDIA cards are for mining crypto-coins too, and they successfully did that for years, before being made obsolete in that area by ASICs. Now it's time for the same thing in AI/ML too, hence why AI chips are being developed, they are the ASICs for this domain. That's the big picture. In 2 to 3 years none is going to use NVIDIA gaming cards for AI/ML anymore, no matter how many GFLOPS future 5000/6000 series are going to offer. They will be for gaming only. End of story.

ASICs aren't magic - they are just chips designed to do a single function fast (e.g run a crypto mining algorithm) as an alternative to using a general purpose CPU/GPU whose generality comes at the cost of some performance overhead.

If your application calls for generality - like a gaming card's need to run custom shaders, or an ML model's need to run custom compute kernels, then an ASIC won't help you. These applications still need a general purpose processor, just one that provides huge parallelism.

It seems you may be thinking that all an ML chip does is matrix multiplication, and so a specialized ASIC would make sense, but that's not the case - an ML chip needs to run the entire model - think of it as a PyTorch accelerator, not a matmul accelerator.

Finally, the market for consumer (vs data center) ML cards is tiny relative to the gaming market, and these chips/cards are expensive to develop. Unless this changes, it doesn't make sense for companies like NVIDA to develop ML-only cards when with minimal effort they can leverage their data center designs and build dual-use GPU/compute consumer cards.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact