Demystifying NPUs: Questions and Answers

throwaway81523 · 2024-06-12T05:03:43

This article is useless, gives no clue about what the npu's basic operations are or how to program them. Comments in this thread about that are way more informative.

klelatti · 2024-06-17T07:49:25

Thanks. I'm the author. I added some comments to this thread to try to help readers understand NPUs better. However, I'm afraid that this - which is pretty rude - as the top-voted comment means that I won't be reading or commenting on HN threads on my writing anymore.

patrickthebold · 2024-06-11T17:00:35

Skimmed about 2/3s of this. It's unclear to me if code needs to be specially written to run on the NPU.

Normally I think of something like CUDA to get code running on a GPU. Can code targeting a CPU or GPU automatically be sped up running on a NPU? Or does code explicitly need to target the NPU?

ribit · 2024-06-12T12:33:29

Most NPUs are not directly end-user programmable. The vendor usually provides a custom SDK that allows you to run models created with popular frameworks on their NPUs. Apple is a good example since they have been doing it for a while. They provide a framework called CoreML and tools for converting ML models from frameworks such as PyTorch into a proprietary format that CoreML can work with.

The main reason for this lack of direct programmability is that NPUs are fast-evolving, optimized technology. Hiding the low-level interface allows the designer to change the hardware implementation without affecting end-user software. For example, some NPUs can only work with specific data formats or layer types. Early NPUs were very simple convolution engines based on DSPs; newer designs also have built-in support for common activation functions, normalization, and quantization.

Maybe one day, these things will mature enough to have a standard programming interface. I am skeptical about this becoming a reality any time soon. Some companies (like Tenstorrent) are specifically working on open architectures that will be directly programmable, I'm not sure whether their approach translates to the embedded NPUs, though. What would be nice is an open graph-based API and a model format for specifying and encoding ML models.

cherioo · 2024-06-11T18:00:16

My understanding is that the low level APIs are different, CUDA vs. whatever Android provides vs. whatever Apple provides. However, higher abstraction like PyTorch may be able to target different platforms with less code changes.

fulafel · 2024-06-12T03:53:52

Cuda is of course NVidia-proprietary.

Will be interesting to see if other high level accelerator supporting languages like Chapel or Futhark or JAX end up getting NPU backends, it might give them a nice boost over the proprietary C++ inspired language.

Edit: JAX has TPU support.

fisf · 2024-06-12T09:04:31

Support in for high level backends almost always comes down to c++ either way. That goes for NPUs (which are really fragmented anyway, i.e. there is no uniform API), and and also for JAX' TPU backend (which iirc is using XLA).

fulafel · 2024-06-12T12:21:02

C++ stings less as an API used in the low level machinery under the hood as long as you as an application author don't have to write code in it.

I haven't done an in depth look but most matrix math accelerators (eg AMD, Intel and Apple) seem to provide C/C++/Python APIs for describing the computations but the code executing on the NPU is not compiled from user C++ code.

Apparently eg in Intel's stuff there's a custom run-time compiler consuming this kind of IR (intermediate representation) in the accelerator sw stack: https://docs.openvino.ai/2022.3/openvino_ir.html & https://github.com/intel/linux-npu-driver

And on AMD from user POV it doesn't seem too different: https://ryzenai.docs.amd.com/en/latest/devflow.html

fisf · 2024-06-12T15:28:33

Yes, but this is not where I am getting at.

"Will be interesting to see if other high level accelerator supporting languages like Chapel or Futhark or JAX end up getting NPU backends, it might give them a nice boost over the proprietary C++ inspired language."

As you say, the GPU (or NPU, TPU,..) don't run C++ or anything derived from it. The "runtime" (~backend) will usually emit some kind of hardware dependent format (or again, an IR) like SPIR-V, PTX, etc.

But the backend itself is usually written in C++ (due to performance reasons), and there is really no way to get around that. Interacting with that from Python (or Jax) is a usability win, but there is zero difference in functionality. I.e. there is no proprietary C++ inspired language in play here. Hence no way to get a boost.

fulafel · 2024-06-13T03:58:28

Right.I was focusing more on the CUDA-kernels-on-NPU line of thought from patrikthebold's message and its alternatives, which as you say, is not a reality now either.

In the Jax style implementation scenario the compiler part of JAX is better inspiration, maybe along the lines of this case study of a path tracer running on a TPU: https://blog.evjang.com/2019/11/jaxpt.html - I don't think Chapel or Futhark would adopt the same approach as such but it's at least some kind of existence proof of a compiler targeting it from a high level language for a non-machine learning code.

eterps · 2024-06-11T17:20:28

I'm also wondering if such an NPU can be targeted (in a way that can be understood) from the assembly/machine language level. Or that it needs an opaque kitchen sink of libraries, blobs and other abstractions.

wmf · 2024-06-11T20:14:30

You can write assembly for NPUs although the instruction set may be quite janky compared to CPUs. Once you've written NPU code you need some libraries to load your code but that's not particularly different from the fact that CPUs now need massive firmware to boot.

Back in reality, that's not how any vendor intends their NPUs to be used. They provide high-level libraries that implement ONNX, CoreML, DirectML, or whatever and they expect you to just use those.

qludes · 2024-06-11T17:31:34

I believe it's the latter, each NPU vendor has their own software stack. Take a look at Tomeu Vizoso's work:

https://blog.tomeuvizoso.net/search/label/npu

klelatti · 2024-06-11T17:25:51

Opaque blobs I believe in almost all cases!

lawlessone · 2024-06-11T17:03:57

I agree I don't understand the difference? Are the calculations an NPU is capable of doing different to a GPU?

Are they not basically identical hardware?

klelatti · 2024-06-11T17:17:47

NPUs are basically specialized for matrix multiplication, GPUs for more general parallel operations on multiple data (Single Instruction Multiple Thread) although modern GPUs also contain matrix multiplication units.

May be a degree of software compatibility at the highest level - eg PyTorch - but the underlying software will be very different.

xcv123 · 2024-06-11T21:01:21

By the same reasoning, a CPU is no different to a GPU. They can both do matrix calculations.

A GPU is optimised for 3D rendering (and is useful for parallel computations in general). An NPU is optimised for neural network inferencing. These algorithms both involve matrix mathematics but they are not the same. The NPU hardware design matches the deep neural network inferencing algorithm. For example it has an "Activation Function" block dedicated to computing the activation function between neural network layers. It is optimised and specialised for one very specific algorithm: inferencing. A GPU would beat an NPU for training, and any other parallel computations besides inferencing.

int_19h · 2024-06-12T01:35:04

Modern GPUs have cores optimized for different things, including specifically things that NPUs (as defined by the article) are all about. E.g. here's NVidia:

https://developer.nvidia.com/blog/programming-tensor-cores-c...

TheDudeMan · 2024-06-11T22:31:30

High-end Nvidia GPUs are not optimized for 3D rendering; they are optimized for machine learning and inference.

xcv123 · 2024-06-12T01:25:18

You're right that GPU is a misnomer these days. The top-end chips are for HPC and scientific computing in general. Good for training, inference, and non-AI applications such as molecular dynamics simulations. They can accelerate FFT and linear solvers. Their applications are broader than AI/ML. Useful for physics simulations.

aragilar · 2024-06-12T10:07:28

Depends on the chips, I've been hearing from people that NVidia ones (due to various changes in design which preference low-precision over high precision) are becoming more and more useless for traditional HPC, and so there's a move on to switch to AMD (noting that the software side of AMD is still not good). https://news.ycombinator.com/item?id=40655965 suggests Apple's hardware would have similar issues.

taneq · 2024-06-11T23:28:43

They're not really GPUs then, are they? Even if they're capable of generating a video output.

Dalewyn · 2024-06-12T00:48:17

They are GPUs, but standing for General Processing Unit. "GPU" the acronym has persisted from sheer force of tradition and habit, but what it means has changed drastically over the past several years between cryptocurrency and now "AI".

I wonder if NPU will supercede GPU as in General Processing Unit now that it has finally entered the wider lexicon, relegating GPU back to Graphics Processing Unit or video cards.

And no, GPGPU (General Purpose Graphics Processing Unit) is a bloody stupid term to be bluntly honest.

lxgr · 2024-06-12T03:49:53

Acronyms don’t just change their meaning through a lack of tradition and habit. Etymology doesn’t have an expiration date.

Very rarely, an acronym is “retconned” into a more appropriate expansion or simply starts being considered a regular word not standing for anything.

I’d strongly challenge your assertion that this has happened for “GPU”.

taneq · 2024-06-12T01:47:37

That might be how you use the term but GPU is still almost universally used to mean graphics processing unit. If it want to use a ubiquitous acronym to mean something different, you need to define it first.

Dalewyn · 2024-06-12T01:55:33

>GPU is still almost universally used to mean graphics processing unit.

No, GPU is almost universally used to mean GPU. There is nothing graphical about cryptocurrency, "AI" (sans image generation), protein crunching, and whatever else they are being used for that aren't graphical.

I question how many people are even aware the G is supposed to stand for Graphics anymore. The nomenclature is outdated and doesn't reflect reality anymore.

smcin · 2024-06-12T22:33:25

> I question how many people are even aware the G is supposed to stand for Graphics anymore.

Most people aren't aware, because the industry never called attention to the morphng definition of GPU.

At that, CPUs aren't Central (for large-scale array-oriented computing workloads) anymore either. (They still are for enterprise or web workloads. Or, they're "Central" in terms of coordinating GPUs and moving data around. But no longer "Central" in terms of "does most of the computing".)

_flux · 2024-06-12T15:03:41

The current GPUs aren't really that generally applicable though, as they are most useful when you need to do a bunch of number crunching on a lot of data of the same structure—which is why we still run compilers on CPUs.

If you are offended by the term GPGPU, maybe we could use the name Compute Processing Unit :-).

Dalewyn · 2024-06-13T00:08:05

>If you are offended by the term GPGPU, maybe we could use the name Compute Processing Unit :-)

I can certainly drink to that.

adrian_b · 2024-06-11T17:17:52

A NPU can do only a very small subset of the operations supported by a GPU.

A NPU does strictly only the operations required for ML inference, which use data types with low precision, i.e. 16-bit or 8-bit types.

foobiekr · 2024-06-11T17:32:58

Different optimization choices.

Const-me · 2024-06-12T08:49:34

> Are they not basically identical hardware?

For example Apple’s NPU can’t do FP32 precision, it can only do FP16 and less.

amelius · 2024-06-12T11:26:45

Is there a good overview somewhere of all the arithmetic primitives used in deep learning? I want to understand why libcudnn* is taking 1.7GB of disk space.