Hacker News new | past | comments | ask | show | jobs | submit login

>These are primarily focused on machine learning training where you do backpropagation through huge matrix operations

How is this different from normal matrix operations?




Scale, and only scale. As with most extensions, there's negative value if you're doing a small number of calcs, but it pays off in larger needs like ML training.

We've been calculating matrixes for time eternal, but AMX has just recently become a thing. Intel just introduced their own AMX instructions.


I don't know, if you want to perform DL inference on 4k/8k video real-time you're gonna need some heavy duty matrix multiplication resources. GPU is great for batched inferences but for quick no-pcie-transfer, small-to-no-batching inferences you want something close to the CPU...


Then it's good that the A13/A14/M1 have a neural inference engine, the latter featuring 11 trillion operations per second, using shared memory with the CPU.


We're talking about INT4/7 or bfloat TOPS, right? And if similar to other neural inference engines, vpus, tpus, etc. it's probably off except for heavy duty stuff and slow to power up again? Whilst repowering a matmul in-cpu block might be faster?

I don't see the need for an effort such as a matmul-dedicated instruction-set elsewhere? What's your guess?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: