Memory and ILP handling in 2D convolutions

mratsim · 2024-07-20T19:28:21.000000Z

Years ago I started a collection of convolution optimization resources: https://github.com/mratsim/laser/wiki/Convolution-optimisati...

Also checked and apparently Nvidia Cutlass now supports generic convolutions: https://github.com/NVIDIA/cutlass

epistasis · 2024-07-20T17:42:57.000000Z

Interesting article, thanks, IMHO mostly for the low level performance analysis.

When it comes to actual computation of convolutions, the fast Fourier transform should at least be mentioned, even if in passing. Early in grad school I peaked at the source for R's density() function, and was blown away that it was using FFT, and that I had not picked up that trick in my math classes (or maybe I had just forgotten it...)

For a 2d example:

https://stackoverflow.com/questions/50453981/implement-2d-co...

And a recent HN thread that was very good:

https://news.ycombinator.com/item?id=40840396

imtringued · 2024-07-20T18:18:22.000000Z

As cool as this is, I can't help but think how pointless the goal itself is.

XDNA 2 will have 12 TFLOPs, roughly matching the 96 core Threadripper Pro 7995WX at a much lower price point.

bee_rider · 2024-07-20T19:18:41.000000Z

These sort of computations generally just get fed bigger inputs as compute gets better.

Also, plenty of threadrippers exist out there already, if you get access to some cluster, it might have whatever type of chip in it. If I have access to a cluster with many 7995’s, I don’t really care too much about what’s available on the consumer side.

toxik · 2024-07-20T17:51:48.000000Z

ILP is instruction-level parallelism, if you had a hard time remembering like me.

SkiFire13 · 2024-07-20T18:37:38.000000Z

I was thinking of Integer Linear Programming when I saw the title. Just another example of why acronyms are bad.