
Block-Sparse GPU Kernels - stablemap
https://blog.openai.com/block-sparse-gpu-kernels/
======
chroem-
The thing is, it's a common misconception that neural networks are somehow
intrinsically related to linear algebra. Matrix multiplications are just a
convenient way to build functions with lots of tuneable degrees of freedom.
Just like they do in finite element simulations, sparse matrices tend to
allude to the fact that the underlying problem is more graph-based in nature.
While I have no way to prove this, I've strongly suspected for a while that
most of the weights in dense matrix deep learning models don't actually have
an effect on the output, and that we've been unnecessarily burning cycles to
compute their products. The trouble of course is figuring out which ones are
useful and which ones aren't.

~~~
sanxiyn
This is the idea behind "Learning both Weights and Connections for Efficient
Neural Networks (2015)", and yes, it works:
[https://arxiv.org/abs/1506.02626](https://arxiv.org/abs/1506.02626)

How to figure out which ones are useful and which ones aren't? Why, you can
try the simplest thing that could possibly work. Quoting the paper: "All
connections with weights below a threshold are removed from the network". Is
that all? Yes it is.

~~~
grenoire
Interesting research; I would have suspected that there would be a snowball
effect where those minuscule weights add up to significant changes in the end,
but seems not.

