Hacker News new | past | comments | ask | show | jobs | submit login
Sparsity Enables 50x Performance Acceleration in Deep Learning Networks [pdf] (numenta.com)
30 points by eightysixfour 17 days ago | hide | past | favorite | 3 comments



This is partly related to FPGA implementation instead of GPUs.

Interestingly, GPUs (unoptimized 32bit implementation as per the authors) outperform FPGA's in this white paper for dense networks. Sparse networks were not tested on GPU. My takeaway here is that a sparse integer implementation on GPU could possibly reach similar performance?


Probably not as much. The paper describes the sparsity they use as:

1. We initializedthe weights using a sparse random mask, so that only a fraction of the weights contain non-zero values.

2. We created sparse activations by maintaining only the top-k active units of eachlayer; the rest are set to zero

So it's basically random which units are dropped. GPUs being SIMD (meaning 32-128 execution pipelines share an instruction decoder) means you have to run a large number of threads in lockstep, meaning that this kind of sparsity simply means you do a lot of multiplications with zero, or run no-ops. You can save power on no-ops but can't gain performance. To have fast execution for GPUs you can do for example:

1. Do what Nvidia did on Ampere and define a constant bitrate compression where you drop exactly 2 of every 4 values. This allows you to then specialise your HW for this compression, but your speedup is fixed at ~2x

2. Create sparsity (i.e. varaible bitrate compression) in your network where you drop out entire NxN matrix blocks (where on e.g. Nvidia N = 128), so you can skip and entire HW pass over the data. This is possible but more complicated than simply thresholding the activations, but it's the most efficient.

3. Use elementwise sparsity but compress the represenation using e.g. bitmasks. This is only worth it if you have a large degree of sparsity (because handling the data indices is more instructions than simply multiplying by zero)


We've come full circle. Hawkins -> Redwood -> Sparsity -> Hawkins.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: