Interestingly, GPUs (unoptimized 32bit implementation as per the authors) outperform FPGA's in this white paper for dense networks. Sparse networks were not tested on GPU. My takeaway here is that a sparse integer implementation on GPU could possibly reach similar performance?
1. We initializedthe weights using a sparse random mask, so that only a fraction of the weights contain non-zero values.
2. We created sparse activations by maintaining only the top-k active units of eachlayer; the rest are set to zero
So it's basically random which units are dropped. GPUs being SIMD (meaning 32-128 execution pipelines share an instruction decoder) means you have to run a large number of threads in lockstep, meaning that this kind of sparsity simply means you do a lot of multiplications with zero, or run no-ops. You can save power on no-ops but can't gain performance. To have fast execution for GPUs you can do for example:
1. Do what Nvidia did on Ampere and define a constant bitrate compression where you drop exactly 2 of every 4 values. This allows you to then specialise your HW for this compression, but your speedup is fixed at ~2x
2. Create sparsity (i.e. varaible bitrate compression) in your network where you drop out entire NxN matrix blocks (where on e.g. Nvidia N = 128), so you can skip and entire HW pass over the data. This is possible but more complicated than simply thresholding the activations, but it's the most efficient.
3. Use elementwise sparsity but compress the represenation using e.g. bitmasks. This is only worth it if you have a large degree of sparsity (because handling the data indices is more instructions than simply multiplying by zero)