
Efficient Methods and Hardware for Deep Learning [pdf] - godelmachine
http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture15.pdf
======
varelse
See lecture 15 of cs231n 2017:
[https://www.youtube.com/watch?v=vT1JzLTH4G4&list=PLC1qU-
LWwr...](https://www.youtube.com/watch?v=vT1JzLTH4G4&list=PLC1qU-
LWwrF64f4QKQT-Vg5Wr4qEE1Zxk)

Further, while all these tricks are amazing, they are not currently of much
use to fungible data scientists at FAANG and in the greater Fortune 500
because they haven't been added to any of the major frameworks in a manner
that is braindead/automagic. Further, most of these are but one GPU generation
away from general availability if they become popular like FP16/32 mixed
precision did.

That said, the acquisition of DeePhi by Xilinx ought to be a sufficiently
strong signal to add them to said frameworks, no?

[https://www.xilinx.com/news/press/2018/xilinx-announces-
the-...](https://www.xilinx.com/news/press/2018/xilinx-announces-the-
acquisition-of-deephi-tech.html)

Finally, my gut instincts (based on the monotonically decreasing speech
recognition quality of my increasingly powerful Android phones over time, and
the bizarre spelling errors it now makes such as "tech kneeq" as one example)
is that one cannot rate the quality of these techniques from a single metric,
and it seems like the decision to foist them on users may be based on one.
#ItsAdversariesAllTheWayDown

~~~
borramakot
I spent a little bit of time trying to understand exactly what DeePhi
provides, but couldn't find any whitepapers or anything. Do you know any
documentation on which of these approximations are supported by DeePhi?

~~~
varelse
I don't know which of these techniques they actually built, my experience with
these techniques is several encounters with salespeople from multiple vendors
who present them running on FPGAs and then compare performance to non-
approximated versions of the same network on a GPU to show the amazing speed
gains with minimal reduction in test set metrics. This is, of course, not the
whole story.

My response to them is to tell them to implement and compare against the same
tricks on a GPU if you expect me to believe the throughput performance delta.
I usually get some hand-waving about how that's not possible or pointless
yadda yadda yadda and the conversation goes downhill from there. That doesn't
mean I don't like these tricks or that I don't think they're useful, it means
that based on the scars from a very long industry career I don't buy snake
oil.

------
gandreani
Wow I'm a total novice but as far as I know most of the optimization methods
for models (Pruning, Weight Sharing, Quantization, Low Rank Approximation,
Binary / Ternary Net, Winograd Transformation) aren't really implemented in
most ML frameworks. Would be fun to try it out one of these weekends

~~~
twtw
Winograd transform probably wouldn't be exposed at the framework layer. I
wouldn't be surprised if cuDNN/TensorRT/TPU driver already use it internally.

~~~
liuliu
It would actually be interesting to quantize Winograd transformed weights
directly and then see whether we can squeeze some accuracy out (since Winograd
is more accurate numerically than direct convolution). Maybe just too much
work for potentially marginal improvement and you will get better result when
incorporating quantization into your training process anyway.

~~~
borramakot
Is there a way to do nonlinearity (ReLU) in the Winograd domain?

~~~
liuliu
Assuming you mean Fourier domain. It is not very useful to keep the activation
in that domain. Winograd transform operates at 4x4 tiles (it is effective
because 16 FMA of that 4x4 tiles can generate 2x2 activations (roughly
speaking), while direct convolution requires 3x3 FMA to generate 1
activation). If you keep the activation in that domain, you are looking at
keeping 4x4 / 2x2 thus 4x more information than it is needed, which is not
useful or desirable.

Thus, the most optimizations are just keep the transformed weights (you are
still looking at 4x4 / 3x3, 1.9x more data), but that can be somewhat
justified by the computation cost. The transformed activations are not useful.

