Hacker News new | comments | ask | show | jobs | submit login
Efficient Methods and Hardware for Deep Learning [pdf] (stanford.edu)
122 points by godelmachine 5 months ago | hide | past | web | favorite | 12 comments



See lecture 15 of cs231n 2017: https://www.youtube.com/watch?v=vT1JzLTH4G4&list=PLC1qU-LWwr...

Further, while all these tricks are amazing, they are not currently of much use to fungible data scientists at FAANG and in the greater Fortune 500 because they haven't been added to any of the major frameworks in a manner that is braindead/automagic. Further, most of these are but one GPU generation away from general availability if they become popular like FP16/32 mixed precision did.

That said, the acquisition of DeePhi by Xilinx ought to be a sufficiently strong signal to add them to said frameworks, no?

https://www.xilinx.com/news/press/2018/xilinx-announces-the-...

Finally, my gut instincts (based on the monotonically decreasing speech recognition quality of my increasingly powerful Android phones over time, and the bizarre spelling errors it now makes such as "tech kneeq" as one example) is that one cannot rate the quality of these techniques from a single metric, and it seems like the decision to foist them on users may be based on one. #ItsAdversariesAllTheWayDown


I spent a little bit of time trying to understand exactly what DeePhi provides, but couldn't find any whitepapers or anything. Do you know any documentation on which of these approximations are supported by DeePhi?


I don't know which of these techniques they actually built, my experience with these techniques is several encounters with salespeople from multiple vendors who present them running on FPGAs and then compare performance to non-approximated versions of the same network on a GPU to show the amazing speed gains with minimal reduction in test set metrics. This is, of course, not the whole story.

My response to them is to tell them to implement and compare against the same tricks on a GPU if you expect me to believe the throughput performance delta. I usually get some hand-waving about how that's not possible or pointless yadda yadda yadda and the conversation goes downhill from there. That doesn't mean I don't like these tricks or that I don't think they're useful, it means that based on the scars from a very long industry career I don't buy snake oil.


Wow I'm a total novice but as far as I know most of the optimization methods for models (Pruning, Weight Sharing, Quantization, Low Rank Approximation, Binary / Ternary Net, Winograd Transformation) aren't really implemented in most ML frameworks. Would be fun to try it out one of these weekends


Winograd transform probably wouldn't be exposed at the framework layer. I wouldn't be surprised if cuDNN/TensorRT/TPU driver already use it internally.


It would actually be interesting to quantize Winograd transformed weights directly and then see whether we can squeeze some accuracy out (since Winograd is more accurate numerically than direct convolution). Maybe just too much work for potentially marginal improvement and you will get better result when incorporating quantization into your training process anyway.


Is there a way to do nonlinearity (ReLU) in the Winograd domain?


Assuming you mean Fourier domain. It is not very useful to keep the activation in that domain. Winograd transform operates at 4x4 tiles (it is effective because 16 FMA of that 4x4 tiles can generate 2x2 activations (roughly speaking), while direct convolution requires 3x3 FMA to generate 1 activation). If you keep the activation in that domain, you are looking at keeping 4x4 / 2x2 thus 4x more information than it is needed, which is not useful or desirable.

Thus, the most optimizations are just keep the transformed weights (you are still looking at 4x4 / 3x3, 1.9x more data), but that can be somewhat justified by the computation cost. The transformed activations are not useful.


Not an answer, but I've never heard it called the "Winograd domain." The Winograd transform is an algorithm for computing small discrete Fourier transforms, so its the same Fourier domain.


cuDNN certainly does, if you use the CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD option (and you can ask it to automatically pick the best approach if you like).


Pruning, Weight Sharing and Quantization are commonly used techniques and can be done in tf.





Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: