Further, while all these tricks are amazing, they are not currently of much use to fungible data scientists at FAANG and in the greater Fortune 500 because they haven't been added to any of the major frameworks in a manner that is braindead/automagic. Further, most of these are but one GPU generation away from general availability if they become popular like FP16/32 mixed precision did.
That said, the acquisition of DeePhi by Xilinx ought to be a sufficiently strong signal to add them to said frameworks, no?
Finally, my gut instincts (based on the monotonically decreasing speech recognition quality of my increasingly powerful Android phones over time, and the bizarre spelling errors it now makes such as "tech kneeq" as one example) is that one cannot rate the quality of these techniques from a single metric, and it seems like the decision to foist them on users may be based on one. #ItsAdversariesAllTheWayDown
My response to them is to tell them to implement and compare against the same tricks on a GPU if you expect me to believe the throughput performance delta. I usually get some hand-waving about how that's not possible or pointless yadda yadda yadda and the conversation goes downhill from there. That doesn't mean I don't like these tricks or that I don't think they're useful, it means that based on the scars from a very long industry career I don't buy snake oil.
Thus, the most optimizations are just keep the transformed weights (you are still looking at 4x4 / 3x3, 1.9x more data), but that can be somewhat justified by the computation cost. The transformed activations are not useful.