1. Don’t use LSTMs (4 vector-matrix multiplies) or GRUs (3 multiplies). Use a fixed Hippo matrix to update state. Just 1 multiply and since it’s fixed you can unroll during training, much faster than backprop through time.
2. Write SIMD intrinsics by hand. None of the libraries are as fast.
3. Don’t use sigmoid or tanh functions as your nonlinear activation. Instead approximate them with the softsign function which is much cheaper.
Depends on exact architecture, but these optimizations have yielded 10-30x improvement for single threaded CPU real time audio applications.
When GPU audio matures all this may be unnecessary.