I was referring to how the context vectors help avoid vanishing gradients by beh...

I was referring to how the context vectors help avoid vanishing gradients by behaving very similarly to skip-connections, but yes, they aren't skip-connections as-such. That's been my understanding, at least.

We haven't tried truncated BPTT, but we certainly should.

Funnily enough, we adopted AWD-LSTMs, Ranger21, and Mish in the paper I linked after I heard about them through the fast.ai community (we also trialled QRNNs for a bit too). fast.ai has been hugely influential in my work.