Thanks a lot! I always felt weird about positional embeddings, because positions are not a set, they’re a continuum. My initial guess for why they don’t extrapolate was that the extrapolated embeddings step on the others’ turf once a few computations or layers are applied, causing the model to be confused about order, as if random concepts were inserted here and there. (Position overfit seems like it would weigh in though indeed.)
Is ALiBi still the sota for this setting, or have there been advances beyond this in the last 8 months? I know there has been a lot of interest in longer context lengths recently.
If I understand it correctly, you are only attending preceding tokens in your paper. Can the constant bias matrix be made symmetric for unmasked tasks?
I’m curious as to whether this inductive bias wouldn’t hurt on tasks where the first sentence of a long corpus would contain the most useful information.
Nonetheless, very clever trick and congrats on the great paper!
How does ALiBi compare to rotary positional embeddings? That method makes similar claims. I find ALiBi much easier to understand, but that’s probably not the best reason to chose it over other methods.
This seems suboptimal considering the simple output from the original Viswani PE, which is solidly based on a well grounded foundation of eigenvectors for the discrete Fourier transform relation to the circulant matrix of a linear chain graph which is natural language of the sentence
There's not. The positional encodings are generated using sines and cosines such that any offset in position can be described as a linear function on the original position. Using the DFT here would not make sense as the positional encodings are fixed anyway and during inference this method generalizes nicely because of the geometric progression created by the arguments of the positional encoding functions.
There isn't a DFT directly, it's a more obvious statement here.
The circulant matrix (linear graph of words) always has the same eigenvectors and is diagonalized via DFT.
The PE in original Viswani is based on this, they just didn't put in all the details. So effectively the model gets hints from the PE that it's a linear graph because these are the eigenvectors.
Thanks for posting this! You can view a video where I explain what we did and why it's useful at: https://www.youtube.com/watch?v=Pp61ShI9VGc