Attention with Linear Biases (ALiBi)

ofirpress · on May 14, 2023

(I wrote ALiBi)

Thanks for posting this! You can view a video where I explain what we did and why it's useful at: https://www.youtube.com/watch?v=Pp61ShI9VGc

espadrine · on May 14, 2023

Thanks a lot! I always felt weird about positional embeddings, because positions are not a set, they’re a continuum. My initial guess for why they don’t extrapolate was that the extrapolated embeddings step on the others’ turf once a few computations or layers are applied, causing the model to be confused about order, as if random concepts were inserted here and there. (Position overfit seems like it would weigh in though indeed.)

Have you experimented with nonlinear biases?

Eridrus · on May 15, 2023

Is ALiBi still the sota for this setting, or have there been advances beyond this in the last 8 months? I know there has been a lot of interest in longer context lengths recently.

ipsum2 · on May 16, 2023

xpos is SoTA right now: https://arxiv.org/pdf/2212.10554.pdf

Eridrus · on May 17, 2023

Thanks!

zuzun · on May 14, 2023

If I understand it correctly, you are only attending preceding tokens in your paper. Can the constant bias matrix be made symmetric for unmasked tasks?

jerpint · on May 14, 2023

I’m curious as to whether this inductive bias wouldn’t hurt on tasks where the first sentence of a long corpus would contain the most useful information.

Nonetheless, very clever trick and congrats on the great paper!

gmkiv · on May 14, 2023

How does ALiBi compare to rotary positional embeddings? That method makes similar claims. I find ALiBi much easier to understand, but that’s probably not the best reason to chose it over other methods.

amrb · on May 14, 2023

The "one weird trick" to squeeze limes for extra juice

chaxor · on May 14, 2023

This seems suboptimal considering the simple output from the original Viswani PE, which is solidly based on a well grounded foundation of eigenvectors for the discrete Fourier transform relation to the circulant matrix of a linear chain graph which is natural language of the sentence

ofirpress · on May 14, 2023

The ALiBi paper shows that our method beats the sinusoidal PE you refer to across many benchmarks. https://arxiv.org/abs/2108.12409

KRAKRISMOTT · on May 14, 2023

I don't recall there being a DFT in the original attention is all you need paper.

just_a_quack · on May 14, 2023

There's not. The positional encodings are generated using sines and cosines such that any offset in position can be described as a linear function on the original position. Using the DFT here would not make sense as the positional encodings are fixed anyway and during inference this method generalizes nicely because of the geometric progression created by the arguments of the positional encoding functions.

chaxor · on May 16, 2023

There isn't a DFT directly, it's a more obvious statement here. The circulant matrix (linear graph of words) always has the same eigenvectors and is diagonalized via DFT.

The PE in original Viswani is based on this, they just didn't put in all the details. So effectively the model gets hints from the PE that it's a linear graph because these are the eigenvectors.

taliesinb · on May 14, 2023

please clarify your suffix "which is natural language of the sentence"? are you referring to the positional encoding?