Hacker News new | past | comments | ask | show | jobs | submit login
Attention with Linear Biases (ALiBi) (arxiv.org)
58 points by pmoriarty on May 14, 2023 | hide | past | favorite | 15 comments



(I wrote ALiBi)

Thanks for posting this! You can view a video where I explain what we did and why it's useful at: https://www.youtube.com/watch?v=Pp61ShI9VGc


Thanks a lot! I always felt weird about positional embeddings, because positions are not a set, they’re a continuum. My initial guess for why they don’t extrapolate was that the extrapolated embeddings step on the others’ turf once a few computations or layers are applied, causing the model to be confused about order, as if random concepts were inserted here and there. (Position overfit seems like it would weigh in though indeed.)

Have you experimented with nonlinear biases?


Is ALiBi still the sota for this setting, or have there been advances beyond this in the last 8 months? I know there has been a lot of interest in longer context lengths recently.


xpos is SoTA right now: https://arxiv.org/pdf/2212.10554.pdf


Thanks!


If I understand it correctly, you are only attending preceding tokens in your paper. Can the constant bias matrix be made symmetric for unmasked tasks?


I’m curious as to whether this inductive bias wouldn’t hurt on tasks where the first sentence of a long corpus would contain the most useful information.

Nonetheless, very clever trick and congrats on the great paper!


How does ALiBi compare to rotary positional embeddings? That method makes similar claims. I find ALiBi much easier to understand, but that’s probably not the best reason to chose it over other methods.


The "one weird trick" to squeeze limes for extra juice


This seems suboptimal considering the simple output from the original Viswani PE, which is solidly based on a well grounded foundation of eigenvectors for the discrete Fourier transform relation to the circulant matrix of a linear chain graph which is natural language of the sentence


The ALiBi paper shows that our method beats the sinusoidal PE you refer to across many benchmarks. https://arxiv.org/abs/2108.12409


I don't recall there being a DFT in the original attention is all you need paper.


There's not. The positional encodings are generated using sines and cosines such that any offset in position can be described as a linear function on the original position. Using the DFT here would not make sense as the positional encodings are fixed anyway and during inference this method generalizes nicely because of the geometric progression created by the arguments of the positional encoding functions.


There isn't a DFT directly, it's a more obvious statement here. The circulant matrix (linear graph of words) always has the same eigenvectors and is diagonalized via DFT.

The PE in original Viswani is based on this, they just didn't put in all the details. So effectively the model gets hints from the PE that it's a linear graph because these are the eigenvectors.


please clarify your suffix "which is natural language of the sentence"? are you referring to the positional encoding?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: