> Second you shouldn't apply multihead attention which as additional inner weigh...

GistNoesis · on Aug 10, 2023

Afaiu, the whole idea behind rotary embeddings is kind of a hack to switch the similarity metric (that compares query to keys) inside the scaled_dotproduct_attention without having to rewrite the optimized code of scaled_dotproduct_attention.

This custom similarity metric has some properties engineered into it, mainly some invariance with relative positioning, and learnable decay with increasing distance (keys-query similarity decrease with increasing distance in position space and the network can learn how important is position distance compared to feature-space distance). It's a strong prior that works well when relative positioning is important.

It's a refinement of the traditional attention : It's a different and more ambitious aim than what sinusoidal position are trying to do, which is just provide some position information to the neural network so that it can distinguish keys and let it learn what it sees fit.

Sinusoidal position embeddings can learn some relative positioning quite easily because of trigonometry, but they have to learn it. Rotary embeddings have relative positioning baked in : everything is relative to the query position (quite similar point of view as a convolutional network), and the only thing they learn is how important small position distance compared to high position distance should be.