There is a bug : While in SwiGLU beta is a learnable parameter, in the reference...

luckyt · on Aug 9, 2023

I guess this goes to show how challenging it can be to implement transformer neural networks correctly. There are so many ways in which you can make mistakes at various steps, and there is no surefire way of knowing, you'll just have a slightly worse performance than you would've gotten otherwise. And in many cases, if you make a change to the network, either intentionally or not, the network adapts to it and there are many examples of different variants of the architecture performing similarly once trained. (though, in these cases, one might ask if it really matters if you match the original or not?)

One method I've seen people do to identify these types of mistakes is by precisely matching model outputs with a reference implementation. HuggingFace does this with tiny-random models: these models have randomized weights, but the output is expected to match exactly, if not, then it's an indicator of a bug. But this approach only works for bugs that arise during inference, detecting issues in data processing, optimizers, or anything that only happens during training is more challenging.

danieldk · on Aug 9, 2023

And since there is Huggingface transformers, you can also test against that, which is what we do in Curated Transformers (transformers is only a test-time dependency).

visarga · on Aug 9, 2023

The model really wants to learn, but it would use any shortcut to do it.

bkitano19 · on Aug 9, 2023

Wow, great catch. I will update this in the morning!

GistNoesis · on Aug 9, 2023

Cool, there are also additional issues with the RoPEAttention you might want to fix as well :

The reference paper for rotary embedding is Roformer https://arxiv.org/pdf/2104.09864v4.pdf

First you shouldn't rotate the values, only keys and queries. This is wrong : v_out = (torch.bmm(v.transpose(0,1), self.R[:m, ...])).transpose(0,1)

Second you shouldn't apply multihead attention which as additional inner weights that will mess with the rotations you have just done. This is wrong : activations, attn_weights = self.multihead (q_out,k_out,v_out)

Instead you should use scaled_dot_product_attention( q_out,k_out,v_out)

Third, each attention head should have been treated similarly, and each attention head should have the same rotation frequencies.

SpaceManNabs · on Aug 9, 2023

> Second you shouldn't apply multihead attention which as additional inner weights that will mess with the rotations you have just done

wait does that mean that rotary embeddings don't work with multiheaded attention? First I have heard of this. Wouldn't this be an issue with position embeddings as well (for example sinusoidal position embeddings are a special case of rotary embeddings)?

GistNoesis · on Aug 10, 2023

Afaiu, the whole idea behind rotary embeddings is kind of a hack to switch the similarity metric (that compares query to keys) inside the scaled_dotproduct_attention without having to rewrite the optimized code of scaled_dotproduct_attention.

This custom similarity metric has some properties engineered into it, mainly some invariance with relative positioning, and learnable decay with increasing distance (keys-query similarity decrease with increasing distance in position space and the network can learn how important is position distance compared to feature-space distance). It's a strong prior that works well when relative positioning is important.

It's a refinement of the traditional attention : It's a different and more ambitious aim than what sinusoidal position are trying to do, which is just provide some position information to the neural network so that it can distinguish keys and let it learn what it sees fit.

Sinusoidal position embeddings can learn some relative positioning quite easily because of trigonometry, but they have to learn it. Rotary embeddings have relative positioning baked in : everything is relative to the query position (quite similar point of view as a convolutional network), and the only thing they learn is how important small position distance compared to high position distance should be.

tysam_and · on Aug 11, 2023

Generally biases in transformers don't work so well.

Personally I think it's because of the autoregressive, ODE-like nature of them, but who am I to say anything on that. ;PPPP