It turned out that not using positional encodings at all (relying entirely on the causality mapping in the attention) did better than all other methods like Learned PoE, Sinusoidal, RoPE and using an LSTM below the transformer.
It turned out that not using positional encodings at all (relying entirely on the causality mapping in the attention) did better than all other methods like Learned PoE, Sinusoidal, RoPE and using an LSTM below the transformer.