Not an expert, but took my masters in data science some years ago:
My interpretation is the sinusoidal converts the values from 0..N to 0..1. This may result in more predictable changes than using large integer numbers, such that the same word in different positions doesn't lose all its meaning.
Using integers and having a layer to compute this operation could also work, so maybe this is an optimization, eg it reduces training time or yields better results.
My interpretation is the sinusoidal converts the values from 0..N to 0..1. This may result in more predictable changes than using large integer numbers, such that the same word in different positions doesn't lose all its meaning.
Using integers and having a layer to compute this operation could also work, so maybe this is an optimization, eg it reduces training time or yields better results.