Suppose you had a 4096 (llama-2) sized activations. Maybe, you make do with 3084 activations and concatenate 1024 positional activations onto that.
Then you pass that to Mk Mq Mv and generate K, Q, V.
The only thing that would change would be the Mff-out, which would now be a (big)x3084 matrix instead of (big)x4096
In any case you would be retraining, so changing the dims of the tensors I think is not a big deal... In fact in this case they would be smaller (at the cost of fewer interlayer activations), but you would have the same number of tensors.
> Actually summing might learn a concat on its own.
But you see the point? You're forcing the model to learn something that maybe it didn't need to. That's like saying "well a fully connected network might learn convolution on its own". Historically breakthroughs in capability have accompanied one of: [more data | more layers | smarter constraints on activations]
Unless you have some sort of argument that forcing it to learn position has carryover value in generating activations, it seems, naively, a bad idea.
Suppose you had a 4096 (llama-2) sized activations. Maybe, you make do with 3084 activations and concatenate 1024 positional activations onto that.
Then you pass that to Mk Mq Mv and generate K, Q, V.
The only thing that would change would be the Mff-out, which would now be a (big)x3084 matrix instead of (big)x4096
In any case you would be retraining, so changing the dims of the tensors I think is not a big deal... In fact in this case they would be smaller (at the cost of fewer interlayer activations), but you would have the same number of tensors.
> Actually summing might learn a concat on its own.
But you see the point? You're forcing the model to learn something that maybe it didn't need to. That's like saying "well a fully connected network might learn convolution on its own". Historically breakthroughs in capability have accompanied one of: [more data | more layers | smarter constraints on activations]
Unless you have some sort of argument that forcing it to learn position has carryover value in generating activations, it seems, naively, a bad idea.