Both of your points are basically true, but I think a better way to model the pr...

Both of your points are basically true, but I think a better way to model the problem is as a set of similar-length vectors being linearly combined by a probability vector.

Mathematically, we can write v_out = V * w,

where v_out is the vector of output from the attention unit, w is the probability vector from the softmax, and V is the set of input vectors, where each column is an input vector.

For a moment, pretend that the columns of V are orthonormal to each other. This might not be true, but it's an interesting case.

When the model wants the output to be small, it can set w = 1/n, meaning all coordinates of vector w are 1/n. (n = the number of columns in V)

In that case, the length ||v_out|| will be 1/sqrt(n) exactly, which is small compared to the input lengths of 1 (since we're pretending they were orthonormal).

Now if we stop pretending they are orthonormal, the worst case is that they're all the same vector, in which case the weights w can't change anything. But that's a mighty weird case, and in high dimensions, if you have any randomness at all to a set of vectors, they tend to point in wildly different directions with dot products close to zero, in which case the same intuition for the orthonormal case applies, and we'd expect a uniform distribution coming out of the softmax to give us a vector that's much smaller than any of the input vectors.