That is a fair question, and in addition I'm unsure that a simple metric like pe...

That is a fair question, and in addition I'm unsure that a simple metric like perplexity is likely to pick it up.

However, I do think that if perplexity showed a lower drop-off using this modified softmax under quantization that would be an exciting finding and enough to indicate further experiments would definitely be worth doing.

But you are right - if it doesn't show an improvement it doesn't necessarily rule out that it could be helping.

Edit: In the Qualcomm AI paper mentioned in this post, they experiment on BERT uncased (109B param) and OPT 125M and are able to show the effects using perplexity.

I hadn't read the paper when I suggested the same approach, so I guess that is good validation it is worth trying.

Edit2: Actually they also test on ViT 22M, which would be even quicker to try I think.