You could test this on a toy Transformer-based model trainable on a consumer GPU...

radq · on July 25, 2023

Do outlier features emerge in sub-100M parameter models? I haven't seen any research discuss it below the 124M scale (bert-base). At that scale training a model takes ~4 days on an 8xA100 node.

nl · on July 25, 2023

That is a fair question, and in addition I'm unsure that a simple metric like perplexity is likely to pick it up.

However, I do think that if perplexity showed a lower drop-off using this modified softmax under quantization that would be an exciting finding and enough to indicate further experiments would definitely be worth doing.

But you are right - if it doesn't show an improvement it doesn't necessarily rule out that it could be helping.

Edit: In the Qualcomm AI paper mentioned in this post, they experiment on BERT uncased (109B param) and OPT 125M and are able to show the effects using perplexity.

I hadn't read the paper when I suggested the same approach, so I guess that is good validation it is worth trying.

Edit2: Actually they also test on ViT 22M, which would be even quicker to try I think.

moffkalast · on July 24, 2023

It would be hard to say if either of the two completely crap models is more or less crap though. Maybe by repeating it and seeing consistent results despite changing other variables I guess?

nl · on July 25, 2023

Not at all.

I suggest ways to measure it here: https://news.ycombinator.com/item?id=36855881 but the TL;DR is to choose a metric and compare the reduction in performance for quantized versions of the LM compared to the same LM without the modified Softmax.

moffkalast · on July 25, 2023

> we'd expect little difference

I think that also likely holds for the quants, the difference could very well be within the error bars.

Anyway, it's been posted to r/locallama so I'm sure someone will try it within the hour and report back soon :P