Counterargument: this blogpost is worthless. You get all the way to the end and ...

janalsncm · on July 24, 2023

I wouldn’t quite say its value is zero. It’s worth something, but a lot less than if it had been shown to work empirically.

Explainers and their folksy, imprecise tone are good for things we already know are true. I’m skeptical on things which are unproven.

ambrozk · on July 24, 2023

Why would that make it worthless?

PoignardAzur · on July 24, 2023

Among other reasons, because the decoder-only version of the original transformer architecture has proven weirdly resistant to these kinds of hacks and clever optimizations.

Ideas like sparse attention, tree attention, residual attention, etc, all sound good on paper, but when researchers try to reproduce them they either find no results or results that don't scale. Even AliBi is turning out to be less powerful than scaled-down positional embeddings. It's almost a bitter lesson on its own: you can't beat the original transformer.

Optimizations that do stick around tend to be the ones that preserve the original algorithm but help with caching or memory accesses.

6gvONxR4sf7o · on July 24, 2023

Because there are a thousand ideas a minute in this field that meet the "it's worth trying" bar but don't actually pan out to make any difference. It's the equivalent of a blogpost that says "if someone else turned my idea into a business, it would be a billion dollar business. But I won't bother."

Legend2440 · on July 24, 2023

Because until he tries it, who knows if it works?

There are a thousand papers out there making minor tweaks to the transformer architecture. 99% of them are also worthless and forgotten.

debugnik · on July 24, 2023

> Because until he tries it, who knows if it works?

That's precisely what he shared this for, though. So someone willing to train a model with this tweak tries it.

quickthrower2 · on July 24, 2023

With say system architecture, you can muse on stuff like "well if Kubernetes made this decision, it would definitely be more secure" or "it would scale up quicker" without empirical evidence and other people could argue "yes I agree because" or "no I don't because"... etc.

With large ML models, there probably is no intuition like this. We just don't know "if I do the common sense thing X, it surely will produce better results for a given benchmark" ... well we have no idea until it is tried out.

fogof · on July 25, 2023

He says in the very first paragraph:

> I lost a series of pitched battles against Pytorch and biblatex, so I figured I’d just write a blog post instead.

So I think your accusation of his burying the lede on the lack of experiment is unwarranted.