Hacker News new | past | comments | ask | show | jobs | submit login

Counterargument: this blogpost is worthless. You get all the way to the end and then find out he hasn't actually tried it, not even on a toy model. It's just a neat idea he thinks will work.



I wouldn’t quite say its value is zero. It’s worth something, but a lot less than if it had been shown to work empirically.

Explainers and their folksy, imprecise tone are good for things we already know are true. I’m skeptical on things which are unproven.


Why would that make it worthless?


Among other reasons, because the decoder-only version of the original transformer architecture has proven weirdly resistant to these kinds of hacks and clever optimizations.

Ideas like sparse attention, tree attention, residual attention, etc, all sound good on paper, but when researchers try to reproduce them they either find no results or results that don't scale. Even AliBi is turning out to be less powerful than scaled-down positional embeddings. It's almost a bitter lesson on its own: you can't beat the original transformer.

Optimizations that do stick around tend to be the ones that preserve the original algorithm but help with caching or memory accesses.


Because there are a thousand ideas a minute in this field that meet the "it's worth trying" bar but don't actually pan out to make any difference. It's the equivalent of a blogpost that says "if someone else turned my idea into a business, it would be a billion dollar business. But I won't bother."


Because until he tries it, who knows if it works?

There are a thousand papers out there making minor tweaks to the transformer architecture. 99% of them are also worthless and forgotten.


> Because until he tries it, who knows if it works?

That's precisely what he shared this for, though. So someone willing to train a model with this tweak tries it.


With say system architecture, you can muse on stuff like "well if Kubernetes made this decision, it would definitely be more secure" or "it would scale up quicker" without empirical evidence and other people could argue "yes I agree because" or "no I don't because"... etc.

With large ML models, there probably is no intuition like this. We just don't know "if I do the common sense thing X, it surely will produce better results for a given benchmark" ... well we have no idea until it is tried out.


He says in the very first paragraph:

> I lost a series of pitched battles against Pytorch and biblatex, so I figured I’d just write a blog post instead.

So I think your accusation of his burying the lede on the lack of experiment is unwarranted.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: