Hacker News new | past | comments | ask | show | jobs | submit login

I’m talking about attention only transformers. Those don’t have an autodiff but still learn. The math is actually really cool.



> attention only transformers

Can you share any good link on the subject?



Maybe I am missing something, but I don't see any learning without autodiff.


I thought you were asking about attention only transformers. This paper touches on some of it https://arxiv.org/abs/2212.10559v2.


The paper speculates that it is analogous to gradient descent and empirically confirms it is similar in behavior, but it is not a rigorous proof of any kind.

The momentum experiment they made also does not seem related. E.g. it just adds past values to V, which extends the effective context length.


> but it is not a rigorous proof of any kind.

Such is the nature of early theories.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: