could even try it with a fraction of the attention heads, instead of introducing... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		lucidrains on April 21, 2024 \| parent \| context \| favorite \| on: Self-reasoning tokens: teaching models to think ah... could even try it with a fraction of the attention heads, instead of introducing new tokens

sdenton4 on April 21, 2024 [–]

An important piece here is that there's still a training signal making it to the makes weights. See SimSiam for a similar example.

lucidrains on April 21, 2024 | [–]

indeed, simsiam is a great example of the effectiveness of using stop gradient

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact