Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

could even try it with a fraction of the attention heads, instead of introducing new tokens


An important piece here is that there's still a training signal making it to the makes weights. See SimSiam for a similar example.


indeed, simsiam is a great example of the effectiveness of using stop gradient




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: