Hacker News new | past | comments | ask | show | jobs | submit login

I'm talking about the rows in the new K and V matrices introduced by the paper, not rows in the input sequence. The ordering of rows in the new K and V matrices does matter in the sense that rows that appear further down were added later in the training process to add new parameter tokens during scaling. So those newer parameters may represent knowledge that is less fundamental and more about fine tuning on the training set.



But after adding new rows, I think entire network is retrained.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: