Hacker News new | past | comments | ask | show | jobs | submit login

You punish it if parts of the answer can be found in its training data, and reward it otherwise.



But the whole point of the training is that you reward it if it correctly reproduces the next token.


That's not the whole point of the training. It's just (very loosely) a measure of loss used during pre-training. There are many post-training and alignment stages in a typical model that are designed to reward high-quality responses.

Technically, yes, it's impossible to guarantee that it won't just regurgitate source material (which is mostly around the tails of the data distribution), but the whole point of training is to build generalized intelligence.


I guess I used the wrong wording but it doesn't change the argument. Yes, the whole point of training is to build generalized intelligence (or at least that's what we __hope__ for). But as far as I understand, we do it __mainly__ by training for the next word in the sequence.

PS: you speak of "pre-training" and "post-training", so I'm curious what you think is the main part of the training (?)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: