You punish it if parts of the answer can be found in its training data, and rewa...

amelius · 2024-10-22T23:03:46 1729638226

But the whole point of the training is that you reward it if it correctly reproduces the next token.

zeroxfe · 2024-10-23T00:54:08 1729644848

That's not the whole point of the training. It's just (very loosely) a measure of loss used during pre-training. There are many post-training and alignment stages in a typical model that are designed to reward high-quality responses.

Technically, yes, it's impossible to guarantee that it won't just regurgitate source material (which is mostly around the tails of the data distribution), but the whole point of training is to build generalized intelligence.

amelius · 2024-10-23T04:20:10 1729657210

I guess I used the wrong wording but it doesn't change the argument. Yes, the whole point of training is to build generalized intelligence (or at least that's what we __hope__ for). But as far as I understand, we do it __mainly__ by training for the next word in the sequence.

PS: you speak of "pre-training" and "post-training", so I'm curious what you think is the main part of the training (?)