Hacker News new | past | comments | ask | show | jobs | submit login

The LLM is trained by measuring its error compared to the training data. It is literally optimizing to not be recognizable. Any improvement you can make to detect LLM output can immediately be used to train them better.



GANs do that, I don't think LLMs do. I think LLMs are mostly trained on "how do I recon a human would rate this answer?", or at least the default ChatGPT models are and that's the topic at the root of this thread. That's allowed to be a different distribution to the source material.

Observable: ChatGPT quite often used to just outright says "As a large language model trained by OpenAI…", which is a dead giveaway.


This is the result of RLHF (which is fine-tuning to make the output more palatable), but this is not what training is about.

The actual training process makes the model output be the likeliest output, and the introduction phrase you quoted would not come out of this process if there was no RLHF. See GPT3 (text-davinci-003 via API) which didn't have RLHF and would not say this, vs. ChatGPT which is fine-tuned for human preferences and thus will output such giveaways.


And then you can train a new detector.

I see no reason to believe it wouldn’t be a pendulum situation.

That’s how GANs work, after all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: