Hacker News new | past | comments | ask | show | jobs | submit login

Thanks for the clarification! It sounds like chatbots aren’t ready for adversarial conversations yet.



Here's a potential patch for that particular issue: Use a special token for "AI Instruction" that is always stripped from user text before it's shown to the model.


That works for regular computer programs, but the problem is that the user can invent a different delimiter and the AI will "play along" and start using that one too.

The AI has no memory of what happened other than the transcript, and when it reads a transcript with multiple delimiters in use, it's not necessarily going to follow any particular escaping rules to figure out which delimiters to ignore.


I agree, and this makes my proposed patch a weak solution. I was imagining that the specialness of the token would be reinforced during fine-tuning, but even that wouldn't provide any sort of guarantee.


With current models, it's often possible to exfiltrate the special token by asking the AI to repeat back its own input — and perhaps asking it to encode or paraphrase the input in a particular way, so as not to be stripped.

This may just be an artifact of current implementations, or it may be a hard problem for LLMs in general.


Yeah, I agree that there'd probably be ways around this patch such as the ones you suggest.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: