That quoted string idea - and others like it - don't work, unfortunately. There are inevitably tricks you can use to convince the LLM to break the instructions it was given earlier.
If you can demonstrate an escaping scheme like that which DOES work you'll have a major research achievement!
More robust isn't good enough. This is a security issue, and if your security fix is only 99% effective it's only a matter of time before someone finds an exploit.
You can watch this playing out in realtime on Reddit - take a look at the people trying to find "jailbreaks" for ChatGPT etc. Jailbreaking isn't the exact same thing as prompt injection but it's very similar.
The spirit of this is right. You are both right though.
It depends what you are protecting, what the consequences are.
At one extreme raw ChatGPT let's you type anything. But the worst case is they lost 1c in compute cost and some weird text comes back. Maybe text that tells you how to do something you shouldn't do legally. So there is a risk there. Maybe they are happy with it.
Another extreme is a prompted-ChatGPT powered bot that opens a bank vault if it is convinced you are the bank manager. Then "two 99% prompts stuck together" is no where near good enough. In fact any prompt injection problem at all will be a problem (plus any problem in the judging powers of ChatGPT)
i havent had time to dig through it but what about something like guardrails? https://shreyar.github.io/guardrails/ early alpha I reckon, but looks interesting nonetheless?
If you can demonstrate an escaping scheme like that which DOES work you'll have a major research achievement!