I think there was a breakdown in communication here.
If I train a classic deep net as a classifier and there are 5 possible classes, it will only ever output those 5 classes (unless there's a bug).
With ChatGPT, for example, it could theoretically decide to introduce a 6th class - what I would call an alien failure mode, even if you explicitly told it not to.
I think formally / provably constraining the output of LLM APIs will help mitigate these issues, rather than needing to use an embedding API / use the LLM as a featurizer and train another model on top of it.
Formal proof is problematic because English has no formal specification. Some people are working on this, it's a nascent area bringing formal methods (model checking) to neural network models of computation. But it's an interesting fundamental issue that arises there, if you can't even specify the design intentions then how do you prove anything about it.
If I train a classic deep net as a classifier and there are 5 possible classes, it will only ever output those 5 classes (unless there's a bug).
With ChatGPT, for example, it could theoretically decide to introduce a 6th class - what I would call an alien failure mode, even if you explicitly told it not to.
I think formally / provably constraining the output of LLM APIs will help mitigate these issues, rather than needing to use an embedding API / use the LLM as a featurizer and train another model on top of it.