yeah as usual these model can barely sustain a conversation and fall apart the moment actual instructions are given. typical prompt they fail to udnerstand:
"what is pistacchio? explain the question, not the answer."
all these toy llm: "pistacchio is..."
gpt is the only one that consistently understand these instructions: "The question "what is pistachio?" is asking for an explanation or description of the food item..."
this makes these llm basically useless for obtaining anything but hallucinated data.
Asking LLM from things they learned in training mostly result in hallucinations and in general makes you unable to detect by which amount they are hallucinating: these models are unable to reflect on their output, and average output token probability is a lousy proxy for confidence scoring their results.
On the other hand, no amount of prompt engineering seems to make these LLM able to do question and answer over source documents which is the only realistic way by which factual information can be retrieved
You're welcome to bring examples of it tho if you're so confident.
I've had ChatGPT build a fire nctioning website, write a DNA server, fill in significant portions of specs, all without the problems you describe. I'm never going back to doing things from scratch - it's saving me immense amounts of time every single day. The only reasonable conclusion is that the way you're promoting it is counterproductive.
You can get useful results out of a whole lot of them as long as you actually prompt them in a way suitable for the models. The point I made originally was that if you just feed them an ambiguous question, then sure, you will get extremely variable and mostly useless results out. Ironically,
And I mentioned ChatGPT because from context of your comments here it was unclear on first read-through what you meant. Maybe consider that it's possible your prompting is not geared for the models you've tried.
Not least, specifically given that if you expect a model to know how to follow instructions, when most of them have not been through RLHF you're using them wrong. A lot of them needs prompt shaped as a completion, not a conversation.
I have nothing to gain from spending time testing models for you because whatever I pick will just seem like cherry picking to you, and it doesn't matter to me whether or not you agree on the usability of these models. They work for me, and that's all that matters to me. Try a a few completions instead of a question. Or don't
"what is pistacchio? explain the question, not the answer."
all these toy llm: "pistacchio is..."
gpt is the only one that consistently understand these instructions: "The question "what is pistachio?" is asking for an explanation or description of the food item..."
this makes these llm basically useless for obtaining anything but hallucinated data.