HALTT4LLM - Hallucination Trivia Test for Large Language Models
This project is an attempt to create a common metric to test LLM's for progress in eliminating hallucinations; the most serious current problem in widespread adoption of LLM's for real world purposes.
Method seems to be multiple choice trivia tests with real world answers, trick/fake questions where (I don't know) is correct answer, and 'None of the above' type questions. GPT-3.5 hallucinates but is much more willing to admit uncertainty than either GPT-3 or Alpaca Lora.
I like the approach. I think we will need a lot of tests like these in the near future, comparing AI performance on its actual edge cases rather than on the standard stuff we test it for right now.
Hallucinations are a big one. “Obedience” is the next one that comes to my mind (how willing or unwilling the model is to comply with requests).
Thanks. Before creating I looked for hallucination specific tests but couldn't find any. If anyone else knows of other tests I'd love to hear about them.
This project is an attempt to create a common metric to test LLM's for progress in eliminating hallucinations; the most serious current problem in widespread adoption of LLM's for real world purposes.
Method seems to be multiple choice trivia tests with real world answers, trick/fake questions where (I don't know) is correct answer, and 'None of the above' type questions. GPT-3.5 hallucinates but is much more willing to admit uncertainty than either GPT-3 or Alpaca Lora.