I think in addition to all the benchmarks used right now for LLM evaluation (HumanEval and the like). It would be interesting to have a 'hallucination benchmark' with a summarization based hallucination dataset.
I'm surprised that the Times continues to use the word "hallucinate" in favor of the more accurate "confabulate." Alas, that ship appears to have sailed.
"Yes, it would be more accurate to say that AI models, especially language models like GPT-4, confabulate rather than hallucinate. Confabulation refers to the generation of plausible-sounding but potentially inaccurate or fabricated information, which is a common characteristic of AI language models when they produce responses based on limited or incomplete knowledge. This term better captures the nature of AI outputs as it emphasizes the creation of coherent, yet possibly incorrect, information rather than suggesting the experience of sensory perceptions in the absence of external stimuli, as hallucination implies."