You may be interested "Deterministic Quoting"[1]. This doesn't completely "solve" hallucinations, but I would argue that we do get "good enough" in several applications
Yes, extractive QA is one of the improvements beyond the "minimalist implementation" from the article. In our lingo, we'd say that's another way to create a deterministic quotation.
So far, we haven't found extractive QA (or any other technique) to significantly improve overall answer quality when compared to matching sub-string similarity. (I'd be interested to hear if you have different experience!)
There aren't a lot of applications can purely be solved with substrings of source documentation, so having both LLM prose and quotations in the answer provides benefit (eg ability to quote multiple passages). Now, we can modify the constrained generation side of things to allow for these but that gets complicated. Or, it can be done with recursive calls to the LLM, but that again requires some kind of DQ check on top.
Ultimately, both styles seem to perform similarly - and suffer from the same downsides (choosing the wrong quote and occasionally omitting useful quotes).
(Good writeup by the way, I've forwarded it to my team, thanks!)
That's really cool, do you think this might be the basis for potential natural language navigation? (when going over a document, instead of having to search by keyword or regex, one can search for more complicated concepts using English)
If not, what extra work is needed to bring it to that level?
I think you could get a pretty good solution for that using RAG and some tricks with prompt engineering and semantic chunking. With google's very-long-context models (Gemini) you may also have good results simply with some prompt engineering. Preprocessing steps like asking the LLM to summarise themes of each section can be helpful too (in RAG, this info would go in the 'metadata' stored with each chunk, presented to the LLM with each chunk).
A key engineering challenge will be speed ... when you're navigating a document you want a fast response time.
It's all grey isn't it? Vanilla RAG is a big step along the spectrum from LLM towards search, DQ is perhaps another small step. I'm no expert in search but I've read that those systems coming from the other direction, perhaps they'll meet in the middle.
There are three "lookups" in a system with DQ: (1) The original top-k chunk extraction (in the minimalist implementation, that's unchanged from vanilla RAG, just a vector embeddings match) (2) the LLM call, which takes its pick from 1, and (3) the call-back deterministic lookup after the LLM has written its answer.
(3) is much more bounded, because it's only working with those top-k, at least for today's context constrained systems.
In any case, another way to think of DQ is a "band-aid" that can sit on top of that, essentially a "UX feature", until the underlying systems improve enough.
I also agree about the importance of chunk-size. It has "non-linear" effects on UX.
Unfortunately I don't believe that accuracy will scale "multiplicitively". You'll typically only marginally improve beyond 95%... and how much is enough?
Even with such a system, which will still have some hallucination rate, adding Deterministic Quoting on top will still help.
It feels to me we are a long way off LLM systems with trivial rates of hallucination
(1) if the <title> contents (unique reference string) doesn't match, then it's trivially detected. Typically the query is re-run (non-determinism comes in handy sometimes) or if problems persist we show an error message to the doctor
(2) if a valid <title> is hallucinated, then the wrong quote is indeed displayed on the blue background. It's still a verbatim quote, but it is up to the user to handle this.
In testing when we have maliciously shown the wrong quote, users seem to be easily able to identify. It seems "Irrelevant" is easier than "wrong" to detect.
Galactica training paper from FAIR investigated citation hallucination quite thoroughly, if you havent seen it, probably worth a look. Trained in hashes of citations were much more reliable than a natural language representation.
I recommend it. I feel I can now read an arbitrary paper, frown a lot, and eventually understand what it's talking about - to the point where I can implement my own buggy version. And hey, I built my own stable diffusion!!
I found the previous version of this course[1] to be a good complement: it's older (predates SD) but I feel it explains core concepts slightly better. Very understandable given how the close to the bleeding edge this new version is...
Perhaps an even better complement was Karpathy's famous course[2] - similar material but builds towards GPT instead of SD. The fastai coding style is somewhat esoteric (to me) so it was helpful to contrast with Karpathy's more familiar style. I recommend doing both courses. Also I believe the fastai folks are planning a part 3 which covers LLMs; looking forward to that.
Concepts from part 2 helped my hobby project, a 7-day forecast of renewable electricity and power price[3].
Feels pretty great to have built my own Stable Diffusion and GPT! I am grateful to Jeremy and Andrej.
Depends. Some would benefit from simultaneous, others sequential. I did them in "chunks" starting with fastai but that was more driven by the release schedule. Personally I'd recommend trying both and seeing which style you prefer, focus on whichever one makes your more excited to get your hands dirty and play with stuff.
I didn't really track... apparently there's 30-ish hours of video, but lectures are just the beginning. The real learning happens when you play and build. The first lesson was released in October I think.
Disclosure: author on [1]
[1] https://mattyyeung.github.io/deterministic-quoting
reply