mattyyeung's comments

mattyyeung · 2024-06-02T13:13:41

You may be interested "Deterministic Quoting"[1]. This doesn't completely "solve" hallucinations, but I would argue that we do get "good enough" in several applications

Disclosure: author on [1]

[1] https://mattyyeung.github.io/deterministic-quoting

mattyyeung · 2024-05-12T14:47:56

Yes, extractive QA is one of the improvements beyond the "minimalist implementation" from the article. In our lingo, we'd say that's another way to create a deterministic quotation.

So far, we haven't found extractive QA (or any other technique) to significantly improve overall answer quality when compared to matching sub-string similarity. (I'd be interested to hear if you have different experience!)

There aren't a lot of applications can purely be solved with substrings of source documentation, so having both LLM prose and quotations in the answer provides benefit (eg ability to quote multiple passages). Now, we can modify the constrained generation side of things to allow for these but that gets complicated. Or, it can be done with recursive calls to the LLM, but that again requires some kind of DQ check on top.

Ultimately, both styles seem to perform similarly - and suffer from the same downsides (choosing the wrong quote and occasionally omitting useful quotes).

(Good writeup by the way, I've forwarded it to my team, thanks!)

mattyyeung · 2024-05-12T11:07:22

Can quotations be hallucinated? Or are you using something like "deterministic quoting"[1]?

Disclosure: author on that work.

[1] https://mattyyeung.github.io/deterministic-quoting

keefle · 2024-05-14T17:17:59

That's really cool, do you think this might be the basis for potential natural language navigation? (when going over a document, instead of having to search by keyword or regex, one can search for more complicated concepts using English)

If not, what extra work is needed to bring it to that level?

mattyyeung · 2024-05-17T11:13:37

I think you could get a pretty good solution for that using RAG and some tricks with prompt engineering and semantic chunking. With google's very-long-context models (Gemini) you may also have good results simply with some prompt engineering. Preprocessing steps like asking the LLM to summarise themes of each section can be helpful too (in RAG, this info would go in the 'metadata' stored with each chunk, presented to the LLM with each chunk).

A key engineering challenge will be speed ... when you're navigating a document you want a fast response time.

brokensegue · 2024-05-14T16:05:54

the quote cannot be hallucinated but we use a different approach. your work seems interesting though.

mattyyeung · 2024-05-17T11:06:49

I would love to learn more, where would you recommend I look?

mattyyeung · 2024-05-07T23:47:32

Thanks for the thought-provoking comment.

It's all grey isn't it? Vanilla RAG is a big step along the spectrum from LLM towards search, DQ is perhaps another small step. I'm no expert in search but I've read that those systems coming from the other direction, perhaps they'll meet in the middle.

There are three "lookups" in a system with DQ: (1) The original top-k chunk extraction (in the minimalist implementation, that's unchanged from vanilla RAG, just a vector embeddings match) (2) the LLM call, which takes its pick from 1, and (3) the call-back deterministic lookup after the LLM has written its answer.

(3) is much more bounded, because it's only working with those top-k, at least for today's context constrained systems.

In any case, another way to think of DQ is a "band-aid" that can sit on top of that, essentially a "UX feature", until the underlying systems improve enough.

I also agree about the importance of chunk-size. It has "non-linear" effects on UX.

mattyyeung · 2024-05-07T23:18:15

I'd put it like this: RAG = search engine, but sometimes hallucinates

RAG + deterministic quoting = search engine that displays real excerpts from pages.

mattyyeung · 2024-05-07T23:12:04

Unfortunately I don't believe that accuracy will scale "multiplicitively". You'll typically only marginally improve beyond 95%... and how much is enough?

Even with such a system, which will still have some hallucination rate, adding Deterministic Quoting on top will still help.

It feels to me we are a long way off LLM systems with trivial rates of hallucination

resource_waste · 2024-05-08T11:36:54

a 95% diagnosis rate would be insane.

I believe I read doctors are only at like 30%...

mattyyeung · 2024-05-07T23:04:44

Two possibilities:

(1) if the <title> contents (unique reference string) doesn't match, then it's trivially detected. Typically the query is re-run (non-determinism comes in handy sometimes) or if problems persist we show an error message to the doctor

(2) if a valid <title> is hallucinated, then the wrong quote is indeed displayed on the blue background. It's still a verbatim quote, but it is up to the user to handle this.

In testing when we have maliciously shown the wrong quote, users seem to be easily able to identify. It seems "Irrelevant" is easier than "wrong" to detect.

bradfox2 · 2024-05-08T01:35:23

Galactica training paper from FAIR investigated citation hallucination quite thoroughly, if you havent seen it, probably worth a look. Trained in hashes of citations were much more reliable than a natural language representation.

mattyyeung · 2024-05-07T22:53:06

Author here, thanks for your interest! Surprising way to wake up in the morning. Happy to answer questions

sitkack · 2024-05-08T14:52:47

Why the coyness? You submitted the post.

mattyyeung · on April 5, 2023

I just finished this! My thoughts:

I recommend it. I feel I can now read an arbitrary paper, frown a lot, and eventually understand what it's talking about - to the point where I can implement my own buggy version. And hey, I built my own stable diffusion!!

I found the previous version of this course[1] to be a good complement: it's older (predates SD) but I feel it explains core concepts slightly better. Very understandable given how the close to the bleeding edge this new version is...

Perhaps an even better complement was Karpathy's famous course[2] - similar material but builds towards GPT instead of SD. The fastai coding style is somewhat esoteric (to me) so it was helpful to contrast with Karpathy's more familiar style. I recommend doing both courses. Also I believe the fastai folks are planning a part 3 which covers LLMs; looking forward to that.

Concepts from part 2 helped my hobby project, a 7-day forecast of renewable electricity and power price[3].

Feels pretty great to have built my own Stable Diffusion and GPT! I am grateful to Jeremy and Andrej.

[1] https://course19.fast.ai/part2

[2] https://karpathy.ai/zero-to-hero.html

[3] https://greenforecast.au/

wodenokoto · on April 5, 2023

Do you recommend doing fastai and karpathy simultaneously or one at a time?

mattyyeung · on April 5, 2023

Depends. Some would benefit from simultaneous, others sequential. I did them in "chunks" starting with fastai but that was more driven by the release schedule. Personally I'd recommend trying both and seeing which style you prefer, focus on whichever one makes your more excited to get your hands dirty and play with stuff.

rg111 · on April 5, 2023

I second the idea of doing older versions of Part-2 as well.

One will get much better grasp of DL concepts doing those versions rather than this.

E.g. 2019/20 version IIRC.

tomcam · on April 5, 2023

Damn good (and wryly funny) assessment. Thanks much!

hoseja · on April 5, 2023

How long did it take you?

mattyyeung · on April 5, 2023

I didn't really track... apparently there's 30-ish hours of video, but lectures are just the beginning. The real learning happens when you play and build. The first lesson was released in October I think.