Hacker News new | past | comments | ask | show | jobs | submit login
Teach your LLM to answer with facts, not fiction (myscale.com)
175 points by jinqueeny 10 months ago | hide | past | favorite | 145 comments



'Facts' aren't as black and white as people think.

   "What does Charmander evolve into?"
   "What does the spell 'avada kedavra' do?"
   "What is the Sindarin word for 'friend'?"
   "What are the names of Santa's reindeer?"
   "Where did Robin Hood live?"
   "Where did Achilles die?"
These are all 'factual questions' you can find answers to from reputable sources like Wikipedia. Google displays 'fact boxes' for several of them. Wolfram Alpha provides answers for three of them. The answers to some of these questions are part of what passes for 'general knowledge' in some societies.

It's no surprise that LLMs trained on human writings produce text which claims things that aren't true are facts. Humans do that all the time.

There are well attested reputable sources that will tell you Abraham Lincoln was a vampire hunter, others that say he was a Lego Master Builder, and others still will tell you that among his notable quotes is "Party on dudes - be excellent to each other". So what's an LLM to do when it's trying to extend a paragraph of information about Abraham Lincoln?

When an LLM is suggesting what might come next in a piece of text... it doesn't know if it's supposed to guess a probable word from a Wikipedia article, an Onion article, a Project Gutenberg manuscript, or an Archive Of Our Own fanfic. So you get a bit of all that.


> These are all 'factual questions'

Because of elision.

"[Homer wrote] that Achilles died of an arrow in the heel"

This is why the Wiener Kreis taught to use protocolar statements: "<There> and <at that time> <that individual> witnessed <that fact>".


Tangential: I was going to suggest "protocoli(s|z)ed" instead of protocollar, but I Googled "protocollar statements" just to check and found 2 things. First, this page was the top result! Second, "protocolar" (one ell) and "protocolary" are apparently real words. New to me, thanks.


You had me check a few sources for found expressions in use for the concept: you can find simply "protocols" (intending that), "protocol statements", the "protocol-sentence debate", "protocollar propositions"...

Edit: oh, by the way, in case of interest: https://plato.stanford.edu/entries/vienna-circle/


In German they were called "Protokollsätze", which translates to "protocol sentences".


(By the way:

> "protocoli(s|z)ed"

the use of '-ize' is (a graecism) indicated by the OED as International English, as opposed to British, American etc. In fact, some call International English "British spelling with -ize" - it is not exactly that but close. One exception is 'analyse', but that is because linguists compromised on the "difficult" original 'analysize'.)


What's "analysize"? That's not a Greek word.


It's so determined by Fowler; pls. check this: https://www.etymonline.com/search?q=analyse


Ah, it would have been the correct way to transfer it to English, it says, not that it's in any way original.


Sorry, my imprecision. You spend ages trying to find proper expression, and yet... Well, this proves the importance of the effort.


Haha, that, it does.


I think protocollar is in this context a misspelling of protocolar - hence its high placement for protocollar statements, if I google protocolar statements this is the highest result (for me)

https://www.britannica.com/topic/protocol-sentence


> a misspelling of protocolar

It could be. I cannot bring to mind the rules for doubling right now. They both occur, 'protocolar' much more often. I will correct my original post.


> When an LLM is suggesting what might come next in a piece of text... it doesn't know if it's supposed to guess a probable word from a Wikipedia article, an Onion article, a Project Gutenberg manuscript, or an Archive Of Our Own fanfic.

The obvious start seems to be having separate fiction and nonfiction LLMs and not training the nonfiction ones on Archive Of Our Own. People also end up confused about the truth when nobody points out the difference between fiction and nonfiction.


But there's a fundamental issue here. The real strength of LLMs is not just information retrieval, but being able to dynamically recombine that information. Of course that's also their weakness. The reason GPT will regularly produce code with nonexistent API calls is not because it's been trained on 'fictional APIs', but because it's combining various real calls to make new fictional ones.

The obvious answer then is to tell it to make sure that what it's finally outputting is really part of the "real" API, but I think it's safe to say there's some technical hitch there, as it's safe to say OpenAI probably spent quite a lot of energy trying to solve the code hallucinations, and ultimately was unable to do so. I'd guess that the more you restrict its recombination ability, the more you end up with it inappropriately (and incorrectly) just regurgitating large chunks of its training input verbatim. Basically it becomes more like a keyword hunting search engine, and less like a generative LLM.


Yes, and information is much, much different than knowledge.


I kinda like this but e.g are research papers fact or fiction?

How about an economics textbook, or an article in the economist? "A history of the english speaking peoples" by Winston Churchill?

If we restrict to "ground truth we feel very sure about" it feels like available training data might be quite small.


and what if the economics textbook contains "much like Charmander evolves into Charizard, free markets evolve into monopolies"?


Right. A lot of the magic of LLMs probably comes from the broader appreciation of language and cultural reference that they get from being trained on a diverse corpus, rather than just a bunch of dictionaries and reference books.

And anyway - answers to all my ‘fictional facts’ questions above can be sourced from Wikipedia - there’s tons of made up stuff on there.


Hopefully such statements are sufficiently rare that they don't get reinforced, I guess. I don't know. A very real problem occurs with people too when fictional things are repeated often enough without direct mention of their fictional nature.


Which of these is more true: a newspaper article about a battle in the War of 1812, or the Star-Spangled Banner, which was written by someone witnessing a battle in the War of 1812.

Hint: how many stadiums are filled with people standing up to recite a newspaper article about a battle in the War of 1812?


Rather than leaving some text out, it'd be better to label it with its source when training.


> it doesn't know if it's supposed to guess a probable word from a Wikipedia article, an Onion article, a Project Gutenberg manuscript, or an Archive Of Our Own fanfic. So you get a bit of all that.

This is true of base LLM models that are just trained on missing-word prediction on the training corpus, but one of the main points of RLHF[1] is to tune this model to make these kind of inferences the way a human would expect. For example if you asked an untuned model to write a poem in the style of ... etc., a valid internet response might be "hmm no thanks, you go first", you need to steer the model away from replying like this.

I'm not saying it's perfect, but it's wrong to say e.g. GPT-4 has had no information about the difference between a good and bad response and is just generating internet-like text at random, the big players have made progress on this already.

[1] https://en.wikipedia.org/wiki/Reinforcement_learning_from_hu...


Right.

Reinforcement learning trains them that question and answer sessions contain answers which statistically correlate with factual statements in their broader learning corpus.

When formulating answers, this leads them to formulate answers that reflect the factual information on which they were trained.

My point is that the source data contains a far muddier range of information than just unarguable facts.

We largely want LLM based Q&A bots to answer questions about fictional or mythical characters in their own terms. As I said, those questions above all have reasonably ‘correct’ answers.

The fact that from all that LLMs do as well as they do is remarkable. But it also seems like it requires us to assume that LLMs are capable of a remarkable degree of cultural nuance, media literacy and contextual awareness for them to figure out the different authorship, salience, trustworthiness, agenda, biases, and assumptions of all the gigareams of text they’ve ingested.


“ When an LLM is suggesting what might come next in a piece of text... it doesn't know if it's supposed to guess a probable word from a Wikipedia article, an Onion article, a Project Gutenberg manuscript, or an Archive Of Our Own fanfic”

LLMs are very good at inferring context, so that only really applies if you’re using an un-RLHFed base model with no context given


Here, "supposed to guess" means "having the goal of..."

So no LLM knows what it's supposed to do. If you prefer, you could say it only ever has one goal: to generate a sequence of tokens which are jointly the most probable to occur along with the prompt tokens, given such probabilities in a historical corpus.

This imitates knowledge, goal-directness, "inferring context" etc. without doing any of those things. Consider what the aim of knowing, goal-directness, inferring , etc. is --- it is never "consistency with a historical text corpus".

For knowing: that beliefs correspond to the way the world is; for goal-directness that one's acts+desires can realise changes; for 'inferring context': that one is sensitive to reasons to speak outside of what is literally spoken.

LLMs are never sensitive to reasons to speak outside of what has been spoken.


What does RLHF do then? I feel like you completely ignored the central point of GP's comment.

RLHF is the difference between GPT-3.5 and ChatGPT, and it's the whole reason why LLMs are suddenly such a big deal. ChatGPT demonstrated that it's possible to give language models a goal beyond just "complete most likely next word" and that they can actually be somewhat competent at achieving those goals despite not being explicitly trained for them.


> competent at achieving those goals despite not being explicitly trained for them.

Well (1) it doesn't achieve goals, since a "goal" is observer-relative. We have goals, the LLM has a formal optimisation objective which gives it the appearence of goal-directed behaviour (in a similar way, eg., that it appears pens want to fall when dropped).

And (2), reading your "goal" here even in observer-relative ways, I don't think there's much evidence of this. These models are "trained" on everything ever written, include all of the internet and basically all digitised book. I don't see any evidence of much generalisation -- if you can find it by google, then the LLM has it stored compressed (ie., the "weights").

The innovation in LLMs is being able to compute `max P(answer|prompt, historical_corpus)` for increasingly longer prompts --- there's no innovation in goal-directed behaviour.

That's VC propangada to disguise the fact that LLMs are mostly an innovation in copyright laundering.


(1) This is a tired, pointless semantic argument. "It doesn't have a goal, it just acts like it has a goal for all intents and purposes. But, you see, it's actually a machine and not a human and therefore it can't really have goals according to my narrow definition of the term." Either point to an actually relevant difference in the resulting behavior or stop objecting when people use human behavioral terms to describe the behavior of machine learning systems. We're all well aware it's a program; that's not the point. (Sorry, just a frustration I have with the larger discussion around this topic.)

(2) "I don't see any evidence of much generalisation" Seriously? So when I tell ChatGPT to rewrite a paragraph in the style of Shakespeare and it does it, despite never being trained to do that, never seeing the source or target paragraph before, and having no information other than my text prompt and its past training, that's not evidence of generalization? And that's only one of millions of different possible tasks that the same model excels at, despite being trained on nothing but a bunch of unstructured text and a few examples indicating its goal should be to follow instructions given in the prompt text. Up until a couple years ago this level of flexibility in a machine learning model would have been considered science fiction by nearly everyone, and now it's "[not] evidence of much generalization". Okay.


Well (1), the reason this distinction is relevant is so we can separate out whether the system has developed a capacity or an apparent capacity.

Is the child a genius or are they just reading out of a textbook? Can the toddler really compose a sonata or did they just press play on the piano keyboard?

(2) This is indeed the power of interpolating between the data points of "everything ever written in human history" as digitised and compressed by ChatGPT.

If you have 1 billion circles of radii 0 to 1, it isn't generalisation for the machine to produce one with a radii 0.0000100003000001, ie., one not in the set but a mere interpolation of points within it.

It would be expensive, but imagining "reversing" ChatGPT from it's output to the sources which made a non-trivial difference to generating that output.

So the function there is: response -> verbatim text in the training corpus.

Then, maybe, "bolded" by how much each paragraph would "make a difference" to its output.

What you'd find is thousands of pages: all Shakespeare ever written, all papers about Shakespeare, all books about Shakespeare; and so on.

Then when it applied the bolding, and summarising it a little, the trick would be revealed: it would be apparent how a naive statistical interpolation between sequences of characters could produce the effect.

ChatGPT exists because of ebooks and social media: without it, it could do almost nothing. That is, the appearance of these capacities is strictly derivative of the work of a billion people who had them.

Without vast, unimaginable, amounts of work produced on Shakespeare this system wouldnt work. It's just a copyright laundering system. All the school essays on reddit, all the forum posts; all of usenet. All pdfs, all digitised works. All academic papers.

Is this generalisation? Is this a system which starts with little and makes a lot?

Or is it a system which is more like a child reading from a textbook? Ie., making a haphazard ability to repeat what's already written.

The size of the weights of a modern LLM are sufficient to compress everything ever written in human history: and that's exactly what they do.


It isn't apparent that anything you've just described is relevant. You've described how it works (in a highly simplified way), but that doesn't discredit the end result.

If there's truly a difference between "a capacity [and] an apparent capacity" then you should be able to point out what that difference actually is in practice. A child pressing play on a piano can only play one song. A LLM composing poems can compose billions upon billions of unique, never-before-seen poems about every conceivable topic. Whether under the hood it does that by "interpolating numbers in n-dimensional spaces" or "some incomprehensible arrangement of neurons linked together" or some other, yet to be invented process doesn't matter if the result is the same. The fact that you can explain how something works doesn't make it less real.


This is something which GPT generally isn't confused about though: it knows the answer to these questions and it knows that these are questions and statements about well-known works of fiction. I don't really think this is the source of the tendency for LLMs to make stuff up.


Mellon. The rest are left as an exercise to the reader.

It does always amaze me that we trained LLMs on a dump of the internet and then people are shocked that they're about as trustworthy as a random web page.


People are not shocked and poor training data is not the main reason LLMs are not trustworthy.


The issue here is one of semiotics and morphemology. Mapping meaning into a narrative and ontological protocol is going to be the requisite work if we want the engine to be "smart." As explored in the discussion at hand, tokenization creates a great mimic but it's a parlor trick. We must employ a robust thinking-thing that correlates not only a static, contextually indexed dictionary <lexicography>, we must also route that through a network to distill meaning itself into tokens. Perhaps languages which rely on morphemes for written language - a logosyllabary - are somewhat more or less suited for this task? I ask as a dummy.

There also exists the consideration of allographemical contextualization, the nature of relevance, pragmatics, conjunct identification of context, semantics. To be honest the linguistics side alone is vast. Knowledge and cognition however. . . A whole other ballgame. But the only tool we have to really get down to the bottom of how knowledge works is language, it's to epistemological pursuit what math is to physics.

While GPT is super impressive and can do a lot of quasi-brute-force things, we're only finding now the rudiments of the machined intelligence paradigm, and it will behoove any reader to brush up on their classics, true pursuants of philosophy and many order logic are about to be in high demand if I had to reckon.


I prefer to think that most humans actually distinguish the fictional context, and so should LLM. As such, if it is to be of any use, it'd better figure out it's fiction if someone's flying on a winged horse, levitating trolls or (obviously harder) running around a forest with a bow.

And when answering a question, unambiguously specify this fictional context, or at least indicate that it might be fiction if unsure.


How do we handle historical fiction, or even more perilous, how do we handle stories that are "based on historical events"?


I'm not sure a higher level "intelligence" (which some folks think AI is moving towards) should be overly-reliant on human "intelligence", lest it inherit flaws which may outnumber benefits. (Humans believe a variety of outlandish things, such as "Q-Anon has the real facts", etc.)


Absolutely agreed. And I'll just put this here:

    "You need to believe in things that aren't true. How else can they become?"
    - "Hogfather" by Terry Pratchett


source of truth: wikipedia-inference.db.2023

charmander -> pokemon -> fiction avada kedavra -> harry potter -> fiction sindarin -> ??? -> infer( fiction or nonfiction) Robin Hood -> disambiguation -> ask(user input-> do you mean?) ...

This just seems like a categorization and data annotation problem, which I would assume a bunch of projects are trying to solve like this one.


> What does Charmander evolve into?

wait why is this implied to not be black and white? Charmeleon is the only correct answer.


Charmanders don't evolve into anything, it doesn't exist in the natural world.


By this logic, "Is Moby Dick a sperm whale" also can't be answered factually because Moby Dick is a fictional creation and doesn't exist in the natural world?


"The Hitch Hiker’s Guide to the Galaxy; is an indispensable companion to all those who are keen to make sense of life in an infinitely complex and confusing universe. For though it cannot hope to be useful or informative on all matters, it does make the reassuring claim that where it is inaccurate, it is at least definitively inaccurate. In cases of major discrepancy it is always reality that’s got it wrong. So, for instance, when the Guide was sued by the families of those who had died as a result of taking the entry on the planet Traal literally - it said “Ravenous Bugblatter Beasts often make a very good meal for visiting tourists” instead of “Ravenous Bugblatter Beasts often make a very good meal of visiting tourists” - the editors claimed that the first version of the sentence was the more aesthetically pleasing; summoned a qualified poet to testify under oath that beauty was truth, truth beauty, and hoped thereby to prove that the guilty party in this case was life itself for failing to be either beautiful or true."


It is not a good start that they begin with a dictionary definition of Hallucinations. While the similarities to what a LLM does are apparent enough for the term to be used, LLMs are under no obligation to behave similar to the dictionary definition of Hallucinations.

In general facts are not the answer to Hallucinations. You can't possibly have every fact for every situation. The true solution to Hallucinations is figuring out how to make a model say 'I don't know"


"Hallucination" makes it sound like ChatGPT drank some of the punch without realizing it was laced with LSD. "Bullshit" sounds more like what comes out of an overconfident ass who should or could know better with some better education.


Hallucination is a better descriptor for what an LLM is doing though. A bullshitter knows they don't know, an LLM just strings words together in ways that fit what it "saw" from training data. IMO the main problem with calling them hallucinations is the implication that the true things they say are true on purpose. It's hallucinating the true things too.


Exactly, but no, there should be no «implication that the true things they» output are not part of that: required modules for foundational thinking (let us say "critical thinking") are missing.

(Issue is, now some are convinced that people in general would do the same and just blurt out the feedforward output of their "internal neural network", as opposed to having built knowledge in a loop of critical evaluation.)


In Swedish the term for perception and misapprehension is the same word. Because it is the same thing.


_Both_ are an inappropriate degree of anthropomorphism IMO.


"Confabulation" is the correct and precise term that comports with the English language, rather than being jargon requiring a neologism.


> "Confabulation"

And why would that be? "Hallucination" means "erratic wandering", implying one is lost - similarly to "delirium" (maetaphor using the plough) and "error". Part of the idea is that of "instead of witnessing the correct, reporting the false" - a very ancient, traditional idea, and akin to the concept of "intelligence" (intus-legere).

"Confabulation" means locutor and interlocutor are talking, exchanging narrations.


> In psychology, confabulation is a memory error defined as the production of fabricated, distorted, or misinterpreted memories about oneself or the world. It is generally associated with certain types of brain damage (especially aneurysm in the anterior communicating artery) or a specific subset of dementias.

https://en.wikipedia.org/wiki/Confabulation


Interesting, I will check that more analytically as soon as I will be back at the console,

but I am not sure - provisionally - that it can be a good idea to relate strictly human neurology to ANNs, if based on phenomena as opposed to structural issues. You do not have that problem when staying with natural language.


You cannot fix what you cannot measure, here is an attempt to do just that with HallMeter https://why.network/

Still have to figure a measurement unit.


How about "falsehood quotient"? Count the number of counterfactual assertions in a given text, then divide by the number of sentences. Of course, the question of what is a falsehood is an exercise for the reader, but this would at least give a unit of measurement, flawed as it is.


The “exercise for the reader” in this case is the entire point of the metric. If it were possible to do at scale it could be incorporated into existing models right now.

Second, it isn’t even necessarily better to have fewer lies if those few lies are more subtle. Plenty of propaganda works by twisting facts and using misleading statements. Perhaps the worst offenders won’t even have any outright falsehoods at all.


that is very interesting input, thanks! Also split into sentences, similar to what sentiment analytics does would raise the level of complexity into the output metric.


This article suggests that LLMs should use a database as a reference for factual information. Rather than asking LLMs to provide their own answers, it is recommended that they summarize based on the facts extracted from the database. This approach reduces the likelihood of hallucinations among LLMs.


We already had databases of facts, like Wolfram Alpha, decades before LLM, and we largely ignored them. It's ironic that when trying to solve AI problems we keep reverting to these old patterns we've tried since the 80s and they kept failing. Habits die hard, I guess.

There's a categorical difference between knowing a fact, and looking up a fact. When you know a fact you can recognize it in a situation where you wouldn't know to look it up, and you'd know to utilize it in a larger solution rather than simply parrot it when specifically asked about it.

Databases of facts have and will still have their place, but that is absolutely not the solution to LLM telling apart fact from truth. They have to innately have this in their model. I don't believe the nature of LLM is to hallucinate. It's instead a side effect of how we train them. We train them to guess, to be close, but not to be correct necessarily. And why is it a surprise that's precisely what they do?

Also LLM are too small in order to be accurate. They're tiny. GPT4 is roughly 40 times smaller than a human brain. And GPT4 is very large compared to GPT-3, and GPT-3 is very large compared to LLaMA 2.

We'll need for hardware to catch up so we can scale things up pragmatically and see what happens to their ability to grasp facts. But also architectural changes, of course.


Not to mention our wetbrain software is analog, resonant with the environment, continuous and has a single uptime, in most cases. We should consider developing llms in proto human history style, random noise meets environs, evolve useful signs and symbols based on clusters of semantic embedding. Uno reverse it through a dynamic parallel narrative simulator circuit with range of values for interpretative feedback of context analysis. Assign allomorphic symbols to conceptual clusters. Refine resolution. Add modules for memory and inputs for updating knowledge, shine it up with some polish and you've got AGI


> I don't believe the nature of LLM is to hallucinate. It's instead a side effect of how we train them. We train them to guess, to be close, but not to be correct necessarily.

Thoughout this comment you speak about LLMs as-if they're animals, or real physical objects. An LLM is a formal model which is just to generate a sequence of tokens maximally probabilistically consistent with a corpus of historical text.

A digital machine running a LLM program is a physical object which necessarily generates text based on "guessing" because that's the algorithm it's running. LLMs are "guessing algorithms", all of Machine Learning is -- it is dumb brute-force analysis of conditional probability.

> GPT4 is roughly 40 times smaller than a human brain

This doesn't make any sense. GPT4 is an abstract algorithm with no "size". The brain has 10^{big number} cells, and GPT4 can be specified with a single real number. Is that the comparison to make? No, both comparisons are incoherent.

A physical device running GPT4 can be given a "size", but it would again have nothing to do with a brain.

LLMs arent living things where we can "measure their size" and "train them to know, rather than to guess". They are just the equation, `max P(answer|propmt, historical_corpus)`

A machine running GPT4 is just an electrical device generating text according to the rule given above. There is no sense of "training it to do something other than guesswork", and no sense of "size"


Larger models are not always more accurate. Overbuilding a model often leads to "overfitting" the dataset. A good example: the iphone text prediction model. It now has so much data that the suggested completed words are often useless and irrelevant in context.


This are assuming LLMs are intelligent and can think "hey I am dumb, I'll look that up".

What they are literally doing is guessing the next word, a word a time but doing it really really well and making statistically average output over a very large number of inputs.

There is no distinction between understanding "the" vs "a" and telling me 1+1=3. It is all token generation.


What they are doing depends entirely on what decoding algorithm you use. An LLM is mostly a token probability function, but it's not just that - a transformer model is capable of learning anything. Tokens are the interface, not necessarily the implementation.


A transformer can only memorize, it doesn't learn to do.

For what that concerns us here: LLMs will never learn to fact-check anything. They'll blindly regurgitate the facts they have been "taught", but never consider or evaluate "the paper cited for this fact on wikipedia is a bunch of bullshit".

Any attempt to use them to produce "facts" is ultimately just folly, in the same way Google's attempt to do so with it's search engine index is.


> [LLMs] never consider or evaluate "the paper cited for this fact on wikipedia is a bunch of bullshit".

Nor do people, though! This is setting the bar way too high.

The whole point to having edited reference sources like "encyclopedias" is that so that we can rely on the expertise of the editors in lieu of having to develop the expertise ourselves[1].

No, an LLM that simply knows a priori (via prompt hacking) which sources are trustworthy would be absolutely comparable to the way an educated-but-non-expert human approaches sources.

[1] Which is a chicken and egg problem anyway. Everyone starts with edited reference sources as tutorial material. Quite frankly everyone starts learning with wikipedia.


This is setting the bar way too high.

No. If these things are claimed to be sources of truth, then the bar needs to be that high.

It is precisely because people don't fact-check that the bar has to be so high.


> If these things are claimed to be sources of truth

That's a strawman, though. No service, nor human, "claims to be a source of truth" in the kind of profound sense you seem to be using. It stops, everywhere, at "Wikipedia (or whatever) said it and I trust it".

The only way to get access to deeper expertise is to (1) BE an expert and (2) engage in an discussion with another.


No, a transformer is a universal function approximator and is capable of learning to do anything to some degree of accuracy.

GPT doesn't do math correctly but it also doesn't just memorize it.


It seems to me that LLMs are basically an algorithmic encoding of Occam's Razor. The issue seems to be that what is most probable does not always correspond to what happens, or what makes the most sense to an embodied person.

What is most probable is not always what is most correct or most accurate.


Isn't this a serious simplification? Tokens are just the medium


This blog is only having an LLM assess what column it should run a query against.

Why is that necessary? Why have an LLM guess where the facts are?

Put all of that data in a place where it's normalized and ready to vector search.


> This approach reduces the likelihood of hallucinations among LLMs.

This has not been my experience. Did you create any benchmarks as a part of this project?


I am the author of this article. And actually what we tried to do was to replicate the simplest implementation to Retrieval Augmented Language Models by prompting the LLM. There have been many researches on this topic right now like work from Meta(https://arxiv.org/pdf/2208.03299v3.pdf). I think it can give you a picture how those RALMs boost the performance on General QA tasks.


This idea is a simplified version of Retrieval-Augmented Generation (RAG), and RAG has been studied in various research papers, such as the one available at https://arxiv.org/abs/2005.11401


My experience with RAG is that while it reduces the incidence of hallucinations* significantly (especially if you reduce the LLM temperature to zero at the same time), it doesn't eliminate them.

My startup has a product for lawyers that uses RAG to answer legal queries (https://lawlight.ai/). We have a disclaimer that "... (we) do not guarantee the accuracy of answers. You are responsible for reviewing the cited case law and drawing your own independent conclusions."

(This works within the specific context—lawyers are domain experts; and they are supposed to read through all cases they cite in court anyway.)

* I dislike the term "hallucinations." By definition LLMs hallucinate. It's just that much (or most) of the time, the hallucinations reflect reality.


You know, aside from this being a blatant feature-length advertisement for what they're selling, I almost thought this was a clever idea.

I thought it involved prompting the LLM to write SQL code to query a knowledge base of documents, and index into them, so that you'd know where to look in the original documents for your authoritative answer. So it would be a meta-search agent.

But apparently, they intend the queried documents to feed back into training the LLM? That's just gasoline on a dumpster fire.


I cannot figure out why LLMs are relevant to their solution. This whole thing comes down to a similarity search via vectors.

The LLM layer seems completely unnecessary. Why do you have a schema that requires an LLM to decide which column to query (which is the LLM's only unique value in this proposal)? Why are you not normalizing into a single column?


> so that you'd know where to look in the original documents

Oh, we have something similar: perplexity.ai

It provides a number of sources after prompting its textual result.


I am mostly a novice to the field of LLMs, but as a layman who has a basic but admittedly very rough understanding of how they work algorithmically, I have a hunch that the same thing that makes these LLMs powerful AIs that have interesting emergent behaviors is also what makes them occasionally get things wildly wrong and claim to know things that they do not know. They are supposed to be AIs, not carefully vetted encyclopedias. Sure, I get that people want to eventually use AI to do life-critical stuff like surgery, and at that point "hallucinations" become a real problem. But we are nowhere close to that point yet, I think, so I feel that the focus on "hallucinations" may be misleading. It is one thing to try to get a 30 year old doctor to not make nonsense up on the fly while at work, that makes sense. But if you try to prevent a 3 year old kid from making up nonsense, that will actually probably hurt his development into a more powerful intelligence. Note: I know that the current popular LLMs do not actually learn past the scope of a single session, but I am sure that they soon will.


This is correct. Current LLMs work by predicting the next word based on a bunch of preceding words. In other words, they are autocomplete. You can often form a valid sentence on your phone if you click on any text field and then press the automatic suggestions several times.

Transformer-based LLMs are interesting because they are such good version of autocomplete that they can, for example, complete a news article about scientists discovering unicorns, using just the first sentence (this was one of the first public demonstrations of GPT-2). But fundamentally they are still just auto-complete.


LLM chat models are not autocomplete. They can recognize and respond to user text, which is not the same thing as completing it. If you prompted GPT-2 with a question you'd get another question, not an answer.


They are autocomplete, they're just completing something in the form

Assistant: ...

User: ...

Assistant: ...

And the output is stopped when they start generating the equivalent of "User: " and the reins are handed back to you.

This isn't a problem, autocomplete at the level of "what would a person say next" is outrageously powerful, but it is how they're working afaik.


> They are supposed to be AIs

I.e. synthetic professionals. (Reliable things. Problem solvers.)


Situation: people try to use these predictive text chatbots as search engines.

Problem: LLMs are not search engines. They extrapolate, interpolate, and approximate (so-called “hallucinations”) so they can always produce somewhat-plausible text completions.

Solution: Create a search engine so good at returning relevant results that even an LLM can make use of it… then go to significant lengths to plug that search engine into the LLM, preventing people from reading the search results directly.

Why not simply give people access to the search engine‽ People know how to use search engines!! This is the fifth time I've seen an article like this, and I'm still… baffled. It's https://xkcd.com/2021/ all over again.


Yeah, this "fix" just shifts the problem to the vector DB and its embedding algorithm. People keep forgetting to mention that embeddings aren't 100% accurate either. The net accuracy may be better but it's not magic.


Yea… my opinion on this is that startups are attempting to force a market for chat bots, instead of accepting that LLM embeddings are best utilized as a search feature, not new product surface area


IMO the most obvious place LLMs will be used is for interfaces.

They enable voice-based interfaces to be practical for normal users for the first time since you really can talk to them in a convincing way.

Translating user input into a set of well-defined commands seems like a better use than searching data to me.


They really want agents to work, and they just don’t yet.


I’m sure you’ve heard of Google-fu. Someone is better at searching than others. I believe they propose that LLM can be superior at producing search queries. Whether they succeed, that remains to be seen, but the idea isn’t that baffling.


Well, search engines and my ability to query them aren't good enough for highly specific or poorly worded questions yet. LLMs are sometimes better in this space.


Indeed, a frequent issue with search engines is knowing the right terms plug into them, particularly when researching a topic beyond one's scope of knowledge.

Half the power of LLMs as they currently exist is that they can often extract the intention of the user's question in a way that search engines usually can't, allowing them to provide a more useful answer or at least point the user in the right direction.

Perhaps it would make sense for search engines to utilize LLMs to perform this query extraction and suggest more appropriate search terms, engaging conversational interaction only if the suggestions are wrong and the LLM requires further clarification.


> Half the power of LLMs as they currently exist is that they can often extract the intention of the user's question in a way that search engines usually can't

I don’t know about that. Google is pretty good at including “similar” questions that others have asked to what I have queried and often that’s exactly what I needed.


In my experience, Google's ability to suggest "similar" queries is often limited if I don't know the terminology associated with the subject in question. It's decent if you're already in the ballpark, but to extend the analogy if you're stuck trying to figure out where the park is in the first place it's much more hit or miss.


I “fish” with google when I don’t know the terminology is what I’m saying. With some luck Google will have correct question on the first page of the search results.


I agree. Use the LLM to construct a better search query, then return those results and the query to the user. The user can modify the "optimized" query and repeat until a usefull answer is obtained.


Why not show the "raw" search results as well?


I think the author of that title could well do with a refresher course in epistemology and physics, as it is just not possible to do what they suggest. But even more unfortunate is how many people fall for deceptive marketing that really should not even fool the average 16-year-old.


It's weird to see how in such articles (which concern topics that are deeply philosophical by nature), philosophical terms like "facts", "consciousness", "knowledge" etc. are just thrown around as if there was any consensus on what those words even mean.

The whole debate is revolving around hot air, because nobody knows whether the other person is talking about the same thing as themselves.


> I think the author of that title could well do with a refresher course in epistemology and physics, as it is just not possible to do what they suggest.

If you define "facts" as "things actually stored in the LLM's weights", then research shows it is possible to determine if an output is a "fact" or not.

Although looking on arxiv I found a paper saying it doesn't work (https://arxiv.org/pdf/2307.00175.pdf) so maybe not.


If “allowing the execution of arbitrary database queries written by an LLM inside a SaaS application” is the answer, I’d love to know what the question is.


The question is how to make money from LLM hype.


You do that the same way as making money from the blockchain hype. You don't actually use the blockchain - just say that you do.


Or, how to get distracted from the real problem of making it reason. (That comedic elephant in the room.)


Please don’t dump untreated content marketing in the reading fountain.


And how do you solve the problem of discrimination?

(Oh, what a matter: * all the epistemological debate - hardly a deterministic solution; * the fact that we cannot train a function approximator through supervised learning; * the challenge of unsupervised learning; * the scientific and teleological problem that, if we have an ANN find a solution, what we may want is to go "Ok black box, now teach us how you do it to expand our knowledge (not just our dumb capabilities)...")


Update: I will clarify what I have written above as I realize it is unclear.

You cannot solve the problem of discrimination (in non trivial cases of true and false, of good and bad) through a deterministic solution, as the epistemological debate did not solve the general problem. You cannot train a function approximator (e.g. an ANN) as a Discriminator through supervised learning, because the problem remains such for human judgement. Creating a Discriminator through unsupervised learning, I'd like to see how one would frame a proposal; and anyway, if we could create reliable filters, the main question - as usual for a progressive approach to AI - would be to have the oracle in the system teach us instead that knowledge that what could not achieve with good old thinking.


For the life of me I can't understand why so many are obsessed with LLMs as search engines and knowledge databases when it seems like it's impossible for them to be that.


I’m making some youtube videos with my 4 year old (mostly as a bonding activity). They’re fairly generic fare for that age group and genre - a bit of plot, I set a puzzle for her to solve, it reveals some clues or saves the day.

For the first puzzle she had to put the cities Tokyo, Paris, LA, and Brisbane (where we live) in order. Using colours, since she can’t read yet!

Then we had to discuss why those cities were in that order (answer below).

For the video, I figured I would doctor a Google search to land on a page explaining the answer. The actual article I found that mostly worked was about 12 results down.

Then I tried Bing’s Chat instead. It spat out the correct answer in 2 sentences. Deus ex machina indeed! (The four cities are the Summer Olympic hosts, 2020-2032).

So I disagree that it’s impossible. And I can absolutely see the value - asking “why are these cities in this order?” is a real question, like “what might be causing the squeaking sound when my car brakes?” or “what was the movie Audrey Hepburn made with the photographer?”

Search Engines aren’t great for those kind of questions - just google “What’s a good brownie recipe?”. LLMs can give the user exactly what they want, or even prompt for the additional extra context.

Not that they will always be correct; hence the flaw.


What's not jumping out to me is why you need an LLM for this? What unique code is the LLM actually generating when de-hallucinating?

From the looks of it, they're just advertising a SQL extension that adds vectorization and vector search. Further, it looks like the only thing that the LLM is doing here is deciding which column to run the vector search on. Why is that even necessary? Why are you not pre-processing "vector'd" columns into a normalized format to query against?

They're basically adding an unnecessary LLM step to what amounts to a vector search. In fact, the LLM is essentially blindly deciding which column is the best column to pull an answer from.

-----

EDIT: Just struck me how terrifyingly dangerous this blog is. Really tired of seeing this crap in the LLM community.

The basic premise of this blog is "give an LLM complete access to your database. Let it decide how and where it should pull data from". This is basically useless without talking about how you prevent the LLM from pulling data from places you don't want it to.

A far better and safer approach remains to push your relevant fields to a separate place for your LLM. In the spirit of this blog, you should just index to a new table. More realistically, you should just put this in a vector store.


That will help a bit, but it's not going to fix it.

Using GPT4 and Code Interpreter, I have asked it to write a function and test it, given some inputs and expected outputs. The function returned different values when it tested it, but it lied and said it worked as expected.

You need to read the code and the test outputs yourself. Or maybe have it write an automated test?

Despite this, it seems quite promising. I expect that in a year or two, some IDE's will come with a useful pair programming feature.


With code interpreter, it actually runs the code, so how did failing tests get interpreted as being correct?


I don't know. Why does a language model do anything?

The "test" was a print statement, not a unit test. There wasn't a failure message. It had to read the output and compare it to the expected value I gave it.

It claimed it got a different result. I guess it didn't really read the result because it strongly expected something else?

If you use an assertEquals() that loudly complains, maybe it's less likely to do this? I haven't seen it ignore stack traces.


How to solve problem X (... but not really) ... use our product.


I think LLMs need to be taught to say "I don't know"/"I am not sure" or something to that effect. Another approach might be to introduce an adversarial "censor" model to guard against hallucination (or inappropriate answers).


Isn't the fundamental issue that it doesn't have any way to tell if what it thinks it knows is or isn't true?

This article sounds like an idea I had independent not too long ago, but with a different goal:

LLMs are great at natural language comprehension, but also have a lot of neurons dedicated to factoids. Using neurons that way is really inefficient, can we split the "language" capability from the "knowledge" capability and have the former just look things up in a database?

My question was more about reducing the size of the network rather than reducing hallucinations, but it's still a separately updatable knowledge resource.

(The answer may actually be "no"; I don't study this professionally, but technical jargon is kinda both factual domain knowledge and also linguistic comprehension, which is why Oracle isn't competing with Starbucks for Java beans).


Many classic statistical modelling techniques have ways to produce some measure of confidence for their predictions; perhaps LLMs could incorporate that as well, e.g. assign probabilities(/perplexity?) to each of the tokens they generate.

> LLMs are great at natural language comprehension, but also have a lot of neurons dedicated to factoids. Using neurons that way is really inefficient, can we split the "language" capability from the "knowledge" capability and have the former just look things up in a database?

I think the beauty of LLMs is exactly that all we need to do is to feed them raw text --- the hope, I guess, had been that the models will be able to develop human-like insights by learning to understand and "speak" languages on its own. Introducing "feature engineering" (e.g. the distinction as you suggested) would defeat that goal.


> perhaps LLMs could incorporate that as well, e.g. assign probabilities(/perplexity?) to each of the tokens they generate.

They do, that's how tokens are selected. Locally run models or non-chat ones from openai can return the probabilities and you can do things like modify or filter them.


I was thinking more of the entropy of the distribution over the distribution sort of thing --- one may say a model is uncertain about its statements if all its potential variants are equally probable and vice versa.


I don't think that's possible in the current autocomplete based paradigm. But the LLM can be made to ignore most of its own knowledge. For example, Bing answers most questions by performing an Internet search, even if it could have answered without one. (Sometimes this makes it actually worse, e.g. when it is asked to solve a puzzle, and it gives answers to similar ones from the web, without trying to solve it itself.)


I don't see why this isn't a focus in RLHF. Just heavily discourage confident lies and encourage admission of ignorance. It can't be that much harder than "safety" training. But then we would probably end up with things like "As an AI language model, my knowledge of the world is limited, but if I were to guess, ..."


That sounds like admitting there is no secret sauce?

The superpower has been the ability to synthesize output from very disparate training sources, and any answer for "I don't know" would come from needing to synthesize disparate training sources.


One of the fun things I like to do with these examples is follow along at home. I don't have access to GPT-4 because I refuse to give money to OpenAI, but with GPT-3.5, "What is an LLM hallucination" actually gives:

- The usual knowledge cutoff warning

- An explanation of what a hallucination is

If (in a separate conversation) I give GPT-3.5 the exact prompt that explains what an LLM is, I get gaslighted instead. GPT-3.5 attempts to tell me that LLM stands for "Legal Master of Laws". Then it gives the knowledge cutoff warning, and then the same correct explanation Myscale got.

The rest of this article appears to be trying to turn GPT into a frontend for search engines. I don't know why people keep trying to do this.


LLM, confusingly, does stand for legal master of laws, despite the acronym not quite fitting. It's an advanced law degree some lawyers get. ChatGPT isn't gaslighting you! That's a true fact.

https://law.pepperdine.edu/blog/posts/llm-versus-jd-degree.h...


As jwells89 mentioned, LLMs can extract the intention from questions and generate better queries for a search engine or database.


I don't need LLM to assume my intention, I need exact matches for keywords.


Then you don't want an LLM at all; exact keyword matches are something we did in the mid 90s, one of the specific value-adds of an LLM is that it doesn't get stuck when you've only got a half-remembered inexact quote or vague description.


I really REALLY wish people would stop assigning sentience to LLMs.

> In other words, a hallucination is an error in (or a false) perception of something real or concrete.

an llm has no "perception" it doesn't "believe" or "think" that the answers it provides are "correct" or "true" or even "false". It's just autocompleting strings with the most probably next words.

If we keep treating these things as if they're sentient entities that "want" to provide "correct" answers we're going to keep tripping over our false assumptions about their answers.


Not arguing the sentience bit, we don't know what that is, have a definition for it, or can even agree on what that might generally be.

That being said, what you've described is different to how a human first learns and many never grow beyond in what way?


I was hoping this would be a way to train an LLM that somehow knew when to seek out external knowledge and when not to.

I guess this is a pretty unsolvable problem with current architectures. There's just no concrete "confidence" value. I mean, an LLM will give you a probable value for what confidence could be given the words preceding it, but that's an entirely different thing


Does anyone know if there is a “wikipedia-only” or way to constrain the knowledge of a LLM (trained on a more limited set of sources or constrained by limited sources of facts)?

I think a LLM interface to Wikipedia could be useful, at least I imagine it would.


would be nice if they could show a gradient score on results that show how certain it is of its answers...

It should be fairly trivial for it to tell you how often its straight up lied about something.


It's essentially impossible. Confidence is often a situational measure.

"Is the sky blue?"

* Generally, yes. If you ask a 5 year old, the answer is yes.

* Is the sky blue right now? Maybe, maybe not. You need to look outside. Even then, you might have wild fire haze. Is it still blue? Is it orange? When does blue become orange?

* Is the sky blue in Blade Runner? Doesn't really seem like it.

-----

Further, who are you talking to? Is this trivia night where your best guess is better than no guess? Is this a scientific panel? Do you have alternative options? Do those alternative options align with your opinion? If you're wrong, how wrong are you?


> It should be fairly trivial for it to tell you how often its straight up lied about something.

By what mechanism would it "lie" to you? How is it even capable of lying in this sense of the word:

https://en.wiktionary.org/wiki/lie#Verb_2

> To give false information intentionally with intent to deceive.

I guess it could "lie" in other senses, but calling that lying is not really adding clarity to the situation.


It isn't trivial. That's the whole problem.

Luckily my use cases have a manual check built in but even a proxy for confidence would be amazing.


Worst part is when you start questioning, how LLM came up with an answer, and it just adds your suggestion to its answer.


I dont know why so many people keep repeating the trivial fact that we cant "eliminate" hallucinations. We cant eliminite misinformation from google, social media or people we know either. Best we can try:

  1) better filter the training data  
  2) design better retrieval and reranking algorithms  
  3) when context information is provided, make it use the sources and cite the sources (use extractive QA to highlight which part of the source is relevant. This is the type of hallucinations that we should focus on as we can compare the generated result and the context to detect the hallucinations)  
  4) make the llm break down its reasoning into small steps that can be validated inidividually (COT, PAL)
There are some research on how to manipulate the logits during decoding to make the generated text satisfy certain contraints. I suspect that we can use these techniques to make the LLM stick to the provided context.

  - Controllable Text Generation with Language Constraints  
  - Classifiers are Better Experts for Controllable Text Generation  
  - Stay on topic with Classifier-Free Guidance


blogspam


I believe that LLMs should be banned, but if they have to exist, we should teach them ethics first before anything else.


Whose ethics? Should we tell it that sex is a positive thing or a horrible sin?


Those crusaders that prevail are the most ethical.


Do you think your phone keyboard's predictive text should be taught ethics? How? LLMs are just predictive text scaled way up: they don't know or think anything, they just predict the next word repeatedly. They can't learn ethics, but can learn to string words together into sentences about ethics, again just by predicting the next word.


I don't use predictive text on my phone.


What's your rationale for believing they should be banned? How can such algorithms be suppressed? And, is there any precedent for this, in particular that has worked?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: