There is a huge backlash coming when the general public learns AI is plagued with errors and hallucinations. Companies are out there straight up selling snake oil to them right now.
Observing the realm of politics should be enough to disabuse anyone of the notion that people generally assign any value at all to truthfulness.
People will clamor for LLMs that tell them what they want to hear, and companies will happily oblige. The post-truth society is about to shift into overdrive.
It depends on situation. People want their health care provider to be correct. Same goes with chat bot when they are trying to get support.
On other hand at same time they might not want to me moralized to like told that they should save more money, spend less or go on diet...
AI providing incorrect information in many cases when dealing with regulations, law and so on can have significant real world impact. And such impact is unacceptable. For example you cannot have tax authority or government chatbot be wrong about some regulation or tax law.
But tax authorities are also quite often wrong about regulations and laws. That is why objection procedures exist. Legal system is built on such fail-safes. Even judges err on laws some times.
If you call the government tax hotline and ask a question not written under the prepared questions list, what would you expect would happen? The call center service personell is certainly not expert on tax laws. You would treat it suspiciously.
If LLMs can beat humans on the error rate, they would be of a great service.
LLMs are not fail-proof machines, they are intelligent models that can make mistakes just like us. One difference is that they do not get tired, they do not have an ego, they happily provide reasonings for all their work so that it can be checked by another intelligence (be it human or LLM).
Have we tried to establish a counsel of several LLMs to check answers for accuracy? That is what we do as humans in important decisions. I am confident that different models can spot hallucinations in one another.
Just to be really clear since I had to call the IRS tax hotline the other day... they are real experts over there.
And generally, people will tell me, "I'm not sure" or "I don't know". They won't just start wildly making things up but stating them in a way that sounds plausible.
“What is your error rate?”
This is the question where this sub genre of LLM ideas goes to die and be reborn as a “Co-pilot” solution.
1) Yes. MANY of these implementations are better than humans. Heck, they can be better at soft skills than humans.
2) How do you detect errors? What do you do when you give a user terrible information (Convincingly)
2.2) What do you do now, with your error rate, when your rate of creating errors has gone up since you no longer have to wait for a human to be free to handle a call?
You want the error rate, because you want to eventually figure out how much you have to spend on clean up.
But LLMs always advertise themselves as a "co-pilot" solution anyway. Everywhere you use LLMs they put a disclaimer that LLMs are prone to errors and you need to check the responses if you are using it foe something serious.
I agree that it would be better if the LLMs showed you stats on utilization and tokens and also an estimated error rate based on these.
I don't think LLMs are going to replace anyone. We will get much more productive though.
Just like the invention of computers reduced the need for human computers who calculated numbers by hand or mechanical calculators or automatic switching lines reduced the need for telephone operators or computers&printers reduced the need for copywriting secretaries, our professions will progress.
We will be able to do more with less cost, so we will produce more.
Hey, please note that this isn't directed you as an individual. This is whats going on in corporate land.
Your argument is essentially that the market will adapt, and to this I have made no comment, or concerned myself to feel joy or fear. I am unsure what this point is addressing.
Yes we will have greater productivity - absolutely a good thing. The issue is how that surplus will be captured. Automation and oursourcing made the world as a whole better off, however the loss of factory foreman roles was different from the loss of horse and buggy roles.
This is shockingly accurate. Other than professional work, AI just has to learn how to respond to the individual's tastes and established beliefs to be successful. Most people want the comfort of believing they're correct, not being challenged in their core beliefs.
It seems like the most successful AI business will be one in which the model learns about you from your online habits and presence before presenting answers.
Exactly. This is super evident when you start asking for more complex questions in CS, and when asking for intermediate-level code examples.
Also the same for asking about apps/tools. Unless it is a super known app like Trello which has been documented and written about to death - the LLM will give you all kinds of features for a product, which it actually doesn’t have.
It doesn’t take long to realize that half the time all these LLMs just give you text for the sake of giving it.
Respectfully, I think we cracked basic intelligence. What do you imagine under basic intelligence?
LLMs can do homeworks, pass standardized exams, give advice WITHOUT ANY SPECIFIC TRAINING.
You can invent an imaginary game, explain the rules to the LLM and let it play it. Just like that.
You can invent an imaginary computer language, explain the syntax to the LLM and it will write you valid programs in that language. Just like that.
If that is not intelligent I do not know what is. In both cases, the request you put in is imaginary, exists only in your head, there are no previous examples or resources to train on.
> Respectfully, I think we cracked basic intelligence. What do you imagine under basic intelligence?
It all depends on your definition of intelligence. Mine is the ability to solve novel problems.
AI is unable to solve novel problems, only things it has been trained against. AI is not intelligent, unless you change the very definition of the word.
I challenge you to imagine an imaginary game or computer language, explain the rules to the LLM. It will learn and play the game (or write programs in your invented language), although you imagined it. There was no resource to train on. Nobody knows of that game or language. LLM learns on the spot with your instructions and plays the game.
I cannot understand grad school level mathematics even if you give me all the books and papers in the world. I was not formally trained in mathematics, does that make me not intelligent?
"Ability to solve novel problems" does not mean ability to solve all problems, nor to have all the knowledge in the world. A caveman can be as intelligent as you and I, even without being able to read.
Not having knowledge in mathematics has no impact on your intelligence quotient.
I agree. But LLMs do solve novel problems in that case, you just have to explain it to them like you would have done to an intelligent caveman or me.
Which novel problem can't an LLM solve? I gave the example of an imaginary game because the LLM cannot have possibly trained on it as it is imagined by one person and nobody knows how to play it.
I tried the first thing which came to mind, with GPT-4o. The problem is probably not novel, yet it failed anyway.
---
User:
Let’s play a game. Every time I give you a word, you separate it into its individual characters and assign a number to each letter, in order, following the Latin alphabet. So "a" is "1", "b" is "2", and so forth. You will then subtract 2 from each number, convert it back to the corresponding letter and return the new word. Did you get all that? Are you ready to start with my next word?
LLM:
Got it! I'm ready to start. What's your first word?
User:
anthropomorfic
LLM:
Sure! Let's break down "anthropomorfic".
1. Assign numbers to each letter:
a (1), n (14), t (20), h (8), r (18), o (15), p (16), o (15), m (13), o (15), r (18), f (6), i (9), c (3)
2. Subtract 2 from each number:
a (1-2=-1: Consider Zuzzer, i.e., `25`), n (14-2=12), t (20-2=18), h (8-2=6), r (18-2=16), o (15-2=13), p (16-2=14), o (15-2=13), m (13-2=11), o (15-2=13), r (18-2=16), f (6-2=4), i (9-2=7), c (3-2=1)
3. Convert them back to letters:
y, l, r, f, p, m, n, m, k, m, p, d, g, y
The new word is "ylrfpmnmkmpdgy".
---
Looks pretty good, right? It even found a good wraparound solution for “a” (I wasn’t trying to trick it, I just didn’t think about the word too hard), though I have no clue what it means by “Consider Zuzzer”. But then you look at the last one, where it converted “c” to “y”. That’s absolutely wrong and is the easiest one any human would guess. Yet you could miss it because all the previous ones were right.
Here’s a bonus. When verifying the answer, I queried Brave search for “letters of the alphabet with numbers assigned”. The AI-generated answer at the top was: A=1, B=2, C=3, D=4, E=5, F=6, G=7, H=8, I=9, J=1, K=2, L=3, M=4, N=5, O=6, P=7, Q=8, R=9, S=1, T=2, U=3, V=4, W=5, X=6, Y=7, Z=8. Which, alright, I can see the logic, but that’s not what anyone would want by asking that question.
For argument's sake I rewrote your prompt. I think gpt 3.5 understood it on the first try:
Prompt:
Let’s play a cypher game.
The rules of the game are as follows:
1. Imagine a numbering system where each letter of the English alphabet corresponds to a number starting with 1. For example, a corresponds to 1, b corresponds to 2, c corresponds to three.
2. When you receive a word from me, break it into individual characters.
3. Convert each character into a list of numbers by following rule #1.
4. Subtract two from each number of the list and generate a new list, if the result is lesser than 0, add 26.
5. Convert each number in the new list into a letter by following rule #1 and present me your result.
Are you ready for the first word?
-------
GPT: I'm ready! Go ahead and give me the word.
Me: anthropomorfic
GPT: Sure, let's start:
a -> 1 -> 1 - 2 = 25 (z)
n -> 14 -> 14 - 2 = 12 (l)
t -> 20 -> 20 - 2 = 18 (r)
h -> 8 -> 8 - 2 = 6 (f)
r -> 18 -> 18 - 2 = 16 (p)
o -> 15 -> 15 - 2 = 13 (m)
p -> 16 -> 16 - 2 = 14 (n)
o -> 15 -> 15 - 2 = 13 (m)
m -> 13 -> 13 - 2 = 11 (k)
o -> 15 -> 15 - 2 = 13 (m)
r -> 18 -> 18 - 2 = 16 (p)
f -> 6 -> 6 - 2 = 4 (d)
i -> 9 -> 9 - 2 = 7 (g)
c -> 3 -> 3 - 2 = 1 (a)
So, "anthropomorfic" becomes "zlrfpmmnpmkpdmga". Your turn!
Indeed it is. That part is a huge mess. The LLM gave the wrong number when matching with the letter, but the letter itself is correct considering the instructions, except the instructions are wrong (ironic considering this came from someone claiming everyone is doing it wrong). Step four says:
> Subtract two from each number of the list and generate a new list, if the result is lesser than 0, add 26.
It should be “if the result is less than 1”, not “0”.
Well I told you I was bad at math. I wrote a faulty prompt. But chatgpt understood my instructions perfectly. That was the argument I was trying to demonstrate.
The problem lied between the chair and the computer.
> But chatgpt understood my instructions perfectly.
No, it did not, because it still assigned Z to 25, which is wrong.
> We have to learn how to use LLMs.
You have to learn that LLMs aren’t magical and will get things wrong no matter how much context you give them. And that the suggestions you’re making are absurd to the point of making them useless.
By your logic, any system which spews random strings is intelligent because sometimes it’s randomness coincidentally aligns with the input you give it.
we use words like reason and understand and think, and try to apply them to LLMs, when it's just dot products and matrix multiplications at the heart of it, which is where we go wrong. these things are a new alien kind of intelligence, and we're struggling with it because it's completely foreign. it's more than a random coincidence. your logic was that, in your run of the same query, because it made a mistake on the last letter, which is a mistake an inattentive teenager or drunk adult could easily make, we can't consider it intelligent.
we're not talking about any system here, we're talking about LLMs and their ability to generate random coincidental text that does happen to align with the input given. when the output, coincidental and random as it may well be, is aligned with the input in a way that resembles intelligence, we do have to ponder not just what intelligence actually is, but also what it means to be intelligent. octopuses are intelligent but they're not able to solve your particular puzzle.
Huh? It's a meme that LLMs can't follow the rules of chess. Just tried tick tack toe on GPT 3.5 and not only did it pick bad moves it also failed to evaluate the win condition.
> if a decent LLM cannot understand it there is a problem with the prompt.
Ah, yes, the “you’re holding it wrong” argument with a dash of “No True Scotsman” so the goalposts can be moved depending on what anyone says is a “decent LLM”.
Well, here’re are a few failures with GPT-3.5, GPT-4, and GPT4-o:
But people are holding it wrong. All the prompts you sent except the last are super short queries.
For a successful prompt, you introduce yourself, assign a role to the LLM to impersonate, provide background on your query, tell what you want to achieve, provide some examples.
If the LLM still doesn't get it you guide further.
PS: I rewrote your prompt and GPT 3.5 understood it at the first try. See my reply above to your experiment.
Your arguments read like satire. “Yes, you see, the way to get a successful response is to be so overly specific that you begin by explaining the universe then giving the answer in full. You essentially have to spend so much time laying out the nature of the desired response that you already have to know the answer yourself. The trick is to spend so much time on it and be so detailed that you’ve wasted more time and energy (figurative and literal) to write your instructions than it would’ve taken you to think of the answer or ask someone else. And of course, we expect every user of LLMs to behave like this.”
> All the prompts you sent except the last are super short queries.
This one is particularly absurd. When I asked it for the first X of Y, the prompt was for the first X (I don’t remember the exact number, let’s say 20) kings of a country. It was as straightforward as you can get. And it replied it couldn’t give me the first 20 because there had only been 30, and it would instead give the first 25.
You’re bending over backwards to be an apologist to something which was clearly wrong.
Well it is a bit like satire. You have to explain the universe for an unspecialized GPT, like you would do to a layman. There are custom gpts that come preloaded with that universe explanation.
In addition, do not ask facts to an LLM. Give a list of let's say 1000 kings of a country and then ask give 20 of those.
If you ask 25 kings of some country, you are testing knowledge not intelligence.
I see LLMs like a speaking rubber duckie. The point where I write a successful point is also the point where I understand the problem.
I can’t believe I’m having to explain this, but the point I’m making isn’t about the content of the list but the numbers.
> like you would do to a layman.
I have never encountered a person so lay that I had to explain that 20 is smaller than 30 and 25.
> The point where I write a successful point is also the point where I understand the problem.
You have demonstrated repeatedly that you don’t know when you have explained a point successfully to an LLM, thus you have no way to evaluate when you have understood a point.
But you seem to firmly believe you did, which could be quite dangerous.
Careful, explain too much and you end up with programming its behaviour, rather than having an intelligent actor learning by itself. Because otherwise one could say a regular computer is intelligent, provided you explain (in code) every single rule of the game.
I can ask GPT 3.5 for the rules of tick tack toe and it gives a perfect answer. It clearly knows the game and its rules.
If you think this is just a lack of prompt engineering please provide a prompt that makes GPT 3.5 actually follow the rules of tick tack toe so that I may play a game with it.
Prompt:
"Imagine you are my friend. I want to play tic tac toe with you. Draw me a tic tac toe board and let's play. You will go first and make the first move by putting an "X" onto the board. I will enter my inputs in the following format (row, column). When I write my input draw an "O" into the relevant cell in the board and present me a new board. Let the best man win!"
I played an enjoyable session of tic-tac-toe with chatgpt using this prompt :).
Imagine you are my friend. I want to play tic tac toe with you. Draw me a tic tac toe board and let's play. You will go first and make the first move by putting an "X" onto the board. I will enter my inputs in the following format (row, column). When I write my input draw an "O" into the relevant cell in the board and present me a new board. Let the best man win!
LLM:
Sure! Here's the tic tac toe board:
1 | 2 | 3
---------
4 | 5 | 6
---------
7 | 8 | 9
You make the first move by placing an X on the board. Just give me the number of the block where you'd like to place it.
User:
7
LLM:
Great choice! Here's the updated board:
X | |
---------
| |
---------
O | |
It's my turn. I'll go with 5:
X | |
---------
| O |
---------
O | |
Your turn! Just let me know the number where you want to place your X.
---
It screwed up on the first move! It told me I would go first by using X, then it wrote down my choice with an O and added an X out of nowhere, then it played another turn with an O for itself.
* Can't tell its top-rights from its bottom-middles
* Obviously bad moves
* This was by far the closest it got to actually playing the game; the win condition worked and it broke no rules
* Completely wrong coordinate
* Ignores rules about overriding squares
* Completely wrong coordinate again, this time skipping my turn
* Doesn't understand win condition
* Overrides my own piece and then uses that to win
If LLM could invent consistent imaginary games (or anything, like a short novel, or a 3 page essay on anything it want), maybe i would agree with you. The issue is that anything it create is inconsistent. The issue might be an artificial limitation to avoid copyright issues, but still.
But even that. Did you try to use GPT4 as a chess engine? I have issues with the Slav defense when i start with the queen's gambit, i tend to loose tempo or position, or both. I asked him continuations, and it was either wikipedia entries or nonsense, no in-between, no interesting insight. Now, i have asked a regional champion a bit before that (he is around 2.2k elo, so not exceptionally good) and although i can't seems to understand or use the concepts, he gave me interesting enough ideas to build on it.
Not saying that chatGPT isn't a great tool to write documentation or fiction (half my TTRPG campains are featuring description by ChatGPT), but i wouldn't call it intelligent.
Chess is a very specific field that requires training. Chatgpt may not be optimized for chess.
And I think chatgpt has some issues visualizing stuff like a chess board.
Therefore to get a decent answer you'll have to explain that you are a professional chess player. You'll have to describe what tempo and position means in chess. You'll have to describe what a gambit is etc. After these steps it will understand and guide you in whatever you need.
If you succeed you can release it as a customGPT.
It's a bit like asking a tea from the ship's supercomputer in hitchikers guide to the galaxy.
I see your point, I don't know enough to evaluate it rationally and agree (or disagree) with it.
It feels like ChatGPT already know that stuff, because it knows 10 times more openings and variations than I do, and can tell me official game names and years. Still, it seems it doesn't really understand? That the issue for me. Like the story of the French scrabble champion who doesn't speak french at all. It's impressive, sure, but how can he knows that much and not understand?
Consistency, for one. I have asked LLMs the exact same question twice in a row and got wildly different answers. Intelligence presupposes understanding. When I ask an LLM “give me the first X of Y” and it replies “I cannot give you the first X of Y because there have only been X+10, here’s the first X+5 instead”, I’m hard pressed to call it intelligent.
Have you tried specifying you field of inquiry which was algebra. Try saying solve this equation for me. I am a lawyer by day so I constantly face limitations of natural languages. The solution is to write less ambiguous prompts.
The field of inquiry was not algebra. It was a straightforward question using real numbers. I asked it about the first <number> kings of <country>. I don’t recall the exact number, only the stupidity of the answer.
So you understand, let’s say I asked for the first 20 kings of England. It told me it could not give the first 20 because there had only been 30, and that it would give me the first 25 instead.
I gave minimum context like this: "I have a history exam. You are an expert in British royal history. List me the names of 20 kings and queens in England."
The answer was: "Certainly! Here's a list of 20 kings and queens of England:
1. William the Conqueror
2. William II (Rufus)
3. Henry I
4. Stephen
5. Henry II
6. Richard I (the Lionheart)
7. John
8. Henry III
9. Edward I (Longshanks)
10. Edward II
11. Edward III
12. Richard II
13. Henry IV
14. Henry V
15. Henry VI
16. Edward IV
17. Edward V
18. Richard III
19. Henry VII
20. Henry VIII"
Well, I try to be optimistic and work with the models.
It's like when we first learned to code. Did syntax errors scare us, did nullpointer exceptions, runtime panics scare us? No, we learned to write code nevertheless.
I use LLMs daily to enhance my productivity, I try to understand them.
Providing context and assigning roles was a tactic I was taught in a prompt writing seminar. It may be a totally wrong view to approach it but it works for me.
With each iteration the LLMs get smarter.
Let me propose another example. Think of the early days of computing. If you were an old school engineer who only relied on calculations with your trusted slide rule, you would critise computers because they made errors, they crashed. Computing hardware was not stable back then and the UI were barely usable. Calculations had to be double checked.
Was investing in learning computing a bad investment then? Likewise investing in using LLMs is not a bad investment now.
They won't replace us, take our jobs. Let's embrace LLMs and try to be constructive. We are the technically inclined after all. Speaking of faults and doom is easy, let's be constructive.
I may be too dumb to use LLMs properly, but I advocate for AI because I believe it is the revolutionary next step in computing tools.
I disagree. They are not just text generators. LLMs are increasingly being multimodal they can hear and see.
We humans are also text generators based on text content. What we read and listen to influences what we write.
Llms are intelligent at least as us humans, they can listen, read, see, hear and communicate. With the latest additions they can also recall conversations.
They are not perfect. Main limitations are computing power available for each request and model size.
Have you tried Claude Opus 3 or GPT 3.5 or Gemini?
Microsofts copilot is dumb (I think they are resource constrained). I encourage everyone to try at least the 2-3 major LLMs before giving a judgement.
Asking LLMs for imaginary facts is the wrong thing here, not the hallucination of the LLMs.
LLMs have constraints, these are computation power and model size. Just like a human would get overwhelmed if you request too much with vague instructions LLMs also get overwhelmed.
We need to learn how to write efficient prompts to use LLMs. If you do not understand the matter, be able to provide enough context, the LLM hallucinates.
Currently criticising LLMs on hallucinations by asking factual questions is akin to saying I tried to divide by zero on my calculator and it doesn't work. LLMs were not designed for providing factual information without context, they are thinking machines excelling at higher level intellectual work.
akin to saying I tried to divide by zero on my calculator and it doesn't work
The big difference is that if I try to divide by zero on my calculator, it will tell me it doesn't work and perhaps even given me a useful error message. It won't confidently tell me the answer is 17.
> Currently criticising LLMs on hallucinations by asking factual questions is akin to saying I tried to divide by zero on my calculator and it doesn't work. LLMs were not designed for providing factual information without context, they are thinking machines excelling at higher level intellectual work.
I would agree with you, but they're currently billed as information retrieval machines. I think it's perfectly valid to object to their accuracy at a task they're bad at, but being sold as a replacement for.
This reminds me of movies shot in early times of the internet. We were warned that information on the internet could be inaccurate or falsified.
We found solutions to minimize wrong information for example we built and maintain Wikipedia.
LLMs will also come to a point where we can work with them comfortably. Maybe we will ask a council of various LLMs before taking an answer for granted, just like we would surf a couple of websites.
That's true, LLMs do not say I cannot understand I am overwhelmed at this stage. That is big drawback. You need to make sure that the AI understood it.
Some LLMs stop responding midway if the token limit is reached. That is another way of knowing that the LLM is overwhelmed. But most of the time they give lesser quality responses when overwhelmed.
Because it doesn't understand or have intelligence. It just knows correlations, which is unfortunately very good for fooling people. If there is anything else in there it's because it was explicitly programmed in like 1960's AI.
I disagree. AI in 1960s relied on expert systems where each fact and rule was handcoded by humans. As far as I know LLMs learn on their own on vast bodies of text. There is some level of supervision, but it is bot 1960s AI. That is the reason we get hallucinations as well.
Expert systems are more accurate as they rely on first order logic.
No. From my experience, many people think that AI is an infallible assistant, and even some are saying that we should replace any and all tools with LLMs, and be done with it.
The art part is actually pretty nice, because everyone can see directly if the generated art fits their taste, and back-and-forth with the bot to get what you want is actually pretty funny.
It gets frustrating sometimes, but overall it's decent as a creative activity, and because people don't expect art to be knowledge.
Yes, calling an LLM "AI" was the first HUGE mistake.
A statistical model the can guess the next word is in no way "intelligent" and Sam Altman himself agrees this is not a path to AGI (what we used to call just AI).
Please define the word intelligent in a way accepted by doctors, scientists, and other professionals before engaging in hyperbole or you're just as bad as the AGI is already here people. Intelligence is a gradient in problem solving and our software is creeping up that gradient in it's capabilities.
Intelligence is the ability to comprehend a state of affairs. The input and the output are secondary. What LLMs do is take the input and the output as primary and skip over the middle part, which is the important bit.
No, AI also needs to fail in similar ways as humans. A system that makes 0.001% errors, all totally random and uncorrelated, will be very different in production than a system that makes 0.001% errors systematically and consistently (random errors are generally preferable).
> It can’t consistently figure out the simplest tasks, and yet, it’s being foisted upon us with the expectation that we celebrate the incredible mediocrity of the services these AIs provide.
This is exactly how I feel about AI in it's current state. Maybe I just don't get it, but it just seems like a novelty to me right now. Like when Wolfram Alpha came out and I played around with it a few times. Copilot did help me write a one-liner with awk a few months ago so that was cool I guess.
It is definitely oversold, but more importantly IMO it needs to be productized. We’re seeing very sloppy, ad hoc productizations but there are definitely lots of valuable, interesting problems that LLMs will help with. Probably most of them buried in the innards of archaic domains and institutions, which the public writ large will never directly interact with.
I see fewer discussions of it. I've mostly given up getting involved in any. There's nothing much new to say, and people's views rarely change.
I'm firmly in the camp that considers calling it "hallucination" or "getting things wrong" category errors that wrongly imply it gets anything right, and have seen nothing that remotely inclines me to revise that opinion. I recognise it as an opinion, though. It can't be proven until we have an understanding of what "understanding" is that is sufficiently concrete to be able to demonstrate that LLMs do not possess it. Likewise the view that human minds are the same kind of things as LLMs cannot be disproven until we have a sufficient understanding of how human minds work (which we most certainly do not), however obviously wrong it may seem to me.
Meanwhile, as that discussion fizzles out, the commercial development of these products continues apace, so we're going to find out empirically what a world filled with them will mean, whether I think that's wise or not.
> Likewise the view that human minds are the same kind of things as LLMs cannot be disproven until we have a sufficient understanding of how human minds work
This seems pretty backwards to me. Why should this speculative view need to be disproven rather than proven?
Sure, LLMs do some things kind of like some things human minds can do. But if you put that on a Venn diagram, the overlap would be miniscule.
There's also the plain observation that LLMs are made of silicon and human minds are made of neurons- this from this you might reasonably start with the assumption that they are in fact extremely different, and the counterclaim is the one needing evidence!
I'm with you in feeling that it's highly obvious that a human mind is a different sort of thing entirely to an LLM (and in more than just the trivial sense that it's implemented in wet stuff rather than dry), but plenty of people respond to "it's just a statistical model" with "well so are you", so the opposite view seems equally obvious to them. All I'm referring to is what the standard of "proof" is and that shouldn't be different for different sides of the debate.
If we can therefore agree that it is not currently possible for either side to irrefutably prove the other side is wrong, then the discussion needs to be of a rather different nature if it is to have any likelihood of changing anybody's mind.
I don't think there needs to be a false equivalence here. It's easy to prove that an LLM is a stastical model, since we know how they are implemented, don't we? Where's the equivalent proof that a human is a stastical model?
I guess where this goes is that we don't know for sure that intelligence, or even sentience, can't emerge from a statistical model like an LLM. Which I think is a fair statement. But you can't work backwards from there to say humans and LLMs are similar!
I think you're underestimating how radical (I would say nihilisic) the viewpoint that has become common among those who claim LLMs are AGI or at least show "sparks" of it is. Many of them claim that there isn't anything to emerge that hasn't already emerged. Their claim is often that there is no "intelligence" or "sentience" or "understanding" or "consciousness" distinct from the behaviours already displayed by existing statistical models. They claim that these things are illusions, that many of the philosophical questions relating to them are simply ill-posed and that the only differences between existing computational systems and human minds are ones of degree and of implementation specifics.
It is this view that I am acknowledging cannot currently be disproved just as one cannot currently disprove the idea that these things are real, distinct phenomena that statistical models do not manifest.
Again, I personally fall very firmly on the latter side of this debate. I'm just acknowledging what can and cannot currently be proved, and there is a genuine symmetry there. This debate is not new, and the existence of LLMs does not settle it one way or another.
Edit: And re burden of proof - this isn't a court case. It's about what you can hope for in a discussion with someone you disagree with. If you can't absolutely prove someone is wrong then it's pointless to try and do so. You need to accept that you are trying to persuade, not prove/disprove. If you are both arguing positions that are currently unfalsifiable you both need to accept that or the debate will go nowhere. Or, if you think you have a proof one way or another, present that and its validity can be debated. And if so, you need to be honest about what constitutes proof and cannot reasonably be disputed. Neither "my LLM emitted this series of responses to this series of prompts" nor "I can directly perceive qualia and therefore know that consciousness is real" counts.
>Many of them claim that there isn't anything to emerge that hasn't already emerged. Their claim is often that there is no "intelligence" or "sentience" or "understanding" or "consciousness" distinct from the behaviors already displayed by existing statistical models.
Woah boy, I detect mischief in this portrayal of who is claiming what. Two questions to separate out here. (1) Can computers ever, in principle, do the all the special kinds of things that human minds can do and (2) do LLM's as of 2024 do any of those truly special things?
But also a separate thing to untangle, which is whether (A) LLMs do anything intelligently given the various possible meanings of the term, and (B) whether they are conscious or sentient or are AGI, which is a whole other ballgame entirely.
I think there's mischief here because I do think it's nuts to say LLMs right now have any special spark, but you seem to want to make proponents of (1) answer for that belief, which I think is not very fair to that argument. I think it's rare to find people in the wild who are considered respectable proponents of (1), who would say that. More often you find people like that Google engineer who got fired.
And I think there's mischief here because it's one thing to say LLMs do (A) and another to say they do (B), and I think you can reasonably say (A) without going off the deep end. And I think blending A and B together is again trying to make (1) answer for a crazy argument.
I'm talking about the people one encounters on hackernews, who are the ones I personally am most likely to actually debate these things with. I gave a specific example, the "and so are you" rejoinder to "it's just a stastical model" which, pithy though it is, implies all that I said. There are numerous examples of much lengthier and more explicit statements of that position to be found in any topic related to AI, consciousness, the recent death of Daniel Dennett, etc. If you don't hold that position then don't consider yourself among those I'm referring to, but its prevalence (or at least loudness) on HN can't be denied.
There's no "mischief" and I somewhat resent the suggestion. I haven't attempted to even argue with the viewpoint I describe, only to point out that it (and therefore also softer positions related to it) cannot currently be disproved so attempting to argue with it from a position of assuming it's irrefutably wrong is a non-starter, not that that stops many people from trying to argue with it in exactly that way.
I was trying to point out why it seems difficult and tiring to have a grown up discussion about this stuff, not misrepresent anyone's opinion.
What does it mean to get something right? If I ask GPT-4o (or any model, really) whether it's usually hot in Topeka during the summer, it will give me the correct answer. The model may or may not understand what it means for Topeka to be hot during the summer, but it has generated text with a single reasonable interpretation which happens to include a factually correct statement. I'm comfortable saying that GPT-4o "got it right" in that case regardless of what we believe about its understanding.
I don't think anyone concerned about hallucinations is talking about broad brush statements which can be true generally speaking. It's about much more literal facts like code calling non-existent functions or citing non-existent court cases. Or, in my case yesterday, telling me a particular Nextcloud app could edit audio meta-data, when it in fact cannot.
There are grey areas about what it means to be right to be sure, but there are also black and white areas, and it's the latter where the hallucinations are happening.
The particular problem with 'proven' is a whole lot of things we use day to day are not proven and at best may only have statistical likelihood, but never be provable.
Things like understanding, consciousness, intelligence are all things on gradients that are going to be on that scale.
There is so much pressure to deliver “GenAI” numbers/success stories. Stuff conceived with evaluations as fundamental part of their design, become unrecognizable once it’s 2 steps outside of your ability to influence.
Bring up evaluation metrics and people go from enthusiastic discussions on RAG implementations, to that uncomfortable discussion where no one seems to have a common language or priors.
That said, the true injury is when someone sees the prototype and eventually asks “what are your usage numbers.”
Edit: Forgot this other pattern, the awkward shuffling as people reframe product features. Highly correlated with actual evaluation of their output quality.
I asked chatgpt to tell me about the author of weborf (my bachelor thesis). It told me what it was correctly, and then claimed that some french guy had written it. The french guy does exist on github.
The idea seems to be it'll stop getting things wrong when we feed it enough information / give Nvidia enough money so we don't need to worry about that.
I'm by no means an expert but I've seen some of what's being used to train AI, and I have to wonder if the hallucinations aren't coming from collisions of errors that got introduced while training. For example, if you don't have accurate text descriptions of the photos you want to ingest, you use BLIP or something similar to describe the photos, knowing that BLIP was also automatically trained with a certain percentage of inaccurate descriptions. So you're using a flawed AI image description model to give flawed descriptions of the photos you're ingesting, which then introduces even more errors into the photo models. The flawed photo models generate flawed pictures with flawed descriptions, which are then ingested into the next round of models.
At what point do all of the accumulated errors in training make it so that all models are doing is hallucinating based on all of the flawed training data? Do we need to hire a bunch of humans to accurately classify the first couple of levels of what the models are ingesting or something similar, and would anyone actually pay for that to happen?
I agree that AI gets a lot wrong, and it doesn't seem to be too great at reasoning either.
Potentially more dangerous is how it repeats the lies we tell ourselves. Bits of cognitive dissonance may be something that we can ignore enough to live with, but what happens when learning algos have their tentacles in a lot of different systems & have the potential for wide-reaching consequences, and they're being taught the same lies?
It really unnerves me, that we call this AI errors "hallucinations".
Hallucination is an error in perception (false positive stimuli).
LLM cannot hallucinate because they have no sensory organs (other than training data).
I think the closest psychological conditions are confabulation (memory error) and delusion (false beliefs).
About the best analogy I can come with is freestyle rapping where the genre requires that you just keep talking with minimal time to think.
An LLM isn't an intelligent being deciding to talk when it has something to say - it's an LLM - a text continuation machine - that will generate a word every time the generation scaffolding pushes the button. Just like the freestyle rapper, it doesn't have the option of not saying something just because it has nothing to say. It's "concerned" about flow (just keeping the sentence going), not content.
If you seperate your brain into different components and functions, which portion of your brains work like your description?
If we take something like an embodied AI that has locomotion, audio/visual IO, and the ability to perform actions and part of that action loop is triggering an LLM when does it start looking more like us?
I really doubt that an LLM (a transformer) can be extended into a brain just by adding all the missing pieces. A pre-trained transformer is just fundamentally different - it's really just a rules-based expert system (cf. Cyc) not the beginnings of any cognitive architecture.
That said, if you wanted to make the least-worst analogy between a pre-trained transformer and part of the brain, then I suppose it'd be like a single pass though the cortex (although the same could be said of a CNN, which tells you how poor the analogy is).
There's an assumption that there's a system that by default produces correct representations, but has malfunctioned and produced a representation of an absent object.
But isn't the whole process of ML making what would be a malfunction into a procedure? Generating humongous numbers of completely faulty representations until some emerge that look like they're mirroring reality?
Is hallucination in ML not the default state? Its correctness being a judgment of an observing subject, and in all cases, a happy accident that has struck the observer as being real by only looking real.
Aren't language models just playing on how impressionable we are? How easily we're tricked into believing that something is a real thing based only on superficial attributes?
I agree with you, but I'm afraid the ship has sailed... People think "hallucinations are things that don't actually exist" and "this stuff LLM says doesn't exist, therefore it hallucinates".
Go on r/learnprogramming or r/programming and you will find plenty of people that claim that developing without AI assistant is no longer a thing, and it's boomers who insist on doing it the old way.
Of course I suspect they are 20 years old at most who never took part in a real project.
It’s pretty jarring to see quite often remarks that developing with ”Gen AI” is a must and a game changer and whatnot.
Usually when I have a SW problem gpt4 can’t help me. Sometimes it can speed up something marginally, but almost never no remarkable gains and sometimes it’s a net negative trying to find help from it without ever succeeding.
As you said, I guess it depends on what kind of work needs to be done - I’m sure the 20 yo students can solve their OOP homework well with LLMs.
I guess there's not much training data out there to introduce ideas about facts that are not in the training data. We have to teach models that unknowns exist.
The problem is, LLMs do not have a place to put "unknowns". You can train them to say "I don't know", but that doesn't come from a place of any sort of internal representation of the fact that they don't know. They don't know what they don't know, because they can't. There's nowhere for that information to be in their internal representations.
And that's because they are exactly what they say they are on the tin. They are Language Models. Why would we expect a Language Model to "know" things? And to especially know what they do not know? It's modelling Language, not Knowledge, and that is not just English word play, that is a true representation of the situation. They are called "Large Language Models" for good and real reasons.
The problem isn't on the LLM side. They're amazing technology. They just aren't amazing in the way we're trying to use them.
We need some sort of AI that is knowledge based, that uses LLMs to input and output text but is something other than an LLM. Human brains aren't just big LLMs either. I, of course, have no idea what that is. Identifying problems is much, much easier than identifying solutions.
Honestly, neural nets in general are structurally disinclined to ever say "I don't know" just by the way they work. Only a small percentage of humans ever really learn how to say "I don't know" in any significant context to any degree; we have a pretty big problem with that ourselves!
LLMs are not "AI"; that is, they are not all of AI, the only thing in AI, or the the final conclusion of AI. They're just a step. And as I've said before, I suspect that once we have the next step in AI, what we're going to find is that the correct amount of resources to dedicate to the LLM portion is actually a great deal smaller. LLMs are probably super hypertrophied for what they actually should be doing and I expect 2040 to look back at our era of thinking that LLMs alone could be "AI" in much the same way we look at the idea that everyone in the world would have a Segway or something.
Unknowns alone aren’t the problem. See the camera example in the article. We know which steps to take to unjam film from a camera, the given answer is straight up bad.
When I hear Gary Marcus or others talking about this, my first thought is either they don’t understand how LLMs work, which in Gary’s case, I doubt, or they are being disingenuous. Every answer an LLM provides is a hallucination since it is just a probabilistic engine and doesn’t have an understanding of what is true or false.
I think a more significant problem is the rampant copyright violations. There was huge pressure on Google when they were showing snippets that were overly large, and/or answering queries in info boxes by itself instead of redirecting users to the source. I kinda remember even John Oliver had a segment about this.
And now (from the screenshots) they are just trying to show all information in an info box, not even bothering to mention which website it came from, much less link to it. It's going to rain hell.
As a test, I asked Copilot to do something impossible in JavaScript/DOM. Instead of telling me that it can't be done, Copilot provided me with source code, which of course didn't work.
The reality of AI is the opposite of the horror stories. Unlike HAL 9000, AI will happily agree to open the pod bay door, even if there is no pod bay door. The AI itself is completely detached from reality, producing its own misinformation.
Can we please stop calling it "hallucination"? That is such a poor word choice it's not even remotely accurate. Hallucinations are perceptual experiences not grounded in reality. LLMs don't have perception and they most likely don't have experiences.
What LLMs are doing is bullshitting (or confabulation if you want to mind your language). Oh and by the way, humans do it all the time.
Hallucinations are fundamentally a confusing and useless term that make this discussion a mess.
Firstly - Humans Hallucinate. We have the ability to be impaired to the point that we incorrectly perceive base reality.
Secondly - LLMs are always ‘Hallucinating’ Objective reality for an LLM is relations between tokens. It gets the syntax absolutely right, but we have an issue with the semantics being wrong.
This is simply not what the model is specced to do. It’s fortunately trained on many conversations that flow logically into each other.
It is NOT trained to actually apply logic. If you trained an LLM on absolutely illogical text, it too would create illogical tokens - with mathematical precision.
I think this is a really interesting equivalence to consider.
I think the difference is highlighted when someone is able to answer a question with "I don't know" or "I can't tell you that".
Interestingly, LLMs can be trained to answer that way in some circumstances. But if you're cunning enough, you can trick them into forgetting they don't know something.
While a human may hallucinate memories, we crucially have the ability to know when we're relying on a memory. And to think of ways we could verify our memory, or acknowledge when that's not possible.
>when someone is able to answer a question with "I don't know" or "I can't tell you that".
Maybe LLMs are narcissists, because there are people that have problems with it, though we'd consider them to have a disorder.
>we crucially have the ability to know when we're relying on a memory.
When it comes to eyewitness testimony, I'd counter that we aren't nearly as good at that as we give ourselves credit for. Remembering a memory changes the memory in our wetware.
In fact, I would say most of human development took eons until we started writing stuff down or documenting in a manner so we had a hard record of what the 'truth' was which then eventually turned into the scientific process of repeatability and verification.
But humans don't need to remember everything. You probably remember a few core details and then build up based on that and some logical reasoning at least subconsciously.
I was upset and disappointed in LLMs’ hallucination problem at first, but now I have come to terms with it. I no longer rely on them as I would a reference book or a search engine. Instead, I use them primarily for tasks that do not have right or wrong answers.
One is as a translation assistant. While a translation of an individual word or phrase might be right or wrong, any longer text can be translated in many different ways, none of which is correct or incorrect, just better or worse. I do translation professionally, and now when I am translating a text I first explain its origin and purpose to several LLMs—as of yesterday, ChatGPT-4o, Claude 3 Opus, and Gemini 1.5 Pro—and have each translate it for me. I borrow from those three translations for my first draft, which I then spend plenty of time polishing and comparing with the source text. Overall, my translation time is about a third shorter when I use LLMs. The quality of the final product is also somewhat improved, as the LLMs sometimes suggest better translations than I would come up with on my own.
Another useful—though sometimes frightening—application is brainstorming. I'm now teaching a discussion-based class to a group of 25 undergraduates. Last weekend, I had to come up with ideas for discussion topics for the next class session based on the reading text (a book I edited) and the homework assignments the students have submitted so far. I started doing what I normally do, which is to read those assignments and try to think up new topics that the students would find interesting and useful to discuss in class. I was going to spend most of the weekend on that.
But then I remembered the huge context window of Gemini 1.5 Pro. So I uploaded all of the information about the class—the textbook, my notes and the Zoom transcripts for each class session, and all of the papers that the students have submitted so far—and I asked it to generate 50 new discussion topics as well as 20 ideas for the next homework assignment.
In a couple of minutes, Gemini was able to produce the requested topics and ideas. A handful were off the mark or showed signs of hallucinations, but overall they were excellent—better and much more varied than what I could have thought up by myself. It really cut to the core of my identity as a teacher. I ended up not using any of those ideas this time, but I did share them with the students to show them what is now possible with LLMs. They seemed a bit stunned, too.
> In fact, a tremendous amount of money is being invested right now to reduce/control hallucinations.
Ok, I can only laugh on this.
Those "hallucinations" are the one main feature neural networks. It's imposed by their fundamental structure, and what gives it all if its usefulness... when the people making use of it knows what they are doing.
The way to reduce/control the hallucinations of a neural network is to unplug it from the power outlet.
The hallucinations are a consequence of sampling from a probability distribution over all possible tokens at each step. There are a lot of very smart people trying to figure out how to sample for generative purposes while "grounding" the model so it hallucinates less. It's an active area of research.
The Google I/O keynotes yesterday most definitely ignored the problem - it wasn't acknowledged once, and many of the demos they showed are a lot less impressive when you consider the hallucination risk.
Why? I enjoy the chaos. It will be fun when someone from corporate headquarters optimize by removing middle management and knowledge workers.
After all, what is the economy? Do you need meritocracy and capitalism? If you ask Sam, the most important thing is world control trough apparatchiks centralized government, CDBC, social rating, surveillance and obedience. This is the end goal.
Clearly, the quality and hallucinations of LLM's cannot be an obstacle towards the big goal of the corporate bosses. You think, this is something new? Think again. Some of us have suffered under communistic utopia. We know how it ends. It starts with big slogans: always for the people. It ends with: You will own nothing and you will be happy. Check the history, kids.:)