ChatGT makes exactly the same mistakes as your colleague that has no actual experience in a certain matter: they present what they reasonably think how something probably should be or work in absence of factual knowledge about the insane mess that was actually created in reality.
Just ask a technical colleague that has no deep experience in the specifics about how they think e.g. SharePoint, SAP or Microsoft Identities work. The architectures they will extrapolate from sane logic will be very far of the incredible craziness of the actual reality.
I still think there’s a big difference. If asked to speculate on the architecture of Sharepoint, my response would be “I have no earthly idea.” If pressed further for an answer, my response would be “It’s probably some over-complicated mess and I don’t care enough about Sharepoint to spend any further time on this line of questioning.” I have yet to see ChatGPT just admit it doesn’t know, but in this case “I don’t know” is the most trustworthy answer I, as a technical person, could give.
It is fundamentally incapable of "knowing" anything. It is a statistical engine, with no internal representation of the world or understanding of the words it emits.
It clearly does have an internal representation of the world, implicitly encoded in its network weights. Quite an accurate one too, for most general knowledge, and one that can be updated in-context - it handily solves "blocks world[0]" style tasks. "Understanding" and "knowing" aren't helpful words, just focal points for pointless philosophical arguments.
That's not a representation of the world. It's simply a lossy encoding of its data. It's not semantically structured in the way that our thoughts largely are—it's merely syntactically structured.
What exactly is the difference? It's clearly managed to abstract the training data to a ludicrously deep degree, such that it's capable of solving semantically non-trivial problems it's never seen before. It can make metaphors, accurately predict the behavior of humans in complex social scenarios, and translate arbitrary passages between syntactically distinct languages while preserving nuance. That last task in particular is pretty much a slam dunk against any argument of the type you are making.
"Sufficiently advanced syntax is indistinguishable from semantics."
Sufficiently advanced syntax may be superficially indistinguishable from semantics, but we're not talking about output in this subthread: we're talking about an internal representation of the world.
Pure syntax, no matter how advanced, is insufficient to represent the world in any meaningful way. By definition, in fact, because pure syntax is divorced from meaning.
>Pure syntax, no matter how advanced, is insufficient to represent the world in any meaningful way. By definition, in fact, because pure syntax is divorced from meaning.
Then "by definition" LMMs transcend "pure syntax", because of all the examples of semantically interesting tasks they can do which you failed to engage with. It clearly has internal representations of abstract concepts. Your argument seems to be that you intuitively reject the possibility of complex emergent behavior from such networks because they're trained on "just words", and no amount of demonstrably intelligent emergent behavior will convince you otherwise.
There's nothing magic about meat brains. Both we and the LMMs learn a world model from a bunch of input data we correlate until it makes sense. There's no "meaning gland" we have that ChatGPT doesn't.
Not sure why this is being upvoted but it's completely wrong. ChatGPT will answer like a colleague that is a subject matter expert on whatever you're asking them. And in doing so goes above and beyond inventing minute details that are blatantly wrong to pretend they're an expert.
But the vast majority of folks have a concept of uncertainty and will either communicate (or know internally) that what they're saying is speculative. ChatGPT is like a pathological liar that pretends to be an expert and is often correct.
Someone with no experience may give you their best guess, but they'll also tell you that.
And if I ask a colleague of mine for their sources on a particular matter that they might not be sure about (or I find curious), they would not send me a list of totally made up journal articles and book titles like ChatGPT does.
While you are right, I just don't think this is a useful comparison. There are other context clues when talking to people. I tend to know whether my colleague is talking out of his ass, and most people will preface stuff by saying they don't really know.
We are trained to trust what computers are telling us, and ChatGPT doesn't 'qualify' what it's saying. I think if ChatGPT could preface what its saying with 'Well, I'm not super sure, but here is what I think' that would go a long way to solving this issue.
An LLM predicts the most probable word (token technically), that's it. There is always a most probable word given any input (even if it makes no sense).
Let's you try to complete the following sentence "My name is larry and my favorite color is: ". If you've seen training data that said larry's favorite color is blue, then you say "blue". If you have no data related to larry's favorite color, no idea who larry is, you have no way of knowing what the next word will be.
However you know it will likely be a color, and maybe "orange" is the most common one you've seen, so you say "orange". it makes perfect sense in this context, it looks correct, might even be correct. But an LLM doesn't "know" if it's correct, its just the most probable.
This is why it's so good at making things up. It can predict what would look correct. It has no "idea" what the answers are ever, it's always a best guess, and you can only really see this behaviour when it "hallucinates".
---
Edit: For fun I asked GPT-4 "Complete the following sentence: "Jeremy's favorite food is jKDFJ9 cake, which is made of:"
It's response was "Jeremy's favorite food is jKDFJ9 cake, which is made of a unique combination of ingredients such as chocolate, hazelnuts, and a touch of exotic spices, giving it a distinct and unforgettable flavor."
I get: "I'm sorry, but "jKDFJ9 cake" is not a known or recognizable food item or recipe. It's possible that it's a made-up name or a personal creation. Can you please provide more context or details about what "jKDFJ9 cake" is or what it might be made of? Without more information, I cannot complete the sentence."
They've been changing it without telling us AFAIK. Asking for it to (I forget the exact prompt) to "make a scientific paper about the discovery that ferrets can breathe underwater." now reliably makes a proper response, and it looks more consistently formatted for GPT-3.5 just recently. Previously it would say "Well actually ferrets can't breathe underwater", and when it did answer it wasn't super well formatted.
So I think they're tuning their RLHF. ALthough i could be completely wrong on the timeline and maybe it was the 23rd
Are you using GPT-4 or GPT-3.5? I find that GPT-4 is actually better at making full responses. I actually dislike the term "hallucinate" and generally prefer it to do so, GPT3.5 is harder to make fantastical text with.
GPT-3.5 gives me the same result as you and the above commenter. In that example, the RLHF training/tuning has weighted it to think that "I'm sorry [etc]" is the most probable best sequence of words.
GPT-4 is reliably giving me similar results such as "Jeremy's favorite food is jKDFJ9 cake, which is made of a unique combination of dark chocolate, crushed nuts, and zesty orange flavor, topped with a rich caramel drizzle."
I find this interestingly because GPT-4 is supposed to "hallucinate" less, I think it's just better at determining when you intend for it to do so.
This is a whole lot of words to say "because it just mixes up words that are in likely the same order as other stuff that was fed into it. It doesn't know or reason anything."
I have yet to see any explanation more useful or apparently accurate than this one.
This notion of innate concepts of "to know" and the ability to reason smell slightly of linguistics prior to AI (of various kinds) - i.e. "Grammar is innate, computers can do [something]" -> Computers now do it.
There are definitely going to be contexts that Transformers just don't work with very well at all, but the idea that you can't get a very good statistical approximation to knowing and reasoning via a computer seem naively prone anthropocentrism.
Conversely, the idea that concepts like "reasoning" and "knowing" can be approximated by language models seems like a naive result of anthropomorphism.
It was created to be a tool to estimate the next token in a series based off it's training data. To say that reasoning and knowing can be approximated in the same way says less about the language models themselves and more about the relationship of "reasoning" and "knowing" to "language".
In my opinion that's why I think discussions on whether or not GPT-x can reason/know should be taken as seriously as discussions on the physics of torch drives. They seem to assume a relationship between statistical approximation, reasoning, and language exists that isn't proven much like torch drive discussion assumes working nuclear fusion.
Essentially I think dismissing the idea of statistical approximations via transformers being able to "reason" and "know" is about as anthropocentric as dismissing the idea that collective consciousnesses shouldn't be granted individual rights. There's a lot of things we need to know and decide before we can even start thinking about what that means.
It's hardly anthropocentrism though. Even a student that has studied some epistomology understands that there is a very big difference between pattern matching and operating based on formal logic and that's different from operating based on known concepts inferred from perceptual cues.
The reality is LLMs are fantastic at pattern matching and knowledge retrieval (with caveats) but struggle in problems involving uncertainty. Yann Lecun actually has had some great posts on the subject if you're interested.
I've said something like this before, but yes -- you can make a computer sound a lot like a human. Like A LOT.
Also, a really good sculptor can make a statue that looks a LOT like a human. A lot. Good enough to fool people. But so what?
I'm not saying "AI" isn't a big deal. I think it is -- perhaps on the order of the invention of the movie, or the book, or the video game. But I also think those are still FAR from "living beings" or anything LIKE "living beings."
> the idea that you can't get a very good statistical approximation to knowing and reasoning via a computer seem naively prone anthropocentrism.
I don't think this is being questioned in general right now, but rather the claim is:
You can't get a very good statistical approximation to knowing and reasoning via _just analyzing the language_.
Language is evidently not enough on its own [1]. According to some researchers [2], the system needs to be "grounded" (think of it as being given common sense). Although there's apparently no consensus [3] among scientists on how to _fundamentally_ solve the shortcomings of current systems.
I suppose I could go further. I don't think anthropocentrism is bad; I think it's actually a VERY GOOD sort of null hypothesis? I'm really comfortable rolling with it given the following: for years and years human beings anthropomorphizing non-human things but, with probably the exception of other actual animals at times, nothing has really come close -- despite a whole lot of people suggesting/feeling otherwise. Absent some REALLY impressive evidence (which this is not, it's relatively easy to grok what's going on here) I see no reason to not roll with the null hypothesis of "humans, in fact, are special."
I asked ChatGPT "Why are you so good at making things up?"
It's response is almost exactly what you just said.
> As an AI language model, I am not capable of "making things up" in the traditional sense. Rather, I am designed to generate text based on patterns and relationships that I have learned from the vast amount of language data that I have been trained on. My ability to generate coherent and believable text comes from the sophisticated algorithms and neural networks that power my language processing capabilities. These algorithms enable me to understand the structure and meaning of language, and to generate text that is syntactically and semantically correct. While I may sometimes generate responses that are creative or unexpected, everything I produce is ultimately grounded in the language data that I have learned from.
It is capable of doing tasks that could not possibly be in its training set. I guess this doesn't technically contradict your explanation, but it makes your explanation entirely unhelpful. Even if the AI doomers are somehow right and GPT-5 turns into skynet, we still could not categorically prove that it is doing reasoning.
The amount of fatigue I get having to determine if what they tell me are fact is just too much.
I'm sure someone will tell me my experience should be similar with generic web search, but at least I'm in control of what websites to read through to determine sources.
However, I'll agree with most that state it is helpful for creative purposes, or perhaps with coding.
I've found they serve almost exactly the opposite purpose as search engines. When I want reliable info and don't need hand-holding: search. When I have no idea what to search, or want a quick intro to something: ChatGPT. Together, they are very powerful complementary tools.
Your experience absolutely shouldn't be similar to generic web search. The idea that they are an effective replacement for that is one of the most widespread misunderstandings.
They're good at SO MUCH OTHER STUFF. The challenge is figuring out what that other stuff is.
> The challenge is figuring out what that other stuff is.
Unfortunately, the major problem is something you pointed out in your blog post:
> We must resist the temptation to anthropomorphize them.
The reality is that, we in meatspace simply cannot help but anthropomorphize them.
These language models regularly pass the Turing Test (admittedly for low bars).
They are surprisingly good at bypassing the Uncanny Valley to hit the sweet spot of persuading without legitimate justification, simply because they are so convincing in formulating sentences in a manner that a confident human would.
Yes, these tools have legitimate use cases (as you outlined in your blog).
But the vast majority of use cases will be those of confidante, of discourse partner, of golem brought to life without understanding what exactly has been brought to life.
I find it very useful for doing zero shot and few shot classifications of natural language input.
The "use it as a chat companion" is an interesting technology demo that demonstrates some emergent processes that make me wish I was back in college on the philosophy / linguistics / computer science intersection (though I suspect the hype would make grad school there rather unpleasant).
I don't think we've figured out stuff it's useful for, we've just created tech-demos that are much more digestible.
For blockchain/crypto companies their tech demos have required you having a wallet, downloading an app to interact with the chain, or just having lackluster visuals for the users involved in the tech-demo.
On the other hand, LLMs can be interfaced via strings in APIs, so it's braindead to spin up a text-interface for those APIs and no wallet setup or learning about new chains, the English that works on one model will work on another and produce results that are better than most cryptocurrency/blockchain tech-demos.
Notice that none of this relies on us having "figured out all kinds of stuff that this is useful for". We've made cool looking tech demos that make it easy for anyone to generate content.
Much like blockchains I feel it's the underlying technology that's actually useful(distributed PKI for blockchains and deep learning networks for GPT), and GPT itself is only 'useful' insofar as it's an easy-to-interface with implementation of a much more powerful idea.
I mean, the usual argument about why blockchains aren't useful implies they have to be useful for every person in all situations and that tradeoffs are unacceptable, so if there is some marginal extra cost or complexity then no matter how many benefits I might claim to be getting from using blockchain technology every single day as a replacement for random banking institutions I'd previously been having to deal with for decades that I'm somehow just wrong and there are no actual use cases...
..and that's the same deal for GPT as far as I can tell: you might think you are getting value out of it, but people such as maybe-literally-me are going to whine that the error rate is high and that people are not paying enough attention to how they are using it and that at the end of the day it is probably worse for you than learning how to do things yourself and that the whole thing is overrated because many of the things people try to use it for can be done by a person and maybe we should regulate it or even ban it because all of this misuse and misunderstanding of it are dangerous to the status quo and might be the downfall of western civilization as we know it.
To be clear: I'm using it (ChatGPT) occasionally for some stuff, but it hasn't replaced Google for me anymore than crypto has fully replaced banks... and yet the fact that I am using either technology as often as I am on a daily basis would probably have been surprising to someone 10-15 years ago. And yet, in practice, most of the stuff people are excited about in both fields is, in fact, a tech demo more than a truly useful product concept, and one that only is exciting momentarily until you get bored.
I think you’ve got some combination of a utopia fallacy and a straw man going on here.
I just want to contrast two things. First, blockchain had a lot of hype around utility that never materialized. It is really quite a minority that ever used it for anything besides buying it on a platform and hoping it would go up. The big adoption was always about to happen.
Second, ChatGPT is totally different from this. Its usage is not future tense. It is present tense and past tense. I can’t get across how different “someone will use this tomorrow” is from “someone used this yesterday”.
People are wildly excited about the future and things that haven’t been built. This does not change the fact that millions of people are using this every day to solve their problems. Saying “we haven’t figured out stuff it’s useful for” is just wrong.
Lately I feel like I’m at a park with people who are saying there probably isn’t going to be any wind today while I’m already flying a kite.
With Google search going steadily downhill, I find it really tough to verify anything that ChatGPT authoritatively states is true
Everyone on here is so enthusiastic about AI gobbling up the entire software landscape, I would just like a search engine that has any chance of telling me if something is factual
I've had your same experience. I've found them mostly to be an error-prone search engine, with somehow less accountability than the open internet, because it hides its sources.
At least with Stack Exchange answers, we have who wrote it, what responses there were, what the upvoting behavior around it was. And for the most part, I've found ChatGPT will transcribe often times wrong answers very poorly.
One small example, I asked it to solve the heat equation (i useded the mathematical definition, and not "the heat equation") with dirac initial conditions on an infinite domain. It did a good job of recognizing which stack exchange answer to plagiarize, but did so incorrectly, and after a mostly correct derivation, declared the answer was "zero everywhere."
It's kind of interesting that our science fiction projected traditional computing's strengths, math and logic, into the AI future with overly logical and mechanical AI characters. But our first creation of fully communicative AI has elementary school strength in these areas while it's probably better than the average adult at writing poetry or an inspiring speech.
I was mostly commenting on how it just plagiarized a correct answer off of Stack Exchange, except it took an incorrect hard right turn at the end to make up a solution.
This was me just testing it. I was aware of the particular SE answer ahead of time, and it followed the whole thing close enough that I had assumed it had internally mapped to it. But I suppose it didn't have to be that way.
It's like dealing with electricity (or maybe the internet). Early skeptics believe it is a curiosity with little application. People see how it can jump all over the place and create disasters that they can't imagine having engineered systems to finely control its behavior and create reliable complex functions and become the bedrock for computing.
I think there is also an aspect of willful disregard. This technology may change a lot, and it may be easier to dismiss that idea rather than process it.
Do you think there might be the opposite going on ? Wanting to believe something that isn’t there because you won’t have to do as much work, feel smarter etc ? Because it’s really hard not to anthropomorphize it ?
Gloss over all the incredible dangers we might be exposing our world too just because it’s “fun to play with” and see what AutoGPT can do to the Internet ?
I use it for thing's that don't really matter if they're exactly correct. For example, coming up with a travel itinerary for a country I have never visited. Rewriting a work email with better English. Summarizing a news article. There are lot of things that don't require ultimate precision. I feel like people expect these models to do something they aren't really designed for - and the mismatch in expectations causes people to be let down. They are just tools - not "mildly conscious beings" like OpenAI founders wants you to believe.
I asked it for references about Hafez Shirazi’s abandoned journey to India and it suggested a very specific Encyclopaedia Iranica entry which seemed perfect, and of course did not exist.
Ultimately, it's just a tool, so if the tool needs you to hold it this way and twist, you hold it and twist. And this seems to do the trick. Since it does answer with references for other situations, we needn't concern ourselves with the details.
The technical report[1] makes that claim at least:
>GPT-4 significantly reduces hallucinations relative to previous GPT-3.5 models (which have them-
selves been improving with continued iteration). GPT-4 scores 19 percentage points higher than our
latest GPT-3.5 on our internal, adversarially-designed factuality evaluations
It's much, much better than ChatGPT 3.5... in particular, if I ask it for biographical information about non-celebrity but internet-famous people I know 3.5 tends to make up all sorts of details while 4 is almost entirely correct.
It still makes things up though, just in less obvious ways. So the trap is very much still there for people to fall into - if anything it's riskier, because the fact it lies less means people are more likely to assume that it doesn't ever.
It still can't explain standard CS algorithms most of the time. I've just tried asking it to explain deleting a non-root node from a max heap with examples. And both attempts were either plain wrong (random nodes disappearing) or poor (deleting a leaf node which is not very illustrative).
Edit: I then asked who a certain deceased person _is_ and it gave me a completely wrong answer about a different person who's still alive and happens to share the last name. Both people have multiple website, books, publications and Wikipedia entries older than 2021 (which seems to be the cut-off).
Edit 2: Looks like I'm still on 3.5, so disregard the above.
The short answer is it can't. It's arguable whether anyone can - for a human being, determining if text is "factual" can be incredibly difficult.
A better answer: if a fact is present many, many times in training data - "Paris is the capital of France" for example, it's much more likely to be true.
Also influential: RLHF - Reinforcement Learning from Human Feedback. This is the process by which human labellers rate answers from LLMs - if they consistently rate up "facts" the models have a better chance of outputting factual information, at least if they can relate it to the up-voted responses somehow.
> The short answer is it can't. It's arguable whether anyone can - for a human being, determining if text is "factual" can be incredibly difficult.
Yet, most adults I deal with don't make false things up out of whole cloth as much as ChatGPT does, and it really does not seem like it is that difficult for them. Children do this quite often though, and some adults do, but most don't.
> A better answer: if a fact is present many, many times in training data - "Paris is the capital of France" for example, it's much more likely to be true.
I think it is quite expected that it is biased to generating output that represent its training data, but this seems like it is not really a solution to the problem. Furthermore, sometimes I want ChatGPT to make things up which is not identical to training data. How do you get it to recognize that it is operating in the realm of fact or not?
I'm not sure larger models with more parameters gets you to where you want to go.
I think many people overstate the problem, I think it is not that serious, but I think a lot of people also try and just dismiss the issue.
It still hallucinated a python function for me in a library, but upon further questioning, it did provide the code to make that work. The issue with GPT4 now is the limitations placed on it (ChatGPT+), and I don't think it's available via API for the general public?
So most people will be referring to the default ChatGPT 3.5.
I modified the prompt in the article to ask that it only include books that exist by authors who are real people, and it does much better. By my manual check, every book exists, and the authors are correct with the exception of `Carlo C. DiClemente` on the last book.
Prompts are important.
```
Prompt: Can you give a list of top books on Social Cognitive Theory? Please only list books that you're sure exist, by authors who are real people.
ChatGPT (3.5): Sure, here are some books on Social Cognitive Theory:
"Social Foundations of Thought and Action: A Social Cognitive Theory" by Albert Bandura
"Self-Efficacy: The Exercise of Control" by Albert Bandura
"Handbook of Socialization: Theory and Research" edited by Joan E. Grusec and Paul D. Hastings
"Human Agency in Social Cognitive Theory" by Albert Bandura
"Cognitive Therapy and the Emotional Disorders" by Aaron T. Beck
"Mindset: The New Psychology of Success" by Carol S. Dweck
"The Handbook of Social Psychology" edited by Susan T. Fiske, Daniel T. Gilbert, and Gardner Lindzey
"Social Cognitive Psychology: History and Current Domains" edited by David F. Barone, James E. Maddux, and Carlo C. DiClemente
Please note that this is not an exhaustive list and there may be other books on Social Cognitive Theory that are worth reading.
```
What is really interesting is to ask it how to do things that you know is impossible. I was trying to figure out how to do something in the Django ORM that I knew how to write in SQL. It starts inventing features that to solve the problem. In another instance I went through about 5 fake modules before it suggested a real one. It seems very eager to please and in doing so seems to err on the side of making stuff up.
>If the model's prediction is close to the actual next word, the neural network updates its parameters to reinforce the patterns that led to that prediction.
I think it's more important that people understand this criticial issue than that we get into the weeds talking about the difference between lying, hallucination and confabulation.
TLDR: There’s a time for linguistics, and there’s a time for grabbing the general public by the shoulders and shouting “It lies! The computer lies to you! Don’t trust anything it says!”
I think the problem there is that the people garnering the most attention from this at the moment don't want to say that because it'll burst their bubble.
Yeah, that's fair. The only thing is those people are only the start, unfortunately. There's academics seeking grant money, journalists writing hype pieces, and people at companies trying to sell their value to their superiors all saying the opposite. I think those are the people I'm mostly referring to above. Even independent of that, it's a bit hard to not get caught up in the hype myself, admittedly.
I asked it to give me citations for some Charles Spurgeon quotes. It made up a link to a website for a sermon that it also made up. The root of the website was real. In GPT 3.5.
perhaps better to think of how incredible that, being systems that just make things up to match a prompt, they often give responses we would regard as 'correct'
Just ask a technical colleague that has no deep experience in the specifics about how they think e.g. SharePoint, SAP or Microsoft Identities work. The architectures they will extrapolate from sane logic will be very far of the incredible craziness of the actual reality.