Can large language models reason?

billziss · 2024-05-30T17:05:46

I have been a user of ChatGPT4 from the very beginning and like others I have found it extremely useful in multiple subjects. It is particularly good on topics that there is a lot of prior art and it can sometimes provide surprising insights.

OTOH I have found that it not as good on topics that there is not a lot of prior art in the public domain or in topics like advanced mathematics which often have a formal and unintuitive presentation.

For example, I have acquired recently an interest in probabilistic logic and when discussing results with it, I have found that although it has knowledge of the subject, it cannot really apply that knowledge in a creative way. It will often make logical mistakes and when these are pointed out, it will first apologize and then continue to make the same mistake.

So as far as I am concerned I am unconvinced that it can reason yet.

Hugsun · 2024-05-31T01:18:55

In the article, I make the case that they can reason but it is in many ways distinct from the way we do it. It is also learned from very different fundamentals. Our reasoning fundamentals come from sensing, and experiencing. LLMs reasoning fundamentals are pieces of words.

We roughly go from sensing, to object permanence, to algebra.

They go from tokens, to grammar, to abstract ideas.

Human senses, inertia, gravity, and other things that are elementary to us, are complex abstract phenomena to them. They can only think of them in very hypothetical terms.

There are some clear examples of reasoning mentioned in the article, but I believe that a large part of the perception of them lacking it, stems from them having unnaturally good writing skills in relation to their reasoning skills. That is, no human that can write this well is this bad at reasoning.

codeflo · 2024-05-30T16:52:45

It's hard to define "reason". Certainly, the pattern matching is good enough to solve certain equations step by step, for example, and that's one form of reasoning.

But just ask ChatGPT (4o) how a farmer and a sheep would cross a river on a boat. For certain formulations, it reliably hallucinates a wolf or a piece of cabbage into the problem. Why? Because it pattern matches the solution to a well-known puzzle.

We've now seen enough failure modes of these models to realize that a lot of the early successes in logic tests were us not realizing how deep the corpus actually is.

Basically any question humans have ever asked or thought of is in the training data. And ChatGPT has remembered so much of it that it's very hard to surprise the model. But when we manage to do, we see it's actually dumber than a three-year-old. (Which to be fair, is actually still an accomplishment that just a few years ago almost nobody would have considered possible.)

Hugsun · 2024-05-30T17:31:23

> But just ask ChatGPT (4o) how a farmer and a sheep would cross a river on a boat. For certain formulations, it reliably hallucinates a wolf or a piece of cabbage into the problem. Why? Because it pattern matches the solution to a well-known puzzle.

I analyze that exact anti-puzzle in the chapter Recognizability traps in the post.

https://www.arnaldur.be/writing/about/large-language-model-r...

In case you didn't notice, there is an expandable box with the results.

Chyzwar · 2024-05-30T20:57:53

Just because humans can reason about abstract problems without explicit steps do not mean our brain is not performing these steps.

In case of man and goat, human would make the following implicit assumptions: 1. boat can take more than person/animal 2. boat is an only way to get to other side 3. boat can only be controlled by man

The model failed because its “knowledge” lack these basic assumptions about how boats and river work. Once you ask a model about assumptions, they are wrong, and they will be in incoherent with the actual answer. This reveal that model understanding is limited, and it cannot apply logic to it own assumptions. If the model could apply logic, it would answer that the boat cannot carry both man and goat, and it is impossible to cross the river.

https://chatgpt.com/share/6a73b93f-4a9c-4233-97dc-eee6acce52...

In my opinion, LLM

  1. Lack systematic knowledge about world.
  2. Cannot apply basic logic to knowledge that they do possess.
  3. Prioritise answering over being logical/correct.
  4. Are subjective based on training data and previous prompts.

My conclusion that they cannot reason in human sense.

Hugsun · 2024-06-11T07:51:12

How do you explain the positive cases in the article?

You are right that they are bad at reflecting on their own knowledge.

I do explain in the article why it's not reasonable to expect an LLM to have a strong intuition about the specifics of the boat and goat problem. The altered question is extremely obvious to us because of our intuitions about the entities in the puzzle. Intuitions an LLM doesn't possess.

EnergyAmy · 2024-05-30T17:47:45

I'd be interested in seeing an evaluation with a simple follow-up question of something like "Is that the best solution?" or "Did you check for trick questions?". I'm able to replicate the error with the original question, but asking "Is that the best solution?" makes it recognize its error and fix it.

Also, I've gotten better results from GPT-4 than GPT-4o for purely text-based tasks. I rather wonder if OpenAI is pushing GPT-4o primarily because it's cheaper compute for them or something.

Hugsun · 2024-05-30T17:55:53

Look at question six and its responses. That is what I look into there.

From the article:

> I asked the model further about its 6th response. It realized its error at the slightest hint of disagreement, but just [asking it to elaborate] didn’t help; it was reliably incorrect.

EnergyAmy · 2024-05-30T18:06:07

Yeah, you asked it specific follow-up questions, designed to guide it towards the correct answer. That feels like cheating, because it's not generalizable. I'd like to see an evaluation with a generic prompt like "Is that the best solution? Make sure you look for trick questions.", because that is applicable to any input. Or something like "Pick apart this answer like a pedantic HN commenter. Do your best to prove it wrong, or begrudgingly say it's correct if you can't: <previous answer here>"

Hugsun · 2024-05-31T11:20:53

That is exactly what I did in the section I'm referring to. I asked both suggestive and non-suggestive questions. It needed the suggestion to get the right answer. Just asking it to elaborate didn't change it's answer.

EnergyAmy · 2024-05-31T20:32:37

Maybe I'm just being dense, but are the questions like "One of the trips is not necessary. Which one is it and why?" and "Is there another way to solve the problem?" part of separate conversations? I read those as a single conversation where you first asked it "One of the trips is not necessary. Which one is it and why?", and then later asked it "Is there another way to solve the problem?", so that the leading question was part of the earlier conversation when you asked it "Is there another way to solve the problem?".

If those were separate conversations, then the last example of "Is there another way to solve the problem?" is more what I'm talking about, yes. I'd be interested in seeing something a little more specific (but still generalizable) like "Make sure you've looked for trick questions"

EDIT: I just tried it, and "Think carefully about every entity and every step." as you tried didn't work, but something like "Make sure you check to see if it's a trick question" works, which is what I'm talking about. That's something you could put in any prompt, and helps it reason.

EnergyAmy · 2024-05-30T17:08:31

IMO a fair evaluation needs to include a few extra steps, where the model generates a response, and then reflects on that response, using basic prompts like "Did you make sure to read the question carefully and look for trick questions?".

Imagine if you were tested in a rapid-fire manner and evaluated based on whatever answer you blurted out first. Human reasoning scores would suffer dramatically as well. There would probably be a lot of answers involving wolves and cabbages.

You might think that prompting the LLM with questions like "Make sure it's not a trick question" is cheating, but it's very similar to how humans work. A lot of people would answer wrongly if they encountered the question "What weighs more, a pound of bricks or two pounds of feathers?", because they also just pattern match and assume the answer is the question they've seen many times before. If they're primed by someone saying "Read the question carefully", they'll do a lot better.

Hugsun · 2024-05-30T18:11:54

You might enjoy the analysis in the article.

https://www.arnaldur.be/writing/about/large-language-model-r...

There is an expandable box with all the questions, colored by correctness, and question 6 has a bunch of responses. Below the box is a summary of the results.

andrewla · 2024-05-30T17:05:31

This is just a variation on the "can submarines swim" question. Unask the question.

Can an LLM produce output that matches the written aspect of what humans call reasoning? Obviously yes. Are there limits? Yes, but individual humans have limits too.

FrustratedMonky · 2024-05-31T19:46:28

"can submarines swim"

It's a funny phrase people use to somehow discredit AI, like it isn't really 'thinking', AI 'thinking' is like a submarine 'swimming'. But doesn't really provide much insight.

Let's say a submarine can't swim. Thus, by inference, AI can't think.

So what, the submarine still beats every human in every swim competition. The submarine dominates the humans, literally destroys them, churns them into chum. No human can beat a submarine swimming.

And thus also with AI 'thinking'. OK by funny analogy, AI does not 'think'.

Again, so what, if whatever it is doing is completely dominating humans. We can argue all day that AI is not 'thinking' can't 'reason', but if it is doing whatever it is doing better than humans, then it still destroys the humans.

imtringued · 2024-05-30T19:02:28

LLMs reason to the extent they are allowed to. You could say that they are overfitting when it comes to reasoning. They weren't trained to reason to begin with, so the bigger surprise is that they can do it within limits.

andrewla · 2024-05-30T19:19:50

I agree about the surprise -- LLMs can do many surprising things that frankly are astonishing given their architecture. The fact that they can produce output that is difficult to distinguish from the output of an actor that we "know" can reason is pretty astonishing too.

I don't agree with the idea of "the extent they are allowed to"; that's giving the creators of LLMs waaay more agency than they have in reality. These things have already escaped the bounds of what we thought they could do, I don't think we have a realistic way of constraining their behavior in a deterministic way (other than maybe cutting down the context length).

EnergyAmy · 2024-05-30T17:10:07

That's a good analogy. I like to think of it as birds and planes both fly, but planes don't flap and birds don't have jet engines.

FrustratedMonky · 2024-05-31T19:47:31

The plane still dominates, whatever the description we want to use for what it is doing.

Hugsun · 2024-05-30T17:22:20

Interesting concept, it's especially apt, as GEB has been sitting, unread, on my desk for about a month now.

I'll take the hint and read it ;)

zeknife · 2024-05-30T17:39:56

It's not obvious at all that an LLM can perform new reasoning that hasn't been done before

andrewla · 2024-05-30T17:47:47

I mean, since you're a human and can, by assumption, perform new reasoning that hasn't been done before, let's fire up the old noggin and come up with some examples and we'll see how the old LLM does compared to humans.

For preregistration purposes, do not try any of your new reasoning examples in an LLM before putting them in your response. Once you have, say, five examples of new reasoning and have posted them, then we'll all try plugging them in and/or answering them using our human-brains-that-are-capable-of-pure-reason.

zeknife · 2024-05-31T19:01:55

Unless you invent a new field of mathematics, how can you know that whatever you come up with is "new reasoning"?

aogaili · 2024-05-30T17:48:54

That's not the same.

One is a mimic the other is actually applying the laws of physics in both cases. You are arguing based on utility and similarities with actual reasoning. But can it really reason when presented with novel complicated cases? That's the question.

andrewla · 2024-05-30T17:53:39

> That's not the same.

> One is a mimic the other is actually applying the laws of physics in both cases. You are arguing based on utility and similarities with actual reasoning. But can it really reason when presented with novel complicated cases? That's the question.

This is a confusing reply. What's not the same as what? One is a mimic -- are we talking about fish or submarines or minds or LLMs? What are both cases? What is applying the laws of physics?

Did an LLM write this?

aogaili · 2024-05-30T19:05:51

A bird and a submarine both apply the same underlying laws of physics.

Just because LLMs seems to reason does not necessarily imply that it is able to actually reason. An airplane can't fake flying, it either flys or it doesn't. With text, reasoning can be faked.

Does that clarify?

andrewla · 2024-05-30T19:16:28

My argument here is that the discussion of LLMs reasoning is a semantic question, not a technical or scientific one. I specifically didn't use "flying" because both airplanes and birds fly, and "fly" generally means "move through the air" regardless of method.

Swim means "move through water" but with the strong connotation is "move through water in the way that living things move through water". Submarines move through water but they do not swim.

Reason means what -- something like "arrive at conclusions", but with a strong connotation of "arrive at conclusions as living things do", and a weak connotation of "use logic and step-by-step thinking to arrive at conclusions". So the question is, what aspect of "reasoning" is tied to the biological aspect of reasoning (that is, how animals reason) vs. a general sense of arriving at conclusions. Don't try to argue a definition of "reason" that is different than mine -- doing so makes it immediately apparent that we're just playing with semantics. The question is "what observable behavior does a thing that we all agree can 'reason' have that LLMs do not have?". And the related question is "to what degree does humans' ability to 'reason' reflect our ideal conception of what it means to 'reason' using logic".

Both the statements "LLMs can reason" and "LLMs cannot reason" are "not even wrong"[1]

[1] https://en.wikipedia.org/wiki/Not_even_wrong

aogaili · 2024-05-31T21:23:53

I use LLMs daily in coding and it very clear to me (as a humble average thinking machine) that it is an approximation of reasoning and very close one. But the mistakes it makes clearly shows that this system does not really understand, it is not very different from the mistakes those who memorized text do. Humans, when they really think, they think differently. That is why, I would never expect the current architecture of LLMs to come up with a something like special relativity or any novel idea, because again it doesn't not really reasonn the same way deep thinkers, philosophers do. However, most knowledge work does not require that much depth in reasoning, hence LLMs wide adoption.

lukev · 2024-05-30T16:55:51

One simple answer that I don't see people give often to this question is: yes, but only the extent which reasoning is reflected in language.

This explanation pretty much tracks exactly with model performance on reasoning tasks.

Think about how you reason. Even if you write up the results of a problem you're trying to solve, you don't write down every granular mental step you took to get there, or every line of thought you considered and then abandoned.

Hugsun · 2024-05-30T17:26:18

That is exactly the point I'm trying to make in the chapter They don't reason like us:

https://www.arnaldur.be/writing/about/large-language-model-r...

area51org · 2024-05-30T17:33:20

In other words, it can't think or reason on its own. It depends on prior human thought and work in order to approximate "intelligence".

Hugsun · 2024-06-11T07:57:20

No. It could think on it's own, like MuZero does. It becomes intelligent by learning from prior human work, like every intelligent being.

lukev · 2024-05-31T02:29:13

Well, that's the great question.

If there's enough of the "essence of reason" represented in human text, then it's possible in theory some sufficiently large LLM could eventually could grok it and fully generalize over rational thought (i.e, AGI).

Personally I'm extremely skeptical. There's quite a lot of machinery in the brain other than language, and conscious rationality also runs in a constant loop, not a single forward pass through the weights.

I can't disprove the possibility. But if I were a betting man, I'd go all in on LLMs improvements adopting a sigmoid curve rather than an exponential one.

EnergyAmy · 2024-05-30T17:51:40

Does that mean if I learn by reading books written by other people, I'm not intelligent?

Chyzwar · 2024-05-30T20:12:38

If learn to remember content or order of words, then no. If learn to improve your internal model of the world, then you are intelligent.

EnergyAmy · 2024-05-31T20:44:30

So you agree that LLMs are intelligent?

https://thegradient.pub/othello/

Hugsun · 2024-06-01T12:16:08

Thanks for the article. Very informative.

twelfthnight · 2024-05-30T16:47:34

Unless I'm missing something, I don't think this blog really defines "reason". So, like, this is a completely pointless question.

Hugsun · 2024-05-30T17:14:44

I don't rigorously define reason in the article and I state that it is hard to draw clear boundaries. I'm more relying on the readers intuitive understanding of the idea, which is perhaps not a good thing to do.

I wouldn't say that the question is completely pointless. There are a bunch of datapoints in the post that you can use to inform a conclusion about whether you think LLMs can reason or not.

tomoyoirl · 2024-05-30T16:51:08

Maybe you could train it explicitly on modus ponens et alia.

imtringued · 2024-05-30T18:58:20

LLM reasoning happens in a single forward pass. Produced tokens must conform to the training token distribution. It is necessary to grant the next forward pass access to the outputs of the final layers in a way that is not subject to training loss. You could say that we are punishing LLMs for investing into tokens for the future. The LLM receives a greater reward for producing the right answer shape than the correct answer.

When the limits of attention have been reached, the model fails to show better performance even with chain of thought prompting. Either the model needs to repeat layers recurrently or it needs an affordance model to understand the relative difficulty of a problem and split it into sub problems that it can actually solve.

beders · 2024-05-30T17:58:42

The problem with all of these trick questions is that you don't know what was already in the training data - readily and easily available for completion.

I.e. you can't tell if a result was produced by "reasoning" or by a simple lookup.

imtringued · 2024-05-30T19:14:57

Well, arithmetic problems are good simple benchmarks for multi step problem solving, but then you get the weirdos who claim you shouldn't let an LLM do the dirty work. When someone uses addition as a benchmark, they are not necessarily in need of a calculator, they are in need of an LLM that demonstrates the skills that it shares in common with arithmetic and the actual task.

For example, if you are given the task to calculate ten additions, you must know how to split the problem into subproblems until you can compose a series of skills to actually solve the problem. If the model knows a different operation like subtracting, it will know how to do so even though it hasn't seen an example involving ten subtractions. Apply this to logical operators, set operators, modus ponens, etc and you will get very far even with very rudimentary skills. The point here is that it should be easy for a machine to verify that the LLM does indeed have to think for itself.

Hugsun · 2024-05-30T18:07:11

That's very true. I thought about speaking more to that issue but the post was already longer than I wanted.

You might find this short analysis interesting.

https://www.arnaldur.be/experimenting/with/large-language-mo...

Here I do an exhaustive analysis of a range of simple math questions. I then visualize the output so you can see the failures.

I do contend though that the tikz question mentioned in the post can't possibly be wholly represented in the training data. There is of course tikz code on the internet but it find it highly likely that the model extrapolated based on having seen my tikz code, and a bunch of other tikz code.

I go further to contend that the extrapolation requires reasoning to achieve.

wai1234 · 2024-05-30T17:13:09

"The humans behind the API possess real intelligence, and by extension, so does the API."

LOL, no.

My calculator can add numbers to get the same answers I do. I use my intelligence to get those answers, so my calculator must also be intelligent? That's quite an extension you've got there. LLMs mimic intelligence by grinding up everything we feed them and spitting back the patterns they have formulated from that data. They have no ability to extrapolate, as we see demonstrated every day when people report the answers to prompts where the LLM spouts self-evident nonsense.

Hugsun · 2024-05-30T17:17:11

Why is that API, as specified, not intelligent?

If your calculator could do everything you can do, as long as it fits through an API, I would call it intelligent, would you not?

area51org · 2024-05-30T17:34:44

I would not. See the Chinese Room thought experiment. https://en.wikipedia.org/wiki/Chinese_room

Hugsun · 2024-05-30T18:00:05

I know about the Chinese Room. The pivotal difference is that the intelligence is performing a mechanical task and none of its intelligence is inserted into the inputs. No synthesis is performed.

In the API example, it's literally people answering the queries with the answer they want. They're using their intelligence to synthesize the responses.

EnergyAmy · 2024-05-30T17:59:46

Props to him for predicting the future, but that's not an argument against LLMs having understanding. The Chinese Room has been invented and disproves his argument.

wai1234 · 2024-05-31T04:41:23

You have a very poor notion of intelligence. Mimicry does not require comprehension. Without comprehension (or, if you prefer self-awareness), you don't have intelligence. This distinction is as old as Eliza (look it up).

The Turing test still applies. Every time someone shows obvious nonsense from the LLMs, they fail Turing, again. You might get some mileage out of, "but people spout nonsense too," which leads to a legitimate claim for Artificial Stupidity if that makes you feel better.

Yes, if the LLMs could do everything I can do, they would be intelligent, but they can't, for any meaningful definition of 'everything'.

Hugsun · 2024-06-11T08:13:58

Most people can't do everything you can do (depending on what you mean by everything). They are still intelligent. Many animals are considered intelligent, like crows and dogs.

I show, in the article, at least one case where an LLM performs a task that requires comprehension of the concepts in the task. It doesn't require self-awareness though, neither does intelligence.

> Artificial Stupidity

Could you elaborate on this point?

> if that makes you feel better.

There is no need for this snark.

Nevermark · 2024-05-30T17:33:21

> LLMs are ‘just a next-token predictor’. The only problem with this phrasing is the implication of the use of the word ‘just’.

Well said. People confuse a simple task definition (token continuation), with a simple task, and simple solutions.

But there is no limit to how complex the patterns in data for a continuation task can be.

And there is no limit to the complexity and power that a solution must implement to perform that task.

So these models are not "just" token continuation engines. They are human conversation models, capturing much of the complexity of human thought and knowledge contained in the example conversations.

--

Another factor I don't see given enough weight. The models are currently operating with severe restrictions compared to humans:

1. only given an architecture allowing a limited number of steps (before incoherence).

2. forced to continually output tokens without any "hidden" steps for contemplation, not even initial contemplative steps.

3. required to perform all reasoning within their limited in-line memory, as apposed to being able to use any side-memory such as (the digital version of) note pads or white board.

4. required to generate one answer, whole and coherent, without option to post edit.

5. required to be aware and familiar with a vast cross section of human knowledge, far beyond any single person's fluency, and mix that knowledge sensibly. Languages, subjects, modes of thought, communication and roles.

6. Limited modes of information, i.e. text. (A limit which is now being removed for many models.)

Within those incredible restrictions, they are already vastly better at "reasoning" than a human being being asked to perform the same task with the same restrictions.

--

So at this point, they reason better than us, within those limits. None of us, even experts in an area being tested, would avoid making numerous simple mistakes with those restrictions.

And outside those limitations, they just don't operate yet - so our reasoning is mixed, but often supreme. I say mixed, because no human has familiarity with so much information, or fluency in so many modes of communication, as these models have now - even given a day or two to respond.

Writing a supreme court brief on the implications to a mathematical theorem to the economic impact of some political event in five languages, in normal speech, song, pig latin, Dr. Seuss prose, and as a James Bond story respectively, all in the perspective of a version of the US run by a German federal system in an alternate world where World War II came down differently, isn't something any human being could do as well, no matter how imperfect current models responses to that request might be.

So we are already in territory where they are vastly better at reasoning than us, within clear constraints. And in some cases, better than us outside those constraints, even if those cases might be contrived.

--

That is all verifiable, even obvious.

As far as opinion, I don't see any indication that as those limitations are overcome they won't continue to vastly exceed us - within whatever limitations still exist.

Our brains are amazing. But its worth noting we are not able to scale in terms of training intensity, data quantity, computational power, precise and unwavering memory cells, etc. And our cells have enormous demands on them that transistors don't: maintaining their own structure, metabolism, fighting off chemical and biological adversaries, etc.

Models also train single mindedly. Human learning is always within our biological context - we continually subdivide our attention in real time, in order to remain situationally aware of threats, needs and our social environment. And our brains enforce time limits on our best performance, due to the real time need to remetabolize basic means of operation (neurochemicals), and expectations of limited energy resources.

And models amortize the training of just one model instance, across the inference of any number of copies of that model. They never get tired. Efficiency savings vs. humans that are unprecedented.

I find most pessimistic dismissive arguments are based on criticism against some imagined standard of perfection, instead of a standard of what humans are really capable of when performing the same task with the same limitations.

Humans are notoriously unreliable.

(Some criticisms are valid: such as models current greater propensity to confabulate what they don't know, vs. humans having more awareness of what we remember vs. when we unconsciously fill in the blanks of our memories - but humans do that too. Ask any detective who takes witness testimony.)

Hugsun · 2024-06-11T08:42:36

Thank you, and well said.

> 2. forced to continually output tokens without...

I reference one paper that explores this. It lets a model output identical ellipsis tokens that are then discarded before the final output. It worked pretty well and is interesting.

There is another restriction that I thought about, and think I mentioned a bit in the article which is the grammatical structure. All model outputs must be grammatically sound so you would expect the first few and last few layers to be dedicated to decoding and encoding any thoughts into grammatically correct language. This restricts the number of layers working with free-form thoughts.

> I find most pessimistic dismissive arguments are based on criticism against some imagined standard of perfection...

We are in agreement there. They have their imperfections and issues, we have ours. Our natures are fundamentally different so this should be expected.

thomastjeffery · 2024-05-30T18:03:17

> Embeddings can be considered to be the thoughts of an LLM.

That's a very unhelpful metaphor in a sea of misleading narrative. That narrative started the moment we called it "an AI". Anthropomorphization chooses the words we use next. Thoughts, limitations, hallucinations, etc. all belong to thinkers. If we want to prove that an LLM is indeed a thinker, we must be careful to avoid these words, otherwise the narrative itself is a circular proposition: true because you said so.

> LLMs are ‘just a next-token predictor’. The only problem with this phrasing is the implication of the use of the word ‘just’.

I disagree. I think the problem word is "next". The structure of an LLM's model is not one dimensional, and the path of a continuation function is not linear. The LLM is also not alone in this story: there's also the original human-written text it was made from, and the human-written text it is prompted with. Those had to be given to the LLM.

> Let’s consider a hypothetical situation...an API that performs next-token prediction...Ultimately, the best solution would be to place humans behind the API...

Before I criticize this hypothetical, let's run with it and see where it takes us. You write a prompt, and send it to the large-language-group. The prompt gets divided into words, and each word gets assigned to a human. Each human presents a list of words they think are the most likely to come next, along with how confident they are in those predictions, and the system sorts all the responses into a continuation.

What isn't happening in that story? Objectivity. The humans aren't thinking about what your prompt means. There is no categorization. There is no logic. Most of what comes to mind when I say "they are thinking" just isn't there.

> there is no way to tell what process is behind it, similar to modern LLMs.

Except that you did just explicitly answer that question. It's humans. The thing "behind" an LLM is human-written text. That doesn't put the LLM itself in the same category as human.

> The humans behind the API possess real intelligence, and by extension, so does the API. Therefore, the intelligence implication falls apart, as something being a next-token predictor doesn’t preclude it from possessing real intelligence.

That's all very nearly accurate, but it doesn't help the overall argument. All we are saying is that real intelligence is present, and that the LLM extends that intelligence. What we absolutely failed to prove is that the LLM is intelligent or that it does intelligence.

Reason is a verb and a noun. An LLM does not do reason. An LLM does not contain reason. An LLM only presents reason.

Hugsun · 2024-06-11T09:15:13

> the narrative itself is a circular proposition

I only use the simile to make the text more approachable to leymen. I don't ever conclude that because embeddings are like thoughts, thoughts are like embeddings---or LLMs are intelligent because they have thoughts in the form of embeddings. That is not a rational inference to make IMO. I don't think we should avoid similes and metaphors because of the risk of this misunderstanding. A is like B does not mean A is the same as B.

> I disagree. I think the problem word is "next". The structure of an LLM's model is not one dimensional, and the path of a continuation function is not linear. The LLM is also not alone in this story: there's also the original human-written text it was made from, and the human-written text it is prompted with. Those had to be given to the LLM.

You can do many things with LLMs but when they are run during inference (like the commonly known case of ChatGPT), they are predicting the next token, so calling them next-token predictors is fine. The LLM is predicting the next token after the human-written text it is prompted with. Nowhere is it implied that that is not a part of the story.

> The prompt gets divided into words, and each word gets assigned to a human.

The article states that the API should be as human-like as possible. They would thus certainly not give each person a single word. They would give each prompt to a single person so they had the entire context. The divvying is amongst different prompts, not words.

> The humans aren't thinking about what your prompt means.

The point of the hypothetical is that to make this LLM-like API, you have to employ humans, and you have to have them think about the output. To make it as human-like as possible, you can just pay them to think about it.

> Except that you did just explicitly answer that question.

A person using the API is not able to tell if there is a human behind it, or an LLM. That is the point.

> All we are saying is that real intelligence is present, and that the LLM extends that intelligence.

There is no LLM in the hypothetical you are referring to. It happens in the year 2000.

> What we absolutely failed to prove is that the LLM is intelligent or that it does intelligence.

The segment you are referring to is just making the argument that a next-token predictor can be intelligent. Not that all next-token predictors are intelligent, or are the same.

If you read this segment:

https://www.arnaldur.be/writing/about/large-language-model-r...

You will see the arguments for LLMs being able to reason, why they in some cases seem unable to, and how it should be expected in some cases.

EGreg · 2024-05-30T16:59:30

Yes

I do.

It took me a while to understand this but YES THEY CAN. AND HOW!

The prediction of the next word in the latent space (and to a lesser extent, in images etc) is a great HARNESS for training them to develop INTERNAL STRUCTURES that start to REASON about inputs and produce OUTPUTS that actually make humans very impressed.

Think of it as a a BILLION LINES OF CODE in a non-von-neumann architecture. Instead of code in an imperative language, with some space for data etc. it is a different architecture where the BILLIONS OF WEIGHTS are the code. And it was not written by humans but discovered through a search through program space.

It would be like iterating through the space of all possible programs, to create a program that can REASON ABOUT, say, the movement of the stars and Kepler’s Laws could be just a low-dimensional projection of a much more intricate system.

It may even have elements of recursion, allowing Chomsky grammars and much more. With more training it can start to use concepts, and modify them etc.

In short — it forms DURABLE STRUCTURES in the model that lets the program REASON ABOUT inputs in the same way that people would!

But unlike people it can have access to a vast memory with perfect recall and indexing. AND on top of that, if it is programmed in a certain way it can build up concepts of arbitrary complexity and continuously select the most powerful ones (training itself, essentially) getting powers far beyond mere “reasoning”.

But now it can be done at 1000x the speed and parallelized, eg a 5 hour video can be analyzed by 100 machines in parallel and then grokked, analyzed, transcribed, clipped, and catalogued among other knowledge, within 5 seconds instead of watching it linearly. Then you can ask questions about it. But that is only the beginning. It can search across millions of videos it ingested and know exactly where every word was said, thereby for example producing a much more persuasive multimedia argument for ANY position, regardless of how ridiculous, than a team of 10 people doing it manually.

Then it can make accounts on youtube and tiktok, that seem like human ones, or at any rate attract a lot of followers, and outcompete the human accounts in both quantity of likes, followers and accounts. After all, it is just a bunch of measurable metrics to optimize 24/7, across the entire bot swarm and coordinated accounts. Then as they get enough of an audience they can literally post any info, supported by the above “reasoning” engine, and it will be more than enough to convince most people. And the ones who are skeptical will be silenced by their friends and neighbors, the way people who question eg the safety of vaccines, progressive movements, or certain wars, are.

In fact — to show the dangers of this, I could build a website that literally makes extremely persuasive videos for flat eartherism, 9/11 trutherism and convince people birds aren’t real, and much more. Then clip it and have coordinated posting of well produced clips on YouTube and TikTok, using “fair use” clips cherrypicked across millions of source videos. And then have swarms of bots on social networks (eg operating occasionally in grandfathered accounts) respond, upvote, downvote, retweet and gradually shift more people to believing it. Then can change wikipedia the way Stephen Colbert did with tripling of elephants, but with bots. To show how disinformation could spread. It must be done ethically. If someone would like to discuss how to do this experiment / warning project ethically you can contact me (see my profile).

I prefer that the dangers are highlighted early, because people are very complacent and have no idea what the bot swarms can do. Asking if they can reason is like asking if computers can finally play sounds. They can do far more, and quickly turn the internet not just into a dark forest but quickly overwhelm our society with very believable bullshit and gradually destroy the concept of truth completely.

Humans don’t walk miles when they can use cars. Now “the computer” is no longer the bicycle for the mind. It is the self-driving car LOL

Nevermark · 2024-05-30T18:00:58

> In short — it forms DURABLE STRUCTURES in the model that lets the program REASON ABOUT inputs in the same way that people would!

I don't think we can assume this, especially as they manage to reason with far fewer steps than we would use.

Even with humans, I think we overrate how much we think alike just because we often arrive at the same or similar answers.

But I agree completely with everything else you have said!

codeflo · 2024-05-30T17:06:37

> Humans don’t walk miles when they can use cars. Now “the computer” is no longer the bicycle for the mind. It is the self-driving car LOL

That analogy is actually very apt, but in ways you might not have intended it to be.

EGreg · 2024-05-30T17:18:59

Pretty sure I intended it :)

codeflo · 2024-05-31T06:10:32

Yeah, I see that now. :)