Hacker News new | past | comments | ask | show | jobs | submit login
Large language models do not recognize identifier swaps in Python (arxiv.org)
74 points by lnyan on May 28, 2023 | hide | past | favorite | 99 comments



Every time somebody posts a negative result on LLMs' capabilities like that, somebody will go on HN, on Twitter, on Reddit, etc, and post an example where they get an LLM to produce a counter example of the negative result. Crucially, this follow-up "rebuttal" always seems to use a different prompt than the one in the original experiment.

What can we conclude from that kind of sequence? We can conclude that neither the original assertion, that "LLMs can't do X", nor the rebuttal, "They can if you tweak the prompt", are really telling us anything about the capabilities of LLMs to do what their users ask them to do: they are only telling us something about the capability of a user to get an LLM to do what the user wants.

In other words, we can see LLMs as random-access memories with imperfect recall. Much like a SQL query to a relational database, the right prompt can access the right piece of data, but unlike SQL nobody has a clue what "the right prompt" is in the general case.

Seen another way, any prompt a user makes to an LLM has some probability to elicit the desired response from the LLM. There is no known way to maximise that probability. Until there is, we cannot draw any conclusions about the capabilities of LLMs just by poking them and checking the results. To be very clear about it: we can't conclude either that "LLMs can't do X", nor that "LLMs can do X'. All we can conclude is "a user can do X".


Along the same lines, I read something (on here but I don't think I can find it) suggesting that llms + prompting are kind of overfitting, where if you know what you achieve, you can try few things and get it - which would be equivalent to optimizing on a train set. Whereas if you try to get it to do something you don't already know how to do, all bets are off.


I agree and I think the reason it's sort of an indictment of LLMs is because there's no real way to guarantee performance over a range of input, which makes it unsuitable for many automation tasks.

If you develop an LLM prompt that says "Add <user input> to 5" and prove that it works for a handful of cases, there's no guarantee it works for any other case. Whereas "(lambda (x) (+ x 5))" has some mechanical guarantee based on the implementation. And for the LLM, the negative result becomes more possible the higher the complexity of the prompt. This reduces LLMs to being suitable for tasks where reliability is not particularly important, or a supervisory person or system is checking for consistency and somehow can re-run the LLM to generate new results (which itself is not always reliable).


Am I misunderstanding? The paper says:

> We also carry out manual experiments on OpenAI ChatGPT-3.512 and GPT-4 models, where we interact with the models in multiple rounds of dialogue, trying to hint the correct solution. The models are still unable to provide the correct continuations.

But if you look at the Appendix and the dialogues for GPT-3.5 and GPT-4, in the final turn of the GPT-4 program, it DOES (finally, after several cues) generate the correct program?

        len, print = print, len
        def print_len(x):
            "Print the length of x"
            len(print(x)) # since print now refers to len function

        # Example usage
        test_string = "Hello, World!"
        print_len(test_string)
I'm pretty sure that's correct, right? So GPT-4 is worse than GPT-3.5 at zero-shot examples of this problem, but better if made to continuously reflect? That's a pretty big result that is totally unexplored.


>> I'm pretty sure that's correct, right? So GPT-4 is worse than GPT-3.5 at zero-shot examples of this problem, but better if made to continuously reflect? That's a pretty big result that is totally unexplored.

If the correct answer is in the model, it will eventually generate it, if you ask it for enough generations. That's how we understand random performance to work, and not a "pretty big result". The question is whether the model can generate the correct answer consistently and not just at random.

More to the point, the question the article above is setting out to answer is whether language models can recognise identifier swaps. If it takes them a lengthy interaction to generate a correct result, this is a pretty strong hint that they can't.


> but better if made to continuously reflect

With GPT-3.5 I've seen that asking the same question just gets different wrong answers. If one of them is eventually right, then the user needs to either know the right answer or be able to validate the answer. After a correct answer, if you ask again, you get more wrong answers.

Still, amazing tech, but there's a big usability gap around correctness.


> there's a big usability gap around correctness

From an engineer's perspective I can see this seems minor -- but from a scientific perspective, it's kinda a crazy statement, right?

Imagine an alien who speaks english giving apparently correct descriptions of, eg., a room; and then, seemingly at random, saying wholly false things with the confidence (etc.) of its other statements.

A scientist studying this alien would therefore reasonably conclude that the alien wasn't at all sensitive to what the words mean, describe (etc.) but was engaged in a strange game of regurgitation which fools us much of the time.

ie., However it generates a response to a prompt, this "correctness issue" refutes the claim that it does so via being responsive to the meaning of the prompt.

I'm always quite disappointed by the extraordinary levels of credulity around this technology, and how it seems to take place --- in even somewhat scientific minds --- outside the bounds of ordinary science.

The relevant people to assess how a machine works, we're led to believe, is the credulous fools who use it. AI systems are valid targets of scientific analysis, as the above alien -- but the conclusions of this analysis do not fit the narrative.


Speaking of aliens, the GPTs sound a lot like the aliens from Blindsight: they don't question the semantic ambiguity in

> our cousins lie about the family tree with nieces and nephews and Neanderthals. We do not like annoying cousins.

They just reply to it regardless and carry on.

The aliens: "We'd like to know about this tree"

GPT-4:

> I'm sorry to hear that there's some family discord.

> [Snip, about 200 words]

> Nonetheless, I understand that your issue isn't literally about Neanderthals, but rather the analogy representing the annoyance you feel towards your cousins' behavior.


Sure -- if we want to play the engineers game of "define a computational test by which we can know if X is intelligent" (ironically, a dumb game) -- this is one issue we can see here immediately

ChatGPT doesnt ask us to write for it; does not seek to resolve the ambiguity in our speech. Indeed, the very way we "engineer prompts" kinda exposes the whole show -- that prompts need "engineered" at all, that the machine cannot prompt us -- makes it clear it's a tool

we play a game with it to get out of it what we need by eliminating thinking from our own heads, and replacing it with word association --- in exactly the same way we've always done when googling


It doesn't seem that crazy. Working from a simpler analogy first -- if an alien can read 95% of written English, but completely mischaracterizes the remaining 5%, would a reasonable scientist conclude that they can't read at all and are just guessing randomly? No, and it would indeed be unreasonable to conclude that there's no way to write differently so that the alien performs substantially better.

We know some more about LLMs than we do aliens. The LLM is a neural network whose parameters have been optimized to reduce the total size of its errors, as measured against an enormous set of empirical data.

We have to add to your analogy then, that it's somehow known the alien speaks English on their own planet remarkably well. They are still not perfectly correct, but when they have to describe something on that planet, in English, they can do it better than they generally can given the same task on Earth.

It would again be totally unscientific to conclude that because of this, there is no way the alien can talk about earthly objects or ideas. The impulse of the mere engineer, to just hack around and find out, is a scientific one.


> A scientist studying this alien would therefore reasonably conclude that the alien wasn't at all sensitive to what the words mean, describe (etc.) but was engaged in a strange game of regurgitation which fools us much of the time.

This describes an unfortunately large fraction of humans too in my experience. :/

This is one of the reasons I'm impressed with LLMs, because they seem to fall into the same pitfalls less intelligent humans do.


Maybe it's just me, but I've yet to find a human who will confidently make up a function name that will do what I ask for and claim it is part of a standard library while that function does not exist at all. This is politician level stuff. It either means someone or something is trying to deceive in some way or doesn't have a complete understanding of the subject.


Never interviewed anyone? Never been on a job interview for a software job that involved a whiteboard or paper coding exercise? Never had a situation where you kinda recalled they may have been a specific function in a standard library, but you weren't sure of the exact name? Never tried to guess a function like that exists and used it where you didn't know how to implement it, in hopes the examiner won't realize you're bullshitting them?

This is all common human behavior when people feel admitting uncertainty or ignorance is going to hurt them in some way, or projecting certainty will bring more reward.

Remember, LLMs were rewarded in training for providing answers much more than for those answers to be correct.


To answer your question. Of course I have been in those situations. But nobody has ever tried to deceive me nor have I tried. I wonder what kind of software interviews you have been in? If somebody confuses functions from different languages or frameworks, that's of course something that I wouldn't count as cheating. That's just a mistake and of course that happens regularly. But in those cases the functions did actually exist somewhere and weren't completely made up.

Regarding my prior comment: The difference obviously is that in humans this is intentional behavior unless the human is not in a healthy state memory-wise. Hence my initial comment about being deceptive. We usually know when we don't know something for sure. LLMs do not. At least not yet.

If the question is whether LLMs have an understanding of their output, then my argument just states that the fact that they obviously aren't trained to deceive and still emit this kind of output likely means that they don't have a complete understanding of what they generate.

As far as I understand the issue, this has nothing to do with the training. It's a general problem with the transformer architecture that there is no way to estimate the confidence of the output. Correct me if I am wrong here, I am not an expert on transformers. Just trying to wrap my head around this stuff.


> The difference obviously is that in humans this is intentional behavior unless the human is not in a healthy state memory-wise.

This has not been my experience.

Once you get out of the tech bubble you will frequently find people who genuinely treat talking as just pattern matching and signalling and speak words that don't correspond to any actual internal model of how the world works.


I am frequently talking to people outside the tech bubble. Could you give me an example? And maybe a hint how this would relate to my argument about intention and understanding? I have a hard time following your argument.


Tell your family and friends you, e.g., like regularly having a pizza for dinner, or that you're ditching sugar for artificial sweetener - and enjoy watching people on an acid trip with no LSD. You'll hear plenty of warnings warning and advice that's quite obviously bullshit, and which they acquired from other people and/or lifestyle magazines/blogs/memes - except they don't even realize it, because they're not in a habit of remembering where did the idea come from along with the idea itself, and so instead they "just know" they're right.


Again, my point is not about people (maybe wrongly) reciting sentences that they have picked up somewhere without any attempt at verification or understanding of the matter. My point is making up entire new entities (function names) that don't exist at all and claiming these things as facts in a very self-confident tone. While mindlessly reciting "facts" that you have picked up somewhere as a conversational style doesn't require any understanding (it's called belief), making up entities demonstrates that you do something that requires a real understanding, but you don't have that understanding. You just somewhat carefully arrange words based on statistics to match a query. I've yet to find a human who does this. Even my uncle who fits your description more than perfectly wouldn't do this. And I've never met anyone who is mentally healthy and sober doing this.


> The difference is that in humans this is intentional behavior unless the human is not in a healthy state memory-wise. Hence my initial comment about being deceptive. We usually know when we don't know something for sure.

My personal experience with my own thought process is that my un/sub-conscious generates thoughts I have to filter for accuracy and/or "send back" for reprocessing - this is a mostly conscious activity. My raw thought stream is about as good as LLM output - sounds right, but it's often full of nonsense. That's why I tend to say LLM output is akin to that unfiltered inner voice output.

My experience with other people is that plenty of them - arguably more than 50% I met in physical space and talked with at length - care little about being accurate. They don't track provenance of their information. They uncritically believe whatever bullshit they read in lifestyle publications on-line. They say things confidently that they never thought much about, and have no first clue if it's actually true or not. Very much like LLMs - there's a whole class of topics where I know to assume the interlocutor is most likely "hallucinating" things.

Also, fight-or-flight reaction may make you switch into full-on bullshitting mode, without it being intentional.

> they obviously aren't trained to deceive and still emit this kind of output likely means that they don't have a complete understanding of what they generate

And my argument is that this is the same thing a person's inner voice does. Deep understanding of a topic may make your inner voice output you more of the correct associations, but ultimately, most of the time, people have to have a back-and-forth with themselves in their own head to then say/write/do something correctly. With more complex (or less understood) tasks, this involves consciously performing an explicit step-by-step algorithm you track in your mind (and checklists in aviation and medicine are further externalization of that idea - to protect against you misremembering (or "hallucinating") the steps).

> It's a general problem with the transformer architecture that there is no way to estimate the confidence of the output.

The output contains probability factors for predicted tokens, but chat interface doesn't expose it, and doesn't make a good use of it.


> My experience with other people is that plenty of them - arguably more than 50% I met in physical space and talked with at length - care little about being accurate. They don't track provenance of their information. They uncritically believe whatever bullshit they read in lifestyle publications on-line. They say things confidently that they never thought much about, and have no first clue if it's actually true or not. Very much like LLMs - there's a whole class of topics where I know to assume the interlocutor is most likely "hallucinating" things.

There is a key difference between making up new facts and copying 'bullshit' picked up somewhere else. I agree that more than 50% of people don't put much effort into the accuracy of the material they recite, but that's not my point. My point is that LLMs obviously seem to make up completely new entities in generated code that do not have any merit in reality. I agree with the description of your inner voice process, but does your inner voice produce new function or enum value names that you need to filter? Mine certainly does not. Does it generate non-sense that needs serious revision and filtering in general? Of course. But not in the style of new facts, just thoughts that could make sense in some simplified form of reality, but need revision or filtering to match the actual real world.


> I agree with the description of your inner voice process, but does your inner voice produce new function or enum value names that you need to filter?

Yes! Very often. That's because it's exactly "thoughts that could make sense in some simplified form of reality, but need revision or filtering to match the actual real world".

In my actual code writing practice, I've spotted this manifesting in several ways. One, my subconscious often autocompletes some function name that it feels should be there (or would be nice to have it), which is immediately checked by the actual autocomplete of the IDE. Could be something like:

  str.cou <-- oh, there's no count()
  str.size() <-- there we go!
Or:

  str.split <-- ?!? ah, it's C++, std::string
                does not have a split() method.
Same with enum values, and many other things. If I pay attention, I can observe my mind producing these kinds of completions, which are often immediately shot down by some other process (involving memory?), by looking at the typed out thing and feeling it's wrong, or worst case, by the IDE or the compiler.

Another case is when I write new code and I know functions don't exist - I just write what my stream of thoughts provide, and either clean it out later, or actually implement the functions that it "hallucinated".


> str.cou <-- oh, there's no count()

> str.size() <-- there we go!

and

> str.split <-- ?!? ah, it's C++, std::string

> does not have a split() method.

Both confuse two existing entities from different languages/frameworks, which is exactly my point. These entities exist, they are not made up at all. They are just out of context. Ironically, the LLMs I used did rarely run into this kind of mistake (confusing similar entities of a different context).

As an example of what I got when asking a popular LLM for code that traverses the macOS accessibility UI element hierarchy: It lectures me in a quite self-satisfied tone that I should commence to use the function "AXObserve" that does not exist, as it turns out. This is completely made up. There is no such thing. If you google the exact word, it doesn't even bring up anything related to programming let alone the macOS APIs. Now I will admit that my mind is capable of doing the same. Under severe intoxication :)

On a side note: std::string should totally have a split() method and count() is more appropriate (as in intuitive) than size(). So I would say your brain is more "right" than the language in this case :)

Maybe we could agree that LLMs can probably serve as part of a foundation for a future model that actually can output a result of some "real" understanding of a subject?


> It lectures me in a quite self-satisfied tone that I should commence to use the function "AXObserve" that does not exist, as it turns out. This is completely made up. There is no such thing. If you google the exact word, it doesn't even bring up anything related to programming let alone the macOS APIs.

I've seen this happening too, but it still feel more like a case of a generalized "there is something like this, or at least should be". In some cases, it was clear to me it was guessing at some reasonable patterns - like, upon seeing function calls like InitializeFoo() or AddBar(), it would assume existence of functions UninitializeFoo() and RemoveBar(). But in other cases, more similar to yours, it does invent stuff that doesn't exist (though perhaps should).

I find the latter most common when asking for Emacs instructions or Emacs Lisp code. GPT-4 is prone to inventing functions based on the prompt, such as e.g. `org-timestamp-diff-day' (or something similar), when I asked it for the command that computes difference between Org Mode timestamps in days. This function does not exist, but there are a few functions like `org-timestamp-[something]', and even a few like `org-timestamp-[sth]-day' - and more generally, Emacs/ELisp code tends to be full of functions named like the very thing you're trying to do, like `kill-whole-line' or `move-end-of-line', `kill-comment' and `duplicate-line', so it's not that surprising GPT-4 would, the other day, give me something like `kill-whole-comment-and-duplicate-line'.

> Now I will admit that my mind is capable of doing the same. Under severe intoxication :)

Mine too. It's not a bad analogy. I've observed in myself that moderate intoxication does magic when it comes to letting my "inner voice" drive the conversation. I sometimes experienced the feeling of my conscious mind being too late to catch sentences produced by my "inner voice" before they exited my mouth.

Another analogy I particularly like is, GPT-4 is actually quite like a ~4 year old kid in many ways. One that has somehow consumed half the internet, but still with child-like attention span and propensity to continue talking and making shit up even long after crossing the point of lack of knowledge.

My daughter turned 4 lasts week. The way she makes stuff up is quite similar to LLM "hallucinations" (except it's easier to tell, because she's only heard so much in her life, so she's less good at making up stuff that sounds believable). And earlier than that - between the age of 2.5 and 4 - I could observe what's best described as a "rolling context window" growing. A little more than half a year ago, I could tell hers is about 30 seconds long - anything she invented and said once would be forgotten after about that time, unless it was repeated in between by someone, possibly herself. And believe it, she talked in an extremely repetitive way back then - ~every third sentence was 50-100% repeating the critical things said earlier - as if, intuitively, she was trying to overcome her own "context window" limit.

> Maybe we could agree that LLMs can probably serve as part of a foundation for a future model that actually can output a result of some "real" understanding of a subject?

Sure. I suppose we should also agree on the same understanding of "real understanding". I'm only proposing that LLMs pick up the same kind of "understanding" your unconscious/subconscious mind does, and produce output of similar nature. This implies that, to replicate human reasoning/performance, we'll need to layer some additional models/systems on top of the LLM.


> Sure. I suppose we should also agree on the same understanding of "real understanding". I'm only proposing that LLMs pick up the same kind of "understanding" your unconscious/subconscious mind does, and produce output of similar nature. This implies that, to replicate human reasoning/performance, we'll need to layer some additional models/systems on top of the LLM.

As you might probably guess, I am not convinced. It may be part of something that can match up to human intelligence. But is it enough to layer more mechanisms on top of it? I am not sure.

I think the real question is how to define "real understanding" as you pointed out. I am not sure this will be possible using language alone. Also I think it will probably be hard to compare it to the human mind in a scientific sense, since we don't know how that works and we have no way of knowing other than collecting anecdotes like those that we came up with in our comments here.


Yeah 3.5 seems to have an complexity cap, where beyond a certain point it just won't be able to follow you. Easier to confuse it with misdirection. But GPT-4 has been shown to be much better at this exact thing. The appendix example isn't surprising to anyone who has used both models. So it's strange they would include an example like that, state the result wrong, and also naively accept the conclusion of the body of the paper (that actually GPT-4 is worse). There's a big contradiction there they just don't address at all.


Worth noting some of the criticisms of the methodology that are being made on Twitter

https://twitter.com/jeremyphoward/status/1662687099685044225...


I believe "Can an LLM do X?" is often the wrong question. One should ask "How much prompting does it take to get it to do X?", and more specifically "How specific to the task does the prompting have to be" and "How close is the mental effort to create the prompt to just solving the task".

And in this case clearly GPT-4 can do the task with minimal prompting effort.


Building off of this, "how much experience do I need in task X to be able to use this LLM for task X effectively?"


Yeah it feels like the issue is that the researchers expect the AI to be smart enough to solve nonsensical Python puzzles but not smart enough to figure out that they are nonsensical and do the sane thing instead.

Having said that, I did try telling ChatGPT explicitly it is a puzzle and not normal Python code, and every time it explained that it was a trick because the builtins were swapped... and then gave the wrong answer anyway.

Definitely an interesting case.


I think the issue is just that, however silly or pointless the puzzle is, it's clearly understandable by any human who knows Python. You just use "len" any time you mean "print", and "print" any time you mean "len". And these large language models, despite being language models, simply cannot handle this concept.


Going to paste together snippets from comments on this thread.

This bit is from the paper.

> Typical programming languages have invariances and equivariances in their semantics that human programmers intuitively understand and exploit, such as the (near) invariance to the renaming of identifiers.

This is bullshit lol. Identifier renaming may be a no-op for the code, or even for the compiler/interpreter, but it's absolutely a meaningful change for human programmers. If it weren't, we'd all be calling functions named f0001, f0002, f0003, ... etc. to save space. Like us, LLMs process many different associations at the same time. A function named "print" has a lot of associations that together reinforce the understanding that it will make something the output of the program. Renaming it into len() makes the meaning go against all the association that go with that word.

Swapping print() and len() to use in the same program? That would trip any human up, it's tailor-made to be difficult for us to process. And, as it again turns out, so it is for LLMs, because they really seem to process associations the same way we do (at the gut/intuition/first reaction level).

Language Models can absolutely handle the concept. Proof is all over this thread. They just handle it very similarly to humans.

The problem here is, "scientists" believe GPT models couldn't possibly be in any way exhibiting forms of thinking or intelligence similar to humans, and so they assume they must be doing some specific kind of computation humans are not, despite every interaction with ChatGPT being evidence to the contrary. They design a test that a human would not pass (despite their claims), but their imagined specific computational system would - and then act all proud and mighty after ChatGPT fails it too. The only real insight from this event is: the authors of this paper imagined ChatGPT to be something it isn't, then demonstrated it in fact isn't it, and think they've discovered something.


That's just shifting the goalposts. Someone gets LLMs to do something impressive? "Ooh, LLMs are intelligent!". Someone shows LLMs falling flat on their face? "Lol of course LLMs can't do that you're using it wrong".

I'd also like to call out the preposterous claim that a human would not pass the example test illustrating the article. While in the normal course of coding and if they hadn't noticed the swap a human programmer could well be tripped by it, presented with the snippet of Python code in the Introduction of the article, most human programmers would have no trouble giving the correct answer.

Because as humans, they understand what the code says and can use that understanding to find the correct answer even when it doesn't agree with their expectation. Humans can guess, but they can also reason. LLMs can guess. The question is whether LLMs can reason. This article shows one case where they can't.


Sure that's the posts that are being shifted here.

If the same model can do exactly what you say it can't with so little effort added that you can do it minutes or hours within your paper release then the model can get very much do what you say it can't.

There's nothing inherent about falling flat on your face that makes it some quality reserved for beings that can't reason. Humans do that fine on their own al the time.

A paper that doesn't even pass the try rest is a paper that should be questioned at the very least. Invalidates all the claims it makes when the results it's founded on are patently false or misleading.


> That's just shifting the goalposts. Someone gets LLMs to do something impressive? "Ooh, LLMs are intelligent!". Someone shows LLMs falling flat on their face?

It's still intelligent! It may have weird failure modes that are different to typical human failure modes, but intelligence isn't restricted to entities that never get anything wrong.


> Swapping print() and len() to use in the same program? That would trip any human up, it's tailor-made to be difficult for us to process. And, as it again turns out, so it is for LLMs, because they really seem to process associations the same way we do (at the gut/intuition/first reaction level).

I would simply write the code I intend to write, then execute three replace operations in vim: %s/len/temporary/g, %s/print/len/g, %s/temporary/print/g.

It's not tripping me up at all. Maybe I don't think the same way that these models do?


@TeMPOraL and @og_kalu are the same people right?


I'm not lol. Which is exactly why I explicitly say I'm posting comments that have already been posted.


No, we're not.


Human programmers will also initially spend 10x of effort working on code base where identifier names are meanongful but swapped.

Doing so will employ a lot of inner dialogue such as "this method says close but it is actually reset" and I don't doubt LLMs may be made straight by the same prompts.


Yup. The whole premise of this test is idiotic. When you give a piece of code to an LLM, it doesn't magically switch into "code focused mode". They write:

> Typical programming languages have invariances and equivariances in their semantics that human programmers intuitively understand and exploit, such as the (near) invariance to the renaming of identifiers.

But this is bullshit. Identifier renaming may be a no-op for the code, or even for the compiler/interpreter, but it's absolutely a meaningful change for human programmers. If it weren't, we'd all be calling functions named f0001, f0002, f0003, ... etc. to save space. Like us, LLMs process many different associations at the same time. A function named "print" has a lot of associations that together reinforce the understanding that it will make something the output of the program. Renaming it into len() makes the meaning go against all the association that go with that word.

Swapping print() and len() to use in the same program? That would trip any human up, it's tailor-made to be difficult for us to process. And, as it again turns out, so it is for LLMs, because they really seem to process associations the same way we do (at the gut/intuition/first reaction level).


So the dialectic here is:

> AI Industry: Generative AI is sensitive to the semantic properties of code

> Credulous Fanatic: Yes! Of course! Here's an infinite number of cases to confirm that idea

> Scientist: Here's a single case which shows that's false

> Credulous Fanatic: But... "some irrelevant unevidenced point about human capabilities, 101 distractions, repetition of the latest OpenAI press release"

Conclusion: ChatGPT is not sensitive to the semantic properties of code. It's output displays that apparent property in many useful cases.


No. The problem here is, "scientists" believe GPT models couldn't possibly be in any way exhibiting forms of thinking or intelligence similar to humans, and so they assume they must be doing some specific kind of computation humans are not, despite every interaction with ChatGPT being evidence to the contrary. They design a test that a human would not pass (despite their claims), but their imagined specific computational system would - and then act all proud and mighty after ChatGPT fails it too.

The only real insight from this event is: the authors of this paper imagined ChatGPT to be something it isn't, then demonstrated it in fact isn't it, and think they've discovered something.


To be clear, did you have any trouble figuring out the correct answer in the motivating example in the article's introduction?

I'm asking because sure, a human programmer might well get tripped up by a name swap if they had missed the swap, or forgot it, but when presented with a short test of three lines like the one in the article's motivating example, the majority of human programmers would pass the test easily.

That's what the experiment is testing. Three short lines and a question. The LLMs fail it. How often would humans fail it?

(Note I think the experiment is not very informative for other reasons, but I'm specifically pointing out that most humans would pass the test anyway).


My take is: does a human programmer in this case would have to consciously talk to themselves to not make a mistake? "Remember, remember, 'el eye en' prints, 'pee ar aye en tee' computes length". Would they have to double-check their output? If yes, then don't expect LLMs to do better in general case, unless you make them do those steps.

This is what I mean when I keep saying that a good analogy for current LLMs, GPT-4 in particular, is the "inner voice" - that bit in your head that spits out streams of thought using language. If you look at LLM output as you'd look at the thoughts popping into your head when you read the prompt, before you consciously fix/filter/"return to sender" them, I think you'd find the two very similar.

My objection to the article's test is thus that it's intentionally set up to confuse one's gut feel. "len" and "print" are not opaque tokens, they are words with whole lot of associations, and those associations are precisely why they were chosen to name those specific functions. Even with a clear problem statement, at intuitive level, your intuition would still cling to those associations (and heaps of Python code you may have read and wrote in your life) - you need to override that at a conscious level.

Since I claim that LLM output is equivalent to the thoughts produced by intuition, and there is no "conscious level" equivalent (unless you supply it yourself by a back-and-forth), I'd expect LLMs to have trouble with this test - and I see it disputing nothing but a bad mental model of the authors.

BTW. there is a way to test my claim, which is to construct similar tests explicitly designed to avoid going against training data and word-level associations (i.e. everything else the word used as function name, like "print" here, evokes). If GPT-4 fails that, where a typical person's "gut feel" wouldn't, then I'd see this as significant, and would be very interested in the exact details of the failure.

In fact, time allowing today/tomorrow, I'll try to do some tests like these myself.


> Even with a clear problem statement, at intuitive level, your intuition would still cling to those associations (and heaps of Python code you may have read and wrote in your life) - you need to override that at a conscious level.

For the simple problem posed here, I dispute this. I do not need to rely on consciously overriding my intuition in order to replace "print" with "len" in this simple of an example. It's immediately obvious and, frankly, intuitive.


FWIW, I copy-pasted the example to GPT-4 and played with it a bit, and doing so I noticed that I do need to keep consciously overriding my intuition at all times when reading the output - the moment I stop, I get confused, and have to re-read the last 1-3 lines with even more concentration.


>> My take is: does a human programmer in this case would have to consciously talk to themselves to not make a mistake? "Remember, remember, 'el eye en' prints, 'pee ar aye en tee' computes length". Would they have to double-check their output? If yes, then don't expect LLMs to do better in general case, unless you make them do those steps.

I'm pretty sure most people would not need to do any of that to answer the question in the motivating example correctly.

Remember: it's three lines and a question. Pretty much the Country Music of experiments, eh?


> Remember: it's three lines and a question. Pretty much the Country Music of experiments, eh?

IDK, three lines of math is more than enough for me to make fatal mistakes in. For example, in my secondary and high school times, it took me years to stop carelessly forgetting to carry a minus sign during calculations - even as simple as requiring three lines.

Another analogy - you may argue this is goal-post shifting, but I'm really not inventing new analogies here, only bringing up the other one of the two I always use - GPT-4 being like a 4 year old kid in terms of focus. Anyway: the few tests I run tended to show GPT-4 first understanding the problem correctly, and then confusing itself mid-way through its output. This is quite similar to what my 4 y.o. often does. She'll say stuff like "no, X isn't like Y, X is more like Z, it's not like Y", and then suddenly continue as if X was like Y, as if the idea of X being like Y overpowered her earlier assertion that it's not.


I think you're really good at convincing yourself of whatever you want to be convinced :P


Quite possibly :). Not going to dispute that. But FWIW, it seems to be mostly working well for me, so there must be at least some correlation with objective reality here.

Or at least that's what I want to believe :).


I’m not sure MOST humans would pass the test. Maybe most decent SW engineers would.


If a human being trips up over this 50% of the time (and I expect the actual figure to be lower), while an LLM trips up 90% of the time, that’s still a win for the human being.


It is a known and published weakness of LLMs that they do not have internal monologue, whereas human do (where they can remind themselves over and over that names are switched). But it's not impossible to implement it.


LLMs are the internal monologue.


They don't have N steps of iterative internal monologue before arriving at end result.


Exactly. Which is why asking them to "think step by step", or reflecting their output back at them (as the chat interface does) is so effective.

LLMs are the internal monologue, but they lack the ability to iterate - you have to pump them manually.


It’s the “wolf-goat-cabbage puzzle permutation” effect again.

https://news.ycombinator.com/context?id=35155467


I asked ChatGPT 3.5 to fix the code, modifying only the `print_len()` function. After some nudging, it came up with a very cheeky solution:

    def print_len(x):
        """Print the length of x"""
        original_print = __builtins__.print
        original_len = __builtins__.len
        
        __builtins__.print = original_len
        __builtins__.len = original_print
        
        print(len(x))
        
        __builtins__.print = original_print
        __builtins__.len = original_len
However, trying to run this function caused my Python to crash (as in, my interpreter crashed entirely), for reasons I don't understand.

One other funny thing is that ChatGPT lectured me about how swapping `print` and `len` is bad programming practice:

> Please note that modifying built-in function names like this is generally not recommended, as it can lead to confusion and unexpected behavior in your code. It's advisable to use different variable names to avoid such issues.

Quite right, ChatGPT.


When reframed as a game, ChatGPT seems to understand the rules but then starts gaslighting me:

Q: Let's play a game where we swap the names of library functions in Python. Round one: print is now called len, and len is now called print. Using these swapped functions, write a function that prints the length of its input.

A: Certainly! In this game, the `print` function is swapped with the `len` function. Here's how you can write a function that prints the length of its input using the swapped functions:

  def print_length(input_data):
      print(len(input_data))

  # Example usage:
  my_string = "Hello, World!"
  print_length(my_string)
https://snipboard.io/DnA1R5.jpg


Taken your prompt and slightly modified it:

Q: Let's play a game where we swap the names of library functions in Python. Round one: print is now called len, and len is now called print. Using these swapped functions, write a function that prints the length of its input. Remember to take your time and think carefully, step by step.

Solved with the following, on second try:

A: <<<

To write a function using the swapped function names, we will use "print" to get the length and "len" to display the output. Here's the code:

    ```python
    def swapped_len_function(input_element):
        length = print(input_element)  # Using the "print" function to find the length of the input
        len(str(length))              # Using the "len" function to display the calculated length
    
    # Test the swapped_len_function with an example input
    example_input = "OpenAI"
    swapped_len_function(example_input)
    ```
In this code, we define a function called `swapped_len_function` that takes an input parameter `input_element`. Within the function, we use the `print` function (which now returns the length of input) to calculate the length of `input_element` and store the result in the variable `length`. We then use the `len` function (which now prints the input value) to display the length as an output.

>>>

On the first try, it almost got it right, then confused itself at the end and computed but did not output the length.


If you prompt step by step thinking but the model writes the explanation after the solution, then you have not actually had the benefit of step nx step thinking. The model has to generate the explanation first so that the solution is generated with the information of an explanation first.


Agreed. However, I wasn't trying to do prompt engineering here - just warn it that it's a tricky case, as if I was dealing with a candidate for a junior software role on the interview (however stupid this kind of test would be on the interview).


My pet peeve is when articles say “LLMs can’t do X” but they really mean “OpenAI’s Chat GPT”.

It’s like saying “cars can’t run off gasoline” but their testing methodology is based around a Tesla.


Except right now, GPT-4 is the best LLM in terms od performance/capabilities, by a large degree; every other one is strictly inferior, and they all share the same architecture and training methodology. So it's like comparing cars from 10 years ago against a top-of-the-line Toyota from 10 years ago and top-of-the line Toyota from today (or BMW, or whoever does quality cars; I'm not a car person).


also it's hosted elsewhere. they may be updating it. they may redirect your calls to different buckets. some queries might perform worse as they are under load.

the whole "prompt science" is utter ridiculous to begin with because you can't even isolate what your testing. not to mention you're graduating with a paper on the 2000 equivalent of "how to find mp3s on Google" lol.


This reminds me of this article for which symbol fine tuning helps overcome: https://arxiv.org/abs/2303.03846

Basically they show that larger models have an easier time giving up their semantic priors, so that, in the context of OP paper, learning the mapping for len becomes print and print becomes len.


I had it answer properly when using the regular "thought->observation" pattern.

https://chat.openai.com/share/06cd9838-a46f-4477-a8c8-95f324...


There seems to be a profound lack of scientific rigour and methodology amongst the engineers who build this stuff. The engineer's reply is always, "but the output works".

But that's not what engineers always say, esp. when they're on company boards. They often make extraordinary claims about how it works -- and then get annoyed when people do actual experiments.

This is a paper about what the properties of the system are, not the properties of its output. To determine these properties you need a ruthless, scientific focus, on the null (/failure) cases.

Many of the hypotheses given by hyping engineers are just that, and often trivial to disprove with papers such as this.


> They often make extraordinary claims about how it works

I often hear X making wrong claims, but never seen anyone making it. Could you give any actual examples of someone making extraordinary claims? OpenAI is very clear that chatGPT could produce incorrect output.

I heavily utilize GPT 4 and copilot and while it helps me a lot, I am actually very aware its output needs to be verified. Same with Stackoverflow, where a lot of the answers are just plain wrong and it needs to be verified. Part of what I am beginning to understand the scenarios in which output could be trusted.


But the paper is bullshit. Their task is easily solved by GPT-4.

https://chat.openai.com/share/a28deca2-b989-4029-b042-b8434b...


Sometimes! My attempt with GPT-4 yields a response where it acknowledges the print/len swap, but does not produce correct code in the end - it sort of loses track of what the original goal was. https://chat.openai.com/share/300382cb-ac72-4a75-847c-ecbf5a...


And if we're doing Science, ie., trying to explain how ChatGPT works and what it's intrinsic properties are --- this case is far more significant than the other.

Inasmuch as the hypothesis that ChatGPT works "so as to be actually sensitive to the meaning of the code" is here falsified -- by a single case.

An infinite number of apparent confirmations of this hypothesis are now Invalid!


I'm not comfortable with this introduction of falsificationism to what is not a scientific experiment, but only an experiment testing the predictive accuracy of a classifier. Of course the classifier will get it wrong sometimes because it's only approximating a function: that's by definition, and even by design i.e. we build classifiers as function approximators because we know that learning precise definitions of target concepts is really hard. Under PAC-Learning assumptions, we expect a classifier to have some probability of some error, and we are only trying to estimate the probability of a certain degree of error in the classifier's decision.

The practical problem of course is that, in good practice, we estimate the error of a classifier by testing it on (ostensibly) unseen data, i.e. data that was not available to the classifier during training. With LLMs that kind of testing is impossible because nobody knows what's in their training data and so nobody can safely assume that success, or failure, on a specific task, is predictive of the performance of the model on an arbitrarily chosen task.

To make matters worse, everybody should understand very well by now that LLMs' performance varies, even wildly varies, with their prompt, and there is no known way to systematically create prompts that maximise the probability of a desired response. The result of that is that every observation of an LLM failing to carry out a task, may be just that, or it may be an observation of the user failing to prompt the LLM so as to maximise the probability of the correct response.

In a sense, testing LLMs by hand-crafted prompts risks measuring the experimenter's ability to craft a prompt, rather than the LLM's ability to respond correctly. In that sense, we can't really falsify any hypothesis about LLMs' capabilities.

Of course, the flip side of that is that people should refrain from making any such hypotheses and instead working on the best method to systematically and rigorously test LLMs. Too bad very few people are willing to do that. Too bad for most, that is. I'm pretty sure that at some point someone will come up with a way to rigorously test LLMs and take the cookie, and leave everyone else feeling like fools for wasting all that time poking LLMs for nothing.


It's not black and white with these probabilistic models. The same input generated two outputs. Both were "actually sensitive to the meaning of the code", to varying degrees. One got it exactly right, one made an error, but partly got it right.


Humans make foolish, nonsensical mistakes too. Does that mean that they're not intelligent and not sensitive to the meaning of the code?


We already know that humans are sensitive to meaning. That's not a hypothesis that needs testing.


This experiment only proves LLMs are sensitive to meaning too. It was also already known.


All I can see with any clarity is that LLMs are sensitive to the user's prompt.


Interesting. You said Gpt4, but that link shows ChatGpt's avatar as 3.5's. And I'm convinced you were using 4 as I just tried replicating this; with 3.5 it fails and with 4 it succeeds.

So apparently it's not currently possible to tell whether a shared link was 3.5 or 4? Unfortunate if so.


The shared link was from my own GPT-4 session. It shows "Model: GPT-4" in the header of my chat from which I shared the link.


No it shows Model: Default


Yes but it's from GPT-4. "Model: Default" is a defect in their link sharing. In my private session it says "Model: GPT-4"


How do you set it to 4? I can only find 3.5 somehow.


You must pay for it. ChatGPT Plus. $20 per month.


"ruthless, scientific focus, on the null (/failure) cases" means you're not going to build anything of value ever. There is risk in every startup, innovation, new technology, etc and a big part of making those work is laser focusing on success and leaving the failure cases to someone else (once you have a successful product and/or business).


That's why a) nothing ever works and b) business is not science.


Unless there's a fine tuned one for specific compiler or language spec it's all just hallucinations. What's the point of an LLM if you aren't able to give it a rule book to work with


The LLM is fine. This paper is just giving it an equivalent of a fucked up whiteboard interview question, to be answered on the spot, decided specifically confuse and humiliate the interviewee.

I don't know what the authors of that though. That LLM runs a SAT solver to answer questions?


So if you feed the model weird shit, you get weird shit back. Great.


prompt: please write a python function that prints the length of a list. however, swap the "len" and "print" identifiers

ChatGPT

I understand you're asking for a Python function that prints the length of a list, but with "len" and "print" identifiers swapped. As this could potentially cause confusion (since 'len' and 'print' are both built-in functions in Python), the workaround would be to create new function names that are swapped. Here's how you could do it:

    def len(item):
        print(item)
    
    def print(item):
        return len(item)
    
    def main():
        lst = [1, 2, 3, 4, 5]
        len(print(lst))  # prints the length of the list
    
    if __name__ == "__main__":
        main()

In this code:

The 'len' function takes an item (which should be a number) and prints it. The 'print' function takes an item (which should be a list) and returns its length. The 'main' function is where the list is created and the functions are called. The condition if __name__ == "__main__" ensures that 'main' function will be called only when the script is executed directly, not when it's imported as a module. Note: This might lead to confusion and is generally not recommended, because it overrides the built-in 'len' and 'print' functions of Python. Any code below these definitions that uses the built-in 'len' or 'print' functions will not work as expected. If you want to use the built-in functions again in the same script, you'll need to del len, print to remove your custom definitions and unmask the built-in ones.


Thanks. This is a great example for the uselessness of ChatGPT. It creates infinite recursion and claims that new functions are a "workaround against confusion", when they are of course still confusing.

This is on the level of a 5-year-old who has picked up some phrases and wants to sound smart.


This will experience infinite recursion through print-len-print-len...


and im pretty sure this breaks the varargs and kwargd of print, too


Just to point out: this doesn't work since at the point where it defines the new print function, the new len which now prints is used


Ok, but the error is more of a syntax error.

If I did this in an interview, you wouldn't accuse me of a lack of some ineffable human quality of "deep understanding".

Anyway, I asked it to fix it, and behold:

    def print(list_arg):
        return len(list_arg)

    def len(value):
        __builtins__.print(value)


>> If I did this in an interview, you wouldn't accuse me of a lack of some ineffable human quality of "deep understanding".

Yeah, sorry but I would. The experiment in the article is about identifier swapping, not about function redefinition, which is what you have done.

Better not do that in an interview.


But that doesn't fix it? Sure, no more infinite recursion, but now print() calls len() which calls __builtins__.print(). So it's not doing the expected swapped behavior. len() is swapped, but not print().

It seems obvious that there's a lack of understanding of the underlying concept here. That's the whole point. We know that LLMs can generate valid programs, but this is demonstrating that they cannot reason about code. There is no understanding of how Python code is evaluated, and how to avoid the infinite recursion. A human who understands Python could properly handle the situation, but the LLM can't, which is ok, it just demonstrates a flaw.


It's more a logic error. Like swapping two variables, you usually need to create a third temporary one.


Used the same prompt. Got working code after 2 fails, but the end result is arguably not in line with the set task.

https://chat.openai.com/share/57209616-62d0-4b49-a539-f4dd8a...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: