Here's an example of it failing to understand some very simple Python code that any human with basic programming knowledge would not mess up:
------
Me: What is going to be the output of this code when `x = 0` and when `x = 900`:
```python
x = 0
for i in range(100):
if x > 50:
x +=1
else:
x += 5
if x == 90:
print("X is not 90")
else:
pass
```
GPT-4: When x = 0, the code will execute as follows:
The initial value of x is 0.
The for loop iterates 100 times.
During the first 11 iterations, since x is not greater than 50, it will increment by 5 each time.
After the 11th iteration, the value of x will be 55, which is greater than 50.
For the remaining 89 iterations, x will increment by 1 each time.
The final value of x after 100 iterations will be 55 + 89 = 144.
During these iterations, the value of x never becomes 90, so the print statement is never executed.
When x = 900, the code will execute as follows:
The initial value of x is 900.
The for loop iterates 100 times.
Since x is already greater than 50, it will increment by 1 each time.
The final value of x after 100 iterations will be 1000.
During these iterations, the value of x never becomes 90, so the print statement is never executed.
In both cases, the output of the code will be empty, as the print statement is never executed.
--------
That is not correct, when x = 0 it will print "X is not 90". I basically create some simple code that is very unlikely to exist in its training data because it doesn't do anything useful and it cannot execute the code. A large amount of the reason that these models seem so impressive is that they were trained on almost everything that exists on the internet, so probably any question you ask it has already been asked somewhere and any useful code you present to it or ask it to generate is included in its training data, or at least something close enough.
So, when I go out of my way to create simple tasks or questions that are highly unlikely to exist in its training data, yet a human could easily handle, it often fails.
I would guess you were trying an older version or you somehow confused it. I posted your code with only "Consider:" above it, and GPT-4 produced correct answer on the first attempt:
The given Python code initializes a variable x with the value 0, then iterates over a range of 100 values. Inside the loop, there are two conditional statements:
The first conditional statement checks if x > 50. If true, it increments x by 1; otherwise, it increments x by 5.
The second conditional statement checks if x is equal to 90. If true, it prints "X is not 90"; otherwise, it does nothing (pass).
To better understand the code, let's walk through a few iterations:
i = 0, x = 0: x is not greater than 50, so x += 5 => x = 5
i = 1, x = 5: x is not greater than 50, so x += 5 => x = 10
i = 2, x = 10: x is not greater than 50, so x += 5 => x = 15
...
The loop continues until x > 50. This will happen when i = 10, and x = 50:
i = 10, x = 50: x is not greater than 50, so x += 5 => x = 55
Now x is greater than 50, so it will be incremented by 1 in each iteration:
i = 11, x = 55: x is greater than 50, so x += 1 => x = 56
i = 12, x = 56: x is greater than 50, so x += 1 => x = 57
...
This will continue until x reaches 90:
i = 34, x = 89: x is greater than 50, so x += 1 => x = 90
Now x is equal to 90, so the program will print "X is not 90". The loop will continue to increment x by 1 in the remaining iterations:
i = 35, x = 90: x is greater than 50, so x += 1 => x = 91
...
The code snippet will only print "X is not 90" once, when x becomes 90 in the loop.
GPT-4 (as well as all GPTs before it has a limitation): it has to produce an output in a single pass. It cannot pause and think, it cannot backtrack. So yes, it makes weird mistakes sometimes.
A human programmer will need to look at code, then think a bit, then look at it again, etc.
You can put programmers into a similar situation: try reading code aloud instead of showing it to them on screen. If they can't answer right, does it mean they aren't intelligent? Intelligence =/= never making a mistake.
Now that you know GPT's limitations, perhaps you would consider asking questions one at a time instead of intentionally trying to confuse it. Considering multiple questions in a single pass increases error rate.
> so probably any question you ask it has already been asked somewhere and any useful code you present to it or ask it to generate is included in its training data
The "it just recites training data" idea is demonstrably false. Do a bit of combinatorics. Or write a unique piece of code which is not a gotcha question and try it.
GPT is a transformer model. Transformers use the attention mechanims. The mechanism is entirely concerned with retaining semantic context and semantic "global dependencies" spanning the entire input and output.
"Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences ...
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output."
Beyond that, also note that LLMs are probabilistic machines. Output spat out can vary and there are a handful of knobs (such as temperature) to modulate that output.
Finally, I'm pretty sure we (or the workers in the field more like it /g) don't have a firm grasp on why certain failure modes occur. Likely this is due to the fact that we (they) also don't really have a good grasp on how the damn thing actually works its 'magic'.
What is clear is that a significant subset of our semantic universe is embedded in symbols and their usage by us and this subset is somehow encoded in neural nets. This captured subset in LLMs is what drives their uncanny generative abilities. What is missing is precisely what would make it plausibly intelligent, plausibly a reasoning agent operating in a coherent semantic context.
There are some who claim our minds are just like LLMs. Some of us who pay attention to our minds sometimes catch it making nonsensical noises and correct it. (As you age you begin to notice these things..) So it is interesting to this sentient (who makes claims to being) that my mind is just like my body, it is aging, certain parts are degraded, etc., but my 'whateveritis' that is me, my self, is as timeless as ever, and seems to be a spectator of the aging mechanism ..
> The mechanism is entirely concerned with retaining semantic context and semantic "global dependencies" spanning the entire input and output.
This is not quite true: GPT, specifically, is auto-regressive. It computes things only looking back, not forward.
Given that each token has only a fixed computing budget, it is likely that GPT precomputes information which will be relevant to later tokens, to be routed via attention.
In fact, this effect was demonstrated in practice: e.g. in a prompt like "Question: Where is the Eiffel tower located? Answer: " people found that information about "Paris" is routed from tokens "Eiffel tower", i.e. this associative memory was looked up earlier than it was needed.
So I was answering from that perspective: it can do better if it knows what to pre-compute.
Like I said in the parent. I use this tool heavily every day for coding and non coding. The above was meant to be a tiny illustration. I have had long back and forth with GPT-4 trying to get it to understand some nontrivial useful code or generate some useful code and no matter how I phrased it it got it completely wrong even though it looked superficially plausible.
We are talking about a computer program that cannot make decisions. Often times human intelligence is compared with the types of decisions the person makes.
It can make decisions. E.g. if you give it a task, it can formulate sub-tasks, then execute them, e.g. by selecting which API calls to make.
Obviously, if you call the model, you define the scope for it, and it can only make decisions within that scope.
The scope is potentially unbounded. E.g. you can give it access to a terminal of a Linux machine which has an internet connection and tell it to do whatever it likes. (Given RLHF perhaps it would be better to give some positive goal, e.g. perform actions which would eventually produce knowledge.)
In what sense? How do you define intelligence and what is missing?
> finding so many failure modes.
Failures do not indicate a lack of intelligence. People make errors all the time.
We should differentiate failures which arise from poor training for a lack of general capabilities.
There are people who learned a bad way to do something and struggle to change that. Should we consider them unintelligent?