This paper evaluates performance compared to a 'human' which presumably is an average adult human without cognitive impairment. I had to dig in to the references:
In the first batch of participants collected via Amazon Mechanical Turk, each received 11 problems
(this batch also only had two “minimal Problems,” as opposed to three such problems for everyone
else). However, preliminary data examination showed that some participants did not fully follow
the study instructions and had to be excluded (see Section 5.2). In response, we made the screening
criteria more strict (requiring a Master Worker qualification, 99% of HITs approved with at least
2000 HIT history, as opposed to 95% approval requirement in the first batch). Participants in all
but the first batch were paid $10 upon completing the experiment. Participants in the first batch
were paid $5. In all batches, the median pay-per-hour exceeded the U.S. minimal wage.
(Arseny Moskvichev et al)
So in conclusion, this isn't a random sample of (adult) humans, and the paper doesn't give standard deviations.
It would've been more interesting of they had sampled an age range of humans which we would place GPT-4 on rather than just 'it's not as good' which is all this paper can say, really.
This was a first-pass study in a field addressing some of the criticisms leveraged against an earlier study where the spatial reasoning problems were viewed to be too hard. They seemingly made the spatial reasoning questions as easy as they could.
The qualifications they put on MTurk are pretty standard if you want humans who care about what they are doing responding to your study. It costs more to do this.
It is a limitation of science that is both budgetary and procedural.
By calling into question their results you seem to be suggesting that an average human would only 33% of the time be able to tell e.g. how many points are inside a box, or whether more points are inside or outside of a box. This is extremely basic spatial reasoning we are talking about.
The problem they were addressing with the settings is just noise in the results by cheap bots and clicky humans trying to earn $0.50. It is endemic on MTurk.
Did they? I took it more as a clarification of a point so people could more easily know what it meant, and wishing they did more so we could know more.
Their last sentence in fact notes that what the paper says is correct and they just which we could make a more nuanced statement if it was done with more specific constraints.
If you have another way to provide actual humans with arbitrary representative demographics for answering ten minute surveys for a total cost of around $1 each the market would love to hear about it.
Alternatively if you wanted to fund researchers and studies so they could afford to do these studies in a better way, I don’t think anyone would complain.
Several researchers where I work used to use mturk regularly and now have been forced to stop using it entirely because of this. They started getting "As a large language model" answers ... in even things where the only possible answer was 'true' or 'false' or '1-7'.
Another thing I take issue with is this doesn't seem to be using known ways to improve performance of LLMs such as chain of thought and tree of thought prompting.
As I point out, the critique is simply that the paper isn't very good. In fact I'd go as far as to say its egregiously bad, even if the conclusion is 'correct' in as far as something was measured and they reported it.
The conclusion of "GPT-4 can't even compete against low-quality human effort like that found on MTurk" seems alright. The article would be better if it didn't claim to compare "humans", because it doesn't – or at least, not in the common sense definition of "comparing humans vs GPT-4". But to be honest I think that's a bit pedantic.
It would be different if it had found that GPT-4 outperformed MTurk, but if GPT-4 can't compete against MTurk on these tasks, then I don't think it will fare better against you or me with our full attention.
If it’s making a claim about humans vs llms but shittified capitalism means that the humans are just using llms to make money churning on mechanical turn then you can’t make the underlying claim.
That ignores random sample problems and the simplistic statistical analysis…so yeah it unfortunately does.
It doesn’t prove they are wrong, but it’s not credible to say they are right
It's only important that humans (or "humans") were about 90% correct. Whereas GPT-4 was only 33% correct. If those turks were using AI, that would mean there is AI out there much more capable than GPT-4.
The problem is that unless the data are credible and reliable, any interpretation of that data is meaningless. For or against one’s hypotheses doesn’t matter if the data is junk. If the data is junk you can’t move past that, it’s the end of the conversation because the data has no meaning.
From reading the paper and the original paper that the data for the MTurk/Prolific samples are drawn from, this is a convenience sample of 415 humans on two platforms. Each worker received a random sample of the ConceptARC problems, and the average score correct is assigned the "Human" benchmark.
Perhaps by "random sample problems" you mean that the study is not representative of all of humanity? If so we can still take the paper as evaluating these 415 humans who speak English against the two models. If as you say, the workers are actually just using LLMs then this implies there is some LLM that your average MTurk worker has access to that out-performs GPT 4 and GPT 4V. That seems *extremely* unlikely to say the least.
There is no need for any complex statistical analysis here since the question is simply comparing the scores on a test. It's a simple difference in means. Arguably, the main place that could benefit from additional statistical procedures would be weighting the sample to be representative of a target population, but that in no way affects the results of the study at hand.
My wife studys people for living (experimental cognitive psychologist), the quality of MTurk is laughable, if that's our standard for higher level cognition then the bar is low. You'll see the most basic "attention check" questions ("answer option C if you read the question") be failed routinely, honestly at this point I think GPT4 would to a better job than most MTurkers at these tasks...
She has found that prolific is substantially better (you have to pay more for it as well), however that may only be because it's a higher cost/newer platform.
My take is the tasks on Turk are awful and will drive away anybody decent.
I had a time when I was running enough HITs to get a customer rep and felt I was getting OK results. I wanted to get better at running HITs so I thought I would “go native” as a Turk and try to make $50 or so but I could not find tasks to do that were at all reasonable. Instead they’d want me to “OCR” a receipt that was crumpled up and torn and unreadable in spots and said they’d punish me for any mistakes.
> In the first batch of participants collected via Amazon Mechanical Turk, each received 11 problems (this batch also only had two “minimal Problems,” as opposed to three such problems for everyone else). However, preliminary data examination showed that some participants did not fully follow the study instructions and had to be excluded (see Section 5.2).
If they stuck to the average Mechanical Turk worker instead of filtering for "Master Workers," the parent's conclusions likely would've aligned with those of the study. Unfortunately, it seems the authors threw out the only data that didn't support their hypothesis as GPT-4 did, in fact, outperform the median Mechanical Turk worker, particularly in terms of instruction following.
> Unfortunately, it seems the authors threw out the only data that didn't support their hypothesis as GPT-4 did, in fact, outperform the median Mechanical Turk worker, particularly in terms of instruction following.
MTurk, to first approximate, is a marketplace that pays people pennies to fill out web forms. The obvious thing happens. The median Mechanical Turk worker probably either isn't a human, isn't just a (single) human, and/or is a (single) human but is barely paying attention + possibly using macros. Or even just button mashing.
That was true even before GPT-2. Tricks like attention checks and task-specific subtle captcha checks have been around for almost as long as the platform itself. Vaguely psychometric tasks such as ARC are particularly difficult -- designing hardened MTurk protocols in that regime is a fucking nightmare.
The type of study that the authors ran is useful if your goal is to determine whether you should use outputs from a model or deal with MTurk. But results from study designs like the one in the paper rarely generalize beyond the exact type of HIT you're studying and the exact workers you finally identify. And even then you need constant vigilance.
I genuinely have no idea why academics use MTurk for these types of small experiments. For a study of this size, getting human participants that fit some criteria to show up at a physical lab space or login to a zoom call is easier and more robust than getting a sufficiently non-noisy sample from MTurk. The first derivative on your dataset size has to be like an order of magnitude higher than the overall size of the task they're doing for the time investment of hardening an MTurk HIT to even begin make sense.
This is just coming up with excuses for the MTurk workers. "they were barely paying attention", "they were button mashing", "they weren't a single human", etc.
It turns out that GPT-4 does not have those problems. The comparison in the paper is not really fair, since it does not compare average humans vs GPT-4, it compares "humans that did well at our task" vs GPT-4.
> This is just coming up with excuses for the MTurk workers
No. The authors are not trying to study MTurk market dynamics. They are trying to compare humans and LLMs.
Both questions are interesting and useful. This study is only asking about the second question. That's okay. Isolating specific questions and studying them without a bunch of confounds is one of the basic principles of experiment design. The experiment isn't intended to answer every question all at once. It's intended to answer one very specific question accurately.
LLMs can both be worse at Mensa tasks and also better than humans at a variety of reasoning tasks that have economic value. Or, LLMs can be worse at those reasoning tasks but still reasonably good enough and therefore better on a cost-adjusted basis. There's no contradiction there, and I don't think the authors have this confusion.
> The comparison in the paper is not really fair
The study is not trying to fairly compare these two methods of getting work done in general. It's trying to study whether LLMs have "abstraction abilities at humanlike levels", using Mensa puzzles as a proxy.
You can take issues with the goal of the study (like I do). But given that goal, the authors' protocols are completely reasonable as a minimal quality control.
Or, to put this another way: why would NOT filtering out clickbots and humans speedrunning surveys for $0.25/piece result in a more insightful study given the author's stated research question?
> It turns out that GPT-4 does not have those problems.
I think the authors would agree but also point out that these problems aren't the ones they are studying in this particular paper. They would probably suggest that this is interesting future work for themselves, or for labor economists, and that their results in this paper could be incorporated into that larger study (which would hopefully generalize beyond MTurk in particular, since MTUrk inter alia are such uniquely chaotic subsets of the labor market).
For me, the problems with the study are:
1. The question isn't particularly interesting because no one cares about Mensa tests. These problem sets make an implicit assumption that psychometric tools which have some amount of predictive power for humans will have similar predictive power for LLMs. I think that's a naive assumption, and that even if correlations exist the underlying causes are so divergent that the results are difficult to operationalize. So I'm not really sure what to do with studies like this until I find an ethical business model that allows me to make money by automating Mensa style test-taking en masse. Which I kind of hope will ever exist, to be honest.
2. MTurk is a hit mess (typo, but sic). If you want to do this type of study just recruit human participants in the old fashioned ways.
But given the goal of the authors, I don't think applying MTurk filters is "unfair". In fact, if anything, they're probably not doing enough.
This is interesting in a 'human interest news' report way but doesn't do anything to judge current systems any more the average people thinking older less capable chatbots were human.
What every paper I have seen so far is missing is that there are many ways to achieve super-human intelligence. (I need to give creidt to Isaac Arthur of SFIA for this.)
Getting results faster is one way. AIs beat me in speed.
Getting results cheaper is another way. AI is cheaper than I am.
Knowledge across many fields is better. AI beats me here too.
Getting better results in one narrow field is another way, but only one of many ways. I love evaluations of human produced work vs. machine produced work. If we had quality evaluations (not the type-oh riddled garbage most people use.) If we compared AIs to people who work in those fields in occupations recognized by the US Dept of Labor. If we asked both sides to justify their answers. If we had statistically significant sample sizes. Then maybe we could get some good results on quality of work. I can imagine the US DOL spending billions if dollars benchmarking AIs against humans i all the occupations the recognize. Alternately, this could be a very profitable company.
Note that the goal of ASI isn't to beat you per se, but the best humans.
GPT-4 is generally not competitive with domain specialists (outside areas where machines have already been competitive for some time). That of course doesn't mean it isn't valuable - but it becomes more for human augmentation than replacement.
This might be the case now but shouldn’t be a particularly useful distinction for very long— on the grand scale of intelligence, the difference between the smartest human and the average human is tiny.
I had hopes that we would be modifying them genetically to help everone have productive places in our society. Instead, we seem determined to live out the movie Idiocracy while simultaneously living out Bridge on the River Kwai.
Before ASI can beat the best humans, it is going to wipe the floor with me. Maybe I'm selfish, but it's the time when my boss decides my worth that I'm concentrating on.
These are not "ways" of achieving superhuman intelligence, these are potential forms of superhuman performance.
It's also a very broad definition, because by that definition we have had superhuman performance in computation for decades. And we've been able to write programs for very narrow tasks that outperform humans for that time as well.
Generally when people are comparing artificial intelligence against human intelligence, they want an AGI. Humans aren't good at just one narrow thing, that's why they aren't all replaced by specialized programs already. We are taking about being able to describe a problem and the artificial intelligence can take it from there. That's why the bar is AGI, which is a much higher bar than any of your broad definitions.
Though I think we should take note that the write programs for very narrow tasks that outperform humans may start seeing widening.
Things like object recognition have moved from "only humans can do this" to machines are keeping up with us. And I think that this is how it will be for at least some time. Instead of one big blast where ASI is better than humans at everything, more and more categories of narrow intelligence will widen and become superhuman.
I view LLM as a search engine, that can find and present the knowledge in a way that is easier to consume.
For instance: Find me a solution to 2x2. I see this as a search problem, not reasoning problem.
I think many scientists don't look at it from this angle and push the agenda that somehow current AI can reason. It can't. It's just a search engine / pattern matching contraption that works so well it can fool people into thinking they interact with an intelligence.
Frankly, many people "think" in the same way. They store facts in their brain and then retrieve them when asked, but they have limited ability to arrive at new facts from those they already know, without external input.
In that sense, the "AI" can certainly replace humans in tasks that don't require to think or where there is a need to distil information from a wide array of sources in a useful way.
I see it as translation, not search. Search is already done fantastically well at small scale with simple indexing, and (before adversarial reactions from SEO) at internet scale with Page rank.
Asking if LLMs are really reasoning or not feels like an argument about terminology, like asking if A* really is route planning.
The following is a logical reasoning problem. Determine which answer, A,B,C, or D is correct
FXGB, HXIB, JXKB, _____, NXOB
A. OXPBB. B. KXMBC. C. LXLBD. D. LXMB
ChatGPT
The pattern in the sequence "FXGB, HXIB, JXKB, _____, NXOB" suggests an alternating increment in the letters surrounding "X".
Analyzing each component:
The first letter sequence: F, H, J, __, N
The pattern here is increasing by 2: F (+2) → H, H (+2) → J. Following this, J (+2) → L. So, the first letter of the missing term is L.
The third letter sequence: G, I, K, __, O
The pattern is the same as the first sequence: G (+2) → I, I (+2) → K. So, K (+2) → M. The third letter of the missing term is M.
The last letter sequence: B, B, B, __, B
The pattern is consistent: the letter B repeats. So, the last letter of the missing term is B.
Combining these, the missing term is LXM(B).
The answer is C. LXL(B)D.
----
Unfortunately I did not make up the question myself, so is it already in the training set?
I asked the question on GPT 4 and got the right answer:
To solve this logical reasoning problem, we need to look for patterns in the given sequence: FXGB, HXIB, JXKB, _____, NXOB.
1. Analyzing the first and last letters of each group:
- FXGB → HXIB → JXKB → ? → NXOB
- F → H → J → ? → N
Here, the pattern in the first letter seems to be skipping one letter in the alphabet each time (F +2 → H, H +2 → J, J +2 → L). So, the first letter of the missing group should be L.
Looking at the last letter, we see a similar pattern:
- B → B → B → ? → B
The last letter is consistently B, so the last letter of the missing group should also be B.
2. Now, for the middle letters:
- FXGB → HXIB → JXKB → ? → NXOB
The middle letter appears to be consistently 'X', and the next letter is incrementing by one each time (G → I → K → ? → O). Following this pattern, the letter after 'X' should be M (K +1 → L, L +1 → M).
So, putting it all together, the sequence should be LXMB, which is option D. Therefore, the correct answer is D. LXMB.
ChatGPT uses v3.5, which isn't very smart. If you run this through GPT4, it solves it correctly:
To solve this problem, we need to identify the pattern in the given sequence.
Let's examine the sequence step by step:
1. FXGB
2. HXIB
3. JXKB
4. _____
5. NXOB
Now let's look at each component of the sequences:
The first letter in each set:
- F (6th letter of the alphabet)
- H (8th letter)
- J (10th letter)
- ? (We need to find this one)
- N (14th letter)
We can observe that the sequence is increasing by 2 each time (6, 8, 10, ?, 14). Therefore, the missing letter should be the 12th letter of the alphabet, which is L.
The second letter in each set is consistently 'X', so we can assume that it will remain 'X' in the missing set as well.
The third letter in each set:
- G (7th letter of the alphabet)
- I (9th letter)
- K (11th letter)
- ? (We need to find this one)
- O (15th letter)
Like the first letter, this sequence is also increasing by 2 each time (7, 9, 11, ?, 15). Therefore, the missing letter should be the 13th letter of the alphabet, which is M.
The fourth letter in each set:
- B (2nd letter of the alphabet)
- I (9th letter)
- B (2nd letter)
- ? (We need to find this one)
- B (2nd letter)
There doesn't seem to be a consistent pattern for the fourth letter, but we can see that it is alternating between 'B' and 'I'. Given the pattern, we would expect the fourth letter in the missing set to be 'B' to maintain the alternation.
Putting it all together, the missing set should be:
LXMB
Therefore, the answer is D. LXMB.
It blows my mind how people persistently proudly proclaim the superiority of the human mind over artificial intelligence, while making silly mistakes like comparing themselves against an outdated inferior model instead of the state-of-the-art!
This response is actually quite illustrative of the stochastic nature of LLMs. Under normal circumstances a human would never make this kind of error.
The system remains coherent long enough to produce the right answer, yet it still confidentially selects the wrong answer because it only understands the question as a function of statistical relationships between tokens, unlike humans who would understand the question at a conceptual level.
Maybe I'm missing what "abstraction" means here but seems like the tasks were centered around grids and other spatial problems, which are a very limited subset of abstraction/reasoning.
In my experience GPT4/V is pretty bad at those specifically, not necessarily around abstraction in general. Positions, rotations, etc. are a concept that GPT4 finds very hard to apply, which is kinda unsurprising since it has no body, no world, no space; it "lives" in the realm of text. DALLE3 suffers a similar problem where it has trouble with concepts like "upside down" and consistently fails to apply them to generated images.
It's also worth remembering that blind humans who can recognize squares by feel do not have the ability to recognize squares by sight upon gaining vision.
I suspect the model is bad at these kinds of "reasoning" tasks in the same way that a newly-sighted person is bad at recognizing squares by sight.
In my experience GPT4/V is pretty bad at those specifically, not necessarily around abstraction in general.
The problem with a statement like is that it leaves the door open to accepting any kind of canned generality as "abstraction in general". Abstract reasoning is indeed a fuzzy/slippery concept and spatial reason may not capture it well but I'm pretty sure it captures it better a general impression of ChatGPT.
...since it has no body, no world, no space; it "lives" in the realm of text.
There's a bizarre anthropomorphism on this thread, both reflexively compare this software system to a blind human and the implicit call to be considerate of this thing's supposed disability.
Why is it bizarre to consider the limitations inherent in the input data on which the model is trained? Fundamentally, it still "sees" the world through text, and the extent to which it can "understand" spatial relationships is defined by that. It seems utterly unsurprising that this leads to very poor grasp of the actual concepts behind what things like "above" or "left" are - the text that humans produce when talking about such things kinda relies on the reader having their own experience (if not vision, then at least body awareness) that can be mapped to those concepts. You can explain "left" and "right" to a human by telling them which of their hands is which, and I can't help but wonder what the actual information payload this constitutes when you consider the body spatial awareness that is brought into context by association.
Why is it bizarre to consider the limitations inherent in the input data on which the model is trained?
Sure the thing is limited, the study is demonstration of this (and general purpose abilities have been claimed for LLMs at various point).
I was pushing back against the "it's like a blind person" anthropomorphizing argument [edit: especially the assumption these things learn through experience and reflection which the parent also makes]. Maybe if the thing "had eyes", it could learn spatial information and maybe it couldn't (though it would take a lot of work to make that metaphor meaningful). The thing certainly doesn't learn text in the fashion that human learns speech since humans don't digest the entire Internet before they can speak.
Apparently it doesn't improve abstract reasoning capability, because according to the article the multimodal gpt4 did just as dismally as the text-only gpt4. This was surprising to me, as I would have expected an improvement with a model that did include spatial relationships.
Technically true, but when those tokens are 1:1 mapped to text, I think we can simplify this down without losing anything important.
Of course, once you start using tokens for other things - as multimodal LMs already do - that changes. But this current crop of model still has visual modality in its infancy IMO, and gauging the overall performance of model as a whole based on that is very questionable.
> Technically true, but when those tokens are 1:1 mapped to text
I don't know what GPT-4V does particularly, but my understanding is that multimodal models very often have an expanded token space with special tokens related to image handling, so, literally, there is not a 1:1 relationship of tokens to text.
A string of tokens is text. Tokens is just another alphabet, like Japanese letters having many representations for the same sounds and a letter can be entire words sometimes.
By the very fact that there's paper here, whatever it's merit, the authors of the paper have codified their concept of generality and this doesn't validate the point I was replying to, which was essentially "my impression/feeling" is that it is better".
Point is that it's good at abstract reasoning that isn't spatially grounded like in that paper. So it's not really leaving any door open. It's not a cop out. That's just how it is.
> DALLE3 suffers a similar problem where it has trouble with concepts like "upside down" and consistently fails to apply them to generated images.
This has nothing to do with having "no body, no world" and everything to do with the fact that training pictures where things are upside down are simply vastly rarer that pictures where they aren't.
What would directions be for an intelligent creature that lives in zero gravity? I just like thinking about this for the same reasons humans like writing speculative science fiction. Trying to guess what alien perspectives look like, might also give us insights when we're the ones making the alien.
However, North, South, East, and West are relative to the poles of the Earth. Something living in zero gravity would have to use some object as an anchor to determine the direction.
You’re also oriented based on objects. We don’t have an abstract compass pointing north 24/7 the way we can use our bodies to determine left and right or gravity to point down.
The solar system has a north pole and a south pole based on the rotation of the Sun. Basically the only places in which there isn't something to orient against are in the depths of inter-galactic-cluster voids with nothing around. And if a being is stuck in one of those voids, orientation is way down the list of problems they have.
FWIW there is some interesting variability among human cultures on that, as well. There are a few that actually use cardinal directions predominantly or exclusively instead of body-relative ones like "left" and "right".
No, but they would have front and back, and people from the bridge would share which way was “up” and “down” and “left” and “right” based on the controls.
* How would human beings perform on the text-only version of the tasks given to GPT-4?
* How would human beings perform if each grid is shown on its own, making it impossible to perform side-by-side visual comparisons?
* How would human beings perform if each grid is shown on its own only once, making it impossible to perform any back-and-forth comparisons?
* How could we give LLMs the ability to "pay attention" to different parts of images, as needed, so they can make back-and-forth comparisons between parts of different images to solve these kinds of visual reasoning tasks?
> How could we give LLMs the ability to "pay attention" to different parts of images, as needed, so they can make back-and-forth comparisons between parts of different images to solve these kinds of visual reasoning tasks?
Also I want to know how would know how much gain could be made by optimizing the prompt for GPT and by including things like CoT. Current version of the prompt is pretty bad both for humans and AI.
* How would human beings perform if they didn't know they were being tested (ie: if in the same mode they are in when writing comments on the internet)?
* How would human beings perform if the questions are based on culture war topics, which tend to invoke System 1 intuitive/emotional thinking?
If you look at the appendix, you can see example transcripts. The sample they provide looks like a very bad eval.
It’s encoding an originally visual problem into a textual matrix form, and then expecting GPT to recognize visual correlations. You simply can’t compare these two tasks! Most humans wouldn’t recognize the 5x5 matrix for a 4x4 square.
So the comparison with “human level” is completely invalid. And even the valid comparison is only measuring visio-spatial intelligence, not IQ.
Humans are biased towards grokking two-dimensional information arranged in grids because that's how vision works. (Non-V) GPT is not. Its natural gestalt has a vast number of dimensions. It is, and it should, be able to find correlations across data no matter how the input is arranged. (Note that for GPT text is not linear even if it appears so to humans.)
To humans it would be qualitatively more difficult to recognize patterns in four or six or ten-dimensional data than in two dimensions. To GPT, it's just another variation of the same problem, one that's at most quantitatively more difficult.
> And even the valid comparison is only measuring visio-spatial intelligence, not IQ.
You're the first one to mention IQ. The paper definitely does not. In any case "visio-spatial intelligence" is a human concept. GPT doesn't distinguish between "visio-spatial" and any other kind of pattern recognition.
It has been interesting to see evidence accumulating that shows, despite initial excitement bred by papers such as "Sparks", there is something missing from current language models.
Individually none of these results will ever get the attention of a "Sparks" type paper, but collectively a strong case has been built.
Sparks of AGI is not AGI. It's also possible that we're not testing LLMs fairly, or that merely slight tweaks to the architecture or methods would address the issues. I think this comment elaborates nicely:
I think embodiment and the encoding of the natural laws (gravity, force, etc) that go into that will be another huge step at grounding AI. People tend to gravitate to thinking about humanoid robots when that is mentioned (and thereby terminators), but honestly I would think things closer to sensor networks involving thousands or millions of bodies like a hivemind would be more likely (why stick at the human level of a single body if you didn't have to). Interaction with the world is a means of determining truth... The ability to perform science.
And as hard as embodiment is, it will be the easy part in my opinion. Continuous learning without losing the plot is going to be quite the challenge. If an LLM has something wrong, how does it update and change that bit of information without huge amounts of power use? How do you make the system learn 'important' things without filling up with junk/spam it is subject to? How do you keep the system aligned with a goal that is not destructive to itself or others?
But embodiment being a bottleneck could indicate that it's a data/training issue, rather than an architectural issue. Multimodal training data improves GPT-4 already, but that's still very little data compared to growing up to a full human adult. There are still many things to try.
That has always been my impression, despite the myriad ways that LLMs impress.
So much potential is lost just in the request/response limitation. While I’m waiting for a response from GPT-4, I’m continuing to think. Imagine if the reverse were true. AGI needs to be able to mull things over for spans of time.
At least any company trying to sell a product this is going to be an issue with operations costs.
Also this gets into the halting problem. How many resources do you expend on finding an answer? In a human issues will typically come up like we have to go pee, or eat, or something outside our body interrupts us. For an AI, how much time should it spend? Do we want to wake up one day finding our data centers running at full tilt?
This said, there have been some attempts at working on agent based systems that reach out for answers from multiple places and pool the data then run things like chain of thought on that data pool.
> Graduating with a bachelor's degree from Columbia University in 2003, she began post-graduate work in cybernetics, learning to construct positronic brains such that responses to given stimuli could be accurately predicted. She joined US Robots in 2008 as their first Robopsychologist, having earned her PhD. By 2029, when she left Earth for the first time to visit Hyper Base, her formal title was Head Psychologist.
Conclusion is obvious, but the paper is still probably necessary.
Of course LLM's can't reason. They pattern match answers to previously asked questions, and humans will read the text as a reasonable answer because we assign meaning to it, but there is simply no way an LLM could use a "mental model" to "reason" about a problem other than constructing sentences out of probable matches it's been trained on.
The conclusion may be obvious to you and me (although it's hard to know for certain, since these available LLMs are black boxes). But it's definitely not obvious to everyone. There are plenty of people saying this is the dawn of AGI, or that we're a few short steps from AGI. Whereas people like Gary Marcus (who knows tons more than I do) says LLMs are going off in the wrong direction.
Yes, LLMs can't reason 100% correctly, but neither do humans. We can often reason correctly, but it's not always the case.
Even reasoning, fundamental as it is, comes from feedback. Feedback from our actions teaches us how to reason. Learning from feedback is more general than reasoning - AI agents can definitely learn this way too, if they have enough freedom to explore. But you can't do it with supervised training sets alone.
You need to put language models into agents in environments, give them goals and rewards. Then they can make their own training data and mistakes, build up their own experience. Can't teach an AI based on how people make mistakes, it needs to fix its own mistakes, but that means to deploy it in the wild, where there are consequences for errors.
If you remember, DeepMind first tried to train a Go model on human game play, but it was limited. Then they started from scratch, and learning from feedback alone they surpassed human level, even though the feedback was one single bit of information at the end of a whole self-play game. And it had no pre-training prior to learning Go, unlike human players.
That presupposes that language encodes the world we live in completely, whereas in fact language is meaningless without the shared context of reality. 'up' and 'down' are only meaningful to intelligence that can experience space.
Essentially LLMs just are oracles for the shadows on Plato's cave.
The LLMs do indeed deal with Plato's shadows, but so do we - what we "see", after all, is not the actual up or down, but a series of neural activations from our retinas (which aren't even 3D, so concepts like "behind" are only captured by proxy). Such activations can all be readily encoded into tokens, which is exactly what models specifically trained to describe images do.
Do a reverse Chinese room experiment - remove from a human all the abilities multi-modal LLMs gain after training on human media. What's left? Homo ferus.
Most of our intelligence is encoded in the environment and language, it's a collective process, not an individual one. We're collectively, not individually, very smart.
Of course you're saying that LLMs can only train on textual data, whereas we are developing multimodal AI at this time that takes things like visual, audible, and whatever other kind of sensor data and turn it into actionable information.
TLDR: Internal LLM representations correspond to an understanding of the visual world. We've all seen the Othello example, which is too constrained a world to mean much, but even more interesting is that LLMs can caption tokenized images with no pretraining on visual tasks whatsoever. Specifically, pass an image to an encoder-decoder visual model trained in a completely unsupervised manner on images -> take the encoded representation -> pass the encoded representation to an LLM as tokens -> get accurate captions. The tests were done on gpt-j, which is not multimodal and only has about 7bn params. The only caveat is that a linear mapping model needs to be trained to map the vector space from the encoder-decoder model to the embedding space of the language model, but this isn't doing any conceptual labour, it's only needed to align the completely arbitrary coordinate axes of the vision and language models, which were trained separately (akin to an American and a European to agreeing to use metric or imperial — neither’s conception of the world changes).
It's not intuitive, but it's hard to argue with these results. Even small LLMs can caption images. Sure, they don't get the low-level details like the texture of grass, but they get the gist.
I keep reading your sort of analysis, but honestly, those priors need updating. I had to update when learning this. If 7bn params can do it, 175bn params with multimodality can certainly do it.
It's true that humans need symbol grounding, but we don't see hundreds of billions of sequences. There are theoretical reasons (cf category theory) why this could work, albeit probably limited to gist rather than detail.
The real question isn't whether the LLM can reason.
The question is whether an assembly of components, one of which is an LLM (others would include memory and whatever else is needed to make it a self-contained loop with a notion self-identity) can reason.
And this assumption has only gotten more common in the ~56 years since (so the anecdote goes) ELIZA fooled Weizenbaum's secretary.
"Consciousness" has ~22 different definitions; some are easily met, some are not even met by the humans giving those definitions (for example, I've seen people here on HN who write as if consciousness can somehow magically solve the halting problem).
It has to do with the fact that all our definitions of "sentient" and "intelligent" are extremely vague, often vacuous, and generally boil down to "I know it when I see it" in practice.
So there should be absolutely no surprise that, when something comes along that behaves in certain ways very much like a human (and unlike anything else), it is assumed to share the basic qualities of such. It is how we judge other people, after all - it's not like you do a brain scan to determine whether someone you are talking to is "sentient", whatever that means to you. You judge that by their behavior, modelling what a "sentient" would do and comparing the result to what they actually do.
I suspect this has to do with how shallowly people even regard their fellow man. To the point that if the outputs of an entity parrot humans well enough then, well, there couldn't be any differences whatsoever, and it must bear all the titles which we gift to humans.
Ftr, I mostly agree with the comment but downvoted it for the low quality "of course that's the answer, obviously" padded with some unhelpful tropes about trained pattern matching. (Let us know when you can prove that human communication is not just trained pattern matching)
I don't need to prove anything. We know for a fact that LLMs are stochastic parrots. That's why they confabulate, if they were "intelligent" they wouldn't just pass off nonsense as fact.
They have been fed incredible amounts of data which allows them to stochastically parrot really complex things, this makes people think they're intelligent. Designing a test that actually tests intelligence rather than parroting ability is tricky, but whenever people do they fail them.
I just logged into ChatGPT and asked it repeatedly to generate a novel thought experiment intended to prove that LLMs are incapable of reasoning. I kept trying to refine it by explaining what was wrong with each suggestion.
It completely failed. It just spat out variations of well known dilemmas or paradoxes, even when I specifically asked it to create something original it just made nonsense.
Ok, I'll bite.
I personally don't think human communication is simply pattern matching (e.g. we have drives, motives, relationships that we want to tell each other, these are internal systems, outside the visible band of communication).
That's a little off topic though, my post was not about communication, but reasoning.
Pattern matching in LLMs is specifically at the language level, as in "predict the next token in a stream", I should have probably specified this bit.
You seem to refer to a more general kind of pattern matching when you say, where we're looking at a complicated structural model within which humans reason about the behavior of different system components and their interactions.
There is no way to form an internal representation of such systems within an LLM, which is why I said "of course" it can't reason. An LLM is not designed with this capability in mind.
If all you have is language, and predictive text, there is no way to use this to simulate and predict the rules in a higher-dimensional space than "stream of tokens", or let's put it another way: there would need to be an extremely high correlation between the pattern in the problem space and in language space for the likelihood of the model guessing right to be anywhere near human levels.
> There is no way to form an internal representation of such systems within an LLM, which is why I said "of course" it can't reason. An LLM is not designed with this capability in mind.
A complex system clearly does not need to be designed to evoke some behavior. Just look to the concept of emergent phenomenon. See evolution through natural selection as an obvious counter example.
Evolution selects for fitness and not reasoning ability. Yet here we are.
It is not known if the same isn't happening now for the transformer model.
LLMs are universal function approximators. If reasoning ability is at all representable through such a process, it is possible for LLMs to find it.
Neural networks are universal function approximators, transformers are universal sequence to sequence approximators. Crucially, the sequences here are the stream tokens in an LLM. This means an LLM can do arbitrary language manipulation.
If the reasoning task requires something that can't be done using language manipulation in embeddings-space, it won't be able to do it.
Idk, maybe my mathematical intuition is wrong, but it seems obvious to me that if all you have is a function on a vector (representing a token stream), then the only transformations you can create are ones that can be expressed in that vector space. In other words, if you express a problem in a more complicated space (e.g. a visual problem, or an abstract algebra problem), you will not be able to solve it in the smaller token space, there's not enough information. You COULD first transform the vector into a different representation, perform the transformation there, and then encode back.
This step would have to be different for every reasoning problem, so the underlying transformation to solution space would have to be generic and move into one of N reasoning subsystems.
Also, if we hypothetically assume this limitation doesn't exist, we would still have the issue that a trained LLM represents a single such transformation function.
If we want generic reasoning skills, the approximated function would have to be different for every reasoning problem. Even if we use a subset of the vectors degrees of freedoms to express each type of reasoning, it would make any truly reasoning LLM computationally intractable, because the embeddings space would blow up.
What couldn't you possibly represent as a bunch of numbers to the level of precision necessary to process it? We might not have sufficient compute power for LMs large enough for some tasks yet, but I don't see how it's different in principle.
And yes, of course such representation would be different for different problems. We also have organs in our body that do that for various types of sensory inputs that ultimately all feed into the brain to be processed as quantized impulses.
And GPT-4 is already supposed to be an "assembly of experts", i.e. different models trained for different tasks, with an orchestrating model weaving them together. This is also not a surprise, given that our brains aren't monolithic, either. You should be able to do it with a single model that's large enough, but it is likely more energy- and time-efficient to get there by putting smaller pieces together instead.
It's just that there is no way for the model to map the token meanings to a iseful representation.
Yes, in the general case, you can make a model for most types of logic problems (but be careful there, there are real problems you couldn't represent as "a bunch of numbers" and a single transformation), the issue is that the model cannot build that representation even for simple cases because it,'s used to operate at on language (and maybe some other fuzzy input type), which has to be _reliably_ interpreted to this number space; in different ways for different problems; without breaking the language processing.
> In other words, if you express a problem in a more complicated space (e.g. a visual problem, or an abstract algebra problem), you will not be able to solve it in the smaller token space, there's not enough information
You're aware multimodel transformers do exactly this?
Even that space is not large enough for the "any reasoning system" space.
The functions also have to be representable in a way that the model can transform them usefully. If the model does not have the capability to transform into a different arbitrary space for every separate sentence, reasoning is not possible.
Even then, reasoning would IMO be limited by other representation effects. But maybe we could have a discussion at that point.
I just don't think transformers are going to get us to an intelligent agent, which is completely ok, they are a useful ML tool that we can use to create somewhat accurate models for complicated systems, and that's what they're designed for.
> If the model does not have the capability to transform into a different arbitrary space for every separate sentence, reasoning is not possible.
You're making an authoritative statement on how reasoning works based on conjecture. I remind you that this field is in active study.
The "function" that sufficiently approximates the phenomenon of higher order reasoning may well mode collapse into a compact representation that need not exhaustively span the space of all "sentence representations".
You can illustrate for yourself that it is not at all clear your premise is a necessary condition.
Our brains can reason. Are you proposing its mechanism of action is by "transformation into a different arbitrary space for every sensory experience X"?
I mean maybe? But what if there exists some higher order space that actually generalizes perfectly and captures the basis vectors of these seemingly orthogonal spaces? That can also be the case. Which is why in truth we don't yet know, and to say your view is correct is suspect.
Hope you see now why this claim is an overstatement and missing some qualifiers.
Well, you need to be able to simulate what happens in the problem domain in order to arrive at a conclusion.
There isn't really any other feasible option.
This is what humans do, we visualize and predict the outcome of actions within a system. And if our imagination is not enough, we can outsource the reasoning to the outside world, by doing experiments or using tools like geometry etc. (drawing in the sand, drawing on paper).
It is impossible to arrive at a conclusion without "doing the work", unless you've seen the result before and can recite it.
Therefore, you need to be able to model the problem domain in order to solve the problem.
You don't have to be an AI "hype bro" to take issue with the reductive and trite notion that LLM's are stochastic parrots only. There is a continuum between that and AGI.
Well I take issue with the reductive and trite notion that just because an LLM can generate plausible text it's suddenly maybe conscious and intelligent, maybe about to end humanity etc.
It's exactly like the crypto hype wave. Everyone dreaming up semi-plausible futures, based on a whole chain of unfounded assumptions.
It's plausible text, and it's useful text. LLMs aren't just speculative vehicles in search for a problem, as most of crypto is, they are useful, today, right now. They don't require any assumptions to be so, nor do they have to be skynet world ending AGI to be that.
You can point out problematic extrapolation of doomers without being reductive towards the very real and very useful capabilities of LLMs.
The only thing the LLM is missing is an self actuation loop.
We put a camera on a multimodal LLM, it interprets the visual world and sees before it a bunch of blocks. It looks at the task list it has that says "pick up red blocks, put them in blue bin". The visual components identifies the red blocks and the textual component issues commands to its drive unit which calculates the best path and how to use it's manipulators to pick up the blocks.
This is a very basic chain of existence. We have world identification, we have actuation on motivation, we have interaction with the environment. We can do this now. These goals have already been achieved. Companies are already testing more complex models with much more general instructions such as "Pick up everything you think is trash" or "Organize this room" to see the emergent behaviors that come out of the models.
You are seemingly a few years behind what has already been done, and why people are starting to get concerned.
I've actually been working on a similar eval task that uses grids of symbols to evaluate LLM's reasoning ability around the game Wordle. My initial results are:
Interestingly, ChatGPT-4 wrote a short python script (unprompted) to develop a pseudo-CoT type generation and answered correctly. Phind, which also answered correctly, is also a code-based/code-enabled LLM.
Non-code-based-LLM's appear to reason cogently when describing their games - what moves are illegal, what the best guess in certain conditions are, etc - but their actual answer, even in multiple choice is comically wrong.
These are very initial results but my tentative hypothesis is symbolic logical reasoning - string manipulation / function calling type tasks - are greatly enhanced by having a code modality. How to switch between natural language and code-generation may offer a strong boost on tasks like these.
It has been really interesting to read in last few years with machine learning how the model cant do this or that and the next week or month read it can do this or something else. Chat gpt and the models that have come after seem to have accelerated this back and forth a lot. Unless you keep up with it closely and keep updating your information I think what you knew it could do well or could not do well is no longer correct
>The paper investigates the abstract reasoning abilities of text-only and multimodal versions of GPT-4 using the ConceptARC benchmark, concluding that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels.
The best thing from research like this is that they allow new models to be built, or improvements on the existing ones that can lead them to pass these evaluations.
Yeah but the question remains are they passing because of a new obtained/emerged quality or are they passing because they were taught specifically to how to pass or is there even a difference at this point.
Well, while training AI we make sure that we include enough training for them to pass some evaluation, but then test them on things that were not in the training data to make sure that they are not "over fitted".
I don't think parent meant training specifically on that test data, but rather in that kind of task. Think more "LLM trained only on images of art" doing badly on this test, but then "LLM trained on images of art and IQ puzzles" doing better on this test.
It's kind of like asking what even is IQ if you can learn how to solve Mensa puzzles and improve your score. Does it mean you're more intelligent?
One guess for GPT4 is that it's an amalgamation of models. So you would have various models trained more specifically like the ones you mentioned, and you ask them all to start answering the query, then you choose which one produces better results, and then you present that to the user.
Alternatively you can have a decider model that knows what kind of queries goes to what specialised one, then have that as a "hidden layer", whether it be in the application level or "neuron layers".
If you've checked out the docs for the assistant api, you can intuit that there is a higher level system which decides which subsystems to use to respond. The assistant determines whether to use a tool (vision, code interpreter, search, retrieval), as well as which code or text language model to use to generate the response.
Your last statement reminds me of the folktale of John Henry. That of generalized capability versus that of specialization.
Under a narrow range of tests specialization is pretty much guaranteed to win. Specialization in all cases I know of comes at a cost to general capability. It's like one of those qips "Pick any two: fast, cheap, good", the more you pull at one category, the costs rase in the other categories.
That's the idea, yes. However none but OpenAI knows exactly what ChatGPT was trained on. In fact, the dataset that it was trained on is so vast that they probably don't know either if it contains any given question.
IIRC last week I saw a study where GPT4 can solve some leetcode problems simply by giving it the number, no description. A clear example of overfitting.
These tests should be conducted on new questions.
And if we as humans no longer have the ability to create original questions, then maybe we should just retire.
Raw json files for each task from the ConceptARC paper are here: https://github.com/victorvikram/ConceptARC. It would be trivial to convert these into prompts using the format in the appendix of the paper.
It was invented by a Hungarian to impress an Austrian. The Turk portion of the name is from the original Hungarian. It toured Europe for 80+ years. I have no clue what about this you think has anything to do with English culture.
That’s nothing. In American English, calling someone a turkey would be an insult. Turkey is how we refer to the nation Türkiye Cumhuriyeti. Imagine calling an entire nation an ungainly bird that we ritually consume every November. Almost as bad as giving Michael Jordan his own seat at the UN. But that’s what capitalism and the US hegemony have brought us.
> Imagine calling an entire nation an ungainly bird
The bird was named after the country. And the country after the ethnic groups who spoke the Turkic languages, which derive from the native term Türk, which goes back at least 1400 years. Changing the spelling and pronunciation changes nothing.
Saying that we called the country after the bird is buying into populist rabble-rousing without doing even five minutes of fact checking.
In the first batch of participants collected via Amazon Mechanical Turk, each received 11 problems (this batch also only had two “minimal Problems,” as opposed to three such problems for everyone else). However, preliminary data examination showed that some participants did not fully follow the study instructions and had to be excluded (see Section 5.2). In response, we made the screening criteria more strict (requiring a Master Worker qualification, 99% of HITs approved with at least 2000 HIT history, as opposed to 95% approval requirement in the first batch). Participants in all but the first batch were paid $10 upon completing the experiment. Participants in the first batch were paid $5. In all batches, the median pay-per-hour exceeded the U.S. minimal wage.
(Arseny Moskvichev et al)
So in conclusion, this isn't a random sample of (adult) humans, and the paper doesn't give standard deviations.
It would've been more interesting of they had sampled an age range of humans which we would place GPT-4 on rather than just 'it's not as good' which is all this paper can say, really.