>These things aren’t deterministic so it’s hard to even use things like trial-and-error experiments to figure out what works, which as a computer scientist I find completely infuriating!
Nitpick: they can be made "deterministic" in a strict sense by using a deterministic sampling scheme, for instance by turning "temperature" down to 0. This isn't necessarily helpful for prompt experimentation, though - when deployed, prompts are usually combined with other input, so you risk simply substituting the non-determinism of sampling with the unpredictability of whether it will work with all inputs.
In my own experiments with OpenAI's GPT-4 API with temperature set to zero, I was still not getting deterministic outputs, with some small variations between completions. Not sure why, and I haven't had a chance to dig further or talk to their team about why and how this happens.
I think the right term to use here is "more reliable". Setting temperature to 0 will certainly make outputs more reliable for the same inputs. But GPT is inherently nondeterministic both by distributing lossy computations across different devices and using several models under the hood. There's no getting around that when you have enough scale.
> The fascinating thing is that capabilities of these models emerge at certain sizes and nobody knows why
Is there a more technical dive into this statememt?
In particular, is it an emergent statistical property of the model and/or the training data size or an emergent illusion of the human observer of model outputs (the way we visually perceive fluid motion once a frame rate exceeds a certain threshold).
The nature of this "emergence" is interesting both from a theoretical and a practical point of view.
The "large" model size (I put the word in quotes, because large compared to what equivalent task?) might be somehow intrinsic, in which case it will spark a race to make such sized computational resources more common place. But it might also be an imcomplete understanding of the model class.
Good reference. The sparsity of the data points (presumably due to the huge training costs) doesn't reveal too much at present but as the authors suggest the phenomenon might be reproduced in smaller models (which would make it easier to study in greater detail).
There is a paper saying that around 6-7 billion params, something happened that makes the larger transformer qualitative different than the smaller one. I forgot which paper it was though
Can anyone give a layman's rundown on "it guesses the next token" in situations where it's seemingly able to apply what looks like logic to certain prompt requests?
The way I understand it, our language is an abstraction to our logics. We think using it and we communicate our logic with words. Literally, how do you reason within your head? By forming sentences and argue with yourself using words? At least that is how I do it. I can't make any deep reasoning without first translating it to sentences.
So an AI after learning so much text and forming an extremely dense network of connections between different words and phrases, it can mimic something similar to reasoning. At its core, it is still just predicting the next word. But the scale of this "prediction" is so large that it begins to mirror our own "reasoning" process. Because in the end, when we apply logic, we are doing it through our own network of concept connections which is reflected in our language.
I guess the author did make a complete explanation when they mentioned arrays of floating point numbers representing connections. I just thought there was more to it that was omitted. This seems like a relatively simple solution which simulates an fascinatingly complicated process when performed at scale. Thanks for a great explanation!
By nature of its architecture, it can only predict the next token. However, the architecture may be flexible enough that it may be able to uncover the underlying patterns within its training data in order to help it predict the next token.
Then suppose there exists a well defined algorithmic process for human level intelligence, of which some approximation to it appreciably resides in the training data.
Then perhaps the LLM or networks like it, have started to learn aspects of this algorithmic process, then leverage it in order to successfully predict the next token.
In other words, by forcing it to predict the next token, but giving it the flexibility to do so in any way it deems helpful, LLMs have learned higher order concepts such as abstract reasoning and "general intelligence".
> In other words, by forcing it to predict the next token, but giving it the flexibility to do so in any way it deems helpful, LLMs have learned higher order concepts such as abstract reasoning and "general intelligence".
To my understanding there is not any evidence of Abstract Reasoning or "General Intelligence" in Large Language Models such as ChatGPT.
For example, I found a website that defines and has abstract reasoning tests it says are currently "AI Resistant"[0]:
> Resistance to AI tools (like ChatGPT) is checked each week.
> Based on the last check, it's unlikely that AI tools will be able to help answer this question
The test is a "Combining Figures" test where one sees a series of three shapes where the first two overlayed on top of each other create the final shape and a final series where one of the three shapes is missing and you have to figure out which of a set of possible shapes should go there.
It's pretty straightforward to show that you and I can pass this test, but ChatGPT 3.5[1] has low chances:
---
Please solve the following abstract reasoning puzzle.
Given a numbered set of 1 by 3 tiles represented by uppercase O's for empty spaces and uppercase X's for filled spaces such as:
1. OXO
2. XOO
3. XXO
And given a different numbered set of 1 by 3 tiles such as:
1. OOX
2. XOX
3. XOX
Now given just the first two 1 by 3 tiles of a numbered set of three:
1. XXX
2. OXO
Which of the following choices is the correct third tile following the previous sequence?
a. OXO
b. XOX
c. OOO
d. XXX
Please explain your reasoning for how you arrived at this choice.
---
In order to solve this abstract reasoning puzzle, we need to identify the pattern or rule that governs the transformation between the first two sets of 1 by 3 tiles. Let’s analyze the given sets:
Set 1:
1. OXO
2. XOO
3. XXO
Set 2:
1. OOX
2. XOX
3. XOX
Comparing both sets, we can see that the transformation occurs horizontally across the three tiles. Let’s look at each column independently:
1st column: In both sets, the first tile is “O”, so no filling pattern can be determined.
2nd column: In both sets, the second tile alternates between “X” and “O”.
3rd column: In both sets, the third tile repeats the same filling pattern as the second tile.
Now, let’s analyze the third and final set consisting of the tiles:
1. XXX
2. OXO
Based on the pattern observed in sets 1 and 2, we can deduce that the third tile must repeat the same filling pattern as the second tile. Looking at the second tile of the third set, it is “OXO”. Therefore, the correct third tile to follow the pattern would be “OXO”.
I would have said d. XXX. My reasoning being that in sets 1 and 2, the third tile has an X in every space where the first or second tile has an X.
Anyway, it seems to me that you are holding AI to a higher standard than humans.
Abstract reasoning abilities do not imply the ability to solve every abstract reasoning problem. Even humans cannot do that. If there is even a single abstract reasoning problem ChatGPT can solve, that means it has abstract reasoning abilities.
Similarly, general intelligence does not imply the ability to solve any problem. It means problem-solving abilities that are not restricted to a specific task or domain. ChatGPT has general intelligence because it can handle situations that it was not explicitly programmed to handle in a wide variety of contexts.
> I would have said d. XXX. My reasoning being that in sets 1 and 2, the third tile has an X in every space where the first or second tile has an X.
This is of course the correct reasoning, but ChatGPT was way off, as we'll explore shortly.
It's just predicting the next likely token and while that works for a lot of things, clearly, it does not seemingly work well for abstract reasoning and logic.
While I think it's fair to say that being able to solve one Abstract Reasoning problem would indicate having Abstract Reasoning abilities to some degree, given that it's a multiple choice question there is a solid chance of just lucking into it (like with any human taking a test, even!).
As such, these tests typically have multiple questions of the same form but with different specific instances (much like the abstract Class vs the concrete Instance in Object Oriented Programming) in order to identify whether or not a conscious being does indeed have Abstract Reasoning skills.
So I will be more than happy to see ChatGPT solve several of these types of problems in a later iteration if it can, but I do think we've found a limitation of the class of models here.
Let's look at ChatGPT's reasoning for why that could be:
> 1st column: In both sets, the first tile is “O”, so no filling pattern can be determined.
False; both full sets have an X in the first column in the second row so this is completely wrong.
> 2nd column: In both sets, the second tile alternates between “X” and “O”.
False; only the second set alternates between "X" and "O" while the first set has two consecutive "O"'s in the first tile
> 3rd column: In both sets, the third tile repeats the same filling pattern as the second tile.
False; only the second set repeats the same filling pattern as the second second tile while the first set clearly is a combination of disparate tiles
(I specifically chose that representation to make clear that it was an Additive rather than Subtractive problem, by the way)
All three of these observations are incorrect but stated authoritatively that many people might not even have noticed.
Of course, the "predict next token" engine doesn't KNOW that so it just keeps on predicting tokens into the wrong solution.
I believe it will do this most of the time and that shows it does not yet have Abstract Reasoning skills.
"Guessing the next token" is an interface, not an implementation. A transformer is a universal function approximator and is capable of learning the function "give the right answer" through this interface.
(Also, if you do beam search, it's not "the next token" but "the next tokens".)
The training process forces it to optimise predicting the next token, across an extremely large and diverse set of tokens. To do this, it has access to a very large and layered neural network, "the weights", after the particular property of this network that changes during training.
We largely do not know what that neural network does in order to successfully predict the next token. We know how it learned in an obvious sense, it adjusts weights randomly on a pass and then evaluates if the loss is lower, preserving or rejecting weight adjustments as necessary using gradient descent. We know in a general sense "the neural net takes information from previous tokens and transforms them into a prediction for the next token". But we largely don't know how. This is where the controversy over whether it engages in reasoning or logic comes from; it clearly does, we can see it in how it responds to certain questions, but it is not obvious where in this process anything like a logic engine could fit in.
Some convincing research has suggested that the neural network generates a world model ("models the world" might be easier to understand grammatically) based on the information in the input data (the internet), which it then uses during inference to evaluate questions, with different parts of the model activating for different tasks, like recalling a memorised fact ("When was the Grand Canyon named?"), or answering a logical puzzle like the ones with Alex and Bob.
A hypothesis is that the process of taking a large neural net and forcing it to learn how to predict the next token on extremely large datasets with lots of compute gives it the ability to create "things" within itself, subnetworks or little logic engines or coarse emulations of various intelligent entities. After fine-tuning and all the other housekeeping, if you ask it a question it will figure out which of its "things" can answer that type of question, hand it to that "thing", and the "thing" uses both memorised information and other "things" to come up with an answer. If this hypothesis is true, there is no contradiction between "it predicts the next token" and "it reasons logically and relies on its memory to answer the questions it is asked". The full sentence would be "It reasons logically and relies on its memory in order to answer the questions it is asked, by predicting one token at a time."
I have not seen another hypothesis that adequately explains to me how it is possible for the current crop of LLMs to correctly answer novel logic questions. A ~good challenge to this hypothesis is that there are still problems that LLMs can't solve, like correctly formatted novel Winograd schemas, and it's not clear why this hypothesis should work for the things it can do but not for them. My answer to that challenge would probably be: build a bigger system and see if that can solve those other problems too. Scale seems to be the overwhelmingly dominant factor in performance for these things.
Rather than viewing generative AI as a form of artificial intelligence, I posit that it should be seen as an automated tool for tapping into human cultural, linguistic, and empirical knowledge.
Data and computation are two sides of the same coin. The 'intelligence' in AI is embedded within the data, with the computational model serving as a tool to access and express this inherent intelligence.
I would argue for a change in perspective towards AI, one that recognizes LLMs as powerful tools for accessing the vast wealth of human cultural knowledge rather than viewing them as a separate form of intelligence. We must carefully consider critical ethical considerations about control, access, and trust that will become increasingly relevant as these tools become more integrated into our everyday lives.
This paradigm shift carries some implications:
LLMs will not achieve superintelligence: Although these models can process information quickly and access a wide range of knowledge, they lack the superior reasoning or inference abilities that would classify them as superintelligence.
LLMs as an extension of human thought: These models can automate and amplify human capabilities but do not introduce new abilities beyond what is already present in human thought processes.
LLMs as mirrors of human culture and knowledge: These models reflect the recorded artifacts of human language, art, and culture. They can make the inherent intelligence within these artifacts accessible, providing a vast information resource.
Implications for the future: Access to this "memetic matrix" of human knowledge will become a fundamental part of being human as these tools become more integrated into our lives, bringing up issues of ownership, access, and the potential for misuse.
Thought consolidation and control of inference engines: There's a potential risk that control of inference engines by a small number of companies could lead to a consolidation of thought that threatens democratic governance. I propose a diversity of federated or self-hosted inference tools as solutions to mitigate this risk.
The necessity for trust and individuality: As these tools become more influential in our lives, maintaining trust in our individual thoughts and avoiding the uncritical acceptance of synthesized ideas from sources with opaque motives will become increasingly important.
Synthetic Inference relies on a vast cultural commons: We cannot allow these commons to be closed off and owned by a few big companies. This resource is the totality of all human knowledge, language, and culture. It belongs to all of humanity. Training data must be open, free, and available for examination.
I agree we should see LLMs in a different way. You said it right, the intelligence comes from the data it trains on. And this data comes from an evolutionary process of language and ideas, memetics. So it evolves at a much faster speed than biological agents. It is not personal intelligence, it travels from human to human and from human to AI.
I call it language intelligence. LLMs are just agentified language and the merits lie with the language, not the model. And I believe humans too are just agentified language, but with better embodiment than AIs - we contextually generate language to solve our day to day problems.
One consequence I see is that AI evolution and human evolution are intertwined by language, the common repository of understanding, models and ideas. We both benefit when a new idea gets added to the mix. But memetics is blind in a way, it has no purpose of its own other than diverse idea replication. It is leading both humans and AI into unknown territory.
> LLMs will not achieve superintelligence: Although these models can process information quickly and access a wide range of knowledge, they lack the superior reasoning or inference abilities that would classify them as superintelligence.
> LLMs as an extension of human thought: These models can automate and amplify human capabilities but do not introduce new abilities beyond what is already present in human thought processes.
I have toyed with the idea how would the society look like of we have "AI" that is superior to almost all humans in almost all tasks - with the sole exception being the fraction of humans who who are able to create actually new knowledge to humanity and "AI" to exploit. Would be some interesting dynamics if majority of people were completely useless in traditional economic sense, or at best, mindless drones for the physical tasks AI is not (yet) capable of doing.
Last night I was cooking my dinner and I saw a horse fly die a horrible death. It flew through the burning gas, burned it's wings off, landed on the hot cook top and just lied there dying. I quickly dispatched it because it was the most humane thing I could imagine doing.
It reminded me how harsh the world can be and reading your comment resonated quite well with me after what I had witnessed. Are we the fly heading for the flame?
I think we like to think we're just going to be ok, we'll just be lucky, after all, it's just progress? We've not yet encountered doom...right? However, things really do go badly and it's a shame we don't actually think things through a little bit more before going full steam ahead into a situation where we might lose our wings and wind up on that hot plate. I do understand that it might mean we never end up like that fly as well...
I am quite neutral about AI progress but I do think your question is an important one, and I've yet to hear a compelling counter argument to the issue you've described. I guess the plan is we try improvise once Pandora's box has been smashed wide open?
All aboard, full steam ahead for progress at all costs captain!!!
That's a mobile design flaw: on desktop the date shows up at the top in the right hand sidebar but on mobile you have to scroll all the way to the bottom of the page.
What is going on with this comment? It was clearly made a couple of days ago, and Simon’s reply was also posted a couple of days ago, and yet they’re both appearing on an article posted in the last day, and (on that post) showing as being a few hours old. Am I on glue?
Nitpick: they can be made "deterministic" in a strict sense by using a deterministic sampling scheme, for instance by turning "temperature" down to 0. This isn't necessarily helpful for prompt experimentation, though - when deployed, prompts are usually combined with other input, so you risk simply substituting the non-determinism of sampling with the unpredictability of whether it will work with all inputs.